Brett Lantz shows how to manage data with R

One of the challenges faced while working with massive datasets involves gathering, preparing, and otherwise managing data from a variety of sources. This article highlights the basic functionality for getting data into and out of R.

The foundational R data structure is the vector, which is extended and combined into more complex data types, such as lists and data frames. The data frame is an R data structure that corresponds to the notion of a dataset having both features and examples. R provides functions for reading and writing data frames to spreadsheet-like tabular data files.

This article is an excerpt from the book, Machine Learning with R, Third Edition written by Brett Lantz. This book provides a hands-on, readable guide to applying machine learning to real-world problems. Whether you are an experienced R user or new to the language, this book teaches you everything you need to uncover key insights, make new predictions, and visualize your findings.

Saving, loading, and removing R data structures

When you have spent a lot of time getting a data frame into the desired form, you shouldn’t need to recreate your work each time you restart your R session. To save a data structure to a file that can be reloaded later or transferred to another system, use the save() function. The save() function writes one or more R data structures to the location specified by the file parameter. R data files have an .RData extension.

Suppose you had three objects named x, y, and z that you would like to save to a permanent file. Regardless of whether they are vectors, factors, lists, or data frames, they can be saved to a file named mydata.RData using the following command:

> save(x, y, z, file = “mydata.RData”)

The load() command can recreate any data structures that have been saved to an .RData file. To load the mydata.RData file created in the preceding code, simply type:

> load(“mydata.RData”)

This will recreate the x, y, and z data structures in your R environment.

Note: Be careful what you are loading! All data structures stored in the file you are importing with the load() command will be added to your workspace, even if they overwrite something else you are working on.

If you need to wrap up your R session in a hurry, the save.image() command will write your entire session to a file simply called .RData. By default, R will look for this file the next time you start R, and your session will be recreated just as you had left it.

After you’ve been working in an R session for some time, you may have accumulated a number of data structures. The listing function ls()returns a vector of all data structures currently in memory. For example, if you’ve been following along with the code in this chapter, the ls() function returns the following:

> ls()

[1] “blood” “flu_status” “gender” “m”

[5] “subject_name” “subject1” “symptoms”

[9] “temperature”

R automatically clears all data structures from memory upon quitting the session, but for large objects, you may want to free up the memory sooner. The remove function rm() can be used for this purpose. For example, to eliminate the m and subject1 objects, simply type:

> rm(m, subject1)

The rm() function can also be supplied with a character vector of object names to remove. This works with the ls() function to clear the entire R session.

> rm(list=ls())

Be very careful when executing the preceding code, as you will not be prompted before your objects are removed!

Importing and saving data from CSV files

It is very common for public datasets to be stored in text files. Text files can be read on virtually any computer or operating system, which makes the format nearly universal. They can also be exported and imported from and to programs such as Microsoft Excel, providing a quick and easy way to work with spreadsheet data.

A tabular (as in “table”) data file is structured in matrix form, such that each line of text reflects one example, and each example has the same number of features. The feature values on each line are separated by a predefined symbol known as a delimiter. Often, the first line of a tabular data file lists the names of the data columns. This is called a header line.

Perhaps the most common tabular text file format is the comma-separated values (CSV) file, which as the name suggests, uses the comma as a delimiter. The CSV files can be imported to and exported from many common applications. A CSV file representing the medical dataset constructed previously could be stored as:

subject_name,temperature,flu_status,gender,blood_type

John Doe,98.1,FALSE,MALE,O

Jane Doe,98.6,FALSE,FEMALE,AB

Steve Graves,101.4,TRUE, MALE,A

Given a patient data file named pt_data.csv located in the R working directory, the read.csv() function can be used as follows to load the file into R:

> pt_data <- read.csv(“pt_data.csv”, stringsAsFactors = FALSE)

This will read the CSV file into a data frame titled pt_data. Just as we had done previously when constructing a data frame, we need to use the stringsAsFactors = FALSE parameter to prevent R from converting all text variables to factors. Unless you are certain that every column in the CSV file is truly a factor, this step is better left to you, not R, to perform.

Note: If your dataset resides outside the R working directory, the full path to the CSV file (for example, “/path/to/mydata.csv”) can be used when calling the read.csv() function.

By default, R assumes that the CSV file includes a header line listing the names of the features in the dataset. If a CSV file does not have a header, specify the option header = FALSE as shown in the following command, and R will assign default feature names by numbering them as V1, V2, and so on:

> mydata <- read.csv(“mydata.csv”, stringsAsFactors = FALSE,

header = FALSE)

The read.csv() function is a special case of the read.table() function, which can read tabular data in many different forms, including other delimited formats such as tab-separated values (TSV). For more detailed information on the read.table() family of functions, refer to the R help page using the ?read.table command.

To save a data frame to a CSV file, use the write.csv() function. If your data frame is named pt_data, simply enter:

> write.csv(pt_data, file = “pt_data.csv”, row.names = FALSE)

This will write a CSV file with the name pt_data.csv to the R working folder. The row.names parameter overrides R’s default setting, which is to output row names in the CSV file.

Conclusion

In this article, we learned about the basics of managing data in R. We started by learning about the structures used for storing various types of data in R and how to save them. We then moved on to reading and writing data in the commonly used CSV format.

Now that we have spent some time understanding the basics of data management with R, you are ready to begin using machine learning to solve real-world problems. To investigate further, you can refer to Brett Lantz’s latest book Machine Learning with R, Third Edition.

About the author

Brett Lantz is a DataCamp instructor and a frequent speaker at machine learning conferences and workshops around the world. A sociologist by training, Brett was first captivated by machine learning during research on a large database of social network profiles.

Brett Lantz shows how to manage data with R was originally published in Hacker Noon on Medium, where people are continuing the conversation by highlighting and responding to this story.

Publication date

04/29/2019 - 10:41