R-Sessions 08: Getting Data into R

Introduction

Various ways are provided to enter data into R. The most basic method is entering is manually, but this tends to get very tedious. An often more useful way is using the read.table command. It has some variants, as will be shown below. Another way of getting data into R is using the clipboard. The back-draw thereof is the loss of some control over the process. Finally, it will be described how data from SPSS can be read in directly.

Only basic ways of entering data into R are shown here. Much more is possible as other functions offer almost unlimited control. Here the emphasis will be on day-to-day usage.

Reading data from a file

The most general of data-files are basically plain text-files that store the data. Rows generally represent the cases ( / respondents), although the top-row often will state the variable labels. The values these variables can take are written in columns, separated by some kind of indicator, often spaces, commas or tabs. Another variant is that there is no separating character. In that case all variables belonging to a single case are written in succession. Each variable then needs to have a specific number of character places defined, to be able to distinguish between variables. Variable labels are often left out on these type of files.

R is able to read all of the above-mentioned filetypes with the read.table() command, or its derivatives read.csv() and read.delim(). The exception to this are fixed-width files. These are loaded using the read.fwf() command, that uses different parameters. The derivatives of read.table() are basically the same command, but have different defaults. Because their use is so much convenience, these will be used here.

Comma / Tab separated files

As said, the most generic way of reading data is the read.table() command. When given only the filename as parameter, it treats a space as the separating character (so, beware on using spaces in variable labels) and assumes that there are no variable names on the first row of the data. The decimal sign is a “.”. This would lead to the first row of the syntax below, which assigns the contents of a datafile “filename” to the object data, which becomes a data.frame.

The read.csv() and the read.delim() commands are basically the same, but they have a different set of standard values to the parameters. Read.csv() is used for comma-separated files (such as, for instance, Microsoft Excell can export to). The syntax for read.csv() is very simple, as shown below. The read.table()-command can be used for the exact same purpose, by altering the parameters. The header=TRUE – parameter means that the first row of the file is now regarded as containing the variable names. The sep – parameter now indicates the comma “,” as the separating character. fill=TRUE tells the function that if a row contains less columns than there are variables defined by the header row, the missing variables are still assigned to the data frame that results from this function. Those variables for these cases will have the value ‘NA’ (missing). By dec=”.” the character used for decimal points is set to a point (to not interfere with the separating comma). In contrast with the read.table() function. the comment.char is disabled (set to nothing). Normally, if the comment.char is found in the data, no more data is read from the row that is was found on (after the sign, of course). In read.csv() this is disabled by default.

The last two rows of the syntax below shows the read.delim() command and the parameters needed to create the same functionality from read.table. The read.delim() function is used to read tab-delimited data. So, the sep-parameter is now set to “\t” by default. \t means tab. The other parameters are identical to those that read.csv() defaults to.

data <- read.table(“filename”)

data <- read.csv(“filename”)
data <- read.table(“filename”, header = TRUE, sep = “,”, dec=”.”, fill = TRUE, comment.char=””)

data <- read.delim(“filename”)
data <- read.table(“filename”, header = TRUE, sep = “\t”, dec=”.”, fill = TRUE, comment.char=””)

Variable labels

Data that is read into a data.frame can be given variable names. For instance, if the above commands were used to read a data-file containing three variables, variable names can be assigned in several ways. Two ways will be described here: assigning them after the data is read or assigning them using the read.table() command.

names(data) <- c(“Age”,”Income”,”Gender”)
data <- read.table(“filename”, colnames=c(“Age”,”Income”,”Gender”))

In the syntax above, the names() command is used to assign names to the columns of the data.frame (representing the variables). The names are given as strings (hence the apostrophes) and gathered using the c() command.

Fixed width files

When reading files in the ‘fixed width’ format, we cannot rely on a single character that indicates the separations between variables. Instead, the read.fwf() function has a parameter by which we tell the function where to end a variable and start the next one. Just as with read.table(), a data.frame is returned. Variable labels are treated the same way as the previous mentioned

data <- read.fwf(“filename”, widths = c(2,5,1), colnames=c(“Age”, “Income”, “Gender”))
data <- read.fwf(“filename”, widths = c(-5,2,5,-2, 1), colnames=c(“Age”, “Income”, “Gender”))

Reading data from the clipboard

data <- read.table(pipe(“pbpaste”))
data <- read.table(“clipboard”)

read.table is used for read comma seperated files. read.delim is used for reading tab delimited files. read.table(pipe(“pbpaste”)) is used for reading data from the clipboard on mac. read.table(“clipboard”) is used for reading data from the clipboard on Windows. Instead of read.table(pipe(“pbpaste”)) you can use read.delim(pipe(“pbpaste”)) as well.

Reading data from other statistical packages {foreign}

library(foreign)
data <-read.spss(“filename”)

require(foreign) loads the foreign package, which contains the read.spss() function, which can read data as written by the SPSS software.

– – — — —– ——–

– – — — —– ——–
R-Sessions is a collection of manual chapters for R-Project, which are maintained on Curving Normality. All posts are linked to the chapters from the R-Project manual on this site. The manual is free to use, for it is paid by the advertisements, but please refer to it in your work inspired by it. Feedback and topic requests are highly appreciated.
——– —– — — – –

Latest Comments

  1. Index of the R-Sessions | Curving Normality
    [...] Getting Data into R [...]

Leave a Reply