R-Sessions 07: Data Structure

The way R-Project handles data differs from some mainstream statistical programs, such as SPSS. It can handle an unlimited number of data sets, as long as the memory of your computer can handle it. Often, the results of statistical tests or estimation-procedures are stored inside ‘data-sets’ as well. In order to be able to serve different needs, several different types of data-storage are available.

This paragraph will introduce some of the types of data-storage and shows how these objects are managed by R-Project.

Single values and vectors

As was already shown in a previous paragraph, data can easily be stored inside objects. This goes for single values and for ranges of values, called vectors. Below three variables values are created (x, y, and z) where x will contain a single numerical value, y contains three numerical values and z will contain two character values. This is done by using the c()-command that concatenates data. Finally, in the syntax below, it is shown that it is not possible to concatenate different types of data (i.e. numerical data and character data): when this is tried below, the numerical data are converted into character-data.

 x <- 3
y <- c(4,5)
z <- c("one", "two")
x
y
z

c(x,y)
c(x,y,z)


> x <- 3
> y <- c(4,5)
> z <- c("one", "two")
> x
[1] 3
> y
[1] 4 5
> z
[1] "one" "two"
>
> c(x,y)
[1] 3 4 5
> c(x,y,z)
[1] "3" "4" "5" "one" "two"

Matrix

Oftentimes, we want to store data in more than one dimension. This can be done with matrices, that have two dimensions. As with vectors, all the data inside a matrix have to be of the same type.

a <- matrix(nrow=2, ncol=5)
b <- matrix(data=1:10, nrow=2, ncol=5, byrow=FALSE)
c <- matrix(data=1:10, nrow=2, ncol=5, byrow=TRUE)
d <- matrix(data=c("one", "two", "three", "four", "five", "six"), nrow=3, ncol=2, byrow=TRUE)
a
b
c
d

In the syntax above, four matrices are created and assigned to variables that were called 'a', 'b', 'c', and 'd'. The first matrix ('a') is created using the matrix() function. It is specified that this matrix will have two rows (nrow=2) and five columns (ncol=5). In the output we see the resulting matrix: all the data is missing, which is indicated by 'NA'.

The following two matrices have data assigned to it by the 'data=' parameter. To both matrices the values 1 to 10 are assigned, but in the first matrix 'byrow=FALSE' is specified and in the second 'byrow=TRUE'. This results in a different way the data is entered into the matrix (row-wise or column-wise), as can be seen below.

The last matrix shows us, that character values can be stored in matrices as well.

> a <- matrix(nrow=2, ncol=5)
> b <- matrix(data=1:10, nrow=2, ncol=5, byrow=FALSE)
> c <- matrix(data=1:10, nrow=2, ncol=5, byrow=TRUE)
> d <- matrix(data=c("one", "two", "three", "four", "five", "six"), nrow=3, ncol=2, byrow=TRUE)
> a
[,1] [,2] [,3] [,4] [,5]
[1,] NA NA NA NA NA
[2,] NA NA NA NA NA
> b
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
> c
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
> d
[,1] [,2]
[1,] "one" "two"
[2,] "three" "four"
[3,] "five" "six"

data.frame

In social sciences, we often use data-sets in which the rows represent respondents or participants of a survey and the columns represent different variables. This makes matrices less suitable, because we often have variables that store different types of data. This cannot be stored in a matrix.

For this purpose, R-Project has data.frames available. Data.frames can store multiple vectors that don't have to contain the same type of data. The columns are formed by the vectors entered to the data.frame() function that creates data.frames. All vectors need to be of the same length.

p <- 1:5
q <- c("one", "two", "three", "four", "five")
r <- data.frame(p, q)
p
q
r

In the syntax above, the variables 'p' and 'q' are created with vectors of respectively numbers and characters. Then, these are combined in the data.frame called r. The output of the data.frame shows that the columns are named according to the variables entered and that the values in the rows correspond to the order of the values in the data-vectors.


> p <- 1:5
> q <- c("one", "two", "three", "four", "five")
> r <- data.frame(p, q)
> p
[1] 1 2 3 4 5
> q
[1] "one" "two" "three" "four" "five"
> r
p q
1 1 one
2 2 two
3 3 three
4 4 four
5 5 five

Indexing

Oftentimes, we don't want to use a full dataset for our analysis, or we want to select a single value from a range to be able to review or change it. This can be to see the values on a single variable or to see the scores on several variables for a specific respondent of a survey. This can be achieved by a technique called indexing variables.

Above, we've already created some variables of different types: a vector ('y'), a matrix ('c'), and a data.frame ('r'). In the first rows of the syntax below these variables are called for, so we can see what they look like.

 y
c
r

y[2]
c[2,3]
r[3,1]
r[3,]
r$q

Then, they are indexed using straight brackets [ and ]. Since a vector has only one dimension, we place the number or index of the value we want to see between the brackets that are placed behind the name of the variable. In the syntax above, we want to see only the second value stored inside vector 'y'. In the output below we receive the value 5, which is correct.

A matrix has two dimensions and can be indexed using two values, instead of just one. For instance, let's sat we want to see the value on the second row on the third column stored in matrix 'c'. We used the index [2,3] to achieve this (first the row number, then the column number). Below we can see that this works out just fine.

Then the data.frame, which works almost the same as a matrix. First we want to see the value on the third row of the first column and index the data.frame 'r' by using [3,1]. The result is as expected. When we want to see all the values on a specific variable, we can address this variable by naming the data.frame in which it is stored, then a dollar-sign $ and finally the exact name of the variable. This is done on the next row of the syntax, where we call for the variable 'q' inside data.frame 'r'.

The same can be achieved for a single row of the data.frame, giving the scores on all columns / variables for one row / respondent. This is done by specifying the row we want, then a comma and leaving the index for the row number open. We additionally show something else here: it is not necessary to specify just a single value: a combination or range of values is fine as well. So: here we want the values stored in all the columns of the data.frame for the third and fourth row. We achieve this by indexing the data.frame 'r' using [3:4, ].


> y
[1] 4 5
> c
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
> r
p q
1 1 one
2 2 two
3 3 three
4 4 four
5 5 five
>
> y[2]
[1] 5
> c[2,3]
[1] 8
> r[3,1]
[1] 3
> r$q
[1] one two three four five
Levels: five four one three two
> r[3:4,]
p q
3 3 three
4 4 four

Managing variables

So, we have created a number of variables. Although present day computers can easily remember such small amounts of data we've put in them, it forms a good habit to clean up variables when they are no longer needed. This is to preserve memory when working with large (real-life) data-sets and because you don't want to mistakenly mix variables up.

Variables in R-Project are stored in what is called the working space. There is much more to it than will be described here, since not only variables are stored in the working space. We can see what variables we created by using the ls() function. We receive a list of the ten variables we created in this paragraph. If you've been working on some other project in R, such as previous paragraphs of the manual, other objects might be named as well.

ls()


> ls()
[1] "a" "b" "c" "d" "p" "q" "r" "x" "y" "z"

Now we want to clean up a bit. This can be done by using the rm() (rm = remove) function. Between the brackets the variables that need to be deleted are specified. We first delete all the variables, except the ones that were associated with creating the data.frame because we will need them below. When a new ls() is called for, we see that the variables are gone.

rm(a,b,c,d,x,y,z)
ls()
rm(p,q)


> rm(a,b,c,d,x,y,z)
> ls()
[1] "p" "q" "r"
> rm(p,q)

Remember that the variables 'p' and 'q' were stored inside the data.frame we called 'r'. Therefor, we don't need them anymore. They are thus deleted as well.

Attaching data.frames

When working with survey data, as the quantitative sociologist often does, data.frames are often the type of data-storage of choice. As we have already seen, variables stored in data.frame can be addressed individually or group-wise. But in daily practice, this can become very tedious to be typing all the indexes when working with specific (subsets of) variables. Fortunately it is possible to bring the variables stored in a data.frame to the foreground by attaching them to the active work-space.

 ls()
p
r
attach(r)
ls()
p
r
detach(r)

In the syntax above, a list of the available data-objects is requested. We see in the output below that only 'r' is available, which we remember to be a data.frame containing the variables 'p' and 'q'. When 'p' is called for directly, an error message is returned: object "p" is not to be found.

In such cases, we can tell R-Project where too look by indexing (as done above), or by attaching the data.frame. This is done by the attach() function. When we request a list of available object again, we still see only the data.frame 'r' coming up, but when object 'p' is requested, we now see it returned. The data.frame can still be called for normally. Finally we can bring the data.frame back to the 'background' by using the detach() function.

One word of notice: when working with an attached data.frame, it is very important to keep track of changes made to the variables. The 'p'-variable we could call for when the data.frame was attached, is not the same as the 'p'-variable stored inside the data.frame. So, changes made to the 'p'-variable when the data.frame is attached are lost when the data.frame is detached. This of course does not hold when the changes are made directly to the variables inside the data.frame.


> ls()
[1] "r"
> p
Error: object "p" not found
> r
p q
1 1 one
2 2 two
3 3 three
4 4 four
5 5 five
> attach(r)
> ls()
[1] "r"
> p
[1] 1 2 3 4 5
> r
p q
1 1 one
2 2 two
3 3 three
4 4 four
5 5 five
> detach(r)

Leave a Reply