Home » Statistics » Recent Articles:

R-Sessions 07: Data Structure

August 6, 2008 R-Project, R-Sessions No Comments

The way R-Project handles data differs from some mainstream statistical programs, such as SPSS. It can handle an unlimited number of data sets, as long as the memory of your computer can handle it. Often, the results of statistical tests or estimation-procedures are stored inside ‘data-sets’ as well. In order to be able to serve different needs, several different types of data-storage are available.

This paragraph will introduce some of the types of data-storage and shows how these objects are managed by R-Project.

Single values and vectors

As was already shown in a previous paragraph, data can easily be stored inside objects. This goes for single values and for ranges of values, called vectors. Below three variables values are created (x, y, and z) where x will contain a single numerical value, y contains three numerical values and z will contain two character values. This is done by using the c()-command that concatenates data. Finally, in the syntax below, it is shown that it is not possible to concatenate different types of data (i.e. numerical data and character data): when this is tried below, the numerical data are converted into character-data.

1
2
3
4
5
6
7
8
9
 x <- 3
y <- c(4,5)
z <- c("one", "two")
x
y
z
 
c(x,y)
c(x,y,z)


> x <- 3
> y <- c(4,5)
> z <- c("one", "two")
> x
[1] 3
> y
[1] 4 5
> z
[1] "one" "two"
>
> c(x,y)
[1] 3 4 5
> c(x,y,z)
[1] "3" "4" "5" "one" "two"

Matrix

Oftentimes, we want to store data in more than one dimension. This can be done with matrices, that have two dimensions. As with vectors, all the data inside a matrix have to be of the same type.

1
2
3
4
5
6
7
8
a <- matrix(nrow=2, ncol=5)
b <- matrix(data=1:10, nrow=2, ncol=5, byrow=FALSE)
c <- matrix(data=1:10, nrow=2, ncol=5, byrow=TRUE)
d <- matrix(data=c("one", "two", "three", "four", "five", "six"), nrow=3, ncol=2, byrow=TRUE)
a
b
c
d

In the syntax above, four matrices are created and assigned to variables that were called ‘a’, ‘b’, ‘c’, and ‘d’. The first matrix (‘a’) is created using the matrix() function. It is specified that this matrix will have two rows (nrow=2) and five columns (ncol=5). In the output we see the resulting matrix: all the data is missing, which is indicated by ‘NA’.

The following two matrices have data assigned to it by the ‘data=’ parameter. To both matrices the values 1 to 10 are assigned, but in the first matrix ‘byrow=FALSE’ is specified and in the second ‘byrow=TRUE’. This results in a different way the data is entered into the matrix (row-wise or column-wise), as can be seen below.

The last matrix shows us, that character values can be stored in matrices as well.

> a <- matrix(nrow=2, ncol=5)
> b <- matrix(data=1:10, nrow=2, ncol=5, byrow=FALSE)
> c <- matrix(data=1:10, nrow=2, ncol=5, byrow=TRUE)
> d <- matrix(data=c("one", "two", "three", "four", "five", "six"), nrow=3, ncol=2, byrow=TRUE)
> a
[,1] [,2] [,3] [,4] [,5]
[1,] NA NA NA NA NA
[2,] NA NA NA NA NA
> b
[,1] [,2] [,3] [,4] [,5]
[1,] 1 3 5 7 9
[2,] 2 4 6 8 10
> c
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
> d
[,1] [,2]
[1,] "one" "two"
[2,] "three" "four"
[3,] "five" "six"

data.frame

In social sciences, we often use data-sets in which the rows represent respondents or participants of a survey and the columns represent different variables. This makes matrices less suitable, because we often have variables that store different types of data. This cannot be stored in a matrix.

For this purpose, R-Project has data.frames available. Data.frames can store multiple vectors that don’t have to contain the same type of data. The columns are formed by the vectors entered to the data.frame() function that creates data.frames. All vectors need to be of the same length.

1
2
3
4
5
6
p <- 1:5
q <- c("one", "two", "three", "four", "five")
r <- data.frame(p, q)
p
q
r

In the syntax above, the variables ‘p’ and ‘q’ are created with vectors of respectively numbers and characters. Then, these are combined in the data.frame called r. The output of the data.frame shows that the columns are named according to the variables entered and that the values in the rows correspond to the order of the values in the data-vectors.


> p <- 1:5
> q <- c("one", "two", "three", "four", "five")
> r <- data.frame(p, q)
> p
[1] 1 2 3 4 5
> q
[1] "one" "two" "three" "four" "five"
> r
p q
1 1 one
2 2 two
3 3 three
4 4 four
5 5 five

Indexing

Oftentimes, we don’t want to use a full dataset for our analysis, or we want to select a single value from a range to be able to review or change it. This can be to see the values on a single variable or to see the scores on several variables for a specific respondent of a survey. This can be achieved by a technique called indexing variables.

Above, we’ve already created some variables of different types: a vector (‘y’), a matrix (‘c’), and a data.frame (‘r’). In the first rows of the syntax below these variables are called for, so we can see what they look like.

1
2
3
4
5
6
7
8
9
 y
c
r
 
y[2]
c[2,3]
r[3,1]
r[3,]
r$q

Then, they are indexed using straight brackets [ and ]. Since a vector has only one dimension, we place the number or index of the value we want to see between the brackets that are placed behind the name of the variable. In the syntax above, we want to see only the second value stored inside vector ‘y’. In the output below we receive the value 5, which is correct.

A matrix has two dimensions and can be indexed using two values, instead of just one. For instance, let’s sat we want to see the value on the second row on the third column stored in matrix ‘c’. We used the index [2,3] to achieve this (first the row number, then the column number). Below we can see that this works out just fine.

Then the data.frame, which works almost the same as a matrix. First we want to see the value on the third row of the first column and index the data.frame ‘r’ by using [3,1]. The result is as expected. When we want to see all the values on a specific variable, we can address this variable by naming the data.frame in which it is stored, then a dollar-sign $ and finally the exact name of the variable. This is done on the next row of the syntax, where we call for the variable ‘q’ inside data.frame ‘r’.

The same can be achieved for a single row of the data.frame, giving the scores on all columns / variables for one row / respondent. This is done by specifying the row we want, then a comma and leaving the index for the row number open. We additionally show something else here: it is not necessary to specify just a single value: a combination or range of values is fine as well. So: here we want the values stored in all the columns of the data.frame for the third and fourth row. We achieve this by indexing the data.frame ‘r’ using [3:4, ].


> y
[1] 4 5
> c
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
> r
p q
1 1 one
2 2 two
3 3 three
4 4 four
5 5 five
>
> y[2]
[1] 5
> c[2,3]
[1] 8
> r[3,1]
[1] 3
> r$q
[1] one two three four five
Levels: five four one three two
> r[3:4,]
p q
3 3 three
4 4 four

Managing variables

So, we have created a number of variables. Although present day computers can easily remember such small amounts of data we’ve put in them, it forms a good habit to clean up variables when they are no longer needed. This is to preserve memory when working with large (real-life) data-sets and because you don’t want to mistakenly mix variables up.

Variables in R-Project are stored in what is called the working space. There is much more to it than will be described here, since not only variables are stored in the working space. We can see what variables we created by using the ls() function. We receive a list of the ten variables we created in this paragraph. If you’ve been working on some other project in R, such as previous paragraphs of the manual, other objects might be named as well.

1
ls()


> ls()
[1] "a" "b" "c" "d" "p" "q" "r" "x" "y" "z"

Now we want to clean up a bit. This can be done by using the rm() (rm = remove) function. Between the brackets the variables that need to be deleted are specified. We first delete all the variables, except the ones that were associated with creating the data.frame because we will need them below. When a new ls() is called for, we see that the variables are gone.

1
2
3
rm(a,b,c,d,x,y,z)
ls()
rm(p,q)


> rm(a,b,c,d,x,y,z)
> ls()
[1] "p" "q" "r"
> rm(p,q)

Remember that the variables ‘p’ and ‘q’ were stored inside the data.frame we called ‘r’. Therefor, we don’t need them anymore. They are thus deleted as well.

Attaching data.frames

When working with survey data, as the quantitative sociologist often does, data.frames are often the type of data-storage of choice. As we have already seen, variables stored in data.frame can be addressed individually or group-wise. But in daily practice, this can become very tedious to be typing all the indexes when working with specific (subsets of) variables. Fortunately it is possible to bring the variables stored in a data.frame to the foreground by attaching them to the active work-space.

1
2
3
4
5
6
7
8
 ls()
p
r
attach(r)
ls()
p
r
detach(r)

In the syntax above, a list of the available data-objects is requested. We see in the output below that only ‘r’ is available, which we remember to be a data.frame containing the variables ‘p’ and ‘q’. When ‘p’ is called for directly, an error message is returned: object “p” is not to be found.

In such cases, we can tell R-Project where too look by indexing (as done above), or by attaching the data.frame. This is done by the attach() function. When we request a list of available object again, we still see only the data.frame ‘r’ coming up, but when object ‘p’ is requested, we now see it returned. The data.frame can still be called for normally. Finally we can bring the data.frame back to the ‘background’ by using the detach() function.

One word of notice: when working with an attached data.frame, it is very important to keep track of changes made to the variables. The ‘p’-variable we could call for when the data.frame was attached, is not the same as the ‘p’-variable stored inside the data.frame. So, changes made to the ‘p’-variable when the data.frame is attached are lost when the data.frame is detached. This of course does not hold when the changes are made directly to the variables inside the data.frame.


> ls()
[1] "r"
> p
Error: object "p" not found
> r
p q
1 1 one
2 2 two
3 3 three
4 4 four
5 5 five
> attach(r)
> ls()
[1] "r"
> p
[1] 1 2 3 4 5
> r
p q
1 1 one
2 2 two
3 3 three
4 4 four
5 5 five
> detach(r)

R-Sessions 06: Most Basic of All

August 4, 2008 R-Project, R-Sessions No Comments

In this section of my R manual, the most basic of the basics are introduced. Attention will be paid to basic calculations, still the basis of every refined statistical analysis. Furthermore, storing data and using stored data in functions is introduced.

Calculations in R

R can be used as a fully functional calculator. When R is started, some licensing information is shown, as well as a prompt ( > ). When commands are typed and ENTER is pressed, R starts working and returns the outcome of the command. Probably the most basic command that can be entered is a basic number. Since it is not stated what to do with this number, R simply returns it. Some special numbers have names. When these names are called, the corresponding number is returned. Finally, next to numbers, text can be handled by R as well.

1
2
3
4
5
6
3
3 * 2
3 ^ 2
pi
apple
"apple"

In the box above six commands were entered to the R prompt. These commands can be entered one by one, or pasted to the R-console all at once. After these commands are entered one by one, the screen looks like this:
… Continue Reading

R-Sessions 04: Getting Packages

July 30, 2008 R-Project, R-Sessions 1 Comment


A freshly installed version of R-Project can do some pretty nice things already, but much more functionality can be obtained by installing packages that contain new functions. These packages are available by the internet and can be installed from within R-Project. Let’s say we want to use the lme4-package, which can be used to estimate linear and generalized multilevel models. The lme4-package does not come pre-installed with R-Project, so we have to download and install is manually. The way this is done is shown based on an R-installation on windows XP.

Is the package installed?

Before we start to install a package, it is a good custom to check whether or not it is already installed. We start R-project and see the basic screen of the software. Screenshot package 01 Choose ‘Packages’ from the menu’s at the top of the screen. Six options are offered:

  • Load Package: This is used to load packages that are already installed
  • Set CRAN Mirror: Choose from which server the packages should be downloaded
  • Select repositories: Choose CRAN and CRAN (extras)
  • Install package(s): Download and install new packages
  • Update packages: Download new versions of already installed packages when available
  • Install package(s) from local zip files: Used for computers not connected to the internet

… Continue Reading

R-Sessions 03: Getting R-Project


R-Project is an open-source software package that can be obtained freely from the internet. It is available for a large variety of computer operating systems, such as Linux, MacOSX and Windows. Serving the majority, the installation process will be described for a computer running on windows XP.

Downloading R-Project

The website of R-Project can be found on http://www.r-project.org. The left sidebar contains a header ‘Download’. Below, a link to ‘CRAN’ is provided. CRAN stands for the ‘Comprehensive R Archive Network’ and is a network of several web-servers from which both the software as well as additional packages can be downloaded. When clicked on CRAN, a list of providers of the software is shown. Choose one near the location you’re at. Then, a page is shown with several files that can be downloaded. What we want for now is a ‘precompiled’ piece of software, that is ready for installation.

Screenshot R-Website
… Continue Reading

R-Sessions 02: Why R-Project?


There are many good reasons to start using R. Obviously, there are some reasons not to use R, as well. Some of these reasons are shortly described here. In the end, it is just some kind of personal preference that leads a researcher to use one statistical package, or another. Here are some arguments as a base for your own evaluation.

Why use R?

Powerful & Flexible

Probably the best reason to use R is its power. It is not so much a statistical software, but more a statistical programming language. This results in the availability of powerful methods of analyses, but in strong capabilities of managing, manipulating and storing your data. Due to its data-structure, R gains a tremendous flexibility. Everything can be stored inside an object, from data, via functions to the output of functions. This allows the user to easily compare different sets of data, or the results of different analyses just as easy. Because the results of an analysis can be stored in objects, parts of these results can be extracted as well and used in new functions / analyses.
Besides the many already available functions, it is possible to write your own. This results in flexibility that can be used to create functions that are not available in other packages. In general: if you can think of it, you can make it. Thereby, R becomes a very attractive choice for methodological advanced studies. … Continue Reading

R-Sessions 01: What is R?


R is a software package that is used for statistical analyses. It has a syntax-driven interface which allows for a high level of control, many add-on packages, an active community supporting the program and it’s users and an open structure. All in all, it aims to be statistical software that goes beyond pre-set analyses. Oh, and it is free too.

The R software is developed by the R Core Development Team, presently having seventeen members. New versions of the R software are coming out regularly, so apparently progress is made. The source code of the R software is open-source. This means that everybody is allowed to read and change the program code. The consequence of this is that many people have written extensions to R which are able to nest itself in the fundaments of the software. For instance, it can interact with programs such as (WIN)BUGS or have extensions based on C or Fortran code.

A typical R session can be characterized by its flexibility. The software is set up in such a way, that functions or command can interact and thereby be combined to new ones. … Continue Reading

R-Sessions: Introducing the R-Sessions

Curving Normality generally consists of two parts: my personal blog and a manual for R-Project I wrote last year. I want to continue working on this manual, and to increase the exposure it gets I decided to regularly post chapters of it on my blog. I will do so under the name R-Sessions

This manual is already set up in a way that allows the reader to try all the examples by copy-pasting the syntax. I want to extend this feature and add chapters on a specific statistical problem that is dealt with, from a description of the problem and initial exploration of the data to the final solution to the problem. However, I’m not at that point yet. In the near future expect several updates each week in which I present the existing manual.

… Continue Reading

Collective curiosity?

June 20, 2008 Book 2 Comments

ResearchBlogging.org

Those of you who have ever attended a ‘Fête Nos’, a typical Brêton festival-type of gathering with music and people dancing, may immediately understand what I’m going to write about. All the others who have attended another gathering of a large number of people will also be completely familiar with my revived curiosity in a specific subject: The collectivity of human behavior and its occurrence in large masses of people.

Every time the music starts at a crowded ‘Fête Nos’, something peculiar happens: within seconds the mass of people all talking to each other and walking seemingly random suddenly are dancing all together in familiar patterns. This pattern is way too complex to be laid upon all those people: it must, somehow, emerge from the individual moves these people make. Interesting and intriguing, don’t you think?

Even before I started studying sociology I had read `Critical Mass’ (2004) by Philip Ball. I loved this overview of popular science and still do, but somehow it had moved to the back of my memory. I remembered the actor- or boid-based simulations, but I did not really understand how this could be related to the theory-driven sociology that I was studying. I recognized the possibilities offered by the described simulation techniques, but saw them as theories, rather than empirical tests: we can easily make assumptions about behavior and simulate the consequences of that, but then we still don’t know if these assumed behaviors indeed exist and happen in reality.
… Continue Reading

Lying with WordPress statistics

June 20, 2008 Uncategorized 4 Comments

I must admit that I repeatedly feel flattered by the number of page-views on my blog as shown by the WordPress statistics plugin. However, despite the nice graphical representation, they are a little too flattering for the humble number of page-views my blog attracts. A traditional line-graph consists of two axes. Traditionally, these are referred to as the x-axis, and the y-axis. To say it bluntly: the wordpress statistics plug-in messes up on account of both axes. … Continue Reading

Example multi-actor simulation

June 18, 2008 Science No Comments

Recently, I discussed the M.A.R.S simulation models developed by Iannaccone and Makowsky. Based on what I read, I decided to try to work out a similar simulation myself. I did so using R-Project and it resulted in the simulation shown below. For more details on the syntax I used, visit the `my functions’ part of my site, which has a page on the syntax for this specific simulation. Please read further for some interpretation of this animation.

… Continue Reading

Welcome to Curving Normality

Curving Normality is an academic blog maintained by Rense Nieuwenhuis. He uses this blog to write about the social sciences in general, fascinating journal papers, useful data, interesting books, statistics using R. In addition, his personal academic activities are shared here, as well.