R-Sessions 09: Data Manipulation


Today’s edition of R-Sessions deals with the manipulation of data that is stored R-Project. Building upon the previous R-Session, attention is paid to recoding of data, ordering, and finally the merging of several sets of data.

Recoding

The most direct way to recode data in R-Project is using a combination of both indexing and conditionals as described elsewhere. To exemplify this, a simply data.frame will be created below, containing variables indicating gender and monthly income in thousands of euros.

gender <- c(“male”, “female”, “female”, “male”, “male”, “male”, “female”)
income <- c(54, 34, 556, 57, 88, 856, 23)
data <- data.frame(gender, income)
data

> gender <- c("male", "female", "female", "male", "male", "male", "female")
> income <- c(54, 34, 556, 57, 88, 856, 23)
> data <- data.frame(gender, income)
> data
  gender income
1   male     54
2 female     34
3 female    556
4   male     57
5   male     88
6   male    856
7 female     23

Some of the values on the income variable seem exceptionally high. Let’s say we want to remove the two values on income higher than 500. In order to do so, we use the which() command, that reveals which of the values is greater than 500. Next, the result of this is used for indexing the data$income variable. Finally, the indicator for missing values, ‘NA’ is assigned to the that selected values of the ‘income’ variables. Obviously, we would normally only use the third line. The first two are shown here, to make clear exactly what is happening.

which(data$income > 500)
data$income[data$income > 500]
data$income[data$income > 500] <- NA
data

> which(data$income > 500)
[1] 3 6
> data$income[data$income > 500]
[1] 556 856
> data$income[data$income > 500] <- NA
> data
  gender income
1   male     54
2 female     34
3 female     NA
4   male     57
5   male     88
6   male     NA
7 female     23

Sometimes, it is desirable to replace missing values by the mean on the respective variables. That is what we are going to do here. Note, that in general practice it is not very sensible to impute two missing values using only five valid values. Nevertheless, we will proceed here.
The first row of the example below shows that it is not automatically possible to calculate the mean of a variable that contains missing values. Since R-Project cannot compute a valid value, NA is returned. This is not what we want. Therefore, we instruct R-Project to remove missing values by adding na.rm=TRUE to the mean() command. Now, the right value is returned. When the same selection-techniques as above are used, an error will occur. Therefore, we need the is.na() command, that returns a vector of logicals (‘TRUE’ and ‘FALSE’ ). Using is.na(), we can use the which() command to select the desired values on the income variable. To these, the calculated mean is assigned.

mean(data$income)
mean(data$income, na.rm=TRUE)
data$income[which(is.na(data$income))] <- mean(data$income, na.rm=TRUE)
data

> mean(data$income)
[1] NA
> mean(data$income, na.rm=TRUE)
[1] 51.2
> data$income[which(is.na(data$income))] <- mean(data$income, na.rm=TRUE)
> data
  gender income
1   male   54.0
2 female   34.0
3 female   51.2
4   male   57.0
5   male   88.0
6   male   51.2
7 female   23.0

ORDER

It is easy to sort a data-frame using the command order. Combined with indexing functions, it works as follows:

x <- c(1,3,5,4,2)
y <- c('a','b','c','d','e')
df <- data.frame(x,y)

df
  x y
1 1 a
2 3 b
3 5 c
4 4 d
5 2 e

df[order(df$x),]
  x y
1 1 a
5 2 e
2 3 b
4 4 d
3 5 c

MERGE

Merge puts multiple data.frames together, based on an identifier-variable which is unique or a combination of variables.

x <- c(1,2,5,4,3)
y <- c(1,2,3,4,5)
z <- c('a','b','c','d','e')

df1 <- data.frame(x,y)
df2 <- data.frame(x,z)
df3 <- merge(df1,df2,by=c("x"))

 df3
  x y z
1 1 1 a
2 2 2 b
3 3 5 e
4 4 4 d
5 5 3 c

Leave a Reply