-
Notifications
You must be signed in to change notification settings - Fork 24
Reshaping
Not only is data sometimes in file formats that are difficult for R to recognise, but they can also be arranged within a file that makes it difficult to conduct our analyses.
Most statistical analyses in R need data to be in long format. In long format columns are variables and rows are observations. For example:
| country | population |
|---|---|
| Albania | 2,800,000 |
| Botswana | 2,000,000 |
| Cambodia | 14,800,000 |
The variables country and population are the columns and each observation--Albania, Botswana, and Cambodia--are the rows.
However, sometimes we run across data that is in wide format. This is when the columns contain the columns and the rows contain the variables. The country population data in wide format looks like this:
| variables | Albania | Botswana | Cambodia |
|---|---|---|---|
| population | 2,800,000 | 2,000,000 | 14,800,000 |
There are a number of ways to change this data from wide to long format. This includes the t and reshape commands
One of the easiest ways is with the t (transpose) command. (If your interested this just turns the dataframe into a matrix and spits out its transpose)
For this example, first set up the wide format data:
variables <- c("population")
Albania <- c("2800000")
Botswana <- c("2000000")
Cambodia <- c("14800000")
WidePop <- data.frame(variables, Albania, Botswana, Cambodia, stringsAsFactors = FALSE)Now we have something that looks like this:
WidePop## variables Albania Botswana Cambodia
## 1 population 2800000 2000000 14800000
To reshape the data with t and see the results simply type:
LongPop <- t(WidePop)
LongPop## [,1]
## variables "population"
## Albania "2800000"
## Botswana "2000000"
## Cambodia "14800000"
You can see that the resulting matrix needs to be cleaned up a bit before we can use it for statistical analyses. First, t converted our dataframe into a matrix. We need to convert it back using as,data.frame.
LongPop <- as.data.frame(LongPop, stringsAsFactors = FALSE)
LongPop## V1
## variables population
## Albania 2800000
## Botswana 2000000
## Cambodia 14800000
From this output, the data seem almost ready to use, we just need to delete the first row and rename the columns. However, you might have noticed the thing that looks like the first column has no variable name at all. The second column has the name V1. This is because R is not storing the country names as a typical column, but as a row.name. We need to convert the row.name into a regular column with the command rownames:
LongPop$country <- rownames(LongPop)Now we have a new column called country. Now change the name of V1 to population with the rename command in the reshape package.
library(reshape)
LongPop <- rename(LongPop, c(V1 = "population"))Now we can remove the first row.
LongPop <- LongPop[-1, ]If you want to have country as the first variable in the dataframe you can create a new dataframe like this:
LongPop <- LongPop[, c(2, 1)]You should now have a dataframe that looks like this:
LongPop## country population
## Albania Albania 2800000
## Botswana Botswana 2000000
## Cambodia Cambodia 14800000
LongPop <- reshape(WidePop, v.names = CountryNames, direction = "long")## Error: no 'reshapeWide' attribute, must specify 'varying'
LongPop## country population
## Albania Albania 2800000
## Botswana Botswana 2000000
## Cambodia Cambodia 14800000
Before using the t command to convert this data frame into a usable format we need to come up with a vector of observation names. In this case our observations are countries, so we need to create a vetor of the country names. The slow way to do this is to type out the names:
CountryNames <- c("Albania", "Botswana", "Cambodia")The faster way is to use the names command to create an object from the WidePop dataframe's variable names, dropping the first one--variables--with [-1].
CountryNames <- names(WidePop)[-1]WidePop is clearly our data, direction tells reshape that we want our data in long format.