-
Notifications
You must be signed in to change notification settings - Fork 24
Reshaping
Not only is data sometimes in file formats that are difficult for R to recognise, but they can also be arranged within a file that makes it difficult to conduct our analyses.
Most statistical analyses in R need data to be in long format. In long format columns are variables and rows are observations. For example:
country | population |
---|---|
Albania | 2,800,000 |
Botswana | 2,000,000 |
Cambodia | 14,800,000 |
The variables country and population are the columns and each observation-- Albania, Botswana, and Cambodia--are the rows.
However, sometimes we run across data that is in wide format. This is when the columns contain the columns and the rows contain the variables. The country population data in wide format looks like this:
variables | Albania | Botswana | Cambodia |
---|---|---|---|
population | 2,800,000 | 2,000,000 | 14,800,000 |
There are a number of ways to change this data from wide to long format. Using the t
command is covered on this page. Other options include reshape
in the stats
package and various functions in the reshape2
package.
One of the easiest ways is with the t
(transpose) command. (If your interested this just turns the dataframe into a matrix and spits out its transpose)
For this example, first set up the wide format data:
variables <- c("population")
Albania <- c("2800000")
Botswana <- c("2000000")
Cambodia <- c("14800000")
WidePop <- data.frame(variables, Albania, Botswana, Cambodia, stringsAsFactors = FALSE)
Now we have something that looks like this:
WidePop
## variables Albania Botswana Cambodia
## 1 population 2800000 2000000 14800000
To reshape the data with t
and see the results simply type:
LongPop <- t(WidePop)
LongPop
## [,1]
## variables "population"
## Albania "2800000"
## Botswana "2000000"
## Cambodia "14800000"
You can see that the resulting matrix needs to be cleaned up a bit before we can use it for statistical analyses. First, t
converted our dataframe into a matrix. We need to convert it back using as.data.frame
.
LongPop <- as.data.frame(LongPop, stringsAsFactors = FALSE)
LongPop
## V1
## variables population
## Albania 2800000
## Botswana 2000000
## Cambodia 14800000
From this output, the data seem almost ready to use, we just need to delete the first row and rename the columns. However, you might have noticed the thing that looks like the first column has no variable name at all. The second column has the name V1. This is because R is not storing the country names as a typical column, but as a row.name
. We need to convert the row.name
into a regular column with the command rownames
:
LongPop$country <- rownames(LongPop)
Now we have a new column called country. Now change the name of V1 to population with the rename
command in the reshape
package.
library(reshape)
LongPop <- rename(LongPop, c(V1 = "population"))
Now we can remove the first row.
LongPop <- LongPop[-1, ]
If you want to have country as the first variable in the dataframe you can create a new dataframe like this:
LongPop <- LongPop[, c(2, 1)]
You should now have a dataframe that looks like this:
LongPop
## country population
## Albania Albania 2800000
## Botswana Botswana 2000000
## Cambodia Cambodia 14800000