Skip to content
christophergandrud edited this page Dec 28, 2012 · 5 revisions

Not only is data sometimes in file formats that are difficult for R to recognise, but they can also be arranged within a file that makes it difficult to conduct our analyses.

Most statistical analyses in R need data to be in long format. In long format columns are variables and rows are observations. For example:

country population
Albania 2,800,000
Botswana 2,000,000
Cambodia 14,800,000

The variables country and population are the columns and each observation-- Albania, Botswana, and Cambodia--are the rows.

However, sometimes we run across data that is in wide format. This is when the columns contain the columns and the rows contain the variables. The country population data in wide format looks like this:

variables Albania Botswana Cambodia
population 2,800,000 2,000,000 14,800,000

There are a number of ways to change this data from wide to long format. Using the t command is covered on this page. Other options include reshape in the stats package and various functions in the reshape2 package.

t Transpose

One of the easiest ways is with the t (transpose) command. (If your interested this just turns the dataframe into a matrix and spits out its transpose)

For this example, first set up the wide format data:

variables <- c("population")

Albania <- c("2800000")

Botswana <- c("2000000")

Cambodia <- c("14800000")

WidePop <- data.frame(variables, Albania, Botswana, Cambodia, stringsAsFactors = FALSE)

Now we have something that looks like this:

WidePop
##    variables Albania Botswana Cambodia
## 1 population 2800000  2000000 14800000

To reshape the data with t and see the results simply type:

LongPop <- t(WidePop)
LongPop
##           [,1]        
## variables "population"
## Albania   "2800000"   
## Botswana  "2000000"   
## Cambodia  "14800000"  

You can see that the resulting matrix needs to be cleaned up a bit before we can use it for statistical analyses. First, t converted our dataframe into a matrix. We need to convert it back using as.data.frame.

LongPop <- as.data.frame(LongPop, stringsAsFactors = FALSE)
LongPop
##                   V1
## variables population
## Albania      2800000
## Botswana     2000000
## Cambodia    14800000

From this output, the data seem almost ready to use, we just need to delete the first row and rename the columns. However, you might have noticed the thing that looks like the first column has no variable name at all. The second column has the name V1. This is because R is not storing the country names as a typical column, but as a row.name. We need to convert the row.name into a regular column with the command rownames:

LongPop$country <- rownames(LongPop)

Now we have a new column called country. Now change the name of V1 to population with the rename command in the reshape package.

library(reshape)
LongPop <- rename(LongPop, c(V1 = "population"))

Now we can remove the first row.

LongPop <- LongPop[-1, ]

If you want to have country as the first variable in the dataframe you can create a new dataframe like this:

LongPop <- LongPop[, c(2, 1)]

You should now have a dataframe that looks like this:

LongPop
##           country population
## Albania   Albania    2800000
## Botswana Botswana    2000000
## Cambodia Cambodia   14800000
Clone this wiki locally