Skip to content
christophergandrud edited this page Sep 23, 2012 · 10 revisions

There are a couple different ways to select parts of objects in R. Probably the most useful method for this course is with the subset command to select a subset of a dataframe (or vector or matrices). You can also use subscripts with braces [] to select certain rows or columns.

With both methods you can use any of R's logical operators to subset your data. See the Quick-R Guide

The subset Command

One way to subset your data in R is to use the subset command from base R. You can use it to create a new data frame with the subsetted data.

Stand Alone subset

For example, if we wanted to only look at data for Swiss provinces with that are more than 50% Catholic using the swiss data set we could create a new data frame called SwissCatholic by typing:

SwissCatholic <- subset(swiss, Catholic > 50)

Now we have two data sets with the same set of variables. To tell R that we are interested in looking at the Catholic variable in just the subsetted data set we can use the $:

SwissCatholic$Catholic
##  [1]  84.84  93.40  90.57  92.85  97.16  97.67  91.38  98.61  99.71  99.68
## [11] 100.00  98.96  98.22  99.06  99.46  96.83  50.43  58.33

subset Inside Another Function

To simplify your code, you might want to use the subset command inside of another command. For example, if we wanted to plot Examination scores and Education only for swiss provinces with majority Catholic populations we could place subset inside of qplot like this:

# Load the ggplot2 package
library(ggplot2)

# Plot subsetted data
qplot(Education, Examination, data = subset(swiss, Catholic > 50))

plot of chunk subset-swiss-plot

Removing Observations with Missing Values

If you want to remove all of the observations with missing values for a particular variable you can stack the !is.na command inside of the subset command. The is.na command tells you if there is a missing value. The exclimation point (!) in front of the command means "not". So you can read the command !is.na as "is not missing".

Here is an example removing all missing values of Education in the swiss data:

swiss <- subset(swiss, !is.na(Education))

Note that this example is a little silly, because there are no missing values of Education to remove.

Subscripts ([])

Subscripts ([]) are a more general way of locating parts of R objects.

To subset our data with subscripts so that, for example, we only have the examination scores for provinces that are more than 50% Catholic we simply type:

SwissExamination <- swiss$Examination[swiss$Catholic > 50]

This gives us a numeric vector with only examination scores. With this we could, for example, find the mean exam score among majority Catholic provinces with the mean command.

mean(SwissExamination)
## [1] 10.5

So the mean score is 10.5.

Of course you can combine these two commands to fined the mean in on line of code:

mean(swiss$Examination[swiss$Catholic > 50])
## [1] 10.5

Subsetting Columns with Subscripts

So far we have only taken a subset of dataframes based on the values of their rows. What if we want to subset a data set to only contain specific columns? We can also use subscripts for this.

So far we have only entered information in the 'row' part of the subscript []. We can also give R information about the column part. In the first part of a subscript we can put information about the rows. The second part of the subscript can take information about the columns. We use a comma , to separate the two parts so we can think of subscripts being organized like this:

[row, column]

In the previous examples we subsetted one column based on row values. To subset multiple columns based on row values we just need to add a comma , after the description of what rows we want. For example:

swisColumns <- swiss[swiss$Catholic > 50, ]

Gives us all of the columns in Swiss, but only the rows where the Catholic variable is greater than 50.

To select specific columns of a dataframe and put them in a new object we just enter either then column names or numbers in the subscript after a comma. For example if we wanted only the Examination and Catholic variables from the swiss data set put into a new data frame called ExamCath we type:

ExamCath <- swiss[, c("Examination", "Catholic")]
names(ExamCath)
## [1] "Examination" "Catholic"

Notice that we create the character vector c("Examination", "Catholic"). If we didn't put the variable names in a vector, but only seperated them with commas R would be confused about why there were so many commas.