Skip to content

Transforming strings

christophergandrud edited this page Jul 16, 2012 · 2 revisions

Sometimes we want to change parts of character strings, replacing old strings with some new string. (See Recoding Variables for how to change whole strings) We often want to do this during data clean up. Our original variable values may include characters or punctuation that we don't want. For example, a data set might lable the 'United Kingdom' United.Kingdom or there may be some other characters that we don't want like in United.Kingdom.1.

There are a number of ways in R to deal with these types of issues. This page covers the gsub command in base R. It also mentions how to use the similar sub command.

gsub

We can use gsub to remove specific patterns of characters and replace them with something else. Imagine we had a list of countries:

countries <- c("France.1", "China.1", "United.States.3")
countries
## [1] "France.1"        "China.1"         "United.States.3"

We can use gsub to remove the '.1' if we don't want it.

countries <- gsub(".1", "", countries)
countries
## [1] "France"          "China"           "United.States.3"

The first argument is the string pattern that we want to replace. The second is what we want to replace it with. Since we want to replace it with nothing we typed "" with nothing in the middle. If we wanted to replace the '.1' with something else, we would just type this inbetween the "". The final argument is just the character vector where we want to do the replacing.

Notice that the '.3' after 'United.States' was not replaced. This is becuase it didn't fit the pattern '.1'. We can use regular expressions to create more general patterns. In this example all we need is a pattern that includes a '.' and a number after it. We can use the regular expression [0-9] to have gsub look for any combination of a '.' followed by any of the numbers 0 through 9.

countries <- c("France.1", "China.1", "United.States.3")
countries <- gsub(".[0-9]", "", countries)
countries
## [1] "France"        "China"         "United.States"

Finally, we want to replace the '.' between 'United and 'States with a space. To do this we type:

countries <- gsub("\\.", " ", countries)
countries
## [1] "France"        "China"         "United States"

By itself, the period tells R to find all characters, not just periods. To tell R that we are only looking for periods '.' we use the escape characters \\. If you're interested try seeing what happens when you don't use the escape characters.

sub

gsub replaces all instances of a pattern. if you only want to replace the first instance of a pattern you can use the sub command also in base R. sub and gsub have the same syntax.