diff --git a/lab_dataframes.Rmd b/lab_dataframes.Rmd index e3fe82d..48d206a 100644 --- a/lab_dataframes.Rmd +++ b/lab_dataframes.Rmd @@ -132,7 +132,7 @@ X[] <- 0 as.vector(X) ``` -7. In the the earlier exercises, you created a vector with the names of the type Geno\_a\_1, Geno\_a\_2, Geno\_a\_3, Geno\_b\_1, Geno\_b\_2…, Geno\_s\_3 using vectors. In today's lecture, a function named `outer()` that generates matrices was mentioned. Try to generate the same vector as yesterday using this function instead. The `outer()` function is very powerful, but can be hard to wrap you head around, so try to follow the logic, perhaps by creating a simple example to start with. +7. In the the earlier exercises, you created a vector with the names of the type Geno\_a\_1, Geno\_a\_2, Geno\_a\_3, Geno\_b\_1, Geno\_b\_2…, Geno\_s\_3 using vectors. In a previous lecture, a function named `outer()` that generates matrices was mentioned. Try to generate the same vector as before, but this time using `outer()`. This function is very powerful, but can be hard to wrap you head around, so try to follow the logic, perhaps by creating a simple example to start with. ```{r} letnum <- outer(paste("Geno",letters[1:19], sep = "_"), 1:3, paste, sep = "_") @@ -180,7 +180,7 @@ E.mm # Dataframes -Even though vectors are at the very base of R usage, data frames are central to R as the most common ways to import data into R (`read.table()`) will create a dataframe. Even though a dataframe can itself contain another dataframe, by far the most common dataframes consists of a set of equally long vectors. As dataframes can contain several different data types the command `str()` is very useful to run on dataframes. +Even though vectors are at the very base of R usage, data frames are central to R as the most common ways to import data into R (`read.table()`) will create a data frame. A data frame consists of a set of equally long vectors. As data frames can contain several different data types the command `str()` is very useful to run on data frames. ```{r} vector1 <- 1:10 @@ -194,7 +194,7 @@ In the above example, we can see that the dataframe **dfr** contains 10 observat ## Exercise -1. Figure out what is going on with the second column in **dfr** dataframe described above and modify the creation of the dataframe so that the second column is stored as a character vector rather than a factor. Hint: Check the help for `data.frame` to find an argument that turns off the factor conversion. +1. Figure out what is going on with the second column in **dfr** data frame described above and modify the creation of the data frame so that the second column is stored as a character vector rather than a factor. Hint: Check the help for `data.frame` to find an argument that turns off the factor conversion. ```{r,accordion=TRUE} dfr <- data.frame(vector1, vector2, vector3, stringsAsFactors = FALSE) @@ -215,13 +215,13 @@ dfr[dfr$vector3>0,2] dfr$vector2[dfr$vector3>0] ``` -4. Create a new vector combining the all columns of **dfr** separated by a underscore. +4. Create a new vector combining all columns of **dfr** and separate them by a underscore. ```{r,accordion=TRUE} paste(dfr$vector1, dfr$vector2, dfr$vector3, sep = "_") ``` -5. There is a dataframe of car information that comes with the base installation of R. Have a look at this data by typing `mtcars`. How many rows and columns does it have? +5. There is a data frame of car information that comes with the base installation of R. Have a look at this data by typing `mtcars`. How many rows and columns does it have? ```{r,accordion=TRUE} dim(mtcars) @@ -229,13 +229,13 @@ ncol(mtcars) nrow(mtcars) ``` -6. Re-arrange the row names of this dataframe and save as a vector. +6. Re-arrange (shuffle) the row names of this data frame and save as a vector. ```{r,accordion=TRUE} car.names <- sample(row.names(mtcars)) ``` -7. Create a dataframe containing the vector from the previous question and two vectors with random numbers named random1 and random2. +7. Create a data frame containing the vector from the previous question and two vectors with random numbers named random1 and random2. ```{r,accordion=TRUE} random1 <- rnorm(length(car.names)) @@ -244,7 +244,7 @@ mtcars2 <- data.frame(car.names, random1, random2) mtcars2 ``` -8. Now you have two dataframes that both contains information on a set of cars. A collaborator asks you to create a new dataframe with all this information combined. Create a merged dataframe ensuring that rows match correctly. +8. Now you have two data frames that both contains information on a set of cars. A collaborator asks you to create a new data frame with all this information combined. Create a merged data frame ensuring that rows match correctly. ```{r,accordion=TRUE} mt.merged <- merge(mtcars, mtcars2, by.x = "row.names", by.y = "car.names") @@ -332,7 +332,7 @@ list.2 <- list(vec1 = c("hi", "ho", "merry", "christmas"), list.2 ``` -2. Here is a dataframe. +2. Here is a data frame. ```{r} dfr <- data.frame(letters, LETTERS, letters == LETTERS) @@ -369,18 +369,4 @@ lapply(list.a, FUN = "length") ```{r,accordion=TRUE} lapply(X = list.a, FUN = "summary") sapply(X = list.a, FUN = "summary") -``` - -# Extras - -1. Design a S3 class that should hold information on human proteins. The data needed for each protein is: - -- The gene that encodes it -- The molecular weight of the protein -- The length of the protein sequence -- Information on who and when it was discovered -- Protein assay data - -Create this hypothetical S3 object in R. - -2. Among the test data sets that are part of base R, there is one called **iris**. It contains measurements on set of plants. You can access the data using by typing `iris` in R. Explore this data set and calculate some useful summary statistics, like SD, mean and median for the parts of the data where this makes sense. Calculate the same statistics for any grouping that you can find in the data. +``` \ No newline at end of file diff --git a/lab_loadingdata.Rmd b/lab_loadingdata.Rmd index 3aa9595..3063b50 100644 --- a/lab_loadingdata.Rmd +++ b/lab_loadingdata.Rmd @@ -26,7 +26,7 @@ output: # Introduction -Up until now we have mostly created the object we worked with on the fly from within R. The most common use-case is however to read in different data sets that are stored as files, either somewhere on a server or locally on your computer. In this exercise we will test some common ways to import data in R and also show to save data from R. After this exercise you will know how to: +Up until now we have mostly created the object we worked with on the fly from within R. The most common use-case is however to read in different data sets that are stored as files, either somewhere on a server or locally on your computer. In this exercise we will test some common ways to import data in R and also how to save data from R. After this exercise you will know how to: - Read data from txt files and save the information as a vector, data frame or a list. - Identify missing data and correctly encode this at import @@ -66,17 +66,17 @@ shelley.vec[381] 2. Go back and fix the way you read in the text to make sure that you get a vector with all words in chapter as individual entries also filter any non-letter characters and now identify the longest word. ```{r,accordion=TRUE} -shelley.vec2 <- scan(file="https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/book_chapter.txt", what='character', sep=' ', quote=NULL) +shelley.vec2 <- scan(file="https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/book_chapter.txt", what=character(), sep=' ', quote=NULL) shelley.filt2 <- gsub(pattern='[^[:alnum:] ]', replacement="", x=shelley.vec2) -which(nchar(shelley.filt2) == max(nchar(shelley.filt2))) -shelley.filt2[301] +longest <- which(nchar(shelley.filt2) == max(nchar(shelley.filt2))) +shelley.filt2[longest] ``` # `read.table()` This is the by far most common way to get data into R. As the function creates a data frame at import it will only work for data set that fits those criteria, meaning that the data needs to have a set of columns of equal length that are separated with a common string eg. tab, comma, semicolon etc. -In this code block with first import the data from [normalized.txt](https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/normalized.txt) and accept the defaults for all other arguments in the function. With this settings R will read it as a tab delimited file and will use the first row of the data as colnames (header) and the first column as rownames. +In this code block we first import the data from [normalized.txt](https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/normalized.txt) and accept the defaults for all other arguments in the function. With this settings R will read it as a tab delimited file and will use the first row of the data as colnames (header) and the first column as rownames. ```{r,accordion=TRUE} expr.At <- read.table("https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/normalized.txt") @@ -85,7 +85,7 @@ head(expr.At) One does however not have to have all data as a file an the local disk, instead one can read data from online resources. The following command will read in a file from a web server. -```{r,accordion=TRUE, error=T} +```{r,accordion=TRUE} url <- 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data' abalone <- read.table(url, header=FALSE , sep=',') head(abalone) @@ -94,7 +94,7 @@ head(abalone) 1. Read this [example data](https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/example.data) to R using the `read.table()` function. This files consist of gene expression values. Once you have the object in R validate that it looks okay and export it using the `write.table` function. ```{r,accordion=TRUE} -ed <- read.table("https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/example.data", sep=":") +ed <- read.table("https://raw.githubusercontent.com/NBISweden/workshop-r/master/data/lab_loadingdata/example.data", sep=":", header = T) head(ed) str(ed) ``` @@ -102,20 +102,20 @@ str(ed) Encode all NA values as "missing", at export. ```{r,eval=FALSE,accordion=TRUE} -write.table(x=ed, na="missing", file="example_mis.data") +write.table(x=ed, na="missing", file="example_write.txt") ``` 2. Read in the file you just created and double-check that you have the same data as earlier. ```{r,eval=FALSE,accordion=TRUE} -df.test <- read.table("example_mis.data", na.strings="missing") +df.test <- read.table("example_write.txt", na.strings="missing") ``` 3. Analysing genome annotation in R using read.table For this exercise we will load a GTF file into R and calculate some basic summary statistics from the file. In the first part we will use basic manipulations of data frames to extract the information. In the second part you get a try out a library designed to work with annotation data, that stores the information in a more complex format, that allow for easy manipulation and calculation of summaries from genome annotation files. -For those not familiar with the gtf format it is a file format containing annotation information for a genome. It does not contain the actual DNA sequence of the organism, but instead refers to positions along the genome. +For those not familiar with the GTF format it is a file format containing annotation information for a genome. It does not contain the actual DNA sequence of the organism, but instead refers to positions along the genome. A valid GTF file should contain the following tab delimited fields (taken from the ensembl home page). @@ -136,7 +136,7 @@ A valid GTF file should contain the following tab delimited fields (taken from t The last column can contain a large number of attributes that are semicolon-separated. -As these files for many organisms are large we will in this exercise use the latest version of Drosophila melanogaster genome annotation available at `ftp://ftp.ensembl.org/pub/release-86/gtf/drosophila_melanogaster` that is small enough for analysis even on a laptop. +As these files for many organisms are large we will in this exercise use the latest version of *Drosophila melanogaster* genome annotation available at `ftp://ftp.ensembl.org/pub/release-86/gtf/drosophila_melanogaster` that is small enough for analysis even on a laptop. Download the file named **Drosophila_melanogaster.BDGP6.86.gtf.gz** to your computer. Unzip this file and keep track of where your store the file. @@ -166,13 +166,14 @@ str(d.gtf) 1. How many chromosome names can be found in the annotation file? ```{r,accordion=TRUE} -levels(d.gtf$Chromosome) +length(levels(as.factor(d.gtf$Chromosome))) ``` 2. How many **exons** is there in total and per chromosome? (hint: first extract lines that have `feature == 'exon'`) ```{r,accordion=TRUE} d.gtf.exons <- d.gtf[(d.gtf$Feature == 'exon'),] +nrow(d.gtf.exons) aggregate(d.gtf.exons$Feature, by=list(d.gtf.exons$Chromosome), summary) ``` diff --git a/slide_r_elements_3.Rmd b/slide_r_elements_3.Rmd index f8af78b..d4fd0b0 100644 --- a/slide_r_elements_3.Rmd +++ b/slide_r_elements_3.Rmd @@ -337,7 +337,7 @@ name: data_frames_accessing # Data frames — accessing values -- We can always use the `[]` notation to access values inside data frames. +- We can always use the `[row,column]` notation to access values inside data frames. ```{r data.frame.access, echo=T} df[1,] # get the first row @@ -516,12 +516,12 @@ name: lists_nested We can use lists to store hierarchies of data: ```{r lists_nested, echo=T} -ikea_lund <- list(park = 125) +ikea_lund <- list(parking = 125) ikea_sweden <- list(ikea_lund = ikea_lund, ikea_uppsala = ikea_uppsala) # use names to navigate inside the hierarchy -ikea_sweden$ikea_lund$park -ikea_sweden$ikea_uppsala$park +ikea_sweden$ikea_lund$parking +ikea_sweden$ikea_uppsala$parking ```