Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add verification step for raw GEO data vs. prepared CSV #105

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
101 changes: 101 additions & 0 deletions episodes/02-setup.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -185,6 +185,107 @@ download.file(
)
```


::::::::::::::::::::::::::::::::::::::: challenge

## Check that the data file matches the raw data on GEO

This is a substantial challenge. You don't need to do it! But you needn't trust us, either.
You can verify that the raw data deposited on GEO is directly comparable to the prepared CSV.
To look up raw data for samples (GSMxxxx) on GEO, we can use the [GEOquery](https://doi.org/10.1093/bioinformatics/btm254) package.
The `GEOquery::getGEOSuppFiles()` function then allows one to download or read the raw data files in question.
The tricky part of this is writing a function to extract the data of interest from the raw data files on GEO.
Once we have that function, we can check that the provided CSV file contains the same counts as the raw data.

```{r check-geo-data}

# Below, we read the summarized data file into a data.frame and inspect its dimensions (how many rows, how many columns).
counts_URL <- "https://github.com/carpentries-incubator/bioc-rnaseq/raw/main/episodes/data/GSE96870_counts_cerebellum.csv"
counts_cerebellum <- read.csv(url(counts_URL), row.names=1)
dim(counts_cerebellum)
# [1] 41786 22

# Each column in the data.frame represents one of 22 GSMs (GEO Samples).
GSMs_cerebellum <- colnames(counts_cerebellum)

# GEOquery can look up the raw data locations for these GSMs.
# Unfortunately, the dependencies for it crash this build!
# We will illustrate the procedure and then use the results.
if (FALSE) {
if (!require("BiocManager")) install.packages("BiocManager")
if (!require("GEOquery")) BiocManager::install("GEOquery")
library(GEOquery)
getSuppURL <- function(GSM) GEOquery::getGEOSuppFiles(GSM, fetch=FALSE)[1, "url"]
GSM_URLs <- sapply(GSMs_cerebellum, getSuppURL)
} else {
GSM_URLprefix <- "https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2545nnn"
GSM_URLsep <- "suppl"
GSM_files <- c("GSM2545336_10C_CTGAAGCT-GTACTGAC_L00M_featCounts.txt.gz",
"GSM2545337_11C_TAATGCGC-TATAGCCT_L00M_featCounts.txt.gz",
"GSM2545338_12C_TAATGCGC-ATAGAGGC_L00M_featCounts.txt.gz",
"GSM2545339_13C_TAATGCGC-CCTATCCT_L00M_featCounts.txt.gz",
"GSM2545340_14C_TAATGCGC-GGCTCTGA_L00M_featCounts.txt.gz",
"GSM2545341_17C_TAATGCGC-AGGCGAAG_L00M_featCounts.txt.gz",
"GSM2545342_1C_CTGAAGCT-TATAGCCT_L00M_featCounts.txt.gz",
"GSM2545343_20C_TAATGCGC-GTACTGAC_L00M_featCounts.txt.gz",
"GSM2545344_21C_CGGCTATG-TATAGCCT_L00M_featCounts.txt.gz",
"GSM2545345_22C_CGGCTATG-ATAGAGGC_L00M_featCounts.txt.gz",
"GSM2545346_25C_CGGCTATG-CCTATCCT_L00M_featCounts.txt.gz",
"GSM2545347_26C_CGGCTATG-GGCTCTGA_L00M_featCounts.txt.gz",
"GSM2545348_27C_CGGCTATG-AGGCGAAG_L00M_featCounts.txt.gz",
"GSM2545349_28C_CGGCTATG-TAATCTTA_L00M_featCounts.txt.gz",
"GSM2545350_29C_CGGCTATG-CAGGACGT_L00M_featCounts.txt.gz",
"GSM2545351_2C_CTGAAGCT-ATAGAGGC_L00M_featCounts.txt.gz",
"GSM2545352_30C_CGGCTATG-GTACTGAC_L00M_featCounts.txt.gz",
"GSM2545353_3C_CTGAAGCT-CCTATCCT_L00M_featCounts.txt.gz",
"GSM2545354_4C_CTGAAGCT-GGCTCTGA_L00M_featCounts.txt.gz",
"GSM2545362_5C_CTGAAGCT-AGGCGAAG_L00M_featCounts.txt.gz",
"GSM2545363_6C_CTGAAGCT-TAATCTTA_L00M_featCounts.txt.gz",
"GSM2545380_9C_CTGAAGCT-CAGGACGT_L00M_featCounts.txt.gz")
GSM_URLs <- paste(GSM_URLprefix, GSMs_cerebellum, "suppl", GSM_files, sep="/")
names(GSM_URLs) <- GSMs_cerebellum
}

# To extract the read counts per gene for each sample, we write a small function.
fetch_featcounts <- function(GSM_URL) {
message("Processing ", GSM_URL)
tsv <- readLines(gzcon(url(GSM_URL)))[-1][-1] # drop header and colnames
genes <- sapply(strsplit(tsv, "\t"), "[[", 1)
featcounts <- as.integer(sapply(strsplit(tsv, "\t"), "[[", 7))
names(featcounts) <- genes
return(featcounts)
}

# We use lapply(), cbind(), and do.call() to add each column of read counts:
counts_GSMs <- data.frame(do.call(cbind, lapply(GSM_URLs, fetch_featcounts)))
dim(counts_GSMs)
# [1] 41786 22

```

Above we demonstrate that the numbers of genes and samples are the same.
However, that does not demonstrate that the read counts are the same.
Is there a function in R that will check whether two objects are identical?

::::::::::::::::::::::::::::::::::::: solution

The `identical()` function verifies that two objects in R are the same, save for their name.

```{r check-geo-data-solution}

# Don't bother checking the raw data if the dimensions don't match.
stopifnot(identical(dim(counts_cerebellum), dim(counts_GSMs)))

# If the dimensions do match, check if the data matches, too.
identical(counts_cerebellum, counts_GSMs)
# [1] TRUE
```

The prepared data file does indeed contain the same counts per gene as the raw GEO data files.

:::::::::::::::::::::::::::::::::::::


If you navigate to the `data` folder in your working directory, you should now find a file named
`GSE96870_counts_cerebellum.csv`.

Expand Down
Loading