Detailed metaboprep pipeline steps

This is a detailed log of the steps that the metaboprep pipeline takes to process data from Nightingale, Metabolon or any other platform.

(1) data format

The data may be entered into the metaboprep pipeline in two forms:

1) commercial supplied excel sheet
	(a) Nightingale supplied excel sheet
		- sometimes accompanied by a [sample] "Metadata.xlsx" sheet
	(b) Metabolon supplied excel sheet	
		- contains within...
			1) metabolite data
			2) sample metadata - in upper rows
			3) feature metadata - in first columns
2) flat text file(s)
	(a) metabolite data - features|metabolites in columns - & - samples in rows **[NOT optional]**
	(b) sample metadata - sample batch variables in columns - & - samples in rows **[OPTIONAL]**
	(c) feature metadata - feature batch variables in columns - & -features|metabolites in rows **[OPTIONAL]**
		++ NOTE: the metaboprep packages holds within it a Nightingale metabolite annotation object (ng_anno).

(2) parameter file

The parameter file includes 16 arguments:

1) project name
	- a string to give your study|project a name
2) full path to the directory holding your data.
	- This directory must inlcude all files subsequently shared in the parameter file.
3) metabolite data file name
	- the name of the commercially supplied excel sheet or the flat text file holding the metabolite data
	- flat text file should have samples in rows, features|metabolites in columns
4) feature|metabolite annotation file name
	- file name for the flat text feature annotation file
	- NA, if no such file is available
	- otherwise, the first column of data should match the feature (column) names in (3).
		+ NOTE: if you are processing Metabolon data using a flat text file (NOT the commercial excel sheet)
			(i) then it would be advisable to have a feature annotation file that has a column called "SUPER_PATHWAY" that identifies which metabolites|features are "Xenobiotics".
				+ "Xenobiotics" are treated uniquely in the metaboprep pipeline
5) sample annotation file name
	- flat text sample annotation file name
	- NA, if no such file is available
	- otherwise, the first column of data should match the sample (row) names in (3).
6) declaration of the platform used
	- Nightingale
	- Metabolon
	- Other
7) feature missingness
	- the threshold for feature missingness filtering
	- must be a value set between 0 and 1.
8) sample missingness
	- the threshold for sample missingness filtering
	- must be a value set between 0 and 1
9) total sum abundance
	- the threshold for sample total sum abundance
	- In units of standard deviations from the mean.
10) outlier threshold
	- the threshold for defining outliers at each metabolite
	- In units of interquartile range from the median.
11) outlier treatment
	- a string to define how outliers should be treated for the purposes of the PCA and for the PCA only
		+ set to "leave_be" if you would like no action on outliers
		+ set to "winsorize" if you would like outliers to be winsorized to the 100th quantile of all remaining (non-outying) values, at a feature.
		+ set to "turn_NA" if you would like outliers converted to NA. This means they will be imputed to the median for the purposes of the PCA.
12) tree cut height
	- the tree cut height for the hierarchical clustering dendrogram to identify representative|independent features, in Spearman's rho distances (1 - abs(rho)). 
	- Must be a value set between 0 and 1. 
	- For example, a value of 0.8 would, in principal, cluster features|metabolites with a Spearman's rho > 0.2. 
13) PC outlier threshold
	- In units of standard deviations from the mean.
	- The threshold for identifying outliers in the sample principal component analysis.
14) derived variable exclusions
	- TRUE or FALSE
	- If you would like Nightingale derived variables to be excluded from missingness filters as they will be redundant measures of missingness.
15) batch column name
	- column name in the feature annotation file defining run mode|platform|batch for each metabolite|feature
	- This variable will be used to median normalise the data.
	- This column should contain a series of strings defining the run mode of each metabolite|feature such as "neg" or "pos", or perhaps just a single string "batch".
	- Those string names (or name) should be, each, a column name in the sample annotation file that defines the batches for that run mode|platform. 
	- This paramater can be defined as NA if there are no batches in the data set, or if you are running a commercial Metabolon excel file the script will find this information iteslf by looking for the column name "platform" among the feature annotation data.
16) plot feature|metabolite distributions
	- TRUE or FALSE
	- Should the the distribution and summary statistics for each metabolite|or feature be printed to a singl e PDF report file?

(3) how to run the pipeline

> Rscript run_metaboprep_pipeline.R parameter_file.txt

(4) pipeline steps - in detail

(4.1) intiate

1) check for parameter file
2) record date
3) process project name and data directory, as provided in parameter file
4) make a new directory in provided data directory to place metaboprep output
5) start a log file - placed in the newly made metaboprep data directory.
6) process remaining arguments in the parameter file

(4.2) read in the data

1) evaluate if METABOLITE data file is a flat text file or excel file
	(a) if provided file is a flat text file
		1. read in file
			- check if rownames are numeric. if yes, assign column 1 as row names.
		2. if platform was declared as Nightingale attempt to edit feature names to match data in metaboprep's ng_anno (Nighthingale annotation) object.
2) evaluate if a flat text FEATURE annotation|batch file name was provided
	(a) if a file name was provided read it in
		1. read in file
			- check if rownames are numeric. if yes, assign column 1 as row names.
				- column 1 IDs should match the feature (columns) names in metabolite data file 
		2. if platform was declared as Nightingale attempt to edit feature names (rows) to match data in metaboprep's ng_anno (Nighthingale annotation) object.
3) evaluate if a flat text SAMPLE annotation|batch file name was provided
	(a) if a file name was provided read it in
		1. read in file
			- check if rownames are numeric. if yes, assign column 1 as row names.
				- column 1 IDs should match the sample (rows) names in metabolite data file
4) generate a working data set object - defined as a list.
	(a) if platform was declared as Nightingale 
		1. the feature annotation data in the object ng_anno will be added to the feature annotation data sheet. 
5) evaluate if METABOLITE data file is an excel sheet
	(a) if yes and platform is delcared as Nightingale
		1. read in data with the function read.in.nightingale()
	(b) if yes and platform is declared as Metabolon
		1. read in data with the function read.in.metabolon()

(4.3) data normalization

1) if platform (parameter value 6) declared as Metabolon
	(a) if "feat_anno_run_mode_col" (parameter value 15) is NOT NA 
		1. run function norm_metabolite_data()
		2. save the intial "raw" data read in on step (II) as mydata$raw_metabolitedata
		3. define the normalized data as the default|primary data mydata$metabolitedata
	(b) if "feat_anno_run_mode_col" (parameter value 15) is NA
		1. look for runmode or c("PLATFORM","platform") column in featuredata data frame
		2. if runmode column found in featuredata data frame 
			- edit run mode ids to remove "LC MS " and " ", then lowercase strings.
			- lowercase and remove unwanted characeters (" ", "_", "\\.") from column names in sampledata data frame
			- look for column names that match the runmode names in the featuredata data frame
			- run function norm_metabolite_data()
		3. if runmode column NOT found in featuredata data frame
			- look for a "ScaledImp" metabolite data set in the mydata object
			- if found:
				- remove imputed values turning them into NAs
				- set current primary metabolitedata to raw_metabolitedata
				- define normalized (imputations removed) as primary metabolitedata
			- if not found:
				- No normalization possible
2) if platform (paremter value) declared as Other and "feat_anno_run_mode_col" (value 13 step 1) is NOT NA 
		1. run function norm_metabolite_data()
		2. save the "raw" data read in on step (II) as mydata$raw_metabolitedata
		3. define the normalized data as the defualt|primary data mydata$metabolitedata

(4.4) estimate summary statistics on initial or pre-filtered data

1) samples
	(a) If platform is Metabolon
		1. extract feature ids for c("xenobiotics", "Xenobiotics") in feature annotation data column "SUPER_PATHWAY"
	(b) If platform is Nightingale
		1. extract feature ids for "derived_features" in the feature annotation data 
			- do so if the "derived_var_exclusion" variable set to TRUE as defined by paramter value 12
	(c) run function sample.sum.stats()
		1. estimate sample missingness with function sample.missingness()
			- remove xenobiotics or derived variables as defined above in steps 1.a and 1.b above
		2. estimate total sum abundance with function total.peak.area()
			- estimate TSA for all features (to the exclusion of those excluded by 1.a and 1.b above )
			- estimate TSA for all complete (no missingness) features (to the exclusion of those excluded by steps 1.a and 1.b above )
		3. outliers count with function sample.outliers(); uses function outlier.matrix() internally.
			- defaults at 5IQR from the median, this estimates the number of outlier features a sample has.
	(d) add sample stats to sample annotation data frame
	(e) write sample annotation with sumstats to file
2) features
	(a) extract sample missingness from sample stats
		1. use "sample_missingness_w_exclusions" if present
	(b) run function feature.sum.stats()
		1. estimate feature missingness with function feature.missingness()
		2. estimate a variety of summary statistics with the function feature.describe()
		3. outlier count with th function feature.outliers(); uses function outlier.matrix() internally.
			- defaults at 5SD from the mean, this estimates the number of outlier samples a feature has.
		4. estimate the number of independent | representiative features in the data set
			- run function tree_and_independent_features()
				1. excludes any features in parameter feature_names_2_exclude
				2. exclude features with no variance
				3. exclude features with > 20% missingness
				4. run function make.tree() 
					4.1 estimate spearman correlation matrix
					4.2 build distance matrix as 1-abs(rho)
					4.3 estimate hclust() dendrogram
				5. cut tree at defined tree cut height, defaulted to 0.5
				6. identify feature with least missingness within each cluster and return as representative feature
				7. return list object with
					1. dendrogram
					2. vector of independent feature ids
					3. data frame of feature ids
					4. k cluster ids
					5. binary (0|1) for independent features.
		5. return list object with
			1. data frame with
				1. feature missingness
				2. outlier count
				3. summary stats
				4. k cluster ids
				5. binary for (0|1) for independent features
			2. hclust dendrogram 
	(c) add feature summary statistics to feature annotation file
	(d) write feature annotation data to file
3) Derive principal components
	(a) run function pc.and.outliers()
		1. use only the independent | representiative features identified in step 2.b.4
		2. imputed data PCs
			2.1. impute missing values to the median with the function median_impute()
			2.2. Z-transform the data
			2.3. estimate PCs with prcomp()
		3. probabilistic PCs
			3.1. z-transform data
			3.2. estimate PCs with pcaMethods::ppca()
		4. estimate number of "informative" PCs
			4.1. estimate eigenvalues eigen()
			4.2. run a parrallel analysis with nFactors::parallel()
			4.3. estiamte acceleration factor with nFactors::nScree()
		5. identify outliers at 3,4,and 5SD from the mean from imputed PCs
		6. return:
			6.1. imputed PCs with 3,4, and 5 outlier binaries (0|1)
			6.2. variance explained
			6.3. acceleration factor
			6.4. parrallel estimate
			6.5. probablistic PCs
4) add imputed PCs and outliers to sample annotation and sumstats
	(a) write to file
5) write the variance explained by each PC to file
6) save feature|metabolite tree Robj to file

(4.5) perform data filtering

1) identify feature exclusions
	(a) if Metabolon exclude SUPER_PATHWAY features associated with c("xenobiotics", "Xenobiotics")
	(b) if Nightingale and if derived variables parameter set to TRUE (step 1.14) identify derived variable feature names
2) run function perform.metabolite.qc()
	(a) remove but retain excluding variables from data set
	(b) estimate inital sample missingness
	(c) exclude samples with missingness >= 0.8
	(d) estimate inital feature missingness
	(e) exclude features with missingness >= 0.8
	(f) re-estmimate sample missingness
	(g) exclude sample with user defined missingness threshold (default >= 0.2)
	(h) re-estimate feature missingness
	(i) exclude features with user defined missingness threshold (default >= 0.2)
	(j) estimate total peak area | abundance
	(k) if parameter value 9 is NOT NA
		1. exclude samples on user defined SD units from the mean (default is 5 SD from mean)
	(l) identify outliers at each feature with function outlier.matrix()
		1. identified using interquartile range unit distances and median estimates. (parameter value 10)
	(m) how to treat outliers before estimating PCs, and only for PCs as defined by parameter value 11
		1. "leave_be": do nothing to outliers at each feature
		2. "turn_NA": turn outlier values to NA, which means they will be median imputed for the PCA
		3. "winsorize": winsorize the outliers to the 100th quantile of all remaining (non-outlying) values at a feature.
	(n) identify independent | representative feature
		1. run function feature.sum.stats(), which runs function tree_and_independent_features()
		2. extract indepenent feature ids
	(o) estimate PCs
		1. run function pc.and.outliers()
	(p) if parameter 1.12 is NOT NA
		1. identify PC outliers with estimated acceleration factor or a minimum of 2 PCs, default is 5SD from mean
	(q) place exclusion features (step 2.a.) back into data frame

(4.6) estimate summary statistics on filtered data

1) Repeat steps found in (IV), but on the filtered data set produced by step (V).

(4.7) generate HTML report

1) generate the HTML report by running function generate_report()

(4.8) plot each metabolites

1) plot data distributions and summary statistics for each metabolite to PDF by running function feature_plots()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detailed metaboprep pipeline steps

(1) data format

(2) parameter file

(3) how to run the pipeline

(4) pipeline steps - in detail

(4.1) intiate

(4.2) read in the data

(4.3) data normalization

(4.4) estimate summary statistics on initial or pre-filtered data

(4.5) perform data filtering

(4.6) estimate summary statistics on filtered data

(4.7) generate HTML report

(4.8) plot each metabolites

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally