rna-features is a package used to generate machine-learning features from
RNAseq data. Given a list of dataset directories containing DESeq2 contrast
files (.csv) and a 'tpm.tsv' matrix of gene Transcripts per Million (TPM)
across samples (generated by the
llrnaseq pipeline), it generates a
feature matrix containing the following features per dataset:
- Gene breadth (
p <= p-value)- down (
log2FC <= -1) - neither (
-1 < log2FC < 1) - up (
log2FC >= 1)
- down (
- log2FC (
p <= p-value)- Median Absolute Deviation (MAD)
- Maximum
- Median
- TPM
- MAD
- Maximum
- Median
These features are output as a feature_matrix file in both .csv and .pkl
format (the .pkl file can be loaded as a pandas dataframe with
pandas.read_pickle(path)). Below is an output preview:
regulation log2foldchange tpm
down neither up mad max median mad max median
dataset gene
set_1 Solyc00g500063.1 0.0 1.0 0.0 0.000000 0.953245 0.953245 8.412766 54.887642 27.721765
Solyc00g500185.1 0.0 0.0 1.0 0.000000 1.333732 1.333732 0.135050 0.943789 0.254913
Solyc01g005000.3 0.0 1.0 2.0 0.118566 1.097196 1.093001 44.024541 254.986816 108.668376
Solyc01g005010.4 4.0 0.0 0.0 0.439194 -1.201843 -1.577684 13.191743 85.372719 12.014153
Solyc01g005020.3 0.0 1.0 0.0 0.000000 0.649139 0.649139 6.994529 42.430080 18.944556
... ... ... ... ... ... ... ... ... ...
set_2 Solyc12g150103.1 0.0 2.0 3.0 0.245354 1.598051 1.049794 1.223475 7.559584 3.616534
Solyc12g150108.1 1.0 0.0 0.0 0.000000 -23.707473 -23.707473 1.287612 13.105947 0.000000
Solyc12g150113.1 0.0 1.0 4.0 0.251563 1.845714 1.397828 40.746832 193.108032 59.591179
Solyc12g150124.1 0.0 0.0 2.0 0.076378 1.622478 1.546100 0.468325 4.811159 0.703217
Solyc12g150132.1 0.0 0.0 1.0 0.000000 4.130969 4.130969 0.074633 0.551118 0.091994
To install rna-features, download the latest .whl binary from the
releases page and install
using pip(note: the package is not currently installable with python 3.10, as
dependencies such as numpy have not yet released compatible wheels):
wget https://github.com/SpikyClip/rna-features/releases/download/0.1.1-dev/rna_features-0.1.1-py3-none-any.whl
pip install rna_features-0.1.1-py3-none-any.whlThis will install rna-features as a python package, and rna-features will
be available on $PATH. To test if installation is successful:
rna-features -hThe following help message should appear:
usage: rna-features [-h] [-p p-value] dir [dir ...]
Generates machine-learning features from RNAseq data. Takes a list of
directories containing DESeq2 contrast files (.csv) and a 'tpm.tsv' file
(containing a matrix of tpm values of genes against sample) returning a
'feature_matrix.csv' containing gene expression breadth and log2fc/tpm
mad, max and median for each gene.
positional arguments:
dir Dataset directories containing DESeq2 contrast files (.csv) and a 'tpm.tsv' matrix file.
optional arguments:
-h, --help show this help message and exit
-p p-value p-value cutoff for filtering log2fc values [default: 0.05]To use rna-features, specify a list of directories each containing DESeq2
.csv contrast files and one tpm.tsv file:
rna-features dataset_1 dataset_2 dataset_3
An optional p-value cutoff can be specified:
rna-features -p 0.005 dataset_1 dataset_2 dataset_3
- The contrast files (
*.csv) should be in the following format:"", "baseMean", "log2FoldChange", "lfcSE", "stat", "pvalue", "padj" "Solyc01g005000.3",4496.05232181299, 1.09719580776875,0.313072912511878, 3.50460152865228,0.000457291165260712, 0.0115280270712814 "Solyc01g005340.3",540.376944106274, 0.52013987940027,0.170624565359894, 3.04844661906186, 0.0023002777636722, 0.0362570019128406 "Solyc01g005390.3",16.4785747787331,-1.85885261292963,0.471053842373692,-3.94615741496274,7.94154133931579e-05,0.00287540425470711 "Solyc01g005410.4",1181.71130130374, 1.37296624988023,0.394738835793252, 3.47816359928501,0.000504861691439399, 0.0125485785916511 - The tpm matrix (
tpm.tsv) should be in the tab-delimited following format:gene_id01-0-hr-C1 02-0-hr-C2 03-0-hr-C3 04-0-hr-JA1 Solyc00g500003.1 0.030844 0.011062 0.006824 Solyc00g500041.1 1.515571 1.78357 1.503047 Solyc00g500042.1 0.258916 0.273953 0.248473 NaNvalues may occur in theregulationandlog2foldchangecolumns if thetpm.tsvmatrix contains a broader set of genes than those found in the contrast files. SuchNaNfiles have to be processed by the user.