Description
Hi. I have a gene table (genomic locations of genes, 63086 rows × 4 columns) and a CNV table including genomic locations of Copy Number Variables (799505 rows × 5 columns; after segment mean filtering: 507673 rows × 6 columns). I want to determine the genes that have variable copy number. Bioframe.overlap gave RAM error ("Your session crashed after using all available RAM.") for large table dimensions in Colab notebook (Colab RAM is 12.7 GB).
I have tried different dimensions and the results:
CNV table: 799505 rows × 5 columns
Gene table (gene_coord): 63086 rows × 4 columns => RAM error
CNV table: 507673 rows × 6 columns
Gene table (gene_coord): 63086 rows × 4 columns => 23018739 rows × 10 columns
CNV table: 507673 rows × 6 columns
Gene table (gene_coord_expanded): 63086 rows × 6 columns => RAM error (It should have returned 23018739 rows × 12 columns.)
My code:
import pandas
data_url = "https://gdc-hub.s3.us-east-1.amazonaws.com/download/TCGA-BRCA.cnv.tsv.gz"
cnv_data = pandas.read_csv(data_url, compression='gzip', header=0, sep='\t', quotechar='"')
cnv_data.columns = ['sample', 'chrom', 'start', 'end', 'value']
cnv_data['chrom'] = 'chr'+cnv_data['chrom'].astype(str)
cnv_data = cnv_data[(cnv_data['value'] < -0.2) | (cnv_data['value'] > 0.2)]
#cnv_data = cnv_data[abs(cnv_data['value']) > 0.2]
cnv_data['aberration'] = cnv_data['value'].apply(lambda x: 'deletion' if x < -0.2 else ('amplification' if x > 0.2 else 'none'))
cnv_data
gene_coord = bioframe.read_table(
"https://ftp.ensembl.org/pub/release-112/gtf/homo_sapiens/Homo_sapiens.GRCh38.112.gtf.gz",
schema="gtf", skiprows=5)
gene_coord = gene_coord[gene_coord['feature'] == 'gene']
gene_coord['chrom'] = 'chr'+gene_coord['chrom'].astype(str)
gene_coord = gene_coord[['chrom', 'start', 'end', 'attributes']]
chrs_to_keep = [f'chr{i}' for i in range(1, 23)] + ['chrX', 'chrY', 'chrMT']
gene_coord = gene_coord[gene_coord['chrom'].isin(chrs_to_keep)]
gene_coord
import pandas as pd
import re
def parse_attributes(attributes):
attr_dict = {}
matches = re.findall(r'(\w+)\s*"([^"]*)"', attributes)
for key, value in matches:
attr_dict[key] = value
return attr_dict
attributes_df = gene_coord['attributes'].apply(parse_attributes).apply(pd.Series)
gene_coord_expanded = pd.concat([gene_coord, attributes_df], axis=1).drop(columns=['attributes'])
gene_coord_expanded.drop(columns=['gene_version', 'gene_source'], inplace=True)
gene_coord_expanded
import itertools
import numpy
import matplotlib
import matplotlib.pyplot as pyplot
import pandas
import bioframe
import bioframe.vis
cnv_genes = bioframe.overlap(cnv_data, gene_coord_expanded, how='inner', suffixes=('_cnv','_genes'))