This repository contains scripts and tools to quantify and analyze Loss of Y (LOY) in genomic datasets, primarily from WGS, WES, and RNA-seq data. It is designed for CCLE datasets and other human genomic data sources.
Loss of Y (LOY) is a common somatic chromosomal aberration with implications in aging and cancer. This pipeline provides tools to:
- Extract reads mapped to chromosome Y.
- Quantify LOY across samples and tissues.
- Integrate LOY with RNA-seq for differential expression analysis.
- Generate visualizations for LOY patterns.
Extracts all reads mapped to chromosome Y from a directory of BAM files and records their positions.
Usage:
bash ExtractYpositions.sh /path/to/bam/directory /path/to/output/file.txtAnalyzes chromosome Y read positions to identify sparse regions and low-density genes.
Generates histograms, cumulative distributions, and visualizations of read density.
Dependencies:
- R packages:
dplyr,tidyr,ggplot2
Key outputs:
ChromosomeDensityHist.png– Histogram of read density across chromosome Ylow_density_2000genes.txt– Genes overlapping low-density regions (below threshold)ChromosomeCDF.png– Cumulative distribution function of read densitiesorderedSamples.txt– Sample IDs ordered by number of sparse regionsChromosomePosition.png– Scatter plot of reads per sample, ordered by sparsityChromosomeBinSummary.png– Distribution of reads across chromosome Y bins
Usage:
Rscript AnalyzeYPositions.RCompares DNA sequencing read positions on chromosome Y with RNA-seq expression data for Y-linked genes.
Generates a visualization of expressed Y-linked genes across samples ordered by DNA sparsity.
Dependencies:
- R packages:
dplyr,tidyr,ggplot2 - Input files:
RNAseqdataClean.txt– Processed RNA-seq expression matrix (GCT-like format)genes_chrY_positions.txt– Start and end positions of Y-linked genesorderedSamples.txt– Sample order generated byAnalyzeYPositions.RNum_mapped_reads.csv– Mapping between Run IDs and cell line names
Key output:
Gene_vs_Sample_BarsV2.png– Bar plot showing expression presence of Y-linked genes across samples
Usage:
Rscript CompareDNAtoRNA.RPerforms variant calling on BAM files using bcftools and outputs compressed, indexed VCF files.
Dependencies:
- Tools:
bcftools(withmpileup,call, andindexsubcommands) - Input files:
*.bam– BAM files to be processedHomo_sapiens_assembly19.fixed.fasta– Reference genome (update path as needed)
Outputs:
VCF_Files/– Directory containing compressed and indexed VCFs (.vcf.gzand.csi)
Usage:
bash CallVariants.shRuns eQTL analysis by integrating SNP genotypes from VCF files with RNA-seq expression data using the MatrixEQTL R package.
Dependencies:
- R packages:
VariantAnnotation,dplyr,MatrixEQTL,data.table,Matrix,biomaRt - Input files:
*.vcf.gz– Variant call files (compressed and indexed with bcftools)Num_mapped_reads.csv– Maps sample IDs (Run) to cell linesCCLE_RNAseq_genes_rpkm_20180929.gct– Gene expression data (RNA-seq, RPKM values)snps_positions.txt– SNP positions (generated during runtime)gene_positions.txt– Gene positions (fetched via biomaRt)
Outputs:
snps_positions.txt– SNP positions extracted from VCFsgene_positions.txt– Gene coordinates for tested genesoutput_matrixeqtl_results.txt– eQTL associations (with p-values and FDR correction)
Usage:
Rscript makeMatrixEQTL.RThis repository contains data files required for the LOY mapping pipeline.
gene_chrY_positions.txt– Contains the start and end positions of genes on chromosome Y.Num_mapped_reads.csv– Maps sequencing run IDs (Run) to their corresponding CCLE cell lines (cell_line).- Example format:
Run,cell_line,num_mapped_reads SRR8639150,EBC1_LUNG,379420 SRR8639204,CAKI1_KIDNEY,541050 SRR8639219,DMS53_LUNG,514931 - Only the
Runandcell_linecolumns are required for analyses.
- Example format:
- BAM files – High-coverage WGS/WES BAM files.
- RNA-seq counts – Processed RNA-seq count files for integration with LOY analyses.
Full CCLE or other controlled-access datasets are not included due to data usage restrictions.