Skip to content

Analysis pipeline for mapping and quantifying Y chromosome read counts across WGS, WES, and RNA-seq data to study loss of the Y chromosome (LOY) in cancer and other contexts.

Notifications You must be signed in to change notification settings

tatonetti-lab/Mapping-LOY

Repository files navigation

LOY (Loss of Y Chromosome) Mapping Pipeline

This repository contains scripts and tools to quantify and analyze Loss of Y (LOY) in genomic datasets, primarily from WGS, WES, and RNA-seq data. It is designed for CCLE datasets and other human genomic data sources.


Table of Contents


Overview

Loss of Y (LOY) is a common somatic chromosomal aberration with implications in aging and cancer. This pipeline provides tools to:

  • Extract reads mapped to chromosome Y.
  • Quantify LOY across samples and tissues.
  • Integrate LOY with RNA-seq for differential expression analysis.
  • Generate visualizations for LOY patterns.

Scripts

ExtractYpositions.sh

Extracts all reads mapped to chromosome Y from a directory of BAM files and records their positions.

Usage:

bash ExtractYpositions.sh /path/to/bam/directory /path/to/output/file.txt

AnalyzeYPositions.R

Analyzes chromosome Y read positions to identify sparse regions and low-density genes.
Generates histograms, cumulative distributions, and visualizations of read density.

Dependencies:

  • R packages: dplyr, tidyr, ggplot2

Key outputs:

  • ChromosomeDensityHist.png – Histogram of read density across chromosome Y
  • low_density_2000genes.txt – Genes overlapping low-density regions (below threshold)
  • ChromosomeCDF.png – Cumulative distribution function of read densities
  • orderedSamples.txt – Sample IDs ordered by number of sparse regions
  • ChromosomePosition.png – Scatter plot of reads per sample, ordered by sparsity
  • ChromosomeBinSummary.png – Distribution of reads across chromosome Y bins

Usage:

Rscript AnalyzeYPositions.R

CompareDNAtoRNA.R

Compares DNA sequencing read positions on chromosome Y with RNA-seq expression data for Y-linked genes.
Generates a visualization of expressed Y-linked genes across samples ordered by DNA sparsity.

Dependencies:

  • R packages: dplyr, tidyr, ggplot2
  • Input files:
    • RNAseqdataClean.txt – Processed RNA-seq expression matrix (GCT-like format)
    • genes_chrY_positions.txt – Start and end positions of Y-linked genes
    • orderedSamples.txt – Sample order generated by AnalyzeYPositions.R
    • Num_mapped_reads.csv – Mapping between Run IDs and cell line names

Key output:

  • Gene_vs_Sample_BarsV2.png – Bar plot showing expression presence of Y-linked genes across samples

Usage:

Rscript CompareDNAtoRNA.R

CallVariants.sh

Performs variant calling on BAM files using bcftools and outputs compressed, indexed VCF files.

Dependencies:

  • Tools: bcftools (with mpileup, call, and index subcommands)
  • Input files:
    • *.bam – BAM files to be processed
    • Homo_sapiens_assembly19.fixed.fasta – Reference genome (update path as needed)

Outputs:

  • VCF_Files/ – Directory containing compressed and indexed VCFs (.vcf.gz and .csi)

Usage:

bash CallVariants.sh

makeMatrixEQTL.R

Runs eQTL analysis by integrating SNP genotypes from VCF files with RNA-seq expression data using the MatrixEQTL R package.

Dependencies:

  • R packages: VariantAnnotation, dplyr, MatrixEQTL, data.table, Matrix, biomaRt
  • Input files:
    • *.vcf.gz – Variant call files (compressed and indexed with bcftools)
    • Num_mapped_reads.csv – Maps sample IDs (Run) to cell lines
    • CCLE_RNAseq_genes_rpkm_20180929.gct – Gene expression data (RNA-seq, RPKM values)
    • snps_positions.txt – SNP positions (generated during runtime)
    • gene_positions.txt – Gene positions (fetched via biomaRt)

Outputs:

  • snps_positions.txt – SNP positions extracted from VCFs
  • gene_positions.txt – Gene coordinates for tested genes
  • output_matrixeqtl_results.txt – eQTL associations (with p-values and FDR correction)

Usage:

Rscript makeMatrixEQTL.R

Data

This repository contains data files required for the LOY mapping pipeline.

Required files

  • gene_chrY_positions.txt – Contains the start and end positions of genes on chromosome Y.
  • Num_mapped_reads.csv – Maps sequencing run IDs (Run) to their corresponding CCLE cell lines (cell_line).
    • Example format:
      Run,cell_line,num_mapped_reads
      SRR8639150,EBC1_LUNG,379420
      SRR8639204,CAKI1_KIDNEY,541050
      SRR8639219,DMS53_LUNG,514931
      
    • Only the Run and cell_line columns are required for analyses.
  • BAM files – High-coverage WGS/WES BAM files.
  • RNA-seq counts – Processed RNA-seq count files for integration with LOY analyses.

Full CCLE or other controlled-access datasets are not included due to data usage restrictions.

About

Analysis pipeline for mapping and quantifying Y chromosome read counts across WGS, WES, and RNA-seq data to study loss of the Y chromosome (LOY) in cancer and other contexts.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published