Compression Integrity Check Loop

Transferring large datasets from cloud infrastructure to local machines comes with the risk of file corruption due to issues with broadband connectivity, among other factors. These script were designed to check the integrity of fastq.gz files generated by NGS sequencers before they are run through an analysis pipeline.

comp_check.sh should be run on local machines over a smaller set of files as it checks each compressed file serially. It can take hours to get through entire NGS datasets so without access to a High Performance Cluster (HPC) it is best to run this overnight.
gz_builder_array.sh should be used to generate the two job array files needed to run this on an HPC that uses the Sun Grid Engine (SGE) queuing system.

comp_check.sh

This script checks the integrity of all .gz files in a user provided directory path and outputs a log file containing the names of all the successful files. If any compressed files are corrupted, they and the error log are moved into a newly created corrupt_files directory.

Before using this script, open it in a text editor and add the absolute path to the directory containing the .gz files between the quotes on line 4.

Ex:DIR="/User/Garfunkel/Documents/Example_Project/"

Execute the script be running:

$ bash comp_check.sh

gz_array_builder.sh

Upload gz_array_builder.sh to an HPC and place it into your project's directory.

Run the below command to generate the job array files: Replace INPUT and OUTPUT with the directory path to your fastq.gz files and results, respectively.

$ bash gz_array_builder.sh -i INPUT -o OUTPUT

Upon successful completion you will receive a message saying that qsub_gz_check_array.sou and gz_check.sh were generated. Run the ls -lah command to check if gz_check.sh is executable. The filename will appear green and its line will begin with -rwxrwxr-x.

qsub_gz_check_array.sou is the qsub parameter file
gz_check.sh contains the code executed for each job

If you would like to modify any parameters or resources assigned to the jobs do so inqsub_gz_check_array.sou. You will generally not need to edit gz_check.sh unless you would like to rename the resulting log files or error directory.

To initiate the job array enter the following command:

$ source qsub_gz_check_array.sou

Check your jobs status using:

$ qstat

Upon completion this job will write 2 files to the OUTPUT directory containing a list of the passed and corrupted gz files.

gz_check.log
gz_check_error.log

All corrupt gz files will be moved to the newly created corrupt_files directory in the OUTPUT path.

The array will produce a log file for each execution of the script. It is important to spot check these to make sure there wasn't an unforeseen issue with the run.

To access the help menu run:

$ bash gz_array_builder.sh -h

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
comp_check.sh		comp_check.sh
gz_array_builder.sh		gz_array_builder.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Compression Integrity Check Loop

comp_check.sh

gz_array_builder.sh

About

Uh oh!

Releases

Packages

Languages

License

OMahoneyM/compression_integrity_loop

Folders and files

Latest commit

History

Repository files navigation

Compression Integrity Check Loop

comp_check.sh

gz_array_builder.sh

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages