Transferring large datasets from cloud infrastructure to local machines comes with the risk of file corruption due to issues with broadband connectivity, among other factors. These script were designed to check the integrity of fastq.gz
files generated by NGS sequencers before they are run through an analysis pipeline.
-
comp_check.sh
should be run on local machines over a smaller set of files as it checks each compressed file serially. It can take hours to get through entire NGS datasets so without access to a High Performance Cluster (HPC) it is best to run this overnight. -
gz_builder_array.sh
should be used to generate the two job array files needed to run this on an HPC that uses the Sun Grid Engine (SGE) queuing system.
This script checks the integrity of all .gz
files in a user provided directory path and outputs a log file containing the names of all the successful files. If any compressed files are corrupted, they and the error log are moved into a newly created corrupt_files directory.
Before using this script, open it in a text editor and add the absolute path to the directory containing the .gz
files between the quotes on line 4.
- Ex:
DIR="/User/Garfunkel/Documents/Example_Project/"
Execute the script be running:
$ bash comp_check.sh
Upload gz_array_builder.sh
to an HPC and place it into your project's directory.
Run the below command to generate the job array files: Replace INPUT
and OUTPUT
with the directory path to your fastq.gz files and results, respectively.
$ bash gz_array_builder.sh -i INPUT -o OUTPUT
Upon successful completion you will receive a message saying that qsub_gz_check_array.sou
and gz_check.sh
were generated. Run the ls -lah
command to check if gz_check.sh
is executable. The filename will appear green and its line will begin with -rwxrwxr-x
.
qsub_gz_check_array.sou
is the qsub parameter filegz_check.sh
contains the code executed for each job
If you would like to modify any parameters or resources assigned to the jobs do so inqsub_gz_check_array.sou
. You will generally not need to edit gz_check.sh
unless you would like to rename the resulting log files or error directory.
To initiate the job array enter the following command:
$ source qsub_gz_check_array.sou
Check your jobs status using:
$ qstat
Upon completion this job will write 2 files to the OUTPUT directory containing a list of the passed and corrupted gz files.
- gz_check.log
- gz_check_error.log
All corrupt gz files will be moved to the newly created corrupt_files
directory in the OUTPUT
path.
The array will produce a log file for each execution of the script. It is important to spot check these to make sure there wasn't an unforeseen issue with the run.
To access the help menu run:
$ bash gz_array_builder.sh -h