-
Notifications
You must be signed in to change notification settings - Fork 23
Working with several samples
Most studies comprise more than one sample. When working with several amplicon sets (i.e. samples) targeting the same DNA fragment, it is recommended to dereplicate the amplicons at the study level to speed up the clustering process.
Let's assume that you have 10 samples (sample_1.fasta, sample_2.fasta, etc.), and each of them was dereplicated as explained here. The following block of shell code performs a dereplication at the study level (i.e. it merges all samples and updates the amplicon abundances accordingly). Please note that these are code examples; there are surely shorter, faster, or safer ways to encode the same logic.
# Dereplicate the whole study
export LC_ALL=C
cat sample_*.fasta | \
awk 'BEGIN {RS = ">" ; FS = "[_\n]"}
{if (NR != 1) {abundances[$1] += $2 ; sequences[$1] = $3}}
END {for (amplicon in sequences) {
print ">" amplicon "_" abundances[amplicon] "_" sequences[amplicon]}}' | \
sort --temporary-directory=$(pwd) -t "_" -k2,2nr -k1.2,1d | \
sed -e 's/\_/\n/2' > all_samples.fasta
The operation is usually very fast (millions of amplicons dealt with per minute). For large projects, it is possible to reach the maximum amount of data sort
can deal with in memory, in that case it is necessary to use the --temporary-directory
option of sort
. The use of LC_ALL=C
speeds up all "text" operations (sort
, sed
, grep
, etc.), but it assumes that your fasta files contain only ascii characters.
To keep track of the presence-absence (or number of occurrences) of each amplicon in each sample, you might want to build a contingency table. The following shell code will read the fasta files representing your samples and produce a tab-delimited table containing all the amplicons (rows) and their occurrences in the different samples (columns). Tested with GNU awk 4
.
awk 'BEGIN {FS = "[>_]"}
# Parse the sample files
/^>/ {contingency[$2][FILENAME] = $3
amplicons[$2] += $3
if (FNR == 1) {
samples[++i] = FILENAME
}
}
END {# Create table header
printf "amplicon"
s = length(samples)
for (i = 1; i <= s; i++) {
printf "\t%s", samples[i]
}
printf "\t%s\n", "total"
# Sort amplicons by decreasing total abundance (use a coprocess)
command = "LC_ALL=C sort -k1,1nr -k2,2d"
for (amplicon in amplicons) {
printf "%d\t%s\n", amplicons[amplicon], amplicon |& command
}
close(command, "to")
FS = "\t"
while ((command |& getline) > 0) {
amplicons_sorted[++j] = $2
}
close(command)
# Print the amplicon occurrences in the different samples
n = length(amplicons_sorted)
for (i = 1; i <= n; i++) {
amplicon = amplicons_sorted[i]
printf "%s", amplicon
for (j = 1; j <= s; j++) {
printf "\t%d", contingency[amplicon][samples[j]]
}
printf "\t%d\n", amplicons[amplicon]
}}' sample_*.fasta > amplicon_contingency_table.csv
That code is efficient enough to deal with 3 million unique amplicons in 10 samples in less than a minute. For extremely large studies, working with flat text files might become intractable (imagine a table with 130 million rows and 1,000 columns!). If you are in that situation, switching to (fast) relational database is probably your best option (e.g. PostgreSQL).
If the above code does not work for you (GNU awk 4
is not installed everywhere), you can use the python script amplicon_contingency_table.py
located in the folder scripts
:
python amplicon_contingency_table.py samples_*.fasta > amplicons_table.csv
If you want swarm to partition your dataset with the finest resolution (a local number of differences d = 1, with elimination of potential chained clusters and secondary grafting of small clusters) on a quadricore CPU:
./swarm -d 1 -f -t 4 amplicons.fasta -s amplicons.stats > amplicons.swarms
The "stats" table obtained gives for each cluster the total abundance, the number of unique amplicons in the cluster, the name and the abundance of the most abundant amplicon in the cluster (i.e. the seed), and the number of low abundant amplicons (abundance = 1) in the cluster.
Now that you have sorted clusters, you can use that information with the amplicon contingency table to produce a new cluster contingency table. That table will indicate for each cluster (rows) the number of time elements of that cluster (i.e. amplicons) have been seen in each sample (columns).
STATS="amplicons.stats"
SWARMS="amplicons.swarms"
AMPLICON_TABLE="amplicon_contingency_table.csv"
CLUSTER_TABLE="cluster_contingency_table.csv"
# Header
echo -e "cluster\t$(head -n 1 "${AMPLICON_TABLE}")" > "${CLUSTER_TABLE}"
# Compute "per sample abundance" for each cluster
awk -v SWARM="${SWARMS}" \
-v TABLE="${AMPLICON_TABLE}" \
'BEGIN {FS = " "
while ((getline < SWARM) > 0) {
swarms[$1] = $0
}
FS = "\t"
while ((getline < TABLE) > 0) {
table[$1] = $0
}
}
{# Parse the stat file (clusters sorted by decreasing abundance)
seed = $3 "_" $4
n = split(swarms[seed], cluster, "[ _]")
for (i = 1; i < n; i = i + 2) {
s = split(table[cluster[i]], abundances, "\t")
for (j = 1; j < s; j++) {
samples[j] += abundances[j+1]
}
}
printf "%s\t%s", NR, $3
for (j = 1; j < s; j++) {
printf "\t%s", samples[j]
}
printf "\n"
delete samples
}' "${STATS}" >> "${CLUSTER_TABLE}"