Skip to content

Commit

Permalink
More fixes
Browse files Browse the repository at this point in the history
  • Loading branch information
pmitev committed Sep 13, 2021
1 parent cec9268 commit 93b13c7
Show file tree
Hide file tree
Showing 3 changed files with 16 additions and 20 deletions.
24 changes: 11 additions & 13 deletions docs/Bio/NCBI-taxonomy.md
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ Might not be the best solution but it is easy to read and modify, for now. Note,
$ ./01.tabulate-names.awk names.dmp | sort -g -k 1 > names.tab

# Or with bzip2 compression "on the fly"
$ ./01.tabulate-names.awk <(bzcat names.dmp.bz2) | bzip2 -c > names.tab.bz2
$ ./01.tabulate-names.awk <(bzcat names.dmp.bz2) | sort -g -k 1 | bzip2 -c > names.tab.bz2
```

??? note "01.tabulate-names.awk"
Expand All @@ -150,9 +150,9 @@ $ ./01.tabulate-names.awk <(bzcat names.dmp.bz2) | bzip2 -c > names.tab.bz2

$4 ~ "scientific name" { sciname[$1*1]= unds(Clean($2)); next}

$4 ~ "common name" { com_name[$1*1]= Cap(Clean($2)); next}

# Order is important, since the second case will match lines that match the first case.
$4 ~ "genbank common name" { genbank[$1*1]= unds(Clean($2)); next}
$4 ~ "common name" { com_name[$1*1]= Cap(Clean($2)); next}

END{
for(i in sciname) print i"|"sciname[i]"|"com_name[i]"|"genbank[i]
Expand All @@ -171,9 +171,13 @@ $ ./01.tabulate-names.awk <(bzcat names.dmp.bz2) | bzip2 -c > names.tab.bz2
function Cap (string) { return toupper(substr(string,0,1))substr(string,2) }
```

Note that this script will keep the last information for the corresponding match for each ID. To prevent this we need to take care that any subsequent match is ignored
Note that this script will keep the last values for any match of the same ID. It appers that the database have repeated lines that does not contain complete information and the tabulated data get destroyed. To prevent this, we need to take care that any subsequent match will be ignored.


``` bash
$ ./01.tabulate-names-first.awk names.dmp | sort -g -k 1 > names-first.tab
```

??? note "01.tabulate-names-first.awk"
``` awk
#!/usr/bin/awk -f
Expand All @@ -184,9 +188,10 @@ Note that this script will keep the last information for the corresponding match

$4 ~ "scientific name" { if (! sciname[$1*1] ) sciname[$1*1]= unds(Clean($2)); next}

# Order is important, since the second case will match lines that match the first case.
$4 ~ "genbank common name" { if (! genbank[$1*1] ) genbank[$1*1]= unds(Clean($2)); next}
$4 ~ "common name" { if (! com_name[$1*1]) com_name[$1*1]= Cap(Clean($2)); next}

$4 ~ "genbank common name" { if (! genbank[$1*1] ) genbank[$1*1]= unds(Clean($2)); next}

END{
for(i in sciname) print i"|"sciname[i]"|"com_name[i]"|"genbank[i]
Expand Down Expand Up @@ -228,16 +233,9 @@ Now we can use the tabulated data in `names.tab` and perform the replacement in
Again, this might not be the best way but it works. The suggested solutions could be easily merged into a single script. I would prefer to have them in steps, so I can make sure that the first step has completed successfully (*it takes some time*) before I continue. Also I can filter the unnecessary data in the newly tabulated file and use only relevant data or alter further if I need.

``` bash
$ ./02.substitute.awk names.tab hg38.100way.scientificNames.nh > NEW.g38.100way.scientificNames.nh

# Or with bzip2 compression "on the fly"
$ ./02.substitute.awk <(bzcat names.tab.bz2) hg38.100way.scientificNames.nh > NEW.g38.100way.scientificNames.nh
$ ./02.substitute.awk names-first.tab hg38.100way.scientificNames.nh > NEW.g38.100way.scientificNames.nh
```


``` bash
$ ./02.substitute.awk names.tab hg38.100way.scientificNames.nh
```
??? note "02.substitute.awk"
``` awk linenums="1"
#!/usr/bin/awk -f
Expand Down
2 changes: 1 addition & 1 deletion docs/Case_studies/List.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Here is a collection of mine and contributed awk scripts.
* **[Gaussian smearing](Gaussian_smearing.md)**
_trivial task done with awk - example how to use functions_
* **[Linear interpolation](Linear_interpolation.md)**
_use linear interpolation to resample your date on different grid_
_use linear interpolation to resample your data on different grid_

## Physics oriented
* **[Dipole moment example](Dipole_moment.md)**
Expand Down
10 changes: 4 additions & 6 deletions docs/Case_studies/multiple_files_I.md
Original file line number Diff line number Diff line change
Expand Up @@ -35,13 +35,12 @@ Below, it is just one possible way to do it. First we need to have a list of all
``` awk
#!/usr/bin/awk -f

{
names[$1]= 1;
{
data[$1][ARGIND]= $2
}

END {
for (i in names) print i"\t\t"data[i][1]"\t\t"data[i][2]
for (i in data) print i"\t"data[i][1]"\t"data[i][2]
}
```

Expand Down Expand Up @@ -138,12 +137,11 @@ Leave the extra blanks for the first attempt. We will use this problem (cleaning
BEGIN{ FS="|" }

{
id[$1]= 1;
data[$1][FILENAME]= $2
}

END {
for (i in id) print trim(i)"|"trim(data[i]["scientific"])"|"trim(data[i]["genbank"])
for (i in data) print trim(i)"|"trim(data[i]["scientific"])"|"trim(data[i]["genbank"])
}

function trim (x) {
Expand All @@ -153,7 +151,7 @@ Leave the extra blanks for the first attempt. We will use this problem (cleaning
}
```

??? "Solution usung join uggested by Amrei Binzer-Panchal, 2021.01.18"
??? "Solution usung join suggested by Amrei Binzer-Panchal, 2021.01.18"
``` bash
$ join -a1 -a2 -j 1 -o 0,1.2,2.2 -e "NULL" -t "|" <(sort scientific) <(sort genbank)

Expand Down

0 comments on commit 93b13c7

Please sign in to comment.