-
Notifications
You must be signed in to change notification settings - Fork 87
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
updated to ensembl release 110; did some patching #1536
base: main
Are you sure you want to change the base?
Conversation
if ensembl_id == "canis_lupus_familiaris": | ||
ensembl_id = "canis_familiaris" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a horrible, terrible thing that i've done, but i've done it because this single ensembl shift disagrees with all the other ensembl ids....
tmp = ensembl_id.split("_")[:2] | ||
print(tmp, ensembl_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stray print
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops-- debugging
"MT": {"length": 16543, "synonyms": []}, | ||
}, | ||
} | ||
data = {"assembly_accession": None, "assembly_name": "BROAD S1", "chromosomes": {}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uh-oh, what happened to this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow that's not good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks to have all the info there: http://rest.ensembl.org/info/assembly/gasterosteus_aculeatus?synonyms=1
@@ -116,7 +116,7 @@ | |||
|
|||
_species = stdpopsim.Species( | |||
id="GasAcu", | |||
ensembl_id="9307941", | |||
ensembl_id="gasterosteus_aculeatus", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps this is why the assembly dissappeared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, the original ensembl_id was incompatible with the REST API and the maintenance script...
}, | ||
"assembly_accession": "GCA_000313835.1", | ||
"assembly_name": "Hmel1", | ||
"chromosomes": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And another one that dissappeared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like another manually manipulated one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well this one has no karyotype entry:
---
assembly_accession: GCA_000313835.1
assembly_date: 2012-02
assembly_name: Hmel1
coord_system_versions:
- Hmel1
default_coord_system_version: Hmel1
genebuild_initial_release_date: 2012-03
genebuild_last_geneset_update: 2012-03
genebuild_method: import
genebuild_start_date: 2012-03-HGC
golden_path: 273786188
karyotype: []
top_level_region:
-
coord_system: scaffold
length: 163478
name: HE667775
synonyms:
-
dbname: INSDC
name: HE667775.1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... and, it's just a bunch of scaffolds - nothing as long as in the assembly we've got.
"22": {"length": 35308119, "synonyms": []}, | ||
"X": {"length": 151242693, "synonyms": []}, | ||
# Mitochondria absent in ponAbe3, so length taken from ponAbe2. | ||
"MT": {"length": 16499, "synonyms": []}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like we need to stick in this manually again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nope - it's there, just not listed in the karyotype:
-
coord_system: primary_assembly
length: 16499
name: MT
"chromosomes": {"1": {"length": 2065074, "synonyms": ["I"]}}, | ||
"assembly_accession": "GCA_001017915.1", | ||
"assembly_name": "ASM101791v1", | ||
"chromosomes": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another empty one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep - the current build for this has no chromosome-level assembly
The most alarming thing here is those species that now don't have any chromosomes - any idea what's up with that? |
looks like the script isn't working quite right... will have to dig in further |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, watch out for the manually edited files that are getting touched here.
It's pretty messy isn't it?
@@ -1,13 +1,6 @@ | |||
# File created manually from https://www.ncbi.nlm.nih.gov/assembly/GCF_004382195.1 | |||
# File autogenerated from Ensembl REST API. Do not edit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this file shouldn't be changed as was created manually
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we should add a species level attribute -- manually_added
or something -- that would indicate if the maintenance script should go ahead with the download
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd vote "no", so we are alerted when a species that previously didn't have a build gets one; otherwise just manually back out the change in such cases.
(btw the reason this doesn't work is because ensembl doesn't have a karyotype entry for DroSec, just some ~Mb scale scaffolds)
}, | ||
"assembly_accession": "GCA_000313835.1", | ||
"assembly_name": "Hmel1", | ||
"chromosomes": {}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like another manually manipulated one
"38": {"length": 23914537, "synonyms": []}, | ||
"X": {"length": 123869142, "synonyms": []}, | ||
"MT": {"length": 16727, "synonyms": []}, | ||
"1": {"length": 123313939, "synonyms": []}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like the chromosome lengths changed between these two builds of CanFam, and new build doesn't have a mitochondrial genome. We probably want to stick with old version to avoid conflicts with recombination map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm - unlike in other species, it appears to legit not have MT any more.
"d": {"length": 1094478, "synonyms": ["chrLGd"]}, | ||
"f": {"length": 4257874, "synonyms": ["chrLGf"]}, | ||
"g": {"length": 424765, "synonyms": ["chrLGg"]}, | ||
"h": {"length": 248369, "synonyms": ["chrLGh"]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wtf, the previous chromosome names (e.g., LGa
) are not included in the synonyms. I vote to leave well enough alone there. But - why no mitochondria?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah I see what happened here: consulting http://rest.ensembl.org/info/assembly/anolis_carolinensis they've removed the mitochondria from the "karyotype" list of chromosomes, but it's still there, at the end of a list of a bajillion tiny contigs:
-
coord_system: primary_assembly
length: 17223
name: MT
"CM009944.2": {"length": 10670842, "synonyms": ["NC_037651.1"]}, | ||
"CM009945.2": {"length": 9534514, "synonyms": ["NC_037652.1"]}, | ||
"CM009946.2": {"length": 7238532, "synonyms": ["NC_037653.1"]}, | ||
"CM009947.2": {"length": 16343, "synonyms": ["NC_001566.1", "MT"]}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ensembl dropping the MT label here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and those synonyms are still there, labeled "RefSeq" or "GenBank":
-
coord_system: primary_assembly
length: 16343
name: CM009947.2
synonyms:
-
dbname: ensembl_internal_synonym
name: NC_001566
-
dbname: GenBank
name: MT
-
dbname: INSDC
name: CM009947.2
-
dbname: ensembl_internal_synonym
name: CM009947
-
dbname: RefSeq
name: NC_001566.1
"22": {"length": 37823149, "synonyms": []}, | ||
"X": {"length": 155549662, "synonyms": []}, | ||
"Y": {"length": 26350515, "synonyms": []}, | ||
"MT": {"length": 16554, "synonyms": []}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like they just removed those synonyms, oh well
I've been through this. Notes:
This all seems straightforward... except maybe the last point. I guess we should just convert those species over to being "manually added"? The other thing to deal with here is lifting over genetic maps... |
okay here is the update to ensembl 110. i had to do a bit of patching along the way, adding or modifying
ensembl_ids
in thespecies.py
files of a few species.in addition there is a horrible "duct tape" operation i did to patch the
canis_familiaris
situation. i'll point that out below.