updated to ensembl release 110; did some patching#1536
updated to ensembl release 110; did some patching#1536andrewkern wants to merge 2 commits intopopsim-consortium:mainfrom
Conversation
| if ensembl_id == "canis_lupus_familiaris": | ||
| ensembl_id = "canis_familiaris" |
There was a problem hiding this comment.
this is a horrible, terrible thing that i've done, but i've done it because this single ensembl shift disagrees with all the other ensembl ids....
| if ensembl_id == "canis_lupus_familiaris": | ||
| ensembl_id = "canis_familiaris" | ||
| tmp = ensembl_id.split("_")[:2] | ||
| print(tmp, ensembl_id) |
| "MT": {"length": 16543, "synonyms": []}, | ||
| }, | ||
| } | ||
| data = {"assembly_accession": None, "assembly_name": "BROAD S1", "chromosomes": {}} |
There was a problem hiding this comment.
uh-oh, what happened to this one?
There was a problem hiding this comment.
it looks to have all the info there: http://rest.ensembl.org/info/assembly/gasterosteus_aculeatus?synonyms=1
| _species = stdpopsim.Species( | ||
| id="GasAcu", | ||
| ensembl_id="9307941", | ||
| ensembl_id="gasterosteus_aculeatus", |
There was a problem hiding this comment.
perhaps this is why the assembly dissappeared?
There was a problem hiding this comment.
no, the original ensembl_id was incompatible with the REST API and the maintenance script...
| }, | ||
| "assembly_accession": "GCA_000313835.1", | ||
| "assembly_name": "Hmel1", | ||
| "chromosomes": {}, |
There was a problem hiding this comment.
And another one that dissappeared?
There was a problem hiding this comment.
Looks like another manually manipulated one
There was a problem hiding this comment.
Well this one has no karyotype entry:
---
assembly_accession: GCA_000313835.1
assembly_date: 2012-02
assembly_name: Hmel1
coord_system_versions:
- Hmel1
default_coord_system_version: Hmel1
genebuild_initial_release_date: 2012-03
genebuild_last_geneset_update: 2012-03
genebuild_method: import
genebuild_start_date: 2012-03-HGC
golden_path: 273786188
karyotype: []
top_level_region:
-
coord_system: scaffold
length: 163478
name: HE667775
synonyms:
-
dbname: INSDC
name: HE667775.1
There was a problem hiding this comment.
... and, it's just a bunch of scaffolds - nothing as long as in the assembly we've got.
| "22": {"length": 35308119, "synonyms": []}, | ||
| "X": {"length": 151242693, "synonyms": []}, | ||
| # Mitochondria absent in ponAbe3, so length taken from ponAbe2. | ||
| "MT": {"length": 16499, "synonyms": []}, |
There was a problem hiding this comment.
looks like we need to stick in this manually again?
There was a problem hiding this comment.
nope - it's there, just not listed in the karyotype:
-
coord_system: primary_assembly
length: 16499
name: MT
| "chromosomes": {"1": {"length": 2065074, "synonyms": ["I"]}}, | ||
| "assembly_accession": "GCA_001017915.1", | ||
| "assembly_name": "ASM101791v1", | ||
| "chromosomes": {}, |
There was a problem hiding this comment.
yep - the current build for this has no chromosome-level assembly
|
The most alarming thing here is those species that now don't have any chromosomes - any idea what's up with that? |
|
looks like the script isn't working quite right... will have to dig in further |
jeromekelleher
left a comment
There was a problem hiding this comment.
Hmm, watch out for the manually edited files that are getting touched here.
It's pretty messy isn't it?
| @@ -1,13 +1,6 @@ | |||
| # File created manually from https://www.ncbi.nlm.nih.gov/assembly/GCF_004382195.1 | |||
| # File autogenerated from Ensembl REST API. Do not edit. | |||
There was a problem hiding this comment.
Looks like this file shouldn't be changed as was created manually
There was a problem hiding this comment.
maybe we should add a species level attribute -- manually_added or something -- that would indicate if the maintenance script should go ahead with the download
There was a problem hiding this comment.
I'd vote "no", so we are alerted when a species that previously didn't have a build gets one; otherwise just manually back out the change in such cases.
(btw the reason this doesn't work is because ensembl doesn't have a karyotype entry for DroSec, just some ~Mb scale scaffolds)
| }, | ||
| "assembly_accession": "GCA_000313835.1", | ||
| "assembly_name": "Hmel1", | ||
| "chromosomes": {}, |
There was a problem hiding this comment.
Looks like another manually manipulated one
| "38": {"length": 23914537, "synonyms": []}, | ||
| "X": {"length": 123869142, "synonyms": []}, | ||
| "MT": {"length": 16727, "synonyms": []}, | ||
| "1": {"length": 123313939, "synonyms": []}, |
There was a problem hiding this comment.
Looks like the chromosome lengths changed between these two builds of CanFam, and new build doesn't have a mitochondrial genome. We probably want to stick with old version to avoid conflicts with recombination map.
There was a problem hiding this comment.
Hm - unlike in other species, it appears to legit not have MT any more.
| "d": {"length": 1094478, "synonyms": ["chrLGd"]}, | ||
| "f": {"length": 4257874, "synonyms": ["chrLGf"]}, | ||
| "g": {"length": 424765, "synonyms": ["chrLGg"]}, | ||
| "h": {"length": 248369, "synonyms": ["chrLGh"]}, |
There was a problem hiding this comment.
wtf, the previous chromosome names (e.g., LGa) are not included in the synonyms. I vote to leave well enough alone there. But - why no mitochondria?
There was a problem hiding this comment.
ah I see what happened here: consulting http://rest.ensembl.org/info/assembly/anolis_carolinensis they've removed the mitochondria from the "karyotype" list of chromosomes, but it's still there, at the end of a list of a bajillion tiny contigs:
-
coord_system: primary_assembly
length: 17223
name: MT
| "CM009944.2": {"length": 10670842, "synonyms": ["NC_037651.1"]}, | ||
| "CM009945.2": {"length": 9534514, "synonyms": ["NC_037652.1"]}, | ||
| "CM009946.2": {"length": 7238532, "synonyms": ["NC_037653.1"]}, | ||
| "CM009947.2": {"length": 16343, "synonyms": ["NC_001566.1", "MT"]}, |
There was a problem hiding this comment.
ensembl dropping the MT label here
There was a problem hiding this comment.
and those synonyms are still there, labeled "RefSeq" or "GenBank":
-
coord_system: primary_assembly
length: 16343
name: CM009947.2
synonyms:
-
dbname: ensembl_internal_synonym
name: NC_001566
-
dbname: GenBank
name: MT
-
dbname: INSDC
name: CM009947.2
-
dbname: ensembl_internal_synonym
name: CM009947
-
dbname: RefSeq
name: NC_001566.1
| "22": {"length": 37823149, "synonyms": []}, | ||
| "X": {"length": 155549662, "synonyms": []}, | ||
| "Y": {"length": 26350515, "synonyms": []}, | ||
| "MT": {"length": 16554, "synonyms": []}, |
There was a problem hiding this comment.
looks like they just removed those synonyms, oh well
|
I've been through this. Notes:
This all seems straightforward... except maybe the last point. I guess we should just convert those species over to being "manually added"? The other thing to deal with here is lifting over genetic maps... |
|
closing this as it's been covered by #1646 |

okay here is the update to ensembl 110. i had to do a bit of patching along the way, adding or modifying
ensembl_idsin thespecies.pyfiles of a few species.in addition there is a horrible "duct tape" operation i did to patch the
canis_familiarissituation. i'll point that out below.