Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updated to ensembl release 110; did some patching #1536

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

andrewkern
Copy link
Member

okay here is the update to ensembl 110. i had to do a bit of patching along the way, adding or modifying ensembl_ids in the species.py files of a few species.

in addition there is a horrible "duct tape" operation i did to patch the canis_familiaris situation. i'll point that out below.

Comment on lines +161 to +162
if ensembl_id == "canis_lupus_familiaris":
ensembl_id = "canis_familiaris"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a horrible, terrible thing that i've done, but i've done it because this single ensembl shift disagrees with all the other ensembl ids....

tmp = ensembl_id.split("_")[:2]
print(tmp, ensembl_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stray print

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oops-- debugging

"MT": {"length": 16543, "synonyms": []},
},
}
data = {"assembly_accession": None, "assembly_name": "BROAD S1", "chromosomes": {}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uh-oh, what happened to this one?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wow that's not good

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -116,7 +116,7 @@

_species = stdpopsim.Species(
id="GasAcu",
ensembl_id="9307941",
ensembl_id="gasterosteus_aculeatus",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps this is why the assembly dissappeared?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the original ensembl_id was incompatible with the REST API and the maintenance script...

},
"assembly_accession": "GCA_000313835.1",
"assembly_name": "Hmel1",
"chromosomes": {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And another one that dissappeared?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like another manually manipulated one

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well this one has no karyotype entry:

--- 
assembly_accession: GCA_000313835.1
assembly_date: 2012-02
assembly_name: Hmel1
coord_system_versions: 
  - Hmel1
default_coord_system_version: Hmel1
genebuild_initial_release_date: 2012-03
genebuild_last_geneset_update: 2012-03
genebuild_method: import
genebuild_start_date: 2012-03-HGC
golden_path: 273786188
karyotype: []

top_level_region: 
  - 
    coord_system: scaffold
    length: 163478
    name: HE667775
    synonyms: 
      - 
        dbname: INSDC
        name: HE667775.1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... and, it's just a bunch of scaffolds - nothing as long as in the assembly we've got.

"22": {"length": 35308119, "synonyms": []},
"X": {"length": 151242693, "synonyms": []},
# Mitochondria absent in ponAbe3, so length taken from ponAbe2.
"MT": {"length": 16499, "synonyms": []},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like we need to stick in this manually again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nope - it's there, just not listed in the karyotype:

  - 
    coord_system: primary_assembly
    length: 16499
    name: MT

"chromosomes": {"1": {"length": 2065074, "synonyms": ["I"]}},
"assembly_accession": "GCA_001017915.1",
"assembly_name": "ASM101791v1",
"chromosomes": {},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another empty one?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep - the current build for this has no chromosome-level assembly

@petrelharp
Copy link
Contributor

The most alarming thing here is those species that now don't have any chromosomes - any idea what's up with that?

@andrewkern
Copy link
Member Author

looks like the script isn't working quite right... will have to dig in further

Copy link
Member

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, watch out for the manually edited files that are getting touched here.

It's pretty messy isn't it?

@@ -1,13 +1,6 @@
# File created manually from https://www.ncbi.nlm.nih.gov/assembly/GCF_004382195.1
# File autogenerated from Ensembl REST API. Do not edit.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this file shouldn't be changed as was created manually

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we should add a species level attribute -- manually_added or something -- that would indicate if the maintenance script should go ahead with the download

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd vote "no", so we are alerted when a species that previously didn't have a build gets one; otherwise just manually back out the change in such cases.

(btw the reason this doesn't work is because ensembl doesn't have a karyotype entry for DroSec, just some ~Mb scale scaffolds)

},
"assembly_accession": "GCA_000313835.1",
"assembly_name": "Hmel1",
"chromosomes": {},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like another manually manipulated one

"38": {"length": 23914537, "synonyms": []},
"X": {"length": 123869142, "synonyms": []},
"MT": {"length": 16727, "synonyms": []},
"1": {"length": 123313939, "synonyms": []},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like the chromosome lengths changed between these two builds of CanFam, and new build doesn't have a mitochondrial genome. We probably want to stick with old version to avoid conflicts with recombination map.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm - unlike in other species, it appears to legit not have MT any more.

"d": {"length": 1094478, "synonyms": ["chrLGd"]},
"f": {"length": 4257874, "synonyms": ["chrLGf"]},
"g": {"length": 424765, "synonyms": ["chrLGg"]},
"h": {"length": 248369, "synonyms": ["chrLGh"]},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wtf, the previous chromosome names (e.g., LGa) are not included in the synonyms. I vote to leave well enough alone there. But - why no mitochondria?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And gee, the mitochondria shows on ensembl:
Screenshot from 2023-12-12 09-15-25

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see what happened here: consulting http://rest.ensembl.org/info/assembly/anolis_carolinensis they've removed the mitochondria from the "karyotype" list of chromosomes, but it's still there, at the end of a list of a bajillion tiny contigs:

  - 
    coord_system: primary_assembly
    length: 17223
    name: MT

"CM009944.2": {"length": 10670842, "synonyms": ["NC_037651.1"]},
"CM009945.2": {"length": 9534514, "synonyms": ["NC_037652.1"]},
"CM009946.2": {"length": 7238532, "synonyms": ["NC_037653.1"]},
"CM009947.2": {"length": 16343, "synonyms": ["NC_001566.1", "MT"]},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ensembl dropping the MT label here

Copy link
Contributor

@petrelharp petrelharp Dec 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and those synonyms are still there, labeled "RefSeq" or "GenBank":

  - 
    coord_system: primary_assembly
    length: 16343
    name: CM009947.2
    synonyms: 
      - 
        dbname: ensembl_internal_synonym
        name: NC_001566
      - 
        dbname: GenBank
        name: MT
      - 
        dbname: INSDC
        name: CM009947.2
      - 
        dbname: ensembl_internal_synonym
        name: CM009947
      - 
        dbname: RefSeq
        name: NC_001566.1

"22": {"length": 37823149, "synonyms": []},
"X": {"length": 155549662, "synonyms": []},
"Y": {"length": 26350515, "synonyms": []},
"MT": {"length": 16554, "synonyms": []},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like they just removed those synonyms, oh well

@petrelharp
Copy link
Contributor

I've been through this. Notes:

  1. I've added back in (just via git checkout main <file>) the manually added genome info.
  2. Some genomes have MT now missing, even though the lengths are listed, because MT has been removed from their "karyotype". This could cause a problem if MT is listed in the genetic map? We could add some code to the ensembl maintenance parsing that checks if there's any scaffolds with name == "MT", and if so, adds them, even if not in the karyotype.
  3. We don't have some synonyms we had previously, but could fixup the maintenance code to get those by including different dbNames.
  4. Some genomes (two of them!) seem to have much worse assemblies on ensembl now - previously they had chromsome-level; now they don't. I don't know what's up with that.

This all seems straightforward... except maybe the last point. I guess we should just convert those species over to being "manually added"?

The other thing to deal with here is lifting over genetic maps...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants