-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable SmokingStatus via cTAKES #273
Comments
@mikix ideally, you can more simply turn on the pipeline using this "piper" file with the right settings |
OK initial thoughts: Seems like cTAKES's smoking status support is not a turnkey "flip this switch" kind of thing, but a bunch of pieces that you put together. text2phenotypeAnd the text2phenotype project has done that for us. They have:
I don't have a very good understanding of what Update Oct 2023: I tried to build their docker image. After some tweaks to get things going like switching to the right jdk version ( Integrating into CumulusOur current cTAKES is set up as a symptoms extractor by default. We could use a similar override method to replace its pipeline, like we do for the symptoms dictionary. But we also would need to inject a bunch of built Java in there. Which we could also hook up. But... maybe we just build the
Go or Java
Update Oct 2023: The Go code is designed for a very cNLP TransformersTim also suggested that we could just build a new cNLP BERT model for smoking status, and skip cTAKES. We suspect that would have better performance and Tim doesn't think it's hard. That sounds tempting... |
That is fine, though the space might get big at some point. We're edging up on the 'we should use a proper container orchestration platform' territory - we could spin up resources as needed if we're clever about it with swarm/k8s. |
@tmills believes BERT would beat SOTA for any of the existing SmokingStatus pipelines, which makes sense, because the old SmokingStatus pipelines are very old...like 10-15 years old. Even the GO version is a literally-exact translation of the algorithm, just in GO which is much faster |
So the tradeoff is faster Go vs more-accurate BERT? Go path is something like "build the docker, throw it up in our docker hub, and reference it from Cumulus's compose file" BERT path is "go train a cNLP BERT model like we did for negation, and do the same docker hub dance" In either case, there'd be some integration work in ETL land to call the right service, but that would be a similar amount of effort for both. |
@comorbidity is there value in doing both? Like part of looking at smoking status was comparing approaches right? Or do you think BERT is just going to be so obviously better, and we don't super care about the performance? |
@mikix great question and discussion. the BERT based model should expect to run almost identical to the current "negation" pipeline. What was the speed of cNLP negation compared to cTAKES for the symptoms dictionary? |
I do not remember numbers, but I recall cTAKES being faster. I could get numbers if that would guide our discussion. |
Another advantage of the BERT approach is reusability: there is nothing special about this Smoking model -- from the BERT perspective its just words and the label "smoker", "non-smoker", etc. Therefore: long-term-time-invested in BERT model would be better (using cNLP) because we could reuse it for any number of tasks that we needed a model for. |
Minor note: I updated my comment above with the results of some investigations. I'm putting this down for now to focus on other priorities, but may come back to this. |
Official cTAKES page
https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0+-+Smoking+status#cTAKES4.0Smokingstatus-OverviewofSmokingstatus
This reference implementation may be helpful
The text was updated successfully, but these errors were encountered: