Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable SmokingStatus via cTAKES #273

Open
comorbidity opened this issue Sep 1, 2023 · 11 comments
Open

Enable SmokingStatus via cTAKES #273

comorbidity opened this issue Sep 1, 2023 · 11 comments

Comments

@comorbidity
Copy link
Contributor Author

@mikix ideally, you can more simply turn on the pipeline using this "piper" file with the right settings
https://github.com/text2phenotype/ctakes/tree/main/src/main/resources/com/text2phenotype/ctakes/resources/smoking_status

@comorbidity
Copy link
Contributor Author

@mikix
Copy link
Contributor

mikix commented Sep 1, 2023

OK initial thoughts:

Seems like cTAKES's smoking status support is not a turnkey "flip this switch" kind of thing, but a bunch of pieces that you put together.

text2phenotype

And the text2phenotype project has done that for us. They have:

I don't have a very good understanding of what text2phenotype is doing here, but it seems raw/pure cTAKES is not easy to pull together ourselves, and we might want to leverage that work.

Update Oct 2023: I tried to build their docker image. After some tweaks to get things going like switching to the right jdk version (9-jdk8-corretto) and changing how resources get included (commenting out INCLUDE_RES=true and copying in some files from the source tree)... I was finally able to get something that seemed to run without errors. But I could not figure out how to query it via REST API. I know that sounds silly, but like, no endpoints seemed to be exposed at all. My next step was to either try to get Tomcat to print registered endpoints and debug it at the Tomcat level OR take the smoking code from text2phenotype and put it into an upstream cTAKES checkout which I do have working for Cumulus. But both involve potentially-deep Java coding and I'm focusing on other things right now.)

Integrating into Cumulus

Our current cTAKES is set up as a symptoms extractor by default. We could use a similar override method to replace its pipeline, like we do for the symptoms dictionary. But we also would need to inject a bunch of built Java in there. Which we could also hook up.

But... maybe we just build the text2phenotype docker image, throw it up and use that - meaning we now have two cTAKES images we are building, but for different use cases. Maybe wise, maybe not.

The other interesting part of that is how Cumulus manages multiple dependent services. Ideally it would be able to spin them up and down as needed. But since docker compose doesn't really work like that, we could outsource that to the user and switch our current paradigm from one global etl-support profile to study-specific profiles and the user would start up what they know they need. So docker compose up --profile covid_symptom for example. (Update Oct 2023: this was done elsewhere when we integrated the termexists cNLP transformer)

Go or Java

text2phenotype also has a Go version that might be worth exploring, as it would be faster and is apparently battle-tested for their use cases.

Update Oct 2023: The Go code is designed for a very text2phenotype workflow. It reads notes in via rabbitmq and drops the results into an S3 bucket. Whereas Cumulus expects to talk to NLP over a REST API. So that would be some work to get going.

cNLP Transformers

Tim also suggested that we could just build a new cNLP BERT model for smoking status, and skip cTAKES. We suspect that would have better performance and Tim doesn't think it's hard. That sounds tempting...

@dogversioning
Copy link
Contributor

But since docker compose doesn't really work like that, we could outsource that to the user and switch our current paradigm from one global etl-support profile to study-specific profiles and the user would start up what they know they need. So docker compose up --profile covid_symptom for example.

That is fine, though the space might get big at some point. We're edging up on the 'we should use a proper container orchestration platform' territory - we could spin up resources as needed if we're clever about it with swarm/k8s.

@comorbidity
Copy link
Contributor Author

@tmills believes BERT would beat SOTA for any of the existing SmokingStatus pipelines, which makes sense, because the old SmokingStatus pipelines are very old...like 10-15 years old. Even the GO version is a literally-exact translation of the algorithm, just in GO which is much faster

@mikix
Copy link
Contributor

mikix commented Sep 1, 2023

So the tradeoff is faster Go vs more-accurate BERT?

Go path is something like "build the docker, throw it up in our docker hub, and reference it from Cumulus's compose file"

BERT path is "go train a cNLP BERT model like we did for negation, and do the same docker hub dance"

In either case, there'd be some integration work in ETL land to call the right service, but that would be a similar amount of effort for both.

@mikix
Copy link
Contributor

mikix commented Sep 1, 2023

@comorbidity is there value in doing both? Like part of looking at smoking status was comparing approaches right? Or do you think BERT is just going to be so obviously better, and we don't super care about the performance?

@comorbidity
Copy link
Contributor Author

@mikix great question and discussion. the BERT based model should expect to run almost identical to the current "negation" pipeline. What was the speed of cNLP negation compared to cTAKES for the symptoms dictionary?

@mikix
Copy link
Contributor

mikix commented Sep 1, 2023

What was the speed of cNLP negation compared to cTAKES for the symptoms dictionary?

I do not remember numbers, but I recall cTAKES being faster. I could get numbers if that would guide our discussion.

@comorbidity
Copy link
Contributor Author

Another advantage of the BERT approach is reusability: there is nothing special about this Smoking model -- from the BERT perspective its just words and the label "smoker", "non-smoker", etc.

Therefore: long-term-time-invested in BERT model would be better (using cNLP) because we could reuse it for any number of tasks that we needed a model for.

@mikix
Copy link
Contributor

mikix commented Oct 23, 2023

Minor note: I updated my comment above with the results of some investigations. I'm putting this down for now to focus on other priorities, but may come back to this.

@mikix mikix removed their assignment Jul 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants