Enable SmokingStatus via cTAKES #273

comorbidity · 2023-09-01T17:27:30Z

Official cTAKES page
https://cwiki.apache.org/confluence/display/CTAKES/cTAKES+4.0+-+Smoking+status#cTAKES4.0Smokingstatus-OverviewofSmokingstatus

This reference implementation may be helpful

comorbidity · 2023-09-01T17:30:51Z

@mikix ideally, you can more simply turn on the pipeline using this "piper" file with the right settings
https://github.com/text2phenotype/ctakes/tree/main/src/main/resources/com/text2phenotype/ctakes/resources/smoking_status

comorbidity · 2023-09-01T17:36:45Z

https://cwiki.apache.org/confluence/display/CTAKES/Piper+Files

mikix · 2023-09-01T18:53:17Z

OK initial thoughts:

Seems like cTAKES's smoking status support is not a turnkey "flip this switch" kind of thing, but a bunch of pieces that you put together.

text2phenotype

And the text2phenotype project has done that for us. They have:

An entrypoint piper file which loads their default piper file and points at two sub-piper files.
A custom ClassifierEntries annotator that adds some extra logic and runs the two sub-piper files.

I don't have a very good understanding of what text2phenotype is doing here, but it seems raw/pure cTAKES is not easy to pull together ourselves, and we might want to leverage that work.

Update Oct 2023: I tried to build their docker image. After some tweaks to get things going like switching to the right jdk version (9-jdk8-corretto) and changing how resources get included (commenting out INCLUDE_RES=true and copying in some files from the source tree)... I was finally able to get something that seemed to run without errors. But I could not figure out how to query it via REST API. I know that sounds silly, but like, no endpoints seemed to be exposed at all. My next step was to either try to get Tomcat to print registered endpoints and debug it at the Tomcat level OR take the smoking code from text2phenotype and put it into an upstream cTAKES checkout which I do have working for Cumulus. But both involve potentially-deep Java coding and I'm focusing on other things right now.)

Integrating into Cumulus

Our current cTAKES is set up as a symptoms extractor by default. We could use a similar override method to replace its pipeline, like we do for the symptoms dictionary. But we also would need to inject a bunch of built Java in there. Which we could also hook up.

But... maybe we just build the text2phenotype docker image, throw it up and use that - meaning we now have two cTAKES images we are building, but for different use cases. Maybe wise, maybe not.

The other interesting part of that is how Cumulus manages multiple dependent services. Ideally it would be able to spin them up and down as needed. But since docker compose doesn't really work like that, we could outsource that to the user and switch our current paradigm from one global etl-support profile to study-specific profiles and the user would start up what they know they need. So docker compose up --profile covid_symptom for example. (Update Oct 2023: this was done elsewhere when we integrated the termexists cNLP transformer)

Go or Java

text2phenotype also has a Go version that might be worth exploring, as it would be faster and is apparently battle-tested for their use cases.

Update Oct 2023: The Go code is designed for a very text2phenotype workflow. It reads notes in via rabbitmq and drops the results into an S3 bucket. Whereas Cumulus expects to talk to NLP over a REST API. So that would be some work to get going.

cNLP Transformers

Tim also suggested that we could just build a new cNLP BERT model for smoking status, and skip cTAKES. We suspect that would have better performance and Tim doesn't think it's hard. That sounds tempting...

dogversioning · 2023-09-01T18:58:14Z

But since docker compose doesn't really work like that, we could outsource that to the user and switch our current paradigm from one global etl-support profile to study-specific profiles and the user would start up what they know they need. So docker compose up --profile covid_symptom for example.

That is fine, though the space might get big at some point. We're edging up on the 'we should use a proper container orchestration platform' territory - we could spin up resources as needed if we're clever about it with swarm/k8s.

comorbidity · 2023-09-01T19:00:56Z

@tmills believes BERT would beat SOTA for any of the existing SmokingStatus pipelines, which makes sense, because the old SmokingStatus pipelines are very old...like 10-15 years old. Even the GO version is a literally-exact translation of the algorithm, just in GO which is much faster

mikix · 2023-09-01T19:03:44Z

So the tradeoff is faster Go vs more-accurate BERT?

Go path is something like "build the docker, throw it up in our docker hub, and reference it from Cumulus's compose file"

BERT path is "go train a cNLP BERT model like we did for negation, and do the same docker hub dance"

In either case, there'd be some integration work in ETL land to call the right service, but that would be a similar amount of effort for both.

mikix · 2023-09-01T19:06:14Z

@comorbidity is there value in doing both? Like part of looking at smoking status was comparing approaches right? Or do you think BERT is just going to be so obviously better, and we don't super care about the performance?

comorbidity · 2023-09-01T19:19:21Z

@mikix great question and discussion. the BERT based model should expect to run almost identical to the current "negation" pipeline. What was the speed of cNLP negation compared to cTAKES for the symptoms dictionary?

mikix · 2023-09-01T19:20:37Z

What was the speed of cNLP negation compared to cTAKES for the symptoms dictionary?

I do not remember numbers, but I recall cTAKES being faster. I could get numbers if that would guide our discussion.

comorbidity · 2023-09-01T19:22:53Z

Another advantage of the BERT approach is reusability: there is nothing special about this Smoking model -- from the BERT perspective its just words and the label "smoker", "non-smoker", etc.

Therefore: long-term-time-invested in BERT model would be better (using cNLP) because we could reuse it for any number of tasks that we needed a model for.

mikix · 2023-10-23T13:27:07Z

Minor note: I updated my comment above with the results of some investigations. I'm putting this down for now to focus on other priorities, but may come back to this.

comorbidity assigned mikix Sep 1, 2023

mikix removed their assignment Jul 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable SmokingStatus via cTAKES #273

Enable SmokingStatus via cTAKES #273

comorbidity commented Sep 1, 2023

comorbidity commented Sep 1, 2023

comorbidity commented Sep 1, 2023

mikix commented Sep 1, 2023 •

edited

Loading

dogversioning commented Sep 1, 2023

comorbidity commented Sep 1, 2023

mikix commented Sep 1, 2023 •

edited

Loading

mikix commented Sep 1, 2023

comorbidity commented Sep 1, 2023

mikix commented Sep 1, 2023

comorbidity commented Sep 1, 2023

mikix commented Oct 23, 2023 •

edited

Loading

Enable SmokingStatus via cTAKES #273

Enable SmokingStatus via cTAKES #273

Comments

comorbidity commented Sep 1, 2023

comorbidity commented Sep 1, 2023

comorbidity commented Sep 1, 2023

mikix commented Sep 1, 2023 • edited Loading

text2phenotype

Integrating into Cumulus

Go or Java

cNLP Transformers

dogversioning commented Sep 1, 2023

comorbidity commented Sep 1, 2023

mikix commented Sep 1, 2023 • edited Loading

mikix commented Sep 1, 2023

comorbidity commented Sep 1, 2023

mikix commented Sep 1, 2023

comorbidity commented Sep 1, 2023

mikix commented Oct 23, 2023 • edited Loading

mikix commented Sep 1, 2023 •

edited

Loading

mikix commented Sep 1, 2023 •

edited

Loading

mikix commented Oct 23, 2023 •

edited

Loading