Skip to content

PHI Detection Eval Framework#157

Open
btbeal-db wants to merge 11 commits intomainfrom
brennan.beal/phi-detection-metrics
Open

PHI Detection Eval Framework#157
btbeal-db wants to merge 11 commits intomainfrom
brennan.beal/phi-detection-metrics

Conversation

@btbeal-db
Copy link
Collaborator

Summary

A common use case for pixels is to evaluate whether or not a given DICOM's metadata contains PHI. This PR creates two notebooks in ./notebooks/DE-ID/ in order to:

(1) Create golden data for PHI-detection evaluation for a given model
(2) Assess various LLMs ability to classify JSON strings as PHI via ai_query()

Copy link
Collaborator

@dmoore247 dmoore247 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good with changes. Thanks Brennan!

masked_data = spark.read.table(dbutils.widgets.get("masked_dicom_data")).drop("pixel_hash")
joined_df = masked_data.join(dicom_data, "path")
tag_df = spark.read.table(dbutils.widgets.get("dicom_tag_mapping_table"))
tag_mapping = dict(tag_df.select("Tag", "Keyword").rdd.map(tuple).collect())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this work on serverless compute?
Let's not use any Spark 2.x apis and remain fully UC compatible.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored quite a bit to accommodate serverless -- thanks for the callout!

# MAGIC 2. Gather corresponding masked metadata from the `midi_b_val` dataset
# MAGIC 3. Keep only data that the original and masked jsons have in common (i.e., the non-phi)
# MAGIC 5. Sample this data into the `phi_detection_golden` table

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include a ## Requirements section.
Document databricks compute requirements (e.g. Works with serverless, or jobs only etc.)
Document CPU/GPU
Document significant libraries or services required.


dbutils.widgets.text(
name="dicom_tag_mapping_table",
defaultValue="hls_radiology.ddsm.dicom_tags",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's include code to enable customer to replicate dicom_tags table.
Let's explain what dicom_tags table contains and why it's usefull.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it turns out, we don't need this table, can just create a small function with pydicom, which I've done and added!


dbutils.widgets.text(
name="masked_dicom_data",
defaultValue="hls_radiology.tcia.midi_b_val",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's explain this table too.

StructField("errors", ArrayType(MapType(StringType(), StringType())), True),
])

tag_mapping_bc = spark.sparkContext.broadcast(dicom_tag_mapping)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make sure this works with UC enabled coompute, and serverless compute.
.sparkContext and .rdd are red flags for UC compatibility.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea addressed -- thanks!


display(golden_data.limit(5))

# COMMAND ----------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Description to explain to user what this is for please.

.saveAsTable(table_name)
)

# COMMAND ----------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prob needs descriptive comment as to what is going on and why.

.withColumn("diff_tags", F.col("comparison_struct.diff_tags"))
)

# COMMAND ----------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prob needs descriptive comment as to what is going on and why.

return compare_dicom_tags_udf


# COMMAND ----------
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prob needs descriptive comment as to what is going on and why.

@dmoore247
Copy link
Collaborator

@btbeal-db Upon re-run, the linting job succeeded.

Copy link
Collaborator

@dmoore247 dmoore247 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments.

Copy link
Collaborator Author

@btbeal-db btbeal-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated!


dbutils.widgets.text(
name="dicom_tag_mapping_table",
defaultValue="hls_radiology.ddsm.dicom_tags",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it turns out, we don't need this table, can just create a small function with pydicom, which I've done and added!

masked_data = spark.read.table(dbutils.widgets.get("masked_dicom_data")).drop("pixel_hash")
joined_df = masked_data.join(dicom_data, "path")
tag_df = spark.read.table(dbutils.widgets.get("dicom_tag_mapping_table"))
tag_mapping = dict(tag_df.select("Tag", "Keyword").rdd.map(tuple).collect())
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've refactored quite a bit to accommodate serverless -- thanks for the callout!

StructField("errors", ArrayType(MapType(StringType(), StringType())), True),
])

tag_mapping_bc = spark.sparkContext.broadcast(dicom_tag_mapping)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea addressed -- thanks!

@dmoore247
Copy link
Collaborator

@btbeal-db can you refresh/ update the branch?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants