PHI Detection Eval Framework by btbeal-db · Pull Request #157 · databricks-industry-solutions/pixels

btbeal-db · 2025-10-15T19:56:32Z

Summary

A common use case for pixels is to evaluate whether or not a given DICOM's metadata contains PHI. This PR creates two notebooks in ./notebooks/DE-ID/ in order to:

(1) Create golden data for PHI-detection evaluation for a given model
(2) Assess various LLMs ability to classify JSON strings as PHI via ai_query()

dmoore247

Good with changes. Thanks Brennan!

dmoore247 · 2025-10-21T18:52:28Z

notebooks/DE-ID/DICOM - Create PHI Detection Golden Data.py

+masked_data = spark.read.table(dbutils.widgets.get("masked_dicom_data")).drop("pixel_hash")
+joined_df = masked_data.join(dicom_data, "path")
+tag_df = spark.read.table(dbutils.widgets.get("dicom_tag_mapping_table"))
+tag_mapping = dict(tag_df.select("Tag", "Keyword").rdd.map(tuple).collect())


Does this work on serverless compute?
Let's not use any Spark 2.x apis and remain fully UC compatible.

I've refactored quite a bit to accommodate serverless -- thanks for the callout!

dmoore247 · 2025-10-21T18:53:51Z

notebooks/DE-ID/DICOM - Create PHI Detection Golden Data.py

+# MAGIC 2. Gather corresponding masked metadata from the `midi_b_val` dataset
+# MAGIC 3. Keep only data that the original and masked jsons have in common (i.e., the non-phi)
+# MAGIC 5. Sample this data into the `phi_detection_golden` table
+


Please include a ## Requirements section.
Document databricks compute requirements (e.g. Works with serverless, or jobs only etc.)
Document CPU/GPU
Document significant libraries or services required.

dmoore247 · 2025-10-21T18:56:50Z

notebooks/DE-ID/DICOM - Create PHI Detection Golden Data.py

+
+dbutils.widgets.text(
+  name="dicom_tag_mapping_table", 
+  defaultValue="hls_radiology.ddsm.dicom_tags",


Let's include code to enable customer to replicate dicom_tags table.
Let's explain what dicom_tags table contains and why it's usefull.

As it turns out, we don't need this table, can just create a small function with pydicom, which I've done and added!

dmoore247 · 2025-10-21T18:57:11Z

notebooks/DE-ID/DICOM - Create PHI Detection Golden Data.py

+
+dbutils.widgets.text(
+  name="masked_dicom_data", 
+  defaultValue="hls_radiology.tcia.midi_b_val",


Let's explain this table too.

dmoore247 · 2025-10-21T18:58:43Z

notebooks/DE-ID/DICOM - Create PHI Detection Golden Data.py

+        StructField("errors", ArrayType(MapType(StringType(), StringType())), True),
+    ])
+
+    tag_mapping_bc = spark.sparkContext.broadcast(dicom_tag_mapping)


Let's make sure this works with UC enabled coompute, and serverless compute.
.sparkContext and .rdd are red flags for UC compatibility.

Yea addressed -- thanks!

dmoore247 · 2025-10-21T18:59:35Z

notebooks/DE-ID/DICOM - Create PHI Detection Golden Data.py

+
+display(golden_data.limit(5))
+
+# COMMAND ----------


Description to explain to user what this is for please.

dmoore247 · 2025-10-21T19:00:59Z

notebooks/DE-ID/DICOM - Create PHI Detection Golden Data.py

+      .saveAsTable(table_name)
+)
+
+# COMMAND ----------


Prob needs descriptive comment as to what is going on and why.

dmoore247 · 2025-10-21T19:01:12Z

notebooks/DE-ID/DICOM - Create PHI Detection Golden Data.py

+    .withColumn("diff_tags", F.col("comparison_struct.diff_tags"))
+)
+
+# COMMAND ----------


Prob needs descriptive comment as to what is going on and why.

dmoore247 · 2025-10-21T19:01:20Z

notebooks/DE-ID/DICOM - Create PHI Detection Golden Data.py

+    return compare_dicom_tags_udf
+
+
+# COMMAND ----------


Prob needs descriptive comment as to what is going on and why.

dmoore247 · 2025-10-23T13:32:27Z

@btbeal-db Upon re-run, the linting job succeeded.

dmoore247

See comments.

btbeal-db

Updated!

btbeal-db · 2025-11-03T19:31:32Z

notebooks/DE-ID/DICOM - Create PHI Detection Golden Data.py

+
+dbutils.widgets.text(
+  name="dicom_tag_mapping_table", 
+  defaultValue="hls_radiology.ddsm.dicom_tags",


As it turns out, we don't need this table, can just create a small function with pydicom, which I've done and added!

btbeal-db · 2025-11-03T19:53:05Z

notebooks/DE-ID/DICOM - Create PHI Detection Golden Data.py

+masked_data = spark.read.table(dbutils.widgets.get("masked_dicom_data")).drop("pixel_hash")
+joined_df = masked_data.join(dicom_data, "path")
+tag_df = spark.read.table(dbutils.widgets.get("dicom_tag_mapping_table"))
+tag_mapping = dict(tag_df.select("Tag", "Keyword").rdd.map(tuple).collect())


I've refactored quite a bit to accommodate serverless -- thanks for the callout!

btbeal-db · 2025-11-03T19:53:12Z

notebooks/DE-ID/DICOM - Create PHI Detection Golden Data.py

+        StructField("errors", ArrayType(MapType(StringType(), StringType())), True),
+    ])
+
+    tag_mapping_bc = spark.sparkContext.broadcast(dicom_tag_mapping)


Yea addressed -- thanks!

dmoore247 · 2025-11-20T19:13:41Z

@btbeal-db can you refresh/ update the branch?

btbeal-db and others added 10 commits September 29, 2025 23:19

Initial work to diff masked dicom metadata against original

26ce6cc

Changing to py source

f77108e

Adding widgets, updating udf to return struct

5a2771d

Creating golden data and cleaning up

dfdd953

Adding down-sample

caeeed1

Refactor to clean and finalize code

c6f9fe1

Merge branch 'main' into brennan.beal/amgen-metrics

92b706e

Updating to add query limit

17b151d

Adding MLFlow and making more spark native for evals

18bebe8

Reverting VLMTransformer file change

680aa60

btbeal-db requested a review from dmoore247 October 15, 2025 19:56

dmoore247 assigned btbeal-db Oct 21, 2025

dmoore247 requested changes Oct 22, 2025

View reviewed changes

dmoore247 requested changes Oct 31, 2025

View reviewed changes

btbeal-db commented Nov 3, 2025

View reviewed changes

Updating to accommodate serverless/remove table dependencies

e0200ec

Conversation

btbeal-db commented Oct 15, 2025

Summary

Uh oh!

dmoore247 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmoore247 commented Oct 23, 2025

Uh oh!

dmoore247 left a comment

Choose a reason for hiding this comment

Uh oh!

btbeal-db left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmoore247 commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants