Conversation
dmoore247
left a comment
There was a problem hiding this comment.
Good with changes. Thanks Brennan!
| masked_data = spark.read.table(dbutils.widgets.get("masked_dicom_data")).drop("pixel_hash") | ||
| joined_df = masked_data.join(dicom_data, "path") | ||
| tag_df = spark.read.table(dbutils.widgets.get("dicom_tag_mapping_table")) | ||
| tag_mapping = dict(tag_df.select("Tag", "Keyword").rdd.map(tuple).collect()) |
There was a problem hiding this comment.
Does this work on serverless compute?
Let's not use any Spark 2.x apis and remain fully UC compatible.
There was a problem hiding this comment.
I've refactored quite a bit to accommodate serverless -- thanks for the callout!
| # MAGIC 2. Gather corresponding masked metadata from the `midi_b_val` dataset | ||
| # MAGIC 3. Keep only data that the original and masked jsons have in common (i.e., the non-phi) | ||
| # MAGIC 5. Sample this data into the `phi_detection_golden` table | ||
|
|
There was a problem hiding this comment.
Please include a ## Requirements section.
Document databricks compute requirements (e.g. Works with serverless, or jobs only etc.)
Document CPU/GPU
Document significant libraries or services required.
|
|
||
| dbutils.widgets.text( | ||
| name="dicom_tag_mapping_table", | ||
| defaultValue="hls_radiology.ddsm.dicom_tags", |
There was a problem hiding this comment.
Let's include code to enable customer to replicate dicom_tags table.
Let's explain what dicom_tags table contains and why it's usefull.
There was a problem hiding this comment.
As it turns out, we don't need this table, can just create a small function with pydicom, which I've done and added!
|
|
||
| dbutils.widgets.text( | ||
| name="masked_dicom_data", | ||
| defaultValue="hls_radiology.tcia.midi_b_val", |
There was a problem hiding this comment.
Let's explain this table too.
| StructField("errors", ArrayType(MapType(StringType(), StringType())), True), | ||
| ]) | ||
|
|
||
| tag_mapping_bc = spark.sparkContext.broadcast(dicom_tag_mapping) |
There was a problem hiding this comment.
Let's make sure this works with UC enabled coompute, and serverless compute.
.sparkContext and .rdd are red flags for UC compatibility.
There was a problem hiding this comment.
Yea addressed -- thanks!
|
|
||
| display(golden_data.limit(5)) | ||
|
|
||
| # COMMAND ---------- |
There was a problem hiding this comment.
Description to explain to user what this is for please.
| .saveAsTable(table_name) | ||
| ) | ||
|
|
||
| # COMMAND ---------- |
There was a problem hiding this comment.
Prob needs descriptive comment as to what is going on and why.
| .withColumn("diff_tags", F.col("comparison_struct.diff_tags")) | ||
| ) | ||
|
|
||
| # COMMAND ---------- |
There was a problem hiding this comment.
Prob needs descriptive comment as to what is going on and why.
| return compare_dicom_tags_udf | ||
|
|
||
|
|
||
| # COMMAND ---------- |
There was a problem hiding this comment.
Prob needs descriptive comment as to what is going on and why.
|
@btbeal-db Upon re-run, the linting job succeeded. |
|
|
||
| dbutils.widgets.text( | ||
| name="dicom_tag_mapping_table", | ||
| defaultValue="hls_radiology.ddsm.dicom_tags", |
There was a problem hiding this comment.
As it turns out, we don't need this table, can just create a small function with pydicom, which I've done and added!
| masked_data = spark.read.table(dbutils.widgets.get("masked_dicom_data")).drop("pixel_hash") | ||
| joined_df = masked_data.join(dicom_data, "path") | ||
| tag_df = spark.read.table(dbutils.widgets.get("dicom_tag_mapping_table")) | ||
| tag_mapping = dict(tag_df.select("Tag", "Keyword").rdd.map(tuple).collect()) |
There was a problem hiding this comment.
I've refactored quite a bit to accommodate serverless -- thanks for the callout!
| StructField("errors", ArrayType(MapType(StringType(), StringType())), True), | ||
| ]) | ||
|
|
||
| tag_mapping_bc = spark.sparkContext.broadcast(dicom_tag_mapping) |
There was a problem hiding this comment.
Yea addressed -- thanks!
|
@btbeal-db can you refresh/ update the branch? |
Summary
A common use case for pixels is to evaluate whether or not a given DICOM's metadata contains PHI. This PR creates two notebooks in ./notebooks/DE-ID/ in order to:
(1) Create golden data for PHI-detection evaluation for a given model
(2) Assess various LLMs ability to classify JSON strings as PHI via ai_query()