Skip to content

Normalization step failure-collection/review ideas #49

@colleenXu

Description

@colleenXu

[EDITED after discussion with Evan 2025-09-18]

Background

I've previously brought up the complications of the normalization step from my POV, based on my experience doing it in my custom notebooks. I collect data on failures (broken down by type of failure) and review/analyze those failures and other normalization stuff. I've been asked to share some details on what I'm doing, so that's what this issue is for.

While doing all this is more work, I've found it very helpful for:

  • finding NodeNorm issues/shortcomings
  • finding errors in the resource itself (that developers appreciated being informed of specific errors and fixed!)
  • figuring out what to use as the original ID when the resource contains multiple options

Details

I catch these types of NodeNorm mapping failures:

  • NodeNorm returned None - save the input ID. ORION's current NodeNorm failure file does this
  • NodeNorm clique is the wrong primary category - save the input ID and that NodeNorm category. This requires as input an expected category/list of expected categories (which can be tricky to include)
  • NodeNorm clique's doesn't have a primary label - save the input ID. Evan agreed to add this. I've heard that the UI doesn't like it when Nodes don't have a human-readable label so this
  • Unexpected errors (caught with try-except) - save the input ID and that NodeNorm response

Then I save and print summary statistics for each failure type: how many input IDs affected and how many rows removed as a result.

I've done these kinds of reviews/analyses:

  • A resource provides two ID columns for an entity - compare the NodeNorm mappings from each column for differences.
    • If they differ, manually compare the input IDs, NodeNorm mappings, and original concept (string) to find which column of input IDs best captures the original concept
  • Compare the names provided by the resource to the NodeNorm primary labels
  • For 1 ID column from the resource, compare the input IDs, NodeNorm mappings, and original concept (string) to see if the original concept is being accurately represented by the input ID/NodeNorm mapping.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions