-
Notifications
You must be signed in to change notification settings - Fork 7
3: NAACCR XML Dictionaries
To download one of the following files, click the link, locate the "Raw" button, right-click the button and select "Save link as..." (the option might have a different name depending on your web browser).
- Version 25 [ Base | ----- | Combined as CSV ]
- Version 24 [ Base | ----- | Combined as CSV ]
- Version 23 [ Base | ----- | Combined as CSV ]
- Version 22 [ Base | ----- | Combined as CSV ]
- Version 21 [ Base | User | Combined as CSV ]
- Version 18 [ Base | User | Combined as CSV ]
- Version 16 [ Base | User | Combined as CSV ]
- Version 15 [ Base | User | Combined as CSV ]
- Version 14 [ Base | User | Combined as CSV ]
NAACCR IDs should not change over time, except in very rare circumstances such as when the NAACCR item name or its semantic meaning changes too much after it has been defined.
- Two items were switched from the Patient level to the Tumor level in the N18 and N21 base dictionaries: dateOfLastCancerStatus and dateOfLastCancerStatusFlag (this is not technically the same as changing an ID, but it will cause the same type of headaches).
- Many IDs were changed in the N18 base dictionary to conform to the shorter naming requirement of the 1.4 NAACCR XML specification: renamed-items-naaccr18.csv
- A few IDs were changed going from N17 to N18 to conform to better align with their new name: renamed-items-naaccr17-to-18.csv
The NAACCR XML standard defines two types of dictionaries:
- Base dictionaries defining all the standard items maintained by the NAACCR organization
- User dictionaries defining state-specific, registry-specific, or use-case specific items maintained by the organizations that define them
In addition to the base dictionaries, NAACCR also provides default user dictionaries. Those define the items that can be re-defined by other organizations (like the state requestor item, or the NPCR item) as one block of text. Without those, the translation from fixed-column to XML would lose the information contained in the state requestor or NPCR items since those items are not defined in the base dictionaries. In other words, the "default" user-defined dictionaries are an artifact that was needed to ensure a smooth transition from fixed-column to XML. One the fixed-column standard is retired, those default user-defined dictionaries won't be needed anymore.
Note that there is no syntax differences between a base dictionary and a user-defined one, but they do use a different validation logic.
Several tools can be used to create your own user-defined dictionary:
- File*Pro is a more complex software provided by SEER.
- XMLExchange Plus is another more complex software provided by NPCR.
If you are responsible for a software that can create a user-defined dictionary, and you would like it to be referenced here, add an issue to the project issue tracker.
There are a few things to consider when creating a new dictionary.
First, a new URI needs to be crated. The URI is used to uniquely identify the dictionary, but there is no central organization to ensure that every existing dictionary URI is unique. And so in theory it's possible that multiple organizations would chose the same URI. To avoid that, a URI should be reasonably unique. Here are a few example (note that URI looks like internet addresses, but they usually aren't).
-
https://my-dictionary.xml
: simple, but not unique at all; not a great choice. -
https://www.myorgnanization.com/my-dictionary.xml
: much better, the organization makes the URI much more unique. -
https://www.myorgnanization.com/my-dictionary-180.xml
: adding the NAACCR version allows the organization to maintain one dictionary per version; that doesn't really makes the URI more unique outside the organization but it helps handling multiple dictionaries within the organization. -
https://www.myorgnanization.com/my-dictionary-2020001.xml
: adding a timestamps is the ultimate way to make the URI unique; this can be useful for software creating dictionaries on-the-fly based on user input.
The second thing to consider is how to create the NAACCR XML ID for each data item that needs to appear in the user-defined dictionary. Consider the following when creating your IDs:
- How close is the ID from the data item name? IDs need to be close enough from the name so that there is no ambiguity that a given ID goes with a given name. Note that a standard algorithm can be applied to a name to derive the corresponding ID; that algorithm is described in the next section.
- How unique is the ID? If an ID is not unique enough, there is a risk that a data file would reference two user-defined dictionaries containing the same ID, which would be a conflict.
- Who owns the data item? It is considered good practice to add a prefix representing the owning organization to the ID of the data items owned by that organization. That limits the possible conflicts and it makes the ownership obvious. And so for example, instead of the ID
tobaccoHistory
, consider usingmyOrganizationTobaccoHistory
.
Another thing to consider is how to chose the NAACCR number for each item contained in the user-defined dictionary. Numbers must also be unique among all the user-defined dictionaries contained in a given data file. There is no real guidelines on what range of numbers to use.
The name and length of the new data items must be provided (there is no default for those).
In general, the default value for type, padding and trimming is appropriate.
The project contains a utility class (NaaccrXmlDictionaryUtils) to read, write and validate a given dictionary file. Note that there is no syntax differences between a base dictionary and a user-defined one, but they do use a different validation logic.
That utility class also contains a method to create a NAACCR ID (used for the "naaccrId" attribute) from a given item name using the following rules:
- Spaces, dashes, slashes, ampersands, periods and underscores are considered as word separators and replaced by a single space.
- Anything in parenthesis is removed (along with the parenthesis).
- Any non-digit and non-letter character is removed.
- The result is split by spaces (called words in the rest of this logic).
- Roman numeral words are converted to the corresponding numbers (so I becomes 1, II becomes 2, etc...); this applies only to full words, and only for numbers up to IX (9).
- If the two last words were converted roman numerals, a "to" word is inserted between them.
- The first word is uncapitalized, the other words are capitalized. All abbreviations are considered words (so EOD becomes Eod).
- All the words are concatenated back together.
- The resulting ID is manually reviewed to ensure it is no more than 32 characters.
File*Pro allows has an option to derive the ID from an item name using those rules.