-
Notifications
You must be signed in to change notification settings - Fork 7
7: NAACCR XML and SAS
This page contains resources related to reading/writing NAACCR XML data files with SAS. The content of this page and any resources it links to are not officially endorsed by any standard setters (including the NAACCR organization).
Feel free to checkout this NAACCR 2018 oral presentation (PDF) about NAACCR XML and SAS.
The recommended way to handle NAACCR XML in SAS is to call a Java Archive (JAR) from the SAS program and let it do the work. Two macros were created (readNaaccrXml and writeNaaccrXml) to hide the complexity of interacting with Java code.
You will need to download the NAACCR XML SAS JAR file and macros and reference them in your SAS program. Those resources are included in the zipped distribution found on the release page of this GitHub project. After unzipping the file, you will find them in the "sas" folder of the installation.
You can also find the macros in this GitHub project, as well as some example programs (they are very similar to the examples listed on this page).
With the macros and JAR file available to the SAS program, a typical call to the read macro will look like this (the version of the JAR file might need to be updated):
%include "read_naaccr_xml_macro.sas";
%readNaaccrXml(
libpath="naaccr-xml-10.0-sas.jar",
sourcefile="synthetic-data_naaccr-240-incidence_10-tumors.xml",
naaccrversion="240",
recordtype="I",
dataset=fromxml,
items="patientIdNumber,primarySite"
);
Note that the only required parameters are the path to the library and the source file. Others have a default.
Here is another example that references a user-defined dictionary (note that multiple dictionaries need to be separated by a semi-column):
%include "read_naaccr_xml_macro.sas";
%readNaaccrXml(
libpath="naaccr-xml-10.0-sas.jar",
sourcefile="synthetic-data_naaccr-240-incidence_10-tumors_non-standard.xml",
naaccrversion="240",
recordtype="I",
dataset=fromxml,
items="patientIdNumber,primarySite,myVariable",
dictfile="my-dictionary.csv"
);
Here is a description of the available parameters:
- libpath: needs to point to the Java SAS library (path can be relative or absolute)
-
sourcefile: needs to point to the XML to import (path can be relative or absolute);
- if the path ends with ".gz" it will be processed as a GZIP compressed file
- if it ends with ".zip", every file inside the zip file will be processed (into the same SAS data set); the inner files can be compressed (".gz") or uncompressed.
- otherwise it will be processed as an uncompressed file
- naaccrversion should be one of the supported NAACCR versions provided as three digits: "140", "150", "160", etc... (this parameter is required, no default); make sure to provide the proper version or some items might be dropped during the reading process
- recordtype: should be "A", "M", "C" or "I" (required, no default); make sure to provide the proper type or some items might be dropped during the reading process
- dataset: should be the name of the dataset into which the data should be loaded (defaults to alldata)
- items: an optional list of items to read (any items not in the list will be ignored); if not provided, the data set will contain all standard items plus any non-standard items provided via the extra user-defined dictionary (if it was provided). Be aware that creating a data set containing all items will be MUCH slower than creating one for just a few items, and so if you only need a handful of items to do your analysis, it is strongly recommended to provide those items. There are two ways to provide the list: 1. Hard code the XML IDs in the SAS code, separate them with a comma: items="patientIdNumber,tumorRecordNumber,primarySite" 2. Provide the path (relative or absolute) to a CSV file: items="included-items.csv" The first line of the file must be headers; the XML IDs to include are expected to be found in the first column (the file can contain other columns); a simple file would look like this:
NAACCR_XML_ID
patientIdNumber
tumorRecordNumber
primarySite
- dictfile: path to an optional user-defined dictionary in CSV format (the free File*Pro software available on the SEER website can generate those files). Path can be relative or absolute; if relative, it will be computed from the directory containing the macro (in other words, the dictionary CSV file can be copied in the same directory as the macro and referenced by its filename only). Use semicolon to separate multiple paths if you need to provide more than one dictionary.
- cleanuptempfiles: cleanuptempfiles should be "yes" or "no" (defaults to "yes"); if "no" then the tmp flat and format files won't be automatically deleted; use this parameter to QC those files when investigating issues.
- groupeditems: should be "yes" or "no" (defaults to "no"); if "yes" then the grouped items will added to the created data set. Note that the "items" parameter has not impact on this one, either all the grouped items are included, or none are.
Note that the macro creates a temp fixed-column and input SAS format file in the same folder as the source file; those files will be automatically deleted by the macro when its done executing (unless the 'cleanuptempfiles' parameter is set to 'no').
Once the "fromxml" data set has been created, it can be used as any other data set:
proc freq data=fromxml;
tables primarySite;
run;
Calling the write macro is very similar:
%include "write_naaccr_xml_macro.sas";
%writeNaaccrXml(
libpath="naaccr-xml-10.0-sas.jar",
targetfile="recreated-from-sas.xml",
naaccrversion="240",
recordtype="I",
dataset=fromxml
);
Here is an example writing a user-defined dictionary (and writing the NAACCR numbers):
%include "write_naaccr_xml_macro.sas";
%writeNaaccrXml(
libpath="naaccr-xml-10.0-sas.jar",
targetfile="recreated-from-sas.xml.gz",
naaccrversion="240",
recordtype="I",
writenum="yes",
dataset=fromxml,
dictfile="my-dictionary.csv",
dicturi="https://my.organization.org/my-dictionary.xml"
);
The only required parameters are the path to the library and the target file. Here is a description of the available parameters:
- libpath: needs to point to the Java SAS library (path can be relative or absolute)
-
targetfile: needs to point to the XML to export (path can be relative or absolute)
- if the path ends with ".gz" it will be processed as a GZIP compressed file
- otherwise it will be processed as an uncompressed file
- naaccrversion should be one of the supported NAACCR versions provided as three digits: "140", "150", "160", etc... (this parameter is required, no default); make sure to provide the proper version or some items might be dropped during the writing process
- recordtype: should be "A", "M", "C" or "I" (required, no default); make sure to provide the proper type or some items might be dropped during the writing process
- dataset: should be the name of the dataset from which the data should be taken (defaults to alldata)
-
items: an optional list of items to write (any items not in the list will be ignored);
if not provided, the all items in the data set will be written.
There are two ways to provide the list:
1. Hard code the XML IDs in the SAS code, separate them with a comma:
items="patientIdNumber,tumorRecordNumber,primarySite"
2. Provide the path (relative or absolute) to a CSV file:items="included-items.csv"
The first line of the file must be headers; the XML IDs to include are expected to be found in the first column (the file can contain other columns); a simple file would look like this:
NAACCR_XML_ID
patientIdNumber
tumorRecordNumber
primarySite
- dictfile: path to an optional user-defined dictionary in CSV format (the free File*Pro software available on the SEER website can generate those files). Path can be relative or absolute; if relative, it will be computed from the directory containing the macro (in other words, the dictionary CSV file can be copied in the same directory as the macro and referenced by its filename only). Use semicolon to separate multiple paths if you need to provide more than one dictionary.
- dicturi: an optional user-defined dictionary URI to reference in the created XML file (if a CSV dictionary is provided, then this one should be provided as well); the URI can be found as a root attribute of the XML dictionary (it usually looks like an internet address, but it's rarely a legit address; and the macros do not try to connect to that address in any way). Use semicolon to separate multiple URIs.
- writenum: should be "yes" or "no" (defaults to "no"); if "yes" then the NAACCR numbers will be written.
- cleanuptempfiles should be "yes" or "no" (defaults to "yes"); if "no" then the tmp flat and format files won't be automatically deleted; use this parameter to QC those files when investigating issues.
A typical use-case for the write macro is to read an XML file (using the read macro), do something to the data and write it back. But an another common use-case is to start from an existing data set. In that case, there are a few caveats to keep in mind:
- Variable names must be the NAACCR XML IDs (any other variable will be ignored).
- Every observation (which represent Tumors) that belong to the same Patient need to have the same value for the "patientIdNumber" variable (otherwise every Tumor will end up in its own Patient and there won’t be any Tumor grouping done).
- The Patient values are taken from the first observation (the first Tumor) of that Patient.
- The NaaccrData values (the items appearing only once per file) are taken from the first observation.
A few things to keep in mind:
- This solution creates a temporary flat file as well as a temporary SAS format file that defines all the variables. Those files will be automatically deleted by the macro when its done executing (unless the 'cleanuptempfiles' parameter is set to 'no').
- The current solution supports one or multiple user-defined dictionaries separated by a semi-colon (the NAACCR XML standard specifies that multiple dictionary URI in a data file must be separated by a space, but the macros need a path to a dictionary file, not a dictionary URI, and those can contain spaces; that's why the file path separator is a semi-column instead). The dictionary files must be in the CSV format. The free File*Pro software available on the SEER website can create those CSV files for a given dictionary.
- New lines are translated into the "::" characters and back into new lines when the file is re-created; that means new lines can't be used in analysis within SAS.
- SAS imposes a restriction of 32 characters for its variable names; the NAACCR XML standard used to have some data items with an NAACCR XML ID longer than 32 characters, but those have been truncated and a 32-character limit is now imposed on all data item IDs.
- If you receive an error about a missing/invalid CLASSPATH, please check that the path you provided to the JAR library is correct. As a reminder, the path can be relative (if the macros and the JAR are in the same folder) or absolute. When absolute, it's common for the version to appear in the path itself (in addition to the JAR filename); when upgrading to a new version, make sure to update both versions in the path. The path an also appear multiple times in your program; a safe way to double-check is to search the content of the macro for the old version number.
Other solutions were considered to read/write NAACCR XML in SAS; they worked well on small files but didn't handle large ones very well. Those solutions are explained in the next sections.
Resources: https://github.com/imsweb/naaccr-xml/tree/master/docs/sas/read/xmlmapper
The XML Mapper is the standard way to read XML in SAS. It requires a definition file (an XML file itself) that maps the tags to tables, variables and observations. To be ale to flatten the three NAACCR XML levels (NaaccrData, Patient and Tumor), the definition files also define counters that are incremented when one of the main NAACCR XML tags is read. The counters are called keys; the NaaccrDataKey is the same for all observations in the file; the PatientKey is unique per Patient in the file (so several observations can have the same PatientKey if the Tumor is for the same Patient); the TumorKey is unique per Tumor and therefore per observation.
Using those keys, the SAS program can merge the three levels into one big data set.
Unfortunately, that method gets slow for large data files.
One way to make it faster is to limit the number of variables that the XML Mapper needs to process. That requires a new XML Mapper file defining only a subset of the variables, and so this is really not convenient.
Resources: https://github.com/imsweb/naaccr-xml/tree/master/docs/sas/write/tagset
Tagsets are a simple way to format a given data set into various formats. Using a customized NAACCR XML tagset, a given data set can be written as valid NAACCR XML. For that solution to work, the data set must contain all the NAACCR variables, including the special keys. One way to achieve that is to use an XML Mapper to read the data, apply the required computations and/or recoding, and finally use the tagset to re-write the data. The mapping definition file will need to contain all the variables (for the requested record type) for this to work properly.
Tagsets are event-based; they define actions that need to be triggered based on the event (start table, start row, start column, etc...). It's a very powerful mechanism, but it is difficult to optimize. And so again, re-creating a NAACCR XML data file with this solution is rather slow and won't be practical for very large files.