-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Creating "real" addresses #18
Comments
Found this project which might be worth consideration. |
I would like to second this nomination. I used synthetic data for the testing of Naaccr*Prep, and some of its calculated fields use addressAtDxCounty, countyDxGeocode1990, stateAtDxGeocode19708090, censusCertainty708090. It would be really useful to have this kind of information filled in. |
I agree @halla-ims however I'm just not sure about the best source. The project I linked above is nice but it doesn't cover all the areas we need addresses in. Do you have any other ideas for getting the addresses? Also adding @depryf to the conversation. |
Someone suggested using addresses of libraries or postal offices, which are all publicly available. |
I was talking with someone who has experience with synthetic data sets, and he mentioned two data source that he used in the past to create "real" addresses: USDA SNAP grocery store locations - source: https://www.fns.usda.gov/snap/retailer-locator (n=250,000) I still don't think a standalone library can be shipped with that much data, but I guess there could be some kind of data provider that can be dynamically plugged into the mechanism and use some of those real address. Something to think about. |
Creating data with addresses that do not geocode is causing edit failures and making it hard to use the data. We need to decide the best way to handle this.
One option is to add a bunch of institutional addresses to the library for use. I'm just not sure if we could get enough and support all states. In addition, if we don't have enough, then creating a large number of records will end up with a lot of possible matches since the addresses get reused.
Another thing to consider would be to support callers passing in a list of addresses to use for the generation. No matter what we do I think we should consider that.
The text was updated successfully, but these errors were encountered: