Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating "real" addresses #18

Open
ctmay4 opened this issue Aug 16, 2017 · 5 comments
Open

Creating "real" addresses #18

ctmay4 opened this issue Aug 16, 2017 · 5 comments

Comments

@ctmay4
Copy link
Member

ctmay4 commented Aug 16, 2017

Creating data with addresses that do not geocode is causing edit failures and making it hard to use the data. We need to decide the best way to handle this.

One option is to add a bunch of institutional addresses to the library for use. I'm just not sure if we could get enough and support all states. In addition, if we don't have enough, then creating a large number of records will end up with a lot of possible matches since the addresses get reused.

Another thing to consider would be to support callers passing in a list of addresses to use for the generation. No matter what we do I think we should consider that.

@ctmay4
Copy link
Member Author

ctmay4 commented Mar 13, 2018

Found this project which might be worth consideration.

https://github.com/EthanRBrown/rrad

@halla-ims
Copy link
Collaborator

I would like to second this nomination. I used synthetic data for the testing of Naaccr*Prep, and some of its calculated fields use addressAtDxCounty, countyDxGeocode1990, stateAtDxGeocode19708090, censusCertainty708090. It would be really useful to have this kind of information filled in.

@ctmay4
Copy link
Member Author

ctmay4 commented Oct 10, 2018

I agree @halla-ims however I'm just not sure about the best source. The project I linked above is nice but it doesn't cover all the areas we need addresses in. Do you have any other ideas for getting the addresses?

Also adding @depryf to the conversation.

@depryf
Copy link
Member

depryf commented Oct 10, 2018

Someone suggested using addresses of libraries or postal offices, which are all publicly available.

@depryf
Copy link
Member

depryf commented Jan 13, 2022

I was talking with someone who has experience with synthetic data sets, and he mentioned two data source that he used in the past to create "real" addresses:

USDA SNAP grocery store locations - source: https://www.fns.usda.gov/snap/retailer-locator (n=250,000)
NCHS public school locations - source: https://nces.ed.gov/ccd/pubschuniv.asp (n=100,000)

I still don't think a standalone library can be shipped with that much data, but I guess there could be some kind of data provider that can be dynamically plugged into the mechanism and use some of those real address.

Something to think about.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants