Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scan for SSNs and other kinds of PII by content #42

Open
simsong opened this issue Nov 3, 2019 · 0 comments
Open

Scan for SSNs and other kinds of PII by content #42

simsong opened this issue Nov 3, 2019 · 0 comments

Comments

@simsong
Copy link

simsong commented Nov 3, 2019

I learned about this program recently at a data science conference.

From examining your source code, it seems that you are mostly detected possible PII by column names, rather than by doing a content examination. With some work you could scan for PII by content. For example, you could have regular expressions that scan for SSNs, phone numbers, email addresses, and the like.

You can find many such regular expressions in the bulk_extractor open source project, which is Named Entity Recognizer that is used for processing digital evidence. The bulk_extractor program uses flex as a high-speed RE parser. With not a lot of work, you could actually take the bulk_extractor shared library and call it from R directly. Or you could manually take out the regular expressions from its files and add them directly here. It would be slower but easier to maintain.

Here are the files of interest:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant