You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I learned about this program recently at a data science conference.
From examining your source code, it seems that you are mostly detected possible PII by column names, rather than by doing a content examination. With some work you could scan for PII by content. For example, you could have regular expressions that scan for SSNs, phone numbers, email addresses, and the like.
You can find many such regular expressions in the bulk_extractor open source project, which is Named Entity Recognizer that is used for processing digital evidence. The bulk_extractor program uses flex as a high-speed RE parser. With not a lot of work, you could actually take the bulk_extractor shared library and call it from R directly. Or you could manually take out the regular expressions from its files and add them directly here. It would be slower but easier to maintain.
I learned about this program recently at a data science conference.
From examining your source code, it seems that you are mostly detected possible PII by column names, rather than by doing a content examination. With some work you could scan for PII by content. For example, you could have regular expressions that scan for SSNs, phone numbers, email addresses, and the like.
You can find many such regular expressions in the bulk_extractor open source project, which is Named Entity Recognizer that is used for processing digital evidence. The bulk_extractor program uses flex as a high-speed RE parser. With not a lot of work, you could actually take the bulk_extractor shared library and call it from R directly. Or you could manually take out the regular expressions from its files and add them directly here. It would be slower but easier to maintain.
Here are the files of interest:
The text was updated successfully, but these errors were encountered: