feature: no-code anonymizers packs #223
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The idea, let users create PHP-less packs, defined by a YAML file, and loading data from plain text or CSV files. The main idea here is to allow easy extension without any PHP knowledge.
Features
General
Anyone can create a custom
db_tools.pack.yamlfile, which defines a new pack. Basic information is such:Then in the
datasection, can add one or more anonymizers. Let's start with a simple one, a raw data list directly into the YAML file.Enum (single column) anonymizers
This means that the
fr_fr.address_street_prefixenum anonymizator is then exposed, with the given values.If you have many entries and want to place into a file instead, you may simply reference the file using an relative path (relative to the main YAML entrypoint file directory), as such:
Considering the file contains:
Then the
fr_fr.address_street_namewill be exposed using each plain text line as a value for the enum data list.If you use a CSV file instead, only the first column will be fetched.
Text patterns
One other feature that this brings is the ability to write text lines that concatenate text from multiple anonymizers, such as:
Where:
[n-m]is a range of integers. Because integers could be negative,[n,m]variant will also be accepted and parsed as such.{address_street_prefix}will fetch a random value from thefr_fr.address_street_prefixanonymizer.{self.address_street_name}will fetch a random value from thefr_fr.address_street_prefixanonymizer. Here,selfis an alias for the current pack name: this allow disambiguating with existing core anonymizers, which don't require any prefix.The most important part of this is that the generated SQL will be a
CONCAT(expr, expr, expr)where eachexprwill be the generated SQL from the target anonymizer. This makes the whole SQL completely random, and doesn't require any sample table.This technical solution might be reevaluated later, since I don't have any performance numbers yet. If it happens to be too slow, then we will have to make sure that the initialization generate a sample table prior to anonymize the full database.
Multiple column anonymizers
Multiple column anonymizers are easy as well, you may simply add a raw entry list as such:
Or from a CSV file:
Where CSV file is:
Text patterns in multiple column anonymizers
You may directly use all other anonymizers to generate a "row pattern" for column anonymizers, for example:
Note here you write a single row, but each new row will be generated using the string patterns embedded, using the same method as described upper.
Technically
@todo Pack registry and factories
And now?
OK, first shot at this, I wanted to open a PR to see the diff in another place than my IDE or my terminal. It's unfinished yet.