feature: no-code anonymizers packs by pounard · Pull Request #223 · makinacorpus/DbToolsBundle

pounard · 2025-07-21T21:15:31Z

The idea, let users create PHP-less packs, defined by a YAML file, and loading data from plain text or CSV files. The main idea here is to allow easy extension without any PHP knowledge.

Features

General

Anyone can create a custom db_tools.pack.yaml file, which defines a new pack. Basic information is such:

name: fr-fr
data: []

Then in the data section, can add one or more anonymizers. Let's start with a simple one, a raw data list directly into the YAML file.

Enum (single column) anonymizers

data:
    address_street_prefix:
        data: [rue, avenue, impasse, voie, chemin, route]

This means that the fr_fr.address_street_prefix enum anonymizator is then exposed, with the given values.

If you have many entries and want to place into a file instead, you may simply reference the file using an relative path (relative to the main YAML entrypoint file directory), as such:

data:
    # ...
    address_street_name:
        data: ./resources/address_street_names.txt

Considering the file contains:

des fleurs
du général chose
de la grand haie
mercoeur

Then the fr_fr.address_street_name will be exposed using each plain text line as a value for the enum data list.

If you use a CSV file instead, only the first column will be fetched.

Text patterns

One other feature that this brings is the ability to write text lines that concatenate text from multiple anonymizers, such as:

data:
    # ...
    address_street:
        generated: "[0-2000] {address_street_prefix} {self.address_street_name}"

Where:

[n-m] is a range of integers. Because integers could be negative, [n,m] variant will also be accepted and parsed as such.
{address_street_prefix} will fetch a random value from the fr_fr.address_street_prefix anonymizer.
{self.address_street_name} will fetch a random value from the fr_fr.address_street_prefix anonymizer. Here, self is an alias for the current pack name: this allow disambiguating with existing core anonymizers, which don't require any prefix.

The most important part of this is that the generated SQL will be a CONCAT(expr, expr, expr) where each expr will be the generated SQL from the target anonymizer. This makes the whole SQL completely random, and doesn't require any sample table.

This technical solution might be reevaluated later, since I don't have any performance numbers yet. If it happens to be too slow, then we will have to make sure that the initialization generate a sample table prior to anonymize the full database.

Multiple column anonymizers

Multiple column anonymizers are easy as well, you may simply add a raw entry list as such:

data:
    # ...
    # Abitrary with columns data list
    address_hexasmal:
        columns: [code_insee, locality, postal_code, dependent_locality]
        data:
            - [01001, L ABERGEMENT CLEMENCIAT, 01400, L ABERGEMENT CLEMENCIAT]
            - [01002, L ABERGEMENT DE VAREY, 01640, L ABERGEMENT DE VAREY]
            - [01004, AMBERIEU EN BUGEY, 01500, AMBERIEU EN BUGEY]

Or from a CSV file:

data:
    # ...
    address_hexasmal:
        # null value ignores the CSV input, when listing data sources using tooling
        # it will not appear, when generating documentation, it will not appear.
        columns: [code_insee, locality, postal_code, null, dependent_locality]
        data: ./resources/address/hexasmal.csv
        csv_skip_header: true

Where CSV file is:

#Code_commune_INSEE;Nom_de_la_commune;Code_postal;Libellé_d_acheminement;Ligne_5
01001;L ABERGEMENT CLEMENCIAT;01400;L ABERGEMENT CLEMENCIAT;
01002;L ABERGEMENT DE VAREY;01640;L ABERGEMENT DE VAREY;
01004;AMBERIEU EN BUGEY;01500;AMBERIEU EN BUGEY;
...

Text patterns in multiple column anonymizers

You may directly use all other anonymizers to generate a "row pattern" for column anonymizers, for example:

data:
    # ...
    address:
        columns: [country, locality, region, postal_code, street_address]
        generated:
            # This is a raw string, hardcoded.
            - FRANCE
            # Uses another column anonymizer columns, please note here we do a COALESCE(dependent_locality, locality) using more than one columns.
            - "{address_hexasmal.dependant_locality|address_hexasmal.locality}"
            - REGION TODO
            # When you use the same column anonymizer more than once, all values
            # in a single row will be fetched from the same sample row: this ensure
            # consistency in results.
            - "{address_hexasmal.postal_code}"
            # Single value from datalists.
            - "{address_street}"

Note here you write a single row, but each new row will be generated using the string patterns embedded, using the same method as described upper.

Technically

@todo Pack registry and factories

And now?

OK, first shot at this, I wanted to open a PR to see the diff in another place than my IDE or my terminal. It's unfinished yet.

pounard · 2025-10-18T16:12:56Z

Note for myself:

actual implementation represents all anonymizers as intermediate classes, parses and validates all at once,
another possible implementation would be to be lazy and directly instanciate anonymizers when asked for.

pounard force-pushed the datalist-3-pack branch 2 times, most recently from 7f0ab92 to 5c85ba1 Compare July 22, 2025 11:27

pounard added this to the 3.0.0 milestone Jul 22, 2025

pounard added the enhancement New feature or request label Jul 22, 2025

pounard self-assigned this Jul 24, 2025

pounard force-pushed the datalist-3-pack branch from 5c85ba1 to fbb155a Compare July 24, 2025 13:35

pounard changed the base branch from main to datalist-2quater-file-column July 24, 2025 13:36

pounard changed the title ~~feature: php-less anonymizers packs~~ feature: no-code anonymizers packs Jul 25, 2025

pounard force-pushed the datalist-2quater-file-column branch 2 times, most recently from ee5cfa6 to 70ef59b Compare October 18, 2025 14:07

pounard added 2 commits October 18, 2025 16:24

feature: no code pack

18fe9d7

PENDING WORK RESET ME

879bfef

pounard force-pushed the datalist-3-pack branch from fbb155a to 879bfef Compare October 18, 2025 15:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature: no-code anonymizers packs#223

feature: no-code anonymizers packs#223
pounard wants to merge 2 commits intodatalist-2quater-file-columnfrom
datalist-3-pack

pounard commented Jul 21, 2025 •

edited

Loading

Uh oh!

pounard commented Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pounard commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Features

General

Enum (single column) anonymizers

Text patterns

Multiple column anonymizers

Text patterns in multiple column anonymizers

Technically

And now?

Uh oh!

pounard commented Oct 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

pounard commented Jul 21, 2025 •

edited

Loading