Skip to content

Conversation

@pounard
Copy link
Member

@pounard pounard commented Jul 21, 2025

The idea, let users create PHP-less packs, defined by a YAML file, and loading data from plain text or CSV files. The main idea here is to allow easy extension without any PHP knowledge.

Features

General

Anyone can create a custom  db_tools.pack.yaml file, which defines a new pack. Basic information is such:

name: fr-fr
data: []

Then in the data section, can add one or more anonymizers. Let's start with a simple one, a raw data list directly into the YAML file.

Enum (single column) anonymizers

data:
    address_street_prefix:
        data: [rue, avenue, impasse, voie, chemin, route]

This means that the fr_fr.address_street_prefix enum anonymizator is then exposed, with the given values.

If you have many entries and want to place into a file instead, you may simply reference the file using an relative path (relative to the main YAML entrypoint file directory), as such:

data:
    # ...
    address_street_name:
        data: ./resources/address_street_names.txt

Considering the file contains:

des fleurs
du général chose
de la grand haie
mercoeur

Then the fr_fr.address_street_name will be exposed using each plain text line as a value for the enum data list.

If you use a CSV file instead, only the first column will be fetched.

Text patterns

One other feature that this brings is the ability to write text lines that concatenate text from multiple anonymizers, such as:

data:
    # ...
    address_street:
        generated: "[0-2000] {address_street_prefix} {self.address_street_name}"

Where:

  • [n-m] is a range of integers. Because integers could be negative, [n,m] variant will also be accepted and parsed as such.
  • {address_street_prefix} will fetch a random value from the fr_fr.address_street_prefix anonymizer.
  • {self.address_street_name} will fetch a random value from the fr_fr.address_street_prefix anonymizer. Here, self is an alias for the current pack name: this allow disambiguating with existing core anonymizers, which don't require any prefix.

The most important part of this is that the generated SQL will be a CONCAT(expr, expr, expr) where each expr will be the generated SQL from the target anonymizer. This makes the whole SQL completely random, and doesn't require any sample table.

This technical solution might be reevaluated later, since I don't have any performance numbers yet. If it happens to be too slow, then we will have to make sure that the initialization generate a sample table prior to anonymize the full database.

Multiple column anonymizers

Multiple column anonymizers are easy as well, you may simply add a raw entry list as such:

data:
    # ...
    # Abitrary with columns data list
    address_hexasmal:
        columns: [code_insee, locality, postal_code, dependent_locality]
        data:
            - [01001, L ABERGEMENT CLEMENCIAT, 01400, L ABERGEMENT CLEMENCIAT]
            - [01002, L ABERGEMENT DE VAREY, 01640, L ABERGEMENT DE VAREY]
            - [01004, AMBERIEU EN BUGEY, 01500, AMBERIEU EN BUGEY]

Or from a CSV file:

data:
    # ...
    address_hexasmal:
        # null value ignores the CSV input, when listing data sources using tooling
        # it will not appear, when generating documentation, it will not appear.
        columns: [code_insee, locality, postal_code, null, dependent_locality]
        data: ./resources/address/hexasmal.csv
        csv_skip_header: true

Where CSV file is:

#Code_commune_INSEE;Nom_de_la_commune;Code_postal;Libellé_d_acheminement;Ligne_5
01001;L ABERGEMENT CLEMENCIAT;01400;L ABERGEMENT CLEMENCIAT;
01002;L ABERGEMENT DE VAREY;01640;L ABERGEMENT DE VAREY;
01004;AMBERIEU EN BUGEY;01500;AMBERIEU EN BUGEY;
...

Text patterns in multiple column anonymizers

You may directly use all other anonymizers to generate a "row pattern" for column anonymizers, for example:

data:
    # ...
    address:
        columns: [country, locality, region, postal_code, street_address]
        generated:
            # This is a raw string, hardcoded.
            - FRANCE
            # Uses another column anonymizer columns, please note here we do a COALESCE(dependent_locality, locality) using more than one columns.
            - "{address_hexasmal.dependant_locality|address_hexasmal.locality}"
            - REGION TODO
            # When you use the same column anonymizer more than once, all values
            # in a single row will be fetched from the same sample row: this ensure
            # consistency in results.
            - "{address_hexasmal.postal_code}"
            # Single value from datalists.
            - "{address_street}"

Note here you write a single row, but each new row will be generated using the string patterns embedded, using the same method as described upper.

Technically

@todo Pack registry and factories

And now?

OK, first shot at this, I wanted to open a PR to see the diff in another place than my IDE or my terminal. It's unfinished yet.

@pounard pounard force-pushed the datalist-3-pack branch 2 times, most recently from 7f0ab92 to 5c85ba1 Compare July 22, 2025 11:27
@pounard pounard added this to the 3.0.0 milestone Jul 22, 2025
@pounard pounard added the enhancement New feature or request label Jul 22, 2025
@pounard pounard self-assigned this Jul 24, 2025
@pounard pounard changed the base branch from main to datalist-2quater-file-column July 24, 2025 13:36
@pounard pounard changed the title feature: php-less anonymizers packs feature: no-code anonymizers packs Jul 25, 2025
@pounard pounard force-pushed the datalist-2quater-file-column branch 2 times, most recently from ee5cfa6 to 70ef59b Compare October 18, 2025 14:07
@pounard
Copy link
Member Author

pounard commented Oct 18, 2025

Note for myself:

  • actual implementation represents all anonymizers as intermediate classes, parses and validates all at once,
  • another possible implementation would be to be lazy and directly instanciate anonymizers when asked for.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants