Skip to content

Dataset Cleaning Process #208

@KennethEnevoldsen

Description

@KennethEnevoldsen

Get an overview of filters:

  • Create an overview of cleaning taggers #207
    • Figure out the difference between the PII taggers (which ones are most relevant) - we might just start using the fastest one
    • Test NSFW filters (is what they filter out problematic in Danish) (minor)
    • Test hate-speech filters (is what they filter out problematic in Danish) (minor)

See the filters are valid

  • Apply all filters to DAGW (whole dataset) and see if any of the filters are problematic (non should filter out extreme amount of data as we consider the dataset fairly clean)

    • E.g. could be afraid that the current gopher filters uses "english" thresholds which might be problematic for us.

Starting applying filters to the dataset

  • Formatting datasets to jsonl.gz and applying filters to them.

Decide on reasonable threshold

  • Once we have run the analysis we would like to set a reasonable set of starting thresholds which we can vary based on (specified in the config).

@peterbjorgensen does this seems like a reasonable approach to you as well?

Metadata

Metadata

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions