Skip to content

Reason for not applying remove_non_prining_characters normalization #416

Open
@JoeyOhman

Description

@JoeyOhman

Hi,

We are much inspired by this great work and are in the process of cleaning our data. However, if we understand correctly, the remove_non_prining_characters normalization step is not used for the final cleaning. Do you have any thoughts on why this should not be used?

non_printing_characters_re = re.compile(

There you have this:

non_printing_characters_re = re.compile(
    f"[{''.join(map(chr, list(range(0,32)) + list(range(127,160))))}]"
)

Which we modified, to keep newlines (\n) and tabs (\t), and to also remove soft-hyphens, non-breaking spaces, and zero-width space:

additional_chars_to_remove = [160, 173, 8203]
non_printing_characters_re = re.compile(
    f"[{''.join(map(chr, list(range(0,9)) + list(range(11, 32)) + list(range(127,160)) + additional_chars_to_remove))}]"
)

There could of course be more characters that one may want to remove.

To be clear, I am writing this here for two reasons:

  1. To get your feedback. Do you think this is a good idea to use for the final data cleaning?
  2. If so, this could be incorporated into this repository to help other people that might be thinking about this.

Thanks for your amazing contributions!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions