Skip to content

AlekseyKorshuk/chat-data-pipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Chat Data Pipeline

This repository helps to clean, filter and deduplicate conversation datasets.

Open Source Community rules the world, so please contribute: open Pull Request or create an Issue.

Star this repository:

GitHub stars

Quick Start

Clone and install dependencies:

git clone https://github.com/AlekseyKorshuk/chat-data-pipeline
cd chat-data-pipeline
pip install -r requirements.txt

We will prepare very small dataset of instructions:

python3 main.py --config_path ./experiments/tiny-example.yaml

You can take a look at the YAML file to discover the structure of the config.

Initial dataset has the following structure of one sample:

{
  "conversation": [
    {
      "content": "Explain the main differences between an alligator and a crocodile.",
      "do_train": false,
      "role": "User"
    },
    {
      "content": "Alligators and crocodiles belong to the same order, Crocodilia, but they have several differences. 1) Shape of the snout: Alligators have a U-shaped, wider snout, while crocodiles have a more pointed, V-shaped snout. 2) Teeth placement: In an alligator, lower teeth are mostly hidden when its mouth is closed, while in a crocodile, the fourth lower tooth is visible even when the mouth is closed. 3) Habitat: Alligators are mostly found in freshwater habitats such as swamps and rivers, while crocodiles can be found in both freshwater and saltwater habitats. 4) Distribution: Alligators are mainly found in the southeastern United States and parts of China, whereas crocodiles have a more widespread distribution across Africa, Asia, the Americas, and Australia.",
      "do_train": true,
      "role": "Assistant"
    }
  ]
}

This example could have more conversation turns: User, Assistant, User, Assistant...

As well role can be "System" at the very first item in the list.

Custom Setup

In general, you can use this for any dataset that has a string column. Here is an example usage:

from datasets import load_dataset

from chat_data_pipeline import utils
from chat_data_pipeline.preprocessor import DataPreprocessor
from chat_data_pipeline import cleaners as cln
from chat_data_pipeline import filters as ftr

dataset = load_dataset("AlekseyKorshuk/tiny-imdb", split="train")

deduplication_config = {
    'do_deduplication': True,
    'minhash_config': {
        'ngram_size': 5,
        'num_perm': 256,
        'threshold': 0.7,
        'min_ngram_size': 5
    }
}

cleaners = [cln.fix_utf8_encoding, cln.normalize_punctuation, cln.remove_empty_lines]
filters = [
    utils.custom_partial(ftr.check_word_number,
                         min_word_threshold=0,
                         max_word_threshold=10000),
]

preprocessor = DataPreprocessor(
    dataset=dataset,
    column_name="text",
    cleaners=cleaners,
    filters=filters,
    deduplication_config=deduplication_config,
    verbose=False,
)
preprocessed_dataset = preprocessor.run()

Acknowledgment

This is a friendly fork of Squeakily by CarperAI, but this repository aims at conversation data, uses pandas to speed up the pipeline and latest near deduplication.

Releases

No releases published

Packages

No packages published

Languages