Skip to content

Mismatch Between train_file_names and train_files Parameters Causes Assertion Failure #41

Open
@Tangkexian

Description

@Tangkexian

Great work! But when using the write_selected_data.py script, a mismatch in the number of --train_file_names and --train_files parameters results in an assertion failure.

Steps to Reproduce

  1. Run the Data Selection Script as the README Instructions:

    python3 -m less.data_selection.write_selected_data \
    --target_task_names ${TARGET_TASK_NAMES} \
    --train_file_names flan_v2 cot dolly oasst1 \
    --train_files ../data/train/processed/dolly/dolly_data.jsonl ../data/train/processed/oasst1/oasst1_data.jsonl \
    --output_path $SELECTED_DATA_OUTPUT_PATH \
    --percentage 0.05
  2. Observe the Assertion Failure:

    The script contains the following assertion:

    assert len(args.train_file_names) == len(args.train_files)

    In this example, --train_file_names has 4 names (flan_v2, cot, dolly, oasst1), while --train_files only provides 2 file paths (dolly_data.jsonl and oasst1_data.jsonl). This mismatch triggers the assertion, causing the script to terminate unexpectedly.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions