Skip to content

The issue with the Data Selection Pipeline #33

Open
@kuang1216

Description

@kuang1216

请问,目前我运行了你的相关代码,我心中存在以下的疑惑:
(1)步骤二:

CKPT=105
TRAINING_DATA_NAME=dolly
TRAINING_DATA_FILE=../data/train/processed/dolly/dolly_data.jsonl # when changing data name, change the data path accordingly
GRADIENT_TYPE="adam"
MODEL_PATH=../out/llama2-7b-p0.05-lora-seed3/checkpoint-${CKPT}
OUTPUT_PATH=../grads/llama2-7b-p0.05-lora-seed3/${TRAINING_DATA_NAME}-ckpt${CKPT}-${GRADIENT_TYPE}
DIMS="8192"

./less/scripts/get_info/get_train_lora_grads.sh "$TRAINING_DATA_FILE" "$MODEL_PATH" "$OUTPUT_PATH" "$DIMS" "$GRADIENT_TYPE"

步骤三:

步骤三的第一个脚本

CKPT=105
TASK=tydiqa
MODEL_PATH=../out/llama2-7b-p0.05-lora-seed3/checkpoint-${CKPT}
OUTPUT_PATH=../grads/llama2-7b-p0.05-lora-seed3/${TASK}-ckpt${CKPT}-sgd # for validation data, we always use sgd
DATA_DIR=../data
DIMS="4096 8192" # We use 8192 as our default projection dimension 

./less/scripts/get_info/get_eval_lora_grads.sh "$TASK" "$DATA_DIR" "$MODEL_PATH" $OUTPUT_PATH "$DIMS"

步骤三的第二个脚本

DIM=8192 # decide which dimension to use
GRADIENT_PATH=../grads/llama2-7b-p0.05-lora-seed3/{}-ckpt{}-adam/dim${DIM}
TRAIN_FILE_NAMES="flan_v2 cot dolly oasst1"
CKPTS="105 211 317 420" # checkpoing index
CHECKPOINT_WEIGHTS="1.6877e-05 1.2859e-05 7.7030e-06 2.5616e-06" # average lr of the epoch

VALIDATION_GRADIENT_PATH=../grads/llama2-7b-p0.05-lora-seed3/{}-ckpt{}-sgd/dim${DIM}
TARGET_TASK_NAMES="tydiqa"
SELECTED_DATA_OUTPUT_PATH="../selected_data"

./less/scripts/data_selection/matching.sh "$GRADIENT_PATH" "$TRAIN_FILE_NAMES" "$CKPTS" "$CHECKPOINT_WEIGHTS" "$VALIDATION_GRADIENT_PATH" "$TARGET_TASK_NAMES" "$SELECTED_DATA_OUTPUT_PATH"

是否需要对每个CKPT都进行运算还是只需要对最后一次的CKPT进行运算?(我一共有4个CKPT,是需要对4个都进行运算,还是只需要对最后一个进行运算?)

(2)在我的实验中,我只使用了最后一次的CKPT(一共有4次,分别是422、845、1268、1688),CKPT选择了最后一次保存的CKPT,在步骤三的第一个脚本中,我使用CKPTS=‘1688’。我一共进行了两次实验,第一次实验的设置如下:步骤二除了CKPT外其余的不变,在步骤三的第二个脚本中,我将CKPTS=1688,TRAIN_FILE_NAMES=dolly (因为我只得到了dolly的数据梯度),在MMLU和BBH任务中得到的结果分别是46.9和41.2。在第二次实验中,步骤二中CKPT=1688,TRAINING_DATA_NAME=flan_v2 cot dolly oasst1,在步骤三的第二个脚本中,我将CKPTS=1688,TRAIN_FILE_NAMES=flan_v2 cot dolly oasst1,在MMLU和BBH任务中得到的结果分别是43.9和40.0。请问我使用了更多的数据集为什么得到的结果还不如使用少量的数据集?
(3)Data Selection Pipeline的一些脚本问题,最开始只使用dolly数据集得到训练数据的梯度,但是到最后使用了4个数据集flan_v2 cot dolly oasst1来计算影响力分数,感觉这一点有点混乱。

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions