Skip to content

Training dataset format: npy or jsonl? #422

Open
@justlovebarbecue

Description

@justlovebarbecue

Hi,

I am trying to use a demo dataset to test the training code. But the instruction is not clear enough. Before running the training code, I did the "binarize_data" step, for this one, which format I should use? npy or jsonl, if it is jsonl, it looks like there is no "input_ids" and "label" for the dataloader parts for following training part. If it is npy, i meet a problem about uint format cannot be converted shown as below:

self.input_ids = [torch.tensor(example["input_ids"], dtype=torch.long) for example in self.input_ids if len(example["input_ids"]) < args.model_max_length]
TypeError: can't convert np.ndarray of type numpy.uint32. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

Any clue on this issue? or the only thing needed is forcely transfer the data format to make it NOT as "uint"?

Thanks!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions