Skip to content

Training data format  #4

@NitinAggarwal1

Description

@NitinAggarwal1

Hi , as per the documentation this is mentioned as way to create the training data set.

DatasetName (e.g. LF-AmazonTitles-131K)
│ trn_X.txt (text for trn documents, one text in each line)
| tst_X.tst (text for tst documents, one text in each line)
| Y.txt (text for labels, one text in each line)
│ trn_X_Y.txt (trn labels in spmat format)
| tst_X_Y.txt (tst labels in spmat format)
| filter_labels_test.txt (filter labels where label and test documents are same)

└───XXCondensedData (embeddings for tst, trn documents and labels, for benchmark datasets, XX=DX[Astec])
│ trn_point_embs.npy (2D numpy matrix for trn document embeddings)
│ tst_point_embs.npy (2D numpy matrix for tst document embeddings)
| label_embs.npy (2D numpy matrix for label embeddings)

I could not understand the trn labels in spmat format . Is there a script that creates that from input documents like ( trn_X.txt and tst_X.txt and Y.txt ) . This is for the case we want to use the label embeddings as well.

I want to generate it for my custom dataset.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions