Training data format 

Hi , as per the documentation this is mentioned as way to create the training data set. 


DatasetName (e.g. LF-AmazonTitles-131K)
│   trn_X.txt   (text for trn documents, one text in each line)
|   tst_X.tst   (text for tst documents, one text in each line)
|   Y.txt       (text for labels, one text in each line)
│   trn_X_Y.txt (trn labels in spmat format)
|   tst_X_Y.txt (tst labels in spmat format)
|   filter_labels_test.txt (filter labels where label and test documents are same)
│
└───XXCondensedData (embeddings for tst, trn documents and labels, for benchmark datasets, XX=DX[Astec])
    │   trn_point_embs.npy (2D numpy matrix for trn document embeddings)
    │   tst_point_embs.npy (2D numpy matrix for tst document embeddings)
    |   label_embs.npy     (2D numpy matrix for label embeddings)


I could not understand the trn labels in spmat format . Is there a script that creates that from input documents like ( trn_X.txt and tst_X.txt and Y.txt ) . This is for the case we want to use the label embeddings as well. 

I want to generate it for my custom dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Training data format #4

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Training data format #4

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions