GitHub

HashSet - A Dataset For Hashtag Segmentation

Hashset is a new dataset consisiting on 1.9k manually annotated and 3.3M loosely supervised tweets for testing the efficiency of hashtag segmentation models. We compare State of The Art Hashtag Segmentation models on Hashset and other baseline datasets (STAN and BOUN). We compare and analyse the results across the datasets to argue that HashSet can act as a good benchmark for hashtag segmentation tasks.

Directory Tree

  '|-- HashtagSegmentation',
  '    |-- Data Stats Notebooks', ## ipynb notebooks used to generate statistics for the datasets used in this study
  '    |   |-- Data Validation.ipynb',  
  '    |   |-- Final Data Statistics.ipynb',
  '    |-- ModelPredictions',
  '    |   |-- distant-sampled-lowercase_hashformers_output.csv',
  '    |   |-- distant-sampled_hashformers_output.csv',
  '    |   |-- hashformer_analysis.ipynb',
  '    |   |-- maddela_analysis.ipynb',
  '    |   |-- stan-large_hashformers_output.csv',
  '    |   |-- stan-small_hashformers_output.csv',
  '    |   |-- Hashformers',
  '    |       |-- HashSet-Manual_hashformers_output.csv',
  '    |       |-- boun_hashformers_output.csv',
  '    |       |-- stan-dev_hashformers_output.csv',
  '    |-- datasets', ## Datasets used to compare model performances along with HashSet. 
  '        |-- boun-celebi-et-al.csv',  
  '        |-- stan-dev-celebi-etal.csv',
  '        |-- stan-large-maddela_et_al_dev.pkl',
  '        |-- stan-large-maddela_et_al_test.pkl', 
  '        |-- stan-large-maddela_et_al_train.pkl',
  '        |-- stan-small-bansal_et_al.pkl', 
  '        |-- hashset',
  '            |-- HashSet-Distant-sampled.csv',
  '            |-- HashSet-Distant.csv',
  '            |-- HashSet-Manual.csv',
  ''

HashSet

HashSet Manual: contains 1.9k manually annotated hashtags. Each row consists of the hashtag, segmented hashtag ,named entity annotations, a list storing whether the hashtag contains mix of hindi and english tokens and/or contains non-english tokens.
HashSet Distant: 3.3M loosely collected camel cased hashtags containing hashtag and their segmentation

Model Predictions

Add content here Paper

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
Data Stats Notebooks		Data Stats Notebooks
ModelPredictions		ModelPredictions
datasets		datasets
oldfiles		oldfiles
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HashSet - A Dataset For Hashtag Segmentation

Directory Tree

HashSet

Model Predictions

About

Releases

Packages

Contributors 3

Languages

prashantkodali/HashSet

Folders and files

Latest commit

History

Repository files navigation

HashSet - A Dataset For Hashtag Segmentation

Directory Tree

HashSet

Model Predictions

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages