Skip to content

centre-for-humanities-computing/word-associations

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

word-associations

Word associations in Indre Mission and Kirkeligt Samfund.

Usage

Folder structure:

- dataset/
    - data/
        - pr983_204.txt
        ...
    - Stopord.txt
    - metadata_nordveck.csv

Install requirements:

pip install -r requirements.txt

Preprocessing

Preprocess the corpus (lemmatization, stop word removal, normalization):

python3 src/clean_texts.py

This will output the cleaned corpus as a csv file with id, text and clean_text columns.

- dataset/
    - clean_data.csv

Word count

You can use the src/word_count.py run CLI to extraxct the most common words.

Collect collocations

You can use the src/cooccurrences.py run CLI, to extract the highest scoring collocations of a target word based on PMI.

Arguments

Argument Description Type Default
seed_word Seed word to start off from. str -
-h, --help Show help message and exit.
--group_by GROUP_BY,
-g GROUP_BY
Metadata column to group results by. str None
--out_file OUT_FILE,
-o OUT_FILE
JSON file to output results to. str results/coocurrences.json
--top_k TOP_K,
-k TOP_K
Top K ranking cooccurring words to output. int 50
--n_context N_CONTEXT,
-n N_CONTEXT
Number of context words to consider in each direction. int 5

About

Word associations in Indre Mission and Kirkeligt Samfund

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •  

Languages