Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asset cleanup #62

Open
wants to merge 12 commits into
base: main
Choose a base branch
from
12 changes: 6 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,12 +118,12 @@ in `assets/scientific-lit-embeddings`.

Outputs (annually):

- FastText vectors: `assets/{en,zh}_merged_fasttext.bin`
- tf-idf vectors and vocab: `assets/{en,zh}_merged_tfidf.bin` and TODO
- FastText vectors: `assets/en_merged_fasttext.bin`
- tf-idf vectors and vocab: `assets/en_merged_tfidf.bin` and TODO

Outputs (~weekly):

- Preprocessed corpus: `assets/corpus/{lang}_corpus-*.jsonl.gz`
- Preprocessed corpus: `assets/corpus/en_corpus-*.jsonl.gz`

### 2. Field taxonomy

Expand All @@ -143,8 +143,8 @@ created embeddings for each field.

Outputs (annually):

- FastText field embeddings: `assets/{en,zh}_field_fasttext.bin`
- tf-idf field embeddings: `assets/{en,zh}_field_tfidf.bin`
- FastText field embeddings: `assets/en_field_fasttext.bin`
- tf-idf field embeddings: `assets/en_field_tfidf.bin`

### 4. Entity embeddings

Expand All @@ -154,7 +154,7 @@ generate FastText _entity embeddings_. This is documented in the `wiki-field-tex

Outputs (annually):

- FastText entity embeddings: `assets/{en,zh}_field_mention_fasttext.bin`
- FastText entity embeddings: `assets/en_field_mention_fasttext.bin`

### 5. Publication embedding

Expand Down
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Long-term if this is still in use I think it would be nice to convert this to a real Python file (notebooks can have weird dependency issues that are harder to work out and are more of a pain to re-run) but not a big deal for now.

Large diffs are not rendered by default.

File renamed without changes.
1,162 changes: 1,162 additions & 0 deletions analysis/nslp-forc/all_fields_hierarchy.jsonl

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion analysis/nslp-forc/forc.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@
"levels_reversed = defaultdict(set)\n",
"children = defaultdict(list)\n",
"parents = {}\n",
"with open(\"../../assets/fields/all_fields_hierarchy.jsonl\", \"r\") as json_file:\n",
"with open(\"all_fields_hierarchy.jsonl\", \"r\") as json_file:\n",
" data = [json.loads(line) for line in json_file]\n",
"for row in data:\n",
" levels[row[\"child_display_name\"].lower()] = int(row[\"child_level\"])\n",
Expand Down
1,162 changes: 1,162 additions & 0 deletions analysis/venues/all_fields_hierarchy.jsonl

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion analysis/venues/evaluate.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@


def main():
meta = pd.read_pickle(ASSETS_DIR / "fields/fos.pkl.gz")
meta = pd.read_pickle("fos.pkl.gz")
meta.index = meta.index.astype(int)
id_to_name = meta.query("level == 0")["display_name"].to_dict()

Expand Down
2 changes: 1 addition & 1 deletion analysis/venues/evaluate_l2_venues.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@
"levels_reversed = defaultdict(set)\n",
"children = defaultdict(list)\n",
"parents = {}\n",
"with open(\"../../assets/fields/all_fields_hierarchy.jsonl\", \"r\") as json_file:\n",
"with open(\"all_fields_hierarchy.jsonl\", \"r\") as json_file:\n",
" data = [json.loads(line) for line in json_file]\n",
"for row in data:\n",
" levels[row[\"child_display_name\"].lower()] = int(row[\"child_level\"])\n",
Expand Down
4 changes: 2 additions & 2 deletions analysis/venues/evaluate_venues.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@
"\n",
"from fos.settings import ASSETS_DIR\n",
"\n",
"meta = pd.read_pickle(ASSETS_DIR / \"fields/fos.pkl.gz\")\n",
"meta = pd.read_pickle(\"fos.pkl.gz\")\n",
"meta.index = meta.index.astype(int)\n",
"id_to_name = meta.query(\"level == 0\")[\"display_name\"].to_dict()\n",
"\n",
Expand Down Expand Up @@ -676,4 +676,4 @@
},
"nbformat": 4,
"nbformat_minor": 0
}
}
File renamed without changes.
2 changes: 0 additions & 2 deletions assets/fields/.gitignore

This file was deleted.

5 changes: 0 additions & 5 deletions assets/fields/all_fields_hierarchy.jsonl.dvc

This file was deleted.

Binary file removed assets/fields/dag.pkl.gz
Binary file not shown.
Binary file removed assets/fields/example_text.pkl.gz
Binary file not shown.
Binary file removed assets/fields/fos_attr.pkl.gz
Binary file not shown.
Binary file removed assets/fields/fos_children.pkl.gz
Binary file not shown.
83 changes: 0 additions & 83 deletions assets/fields/get.py

This file was deleted.

5 changes: 0 additions & 5 deletions assets/fields/level_one_field_hierarchy.jsonl.dvc

This file was deleted.

Loading