Skip to content

Commit 62ee7f3

Browse files
committed
Adding v0.0.39
1 parent 03a9bb7 commit 62ee7f3

File tree

169 files changed

+18609
-1690
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

169 files changed

+18609
-1690
lines changed

CITATION.cff

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
version: 0.0.38
1+
version: 0.0.39
22
message: "If you use this software, please cite it as below."
33
authors:
44
- family-names: Eren
@@ -20,7 +20,7 @@ authors:
2020
- family-names: Alexandrov
2121
given-names: Boian
2222
title: "Tensor Extraction of Latent Features (T-ELF)"
23-
version: 0.0.38
23+
version: 0.0.39
2424
url: https://github.com/lanl/T-ELF
2525
doi: 10.5281/zenodo.10257897
2626
date-released: 2023-12-04

README.md

Lines changed: 35 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -22,7 +22,7 @@ Central to T-ELF's core capabilities lie non-negative matrix and tensor factoriz
2222

2323
<div align="center", style="font-size: 50px">
2424
<p align="center">
25-
<img src="docs/capabilities.png">
25+
<img src="docs/smart_tensors_image.png">
2626
</p>
2727

2828
</div>
@@ -86,47 +86,48 @@ python post_install.py # use the following, for example, for GPU system: <python
8686

8787
### TELF.factorization
8888

89-
| **Method** | **Dense** | **Sparse** | **GPU** | **CPU** | **Multiprocessing** | **HPC** | **Description** | **Example** | **Release Status** |
90-
|:-------------------------:|:------------------:|:------------------:|:------------------:|:------------------:|:-------------------:|:------------------:|:----------------------------------------------------------------:|:-----------:|:------------------:|
91-
| NMFk | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | NMF with Automatic Model Determination | [Link](examples/NMFk/NMFk.ipynb) | :white_check_mark: |
92-
| Custom NMFk | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | Use Custom NMF Functions with NMFk | [Link](examples/NMFk/Custom_NMF_NMFk.ipynb) | :white_check_mark: |
93-
| TriNMFk | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | | NMF with Automatic Model Determination for Clusters and Patterns | [Link](examples/TriNMFk/TriNMFk.ipynb) | :white_check_mark: |
94-
| RESCALk | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | RESCAL with Automatic Model Determination | [Link](examples/RESCALk/RESCALk.ipynb) | :white_check_mark: |
95-
| RNMFk | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | Recommender NMFk | [Link](examples/RNMFk/RNMFk.ipynb) | :white_check_mark: |
96-
| SymNMFk | :heavy_check_mark: | | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | NMFk with Symmetric Clustering | [Link](examples/SymNMFk/SymNMFk.ipynb) | :white_check_mark: |
97-
| WNMFk | :heavy_check_mark: | | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | NMFk with weighting - used for recommendation system | [Link](examples/WNMFk/WNMFk.ipynb) | :white_check_mark: |
98-
| HNMFk | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | Hierarchical NMFk | [Link](examples/HNMFk/HNMFk.ipynb) | :white_check_mark: |
99-
| BNMFk | :heavy_check_mark: | | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | Boolean NMFk | [Link](examples/BNMFk/BNMFk.ipynb) | :white_check_mark: |
100-
| LMF | :heavy_check_mark: | | :heavy_check_mark: | :heavy_check_mark: | | | Logistic Matrix Factorization | [Link](examples/LMF/LMF.ipynb) | :white_check_mark: |
101-
| SPLIT NMFk | | | | | | | Joint NMFk factorization of multiple data via SPLIT | | :soon: |
102-
| SPLIT Transfer Classifier | | | | | | | Supervised transfer learning method via SPLIT and NMFk | | :soon: |
89+
| **Method** | **Dense** | **Sparse** | **GPU** | **CPU** | **Multiprocessing** | **HPC** | **Description** | **Example** |
90+
|:-------------------------:|:------------------:|:------------------:|:------------------:|:------------------:|:-------------------:|:------------------:|:----------------------------------------------------------------:|:-----------:|
91+
| NMFk | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | NMF with Automatic Model Determination | [Link](examples/NMFk/NMFk.ipynb) |
92+
| Custom NMFk | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | Use Custom NMF Functions with NMFk | [Link](examples/NMFk/Custom_NMF_NMFk.ipynb) |
93+
| TriNMFk | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | | NMF with Automatic Model Determination for Clusters and Patterns | [Link](examples/TriNMFk/TriNMFk.ipynb) |
94+
| RESCALk | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | RESCAL with Automatic Model Determination | [Link](examples/RESCALk/RESCALk.ipynb) |
95+
| RNMFk | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | Recommender NMFk | [Link](examples/RNMFk/RNMFk.ipynb) |
96+
| SymNMFk | :heavy_check_mark: | | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | NMFk with Symmetric Clustering | [Link](examples/SymNMFk/SymNMFk.ipynb) |
97+
| WNMFk | :heavy_check_mark: | | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | NMFk with weighting - used for recommendation system | [Link](examples/WNMFk/WNMFk.ipynb) |
98+
| HNMFk | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | Hierarchical NMFk | [Link](examples/HNMFk/HNMFk.ipynb) |
99+
| BNMFk | :heavy_check_mark: | | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | Boolean NMFk | [Link](examples/BNMFk/BNMFk.ipynb) |
100+
| LMF | :heavy_check_mark: | | :heavy_check_mark: | :heavy_check_mark: | | | Logistic Matrix Factorization | [Link](examples/LMF/LMF.ipynb) |
101+
| SPLIT | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | | Joint NMFk factorization of multiple data via SPLIT | [Link](examples/SPLIT/00-SPLIT.ipynb) |
102+
| SPLITTransfer | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | :heavy_check_mark: | | Supervised transfer learning method via SPLIT and NMFk | [Link](examples/SPLITTransfer/00-SPLITTransfer.ipynb) |
103103

104104
### TELF.pre_processing
105105

106-
| **Method** | **Multiprocessing** | **HPC** | **Description** | **Example** | **Release Status** |
107-
|:----------:|:-------------------:|:-------------------:|:------------------------------------------------------------------:|:-----------:|:------------------:|
108-
| Vulture | :heavy_check_mark: | :heavy_check_mark: | Advanced text processing tool for cleaning and NLP | [Link](examples/Vulture) | :white_check_mark: |
109-
| Beaver | :heavy_check_mark: | :heavy_check_mark: | Fast matrix and tensor building tool for text mining | [Link](examples/Beaver) | :white_check_mark: |
110-
| iPenguin | :heavy_check_mark: | | Online information retrieval tool for Scopus, SemanticScholar, and OSTI | [Link](examples/iPenguin) | :white_check_mark: |
111-
| Orca | :heavy_check_mark: | | Duplicate author detector for text mining and information retrieval | [Link](examples/Orca) | :white_check_mark: |
106+
| **Method** | **Multiprocessing** | **HPC** | **Description** | **Example** |
107+
|:----------:|:-------------------:|:-------------------:|:------------------------------------------------------------------:|:-----------:|
108+
| Vulture | :heavy_check_mark: | :heavy_check_mark: | Advanced text processing tool for cleaning and NLP | [Link](examples/Vulture) |
109+
| Beaver | :heavy_check_mark: | :heavy_check_mark: | Fast matrix and tensor building tool for text mining | [Link](examples/Beaver) |
110+
| iPenguin | :heavy_check_mark: | | Online information retrieval tool for Scopus, SemanticScholar, and OSTI | [Link](examples/iPenguin) |
111+
| Orca | :heavy_check_mark: | | Duplicate author detector for text mining and information retrieval | [Link](examples/Orca) |
112112

113113
### TELF.post_processing
114114

115-
| **Method** | **Description** | **Example** | **Release Status** |
116-
|:----------:|:----------------------------------------------------------:|:-----------:|:------------------:|
117-
| Wolf | Graph centrality and ranking tool | [Link](examples/Wolf) | :white_check_mark: |
118-
| Peacock | Data visualization and generation of actionable statistics | [Link](examples/Peacock) | :white_check_mark: |
119-
| SeaLion | Generic report generation tool | [Link](examples/SeaLion) | :white_check_mark: |
120-
| Fox | Report generation tool for text data | | :soon: |
115+
| **Method** | **Description** | **Example** |
116+
|:----------:|:----------------------------------------------------------:|:-----------:|
117+
| Wolf | Graph centrality and ranking tool | [Link](examples/Wolf) |
118+
| Peacock | Data visualization and generation of actionable statistics | [Link](examples/Peacock) |
119+
| SeaLion | Generic report generation tool | [Link](examples/SeaLion) |
120+
| Fox | Report generation tool for text data from NMFk using OpenAI | [Link](examples/Fox) |
121+
| ArcticFox | Report generation tool for text data from HNMFk using local LLMs | [Link](examples/ArcticFox) |
121122

122123
### TELF.applications
123124

124-
| **Method** | **Description** | **Example** | **Release Status** |
125-
|:----------:|:--------------------------------------------------------------------:|:-----------:|:------------------:|
126-
| Cheetah | Fast search by keywords and phrases | [Link](examples/Cheetah) | :white_check_mark: |
127-
| Bunny | Dataset generation tool for documents and their citations/references | [Link](examples/Bunny) | :white_check_mark: |
128-
| Penguin | Text storage tool | [Link](examples/Penguin) | :white_check_mark: |
129-
| Termite | Knowladge graph building tool | | :soon: |
125+
| **Method** | **Description** | **Example** |
126+
|:----------:|:--------------------------------------------------------------------:|:-----------:|
127+
| Cheetah | Fast search by keywords and phrases | [Link](examples/Cheetah) |
128+
| Bunny | Dataset generation tool for documents and their citations/references | [Link](examples/Bunny) |
129+
| Penguin | Text storage tool | [Link](examples/Penguin) |
130+
| Termite | Knowladge graph building tool | :soon: |
130131

131132

132133
## How to Cite T-ELF?
@@ -150,7 +151,7 @@ Eren, M., Solovyev, N., Barron, R., Bhattarai, M., Truong, D., Boureima, I., Ska
150151
```
151152

152153
## Authors
153-
- [Maksim Ekin Eren](mailto:[email protected]): Advanced Research in Cyber Systems, Los Alamos National Laboratory ([Website](https://www.maksimeren.com/))
154+
- [Maksim Ekin Eren](mailto:[email protected]): Information Systems and Modeling Group, Los Alamos National Laboratory ([Website](https://www.maksimeren.com/))
154155
- [Nicholas Solovyev](mailto:[email protected]): Theoretical Division, Los Alamos National Laboratory
155156
- [Ryan Barron](mailto:[email protected]): Theoretical Division, Los Alamos National Laboratory
156157
- [Manish Bhattarai](mailto:[email protected]): Theoretical Division, Los Alamos National Laboratory

TELF/applications/Bunny/auto_bunny.py

Lines changed: 20 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -5,7 +5,7 @@
55
import pandas as pd
66
from dataclasses import dataclass, field
77

8-
from .bunny import Bunny
8+
from .bunny import Bunny, BunnyFilter
99
from ..Cheetah import Cheetah
1010
from ...pre_processing.iPenguin.Scopus import Scopus
1111
from ...pre_processing.iPenguin.SemanticScholar import SemanticScholar
@@ -42,7 +42,16 @@ def __init__(self, core, s2_key=None, scopus_keys=None, output_dir=None, cache_d
4242
self.verbose = verbose
4343

4444

45-
def run(self, steps, *, s2_key=None, scopus_keys=None, cheetah_index=None, max_papers=250000, checkpoint=True):
45+
def run(self,
46+
steps,
47+
*,
48+
s2_key=None,
49+
scopus_keys=None,
50+
cheetah_index=None,
51+
max_papers=250000,
52+
checkpoint=True,
53+
filter_type:str=None, # must be a key from Bunny.FILTERS
54+
filter_value=None):
4655

4756
# validate input
4857
if not isinstance(steps, (list, tuple)):
@@ -87,6 +96,15 @@ def run(self, steps, *, s2_key=None, scopus_keys=None, cheetah_index=None, max_p
8796
return df
8897

8998
df = self.__bunny_hop(df, modes, step_max_papers, hop_priority)
99+
if filter_value and filter_type:
100+
bunny = Bunny()
101+
query = BunnyFilter(filter_type, filter_value)
102+
subset_df = bunny.apply_filter(df, query, filter_in_core=True, do_author_match=False).reset_index(drop=True)
103+
if len(subset_df) < 1:
104+
print("No papers for filter_value, using original df without filter.")
105+
else:
106+
df = subset_df
107+
90108
df = self.__vulture_clean(df, vulture_settings)
91109
df, cheetah_table = self.__cheetah_filter(df, cheetah_settings)
92110

TELF/applications/Bunny/bunny.py

Lines changed: 28 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -161,10 +161,17 @@ def __init__(self, s2_key=None, scopus_keys=None, penguin_settings=None, output_
161161
self.penguin_settings = penguin_settings
162162
self.enabled = self.s2_key is not None
163163

164-
# create a dictionary of supported filters and their callable load functions
165-
filters = {f: f"_filter_{re.sub('-', '', f.lower())}" for f in Bunny.FILTERS}
166-
self.filter_funcs = {k: getattr(self, v) for k,v in filters.items() if callable(getattr(self, v))}
167-
164+
# Explicitly map supported filters to methods
165+
self.filter_funcs = {
166+
'AFFILCOUNTRY': lambda df, f, auth_map=None: self._filter_affil_generic(df, column_name='country', filter_value=f, auth_map=auth_map),
167+
'AFFILORG': self._filter_affilorg,
168+
'AF-ID': self._filter_afid,
169+
'PUBYEAR': self._filter_pubyear,
170+
'AU-ID': self._filter_auid,
171+
'KEY': self._filter_key,
172+
'DOI': lambda df, f, auth_map=None: set(df[df['doi'].str.lower() == f.lower()].index),
173+
}
174+
168175

169176
def __init_lookup(self, series, priority, sep):
170177
lookup = [y for x in series for y in x.split(sep)]
@@ -619,31 +626,40 @@ def _filter_auid(self, df, f, auth_map=None):
619626
return pids
620627

621628

622-
def _filter_affilcountry(self, df, f, auth_map):
629+
def _filter_affil_generic(self, df, column_name, filter_value, auth_map):
623630
if 'affiliations' not in df:
624631
raise ValueError('"affiliations" not found in df')
625632

626-
country = f.lower()
633+
filter_value = filter_value.lower()
627634
pids, aids = set(), set()
628635
aff_df = df.dropna(subset=['affiliations'])
629-
affiliations = {k:v for k,v in zip(aff_df.index.to_list(), aff_df.affiliations.to_list())}
636+
affiliations = {k: v for k, v in zip(aff_df.index.to_list(), aff_df.affiliations.to_list())}
637+
630638
for idx, affiliation in affiliations.items():
631639
if isinstance(affiliation, str):
632640
affiliation = ast.literal_eval(affiliation)
633641
for aff_id, aff in affiliation.items():
634-
if aff['country'].lower() == country:
635-
pids.add(idx)
636-
aids |= set(aff['authors'])
637-
break
638-
642+
try:
643+
if aff[column_name].lower() == filter_value:
644+
pids.add(idx)
645+
aids |= set(aff['authors'])
646+
break
647+
except KeyError:
648+
print(f"Warning: '{column_name}' not found in affiliation {aff_id} for index {idx}.")
649+
except Exception as e:
650+
print(f"Warning: error processing affiliation {aff_id} at index {idx}{e}")
651+
639652
if auth_map is not None:
640653
s2_aids = {auth_map[aid] for aid in aids if aid in auth_map}
641654
for idx, scopus_authors, s2_authors in zip(df.index.to_list(), df.author_ids.to_list(), df.s2_author_ids.to_list()):
642655
if isinstance(scopus_authors, str) and set(scopus_authors.split(';')) & aids:
643656
pids.add(idx)
644657
if isinstance(s2_authors, str) and set(s2_authors.split(';')) & s2_aids:
645658
pids.add(idx)
659+
646660
return pids
661+
662+
647663

648664

649665
def _filter_affilorg(self, df, f, auth_map):

0 commit comments

Comments
 (0)