Merge pull request #314 from stephenhky/develop

Release 2.1.0
stephenhky · Dec 15, 2024 · ac14957 · ac14957
2 parents 427b88d + 9b50b62
commit ac14957
Show file tree

Hide file tree

Showing 59 changed files with 310 additions and 37,999 deletions.
diff --git a/.circleci/config.yml b/.circleci/config.yml
@@ -13,19 +13,20 @@ shared: &shared
           sudo apt-get update
           sudo apt-get install libc6
           sudo apt-get install python3-dev
+          sudo apt-get install -y g++
 
     - run:
         name: Installing Miniconda and Packages
         command: |
           pip install --upgrade --user pip
-          pip install --upgrade --user  .
           pip install --upgrade --user google-compute-engine
+          pip install --user  .
 
     - run:
         name: Run Unit Tests
         command: |
-          pip install -r test_requirements.txt
-          python setup.py test
+          pip install --user .[test]
+          pytest
 
 
 jobs:

diff --git a/.readthedocs.yml b/.readthedocs.yml
@@ -12,7 +12,7 @@ sphinx:
 build:
   os: ubuntu-22.04
   tools:
-    python: "3.9"
+    python: "3.12"
 
 # Build documentation with MkDocs
 #mkdocs:

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -1,12 +1,6 @@
 include README.md
-include requirements.txt
-include setup_requirements.txt
-include test_requirements.txt
+include LICENSE
 include pyproject.toml
 include shorttext/data/shorttext_exampledata.csv
 include shorttext/utils/stopwords.txt
 include shorttext/utils/nonneg_stopwords.txt
-include shorttext/metrics/dynprog/dldist.pyx
-include shorttext/metrics/dynprog/dldist.c
-include shorttext/metrics/dynprog/lcp.pyx
-include shorttext/metrics/dynprog/lcp.c
diff --git a/README.md b/README.md
@@ -18,7 +18,7 @@ representation of the texts and documents are needed before they are put into
 any classification algorithm. In this package, it facilitates various types
 of these representations, including topic modeling and word-embedding algorithms.
 
-The package `shorttext` runs on Python 3.8, 3.9, 3.10, and 3.11.
+The package `shorttext` runs on Python 3.9, 3.10, 3.11, and 3.12.
 Characteristics:
 
 - example data provided (including subject keywords and NIH RePORT);
@@ -31,8 +31,7 @@ Characteristics:
 - maximum entropy classification;
 - metrics of phrases differences, including soft Jaccard score (using Damerau-Levenshtein distance), and Word Mover's distance (WMD);
 - character-level sequence-to-sequence (seq2seq) learning; 
-- spell correction;
-- API for word-embedding algorithm for one-time loading; and
+- spell correction; and
 - Sentence encodings and similarities based on BERT.
 
 ## Documentation
@@ -84,6 +83,7 @@ If you would like to contribute, feel free to submit the pull requests. You can
 
 ## News
 
+* 12/14/2024: `shorttext` 2.1.0 released.
 * 07/12/2024: `shorttext` 2.0.0 released.
 * 12/21/2023: `shorttext` 1.6.1 released.
 * 08/26/2023: `shorttext` 1.6.0 released.
@@ -159,8 +159,3 @@ If you would like to contribute, feel free to submit the pull requests. You can
 * 12/21/2016: `shorttext` 0.2.0 released.
 * 11/25/2016: `shorttext` 0.1.2 released.
 * 11/21/2016: `shorttext` 0.1.1 released.
-
-## Possible Future Updates
-
-- [ ] Dividing components to other packages;
-- [ ] More available corpus.
diff --git a/docs/codes.rst b/docs/codes.rst
@@ -65,7 +65,13 @@ Module `shorttext.metrics.dynprog`
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 .. automodule:: shorttext.metrics.dynprog.jaccard
-   :members: soft_intersection_list
+   :members:
+
+.. automodule:: shorttext.metrics.dynprog.dldist
+   :members:
+
+.. automodule:: shorttext.metrics.dynprog.lcp
+   :members:
 
 Module `shorttext.metrics.wassersterin`
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

diff --git a/docs/conf.py b/docs/conf.py
@@ -56,9 +56,9 @@
 # built documents.
 #
 # The short X.Y version.
-version = u'2.0'
+version = u'2.1'
 # The full version, including alpha/beta/rc tags.
-release = u'2.0.0'
+release = u'2.1.0'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.

diff --git a/docs/install.rst b/docs/install.rst
@@ -27,7 +27,7 @@ Backend for Keras
 -----------------
 
 The package keras_ (version >= 2.0.0) uses Tensorflow_ as the backend. Refer to
-:doc:`faq` for how to switch the backend. It is also desirable if the package Cython_ has been previously installed.
+:doc:`faq` for how to switch the backend.
 
 
 Possible Solutions for Installation Failures
@@ -41,7 +41,7 @@ you may try one (or more) of the following:
 
 ::
 
-    pip install -U python3-dev
+    pip install python3-dev
 
 
 
@@ -70,7 +70,6 @@ Required Packages
 
 Home: :doc:`index`
 
-.. _Cython: http://cython.org/
 .. _Numpy: http://www.numpy.org/
 .. _SciPy: https://www.scipy.org/
 .. _Scikit-Learn: http://scikit-learn.org/stable/

diff --git a/docs/intro.rst b/docs/intro.rst
@@ -23,19 +23,12 @@ Characteristics:
 - metrics of phrases differences, including soft Jaccard score (using Damerau-Levenshtein distance), and Word Mover's distance (WMD); (see :doc:`tutorial_metrics`)
 - character-level sequence-to-sequence (seq2seq) learning; (see :doc:`tutorial_charbaseseq2seq`)
 - spell correction; (see :doc:`tutorial_spell`)
-- API for word-embedding algorithm for one-time loading; (see :doc:`tutorial_wordembedAPI`) and
 - Sentence encodings and similarities based on BERT (see :doc:`tutorial_wordembed` and :doc:`tutorial_metrics`).
 
-Before release 0.7.2, part of the package was implemented using C, and it is interfaced to
-Python using SWIG_ (Simplified Wrapper and Interface Generator). Since 1.0.0, these implementations
-were replaced with Cython_.
-
 Author: Kwan Yuet Stephen Ho (LinkedIn_, ResearchGate_, Twitter_)
 
 Home: :doc:`index`
 
 .. _LinkedIn: https://www.linkedin.com/in/kwan-yuet-ho-19882530
 .. _ResearchGate: https://www.researchgate.net/profile/Kwan-yuet_Ho
 .. _Twitter: https://twitter.com/stephenhky
-.. _SWIG: http://www.swig.org/
-.. _Cython: http://cython.org/
diff --git a/docs/news.rst b/docs/news.rst
@@ -1,6 +1,7 @@
 News
 ====
 
+* 12/14/2024: `shorttext` 2.1.0 released.
 * 07/12/2024: `shorttext` 2.0.0 released.
 * 12/21/2023: `shorttext` 1.6.1 released.
 * 08/26/2023: `shorttext` 1.6.0 released.
@@ -81,6 +82,13 @@ News
 What's New
 ----------
 
+Released 2.1.0 (December 14, 2024)
+------------------------------
+
+* Use of `pyproject.toml` for package distribution.
+* Removed Cython components.
+* Huge relative import refactoring.
+
 Released 2.0.0 (July 13, 2024)
 ------------------------------
 

diff --git a/docs/requirements.txt b/docs/requirements.txt
@@ -1,14 +1,12 @@
-Cython==3.0.11
-numpy==2.0.1
-scipy==1.14.0
+numpy==2.2.0
+scipy==1.14.1
 joblib==1.4.2
-scikit-learn==1.5.1
-tensorflow==2.17.0
-keras==3.4.1
-gensim==4.3.3
-pandas==2.2.2
+scikit-learn==1.5.2
+tensorflow==2.18.0
+keras==3.7.0
+gensim==4.0.0
+pandas==2.2.3
 snowballstemmer==2.1.0
-transformers==4.43.4
-torch==2.4.0
-python-Levenshtein==0.25.1
+transformers==4.47.0
+torch==2.5.1
 numba==0.60.0
diff --git a/docs/scripts.rst b/docs/scripts.rst
@@ -12,14 +12,15 @@ ShortTextCategorizerConsole
 
     usage: ShortTextCategorizerConsole [-h] [--wv WV] [--vecsize VECSIZE]
                                        [--topn TOPN] [--inputtext INPUTTEXT]
+                                       [--type TYPE]
                                        model_filepath
 
     Perform prediction on short text with a given trained model.
 
     positional arguments:
       model_filepath        Path of the trained (compact) model.
 
-    optional arguments:
+    options:
       -h, --help            show this help message and exit
       --wv WV               Path of the pre-trained Word2Vec model. (None if not
                             needed)
@@ -28,6 +29,9 @@ ShortTextCategorizerConsole
       --inputtext INPUTTEXT
                             single input text for classification. Run console if
                             set to None. (Default: None)
+      --type TYPE           Type of word-embedding model (default: "word2vec";
+                            other options: "fasttext", "poincare",
+                            "word2vec_nonbinary", "poincare_binary")
 
 
 ShortTextWordEmbedSimilarity
@@ -48,25 +52,4 @@ ShortTextWordEmbedSimilarity
                    options: "fasttext", "poincare")
 
 
-WordEmbedAPI
-------------
-
-::
-
-    usage: WordEmbedAPI [-h] [--port PORT] [--embedtype EMBEDTYPE] [--debug]
-                        filepath
-
-    Load word-embedding models into memory.
-
-    positional arguments:
-      filepath              file path of the word-embedding model
-
-    optional arguments:
-      -h, --help            show this help message and exit
-      --port PORT           port number
-      --embedtype EMBEDTYPE
-                            type of word-embedding algorithm (default: "word2vec),
-                            allowing "word2vec", "fasttext", and "poincare"
-      --debug               Debug mode (Default: False)
-
 Home: :doc:`index`
diff --git a/docs/tutorial.rst b/docs/tutorial.rst
@@ -8,9 +8,6 @@ Before using, type
 
 >>> import shorttext
 
-You will get the message that `Theano`, `Tensorflow` or `CNTK` backend is used for `keras`. Refer to `Keras Backend
-<https://keras.io/backend/>`_ for information about switching backends.
-
 .. toctree::
    :maxdepth: 2
 

diff --git a/docs/tutorial_metrics.rst b/docs/tutorial_metrics.rst
@@ -14,7 +14,9 @@ Each of this change causes a distance of 1. The algorithm was written in C.
 
 First import the package:
 
->>> from shorttext.metrics.dynprog import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
+>>> from shorttext.metrics.dynprog.dldist import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
+>>> from shorttext.metrics.dynprog.lcp import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
+>>> from shorttext.metrics.dynprog import similarity, soft_jaccard_score
 
 The distance can be calculated by:
 

diff --git a/docs/tutorial_wordembedAPI.rst b/docs/tutorial_wordembedAPI.rst