Skip to content

Commit

Permalink
Merge pull request #314 from stephenhky/develop
Browse files Browse the repository at this point in the history
Release 2.1.0
  • Loading branch information
stephenhky authored Dec 15, 2024
2 parents 427b88d + 9b50b62 commit ac14957
Show file tree
Hide file tree
Showing 59 changed files with 310 additions and 37,999 deletions.
7 changes: 4 additions & 3 deletions .circleci/config.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,19 +13,20 @@ shared: &shared
sudo apt-get update
sudo apt-get install libc6
sudo apt-get install python3-dev
sudo apt-get install -y g++
- run:
name: Installing Miniconda and Packages
command: |
pip install --upgrade --user pip
pip install --upgrade --user .
pip install --upgrade --user google-compute-engine
pip install --user .
- run:
name: Run Unit Tests
command: |
pip install -r test_requirements.txt
python setup.py test
pip install --user .[test]
pytest
jobs:
Expand Down
2 changes: 1 addition & 1 deletion .readthedocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ sphinx:
build:
os: ubuntu-22.04
tools:
python: "3.9"
python: "3.12"

# Build documentation with MkDocs
#mkdocs:
Expand Down
8 changes: 1 addition & 7 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -1,12 +1,6 @@
include README.md
include requirements.txt
include setup_requirements.txt
include test_requirements.txt
include LICENSE
include pyproject.toml
include shorttext/data/shorttext_exampledata.csv
include shorttext/utils/stopwords.txt
include shorttext/utils/nonneg_stopwords.txt
include shorttext/metrics/dynprog/dldist.pyx
include shorttext/metrics/dynprog/dldist.c
include shorttext/metrics/dynprog/lcp.pyx
include shorttext/metrics/dynprog/lcp.c
11 changes: 3 additions & 8 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ representation of the texts and documents are needed before they are put into
any classification algorithm. In this package, it facilitates various types
of these representations, including topic modeling and word-embedding algorithms.

The package `shorttext` runs on Python 3.8, 3.9, 3.10, and 3.11.
The package `shorttext` runs on Python 3.9, 3.10, 3.11, and 3.12.
Characteristics:

- example data provided (including subject keywords and NIH RePORT);
Expand All @@ -31,8 +31,7 @@ Characteristics:
- maximum entropy classification;
- metrics of phrases differences, including soft Jaccard score (using Damerau-Levenshtein distance), and Word Mover's distance (WMD);
- character-level sequence-to-sequence (seq2seq) learning;
- spell correction;
- API for word-embedding algorithm for one-time loading; and
- spell correction; and
- Sentence encodings and similarities based on BERT.

## Documentation
Expand Down Expand Up @@ -84,6 +83,7 @@ If you would like to contribute, feel free to submit the pull requests. You can

## News

* 12/14/2024: `shorttext` 2.1.0 released.
* 07/12/2024: `shorttext` 2.0.0 released.
* 12/21/2023: `shorttext` 1.6.1 released.
* 08/26/2023: `shorttext` 1.6.0 released.
Expand Down Expand Up @@ -159,8 +159,3 @@ If you would like to contribute, feel free to submit the pull requests. You can
* 12/21/2016: `shorttext` 0.2.0 released.
* 11/25/2016: `shorttext` 0.1.2 released.
* 11/21/2016: `shorttext` 0.1.1 released.

## Possible Future Updates

- [ ] Dividing components to other packages;
- [ ] More available corpus.
8 changes: 7 additions & 1 deletion docs/codes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -65,7 +65,13 @@ Module `shorttext.metrics.dynprog`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

.. automodule:: shorttext.metrics.dynprog.jaccard
:members: soft_intersection_list
:members:

.. automodule:: shorttext.metrics.dynprog.dldist
:members:

.. automodule:: shorttext.metrics.dynprog.lcp
:members:

Module `shorttext.metrics.wassersterin`
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Expand Down
4 changes: 2 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -56,9 +56,9 @@
# built documents.
#
# The short X.Y version.
version = u'2.0'
version = u'2.1'
# The full version, including alpha/beta/rc tags.
release = u'2.0.0'
release = u'2.1.0'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
5 changes: 2 additions & 3 deletions docs/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Backend for Keras
-----------------

The package keras_ (version >= 2.0.0) uses Tensorflow_ as the backend. Refer to
:doc:`faq` for how to switch the backend. It is also desirable if the package Cython_ has been previously installed.
:doc:`faq` for how to switch the backend.


Possible Solutions for Installation Failures
Expand All @@ -41,7 +41,7 @@ you may try one (or more) of the following:

::

pip install -U python3-dev
pip install python3-dev



Expand Down Expand Up @@ -70,7 +70,6 @@ Required Packages

Home: :doc:`index`

.. _Cython: http://cython.org/
.. _Numpy: http://www.numpy.org/
.. _SciPy: https://www.scipy.org/
.. _Scikit-Learn: http://scikit-learn.org/stable/
Expand Down
7 changes: 0 additions & 7 deletions docs/intro.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,19 +23,12 @@ Characteristics:
- metrics of phrases differences, including soft Jaccard score (using Damerau-Levenshtein distance), and Word Mover's distance (WMD); (see :doc:`tutorial_metrics`)
- character-level sequence-to-sequence (seq2seq) learning; (see :doc:`tutorial_charbaseseq2seq`)
- spell correction; (see :doc:`tutorial_spell`)
- API for word-embedding algorithm for one-time loading; (see :doc:`tutorial_wordembedAPI`) and
- Sentence encodings and similarities based on BERT (see :doc:`tutorial_wordembed` and :doc:`tutorial_metrics`).

Before release 0.7.2, part of the package was implemented using C, and it is interfaced to
Python using SWIG_ (Simplified Wrapper and Interface Generator). Since 1.0.0, these implementations
were replaced with Cython_.

Author: Kwan Yuet Stephen Ho (LinkedIn_, ResearchGate_, Twitter_)

Home: :doc:`index`

.. _LinkedIn: https://www.linkedin.com/in/kwan-yuet-ho-19882530
.. _ResearchGate: https://www.researchgate.net/profile/Kwan-yuet_Ho
.. _Twitter: https://twitter.com/stephenhky
.. _SWIG: http://www.swig.org/
.. _Cython: http://cython.org/
8 changes: 8 additions & 0 deletions docs/news.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
News
====

* 12/14/2024: `shorttext` 2.1.0 released.
* 07/12/2024: `shorttext` 2.0.0 released.
* 12/21/2023: `shorttext` 1.6.1 released.
* 08/26/2023: `shorttext` 1.6.0 released.
Expand Down Expand Up @@ -81,6 +82,13 @@ News
What's New
----------

Released 2.1.0 (December 14, 2024)
------------------------------

* Use of `pyproject.toml` for package distribution.
* Removed Cython components.
* Huge relative import refactoring.

Released 2.0.0 (July 13, 2024)
------------------------------

Expand Down
20 changes: 9 additions & 11 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,14 +1,12 @@
Cython==3.0.11
numpy==2.0.1
scipy==1.14.0
numpy==2.2.0
scipy==1.14.1
joblib==1.4.2
scikit-learn==1.5.1
tensorflow==2.17.0
keras==3.4.1
gensim==4.3.3
pandas==2.2.2
scikit-learn==1.5.2
tensorflow==2.18.0
keras==3.7.0
gensim==4.0.0
pandas==2.2.3
snowballstemmer==2.1.0
transformers==4.43.4
torch==2.4.0
python-Levenshtein==0.25.1
transformers==4.47.0
torch==2.5.1
numba==0.60.0
27 changes: 5 additions & 22 deletions docs/scripts.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,15 @@ ShortTextCategorizerConsole

usage: ShortTextCategorizerConsole [-h] [--wv WV] [--vecsize VECSIZE]
[--topn TOPN] [--inputtext INPUTTEXT]
[--type TYPE]
model_filepath

Perform prediction on short text with a given trained model.

positional arguments:
model_filepath Path of the trained (compact) model.

optional arguments:
options:
-h, --help show this help message and exit
--wv WV Path of the pre-trained Word2Vec model. (None if not
needed)
Expand All @@ -28,6 +29,9 @@ ShortTextCategorizerConsole
--inputtext INPUTTEXT
single input text for classification. Run console if
set to None. (Default: None)
--type TYPE Type of word-embedding model (default: "word2vec";
other options: "fasttext", "poincare",
"word2vec_nonbinary", "poincare_binary")


ShortTextWordEmbedSimilarity
Expand All @@ -48,25 +52,4 @@ ShortTextWordEmbedSimilarity
options: "fasttext", "poincare")


WordEmbedAPI
------------

::

usage: WordEmbedAPI [-h] [--port PORT] [--embedtype EMBEDTYPE] [--debug]
filepath

Load word-embedding models into memory.

positional arguments:
filepath file path of the word-embedding model

optional arguments:
-h, --help show this help message and exit
--port PORT port number
--embedtype EMBEDTYPE
type of word-embedding algorithm (default: "word2vec),
allowing "word2vec", "fasttext", and "poincare"
--debug Debug mode (Default: False)

Home: :doc:`index`
3 changes: 0 additions & 3 deletions docs/tutorial.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,6 @@ Before using, type

>>> import shorttext

You will get the message that `Theano`, `Tensorflow` or `CNTK` backend is used for `keras`. Refer to `Keras Backend
<https://keras.io/backend/>`_ for information about switching backends.

.. toctree::
:maxdepth: 2

Expand Down
4 changes: 3 additions & 1 deletion docs/tutorial_metrics.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@ Each of this change causes a distance of 1. The algorithm was written in C.

First import the package:

>>> from shorttext.metrics.dynprog import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
>>> from shorttext.metrics.dynprog.dldist import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
>>> from shorttext.metrics.dynprog.lcp import damerau_levenshtein, longest_common_prefix, similarity, soft_jaccard_score
>>> from shorttext.metrics.dynprog import similarity, soft_jaccard_score

The distance can be calculated by:

Expand Down
40 changes: 0 additions & 40 deletions docs/tutorial_wordembedAPI.rst

This file was deleted.

Loading

0 comments on commit ac14957

Please sign in to comment.