- Type stubs for the Python bindings are now available, allowing better static code analysis, better code completion in supported IDEs and easier understanding of the library's API. (#197)
- The method
LanguageDetector.detect_multiple_languages_of
still returned character indices instead of byte indices when only a singleDetectionResult
was produced. This has been fixed. (#203, #205)
-
The method
LanguageDetector.detect_multiple_languages_of
returns byte indices. For creating string slices in Python and JavaScript, character indices are needed but were not provided. This resulted in incorrectDetectionResult
s for Python and JavaScript. This has been fixed now by converting the byte indices to character indices. (#192) -
Some minor bugs in the WASM module have been fixed to prepare the first release of Lingua for JavaScript.
-
Python bindings for the Rust implementation of Lingua have now replaced the pure Python implementation in order to benefit from Rust's performance in any Python software.
-
Parallel equivalents for all methods in
LanguageDetector
have been added to give the user the choice of using the library single-threaded or multi-threaded.
-
This release resolves some dependency issues so that the latest versions of dependencies NumPy, Pandas and Matplotib can be used with Python >= 3.9 while older versions are used with Python 3.8.
-
All dependencies have been updated to their latest versions.
- Processing the language models now performs a little faster by performing binary search on the language model NumPy arrays.
-
Several bugs in multiple languages detection have been fixed that caused incomplete results to be returned in several cases. (#143, #154)
-
A significant amount of Kazakh texts were incorrectly classified as Mongolian. This has been fixed. (#160)
-
A new section on performance tips has been added to the README.
-
All dependencies have been updated to their latest versions.
- After applying some internal optimizations, language detection is now faster, at least between 20% and 30%, approximately. For long input texts, the speed improvement is greater than for short input texts.
- For long input texts, an error occurred whiled computing the confidence values due to numerical underflow when converting probabilities. This has been fixed. Thanks to @jordimas for reporting this bug. (#102)
- The min-max normalization method for the confidence values has been replaced with applying the softmax function. This gives more realistic probabilities. Big thanks to @Alex-Kopylov for proposing and implementing this change. (#99)
- Under certain circumstances, calling the method
LanguageDetector.detect_multiple_languages_of()
raised anIndexError
. This has been fixed. Thanks to @Saninsusanin for reporting this bug. (#98)
-
The new method
LanguageDetector.detect_multiple_languages_of()
has been introduced. It allows to detect multiple languages in mixed-language text. (#4) -
The new method
LanguageDetector.compute_language_confidence()
has been introduced. It allows to retrieve the confidence value for one specific language only, given the input text. (#86)
- The computation of the confidence values has been revised and the min-max normalization algorithm is now applied to the values, making them better comparable by behaving more like real probabilities. (#78)
- The library now has a fresh and colorful new logo. Why? Well, why not? (-:
- An
__all__
variable has been added indicating which types are exported by the library. This helps with type checking programs using Lingua. Big thanks to @bscan for the pull request. (#76) - The rule-based language filter has been improved for German texts. (#71)
- A further bottleneck in the code has been removed, making language detection 30 % faster compared to version 1.1.2, approximately.
- The language models are now stored on disk as serialized NumPy arrays instead of JSON. This reduces the preloading time of the language models significantly.
- A bottleneck in the language detection code has been removed, making language detection 40 % faster, approximately.
- The
py.typed
file that actives static type checking was missing. Big thanks to @Vasniktel for reporting this problem. (#63) - The ISO 639-3 code for Urdu was wrong. Big thanks to @pluiez for reporting this bug. (#64)
- For certain ngrams, wrong probabilities were returned. This has been fixed. Big thanks to @3a77 for reporting this bug. (#62)
- The new method
LanguageDetectorBuilder.with_low_accuracy_mode()
has been introduced. By activating it, detection accuracy for short text is reduced in favor of a smaller memory footprint and faster detection performance.
- The memory footprint has been reduced significantly by storing the language models in structured NumPy arrays instead of dictionaries. This reduces memory consumption from 2600 MB to 800 MB, approximately.
- Several language model files have become obsolete and could be deleted without decreasing detection accuracy. This results in a smaller memory footprint.
- The lowest supported Python version is 3.8 now. Python 3.7 is no longer compatible with this library.
- This patch release makes the library compatible with Python >= 3.7.1. Previously, it could be installed from PyPI only with Python >= 3.9.
- The very first release of Lingua. Enjoy!