Releases: stanfordnlp/CoreNLP
v4.5.0
CoreNLP 4.5.0
Main features are improved lemmatization of English, improved tokenization of both English and non-English flex-based languages, and some updates to tregex, tsurgeon, and semgrex
-
All PTB and German tokens normalized now in PTBLexer (previously only German umlauts).
This makes the tokenizer 2% slower, but should avoid issues with resume' for example
d46fecd -
log4j removed entirely from public CoreNLP (internal "research" branch still has a use)
f05cb54 -
Fix NumberFormatException showing up in NER models: #547 5ee2c39
-
Fix "seconds" in the lemmatizer: e7a073b
-
Fix double escaping of & in the online demos: 8413fa1
-
Report the cause of an error if "tregex" is asked for but no parse annotator is added: 4db80c0
-
Merge ssplit and cleanxml into the tokenize annotator (done in a backwards compatible manner): #1259
-
Custom tregex pattern, ROOT tregex pattern, and tsurgeon operation for simultaneously moving a subtree and pruning anything left behind, used for processing the Italian VIT treebank in stanza: #1263
-
Refactor tokenization of punctuation, filenames, and other entities common to all languages, not just English: 3c40ba3 58a2288 8b97d64
-
Improved tokenization of number patterns, names with apostrophes such as Sh'reyan, non-American phone numbers, invisible commas 9476a8e 6193934 afb1ea8 7c84960
-
Significant lemmatizer improvements: adjectives & adverbs, along with some various other special cases #1266
-
Include graph & semgrex indices in the results for a semgrex query (will make the results more usable) 45b47e2
-
Trim words in the NER training process. spaces can still be inside a word, but random whitespace won't ruin the performance of the models 0d9e9c8
-
Fix NBSP in the Chinese segmenter stanfordnlp/stanza#1052 #1279
v4.4.0
Enhancements
-
added -preTokenized option which will assume text should be tokenized on white space and sentence split on newline
-
tsurgeon CLI - python side added to stanza
#1240 -
sutime WORKDAY definition
0dfb118
Fixes
-
rebuilt Italian dependency parser using CoreNLP predicted tags
-
XML security issue:
#1241 -
NER server security issue:
5ee097d -
fix infinite loop in tregex:
#1238 -
json utf-8 output on windows
#1231
stanfordnlp/stanza#894 -
fix nondeterministic results in certain SemanticGraph structures
#1228
cc806f2 -
workaround for NLTK sending % unescaped to the server
#1226
20fe1e9 -
make TimingTest function on Windows
4aafb84
v4.3.2
v4.3.1
v4.3.0
v4.2.2
v4.2.1
Fix the server having some links http instead of https
#1146
Improve MWE expressions in the enhanced dependency conversion
1ef9ef9
Add the ability for the command line semgrex processor to handle multiple calls in one process
c9d50ef
Fix interaction between discarding tokens in ssplit and assigning NER tags
a803bc3
Reduce the size of the sr parser models (not a huge amount, but some)
#1142
Various QuoteAnnotator bug fixes
#1135
#1134
#1121
#1118
9f1b015
#1147
Switch to newer istack implementation
#1133
Newer protobuf
#1150
Add a conllu output format to some of the segmenter code, useful for testing with the official test scripts
c70ddec
Fix Turkish locale enums
#1126
stanfordnlp/stanza#580
Use StringBuilder instead of StringBuffer where possible
#1010
v4.2.0
Overview
This release features a collection of small bug fixes and updates. It is the first release built directly from the GitHub repo.
Enhancements
- Upgrade libraries (EJML, JUnit, JFlex)
- Add character offsets to Tregex responses from server
- Improve cleaning of treebanks for English models
- Speed up loading of Wikidict annotator
- New utility for tagging CoNLL-U files in place
- Command line tool for processing TokensRegex
Fixes
- Output single token NER entities in inline XML output format
- Add currency symbol part of speech training data
- Fix issues with tree binarizing
Stanford CoreNLP 4.0.0
Overview
The latest release of Stanford CoreNLP includes a major overhaul of tokenization and a large collection of new parsing and tagging models. There are also miscellaneous enhancements and fixes.
Enhancements
- UD v2.0 tokenization standard for English, French, German, and Spanish. That means "new" LDC tokenization for English (splitting on most hyphens) and not escaping parentheses or turning quotes etc. into ASCII sequences by default.
- Upgrade options for normalizing special chars (quotes, parentheses, etc.) in PTBTokenizer
- Have WhitespaceTokenizer support same newline processing as PTBTokenizer
- New mwt annotator for handling multiword tokens in French, German, and Spanish.
- New models with more training data and better performance for tagging and parsing in English, French, German, and Spanish.
- Add French NER
- New Chinese segmentation based off CTB9
- Improved handling of double codepoint characters
- Easier syntax for specifying language specific pipelines and NER pipeline properties
- Improved CoNLL-U processing
- Improved speed and memory performance for CRF training
- Tregex support in CoreSentence
- Updated library dependencies
Fixes
- NPE while simultaneously tokenizing on whitespace and sentence splitting on newlines
- NPE in EntityMentionsAnnotator during language check
- NPE in CorefMentionAnnotator while aligning coref mentions with titles and entity mentions
- NPE in NERCombinerAnnotator in certain configurations of models on/off
- Incorrect handling of eolonly option in ArabicSegmenterAnnotator
- Apply named entity granularity change prior to coref mention detection
- Incorrect handling of keeping newline tokens when using Chinese segmenter on Windows
- Incorrect handling of reading in German treebank files
- SR parser crashes when given bad training input
- New PTBTokenizer known abbreviations: "Tech.", "Amb.". Fix legacy tokenizer hack special casing 'Alex.' for 'Alex. Brown'
- Fix ancient bug in printing constituency tree with multiple roots.
- Fix parser from failing on word "STOP" because it treated it as a special word