22 Jul 23:21

efc66a9

v4.5.0

CoreNLP 4.5.0

Main features are improved lemmatization of English, improved tokenization of both English and non-English flex-based languages, and some updates to tregex, tsurgeon, and semgrex

All PTB and German tokens normalized now in PTBLexer (previously only German umlauts).
This makes the tokenizer 2% slower, but should avoid issues with resume' for example
d46fecd
log4j removed entirely from public CoreNLP (internal "research" branch still has a use)
f05cb54
Fix NumberFormatException showing up in NER models: #547 5ee2c39
Fix "seconds" in the lemmatizer: e7a073b
Fix double escaping of & in the online demos: 8413fa1
Report the cause of an error if "tregex" is asked for but no parse annotator is added: 4db80c0
Merge ssplit and cleanxml into the tokenize annotator (done in a backwards compatible manner): #1259
Custom tregex pattern, ROOT tregex pattern, and tsurgeon operation for simultaneously moving a subtree and pruning anything left behind, used for processing the Italian VIT treebank in stanza: #1263
Refactor tokenization of punctuation, filenames, and other entities common to all languages, not just English: 3c40ba3 58a2288 8b97d64
Improved tokenization of number patterns, names with apostrophes such as Sh'reyan, non-American phone numbers, invisible commas 9476a8e 6193934 afb1ea8 7c84960
Significant lemmatizer improvements: adjectives & adverbs, along with some various other special cases #1266
Include graph & semgrex indices in the results for a semgrex query (will make the results more usable) 45b47e2
Trim words in the NER training process. spaces can still be inside a word, but random whitespace won't ruin the performance of the models 0d9e9c8
Fix NBSP in the Chinese segmenter stanfordnlp/stanza#1052 #1279

Assets 2

2 Join discussion

25 Jan 11:49

AngledLuffa

v4.4.0

04408ad

v4.4.0

Enhancements

added -preTokenized option which will assume text should be tokenized on white space and sentence split on newline
tsurgeon CLI - python side added to stanza
#1240
sutime WORKDAY definition
0dfb118

Fixes

rebuilt Italian dependency parser using CoreNLP predicted tags
XML security issue:
#1241
NER server security issue:
5ee097d
fix infinite loop in tregex:
#1238
json utf-8 output on windows
#1231
stanfordnlp/stanza#894
fix openie crash in certain unusual graphs
#1230
#1082
fix nondeterministic results in certain SemanticGraph structures
#1228
cc806f2
workaround for NLTK sending % unescaped to the server
#1226
20fe1e9
make TimingTest function on Windows
4aafb84

Assets 2

18 Nov 22:42

J38

v4.3.2

d147ba5

v4.3.2

Fixes

fix issues with default Italian pipeline

Assets 2

22 Oct 11:28

J38

v4.3.1

4c28eb5

v4.3.1

Fixes

character offset issue with StatTok
fixes path issue with default Hungarian properties
adds Hungarian and Italian to demo
fixes umlaut issue

Assets 2

06 Oct 10:54

J38

v4.3.0

147e458

v4.3.0

Overview

This release adds new European languages, improvements to the parsers and tokenizers, and other misc. fixes.

Enhancements

Hungarian pipeline
Italian pipeline
Improvements to English tokenizer
Better memory usage by dependency parser

Fixes

issue with umlaut handling in German #1184

Assets 2

14 May 21:36

J38

v4.2.2

8e3a8bd

v4.2.2

This release includes some small fixes to version 4.2.1.

It includes:

demo fixes for 4.2.2, resolving cache issues with demo resources
small fix to RegexNERSequenceClassifier issue allowing AnswerAnnotation to be overwritten

Assets 2

05 May 20:58

AngledLuffa

v4.2.1

7669744

v4.2.1

Fix the server having some links http instead of https
#1146

Improve MWE expressions in the enhanced dependency conversion
1ef9ef9

Add the ability for the command line semgrex processor to handle multiple calls in one process
c9d50ef

Fix interaction between discarding tokens in ssplit and assigning NER tags
a803bc3

Reduce the size of the sr parser models (not a huge amount, but some)
#1142

Various QuoteAnnotator bug fixes
#1135
#1134
#1121
#1118
9f1b015
#1147

Switch to newer istack implementation
#1133
Newer protobuf
#1150

Add a conllu output format to some of the segmenter code, useful for testing with the official test scripts
c70ddec

Fix Turkish locale enums
#1126
stanfordnlp/stanza#580

Use StringBuilder instead of StringBuffer where possible
#1010

Assets 2

17 Nov 10:17

J38

v4.2.0

c56e5f9

v4.2.0

Overview

This release features a collection of small bug fixes and updates. It is the first release built directly from the GitHub repo.

Enhancements

Upgrade libraries (EJML, JUnit, JFlex)
Add character offsets to Tregex responses from server
Improve cleaning of treebanks for English models
Speed up loading of Wikidict annotator
New utility for tagging CoNLL-U files in place
Command line tool for processing TokensRegex

Fixes

Output single token NER entities in inline XML output format
Add currency symbol part of speech training data
Fix issues with tree binarizing

Assets 2

04 May 02:47

J38

v4.0.0

a00179c

Stanford CoreNLP 4.0.0

Overview

The latest release of Stanford CoreNLP includes a major overhaul of tokenization and a large collection of new parsing and tagging models. There are also miscellaneous enhancements and fixes.

Enhancements

UD v2.0 tokenization standard for English, French, German, and Spanish. That means "new" LDC tokenization for English (splitting on most hyphens) and not escaping parentheses or turning quotes etc. into ASCII sequences by default.
Upgrade options for normalizing special chars (quotes, parentheses, etc.) in PTBTokenizer
Have WhitespaceTokenizer support same newline processing as PTBTokenizer
New mwt annotator for handling multiword tokens in French, German, and Spanish.
New models with more training data and better performance for tagging and parsing in English, French, German, and Spanish.
Add French NER
New Chinese segmentation based off CTB9
Improved handling of double codepoint characters
Easier syntax for specifying language specific pipelines and NER pipeline properties
Improved CoNLL-U processing
Improved speed and memory performance for CRF training
Tregex support in CoreSentence
Updated library dependencies

Fixes

NPE while simultaneously tokenizing on whitespace and sentence splitting on newlines
NPE in EntityMentionsAnnotator during language check
NPE in CorefMentionAnnotator while aligning coref mentions with titles and entity mentions
NPE in NERCombinerAnnotator in certain configurations of models on/off
Incorrect handling of eolonly option in ArabicSegmenterAnnotator
Apply named entity granularity change prior to coref mention detection
Incorrect handling of keeping newline tokens when using Chinese segmenter on Windows
Incorrect handling of reading in German treebank files
SR parser crashes when given bad training input
New PTBTokenizer known abbreviations: "Tech.", "Amb.". Fix legacy tokenizer hack special casing 'Alex.' for 'Alex. Brown'
Fix ancient bug in printing constituency tree with multiple roots.
Fix parser from failing on word "STOP" because it treated it as a special word

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CoreNLP 4.5.0

Uh oh!

Enhancements

Fixes

Uh oh!

Fixes

Uh oh!

Fixes

Uh oh!

Overview

Enhancements

Fixes

Uh oh!

Uh oh!

Uh oh!

Overview

Enhancements

Fixes

Uh oh!

Overview

Enhancements

Fixes

Uh oh!

Releases: stanfordnlp/CoreNLP

v4.5.0

CoreNLP 4.5.0

Uh oh!

v4.4.0

Enhancements

Fixes

Uh oh!

v4.3.2

Fixes

Uh oh!

v4.3.1

Fixes

Uh oh!

v4.3.0

Overview

Enhancements

Fixes

Uh oh!

v4.2.2

Uh oh!

v4.2.1

Uh oh!

v4.2.0

Overview

Enhancements

Fixes

Uh oh!

Stanford CoreNLP 4.0.0

Overview

Enhancements

Fixes

Uh oh!