Changelog
Added
- Support for setuptools based projects in
edsnlp.package
command - Pipelines can now be instantiated directly from a config file (instead of having to cast a dict containing their arguments) by putting the @core = "pipeline" or "load" field in the pipeline section)
edsnlp.load
now correctly takes disable, enable and exclude parameters into account- Pipeline now has a basic repr showing is base langage (mostly useful to know its tokenizer) and its pipes
- New
python -m edsnlp.evaluate
script to evaluate a model on a dataset - Sentence detection can now be configured to change the minimum number of newlines to consider a newline-triggered sentence, and disable capitalization checking.
- New
eds.split
pipe to split a document into multiple documents based on a splitting pattern (useful for training) - Allow
converter
argument ofedsnlp.data.read/from_...
to be a list of converters instead of a single converter - New revamped and documented
edsnlp.train
script and API - Support YAML config files (supported only CFG/INI files before)
- Most of EDS-NLP functions are now clickable in the documentation
- ScheduledOptimizer now accepts schedules directly in place of parameters, and easy parameter selection:
ScheduledOptimizer( optim="adamw", module=nlp, total_steps=2000, groups={ "^transformer": { # lr will go from 0 to 5e-5 then to 0 for params matching "transformer" "lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 0 "max_value": 5e-5,}, }, "": { # lr will go from 3e-4 during 200 steps then to 0 for other params "lr": {"@schedules": "linear", "warmup_rate": 0.1, "start_value": 3e-4 "max_value": 3e-4,}, }, }, )
Changed
eds.span_context_getter
's parametercontext_sents
is no longer optional and must be explicitly set to 0 to disable sentence context- In multi-GPU setups, streams that contain torch components are now stripped of their parameter tensors when sent to CPU Workers since these workers only perform preprocessing and postprocessing and should therefore not need the model parameters.
- The
batch_size
argument ofPipeline
is deprecated and is not used anymore. Use thebatch_size
argument ofstream.map_pipeline
instead.
Fixed
- Sort files before iterating over a standoff or json folder to ensure reproducibility
- Sentence detection now correctly match capitalized letters + apostrophe
- We now ensure that the workers pool is properly closed whatever happens (exception, garbage collection, data ending) in the
multiprocessing
backend. This prevents some executions from hanging indefinitely at the end of the processing. - Propagate torch sharing strategy to other workers in the
multiprocessing
backend. This is useful when the system is running out of file descriptors andulimit -n
is not an option. Torch sharing strategy can also be set via an environment variableTORCH_SHARING_STRATEGY
(default isfile_descriptor
, consider usingfile_system
if you encounter issues).
Data API changes
LazyCollection
objects are now calledStream
objects- By default,
multiprocessing
backend now preserves the order of the input data. To disable this and improve performance, usedeterministic=False
in theset_processing
method - 🚀 Parallelized GPU inference throughput improvements !
- For simple {pre-process → model → post-process} pipelines, GPU inference can be up to 30% faster in non-deterministic mode (results can be out of order) and up to 20% faster in deterministic mode (results are in order)
- For multitask pipelines, GPU inference can be up to twice as fast (measured in a two-tasks BERT+NER+Qualif pipeline on T4 and A100 GPUs)
- The
.map_batches
,.map_pipeline
and.map_gpu
methods now support a specificbatch_size
and batching function, instead of having a single batch size for all pipes - Readers now have a
loop
parameter to cycle over the data indefinitely (useful for training) - Readers now have a
shuffle
parameter to shuffle the data before iterating over it - In
multiprocessing
mode, file based readers now read the data in the workers (was an option before) - We now support two new special batch sizes
- "fragment" in the case of parquet datasets: rows of a full parquet file fragment per batch
- "dataset" which is mostly useful during training, for instance to shuffle the dataset at each epoch.
These are also compatible in batched writer such as parquet, where each input fragment can be processed and mapped to a single matching output fragment.
- 💥 Breaking change: a
map
function returning a list or a generator won't be automatically flattened anymore. Useflatten()
to flatten the output if needed. This shouldn't change the behavior for most users since most writers (to_pandas, to_polars, to_parquet, ...) still flatten the output - 💥 Breaking change: the
chunk_size
andsort_chunks
are now deprecated : to sort data before applying a transformation, use.map_batches(custom_sort_fn, batch_size=...)
Training API changes
- We now provide a training script
python -m edsnlp.train --config config.cfg
that should fit many use cases. Check out the docs ! - In particular, we do not require pytorch's Dataloader for training and can rely solely on EDS-NLP stream/data API, which is better suited for large streamable datasets and dynamic preprocessing (ie different result each time we apply a noised preprocessing op on a sample).
- Each trainable component can now provide a
stats
field in itspreprocess
output to log info about the sample (number of words, tokens, spans, ...):- these stats are both used for batching (e.g., make batches of no more than "25000 tokens")
- for logging
- for computing correct loss means when accumulating gradients over multiple mini-mini-batches
- for computing correct loss means in multi-GPU setups, since these stats are synchronized and accumulated across GPUs
- Support multi GPU training via hugginface
accelerate
and EDS-NLPStream
API consideration of env['WOLRD_SIZE'] and env['LOCAL_RANK'] environment variables
Pull Requests
- Improve training tutorials by @percevalw in #331
- Various fixes by @percevalw in #332
- Multiprocessing related fixes by @percevalw in #333
- chore: bump version to 0.14.0 by @percevalw in #334
Full Changelog: v0.13.1...v0.14.0