Releases: PaddlePaddle/Paddle
v0.14.0
Release Log
Major Features
- Enhanced the inference library. Better memory buffer. Added several demos.
- Inference library added support for Anakin engine, TensorRT engine.
- ParallelExecutor supports multi-threaded CPU training. (In addition to multi-GPU training)
- Added mean IOU operator, argsort operator, etc. Improved L2norm operator. Added crop API.
- Released pre-trained ResNet50, Se-Resnext50, AlexNet, etc, Enahanced Transformer, etc.
- New data augmentation operators.
- Major documentation and API comment improvements.
- Enhance the continuous evaluation system.
Performance Improvements
- More overlap of distributed training network operation with computation. ~10% improvements
- CPU performance improvements with more MKLDNN support.
Major Bug Fixes
- Fix memory leak issues.
- Fix concat operator.
- Fix ParallelExecutor input data memcpy issue.
- Fix ParallelExecutor deadlock issue.
- Fix distributed training client timeout.
- Fix distributed training pserver side learning rate decay.
- Thread-safe Scope implementation.
- Fix some issue using memory optimizer and parallelexecutor together.
Known Issues
- IfElse has some bugs.
- BatchNorm is not stable if batch_size=1
v0.13.0
Release Log
Major Features
- Asynchronous distributed training support.
- Distributed training with ParallelExecutor.
- Distributed ring-based training with NCCL2.
- Support checkpoint save on trainer and store on trainer and parameter server.
- Graceful shutdown of parameter server.
- Publish the high-level inference lib API and inference implementation.
- Assign roles to each op.
- Publish the C++ train API to allow to embed fluid into other C++ systems.
- Support uint8_t type data file and data exchange.
- C++ reader supports customized data augmentation.
- Improved operator and interface support for speech models.
- New random_crop op.
- New shape op to get the tensor's shape.
- New resize_bilinear interface.
- New dice_loss layer.
- Enhanced reduce_op to support reduce on multiple dimensions.
Performance Improvements
On P40 GPU ResNet-50 model, single GPU speed improves 23.8% (105 images/sec to 130 images/sec). 8 GPUs speedup ratio 6, 32 GPUs speedup ratio reaches 17.4.
- Overlap send/recv op with other operators.
- Multi-thread server-side request handling.
- Weight decay and clipping moved from trainer to parameter server for performance and correctness.
- Improved C++ reader.
Major Bug Fixes
- Fix accuracy loss when both ParallelExecutor and memory optimizer are used.
- Fix ParallelExecutor hang when multiple inputs duplicate.
- Fix Program clone cause memory leak.
- Fix GRU unit bias ineffective and wrong activation.
- Fix ROI Pooling GPU computation issues.
- Fix fill_constant_batch_size_like when input is sequence.
- Fix reshape op.
v0.12.0
Release log
Major Improvements
Reader Prototype. Data can be read through C++ reader asynchronously with potentially higher performance.
ParallelExecutor. Significantly improve the multi-gpu performance over the previous solution.
Distributed Training. Major performance improvements and stability improvements.
Inplace Activation. Significantly reduce the GPU memory requirements and increase the batch size.
Operator Optimizations. Performance improvements of many operators.
Timeline Profiling. Allow to visualize performance as time series.
Major Bug Fixes
Calling cublas/cudnn library with wrong argument types.
Evaluated Models
Image Classification
Object Detection
OCR
Machine Translation
Text Classification
Language Model
Sequence Tagging
0.11.1a2
This release is a weekly alpha version of PaddlePaddle. It should be only used for internal tests. This is not a production-ready version.
Release log
Performance gain and memory optimization
Config and Env:
- model: SE-ResNet-150
- Input: 3 x 224 x 224
- batch_size: 25
- CentOS 6.3, Tesla P40, single card.
The comparison results before optimization:
| Speed | Memory | |
|---|---|---|
| Fluid(before) | 1.95 sec/iter | 18341 MB |
| PyTorch | 1.154 sec/iter | 13359 MB |
| Fluid/PyTorch | 1.6898 | 1.3729 |
After optimizing the speed:
| Speed | Memory | |
|---|---|---|
| Fluid(opti_speed) | 1.45 sec/iter | 17222 MB |
| PyTorch | 1.154 sec/iter | 13359 MB |
| Fluid/PyTorch | 1.2565 | 1.2892 |
After optimizing the memory usage:
| Speed | Memory | |
|---|---|---|
| Fluid(opti_mem) | 1.93 sec/iter | 14388 MB |
| PyTorch | 1.154 sec/iter | 13359 MB |
| Fluid/PyTorch | 1.6724 | 1.0770 |
- Overall performance gain.
- Details issue: #8990
- Delete GPU memory while training.
- [WIP] Feed data from C++
- Add basic RecordIO API
- Polish C++ Reader operators
- Add DoubleBuffer Reader
Distributed training
- now support distributed sparse update
- [WIP] send recv using zerocopy grpc transfer
v0.11.0
Fluid
Release v0.11.0 includes a new feature PaddlePaddle Fluid. Fluid is designed to allow users to program like PyTorch and TensorFlow Eager Execution. In these systems, there is no longer the concept model and applications do not include a symbolic description of a graph of operators nor a sequence of layers. Instead, applications look exactly like a usual program that describes a process of training or inference. The difference between Fluid and PyTorch or Eager Execution is that Fluid doesn't rely on Python's control-flow, if-then-else nor for. Instead, Fluid provides its C++ implementations and their Python binding using the with statement. For an example
In v0.11.0, we provides a C++ class Executor to run a Fluid program. Executor works like an interpreter. In future version, we will improve Executor into a debugger like GDB, and we might provide some compilers, which, for example, takes an application like the above one, and outputs an equivalent C++ source program, which can be compiled using nvcc to generate binaries that use CUDA, or using icc to generate binaries that make full use of Intel CPUs.
New Features
- Release
Fluid. - Add C-API for model inference
- Use fluid API to create a simple GAN demo.
- Add develop guide about performance tunning.
- Add retry when download
paddle.v2.dataset. - Linking protobuf-lite not protobuf in C++. Reduce the binary size.
- Feature Elastic Deep Learning (EDL) released.
- A new style cmake functions for Paddle. It is based on Bazel API.
- Automatically download and compile with Intel® MKLML library as CBLAS when build
WITH_MKL=ON. - Intel® MKL-DNN on PaddlePaddle:
- Complete 11 MKL-DNN layers: Convolution, Fully connectivity, Pooling, ReLU, Tanh, ELU, Softmax, BatchNorm, AddTo, Concat, LRN.
- Complete 3 MKL-DNN networks: VGG-19, ResNet-50, GoogleNet
- Benchmark on Intel Skylake 6148 CPU: 2~3x training speedup compared with MKLML.
- Add the
softsignactivation. - Add the dot product layer.
- Add the L2 distance layer.
- Add the sub-nested sequence layer.
- Add the kmax sequence score layer.
- Add the sequence slice layer.
- Add the row convolution layer
- Add mobile friendly webpages.
Improvements
- Build and install using a single
whlpackage. - Custom evaluating in V2 API.
- Change
PADDLE_ONLY_CPUtoPADDLE_WITH_GPU, since we will support many kinds of devices. - Remove buggy BarrierStat.
- Clean and remove unused functions in paddle::Parameter.
- Remove ProtoDataProvider.
- Huber loss supports both regression and classification.
- Add the
strideparameter for sequence pooling layers. - Enable v2 API use cudnn batch normalization automatically.
- The BN layer's parameter can be shared by a fixed the parameter name.
- Support variable-dimension input feature for 2D convolution operation.
- Refine cmake about CUDA to automatically detect GPU architecture.
- Improved website navigation.
Bug Fixes
v0.10.0
Release v0.10.0
Please pull the official images from docker hub.
We are glad to release version 0.10.0. In this version, we are happy to release the new
Python API.
-
Our old Python API is kind of out of date. It's hard to learn and hard to
use. To write a PaddlePaddle program using the old API, we'd have to write
at least two Python files: onedata providerand another one that defines
the network topology. Users start a PaddlePaddle job by running the
paddle_trainerC++ program, which calls Python interpreter to run the
network topology configuration script and then start the training loop,
which iteratively calls the data provider function to load minibatches.
This prevents us from writing a Python program in a modern way, e.g., in the
Jupyter Notebook. -
The new API, which we often refer to as the v2 API, allows us to write
much shorter Python programs to define the network and the data in a single
.py file. Also, this program can run in Jupyter Notebook, since the entry
point is in Python program and PaddlePaddle runs as a shared library loaded
and invoked by this Python program.
Basing on the new API, we delivered an online interative book, Deep Learning 101
and its Chinese version.
We also worked on updating our online documentation to describe the new API.
But this is an ongoing work. We will release more documentation improvements
in the next version.
We also worked on bring the new API to distributed model training (via MPI and
Kubernetes). This work is ongoing. We will release more about it in the next
version.
New Features
- We release new Python API.
- Deep Learning 101 book in English and Chinese.
- Support rectangle input for CNN.
- Support stride pooling for seqlastin and seqfirstin.
- Expose
seq_concat_layer/seq_reshape_layerintrainer_config_helpers. - Add dataset package: CIFAR, MNIST, IMDB, WMT14, CONLL05, movielens, imikolov.
- Add Priorbox layer for Single Shot Multibox Detection.
- Add smooth L1 cost.
- Add data reader creator and data reader decorator for v2 API.
- Add the CPU implementation of cmrnorm projection.
Improvements
- Support Python virtualenv for
paddle_trainer. - Add pre-commit hooks, used for automatically format our code.
- Upgrade protobuf to version 3.x.
- Add an option to check data type in Python data provider.
- Speedup the backward of average layer on GPU.
- Documentation refinement.
- Check dead links in documents using Travis-CI.
- Add a example for explaining
sparse_vector. - Add ReLU in layer_math.py
- Simplify data processing flow for Quick Start.
- Support CUDNN Deconv.
- Add data feeder in v2 API.
- Support predicting the samples from sys.stdin for sentiment demo.
- Provide multi-proccess interface for image preprocessing.
- Add benchmark document for v1 API.
- Add ReLU in
layer_math.py. - Add packages for automatically downloading public datasets.
- Rename
Argument::sumCosttoArgument::sumsince classArgumentis nothing with cost. - Expose Argument::sum to Python
- Add a new
TensorExpressionimplementation for matrix-related expression evaluations. - Add lazy assignment for optimizing the calculation of a batch of multiple expressions.
- Add abstract calss
Functionand its implementation:PadFuncandPadGradFunc.ContextProjectionForwardFuncandContextProjectionBackwardFunc.CosSimBackwardandCosSimBackwardFunc.CrossMapNormalFuncandCrossMapNormalGradFunc.MulFunc.
- Add class
AutoCompareandFunctionCompare, which make it easier to write unit tests for comparing gpu and cpu version of a function. - Generate
libpaddle_test_main.aand remove the main function inside the test file. - Support dense numpy vector in PyDataProvider2.
- Clean code base, remove some copy-n-pasted code snippets:
- Extract
RowBufferclass forSparseRowMatrix. - Clean the interface of
GradientMachine. - Use
overridekeyword in layer. - Simplify
Evaluator::create, useClassRegisterto createEvaluators.
- Extract
- Check MD5 checksum when downloading demo's dataset.
- Add
paddle::Errorwhich intentially replaceLOG(FATAL)in Paddle.
Bug Fixes
- Check layer input types for
recurrent_group. - Don't run
clang-formatwith .cu source files. - Fix bugs with
LogActivation. - Fix the bug that runs
test_layerHelpersmultiple times. - Fix the bug that the seq2seq demo exceeds protobuf message size limit.
- Fix the bug in dataprovider converter in GPU mode.
- Fix a bug in
GatedRecurrentLayer. - Fix bug for
BatchNormwhen testing more than one models. - Fix broken unit test of paramRelu.
- Fix some compile-time warnings about
CpuSparseMatrix. - Fix
MultiGradientMachineerror whentrainer_count > batch_size. - Fix bugs that prevents from asynchronous data loading in
PyDataProvider2.
v0.9.0
Please use docker hub to get this release.
New Features:
- New Layers
- bilinear interpolation layer.
- spatial pyramid-pool layer.
- de-convolution layer.
- maxout layer.
- Support rectangle padding, stride, window and input for Pooling Operation.
- Add —job=time in trainer, which can be used to print time info without compiler option -WITH_TIMER=ON.
- Expose cost_weight/nce_layer in
trainer_config_helpers - Add FAQ, concepts, h-rnn docs.
- Add Bidi-LSTM and DB-LSTM to quick start demo @alvations
- Add usage track scripts.
Improvements
- Add Travis-CI for Mac OS X. Enable swig unittest in Travis-CI. Skip Travis-CI when only docs are changed.
- Add code coverage tools.
- Refine convolution layer to speedup and reduce GPU memory.
- Speed up PyDataProvider2
- Add ubuntu deb package build scripts.
- Make Paddle use git-flow branching model.
- PServer support no parameter blocks.
Bug Fixes
- add zlib link to py_paddle
- add input sparse data check for sparse layer at runtime
- Bug fix for sparse matrix multiplication
- Fix floating-point overflow problem of tanh
- Fix some nvcc compile options
- Fix a bug in yield dictionary in DataProvider
- Fix SRL hang when exit.
PaddlePaddle v0.8.0beta.1
New features:
- Mac OSX is supported by source code. #138
- Both GPU and CPU versions of PaddlePaddle are supported.
- Support CUDA 8.0
- Enhance
PyDataProvider2- Add dictionary yield format.
PyDataProvider2can yield a dictionary with key is data_layer's name, value is features. - Add
min_pool_sizeto control memory pool in provider.
- Add dictionary yield format.
- Add
debinstall package & docker image for no_avx machines.- Especially for cloud computing and virtual machines
- Automatically disable
avxinstructions in cmake when machine's CPU don't supportavxinstructions. - Add Parallel NN api in trainer_config_helpers.
- Add
travis cifor Github
Bug fixes:
- Several bugs in trainer_config_helpers. Also complete the unittest for trainer_config_helpers
- Check if PaddlePaddle is installed when unittest.
- Fix bugs in GTX series GPU
- Fix bug in MultinomialSampler
Also more documentation was written since last release.
PaddlePaddle v0.8.0beta.0
PaddlePaddle v0.8.0beta.0 release. The install package is not stable yet and it's a pre-release version.