-
Notifications
You must be signed in to change notification settings - Fork 0
GSoC 2016 Final Report : The Shogun Detox
Name: Pan Deng
Organization: Shogun Machine Learning Toolbox
Mentors: Heiko Strathmann, lambday, Viktor
As a powerful machine learning toolkit, Shogun was achieved by the efforts of many developers. However, this also implicates the trouble with Shogun: some parts of the codes are outdated, or less optimized, and codes between modules are not unified. The problems dampen the developers’ experience with Shogun, and can lead to the obstruction for further implementations. Thus, my project was focusing on clean-up and refactoring the codes of Shogun. I focused on the two important modules of Shogun – linear algebra library as the computational core for machine learning libraries, and serialization framework, for the fast and easy serialization of Shogun data.
- Refactor linear algebra library
- Refactor serialization framework
- Add cookbook for Shogun
- Other contributions
- Appendix: timeline
Shogun's internal linear algebra library (refers as linalglibrary below) serves as the computational core for Shogun's machine learning libraries. However, the old linalg is not well-organized, and many operations that should be implemented in linalg are implemented in individual classes. The project here is to work out a new linalg framework and migrate the linear algebra methods back into linalg library. Also, we aim to refactor the new linalg library to be plugin-based, which will allow the developers to add external linalg libraries easily.
The new linalg library supports CPU and GPU backend linear algebra operations with Eigen3 and Viennacl libraries, and allows the easy plugin of other linear algebra libraries. However, users can and can only register one CPU backend, and/or one GPU backend library at one time, as the linalg library backend class SGLinalg is designed as singleton.
The linalg library provides a unified interface for all the linalg methods implemented in linalg namespace. Users need to include shogun/mathematics/linag/LinalgNameSpace.h header file in the class and call linalg::method(arg1, arg2, ..) to run the linalg operations. Linalg library will infer the backend to use by the location (CPU/GPU) the data is stored at.
The operations are implemented in each backend class with the same base class LinalgBackendBase.h and overrode the base methods. for GPU backends, to_gpu and from_gpu methods must be implemented, as required by the base class LinalgBackendGPUBase.h.
The framework of the new linalg library was created in PR3317 and PR3348. Minor updates are made in the following PRs: #3346, #3351, #3363, #3367, #3369, #3383, #3392, #3404.
Currently the linalg library supports the following linear algebra operation with SGVector and SGMatrix using Eigen3 or ViennaCL libraries:
| Pull requests | Descriptions |
|---|---|
| 3335, 3359, 3387, 3391 |
linalg::add(). In-place add is available. |
| 3334 | linalg::mean() |
| 3336 | linalg::max() |
| 3340 |
linalg::range_fill(). Only works for Eigen3 library. |
| 3344, 3382, 3400, 3403 |
linalg::sum(), linalg::rowwise_sum(), linalg::colwise_sum(). The sum methods can operate with matrix blocks and have flag parameter no_diag. |
| 3350 | linalg::set_const() |
| 3358 | linalg:scale() |
- Migrate other
linalgmethods to the new library and remove the old methods. - Migrate
linalgmethods inSGVector,SGMatrixand other classes to the newlinalglibrary and remove the old methods. - Enable the
linalgoperations with other CPU or GPU backends.
The old Shogun serialization framework is redundant and hard to read. We want to switch to a new serialization framework that is light and fast, with Cereal serialization library. For this project, I first modified the CMake files and enabled automatic download and installation of Cereal library in Shogun. I also implemented the revised Cereal library into Shogun classes with the new Tag- parameter framework (work of sanuj and lisitsyn).
To implement Cereal serialization library into Shogun, I added a Cereal check in CMakeLists and provided the download path of Cereal in .cmake files: #3202 and #3397.
Most classes in Shogun are based on SGObject class, which defines methods of registering parameters of the class to the parameter list, as well as the serialization of the data. To replace the serialization framework, I implemented serialization wrapper methods and serialization functions in SGObject.cpp and Any.h, the latter saves the parameter values registered by a SGObject class, in PR3375. There are also some basic data structures in Shogun that are not SGObject-based, such as SGVector and SGMatrix. I also implemented serialization methods in SGVector and SGReferencedData class (Shogun version of smart pointer object that works with C++0x) in PR3375, and in SGMatrix in PR3412. The unit-tests of SGObject-Any-SGVector-SGReferencedData can be found in PR3375.
With the implementations, one can serialize SGObject-based classes in Shogun into XML, JSON or Binary files with (here I use JSON as example):
SGObject obj_save;
obj_save.save_json(filename);
SGObject obj_load;
obj_load.load_json(filename);One can find detailed introduction to the serialization framework in the README file.
There is also unfinished work with the serialization project. One is to support the serialization of all data types and data structures in SGObject parameter list, Any.h, as shown in PR3418. The explicit listing strategy I am currently using is too verbose to read.
To interpret Shogun's functions to the users, I worked on the Shogun cookbook project, writing API examples that cover major Shogun machine learning algorithms in all target languages with Shogun's meta language and a sphinx-based API documentation system. The goal is to have a cookbook with all algorithms in Shogun, and the current one looks like this.
I submitted the following cookbook pages with integration test datasets:
| Cookbook page PR | Test dataset PR | Descriptions |
|---|---|---|
| Clustering | ||
| 3183 | 91, 94, 101 | K-means clustering |
| 3207 | 87 | Hierarchical clustering |
| Binary classifiers | ||
| - | 105 | Linear SVM |
| Multi-class classifiers | ||
| 3208 | 89, 93 | Quadratic discriminant analysis |
| 3242 | 97 | Multi-class linear machine |
| 3244 | 95 | Multi-class logistic regression |
| 3280, 3296 | 98 | ECOC random |
| 3286 | 100 | Relaxed tree classifier |
| 3287, 3318 | 103 | Shareboost classifier |
| 3326 | 112 | Multi-class LDA |
| Gaussian processes | ||
| 3311 | 108 | Gaussian process classifier |
I also refactored the structure of the cookbook with PRs: #3297, #104
Finish the two undergoing cookbook pages: CHAID tree classifier (PR3303, dataset PR119) and CARTree classifier (PR3282, dataset PR120). The two examples can be translated to C++ from meta-language and generate the correct results, while fail JAVA and some other languages. I am still looking into the reason for the failure.
I will also continue to add cookbook pages for other algorithms, such as kernels and regressions to the cookbook.
- Removed HAVE_EIGEN3 macros Shogun-wise (PR3092).
- Fixed
shogun/mathematics/warnings (PR3185). - Added assertation in
CCHAIDTreeclass (PR3395) - Added new
CQDAclass constructor (PR3233)
Week1: May 23rd – May 29th
- Download and installation of
Cerealserialization library to Shogun. - The prototype of new
linalglibrary witwh CPU dot method on vectors. - Cookbook: hierarchical clustering and quadratic discriminant analysis.
Week2: May 30th – Jun 5th
-
SGVectordot operation with CPUEigen3library and GPUViennaCLlibrary. - Added singleton for
Linalgclass ininit.handinit.cpp. - Cookbook: multiclass logistic regression and multiclass linear machine.
Week3: Jun 6th – Jun 12th
-
SGVectorsum operation with CPUEigen3library and GPUViennaCLlibrary. - Benchmark of new
linalgmethods. - Cookbook: ecoc and CARTree.
Week4: Jun 13th – Jun 19th
- Integrated CPU and GPU vector data structure in
SGVectorclass and GPU data storage modules. - Cookbook: shareboost, relaxed tree and kmeans.
Week5: Jun 20th – Jun 26th
- Finished new
linalgmethod implementation modules with vector dot method withEigen3andViennaCLlibrary. - Cookbook: Gaussian process classifier. CHAIDTree.
- Cookbook: split classifiers into binary and multi-class.
Week6: Jun 27th - Jul 3rd
- Finished
to_gpuandfrom_gpumethods withVienaCLlibrary. - Doxygens of new
linalglibrary - Refactored linalg vector
add,mean,sum,max,range_fill,scalemethods - Cookbook: Multi-class linear discriminant analysis classifier.
Week7: Jul 4th – Jul 10th
- Unit-tests of new
linalglibrary. - Integrated
SGMatrixto the newlinalglibrary.
Week8: Jul 11th – Jul 17th
- Checked the current parameter serialization framework and the newly added tags framework.
- Serialization of class
SGVectorwithCereallibrary. - Serialization of class
AnywithCereallibrary.
Week9: Jul 18th – Jul 24th
- Finished
addandcolwise sum/rowwise sum/summethods working withSGVectorandSGMatrixto the newlinalglibrary. - Serialization of class
SGObjectwithCereallibrary.
Week10: Jul 25th – Jul 31st
- Had SGObejct-Any-SGVector-SGReference data serialization working at local.
- Finished
inplace addandmeanmethods working withSGVectorandSGMatrixto the newlinalglibrary.
Week11: Aug 1st – Aug 7th
- Had SGObejct-Any-SGVector-SGReference data serialization working on travis.
- Added unit-tests for serialization.
- Merged
block sum,scale,set_constandrange_fillmethods working withSGVectorandSGMatrixto the newlinalglibrary.
Week12: Aug 8th – Aug 14th
- Added
SGMatrixserialization methods. - READMEs for
linalglibrary and serialization framework. - Cookbook:
CHAIDtressandCARTreerevisit.
Week13: Aug 15th – Aug 23th
- Peer review: code and README of
Tag-parameter framework and plugin module by Sanuj - GSoC16 summary
Welcome to the Shogun wiki!
-
[quick link GSoC 2016 projects](Google Summer of Code 2016 Projects)
-
Readmes:
-
Documents
-
[Roadmaps](Project roadmaps)
-
GSoC
- Getting involved
- Follow ups
- [2016 projects](Google Summer of Code 2016 Projects)
-
Credits