Releases: NVIDIA/spark-rapids-ml
v24.08.0 release
Release notes:
- Removed MAXINT limit on number of non-zero inputs per GPU for sparse logistic regression.
- IVF-PQ and Cagra were added to the suite of supported approximate nearest neighbor algorithms.
- Extended benchmarking scripts to be compatible with Databricks runtime 13.3 with the spark-rapids plugin and 14.3 and 15.4 without the plugin.
- Included an experimental CLI for no-import-statement-change acceleration of pyspark.ml applications.
- Fixed a slow down for inputs having a large number of columns when type conversion is required.
- Updated RAPIDS dependencies to 24.08.
- Known issues to be fixed in next release:
- for sparse logistic regression fit a low-level C++/CUDA exception is raised if a partition has no non-zero data.
- array type inputs with int dtypes are not converted to float leading to errors in some algorithms (e.g. cagra ann)
- in ivf-pq based Cagra the intermediate graph degree must <= 128 or a low-level C++ exception is raised
- test_sparse_int64 test requires 256GB host memory to run and not 128GB stated in the comments
pip package available at https://pypi.org/project/spark-rapids-ml/24.08.0/
v24.06.0 release
Release notes:
- Double precision support for GPU accelerated logistic regression.
- Added GPU accelerated IVF-Flat Approximate Nearest Neighbor (ANN) to benchmarking scripts.
- Improved throughput of GPU accelerated IVF-Flat ANN for large data sets.
- Update of RAPIDS dependencies to 24.06.
NOTE: For a large number of feature/input columns in float64 type, please use VectorUDT or array type (as opposed to multiple scalar columns) for all algorithms due to a performance issue. This will be resolved in our 24.08 release.
pip package available at https://pypi.org/project/spark-rapids-ml/24.06.0/
v24.04.0 release
Release notes:
- Feature standardization in logistic regression for sparse vectors.
- GPU accelerated Density Based Spatial Clustering for Applications with Noise (DBSCAN) algorithm with example notebook.
- GPU accelerated IVF-Flat Approximate Nearest Neighbor algorithm with example notebook
- Stage level scheduling support for Yarn and K8s.
- Update of RAPIDS dependencies to 24.04.
pip package available at https://pypi.org/project/spark-rapids-ml/24.04.0/
v24.02.0 release
Release notes:
- Support feature standardization in logistic regression for dense vectors.
- Add large scale synthetic sparse data generation for logistic regression testing.
- Fix tol=0 in KMeans
- Add sparse vectors to logistic regression notebook example.
- Update RAPIDS dependencies to 24.02.
- Known Issue: RandomForest training will throw an exception if the label column takes on only a single value. This will be fixed in 24.04.
pip package available at https://pypi.org/project/spark-rapids-ml/24.02.0/
v23.12.0 release
Release notes:
- Match Spark's logistic regression fit behavior when data set has only one label value.
- Support sparse vector based computations through cuML layer in logistic regression fit, transform, and cross validation.
- Update dataproc benchmark script.
- Update Azure Databricks instructions.
- Update RAPIDS dependencies to 23.12.
pip package available at https://pypi.org/project/spark-rapids-ml/23.12.0/
v23.10.0 release
Release Notes:
- L1 and elastic net regularization for GPU accelerated distributed LogisticRegression, with notebook example.
- More than 2 classes for GPU accelerated distributed LogisticRegression, with notebook example.
- Optimized fitMultiple api for LogisticRegression.
- Accelerated cross validation for LogisticRegression and log loss.
- Output raw prediction column for logistic regression.
- Updated Databricks init scripts and benchmarking scripts.
- Improved api docs.
- Updated RAPIDS dependencies to 23.10.
NOTE: While the runtime is compatible with Spark versions >= 3.3, some scripts in python/tests/
are not compatible with Spark 3.3. This is addressed in 23.12
pip package available at https://pypi.org/project/spark-rapids-ml/23.10.0/
v23.08.0 release
Release Notes:
- GPU accelerated distributed Logistic Regression with L2 regularization fit and transform, along with benchmarking and Jupyter notebook examples.
- GPU accelerated distributed Uniform Manifold Approximation and Projection (UMAP) fit and transform for non-linear dimensionality reduction along with benchmarking and Jupyter notebook examples.
- Stage level scheduling for training on stand-alone clusters.
- Improved logging.
- Preserve input column types during transform.
- Default to float32 inputs to cuML layer.
- Support conversion of GPU Logistic Regression models to pySpark ML CPU.
- Improved local benchmarking script.
- Updated RAPIDS and RAPIDS Accelerator for Spark dependencies to 23.08.
pip package available at https://pypi.org/project/spark-rapids-ml/23.8.0/
v23.06.0 release
Release Notes:
- GPU accelerated CrossValidator for RandomForestClassifier, RandomForestRegressor and LinearRegression, with example notebook
- Support for CUDA unified virtual memory to allow over-subscription of GPU memory
- Benchmarking scripts and instructions for AWS EMR
- Distributed synthetic data generation
- RandomForest example notebooks
- Support Spark ML parameters in constructors
- Improved API docs
- Updated RAPIDS dependencies to 23.06
pip package available at https://pypi.org/project/spark-rapids-ml/23.6.0/
v23.04.0 release
This release includes:
- Getting started guide and benchmarking scripts on GCP dataproc
- Getting started guide on AWS EMR
- cpu method to convert Spark RAPIDS ML generated models to Spark ML models
- Eliminating the need for CUDA on the driver node
- Example notebook for k-NN
- Spark 3.4 compatibility
- Updating RAPIDS dependencies to 23.04
pip package available at https://pypi.org/project/spark-rapids-ml/23.4.0/
v23.02.0 release
Added GPU-accelerated PySpark-compatible APIs for the following algorithms:
- K-Means
- k-NN
- LinearRegression
- PCA
- RandomForestClassifier
- RandomForestRegressor
Pip package: https://pypi.org/project/spark-rapids-ml/