Releases: NVIDIA/MatX
v0.9.4
Note: MatX is approaching a 1.0 release with several major updates. 1.0 will contain CUDA JIT capabilities that allow better kernel fusion and overall improvements in kernel runtimes. Along with the JIT capabilities, most files have changes that allow for efficient improvements in the kernels. MatX 1.0 will require C++20 support in both the CUDA and host compilers. CUDA 11.8 support will no longer be supported.
Notable Changes:
- apply() and apply_idx() operators for writing lambda-based custom operators
Full Changelog
- Add profiling unit tests and fix timer safety by @cliffburdick in #1060
- Fixed-size reductions by @cliffburdick in #1061
- Fix gcc warning by @cliffburdick in #1062
- Added enum documentation for all operators by @cliffburdick in #1063
- Support ND operators and transforms to/from python by @cliffburdick in #1064
- Add prerun_done_ flag to prevent duplicate PreRun executions in transform operators by @cliffburdick in #1065
- Fix some iterator issues that come up with CCCL ToT by @miscco in #1066
- Properly use an
if constexprto guard segemented CUB algorithms by @miscco in #1067 - Fix cuTENSORNet/cuDSS library path and update to new cuTensorNet API by @cliffburdick in #1069
- Added apply() operator by @cliffburdick in #1072
- Update stdd docs by @cliffburdick in #1076
- Update release container to CUDA 13.0.1 by @tmartin-gh in #1068
- Add apply_idx operator for index-aware computations by @cliffburdick in #1077
- Fix missing include of
<cuda/std/utility>by @miscco in #1078
Full Changelog: v0.9.3...v0.9.4
v0.9.3
New operators: find_peaks, zipvec
Key Updates:
- C2R FFT transforms
- Indexing speedup for accessing tensors
What's Changed
- Add qualifier to maybe unused variables by @cliffburdick in #1027
- Add CTK 12.9.1 / Ubuntu 24.04 container recipe by @tmartin-gh in #1028
- Added
find_peaksoperator by @cliffburdick in #1029 - Add missing include by @miscco in #1031
- Removing legacy docs folder by @cliffburdick in #1032
- Updated CCCL to 3.0.0 to prepare for CTK 13.0 by @cliffburdick in #1030
- Fix error in nvc++ by @cliffburdick in #1036
- Fixed std::accumulate starting value by @cliffburdick in #1035
- Updated developer docs for EPT by @cliffburdick in #1038
- Fixed issue where op=transform was double-calling transform by @cliffburdick in #1037
- Make cache entries per-thread since most CUDA library handles are not thread-safe by @cliffburdick in #1040
- Add pad operator for padding input operators along one dimension by @tbensonatl in #1041
- Remove unreachable return by @cliffburdick in #1042
- Add zipvec operator by @tbensonatl in #1033
- Fixed const issues seen in user's code by @cliffburdick in #1044
- Added negative file tests by @cliffburdick in #1045
- Add conditional CUDA 13+ support for select vector types by @cliffburdick in #1047
- Update CUDA macro by @cliffburdick in #1048
- Use index_t with get_grid_dims to support 32-bit builds by @tbensonatl in #1050
- qr_econ unreachable fix by @cliffburdick in #1049
- Missing return value in QR even though code is unreachable by @cliffburdick in #1051
- Avoid deprecated
thrustiterators by @miscco in #1055 - Added support for C2R FFTs via
irfftandirfft2by @cliffburdick in #1054 - Rename
version_config.h->matx/version_config.hby @valgur in #1052 - Refactor Storage system to use duck-typed allocators by @cliffburdick in #1046
- Added min/max headers where needed by @cliffburdick in #1056
- Optimize tensor indexing for ranks 1-4 with explicit stride calculations by @cliffburdick in #1057
- Add cusparse explicitly to link libraries by @agirault in #1058
- Add shared_ptr constructor to Storage class by @cliffburdick in #1059
New Contributors
Full Changelog: v0.9.2...v0.9.3
v0.9.2
New operator: interp
Other Additions:
- Improvements to sparse support including new batched tri-diagonal solver
- Automatic vectorization and ILP support
- DLPack updated to 1.1
- Many bug fixes
What's Changed
- Fix partial any/all reduction by @simonbyrne in #959
- interp1: add support for higher dimensional sample points and values by @simonbyrne in #963
- Introduce DIA and SkewDIA format by @aartbik in #964
- Refactor MATX_CUDA_CHECK to prevent multiple evaluation by @tmartin-gh in #957
- Introduce DIA format factory method by @aartbik in #965
- reformat sparse files with clang-format by @aartbik in #966
- Implement DIA SpMV kernel by @aartbik in #967
- Generalize SpMV from square to m x n DIA by @aartbik in #969
- replace static_assert(false) with host-only THROW by @aartbik in #968
- Generalize DIA to DIA-I and DIA-J by @aartbik in #972
- Avoid name collision with cpu_set_t from sched.h by @tbensonatl in #971
- Add axis argument to interp1. by @simonbyrne in #970
- Add operator tests back by @cliffburdick in #977
- clang-format on sparse tests by @aartbik in #973
- Add SpMV test for DIA-I and DIA-J by @aartbik in #974
- (re) enable all sparse tests by @aartbik in #979
- Let X = solve(A, B) take X and B along rows by @aartbik in #981
- Add tri-diagonal solve support by @aartbik in #982
- update doc with latest DIA support by @aartbik in #983
- minor sparse documentation refinement by @aartbik in #984
- Updating Google Test by @cliffburdick in #985
- Minor fix in UST level order for DIA by @aartbik in #986
- Vectorization and ILP by @cliffburdick in #980
- Fixing compile error with FFT conv by @cliffburdick in #989
- Fixing another 12.9 compiler bug by @cliffburdick in #991
- Removing unused parameter in lambda causing error on clang by @cliffburdick in #992
- proper lvl2dim computation for add/sub by @aartbik in #994
- add braces to if-then-else by @aartbik in #997
- Avoid
fmodbecome ambiguous once CCCL specializes it for extended floating point types by @miscco in #996 - clang formatting by @aartbik in #998
- implement batched tri-diagonal direct solve by @aartbik in #999
- add streams to alloc/free in cusparse sequences by @aartbik in #1001
- test for batched tri-diag direct solver by @aartbik in #1000
- fix minor typos in comments by @aartbik in #1002
- DLPack 1.1 update by @cliffburdick in #1004
- Fix host compiler errors when using -Wall -Werror by @tmartin-gh in #1006
- Fix ARM relocation trucation build errors by @dylan-eustice in #1008
- Allocate pinned host memory instead of managed when managed isn't available by @cliffburdick in #1010
- Added executor to cache by @cliffburdick in #1009
- Remove template parameters in constructor by @cliffburdick in #1012
- fix flipud for 1D tensors by @simonbyrne in #1011
- Fix warnings in clang19 by @cliffburdick in #1015
- Missing unit test syncs by @dylan-eustice in #1013
- add convenience constructor for batched tri diag sparse tensor by @aartbik in #1019
- Remove runtime checks on memory spaces by @aartbik in #1018
- build each test file as a separate executable by @simonbyrne in #1017
- use batched sparse solve for interp by @simonbyrne in #1016
New Contributors
- @miscco made their first contribution in #996
- @dylan-eustice made their first contribution in #1008
Full Changelog: v0.9.1...v0.9.2
v0.9.1
Sparse support + bugfixes
- New operators:
argminmax,dense2sparse,sparse2dense,interp1,normalize,argsort - Removed requirement for --relaxed-constexpr
- Added MatX NVTX domain
- Significantly improved speed of
svdandinv - Python integration sample
- Experimental sparse tensor support (SpMM and solver routines supported)
- Significantly reduced FFT memory usage
What's Changed
- Moving definition of CUB cache up by @cliffburdick in #771
- Added documentation of memory types by @cliffburdick in #770
- Cleaning up non-const operator() to avoid code duplication by @cliffburdick in #769
- Switch to CUB/Thrust backend for cuda executor argmax by @tmartin-gh in #772
- Refactor cub argmax to generic cub reduce, use for argmin. Fixes #774. by @tmartin-gh in #776
- Change any() and all() to use CUB's reduce by @tmartin-gh in #777
- Add argminmax operator by @tmartin-gh in #778
- Fix matx::HostExecutor segfault with argmin/argmax by @tmartin-gh in #780
- Added new cusolverDnXsyevBatched API for batched eigen calls for CTK 12.6.2 and up by @cliffburdick in #781
- cub.h CUDACC guards for custom ops by @nvjonwong in #782
- Add example compiled with host compiler to catch regressions. by @tmartin-gh in #783
- Remove relaxed constexpr by @cliffburdick in #775
- Cleanup versions.json so jq can parse it. by @alliepiper in #785
- Allow rapids-cmake's version file to be overridden. by @alliepiper in #786
- Update rapids-cmake (branch-24.12@03ec7ef) by @alliepiper in #787
- Created MatX NVTX domain by @cliffburdick in #784
- Update docs github action by @tmartin-gh in #789
- Update docs github action by @tmartin-gh in #790
- Work around compiler parser bug by @cliffburdick in #791
- Updating developer documentation by @cliffburdick in #793
- Modify concat op to enable concatenating float3. by @nvjonwong in #792
- Fix rapids cmake by @alliepiper in #799
- Switched to getRs instead of getRi for faster inverse by @cliffburdick in #797
- Update CMakeLists.txt by @cliffburdick in #801
- Support half precision R2C transforms by @cliffburdick in #796
- Fix gcc13 erroneous warning by @cliffburdick in #802
- fixed missing forwarding code for allocate by @aartbik in #804
- Fix bug with eye, and also zero workspace before LU factorization by @cliffburdick in #807
- Change shape_type for the remap op by @nvjonwong in #806
- Faster batched SVD for small sizes by @cliffburdick in #805
- Fixing broadcasting in all operator() by @cliffburdick in #795
- Add a better error on memory allocation failure by @cliffburdick in #808
- Fix solver interfaces to use executor in cache by @cliffburdick in #809
- Python integration sample by @tmartin-gh in #812
- Fixes for clang17 errors/warnings by @cliffburdick in #815
- Misc Cleanup by @tmartin-gh in #814
- frexp_fix by @cliffburdick in #817
- Adding structures needed for sparse support by @cliffburdick in #819
- fix missing newline at EOF (to avoid future diff issues) by @aartbik in #822
- add size() to container storage by @aartbik in #824
- minor edit for sparse (layout and proper swap def) by @aartbik in #820
- add a to-string method for memory space by @aartbik in #823
- Cleanup cmake usage when MatX is a dependent project by @tmartin-gh in #827
- Fixing warnings issues by clang-19, both host and device by @cliffburdick in #825
- Update build_docs actions to newest. Add CI_RUN_DATETIME in version.rst by @tmartin-gh in #829
- introduce a versatile sparse tensor type to MatX (experimental) by @aartbik in #821
- Add initial tiff support by @tmartin-gh in #831
- Make dim2lvl translation for printing more in the style of MatX by @aartbik in #832
- Expose tensor format (and lvl specs) to sparse tensor data by @aartbik in #833
- Add cross product operator by @mfzmullen in #818
- remove LVL depth restriction with constexpr templating by @aartbik in #834
- Guard all DIM/LVL recursion against completely empty format by @aartbik in #835
- Adjust half-type threshold for cross product unit tests by @mfzmullen in #838
- Added fp32 version of normcdf by @cliffburdick in #839
- Changing black scholes to float and improving performance by @cliffburdick in #840
- Implement the () operator on sparse tensors by @aartbik in #837
- Support operators into einsum interface by @cliffburdick in #845
- Add print function with nonzero dim args by @tbensonatl in #844
- Updated CCCL to fix regression in newer CTK versions by @cliffburdick in #846
- First version of MATX SpMM (using dispatch to cuSPARSE) by @aartbik in #843
- Moved sparse operator() into tensor_impl_t by @cliffburdick in #841
- Adding timing metrics to CUDA and host executors by @cliffburdick in #842
- Remove dense "testers" from the sparse tensor format type by @aartbik in #847
- cuDSS by @cliffburdick in #848
- Update deprecated CUB types by @cliffburdick in #851
- Renamed versatile into universal for sparse tensor types by @aartbik in #850
- Ignore incorrect gcc warning in einsum by @cliffburdick in #853
- Added documentation on integrating with existing software by @cliffburdick in #852
- Add compile-time check for minimum CUDA arch by @tbensonatl in #855
- First version of MATX Sparse-Direct-Solve (using dispatch to cuDSS) by @aartbik in #849
- First version of MATX sparse2dense conversion (dispatch to cuSPARSE) by @aartbik in #856
- Improve cuFFT errors by @cliffburdick in #860
- workaround for CTAD bug in NVC++ by @cliffburdick in #859
- Add note about host-allocated memory to external guide by @cliffburdick in #862
- Cleanup to use pass-by-reference more consistently by @aartbik in #861
- Move empty storage construction to inline helper method by @aartbik in #857
- Make CCCL copy false by @cliffburdick in #865
- Remove test for free memory on FFTs by @cliffburdick in #864
- Fix initializer list order by @tmartin-gh in #867
- Initialize host cuRAND API when using host compiler by @cliffburdick in #866
- Add user-friendly assertions to make_sparse_tensor by @aartbik in #869
- Add "zero" matrix factor methods for COO,CSR,CSC by @aartbik in #870
- First version of MATX dense2sparse conversion (dispatch to cuSPARSE) by @aartbik in #868
- Add sparse factory method tests by @aartbik in #871
- Enforce library restrictions on MatX transformations by @aartbik in #872
- Add sparse conversion tests (dense2sparse, sparse2dense) by @aartbik in #873
- Add sparse direct-solver tests by @aartbik in #874
- Add SpMM tests by @aartbik in #875
- Refactored OperatorTests.cu for faster compilation time by @cliffburdick in #876
- Test feeding dense output as intermediate for the new sparse ops by @aartbik in #877
- Use transitive include in benchmarks cmake by @cliffburdick in #880
- Remove const qualifier on input to thrust ...
v0.9.0
Version v0.9.0 adds comprehensive support for more host CPU transforms such as BLAS and LAPACK, including multi-threaded versions.
Beyond the CPU support, there are many more minor improvements:
- Added several new operators include
vector_norm,matrix_norm,frexp,diag, and more - Many compiler fixes to support a wider range of older and newer compilers
- Performance improvements to avoid overhead of permutation operators when unnecessary
- Much more!
A full changelist is below
What's Changed
- Update pybyind to v2.12.0. Fixes issue #591. by @tmartin-gh in #604
- Change print macro to matx namespaced function by @tmartin-gh in #607
- Added frexp() operator by @cliffburdick in #609
- Disable CUTLASS compile option by @cliffburdick in #610
- Created dimensionless versions of ones() and zeros() by @cliffburdick in #611
- Add smem-based polyphase channelizer kernel by @tbensonatl in #613
- Eigen guide by @tylera-nvidia in #612
- Multithreaded docs build Fix by @tylera-nvidia in #614
- Fixed issues with static tensor unit tests compiling by @cliffburdick in #615
- Implement csqrt by @tylera-nvidia in #619
- Automatic Enumeration of NVTX Range IDs by @tylera-nvidia in #616
- Fixing Clang errors to compile with clang-17 by @cliffburdick in #621
- Update to CCCL 2.4.0 and fix CMake to not use system includes by @cliffburdick in #623
- Remove options that nvc++ doesn't support by @cliffburdick in #624
- Fixing some warnings on certain compilers by @cliffburdick in #625
- More nvc++ warning fixes. Increase minimum supported CUDA to 11.5 by @cliffburdick in #627
- More nvc++ fixes + code coverage generation by @cliffburdick in #628
- fixed printing 0D tensors by @tylera-nvidia in #618
- Remove conversion for double to half by @cliffburdick in #631
- Add NVTX Tests for Code Coverage by @tylera-nvidia in #632
- Feature/add complex cast operators by @tbensonatl in #633
- Avoid array indices passthrough in matxOpTDKernel by @tbensonatl in #634
- Add mixed precision support for channelize_poly by @tbensonatl in #640
- Add test cases for stride kernels by @cliffburdick in #641
- Basic synchronization support with sync() by @aayushg55 in #642
- Converting old std:: types to cuda::std:: types by @cliffburdick in #629
- Fix pybind iterator bug on newer g++ by @cliffburdick in #643
- Initialize NVTX variable by @cliffburdick in #644
- Fixed remaining nvc++ warnings by @cliffburdick in #645
- Change cmake option/project order by @raplonu in #649
- Change check on build type to avoid short circuiting by @cliffburdick in #647
- Add complex cast operators for split inputs by @tbensonatl in #650
- Added
norm()operator by @cliffburdick in #620 - Add zero-copy interface from MatX to NumPy by @cliffburdick in #653
- Added host multithreading support for FFTW by @aayushg55 in #652
- Fixed OpenMP compiler flags by @aayushg55 in #654
- Fixed issue with operator types used as both lvalue/rvalue not assigning by @cliffburdick in #655
- Smaller FFT test sizes for faster CI/CD by @aayushg55 in #656
- Docs for matrix/vector norm by @cliffburdick in #657
- Change matmul to use tensor_t temp until issue with impl is fixed by @cliffburdick in #658
- Added plan caching for FFTW host plans by @aayushg55 in #659
- Fixed fftw guards and temp allocation by @aayushg55 in #660
- Fixed fftw guards to be fine-grained by @aayushg55 in #661
- Enabled FFT conv for host by @aayushg55 in #662
- NVPL BLAS Support by @aayushg55 in #665
- Change supported CUDA to 11.8 by @cliffburdick in #670
- enh: add macro to define cuda functions accessible at global scope by @mfzmullen in #668
- Add workaround for pre-11.8 CTK smem init errors by @tbensonatl in #673
- Fix to ConvCorr tests to skip host tests when host not enabled by @aayushg55 in #674
- Expanded Host BLAS support by @aayushg55 in #675
- Update README.md by @HugoPhibbs in #676
- Improved the error messages when sizes are incompatible by @cliffburdick in #682
- Added toeplitz operator by @cliffburdick in #683
- Simplified cmake file so no definitions are required by default by @cliffburdick in #684
- fix type for permuted ops in norm. by @luitjens in #696
- Fix c++20 warning by @cliffburdick in #698
- Update Cub Cache Creation to new Method by @tylera-nvidia in #694
- Fixed base operator types by @cliffburdick in #703
- Update slice.rst by @HugoPhibbs in #704
- Fixed issues with host compiler with C++17 and C++20 modes by @cliffburdick in #706
- NVPL LAPACK Solver Support on ARM by @aayushg55 in #701
- Add detail:: namespace to CUB struct by @cliffburdick in #708
- OpenBLAS LAPACK Solver Support for x86 by @aayushg55 in #709
- Exclude examples/cmake_sample_project/build* from doxygen search by @tmartin-gh in #711
- Fixed random pre/post run signature by @cliffburdick in #715
- Rapids cmake 24 06 package by @cliffburdick in #716
- Add support for UINT Generation by @tylera-nvidia in #695
- Update svd docstring by @cliffburdick in #717
- Solver SVD Optimizations and Improved cuSolver batching by @aayushg55 in #721
- MATX_EN_CUTENSOR / MATX_ENABLE_CUTENSOR Unified Variable by @tylera-nvidia in #720
- mtie should output the correct rank and size for the output operator. by @luitjens in #726
- Update bug_report.md by @HugoPhibbs in #729
- eliminate auto spills in permute by @luitjens in #731
- Revert accidental commit to main by @cliffburdick in #734
- Host Solver workspace query fix by @aayushg55 in #733
- Add in-place transform support for inv() by @tbensonatl in #736
- Allow access to Data() pointer from device by @tmartin-gh in #738
- Use cublasmatinvBatched() for N <= 32 by @tbensonatl in #739
- Added new pinv() operator and updated Reduced SVD by @aayushg55 in #740
- optimize our iterator to avoid an unnecessary constructor call by @luitjens in #741
- Updated Solver documentation by @aayushg55 in #742
- Updated documentation for CPU support by @aayushg55 in #743
- Slice optimizations to reduce spills by @cliffburdick in #732
- Fixing shadow declaration by @cliffburdick in #745
- Workaround for constexpr bug inside lambda in CUDA 11.8 by @cliffburdick in #671
- Added diag operator taking 1D operator to generate 2D operator by @cliffburdick in #746
- Add normcdf docs by @cliffburdick in #747
- Refactor template arguments to reductions to force no permutes when unnecessary by @cliffburdick in #749
- Adding workarounds for false positives on gcc14 by @cliffburdick in #751
- Visibility fix for cache static deinit issue by @nvjonwong in #752
- Don't allow in-place make_tensor to change ownership by @cliffburdick in #753
- Fix for erroneous errors on gcc14.1 by @cliffburdick in #755
- Create temp contiguous tensors if needed for sor...
v0.8.0
Release highlights:
- Features
- Updated cuTENSOR and cuTensorNet versions
- Added configurable print formatting
- ARM FFT support via NVPL
- New operators: abs2(), outer(), isnan(), isinf()
- Many more unit tests for CPU tests
- Bug fixes for matmul on Hopper, 2D FFTs, and more
Full changelist:
What's Changed
- Increase cublas workspace to 32 MiB for Hopper+ by @tbensonatl in #545
- matmul bug fixes. by @luitjens in #547
- Added missing synchronization by @luitjens in #552
- Refine some file I/O functions' doxygen comments by @AtomicVar in #549
- Update docs by @tmartin-gh in #551
- Export used environment variables in sphinx config by @tmartin-gh in #553
- Import os by @tmartin-gh in #554
- Add version info by @tmartin-gh in #555
- Fix typo by @tmartin-gh in #556
- Adds IsNan and IsInf Operators by @nvjonwong in #557
- Use cmake project version info in sphinx config by @tmartin-gh in #560
- outer() operator for outer product by @cliffburdick in #559
- Fix nans in QR and SVD. by @luitjens in #558
- Update CMakeLists.txt by @cliffburdick in #548
- Fix CMake to allow multiple rapids-cmake to coexist by @cliffburdick in #562
- Return 0D arrays for 0D shape in operators by @cliffburdick in #561
- Fix NVTX3 include path by @AtomicVar in #564
- Add .npy File I/O by @AtomicVar in #565
- SVD & QR improvements by @luitjens in #563
- chore: Fix typo s/whereever/wherever/ by @hugo-syn in #566
- Add rapids-cmake-dir, if defined, to CMAKE_MODULE_PATH by @tbensonatl in #567
- Add abs2() operator for squared abs() by @tbensonatl in #568
- Fixed issue on g++13 with nullptr dereference that cannot happen at r… by @cliffburdick in #571
- Force max(min) size of direct convolution dimension to be < 1024 by @cliffburdick in #573
- Remove incorrect warning check for any compiler other than gcc by @cliffburdick in #577
- stream memory cleanup by @cliffburdick in #579
- Update reshape indices by @cliffburdick in #580
- Update matlabpython.rst by @cliffburdick in #583
- Prevent potential oob read in matxOpTDKernel by @tbensonatl in #586
- Broadcast lower-rank tensors during batched matmul by @tbensonatl in #585
- Fix bugs in 2D FFTs and add tests by @benbarsdell in #587
- Added ARM FFT Support by @cliffburdick in #576
- Various bug fixes for older compilers by @cliffburdick in #588
- Renamed rmin/rmax functions to min/max and element-wise are now minimum/maximum to match Python by @cliffburdick in #589
- Fix clang macro by @cliffburdick in #592
- Fix misplaced sentence in README by @lucifer1004 in #594
- Add configurable print formatting types by @tmartin-gh in #593
- Fixing return types to allow either prvalue or lvalue in operator() by @cliffburdick in #598
- Rework einsum for new cache style. Fix for issue #597 by @tmartin-gh in #599
- Updated cutensornet to 24.03 and cutensor to 2.0.1 by @cliffburdick in #600
- adding file name and line number to ease debug by @bhaskarrakshit in #601
- Updating versions and notes for v0.8.0 by @cliffburdick in #602
New Contributors
- @hugo-syn made their first contribution in #566
- @benbarsdell made their first contribution in #587
- @lucifer1004 made their first contribution in #594
- @bhaskarrakshit made their first contribution in #601
Full Changelog: v0.7.0...v0.8.0
v0.7.0
Features
- Convert libcudacxx to CCCL by @cliffburdick in #501
- Add PreRun and tests for at/clone/diag operators by @tbensonatl in #502
- Add explicit FFT length to fft_conv example by @tbensonatl in #503
- Add Pre/PostRun support for collapse, concat ops by @tbensonatl in #506
- polyval operator by @cliffburdick in #508
- Optimize resample poly kernels by @tbensonatl in #512
- Allow negative indexing on slices by @cliffburdick in #516
- Automatically publish docs to GH Pages on merge to main by @tmartin-gh in #520
- Add configurable precision support of
print(). by @AtomicVar in #521 - Make matxHalf trivially copyable by @tbensonatl in #513
- Added operator for matvec by @cliffburdick in #514
- New rapids and nvbench by @cliffburdick in #529
Fixes
- Add FFT1D tensor size checks by @tbensonatl in #499
- Fix errors which caused some unit tests failed to compile. by @AtomicVar in #504
- Fix upsample output size by @cliffburdick in #507
- removing print characters accidently left behind by @tylera-nvidia in #510
- Renamed host executor and prepared for multi-threaded additions by @cliffburdick in #511
- removing old hardcoded limit for repmat rank size by @tylera-nvidia in #515
- Avoid async alloc in some Cholesky decomp cases by @tbensonatl in #517
- Workaround for maybe_unused parse bug in old gcc by @tbensonatl in #522
- Fix matvec output dims to match A rather than B by @tbensonatl in #523
- Remove CUDA system include by @cliffburdick in #525
- Zero-initialize batches field in CUB params by @tbensonatl in #527
- Fixing host include guard on resample poly by @cliffburdick in #528
- Update device.h for host compiler by @cliffburdick in #530
- Made allocator an inline function by @cliffburdick in #532
- Build and publish documentation on merge to main by @tmartin-gh in #533
- Remove doxygen parameter to match tensor_t constructor signature by @tmartin-gh in #534
- Update iterator.h by @cliffburdick in #536
- Update Bug Report Issue Template by @AtomicVar in #539
- Fix CCCL libcudacxx path by @cliffburdick in #537
- Check matmul types and error at compile-time if the backend doesn't support them by @cliffburdick in #540
- Fix batched cov transform by @tbensonatl in #541
- Update caching for transforms to fixing all leaks reported by compute-sanitizer by @cliffburdick in #542
- Update docs for v0.7.0 by @cliffburdick in #544
Full Changelog: v0.6.0...v0.7.0
v0.6.0
Notable Updates
- Transforms as operators by @cliffburdick in #452
- resample_poly optimizations and operator support by @tbensonatl in #465
Full changelog below:
What's Changed
- Added upsample and downsample operators by @cliffburdick in #442
- Added lvalue semantics to operators that needed it by @cliffburdick in #443
- Added operator support to solver functions by @cliffburdick in #444
- Added shapeless version of diag() and eye() by @cliffburdick in #445
- Deprecated random interface by @cliffburdick in #446
- Updated cuTENSOR/cuTensorNet and added example for trace by @cliffburdick in #447
- Fixing host compilation where device code snuck in by @cliffburdick in #453
- Added Protections for Shift Operator inputs and fixed issues with size/Shape returns for certain input sizes by @tylera-nvidia in #454
- Added isclose and allclose functions by @cliffburdick in #448
- Adds normalization options for
fftandifftby @nvjonwong in #456 - Updated 0D tensor syntax and expanded simple radar pipeline by @cliffburdick in #458
- Add initial polyphase channelizer operator by @tbensonatl in #459
- Fixed inverse from stomping on input by @cliffburdick in #461
- Fix cache issue with strides by @cliffburdick in #460
- Added const to Pre/PostRun by @cliffburdick in #462
- Revert inv by @cliffburdick in #463
- Added proper LHS handling for transforms by @cliffburdick in #464
- Updated incorrect license by @cliffburdick in #466
- Use device mem instead of managed for fft workbuf by @tbensonatl in #467
- Added at() and percentile() operators by @cliffburdick in #471
- Add overlap operator by @cliffburdick in #472
- Support stride 0 A/B batches for GEMMs by @cliffburdick in #473
- Added FFT-based convolution to conv1d() by @cliffburdick in #475
- Documentation cleanup by @tmartin-gh in #477
- Adding FFT convolution benchmarks by @cliffburdick in #476
- Fixed rank of output in matmul operator when A/B had 0 stride by @cliffburdick in #478
- Updating header image by @cliffburdick in #480
- Add pwelch operator by @tmartin-gh in #479
- Docs cleanup. Enforce warning-as-error for doxygen and sphinx. by @tmartin-gh in #481
- Fixes for CUDA 12.3 compiler by @cliffburdick in #483
- Update pwelch.h by @cliffburdick in #486
- Fixes for new compiler issues by @cliffburdick in #488
- Fixing sample Cmake Project by @tylera-nvidia in #489
- Update base_operator.h by @cliffburdick in #490
- Add window operator input to pwelch by @tmartin-gh in #491
- Add PreRun methods for slice/fftshift operators by @tbensonatl in #493
- PreRun support for r2c and other fft related fixes by @tbensonatl in #494
New Contributors
- @tmartin-gh made their first contribution in #477
Full Changelog: v0.5.0...v0.6.0
v0.5.0
Notable Updates
- Documentation rewritten to include working examples for every function based on unit tests
- Polyphase resampler based on SciPy/cuSignal's
resample_poly
Full changelog below:
What's Changed
- Modifies TensorViewToNumpy and NumpyToTensorView for rank = 5 by @nvjonwong in #427
- NumpyToTensorView overload which returns new TensorView by @nvjonwong in #428
- Added fftfreq() generator by @cliffburdick in #430
- Latest NumpyToTensorView function requires complex conversion for complex types by @nvjonwong in #431
- Fixed print function to work on device in certain cases by @cliffburdick in #436
- Fixed unused variable warning by @cliffburdick in #435
- Adding initial polyphase resampler transform by @tbensonatl in #437
- Revamped documentation by @cliffburdick in #438
- Fixing typo in Cholesky docs by @cliffburdick in #439
- Added broadcasting documentation by @cliffburdick in #440
- Broadcast docs by @cliffburdick in #441
New Contributors
- @nvjonwong made their first contribution in #427
Full Changelog: v0.4.1...v0.5.0
v0.4.1
This is a minor release mostly focused on bug fixes for different compilers and CUDA versions. One major feature added was all reductions are supported on the host using a single threaded executor. Multi-threaded executor support coming soon.
What's Changed
- Host reductions by @cliffburdick in #385
- Reduced cuBLASLt workspace size by @cliffburdick in #404
- Fix benchmarks that broke with new executors by @cliffburdick in #405
- All operator tests converted to use host and device, and improved 16b by @cliffburdick in #403
- Add single argument copy() and copy() tests by @tbensonatl in #407
- Add rank0 tensor remap support by @tbensonatl in #408
- Add Mutex to support multithread NVTX markers by @tylera-nvidia in #406
- Fix a few issues highlighted by linters/clang by @tbensonatl in #409
- Fixed compilation for Pascal by @cliffburdick in #412
- Fixed issue with constructor when passing strides and sizes by @cliffburdick in #413
- CMake fixes found by user by @cliffburdick in #416
- Update libcudacxx to 2.1.0 by @cliffburdick in #417
- Fixed cupy check for unit tests, default constructors, and file IO by @cliffburdick in #419
- Added delta degrees of freedom on var() to mimic Python by @cliffburdick in #421
- Adding correct license on files that were wrong by @cliffburdick in #423
- Fixed two issues with release mode and DLPack and reductions on the host by @cliffburdick in #424
Full Changelog: v0.4.0...v0.4.1