27 Oct 17:38

ad55c6b

v0.9.4 Latest

Latest

Note: MatX is approaching a 1.0 release with several major updates. 1.0 will contain CUDA JIT capabilities that allow better kernel fusion and overall improvements in kernel runtimes. Along with the JIT capabilities, most files have changes that allow for efficient improvements in the kernels. MatX 1.0 will require C++20 support in both the CUDA and host compilers. CUDA 11.8 support will no longer be supported.

Notable Changes:

apply() and apply_idx() operators for writing lambda-based custom operators

Full Changelog

Add profiling unit tests and fix timer safety by @cliffburdick in #1060
Fixed-size reductions by @cliffburdick in #1061
Fix gcc warning by @cliffburdick in #1062
Added enum documentation for all operators by @cliffburdick in #1063
Support ND operators and transforms to/from python by @cliffburdick in #1064
Add prerun_done_ flag to prevent duplicate PreRun executions in transform operators by @cliffburdick in #1065
Fix some iterator issues that come up with CCCL ToT by @miscco in #1066
Properly use an if constexpr to guard segemented CUB algorithms by @miscco in #1067
Fix cuTENSORNet/cuDSS library path and update to new cuTensorNet API by @cliffburdick in #1069
Added apply() operator by @cliffburdick in #1072
Update stdd docs by @cliffburdick in #1076
Update release container to CUDA 13.0.1 by @tmartin-gh in #1068
Add apply_idx operator for index-aware computations by @cliffburdick in #1077
Fix missing include of <cuda/std/utility> by @miscco in #1078

Full Changelog: v0.9.3...v0.9.4

Contributors

miscco, cliffburdick, and tmartin-gh

Assets 2

26 Sep 23:30

cliffburdick

v0.9.3

86d0b82

v0.9.3

New operators: find_peaks, zipvec
Key Updates:

C2R FFT transforms
Indexing speedup for accessing tensors

What's Changed

Add qualifier to maybe unused variables by @cliffburdick in #1027
Add CTK 12.9.1 / Ubuntu 24.04 container recipe by @tmartin-gh in #1028
Added find_peaks operator by @cliffburdick in #1029
Add missing include by @miscco in #1031
Removing legacy docs folder by @cliffburdick in #1032
Updated CCCL to 3.0.0 to prepare for CTK 13.0 by @cliffburdick in #1030
Fix error in nvc++ by @cliffburdick in #1036
Fixed std::accumulate starting value by @cliffburdick in #1035
Updated developer docs for EPT by @cliffburdick in #1038
Fixed issue where op=transform was double-calling transform by @cliffburdick in #1037
Make cache entries per-thread since most CUDA library handles are not thread-safe by @cliffburdick in #1040
Add pad operator for padding input operators along one dimension by @tbensonatl in #1041
Remove unreachable return by @cliffburdick in #1042
Add zipvec operator by @tbensonatl in #1033
Fixed const issues seen in user's code by @cliffburdick in #1044
Added negative file tests by @cliffburdick in #1045
Add conditional CUDA 13+ support for select vector types by @cliffburdick in #1047
Update CUDA macro by @cliffburdick in #1048
Use index_t with get_grid_dims to support 32-bit builds by @tbensonatl in #1050
qr_econ unreachable fix by @cliffburdick in #1049
Missing return value in QR even though code is unreachable by @cliffburdick in #1051
Avoid deprecated thrust iterators by @miscco in #1055
Added support for C2R FFTs via irfft and irfft2 by @cliffburdick in #1054
Rename version_config.h -> matx/version_config.h by @valgur in #1052
Refactor Storage system to use duck-typed allocators by @cliffburdick in #1046
Added min/max headers where needed by @cliffburdick in #1056
Optimize tensor indexing for ranks 1-4 with explicit stride calculations by @cliffburdick in #1057
Add cusparse explicitly to link libraries by @agirault in #1058
Add shared_ptr constructor to Storage class by @cliffburdick in #1059

New Contributors

@valgur made their first contribution in #1052
@agirault made their first contribution in #1058

Full Changelog: v0.9.2...v0.9.3

Contributors

valgur, miscco, and 4 other contributors

Assets 2

29 Jul 19:13

cliffburdick

v0.9.2

fa9e872

v0.9.2

New operator: interp

Other Additions:

Improvements to sparse support including new batched tri-diagonal solver
Automatic vectorization and ILP support
DLPack updated to 1.1
Many bug fixes

What's Changed

Fix partial any/all reduction by @simonbyrne in #959
interp1: add support for higher dimensional sample points and values by @simonbyrne in #963
Introduce DIA and SkewDIA format by @aartbik in #964
Refactor MATX_CUDA_CHECK to prevent multiple evaluation by @tmartin-gh in #957
Introduce DIA format factory method by @aartbik in #965
reformat sparse files with clang-format by @aartbik in #966
Implement DIA SpMV kernel by @aartbik in #967
Generalize SpMV from square to m x n DIA by @aartbik in #969
replace static_assert(false) with host-only THROW by @aartbik in #968
Generalize DIA to DIA-I and DIA-J by @aartbik in #972
Avoid name collision with cpu_set_t from sched.h by @tbensonatl in #971
Add axis argument to interp1. by @simonbyrne in #970
Add operator tests back by @cliffburdick in #977
clang-format on sparse tests by @aartbik in #973
Add SpMV test for DIA-I and DIA-J by @aartbik in #974
(re) enable all sparse tests by @aartbik in #979
Let X = solve(A, B) take X and B along rows by @aartbik in #981
Add tri-diagonal solve support by @aartbik in #982
update doc with latest DIA support by @aartbik in #983
minor sparse documentation refinement by @aartbik in #984
Updating Google Test by @cliffburdick in #985
Minor fix in UST level order for DIA by @aartbik in #986
Vectorization and ILP by @cliffburdick in #980
Fixing compile error with FFT conv by @cliffburdick in #989
Fixing another 12.9 compiler bug by @cliffburdick in #991
Removing unused parameter in lambda causing error on clang by @cliffburdick in #992
proper lvl2dim computation for add/sub by @aartbik in #994
add braces to if-then-else by @aartbik in #997
Avoid fmod become ambiguous once CCCL specializes it for extended floating point types by @miscco in #996
clang formatting by @aartbik in #998
implement batched tri-diagonal direct solve by @aartbik in #999
add streams to alloc/free in cusparse sequences by @aartbik in #1001
test for batched tri-diag direct solver by @aartbik in #1000
fix minor typos in comments by @aartbik in #1002
DLPack 1.1 update by @cliffburdick in #1004
Fix host compiler errors when using -Wall -Werror by @tmartin-gh in #1006
Fix ARM relocation trucation build errors by @dylan-eustice in #1008
Allocate pinned host memory instead of managed when managed isn't available by @cliffburdick in #1010
Added executor to cache by @cliffburdick in #1009
Remove template parameters in constructor by @cliffburdick in #1012
fix flipud for 1D tensors by @simonbyrne in #1011
Fix warnings in clang19 by @cliffburdick in #1015
Missing unit test syncs by @dylan-eustice in #1013
add convenience constructor for batched tri diag sparse tensor by @aartbik in #1019
Remove runtime checks on memory spaces by @aartbik in #1018
build each test file as a separate executable by @simonbyrne in #1017
use batched sparse solve for interp by @simonbyrne in #1016

New Contributors

@miscco made their first contribution in #996
@dylan-eustice made their first contribution in #1008

Full Changelog: v0.9.1...v0.9.2

Contributors

simonbyrne, miscco, and 5 other contributors

Assets 2

14 May 15:43

cliffburdick

v0.9.1

4475c22

v0.9.1

Sparse support + bugfixes

New operators: argminmax, dense2sparse, sparse2dense, interp1, normalize, argsort
Removed requirement for --relaxed-constexpr
Added MatX NVTX domain
Significantly improved speed of svd and inv
Python integration sample
Experimental sparse tensor support (SpMM and solver routines supported)
Significantly reduced FFT memory usage

What's Changed

Moving definition of CUB cache up by @cliffburdick in #771
Added documentation of memory types by @cliffburdick in #770
Cleaning up non-const operator() to avoid code duplication by @cliffburdick in #769
Switch to CUB/Thrust backend for cuda executor argmax by @tmartin-gh in #772
Refactor cub argmax to generic cub reduce, use for argmin. Fixes #774. by @tmartin-gh in #776
Change any() and all() to use CUB's reduce by @tmartin-gh in #777
Add argminmax operator by @tmartin-gh in #778
Fix matx::HostExecutor segfault with argmin/argmax by @tmartin-gh in #780
Added new cusolverDnXsyevBatched API for batched eigen calls for CTK 12.6.2 and up by @cliffburdick in #781
cub.h CUDACC guards for custom ops by @nvjonwong in #782
Add example compiled with host compiler to catch regressions. by @tmartin-gh in #783
Remove relaxed constexpr by @cliffburdick in #775
Cleanup versions.json so jq can parse it. by @alliepiper in #785
Allow rapids-cmake's version file to be overridden. by @alliepiper in #786
Update rapids-cmake (branch-24.12@03ec7ef) by @alliepiper in #787
Created MatX NVTX domain by @cliffburdick in #784
Update docs github action by @tmartin-gh in #789
Update docs github action by @tmartin-gh in #790
Work around compiler parser bug by @cliffburdick in #791
Updating developer documentation by @cliffburdick in #793
Modify concat op to enable concatenating float3. by @nvjonwong in #792
Fix rapids cmake by @alliepiper in #799
Switched to getRs instead of getRi for faster inverse by @cliffburdick in #797
Update CMakeLists.txt by @cliffburdick in #801
Support half precision R2C transforms by @cliffburdick in #796
Fix gcc13 erroneous warning by @cliffburdick in #802
fixed missing forwarding code for allocate by @aartbik in #804
Fix bug with eye, and also zero workspace before LU factorization by @cliffburdick in #807
Change shape_type for the remap op by @nvjonwong in #806
Faster batched SVD for small sizes by @cliffburdick in #805
Fixing broadcasting in all operator() by @cliffburdick in #795
Add a better error on memory allocation failure by @cliffburdick in #808
Fix solver interfaces to use executor in cache by @cliffburdick in #809
Python integration sample by @tmartin-gh in #812
Fixes for clang17 errors/warnings by @cliffburdick in #815
Misc Cleanup by @tmartin-gh in #814
frexp_fix by @cliffburdick in #817
Adding structures needed for sparse support by @cliffburdick in #819
fix missing newline at EOF (to avoid future diff issues) by @aartbik in #822
add size() to container storage by @aartbik in #824
minor edit for sparse (layout and proper swap def) by @aartbik in #820
add a to-string method for memory space by @aartbik in #823
Cleanup cmake usage when MatX is a dependent project by @tmartin-gh in #827
Fixing warnings issues by clang-19, both host and device by @cliffburdick in #825
Update build_docs actions to newest. Add CI_RUN_DATETIME in version.rst by @tmartin-gh in #829
introduce a versatile sparse tensor type to MatX (experimental) by @aartbik in #821
Add initial tiff support by @tmartin-gh in #831
Make dim2lvl translation for printing more in the style of MatX by @aartbik in #832
Expose tensor format (and lvl specs) to sparse tensor data by @aartbik in #833
Add cross product operator by @mfzmullen in #818
remove LVL depth restriction with constexpr templating by @aartbik in #834
Guard all DIM/LVL recursion against completely empty format by @aartbik in #835
Adjust half-type threshold for cross product unit tests by @mfzmullen in #838
Added fp32 version of normcdf by @cliffburdick in #839
Changing black scholes to float and improving performance by @cliffburdick in #840
Implement the () operator on sparse tensors by @aartbik in #837
Support operators into einsum interface by @cliffburdick in #845
Add print function with nonzero dim args by @tbensonatl in #844
Updated CCCL to fix regression in newer CTK versions by @cliffburdick in #846
First version of MATX SpMM (using dispatch to cuSPARSE) by @aartbik in #843
Moved sparse operator() into tensor_impl_t by @cliffburdick in #841
Adding timing metrics to CUDA and host executors by @cliffburdick in #842
Remove dense "testers" from the sparse tensor format type by @aartbik in #847
cuDSS by @cliffburdick in #848
Update deprecated CUB types by @cliffburdick in #851
Renamed versatile into universal for sparse tensor types by @aartbik in #850
Ignore incorrect gcc warning in einsum by @cliffburdick in #853
Added documentation on integrating with existing software by @cliffburdick in #852
Add compile-time check for minimum CUDA arch by @tbensonatl in #855
First version of MATX Sparse-Direct-Solve (using dispatch to cuDSS) by @aartbik in #849
First version of MATX sparse2dense conversion (dispatch to cuSPARSE) by @aartbik in #856
Improve cuFFT errors by @cliffburdick in #860
workaround for CTAD bug in NVC++ by @cliffburdick in #859
Add note about host-allocated memory to external guide by @cliffburdick in #862
Cleanup to use pass-by-reference more consistently by @aartbik in #861
Move empty storage construction to inline helper method by @aartbik in #857
Make CCCL copy false by @cliffburdick in #865
Remove test for free memory on FFTs by @cliffburdick in #864
Fix initializer list order by @tmartin-gh in #867
Initialize host cuRAND API when using host compiler by @cliffburdick in #866
Add user-friendly assertions to make_sparse_tensor by @aartbik in #869
Add "zero" matrix factor methods for COO,CSR,CSC by @aartbik in #870
First version of MATX dense2sparse conversion (dispatch to cuSPARSE) by @aartbik in #868
Add sparse factory method tests by @aartbik in #871
Enforce library restrictions on MatX transformations by @aartbik in #872
Add sparse conversion tests (dense2sparse, sparse2dense) by @aartbik in #873
Add sparse direct-solver tests by @aartbik in #874
Add SpMM tests by @aartbik in #875
Refactored OperatorTests.cu for faster compilation time by @cliffburdick in #876
Test feeding dense output as intermediate for the new sparse ops by @aartbik in #877
Use transitive include in benchmarks cmake by @cliffburdick in #880
Remove const qualifier on input to thrust ...

Contributors

alliepiper, simonbyrne, and 8 other contributors

Assets 2

15 Oct 18:12

cliffburdick

v0.9.0

af55b57

v0.9.0

Version v0.9.0 adds comprehensive support for more host CPU transforms such as BLAS and LAPACK, including multi-threaded versions.

Beyond the CPU support, there are many more minor improvements:

Added several new operators include vector_norm, matrix_norm, frexp, diag, and more
Many compiler fixes to support a wider range of older and newer compilers
Performance improvements to avoid overhead of permutation operators when unnecessary
Much more!

A full changelist is below

What's Changed

Update pybyind to v2.12.0. Fixes issue #591. by @tmartin-gh in #604
Change print macro to matx namespaced function by @tmartin-gh in #607
Added frexp() operator by @cliffburdick in #609
Disable CUTLASS compile option by @cliffburdick in #610
Created dimensionless versions of ones() and zeros() by @cliffburdick in #611
Add smem-based polyphase channelizer kernel by @tbensonatl in #613
Eigen guide by @tylera-nvidia in #612
Multithreaded docs build Fix by @tylera-nvidia in #614
Fixed issues with static tensor unit tests compiling by @cliffburdick in #615
Implement csqrt by @tylera-nvidia in #619
Automatic Enumeration of NVTX Range IDs by @tylera-nvidia in #616
Fixing Clang errors to compile with clang-17 by @cliffburdick in #621
Update to CCCL 2.4.0 and fix CMake to not use system includes by @cliffburdick in #623
Remove options that nvc++ doesn't support by @cliffburdick in #624
Fixing some warnings on certain compilers by @cliffburdick in #625
More nvc++ warning fixes. Increase minimum supported CUDA to 11.5 by @cliffburdick in #627
More nvc++ fixes + code coverage generation by @cliffburdick in #628
fixed printing 0D tensors by @tylera-nvidia in #618
Remove conversion for double to half by @cliffburdick in #631
Add NVTX Tests for Code Coverage by @tylera-nvidia in #632
Feature/add complex cast operators by @tbensonatl in #633
Avoid array indices passthrough in matxOpTDKernel by @tbensonatl in #634
Add mixed precision support for channelize_poly by @tbensonatl in #640
Add test cases for stride kernels by @cliffburdick in #641
Basic synchronization support with sync() by @aayushg55 in #642
Converting old std:: types to cuda::std:: types by @cliffburdick in #629
Fix pybind iterator bug on newer g++ by @cliffburdick in #643
Initialize NVTX variable by @cliffburdick in #644
Fixed remaining nvc++ warnings by @cliffburdick in #645
Change cmake option/project order by @raplonu in #649
Change check on build type to avoid short circuiting by @cliffburdick in #647
Add complex cast operators for split inputs by @tbensonatl in #650
Added norm() operator by @cliffburdick in #620
Add zero-copy interface from MatX to NumPy by @cliffburdick in #653
Added host multithreading support for FFTW by @aayushg55 in #652
Fixed OpenMP compiler flags by @aayushg55 in #654
Fixed issue with operator types used as both lvalue/rvalue not assigning by @cliffburdick in #655
Smaller FFT test sizes for faster CI/CD by @aayushg55 in #656
Docs for matrix/vector norm by @cliffburdick in #657
Change matmul to use tensor_t temp until issue with impl is fixed by @cliffburdick in #658
Added plan caching for FFTW host plans by @aayushg55 in #659
Fixed fftw guards and temp allocation by @aayushg55 in #660
Fixed fftw guards to be fine-grained by @aayushg55 in #661
Enabled FFT conv for host by @aayushg55 in #662
NVPL BLAS Support by @aayushg55 in #665
Change supported CUDA to 11.8 by @cliffburdick in #670
enh: add macro to define cuda functions accessible at global scope by @mfzmullen in #668
Add workaround for pre-11.8 CTK smem init errors by @tbensonatl in #673
Fix to ConvCorr tests to skip host tests when host not enabled by @aayushg55 in #674
Expanded Host BLAS support by @aayushg55 in #675
Update README.md by @HugoPhibbs in #676
Improved the error messages when sizes are incompatible by @cliffburdick in #682
Added toeplitz operator by @cliffburdick in #683
Simplified cmake file so no definitions are required by default by @cliffburdick in #684
fix type for permuted ops in norm. by @luitjens in #696
Fix c++20 warning by @cliffburdick in #698
Update Cub Cache Creation to new Method by @tylera-nvidia in #694
Fixed base operator types by @cliffburdick in #703
Update slice.rst by @HugoPhibbs in #704
Fixed issues with host compiler with C++17 and C++20 modes by @cliffburdick in #706
NVPL LAPACK Solver Support on ARM by @aayushg55 in #701
Add detail:: namespace to CUB struct by @cliffburdick in #708
OpenBLAS LAPACK Solver Support for x86 by @aayushg55 in #709
Exclude examples/cmake_sample_project/build* from doxygen search by @tmartin-gh in #711
Fixed random pre/post run signature by @cliffburdick in #715
Rapids cmake 24 06 package by @cliffburdick in #716
Add support for UINT Generation by @tylera-nvidia in #695
Update svd docstring by @cliffburdick in #717
Solver SVD Optimizations and Improved cuSolver batching by @aayushg55 in #721
MATX_EN_CUTENSOR / MATX_ENABLE_CUTENSOR Unified Variable by @tylera-nvidia in #720
mtie should output the correct rank and size for the output operator. by @luitjens in #726
Update bug_report.md by @HugoPhibbs in #729
eliminate auto spills in permute by @luitjens in #731
Revert accidental commit to main by @cliffburdick in #734
Host Solver workspace query fix by @aayushg55 in #733
Add in-place transform support for inv() by @tbensonatl in #736
Allow access to Data() pointer from device by @tmartin-gh in #738
Use cublasmatinvBatched() for N <= 32 by @tbensonatl in #739
Added new pinv() operator and updated Reduced SVD by @aayushg55 in #740
optimize our iterator to avoid an unnecessary constructor call by @luitjens in #741
Updated Solver documentation by @aayushg55 in #742
Updated documentation for CPU support by @aayushg55 in #743
Slice optimizations to reduce spills by @cliffburdick in #732
Fixing shadow declaration by @cliffburdick in #745
Workaround for constexpr bug inside lambda in CUDA 11.8 by @cliffburdick in #671
Added diag operator taking 1D operator to generate 2D operator by @cliffburdick in #746
Add normcdf docs by @cliffburdick in #747
Refactor template arguments to reductions to force no permutes when unnecessary by @cliffburdick in #749
Adding workarounds for false positives on gcc14 by @cliffburdick in #751
Visibility fix for cache static deinit issue by @nvjonwong in #752
Don't allow in-place make_tensor to change ownership by @cliffburdick in #753
Fix for erroneous errors on gcc14.1 by @cliffburdick in #755
Create temp contiguous tensors if needed for sor...

Contributors

jjomier, raplonu, and 9 other contributors

Assets 2

04 Apr 17:27

cliffburdick

v0.8.0

7719779

v0.8.0

Release highlights:

Features
- Updated cuTENSOR and cuTensorNet versions
- Added configurable print formatting
- ARM FFT support via NVPL
- New operators: abs2(), outer(), isnan(), isinf()
- Many more unit tests for CPU tests
Bug fixes for matmul on Hopper, 2D FFTs, and more

Full changelist:

What's Changed

Increase cublas workspace to 32 MiB for Hopper+ by @tbensonatl in #545
matmul bug fixes. by @luitjens in #547
Added missing synchronization by @luitjens in #552
Refine some file I/O functions' doxygen comments by @AtomicVar in #549
Update docs by @tmartin-gh in #551
Export used environment variables in sphinx config by @tmartin-gh in #553
Import os by @tmartin-gh in #554
Add version info by @tmartin-gh in #555
Fix typo by @tmartin-gh in #556
Adds IsNan and IsInf Operators by @nvjonwong in #557
Use cmake project version info in sphinx config by @tmartin-gh in #560
outer() operator for outer product by @cliffburdick in #559
Fix nans in QR and SVD. by @luitjens in #558
Update CMakeLists.txt by @cliffburdick in #548
Fix CMake to allow multiple rapids-cmake to coexist by @cliffburdick in #562
Return 0D arrays for 0D shape in operators by @cliffburdick in #561
Fix NVTX3 include path by @AtomicVar in #564
Add .npy File I/O by @AtomicVar in #565
SVD & QR improvements by @luitjens in #563
chore: Fix typo s/whereever/wherever/ by @hugo-syn in #566
Add rapids-cmake-dir, if defined, to CMAKE_MODULE_PATH by @tbensonatl in #567
Add abs2() operator for squared abs() by @tbensonatl in #568
Fixed issue on g++13 with nullptr dereference that cannot happen at r… by @cliffburdick in #571
Force max(min) size of direct convolution dimension to be < 1024 by @cliffburdick in #573
Remove incorrect warning check for any compiler other than gcc by @cliffburdick in #577
stream memory cleanup by @cliffburdick in #579
Update reshape indices by @cliffburdick in #580
Update matlabpython.rst by @cliffburdick in #583
Prevent potential oob read in matxOpTDKernel by @tbensonatl in #586
Broadcast lower-rank tensors during batched matmul by @tbensonatl in #585
Fix bugs in 2D FFTs and add tests by @benbarsdell in #587
Added ARM FFT Support by @cliffburdick in #576
Various bug fixes for older compilers by @cliffburdick in #588
Renamed rmin/rmax functions to min/max and element-wise are now minimum/maximum to match Python by @cliffburdick in #589
Fix clang macro by @cliffburdick in #592
Fix misplaced sentence in README by @lucifer1004 in #594
Add configurable print formatting types by @tmartin-gh in #593
Fixing return types to allow either prvalue or lvalue in operator() by @cliffburdick in #598
Rework einsum for new cache style. Fix for issue #597 by @tmartin-gh in #599
Updated cutensornet to 24.03 and cutensor to 2.0.1 by @cliffburdick in #600
adding file name and line number to ease debug by @bhaskarrakshit in #601
Updating versions and notes for v0.8.0 by @cliffburdick in #602

New Contributors

@hugo-syn made their first contribution in #566
@benbarsdell made their first contribution in #587
@lucifer1004 made their first contribution in #594
@bhaskarrakshit made their first contribution in #601

Full Changelog: v0.7.0...v0.8.0

Contributors

benbarsdell, luitjens, and 8 other contributors

Assets 2

04 Jan 21:06

cliffburdick

v0.7.0

13076b0

v0.7.0

Features

Convert libcudacxx to CCCL by @cliffburdick in #501
Add PreRun and tests for at/clone/diag operators by @tbensonatl in #502
Add explicit FFT length to fft_conv example by @tbensonatl in #503
Add Pre/PostRun support for collapse, concat ops by @tbensonatl in #506
polyval operator by @cliffburdick in #508
Optimize resample poly kernels by @tbensonatl in #512
Allow negative indexing on slices by @cliffburdick in #516
Automatically publish docs to GH Pages on merge to main by @tmartin-gh in #520
Add configurable precision support of print(). by @AtomicVar in #521
Make matxHalf trivially copyable by @tbensonatl in #513
Added operator for matvec by @cliffburdick in #514
New rapids and nvbench by @cliffburdick in #529

Fixes

Add FFT1D tensor size checks by @tbensonatl in #499
Fix errors which caused some unit tests failed to compile. by @AtomicVar in #504
Fix upsample output size by @cliffburdick in #507
removing print characters accidently left behind by @tylera-nvidia in #510
Renamed host executor and prepared for multi-threaded additions by @cliffburdick in #511
removing old hardcoded limit for repmat rank size by @tylera-nvidia in #515
Avoid async alloc in some Cholesky decomp cases by @tbensonatl in #517
Workaround for maybe_unused parse bug in old gcc by @tbensonatl in #522
Fix matvec output dims to match A rather than B by @tbensonatl in #523
Remove CUDA system include by @cliffburdick in #525
Zero-initialize batches field in CUB params by @tbensonatl in #527
Fixing host include guard on resample poly by @cliffburdick in #528
Update device.h for host compiler by @cliffburdick in #530
Made allocator an inline function by @cliffburdick in #532
Build and publish documentation on merge to main by @tmartin-gh in #533
Remove doxygen parameter to match tensor_t constructor signature by @tmartin-gh in #534
Update iterator.h by @cliffburdick in #536
Update Bug Report Issue Template by @AtomicVar in #539
Fix CCCL libcudacxx path by @cliffburdick in #537
Check matmul types and error at compile-time if the backend doesn't support them by @cliffburdick in #540
Fix batched cov transform by @tbensonatl in #541
Update caching for transforms to fixing all leaks reported by compute-sanitizer by @cliffburdick in #542
Update docs for v0.7.0 by @cliffburdick in #544

Full Changelog: v0.6.0...v0.7.0

Contributors

ZJUGuoShuai, cliffburdick, and 3 other contributors

Assets 2

02 Oct 16:50

cliffburdick

v0.6.0

7b69822

v0.6.0

Notable Updates

Transforms as operators by @cliffburdick in #452
resample_poly optimizations and operator support by @tbensonatl in #465

Full changelog below:

What's Changed

Added upsample and downsample operators by @cliffburdick in #442
Added lvalue semantics to operators that needed it by @cliffburdick in #443
Added operator support to solver functions by @cliffburdick in #444
Added shapeless version of diag() and eye() by @cliffburdick in #445
Deprecated random interface by @cliffburdick in #446
Updated cuTENSOR/cuTensorNet and added example for trace by @cliffburdick in #447
Fixing host compilation where device code snuck in by @cliffburdick in #453
Added Protections for Shift Operator inputs and fixed issues with size/Shape returns for certain input sizes by @tylera-nvidia in #454
Added isclose and allclose functions by @cliffburdick in #448
Adds normalization options for fft and ifft by @nvjonwong in #456
Updated 0D tensor syntax and expanded simple radar pipeline by @cliffburdick in #458
Add initial polyphase channelizer operator by @tbensonatl in #459
Fixed inverse from stomping on input by @cliffburdick in #461
Fix cache issue with strides by @cliffburdick in #460
Added const to Pre/PostRun by @cliffburdick in #462
Revert inv by @cliffburdick in #463
Added proper LHS handling for transforms by @cliffburdick in #464
Updated incorrect license by @cliffburdick in #466
Use device mem instead of managed for fft workbuf by @tbensonatl in #467
Added at() and percentile() operators by @cliffburdick in #471
Add overlap operator by @cliffburdick in #472
Support stride 0 A/B batches for GEMMs by @cliffburdick in #473
Added FFT-based convolution to conv1d() by @cliffburdick in #475
Documentation cleanup by @tmartin-gh in #477
Adding FFT convolution benchmarks by @cliffburdick in #476
Fixed rank of output in matmul operator when A/B had 0 stride by @cliffburdick in #478
Updating header image by @cliffburdick in #480
Add pwelch operator by @tmartin-gh in #479
Docs cleanup. Enforce warning-as-error for doxygen and sphinx. by @tmartin-gh in #481
Fixes for CUDA 12.3 compiler by @cliffburdick in #483
Update pwelch.h by @cliffburdick in #486
Fixes for new compiler issues by @cliffburdick in #488
Fixing sample Cmake Project by @tylera-nvidia in #489
Update base_operator.h by @cliffburdick in #490
Add window operator input to pwelch by @tmartin-gh in #491
Add PreRun methods for slice/fftshift operators by @tbensonatl in #493
PreRun support for r2c and other fft related fixes by @tbensonatl in #494

New Contributors

@tmartin-gh made their first contribution in #477

Full Changelog: v0.5.0...v0.6.0

Contributors

cliffburdick, tmartin-gh, and 3 other contributors

Assets 2

03 Jul 21:38

cliffburdick

v0.5.0

7457329

v0.5.0

Notable Updates

Documentation rewritten to include working examples for every function based on unit tests
Polyphase resampler based on SciPy/cuSignal's resample_poly

Full changelog below:

What's Changed

Modifies TensorViewToNumpy and NumpyToTensorView for rank = 5 by @nvjonwong in #427
NumpyToTensorView overload which returns new TensorView by @nvjonwong in #428
Added fftfreq() generator by @cliffburdick in #430
Latest NumpyToTensorView function requires complex conversion for complex types by @nvjonwong in #431
Fixed print function to work on device in certain cases by @cliffburdick in #436
Fixed unused variable warning by @cliffburdick in #435
Adding initial polyphase resampler transform by @tbensonatl in #437
Revamped documentation by @cliffburdick in #438
Fixing typo in Cholesky docs by @cliffburdick in #439
Added broadcasting documentation by @cliffburdick in #440
Broadcast docs by @cliffburdick in #441

New Contributors

@nvjonwong made their first contribution in #427

Full Changelog: v0.4.1...v0.5.0

Contributors

cliffburdick, tbensonatl, and nvjonwong

Assets 2

02 Jun 15:17

cliffburdick

v0.4.1

e751134

v0.4.1

This is a minor release mostly focused on bug fixes for different compilers and CUDA versions. One major feature added was all reductions are supported on the host using a single threaded executor. Multi-threaded executor support coming soon.

What's Changed

Host reductions by @cliffburdick in #385
Reduced cuBLASLt workspace size by @cliffburdick in #404
Fix benchmarks that broke with new executors by @cliffburdick in #405
All operator tests converted to use host and device, and improved 16b by @cliffburdick in #403
Add single argument copy() and copy() tests by @tbensonatl in #407
Add rank0 tensor remap support by @tbensonatl in #408
Add Mutex to support multithread NVTX markers by @tylera-nvidia in #406
Fix a few issues highlighted by linters/clang by @tbensonatl in #409
Fixed compilation for Pascal by @cliffburdick in #412
Fixed issue with constructor when passing strides and sizes by @cliffburdick in #413
CMake fixes found by user by @cliffburdick in #416
Update libcudacxx to 2.1.0 by @cliffburdick in #417
Fixed cupy check for unit tests, default constructors, and file IO by @cliffburdick in #419
Added delta degrees of freedom on var() to mimic Python by @cliffburdick in #421
Adding correct license on files that were wrong by @cliffburdick in #423
Fixed two issues with release mode and DLPack and reductions on the host by @cliffburdick in #424

Full Changelog: v0.4.0...v0.4.1

Contributors

cliffburdick, tbensonatl, and tylera-nvidia

Assets 2

Uh oh!

Releases: NVIDIA/MatX

v0.9.4

Notable Changes:

Full Changelog

Contributors

Uh oh!

v0.9.3

What's Changed

New Contributors

Contributors

Uh oh!

v0.9.2

What's Changed

New Contributors

Contributors

Uh oh!

v0.9.1

Sparse support + bugfixes

What's Changed

Contributors

Uh oh!

v0.9.0

What's Changed

Contributors

Uh oh!

v0.8.0

Release highlights:

Full changelist:

What's Changed

New Contributors

Contributors

Uh oh!

v0.7.0

Features

Fixes

Contributors

Uh oh!

v0.6.0

Notable Updates

What's Changed

New Contributors

Contributors

Uh oh!

v0.5.0

Notable Updates

What's Changed

New Contributors

Contributors

Uh oh!

v0.4.1

What's Changed

Contributors

Uh oh!