- Refactor Classwise Shapley valuation with the interfaces and sampler architecture PR #616.
- Refactoring KNN Shapley values with the new sampler architecture PR #610.
- Refactoring MSR Banzhaf semivalues with the new sampler architecture. PR #605
- Refactoring group-testing shapley values with new sampler architecture PR #602
- Refactoring of least-core data valuation methods with more supported sampling methods and consistent interface. PR #580
- Refactoring of owen shapley valuation with new sampler architecture PR #597
- New method
InverseHarmonicMeanInfluence
, implementation for the paperDataInf: Efficiently Estimating Data Influence in LoRA-tuned LLMs and Diffusion Models
PR #582 - Add new backend implementations for influence computation to account for block-diagonal approximations PR #582
- Extend
DirectInfluence
with block-diagonal and Gauss-Newton approximation PR #591 - Extend
LissaInfluence
with block-diagonal and Gauss-Newton approximation PR #593 - Extend
NystroemSketchInfluence
with block-diagonal and Gauss-Newton approximation PR #596 - Extend
ArnoldiInfluence
with block-diagonal and Gauss-Newton approximation PR #598 - Extend
CgInfluence
with block-diagonal and Gauss-Newton approximation PR #601
-
Replace
np.float_
withnp.float64
andnp.alltrue
withnp.all
, as the old aliases are removed in NumPy 2.0 PR #604 -
Fix a bug in pydvl.utils.numeric.random_subset where 1 - q was used instead of q as the probability of an element being sampled PR #597
-
Fix a bug in the calculation of variance estimates for MSR Banzhaf PR #605
-
Fix a bug in KNN Shapley values. See Issue 613 for details.
- Use tighter bounds for the calculation of the minimal sample size that guarantees an epsilon-delta approximation in group testing (Jia et al. 2023) PR #602
- Breaking Changes
- Rename parameter
hessian_regularization
ofDirectInfluence
toregularization
and change the type annotation to allow for block-wise regularization parameters PR #591 - Rename parameter
hessian_regularization
ofLissaInfluence
toregularization
and change the type annotation to allow for block-wise regularization parameters PR #593 - Remove parameter
h0
from init ofLissaInfluence
PR #593 - Rename parameter
hessian_regularization
ofNystroemSketchInfluence
toregularization
and change the type annotation to allow for block-wise regularization parameters PR #596 - Renaming of parameters of
ArnoldiInfluence
,hessian_regularization
->regularization
(modify type annotation),rank_estimate
->rank
PR #598 - Remove functions remove obsolete functions
lanczos_low_rank_hessian_approximation
,model_hessian_low_rank
frominfluence.torch.functional
PR #598 - Renaming of parameters of
CgInfluence
,hessian_regularization
->regularization
(modify type annotation),pre_conditioner
->preconditioner
,use_block_cg
->solve_simultaneously
PR #601 - Remove parameter
x0
fromCgInfluence
PR #601 - Rename module
influence.torch.pre_conditioner
->influence.torch.preconditioner
PR #601 - Refactor preconditioner:
- renaming
PreConditioner
->Preconditioner
- fit to
TensorOperator
PR #601
- renaming
- Rename parameter
- Add progress bars to the computation of
LazyChunkSequence
andNestedLazyChunkSequence
PR #567 - Add a device fixture for
pytest
, which depending on the availability and user input (pytest --with-cuda
) resolves to cuda device PR #574
- Fixed logging issue in decorator
log_duration
PR #567 - Fixed missing move of tensors to model device in
EkfacInfluence
implementation PR #570 - Missing move to device of
preconditioner
inCgInfluence
implementation PR #572 - Raise a more specific error message, when a
RunTimeError
occurs intorch.linalg.eigh
, so the user can check if it is related to a known issue PR #578 - Fix an edge case (empty train data) in the test
test_classwise_scorer_accuracies_manual_derivation
, which resulted in undefined behavior (np.nan
toint
conversion with different results depending on OS) PR #579
- Changed logging behavior of iterative methods
LissaInfluence
andCgInfluence
to warn on not achieving desired tolerance withinmaxiter
, add parameterwarn_on_max_iteration
to set the level for this information tologging.DEBUG
PR #567
FutureWarning
forParallelConfig
constantly raised without actually instantiating the object PR #562
- New method
MSR Banzhaf
with accompanying notebook, and new stopping criterionRankCorrelation
PR #520 - New method:
NystroemSketchInfluence
PR #504 - New preconditioned block variant of conjugate gradient PR #507
- Improvements to documentation: fixes, links, text, example gallery, LFS and more PR #532, PR #543
- Glossary of data valuation and influence terms in the documentation [PR #537](#537
- Documentation about writing notes for new features, changes or deprecations PR #557
- Bug in
LissaInfluence
, when not using CPU device PR #495 - Memory issue with
CgInfluence
andArnoldiInfluence
PR #498 - Raising specific error message with install instruction, when trying to load
pydvl.utils.cache.memcached
withoutpymemcache
installed. Ifpymemcache
is available, all symbols frompydvl.utils.cache.memcached
are available throughpydvl.utils.cache
PR #509
- Add property
model_dtype
to instances of typeTorchInfluenceFunctionModel
- Bump versions of CI actions to avoid warnings PR #502
- Add Python Version 3.11 to supported versions PR #510
- Documentation improvements and cleanup PR #521, PR #522
- Simplified parallel backend configuration PR #549
- Implement new method:
EkfacInfluence
PR #451 - New notebook to showcase ekfac for LLMs PR #483
- Implemented exact games in Castro et al. 2009 and 2017 PR #341
- Bug in using
DaskInfluenceCalcualator
withTorchnumpyConverter
for single dimensional arrays PR #485 - Fix implementations of
to
methods ofTorchInfluenceFunctionModel
implementations PR #487 - Fixed bug with checking for converged values in semivalues PR #341
- Add applications of data valuation section, display examples more prominently, make all sections visible in table of contents, use mkdocs material cards in the home page PR #492
- New cache backends: InMemoryCacheBackend and DiskCacheBackend PR #458
- New influence function interface
InfluenceFunctionModel
- Data parallel computation with
DaskInfluenceCalculator
PR #26 - Sequential batch-wise computation and write to disk with
SequentialInfluenceCalculator
PR #377 - Adapt notebooks to new influence abstractions PR #430
- Refactor and simplify caching implementation PR #458
- Simplify display of computation progress PR #466
- Improve readme and explain better the examples PR #465
- Simplify and improve tests, add CodeCov code coverage PR #429
- Breaking Changes
- Removed
compute_influences
and all related code. Replaced by newInfluenceFunctionModel
interface. Removed modules:- influence.general
- influence.inversion
- influence.twice_differentiable
- influence.torch.torch_differentiable
- Removed
- Import bug in README PR #457
- New method: Class-wise Shapley values PR #338
- New method: Data-OOB by @BastienZim PR #426, PR $431
- Added
AntitheticPermutationSampler
PR #439 - Faster semi-value computation with per-index check of stopping criteria (optional) PR #437
- Fix initialization of
data_names
inValuationResult.zeros()
PR #443
- No longer using docker within tests to start a memcached server PR #444
- Using pytest-xdist for faster local tests PR #440
- Improvements and fixes to notebooks PR #436
- Refactoring of parallel module. Old imports will stop working in v0.9.0 PR #421
This is our first β release! We have worked hard to deliver improvements across
the board, with a focus on documentation and usability. We have also reworked
the internals of the influence
module, improved parallelism and handling of
randomness.
- Implemented solving the Hessian equation via spectral low-rank approximation PR #365
- Enabled parallel computation for Leave-One-Out values PR #406
- Added more abbreviations to documentation PR #415
- Added seed to functions from
pydvl.utils.numeric
,pydvl.value.shapley
andpydvl.value.semivalues
. Introduced new typeSeed
and conversion functionensure_seed_sequence
. PR #396 - Added
batch_size
parameter tocompute_banzhaf_semivalues
,compute_beta_shapley_semivalues
,compute_shapley_semivalues
andcompute_generic_semivalues
. PR #428 - Added classwise Shapley as proposed by (Schoch et al. 2021) [https://arxiv.org/abs/2211.06800] PR #338
- Replaced sphinx with mkdocs for documentation. Major overhaul of documentation PR #352
- Made ray an optional dependency, relying on joblib as default parallel backend PR #408
- Decoupled
ray.init
fromParallelConfig
PR #373 - Breaking Changes
- Signature change: return information about Hessian inversion from
compute_influence_factors
PR #375 - Major changes to IF interface and functionality. Foundation for a framework abstraction for IF computation. PR #278 PR #394
- Renamed
semivalues
tocompute_generic_semivalues
PR #413 - New
joblib
backend as default instead of ray. Simplify MapReduceJob. PR #355 - Bump torch dependency for influence package to 2.0 PR #365
- Signature change: return information about Hessian inversion from
- Fixes to parallel computation of generic semi-values: properly handle all samplers and stopping criteria, irrespective of parallel backend. PR #372
- Optimises memory usage in IF calculation PR #375
- Fix adding valuation results with overlapping indices and different lengths PR #370
- Fixed bugs in conjugate gradient and
linear_solve
PR #358 - Fix installation of dev requirements for Python3.10 PR #382
- Improvements to IF documentation PR #371
- Fix parsing keyword arguments of
compute_semivalues
dispatch function PR #333 - Create new
RayExecutor
class based on the concurrent.futures API, use the new class to fix an issue with Truncated Monte Carlo Shapley (TMCS) starting too many processes and dying, plus other small changes PR #329 - Fix creation of GroupedDataset objects using the
from_arrays
andfrom_sklearn
class methods PR #324 - Fix release job not triggering on CI when a new tag is pushed PR #331
- Added alias
ApproShapley
from Castro et al. 2009 for permutation Shapley PR #332
- Fixes in
ValuationResult
: bugs around data names, semantics ofempty()
, new methodzeros()
and normalised random values PR #327 - New method: Implements generalised semi-values for data valuation, including Data Banzhaf and Beta Shapley, with configurable sampling strategies PR #319
- Adds kwargs parameter to
from_array
andfrom_sklearn
Dataset and GroupedDataset class methods PR #316 - PEP-561 conformance: added
py.typed
PR #307 - Removed default non-negativity constraint on least core subsidy
and added instead a
non_negative_subsidy
boolean flag. Renamedoptions
tosolver_options
and pass it as dict. Change default least-core solver to SCS with 10000 max_iters. PR #304 - Cleanup: removed unnecessary decorator
@unpackable
PR #233 - Stopping criteria: fixed problem with
StandardError
and enable proper composition of index convergence statuses. Fixed a bug withn_jobs
intruncated_montecarlo_shapley
. PR #300 and PR #305 - Shuffling code around to allow for simpler user imports, some cleanup and documentation fixes. PR #284
- Bug fix: Warn instead of raising an error when
n_iterations
is less than the size of the dataset in Monte Carlo Least Core PR #281
- Fixed parallel and antithetic Owen sampling for Shapley values. Simplified and extended tests. PR #267
- Added
Scorer
class for a cleaner interface. Fixed minor bugs around Group-Testing Shapley, added more tests and switched to cvxpy for the solver. PR #264 - Generalised stopping criteria for valuation algorithms. Improved classes
ValuationResult
andStatus
with more operations. Some minor issues fixed. PR #252 - Fixed a bug whereby
compute_shapley_values
would only spawn one process when usingn_jobs=-1
and Monte Carlo methods. PR #270 - Bugfix in
RayParallelBackend
: wrong semantics forkwargs
. PR #268 - Splitting of problem preparation and solution in Least-Core computation. Umbrella function for LC methods. PR #257
- Operations on
ValuationResult
andStatus
and some cleanup PR #248 - Bug fix and minor improvements: Fixes bug in TMCS with remote Ray cluster,
raises an error for dummy sequential parallel backend with TMCS, clones model
inside
Utility
before fitting by default, with flagclone_before_fit
to disable it, catches all warnings inUtility
whenshow_warnings
isFalse
. Adds Miner and Gloves toy games utilities PR #247
- GH action to mark issues as stale PR #201
- Disabled caching of Utility values as well as repeated evaluations by default PR #211
- Test and officially support Python version 3.9 and 3.10 PR #208
- Breaking change: Introduces a class ValuationResult to gather and inspect results from all valuation algorithms PR #214
- Fixes bug in Influence calculation with multidimensional input and adds new example notebook PR #195
- Breaking change: Passes the input to
MapReduceJob
at initialization, removeschunkify_inputs
argument fromMapReduceJob
, removesn_runs
argument fromMapReduceJob
, calls the parallel backend'sput()
method for each generated chunk in_chunkify()
, renames ParallelConfig'snum_workers
attribute ton_local_workers
, fixes a bug inMapReduceJob
's chunkification whenn_runs
>=n_jobs
, and defines a sequential parallel backend to run all jobs in the current thread PR #232 - New method: Implements exact and monte carlo Least Core for data valuation,
adds
from_arrays()
class method to theDataset
andGroupedDataset
classes, addsextra_values
argument toValuationResult
, addscompute_removal_score()
andcompute_random_removal_score()
helper functions PR #237 - New method: Group Testing Shapley for valuation, from Jia et al. 2019 PR #240
- Fixes bug in ray initialization in
RayParallelBackend
class PR #239 - Implements "Egalitarian Least Core", adds cvxpy as a dependency and uses it instead of scipy as optimizer PR #243
- Simplified and fixed powerset sampling and testing PR #181
- Simplified and fixed publishing to PyPI from CI PR #183
- Fixed bug in release script and updated contributing docs. PR #184
- Added Pull Request template PR #185
- Modified Pull Request template to automatically link PR to issue PR ##186
- First implementation of Owen Sampling, squashed scores, better testing PR #194
- Improved documentation on caching, Shapley, caveats of values, bibtex PR #194
- Breaking change: Rearranging of modules to accommodate for new methods PR #194
Mostly API documentation and notebooks, plus some bugfixes.
In PR #161:
- Support for $$ math in sphinx docs.
- Usage of sphinx extension for external links (introducing new directives like
:gh:
,:issue:
and:tfl:
to construct standardised links to external resources). - Only update auto-generated documentation files if there are changes. Some
minor additions to
update_docs.py
. - Parallelization of exact combinatorial Shapley.
- Integrated KNN shapley into the main interface
compute_shapley_values
.
In PR #161:
- Improved main docs and Shapley notebooks. Added or fixed many docstrings, readme and documentation for contributors. Typos, grammar and style in code, documentation and notebooks.
- Internal renaming and rearranging in the parallelization and caching modules.
- Bug in random matrix generation PR #161.
- Bugs in MapReduceJob's
_chunkify
and_backpressure
methods PR #176.
This is very first release of pyDVL.
It contains:
-
Data Valuation Methods:
- Leave-One-Out
- Influence Functions
- Shapley:
- Exact Permutation and Combinatorial
- Montecarlo Permutation and Combinatorial
- Truncated Montecarlo Permutation
-
Caching of results with Memcached
-
Parallelization of computations with Ray
-
Documentation
-
Notebooks containing examples of different use cases