Skip to content

Commit

Permalink
Merge branch 'release/v0.6.1'
Browse files Browse the repository at this point in the history
  • Loading branch information
AnesBenmerzoug committed Apr 13, 2023
2 parents f8e07cc + ce29e0c commit 0e929ae
Show file tree
Hide file tree
Showing 31 changed files with 1,311 additions and 680 deletions.
2 changes: 1 addition & 1 deletion .bumpversion.cfg
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
[bumpversion]
current_version = 0.6.0
current_version = 0.6.1
commit = False
tag = False
allow_dirty = False
Expand Down
42 changes: 31 additions & 11 deletions .github/workflows/publish.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
name: Upload Python Package to PyPI
name: Publish Python Package to PyPI

on:
push:
Expand All @@ -10,32 +10,52 @@ on:
description: Why did you trigger the pipeline?
required: False
default: Check if it runs again due to external changes
tag:
description: Tag for which a package should be published
type: string
required: false

env:
PY_COLORS: 1

jobs:
deploy:
publish:
runs-on: ubuntu-latest
concurrency:
group: deploy
group: publish
steps:
- uses: actions/checkout@v3
with:
fetch-depth: 0
- name: Fail if manually triggered workflow is not on 'master' branch
if: github.event_name == 'workflow_dispatch' && github.ref_name != 'master'
run: exit -1
- name: Fail if manually triggered workflow does not have 'tag' input
if: github.event_name == 'workflow_dispatch' && inputs.tag == ''
run: |
echo "Input 'tag' should not be empty"
exit -1
- name: Extract branch name from input
id: get_branch_name_input
if: github.event_name == 'workflow_dispatch'
run: |
export BRANCH_NAME=$(git log -1 --format='%D' ${{ inputs.tag }} | sed -e 's/.*origin\/\(.*\).*/\1/')
echo "branch_name=${BRANCH_NAME}" >> $GITHUB_OUTPUT
- name: Extract branch name from tag
id: get_branch_name
id: get_branch_name_tag
if: github.ref_type == 'tag'
run: |
export BRANCH_NAME=$(git log -1 --format='%D' $GITHUB_REF | sed -e 's/.*origin\/\(.*\),.*/\1/')
echo ::set-output name=branch_name::${BRANCH_NAME}
export BRANCH_NAME=$(git log -1 --format='%D' $GITHUB_REF | sed -e 's/.*origin\/\(.*\).*/\1/')
echo "branch_name=${BRANCH_NAME}" >> $GITHUB_OUTPUT
shell: bash
- name: Fail if tag is not on 'master' branch
if: github.ref_type == 'tag' && steps.get_branch_name.outputs.branch_name != 'master'
run: exit -1
if: ${{ steps.get_branch_name_tag.outputs.branch_name != 'master' && steps.get_branch_name_input.outputs.branch_name != 'master' }}
run: |
echo "Tag is on branch ${{ steps.get_branch_name.outputs.branch_name }}"
echo "Should be on Master branch instead"
exit -1
- name: Fail if running locally
if: ${{ !github.event.act }} # skip during local actions testing
run: |
echo "Running action locally. Failing"
exit -1
- name: Set up Python 3.8
uses: actions/setup-python@v4
with:
Expand Down
20 changes: 18 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,21 @@
# Changelog

## 0.6.1 - 🏗 Bug fixes and small improvement

- Fix parsing keyword arguments of `compute_semivalues` dispatch function
[PR #333](https://github.com/appliedAI-Initiative/pyDVL/pull/333)
- Create new `RayExecutor` class based on the concurrent.futures API,
use the new class to fix an issue with Truncated Monte Carlo Shapley
(TMCS) starting too many processes and dying, plus other small changes
[PR #329](https://github.com/appliedAI-Initiative/pyDVL/pull/329)
- Fix creation of GroupedDataset objects using the `from_arrays`
and `from_sklearn` class methods
[PR #324](https://github.com/appliedAI-Initiative/pyDVL/pull/334)
- Fix release job not triggering on CI when a new tag is pushed
[PR #331](https://github.com/appliedAI-Initiative/pyDVL/pull/331)
- Added alias `ApproShapley` from Castro et al. 2009 for permutation Shapley
[PR #332](https://github.com/appliedAI-Initiative/pyDVL/pull/332)

## 0.6.0 - 🆕 New algorithms, cleanup and bug fixes 🏗

- Fixes in `ValuationResult`: bugs around data names, semantics of
Expand All @@ -8,8 +24,8 @@
- **New method**: Implements generalised semi-values for data valuation,
including Data Banzhaf and Beta Shapley, with configurable sampling strategies
[PR #319](https://github.com/appliedAI-Initiative/pyDVL/pull/319)
- Adds kwargs parameter to `from_array` and `from_sklearn`
Dataset and GroupedDataset class methods
- Adds kwargs parameter to `from_array` and `from_sklearn` Dataset and
GroupedDataset class methods
[PR #316](https://github.com/appliedAI-Initiative/pyDVL/pull/316)
- PEP-561 conformance: added `py.typed`
[PR #307](https://github.com/appliedAI-Initiative/pyDVL/pull/307)
Expand Down
110 changes: 103 additions & 7 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -261,6 +261,102 @@ sizeable amount of time, so care must be taken not to overdo it:
2. We try not to trigger CI pipelines when unnecessary (see [Skipping CI
runs](#skipping-ci-runs)).

### Running Github Actions locally

To run Github Actions locally we use [act](https://github.com/nektos/act).
It uses the workflows defined in `.github/workflows` and determines
the set of actions that need to be run. It uses the Docker API
to either pull or build the necessary images, as defined
in our workflow files and finally determines the execution path
based on the dependencies that were defined.

Once it has the execution path, it then uses the Docker API
to run containers for each action based on the images prepared earlier.
The [environment variables](https://docs.github.com/en/actions/learn-github-actions/variables#default-environment-variables)
and [filesystem](https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#file-systems)
are all configured to match what GitHub provides.

You can install it manually using:

```shell
curl -s https://raw.githubusercontent.com/nektos/act/master/install.sh | sudo bash -s -- -d -b ~/bin
```

And then simply add it to your PATH variable: `PATH=~/bin:$PATH`

Refer to its official
[readme](https://github.com/nektos/act#installation-through-package-managers)
for more installation options.

#### Cheatsheat

```shell
# List all actions for all events:
act -l

# List the actions for a specific event:
act workflow_dispatch -l

# List the actions for a specific job:
act -j lint -l

# Run the default (`push`) event:
act

# Run a specific event:
act pull_request

# Run a specific job:
act -j lint

# Collect artifacts to the /tmp/artifacts folder:
act --artifact-server-path /tmp/artifacts

# Run a job in a specific workflow (useful if you have duplicate job names)
act -j lint -W .github/workflows/tox.yml

# Run in dry-run mode:
act -n

# Enable verbose-logging (can be used with any of the above commands)
act -v
```

#### Example

To run the `publish` job (the toughest one to test) with tag 'v0.6.0'
you would simply use:

```shell
act push -j publish --eventpath events.json
```

With `events.json` containing:

```json
{
"ref": "refs/tags/v0.6.0"
}
```

To instead run it as if it had been manually triggered (i.e. `workflow_dispatch`)

you would instead use:

```shell
act workflow_dispatch -j publish --eventpath events.json
```

With `events.json` containing:

```json
{
"inputs": {
"tag": "v0.6.0"
}
}
```

### Skipping CI runs

One sometimes would like to skip CI for certain commits (e.g. updating the
Expand Down Expand Up @@ -348,10 +444,10 @@ create a new release manually by following these steps:
8. Pour yourself a cup of coffee, you earned it! :coffee: :sparkles:
9. A package will be automatically created and published from CI to PyPI.

### CI and requirements for releases
### CI and requirements for publishing

In order to release new versions of the package from the development branch, the
CI pipeline requires the following secret variables set up:
In order to publish new versions of the package from the development branch,
the CI pipeline requires the following secret variables set up:

```
TEST_PYPI_USERNAME
Expand All @@ -367,13 +463,13 @@ The last 2 are used in the [publish.yaml](.github/workflows/publish.yaml) CI
workflow to publish packages to [PyPI](https://pypi.org/) from `develop` after
a GitHub release.
#### Release to TestPyPI
#### Publish to TestPyPI
We use [bump2version](https://pypi.org/project/bump2version/) to bump the build
part of the version number, create a tag and push it from CI.
We use [bump2version](https://pypi.org/project/bump2version/) to bump
the build part of the version number and publish a package to TestPyPI from CI.
To do that, we use 2 different tox environments:
- **bump-dev-version**: Uses bump2version to bump the dev version,
without committing the new version or creating a corresponding git tag.
without committing the new version or creating a corresponding git tag.
- **publish-test-package**: Builds and publishes a package to TestPyPI
53 changes: 29 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,42 +32,47 @@ Data Valuation is the task of estimating the intrinsic value of a data point
wrt. the training set, the model and a scoring function. We currently implement
methods from the following papers:

- Ghorbani, Amirata, and James Zou.
[Data Shapley: Equitable Valuation of Data for Machine Learning](http://proceedings.mlr.press/v97/ghorbani19c.html).
In International Conference on Machine Learning, 2242–51. PMLR, 2019.
- Castro, Javier, Daniel Gómez, and Juan Tejada. [Polynomial Calculation of the
Shapley Value Based on Sampling](https://doi.org/10.1016/j.cor.2008.04.004).
Computers & Operations Research, Selected papers presented at the Tenth
International Symposium on Locational Decisions (ISOLDE X), 36, no. 5 (May 1,
2009): 1726–30.
- Ghorbani, Amirata, and James Zou. [Data Shapley: Equitable Valuation of Data
for Machine Learning](http://proceedings.mlr.press/v97/ghorbani19c.html). In
International Conference on Machine Learning, 2242–51. PMLR, 2019.
- Wang, Tianhao, Yu Yang, and Ruoxi Jia.
[Improving Cooperative Game Theory-Based Data Valuation via Data Utility Learning](https://doi.org/10.48550/arXiv.2107.06336).
arXiv, 2022.
- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li,
Ce Zhang, Costas Spanos, and Dawn Song.
[Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms](https://doi.org/10.14778/3342263.3342637).
[Improving Cooperative Game Theory-Based Data Valuation via Data Utility
Learning](https://doi.org/10.48550/arXiv.2107.06336). arXiv, 2022.
- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo
Li, Ce Zhang, Costas Spanos, and Dawn Song. [Efficient Task-Specific Data
Valuation for Nearest Neighbor Algorithms](https://doi.org/10.14778/3342263.3342637).
Proceedings of the VLDB Endowment 12, no. 11 (1 July 2019): 1610–23.
- Okhrati, Ramin, and Aldo Lipani.
[A Multilinear Sampling Algorithm to Estimate Shapley Values](https://doi.org/10.1109/ICPR48806.2021.9412511).
In 25th International Conference on Pattern Recognition (ICPR 2020), 7992–99.
IEEE, 2021.
- Yan, T., & Procaccia, A. D.
[If You Like Shapley Then You’ll Love the Core]().
Proceedings of the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759.
- Okhrati, Ramin, and Aldo Lipani. [A Multilinear Sampling Algorithm to Estimate
Shapley Values](https://doi.org/10.1109/ICPR48806.2021.9412511). In 25th
International Conference on Pattern Recognition (ICPR 2020), 7992–99. IEEE,
2021.
- Yan, T., & Procaccia, A. D. [If You Like Shapley Then You’ll Love the
Core](https://ojs.aaai.org/index.php/AAAI/article/view/16721). Proceedings of
the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759.
- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve
Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos.
[Towards Efficient Data Valuation Based on the Shapley Value](http://proceedings.mlr.press/v89/jia19a.html).
Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos. [Towards Efficient
Data Valuation Based on the Shapley Value](http://proceedings.mlr.press/v89/jia19a.html).
In 22nd International Conference on Artificial Intelligence and Statistics,
1167–76. PMLR, 2019.
- Wang, Jiachen T., and Ruoxi Jia.
[Data Banzhaf: A Robust Data Valuation Framework for Machine Learning](https://doi.org/10.48550/arXiv.2205.15466).
- Wang, Jiachen T., and Ruoxi Jia. [Data Banzhaf: A Robust Data Valuation
Framework for Machine Learning](https://doi.org/10.48550/arXiv.2205.15466).
arXiv, October 22, 2022.
- Kwon, Yongchan, and James Zou.
[Beta Shapley: A Unified and Noise-Reduced Data Valuation Framework for Machine Learning](http://arxiv.org/abs/2110.14049).
- Kwon, Yongchan, and James Zou. [Beta Shapley: A Unified and Noise-Reduced Data
Valuation Framework for Machine Learning](http://arxiv.org/abs/2110.14049).
In Proceedings of the 25th International Conference on Artificial Intelligence
and Statistics (AISTATS) 2022, Vol. 151. Valencia, Spain: PMLR, 2022.

Influence Functions compute the effect that single points have on an estimator /
model. We implement methods from the following papers:

- Koh, Pang Wei, and Percy Liang.
[Understanding Black-Box Predictions via Influence Functions](http://proceedings.mlr.press/v70/koh17a.html).
In Proceedings of the 34th International Conference on Machine Learning,
- Koh, Pang Wei, and Percy Liang. [Understanding Black-Box Predictions via
Influence Functions](http://proceedings.mlr.press/v70/koh17a.html). In
Proceedings of the 34th International Conference on Machine Learning,
70:1885–94. Sydney, Australia: PMLR, 2017.

# Installation
Expand Down
25 changes: 15 additions & 10 deletions docs/30-data-valuation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -314,9 +314,8 @@ values in pyDVL. First construct the dataset and utility, then call
u=utility, mode="owen", n_iterations=4, max_q=200
)
There are more details on Owen
sampling, and its variant *Antithetic Owen Sampling* in the documentation for the
function doing the work behind the scenes:
There are more details on Owen sampling, and its variant *Antithetic Owen
Sampling* in the documentation for the function doing the work behind the scenes:
:func:`~pydvl.value.shapley.montecarlo.owen_sampling_shapley`.

Note that in this case we do not pass a
Expand All @@ -327,20 +326,26 @@ integration.
Permutation Shapley
^^^^^^^^^^^^^^^^^^^

An equivalent way of computing Shapley values appears often in the literature.
It uses permutations over indices instead of subsets:
An equivalent way of computing Shapley values (``ApproShapley``) appeared in
:footcite:t:`castro_polynomial_2009` and is the basis for the method most often
used in practice. It uses permutations over indices instead of subsets:

$$
v_u(x_i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)}
[u(\sigma_{:i} \cup \{i\}) − u(\sigma_{:i})]
,$$

where $\sigma_{:i}$ denotes the set of indices in permutation sigma before the
position where $i$ appears. To approximate this sum (with $\mathcal{O}(n!)$ terms!)
one uses Monte Carlo sampling of permutations, something which has surprisingly
low sample complexity. By adding early stopping, the result is the so-called
**Truncated Monte Carlo Shapley** (:footcite:t:`ghorbani_data_2019`), which is
efficient enough to be useful in some applications.
position where $i$ appears. To approximate this sum (which has $\mathcal{O}(n!)$
terms!) one uses Monte Carlo sampling of permutations, something which has
surprisingly low sample complexity. One notable difference wrt. the
combinatorial approach above is that the approximations always fulfill the
efficiency axiom of Shapley, namely $\sum_{i=1}^n \hat{v}_i = u(D)$ (see
:footcite:t:`castro_polynomial_2009`, Proposition 3.2).

By adding early stopping, the result is the so-called **Truncated Monte Carlo
Shapley** (:footcite:t:`ghorbani_data_2019`), which is efficient enough to be
useful in applications.

.. code-block:: python
Expand Down
Loading

0 comments on commit 0e929ae

Please sign in to comment.