Merge branch 'release/v0.6.1'

aai-institute · Apr 13, 2023 · 0e929ae · 0e929ae
2 parents f8e07cc + ce29e0c
commit 0e929ae
Show file tree

Hide file tree

Showing 31 changed files with 1,311 additions and 680 deletions.
diff --git a/.bumpversion.cfg b/.bumpversion.cfg
@@ -1,5 +1,5 @@
 [bumpversion]
-current_version = 0.6.0
+current_version = 0.6.1
 commit = False
 tag = False
 allow_dirty = False

diff --git a/.github/workflows/publish.yaml b/.github/workflows/publish.yaml
@@ -1,4 +1,4 @@
-name: Upload Python Package to PyPI
+name: Publish Python Package to PyPI
 
 on:
   push:
@@ -10,32 +10,52 @@ on:
         description: Why did you trigger the pipeline?
         required: False
         default: Check if it runs again due to external changes
+      tag:
+        description: Tag for which a package should be published
+        type: string
+        required: false
 
 env:
   PY_COLORS: 1
 
 jobs:
-  deploy:
+  publish:
     runs-on: ubuntu-latest
     concurrency:
-      group: deploy
+      group: publish
     steps:
       - uses: actions/checkout@v3
         with:
           fetch-depth: 0
-      - name: Fail if manually triggered workflow is not on 'master' branch
-        if: github.event_name == 'workflow_dispatch' && github.ref_name != 'master'
-        run: exit -1
+      - name: Fail if manually triggered workflow does not have 'tag' input
+        if: github.event_name == 'workflow_dispatch' && inputs.tag == ''
+        run: |
+          echo "Input 'tag' should not be empty"
+          exit -1
+      - name: Extract branch name from input
+        id: get_branch_name_input
+        if: github.event_name == 'workflow_dispatch'
+        run: |
+          export BRANCH_NAME=$(git log -1 --format='%D' ${{ inputs.tag }} | sed -e 's/.*origin\/\(.*\).*/\1/')
+          echo "branch_name=${BRANCH_NAME}" >> $GITHUB_OUTPUT
       - name: Extract branch name from tag
-        id: get_branch_name
+        id: get_branch_name_tag
         if: github.ref_type == 'tag'
         run: |
-          export BRANCH_NAME=$(git log -1 --format='%D' $GITHUB_REF | sed -e 's/.*origin\/\(.*\),.*/\1/')
-          echo ::set-output name=branch_name::${BRANCH_NAME}
+          export BRANCH_NAME=$(git log -1 --format='%D' $GITHUB_REF | sed -e 's/.*origin\/\(.*\).*/\1/')
+          echo "branch_name=${BRANCH_NAME}" >> $GITHUB_OUTPUT
         shell: bash
       - name: Fail if tag is not on 'master' branch
-        if: github.ref_type == 'tag' && steps.get_branch_name.outputs.branch_name != 'master'
-        run: exit -1
+        if: ${{ steps.get_branch_name_tag.outputs.branch_name != 'master' && steps.get_branch_name_input.outputs.branch_name != 'master' }}
+        run: |
+          echo "Tag is on branch ${{ steps.get_branch_name.outputs.branch_name }}"
+          echo "Should be on Master branch instead"
+          exit -1
+      - name: Fail if running locally
+        if: ${{ !github.event.act }} # skip during local actions testing
+        run: |
+          echo "Running action locally. Failing"
+          exit -1
       - name: Set up Python 3.8
         uses: actions/setup-python@v4
         with:

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,21 @@
 # Changelog
 
+## 0.6.1 - 🏗 Bug fixes and small improvement
+
+- Fix parsing keyword arguments of `compute_semivalues` dispatch function
+  [PR #333](https://github.com/appliedAI-Initiative/pyDVL/pull/333)
+- Create new `RayExecutor` class based on the concurrent.futures API,
+  use the new class to fix an issue with Truncated Monte Carlo Shapley
+  (TMCS) starting too many processes and dying, plus other small changes
+  [PR #329](https://github.com/appliedAI-Initiative/pyDVL/pull/329)
+- Fix creation of GroupedDataset objects using the `from_arrays`
+  and `from_sklearn` class methods 
+  [PR #324](https://github.com/appliedAI-Initiative/pyDVL/pull/334)
+- Fix release job not triggering on CI when a new tag is pushed
+  [PR #331](https://github.com/appliedAI-Initiative/pyDVL/pull/331)
+- Added alias `ApproShapley` from Castro et al. 2009 for permutation Shapley
+  [PR #332](https://github.com/appliedAI-Initiative/pyDVL/pull/332)
+
 ## 0.6.0 - 🆕 New algorithms, cleanup and bug fixes 🏗
 
 - Fixes in `ValuationResult`: bugs around data names, semantics of
@@ -8,8 +24,8 @@
 - **New method**: Implements generalised semi-values for data valuation,
   including Data Banzhaf and Beta Shapley, with configurable sampling strategies
   [PR #319](https://github.com/appliedAI-Initiative/pyDVL/pull/319)
-- Adds kwargs parameter to `from_array` and `from_sklearn`
-  Dataset and GroupedDataset class methods
+- Adds kwargs parameter to `from_array` and `from_sklearn` Dataset and
+  GroupedDataset class methods
   [PR #316](https://github.com/appliedAI-Initiative/pyDVL/pull/316)
 - PEP-561 conformance: added `py.typed`
   [PR #307](https://github.com/appliedAI-Initiative/pyDVL/pull/307)

diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -261,6 +261,102 @@ sizeable amount of time, so care must be taken not to overdo it:
 2. We try not to trigger CI pipelines when unnecessary (see [Skipping CI
 runs](#skipping-ci-runs)).
 
+### Running Github Actions locally
+
+To run Github Actions locally we use [act](https://github.com/nektos/act).
+It uses the workflows defined in `.github/workflows` and determines
+the set of actions that need to be run. It uses the Docker API
+to either pull or build the necessary images, as defined
+in our workflow files and finally determines the execution path
+based on the dependencies that were defined.
+
+Once it has the execution path, it then uses the Docker API
+to run containers for each action based on the images prepared earlier.
+The [environment variables](https://docs.github.com/en/actions/learn-github-actions/variables#default-environment-variables) 
+and [filesystem](https://docs.github.com/en/actions/using-github-hosted-runners/about-github-hosted-runners#file-systems)
+are all configured to match what GitHub provides.
+
+You can install it manually using:
+
+```shell
+curl -s https://raw.githubusercontent.com/nektos/act/master/install.sh | sudo bash -s -- -d -b ~/bin 
+```
+
+And then simply add it to your PATH variable: `PATH=~/bin:$PATH`
+
+Refer to its official
+[readme](https://github.com/nektos/act#installation-through-package-managers)
+for more installation options.
+
+#### Cheatsheat
+
+```shell
+# List all actions for all events:
+act -l
+
+# List the actions for a specific event:
+act workflow_dispatch -l
+
+# List the actions for a specific job:
+act -j lint -l
+
+# Run the default (`push`) event:
+act
+
+# Run a specific event:
+act pull_request
+
+# Run a specific job:
+act -j lint
+
+# Collect artifacts to the /tmp/artifacts folder:
+act --artifact-server-path /tmp/artifacts
+
+# Run a job in a specific workflow (useful if you have duplicate job names)
+act -j lint -W .github/workflows/tox.yml
+
+# Run in dry-run mode:
+act -n
+
+# Enable verbose-logging (can be used with any of the above commands)
+act -v
+```
+
+#### Example
+
+To run the `publish` job (the toughest one to test) with tag 'v0.6.0' 
+you would simply use:
+
+```shell
+act push -j publish --eventpath events.json
+```
+
+With `events.json` containing:
+
+```json
+{
+  "ref": "refs/tags/v0.6.0"
+}
+```
+
+To instead run it as if it had been manually triggered (i.e. `workflow_dispatch`)
+
+you would instead use:
+
+```shell
+act workflow_dispatch -j publish --eventpath events.json
+```
+
+With `events.json` containing:
+
+```json
+{
+  "inputs": {
+    "tag": "v0.6.0"
+  }
+}
+```
+
 ### Skipping CI runs
 
 One sometimes would like to skip CI for certain commits (e.g. updating the
@@ -348,10 +444,10 @@ create a new release manually by following these steps:
 8. Pour yourself a cup of coffee, you earned it! :coffee: :sparkles:
 9. A package will be automatically created and published from CI to PyPI.
 
-### CI and requirements for releases
+### CI and requirements for publishing
 
-In order to release new versions of the package from the development branch, the
-CI pipeline requires the following secret variables set up:
+In order to publish new versions of the package from the development branch,
+the CI pipeline requires the following secret variables set up:
 
 ```
 TEST_PYPI_USERNAME
@@ -367,13 +463,13 @@ The last 2 are used in the [publish.yaml](.github/workflows/publish.yaml) CI
 workflow to publish packages to [PyPI](https://pypi.org/) from `develop` after
 a GitHub release.
 
-#### Release to TestPyPI
+#### Publish to TestPyPI
 
-We use [bump2version](https://pypi.org/project/bump2version/) to bump the build
-part of the version number, create a tag and push it from CI.
+We use [bump2version](https://pypi.org/project/bump2version/) to bump
+the build part of the version number and publish a package to TestPyPI from CI.
 
 To do that, we use 2 different tox environments:
 
 - **bump-dev-version**: Uses bump2version to bump the dev version,
-  without committing  the new version or creating a corresponding git tag.
+  without committing the new version or creating a corresponding git tag.
 - **publish-test-package**: Builds and publishes a package to TestPyPI
diff --git a/README.md b/README.md
@@ -32,42 +32,47 @@ Data Valuation is the task of estimating the intrinsic value of a data point
 wrt. the training set, the model and a scoring function. We currently implement
 methods from the following papers:
 
-- Ghorbani, Amirata, and James Zou. 
-  [Data Shapley: Equitable Valuation of Data for Machine Learning](http://proceedings.mlr.press/v97/ghorbani19c.html).
-  In International Conference on Machine Learning, 2242–51. PMLR, 2019.
+- Castro, Javier, Daniel Gómez, and Juan Tejada. [Polynomial Calculation of the
+  Shapley Value Based on Sampling](https://doi.org/10.1016/j.cor.2008.04.004).
+  Computers & Operations Research, Selected papers presented at the Tenth
+  International Symposium on Locational Decisions (ISOLDE X), 36, no. 5 (May 1,
+  2009): 1726–30.
+- Ghorbani, Amirata, and James Zou. [Data Shapley: Equitable Valuation of Data
+  for Machine Learning](http://proceedings.mlr.press/v97/ghorbani19c.html). In
+  International Conference on Machine Learning, 2242–51. PMLR, 2019.
 - Wang, Tianhao, Yu Yang, and Ruoxi Jia. 
-  [Improving Cooperative Game Theory-Based Data Valuation via Data Utility Learning](https://doi.org/10.48550/arXiv.2107.06336).
-  arXiv, 2022.
-- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo Li,
-  Ce Zhang, Costas Spanos, and Dawn Song.
-  [Efficient Task-Specific Data Valuation for Nearest Neighbor Algorithms](https://doi.org/10.14778/3342263.3342637).
+  [Improving Cooperative Game Theory-Based Data Valuation via Data Utility
+  Learning](https://doi.org/10.48550/arXiv.2107.06336). arXiv, 2022.
+- Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nezihe Merve Gurel, Bo
+  Li, Ce Zhang, Costas Spanos, and Dawn Song. [Efficient Task-Specific Data
+  Valuation for Nearest Neighbor Algorithms](https://doi.org/10.14778/3342263.3342637).
   Proceedings of the VLDB Endowment 12, no. 11 (1 July 2019): 1610–23.
-- Okhrati, Ramin, and Aldo Lipani.
-  [A Multilinear Sampling Algorithm to Estimate Shapley Values](https://doi.org/10.1109/ICPR48806.2021.9412511).
-  In 25th International Conference on Pattern Recognition (ICPR 2020), 7992–99.
-  IEEE, 2021.
-- Yan, T., & Procaccia, A. D.
-  [If You Like Shapley Then You’ll Love the Core]().
-  Proceedings of the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759.
+- Okhrati, Ramin, and Aldo Lipani. [A Multilinear Sampling Algorithm to Estimate
+  Shapley Values](https://doi.org/10.1109/ICPR48806.2021.9412511). In 25th
+  International Conference on Pattern Recognition (ICPR 2020), 7992–99. IEEE,
+  2021.
+- Yan, T., & Procaccia, A. D. [If You Like Shapley Then You’ll Love the
+  Core](https://ojs.aaai.org/index.php/AAAI/article/view/16721). Proceedings of
+  the AAAI Conference on Artificial Intelligence, 35(6) (2021): 5751-5759.
 - Jia, Ruoxi, David Dao, Boxin Wang, Frances Ann Hubis, Nick Hynes, Nezihe Merve
-  Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos.
-  [Towards Efficient Data Valuation Based on the Shapley Value](http://proceedings.mlr.press/v89/jia19a.html).
+  Gürel, Bo Li, Ce Zhang, Dawn Song, and Costas J. Spanos. [Towards Efficient
+  Data Valuation Based on the Shapley Value](http://proceedings.mlr.press/v89/jia19a.html).
   In 22nd International Conference on Artificial Intelligence and Statistics,
   1167–76. PMLR, 2019.
-- Wang, Jiachen T., and Ruoxi Jia. 
-  [Data Banzhaf: A Robust Data Valuation Framework for Machine Learning](https://doi.org/10.48550/arXiv.2205.15466).
+- Wang, Jiachen T., and Ruoxi Jia. [Data Banzhaf: A Robust Data Valuation
+  Framework for Machine Learning](https://doi.org/10.48550/arXiv.2205.15466).
   arXiv, October 22, 2022.
-- Kwon, Yongchan, and James Zou.
-  [Beta Shapley: A Unified and Noise-Reduced Data Valuation Framework for Machine Learning](http://arxiv.org/abs/2110.14049).
+- Kwon, Yongchan, and James Zou. [Beta Shapley: A Unified and Noise-Reduced Data
+  Valuation Framework for Machine Learning](http://arxiv.org/abs/2110.14049).
   In Proceedings of the 25th International Conference on Artificial Intelligence
   and Statistics (AISTATS) 2022, Vol. 151. Valencia, Spain: PMLR, 2022.
 
 Influence Functions compute the effect that single points have on an estimator /
 model. We implement methods from the following papers:
 
-- Koh, Pang Wei, and Percy Liang.
-  [Understanding Black-Box Predictions via Influence Functions](http://proceedings.mlr.press/v70/koh17a.html).
-  In Proceedings of the 34th International Conference on Machine Learning,
+- Koh, Pang Wei, and Percy Liang. [Understanding Black-Box Predictions via
+  Influence Functions](http://proceedings.mlr.press/v70/koh17a.html). In
+  Proceedings of the 34th International Conference on Machine Learning,
   70:1885–94. Sydney, Australia: PMLR, 2017.
 
 # Installation

diff --git a/docs/30-data-valuation.rst b/docs/30-data-valuation.rst
@@ -314,9 +314,8 @@ values in pyDVL. First construct the dataset and utility, then call
        u=utility, mode="owen", n_iterations=4, max_q=200
    )
 
-There are more details on Owen
-sampling, and its variant *Antithetic Owen Sampling* in the documentation for the
-function doing the work behind the scenes:
+There are more details on Owen sampling, and its variant *Antithetic Owen
+Sampling* in the documentation for the function doing the work behind the scenes:
 :func:`~pydvl.value.shapley.montecarlo.owen_sampling_shapley`.
 
 Note that in this case we do not pass a
@@ -327,20 +326,26 @@ integration.
 Permutation Shapley
 ^^^^^^^^^^^^^^^^^^^
 
-An equivalent way of computing Shapley values appears often in the literature.
-It uses permutations over indices instead of subsets:
+An equivalent way of computing Shapley values (``ApproShapley``) appeared in
+:footcite:t:`castro_polynomial_2009` and is the basis for the method most often
+used in practice. It uses permutations over indices instead of subsets:
 
 $$
 v_u(x_i) = \frac{1}{n!} \sum_{\sigma \in \Pi(n)}
 [u(\sigma_{:i} \cup \{i\}) − u(\sigma_{:i})]
 ,$$
 
 where $\sigma_{:i}$ denotes the set of indices in permutation sigma before the
-position where $i$ appears. To approximate this sum (with $\mathcal{O}(n!)$ terms!)
-one uses Monte Carlo sampling of permutations, something which has surprisingly
-low sample complexity. By adding early stopping, the result is the so-called
-**Truncated Monte Carlo Shapley** (:footcite:t:`ghorbani_data_2019`), which is
-efficient enough to be useful in some applications.
+position where $i$ appears. To approximate this sum (which has $\mathcal{O}(n!)$
+terms!) one uses Monte Carlo sampling of permutations, something which has
+surprisingly low sample complexity. One notable difference wrt. the
+combinatorial approach above is that the approximations always fulfill the
+efficiency axiom of Shapley, namely $\sum_{i=1}^n \hat{v}_i = u(D)$ (see
+:footcite:t:`castro_polynomial_2009`, Proposition 3.2).
+
+By adding early stopping, the result is the so-called **Truncated Monte Carlo
+Shapley** (:footcite:t:`ghorbani_data_2019`), which is efficient enough to be
+useful in applications.
 
 .. code-block:: python