[WIP] Non Negative Matrix Factorization (NMF) and NNLS #231
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What's new ?
Two new classes:
NMF
for Non Negative Matrix Factorization. This non-convex problem is NP-hard [3], but it is bi-convex:Solves ::
min_{H1, H2} 0.5 * ||Y - H1 @ H2.T||_F^2
s.t. H1 >= 0, H2 >= 0
Hence, it is solved with alternating minimization of two convex sub-problems: Non Negative Least Squares problems (
NNLS
).Solves ::
min_H 0.5 * ||Y - W @ H.T||_F^2
s.t. H >= 0
The implementation is based on the pseudo code found in [1] based on ADMM [2].
[1] Huang, K., Sidiropoulos, N.D. and Liavas, A.P., 2016.
A flexible and efficient algorithmic framework for constrained matrix and tensor factorization.
IEEE Transactions on Signal Processing, 64(19), pp.5052-5065.
[2] Boyd, S., Parikh, N., Chu, E., Peleato, B. and Eckstein, J., 2010.
Distributed Optimization and Statistical Learning via the Alternating Direction Method of Multipliers.
Machine Learning, 3(1), pp.1-122.
[3] Vavasis, S.A., 2010.
On the complexity of nonnegative matrix factorization.
SIAM Journal on Optimization, 20(3), pp.1364-1377.
Difference with Sklearn
Like Sklearn, this implementation is based on alternating minimization of the two problems. However Sklearn's relies on Block Coordinate Descent, whereas this implementation is based on ADMM and provides dual variables.
At first, I thought those dual variables were needed for implicit differentiation (with KKT conditions). I am just realizing there was another approach: the fix point of a proximity operator ! I don't know which one is the fastest or the more stable.
Implementation choices
Since the problem is non convex, the starting point is very important. It is a bit tricky to find a good initialization, so currently the implement defaults to thes ones of Sklearn.
The
nnls_solver
solver part ofNMF
class allows to switch between different solvers for the NNLS problem: NMF is more a "meta" algorithm for matrix factorization. Note that the pseudo code of [1] supports arbitrary products of tensors, not only the caseY-UW.T
.I noticed that the heuristics provided by [1] for step size tuning were different from the ones of OSQP: in doubt, I proposed both.
Why a separate class for NNLS ?
NNLS is special case of quadratic program.
l2
regularization,l1
regulairzation, Huber fitting, masking, etc... see exemples given in page 9 of [1]. With orthogonal constraints onU
the NMF becomes equivalent to K-means, which allows to define a differentiable K-mean layer (in the spirit of [5]): we could outperform [6] for example. A separate class for NNLS facilitates the add of these parameters.[4] Ding, C., He, X. and Simon, H.D., 2005, April. On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the 2005 SIAM international conference on data mining (pp. 606-610). Society for Industrial and Applied Mathematics.
[5] Genevay, A., Dulac-Arnold, G. and Vert, J.P., 2019. Differentiable deep clustering with cluster size constraints. arXiv preprint arXiv:1910.09036.
[6] Cho, M., Vahid, K.A., Adya, S. and Rastegari, M., 2021. DKM: Differentiable K-Means Clustering Layer for Neural Network Compression. arXiv preprint arXiv:2108.12659.
TODO
l1
,l2
, etc..)Discussions are welcome, specially on the ADMM versus Coordinate Descent