-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for CategoricalArrays.jl #132
Conversation
Is there a way to make this work without adding the 5 (direct and indirect) dependencies?
I think just using DataAPI would add support for CategoricalArrays/PooledArrays, etc. |
@juliohm. Sorry to intrude. You may want to remove LossFunctions as a dependency in |
I also considered other options, including this issue that I opened in CategoricalArrays.jl to see if a CategoricalValue could simply be a subtype of Number: JuliaData/CategoricalArrays.jl#268 I am divided in the issue. What is your point of view @joshday ? Should we push it forward the idea of
Nice. I was unaware of the DataAPI.jl effort 👍 Unfortunately it doesn't have the Meanwhile, I will finish this PR with some preliminary implementation, and then we can get back to this more profound issue on "data type dependencies" in the Julia ecosystem. It is a big issue. Thanks for pointing it out. |
Thank you @OkonSamuel , I will fix the docs. For some reason I've committed some unnecessary changes to the docs Project.toml + Manifest.toml files. |
Take a look at the |
Hi @joshday could you please clarify the usage of
@OkonSamuel could you please clarify what is happening? The docs build successfully in my local machine. Also I have the same setup in GeoStats.jl docs for example and never had an issue. |
@juliohm The package name is already an implicit dependency and there is no need to add it. Also the main cause of the doc build error is in the docs/Manifest.toml file. The Julia registry doesn't recognize version 0.5.2 and the given uuid causing an error. You can comfirm this by trying to get v0.5.2 of the LossFunctions package on your system. |
Thank you @OkonSamuel I get the issue now. The incorrect UUID I commited some time ago still causing issues. I will delete the Manifest.toml and commit a new one. It is usually a good idea to commit the Manifest.toml in the docs, and update it every now and then. That is what I've learned from other project maintainers. |
Checking in a Instead of checking in
@OkonSamuel It's okay to still keep - julia --project=docs/ -e 'using Pkg; Pkg.add(PackageSpec(path=pwd()))' |
I don't think |
The categorical case is a specific case as opposed to a general case @nalimilan. I was wondering why a categorical value could not be seen as a number in the sense that it is usually used as a number? The PR is fine as is with this On a more general view of the issue, I think we need to come up with a data-independent way of expressing methods. I learned about DataAPI.jl from @joshday for example, but I didn't quite get how the package can help with this. I am personally starting to think that ScientificTypes.jl is a possible solution where we describe methods with their scientific semantic as opposed to their machine type. However, to do this we need a set of traits to dispatch on. For example, in this case I think a trait like "numerical scalar" could be implemented as isnum(x) = x <: Union{Finite{N} where N, Infinite} where |
|
I tried removing the In the next recycling of LearnBase.jl + LossFunctions.jl, I will consider this data dependency issue more seriously. We need to open an issue on GitHub to track it more closely. |
My apologies! I was confused about how DataAPI works. A few questions though:
and then add the necessary methods for
? |
Codecov Report
@@ Coverage Diff @@
## master #132 +/- ##
=========================================
Coverage ? 93.88%
=========================================
Files ? 11
Lines ? 572
Branches ? 0
=========================================
Hits ? 537
Misses ? 35
Partials ? 0 Continue to review full report at Codecov.
|
Yes, this is correct. From what I understand, categorical values a la CategoricalArrays.jl need to be explicit in user code. So if I have a vector of string, I need to
Initially I tried to do this in a separate PR, and realized that these are not the same in the sense that
Yes, but it is a
and
And just to clarify, this is for any subtype of |
Why do we need to force someone to do that who already has their I see you're still adding commits, but I'll say I don't think adding CategoricalArrays as a dependency is acceptable for this. What about PooledArrays? What about Vector{String}? |
I think that is the standard in other parts of the ecosystem. For example, DataFrames.jl expects users to explicitly convert the columns with
I am also not happy with the fact that we need to add CategoricalArrays.jl as a dependency here. I'd like to think of this PR as an intermediate solution before we can keep refactoring the whole LearnBase.jl + LossFunctions.jl experience. I think many things can improve in the future, but I don't see a way out of this dependency currently without the traits in place, and other major changes in the codebase. My idea was to merge this PR as is, to at least have something working temporarily with categorical predictions produced by learning models. Afterwards, I plan to come back to LearnBase.jl again to incorporate those suggestions by @CarloLucibello , and then come back here again to fix other major issues. There are too many things to improve, but I think we need to go step-by-step. Do you think it is ok to merge this PR knowing this dependency on CategoricalArrays.jl can go away in the future? I am fully aligned with your vision @joshday, but refactoring the whole codebase again is not something quick to do, or something that I can afford at the moment given other deadlines on my shoulders. |
We never said that. You can just convert string columns to categorical if you want, mainly for two things: 1) use a custom order of levels, 2) improve performance of some operations like grouping. But e.g. StatsModels handles string columns just like categorical columns. You just need to call |
I think we have many steps until we get a version of On a similar argument, the dependency on If you are ok with this plan, I can go ahead and:
The step 1) is what is currently blocking my research. The steps 2-4) are what I want to work on with more time, and without these upcoming deadlines. |
I just don't get the motivation for adding it. Why go through multiple breaking changes (adding and then removing a dep) on a repo that usually doesn't have a lot of activity, just so you have a more limited version of
I think the concerns I've seen raised about LossFunctions could be handled in a single PR (basically your step 3). "Many steps" is tiring for maintainers and I don't think it's necessary. I'm glad that you're willing to put some work in to clean up JuliaML packages, but please find a workaround for your current research. |
Thank you @joshday for the feedback. Is there a solution in between maybe where this PR is adapted to drop the dependency on CategoricalArrays.jl? I think this is your major concern about this PR, right? I just don't see how we can drop the dependency on CategoricalArrays.jl without traits. If you have suggestions on how to do this, we could keep working here as opposed to starting it all over. Regarding the step (3) you mentioned, I also don't see how it addresses the dependency issue on CategoricalArrays.jl Could you please clarify? I see how it addresses other issues in the interface, but I can't see how these changes solve the dependency on data types.
The |
Rely on multiple dispatch: value(loss::MisclassLoss, target, output) = target == output
deriv(loss::MisclassLoss, target, output) = 0
deriv2(loss::MisclassLoss, target, output) = 0
for f in [:value, :deriv, :deriv2]
@eval function $f(loss::MisclassLoss, targets::AbstractArray, outputs::AbstractArray)
$f(loss, targets, outputs, AggMode.None())
end
end
value(loss::MisclassLoss, target::AbstractArray, output::AbstractArray, ::AggMode.None) = target .== output
value(loss::MisclassLoss, target::AbstractArray, output::AbstractArray, ::AggMode.Sum) = sum(target .== output)
value(loss::MisclassLoss, target::AbstractArray, output::AbstractArray, ::AggMode.Mean) = mean(target .== output) (and add the methods I've missed) |
Got it. My previous attempt consisted of removing the |
@joshday I've removed the dependency on CategoricalArrays.jl. Appreciate if you can take a look. |
Co-Authored-By: Josh Day <[email protected]>
This PR is work-in-progress. It attempts to generalize the functionality of the package to support
CategoricalArray
andCategoricalValue
from CategoricalArrays.jl. Fix #117.I will share a comment here when it is ready for review.