Skip to content
This repository was archived by the owner on May 23, 2022. It is now read-only.

Refactor ObsDim out of LearnBase and update related interfaces #44

Merged
merged 7 commits into from
Jul 27, 2021

Conversation

darsnack
Copy link
Member

@darsnack darsnack commented Oct 29, 2020

This is in response to #41 and #42. Based on discussions, I removed ObsDim from LearnBase. Correspondingly, I propose we update the interfaces that relied on it to use a obsdim keyword argument instead. Here's a summary of what that means:

  • getobs(x, idx, ::ObsDim.Undefined) => getobs(x, idx, obsdim::Nothing): LearnBase.jl now contains the default routing that maps getobs(x, idx, ::Nothing) to getobs(x, idx). A developer implementing the interface can still choose to only implement getobs(x, idx) if the observation dimension is nonsensical.
  • getobs(x, idx; obsdim = default_obsdim(x)): This is the keyword version of getobs. By default, default_obsdim(x) = nothing, and for arrays default_obsdim(x) = ndims(x) (in line with popular packages like Flux.jl). Most importantly, the docs will make it clear that the non-keyword argument functions are the ones to implement. This keyword argument version is for end-user convenience.
  • getobs(x, obsdim): This is no longer going to be supported. It conflicts for dispatch with getobs(x, idx), and the routing logic required to make multiple dispatch work is not worth it in my opinion. I believe this function and the corresponding getobs(x) existed for cases where fetching all observations at once is more efficient than fetching all observations one at a time. I think we can still support this optimization by dispatching on getobs(x, 1:nobs(x)) since 1:nobs(x) <: UnitRange.
  • getobs(x): Same as above.

Additionally, nobs is now restricted to StatsBase. For example, in MLDataPattern.jl, I implement StatsBase.nobs instead of LearnBase.nobs.

I think that covers the bulk of the changes. I've done some crude REPL regression testing to make sure that these changes don't cause performance penalties for common use cases like data stored in arrays.
cc @juliohm @Evizero @oxinabox @racinmat

@darsnack
Copy link
Member Author

I also think the docs on this topic could use some love (which I am willing to do). Currently, they are in hosted MLDataPattern.jl. Would it make sense to create some docs for LearnBase.jl and add a forward pointer to the section in the MLDataPattern.jl docs? Just so that someone looking to implement this interface can find the correct descriptions.

@juliohm
Copy link
Member

juliohm commented Oct 29, 2020

Lovely @darsnack. Can you confirm that the code changes are mainly moving code from other.jl to observation.jl?

I agree with you 💯 percent that we need more love in the docs here #38 In my opinion we should be concentrating the documentation efforts in LearnBase.jl as opposed to downstream packages.

@juliohm juliohm requested a review from Evizero October 29, 2020 13:29
@juliohm
Copy link
Member

juliohm commented Oct 29, 2020

Also, I would like to ask the permission of other JuliaML members to add @darsnack to the organization? He is putting a lot of effort to improve the ML ecosystem in Julia, and it would be nice to give him access to infrastructure setups, etc.

@darsnack
Copy link
Member Author

Yeah the code changes are removing the ObsDim module and type, copying getobs to observation.jl (mostly this), and adding the default routing when obsdim::Nothing to getobs(x, idx).

@johnnychen94
Copy link
Member

johnnychen94 commented Oct 29, 2020

getobs(x, idx, ::ObsDim.Undefined) => getobs(x, idx, obsdim::Nothing): LearnBase.jl now contains the default routing that maps getobs(x, idx, ::Nothing) to getobs(x, idx)

Since ObsDim is removed, the dispatching-based API is useless and this default routing can be removed.

Also, I believe things will be clearer if we consistently make obsdim a keyword, as most of the other functions (cat, eachslice) in Base do. And it seems like obsdim is a better choice to dims.

@darsnack
Copy link
Member Author

darsnack commented Oct 29, 2020

What about the case where someone wants to implement getobs for their data container, and observation dimensions don't make sense for that data container? Currently, the developer only needs to define getobs(x::MyContainer, idx) and the default routing will ensure that if someone calls getobs(x::MyContainer, idx; obsdim = default_obsdim(x)) still routes to the getobs(x, idx) defined by the developer of MyContainer.

If we drop support for this, then the developer needs to implement getobs(x, idx, obsdim) and ignore the obsdim argument. That's not the end of the world, just making clear the trade-off. Personally, I agree that there should only be getobs(x, idx; obsdim) as an interface function, but I was trying to maintain some of the old functionality. Happy to make the switch though.

Also, a non-keyword based function is required to support the above behavior, since we can't dispatch on the type of keyword arguments.

@johnnychen94
Copy link
Member

johnnychen94 commented Oct 29, 2020

What about the case where someone wants to implement getobs for their data container, and observation dimensions don't make sense for that data container?

As long as getindex(x::MyContainer, ind...) is well defined, this won't be an issue. ObsDim is definitely a concept in the array world and I don't think there will be a non-trivial non-array data container type in practice.

Also, the core logic in getobs is defined in
https://github.com/JuliaML/MLDataPattern.jl/blob/ac4227ebb8ddefb87d480c6bd80f784c7971aac6/src/container.jl#L130-L144 can be simplified a lot without using generated function.

@darsnack
Copy link
Member Author

As long as getindex(x::MyContainer, ind...) is well defined, this won't be an issue. ObsDim is definitely a concept in the array world and I don't think there will be a non-trivial non-array data container type in practice.

I do see your point. So would the change be to make the interface getobs(x, idx; obsdim) and default to getindex(x, idx) when the method is undefined?

Also, the core logic in getobs is defined in
https://github.com/JuliaML/MLDataPattern.jl/blob/ac4227ebb8ddefb87d480c6bd80f784c7971aac6/src/container.jl#L130-L144 can be simplified a lot without using generated function.

Agreed. I created JuliaML/MLDataPattern.jl#50 which removes a lot of the excess code.

@johnnychen94
Copy link
Member

johnnychen94 commented Oct 29, 2020

So would the change be to make the interface getobs(x, idx; obsdim) and default to getindex(x, idx) when the method is undefined?

Yes, and all the complicated dispatch routes get simplified to getobs(x, idx; obsdim=default_obsdim(x)).

@darsnack
Copy link
Member Author

Just a note on that change. If we use keyword arguments as the interface, then we can't dispatch on the type of obsdim, and we need to dispatch on LearnBase.getobs(tup::Tuple, indices, obsdims::Tuple). This can be fixed by doing special routing for the Tuple case, but it might be cleaner to support dispatch on obsdim. It adds flexibility, and it can be easily supported with a single line of routing logic:

LearnBase.getobs(data, idx; obsdim = default_obsdim(data)) = getobs(data, idx, obsdim)

Copy link
Member

@johnnychen94 johnnychen94 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead, we could dispatch on getobs(tup::Tuple, indices; obsdims::Tuple=default_obsdim(tup)) = map((x, d)->getobs(x, indices, d), tup, obsdims)

Copy link
Member

@Evizero Evizero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have nothing of value to add here. You guys are doing a great job as far as I am concerned

@Evizero
Copy link
Member

Evizero commented Nov 18, 2020

Also, I would like to ask the permission of other JuliaML members to add darsnack to the organization? He is putting a lot of effort to improve the ML ecosystem in Julia, and it would be nice to give him access to infrastructure setups, etc.

@juliohm As far as I am concerned I trust your judgment in these matters. Also I just noticed your were a "member" of the org so I upgraded your membership status as you are as much an "owner" as I have ever been. Thank you for all that you do

@juliohm
Copy link
Member

juliohm commented Nov 18, 2020

Thank you @Evizero, this organization wouldn't even exist without your work. I appreciate it. I will go ahead and add @darsnack as a member of the organization.

@darsnack
Copy link
Member Author

Just an update, I have ported the changes from this PR to MLLabelUtils. MLDataPattern is in progress. I have finished refactoring the portions of MLDataPattern that don't involve the targets. Now that MLLabelUtils is ready, I can refactor the targets.

Hoping to get this done this week so I can stop having it hang around the back of my brain.

@johnnychen94
Copy link
Member

Although not participated, I'm so glad to see the enormous progress on the Flux side.

@CarloLucibello
Copy link
Member

What is missing here instead? Changes can be propagated with no hurry in downstream packages once we merge this PR and tag a major release

@juliohm
Copy link
Member

juliohm commented Mar 23, 2021

I've merged #45 to migrate from Travis/AppVeyor to GitHub CI. Thanks @CarloLucibello for the update.

I will try to close and reopen this PR to see if the GitHub Actions are triggered now.

@juliohm juliohm closed this Mar 23, 2021
@juliohm juliohm reopened this Mar 23, 2021
@darsnack
Copy link
Member Author

Think I will need to rebase for CI to trigger correctly?

@CarloLucibello no changes left, but since this is just an interface, we can't test whether it will work without refactoring the downstream packages.

@darsnack darsnack force-pushed the darsnack/rm-obsdim branch from c952f4b to 64b73b7 Compare March 23, 2021 13:45
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ObsDim is not exported and out of sync with MLLabelUtils?
5 participants