-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
On the form of input data for the MLJ interface #5
Comments
Good point! Since Julia is column major (considering n-dimensional arrays) making sure the input data are column major as well significantly improves the performance of the algorithms compared to just naive re-implementation one-to-one in Python, which is row major. Lazy transpose ( Since MLJ assumes that "columns are features by default i.e. n x p" [src] the idea behind this model is also not easily compatible without running I guess there has to be a nicer and cleaner solution to this, but I haven't found it yet. Or is there some solution I'm missing? Also, thank you very much for such a quick and elaborate response! |
Yes, that is why I suggest internally using # This algorithm to reverse each observation vector is generic and supposes observations
# are columns (uncompliant with MLJ convention).
uncompliant_algorithm(X::AbstractMatrix) =
hcat([reverse(X[:, j])) for j in size(X, 2)]...)
# This algorithm also reverses each observation and is generic but supposes observations are
# rows. There is no explicit copying, because `transpose(X)` is a view of `X`:
compliant_algorithm(X::AbstractMatrix) = uncompliant_algorithm(transpose(X)) |> transpose
using BenchmarkTools
Xraw = rand(100, 10000)
Xgood = transpose(Xraw)
Xbad = permutedims(Xraw)
# performance of original algorithm:
@btime uncompliant_algorithm($Xraw);
# 1.556 ms (10002 allocations: 8.62 MiB)
# compliant algorithm is degraded when using `Matrix`:
@btime compliant_algorithm($Xbad);
# 2.307 ms (10002 allocations: 8.62 MiB)
# but performance is recuperated if user uses `transpose(::Matrix)` type:
@btime compliant_algorithm($Xgood);
# 1.575 ms (10002 allocations: 8.62 MiB)
# In all three benchmarks, allocations are the same. |
I'd benchmark the degradation in your case. If it's really a concern, you could have your interface include something along these lines: verbosity > 0 && @info "Input matrix `X` is a `Matrix`. Performance may be improved by providing "*
"`transpose(permutedims(X))` instead. " |
I implemented the standard MLJ API, so if someone wants, they can pass the data using This way no breaking changes are introduced and the interface should be, hopefully (?), compliant with other MLJ models and standards. These changes and behavior are now also correctly highlighted in model documentation (most notably in this commit 9ce6927). What is your opinion? |
Mmmm. It's not going to work to allow data to be passed in the form We could overload Giving the user the option to specify machine(model, X, y, :column) but I suggest Since An alternative is to make I don't think you can avoid making breaking changes to address MLJ requirements. BTW, standard practice in Julia development is to make your first release |
just FYI: https://github.com/xKDR/TSFrames.jl |
@antoninkriz Any update re an MLJ-compliant solution here? |
@ablaom Sorry, I was quite overwhelmed with university in the past few months. If I remember correctly, the code should be now compliant with MLJ (but I think I should rather check again), utilizing the What remains is yanking 1.0 and re-releasing under 0.1. And again, thanks for all your help! |
I'm staring to look over the MLJ interface. A fundamental issue is the current form of input data, which I understand should be in the form
(X_train, :column_based)
or(X_train, :row_based)
whereX
is what exactly?Passing the row/column flag in this way is quite non-standard for MLJ models. In any case, it means the currently declared
input_scitype
does not match the data requirements.You can pass metadata like this flag as a third argument, as in
machine(model, X, y, flag)
but before going that route, can you say more about what data you can accept forX
and why you need this distinction? Can we instead deal with row-versus-column issue by having the user provide a lazytranspose
, if her data does not conform to your preferred format?The text was updated successfully, but these errors were encountered: