-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent results with r package mdlp #12
Comments
Hi Ashton! Thanks for your comments. Hmm, seems the gist I posted is quite old. It looks like I made the script when I was trying to get the package into scikit-learn, but as of a few months ago, this project has been discontinued. #7 seems to have certain breaking changes, and it has caused a unit test to fail in this package. (I need to get Travis working for this project sometime.) With all this said, I'll certainly look into your issue when I have the time. |
Thanks @hlin117 . I am not sure if this is a related error, but I am also failing the unit test (in fact, the output discretized values are floats instead of integers). As a nasty hack, I ended up using rpy2 to simply call the r mdlp function. Perhaps when I am done with my current project I can help to take a look. It's too bad this wasn't accepted by scikit-learn; IMO this would be a useful discretization method. |
The inconsistency is caused by the different implementation. R code is here. R: for (i in 1:(n-1)){
if(x[i+1]!=x[i]) { # they check x not y
ct <- (x[i]+x[i+1])/2
wx <- which(x<=ct)
wn <- length(wx)/n
e1 <- wn*ent(y[wx])
e2 <- (1-wn)*ent(y[-wx])
val <- e1+e2
if(val<entropy) {
entropy <- val
ci <- i
}
}
} Yours: for ind in range(start + 1, end):
# I choose not to use a `min` function here for this optimization.
if y[ind-1] == y[ind]: # you check y instead of x
continue
# Find the partition entropy, and check if this entropy is minimum
first_half = (<float> (ind - start)) / length \
* slice_entropy(y, start, ind)[0]
second_half = (<float> (end - ind)) / length \
* slice_entropy(y, ind, end)[0]
curr_entropy = first_half + second_half
if prev_entropy > curr_entropy:
prev_entropy = curr_entropy
k = ind I think that checking y is wired because the algorithm is to discretize x. |
In the original paper, it also state that only boundary points of the attribute are considered. And it is defined in 2.2. |
First, I want to say your work in this library looks great and it is perfect for what I am trying to use it for. So thank you :)
I am getting very different results from what you obtained here: https://gist.github.com/hlin117/dee443d98af5ac3413c5 . When I print conv_X after running your example from the README, it looks like this:
Meanwhile the r package and your previous results give categories in the range of [1,3]. I would feel more confident in these results if they matched the results of the r packages.
I created a new environment and set versions for my packages to match the ones you specify in requirements.txt.
Any idea why I might be obtaining these inconsistent results?
The text was updated successfully, but these errors were encountered: