-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add reason post is likely nonsense #4304
base: master
Are you sure you want to change the base?
Conversation
This is not really a good practice, but too many tests think nonsense is not gibberish... For example, "I have this number: 111111111111111" and "This asdf should asdf not asdf be asdf matched asdf because asdf the asdf words do not asdf follow on each asdf other".
What's the standard deviation of entropy-per-char? 3.0 seems quite close to 2.6 to me, but it depends - that might be super unlikely or relatively likely; the standard deviation will reveal which it is. |
Well, entropy per char for English + space IIRC is ~21... |
That's... not what I'm asking. You have a comment in here that says "Average entropy per char in English is 2.6". If that's the average, what's the stddev? |
The entropy values here are broken, every post gets caught. Legit posts:
Gibberish posts:
I think the entropy values need to be adjusted according to these results |
A stat with 12405 fp posts on MS
So yes, I managed to mess up with the decimal point Note: fp is defined as:
|
Too much very-compact code that looks like nonsense but is not actually
This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions. |
That's a lot...unless I'm misunderstanding something, it means we're catching one out of every six non-spam posts. |
Well, it is that I took many fp posts out of MS record and analyzed them rather than that this reason will result in those fps. |
@user12986714 Ah, gotcha. Do you happen to have any stats on how many tps/fps this will result in over the MS corpus? |
W.r.t. result on metasmoke dataset, fp rate is very low. However, since the samples on MS is biased, we cannot really conclude anything. However, I believe that some test sessions have been run and fp rate is low. |
W.r.t. result on metasmoke dataset, fp rate is very low. However, since the samples on MS is biased, we cannot really conclude anything.
Not necessarily in this case — most false-positives on MS are normal English posts, which are what we want to avoid catching.
In that case, I think this is ready for merge (cc @ArtOfCode-). If we run into problems we can always revert it.
… On Sep 12, 2020, at 1:54 PM, user12986714 ***@***.***> wrote:
W.r.t. result on metasmoke dataset, fp rate is very low. However, since the samples on MS is biased, we cannot really conclude anything.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I ran some tests on this today. It's really cool, but it has a couple problems and therefore currently catches a ton of FPs. See review comments for details.
"french.stackexchange.com", "spanish.stackexchange.com", | ||
"portuguese.stackexchange.com", "korean.stackexchange.com", | ||
"ukrainian.stackexchange.com", "italian.stackexchange.com"], | ||
max_rep=10000, max_score=10000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to strip code blocks, for instance https://askubuntu.com/a/623972. Or at least collapse repeated whitespace characters.
I ran some more tests today. It's looking a lot better, but we still have problems with:
|
This PR attempts to find out vandalism and gibberish by calculating the informational entropy of a given text. The constants used in this PR is conservative.