Add reason post is likely nonsense #4304

user12986714 · 2020-08-06T18:17:50Z

This PR attempts to find out vandalism and gibberish by calculating the informational entropy of a given text. The constants used in this PR is conservative.

This is not really a good practice, but too many tests think nonsense is not gibberish... For example, "I have this number: 111111111111111" and "This asdf should asdf not asdf be asdf matched asdf because asdf the asdf words do not asdf follow on each asdf other".

ArtOfCode- · 2020-08-06T19:30:00Z

What's the standard deviation of entropy-per-char? 3.0 seems quite close to 2.6 to me, but it depends - that might be super unlikely or relatively likely; the standard deviation will reveal which it is.

user12986714 · 2020-08-07T01:54:09Z

Well, entropy per char for English + space IIRC is ~21...

ArtOfCode- · 2020-08-07T20:49:41Z

That's... not what I'm asking. You have a comment in here that says "Average entropy per char in English is 2.6". If that's the average, what's the stddev?

ghost · 2020-08-08T13:20:25Z

The entropy values here are broken, every post gets caught.

Legit posts:

“I have seen the discussion about the Turkish Airlines COVID Cabin policy which makes little sense. Regardless though, does anyone know if they are enforcing it? I, like many, will be transferring at a European Airport on two tickets issued separately. I can't check my luggage all the way through from Istanbul to Malaga and can't exit customs to collect the luggage in Brussels (without a forced quarantine or denied entry)” - entropy per char of 0.2332
“Why all Indian rupee notes are accepted in Nepal and Bhutan, except 500 Rs and 1000 Rs? Why spare those two notes?” - entropy per char of 0.2485

Gibberish posts:

this this this this spamd dshdshdshdshhds - entropy per char of 0.4045
“test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test test” - entropy per char of 0.4898
“gibberish dshdshdsaasdlaf,afdasfkkdafkafdkdfkfdskdsfksdkfksd.fk.sdfksfdkfk.fsdk.sdfksfdkfsdkfsdk” - entropy per char of 0.3817
“dfahdfhdsfsfkdjjsldfksdflfsdlkjfldskjlkfdsjklsfd/jldsfjlsdfjsdfakjsdafjkfsjkldfsaklfdklsalkfdsaklsdadsaklasfdkldfsaljkdfslkaldsfkdslflfsddfskjsdfllsfdaladsflfdsalsdalfksdklafkdsafdsafdsaklkldfsakfdkslaflklfdsakldfsalfdldsaflkasfldjkdfslsdklaflkdsfakjlfsadkljfsakljafsdlkjsdfjjfsdljasdladsfljkfdsjldfsjldsfjsdf“ - entropy per char of 0.4025

I think the entropy values need to be adjusted according to these results

user12986714 · 2020-08-08T15:49:13Z

A stat with 12405 fp posts on MS

>>> statistics.mean(result)
0.20483261275004847
>>> statistics.median(result)
0.20223865427238322
>>> statistics.stdev(result)
0.031230117152319384

So yes, I managed to mess up with the decimal point

Note: fp is defined as:

>>> def is_fp(post):
...     fp_count = 0
...     tp_count = 0
...     for fb in post['feedback']:
...             if fb[1].startswith("f"):
...                     fp_count += 1
...             elif fb[1].startswith("t"):
...                     tp_count += 1
...     return (fp_count - tp_count > 1) or ((fp_count > 0) and (tp_count ==0))

Too much very-compact code that looks like nonsense but is not actually

findspam.py

stale · 2020-09-11T00:39:03Z

This issue has been closed because it has had no recent activity. If this is still important, please add another comment and find someone with write permissions to reopen the issue. Thank you for your contributions.

NobodyNada · 2020-09-12T05:50:59Z

A stat with 12405 fp posts on MS

That's a lot...unless I'm misunderstanding something, it means we're catching one out of every six non-spam posts.

user12986714 · 2020-09-12T20:02:53Z

A stat with 12405 fp posts on MS

That's a lot...unless I'm misunderstanding something, it means we're catching one out of every six non-spam posts.

Well, it is that I took many fp posts out of MS record and analyzed them rather than that this reason will result in those fps.

NobodyNada · 2020-09-12T20:04:05Z

@user12986714 Ah, gotcha. Do you happen to have any stats on how many tps/fps this will result in over the MS corpus?

user12986714 · 2020-09-12T20:54:04Z

@user12986714 Ah, gotcha. Do you happen to have any stats on how many tps/fps this will result in over the MS corpus?

W.r.t. result on metasmoke dataset, fp rate is very low. However, since the samples on MS is biased, we cannot really conclude anything.

However, I believe that some test sessions have been run and fp rate is low.

NobodyNada · 2020-09-12T20:59:51Z

W.r.t. result on metasmoke dataset, fp rate is very low. However, since the samples on MS is biased, we cannot really conclude anything.

Not necessarily in this case — most false-positives on MS are normal English posts, which are what we want to avoid catching. In that case, I think this is ready for merge (cc @ArtOfCode-). If we run into problems we can always revert it.

…

On Sep 12, 2020, at 1:54 PM, user12986714 ***@***.***> wrote: W.r.t. result on metasmoke dataset, fp rate is very low. However, since the samples on MS is biased, we cannot really conclude anything.

NobodyNada

So I ran some tests on this today. It's really cool, but it has a couple problems and therefore currently catches a ton of FPs. See review comments for details.

findspam.py

NobodyNada · 2020-10-06T19:20:31Z

findspam.py

+                    "french.stackexchange.com", "spanish.stackexchange.com",
+                    "portuguese.stackexchange.com", "korean.stackexchange.com",
+                    "ukrainian.stackexchange.com", "italian.stackexchange.com"],
+             max_rep=10000, max_score=10000)


We might want to strip code blocks, for instance https://askubuntu.com/a/623972. Or at least collapse repeated whitespace characters.

findspam.py

NobodyNada · 2020-10-13T18:20:57Z

I ran some more tests today. It's looking a lot better, but we still have problems with:

Code. We probably should strip code blocks, but then we'll still have a lot of fp's due to posts with unformatted code.
Posts with lots of un-rendered whitespace. IMO we really should collapse repeated whitespace characters.
Those constants don't seem to be conservative enough; e.g. https://english.stackexchange.com/a/408724/106362 and https://hermeneutics.stackexchange.com/a/51104 are both caught, with entropies of 4.0233 and 5.6742 respectively.

user12986714 added 5 commits August 6, 2020 14:16

Add reason post is likely nonsense

40b64cb

Fix typo

4e23c81

Fix division by zero + use constants

1196fa3

Syntax fix

5b43377

Make CI happy

38b143f

This is not really a good practice, but too many tests think nonsense is not gibberish... For example, "I have this number: 111111111111111" and "This asdf should asdf not asdf be asdf matched asdf because asdf the asdf words do not asdf follow on each asdf other".

user12986714 added 4 commits August 8, 2020 11:53

Fix constants

dc3ba10

Exclude codegolf.SE

eddf95e

Too much very-compact code that looks like nonsense but is not actually

2stddev

a274d2c

Exclude non-English sites

7592eb0

ghost reviewed Aug 8, 2020

View reviewed changes

findspam.py Outdated Show resolved Hide resolved

user12986714 added 2 commits August 8, 2020 13:24

Add italian.SE to exclusion list + fix CI

d271d65

Fix flake8 attempt 1

83b6000

ghost approved these changes Aug 8, 2020

View reviewed changes

stale bot added the status: stale label Sep 9, 2020

stale bot closed this Sep 11, 2020

makyen reopened this Sep 11, 2020

stale bot removed the status: stale label Sep 11, 2020

makyen added the status: confirmed Confirmed as something that needs working on. label Sep 11, 2020

NobodyNada suggested changes Oct 6, 2020

View reviewed changes

Correct math

ae2bdf4

Collapse whitespaces

5eec477

Add reason post is likely nonsense #4304

Are you sure you want to change the base?

Add reason post is likely nonsense #4304

Uh oh!

Conversation

user12986714 commented Aug 6, 2020

Uh oh!

ArtOfCode- commented Aug 6, 2020

Uh oh!

user12986714 commented Aug 7, 2020

Uh oh!

ArtOfCode- commented Aug 7, 2020

Uh oh!

ghost commented Aug 8, 2020 • edited by ghost Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

user12986714 commented Aug 8, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

stale bot commented Sep 11, 2020

Uh oh!

NobodyNada commented Sep 12, 2020

Uh oh!

user12986714 commented Sep 12, 2020

Uh oh!

NobodyNada commented Sep 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

user12986714 commented Sep 12, 2020

Uh oh!

NobodyNada commented Sep 12, 2020 via email

Uh oh!

NobodyNada left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NobodyNada Oct 6, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

NobodyNada commented Oct 13, 2020

Uh oh!

Uh oh!

ghost commented Aug 8, 2020 •

edited by ghost

Loading

user12986714 commented Aug 8, 2020 •

edited

Loading

NobodyNada commented Sep 12, 2020 •

edited

Loading