Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DETECTION] Expected iso-8859-2 in czech text #571

Open
pawelzwronek opened this issue Dec 4, 2024 · 2 comments
Open

[DETECTION] Expected iso-8859-2 in czech text #571

pawelzwronek opened this issue Dec 4, 2024 · 2 comments
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence question Further information is requested

Comments

@pawelzwronek
Copy link

Notice
I hereby announce that my raw input is not :

  • Too small content (<=32 characters) as I do know that ANY charset detector heavily depends on content
  • Encoded in a deprecated/abandoned encoding that is not even supported by my interpreter

Provide the file

Ledňáček říční (Alcedo atthis) je průměrně 16,5 cm velký pták z čeledi
ledňáčkovitých (Alcedinidae). Je velmi výrazně zbarvený s oranžovou spodinou a
modrým hřbetem, křídly a temenem. Výrazným znakem je také jeho nápadně dlouhý
zašpičatělý zobák. Pro své krásné zbarvení je nazýván Létající drahokam.

iso-8859-2.txt

Verbose output

D:\temp\1\test\cs>normalizer -v iso-8859-2.txt
2024-12-04 02:38:55,249 | Level 5 | override steps (5) and chunk_size (512) as content does not fit (301 byte(s) given) parameters.
2024-12-04 02:38:55,249 | Level 5 | Code page ascii does not fit given bytes sequence at ALL. 'ascii' codec can't decode byte 0xf2 in position 3: ordinal not in range(128)
2024-12-04 02:38:55,250 | Level 5 | Code page utf_8 does not fit given bytes sequence at ALL. 'utf-8' codec can't decode byte 0xf2 in position 3: invalid continuation byte
2024-12-04 02:38:55,250 | Level 5 | Code page big5 does not fit given bytes sequence at ALL. 'big5' codec can't decode byte 0xed in position 13: illegal multibyte sequence
2024-12-04 02:38:55,251 | Level 5 | Code page big5hkscs does not fit given bytes sequence at ALL. 'big5hkscs' codec can't decode byte 0xed in position 13: illegal multibyte sequence
2024-12-04 02:38:55,251 | Level 5 | cp037 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 420.500000 %.
2024-12-04 02:38:55,252 | Level 5 | cp1006 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 24.200000 %.
2024-12-04 02:38:55,253 | Level 5 | cp1026 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2024-12-04 02:38:55,253 | Level 5 | cp1125 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 30.300000 %.
2024-12-04 02:38:55,254 | Level 5 | cp1140 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2024-12-04 02:38:55,254 | Level 5 | cp1250 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-12-04 02:38:55,255 | Level 5 | cp1250 should target any language(s) of ['Latin Based']
2024-12-04 02:38:55,257 | Level 5 | We detected language [('Czech', 0.4688), ('Slovak', 0.4375), ('Dutch', 0.375), ('Swedish', 0.375), ('Hungarian', 0.3438), ('Spanish', 0.3125), ('Polish', 0.3125), ('Portuguese', 0.3125), ('Croatian', 0.3125), ('Italian', 0.3125), ('Danish', 0.3125), ('Estonian', 0.3125), ('Finnish', 0.2812), ('French', 0.2812), ('Norwegian', 0.2812), ('Turkish', 0.25), ('Romanian', 0.25), ('German', 0.2188), ('Vietnamese', 0.1562)] using cp1250
2024-12-04 02:38:55,257 | Level 5 | cp1251 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 24.200000 %.
2024-12-04 02:38:55,258 | Level 5 | cp1252 passed initial chaos probing. Mean measured chaos is 5.800000 %
2024-12-04 02:38:55,258 | Level 5 | cp1252 should target any language(s) of ['Latin Based']
2024-12-04 02:38:55,260 | Level 5 | We detected language [('Slovak', 0.4667), ('Czech', 0.4667), ('Dutch', 0.4), ('Hungarian', 0.3667), ('Italian', 0.3667), ('Swedish', 0.3667), ('Portuguese', 0.3333), ('Finnish', 0.3333), ('Danish', 0.3333), ('Polish', 0.3333), ('Croatian', 0.3333), ('French', 0.3), ('Estonian', 0.3), ('Spanish', 0.2667), ('Norwegian', 0.2667), ('Turkish', 0.2667), ('Romanian', 0.2667), ('German', 0.2333), ('Vietnamese', 0.1667)] using cp1252
2024-12-04 02:38:55,260 | Level 5 | cp1253 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 24.200000 %.
2024-12-04 02:38:55,261 | Level 5 | cp1254 passed initial chaos probing. Mean measured chaos is 5.800000 %
2024-12-04 02:38:55,262 | Level 5 | cp1254 should target any language(s) of ['Latin Based']
2024-12-04 02:38:55,263 | Level 5 | We detected language [('Czech', 0.4667), ('Slovak', 0.4667), ('Dutch', 0.4), ('Hungarian', 0.3667), ('Italian', 0.3667), ('Swedish', 0.3667), ('Portuguese', 0.3333), ('Finnish', 0.3333), ('Danish', 0.3333), ('Polish', 0.3333), ('Croatian', 0.3333), ('French', 0.3), ('Turkish', 0.3), ('Estonian', 0.3), ('Spanish', 0.2667), ('Norwegian', 0.2667), ('Romanian', 0.2667), ('German', 0.2333), ('Vietnamese', 0.1667)] using cp1254
2024-12-04 02:38:55,264 | Level 5 | cp1255 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 24.200000 %.
2024-12-04 02:38:55,264 | Level 5 | cp1256 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 24.200000 %.
2024-12-04 02:38:55,265 | Level 5 | cp1257 passed initial chaos probing. Mean measured chaos is 3.700000 %
2024-12-04 02:38:55,266 | Level 5 | cp1257 should target any language(s) of ['Latin Based']
2024-12-04 02:38:55,267 | Level 5 | We detected language [('Slovak', 0.4667), ('Czech', 0.4333), ('Dutch', 0.4), ('Croatian', 0.3667), ('Swedish', 0.3667), ('Hungarian', 0.3333), ('Polish', 0.3333), ('Finnish', 0.3333), ('Italian', 0.3333), ('Portuguese', 0.3333), ('French', 0.3), ('Danish', 0.3), ('Estonian', 0.3), ('Spanish', 0.2667), ('Turkish', 0.2667), ('Romanian', 0.2667), ('German', 0.2333), ('Norwegian', 0.2333), ('Vietnamese', 0.1667)] using cp1257
2024-12-04 02:38:55,268 | Level 5 | cp1258 passed initial chaos probing. Mean measured chaos is 5.900000 %
2024-12-04 02:38:55,268 | Level 5 | cp1258 should target any language(s) of ['Latin Based']
2024-12-04 02:38:55,269 | Level 5 | We detected language [('Slovak', 0.5357), ('Czech', 0.5), ('Dutch', 0.4286), ('Swedish', 0.4286), ('Italian', 0.3929), ('Portuguese', 0.3929), ('Croatian', 0.3929), ('Hungarian', 0.3571), ('Norwegian', 0.3571), ('Finnish', 0.3571), ('Danish', 0.3571), ('Polish', 0.3571), ('Estonian', 0.3214), ('Spanish', 0.2857), ('French', 0.2857), ('Turkish', 0.2857), ('Romanian', 0.25), ('German', 0.1786), ('Vietnamese', 0.1429)] using cp1258
2024-12-04 02:38:55,270 | Level 5 | cp273 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2024-12-04 02:38:55,270 | Level 5 | Code page cp424 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xed in position 10: character maps to <undefined>
2024-12-04 02:38:55,271 | Level 5 | cp437 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 54.500000 %.
2024-12-04 02:38:55,272 | Level 5 | cp500 is deemed too similar to code page cp037 and was consider unsuited already. Continuing!
2024-12-04 02:38:55,272 | Level 5 | cp720 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 30.300000 %.
2024-12-04 02:38:55,273 | Level 5 | cp737 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 36.400000 %.
2024-12-04 02:38:55,274 | Level 5 | cp775 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 20.100000 %.
2024-12-04 02:38:55,274 | Level 5 | cp850 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2024-12-04 02:38:55,275 | Level 5 | cp852 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 32.200000 %.
2024-12-04 02:38:55,275 | Level 5 | cp855 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 42.400000 %.
2024-12-04 02:38:55,276 | Level 5 | Code page cp856 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xe1 in position 4: character maps to <undefined>
2024-12-04 02:38:55,277 | Level 5 | Code page cp857 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xf2 in position 3: character maps to <undefined>
2024-12-04 02:38:55,277 | Level 5 | cp858 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2024-12-04 02:38:55,278 | Level 5 | cp860 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2024-12-04 02:38:55,278 | Level 5 | cp861 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2024-12-04 02:38:55,279 | Level 5 | cp862 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2024-12-04 02:38:55,279 | Level 5 | cp863 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2024-12-04 02:38:55,280 | Level 5 | cp864 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 24.200000 %.
2024-12-04 02:38:55,281 | Level 5 | cp865 is deemed too similar to code page cp437 and was consider unsuited already. Continuing!
2024-12-04 02:38:55,281 | Level 5 | cp866 is deemed too similar to code page cp1125 and was consider unsuited already. Continuing!
2024-12-04 02:38:55,282 | Level 5 | cp869 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 30.300000 %.
2024-12-04 02:38:55,283 | Level 5 | Code page cp874 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfd in position 55: character maps to <undefined>
2024-12-04 02:38:55,283 | Level 5 | cp875 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 163.600000 %.
2024-12-04 02:38:55,284 | Level 5 | Code page cp932 does not fit given bytes sequence at ALL. 'cp932' codec can't decode byte 0xed in position 13: illegal multibyte sequence
2024-12-04 02:38:55,284 | Level 5 | Code page cp949 does not fit given bytes sequence at ALL. 'cp949' codec can't decode byte 0xe8 in position 5: illegal multibyte sequence
2024-12-04 02:38:55,285 | Level 5 | Code page cp950 does not fit given bytes sequence at ALL. 'cp950' codec can't decode byte 0xed in position 13: illegal multibyte sequence
2024-12-04 02:38:55,285 | Level 5 | Code page euc_jis_2004 does not fit given bytes sequence at ALL. 'euc_jis_2004' codec can't decode byte 0xe8 in position 5: illegal multibyte sequence
2024-12-04 02:38:55,286 | Level 5 | Code page euc_jisx0213 does not fit given bytes sequence at ALL. 'euc_jisx0213' codec can't decode byte 0xe8 in position 5: illegal multibyte sequence
2024-12-04 02:38:55,286 | Level 5 | Code page euc_jp does not fit given bytes sequence at ALL. 'euc_jp' codec can't decode byte 0xe8 in position 5: illegal multibyte sequence
2024-12-04 02:38:55,287 | Level 5 | Code page euc_kr does not fit given bytes sequence at ALL. 'euc_kr' codec can't decode byte 0xe8 in position 5: illegal multibyte sequence
2024-12-04 02:38:55,287 | Level 5 | Code page gb18030 does not fit given bytes sequence at ALL. 'gb18030' codec can't decode byte 0xed in position 13: illegal multibyte sequence
2024-12-04 02:38:55,288 | Level 5 | Code page gb2312 does not fit given bytes sequence at ALL. 'gb2312' codec can't decode byte 0xe8 in position 5: illegal multibyte sequence
2024-12-04 02:38:55,288 | Level 5 | Code page gbk does not fit given bytes sequence at ALL. 'gbk' codec can't decode byte 0xf8 in position 9: illegal multibyte sequence
2024-12-04 02:38:55,289 | Level 5 | hp_roman8 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 22.100000 %.
2024-12-04 02:38:55,290 | Level 5 | Code page hz does not fit given bytes sequence at ALL. 'hz' codec can't decode byte 0xf2 in position 3: illegal multibyte sequence
2024-12-04 02:38:55,290 | Level 5 | Code page iso2022_jp does not fit given bytes sequence at ALL. 'iso2022_jp' codec can't decode byte 0xf2 in position 3: illegal multibyte sequence
2024-12-04 02:38:55,291 | Level 5 | Code page iso2022_jp_1 does not fit given bytes sequence at ALL. 'iso2022_jp_1' codec can't decode byte 0xf2 in position 3: illegal multibyte sequence
2024-12-04 02:38:55,291 | Level 5 | Code page iso2022_jp_2 does not fit given bytes sequence at ALL. 'iso2022_jp_2' codec can't decode byte 0xf2 in position 3: illegal multibyte sequence
2024-12-04 02:38:55,292 | Level 5 | Code page iso2022_jp_2004 does not fit given bytes sequence at ALL. 'iso2022_jp_2004' codec can't decode byte 0xf2 in position 3: illegal multibyte sequence
2024-12-04 02:38:55,292 | Level 5 | Code page iso2022_jp_3 does not fit given bytes sequence at ALL. 'iso2022_jp_3' codec can't decode byte 0xf2 in position 3: illegal multibyte sequence
2024-12-04 02:38:55,293 | Level 5 | Code page iso2022_jp_ext does not fit given bytes sequence at ALL. 'iso2022_jp_ext' codec can't decode byte 0xf2 in position 3: illegal multibyte sequence
2024-12-04 02:38:55,293 | Level 5 | Code page iso2022_kr does not fit given bytes sequence at ALL. 'iso2022_kr' codec can't decode byte 0xf2 in position 3: illegal multibyte sequence
2024-12-04 02:38:55,294 | Level 5 | iso8859_10 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-12-04 02:38:55,295 | Level 5 | iso8859_10 should target any language(s) of ['Latin Based']
2024-12-04 02:38:55,296 | Level 5 | We detected language [('Czech', 0.4688), ('Slovak', 0.4375), ('Dutch', 0.375), ('Swedish', 0.375), ('Hungarian', 0.3438), ('Danish', 0.3438), ('Spanish', 0.3125), ('Portuguese', 0.3125), ('Norwegian', 0.3125), ('Croatian', 0.3125), ('Italian', 0.3125), ('Estonian', 0.3125), ('Finnish', 0.2812), ('French', 0.2812), ('Polish', 0.2812), ('Turkish', 0.25), ('Romanian', 0.25), ('German', 0.2188), ('Vietnamese', 0.1562)] using iso8859_10
2024-12-04 02:38:55,297 | Level 5 | Code page iso8859_11 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfd in position 55: character maps to <undefined>
2024-12-04 02:38:55,297 | Level 5 | iso8859_13 passed initial chaos probing. Mean measured chaos is 3.700000 %
2024-12-04 02:38:55,297 | Level 5 | iso8859_13 should target any language(s) of ['Latin Based']
2024-12-04 02:38:55,298 | Level 5 | We detected language [('Slovak', 0.4667), ('Czech', 0.4333), ('Dutch', 0.4), ('Croatian', 0.3667), ('Swedish', 0.3667), ('Hungarian', 0.3333), ('Polish', 0.3333), ('Finnish', 0.3333), ('Italian', 0.3333), ('Portuguese', 0.3333), ('French', 0.3), ('Danish', 0.3), ('Estonian', 0.3), ('Spanish', 0.2667), ('Turkish', 0.2667), ('Romanian', 0.2667), ('German', 0.2333), ('Norwegian', 0.2333), ('Vietnamese', 0.1667)] using iso8859_13
2024-12-04 02:38:55,299 | Level 5 | iso8859_14 passed initial chaos probing. Mean measured chaos is 2.000000 %
2024-12-04 02:38:55,299 | Level 5 | iso8859_14 should target any language(s) of ['Latin Based']
2024-12-04 02:38:55,300 | Level 5 | We detected language [('Czech', 0.4688), ('Slovak', 0.4375), ('Dutch', 0.375), ('Swedish', 0.375), ('Hungarian', 0.3438), ('Italian', 0.3438), ('Danish', 0.3438), ('Spanish', 0.3125), ('Portuguese', 0.3125), ('Norwegian', 0.3125), ('Croatian', 0.3125), ('Estonian', 0.3125), ('French', 0.2812), ('Finnish', 0.2812), ('Polish', 0.2812), ('Turkish', 0.25), ('Romanian', 0.25), ('German', 0.2188), ('Vietnamese', 0.1562)] using iso8859_14
2024-12-04 02:38:55,301 | Level 5 | iso8859_15 passed initial chaos probing. Mean measured chaos is 2.700000 %
2024-12-04 02:38:55,301 | Level 5 | iso8859_15 should target any language(s) of ['Latin Based']
2024-12-04 02:38:55,303 | Level 5 | We detected language [('Czech', 0.4839), ('Slovak', 0.4516), ('Dutch', 0.3871), ('Hungarian', 0.3548), ('Italian', 0.3548), ('Swedish', 0.3548), ('Portuguese', 0.3226), ('Danish', 0.3226), ('Polish', 0.3226), ('Croatian', 0.3226), ('Spanish', 0.2903), ('French', 0.2903), ('Norwegian', 0.2903), ('Finnish', 0.2903), ('Estonian', 0.2903), ('Turkish', 0.2581), ('Romanian', 0.2581), ('German', 0.2258), ('Vietnamese', 0.1613)] using iso8859_15
2024-12-04 02:38:55,304 | Level 5 | iso8859_16 passed initial chaos probing. Mean measured chaos is 2.700000 %
2024-12-04 02:38:55,304 | Level 5 | iso8859_16 should target any language(s) of ['Latin Based']
2024-12-04 02:38:55,305 | Level 5 | We detected language [('Slovak', 0.4688), ('Czech', 0.4688), ('Dutch', 0.375), ('Swedish', 0.375), ('Hungarian', 0.3438), ('Italian', 0.3438), ('Spanish', 0.3125), ('Portuguese', 0.3125), ('Croatian', 0.3125), ('Danish', 0.3125), ('Estonian', 0.3125), ('French', 0.2812), ('Polish', 0.2812), ('Finnish', 0.2812), ('Norwegian', 0.2812), ('Turkish', 0.25), ('Romanian', 0.25), ('German', 0.2188), ('Vietnamese', 0.1562)] using iso8859_16
2024-12-04 02:38:55,306 | Level 5 | iso8859_2 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-12-04 02:38:55,307 | Level 5 | iso8859_2 should target any language(s) of ['Latin Based']
2024-12-04 02:38:55,308 | Level 5 | We detected language [('Czech', 0.4688), ('Slovak', 0.4375), ('Croatian', 0.375), ('Dutch', 0.375), ('Swedish', 0.375), ('Hungarian', 0.3438), ('Spanish', 0.3125), ('Portuguese', 0.3125), ('Italian', 0.3125), ('Danish', 0.3125), ('Estonian', 0.3125), ('Finnish', 0.2812), ('French', 0.2812), ('Polish', 0.2812), ('Norwegian', 0.2812), ('Turkish', 0.25), ('Romanian', 0.25), ('German', 0.2188), ('Vietnamese', 0.1562)] using iso8859_2
2024-12-04 02:38:55,308 | Level 5 | Code page iso8859_3 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xbe in position 133: character maps to <undefined>
2024-12-04 02:38:55,309 | Level 5 | iso8859_4 passed initial chaos probing. Mean measured chaos is 0.000000 %
2024-12-04 02:38:55,310 | Level 5 | iso8859_4 should target any language(s) of ['Latin Based']
2024-12-04 02:38:55,311 | Level 5 | We detected language [('Czech', 0.4688), ('Slovak', 0.4375), ('Croatian', 0.375), ('Dutch', 0.375), ('Swedish', 0.375), ('Hungarian', 0.3438), ('Danish', 0.3438), ('Spanish', 0.3125), ('Portuguese', 0.3125), ('Norwegian', 0.3125), ('Italian', 0.3125), ('Estonian', 0.3125), ('Finnish', 0.2812), ('French', 0.2812), ('Polish', 0.2812), ('Turkish', 0.25), ('Romanian', 0.25), ('German', 0.2188), ('Vietnamese', 0.1562)] using iso8859_4
2024-12-04 02:38:55,312 | Level 5 | iso8859_5 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 24.200000 %.
2024-12-04 02:38:55,312 | Level 5 | Code page iso8859_6 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xf8 in position 9: character maps to <undefined>
2024-12-04 02:38:55,313 | Level 5 | iso8859_7 is deemed too similar to code page cp1253 and was consider unsuited already. Continuing!
2024-12-04 02:38:55,313 | Level 5 | iso8859_8 was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 24.200000 %.
2024-12-04 02:38:55,314 | Level 5 | iso8859_9 passed initial chaos probing. Mean measured chaos is 5.800000 %
2024-12-04 02:38:55,314 | Level 5 | iso8859_9 should target any language(s) of ['Latin Based']
2024-12-04 02:38:55,314 | Level 5 | We detected language [('Czech', 0.4667), ('Slovak', 0.4667), ('Dutch', 0.4), ('Hungarian', 0.3667), ('Italian', 0.3667), ('Swedish', 0.3667), ('Portuguese', 0.3333), ('Finnish', 0.3333), ('Danish', 0.3333), ('Polish', 0.3333), ('Croatian', 0.3333), ('French', 0.3), ('Turkish', 0.3), ('Estonian', 0.3), ('Spanish', 0.2667), ('Norwegian', 0.2667), ('Romanian', 0.2667), ('German', 0.2333), ('Vietnamese', 0.1667)] using iso8859_9
2024-12-04 02:38:55,315 | Level 5 | Code page johab does not fit given bytes sequence at ALL. 'johab' codec can't decode byte 0xed in position 13: illegal multibyte sequence
2024-12-04 02:38:55,316 | Level 5 | koi8_r was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 30.300000 %.
2024-12-04 02:38:55,316 | Level 5 | Code page koi8_t does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xbe in position 133: character maps to <undefined>
2024-12-04 02:38:55,317 | Level 5 | koi8_u was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 30.300000 %.
2024-12-04 02:38:55,317 | Level 5 | kz1048 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2024-12-04 02:38:55,317 | Level 5 | latin_1 passed initial chaos probing. Mean measured chaos is 5.800000 %
2024-12-04 02:38:55,317 | Level 5 | latin_1 should target any language(s) of ['Latin Based']
2024-12-04 02:38:55,317 | Level 5 | We detected language [('Slovak', 0.4667), ('Czech', 0.4667), ('Dutch', 0.4), ('Hungarian', 0.3667), ('Italian', 0.3667), ('Swedish', 0.3667), ('Portuguese', 0.3333), ('Finnish', 0.3333), ('Danish', 0.3333), ('Polish', 0.3333), ('Croatian', 0.3333), ('French', 0.3), ('Estonian', 0.3), ('Spanish', 0.2667), ('Norwegian', 0.2667), ('Turkish', 0.2667), ('Romanian', 0.2667), ('German', 0.2333), ('Vietnamese', 0.1667)] using latin_1
2024-12-04 02:38:55,318 | Level 5 | mac_cyrillic was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 24.200000 %.
2024-12-04 02:38:55,319 | Level 5 | mac_greek was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 24.200000 %.
2024-12-04 02:38:55,319 | Level 5 | mac_iceland was excluded because of initial chaos probing. Gave up 1 time(s). Computed mean chaos is 56.800000 %.
2024-12-04 02:38:55,320 | Level 5 | mac_latin2 passed initial chaos probing. Mean measured chaos is 16.900000 %
2024-12-04 02:38:55,320 | Level 5 | mac_latin2 should target any language(s) of ['Latin Based']
2024-12-04 02:38:55,322 | Level 5 | We detected language [('Slovak', 0.5), ('Czech', 0.4333), ('Croatian', 0.3667), ('Dutch', 0.3667), ('Swedish', 0.3667), ('Finnish', 0.3333), ('Italian', 0.3333), ('Danish', 0.3333), ('Portuguese', 0.3333), ('Hungarian', 0.3), ('Polish', 0.3), ('Norwegian', 0.3), ('Estonian', 0.3), ('Spanish', 0.2667), ('French', 0.2667), ('Turkish', 0.2333), ('Romanian', 0.2333), ('German', 0.2), ('Vietnamese', 0.1667)] using mac_latin2
2024-12-04 02:38:55,323 | Level 5 | mac_roman is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2024-12-04 02:38:55,323 | Level 5 | mac_turkish is deemed too similar to code page mac_iceland and was consider unsuited already. Continuing!
2024-12-04 02:38:55,324 | Level 5 | ptcp154 is deemed too similar to code page cp1251 and was consider unsuited already. Continuing!
2024-12-04 02:38:55,324 | Level 5 | Code page shift_jis does not fit given bytes sequence at ALL. 'shift_jis' codec can't decode byte 0xf2 in position 3: illegal multibyte sequence
2024-12-04 02:38:55,325 | Level 5 | Code page shift_jis_2004 does not fit given bytes sequence at ALL. 'shift_jis_2004' codec can't decode byte 0xed in position 13: illegal multibyte sequence
2024-12-04 02:38:55,325 | Level 5 | Code page shift_jisx0213 does not fit given bytes sequence at ALL. 'shift_jisx0213' codec can't decode byte 0xed in position 13: illegal multibyte sequence
2024-12-04 02:38:55,326 | Level 5 | Code page tis_620 does not fit given bytes sequence at ALL. 'charmap' codec can't decode byte 0xfd in position 55: character maps to <undefined>
2024-12-04 02:38:55,326 | Level 5 | Encoding utf_16 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2024-12-04 02:38:55,326 | Level 5 | Code page utf_16_be does not fit given bytes sequence at ALL. 'utf-16-be' codec can't decode byte 0x0a in position 300: truncated data
2024-12-04 02:38:55,327 | Level 5 | Code page utf_16_le does not fit given bytes sequence at ALL. 'utf-16-le' codec can't decode byte 0x0a in position 300: truncated data
2024-12-04 02:38:55,327 | Level 5 | Encoding utf_32 won't be tested as-is because it require a BOM. Will try some sub-encoder LE/BE.
2024-12-04 02:38:55,327 | Level 5 | Code page utf_32_be does not fit given bytes sequence at ALL. 'utf-32-be' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2024-12-04 02:38:55,328 | Level 5 | Code page utf_32_le does not fit given bytes sequence at ALL. 'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)
2024-12-04 02:38:55,328 | Level 5 | Encoding utf_7 won't be tested as-is because detection is unreliable without BOM/SIG.
2024-12-04 02:38:55,328 | DEBUG | Encoding detection: Found cp1250 as plausible (best-candidate) for content. With 11 alternatives.
{
    "path": "D:\\temp\\1\\test\\cs\\iso-8859-2.txt",
    "encoding": "cp1250",
    "encoding_aliases": [
        "1250",
        "windows_1250"
    ],
    "alternative_encodings": [],
    "language": "Czech",
    "alphabets": [
        "Basic Latin",
        "Control character",
        "Latin Extended-A",
        "Latin-1 Supplement"
    ],
    "has_sig_or_bom": false,
    "chaos": 0.0,
    "coherence": 46.88,
    "unicode_path": null,
    "is_preferred": true
}

Expected encoding
iso-8859-2

Desktop (please complete the following information):

  • OS: Windows 10
  • Python version 3.9
  • Package version 3.4.0

Additional context
Detected encoding: cp1250
So the issue is 3. character at 4. line: š. In cp1250 it's ą (it's a polish letter), but the source text is in czech language. There is no ą in that language.
Test file comes from uchardet project: iso-8859-2.txt

@pawelzwronek pawelzwronek added detection Related to the charset detection mechanism, chaos/mess/coherence help wanted Extra attention is needed labels Dec 4, 2024
@pawelzwronek
Copy link
Author

There are more missed detection. I used this script to run all test from uchardet 0.0.8 project:

import os
from charset_normalizer import from_path

base_dir = "test"

for root, dirs, files in os.walk(base_dir):
    for file in files:
        file_path = os.path.join(root, file)
        res = from_path(
            file_path,
            chunk_size=512,
            threshold=0.2,
            cp_exclusion=None,
            preemptive_behaviour=True,
            explain=False,
        )
        
        best = res.best()
        if best is not None:
            print(f"{file_path}: encoding: {best.could_be_from_charset}")

@pawelzwronek pawelzwronek changed the title [DETECTION] Expected iso-8859-2 in chech text [DETECTION] Expected iso-8859-2 in czech text Dec 4, 2024
@Ousret
Copy link
Member

Ousret commented Dec 24, 2024

I took the time to assess the presented case and unfortunately we cannot provide any reliable patch to improve it.
The rendered text differ by only a handful of characters, less than 5 if I recall correctly.

We will need much more sample data to be able to improve.
What we generally advise here is to get first 3 best guesses and present them to the end user, a native czech reader will spot the typos immediately.

If you cannot provide more files we will be obligated to close this one.

Regards,

@Ousret Ousret added question Further information is requested and removed help wanted Extra attention is needed labels Dec 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
detection Related to the charset detection mechanism, chaos/mess/coherence question Further information is requested
Development

No branches or pull requests

2 participants