Skip to content

Wrong ordering from collate() #558

@bact

Description

@bact

Description

pythainlp.util.collate() results a wrong ordering,
as current implementation ignores tone marks and symbols in the ordering.

Try this code:

from pythainlp.util import collate

collate(["ก้วย", "ก๋วย", "ก่วย", "กวย", "ก้วย", "ก่วย", "ก๊วย"])

Expected results

Ordering according to Thai dictionary

['กวย', 'ก่วย', 'ก่วย', 'ก้วย', 'ก้วย', 'ก๊วย', 'ก๋วย']

Current results

['ก้วย', 'ก๋วย', 'ก่วย', 'ก้วย', 'ก่วย', 'ก๊วย', 'กวย']

Your environment

  • PyThaiNLP version: 2.3.1

Files

pythainlp/util/collate.py

Proposed test case

class TestUtilPackage(unittest.TestCase):

    # ### pythainlp.util.collate

    def test_collate(self):
        self.assertEqual(
            collate(["ก้วย", "ก๋วย", "กวย", "ก่วย", "ก๊วย"]),
            collate(["ก๋วย", "ก่วย", "ก้วย", "ก๊วย", "กวย"]),
        )  # should guarantee same order
        self.assertEqual(
            collate(["ก้วย", "ก๋วย", "ก่วย", "กวย", "ก้วย", "ก่วย", "ก๊วย"]),
            ["กวย", "ก่วย", "ก่วย", "ก้วย", "ก้วย", "ก๊วย", "ก๋วย"],
        )

Metadata

Metadata

Assignees

No one assigned

    Labels

    Hacktoberfestfor Hacktoberfest eventbugbugs in the libraryhelp wantedno contributor yet

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions