-
Notifications
You must be signed in to change notification settings - Fork 281
Closed
Labels
Hacktoberfestfor Hacktoberfest eventfor Hacktoberfest eventbugbugs in the librarybugs in the libraryhelp wantedno contributor yetno contributor yet
Milestone
Description
Description
pythainlp.util.collate()
results a wrong ordering,
as current implementation ignores tone marks and symbols in the ordering.
Try this code:
from pythainlp.util import collate
collate(["ก้วย", "ก๋วย", "ก่วย", "กวย", "ก้วย", "ก่วย", "ก๊วย"])
Expected results
Ordering according to Thai dictionary
['กวย', 'ก่วย', 'ก่วย', 'ก้วย', 'ก้วย', 'ก๊วย', 'ก๋วย']
Current results
['ก้วย', 'ก๋วย', 'ก่วย', 'ก้วย', 'ก่วย', 'ก๊วย', 'กวย']
Your environment
- PyThaiNLP version: 2.3.1
Files
pythainlp/util/collate.py
Proposed test case
class TestUtilPackage(unittest.TestCase):
# ### pythainlp.util.collate
def test_collate(self):
self.assertEqual(
collate(["ก้วย", "ก๋วย", "กวย", "ก่วย", "ก๊วย"]),
collate(["ก๋วย", "ก่วย", "ก้วย", "ก๊วย", "กวย"]),
) # should guarantee same order
self.assertEqual(
collate(["ก้วย", "ก๋วย", "ก่วย", "กวย", "ก้วย", "ก่วย", "ก๊วย"]),
["กวย", "ก่วย", "ก่วย", "ก้วย", "ก้วย", "ก๊วย", "ก๋วย"],
)
Metadata
Metadata
Assignees
Labels
Hacktoberfestfor Hacktoberfest eventfor Hacktoberfest eventbugbugs in the librarybugs in the libraryhelp wantedno contributor yetno contributor yet