feat: [sc-26105] Add first/last name tokenizer to NameAI #606

Byczong · 2025-02-05T19:05:15Z

Story details: https://app.shortcut.com/ps-web3/story/26105

todo:

implement downloader
~~add s3 env vars to .env.example~~ make bucket public
use python -m python -m nameai.download in ci/cd, deployment scripts
improve performance tests
add load tests
make tokenization quality reports
limit data (?)

changeset-bot · 2025-02-05T19:05:19Z

⚠️ No Changeset found

Latest commit: 622caa2

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

vercel · 2025-02-05T19:05:20Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
namegraph.dev	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Feb 19, 2025 1:22pm
namehashlabs.org	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Feb 19, 2025 1:22pm
namekit.io	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Feb 19, 2025 1:22pm

4 Skipped Deployments

Name	Status	Updated (UTC)
examples.nameguard.io	⬜️ Skipped (Inspect)	Feb 19, 2025 1:22pm
nameai.io	⬜️ Skipped (Inspect)	Feb 19, 2025 1:22pm
nameguard.io	⬜️ Skipped (Inspect)	Feb 19, 2025 1:22pm
storybook.namekit.io	⬜️ Skipped (Inspect)	Feb 19, 2025 1:22pm

…ameai

Goader · 2025-02-21T00:00:52Z

apps/api.nameai.io/tests/test_tokenizer.py

+def test_person_name_tokenizer_simple_names():
+    """Verify tokenization of clear person names."""
+    with init_person_name_tokenizer([]) as tokenizer:
+        from nameai.data import get_resource_path
+        import json
+
+        with open(get_resource_path('tests/person_names_quality.json')) as f:
+            quality_tests = json.load(f)
+
+        failures = []
+        for input_label, expected_tokens in quality_tests['simple_names'].items():
+            tokenized_labels = list(tokenizer.tokenize_with_scores(input_label))
+            expected_tuple = tuple(expected_tokens)
+            found = False
+            for tokens, score in tokenized_labels:
+                if tokens == expected_tuple:
+                    found = True
+                    assert score > -float('inf'), f'Expected valid score for {input_label}'
+                    break
+            if not found:
+                failures.append(f'Failed to find expected tokenization for {input_label}')
+
+        if failures:
+            print('\n=== PersonNameTokenizer Quality Test Failures [simple_names] ===')
+            for failure in failures:
+                print(failure)
+            print(f'\nTotal failures: {len(failures)} out of {len(quality_tests)} test cases')
+            assert False, 'Some tokenization quality tests failed. See above for details.'
+
+
+def test_person_name_tokenizer_ambiguous_names():
+    """Verify handling of ambiguous inputs that could be names."""
+    with init_person_name_tokenizer([]) as tokenizer:
+        from nameai.data import get_resource_path
+        import json
+
+        with open(get_resource_path('tests/person_names_quality.json')) as f:
+            quality_tests = json.load(f)
+
+        failures = []
+        for input_label, interpretation2expected_tokens in quality_tests['ambiguous_names'].items():
+            tokenized_labels = list(tokenizer.tokenize_with_scores(input_label))
+            if interpretation2expected_tokens['person_name'] is not None:
+                person_name_tokens = tuple(interpretation2expected_tokens['person_name'])
+                found = False
+                for tokens, score in tokenized_labels:
+                    if tokens == person_name_tokens:
+                        found = True
+                        assert score > -float('inf'), f'Expected valid score for {input_label}'
+                        break
+                if not found:
+                    failures.append(f'Failed to find person name tokenization for {input_label}')
+
+        if failures:
+            print('\n=== PersonNameTokenizer Quality Test Failures [ambiguous_names] ===')
+            for failure in failures:
+                print(failure)
+            print(f'\nTotal failures: {len(failures)} out of {len(quality_tests)} test cases')
+            assert False, 'Some tokenization quality tests failed. See above for details.'
+
+
+def test_person_name_tokenizer_non_names_low_scores():
+    """Verify that non-name inputs get low (< 1e-10) probability scores."""
+    with init_person_name_tokenizer([]) as tokenizer:
+        from nameai.data import get_resource_path
+        import json
+
+        with open(get_resource_path('tests/person_names_quality.json')) as f:
+            quality_tests = json.load(f)
+
+        failures = []
+        for input_label in quality_tests['non_names'].keys():
+            tokenized_labels = list(tokenizer.tokenize_with_scores(input_label))
+            for tokens, log_prob in tokenized_labels:
+                if log_prob >= math.log(1e-10):
+                    failures.append(f'Expected very low score for non-name {input_label}, got {log_prob}')
+
+        if failures:
+            print('\n=== PersonNameTokenizer Quality Test Failures [non_names] ===')
+            for failure in failures:
+                print(failure)
+            print(f'\nTotal failures: {len(failures)} out of {len(quality_tests)} test cases')
+            assert False, 'Some tokenization quality tests failed. See above for details.'


Are these tests simply adding a probability score check compared to those from test_nlp_inspector.py?

In test_tokenizer.py separate tokenizers are tested (AllTokenizer and PersonNamesTokenizer).
In test_nlp_inspector.py the tokenizations come from both tokenizers (merging in done in NLPInspector).

So these tests are for different levels of the tokenization pipeline.

Goader · 2025-02-21T00:04:44Z

It all seems good to me. One thing that is bothering me is maintaining 2 separate implementations of the same functionality. I would think of possibly substituting this functionality in NameGraph by using the implementation from here? @djstrong

Byczong added 2 commits February 5, 2025 14:07

add person names tokenizer

6c4474b

fix tests

e70913a

vercel bot deployed to Preview – namegraph.dev February 5, 2025 19:05 View deployment

vercel bot deployed to Preview – namehashlabs.org February 5, 2025 19:06 View deployment

vercel bot deployed to Preview – nameai.io February 5, 2025 19:06 View deployment

implement download from s3

7c3ad2a

vercel bot deployed to Preview – storybook.namekit.io February 6, 2025 11:29 View deployment

vercel bot deployed to Preview – examples.nameguard.io February 6, 2025 11:30 View deployment

vercel bot deployed to Preview – nameai.io February 6, 2025 11:30 View deployment

vercel bot deployed to Preview – namekit.io February 6, 2025 11:30 View deployment

vercel bot deployed to Preview – nameguard.io February 6, 2025 11:31 View deployment

vercel bot deployed to Preview – namehashlabs.org February 6, 2025 11:31 View deployment

vercel bot deployed to Preview – namegraph.dev February 6, 2025 11:32 View deployment

add downloading in ci, dockerfile

bf02737

vercel bot deployed to Preview – storybook.namekit.io February 6, 2025 18:10 View deployment

vercel bot deployed to Preview – examples.nameguard.io February 6, 2025 18:11 View deployment

vercel bot deployed to Preview – namekit.io February 6, 2025 18:11 View deployment

vercel bot deployed to Preview – namehashlabs.org February 6, 2025 18:12 View deployment

vercel bot had a problem deploying to Preview – nameai.io February 6, 2025 18:12 Failure

vercel bot deployed to Preview – nameguard.io February 6, 2025 18:13 View deployment

vercel bot deployed to Preview – namegraph.dev February 6, 2025 18:13 View deployment

Byczong added 2 commits February 10, 2025 22:38

improve tests

ec21df5

merge nameai.dev to nameai.io rename

4740bf9

vercel bot deployed to Preview – examples.nameguard.io February 12, 2025 11:45 View deployment

vercel bot deployed to Preview – nameguard.io February 12, 2025 11:45 View deployment

vercel bot deployed to Preview – namehashlabs.org February 12, 2025 11:46 View deployment

vercel bot deployed to Preview – namegraph.dev February 12, 2025 11:47 View deployment

vercel bot deployed to Preview – namekit.io February 12, 2025 11:47 View deployment

vercel bot deployed to Preview – namekit.io February 19, 2025 12:50 View deployment

vercel bot deployed to Preview – nameai.io February 19, 2025 12:50 View deployment

vercel bot deployed to Preview – storybook.namekit.io February 19, 2025 12:50 View deployment

refine docstrings; remove unused method

254c04c

vercel bot deployed to Preview – examples.nameguard.io February 19, 2025 13:17 View deployment

vercel bot deployed to Preview – namegraph.dev February 19, 2025 13:17 View deployment

vercel bot deployed to Preview – storybook.namekit.io February 19, 2025 13:18 View deployment

vercel bot deployed to Preview – nameguard.io February 19, 2025 13:18 View deployment

vercel bot deployed to Preview – nameai.io February 19, 2025 13:19 View deployment

vercel bot deployed to Preview – namehashlabs.org February 19, 2025 13:19 View deployment

vercel bot deployed to Preview – namekit.io February 19, 2025 13:20 View deployment

Merge branch 'main' into byczong/sc-26106/add-first-last-name-tk-to-n…

622caa2

…ameai

vercel bot temporarily deployed to Preview – nameai.io February 19, 2025 13:21 Inactive

vercel bot temporarily deployed to Preview – nameguard.io February 19, 2025 13:21 Inactive

vercel bot temporarily deployed to Preview – storybook.namekit.io February 19, 2025 13:21 Inactive

vercel bot temporarily deployed to Preview – examples.nameguard.io February 19, 2025 13:21 Inactive

vercel bot deployed to Preview – namehashlabs.org February 19, 2025 13:21 View deployment

vercel bot deployed to Preview – namegraph.dev February 19, 2025 13:22 View deployment

vercel bot deployed to Preview – namekit.io February 19, 2025 13:22 View deployment

Byczong marked this pull request as ready for review February 19, 2025 13:23

Byczong requested review from djstrong, Carbon225, notrab, lightwalker-eth, FrancoAguzzi, BanaSeba and edulennert as code owners February 19, 2025 13:23

Byczong requested a review from Goader February 19, 2025 21:46

Goader reviewed Feb 21, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: [sc-26105] Add first/last name tokenizer to NameAI #606

feat: [sc-26105] Add first/last name tokenizer to NameAI #606

Uh oh!

Byczong commented Feb 5, 2025 •

edited by Goader

Loading

Uh oh!

changeset-bot bot commented Feb 5, 2025 •

edited

Loading

Uh oh!

vercel bot commented Feb 5, 2025 •

edited

Loading

Uh oh!

Goader Feb 21, 2025

Uh oh!

Byczong Mar 3, 2025

Uh oh!

Goader commented Feb 21, 2025

Uh oh!

Uh oh!

feat: [sc-26105] Add first/last name tokenizer to NameAI #606

Are you sure you want to change the base?

feat: [sc-26105] Add first/last name tokenizer to NameAI #606

Uh oh!

Conversation

Byczong commented Feb 5, 2025 • edited by Goader Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

changeset-bot bot commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

vercel bot commented Feb 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Goader Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

Byczong Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

Goader commented Feb 21, 2025

Uh oh!

Uh oh!

Byczong commented Feb 5, 2025 •

edited by Goader

Loading

changeset-bot bot commented Feb 5, 2025 •

edited

Loading

vercel bot commented Feb 5, 2025 •

edited

Loading