Add new nvtext::normalize_characters API #17818

davidwendt · 2025-01-24T21:29:14Z

Description

Adds new normalizer APIs as part of the rework for the subword-tokenizer.
The new API is split into 2 parts. First a normalizer object is created with appropriate state: lower-case and special-tokens. The normalizing tables are currently hardcoded inside libcudf. Future versions of the this may load these tables from some other source. The 2nd API is given the input strings column and the normalizer object and returns a normalized strings column. The normalizer object can be reused on all subsequent normalize_characters calls.

The current nvtext::normalize_characters loads the normalizing tables on each call which can be significant overhead. This API will be deprecated and replaced by these 2 new ones. Some utility functions from that implementation have been refactored to be used by both until the old one is removed.

The first API creates the normalizer object.

std::unique_ptr<character_normalizer> create_character_normalizer(
  bool do_lower_case,
  cudf::strings_column_view const& special_tokens,
  rmm::cuda_stream_view stream,
  rmm::device_async_resource_ref mr);

The 2nd API uses the normalizer on a strings column:

std::unique_ptr<cudf::column> normalize_characters(
  cudf::strings_column_view const& input,
  character_normalizer const& normalizer,
  rmm::cuda_stream_view stream,
  rmm::device_async_resource_ref mr);

Using the python interface:

import cudf
from cudf.core.character_normalizer import CharacterNormalizer

cn = CharacterNormalizer(do_lower=False)
sn = cn.normalize(input_strings)

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-01-24T21:29:18Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

… new-normalizer-apis

davidwendt · 2025-01-30T21:58:21Z

/ok to test

davidwendt · 2025-01-30T23:39:07Z

/ok to test

… new-normalizer-apis

cpp/tests/text/normalize_tests.cpp

cpp/src/text/normalize.cu

cpp/include/nvtext/normalize.hpp

cpp/src/text/normalize.cu

cpp/tests/text/normalize_tests.cpp

cpp/src/text/normalize.cu

… new-normalizer-apis

kingcrimsontianyu

Lgtm. Thanks for the work!

karthikeyann

minor nit.

karthikeyann · 2025-02-19T16:29:03Z

cpp/src/text/normalize.cu

+                                                      std::move(tokens_view));
+}
+
+character_normalizer::~character_normalizer() {}


Since we moved to using unique_ptr, we can default this destructor.

Actually the compiler does not like that. With a ~character_normalizer()=default; declaration the compiler tries to generate the destructor in the including TU (like test and benchmark .cpps) but complains it does not know the size of the _impl class type and reports an error. Keeping this empty destructor defined here means the compiler will not try to generate the destructor on its own (and fail at it).

Add new nvtext::normalize_characters API

dd51eb3

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 24, 2025

davidwendt self-assigned this Jan 24, 2025

davidwendt added 8 commits January 24, 2025 17:49

Merge branch 'branch-25.04' into new-normalizer-apis

a8446c5

Merge branch 'branch-25.04' into new-normalizer-apis

74cbed0

add special-tokens column

2dd819a

Merge branch 'new-normalizer-apis' of github.com:davidwendt/cudf into…

0777591

… new-normalizer-apis

add remove_copy_safe

a29a405

add special_tokens_kernel

5b21c0a

add python and pylibcudf interfaces

e37bf3f

Merge branch 'branch-25.04' into new-normalizer-apis

b31499e

github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Jan 29, 2025

davidwendt added 8 commits January 29, 2025 09:30

Merge branch 'branch-25.04' into new-normalizer-apis

8194534

add block-store

ac4c5a8

Merge branch 'branch-25.04' into new-normalizer-apis

b575dbb

Merge branch 'branch-25.04' into new-normalizer-apis

f501627

fix block-store algo type

0df89af

Merge branch 'branch-25.04' into new-normalizer-apis

0b9abba

update nvtx range names

51cae79

replace d_sizes with transform-iterator

091cb84

davidwendt added 2 commits January 30, 2025 16:59

Merge branch 'branch-25.04' into new-normalizer-apis

5aa65ff

fix pylibcudf normalize-characters pytest

deb33ac

davidwendt added 3 commits February 11, 2025 17:13

Merge branch 'new-normalizer-apis' of github.com:davidwendt/cudf into…

f980d86

… new-normalizer-apis

Merge branch 'branch-25.04' into new-normalizer-apis

640d8c9

Merge branch 'branch-25.04' into new-normalizer-apis

6d4a670

karthikeyann reviewed Feb 13, 2025

View reviewed changes

cpp/tests/text/normalize_tests.cpp Show resolved Hide resolved

cpp/src/text/normalize.cu Outdated Show resolved Hide resolved

cpp/src/text/normalize.cu Show resolved Hide resolved

davidwendt added 3 commits February 13, 2025 08:22

Merge branch 'branch-25.04' into new-normalizer-apis

dbd528b

add get_first_and_last_offset utility

10e3e4b

Merge branch 'branch-25.04' into new-normalizer-apis

d308cfb

davidwendt requested a review from karthikeyann February 13, 2025 19:45

Merge branch 'branch-25.04' into new-normalizer-apis

839bb1b

kingcrimsontianyu reviewed Feb 14, 2025

View reviewed changes

cpp/include/nvtext/normalize.hpp Outdated Show resolved Hide resolved

kingcrimsontianyu reviewed Feb 14, 2025

View reviewed changes

cpp/src/text/normalize.cu Show resolved Hide resolved

kingcrimsontianyu reviewed Feb 14, 2025

View reviewed changes

cpp/tests/text/normalize_tests.cpp Show resolved Hide resolved

davidwendt added 2 commits February 14, 2025 19:03

Merge branch 'branch-25.04' into new-normalizer-apis

936c4ce

Merge branch 'branch-25.04' into new-normalizer-apis

373971e

kingcrimsontianyu reviewed Feb 18, 2025

View reviewed changes

cpp/src/text/normalize.cu Show resolved Hide resolved

davidwendt added 6 commits February 18, 2025 11:57

Merge branch 'branch-25.04' into new-normalizer-apis

c08ae75

change _impl to smart-pointer

29bee66

Merge branch 'branch-25.04' into new-normalizer-apis

dab936a

Merge branch 'new-normalizer-apis' of github.com:davidwendt/cudf into…

0960987

… new-normalizer-apis

fix merge conflict

8f7b16e

fix docstring

e3fdb2d

davidwendt requested a review from kingcrimsontianyu February 19, 2025 13:09

kingcrimsontianyu approved these changes Feb 19, 2025

View reviewed changes

karthikeyann approved these changes Feb 19, 2025

View reviewed changes

davidwendt and others added 4 commits February 20, 2025 04:31

Merge branch 'branch-25.04' into new-normalizer-apis

8f46b89

Merge branch 'branch-25.04' into new-normalizer-apis

9a15314

Merge branch 'branch-25.04' into new-normalizer-apis

d31a760

add deprecation warning

56041b5

davidwendt requested a review from Matt711 February 21, 2025 16:24

Merge branch 'branch-25.04' into new-normalizer-apis

79e5c26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new nvtext::normalize_characters API #17818

Add new nvtext::normalize_characters API #17818

davidwendt commented Jan 24, 2025 •

edited

Loading

copy-pr-bot bot commented Jan 24, 2025

davidwendt commented Jan 30, 2025

davidwendt commented Jan 30, 2025

kingcrimsontianyu left a comment •

edited

Loading

karthikeyann left a comment

karthikeyann Feb 19, 2025

davidwendt Feb 19, 2025

Add new nvtext::normalize_characters API #17818

Are you sure you want to change the base?

Add new nvtext::normalize_characters API #17818

Conversation

davidwendt commented Jan 24, 2025 • edited Loading

Description

Checklist

copy-pr-bot bot commented Jan 24, 2025

davidwendt commented Jan 30, 2025

davidwendt commented Jan 30, 2025

kingcrimsontianyu left a comment • edited Loading

Choose a reason for hiding this comment

karthikeyann left a comment

Choose a reason for hiding this comment

karthikeyann Feb 19, 2025

Choose a reason for hiding this comment

davidwendt Feb 19, 2025

Choose a reason for hiding this comment

davidwendt commented Jan 24, 2025 •

edited

Loading

kingcrimsontianyu left a comment •

edited

Loading