Add new nvtext::normalize_characters API #17818

davidwendt · 2025-01-24T21:29:14Z

Description

Adds new normalizer APIs as part of the rework for the subword-tokenizer.
The new API is split into 2 parts. First a normalizer object is created with appropriate state: lower-case and special-tokens. The normalizing tables are currently hardcoded inside libcudf. Future versions of the this may load these tables from some other source. The 2nd API is given the input strings column and the normalizer object and returns a normalized strings column. The normalizer object can be reused on all subsequent normalize_characters calls.

The current nvtext::normalize_characters loads the normalizing tables on each call which can be significant overhead. This API will be deprecated and replaced by these 2 new ones. Some utility functions from that implementation have been refactored to be used by both until the old one is removed.

The first API creates the normalizer object.

std::unique_ptr<character_normalizer> create_character_normalizer(
  bool do_lower_case,
  cudf::strings_column_view const& special_tokens,
  rmm::cuda_stream_view stream,
  rmm::device_async_resource_ref mr);

The 2nd API uses the normalizer on a strings column:

std::unique_ptr<cudf::column> normalize_characters(
  cudf::strings_column_view const& input,
  character_normalizer const& normalizer,
  rmm::cuda_stream_view stream,
  rmm::device_async_resource_ref mr);

Using the python interface:

import cudf
from cudf.core.character_normalizer import CharacterNormalizer

cn = CharacterNormalizer(do_lower=False)
sn = cn.normalize(input_strings)

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2025-01-24T21:29:18Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

… new-normalizer-apis

davidwendt · 2025-01-30T21:58:21Z

/ok to test

davidwendt · 2025-01-30T23:39:07Z

/ok to test

… new-normalizer-apis

GregoryKimball · 2025-02-11T19:10:20Z

@kingcrimsontianyu and @karthikeyann would you two please lead the review for this contribution?

… new-normalizer-apis

cpp/tests/text/normalize_tests.cpp

cpp/src/text/normalize.cu

kingcrimsontianyu · 2025-02-14T17:26:32Z

cpp/include/nvtext/normalize.hpp

+  ~character_normalizer();
+
+  struct character_normalizer_impl;
+  character_normalizer_impl* _impl{};


Would it be better to use std::unique_ptr<character_normalizer_impl> for the impl data member?

cpp/src/text/normalize.cu

kingcrimsontianyu · 2025-02-14T17:54:21Z

cpp/tests/text/normalize_tests.cpp

+                                                   "P^NP",
+                                                   "$41.07",
+                                                   "[a,b]",
+                                                   "丏丟",


Looked it up in the dictionary as I thought this is a literature word I didn't know 😃

I don't really know what this is but tests the normalizer paths I want. Hopefully it is not something offensive.

kingcrimsontianyu · 2025-02-18T16:00:54Z

cpp/src/text/normalize.cu

+                                T const& value,
+                                rmm::cuda_stream_view stream)
+{
+  auto const copy_size = std::min(static_cast<std::size_t>(std::distance(first, last)),


Just a question: Is this because some iterators in libcudf or user-defined iterators may (incorrectly) use int as the iterator's difference_type as opposed to std::ptrdiff_t? I'm curious when this is usually happening.

Many of the thrust functions I have 2 different code paths depending on the range of the input iterators -- 32-bit and 64-bit. Unfortunately, building with thrust pulls in both even if we only use 1 of them. This bloats the code upto 2x in some cases. We generally only require 32-bit (size_type) iterator ranges and we are able to compile out the thrust 64-bit iterator paths with some strategic patching in our cmake -- significantly reducing our binary size. Note, the bloat issue is something the CCCL team is working on so the patch we use is meant to be temporary.

Anyway, this means in places were we actually a need 64-bit iterator range (like for reading large strings character vectors we currently support), we need to workaround the missing 64-bit range thrust APIs. The remove_copy_safe function and the remove_safe utility below workaround the limit by calling the underlying thrust functions in batches of max<int> counts.

… new-normalizer-apis

Add new nvtext::normalize_characters API

dd51eb3

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jan 24, 2025

davidwendt self-assigned this Jan 24, 2025

davidwendt added 8 commits January 24, 2025 17:49

Merge branch 'branch-25.04' into new-normalizer-apis

a8446c5

Merge branch 'branch-25.04' into new-normalizer-apis

74cbed0

add special-tokens column

2dd819a

Merge branch 'new-normalizer-apis' of github.com:davidwendt/cudf into…

0777591

… new-normalizer-apis

add remove_copy_safe

a29a405

add special_tokens_kernel

5b21c0a

add python and pylibcudf interfaces

e37bf3f

Merge branch 'branch-25.04' into new-normalizer-apis

b31499e

github-actions bot added Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Jan 29, 2025

davidwendt added 8 commits January 29, 2025 09:30

Merge branch 'branch-25.04' into new-normalizer-apis

8194534

add block-store

ac4c5a8

Merge branch 'branch-25.04' into new-normalizer-apis

b575dbb

Merge branch 'branch-25.04' into new-normalizer-apis

f501627

fix block-store algo type

0df89af

Merge branch 'branch-25.04' into new-normalizer-apis

0b9abba

update nvtx range names

51cae79

replace d_sizes with transform-iterator

091cb84

davidwendt added 2 commits January 30, 2025 16:59

Merge branch 'branch-25.04' into new-normalizer-apis

5aa65ff

fix pylibcudf normalize-characters pytest

deb33ac

davidwendt added 7 commits February 7, 2025 14:04

fix comment formatting in python source

b46df42

Merge branch 'new-normalizer-apis' of github.com:davidwendt/cudf into…

af35440

… new-normalizer-apis

Merge branch 'branch-25.04' into new-normalizer-apis

010b730

Merge branch 'branch-25.04' into new-normalizer-apis

525b4db

change min() to cuda::std::min()

2d92791

Merge branch 'branch-25.04' into new-normalizer-apis

2e4baf9

Merge branch 'branch-25.04' into new-normalizer-apis

967c68f

GregoryKimball requested a review from karthikeyann February 11, 2025 19:09

davidwendt added 4 commits February 11, 2025 17:10

Merge branch 'branch-25.04' into new-normalizer-apis

4919a07

Merge branch 'new-normalizer-apis' of github.com:davidwendt/cudf into…

f980d86

… new-normalizer-apis

Merge branch 'branch-25.04' into new-normalizer-apis

640d8c9

Merge branch 'branch-25.04' into new-normalizer-apis

6d4a670

karthikeyann reviewed Feb 13, 2025

View reviewed changes

cpp/tests/text/normalize_tests.cpp Show resolved Hide resolved

cpp/src/text/normalize.cu Outdated Show resolved Hide resolved

cpp/src/text/normalize.cu Show resolved Hide resolved

davidwendt added 3 commits February 13, 2025 08:22

Merge branch 'branch-25.04' into new-normalizer-apis

dbd528b

add get_first_and_last_offset utility

10e3e4b

Merge branch 'branch-25.04' into new-normalizer-apis

d308cfb

davidwendt requested a review from karthikeyann February 13, 2025 19:45

Merge branch 'branch-25.04' into new-normalizer-apis

839bb1b

kingcrimsontianyu reviewed Feb 14, 2025

View reviewed changes

cpp/src/text/normalize.cu Show resolved Hide resolved

kingcrimsontianyu reviewed Feb 14, 2025

View reviewed changes

davidwendt added 2 commits February 14, 2025 19:03

Merge branch 'branch-25.04' into new-normalizer-apis

936c4ce

Merge branch 'branch-25.04' into new-normalizer-apis

373971e

kingcrimsontianyu reviewed Feb 18, 2025

View reviewed changes

davidwendt added 5 commits February 18, 2025 11:57

Merge branch 'branch-25.04' into new-normalizer-apis

c08ae75

change _impl to smart-pointer

29bee66

Merge branch 'branch-25.04' into new-normalizer-apis

dab936a

Merge branch 'new-normalizer-apis' of github.com:davidwendt/cudf into…

0960987

… new-normalizer-apis

fix merge conflict

8f7b16e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new nvtext::normalize_characters API #17818

Add new nvtext::normalize_characters API #17818

davidwendt commented Jan 24, 2025 •

edited

Loading

copy-pr-bot bot commented Jan 24, 2025

davidwendt commented Jan 30, 2025

davidwendt commented Jan 30, 2025

GregoryKimball commented Feb 11, 2025

kingcrimsontianyu Feb 14, 2025

kingcrimsontianyu Feb 14, 2025

davidwendt Feb 18, 2025

kingcrimsontianyu Feb 18, 2025

davidwendt Feb 18, 2025

Add new nvtext::normalize_characters API #17818

Are you sure you want to change the base?

Add new nvtext::normalize_characters API #17818

Conversation

davidwendt commented Jan 24, 2025 • edited Loading

Description

Checklist

copy-pr-bot bot commented Jan 24, 2025

davidwendt commented Jan 30, 2025

davidwendt commented Jan 30, 2025

GregoryKimball commented Feb 11, 2025

kingcrimsontianyu Feb 14, 2025

Choose a reason for hiding this comment

kingcrimsontianyu Feb 14, 2025

Choose a reason for hiding this comment

davidwendt Feb 18, 2025

Choose a reason for hiding this comment

kingcrimsontianyu Feb 18, 2025

Choose a reason for hiding this comment

davidwendt Feb 18, 2025

Choose a reason for hiding this comment

davidwendt commented Jan 24, 2025 •

edited

Loading