Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new nvtext::normalize_characters API #17818

Open
wants to merge 62 commits into
base: branch-25.04
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
62 commits
Select commit Hold shift + click to select a range
dd51eb3
Add new nvtext::normalize_characters API
davidwendt Jan 24, 2025
a8446c5
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Jan 24, 2025
74cbed0
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Jan 27, 2025
2dd819a
add special-tokens column
davidwendt Jan 27, 2025
0777591
Merge branch 'new-normalizer-apis' of github.com:davidwendt/cudf into…
davidwendt Jan 27, 2025
a29a405
add remove_copy_safe
davidwendt Jan 27, 2025
5b21c0a
add special_tokens_kernel
davidwendt Jan 28, 2025
e37bf3f
add python and pylibcudf interfaces
davidwendt Jan 29, 2025
b31499e
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Jan 29, 2025
8194534
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Jan 29, 2025
ac4c5a8
add block-store
davidwendt Jan 29, 2025
b575dbb
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Jan 29, 2025
f501627
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Jan 29, 2025
0df89af
fix block-store algo type
davidwendt Jan 30, 2025
0b9abba
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Jan 30, 2025
51cae79
update nvtx range names
davidwendt Jan 30, 2025
091cb84
replace d_sizes with transform-iterator
davidwendt Jan 30, 2025
5aa65ff
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Jan 30, 2025
deb33ac
fix pylibcudf normalize-characters pytest
davidwendt Jan 30, 2025
ac1544d
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 3, 2025
e96bc21
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 3, 2025
f2f35a6
add more gtests and pytests
davidwendt Feb 3, 2025
6530637
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 4, 2025
fc86efa
add longer strings to gtests
davidwendt Feb 4, 2025
ce0c1d0
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 4, 2025
756fa6d
fix typos
davidwendt Feb 4, 2025
2a2b9da
remove unneeded includes
davidwendt Feb 4, 2025
c205062
remove unneeded variable
davidwendt Feb 4, 2025
fbe42f1
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 4, 2025
d2ff0c5
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 5, 2025
f0986a1
Merge branch 'new-normalizer-apis' of github.com:davidwendt/cudf into…
davidwendt Feb 5, 2025
7bf1dc5
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 5, 2025
803d36b
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 5, 2025
3d48c6f
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 6, 2025
b5c0ebc
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 7, 2025
e51d69b
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 7, 2025
b497d88
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 7, 2025
b46df42
fix comment formatting in python source
davidwendt Feb 7, 2025
af35440
Merge branch 'new-normalizer-apis' of github.com:davidwendt/cudf into…
davidwendt Feb 7, 2025
010b730
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 10, 2025
525b4db
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 10, 2025
2d92791
change min() to cuda::std::min()
davidwendt Feb 10, 2025
2e4baf9
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 11, 2025
967c68f
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 11, 2025
4919a07
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 11, 2025
f980d86
Merge branch 'new-normalizer-apis' of github.com:davidwendt/cudf into…
davidwendt Feb 11, 2025
640d8c9
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 12, 2025
6d4a670
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 12, 2025
dbd528b
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 13, 2025
10e3e4b
add get_first_and_last_offset utility
davidwendt Feb 13, 2025
d308cfb
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 13, 2025
839bb1b
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 14, 2025
936c4ce
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 15, 2025
373971e
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 18, 2025
c08ae75
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 18, 2025
29bee66
change _impl to smart-pointer
davidwendt Feb 18, 2025
dab936a
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 18, 2025
0960987
Merge branch 'new-normalizer-apis' of github.com:davidwendt/cudf into…
davidwendt Feb 19, 2025
8f7b16e
fix merge conflict
davidwendt Feb 19, 2025
e3fdb2d
fix docstring
davidwendt Feb 19, 2025
8f46b89
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 20, 2025
9a15314
Merge branch 'branch-25.04' into new-normalizer-apis
davidwendt Feb 20, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 6 additions & 3 deletions cpp/benchmarks/text/normalize.cpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2021-2024, NVIDIA CORPORATION.
* Copyright (c) 2021-2025, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -48,15 +48,18 @@ static void bench_normalize(nvbench::state& state)
[&](nvbench::launch& launch) { auto result = nvtext::normalize_spaces(input); });
} else {
bool const to_lower = (normalize_type == "to_lower");
// we expect the normalizer to be created once and re-used
// so creating it is not measured
auto normalizer = nvtext::create_character_normalizer(to_lower);
state.exec(nvbench::exec_tag::sync, [&](nvbench::launch& launch) {
auto result = nvtext::normalize_characters(input, to_lower);
auto result = nvtext::normalize_characters(input, *normalizer);
});
}
}

NVBENCH_BENCH(bench_normalize)
.set_name("normalize")
.add_int64_axis("min_width", {0})
.add_int64_axis("max_width", {32, 64, 128, 256})
.add_int64_axis("max_width", {128, 256})
.add_int64_axis("num_rows", {32768, 262144, 2097152})
.add_string_axis("type", {"spaces", "characters", "to_lower"});
14 changes: 13 additions & 1 deletion cpp/include/cudf/strings/detail/utilities.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2024, NVIDIA CORPORATION.
* Copyright (c) 2019-2025, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -96,5 +96,17 @@ int64_t get_offset_value(cudf::column_view const& offsets,
size_type index,
rmm::cuda_stream_view stream);

/**
* @brief Return the first and last offset in the given strings column
*
* This accounts for sliced input columns as well.
*
* @param input Strings column
* @param stream CUDA stream used for device memory operations and kernel launches
* @return First and last offset values
*/
std::pair<int64_t, int64_t> get_first_and_last_offset(cudf::strings_column_view const& input,
rmm::cuda_stream_view stream);

} // namespace strings::detail
} // namespace CUDF_EXPORT cudf
111 changes: 110 additions & 1 deletion cpp/include/nvtext/normalize.hpp
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2020-2024, NVIDIA CORPORATION.
* Copyright (c) 2020-2025, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -16,6 +16,7 @@
#pragma once

#include <cudf/column/column.hpp>
#include <cudf/column/column_view.hpp>
#include <cudf/strings/strings_column_view.hpp>
#include <cudf/utilities/export.hpp>
#include <cudf/utilities/memory_resource.hpp>
Expand Down Expand Up @@ -107,5 +108,113 @@ std::unique_ptr<cudf::column> normalize_characters(
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());

/**
* @brief Normalizer object to be used with nvtext::normalize_characters
*
* Use nvtext::create_normalizer to create this object.
*
* This normalizer includes:
*
* - adding padding around punctuation (unicode category starts with "P")
* as well as certain ASCII symbols like "^" and "$"
* - adding padding around the [CJK Unicode block
* characters](https://en.wikipedia.org/wiki/CJK_Unified_Ideographs_(Unicode_block))
* - changing whitespace (e.g. `"\t", "\n", "\r"`) to just space `" "`
* - removing control characters (unicode categories "Cc" and "Cf")
*
* The padding process adds a single space before and after the character.
* Details on _unicode category_ can be found here:
* https://unicodebook.readthedocs.io/unicode.html#categories
*
* If `do_lower_case = true`, lower-casing also removes any accents. The
* accents cannot be removed from upper-case characters without lower-casing
* and lower-casing cannot be performed without also removing accents.
* However, if the accented character is already lower-case, then only the
* accent is removed.
*
* If `special_tokens` are included the padding after `[` and before `]` is not
* inserted if the characters between them match one of the given tokens.
* Also, the `special_tokens` are expected to include the `[]` characters
* at the beginning of and end of each string appropriately.
*/
struct character_normalizer {
/**
* @brief Normalizer object constructor
*
* This initializes and holds the character normalizing tables and settings.
*
* @param do_lower_case If true, upper-case characters are converted to
* lower-case and accents are stripped from those characters.
* If false, accented and upper-case characters are not transformed.
* @param special_tokens Each row is a token including the `[]` brackets.
* For example: `[BOS]`, `[EOS]`, `[UNK]`, `[SEP]`, `[PAD]`, `[CLS]`, `[MASK]`
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Device memory resource used to allocate the returned column's device memory
*/
character_normalizer(bool do_lower_case,
cudf::strings_column_view const& special_tokens,
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());
~character_normalizer();

struct character_normalizer_impl;
std::unique_ptr<character_normalizer_impl> _impl;
};

/**
* @brief Create a normalizer object
*
* Creates a normalizer object which can be reused on multiple calls to
* nvtext::normalize_characters
*
* @see nvtext::character_normalizer
*
* @param do_lower_case If true, upper-case characters are converted to
* lower-case and accents are stripped from those characters.
* If false, accented and upper-case characters are not transformed.
* @param special_tokens Individual tokens including `[]` brackets.
* Default is no special tokens.
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Device memory resource used to allocate the returned column's device memory
* @return Object to be used with nvtext::normalize_characters
*/
std::unique_ptr<character_normalizer> create_character_normalizer(
bool do_lower_case,
cudf::strings_column_view const& special_tokens = cudf::strings_column_view(cudf::column_view{
cudf::data_type{cudf::type_id::STRING}, 0, nullptr, nullptr, 0}),
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());

/**
* @brief Normalizes the text in input strings column
*
* @see nvtext::character_normalizer for details on the normalizer behavior
*
* @code{.pseudo}
* cn = create_character_normalizer(true)
* s = ["éâîô\teaio", "ĂĆĖÑÜ", "ACENU", "$24.08", "[a,bb]"]
* s1 = normalize_characters(s,cn)
* s1 is now ["eaio eaio", "acenu", "acenu", " $ 24 . 08", " [ a , bb ] "]
*
* cn = create_character_normalizer(false)
* s2 = normalize_characters(s,cn)
* s2 is now ["éâîô eaio", "ĂĆĖÑÜ", "ACENU", " $ 24 . 08", " [ a , bb ] "]
* @endcode
*
* A null input element at row `i` produces a corresponding null entry
* for row `i` in the output column.
*
* @param input The input strings to normalize
* @param normalizer Normalizer to use for this function
* @param stream CUDA stream used for device memory operations and kernel launches
* @param mr Memory resource to allocate any returned objects
* @return Normalized strings column
*/
std::unique_ptr<cudf::column> normalize_characters(
cudf::strings_column_view const& input,
character_normalizer const& normalizer,
rmm::cuda_stream_view stream = cudf::get_default_stream(),
rmm::device_async_resource_ref mr = cudf::get_current_device_resource_ref());

/** @} */ // end of group
} // namespace CUDF_EXPORT nvtext
14 changes: 13 additions & 1 deletion cpp/src/strings/utilities.cu
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/*
* Copyright (c) 2019-2024, NVIDIA CORPORATION.
* Copyright (c) 2019-2025, NVIDIA CORPORATION.
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand Down Expand Up @@ -180,6 +180,18 @@ int64_t get_offset_value(cudf::column_view const& offsets,
: cudf::detail::get_value<int32_t>(offsets, index, stream);
}

std::pair<int64_t, int64_t> get_first_and_last_offset(cudf::strings_column_view const& input,
rmm::cuda_stream_view stream)
{
if (input.is_empty()) { return {0L, 0L}; }
auto const first_offset = (input.offset() == 0) ? 0
: cudf::strings::detail::get_offset_value(
input.offsets(), input.offset(), stream);
auto const last_offset =
cudf::strings::detail::get_offset_value(input.offsets(), input.size() + input.offset(), stream);
return {first_offset, last_offset};
}

} // namespace detail

rmm::device_uvector<string_view> create_string_vector_from_column(
Expand Down
Loading
Loading