Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(python): optimize pystr deserialize perf #2007

Open
wants to merge 24 commits into
base: main
Choose a base branch
from

Conversation

chaokunyang
Copy link
Collaborator

@chaokunyang chaokunyang commented Jan 14, 2025

What does this PR do?

This PR implemented an optimized version of PyUnicode_FromUCS1/Fury_PyUnicode_FromUCS2 for faster performance by :

  • replace max char check using SIMD
  • Cast ucs2 array to ucs1 array by SIMD

Related issues

Does this PR introduce any user-facing change?

  • Does this PR introduce any public API change?
  • Does this PR introduce any binary protocol compatibility change?

Benchmark

@chaokunyang chaokunyang marked this pull request as draft January 14, 2025 05:49
@pandalee99 pandalee99 self-requested a review January 14, 2025 15:26
@chaokunyang chaokunyang force-pushed the optimize_pystr_deserialize_perf branch from 8ba4b1b to 6f0a64b Compare January 15, 2025 14:34
@chaokunyang chaokunyang marked this pull request as ready for review January 15, 2025 15:19
@chaokunyang
Copy link
Collaborator Author

Copy link
Contributor

@pandalee99 pandalee99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is very efficient,very nice!

maybe we can optimize the repetitive code.

  // Handle remaining elements
  for (; i < length; i++) {
    if (arr[i] > max_sse) {
      max_sse = arr[i];
    }

It's just the way it's written. It's nothing serious.

cdef const char * buf = <const char *>(self.c_buffer.get().data() + self.reader_index)
self.reader_index += size
cdef uint32_t encoding = header & <uint32_t>0b11
if encoding == 0:
# PyUnicode_FromASCII
return PyUnicode_DecodeLatin1(buf, size, "strict")
return <unicode>Fury_PyUnicode_FromUCS1(buf, size)
# return PyUnicode_DecodeLatin1(buf, size, "strict")
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If i use PyUnicode_DecodeLatin1 directly here, It's faster in macos, which is unexpected Since my implementation used the simd, and if i invoke PyUnicode_DecodeLatin1 directly in PyUnicode_FromUCS1, it's slower too. @penguin-wwy do you have any ideas?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you describe the testing method? The tests I wrote myself do not have this issue.

# integration_tests/cpython_benchmark/fury_benchmark.py
STRING = "sjuveaibngurbzsivbrubiasb3r93284r92r1209130r0fa;2''j93r2nfln''[]\=-_+/,./!@$#%^&*()i9124u0hpq[jnzj0r9h034-2iu1058]"

def micro_benchmark():
    runner.bench_func(
        "fury_string", fury_object, language, not args.no_ref, STRING
    )
    runner.bench_func(
        "fury_large_string", fury_object, language, not args.no_ref, STRING * 10000
    )

Using PyUnicode_FromUCS1:
fury_string: Mean +- std dev: 54.7 us +- 2.5 us
fury_large_string: Mean +- std dev: 255 us +- 24 us

Using Fury_PyUnicode_FromUCS1:
fury_string: Mean +- std dev: 53.8 us +- 2.0 us
fury_large_string: Mean +- std dev: 236 us +- 6 us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants