Skip to content

Conversation

@edgarrmondragon
Copy link
Collaborator

@edgarrmondragon edgarrmondragon commented Jul 17, 2024


📚 Documentation preview 📚: https://meltano-sdk--2541.org.readthedocs.build/en/2541/

Summary by Sourcery

Implement msgspec encoding for improved performance.

Enhancements:

  • Replace the default JSON encoder and decoder with msgspec for serialization and deserialization.

Tests:

  • Update tests to accommodate the msgspec implementation.

@codspeed-hq
Copy link

codspeed-hq bot commented Jul 17, 2024

CodSpeed Performance Report

Merging #2541 will improve performances by ×12

Comparing edgarrmondragon/refactor/msgspec-impl-naive (ae168a0) with main (6f32572)

Summary

⚡ 2 improvements
✅ 5 untouched benchmarks

Benchmarks breakdown

Benchmark BASE HEAD Change
test_bench_deserialize_json 23.3 ms 5.5 ms ×4.2
test_bench_format_message 52.2 ms 4.2 ms ×12

@codecov
Copy link

codecov bot commented Jul 17, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 91.42%. Comparing base (6f32572) to head (ae168a0).
Report is 195 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2541      +/-   ##
==========================================
+ Coverage   91.34%   91.42%   +0.08%     
==========================================
  Files          63       63              
  Lines        5231     5280      +49     
  Branches      677      673       -4     
==========================================
+ Hits         4778     4827      +49     
  Misses        320      320              
  Partials      133      133              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@edgarrmondragon edgarrmondragon force-pushed the edgarrmondragon/refactor/msgspec-impl-naive branch from 97de9f7 to 1d9e947 Compare July 17, 2024 02:00
@edgarrmondragon edgarrmondragon force-pushed the edgarrmondragon/refactor/msgspec-impl-naive branch from 1d9e947 to 4febeba Compare July 17, 2024 02:59
@edgarrmondragon edgarrmondragon force-pushed the edgarrmondragon/refactor/msgspec-impl-naive branch from 4febeba to cbe10bd Compare July 17, 2024 03:00
@edgarrmondragon edgarrmondragon force-pushed the edgarrmondragon/refactor/msgspec-impl-naive branch from 8876b80 to f691f78 Compare July 25, 2024 15:45
@edgarrmondragon edgarrmondragon added this to the 0.41.0 milestone Aug 14, 2024
@BuzzCutNorman
Copy link
Contributor

I found that it helped to add a defualt_output to the for the SingerWriter to use. This allows you to make the write message a little generic.

default_output = sys.stdout.buffer

def write_message(self, message: Message) -> None:
	"""Write a message to stdout.

	Args:
		message: The message to write.
	"""
	self.default_output.write(self.format_message(message))
	self.default_output.flush()

@BuzzCutNorman
Copy link
Contributor

BuzzCutNorman commented Aug 16, 2024

In the json.py file I found to match the msgspec performance suggestions and fit into the framework you put in place I created a function for generating jsonl so I could keep the functionality of seralize_json to return strings. This way the serialize_json can be used in the connector engine creation and also in process_batch_files.

https://jcristharif.com/msgspec/perf-tips.html#line-delimited-json

json.py:

def serialize_json(obj: object, **kwargs: t.Any) -> str:
    """Serialize a dictionary into a line of json.

    Args:
        obj: A Python object usually a dict.
        **kwargs: Optional key word arguments.

    Returns:
        A string of serialized json.
    """
    return encoder.encode(obj).decode()

msg_buffer = bytearray(64)

def serialize_jsonl(obj: object, **kwargs: t.Any) -> bytes:
        """Serialize a dictionary into a line of jsonl.

        Args:
            obj: A Python object usually a dict.
            **kwargs: Optional key word arguments.

        Returns:
            A bytes of serialized json.
        """
        encoder.encode_into(obj, msg_buffer)
        msg_buffer.extend(b"\n")
        return msg_buffer

SingerWriter:

    def serialize_message(self, message: Message) -> str | bytes:
        """Serialize a dictionary into a line of json.

        Args:
            message: A Singer message object.

        Returns:
            A string of serialized json.
        """
        return serialize_jsonl(message.to_dict())

@edgarrmondragon edgarrmondragon added the Release Highlight Call this out in the release notes label Aug 22, 2024
@edgarrmondragon
Copy link
Collaborator Author

edgarrmondragon commented Sep 6, 2024

I found that it helped to add a defualt_output to the for the SingerWriter to use. This allows you to make the write message a little generic.

default_output = sys.stdout.buffer

def write_message(self, message: Message) -> None:
	"""Write a message to stdout.

	Args:
		message: The message to write.
	"""
	self.default_output.write(self.format_message(message))
	self.default_output.flush()

Do you mean in

class MsgSpecWriter(GenericSingerWriter[bytes, Message]):
"""Interface for all plugins writing Singer messages to stdout."""
def serialize_message(self, message: Message) -> bytes: # noqa: PLR6301
"""Serialize a dictionary into a line of json.
Args:
message: A Singer message object.
Returns:
A string of serialized json.
"""
return encoder.encode(message.to_dict())
def write_message(self, message: Message) -> None:
"""Write a message to stdout.
Args:
message: The message to write.
"""
sys.stdout.buffer.write(self.format_message(message) + b"\n")
sys.stdout.flush()

?

@edgarrmondragon edgarrmondragon force-pushed the edgarrmondragon/refactor/msgspec-impl-naive branch from 29dea7a to d23a8ab Compare September 6, 2024 18:56
@BuzzCutNorman
Copy link
Contributor

BuzzCutNorman commented Sep 6, 2024

Yes, that is exactly what I meant. Could have definitely been stated clearer on my part😅.

 class MsgSpecWriter(GenericSingerWriter[bytes, Message]): 
     """Interface for all plugins writing Singer messages to stdout.""" 
     
     default_output = sys.stdout.buffer
     
     def serialize_message(self, message: Message) -> bytes:  # noqa: PLR6301 
         """Serialize a dictionary into a line of json. 
  
         Args: 
             message: A Singer message object. 
  
         Returns: 
             A string of serialized json. 
         """ 
         return serialize_jsonl(message.to_dict()) 
  
     def write_message(self, message: Message) -> None: 
         """Write a message to stdout. 
  
         Args: 
             message: The message to write. 
         """ 
 	self.default_output.write(self.format_message(message))
	self.default_output.flush()

@edgarrmondragon edgarrmondragon force-pushed the edgarrmondragon/refactor/msgspec-impl-naive branch from d23a8ab to 3169b58 Compare September 6, 2024 19:15
@edgarrmondragon edgarrmondragon changed the title refactor: Implement (naive) msgspec encoding refactor: Implement msgspec encoding Sep 6, 2024
@edgarrmondragon
Copy link
Collaborator Author

Naive of me to think I could get this across in 1/2 a day of work 😅. I'll come back to this later, there's plenty of time until the planned release date.

@BuzzCutNorman
Copy link
Contributor

Like the pun 😊. Great dad joke material. Kind an inside joke now since you dropped (naive) from the title of the PR.

@edgarrmondragon
Copy link
Collaborator Author

Ok, the tests are passing.

Now I want to think of how to make it easy and straightforward for a developer to use msgspec as the SerDe layer, and also keep the door open to the user being the one deciding which serialization layer to use.

@edgarrmondragon edgarrmondragon self-assigned this Jan 13, 2025
@edgarrmondragon edgarrmondragon modified the milestones: v0.44, v0.45 Jan 15, 2025
@edgarrmondragon
Copy link
Collaborator Author

@sourcery-ai review

@sourcery-ai
Copy link
Contributor

sourcery-ai bot commented Jan 22, 2025

Reviewer's Guide by Sourcery

This pull request introduces msgspec encoding for improved performance, refactors the reader/writer implementations, and adds new test cases. It also introduces BaseSingerReader and BaseSingerWriter classes to standardize reader/writer implementations.

Updated class diagram for BaseSingerReader and BaseSingerWriter

classDiagram
    class PluginBase {
        +config: dict | PurePath | str | list[PurePath | str] | None
        +parse_env_config: bool
        +validate_config: bool
    }
    class BaseSingerReader {
        +message_reader_class: type[GenericSingerReader]
        +message_reader: GenericSingerReader | None
        +listen(file_input: t.IO[str] | None) : None
        +process_lines(file_input: t.IO[str] | None) : t.Counter[str]
        +process_endofpipe() : None
        +_assert_line_requires(message_dict: dict, requires: set[str]) : None
        <<abstract>>
        +_process_schema_message(message_dict: dict) : None
        <<abstract>>
        +_process_record_message(message_dict: dict) : None
        <<abstract>>
        +_process_state_message(message_dict: dict) : None
        <<abstract>>
        +_process_activate_version_message(message_dict: dict) : None
        <<abstract>>
        +_process_batch_message(message_dict: dict) : None
    }
    class BaseSingerWriter {
        +message_writer_class: type[GenericSingerWriter]
        +message_writer: GenericSingerWriter | None
        +write_message(message: t.Any) : None
    }
    PluginBase <|-- BaseSingerReader
    PluginBase <|-- BaseSingerWriter
Loading

Class diagram for MsgSpecReader and MsgSpecWriter

classDiagram
    class GenericSingerReader {
        <<interface>>
        +deserialize_json(line: str) : dict
    }
    class GenericSingerWriter {
        <<interface>>
        +serialize_message(message: Message) : bytes
    }
    class MsgSpecReader {
        +default_input: t.IO
        +deserialize_json(line: str) : dict
    }
    class MsgSpecWriter {
        +serialize_message(message: Message) : bytes
        +write_message(message: Message) : None
    }

    GenericSingerReader <|.. MsgSpecReader : implements
    GenericSingerWriter <|.. MsgSpecWriter : implements
Loading

File-Level Changes

Change Details Files
Introduces BaseSingerReader and BaseSingerWriter classes to standardize reader/writer implementations.
  • Adds BaseSingerReader and BaseSingerWriter as base classes for readers and writers.
  • Moves listen and write_message methods to the new base classes.
  • Updates Target and Tap to inherit from the new base classes.
  • Removes SingerReader and SingerWriter inheritance from Target and Tap.
  • Adds message_reader_class and message_writer_class attributes to BaseSingerReader and BaseSingerWriter respectively.
  • Adds process_endofpipe to BaseSingerReader.
singer_sdk/plugin_base.py
singer_sdk/target_base.py
singer_sdk/tap_base.py
Implements msgspec encoding for improved performance.
  • Adds msgspec as a dependency.
  • Creates MsgSpecReader and MsgSpecWriter classes in singer_sdk/contrib/msgspec.py.
  • Implements serialize_jsonl, enc_hook, and dec_hook functions for msgspec encoding.
  • Updates deserialize_json method in MsgSpecReader to use msgspec.
  • Updates serialize_message method in MsgSpecWriter to use msgspec.
  • Adds message_writer_class = MsgSpecWriter to SQLiteTap and SampleTapCountries to use msgspec encoding by default.
  • Adds benchmark tests for msgspec reader and writer.
requirements/requirements.txt
singer_sdk/contrib/msgspec.py
tests/core/test_io.py
samples/sample_tap_sqlite/__init__.py
samples/sample_tap_countries/countries_tap.py
noxfile.py
tests/singerlib/encoding/test_msgspec.py
Refactors listen and process_lines methods to improve code structure and reusability.
  • Removes listen method from GenericSingerReader.
  • Moves the logic from GenericSingerReader.listen to GenericSingerReader.process_lines.
  • Adds a callbacks argument to GenericSingerReader.process_lines to handle different message types.
  • Removes _process_lines and _process_endofpipe methods from Target.
  • Updates target_sync_test and tap_sync_test to use io.TextIOWrapper instead of io.StringIO.
singer_sdk/plugin_base.py
singer_sdk/target_base.py
singer_sdk/singerlib/encoding/base.py
singer_sdk/testing/legacy.py
singer_sdk/testing/runners.py
tests/singerlib/encoding/test_simple.py
Adds test files for msgspec and simple encoding.
  • Adds tests/singerlib/encoding/test_msgspec.py to test msgspec encoding.
  • Adds tests/singerlib/encoding/test_simple.py to test simple encoding.
  • Adds tests/singerlib/encoding/conftest.py to configure tests for encoding.
tests/singerlib/encoding/test_msgspec.py
tests/singerlib/encoding/test_simple.py
tests/singerlib/encoding/conftest.py

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!
  • Generate a plan of action for an issue: Comment @sourcery-ai plan on
    an issue to generate a plan of action for it.

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @edgarrmondragon - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider making the IO implementation an attribute of the Singer class rather than using multiple inheritance, to avoid MRO ordering issues. This would provide a cleaner and more explicit design.
Here's what I looked at during the review
  • 🟢 General issues: all looks good
  • 🟡 Security: 1 issue found
  • 🟡 Testing: 2 issues found
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@edgarrmondragon edgarrmondragon marked this pull request as ready for review March 5, 2025 05:56
@edgarrmondragon edgarrmondragon requested a review from a team as a code owner March 5, 2025 05:56
Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @edgarrmondragon - I've reviewed your changes - here's some feedback:

Overall Comments:

  • Consider adding msgspec as a dependency to the core extra in pyproject.toml.
  • The new base classes BaseSingerReader and BaseSingerWriter duplicate some logic from the original PluginBase class; consider refactoring to avoid this duplication.
Here's what I looked at during the review
  • 🟡 General issues: 2 issues found
  • 🟢 Security: all looks good
  • 🟡 Testing: 1 issue found
  • 🟡 Complexity: 1 issue found
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

@edgarrmondragon edgarrmondragon requested a review from a team as a code owner March 14, 2025 22:52
@edgarrmondragon edgarrmondragon changed the title refactor: Implement msgspec encoding feat: Implement Singer msgspec encoding Mar 18, 2025
@edgarrmondragon edgarrmondragon merged commit 99a513c into main Mar 18, 2025
36 of 37 checks passed
@edgarrmondragon edgarrmondragon deleted the edgarrmondragon/refactor/msgspec-impl-naive branch March 18, 2025 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants