Skip to content

Simplify HDFDocumentComposer #858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

shihab-dls
Copy link
Contributor

@shihab-dls shihab-dls commented Apr 16, 2025

Fixes #795

This PR should mostly remove duplicate code in writers that perform the same collect_stream_docs operation. As #795 requested tests to remain unchanged, writers still have their collect_stream_docs methods, but this simply calls the composers method.

With this, composers can now be created in open(); however, as panda's writer creates its dataset in _update_datasets(), I create the composer here instead.

EDIT: I've moved this back to collect_stream_docs (for panda writer), as I think it's clearer here.

@shihab-dls
Copy link
Contributor Author

Given I did not change any of the tests, we still pass in name: str to writers collect_stream_docs(). I don't see how this is ever accessed, even prior to the changes of this PR? If I'm not missing something, I'll remove this and amend the tests that call collect_stream_docs().

@shihab-dls shihab-dls requested a review from coretl April 16, 2025 10:20
Copy link
Collaborator

@coretl coretl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks almost there but I spotted a few simplifications

Comment on lines +81 to +83
async def collect_stream_docs(
self, indices_written: int
) -> AsyncIterator[StreamAsset]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, having seen the code it does no I/O, so we can put it back to a plain iterator

Suggested change
async def collect_stream_docs(
self, indices_written: int
) -> AsyncIterator[StreamAsset]:
def collect_stream_docs(
self, indices_written: int
) -> Iterator[StreamAsset]:

Comment on lines +143 to +144
async for doc in self._composer.collect_stream_docs(indices_written):
yield doc
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which simplifies to

Suggested change
async for doc in self._composer.collect_stream_docs(indices_written):
yield doc
yield from self._composer.collect_stream_docs(indices_written)

Comment on lines +148 to +153
if not self._composer:
self._composer = HDFDocumentComposer(
Path(await self.panda_data_block.hdf_directory.get_value())
/ Path(await self.panda_data_block.hdf_file_name.get_value()),
self._datasets,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the reasoning to put it here rather than in open()?

yield "stream_resource", doc
for doc in self.composer.stream_data(indices_written):
yield "stream_datum", doc
if self.composer:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should raise if self.composer is None

Suggested change
if self.composer:
if self.composer is None:
msg = "open() not called on {self}"
raise RuntimeError(msg)

@coretl
Copy link
Collaborator

coretl commented Jun 11, 2025

@shihab-dls please could we rebase this so we can merge it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Simplify HDFDocumentComposer
2 participants