Safe way to periodically add arrow RecordBatch to a file #48124

RayZ0rr · 2025-11-13T15:54:23Z

RayZ0rr
Nov 13, 2025

I have a use case where I want to save some information, which can consist of numpy nd array of variable shape, numpy 1D arrays, objects like pytorch model state_dict, pytorch optimizer state_dict, scalars like floats, ints, strings, custom types like MetricsInfo etc. These formats can be encoded as various datatypes in arrow which is straight forward for primitive types, tricks like #48099 for variable shape tensors and binary type for other python objects.

I only want to use a single file for all these. The information will be generated periodically like after each epoch in training deep learning models. So at each period, say epoch end, I need to save this information to the file. This is important because if the run is interrupted, I don't want to lose all information till the current epoch. Nor do I want to pressure the memory by buffering all information for a single write.

Can this be done with files of arrow or parquet format?

EDIT: after adding the data, I would like to get random access reads to the saved data

drin · 2025-11-13T16:36:37Z

drin
Nov 13, 2025

yes, you can use either. you just need to be able to represent your data as a recordbatch as a precondition (or any type a write function accepts).

2 replies

RayZ0rr Nov 13, 2025
Author

Hey @drin ,

I forgot to mention in my top post but after saving the record batches I would like to get random access reads to the data at specific epoch. For eg: after training the model and doing further analysis, I'll have a list of epoch number with best performance. So, I would like to only get data at these best epoch numbers or single best epoch number

drin Nov 13, 2025

you can make epoch a column of your record batch or put some index-like structure in the schema metadata that you can use to identify record batch index from epoch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Safe way to periodically add arrow RecordBatch to a file #48124

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Safe way to periodically add arrow RecordBatch to a file #48124

Uh oh!

Uh oh!

RayZ0rr Nov 13, 2025

Replies: 1 comment · 2 replies

Uh oh!

drin Nov 13, 2025

Uh oh!

RayZ0rr Nov 13, 2025 Author

Uh oh!

drin Nov 13, 2025

RayZ0rr
Nov 13, 2025

Replies: 1 comment 2 replies

drin
Nov 13, 2025

RayZ0rr Nov 13, 2025
Author