Static batch schema #349

laserkelvin · 2025-03-20T18:25:35Z

This PR addresses an issue with the existing BatchSchema definition, which is done on-the-fly based on what information was actually packed into the individual data samples. The issue with this is that serialization is not possible, which prevents multiple data loader workers being used, which hinders training.

This is solved by creating a static definition of a BatchSchema that subclasses DataSampleSchema. The scope of testing is on a new HDF5 dataset with PyG graphs, and is functional for multi-GPU training.

Summary of changes:

Defined BatchSchema, subclassing DataSampleSchema
Added a to method for DataSampleSchema (which is inherited by BatchSchema) for data transfers to accelerators
Added transfer_batch_to_device methods to LightningModule definitions, which facilitates the data transfer before model calls
Unit tests for batching with the new BatchSchema

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

smiret-intel

Does this imply we need to change the underlying data format to HDF5? I assume this works for native PyTorch as well.

Lee, Kin Long Kelvin added 20 commits March 19, 2025 15:25

feat: implementation of a static batch schema

2e7b052

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

refactor: adding batched lattice parameter type

3088bbe

feat: added sample testing with lattice

cfc52c3

Signed-off-by: Lee, Kin Long Kelvin <[email protected]>

refactor: adding embedding and model output field

8e23eff

refactor: retiring old dynamic batch schema method

5d7a203

refactor: updating collate function with batch schema class method

d065eed

feat: adding to method for base sample schema class for data transfer

4e83aef

refactor: adding placeholder batch transfer for multitask

0a9dc15

Merge branch 'main' into static-batch-schema

38242c0

feat: adding private recursive move tensor method

3036b64

refactor: using the recursive device transfer method

eba23f1

test: adding batch movement test to unit tests

c549410

refactor: removing overly verbose fractional coordinate check

d54c377

refactor: making cart_frac conversion work on tensors and arrays alike

f50d897

refactor: making tensor assumptions now

584fc1e

fix: correcting batch collation for pyg

8cc9979

fix: fully functioning pyg collate process

ce66993

refactor: making to method return the schema object for data loading

df2ea99

fix: relying on pyg graph to method

ecae0d7

fix: only casting to tensor if they aren't already tensors

7faf942

laserkelvin added bug Something isn't working data Issues related to data loading, pipelining, etc. labels Mar 20, 2025

laserkelvin requested a review from smiret-intel March 20, 2025 20:30

smiret-intel approved these changes Mar 20, 2025

View reviewed changes

laserkelvin merged commit 66b5280 into IntelLabs:main Mar 24, 2025
2 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Static batch schema #349

Static batch schema #349

Uh oh!

laserkelvin commented Mar 20, 2025

Uh oh!

smiret-intel left a comment

Uh oh!

Uh oh!

Uh oh!

Static batch schema #349

Static batch schema #349

Uh oh!

Conversation

laserkelvin commented Mar 20, 2025

Uh oh!

smiret-intel left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!