From 9c43bf432084006ef090ef1f058f3eb0888a4806 Mon Sep 17 00:00:00 2001 From: OussamaSaoudi <45303303+OussamaSaoudi@users.noreply.github.com> Date: Tue, 14 Jan 2025 13:22:22 -0800 Subject: [PATCH] doc: Clarify `JsonHandler` semantics on EngineData ordering (#635) ## What changes are proposed in this pull request? When reading multiple log files during log replay, it is important that we read commits in order. This ensures the correctness of add/remove deduplication. Hence, we are implicitly relying on the commit files being read in order by the json handler. Moreover when in-commit timestamps is enabled, the ordering of batches of engine data in a commit is important. A correct delta table should have the commit info be the _first_ action in a log file. --- kernel/src/lib.rs | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/kernel/src/lib.rs b/kernel/src/lib.rs index fa88e7afa..49dceea75 100644 --- a/kernel/src/lib.rs +++ b/kernel/src/lib.rs @@ -371,8 +371,20 @@ pub trait JsonHandler: AsAny { output_schema: SchemaRef, ) -> DeltaResult>; - /// Read and parse the JSON format file at given locations and return - /// the data as EngineData with the columns requested by physical schema. + /// Read and parse the JSON format file at given locations and return the data as EngineData with + /// the columns requested by physical schema. Note: The [`FileDataReadResultIterator`] must emit + /// data from files in the order that `files` is given. For example if files ["a", "b"] is provided, + /// then the engine data iterator must first return all the engine data from file "a", _then_ all + /// the engine data from file "b". Moreover, for a given file, all of its [`EngineData`] and + /// constituent rows must be in order that they occur in the file. Consider a file with rows + /// (1, 2, 3). The following are legal iterator batches: + /// iter: [EngineData(1, 2), EngineData(3)] + /// iter: [EngineData(1), EngineData(2, 3)] + /// iter: [EngineData(1, 2, 3)] + /// The following are illegal batches: + /// iter: [EngineData(3), EngineData(1, 2)] + /// iter: [EngineData(1), EngineData(3, 2)] + /// iter: [EngineData(2, 1, 3)] /// /// # Parameters ///