-
Notifications
You must be signed in to change notification settings - Fork 460
feat(parallel random reads): Changes to enable parallel random read handling in gcs reader #3619
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…abhishek/random_reads_gcs
Hi @abhishek10004, it looks like you've invoked |
/gemini review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces changes to enable parallel random reads in the GCS reader. This is achieved by refactoring the read logic in gcs_reader.go
, introducing a mutex to manage concurrent access, and adding logic to dynamically switch between sequential (RangeReader
) and random (MultiRangeReader
) read strategies. The changes also include significant updates to the test suite to cover the new logic and parallel read scenarios.
While the changes to gcs_reader.go
seem well-thought-out for concurrency, a critical issue has been introduced in internal/fs/fs.go
by removing a file handle lock, which leads to a data race. Additionally, some of the test modifications appear fragile due to hardcoded mock expectations.
fh.Inode().Lock() | ||
defer fh.Unlock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing the lock on the file handle fh
here introduces a data race. Multiple goroutines calling ReadFile
on the same file handle will concurrently access and potentially modify fh.reader
and fh.readManager
within fh.Read
and fh.ReadWithReadManager
. This can lead to unpredictable behavior and crashes.
While the goal is to enable parallel reads, the file handle's internal state (like the reader instance) must be protected during initialization and updates. A lock is still required for this. Consider moving the lock inside fh.Read
and fh.ReadWithReadManager
to protect just the reader management logic, which would allow the actual read I/O operations to run in parallel if the underlying reader supports it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is okay since we are now synchronizing at the reader level (for range readers)
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #3619 +/- ##
==========================================
+ Coverage 82.34% 82.36% +0.02%
==========================================
Files 146 146
Lines 22524 22663 +139
==========================================
+ Hits 18548 18667 +119
- Misses 3428 3445 +17
- Partials 548 551 +3
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
/gemini review |
/gemini summary |
func (rr *RangeReader) readFromExistingReader(ctx context.Context, req *gcsx.GCSReaderRequest) (gcsx.ReaderResponse, error) { | ||
func (rr *RangeReader) readFromExistingReader(ctx context.Context, req *gcsx.GCSReaderRequest) (int, error) { | ||
rr.skipBytes(req.Offset) | ||
rr.invalidateReaderIfMisalignedOrTooSmall(req.Offset, req.Offset+int64(len(req.Buffer))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please compute and put the value in a variable so it can be reused. req.Offset+int64(len(req.Buffer))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also what is the difference between endOffset in GCSReaderRequest vs this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCSReaderRequest has the computed endOffset which would be used while creating the reader. Here, we're reading from the existing reader, so I'm reading only what the user has requested.
@@ -407,129 +410,341 @@ func (t *gcsReaderTest) Test_ReadAt_PropagatesCancellation() { | |||
} | |||
} | |||
|
|||
func (t *gcsReaderTest) Test_ReadInfo_WithInvalidInput() { | |||
t.object.Size = 10 * MiB | |||
func (t *gcsReaderTest) Test_IsSeekNeeded() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not reviewing the tests assuming you have written the same tests as random_reader
/gemini review |
Warning Gemini encountered an error creating the review. You can try again by commenting |
@@ -52,6 +52,12 @@ const ( | |||
MultiRangeReaderType | |||
) | |||
|
|||
type readInfo struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rename it to more specific and clear? Something like - RequestState
?
Also, I feel good to document the purpose of this struct, as not very straightforward to understand.
Description
This pull request introduces changes to enable parallel random reads in the GCS reader. This is achieved by refactoring the read logic in gcs_reader.go, introducing a mutex to manage concurrent access for range reader while allowing reads to go in parallel for MRD. The changes also include updates to the test suite to cover the new logic and parallel read scenarios.
Link to the issue in case of a bug fix.
b/427137532
Testing details
Any backward incompatible change? If so, please explain.