-
Notifications
You must be signed in to change notification settings - Fork 3
Description
When reading files via Xrootd with Spark (https://github.com/spark-root/laurelin) doing profiling with the code shows there's significant RTTs being burned because the HadoopFile interface doesn't support vectorized read/writes, meaning each TTree basket incurs its own RTT penalty compared with (e.g.) CMSSW who issues reads for multiple baskets with a single preadv() call. Not to mention, the backing filesystem on the other end typically supports vectorized I/O as well, so it would be a win on that side too.
If hadoop-xrootd was to implement a readv()/writev() interface, I could use that to vastly reduce the number of I/O round-trips for spark. XrdCl itself supports this via the synchronous XrdCl::File::VectorRead
call, so if that C++ function could be exported up to XRootDClFile
and then XRootDInputStream
, I can use that to issue a single read at a time instead of potentially hundreds/thousands.