Does gluten support access S3 based on 'path-style'? #9412

squalud · 2025-04-24T09:44:24Z

squalud
Apr 24, 2025

I use Alluxio's proxy to provide S3 interface access.

By setting spark.hadoop.fs.ks3.endpoint to http://<alluxio-proxy-service-name>:39999/api/v1/s3/ and setting the spark.hadoop.fs.s3a.path.style.access parameter to true to use path-style to access S3, I can use pyspark to successfully read csv files through the URL format of s3a://data/tmp/file.csv; Note that /data is a path, not the bucket.

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Read S3 Data in PySpark") \
    .remote("sc://xx.xx.xx.xx:15002") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://<alluxio-proxy-service-name>:39999/api/v1/s3/")
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .getOrCreate()

csv_path = "s3a://data/tmp/file.csv"
df_csv = spark.read.csv(csv_path, header=True)
df_csv.show()

But when I change to gluten and setting spark.gluten.sql.native.arrow.reader.enabled to true to use arrow's reader to read, I get an error:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Read S3 Data in PySpark") \
    .remote("sc://xx.xx.xx.xx:15002") \
    .config("spark.hadoop.fs.s3a.endpoint", "http://<alluxio-proxy-service-name>:39999/api/v1/s3/")
    .config("spark.hadoop.fs.s3a.path.style.access", "true") \
    .config("spark.gluten.enabled", "true") \
    .config("spark.gluten.sql.native.arrow.reader.enabled", "true") \
    .config("spark.plugins", "org.apache.gluten.GlutenPlugin") \
    .getOrCreate()

csv_path = "s3a://data/tmp/file.csv"
df_csv = spark.read.csv(csv_path, header=True)
df_csv.show()

SparkConnectGrpcException: (org.apache.spark.SparkException) Job aborted due to stage failure: Task 0 in stage 6.0 failed 4 times, most recent failure: Lost task 0.3 in stage 6.0 (TID 27) (xx.xx.xx.xx executor 2): org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: Error during calling Java code from native code: org.apache.gluten.exception.GlutenException: org.apache.gluten.exception.GlutenException: Exception: VeloxRuntimeError
Error Source: RUNTIME
Error Code: INVALID_STATE
Reason: Operator::getOutput failed for [operator: ValueStream, plan node ID: 0]: Error during calling Java code from native code: java.lang.RuntimeException: When getting information for key 'tmp/file.csv' in bucket 'data': AWS Error ACCESS_DENIED during HeadObject operation: No response body.
	at org.apache.arrow.dataset.file.JniWrapper.makeFileSystemDatasetFactory(Native Method)
	at org.apache.arrow.dataset.file.FileSystemDatasetFactory.createNative(FileSystemDatasetFactory.java:53)
	at org.apache.arrow.dataset.file.FileSystemDatasetFactory.<init>(FileSystemDatasetFactory.java:34)
	at org.apache.gluten.utils.ArrowUtil$.makeArrowDiscovery(ArrowUtil.scala:128)
	at org.apache.gluten.utils.ArrowUtil$.readArrowSchema(ArrowUtil.scala:139)
	at org.apache.gluten.utils.ArrowUtil$.readArrowFileColumnNames(ArrowUtil.scala:152)
	at org.apache.gluten.datasource.ArrowCSVFileFormat.$anonfun$buildReaderWithPartitionValues$3(ArrowCSVFileFormat.scala:128)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:217)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:279)
	at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:129)
	at org.apache.spark.sql.execution.ArrowFileSourceScanExec$$anon$1.hasNext(ArrowFileSourceScanExec.scala:48)
	at org.apache.gluten.iterator.IteratorsV1$ReadTimeAccumulator.hasNext(IteratorsV1.scala:127)
	at scala.collection.Iterator$$anon$10.hasNext(I...

It seems that Arrow's reader treats the first-level path data as the bucket, that is, the configuration spark.hadoop.fs.s3a.path.style.access does not take effect to Gluten/Arrow. How can I use gluten + arrow's reader to access S3 based on path-style just like Spark's original reader?

PHILO-HE · 2025-06-18T06:03:00Z

PHILO-HE
Jun 18, 2025
Collaborator

@squalud, I note there is already some code to pass path.style.access config to Velox. Could you firstly verify it is really set for Velox in this piece of code?

incubator-gluten/cpp/velox/utils/ConfigExtractor.cc

Line 131 in 0e5830a

setConfigIfPresent(S3Config::Keys::kPathStyleAccess);

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Does gluten support access S3 based on 'path-style'? #9412

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Does gluten support access S3 based on 'path-style'? #9412

Uh oh!

Uh oh!

squalud Apr 24, 2025

Replies: 1 comment

Uh oh!

PHILO-HE Jun 18, 2025 Collaborator

squalud
Apr 24, 2025

PHILO-HE
Jun 18, 2025
Collaborator