-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[WIP] Tws read fully fix #52592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
[WIP] Tws read fully fix #52592
+429,698
−130,188
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…alueProtoConverter ### What changes were proposed in this pull request? This PR improves the naming of methods in `LiteralValueProtoConverter` to better reflect their actual functionality. The changes rename several methods from using "Catalyst" terminology to "Scala" terminology: - `toCatalystValue()` → `toScalaValue()` - `toCatalystArray()` → `toScalaArray()` - `toCatalystArrayInternal()` → `toScalaArrayInternal()` - `toCatalystMap()` → `toScalaMap()` - `toCatalystMapInternal()` → `toScalaMapInternal()` - `toCatalystStruct()` → `toScalaStruct()` - `toCatalystStructInternal()` → `toScalaStructInternal()` ### Why are the changes needed? The previous method names using "Catalyst" terminology were misleading because these methods actually convert protobuf literal values to standard Scala values (like `Array`, `String`, etc.), not to Catalyst's internal data structures (like `GenericArrayData`, `UTF8String`, etc.). The "Catalyst" naming suggested these methods were part of Spark's Catalyst optimizer, when in fact they are utility methods for converting between protobuf and Scala representations. This naming inconsistency could confuse developers about the purpose and scope of these methods, potentially leading to incorrect usage or maintenance issues. ### Does this PR introduce _any_ user-facing change? No, this is an internal refactoring that improves method naming ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.5.9 Closes apache#52251 from heyihong/SPARK-53502. Authored-by: Yihong He <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
### What changes were proposed in this pull request? This PR aims to update `Gemfile` to allow Ruby 3.0+ explicitly from Apache Spark 4.1.0. Gradually, we are going to update this more accurately to a higher tested Ruby versions in order to prevent any accidental documentation corruptions. ### Why are the changes needed? 1. We use `public_suffix` v6.0.2 which requires `Ruby 3.0+` officially. https://github.com/apache/spark/blob/0dc38b02f4a72573c88d438fedd6df6271c8edd3/docs/Gemfile.lock#L60 - https://rubygems.org/gems/public_suffix/versions/6.0.2 2. We use `jekyll-sass-converter` v3.1.0 which requires `Ruby 3.1+` officially. https://github.com/apache/spark/blob/0dc38b02f4a72573c88d438fedd6df6271c8edd3/docs/Gemfile.lock#L44 - https://rubygems.org/gems/jekyll-sass-converter/versions/3.1.0 3. `Ruby 3.1` and below have been in the EOL status technically. So, it's not a surprise when one of our Ruby dependencies drops Ruby 3.1 and older support completely at any point of time. - https://www.ruby-lang.org/en/downloads/branches/ ``` Ruby 3.1 status: eol release date: 2021-12-25 normal maintenance until: 2024-04-01 EOL: 2025-03-26 ``` 4. Apache Spark 4.1 has Ruby 3.3 test coverage only and `release` script has been using `Ruby 3.0.2` on `Ubuntu Jammy`. https://github.com/apache/spark/blob/d9a23c2f447f7e11430c83799472cb7053119016/.github/workflows/pages.yml#L71 https://github.com/apache/spark/blob/d9a23c2f447f7e11430c83799472cb7053119016/dev/create-release/spark-rm/Dockerfile#L19 ``` root7b0327607ed7:/# ruby --version ruby 3.0.2p107 (2021-07-07 revision 0db68f0233) [aarch64-linux-gnu] ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Since CI passes with Ruby 3.3 always, we need to verify manually like the following. ``` $ cd docs $ docker run -it --rm -v $PWD:/docs -w /docs ruby:2.7.8 bash root5328792f5392:/docs# gem install bundler -v 2.4.22 Fetching bundler-2.4.22.gem Successfully installed bundler-2.4.22 1 gem installed root5328792f5392:/docs# bundle install Your Ruby version is 2.7.8, but your Gemfile specified >= 3.0.0 ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52258 from dongjoon-hyun/SPARK-53514. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? refactors arrow python runner code so `BaseArrowPythonRunner` can have varying INPUT/OUTPUT types and isn't restricted to internal row and columnar batch ### Why are the changes needed? some implementations may want to have different input/outputs from the python runner. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#52214 from richardc-db/refactor_arrow_eval_code. Authored-by: Richard Chen <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
…/Xml` ### What changes were proposed in this pull request? This PR removes the unused `private lazy val` from both `SchemaOfCsv` and `SchemaOfXml`. ### Why are the changes needed? Code cleanup. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52259 from LuciferYang/SPARK-53515. Authored-by: yangjie01 <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…rsion ### What changes were proposed in this pull request? Add more test for pyarrow datatype conversion ### Why are the changes needed? to improve test coverage ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#52262 from zhengruifeng/test_arrow_type. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? This is a followup of apache#51434 to fix a regression. Previously, `DescribeTableExec` extracts the column current default from the table schema, but after apache#51434, we get the table v2 columns, which may be converted from the table schema via `CatalogV2Util.structTypeToV2Columns`. This conversion only sets the default value in v2 `Column` if both current default and exists default are present in the `StructField` metadata. However, some data source may not store the exists default in its column information as they have their own back-fill systems. This PR fixes this regression by allowing the absense of exists default in v2 `Column`. ### Why are the changes needed? Fix DESCRIBE TABLE regression ### Does this PR introduce _any_ user-facing change? No, the regression is not released yet. ### How was this patch tested? new test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#52252 from cloud-fan/follow. Lead-authored-by: Wenchen Fan <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
… single-pass Analyzer logic ### What changes were proposed in this pull request? Create Casts with timezone. When single-pass Analyzer reuses this code, it does not update casts which are added several levels below. Also, update `CollationTypePrecedenceSuite` to have session variables allocated more granularly, so that single-pass Analyzer runs cases in this suite even without session variable support. ### Why are the changes needed? For this logic to be reused in single-pass Analyzer in a compatible manner. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests cover these cases. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52233 from vladimirg-db/vladimir-golubev_data/create-casts-with-timezone-from-coercion-rule. Authored-by: Vladimir Golubev <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…nd withColumnsRenamed in Spark Connect ### What changes were proposed in this pull request? Tries the eager analysis first for `withColumns` and `withColumnsRenamed` in Spark Connect. ### Why are the changes needed? The planning for `withColumns` and `withColumnsRenamed` in Spark Connect was changed to be lazy at apache#49386, but it could cause a performance issue as the current analyzer can't resolve it fast enough when they are used many times, which can happen relatively more often than the other APIs. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing tests should pass. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52253 from ueshin/issues/SPARK-53505/eager_analysis. Authored-by: Takuya Ueshin <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…ultCollationToStringType` ### What changes were proposed in this pull request? Remove using `v2ColumnsToStructType` in `ApplyDefaultCollationToStringType` because of expensive handling for default values, since we only need to know column's data type. ### Why are the changes needed? Performance improvement. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52234 from ilicmarkodb/fix_apply_def. Authored-by: ilicmarkodb <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
…Suite` which are caused by `executorHolder` undefined ### What changes were proposed in this pull request? This PR aims to fix flaky tests in `SparkConnectServiceSuite` which are caused by `executorHolder` [undefined](https://github.com/apache/spark/blob/ab9a63626018156b3e0f267f14409c30031692b7/sql/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectServiceSuite.scala#L908). The conditions to reproduce this issue are: (1) The operation finishes before its `executeHolder` is set in [MockSparkListener#onOtherEvent](https://github.com/apache/spark/blob/ab9a63626018156b3e0f267f14409c30031692b7/sql/connect/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectServiceSuite.scala#L961). (2) `executeHolder` is accessed through calling `verifyEvents.onComplete` after the operation finishes. `SparkListenerConnectOperationStarted` is posted asynchronously with the corresponding operation so the condition (1) can be met. After an operation finishes, `executeHolder` is [removed from a map](https://github.com/apache/spark/blob/af16aa8e11c223642f928b0b9893854a851d70bb/sql/connect/server/src/main/scala/org/apache/spark/sql/connect/service/SparkConnectExecutionManager.scala#L153) so if the condition (1) is met, `executeHolder` is never set because `SparkConnectService.executionManager.getExecuteHolder` consistently returns `None`. One example of the test affected by this issue is `SPARK-43923: commands send events - get_resources_command`. You can easily reproduce this issue by inserting sleep into `MockSparkListener#onOtherEvent` like as follows. ``` val executeKey = ExecuteKey(sessionHolder.userId, sessionHolder.sessionId, e.operationId) + Thread.sleep(1000) executeHolder = SparkConnectService.executionManager.getExecuteHolder(executeKey) ``` And then, run test. ``` $ build/sbt 'connect/testOnly org.apache.spark.sql.connect.planner.SparkConnectServiceSuite -- -z "get_resources_command"' ``` To resolve this issue, this PR proposes: * Change `VerifyEvents#onCompleted` just to assert `executeHolder.eventsManager.getProducedRowCount == producedRowCount` * Call `VerifyEvents#onCompleted` from `StreamObserver#onCompleted` * Add `VerifyEvents#assertClosed` to check if the status is `Closed` ### Why are the changes needed? For test stability. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Inserting `Thread.sleep(1000)` like mentioned above and then run `SparkConnectServiceSuite`. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52264 from sarutak/SPARK-48163. Authored-by: Kousuke Saruta <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR simplifies `PipelineTest` by extending `QueryTest`, as many util functions are already defined there. ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#52266 from cloud-fan/sdp_test. Authored-by: Wenchen Fan <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? `catalogString` of User Defined Type is mistakenly truncated, which leads to catalog errors. ### Why are the changes needed? bugfix ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New Unit Test ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#52263 from yaooqinn/SPARK-53518. Authored-by: Kent Yao <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…DTFs ### What changes were proposed in this pull request? Support return type coercion for Arrow Python UDTFs by doing `arrow_cast` by default ### Why are the changes needed? Consistent behavior across Arrow UDFs and Arrow UDTFs ### Does this PR introduce _any_ user-facing change? No, Arrow UDTF is not a public API yet ### How was this patch tested? New and existing UTs ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52140 from shujingyang-db/arrow-udtf-type-corerion. Lead-authored-by: Shujing Yang <[email protected]> Co-authored-by: Shujing Yang <[email protected]> Signed-off-by: Takuya Ueshin <[email protected]>
…nst scalar under ANSI ### What changes were proposed in this pull request? - Ensure `==` returns a nameless Series when comparing with another Series/Index, but preserves the name for scalar comparisons. - Add test cases to compare with `np.nan` ### Why are the changes needed? Part of https://issues.apache.org/jira/browse/SPARK-53389 ### Does this PR introduce _any_ user-facing change? No, the feature is not released yet. For example, Before ```py >>> psdf['int'] == 'x' 0 False 1 False dtype: bool ``` After ```py >>> psdf['int'] == 'x' 0 False 1 False Name: int, dtype: bool ``` which follows native pandas ```py >>> pdf['int'] == 'x' 0 False 1 False Name: int, dtype: bool ``` ### How was this patch tested? Unit tests Commands below passed ```py 1037 SPARK_ANSI_SQL_MODE=true ./python/run-tests --python-executables=python3.11 --testnames "pyspark.pandas.tests.data_type_ops.test_num_ops NumOpsTests.test_comparison_dtype_compatibility" 1038 SPARK_ANSI_SQL_MODE=false ./python/run-tests --python-executables=python3.11 --testnames "pyspark.pandas.tests.data_type_ops.test_num_ops NumOpsTests.test_comparison_dtype_compatibility" 1039 SPARK_ANSI_SQL_MODE=true ./python/run-tests --python-executables=python3.11 --testnames "pyspark.pandas.tests.data_type_ops.test_num_ops NumOpsTests.test_eq" 1040 SPARK_ANSI_SQL_MODE=false ./python/run-tests --python-executables=python3.11 --testnames "pyspark.pandas.tests.data_type_ops.test_num_ops NumOpsTests.test_eq" 1041 SPARK_ANSI_SQL_MODE=false ./python/run-tests --python-executables=python3.11 --testnames "pyspark.pandas.tests.data_type_ops.test_num_ops NumOpsTests.test_ne" 1042 SPARK_ANSI_SQL_MODE=true ./python/run-tests --python-executables=python3.11 --testnames "pyspark.pandas.tests.data_type_ops.test_num_ops NumOpsTests.test_ne" ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52224 from xinrong-meng/cmp_op_test. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Xinrong Meng <[email protected]>
### What changes were proposed in this pull request? apache#49962 added a fallback in case there were already broken (ie, non-resolved) persisted default values in catalogs. A broken one is something like 'current_database, current_user, current_timestamp' , these are non-deterministic and will bring wrong results in EXISTS_DEFAULT, where user expects the value resolved when they set the default. Add yet another fallback for broken default default value, in this case one where there are nested function calls. ### Why are the changes needed? Take the case where the EXISTS_DEFAULT is : ```CONCAT(YEAR(CURRENT_DATE), LPAD(WEEKOFYEAR(CURRENT_DATE), 2, '0'))``` the current code `Literal.fromSQL(defaultSQL)` will throw the exception before getting to the fallback: ``` Caused by: java.lang.AssertionError: assertion failed: function arguments must be resolved. at scala.Predef$.assert(Predef.scala:279) at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expressionBuilder$1(FunctionRegistry.scala:1278) at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistryBase.lookupFunction(FunctionRegistry.scala:251) at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistryBase.lookupFunction$(FunctionRegistry.scala:245) at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:317) at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$fromSQL$1.applyOrElse(literals.scala:325) at org.apache.spark.sql.catalyst.expressions.Literal$$anonfun$fromSQL$1.applyOrElse(literals.scala:317) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$4(TreeNode.scala:586) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(origin.scala:121) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:586) at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUpWithPruning$1(TreeNode.scala:579) at scala.collection.immutable.List.map(List.scala:251) at scala.collection.immutable.List.map(List.scala:79) at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:768) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUpWithPruning(TreeNode.scala:579) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:556) at org.apache.spark.sql.catalyst.expressions.Literal$.fromSQL(literals.scala:317) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.analyzeExistenceDefaultValue(ResolveDefaultColumnsUtil.scala:393) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.getExistenceDefaultValue(ResolveDefaultColumnsUtil.scala:529) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.$anonfun$getExistenceDefaultValues$1(ResolveDefaultColumnsUtil.scala:524) at scala.collection.ArrayOps$.map$extension(ArrayOps.scala:936) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.getExistenceDefaultValues(ResolveDefaultColumnsUtil.scala:524) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.$anonfun$existenceDefaultValues$2(ResolveDefaultColumnsUtil.scala:594) at scala.Option.getOrElse(Option.scala:201) at org.apache.spark.sql.catalyst.util.ResolveDefaultColumns$.existenceDefaultValues(ResolveDefaultColumnsUtil.scala:592) ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add unit test in StructTypeSuite ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52274 from szehon-ho/more_default_value_fallback. Authored-by: Szehon Ho <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? This PR changes default value of `spark.sql.scripting.enabled` in order to enable SQL scripting by default. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? Yes, the default value of `spark.sql.scripting.enabled` is changed. ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52272 from dusantism-db/enable-sql-scripting-by-default. Authored-by: Dušan Tišma <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? Small refactor of the Star trait to make it compatible with the new single-pass Analyzer. Basically, remove `LogicalPlan` from the signature of core methods. This makes it possible to call them using `NameScope`. ### Why are the changes needed? To eventually support all types of star expressions in the single pass analyzer ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Claude Code v1.0.107 Closes apache#52268 from mikhailnik-db/refactor-star-trait. Authored-by: Mikhail Nikoliukin <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? Disallow `%` between Decimal and float under ANSI pandas: ```py >>> pdf['decimal'] % 0.1 Traceback (most recent call last): ... TypeError: unsupported operand type(s) for %: 'decimal.Decimal' and 'float' ``` pandas on spark before: ```py >>> psdf['decimal'] % 0.1 0 0.1 1 0.1 2 0.1 Name: decimal, dtype: float64 ``` pandas on spark after: ```py >>> psdf['decimal'] % 0.1 Traceback (most recent call last): ... TypeError: Modulo can not be applied to given types. ``` ### Why are the changes needed? Part of https://issues.apache.org/jira/browse/SPARK-53389 ### Does this PR introduce _any_ user-facing change? No, the feature is not released yet ### How was this patch tested? Unit tests Commands below passed: ```py 1097 SPARK_ANSI_SQL_MODE=true ./python/run-tests --python-executables=python3.11 --testnames "pyspark.pandas.tests.data_type_ops.test_num_mul_div NumMulDivTests.test_mod" 1098 SPARK_ANSI_SQL_MODE=false ./python/run-tests --python-executables=python3.11 --testnames "pyspark.pandas.tests.data_type_ops.test_num_mul_div NumMulDivTests.test_mod" ``` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52255 from xinrong-meng/mod. Authored-by: Xinrong Meng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
…ral.Map/Array optional ### What changes were proposed in this pull request? This PR optimizes the `LiteralValueProtoConverter` to reduce redundant type information in Spark Connect protocol buffers. The key changes include: 1. **Optimized type inference for arrays and maps**: Modified the conversion logic to only include type information in the first element of arrays and the first key-value pair of maps, since subsequent elements can infer their types from the first element. 2. **Added `needDataType` parameter**: Introduced a new parameter to control when type information is necessary, allowing the converter to skip redundant type information. 3. **Updated protobuf documentation**: Enhanced comments in the protobuf definitions to clarify that only the first element needs to contain type information for inference. 4. **Improved test coverage**: Added new test cases for complex nested structures including tuples and maps with array values. ### Why are the changes needed? The current implementation includes type information for every element in arrays and every key-value pair in maps, which is redundant and increases the size of protocol buffer messages. Since Spark Connect can infer types from the first element, including type information for subsequent elements is unnecessary and wastes bandwidth and processing time. ### Does this PR introduce any user-facing change? **No** - This PR does not introduce any user-facing changes. The change is backward compatible and existing connect clients will continue to work unchanged. ### How was this patch tested? `build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"` ### Was this patch authored or co-authored using generative AI tooling? Generated-by: Cursor 1.4.5 Closes apache#51473 from heyihong/SPARK-52449. Authored-by: Yihong He <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
### What changes were proposed in this pull request? Currently, ColumnarRow's `get` call didn't check `isNullAt`, but `UnsafeRow.get` does. https://github.com/apache/spark/blob/b177b6515c8371fe0761b46d2fa45dd5e8465910/sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/SpecializedGettersReader.java#L36 And in some cases it's assumed that the `InternalRow.get` is null safe, for example https://github.com/apache/spark/blob/5b2c4cf9ce886b69eeb5d2303d7582f6ecd763aa/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveInspectors.scala#L377 We hit it when we extend spark to make it working on columnar data. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52175 from WangGuangxin/fix_columnarrow. Authored-by: wangguangxin.cn <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request? Add tests for arrow udf with numpy output ### Why are the changes needed? to improve test coverage ### Does this PR introduce _any_ user-facing change? no, test-only ### How was this patch tested? ci ### Was this patch authored or co-authored using generative AI tooling? no Closes apache#52285 from zhengruifeng/test_numpy_arrow_agg. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…e` building ### What changes were proposed in this pull request? This PR aims to add `libwebp-dev` to recover `spark-rm/Dockerfile` building. ### Why are the changes needed? `Apache Spark` release docker image compilation has been broken for last 7 days due to the SparkR package compilation. - https://github.com/apache/spark/actions/workflows/release.yml - https://github.com/apache/spark/actions/runs/17425825244 ``` apache#11 559.4 No package 'libwebpmux' found ... apache#11 559.4 -------------------------- [ERROR MESSAGE] --------------------------- apache#11 559.4 <stdin>:1:10: fatal error: ft2build.h: No such file or directory apache#11 559.4 compilation terminated. apache#11 559.4 -------------------------------------------------------------------- apache#11 559.4 ERROR: configuration failed for package 'ragg' ``` ### Does this PR introduce _any_ user-facing change? No, this is a fix for Apache Spark release tool. ### How was this patch tested? Manually build. ``` $ cd dev/create-release/spark-rm $ docker build . ``` **BEFORE** ``` ... Dockerfile:83 -------------------- 82 | # See more in SPARK-39959, roxygen2 < 7.2.1 83 | >>> RUN Rscript -e "install.packages(c('devtools', 'knitr', 'markdown', \ 84 | >>> 'rmarkdown', 'testthat', 'devtools', 'e1071', 'survival', 'arrow', \ 85 | >>> 'ggplot2', 'mvtnorm', 'statmod', 'xml2'), repos='https://cloud.r-project.org/')" && \ 86 | >>> Rscript -e "devtools::install_version('roxygen2', version='7.2.0', repos='https://cloud.r-project.org')" && \ 87 | >>> Rscript -e "devtools::install_version('lintr', version='2.0.1', repos='https://cloud.r-project.org')" && \ 88 | >>> Rscript -e "devtools::install_version('pkgdown', version='2.0.1', repos='https://cloud.r-project.org')" && \ 89 | >>> Rscript -e "devtools::install_version('preferably', version='0.4', repos='https://cloud.r-project.org')" 90 | -------------------- ERROR: failed to build: failed to solve: ``` **AFTER** ``` ... => [ 6/22] RUN add-apt-repository 'deb https://cloud.r-project.org/bin/linux/ubuntu jammy-cran40/' 3.8s => [ 7/22] RUN Rscript -e "install.packages(c('devtools', 'knitr', 'markdown', 'rmarkdown', 'testthat', 'devtools', 'e1071', 'survival', 'arrow', 892.2s => [ 8/22] RUN add-apt-repository ppa:pypy/ppa 15.3s ... ``` After merging this PR, we can validate via the daily release dry-run CI. - https://github.com/apache/spark/actions/workflows/release.yml ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52290 from dongjoon-hyun/SPARK-53539. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? This PR aims to remove `pypy3` from `spark-rm` image. ### Why are the changes needed? - This will reduce the `spark-rm` build time during the daily `dry-run` CI and release process. - https://github.com/apache/spark/actions/workflows/release.yml - `pypy3` was added at 4.0.0 via the following and we didn't have `pypy3` before. - apache#46534 ### Does this PR introduce _any_ user-facing change? No. This is an update for Apache Spark release tool. ### How was this patch tested? Manual review because this is not used technically. After merging this PR, we can check the daily release dry-run CI too. - https://github.com/apache/spark/actions/workflows/release.yml ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52291 from dongjoon-hyun/SPARK-53540. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request? Upgrade Jetty from 11.0.25 to 11.0.26. ### Why are the changes needed? To bring some bug fixes. https://github.com/jetty/jetty.project/releases/tag/jetty-11.0.26 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52284 from zml1206/SPARK-53533. Authored-by: zml1206 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…t API in Spark Connect in Scala ### What changes were proposed in this pull request? As titled. ### Why are the changes needed? This allows users to use direct passthrough partitioning API in connect mode ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? New unit tests. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52242 from shujingyang-db/direct-shuffle-partition-id-connect. Authored-by: Shujing Yang <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
### What changes were proposed in this pull request? Change `ExpandExec#doExecute` to initialize the unsafe projections before using them. ### Why are the changes needed? The unsafe projections might contain non-deterministic expressions, and non-deterministic expressions must be initialized before used. For example, see the test added by this PR (which is essentially the test right above it, except with whole-stage codegen turned off). In the case where the "split-updates" property is set to "true", `RewriteUpdateTable` will create an `Expand` operator with a set of projections, one of which will contain a nondeterministic expression (the assignment value). `ExpandExec` fails to initialize the derived `UnsafeProjection`s before using them, resulting in a `NullPointerException`: ``` [info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 11.0 failed 1 times, most recent failure: Lost task 0.0 in stage 11.0 (TID 11) (10.0.0.101 executor driver): java.lang.NullPointerException: Cannot invoke "java.util.Random.nextDouble()" because "<parameter1>.mutableStateArray_0[0]" is null [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.writeFields_0_0$(Unknown Source) [info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source) [info] at org.apache.spark.sql.execution.ExpandExec$$anon$1.next(ExpandExec.scala:75) ... ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test. ### Was this patch authored or co-authored using generative AI tooling? Closes apache#52292 from bersprockets/upd_del_oddity. Authored-by: Bruce Robbins <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? This PR computes RowBasedChecksum for ShuffleWriters, which is controlled under spark.shuffle.rowbased.checksum.enabled. If enabled, Spark will calculate the RowBasedChecksum values for each partition and each map output and returns the values from executors to the driver. Different from the previous shuffle Checksum, RowBasedChecksum is independent of the input row order, which is used to detect whether different task attempts of the same partition produce different output data or not (key or value). In case the output data has changed across retries, Spark will need to retry all tasks of the consumer stage to avoid correctness issues. This PR contains only the RowBasedChecksum computation. In next PR, I plan to trigger the full stage retry when we detect checksum mismatches. ### Why are the changes needed? Problem: Spark's resilience features can cause an RDD to be partially recomputed, e.g. when an executor is lost due to downscaling, or due to a spot instance kill. When the output of a nondeterministic task is recomputed, Spark does not always recompute everything that depends on this task's output. In some cases, some subsequent computations are based on the output of one "attempt" of the task, while other subsequent computations are based on another "attempt". This could be problematic when the producer stage is non-deterministic. In which case, the second attempt of the same task can produce output that is very different from the first one. For example, if the stage uses a round-robin partitioning, some of the output data could be placed in different partitions in different task attempts. This could lead to incorrect results unless we retry the whole consumer stage that depends on retried non-deterministic stage. Below is an example of this. Example: Let’s say we have Stage 1 and Stage 2, where Stage 1 is the producer and Stage 2 is the consumer. Assume that the data produced by Task 2 were lost due to some reason while Stage 2 is executing. Further assume that at this point, Task 1 of Stage 2 has already gotten all its inputs and finishes, while Task 2 of Stage 2 fails with data fetch failures. <img width="600" alt="example 1" src="https://github.com/user-attachments/assets/549d1d90-3a8c-43e3-a891-1a6c614e9f24" /> Task 2 of Stage 1 will be retried to reproduce the data, and after which Task 2 of Stage 2 is retried. Eventually, Task 1 and Task 2 of Stage 2 produces the result which contains all 4 tuples {t1, t2, t3, t4} as shown in the example graph. <img width="720" alt="example 2" src="https://github.com/user-attachments/assets/bebf03d5-f05e-46b6-8f78-bfad08999867" /> Now, let’s assume that Stage 1 is non-deterministic (e.g., when using round-robin partitioning and the input data is not ordering), and Task 2 places tuple t3 for Partition 1 and tuple t4 for Partition 2 in its first attempt. It places tuple t4 for Partition 1 and tuple t3 for Partition 2 in its second attempt. When Task 2 of Stage 2 is retried, instead of reading {t2, t4} as it should, it reads {t2, t3} as its input. The result generated by Stage 2 is {t1, t2, t3, t3}, which is inaccurate. <img width="720" alt="example 3" src="https://github.com/user-attachments/assets/730fac0f-dfc3-4392-a74f-ed3e0d11e665" /> The problem can be avoided if we retry all tasks of Stage 2. As all tasks read consistent data, we can produce result correctly, regardless of how the retried of Stage 1 Task 2 would partition the data. <img width="720" alt="example 4" src="https://github.com/user-attachments/assets/a501a33e-97bb-4a01-954f-bc7d0f01f3e6" /> Proposal: To avoid correctness issues produce by non-deterministic stage with partial retry, we propose an approach which first try to detect inconsistent data that might be generated by different task attempts of a non-deterministic stage. For example, whether all the data partitions generated by Task 2 in the first attempt are the same as the all the data partitions generated by the second attempt. We retry the entire consumer stages if inconsistent data is detected. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tested Benchmark test: tpcds (10gb): the overhead of checksum computation with UnsafeRowChecksum is 0.4%. tpcds (3tb): the overhead of checksum computation with UnsafeRowChecksum is 0.72%. ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50230 from JiexingLi/shuffle-checksum. Lead-authored-by: Tengfei Huang <[email protected]> Co-authored-by: Jiexing Li <[email protected]> Co-authored-by: Wenchen Fan <[email protected]> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
…StateInPySparkStateServer` ### What changes were proposed in this pull request? This PR performs the following cleanup on the code related to `TransformWithStateInPySparkStateServer`: - Removed the `private` function `sendIteratorForListState` from `TransformWithStateInPySparkStateServer`, as it is no longer used after SPARK-51891. - Removed the function `sendIteratorAsArrowBatches` from `TransformWithStateInPySparkStateServer`, as it is no longer used after SPARK-52333. - Removed the input parameters `timeZoneId`, `errorOnDuplicatedFieldNames`, `largeVarTypes`, and `arrowStreamWriterForTest` from the constructor of `TransformWithStateInPySparkStateServer`, as they are no longer used after the cleanup of `sendIteratorAsArrowBatches`. - Removed the input parameter `timeZoneId` from the constructor of `TransformWithStateInPySparkPythonPreInitRunner`, as it was only used for constructing `TransformWithStateInPySparkStateServer`. ### Why are the changes needed? Code cleanup. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass Github Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52279 from LuciferYang/TransformWithStateInPySparkStateServer. Lead-authored-by: yangjie01 <[email protected]> Co-authored-by: YangJie <[email protected]> Signed-off-by: yangjie01 <[email protected]>
### What changes were proposed in this pull request? This PR aims to increase the K8s test version to 1.34 ### Why are the changes needed? To improve the test coverage because K8s 1.34.0 was released on 2025-08-27. - https://kubernetes.io/blog/2025/08/27/kubernetes-v1-34-release/ - https://kubernetes.io/releases/#release-v1-34 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs and check the CI log. ``` ... * Downloading Kubernetes v1.34.0 preload ... ... System Info: ... Kubelet Version: v1.34.0 ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52293 from dongjoon-hyun/SPARK-53541. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: yangjie01 <[email protected]>
### What changes were proposed in this pull request? When `HadoopRDD.getInputFormat` fails to initialize the InputFormat, the thrown error message is not clear. This PR enhances the exception message to include the class name of InputFormat. A typical case that triggers this issue is reading an Iceberg table without a properly configured Iceberg catalog. https://stackoverflow.com/questions/72620351/getting-error-when-querying-iceberg-table-via-spark-thrift-server-using-beeline ### Why are the changes needed? Improve the error message. ### Does this PR introduce _any_ user-facing change? Yes, the user would know more info from the error message. ### How was this patch tested? Before ``` spark-sql (default)> select * from i; 2025-09-09 05:52:13 ERROR SparkSQLDriver: Failed in [select * from i] java.lang.RuntimeException: java.lang.InstantiationException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:157) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:126) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:219) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:233) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:301) ... Caused by: java.lang.InstantiationException at java.base/jdk.internal.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(InstantiationExceptionConstructorAccessorImpl.java:48) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:481) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:155) ... 67 more ``` After ``` spark-sql (default)> select * from i; 2025-09-09 06:23:48 ERROR SparkSQLDriver: Failed in [select * from i] java.lang.RuntimeException: Failed to instantiate org.apache.hadoop.mapred.FileInputFormat at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:223) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:231) at org.apache.spark.rdd.RDD.$anonfun$partitions$2(RDD.scala:301) ... Caused by: java.lang.InstantiationException at java.base/jdk.internal.reflect.InstantiationExceptionConstructorAccessorImpl.newInstance(InstantiationExceptionConstructorAccessorImpl.java:48) at java.base/java.lang.reflect.Constructor.newInstanceWithCaller(Constructor.java:500) at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:481) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:155) at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:126) at org.apache.spark.rdd.HadoopRDD.getInputFormat(HadoopRDD.scala:218) ... 65 more ``` ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52282 from pan3793/SPARK-53531. Authored-by: Cheng Pan <[email protected]> Signed-off-by: yangjie01 <[email protected]>
### What changes were proposed in this pull request? Push Variant into DSv2 scan ### Why are the changes needed? with the change, DSV2 scan only needs to fetch the necessary shredded columns required by the plan ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? new tests ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#52522 from huaxingao/variant-v2-pushdown. Authored-by: Huaxin Gao <[email protected]> Signed-off-by: Huaxin Gao <[email protected]>
…ateServer Use readFully() instead of read() to ensure the entire protobuf message is read from the input stream. The read() method may only read partial results (experimentally 8KB), which can cause failures when processing large state values. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>
This reverts commit 6a0fb22.
This reverts commit 3d8f9ab.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Why are the changes needed?
Does this PR introduce any user-facing change?
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?