[SPARK-51156][CONNECT] Static token authentication support in Spark Connect #50006

Kimahriman · 2025-02-19T11:53:57Z

What changes were proposed in this pull request?

Adds static token authentication support to Spark Connect, which is used by default for automatically started servers locally.

Why are the changes needed?

To add authentication support to Spark Connect so a connect server isn't started that could be accessible to other users inadvertently.

Does this PR introduce any user-facing change?

The local authentication should be transparent to users, but adds the option for users manually starting connect servers to specify an authentication token.

How was this patch tested?

Existing UTs

Was this patch authored or co-authored using generative AI tooling?

No

Kimahriman · 2025-02-19T11:57:21Z

@HyukjinKwon

Mostly based on your initial PR, main differences being:

Uses the config system in addition to an env var. The env var is mostly for the internal use and test cases where additional clients connect to the same automatically launched local server. The config can be used by external users creating dedicated servers as well to add authentication support
Uses the Authorization header with the token parameter of the gRPC client by adding the ability to use a token with a non-TLS local connection

I haven't done the Scala client yet, I would need to make similar changes allowing a token over a non-TLS local connection as well. Mostly just seeing if you think this is helpful or a good slightly alternative route to go down.

HyukjinKwon · 2025-02-19T13:44:37Z

nice!

HyukjinKwon · 2025-02-19T13:48:33Z

will take a closer look tmr :-)

hvanhovell · 2025-02-19T19:56:28Z

.../main/scala/org/apache/spark/sql/connect/service/PreSharedKeyAuthenticationInterceptor.scala

+      metadata: Metadata,
+      next: ServerCallHandler[ReqT, RespT]): ServerCall.Listener[ReqT] = {
+    val authHeaderValue =
+      metadata.get(Metadata.Key.of("Authorization", Metadata.ASCII_STRING_MARSHALLER))


Nit: Let's not recreate the same key over and over. Either put it in class of a companion object.

just put it in the class for now

hvanhovell · 2025-02-19T19:57:05Z

.../main/scala/org/apache/spark/sql/connect/service/PreSharedKeyAuthenticationInterceptor.scala

+      val status = Status.UNAUTHENTICATED.withDescription("No authentication token provided")
+      call.close(status, new Metadata())
+      new ServerCall.Listener[ReqT]() {}
+    } else if (authHeaderValue != s"Bearer $token") {


Nit: you could pre-compute the expected value.

hvanhovell · 2025-02-19T20:00:44Z

...server/src/main/scala/org/apache/spark/sql/connect/planner/StreamingForeachBatchHelper.scala

@@ -135,7 +136,10 @@ object StreamingForeachBatchHelper extends Logging {
      sessionHolder: SessionHolder): (ForeachBatchFnType, AutoCloseable) = {

    val port = SparkConnectService.localPort
-    val connectUrl = s"sc://localhost:$port/;user_id=${sessionHolder.userId}"
+    var connectUrl = s"sc://localhost:$port/;user_id=${sessionHolder.userId}"
+    Connect.getAuthenticateToken.foreach { token =>


Nit: If you like reusing code then factor this into a helper function in sql/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala.

def authenticationTokenParam: String = getAuthenticateToken.map(token => ";token=" + token).getOrElse("")

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala

hvanhovell

LGTM

hvanhovell · 2025-02-19T20:22:57Z

@Kimahriman for scala you probably need to change org.apache.spark.sql.connect.client.SparkConnectClient.Configuration.credentials. If you don't want SSL I guess you can compose InsecureChannelCredentials with CallCredentials, but I frankly have no idea if that satisfies the requirements, and if it is even allowed by gRPC.

HyukjinKwon · 2025-02-19T22:56:55Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala

+
+  def getAuthenticateToken: Option[String] = {
+    SparkEnv.get.conf.get(CONNECT_AUTHENTICATE_TOKEN).orElse {
+      sys.env.get(CONNECT_AUTHENTICATE_TOKEN_ENV)


Let's avoid sys.env. This creates a map internally for all environment variables, and it's actually pretty slow. Let's use System.getenv

HyukjinKwon · 2025-02-19T22:57:50Z

python/pyspark/sql/connect/session.py

            # Configurations to be overwritten
            overwrite_conf = opts
            overwrite_conf["spark.master"] = master
            overwrite_conf["spark.local.connect"] = "1"
+            # When running a local server, always use an ephemeral port
+            overwrite_conf["spark.connect.grpc.binding.port"] = "0"


I think we shouldn't mix up with the improvement and this auth change. Let's revert this change.

Yeah I just had this in branch as I was messing with things, will revert

HyukjinKwon · 2025-02-19T23:01:32Z

python/pyspark/sql/connect/session.py

            os.environ["SPARK_LOCAL_CONNECT"] = "1"
+            os.environ["SPARK_CONNECT_AUTHENTICATE_TOKEN"] = token


You're making the same mistake as I did :-). Individual Spark tests does not actually terminate JVM but only stop SparkContext via SparkSession.stop. After that, you create the SparkContext (via SparkSession) to Spark connect server here, and it creates a new token. Then the previously set SPARK_CONNECT_AUTHENTICATE_TOKEN will be applied to the Spark Connect server, then you end up with a different token.

To workaround this problem, you should either create the token once (as in #49880), or have a variable to reset on SparkSession.stop.

Isn't that what del os.environ["SPARK_CONNECT_AUTHENTICATE_TOKEN"] is doing in SparkSession.stop?

I did hit issues with tests and incorrect tokens, which is why I had to manually set the token for the foreach batch worker and query listener processes that are started. The env vars we're stomping on each other. I don't think it's from the token not clearing after stopping, but due to thread-based parallelism that is sharing the same environment vars. I need to go through and remind myself where if anywhere the env vars are actually still needed with how things are setup. I think there were a few unit tests that try to create a second session to the same remote server, which I don't know if that's really even a real use case

Ah, okie. I saw you are deleting it. One simpler way is just to generate token once only for this process, and don't clean it up at all. The token will be generated per process in any event, so I think it could make the things easier. spark.connect.authenticate.token is a static config anyway.

Okay I changed things around a little bit:

The unit tests set a static token value via the config

When starting the local session, we check if the env var is already set and use that if it is, otherwise if the config is set we set the env var to that, otherwise generate a new token and set the env var

Don't set the config for the auto started server, just let the env var set the token on the server

.../main/scala/org/apache/spark/sql/connect/service/PreSharedKeyAuthenticationInterceptor.scala

HyukjinKwon · 2025-02-19T23:09:01Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala

@@ -313,4 +314,21 @@ object Connect {
      .internal()
      .booleanConf
      .createWithDefault(true)
+
+  val CONNECT_AUTHENTICATE_TOKEN =
+    buildStaticConf("spark.connect.authenticate.token")


#49880 (comment)

I also still don't like it to be as a conf. Are we sure that this is not shown in ps? Those will be passed to Spark Submit through Py4J server launcher to start Spark Connect server.

This is different with other cases because it will be always down in ps command vs other configurations are set in spark.conf file in general.

I am being difficult particularly on this because we're adding this mainly to block the access from an arbitrary user .. but it will be pretty useless if this can be just seen with simple ps aux..

Okay I see what you mean, I was not expecting the conf vars set on the Python SparkConf to end up on the command line, I thought it was just passed in memory through py4j. Let me think about that.

There are a lot of use cases to consider here though that I'm trying to encompass, which is why the dual config/env var setup.

pyspark --remote local / spark-shell --remote local: For the most part I assume this is for testing purposes. I'm not sure what the use case is for multiple users being on the same machine but they need to prevent others from connecting to their own session. This would be the case where seeing the ps output could be thought of as a security hole. Having the authentication at least prevents users on other servers from remotely accessing this connect server

pyspark --conf spark.api.mode=connect / spark-shell --conf spark.api.mode-connect: I think this is effectively the same thing as the previous use case

spark-submit --deploy-mode client --conf spark.api.mode=connect: Kinda similar to the previous two. multiple users on the same machine where a job is submitted from, but you don't want them to access your own sessions. I guess if multiple users can remotely start sessions this way on a dedicated server, you could see the ps output from the Spark driver. I don't think this method would show the token on the command line, but I would need to verify

spark-submit --deploy-mode cluster --conf spark.api.mode=connect: This is the case I am most worried about from a security perspective. You are launching a driver in a shared compute cluster, so anyone else on that cluster would be able to access your Spark Connect server without any authentication (the reason I brought up the security issue in the beginning). I also don't think this would show the token in the command line, but would need to verify

The reason the config is designed to be used for more than just the local use case is for something like

spark-submit --deploy-mode cluster --conf spark.connect.authenticate.token=<token> --class org.apache.spark.sql.connect.service.SparkConnectServer: So you can launch a remote Spark session on behalf of a user and then only let that user authenticate to it

Okie. I am fine with going ahead with this first, and following up.

Updated so that the tests specify a constant value via the config, but normal usage would only set the env var for the automatically launched server

HyukjinKwon · 2025-02-19T23:12:23Z

Would be great if we can have some tests (can refer #49880 or write your own). Otherwise, let's go with this way, looks good to me too 👍

Kimahriman · 2025-02-19T23:39:38Z

Would be great if we can have some tests (can refer #49880 or write your own). Otherwise, let's go with this way, looks good to me too 👍

Yeah I saw you added some tests in your branch, I'll check that out and add some as well

HyukjinKwon

LGTM too pending tests

Kimahriman · 2025-02-21T03:15:19Z

Yep working on some tests and figuring out the scala client still

…connections

Kimahriman · 2025-02-21T18:21:13Z

@Kimahriman for scala you probably need to change org.apache.spark.sql.connect.client.SparkConnectClient.Configuration.credentials. If you don't want SSL I guess you can compose InsecureChannelCredentials with CallCredentials, but I frankly have no idea if that satisfies the requirements, and if it is even allowed by gRPC.

Yep it is not allowed by gRPC. The Java library doesn't have the equivalent LocalChannelCredentials as the Python library: grpc/grpc-java#9900

Was able to work around this by simply passing in the metadata through the interceptor if the host is localhost, let me know what you think.

I also had to make some updates around the assertions for tokens and SSL on the client setup, and changed it to act like the Python setup, where TLS is used by default if you have a token set, unless you're talking to localhost, and no checks are made on the options you try to use. Let me know if you want to go with that same behavior or have some checks in place like before. That's why some tests are failing right now, I can update those accordingly after deciding which route to go.

HyukjinKwon · 2025-02-22T01:42:33Z

There's a conflict with my PR merged. I resolved them on my own 👍

HyukjinKwon · 2025-02-23T01:25:47Z

Merged to master and branch-4.0.

Thanks @Kimahriman for taking this over!

…onnect ### What changes were proposed in this pull request? Adds static token authentication support to Spark Connect, which is used by default for automatically started servers locally. ### Why are the changes needed? To add authentication support to Spark Connect so a connect server isn't started that could be accessible to other users inadvertently. ### Does this PR introduce _any_ user-facing change? The local authentication should be transparent to users, but adds the option for users manually starting connect servers to specify an authentication token. ### How was this patch tested? Existing UTs ### Was this patch authored or co-authored using generative AI tooling? No Closes #50006 from Kimahriman/spark-connect-local-auth. Lead-authored-by: Adam Binford <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 7e9547c) Signed-off-by: Hyukjin Kwon <[email protected]>

…N_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient` ### What changes were proposed in this pull request? This pr aims to remove unused `private val AUTH_TOKEN_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient` because it becomes a useless `private val` declaration after #50006. ### Why are the changes needed? Remove unused `private val`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #50070 from LuciferYang/SPARK-51156-FOLLOWUP. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…N_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient` ### What changes were proposed in this pull request? This pr aims to remove unused `private val AUTH_TOKEN_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient` because it becomes a useless `private val` declaration after #50006. ### Why are the changes needed? Remove unused `private val`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #50070 from LuciferYang/SPARK-51156-FOLLOWUP. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 9de3b7c) Signed-off-by: Dongjoon Hyun <[email protected]>

…onnect ### What changes were proposed in this pull request? Adds static token authentication support to Spark Connect, which is used by default for automatically started servers locally. ### Why are the changes needed? To add authentication support to Spark Connect so a connect server isn't started that could be accessible to other users inadvertently. ### Does this PR introduce _any_ user-facing change? The local authentication should be transparent to users, but adds the option for users manually starting connect servers to specify an authentication token. ### How was this patch tested? Existing UTs ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50006 from Kimahriman/spark-connect-local-auth. Lead-authored-by: Adam Binford <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…N_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient` ### What changes were proposed in this pull request? This pr aims to remove unused `private val AUTH_TOKEN_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient` because it becomes a useless `private val` declaration after apache#50006. ### Why are the changes needed? Remove unused `private val`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50070 from LuciferYang/SPARK-51156-FOLLOWUP. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

github-actions bot added SQL PYTHON CONNECT labels Feb 19, 2025

hvanhovell reviewed Feb 19, 2025

View reviewed changes

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala Show resolved Hide resolved

hvanhovell approved these changes Feb 19, 2025

View reviewed changes

Kimahriman force-pushed the spark-connect-local-auth branch from a9b1110 to 7daf701 Compare February 19, 2025 22:18

HyukjinKwon changed the title ~~[SPARK-51156] Static token authentication support to Spark Connect~~ [SPARK-51156][CONNECT] Static token authentication support to Spark Connect Feb 19, 2025

HyukjinKwon reviewed Feb 19, 2025

View reviewed changes

.../main/scala/org/apache/spark/sql/connect/service/PreSharedKeyAuthenticationInterceptor.scala Show resolved Hide resolved

HyukjinKwon reviewed Feb 19, 2025

View reviewed changes

HyukjinKwon mentioned this pull request Feb 20, 2025

[SPARK-51156][CONNECT] Provide a basic authentication token when running Spark Connect server locally #49880

Closed

hvanhovell marked this pull request as ready for review February 20, 2025 01:06

Kimahriman changed the title ~~[SPARK-51156][CONNECT] Static token authentication support to Spark Connect~~ [SPARK-51156][CONNECT] Static token authentication support in Spark Connect Feb 20, 2025

HyukjinKwon approved these changes Feb 21, 2025

View reviewed changes

Kimahriman added 4 commits February 21, 2025 14:34

Add static authentication token support and use by default for local …

2dae591

…connections

Cleanup and work on scala client

1c77476

Undo ephemeral port

2fba2e7

Only set environment variable when creating token automatically

619a2f2

Kimahriman added 4 commits February 21, 2025 14:35

Use System.getenv

ad2ebff

Fix pandas grouped map test

ae2abe9

Python lint

81f5e53

Scala client working

bd4323d

Kimahriman force-pushed the spark-connect-local-auth branch from a035648 to bd4323d Compare February 21, 2025 14:59

Kimahriman and others added 2 commits February 21, 2025 20:06

Add python test

5374049

Merge branch 'master' into spark-connect-local-auth

0b7ccf8

Kimahriman added 2 commits February 22, 2025 08:21

Update connect client tests

42de37c

Fix one more test

f0948f5

HyukjinKwon closed this in 7e9547c Feb 23, 2025

LuciferYang mentioned this pull request Feb 24, 2025

[SPARK-51156][CONNECT][FOLLOWUP] Remove unused private val AUTH_TOKEN_ON_INSECURE_CONN_ERROR_MSG from SparkConnectClient #50070

Closed

		os.environ["SPARK_LOCAL_CONNECT"] = "1"
		os.environ["SPARK_CONNECT_AUTHENTICATE_TOKEN"] = token

[SPARK-51156][CONNECT] Static token authentication support in Spark Connect #50006

[SPARK-51156][CONNECT] Static token authentication support in Spark Connect #50006

Uh oh!

Conversation

Kimahriman commented Feb 19, 2025

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Kimahriman commented Feb 19, 2025

Uh oh!

HyukjinKwon commented Feb 19, 2025

Uh oh!

HyukjinKwon commented Feb 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hvanhovell left a comment

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Feb 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HyukjinKwon Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Kimahriman Feb 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Feb 19, 2025

Uh oh!

Kimahriman commented Feb 19, 2025

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Kimahriman commented Feb 21, 2025

Uh oh!

Kimahriman commented Feb 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon Feb 19, 2025 •

edited

Loading

HyukjinKwon Feb 19, 2025 •

edited

Loading

Kimahriman Feb 19, 2025 •

edited

Loading

Kimahriman commented Feb 21, 2025 •

edited

Loading