Skip to content

[SPARK-51156][CONNECT] Static token authentication support in Spark Connect #50006

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 12 commits into from

Conversation

Kimahriman
Copy link
Contributor

What changes were proposed in this pull request?

Adds static token authentication support to Spark Connect, which is used by default for automatically started servers locally.

Why are the changes needed?

To add authentication support to Spark Connect so a connect server isn't started that could be accessible to other users inadvertently.

Does this PR introduce any user-facing change?

The local authentication should be transparent to users, but adds the option for users manually starting connect servers to specify an authentication token.

How was this patch tested?

Existing UTs

Was this patch authored or co-authored using generative AI tooling?

No

@Kimahriman
Copy link
Contributor Author

@HyukjinKwon

Mostly based on your initial PR, main differences being:

  • Uses the config system in addition to an env var. The env var is mostly for the internal use and test cases where additional clients connect to the same automatically launched local server. The config can be used by external users creating dedicated servers as well to add authentication support
  • Uses the Authorization header with the token parameter of the gRPC client by adding the ability to use a token with a non-TLS local connection

I haven't done the Scala client yet, I would need to make similar changes allowing a token over a non-TLS local connection as well. Mostly just seeing if you think this is helpful or a good slightly alternative route to go down.

@HyukjinKwon
Copy link
Member

nice!

@HyukjinKwon
Copy link
Member

will take a closer look tmr :-)

metadata: Metadata,
next: ServerCallHandler[ReqT, RespT]): ServerCall.Listener[ReqT] = {
val authHeaderValue =
metadata.get(Metadata.Key.of("Authorization", Metadata.ASCII_STRING_MARSHALLER))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Let's not recreate the same key over and over. Either put it in class of a companion object.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just put it in the class for now

val status = Status.UNAUTHENTICATED.withDescription("No authentication token provided")
call.close(status, new Metadata())
new ServerCall.Listener[ReqT]() {}
} else if (authHeaderValue != s"Bearer $token") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you could pre-compute the expected value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -135,7 +136,10 @@ object StreamingForeachBatchHelper extends Logging {
sessionHolder: SessionHolder): (ForeachBatchFnType, AutoCloseable) = {

val port = SparkConnectService.localPort
val connectUrl = s"sc://localhost:$port/;user_id=${sessionHolder.userId}"
var connectUrl = s"sc://localhost:$port/;user_id=${sessionHolder.userId}"
Connect.getAuthenticateToken.foreach { token =>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: If you like reusing code then factor this into a helper function in sql/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala.

def authenticationTokenParam: String = getAuthenticateToken.map(token => ";token=" + token).getOrElse("")

Copy link
Contributor

@hvanhovell hvanhovell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@hvanhovell
Copy link
Contributor

@Kimahriman for scala you probably need to change org.apache.spark.sql.connect.client.SparkConnectClient.Configuration.credentials. If you don't want SSL I guess you can compose InsecureChannelCredentials with CallCredentials, but I frankly have no idea if that satisfies the requirements, and if it is even allowed by gRPC.

@Kimahriman Kimahriman force-pushed the spark-connect-local-auth branch from a9b1110 to 7daf701 Compare February 19, 2025 22:18
@HyukjinKwon HyukjinKwon changed the title [SPARK-51156] Static token authentication support to Spark Connect [SPARK-51156][CONNECT] Static token authentication support to Spark Connect Feb 19, 2025

def getAuthenticateToken: Option[String] = {
SparkEnv.get.conf.get(CONNECT_AUTHENTICATE_TOKEN).orElse {
sys.env.get(CONNECT_AUTHENTICATE_TOKEN_ENV)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's avoid sys.env. This creates a map internally for all environment variables, and it's actually pretty slow. Let's use System.getenv

# Configurations to be overwritten
overwrite_conf = opts
overwrite_conf["spark.master"] = master
overwrite_conf["spark.local.connect"] = "1"
# When running a local server, always use an ephemeral port
overwrite_conf["spark.connect.grpc.binding.port"] = "0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we shouldn't mix up with the improvement and this auth change. Let's revert this change.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I just had this in branch as I was messing with things, will revert

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted

os.environ["SPARK_LOCAL_CONNECT"] = "1"
os.environ["SPARK_CONNECT_AUTHENTICATE_TOKEN"] = token
Copy link
Member

@HyukjinKwon HyukjinKwon Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're making the same mistake as I did :-). Individual Spark tests does not actually terminate JVM but only stop SparkContext via SparkSession.stop. After that, you create the SparkContext (via SparkSession) to Spark connect server here, and it creates a new token. Then the previously set SPARK_CONNECT_AUTHENTICATE_TOKEN will be applied to the Spark Connect server, then you end up with a different token.

To workaround this problem, you should either create the token once (as in #49880), or have a variable to reset on SparkSession.stop.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't that what del os.environ["SPARK_CONNECT_AUTHENTICATE_TOKEN"] is doing in SparkSession.stop?

I did hit issues with tests and incorrect tokens, which is why I had to manually set the token for the foreach batch worker and query listener processes that are started. The env vars we're stomping on each other. I don't think it's from the token not clearing after stopping, but due to thread-based parallelism that is sharing the same environment vars. I need to go through and remind myself where if anywhere the env vars are actually still needed with how things are setup. I think there were a few unit tests that try to create a second session to the same remote server, which I don't know if that's really even a real use case

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okie. I saw you are deleting it. One simpler way is just to generate token once only for this process, and don't clean it up at all. The token will be generated per process in any event, so I think it could make the things easier. spark.connect.authenticate.token is a static config anyway.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I changed things around a little bit:

  • The unit tests set a static token value via the config
  • When starting the local session, we check if the env var is already set and use that if it is, otherwise if the config is set we set the env var to that, otherwise generate a new token and set the env var
  • Don't set the config for the auto started server, just let the env var set the token on the server

@@ -313,4 +314,21 @@ object Connect {
.internal()
.booleanConf
.createWithDefault(true)

val CONNECT_AUTHENTICATE_TOKEN =
buildStaticConf("spark.connect.authenticate.token")
Copy link
Member

@HyukjinKwon HyukjinKwon Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#49880 (comment)

I also still don't like it to be as a conf. Are we sure that this is not shown in ps? Those will be passed to Spark Submit through Py4J server launcher to start Spark Connect server.

This is different with other cases because it will be always down in ps command vs other configurations are set in spark.conf file in general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am being difficult particularly on this because we're adding this mainly to block the access from an arbitrary user .. but it will be pretty useless if this can be just seen with simple ps aux..

Copy link
Contributor Author

@Kimahriman Kimahriman Feb 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay I see what you mean, I was not expecting the conf vars set on the Python SparkConf to end up on the command line, I thought it was just passed in memory through py4j. Let me think about that.

There are a lot of use cases to consider here though that I'm trying to encompass, which is why the dual config/env var setup.

  • pyspark --remote local / spark-shell --remote local: For the most part I assume this is for testing purposes. I'm not sure what the use case is for multiple users being on the same machine but they need to prevent others from connecting to their own session. This would be the case where seeing the ps output could be thought of as a security hole. Having the authentication at least prevents users on other servers from remotely accessing this connect server
  • pyspark --conf spark.api.mode=connect / spark-shell --conf spark.api.mode-connect: I think this is effectively the same thing as the previous use case
  • spark-submit --deploy-mode client --conf spark.api.mode=connect: Kinda similar to the previous two. multiple users on the same machine where a job is submitted from, but you don't want them to access your own sessions. I guess if multiple users can remotely start sessions this way on a dedicated server, you could see the ps output from the Spark driver. I don't think this method would show the token on the command line, but I would need to verify
  • spark-submit --deploy-mode cluster --conf spark.api.mode=connect: This is the case I am most worried about from a security perspective. You are launching a driver in a shared compute cluster, so anyone else on that cluster would be able to access your Spark Connect server without any authentication (the reason I brought up the security issue in the beginning). I also don't think this would show the token in the command line, but would need to verify

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason the config is designed to be used for more than just the local use case is for something like

  • spark-submit --deploy-mode cluster --conf spark.connect.authenticate.token=<token> --class org.apache.spark.sql.connect.service.SparkConnectServer: So you can launch a remote Spark session on behalf of a user and then only let that user authenticate to it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okie. I am fine with going ahead with this first, and following up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated so that the tests specify a constant value via the config, but normal usage would only set the env var for the automatically launched server

@HyukjinKwon
Copy link
Member

Would be great if we can have some tests (can refer #49880 or write your own). Otherwise, let's go with this way, looks good to me too 👍

@Kimahriman
Copy link
Contributor Author

Would be great if we can have some tests (can refer #49880 or write your own). Otherwise, let's go with this way, looks good to me too 👍

Yeah I saw you added some tests in your branch, I'll check that out and add some as well

@hvanhovell hvanhovell marked this pull request as ready for review February 20, 2025 01:06
@Kimahriman Kimahriman changed the title [SPARK-51156][CONNECT] Static token authentication support to Spark Connect [SPARK-51156][CONNECT] Static token authentication support in Spark Connect Feb 20, 2025
Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM too pending tests

@Kimahriman
Copy link
Contributor Author

Yep working on some tests and figuring out the scala client still

@Kimahriman Kimahriman force-pushed the spark-connect-local-auth branch from a035648 to bd4323d Compare February 21, 2025 14:59
@Kimahriman
Copy link
Contributor Author

Kimahriman commented Feb 21, 2025

@Kimahriman for scala you probably need to change org.apache.spark.sql.connect.client.SparkConnectClient.Configuration.credentials. If you don't want SSL I guess you can compose InsecureChannelCredentials with CallCredentials, but I frankly have no idea if that satisfies the requirements, and if it is even allowed by gRPC.

Yep it is not allowed by gRPC. The Java library doesn't have the equivalent LocalChannelCredentials as the Python library: grpc/grpc-java#9900

Was able to work around this by simply passing in the metadata through the interceptor if the host is localhost, let me know what you think.

I also had to make some updates around the assertions for tokens and SSL on the client setup, and changed it to act like the Python setup, where TLS is used by default if you have a token set, unless you're talking to localhost, and no checks are made on the options you try to use. Let me know if you want to go with that same behavior or have some checks in place like before. That's why some tests are failing right now, I can update those accordingly after deciding which route to go.

@HyukjinKwon
Copy link
Member

There's a conflict with my PR merged. I resolved them on my own 👍

@HyukjinKwon
Copy link
Member

Merged to master and branch-4.0.

Thanks @Kimahriman for taking this over!

HyukjinKwon added a commit that referenced this pull request Feb 23, 2025
…onnect

### What changes were proposed in this pull request?

Adds static token authentication support to Spark Connect, which is used by default for automatically started servers locally.

### Why are the changes needed?

To add authentication support to Spark Connect so a connect server isn't started that could be accessible to other users inadvertently.

### Does this PR introduce _any_ user-facing change?

The local authentication should be transparent to users, but adds the option for users manually starting connect servers to specify an authentication token.

### How was this patch tested?

Existing UTs

### Was this patch authored or co-authored using generative AI tooling?

No

Closes #50006 from Kimahriman/spark-connect-local-auth.

Lead-authored-by: Adam Binford <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 7e9547c)
Signed-off-by: Hyukjin Kwon <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Feb 24, 2025
…N_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient`

### What changes were proposed in this pull request?
This pr aims to remove unused `private val AUTH_TOKEN_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient`  because it becomes a useless `private val` declaration after #50006.

### Why are the changes needed?
Remove unused `private val`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #50070 from LuciferYang/SPARK-51156-FOLLOWUP.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
dongjoon-hyun pushed a commit that referenced this pull request Feb 24, 2025
…N_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient`

### What changes were proposed in this pull request?
This pr aims to remove unused `private val AUTH_TOKEN_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient`  because it becomes a useless `private val` declaration after #50006.

### Why are the changes needed?
Remove unused `private val`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes #50070 from LuciferYang/SPARK-51156-FOLLOWUP.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 9de3b7c)
Signed-off-by: Dongjoon Hyun <[email protected]>
Pajaraja pushed a commit to Pajaraja/spark that referenced this pull request Mar 6, 2025
…onnect

### What changes were proposed in this pull request?

Adds static token authentication support to Spark Connect, which is used by default for automatically started servers locally.

### Why are the changes needed?

To add authentication support to Spark Connect so a connect server isn't started that could be accessible to other users inadvertently.

### Does this PR introduce _any_ user-facing change?

The local authentication should be transparent to users, but adds the option for users manually starting connect servers to specify an authentication token.

### How was this patch tested?

Existing UTs

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#50006 from Kimahriman/spark-connect-local-auth.

Lead-authored-by: Adam Binford <[email protected]>
Co-authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Pajaraja pushed a commit to Pajaraja/spark that referenced this pull request Mar 6, 2025
…N_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient`

### What changes were proposed in this pull request?
This pr aims to remove unused `private val AUTH_TOKEN_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient`  because it becomes a useless `private val` declaration after apache#50006.

### Why are the changes needed?
Remove unused `private val`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass GitHub Actions

### Was this patch authored or co-authored using generative AI tooling?
No

Closes apache#50070 from LuciferYang/SPARK-51156-FOLLOWUP.

Authored-by: yangjie01 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants