-
Notifications
You must be signed in to change notification settings - Fork 28.7k
[SPARK-51156][CONNECT] Static token authentication support in Spark Connect #50006
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Mostly based on your initial PR, main differences being:
I haven't done the Scala client yet, I would need to make similar changes allowing a token over a non-TLS local connection as well. Mostly just seeing if you think this is helpful or a good slightly alternative route to go down. |
nice! |
will take a closer look tmr :-) |
metadata: Metadata, | ||
next: ServerCallHandler[ReqT, RespT]): ServerCall.Listener[ReqT] = { | ||
val authHeaderValue = | ||
metadata.get(Metadata.Key.of("Authorization", Metadata.ASCII_STRING_MARSHALLER)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Let's not recreate the same key over and over. Either put it in class of a companion object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just put it in the class for now
val status = Status.UNAUTHENTICATED.withDescription("No authentication token provided") | ||
call.close(status, new Metadata()) | ||
new ServerCall.Listener[ReqT]() {} | ||
} else if (authHeaderValue != s"Bearer $token") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you could pre-compute the expected value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@@ -135,7 +136,10 @@ object StreamingForeachBatchHelper extends Logging { | |||
sessionHolder: SessionHolder): (ForeachBatchFnType, AutoCloseable) = { | |||
|
|||
val port = SparkConnectService.localPort | |||
val connectUrl = s"sc://localhost:$port/;user_id=${sessionHolder.userId}" | |||
var connectUrl = s"sc://localhost:$port/;user_id=${sessionHolder.userId}" | |||
Connect.getAuthenticateToken.foreach { token => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: If you like reusing code then factor this into a helper function in sql/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala
.
def authenticationTokenParam: String = getAuthenticateToken.map(token => ";token=" + token).getOrElse("")
sql/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@Kimahriman for scala you probably need to change |
a9b1110
to
7daf701
Compare
|
||
def getAuthenticateToken: Option[String] = { | ||
SparkEnv.get.conf.get(CONNECT_AUTHENTICATE_TOKEN).orElse { | ||
sys.env.get(CONNECT_AUTHENTICATE_TOKEN_ENV) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's avoid sys.env
. This creates a map internally for all environment variables, and it's actually pretty slow. Let's use System.getenv
# Configurations to be overwritten | ||
overwrite_conf = opts | ||
overwrite_conf["spark.master"] = master | ||
overwrite_conf["spark.local.connect"] = "1" | ||
# When running a local server, always use an ephemeral port | ||
overwrite_conf["spark.connect.grpc.binding.port"] = "0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we shouldn't mix up with the improvement and this auth change. Let's revert this change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I just had this in branch as I was messing with things, will revert
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted
os.environ["SPARK_LOCAL_CONNECT"] = "1" | ||
os.environ["SPARK_CONNECT_AUTHENTICATE_TOKEN"] = token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're making the same mistake as I did :-). Individual Spark tests does not actually terminate JVM but only stop SparkContext via SparkSession.stop. After that, you create the SparkContext (via SparkSession) to Spark connect server here, and it creates a new token. Then the previously set SPARK_CONNECT_AUTHENTICATE_TOKEN
will be applied to the Spark Connect server, then you end up with a different token.
To workaround this problem, you should either create the token once (as in #49880), or have a variable to reset on SparkSession.stop
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't that what del os.environ["SPARK_CONNECT_AUTHENTICATE_TOKEN"]
is doing in SparkSession.stop
?
I did hit issues with tests and incorrect tokens, which is why I had to manually set the token for the foreach batch worker and query listener processes that are started. The env vars we're stomping on each other. I don't think it's from the token not clearing after stopping, but due to thread-based parallelism that is sharing the same environment vars. I need to go through and remind myself where if anywhere the env vars are actually still needed with how things are setup. I think there were a few unit tests that try to create a second session to the same remote server, which I don't know if that's really even a real use case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okie. I saw you are deleting it. One simpler way is just to generate token
once only for this process, and don't clean it up at all. The token will be generated per process in any event, so I think it could make the things easier. spark.connect.authenticate.token
is a static config anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I changed things around a little bit:
- The unit tests set a static token value via the config
- When starting the local session, we check if the env var is already set and use that if it is, otherwise if the config is set we set the env var to that, otherwise generate a new token and set the env var
- Don't set the config for the auto started server, just let the env var set the token on the server
.../main/scala/org/apache/spark/sql/connect/service/PreSharedKeyAuthenticationInterceptor.scala
Show resolved
Hide resolved
@@ -313,4 +314,21 @@ object Connect { | |||
.internal() | |||
.booleanConf | |||
.createWithDefault(true) | |||
|
|||
val CONNECT_AUTHENTICATE_TOKEN = | |||
buildStaticConf("spark.connect.authenticate.token") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also still don't like it to be as a conf. Are we sure that this is not shown in ps
? Those will be passed to Spark Submit through Py4J server launcher to start Spark Connect server.
This is different with other cases because it will be always down in ps
command vs other configurations are set in spark.conf file in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am being difficult particularly on this because we're adding this mainly to block the access from an arbitrary user .. but it will be pretty useless if this can be just seen with simple ps aux
..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay I see what you mean, I was not expecting the conf vars set on the Python SparkConf
to end up on the command line, I thought it was just passed in memory through py4j. Let me think about that.
There are a lot of use cases to consider here though that I'm trying to encompass, which is why the dual config/env var setup.
pyspark --remote local
/spark-shell --remote local
: For the most part I assume this is for testing purposes. I'm not sure what the use case is for multiple users being on the same machine but they need to prevent others from connecting to their own session. This would be the case where seeing theps
output could be thought of as a security hole. Having the authentication at least prevents users on other servers from remotely accessing this connect serverpyspark --conf spark.api.mode=connect
/spark-shell --conf spark.api.mode-connect
: I think this is effectively the same thing as the previous use casespark-submit --deploy-mode client --conf spark.api.mode=connect
: Kinda similar to the previous two. multiple users on the same machine where a job is submitted from, but you don't want them to access your own sessions. I guess if multiple users can remotely start sessions this way on a dedicated server, you could see theps
output from the Spark driver. I don't think this method would show the token on the command line, but I would need to verifyspark-submit --deploy-mode cluster --conf spark.api.mode=connect
: This is the case I am most worried about from a security perspective. You are launching a driver in a shared compute cluster, so anyone else on that cluster would be able to access your Spark Connect server without any authentication (the reason I brought up the security issue in the beginning). I also don't think this would show the token in the command line, but would need to verify
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason the config is designed to be used for more than just the local use case is for something like
spark-submit --deploy-mode cluster --conf spark.connect.authenticate.token=<token> --class org.apache.spark.sql.connect.service.SparkConnectServer
: So you can launch a remote Spark session on behalf of a user and then only let that user authenticate to it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okie. I am fine with going ahead with this first, and following up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated so that the tests specify a constant value via the config, but normal usage would only set the env var for the automatically launched server
Would be great if we can have some tests (can refer #49880 or write your own). Otherwise, let's go with this way, looks good to me too 👍 |
Yeah I saw you added some tests in your branch, I'll check that out and add some as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM too pending tests
Yep working on some tests and figuring out the scala client still |
a035648
to
bd4323d
Compare
Yep it is not allowed by gRPC. The Java library doesn't have the equivalent Was able to work around this by simply passing in the metadata through the interceptor if the host is localhost, let me know what you think. I also had to make some updates around the assertions for tokens and SSL on the client setup, and changed it to act like the Python setup, where TLS is used by default if you have a token set, unless you're talking to localhost, and no checks are made on the options you try to use. Let me know if you want to go with that same behavior or have some checks in place like before. That's why some tests are failing right now, I can update those accordingly after deciding which route to go. |
There's a conflict with my PR merged. I resolved them on my own 👍 |
Merged to master and branch-4.0. Thanks @Kimahriman for taking this over! |
…onnect ### What changes were proposed in this pull request? Adds static token authentication support to Spark Connect, which is used by default for automatically started servers locally. ### Why are the changes needed? To add authentication support to Spark Connect so a connect server isn't started that could be accessible to other users inadvertently. ### Does this PR introduce _any_ user-facing change? The local authentication should be transparent to users, but adds the option for users manually starting connect servers to specify an authentication token. ### How was this patch tested? Existing UTs ### Was this patch authored or co-authored using generative AI tooling? No Closes #50006 from Kimahriman/spark-connect-local-auth. Lead-authored-by: Adam Binford <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]> (cherry picked from commit 7e9547c) Signed-off-by: Hyukjin Kwon <[email protected]>
…N_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient` ### What changes were proposed in this pull request? This pr aims to remove unused `private val AUTH_TOKEN_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient` because it becomes a useless `private val` declaration after #50006. ### Why are the changes needed? Remove unused `private val`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #50070 from LuciferYang/SPARK-51156-FOLLOWUP. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
…N_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient` ### What changes were proposed in this pull request? This pr aims to remove unused `private val AUTH_TOKEN_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient` because it becomes a useless `private val` declaration after #50006. ### Why are the changes needed? Remove unused `private val`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes #50070 from LuciferYang/SPARK-51156-FOLLOWUP. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit 9de3b7c) Signed-off-by: Dongjoon Hyun <[email protected]>
…onnect ### What changes were proposed in this pull request? Adds static token authentication support to Spark Connect, which is used by default for automatically started servers locally. ### Why are the changes needed? To add authentication support to Spark Connect so a connect server isn't started that could be accessible to other users inadvertently. ### Does this PR introduce _any_ user-facing change? The local authentication should be transparent to users, but adds the option for users manually starting connect servers to specify an authentication token. ### How was this patch tested? Existing UTs ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50006 from Kimahriman/spark-connect-local-auth. Lead-authored-by: Adam Binford <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…N_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient` ### What changes were proposed in this pull request? This pr aims to remove unused `private val AUTH_TOKEN_ON_INSECURE_CONN_ERROR_MSG` from `SparkConnectClient` because it becomes a useless `private val` declaration after apache#50006. ### Why are the changes needed? Remove unused `private val`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass GitHub Actions ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#50070 from LuciferYang/SPARK-51156-FOLLOWUP. Authored-by: yangjie01 <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
Adds static token authentication support to Spark Connect, which is used by default for automatically started servers locally.
Why are the changes needed?
To add authentication support to Spark Connect so a connect server isn't started that could be accessible to other users inadvertently.
Does this PR introduce any user-facing change?
The local authentication should be transparent to users, but adds the option for users manually starting connect servers to specify an authentication token.
How was this patch tested?
Existing UTs
Was this patch authored or co-authored using generative AI tooling?
No