Skip to content

Fixed sqlserver not actually getting a lock if lock is already set #1186

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 25, 2025

Conversation

urbim
Copy link
Contributor

@urbim urbim commented Oct 11, 2024

This PR fixes the SQL Server not actually getting a lock if lock is already set (e.g. by another instance).

As mentioned in this comment by @dhui, the SQL statement for sp_getapplock is not correct.

I tested the locking with this script and added some debugging logging as seen in this commit (debug log is not actually part of this PR, I only used it for testing).

The result before applying the fixes was:

=== Lock 0xc000080400
=== Unlock 0xc000080400
=== Lock 0xc000080400
Locked
=== Lock 0xc0001440a0
=== Unlock 0xc0001440a0
panic: mssql: Cannot release the application lock (Database Principal: 'public', Resource: '983082799') because it is not currently held. in line 0: EXEC sp_releaseapplock @Resource = @p1, @LockOwner = 'Session'

goroutine 1 [running]:
main.main()
        /path/to/golang-migrate-test/golang-migrate-test.go:49 +0x254

The first two lines are from d1, err := p.Open(connectionString). Then d1 acquires the lock (line 3). Then _, err = p.Open(connectionString) // d2 should fail when trying to acquire the second lock, but it does not - the code succeeds but the lock is actually not acquired (line 5). The code actually fails when d2 tries to unlock - it can't because it hasn't acquired the lock in the first place (line 6+).

I'd expect a SELECT statement after the EXEC sp_getapplock statement to be used with sql.Conn.QueryRowContext instead of with sql.Conn.ExecContext().

This is exactly right, the SELECT statement is missing and this PR fixes this, along with some other improvements to the locking:

It's unclear to me what @LockMode (update vs exclusive)

From the documentation the Update should be used when updating resources, which is not the case here. The documentation is not entirely clear to me, but from what I can tell, this is the same as Exclusive with added mechanism, that prevents deadlocks from read-lock-update pattern. I think the Exclusive is the right choice here, so the PR also changes that.

and @LockOwner (transaction vs session) we should be using.

From the documentation:

Locks associated with the current transaction are released when the transaction commits or rolls back. Locks associated with the session are released when the session is logged out.

Since we are acquiring lock outside the transaction, the Session is the correct option to use. I actually tried using the Transaction value as the LockOwner and the result was -999, which "Indicates a parameter validation or other call error".

I also set the timeout of the sp_getapplock to infinite, similar to what is used in the postgres implementation. Since migrations can take a long time (e.g. updating an index on a large database), I think that the only good options for timeout are either infinite or somehow configurable by the user. Since the second option is probably not feasible in the current architecture of this library, I opted for the first one. One PR is already open addressing this timeout, but it would still incorrectly obtain the lock if another instance was holding the lock for more than 10 seconds.

This PR should close #1123 and #253.

I didn't manage to get the tests working locally, but I'm hoping the tests will pass in CI.

Lastly, the result of the above script with fixes applied (and non-infinite timeout for testing):

=== Lock 0xc00043a0a0
=== Unlock 0xc00043a0a0
=== Lock 0xc00043a0a0
Locked
=== Lock 0xc0000800c0
panic: try lock failed with error -1: The lock request timed out. in line 0: DECLARE @lockResult int; EXEC @lockResult = sp_getapplock @Resource = @p1, @LockMode = 'Exclusive', @LockOwner = 'Session', @LockTimeout = 1000; SELECT @lockResult; (details: <nil>)

goroutine 1 [running]:
main.main()
        /path/to/golang-migrate-test.go:42 +0x13c

When d2 tries to acquire the lock, it fails with -1: The lock request timed out after timeout as expected.

@urbim
Copy link
Contributor Author

urbim commented Apr 2, 2025

@dhui Any chance you take a look at this?

Copy link
Member

@dhui dhui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@urbim Thanks for the PR and for the detailed write up. Could you also rebase your PR to get the latest supported Go versions?

From the documentation the Update should be used when updating resources, which is not the case here. The documentation is not entirely clear to me, but from what I can tell, this is the same as Exclusive with added mechanism, that prevents deadlocks from read-lock-update pattern. I think the Exclusive is the right choice here, so the PR also changes that.

Thanks for the investigation. I agree.

Since we are acquiring lock outside the transaction, the Session is the correct option to use. I actually tried using the Transaction value as the LockOwner and the result was -999, which "Indicates a parameter validation or other call error".

👍

I also set the timeout of the sp_getapplock to infinite, similar to what is used in the postgres implementation. Since migrations can take a long time (e.g. updating an index on a large database), I think that the only good options for timeout are either infinite or somehow configurable by the user. Since the second option is probably not feasible in the current architecture of this library, I opted for the first one.

I thought about this for a bit and think that changing the lock behavior to wait indefinitely makes sense. e.g. most users will want to block and wait for any pending migrations to finish running. It helps that the postgres driver has already made this decisions.

#1123 is already open addressing this timeout, but it would still incorrectly obtain the lock if another instance was holding the lock for more than 10 seconds.

I agree with you

@@ -33,7 +33,7 @@ var (
ErrMultipleAuthOptionsPassed = fmt.Errorf("both password and useMsi=true were passed.")
)

var lockErrorMap = map[mssql.ReturnStatus]string{
var lockErrorMap = map[int]string{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you continue to use mssql.ReturnStatus?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi!

Sorry for the late reply!

I think int is the correct choice here, since we are selecting an int result (DECLARE @lockResult int;).

Note that in the previous version, this map was never used, since status variable was always 0 (this statement does not actually write anything to status variable: ss.conn.ExecContext(context.Background(), query, aid, &status)).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, you're right. The docs mention ReturnStatus being used for PROCs

@dhui dhui mentioned this pull request Apr 17, 2025
@urbim urbim force-pushed the bugfix/sqlserverLockFix branch from 94afe47 to 6b9f24b Compare April 24, 2025 10:00
@urbim
Copy link
Contributor Author

urbim commented Apr 24, 2025

@dhui Thanks for the review. I rebased the PR and added a comment to your review.

@coveralls
Copy link

Coverage Status

coverage: 56.314% (-0.005%) from 56.319%
when pulling 6b9f24b on urbim:bugfix/sqlserverLockFix
into 9023d66 on golang-migrate:master.

@dhui dhui merged commit 2788339 into golang-migrate:master Apr 25, 2025
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants