
The Token Rotation Bug That Could Lock Users Out Forever
GourdianToken's refresh rotation had a subtle ordering bug: it destroyed the old token before the new one existed. One failed network call in between, and the user was permanently logged out. A war story about failure ordering in multi-step state changes.

Inside GourdianToken
Part 3 of 3 β’ Series Complete
The engineering behind GourdianToken v2.0.0 β a production-grade JWT library for Go: what shipped, how race-free refresh rotation works across four storage backends, and the bugs that almost made it into production.
Table of Contents
Some bugs crash loudly. The worst ones don't β they wait for a network blip and then quietly ruin one user's day in a way no log line will ever explain. GourdianToken's refresh rotation had one of the second kind, and it survived code review, a full test suite, and several releases before I caught it while writing documentation of all things.
The setup: rotation is invalidate-and-replace
Refresh rotation swaps a user's refresh token on every use: the old token is marked rotated (dead) in storage, and a fresh one is returned. Two steps, touching two different systems β the storage backend and the token signer. And any two-step state change has an ordering question: which step goes first?
1// The original flow (simplified)2func (maker *JWTMaker) RotateRefreshToken(ctx context.Context, oldToken string) (*RefreshTokenResponse, error) {3 claims, err := maker.VerifyRefreshToken(ctx, oldToken)4 if err != nil {5 return nil, err6 }78 // Step 1: kill the old token (atomic compare-and-swap)9 marked, err := maker.tokenRepo.MarkTokenRotatedAtomic(ctx, oldToken, ttl)10 if err != nil || !marked {11 return nil, ...12 }1314 // Step 2: mint the replacement15 return maker.CreateRefreshToken(ctx, claims.Subject, claims.Username, claims.SessionID)16}Invalidate first, then replace. It reads naturally β you retire the old thing, then hand over the new one. It's also the intuitively 'safe' order from a security mindset: at no point do two valid tokens exist. Every test passed. The concurrency tests from the previous post in this series passed. What could go wrong lives entirely between line 9 succeeding and line 15 failing.
The failure mode: a one-way door
Suppose the atomic mark succeeds and then CreateRefreshToken fails. The causes don't have to be exotic: a context deadline expiring, a transient signing-key I/O error, the process being killed mid-request. Now walk through the user's state: their old refresh token is marked rotated β permanently and correctly dead, by design. The replacement token was never created. Nothing was returned.
The user is now unrecoverable
When the client retries the rotation with the only token it has, verification fails with ErrTokenRotated β the exact same signal as a replay attack. The system can't tell an unlucky user from an attacker, because the failure left the state of an attack: a rotated token being presented again. The only way out is a full re-login, and nothing in any log explains why.
What makes this bug vicious is that its blast radius is invisible. It fires once per unlucky request, affects one user at a time, and masquerades as the security feature working correctly. If a support ticket ever said 'the app randomly logged me out,' this would be the last place anyone looked.
The fix: create before you destroy
The fix that shipped in v2.0.0 is a reorder: mint the new token first, and only then mark the old one rotated. Now enumerate the failure points. If creation fails, nothing has been persisted β the old token is still valid and the client can simply retry. If the atomic mark fails with an error, the old token is likewise untouched. There is no longer any window where the old token is dead and the new one doesn't exist.
1// Create the new token BEFORE marking the old one as rotated.2// If CreateRefreshToken fails, nothing has been persisted yet,3// so the old token remains valid and the caller can safely4// retry rotation instead of being permanently locked out.5newToken, err := maker.CreateRefreshToken(ctx, claims.Subject, claims.Username, claims.SessionID)6if err != nil {7 return nil, err8}910// ATOMIC OPERATION: only one goroutine will succeed here11marked, err := maker.tokenRepo.MarkTokenRotatedAtomic(ctx, oldToken, maker.config.RefreshMaxLifetimeExpiry)12if err != nil {13 return nil, fmt.Errorf("repository error: %w", err)14}15if !marked {16 // Already rotated by another goroutine. The new token created17 // above is simply discarded β it was never returned to any18 // caller and never persisted, so it poses no risk.19 return nil, fmt.Errorf("%w", ErrTokenRotated)20}21return newToken, nilThe new order has a cost, and it's worth being honest about it: a request that loses the concurrent-rotation race now signs a token that gets thrown away. But that discarded token is pure CPU work β it was never returned to a caller and never written to storage, so it can't be used by anyone. Wasting a signature on the losing path is a fine price for never stranding a user on the failure path.
And the 'two valid tokens exist briefly' worry that made invalidate-first feel safer? It doesn't survive contact with the atomicity guarantee. The new token only reaches the caller after the compare-and-swap succeeds, and the moment it succeeds the old token is dead. From any observer's perspective, the swap is still exactly-once.
The sibling bug hiding in MongoDB
Auditing the rotation path for the lockout fix surfaced a second, quieter bug in the same neighborhood. The MongoDB backend classified duplicate-key errors β the 'someone else already rotated this token' signal β inside its transaction callback. But a write error surfaced to the driver mid-transaction can cause the server to abort the transaction regardless of what the callback returns, which made commit behavior depend on the driver version.
The fix moved the classification to the transaction boundary: let the transaction settle first, then decide whether the error means 'conflict, return false' or 'real failure, return the error'. A dedicated regression test now races concurrent rotations against a transactional Mongo repository and asserts that duplicate keys never surface as raw errors and exactly one call wins.
What I take away from this
- In any invalidate-and-replace flow, create the replacement before destroying the original. It's the same rule as writing the new file before deleting the old one β and it applies to tokens, credentials, DNS records, and config rollouts alike.
- Enumerate the state after every step fails, not just after every step succeeds. The bug was invisible in the success path and obvious the first time I wrote down 'what does the user hold if step 2 fails?'
- Beware failure modes that impersonate features. The lockout produced ErrTokenRotated β the replay-detection signal β so the system would have reported an attack while eating a user's session.
- Happy-path plus concurrency tests still aren't fault tests. What caught this was neither; it was explaining the code's guarantees in prose and failing to justify one line's position.
ββDocumentation is a debugging tool: the bug you can't explain away while writing docs is a bug you just found.β
Both fixes shipped in GourdianToken v2.0.0, with regression tests locking them in. If you're catching up on this series: part 1 covers the full release and how to migrate, and part 2 covers how the atomic compare-and-swap that this post leans on is implemented across all four storage backends.
Series ProgressComplete
Inside GourdianToken - 3 of 3
AI-readable content
Learn more βThis content is available via the AI Content API as JSON or token-efficient Markdown. Feed it directly into LLM workflows.
/api/content/blogs/the-token-rotation-bug-that-locked-users-outRelated Articles
Continue your learning journey with these handpicked articles.


Related Content
Explore related articles, projects, and tools.