fix: correct inverted park condition in PooledTaskExecutorTest.LongTask#971
fix: correct inverted park condition in PooledTaskExecutorTest.LongTask#971
Conversation
…gTask Co-authored-by: vharseko <6818498+vharseko@users.noreply.github.com> Agent-Logs-Url: https://github.com/OpenIdentityPlatform/OpenAM/sessions/3a385bd2-89e9-4012-b0fe-da93e98eb1f5
… PooledTaskExecutorTest Agent-Logs-Url: https://github.com/OpenIdentityPlatform/OpenAM/sessions/49bb67bb-1acb-4834-ad72-86979cbf3265 Co-authored-by: vharseko <6818498+vharseko@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Fixes a flaky/hanging unit test by removing a racy LockSupport.park/unpark + AtomicBoolean coordination mechanism and replacing it with a CountDownLatch to guarantee unblock() reliably releases execute() even if signaled early.
Changes:
- Replaced
AtomicBoolean/LockSupportstate machine inLongTaskwith aCountDownLatch. - Simplified
unblock()tocountDown()andexecute()toawait().
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
openam-core/src/test/java/org/forgerock/openam/sm/datalayer/impl/PooledTaskExecutorTest.java
Show resolved
Hide resolved
| latch.await(); | ||
| } catch (InterruptedException e) { | ||
| Thread.currentThread().interrupt(); |
There was a problem hiding this comment.
Using an unbounded await() in test code can still hang the entire test run if unblock() is never reached (e.g., due to a preceding failure). Consider await(timeout, unit) and failing the test with a clear message if it times out, to keep CI failures fast and diagnosable.
| latch.await(); | |
| } catch (InterruptedException e) { | |
| Thread.currentThread().interrupt(); | |
| if (!latch.await(30, TimeUnit.SECONDS)) { | |
| fail("LongTask was not unblocked within 30 seconds; test is likely hung."); | |
| } | |
| } catch (InterruptedException e) { | |
| Thread.currentThread().interrupt(); | |
| fail("LongTask was interrupted while waiting to be unblocked", e); |
|
@copilot apply changes based on the comments in this thread |
Agent-Logs-Url: https://github.com/OpenIdentityPlatform/OpenAM/sessions/adbb16e3-cdcd-479d-9111-948001af7070 Co-authored-by: vharseko <6818498+vharseko@users.noreply.github.com>
Done in commit
|
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Agent-Logs-Url: https://github.com/OpenIdentityPlatform/OpenAM/sessions/d61905b6-3165-47d1-a9ca-9e19738e39f3 Co-authored-by: vharseko <6818498+vharseko@users.noreply.github.com>
whileloop condition inLongTask.execute()AtomicBoolean + LockSupportmechanism inLongTaskwith aCountDownLatchlatch.await(30, TimeUnit.SECONDS)withfail()on timeoutInterruptedException, re-interrupt the thread and throwDataLayerExceptionthrowstatement introduced by commit b25d626 (caused compile error)Original prompt
Problem
The CI job https://github.com/OpenIdentityPlatform/OpenAM/actions/runs/23591340704/job/68697595457 is failing due to a flaky/broken test:
The test times out (300 seconds) because a thread never terminates —
task1.isAlive()returnstrueafterjoin().Root Cause
In
openam-core/src/test/java/org/forgerock/openam/sm/datalayer/impl/PooledTaskExecutorTest.java, theLongTask.execute()method has inverted locking logic:After
locked.set(true), the CASlocked.compareAndSet(false, true)fails (because locked is alreadytrue), so!false = true, and the thread parks itself. Whenunblock()setslocked = falseand unparks, the CAS succeeds (false → true), so!true = false, and the loop exits.The problem: on a slow or loaded CI runner (macOS ARM), if
unblock()is called beforeLockSupport.park()executes, the unpark signal is consumed but the thread then parks again and never wakes up, causing the thread to hang indefinitely.Fix
Change the
whileloop condition inLongTask.execute()from the broken CAS to a simplelocked.get()check — park while locked istrue, exit whenunblock()sets it tofalse:The
unblock()method already correctly setslocked = falseand callsLockSupport.unpark()— no changes needed there.File to Change
openam-core/src/test/java/org/forgerock/openam/sm/datalayer/impl/PooledTaskExecutorTest.javawhile (!locked.compareAndSet(false, true))→while (locked.get())The following is the prior conversation context from the user's chat exploration (may be truncated):
User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.
Assistant: ## Failing Test:
PooledTaskExecutorTest.testExecuteRoot Cause
The test fails at line 120 with:
This means
task1(or one of the threads) is still alive after thejoin()call with a 100-second timeout. The issue is a race condition / deadlock in theLongTasklocking logic.Look at the
LongTask.execute()method (lines 155–164):The logic is inverted. After setting
locked = true, thewhilecondition!locked.compareAndSet(false, true)tries to CAS fromfalse → true, butlockedis alreadytrue. This means the CAS fails (returnsfalse), so!false = true, and the thread parks itself immediately and forever — even beforeunblock()is called.When
unblock()setslocked = falseand unparks the thread, the CAS then succeeds (false → true), so!true = false, and the loop exits. However, on a slow/loaded CI runner (macOS ARM in this case), there's a timing window where the unpark happens before the park, causing the thread to never wake up.Fix
The intent of
LongTaskis: stay locked untilunblock()setslocked = false. The correct loop condition should park whilelockedis stilltrue: