Skip to content

fix: correct inverted park condition in PooledTaskExecutorTest.LongTask#971

Open
Copilot wants to merge 6 commits intomasterfrom
copilot/fix-flaky-pooledtaskexecutor-test
Open

fix: correct inverted park condition in PooledTaskExecutorTest.LongTask#971
Copilot wants to merge 6 commits intomasterfrom
copilot/fix-flaky-pooledtaskexecutor-test

Conversation

Copy link
Contributor

Copilot AI commented Mar 26, 2026

  • Initial fix: correct inverted while loop condition in LongTask.execute()
  • Deeper fix: replace the AtomicBoolean + LockSupport mechanism in LongTask with a CountDownLatch
  • Reviewer feedback: use latch.await(30, TimeUnit.SECONDS) with fail() on timeout
  • Reviewer feedback: on InterruptedException, re-interrupt the thread and throw DataLayerException
  • Revert duplicate throw statement introduced by commit b25d626 (caused compile error)
Original prompt

Problem

The CI job https://github.com/OpenIdentityPlatform/OpenAM/actions/runs/23591340704/job/68697595457 is failing due to a flaky/broken test:

[ERROR] org.forgerock.openam.sm.datalayer.impl.PooledTaskExecutorTest.testExecute -- Time elapsed: 300.3 s <<< FAILURE!
[ERROR] PooledTaskExecutorTest.testExecute:120 [Task 1 thread running]
Expecting value to be false but was true expected:<[fals]e> but was:<[tru]e>

The test times out (300 seconds) because a thread never terminates — task1.isAlive() returns true after join().

Root Cause

In openam-core/src/test/java/org/forgerock/openam/sm/datalayer/impl/PooledTaskExecutorTest.java, the LongTask.execute() method has inverted locking logic:

// Current (BROKEN) code
public void execute(TokenStorageAdapter adapter) throws DataLayerException {
    this.executingThread = Thread.currentThread();
    debug("Locking");
    locked.set(true);
    while (!locked.compareAndSet(false, true)) {  // ← BUG: inverted condition
        debug("Task still locked - parking thread");
        LockSupport.park(this);
        debug("Thread unparked");
    }
    debug("Thread unlocked - continuing");
}

After locked.set(true), the CAS locked.compareAndSet(false, true) fails (because locked is already true), so !false = true, and the thread parks itself. When unblock() sets locked = false and unparks, the CAS succeeds (false → true), so !true = false, and the loop exits.

The problem: on a slow or loaded CI runner (macOS ARM), if unblock() is called before LockSupport.park() executes, the unpark signal is consumed but the thread then parks again and never wakes up, causing the thread to hang indefinitely.

Fix

Change the while loop condition in LongTask.execute() from the broken CAS to a simple locked.get() check — park while locked is true, exit when unblock() sets it to false:

// Fixed code
@Override
public void execute(TokenStorageAdapter adapter) throws DataLayerException {
    this.executingThread = Thread.currentThread();
    debug("Locking");
    locked.set(true);
    while (locked.get()) {  // ← FIXED: park while still locked
        debug("Task still locked - parking thread");
        LockSupport.park(this);
        debug("Thread unparked");
    }
    debug("Thread unlocked - continuing");
}

The unblock() method already correctly sets locked = false and calls LockSupport.unpark() — no changes needed there.

File to Change

  • File: openam-core/src/test/java/org/forgerock/openam/sm/datalayer/impl/PooledTaskExecutorTest.java
  • Line 159: Change while (!locked.compareAndSet(false, true))while (locked.get())

The following is the prior conversation context from the user's chat exploration (may be truncated):

User: Please find a solution for this failing job. Use the logs, job definition, and any referenced files where the failure occurred. Keep your response focused on the solution and include code suggestions when appropriate.

Assistant: ## Failing Test: PooledTaskExecutorTest.testExecute

Root Cause

The test fails at line 120 with:

[Task 1 thread running] Expecting value to be false but was true

This means task1 (or one of the threads) is still alive after the join() call with a 100-second timeout. The issue is a race condition / deadlock in the LongTask locking logic.

Look at the LongTask.execute() method (lines 155–164):

public void execute(TokenStorageAdapter adapter) throws DataLayerException {
    this.executingThread = Thread.currentThread();
    debug("Locking");
    locked.set(true);
    while (!locked.compareAndSet(false, true)) {  // ← BUG HERE
        debug("Task still locked - parking thread");
        LockSupport.park(this);
        debug("Thread unparked");
    }
    debug("Thread unlocked - continuing");
}

The logic is inverted. After setting locked = true, the while condition !locked.compareAndSet(false, true) tries to CAS from false → true, but locked is already true. This means the CAS fails (returns false), so !false = true, and the thread parks itself immediately and forever — even before unblock() is called.

When unblock() sets locked = false and unparks the thread, the CAS then succeeds (false → true), so !true = false, and the loop exits. However, on a slow/loaded CI runner (macOS ARM in this case), there's a timing window where the unpark happens before the park, causing the thread to never wake up.

Fix

The intent of LongTask is: stay locked until unblock() sets locked = false. The correct loop condition should park while locked is still true:

@Override
public void execute(TokenStorageAdapter adapter) throws DataLayerException {
    this.executingThread = Thread.currentThread();
    debug("Locking");
    locked.set(true);
    while (locked.get()) {  // park while still locked
        debug("Tas...

</details>



<!-- START COPILOT CODING AGENT SUFFIX -->

*This pull request was created from Copilot chat.*
>

<!-- START COPILOT CODING AGENT TIPS -->
---

🔒 GitHub Advanced Security automatically protects Copilot coding agent pull requests. You can protect all pull requests by enabling Advanced Security for your repositories. [Learn more about Advanced Security.](https://gh.io/cca-advanced-security)

Copilot AI changed the title [WIP] Fix flaky PooledTaskExecutorTest due to inverted locking logic fix: correct inverted park condition in PooledTaskExecutorTest.LongTask Mar 26, 2026
Copilot AI requested a review from vharseko March 26, 2026 11:53
@vharseko vharseko requested a review from maximthomas March 26, 2026 12:08
@vharseko vharseko requested a review from Copilot March 26, 2026 17:18
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a flaky/hanging unit test by removing a racy LockSupport.park/unpark + AtomicBoolean coordination mechanism and replacing it with a CountDownLatch to guarantee unblock() reliably releases execute() even if signaled early.

Changes:

  • Replaced AtomicBoolean/LockSupport state machine in LongTask with a CountDownLatch.
  • Simplified unblock() to countDown() and execute() to await().

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +156 to +158
latch.await();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using an unbounded await() in test code can still hang the entire test run if unblock() is never reached (e.g., due to a preceding failure). Consider await(timeout, unit) and failing the test with a clear message if it times out, to keep CI failures fast and diagnosable.

Suggested change
latch.await();
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
if (!latch.await(30, TimeUnit.SECONDS)) {
fail("LongTask was not unblocked within 30 seconds; test is likely hung.");
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
fail("LongTask was interrupted while waiting to be unblocked", e);

Copilot uses AI. Check for mistakes.
@vharseko
Copy link
Member

@copilot apply changes based on the comments in this thread

Copy link
Contributor Author

Copilot AI commented Mar 26, 2026

@copilot apply changes based on the comments in this thread

Done in commit 059273d4. Applied both reviewer suggestions:

  • latch.await()latch.await(30, TimeUnit.SECONDS) with fail(...) on timeout so a hung test fails within 30 s with a clear diagnostic message.
  • On InterruptedException: re-interrupts the thread and throws DataLayerException to propagate the failure deterministically instead of silently continuing.

@vharseko vharseko marked this pull request as ready for review March 27, 2026 06:33
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants