Skip to content

chore: improve channel messaging and address memory leaks#1687

Merged
derklaro merged 19 commits intonightlyfrom
data-buf-enhance
Jul 17, 2025
Merged

chore: improve channel messaging and address memory leaks#1687
derklaro merged 19 commits intonightlyfrom
data-buf-enhance

Conversation

@derklaro
Copy link
Member

Motivation

The channel messaging system (as well as a few network parts in general) really need a cleanup to improve the code quality and reliability of the system. Furthermore, there were memory leaks reported that need to be addressed as well.

Modification

This pr mostly improves channel messaging to be more reliable overall and improves the api surface exposed to consumers. Furthermore, all possible memory leaks were addressed (not limited to channel messaging) and some other networking parts were improved as well.

Result

Channel messaging should be more reliable and easier to use. Additionally, there shouldn't be any memory leaks anymore.

@derklaro derklaro added this to the 4.0.0-RC14 milestone Jul 14, 2025
@derklaro derklaro requested a review from 0utplay July 14, 2025 20:39
@derklaro derklaro self-assigned this Jul 14, 2025
@derklaro derklaro added v: 4.X This pull should be included in the 4.0 release t: improvement The pull request improves existing code labels Jul 14, 2025
@derklaro
Copy link
Member Author

/cc @DasBabyPixel memory leaks should be fixed now

@github-actions
Copy link

github-actions bot commented Jul 14, 2025

Test Results

 56 files  ±0   56 suites  ±0   3m 19s ⏱️ +26s
570 tests ±0  570 ✅ ±0  0 💤 ±0  0 ❌ ±0 
935 runs  ±0  935 ✅ ±0  0 💤 ±0  0 ❌ ±0 

Results for commit c4bc4b2. ± Comparison against base commit c108c3e.

This pull request removes 32 and adds 32 tests. Note that renamed tests count towards both.
eu.cloudnetservice.driver.impl.document.DocumentSerialisationTest ‑ [4] {"b":1,"s":2,"i":3,"l":4,"f":5.0,"d":6.0,"c":"/","string":"Hello, World!","bol":true,"cloud":["Ben?","Yes","No","HoHoHoHo"],"world":{"insane":"!","hello":"world","this":"is"}}, PRETTY
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [14] 2025-07-05
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [23] 22:21:16.493105216
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [28] 22:21:16.493221032Z
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [29] 22:21:16.493242613Z
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [30] 22:21:16.493264484+05:00
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [31] 22:21:16.493282818-03:00
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [34] 2025-07-05T22:21:16.493410477
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [39] 2025-07-05T22:21:16.493542874Z
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [4] 2025-07-05T22:21:16.486859409Z
…
eu.cloudnetservice.driver.impl.document.DocumentSerialisationTest ‑ [4] {"b":1,"s":2,"i":3,"l":4,"f":5.0,"d":6.0,"c":"/","string":"Hello, World!","bol":true,"cloud":["Ben?","Yes","No","HoHoHoHo"],"world":{"hello":"world","insane":"!","this":"is"}}, PRETTY
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [14] 2025-07-16
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [23] 16:47:46.067615018
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [28] 16:47:46.067726286Z
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [29] 16:47:46.067746874Z
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [30] 16:47:46.067768364+05:00
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [31] 16:47:46.067787359-03:00
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [34] 2025-07-16T16:47:46.067913374
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [39] 2025-07-16T16:47:46.068089152Z
eu.cloudnetservice.driver.impl.document.gson.JavaTimeSerializerTest ‑ [4] 2025-07-16T16:47:46.058082152Z
…

♻️ This comment has been updated with latest results.

Copy link
Contributor

@DasBabyPixel DasBabyPixel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a step in the right direction. If you give me a few days time I have a reason to update past RC11 other than java 24 and time to test this branch.
I think replacing null checks with pattern matching is either an overly aggressive inspection or a too idealistic world view. It does not make the code more readable.
I haven't yet gone through all usages to look for all possible leaks, but the obvious ones I do remember should be fixed here.

private void handlePacket(@NonNull BasePacket packet) {
try {
this.doHandlePacket(packet);
} catch (Exception exception) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd go further and catch Throwable instead of silently ignoring OOMEs or similar

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should get oracle to have the default UncaughtExceptionHandler call System.exit.
I remember wasting way too many hours because network threads ate my errors and I couldn't for the life of me figure out where.

try {
this.buffer.close();
} catch (IllegalStateException _) {
// possible double-free error due to a race, ignore
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since when are buffers thread safe? This seems like a hack that will cause issues later

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The buffer is not thread-safe, but the drop guarding the buffer is. So concurrent close calls are safe. However, I would also argue that this is the better approach as close calls should not throw an exception if the buffer was already closed before (this was partly implemented using the isAccessible() check before - just not thread-safe).

Copy link
Contributor

@DasBabyPixel DasBabyPixel Jul 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding of how the buffers work may be incorrect, but concurrent close calls can never be safe?

EDIT: part of internal buffer code in ResourceSupport class, XXXBuffer extends ResourceSupport

    @Override
    public final void close() {
        if (acquires == -1) {
            throw attachTrace(new IllegalStateException("Double-free: Resource already closed and dropped."));
        }
        int acq = acquires;
        acquires--;
        if (acq != 0) {
            // Only record a CLOSE if we're not going to record a DROP.
            tracer.close(acq);
        } else {
            // The 'acquires' was 0, now decremented to -1, which means we need to drop.
            tracer.drop(0);
            try {
                drop.drop(impl());
                Reference.reachabilityFence(this); // Avoid racing with any Cleaner.
            } finally {
                makeInaccessible();
            }
        }
    }

acquires is a non-volatile int. So a buffer must, at every point in time, have a clear "owner" thread. Having IllegalStateExceptions thrown seems to me like the presence of another issue somewhere else, maybe in how the buffer is used or passed between owners.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to enable concurrent close calls, we need to have some sort of atomic isClosed flag or counter

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that place is not atomic, but the actual dropping happens in ArcDrop (in our case) which is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess you are right, ArcDrop is atomic and concurrent close calls should be safe.
Still think this is a bad idea long-term and while it may work now may fail if the netty implementation details change. Someone once told me relying on implementation details is a hassle and I should avoid it, so I do.
grafik
Imo a buffer lifecycle should be clearly defined and a close call must be on the owner thread, or synchronized in some way.
Also I haven't checked, but does enabling leak detection also enable it for the wrapper?

Copy link
Member Author

@derklaro derklaro Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really understand the problem. As stated before the exception is caught to ensure that, at any point, a call to close does NOT throw an exception. The previous isAccessible() guard did the same thing but left the door open a tiny bit open for an exception to be thrown anyway - this is now not possible anymore. Also note that the documentation of DataBuf still states that "buffers are not required to be thread safe", and there is no documentation hinting that close calls are thread-safe. It's just a security harness to be 100% sure that no exception is thrown, ever.

Copy link
Contributor

@DasBabyPixel DasBabyPixel Jul 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we argue that buffers are not thread safe, then concurrent close calls should be able to fail and throw the exception.
It's like instead of getting a ConcurrentModificationException when concurrently modifying an array list, it just silently fails.
The presence of such an exceptions indicates a deeper problem, in that case the concurrent usage of close. The absence of such an exception makes is so that problems, like incorrect usage of buffers, and concurrent close calls, are more difficult to diagnose.
It's not directly a problem where something doesn't work, correct usage of buffers will work. It's that catching that exception will make the future more difficult with bugs that can't be diagnosed because the exception is caught and ignored.

My point is: I want incorrect usage, in this case concurrent usage, to throw an exception. It makes it much easier to debug problems, and often prevents them alltogether.

@DasBabyPixel
Copy link
Contributor

DasBabyPixel commented Jul 16, 2025

EDIT: nvm I think I merged something incorrectly into my fork

@DasBabyPixel
Copy link
Contributor

Disregarding opinions on coding guidelines etc, it's overall looking good. No immediate masses of memory leaks are being thrown at me.
Just don't forget to include the information about closing ChannelMessages in the changelogs.

@derklaro derklaro merged commit a7046d1 into nightly Jul 17, 2025
5 checks passed
@derklaro derklaro deleted the data-buf-enhance branch July 17, 2025 08:27
@derklaro derklaro linked an issue Jul 17, 2025 that may be closed by this pull request
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t: improvement The pull request improves existing code v: 4.X This pull should be included in the 4.0 release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Memory leaks because the packet content buffer is not released in handlers

3 participants