Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix: added clearProperties(ClassLoader) method #215

Closed

Conversation

lprimak
Copy link

@lprimak lprimak commented May 2, 2024

to BeanELResolver to help remove ClassLoader leaks

fixes #214

Relates to payara/Payara#6677

  /**
     * Clears properties cache by class loader.
     * To be called from the Container to clear the cache when ClassLoader is no longer in use
     * <p>
     * Even though {@link #properties} map is using {@link SoftReference} to store
     * its cache content, GC may not clean the cache on its own. This is because
     * the cache stores Class types, which hold a reference to its ClassLoader,
     * which in turn is much heavier than the garbage collector believes.
     * There may not be enough memory pressure to remove the Class element,
     * and GC may not be able to clear the ClassLoader without help from this method.
     *
     * @param classLoader
     */

@lprimak lprimak force-pushed the add-clear-cache-by-classloader branch from 5a6f08c to 1e2e5f9 Compare May 2, 2024 05:58
@markt-asf
Copy link
Contributor

You are going to need to provide a test case that demonstrates a memory leak. And once we have such a test case the fix will likely be to address the memory leak rather than adding a method that clears the cache.

@lprimak
Copy link
Author

lprimak commented May 2, 2024

Screen Shot 2024-05-02 at 3 49 44 PM

@lprimak
Copy link
Author

lprimak commented May 2, 2024

Hi, Mark,

I modified the description to show exactly what's happening. Also uploaded screenshot from VisualVM.
It's clear that GC isn't collecting ClassLoaders while they are in the cache.
The only option is to clear the cache.

@markt-asf
Copy link
Contributor

I'm still not seeing the evidence of a memory leak. Soft references are GC eligible. For a memory leak I'd expect to see a chain of hard references from the leaked object to a GC root.

@lprimak
Copy link
Author

lprimak commented May 3, 2024

SoftConcurrentHashMap.map contains Class<?> ask key, which in turn has a hard reference to the ClassLoader
BeanProperties also has a hard reference to Class<?>

I have done a number of tests and the method in the PR is the minimum necessary to avoid the memory leak.

@lprimak
Copy link
Author

lprimak commented May 20, 2024

@markt-asf Any thoughts please? Thank you

@markt-asf
Copy link
Contributor

There is no memory leak here (confirmed with a profiler). There is nothing to fix.

Absent a test case demonstrating an actual memory leak, this PR and the associated issue will be closed.

@lprimak
Copy link
Author

lprimak commented May 23, 2024

Hi, Mark,

The memory leak is clearly demonstrated by the VisualVM screenshot and my comments above.
I spent 10s of volunteer hours tracking this down, fixing it and verifying the fix works.
Is there some downside to this code that I am not seeing? The "worst-case outcome" of this fix is that it does nothing.
Isn't that a good enough reason to merge this in?

This way, if I am right about this, the fix is already there :)

Any help convincing Mark would be appreciated @smillidge @arjantijms @edburns @ivargrimstad
Thank you

@lprimak
Copy link
Author

lprimak commented May 27, 2024

Hi, Mark,

@edburns would like me to produce a video with the memory leak.
Would that be helpful? Do you have any questions or suggestions? Is there anything above that's
not clearly demonstrating the memory leak?

Thank you

@markt-asf
Copy link
Contributor

I would very much prefer a test case that demonstrates a leak - a leak being a hard reference to an object that isn't eligible for GC that should be eligible for GC.

The information provided so far only shows a reference chain that contains a soft reference to the alleged leaking class. That is not a memory leak.

Based on the information provided so far, I have been unable to recreate a memory leak via use of the BeanELResolver.

My experience of videos (mainly in security vulnerability reports) is not good. The issues are:

  • videos often fail to provide sufficient information to recreate the issue
  • videos are hard to validate as safe and hence require sand-boxed environments to view

You are of course free to provide a video if you wish and it will get looked at. However, I continue to believe the most effective way for you to demonstrate that there is an issue is to provide a test case that demonstrates the leak. I appreciate that creating unit tests for memory leaks is not always easy. The simplest possible web application (with source code) that demonstrates a leak that can be observed in a profiler would be sufficient. Given that I don't see how you can provide a video without first writing such a test case, my recommendation continues to be "provide the test case".

Further:

The proposed fixed is highly unlikely to be applied in any circumstances. Absent evidence of a memory leak, cruft is not going to be added to the API "just in case". If evidence of a memory leak is provided, then the fix is almost certainly going to be to fix the actual leak - not to provide a hack to work-around the leak.

@scottmarlow
Copy link
Contributor

@markt-asf Would a Java process memory dump (e.g. hprof file) help demonstrate the leak? I personally prefer a memory dump over a picture or a video.

@lprimak
Copy link
Author

lprimak commented May 28, 2024

@markt-asf Looks like you have tried to reproduce the memory leak yourself. Can you put that code up so I can take a look? Maybe I can modify it to demo the leak rather than start from scratch?
Thank you

@lprimak
Copy link
Author

lprimak commented May 28, 2024

Also, you claim that all references are soft.
static private class SoftConcurrentHashMap extends ConcurrentHashMap<Class<?>, BeanProperties> {
the above line has a hard key to Class<?>
Why do you think this will be garbage collected?

@markt-asf
Copy link
Contributor

Why do you think this will be garbage collected?

Because I have read the code - particularly the cleanup() method and when it is called - and tested that it behaves as expected and clears both the key and the value from the cache.

The test code I have been using:

package org.apache.markt;

import jakarta.el.BeanELResolver;
import jakarta.el.ELContext;
import jakarta.el.ELResolver;
import jakarta.el.FunctionMapper;
import jakarta.el.VariableMapper;

public class TestBeanELResolver {

    /*
     * Need to run this with the following:
     * -Xmx16m
     * -XX:SoftRefLRUPolicyMSPerMB=100
     */
    public static void main(String[] args) throws Exception {
        @SuppressWarnings("unused")
        Object o = new Object();

        // Create the BeanELResolver instance to be tested
        BeanELResolver r = new BeanELResolver();
        
        // Use the resolver with TesterBeanA. This adds TesterBeanA to BeanELResolver.properties
        resolveA(r);

        /*
         * Now sleep for longer than heap size in MB (16) * SoftRefLRUPolicyMSPerMB (100). This ensures that, when there
         * is memory pressure, any soft references that exist at the moment - and specifically the ones in
         * BeanELResolver.properties - will be cleared.
         */
        Thread.sleep(2000);
        
        /*
         * Create memory pressure.
         * 
         * A call to System.gc() is *not* sufficient to clear the soft references.
         */
        o = null;
        try {
            o = new int[10][10][10][10][10][10][10][10][10][10][10][10];
        } catch(OutOfMemoryError err) {
            // Ignore
        }

        /*
         * At this point the soft reference has been removed from the map and placed on the queue. Trigger the cleanup()
         * method. Do this by using the resolver with a different bean.
         */
        resolveB(r);

        /*
         * Placing a break point here and examining the heap with a profiler will show that references to TesterBeanA
         * have been removed from BeanELResolver.properties
         * 
         * Note that the profiler will still show a hard reference from the class loader to TesterBeanA since the
         * class loader retains a reference to all the classes it has loaded. 
         */
        System.out.println("debug point");
    }

    private static void resolveA(ELResolver r) {
        TesterBeanA b = new TesterBeanA();
        b.setData("test");

        System.out.println(r.getValue(new TestContext(), b, "data"));
    }

    
    private static void resolveB(ELResolver r) {
        TesterBeanB b = new TesterBeanB();
        b.setData("test");

        System.out.println(r.getValue(new TestContext(), b, "data"));
    }


    private static class TestContext extends ELContext {

        @Override
        public ELResolver getELResolver() {
            return null;
        }

        @Override
        public FunctionMapper getFunctionMapper() {
            return null;
        }

        @Override
        public VariableMapper getVariableMapper() {
            return null;
        }
    }


    public static class TesterBeanA {

        private String data;

        public String getData() {
            return data;
        }

        public void setData(String data) {
            this.data = data;
        }
    }


    public static class TesterBeanB {

        private String data;

        public String getData() {
            return data;
        }

        public void setData(String data) {
            this.data = data;
        }
    }
}

@lprimak lprimak force-pushed the add-clear-cache-by-classloader branch from ec45aee to e64af6b Compare May 29, 2024 22:53
@lprimak
Copy link
Author

lprimak commented May 29, 2024

Looking at your test, I can see why you would think it isn't a leak.
However, it absolutely is a leak for all intents and purposes:

The test above introduces memory pressure before forcing cleanup() to be called, thus it does clean out the queue (and the attached ClassLoader)

However, this is not a realistic situation, as memory pressure usually happens after the cleanup() method gets called, not before.
Memory pressure stays active, without an additional beans being resolved, thus cleanup() doesn't get called often enough to clean up the cache and thus is ineffective at cleaning up the class loaders.
This will cause OOM if during application runtime, application will ask for more memory, without resolving additional beans and thus calling cleanup(), which is what I was observing in my real-world tests.

Saying all that, WeakReference solution will probably work in this case, and is probably better solution than the manual cleanup proposed here.

However, I'd like to keep this PR open until the WeakReference method gets incorporated and proven to work in real-world tests.

The risk of WeakReference implementation is that it would be purged from the cache too early and would actually make the cache ineffective. I do not believe that will be the case, but without testing, who knows?

@markt-asf
Copy link
Contributor

This PR really needs to be closed and this discussion moved to the issue.

It does seem unlikely to me that there would be enough bean usage to cause notable memory usage, then some other activity that triggers GC and then no further bean usage until a OOME occurs. Unlikely, but I guess it is possible. I do wonder if some other memory leak was at play as well.

There are a couple of options here.

The simplest may be to switch using class name rather than class as the key. That would reduce the impact of what was left behind before cleanup() is called.

Switching to a WeakReference based cache of some form is probably the better solution although it is higher risk. The Tomcat implementation has been used successfully since Java EE 5 but I'd need to check on where it originated. I know I didn't write it so I can't just copy it.

@lprimak
Copy link
Author

lprimak commented May 30, 2024

I’ll make another PR with WeakReference as the change should be trivial

@lprimak
Copy link
Author

lprimak commented Jun 9, 2024

It does seem unlikely to me that there would be enough bean usage to cause notable memory usage, then some other activity that triggers GC and then no further bean usage until a OOME occurs. Unlikely, but I guess it is possible. I do wonder if some other memory leak was at play as well.

The bean usage isn't the issue. The issue is that stale beans from undeployed applications drag in their ClassLoaders thus creating a huge leak.
Beans are usually created and cached during application deployment.
They are supposed to be unloaded during application undeployment.
However, because they remain in the cache, and there is no memory pressure as-of yet, they do not get cleaned up.
After a number of deploy/undeploy cycles, memory usage gets close to maximum, but not enough to clean up SoftReferences.
After that, during regular application runtime it allocates memory, and because at some point it cannot, OOM occurs.

Even before that point, GC and memory thrashing occurs because there is a lot of uncollectable classes and garbage remaining.

What I am describing above is not a hypothetical. It has been observed and recorded via JFR, VisualVM, JMC, etc.

Solution is this PR is also not a hypothetical. With his applied, the OOM/GC thrashing issues are gone. Also observed by JFR/JMC/VisualVM

The simplest may be to switch using class name rather than class as the key. That would reduce the impact of what was left behind before cleanup() is called.

This will not work, because the cached bean (map value) contains a reference to application's object, which in turn will refer to it's ClassLoader, preventing it from being cleaned up.

@lprimak
Copy link
Author

lprimak commented Jul 26, 2024

Superseded by #248

@lprimak lprimak closed this Jul 26, 2024
@lprimak lprimak deleted the add-clear-cache-by-classloader branch July 26, 2024 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BeanELResolver cache causes Class Loader leaks in app servers
3 participants