OAK-11165 - Cache the Path field in the NodeDocument class. #1758

nfsantos · 2024-10-03T06:56:05Z

Creating a Path instance from a NodeDocument is an expensive operation, so it is worth to cache it. The performance improvements are noticeable in the indexing job, which calls #getPath twice in each NodeDocument that it processes.

Similar idea as what is done here:

jackrabbit-oak/oak-store-document/src/main/java/org/apache/jackrabbit/oak/plugins/document/Path.java

Lines 297 to 306 in 35950be

    
           public int hashCode() { 
        
               int h = hash; 
        
               if (h == -1 && parent != null) { 
        
                   h = 17; 
        
                   h = 37 * h + parent.hashCode(); 
        
                   h = 37 * h + name.hashCode(); 
        
                   hash = h; 
        
               } 
        
               return h; 
        
           }

Joscorbe · 2024-10-03T09:18:55Z

Are there any concerns about increased memory usage by caching the Path object in memory in every case?

I understand it can improve performance on indexing job, but it could increase memory consumption during normal operations of the repository. Have you evaluated this impact?

Maybe we could have a way to enable/disable the cache of this object depending on the needs. For example, enable it as part of indexing, but keep it disabled on other parts, at least until we assess any possible impact.

nfsantos · 2024-10-03T09:51:01Z

Are there any concerns about increased memory usage by caching the Path object in memory in every case?

I understand it can improve performance on indexing job, but it could increase memory consumption during normal operations of the repository. Have you evaluated this impact?

Maybe we could have a way to enable/disable the cache of this object depending on the needs. For example, enable it as part of indexing, but keep it disabled on other parts, at least until we assess any possible impact.

I have not evaluated the potential increase in memory in other situations, I was hoping someone more knowledgeable in this module would be able to comment. The main place where there could be a increased in memory usage that I can think of is in the NodeDocument cache. If getPath is called on the documents in the cache and these instances of Path would not be held otherwise anywhere else, then the memory usage will increase. I don't know if this could be an issue.

I was hoping that in most cases the lifetime of the Path instance created by calling #getPAth will be similar to the one of the NodeDocument from where it was created, so there would be no increase in memory.

I'm ok with a way to enabled/disable this feature, could be a static flag set from an env variable with a default value of disabled. And we would enable it only for indexing jobs. But this optimization could benefit other uses of Oak, but I'm not sure how to evaluate it.

nfsantos · 2024-10-03T10:04:12Z

The reason why calling getPath is so expensive are the calls to StringCache for each path segment. I am assuming this is done to save memory, as two different paths often have many path segments in common. On the other hand, this requires computing the hash of every path segment, which might not be otherwise needed in many situations. And this cache might also be ineffective because it only has capacity for 1024 elements and when traversing an Oak tree we will come across a very large number of different names. We can assume that the leaves will all have different names, so traversing 1024 nodes will completely trash the cache, pushing out the common names that appear in the parent segments. I have not done a deep analysis, but I suspect this StringCache might not be reducing significantly the memory usage and is increasing CPU usage. On the indexing job, the calls to StringCache.get are responsible for more than 10% of the total CPU usage, which seems quite high.

stefan-egli · 2024-10-03T12:54:49Z

I think there is a very high likelihood that getPath is executed by the time it gets put into the cache, as for example getNodeAtRevision already calls it, which is very likely to be called. Perhaps having an idea by how much the memory would increase in a realistic scenario would be useful.

Cache the Path field in the NodeDocument class.

d414429

shodaaan approved these changes Oct 3, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OAK-11165 - Cache the Path field in the NodeDocument class. #1758

OAK-11165 - Cache the Path field in the NodeDocument class. #1758

nfsantos commented Oct 3, 2024

Joscorbe commented Oct 3, 2024

nfsantos commented Oct 3, 2024 •

edited

Loading

nfsantos commented Oct 3, 2024 •

edited

Loading

stefan-egli commented Oct 3, 2024

	public int hashCode() {
	int h = hash;
	if (h == -1 && parent != null) {
	h = 17;
	h = 37 * h + parent.hashCode();
	h = 37 * h + name.hashCode();
	hash = h;
	}
	return h;
	}

OAK-11165 - Cache the Path field in the NodeDocument class. #1758

Are you sure you want to change the base?

OAK-11165 - Cache the Path field in the NodeDocument class. #1758

Conversation

nfsantos commented Oct 3, 2024

Joscorbe commented Oct 3, 2024

nfsantos commented Oct 3, 2024 • edited Loading

nfsantos commented Oct 3, 2024 • edited Loading

stefan-egli commented Oct 3, 2024

nfsantos commented Oct 3, 2024 •

edited

Loading

nfsantos commented Oct 3, 2024 •

edited

Loading