Joern uses the domain-specific classes from codepropertygraph, which (up to joern 2.x) were generated by overflowdb (specifically https://github.com/ShiftLeftSecurity/overflowdb and https://github.com/ShiftLeftSecurity/overflowdb-codegen). As of joern 4.0.x we replaced overflowdb with it's successor, flatgraph. The most important PRs paving the way for flatgraph are ShiftLeftSecurity/codepropertygraph#1769 and #4630.
Most importantly, flatgraph brings us about 40% less memory usage as well as faster traversals. The reduced memory footprint is achieved by flatgraph's efficient columnar layout, essentially we hold everything in few (albeit very large) arrays. The faster traversals account for about 40% performance improvement for many joern use-cases, e.g. running the default passes while importing a large cpg into joern. Some numbers for Linux 4, as an example for a very large codebase. Numbers are based on my workstation and just rough measurements.
Linux 4.1.16, cpg created with c2cpg after importCpg
into joern:
- 48M nodes with 630M properties (mostly
String
andInteger
) - 431M edges with 115M properties (all
String
)
joern 2 (overflowdb) | joern 4 (flatgraph) | |
---|---|---|
heap after import (after garbage collection) | 33g | 20g |
minimum required heap (Xmx) for import | 80g | 30g |
time for importCpg | 18 minutes | 11 minutes |
file size on disk | 2600M | 400m |
Linux 5 and 6 are considerably larger, so I wasn't able to import them into joern 2 on my workstation (which has 128G physical memory). With joern 4 it works just fine with ./joern -J-Xmx90g
for linux6 😀
Also worth noting: one of overflowdb's features was the overflowing-to-disk mechanism. While it sounds nice to be able to handle graphs larger than the available memory, in practice it was too slow to be useful, so we didn't reimplement it in flatgraph.
We tried to minimise the joern-user-facing API changes, and depending on your usage you may not notice anything at all. That being said, if your code makes use of the overflowdb
namespace then you will have to make some changes. In most cases, it's simply a namespace change to flatgraph
. Since hopefully no joern user used the overflowdb api (with one exception listed below), I won't list the changes here, instead please look at the joern migration PR and/or ask us on discord.
Most relevant changes:
-
overflowdb.BatchedUpdate.applyDiff
->flatgraph.DiffGraphApplier.applyDiff
-
io.shiftleft.passes.IntervalKeyPool
->io.joern.x2cpg.utils.IntervalKeyPool
-
StoredNode.propertyOption
now returns anOption
rather than ajava.util.Optional
- the API is almost identical, and there's builtin conversions both ways (.toScala|.toJava
viaimport scala.jdk.OptionConverters.*
). -
the arrow syntax for quickly constructing graphs, e.g.
v0 --- "CFG" --> v1
, quite useful for testing, doesn't exist in flatgraph yet. You'll need to create a diffgraph instead. There's plenty of examples in the joern migration PR. -
Edges can only have zero or one properties. Since the codepropertygraph schema never defined more than one property per edge type, this should not affect you as a joern user, unless you've extended the cpg schema...
Flatgraph is based on @bbrehm's great ideas for a memory efficient columnar layout on the jvm. He built a working prototype with very promising benchmarks that convinced us that the effort to migrate is worth-while, and that turned out to be true.
I'm glad you asked! Version 3 is typically a source for trouble, you know... just look at Gnome 3, Python 3 and many more. The only exception is Scala 3, of course - ymmv :)