Skip to content

Latest commit

 

History

History
43 lines (29 loc) · 4.31 KB

4.0.0-flatgraph.md

File metadata and controls

43 lines (29 loc) · 4.31 KB

4.0.x: Migration to flatgraph

Joern uses the domain-specific classes from codepropertygraph, which (up to joern 2.x) were generated by overflowdb (specifically https://github.com/ShiftLeftSecurity/overflowdb and https://github.com/ShiftLeftSecurity/overflowdb-codegen). As of joern 4.0.x we replaced overflowdb with it's successor, flatgraph. The most important PRs paving the way for flatgraph are ShiftLeftSecurity/codepropertygraph#1769 and #4630.

Why the change?

Most importantly, flatgraph brings us about 40% less memory usage as well as faster traversals. The reduced memory footprint is achieved by flatgraph's efficient columnar layout, essentially we hold everything in few (albeit very large) arrays. The faster traversals account for about 40% performance improvement for many joern use-cases, e.g. running the default passes while importing a large cpg into joern. Some numbers for Linux 4, as an example for a very large codebase. Numbers are based on my workstation and just rough measurements.

Linux 4.1.16, cpg created with c2cpg after importCpg into joern:

  • 48M nodes with 630M properties (mostly String and Integer)
  • 431M edges with 115M properties (all String)
joern 2 (overflowdb) joern 4 (flatgraph)
heap after import (after garbage collection) 33g 20g
minimum required heap (Xmx) for import 80g 30g
time for importCpg 18 minutes 11 minutes
file size on disk 2600M 400m

Linux 5 and 6 are considerably larger, so I wasn't able to import them into joern 2 on my workstation (which has 128G physical memory). With joern 4 it works just fine with ./joern -J-Xmx90g for linux6 😀

Also worth noting: one of overflowdb's features was the overflowing-to-disk mechanism. While it sounds nice to be able to handle graphs larger than the available memory, in practice it was too slow to be useful, so we didn't reimplement it in flatgraph.

API changes / upgrade guide

We tried to minimise the joern-user-facing API changes, and depending on your usage you may not notice anything at all. That being said, if your code makes use of the overflowdb namespace then you will have to make some changes. In most cases, it's simply a namespace change to flatgraph. Since hopefully no joern user used the overflowdb api (with one exception listed below), I won't list the changes here, instead please look at the joern migration PR and/or ask us on discord.

Most relevant changes:

  1. overflowdb.BatchedUpdate.applyDiff -> flatgraph.DiffGraphApplier.applyDiff

  2. io.shiftleft.passes.IntervalKeyPool -> io.joern.x2cpg.utils.IntervalKeyPool

  3. StoredNode.propertyOption now returns an Option rather than a java.util.Optional - the API is almost identical, and there's builtin conversions both ways (.toScala|.toJava via import scala.jdk.OptionConverters.*).

  4. the arrow syntax for quickly constructing graphs, e.g. v0 --- "CFG" --> v1, quite useful for testing, doesn't exist in flatgraph yet. You'll need to create a diffgraph instead. There's plenty of examples in the joern migration PR.

  5. Edges can only have zero or one properties. Since the codepropertygraph schema never defined more than one property per edge type, this should not affect you as a joern user, unless you've extended the cpg schema...

Credits and kudos

Flatgraph is based on @bbrehm's great ideas for a memory efficient columnar layout on the jvm. He built a working prototype with very promising benchmarks that convinced us that the effort to migrate is worth-while, and that turned out to be true.

Why did we leave out version 3?

I'm glad you asked! Version 3 is typically a source for trouble, you know... just look at Gnome 3, Python 3 and many more. The only exception is Scala 3, of course - ymmv :)