This is the implementation of the GLL-based context-free path querying (CFPQ) algorithm. It is based on a high-performance GLL parsing algorithm implemented in Iguana project. Then, it was modificated to support graph-structured input data. Proposed solution solves both reachability and all paths problems for all pairs and multiple sources cases.
The proposed solution has been evaluated on several real-world graphs for both the all pairs and the multiple sources scenarios. The evaluation shows that the proposed solution is more than 45 times faster than the previous solution for Neo4j and is comparable, in some scenarios and cases, with the linear algebra based solution.
Machine configuration: PC with Ubuntu 20.04, Intel Core i7-6700 3.40GHz CPU, DDR4 64Gb RAM.
Enviroment configuration:
- Java HotSpot(TM) 64-Bit server virtual machine (build 15.0.2+7-27, mixed mode, sharing).
- JVM heap configuration: 55Gb both xms and xmx.
- Neo4j 4.0.3 is used. Almost all configurations are default except two:
- total off-heap transaction memory (tx_state_max_off_heap_memory parameter) is set to 24Gb
- pagecache_warmup_enabled is set to true.
The graph data is selected from CFPQ_Data dataset. There are two types of graph:
- Graphs related to RDF analysis problems
- Graphs related to static code analysis problems
A detailed description of the graphs is listed below.
RDF analysis
Graph name | |V| | |E| | #subClassOf | #type | #broaderTransitive |
---|---|---|---|---|---|
Core | 1 323 | 2 752 | 178 | 0 | 0 |
Pathways | 6 238 | 12 363 | 3 117 | 3 118 | 0 |
Go hierarchy | 45 007 | 490 109 | 490 109 | 0 | 0 |
Enzyme | 48 815 | 86 543 | 8 163 | 14 989 | 8 156 |
Eclass_514en | 239 111 | 360 248 | 90 962 | 72 517 | 0 |
Geospecies | 450 609 | 2 201 532 | 0 | 89 065 | 20 867 |
Go | 582 929 | 1 437 437 | 94 514 | 226 481 | 0 |
Taxonomy | 5 728 398 | 14 922 125 | 2 112 637 | 2 508 635 | 0 |
Static code analysis
Graph name | |V| | |E| | #a | #d |
---|---|---|---|---|
Apache | 1 721 418 | 1 510 411 | 362 799 | 1 147 612 |
Block | 3 423 234 | 2 951 393 | 669 238 | 2 282 155 |
Fs | 4 177 416 | 3 609 373 | 824 430 | 2 784 943 |
Ipc | 3 401 022 | 2 931 498 | 664 151 | 2 267 347 |
Lib | 3 401 355 | 2 931 880 | 664 311 | 2 267 569 |
Mm | 2 538 243 | 2 191 079 | 498 918 | 1 692 161 |
Net | 4 039 470 | 3 500 141 | 807 162 | 2 692 979 |
Postgre | 5 203 419 | 4 678 543 | 1 209 597 | 3 468 946 |
Security | 3 479 982 | 3 003 326 | 683 339 | 2 319 987 |
Sound | 3 528 861 | 3 049 732 | 697 159 | 2 352 573 |
Init | 2 446 224 | 2 112 809 | 481 994 | 1 630 815 |
Arch | 3 448 422 | 2 970 242 | 6 712 95 | 2 298 947 |
Crypto | 3 464 970 | 2 988 387 | 678 408 | 2 309 979 |
Drivers | 4 273 803 | 3 707 769 | 858 568 | 2 849 201 |
Kernel | 11 254 434 | 9 484 213 | 1 981 258 | 7 502 955 |
Field-sensitive points-to analysis
Graph name | |V| | |E| |
---|---|---|
sunflow | 15 464 | 15 957 |
lusearch | 15 774 | 14 994 |
luindex | 18 532 | 17 375 |
avrora | 24 690 | 25 196 |
eclipse | 41 383 | 40 200 |
h2 | 44 717 | 56 683 |
pmd | 54 444 | 59 329 |
xalan | 58 476 | 62 758 |
batik | 60 175 | 63 089 |
fop | 86 183 | 83 016 |
tomcat | 111 327 | 110 884 |
jython | 191 895 | 260 034 |
tradebeans | 439 693 | 466 969 |
tradesoap | 440 680 | 468 263 |
All queries used in evaluation are variants of same-generation query. The inverse of an x
relation and the respective edge is denoted as x_r
.
Grammars used for RDF graphs:
G1
S -> subClassOf_r S subClassOf | subClassOf_r subClassOf
| type_r S type | type_r type
G2
S -> subClassOf_r S subClassOf | subClassOf
Geo
S -> broaderTransitive S broaderTransitive_r
| broaderTransitive broaderTransitive_r
Grammar used for static code analysis graphs:
PointsTo
M -> d_r V d
V -> (M? a_r)* M? (a M?)*
Grammar template used for field-sensitive points-to analysis:
PointsTo -> (assign | load_f Alias store_f )* alloc
Alias -> PointsTo FlowsTo
FlowsTo -> alloc_r (assign_r | store_f_r Alias load_f_r)*
TermiFor all fields f
the terminals load_f
, store_f
are generated
The results of the all pairs reachability queries evaluation on graphs related to RDF analysis are listed below.
The sign ’–’ in cells means that the respective query is not applicable to the graph, so time is not measured.
Graph name | G1 | G2 | Geo | |||
---|---|---|---|---|---|---|
time (sec) | #answer | time (sec) | #answer | time (sec) | #answer | |
Core | 0,02 | 204 | 0,01 | 214 | – | – |
Pathways | 0,07 | 884 | 0,04 | 3117 | – | – |
Go hierarchy | 3,68 | 588 976 | 5,42 | 738 937 | – | – |
Enzyme | 0,22 | 396 | 0,17 | 8163 | 5,7 | 14 267 542 |
Eclass | 1,5 | 90 994 | 0,97 | 96 163 | – | – |
Geospecies | 2,89 | 85 | 2,65 | 0 | 145,8 | 226 669 749 |
Go | 5,56 | 640 316 | 4,24 | 659 501 | – | – |
Taxonomy | 45,47 | 151 706 | 36,06 | 2 112 637 | – | – |
The evaluation results for single source CFPQ for graphs related to RDF analysis and G1, G2, Geo grammars respectively in reachability and all paths scenarious:
The results for graphs related to static code analysis are compared to results of Azimov’s CFPQ algorithm based on matrix operations. The implementation from CFPQ_PyAlgo was taken as the implementation of the matrix CFPQ algorithm. This library contains the implementation for both scenarios, all pairs reachability and single source reachability. To perform matrix operations pygraphblas is used. Pygraphblas is a python wrapper over the SuiteSparse library, which based on the GraphBLAS framework.
The results of the all pairs reachability queries evaluation on graphs related to static code analysis are listed below.
The sign ’–’ in cells means that the respective query and graph require a considerable amount of memory during algorithm execution that leads to unpredictable time to get the result.
Graph name | PointsTo | ||
---|---|---|---|
Neo4j time (sec) | GraphBLAS time (sec) | #answer | |
Apache | – | 536,7 | 92 806 768 |
Block | 113,01 | 123,88 | 5 351 409 |
Fs | 167,73 | 105,72 | 9 646 475 |
Ipc | 109,43 | 79,52 | 5 249 389 |
Lib | 111,09 | 121,79 | 5 276 303 |
Mm | 77,92 | 84,15 | 3 990 305 |
Net | 160,64 | 206,29 | 8 833 403 |
Postgre | – | 969,88 | 90 661 446 |
Security | 115,75 | 181,7 | 5 593 387 |
Sound | 120,14 | 133,64 | 6 085 269 |
Init | 87,25 | 45,84 | 3 783 769 |
Arch | 130,77 | 119,92 | 5 339 563 |
Crypto | 128,8 | 122,09 | 5 428 237 |
Drivers | 371,18 | 279,39 | 18 825 025 |
Kernel | 614,047 | 378,05 | 16 747 731 |
The evaluation results for single source CFPQ for graphs related to static code analysis and pointsTo grammar in reachability and all paths scenarious:
The results of the all pairs reachability queries evaluation on graphs related to field-sensitive points-to analysis. The results are compared to results of .NET GLL implementation.
Graph name | Reachability | All paths | |||||
---|---|---|---|---|---|---|---|
#answer | Neo4j time (sec) | In memory time (sec) | .NET time (sec) | Neo4j time (sec) | In memory time (sec) | .NET time (sec) | |
avrora | 21 532 | 5.022 | 2.168 | 5.022 | 7.879 | 5.042 | 2.984 |
batik | 45 968 | 12.659 | 5.656 | 2.087 | 30.282 | 30.137 | 6.349 |
eclipse | 21 830 | 5.712 | 2.223 | 1.245 | 14.095 | 13.418 | 3.557 |
fop | 76 615 | 19.400 | 9.394 | 4.649 | 53.721 | 45.088 | 15.353 |
h2 | 92 038 | 19.643 | 13.507 | 19.065 | 30.414 | 23.375 | 108.554 |
jython | 561 720 | OOM | OOM | 1283.518 | OOM | OOM | OOM |
luindex | 9 677 | 2.121 | 0.874 | 0.438 | 3.782 | 2.326 | 2.361 |
lusearch | 9 242 | 1.417 | 0.547 | 0.285 | 2.481 | 1.614 | 1.419 |
pmd | 60 518 | 32.193 | 23.034 | 45.487 | 46.522 | 36.269 | OOT |
sunflow | 16 354 | 0.878 | 0.397 | 0.340 | 2.074 | 1.854 | 1.733 |
tomcat | 82 424 | 58.637 | 26.187 | 3.872 | 124.182 | 88.235 | 24.168 |
tradebeans | 696 316 | OOT | OOM | 134.641 | OOT | OOM | OOM |
tradesoap | 698 567 | OOT | OOM | 134.321 | OOT | OOM | OOM |
xalan | 52 382 | 24.140 | 10.991 | 4.490 | 39.168 | 34.247 | 62.832 |
The project is build with Maven.
git clone https://github.com/JetBrains-Research/GLL4Graph
cd GLL4Graph
mvn compile
To replicate experiments:
- make the directory to save results
mkdir results
- run the following command with arguments
mvn exec:java -Dexec.mainClass="benchmark.GraphBenchmark" -Dexec.args="aguments"
- print help
mvn exec:java -Dexec.mainClass="benchmark.GraphBenchmark" -Dexec.args="-h"
usage: GraphBenchmark [-h] -d <dataset name> -gm <path> -gp <path> -gs <storage type> -m <number> -p <problem type> -S <s=value1 a=value2> -w <number>
-d,--dataset <dataset name> The name of the dataset, an important component of the file name with the results
-gm,--grammar <path> Path to JSON file contains context-free grammar
-gp,--graph <path> Path to directory contains files nodes.csv and edges.csv
-gs,--graph_storage <storage type> Graph storage type, allowed values: NEO4J, IN_MEMORY
-h,--help Print help message
-m,--measurement_iterations <number> Number of measurement iterations
-p,--problem <problem type> Benchmarking algorithm, allowed values: REACHABILITY, ALL_PATHS
-S,--scenario <s=value1 a=value2> Benchmarking scenario and its argument, 's' property allowed values: ALL_PAIRS, MULTIPLE_SOURCES, 'a' property
contains number of nodes if 's' equals ALL_PAIRS or path to file with vertices chunks if 's' equals
MULTIPLE_SOURCES
-w,--warmup_iterations <number> Number of warm-up iterations
Here is an example which can be run right after project is downloaded and results directory is created without any additional preparation.
To run all pairs reachability algorithm on Core graph on G1 grammar use the following command:
mvn exec:java -Dexec.mainClass="benchmark.GraphBenchmark" -Dexec.args="-d Core -gm test/resources/grammars/graph/g1/grammar.json -gp data/core -gs NEO4J -p REACHABILITY -S -s ALL_PAIRS -a $(( $(cat "data/core/nodes.csv" | wc -l)-1 )) -w 1 -m 10"
To get more graph data examples use Python script:
graph_loader.py --graph {graph_name} --relationships {comma_separated_relationships_list}
For example:
graph_loader.py --graph core --relationships subClassOf,type
This project is licensed under OpenBSD License. License text can be found in the license file.