Codebase Indexer is a research project aimed at extracting detailed C++ code, including symbols, references, and relations, into a database for building Retrieval-Augmented Generation (RAG) systems. The extracted data, such as references and relations, is well-suited for constructing knowledge graphs to support GraphRAG-like systems, enabling advanced code search and contextual AI applications.
Note: This project is built on top of clangd
and involves a small patch to enhance its capabilities. It does not attribute or modify the entirety of the clangd
codebase but rather extends its functionality for specialized data extraction.
This project is in its early stages and should be considered a draft for research purposes. Features and functionality are subject to change as development and experimentation continue.
- 🔍 Extraction: Collects C++ symbols, references, and relations as provided by
clangd
. Each symbol is exported with its associated documentation (if available), expanded macros, namespaces, and the actual code of the symbol (e.g., function or class declarations/definitions). - 📄 TSV Export: Outputs the extracted data into
symbols.tsv
,refs.tsv
, andrelations.tsv
files for easy database import. - 💾 Database Integration: Data is imported into ClickHouse for building a knowledge graph and enabling RAG capabilities.
- Build the tool using LLVM build commands.
ninja
,cmake
andlld
(for faster linking) should be installed. Build commands and scripts are available in thejustfile
.
mkdir build && cd build
cmake -G "Ninja" -DLLVM_ENABLE_PROJECTS="clang;clang-tools-extra" -DCMAKE_BUILD_TYPE=Debug -DLLVM_USE_LINKER=lld -DBUILD_SHARED_LIBS=ON ../llvm
ninja
- Generate
compile_commands.json
for your project usingcmake
orbear
. - Run
clangd-indexer
with yourcompile_commands.json
to extract symbol data.index.yaml
is for debug purposes only; actual data is exported to.tsv
files in the current working directory.
./build/bin/clangd-indexer path/to/compile_commands.json --executor=all-TUs --format=yaml > index.yaml
- Apply schema from
schema.sql
- Load the generated TSV files into ClickHouse:
clickhouse-client --password="${PASS}" --query="TRUNCATE TABLE index_symbols"
clickhouse-client --password="${PASS}" --query="INSERT INTO index_symbols FORMAT TabSeparated" < symbols.tsv
clickhouse-client --password="${PASS}" --query="TRUNCATE TABLE index_relations"
clickhouse-client --password="${PASS}" --query="INSERT INTO index_relations FORMAT TabSeparated" < relations.tsv
clickhouse-client --password="${PASS}" --query="TRUNCATE TABLE index_refs"
clickhouse-client --password="${PASS}" --query="INSERT INTO index_refs FORMAT TabSeparated" < refs.tsv
-
✅ Extraction: Precise extraction of C++ codebase using
clangd
, with detailed symbols, references, and relations exported to ClickHouse for structured data storage. -
⚙️ Embedding: Work in progress. The extracted code snippets will be embedded to support classical RAG systems, enabling effective retrieval-augmented generation.
-
🔬 GraphRAG Research: The data, including references and relations, will be used for further research into knowledge graph construction to build advanced GraphRAG systems, supporting enhanced contextual search and AI-driven code analysis.
Welcome to the LLVM project!
This repository contains the source code for LLVM, a toolkit for the construction of highly optimized compilers, optimizers, and run-time environments.
The LLVM project has multiple components. The core of the project is itself called "LLVM". This contains all of the tools, libraries, and header files needed to process intermediate representations and convert them into object files. Tools include an assembler, disassembler, bitcode analyzer, and bitcode optimizer.
C-like languages use the Clang frontend. This component compiles C, C++, Objective-C, and Objective-C++ code into LLVM bitcode -- and from there into object files, using LLVM.
Other components include: the libc++ C++ standard library, the LLD linker, and more.
Consult the Getting Started with LLVM page for information on building and running LLVM.
For information on how to contribute to the LLVM project, please take a look at the Contributing to LLVM guide.
Join the LLVM Discourse forums, Discord chat, LLVM Office Hours or Regular sync-ups.
The LLVM project has adopted a code of conduct for participants to all modes of communication within the project.