Skip to content
Hal Finkel edited this page Sep 17, 2019 · 20 revisions

Welcome to the llvm-project-cxxjit wiki! This is a fork of LLVM with a Clang enhanced with just-in-time (JIT) compilation functionality.

For more information on the implementation and some evaluation, see: https://arxiv.org/abs/1904.08555

Getting Started

To install ClangJIT, clone https://github.com/hfinkel/llvm-project-cxxjit and build as you would Clang/LLVM for your system.

If you're building on a Linux system, then a basic CMake configuration such as the following will likely work:

cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo -DBUILD_SHARED_LIBS=ON -DLLVM_USE_SPLIT_DWARF=ON -DCMAKE_INSTALL_PREFIX=/path/to/somewhere -DLLVM_ENABLE_PROJECTS="llvm;clang;openmp" /path/to/llvm-project-cxxjit/llvm

and then run make (using -j<number of cores to get a parallel build) and then make install. Before you install, you might wish to run LLVM's and Clang's regression tests (which include regression tests associated with the JIT functionality). make -j<number of cores> check-llvm check-clang should do that. Once installed, you can then use the /path/to/somewhere/bin/clang++ with the -fjit option as described below.

Rationale

Clang JIT extensions are designed to address two key challenges of C++ programming:

  1. Long Compile times - Instantiating many C++ templates in order to provide high-performance, customized behaviors necessarily generates long compile times. In some cases, at runtime, only a subset of these instantiations are actually used. In such cases, compile time can be reduced by delaying relevant template instantiation until runtime.

  2. Performance - While C++ is generally one of the highest-performance languages available, there are cases where runtime-specialized code can obtain significantly better performance than a more-generic ahead-of-time-compiled implementation in C++. In C++, a programmer can choose to instantiate specialized instances of particular algorithms and then dispatch at runtime between those pre-defined choices. This, however, is a major contributor to the aforementioned compile-time problems.

Usage

Command-Line Flag

To enable JIT support when using Clang, add the command-line flag:

-fjit

when both compiling and linking the application or shared library.

Detecting when JIT support is available

#ifndef __has_feature // For compatibility with non-Clang compilers.
  #define __has_feature(x) 0
#endif

#if __has_feature(clang_cxx_jit)
  // Code here depending on the JIT
#endif

Attribute

By itself, the command-line flag affects some linking details, but does not enable any use of just-in-time compilation. Function templates can be tagged for just-in-time compilation by using the attribute:

[[clang::jit]]

The attributed function template provides for additional features and restrictions. Features:

  1. Instantiations of this function template will not be constructed at compile time, but rather calling a specialization of the template, or taking the address of a specialization of the template, will trigger the instantiation and compilation of the template at runtime.

  2. Non-constant expressions may be provided for the non-type template parameters, and these values will be used at runtime to construct the type of the requested instantiation.

Example:

$ cat /tmp/jit.cpp
#include <iostream>
#include <cstdlib>

template <int x>
[[clang::jit]] void run() {
  std::cout << "I was compiled at runtime, x = " << x << "\n";
}

int main(int argc, char *argv[]) {
  int a = std::atoi(argv[1]);
  run<a>();
}
$ clang++ -O3 -fjit -o /tmp/jit /tmp/jit.cpp
$ /tmp/jit 5
I was compiled at runtime, x = 5
  1. Type arguments to the template can be provided as strings. If the argument is implicitly convertible to a const char *, then that conversion is performed, and the result is used to identify the requested type. Otherwise, if an object is provided, and that object has a member function named c_str(), and the result of that function can be converted to a const char *, then the call and conversion (if necessary) are performed in order to get a string used to identify the type. The string is parsed and analyzed to identify the type in the declaration context of the parent to the function triggering the instantiation. Whether types defined after the point in the source code that triggers the instantiation are available is not specified.

Example:

$ cat /tmp/jit-t.cpp
#include <iostream>

struct F {
  int i;
  double d;
};

template <typename T, int S>
struct G {
  T arr[S];
};

template <typename T>
[[clang::jit]] void run() {
  std::cout << "I was compiled at runtime, sizeof(T) = " << sizeof(T) << "\n";
}

int main(int argc, char *argv[]) {
  std::string t(argv[1]);
  run<t>();
}
$ clang++ -O3 -fjit -o /tmp/jit-t /tmp/jit-t.cpp
$ /tmp/jit-t '::F'
I was compiled at runtime, sizeof(T) = 16
$ /tmp/jit-t 'F'
I was compiled at runtime, sizeof(T) = 16
$ /tmp/jit-t 'float'
I was compiled at runtime, sizeof(T) = 4
$ /tmp/jit-t 'double'
I was compiled at runtime, sizeof(T) = 8
$ /tmp/jit-t 'size_t'
I was compiled at runtime, sizeof(T) = 8
$ /tmp/jit-t 'std::size_t'
I was compiled at runtime, sizeof(T) = 8
$ /tmp/jit-t 'G<F, 5>'
I was compiled at runtime, sizeof(T) = 80

Restrictions:

  1. Because the body of the template is not instantiated at compile time, decltype(auto) and any other type-deduction mechanisms depending on the body of the function are not available.
  2. Because the template specializations are not compiled until runtime, they're not available at compile time for use as non-type template arguments, etc.

Note: Explicit specializations of a JIT function template are not just-in-time compiled.

Note: A JIT template with a pointer/reference non-type template parameter which is provided with a runtime pointer value will generate a different instantiation for each pointer value. If the pointer provided points to a global object, no attempt is made to map that pointer value back to the name of the global object when constructing the new type.

Note: In general, pointer/reference-type non-type template arguments are not permitted to point to subobjects. This restriction still applies formally to the templates instantiated at runtime using runtime-provided pointer values. This has important optimization benefits: pointers that can be traced back to distinct underlying objects are known not to alias, and these template parameters appear to the optimizer to have this unique-object property.

A Benchmark

As a benchmark to illustrate the feature, we'll adapt a benchmark from the Eigen library. Specifically, this one: https://github.com/eigenteam/eigen-git-mirror/blob/master/bench/benchmark.cpp

We want to look at two aspects: Compile time and runtime performance. Eigen provides a matrix type which can either have compile-time-specific or runtime-specified sizes (i.e., the number of rows and columns).

#include <iostream>
#include <string>
#include <chrono>
#include <cstdlib>

#include <Eigen/Core>

using namespace std;
using namespace Eigen;

If we wish to support a variant of this benchmark supporting float, double, and long double, and supporting any size at runtime, we can adapt the code as:

template <typename T>
void test_aot(int size, int repeat) {
  Matrix<T,Dynamic,Dynamic> I = Matrix<T,Dynamic,Dynamic>::Ones(size, size);
  Matrix<T,Dynamic,Dynamic> m(size, size);
  for(int i = 0; i < size; i++)
  for(int j = 0; j < size; j++) {
    m(i,j) = (i+size*j);
  }

  auto start = chrono::system_clock::now();

  for (int r = 0; r < repeat; ++r) {
    m = Matrix<T,Dynamic,Dynamic>::Ones(size, size) + T(0.00005) * (m + (m*m));
  }

  auto end = chrono::system_clock::now();
  cout << "AoT: " << chrono::duration<double>(end - start).count() << " s\n";
}

void test_aot(std::string &type, int size, int repeat) {
  if (type == "float")
    test_aot<float>(size, repeat);
  else if (type == "double")
    test_aot<double>(size, repeat);
  else if (type == "long double")
    test_aot<long double>(size, repeat);
  else
    cout << type << "not supported for AoT\n";
}

To do the same thing with the Clang JIT extensions, we can write:

template <typename T, int size>
[[clang::jit]] void test_jit_sz(int repeat) {
  Matrix<T,size,size> I = Matrix<T,size,size>::Ones();
  Matrix<T,size,size> m;
  for(int i = 0; i < size; i++)
  for(int j = 0; j < size; j++) {
    m(i,j) = (i+size*j);
  }

  auto start = chrono::system_clock::now();

  for (int r = 0; r < repeat; ++r) {
    m = Matrix<T,size,size>::Ones() + T(0.00005) * (m + (m*m));
  }

  auto end = chrono::system_clock::now();
  cout << "JIT: " << chrono::duration<double>(end - start).count() << " s\n";
}

void test_jit(std::string &type, int size, int repeat) {
  return test_jit_sz<type, size>(repeat);
}

And we can use very-similar code to construct explicit instantiations at compile time, but of course, then we're limited to support for the explicit sizes we have selected.

Compile Time

Compiling on an Intel Xeon E5-2699 using the flags -march=native -ffast-math -O3, and measuring compile time using "user" time from the Linux time command.

Time Time over Base
JIT Only 3.5s 0.92s
Single Specialization (double, size = 16) 4.95s 2.37s
Single Specialization (double, size = 7) 3.3s 0.72s
Single Specialization (double, size = 3) 3.2s 0.62s
Single Specialization (double, size = 1) 2.95s 0.37s
Two Specializations (double, size = 16) and (double, 7) 5.7s 3.12s
AoT Only (three floating-point types with dispatch) 9.7s 7.12s
AoT Only (double only) 5.3s 2.72s
Nothing (just the includes and a main function) 2.58s -

As you can see, the time for generating each specific specialization is essentially additive, and they get more expensive as the fixed matrix sizes get longer. Generating the code for the JIT has a compile-time cost, but it's not even as expensive as a single non-fixed-size implementation.

Uses a Clang build compiled using GCC 8.2.0 with CMake's RelWithDebInfo mode.

Runtime Performance

For (double, size = 3); a repeat count of 40000000. Times as reported by the code (excludes JIT compilation time).

Time
JIT 1.0s
Single Specialization 1.01s
AoT 8.05s

For (double, size = 7)

Time
JIT 8.34s
Single Specialization 8.45s
AoT 20s

For (double, size = 16)

Time
JIT 35.3s
Single Specialization 35.1s
AoT 36.2s

A few trends to notice:

The JIT-generated code is significantly faster than the ahead-of-time-generated code for small matrix sizes. The advantage becomes less significant as the matrix sizes become larger.

Thus, using the JIT gives the performance advantages of using many ahead-of-time specializations, and is sometimes even better, with very low compile-time cost.

Acknowledgement

This research was supported by the Exascale Computing Project (17-SC-20-SC), a collaborative effort of two U.S. Department of Energy organizations (Office of Science and the National Nuclear Security Administration) responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering, and early testbed platforms, in support of the nation’s exascale computing imperative. Additionally, this research used resources of the Argonne Leadership Computing Facility, which is a DOE Office of Science User Facility supported under Contract DE-AC02-06CH11357.