Using latest LLVM in nix

Jun 29, 2023 · sh1marin

The vector project is a RISC-V Vector compliant hardware generator. They use Nix to setup the environment. In the vector project, they use buddy-mlir to write the test case which required LLVM. And this post record what problem have I met to just bump the buddy-mlir to the mainline.

The beginning

A couple of weeks ago, I started to learn in how to integrate the buddy-mlir into the vector project. Just like other languages, MLIR has a built-in language server (the mlir-lsp-server). The buddy-mlir project extends it to have RVV Dialect support in a newer commit than the one vector project currently using. So I decide to bump the buddy-mlir git rev to the latest commit.

This should be a simple copy-paste task right? Well, it is not. The maintainer of buddy-mlir bumped the LLVM to the mainline in history commit. And this brings in a massive of dependency hell.

So first thing I need to do is to replace the git rev in buddy-mlir derivation. It is easy, just replace the rev and hash attributes:

# LLVM derivation
-rev = "e31d27e46048ccc3294d6b215dc778b3390e7834";
-hash = "sha256-CM3+amf2SpOiUBzdnO7sryTwmGcC0NVabNNvuatcCDQ=";
+rev = "8f966cedea594d9a91e585e88a80a42c04049e6c";
+hash = "sha256-g2cYk3/iyUvmIG0QCQpYmWj4L2H4znx9KbuA5TvIjrc=";
# buddy-mlir derivation
-rev = "2900b35cfd5ff34e1608b90d04f9dd9f41296f91";
-hash = "sha256-3qduRyjOQKTDRdOpTJirD0Wm14BuvdJLx2IIWHDMD0g=";
+rev = "74c18e6963cf4781be254d3c5d963b36c0642ba4";
+hash = "sha256-Wx/QQrELfOT0h4B8hF9EPZKn4yVHBZeYh3Wm85Jpq60=";

And it was built successfully. But after I created a GitHub PR for the vector project, I noticed that almost all the tests failed with an error:

*.mlir.S:3:16: error: invalid arch name 'rv32i2p1_m2p0_f2p2_d2p2_v1p0_zicsr2p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0', unsupported version number 2.1 for extension 'i'
 .attribute 5, "rv32i2p1_m2p0_f2p2_d2p2_v1p0_zicsr2p0_zve32f1p0_zve32x1p0_zve64d1p0_zve64f1p0_zve64x1p0_zvl128b1p0_zvl32b1p0_zvl64b1p0"
               ^
*rvv-vp-intrinsic-add*.mlir.S:13:2: error: instruction requires the following: 'V' (Vector Extension for Application Processors), 'Zve32x' or 'Zve64x' (Vector Extensions for Embedded Processors)
 vsetivli zero, 16, e32, m4, ta, ma
 ^

After discussing this with my teammates and mentor, we came up with a possible reason: the llvm used by buddy-mlir is compatible with the execution environment. The vector repo has a customi clang using llvm14 for the riscv 32bit environment. There must be a lot of changes since the llvm 14 to the mainline. I tried to replace the llvmPackage_14 with llvmPackages16, but the problem is not solved. I need to dig deeper into the problem deeper.

First I need to find out what the error message means. What does “unsupported version number 2.1 for extension ‘i’” mean? Reading the message letter by letter, I found that the generated assembly code has an attribute rv32i2p1_....., which indicates that the target architecture uses riscv i extension version 2.1. This .mlir.S file is generated by the buddy-llc tool. As this error is prompted when converting assembly code into machine code, it means that the assembler in execution environment is incompatible with the assembly code generated by buddy-mlir. Searching through the llvm repository, we find that the extension version support is hard-coded in the llvm/lib/Support/RISCVISAInfo.cpp file. And the i extension version 2.1 support is not in llvm14, llvm16, or even llvm17 pre-release. This means that I have to replace the whole toolchain to the latest llvm.

The Nix magic

The nixpkgs have llvmPackages_16 provided, so it is redundant to re-packaged the llvm derivation myself. And nixpkgs also provides simple way to override the download source. We can pass in gitRelease attribute set or officialRelease attribute set to replace the source.

# ...
# LLVM release information; specify one of these but not both:
, gitRelease ? null
  # i.e.:
  # {
  #   version = /* i.e. "15.0.0" */;
  #   rev = /* commit SHA */;
  #   rev-version = /* human readable version; i.e. "unstable-2022-26-07" */;
  #   sha256 = /* checksum for this release, can omit if specifying your own `monorepoSrc` */;
  # }
, officialRelease ? { version = "16.0.6"; sha256 = "sha256-fspqSReX+VD+Nl/Cfq+tDcdPtnQPV1IRopNDfd5VtUs="; }
  # i.e.:
  # {
  #   version = /* i.e. "15.0.0" */;
  #   candidate = /* optional; if specified, should be: "rcN" */
  #   sha256 = /* checksum for this release, can omit if specifying your own `monorepoSrc` */;
  # }
 # ...

It is set to 16.0.6 from GitHub Archive by default, but we can set the officialRelease to null and pass in gitRelease attribute to let the llvmPackage use the latest llvm source.

llvmPackages_16.override {
   gitRelease = {
     version = "17.0.0";
     rev = "8f966cedea594d9a91e585e88a80a42c04049e6c";
     rev-version = "unstable-2023-05-02";
     sha256 = "sha256-g2cYk3/iyUvmIG0QCQpYmWj4L2H4znx9KbuA5TvIjrc=";
   };
   officialRelease = null;
}

Nixpkgs contains some custom patches that needed to be applied to LLVM. And the LLVM mainline contains many breaking changes that I need to fix in the original patch. These patches are used in the libllvm attribute in the llvmPackages_16 set. And the llvmPackage_16 is a huge set of attributes containing several llvm tools and libraries. There is no conventional overrideAttrs to change any part of it. This confused me for a long time until Sharzy told me that the correct way is to extend it.

myLLVM = (llvmPackages_16.override {
  gitRelease = {
    version = "17.0.0";
    rev = "8f966cedea594d9a91e585e88a80a42c04049e6c";
    rev-version = "unstable-2023-05-02";
    sha256 = "sha256-g2cYk3/iyUvmIG0QCQpYmWj4L2H4znx9KbuA5TvIjrc=";
  };
  officialRelease = null;
}).extend (lfinal: lprev: {
  llvm = lprev.llvm.overrideAttrs (oldAttrs: {
    patches = (builtins.filter (p: builtins.baseNameOf p != "gnu-install-dirs.patch") oldAttrs.patches) ++ [
      ./nix/gnu-install-dirs.patch
    ];
  });
});

The code example above is hard to understand the first time you read it, right? And if you have never written functional programming language, it will be ever harder to understand. So what it does is to pass a function into the extend attribute. The function takes two arguments: the lfinal argument, which represents the state after overriding, and the lprev argument represents the state before the override. The function returns an attribute set with the llvm field. This attribute set will eventually replace the original llvm field declare in llvmPackages_16. The llvm attribute set in llvmPackages_16 allows overriding attributes, so the next step is to pass a function with the oldAttrs argument to the overrideAttrs function. Finally, I can start modifing the patches attr for llvm. The filter function will filter out the gnu-install-dirs.patch file and replace it with my updated patch.

So, this is the end right? No, not yet! I am still far from a successful build. Nix even fail to evaluate the derivation in flake.nix:

error: attribute 'stdenv' missing

at /nix/store/some-hash-path/flake.nix:

devShell = pkgs.mkShell.override { stdenv = pkg.myLLVM.stdenv; }
                                                       ^

stdenv is a package that contains necessary build tools and library to build a basic package with makefile or other build system. The stdenv provided in llvmPackages has complete build tools for building the LLVM project, which is required for the vector repository development.

Why is it missing? This is because the indentation f**k up. I don’t know the llvmPackages_16 is actually a set with the following hierachy:

let
  tools = lib.makeExtensible ( ... )
  libraries = lib.makeExtensible ( ... )
in
  { inherit tools libraries release_version; }

The lib.makeExtensible function will add extend attrs to the given set. And because of llvmPackages_16, with both tools and library sets inherit, when I call the llvmPackages_16.extend function, it is actually calls the llvmPackages_16.tools.extend function. Because the extend function will only return the modified set it is in, so after I called the llvmPackages_16.extend function, there is only llvmPackages_16.tools attr set left, no more libraries set. And coincidently, the stdenv set came from libraries set. That’s why the error occurs.

This is an internal bug of nixpkgs. But since stdenv is just a basic build environment, we can use the old stdenv in the unmodified version of the llvmPackage.

The overriden llvmPackages_16 package was successfully built, but the patch still mismatches when I build clang. And a confusing problem arise: the patch.rej shows that I am applying the new patch to the old sources! This is because the llvm attr in llvmPackages_16 is just an alias for the libllvm.

libllvm = callPackage ./llvm {
  inherit llvm_meta;
};

llvm = tools.libllvm;

However clang uses the libllvm attr, so when clang builds, it will try to apply the new patch to the old sources. So the correct way is to extend the libllvm attrs.

My First LLVM contribution

There is also a story when I try to build clang. In the llvm mainline, they force the llvm-gtest build target when you set LLVM_INCLUDE_TESTS=ON. However if I set LLVM_INCLUDE_TESTS=OFF, the build process still fails.

/build/.../clang/lib/Analysis/FlowSensitive/HTMLLogger.cpp: fatal error: HTMLLogger.inc: No such file or directory

The HTMLLogger.inc file is a bunch of HTML/JS/CSS code wrapped in char[] to be used at runtime to generate web logger. It is generated by a bundler script written in Python. It is wrapped in the add_custom_command function provided by CMake, and is executed at build time.

add_custom_command(OUTPUT HTMLLogger.inc
  COMMAND "${Python3_EXECUTABLE}" ${CLANG_SOURCE_DIR}/utils/bundle_resources.py
  ${CMAKE_CURRENT_BINARY_DIR}/HTMLLogger.inc
  HTMLLogger.html HTMLLogger.css HTMLLogger.js
  WORKING_DIRECTORY ${CMAKE_CURRENT_SOURCE_DIR}
  COMMENT "Bundling HTMLLogger resources"
  DEPENDS ${CLANG_SOURCE_DIR}/utils/bundle_resources.py HTMLLogger.html HTMLLogger.css HTMLLogger.js
  VERBATIM)

There must be some error logging during the build time, I thought. But I cannot find any. I also tried adding more verbose logging, thinking that the stdout/stderr might be captured by CMake. But it was still as silent as the wilderness, and I could not see a message. So I tried to verify that the bundle script was actually running. To avoid the possibility of stdout/stderr being captured by the parent process, I add a logging to file script snippet to the bundler. And there is no such logging file exist.

Why isn’t bundler script running? I opened up the build.ninja file and found out why:

build lib/Analysis/FlowSensitive/HTMLLogger.inc | ...: CUSTOM_COMMAND ...
  COMMAND = cd /build/clang-src-unstable-2023-05-02/clang/lib/Analysis/FlowSensitive
  DESC = Bundling HTMLLogger resources
  restat = 1

There is only a cd command left in the COMMAND! This shouldn’t be a bug in ninja, as it generated other build commands correctly. So I add a debug logging to the CMakefile, and find that the Python interpreter is missing! The "${Python3_EXECUTABLE}" variable has no value.

But why? If this is a runtime component, then if I missing python3 is missing, the configure process must fail before I can start the build. Searching through the CMakelist.txt files, I finally come to the core problem. In clang CMakefile, it only finds python3 executable when LLVM_INCLUDE_TESTS=ON. So if I set the option to off, configure process will pass, but the Python3_EXECUTABLE variable will not be set. Since ninja gets an empty string as command, it doesn’t generate the actual build command and just leaves cd.

So I send a patch to llvm to fix the issue and get my first LLVM contribution: https://reviews.llvm.org/D152418.

My reaction :)

It compiles, but it does not work

I am exhausted, but the challenge continues. After I finally compiling all the whole thing successfully and entering the devshell, I found that the lld was broken:

error while loding shared libraries: libLLVM-17git.so: cannot open object: no such file or directory

The shared library is built correctly and placed in the correct path, so there must be something wrong in the lld rpath. I didn’t suffer much with the help from NickCao, who found LLVM introduces a breaking change when installing build tools. In preFixup stage, CMake set runtime path:

lld-unstable> -- Set runtime path of "/nix/store/qm4w9bxjc6xiixrbdrdzjxgr487yphx7-lld-unstable-2023-05-02/bin/lld" to "$ORIGIN/..//nix/store/5l99zwgmjiynkpb6p4hmqndssz41q98i-lld-unstable-2023-05-02-lib/lib"

So I add a patch to remove the changes.

And after the clang work, spike failed to build.

/nix/store/shw0b6wv2xdvyj71b1fj147i83awrqfz-binutils-2.40/bin/ld: /nix/store/kf147vkmb7a7z17lkpgj2d6y7w75nf7v-gcc-12.2.0//lib/libatomic.a(glfree.o): relocation R_X86_64_32S against `.rodata' can not be used when making a PIE object; recompile with -fPIE
/nix/store/shw0b6wv2xdvyj71b1fj147i83awrqfz-binutils-2.40/bin/ld: failed to set dynamic section sizes: bad value
clang-16: error: linker command failed with exit code 1 (use -v to see invocation)

This is because the ld executable in the provided stdenv is not compatible with the latest LLVM. After discussion with the teammates, we decide to keep the llvmPackages_14 toolchain in devshell, and add a script called clang-rv32 as wrapper to use clang in the latest LLVM. And there it is, the final working version: https://github.com/sequencer/vector/pull/230.