After trying to wrangle Boost PP and other advertised compile-time libraries suc...

rcxdude · 2025-03-25T18:26:14 1742927174

CMake has a particularly irritating flaw here, though, in that it makes no distinction between host and target which cross-compiling, which makes it really difficult to do this kind of code generation when supporting this use-case (which is becoming more and more commoon).

MITSardine · 2025-03-25T18:41:36 1742928096

Right, I hadn't thought of that, to be honest. If I understand correctly, you're saying the codegen targets will be compiled to the target arch, and then can't be run on the machine doing the compiling?

I think one solution might be to use target_compile_options() which lets you specify flags per target (instead of globally), assuming you're passing flags to specify the target architecture.

rcxdude · 2025-03-25T20:22:59 1742934179

That only works if it's mostly the same compiler, unfortunately. They could be completely different executables, calling conventions, etc. I don't know why CMake still has such a huge hole in its feature set, but it's quite unfortunate.

Arech · 2025-03-25T18:24:12 1742927052

> Boost Hana (which still has some runtime overhead compared to the same logic with hardcoded values)

Can you elaborate on that? What was your use-case for which this was true?

MITSardine · 2025-03-25T19:03:43 1742929423

One case I benchmarked was Bernstein/Bézier and Lagrange element evaluation. This is: given a degree d triangle or tetrahedron, given some barycentric coordinates, get the physical coordinate and the Jacobian matrix of the mapping.

Degree 2, Lagrange:

- runtime: 3.6M/s - Hana: 16.2M/s - Hardcoded: 37.7M/s

Degree 3, Lagrange: 2.6M/s, 6.4M/s, 13.4M/s (same order).

"Runtime" here means everything is done using runtime loops, "Hana" using Boost Hana to make loops compile-time and use some constexpr ordering arrays, "hardcoded" is a very Fortran-looking function with all hardcoded indices and operations all unrolled.

As you see, using Boost Hana does bring about some improvement, but there is still a factor 2x between that and hardcoded. This is all compiled with Release optimization flags. Technically, the Hana implementation is doing the same operations in the same order as the hardcoded version, all indices known at compile time, which is why I say there must be some runtime overhead to using hana::while.

In the case of Bernstein elements, the best solution is to use de Casteljau's recursive algorithm using templates (10x to 15x speedup to runtime recursive depending on degree). But not everything recasts itself nicely as a recursive algorithm, or I didn't find the way for Lagrange anyways. I did enable flto as, from my understanding (looking at call stacks), hana::while creates lambda functions, so perhaps a simple function optimization becomes a cross-unit affair if it calls hana::while. (speculating)

Similar results to compute Bernstein coefficients of the Jacobian matrix determinant of a Q2 tetrahedron, factor 5x from "runtime" to "hana" (only difference is for loops become hana::whiles), factor 3x from "hana" to "hardcoded" (the loops are unrolled). So a factor 15x between naive C++ and code generated files. In the case of this function in particular, we have 4 nested loops, it's branching hell where continues are hit very often.

Arech · 2025-03-25T19:48:45 1742932125

Hhhmmm, interesting, thanks for reply!

That would be fairly interesting to look at the actual code you've used, and have a look at the codegen. By a chance, is it viable for you to open-source it? I'd guess it should bear lots of interest for Hana author/s.

What compiler/version did you use? For example, MSVC isn't (at least wasn't) good at always evaluating `constexpr` in compile-time...

> hana::while creates lambda functions, so perhaps a simple function optimization becomes a cross-unit affair if it calls hana::while. (speculating)

Hmm, I'd say it (LTO) shouldn't influence, as these lambdas are already fully visible to a compiler.

MITSardine · 2025-03-25T20:42:52 1742935372

I never thought to contact them, but I might do that, thanks for the suggestion. This is something I tested almost two years ago, I have these benchmarks written down but I've since deleted the code I've used, save for the optimal implementations (though it wouldn't take too long to rewrite it).

I tested with clang on my Mac laptop and gcc on a Linux workstation. Version, not sure. If I test this again to contact the Hana people, I'll try and give all this information. I did test the constexpr ordering arrays by making sure I can pass, say, arr[0] as a template parameter. This is only possible if the value is known at compile time. Though it's also possible the compiler could be lazy in other contexts, as in not actually evaluating at compile time if it figures out the result is not necessary to be known at compile time.

Oh yeah, you're right, I was confusing translation unit and function scope.

cbuq · 2025-03-25T17:25:50 1742923550

This sounds pragmatic, but are you writing C++ executables that when run create the generated code? Are there templating libraries involved?

MITSardine · 2025-03-25T17:42:47 1742924567

Yeah, it's all done automatically when you build, and dependencies are properly taken into account: if you modify one of the code generating sources, its outputs are regenerated, and everything that depends on them is correctly recompiled. This doesn't take much CMake logic at all to make work.

In my case, no, it's dumb old code writing strings and dumping that to files. You could do whatever you want in there, it's just a program that writes source files.

I do use some template metaprogramming where it's practical versus code generation, and Boost Hana provides some algorithmic facilities at compile time but those incur some runtime cost. For instance, you can write a while loop with bounds evaluated at compile time, that lets you use its index as a template parameter or evaluate constexpr functions on. But sometimes the best solution has been (for me, performance/complexity wise) to just write dumb files that hardcode things for different cases.