A performance regression accidentally made it into CMake’s trunk, I was able to find it by luck before any official release included it. Testing on LLVM sources, the regression made CMake 5.5x slower on my Linux main machine, and 4.2x slower on my MacOS secondary machine.

I found this while testing an unrelated project on CMake: I was experiencing an unusually high runtime, and perf revealed that much of the time was spent in memcpy, specifically within cmList::Join.

screenshot of perf report

Looking at the git blame for Source/cmList.h, I found that it was last touched by !8580, a seemingly innocent refactoring change. Benchmarking CMake on this commit (45f17e5) and its parent (88e7ad0) revealed it was indeed the cause of the regression.

$ git checkout 88e7ad0084bd6a2fa6f032d7be1ee5d993440dcf
$ cmake . -B build/`git rev-parse HEAD` -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo
$ cmake --build build/`git rev-parse HEAD`

$ git checkout 45f17e5a8566fcdec0071a5ed1bc2656968bf258
$ cmake . -B build/`git rev-parse HEAD` -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo
$ cmake --build build/`git rev-parse HEAD`

$ rm -rf /tmp/experiment
$ hyperfine                                                                                                      \
    -L version 88e7ad0084bd6a2fa6f032d7be1ee5d993440dcf,45f17e5a8566fcdec0071a5ed1bc2656968bf258                 \
    --warmup 1                                                                                                   \
    '~/cmake/build/{version}/bin/cmake ~/llvm-project/llvm -G Ninja -B /tmp/experiment -DCMAKE_BUILD_TYPE=Debug' \
    --cleanup 'rm -rf /tmp/experiment'
Commit Mean [s] Min [s] Max [s] Relative
88e7ad0 11.964 ± 0.078 11.840 12.059 1.00
45f17e5 66.915 ± 0.789 65.775 68.503 5.59 ± 0.08

After some more digging, the performance issue was specifically at line 1219 of Source/cmList.h.

return std::accumulate(
    std::next(std::begin(r)), std::end(r), cmList::ToString(*std::begin(r)),
    [&sep](std::string const& a,
            typename std::iterator_traits<decltype(std::begin(r))>::value_type const& b) -> std::string {
        return a + sep + cmList::ToString(b); // HERE.
    });

Can you see the hidden copy? Because we are passing a as a const& in the lambda, it ends up being copied at every iteration of std::accumulate. Specifically, the copy will be made for the operator+ operation.

I initially fixed this by changing the lambda to take a by non-const reference and using std::move, but this caused some build issues on certain platforms. The next easiest fix was to simply use a standard for loop. This version is also arguably clearer anyway. For reference, you find find the full change and discussion here: !8612.

std::string joined = cmList::ToString(*std::begin(r));
for (auto it = std::next(std::begin(r)); it != std::end(r); ++it) {
  joined += sep + cmList::ToString(*it);
}
return joined;

Comparing this commit (7ad290b) with its direct parent (fbea5d9) shows a return to normal.

Commit Mean [s] Min [s] Max [s] Relative
fbea5d9 65.630 ± 0.594 64.996 66.756 5.45 ± 0.06
7ad290b 12.034 ± 0.075 11.902 12.152 1.00

I think that this can be made even better with C++ ranges. More cycles can be saved by computing the final result string size, saving from having to re-allocate the string at every iteration, which is what I would assume a proper views::join implementation would do. I wrote a version below using range-v3. It is of course possible to write equivalent code without ranges, but I think it may harm readability too much for a small potential performance gain.

template <typename Container>
std::string join(Container items, std::string sep) {
    return items | ranges::views::transform([](auto e) { return std::to_string(e); })
           | ranges::views::intersperse(sep) | ranges::views::join | ranges::to<std::string>;
}