CMake: Fixing a 4-6x performance regression from unnecessary copies

A performance regression accidentally made it into CMake’s trunk, I was able to find it by luck before any official release included it. Testing on LLVM sources, the regression made CMake 5.5x slower on my Linux main machine, and 4.2x slower on my MacOS secondary machine.

I found this while testing an unrelated project on CMake: I was experiencing an unusually high runtime, and perf revealed that much of the time was spent in memcpy, specifically within cmList::Join.

Looking at the git blame for Source/cmList.h, I found that it was last touched by !8580, a seemingly innocent refactoring change. Benchmarking CMake on this commit (45f17e5) and its parent (88e7ad0) revealed it was indeed the cause of the regression.

$ git checkout 88e7ad0084bd6a2fa6f032d7be1ee5d993440dcf
$ cmake . -B build/`git rev-parse HEAD` -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo
$ cmake --build build/`git rev-parse HEAD`

$ git checkout 45f17e5a8566fcdec0071a5ed1bc2656968bf258
$ cmake . -B build/`git rev-parse HEAD` -G Ninja -DCMAKE_BUILD_TYPE=RelWithDebInfo
$ cmake --build build/`git rev-parse HEAD`

$ rm -rf /tmp/experiment
$ hyperfine                                                                                                      \
    -L version 88e7ad0084bd6a2fa6f032d7be1ee5d993440dcf,45f17e5a8566fcdec0071a5ed1bc2656968bf258                 \
    --warmup 1                                                                                                   \
    '~/cmake/build/{version}/bin/cmake ~/llvm-project/llvm -G Ninja -B /tmp/experiment -DCMAKE_BUILD_TYPE=Debug' \
    --cleanup 'rm -rf /tmp/experiment'

Commit	Mean [s]	Min [s]	Max [s]	Relative
`88e7ad0`	11.964 ± 0.078	11.840	12.059	1.00
`45f17e5`	66.915 ± 0.789	65.775	68.503	5.59 ± 0.08

After some more digging, the performance issue was specifically at line 1219 of Source/cmList.h.

return std::accumulate(
    std::next(std::begin(r)), std::end(r), cmList::ToString(*std::begin(r)),
    [&sep](std::string const& a,
            typename std::iterator_traits<decltype(std::begin(r))>::value_type const& b) -> std::string {
        return a + sep + cmList::ToString(b); // HERE.
    });

Can you see the hidden copy? Because we are passing a as a const& in the lambda, it ends up being copied at every iteration of std::accumulate. Specifically, the copy will be made for the operator+ operation.

I initially fixed this by changing the lambda to take a by non-const reference and using std::move, but this caused some build issues on certain platforms. The next easiest fix was to simply use a standard for loop. This version is also arguably clearer anyway. For reference, you find find the full change and discussion here: !8612.

std::string joined = cmList::ToString(*std::begin(r));
for (auto it = std::next(std::begin(r)); it != std::end(r); ++it) {
  joined += sep + cmList::ToString(*it);
}
return joined;

Comparing this commit (7ad290b) with its direct parent (fbea5d9) shows a return to normal.

Commit	Mean [s]	Min [s]	Max [s]	Relative
`fbea5d9`	65.630 ± 0.594	64.996	66.756	5.45 ± 0.06
`7ad290b`	12.034 ± 0.075	11.902	12.152	1.00

I think that this can be made even better with C++ ranges. More cycles can be saved by computing the final result string size, saving from having to re-allocate the string at every iteration, which is what I would assume a proper views::join implementation would do. I wrote a version below using range-v3. It is of course possible to write equivalent code without ranges, but I think it may harm readability too much for a small potential performance gain.

template <typename Container>
std::string join(Container items, std::string sep) {
    return items | ranges::views::transform([](auto e) { return std::to_string(e); })
           | ranges::views::intersperse(sep) | ranges::views::join | ranges::to<std::string>;
}