为什么处理未排序数组与使用现代x86-64clang处理排序数组的速度相同？

html5 • 2022年12月19日 am9:26 • 问答

我发现了这个流行的大约 9 岁的SO 问题，并决定仔细检查其结果。

所以，我有 AMD Ryzen 9 5950X、clang++ 10 和 Linux，我从问题中复制粘贴了代码，这是我得到的：

排序 - 0.549702s：

~/d/so_sorting_faster$ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
    std::sort(data, data + arraySize);
0.549702
sum = 314931600000

未分类 - 0.546554s：

~/d/so_sorting_faster $ cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
    // std::sort(data, data + arraySize);
0.546554
sum = 314931600000

我很确定 unsorted 版本比 3ms 快的事实只是噪音，但它似乎不再慢了。

那么，CPU 的架构发生了什么变化（使其不再慢一个数量级）？

以下是多次运行的结果：

Unsorted: 0.543557 0.551147 0.541722 0.555599
Sorted:   0.542587 0.559719 0.53938  0.557909

以防万一，这是我的 main.cpp：

#include <algorithm>
#include <ctime>
#include <iostream>

int main()
{
    // Generate data
    const unsigned arraySize = 32768;
    int data[arraySize];

    for (unsigned c = 0; c < arraySize; ++c)
        data[c] = std::rand() % 256;

    // !!! With this, the next loop runs faster.
    // std::sort(data, data + arraySize);

    // Test
    clock_t start = clock();
    long long sum = 0;

    for (unsigned i = 0; i < 100000; ++i)
    {
        // Primary loop
        for (unsigned c = 0; c < arraySize; ++c)
        {
            if (data[c] >= 128)
                sum += data[c];
        }
    }

    double elapsedTime = static_cast<double>(clock() - start) / CLOCKS_PER_SEC;

    std::cout << elapsedTime << std::endl;
    std::cout << "sum = " << sum << std::endl;
    return 0;
}

更新

具有更多元素（627680）：

Unsorted
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
    // std::sort(data, data + arraySize);
10.3814

Sorted:
cat main.cpp | grep "std::sort" && clang++ -O3 main.cpp && ./a.out
    std::sort(data, data + arraySize);
10.6885

我认为这个问题仍然相关 - 几乎没有区别。

回答

您链接的问题中的几个答案谈到将代码重写为无分支，从而避免任何分支预测问题。这就是您更新的编译器正在做的事情。

具体来说，带有-O3 矢量化内部循环的clang++ 10 。请参阅 Godbolt 上的代码，程序集的第 36-67 行。代码有点复杂，但是您绝对看不到的一件事是data[c] >= 128测试中的任何条件分支。相反，它使用向量比较指令 ( pcmpgtd)，其输出是一个掩码，其中 1 表示匹配元素，0 表示不匹配。pand带有此掩码的后续元素将不匹配的元素替换为 0，因此当它们无条件地添加到总和时，它们不会做出任何贡献。

粗略的 C++ 等价物是

sum += data[c] & -(data[c] >= 128);

代码实际上sum为数组的偶数和奇数元素保留了两个运行的 64 位，以便它们可以并行累加，然后在循环结束时相加。

一些额外的复杂性是将 32 位data元素符号扩展到 64 位；这就是序列喜欢pxor xmm5, xmm5 ; pcmpgtd xmm5, xmm4 ; punpckldq xmm4, xmm5完成的。打开-mavx2，你会看到一个更简单vpmovsxdq ymm5, xmm5的地方。

代码看起来也很长，因为循环已经展开，data每次迭代处理 8 个元素。

Also note that clang unrolls small loops by default (unlike GCC); if you want to see the most simple version of how it's vectorizing, use `-fno-unroll-loops`.
https://godbolt.org/z/z6WYG9. (I threw in `-march=nehalem` to enable SSE4 including `pmovsxdq` sign-extension to let it make the asm simpler than with manual sign extension. Strangely, even without it, it still only does 8 bytes at a time, not using `punpckldq` + `punpckhdq` to use low and high halves of a load + compare result. To be fair, sometimes GCC shoots itself in the foot by *not* using narrower loads when it has to wide :/)
Anyway, so yes the answer to this question is that it auto-vectorizes. As usual compilers don't pick perfect strategies. (Although GCC's might be optimal for SSE2 or SSE4.)
Also, it would probably be better for clang's strategy (with SSE4.2 from `-march=nehalem`) to use `pmovsxdq xmm, [mem]` loads and widen the compare to 64-bit, instead of widening the compare *result*. GCC does 16-byte loads like I mentioned in my first comment. With SSE4 it takes 2 shuffles to sign-extend the high two masked elements (still probably worth it), and without SSE4 it's pure win vs. clang to get twice as much work done with each pcmpgtd / pand on the initial data, and even the sign-extending can share some work between halves. https://godbolt.org/z/nWhz3n
Also related: [gcc optimization flag -O3 makes code slower than -O2](https://stackoverflow.com/q/28875325) for this same code where branchless (without vectorization) isn't profitable for sorted, and you need PGO (profile guided optimization) to get GCC to make the optimal choice to not do if-conversion, if you're using an old GCC, or compiling with `-fno-tree-vectorize`.
So... compiler have gotten better over the years 🙂

以上是为什么处理未排序数组与使用现代x86-64clang处理排序数组的速度相同？的全部内容。

THE END

二维码

使用没有IPyWidgets的HuggingfaceTransformer

< <上一篇

使用Flutter新的integration_test进行示例集成测试？

下一篇>>

搜索内容

为什么处理未排序数组与使用现代x86-64clang处理排序数组的速度相同？

回答

目录

目录

推荐文章

最新文章