为什么GCC不能为两个int32s的结构生成最佳operator==?

一位同事向我展示了我认为没有必要的代码,但果然,确实如此。我希望大多数编译器会将所有这三种相等性测试的尝试视为等效的:

#include <cstdint>
#include <cstring>

struct Point {
    std::int32_t x, y;
};

[[nodiscard]]
bool naiveEqual(const Point &a, const Point &b) {
    return a.x == b.x && a.y == b.y;
}

[[nodiscard]]
bool optimizedEqual(const Point &a, const Point &b) {
    // Why can't the compiler produce the same assembly in naiveEqual as it does here?
    std::uint64_t ai, bi;
    static_assert(sizeof(Point) == sizeof(ai));
    std::memcpy(&ai, &a, sizeof(Point));
    std::memcpy(&bi, &b, sizeof(Point));
    return ai == bi;
}

[[nodiscard]]
bool optimizedEqual2(const Point &a, const Point &b) {
    return std::memcmp(&a, &b, sizeof(a)) == 0;
}


[[nodiscard]]
bool naiveEqual1(const Point &a, const Point &b) {
    // Let's try avoiding any jumps by using bitwise and:
    return (a.x == b.x) & (a.y == b.y);
}

但令我惊讶的是,只有那些带有memcpy或被memcmpGCC 转换为单个 64 位比较的。为什么?( https://godbolt.org/z/aP1ocs )

对于优化器来说,如果我检查连续的四个字节对上的相等性,这与比较所有八个字节是否相同,这不是很明显吗?

尝试避免将两部分单独布尔化会更有效地编译(少一条指令并且没有对 EDX 的错误依赖),但仍然是两个单独的 32 位操作。

bool bithackEqual(const Point &a, const Point &b) {
    // a^b == 0 only if they're equal
    return ((a.x ^ b.x) | (a.y ^ b.y)) == 0;
}

GCC 和 Clang 在按传递结构时都有相同的遗漏优化(a在 RDI 和bRSI 中也是如此,因为这是 x86-64 System V 的调用约定将结构打包到寄存器中的方式):https : //godbolt.org/z/ v88a6s。memcpy / memcmp 版本都编译为cmp rdi, rsi / sete al,但其他版本执行单独的 32 位操作。

struct alignas(uint64_t) Point令人惊讶的是,在参数在寄存器中的按值情况下仍然有帮助,优化 GCC 的两个 naiveEqual 版本,但不是 bithack XOR/OR。(https://godbolt.org/z/ofGa1f)。这是否给我们提供了有关 GCC 内部结构的任何提示?对齐对 Clang 没有帮助。

回答

If you "fix" the alignment, all give the same assembly language output (with GCC):

struct alignas(std::int64_t) Point {
    std::int32_t x, y;
};

Demo

As a note, some correct/legal ways to do some stuff (as type punning) is to use memcpy, so having specific optimization (or be more aggressive) when using that function seems logical.

  • This is an interesting observation, but I don't feel that it answers the "Why?" _Why are these valid, trivial, and equivalent functions producing different assembly?_
  • @AyxanHaqverdili: guaranteed alignment means the optimization is even more profitable: no chance of cache-line splits when using single 64-bit loads. This might make the optimizer try harder, or bump a heuristic past some threshold of profitability. But without knowing which, this answer is just a useful observation and a workaround, not a real answer.
  • So, why does the alignment matter here? Why can't the compiler do the optimization OP did manually?
  • So... why does the memcpy version not need alignment? The compiler sees through the memcpy in that it copies the unaligned structs to registers... is this a missing compiler optimization that the alignment somehow nudges?
  • But memcpy doesn't assume alignment... so the optimizedEqual doesn't assume that Point is overaligned
  • You can just write `alignas(std::uint64_t)`.
  • @Ben Whether it is more expensive to load a 4byte aligned or 8 byte aligned block of 8 bytes is processor dependent, but the general answer is yes. On x86/x64, unaligned reads are allowed to be slower (basically turning them into a pair of reads). On some architectures (like 68000 if I recall), unaligned reads are illegal, generating a "bus exception."

回答

将其作为单个 64 位比较实现时,您可能会遇到性能悬崖:

你打破商店加载转发。

如果结构体中的 32 位数字通过单独的存储指令写入内存,然后使用 64 位加载指令快速从内存加载回(在存储达到 L1$ 之前),您的执行将停止,直到存储提交到全局可见缓存一致 L1$。如果加载是与之前的 32 位存储匹配的 32 位加载,现代 CPU 将通过在存储到达缓存之前将存储的值转发到加载指令来避免存储加载停顿。如果多个 CPU 访问内存(一个 CPU 以与其他 CPU 不同的顺序查看自己的存储),这会违反顺序一致性,但大多数现代 CPU 架构,甚至 x86 都允许这样做。转发还允许完全推测性地执行更多代码,因为如果必须回滚执行,

如果您希望它使用 64 位操作并且您不希望出现这种性能悬崖,您可能希望确保该结构也始终为单个 64 位数字。

  • *your execution will stall until the stores commit to globally visible cache coherent L1$* - Not quite. There's evidence that a Store-forwarding stall on modern x86 CPUs doesn't have to wait for commit, it just has to do a slower more complete scan of the store buffer, possibly also merging with data from L1d. [Can modern x86 implementations store-forward from more than one prior store?](https://stackoverflow.com/a/46145326) has some more detail on that evidence. It's also not a pipeline stall, OoO exec may be able to hide the latency. But yes, good point, usually something to avoid.
  • But IIRC, I've been told by GCC devs that GCC doesn't know anything about store-forwarding stalls and doesn't actively try to avoid them. (Devs do, so that doesn't rule out tuning some heuristics for cost/benefit of doing wider loads, though.)

回答

为什么编译器不能生成[与memcpy版本相同的程序集]?

编译器“可以”在它被允许的意义上。

编译器根本就没有。为什么它不超出我的知识范围,因为这需要深入了解优化器是如何实现的。但是,答案可能从“没有涵盖这种转换的逻辑”到“规则没有调整为假设一个输出比另一个更快”在所有目标 CPU 上。

如果您使用 Clang 而不是 GCC,您会注意到它为naiveEqualand产生相同的输出,naiveEqual1并且该程序集没有跳转。除了使用两条 32 位指令代替一条 64 位指令外,它与“优化”版本相同。此外Point,Jarod42 的回答中显示的限制对齐对优化器没有影响。

MSVC 的行为类似于 Clang,因为它不受对齐的影响,但不同的是它没有摆脱naiveEqual.

就其价值而言,编译器(我检查了 GCC 和 Clang)为 C++20 默认比较产生了与naiveEqual. 无论出于何种原因,GCC 选择使用jne而不是je用于跳转。

这是缺少编译器优化吗

假设在目标 CPU 上一个总是比另一个快,这将是一个公平的结论。

  • clang with `-march=tigerlake` uses SSE.
  • Also interesting: When I replace my `Point` with `std::tuple<std::int32_t, std::int32_t>` or `std::pair<std::int32_t, std::int32_t>` I get the same behavior... but `std::array<std::int32_t, 2>` is a single compare even though all three are (usually, I expect!) the same bits in memory with the same alignment.
  • @Ben gcc does that array optimization, but clang doesn't...
  • @supercat: As I [commented](https://stackoverflow.com/questions/66263263/why-cant-gcc-generate-an-optimal-operator-for-a-struct-of-two-int32s/66279126?noredirect=1#comment117192100_66263393) in that thread, that's incorrect. C structs are all-or-nothing, unlike separate indexes relative to a pointer. Accessing `a.x` guarantees that `a.y` is accessible.
  • @supercat: How is there any problem here? If the first 32 bits don't match, the `==` compare will be false no matter what garbage you read in the 2nd 32 bits. x86 doesn't have hardware race detection so it won't fault. Or are you talking about hypothetical badness on other ISAs, from GCC's target-independent optimizations doing this without properly checking that the target can't do race detection? Does GCC support any targets with HW race detection?
  • @supercat: I'm aware of that argument / position you take on stuff like strict aliasing or signed-overflow. But I don't see how it applies here. You think it should be possible for a `struct Point &` reference to be to an object that's only half present, i.e. extending into an unmapped page? (Or are you still talking about data races?) That unmapped page argument makes some sense for a `struct Point *` pointer (although it's not in practice how C works), but the caller would have to have a pointer pointing at a partially-present struct, and do `foo(*p)` which looks like the whole thing.

以上是为什么GCC不能为两个int32s的结构生成最佳operator==?的全部内容。
THE END
分享
二维码
< <上一篇
下一篇>>