逻辑/算术移位更少的位更快吗？

html5 • 2022年9月13日 pm2:01 • 问答

是x>>2不是更快x>>31？换句话说，sar x, 2比sar x, 31? 我做了一些简单的测试，它们似乎具有相同的速度。我将不胜感激任何确凿的证据。

回答

这将取决于硬件实现。对于涉及常量移位的常见操作（例如指针算术），可能存在更快的路径（例如，它可能与相关的加法运算融合）。对于变量移位，使用桶形移位器电路，其中任何移位量都具有相同的延迟。

https://uops.info/ and https://agner.org/optimize/ have numbers for actual x86 CPU instructions. Pentium 4 notoriously had slow shifts, but still fixed latency/throughput (not data-dependent). Most CPUs have 1-cycle latency for any shift count. (On modern Intel CPUs, compile-time-constant shifts are great, but when the count is a runtime variable, `shr reg, cl` decodes to 3 uops [because of x86 legacy baggage with not updating FLAGS if the count was 0](https://stackoverflow.com/a/36510865/224132). Unless you let the compiler use BMI2 `shlx` / `shrx`. Still, latency is only 1 cycle.)
Intel as early as 386SX had a barrel shifter: https://media.digikey.com/pdf/Data%20Sheets/Intel%20PDFs/Intel386%20SX.pdf#page=84 lists cycle counts for shift/rotate of a register as 3 cycles for shift-by-1, shift-by-CL, or shift-by-immediate. (vs. 2 cycles for an instruction like `add reg,reg`). The last Intel x86 to have shift performance that depended on the count seems to be 286: https://www2.math.uni-wuppertal.de/~fpf/Uebungen/GdR-SS02/opcode_i.html has a table for 8088 .. Pentium. **8088 was 8 + 4n, 186 and 286 were 5 + n. 386 was a fixed 3 cycles**.

以上是逻辑/算术移位更少的位更快吗？的全部内容。

THE END

二维码

从查找表中更新向量的某些值的规范tidyverse方法

如何从Next.js正确键入_document.tsx文件？

下一篇>>