C#中的硬件SIMD解析性能提升
c#
我已经实现了一种使用 .NET 中可用的 SIMD 内在函数解析长度 <= 8 的无符号整数字符串的方法,如下所示:
public unsafe static uint ParseUint(string text)
{
fixed (char* c = text)
{
var parsed = Sse3.LoadDquVector128((byte*) c);
var shift = (8 - text.Length) * 2;
var shifted = Sse2.ShiftLeftLogical128BitLane(parsed,
(byte) (shift));
Vector128<byte> digit0 = Vector128.Create((byte) '0');
var reduced = Sse2.SubtractSaturate(shifted, digit0);
var shortMult = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
var collapsed2 = Sse2.MultiplyAddAdjacent(reduced.As<byte, short>(), shortMult);
var repack = Sse41.PackUnsignedSaturate(collapsed2, collapsed2);
var intMult = Vector128.Create((short)0, 0, 0, 0, 100, 1, 100, 1);
var collapsed3 = Sse2.MultiplyAddAdjacent(repack.As<ushort,short>(), intMult);
var e1 = collapsed3.GetElement(2);
var e2 = collapsed3.GetElement(3);
return (uint) (e1 * 10000 + e2);
}
}
可悲的是,与基线的比较uint.Parse()给出了以下相当不起眼的结果:
| 方法 | 意思 | 错误 | 标准差 |
|---|---|---|---|
| 基线 | 15.157 纳秒 | 0.0325 纳秒 | 0.0304 纳秒 |
| 解析模拟 | 3.269 纳秒 | 0.0115 纳秒 | 0.0102 纳秒 |
回答
我做了一些优化。
public unsafe uint ParseUint2(string text)
{
fixed (char* c = text)
{
Vector128<ushort> raw = Sse3.LoadDquVector128((ushort*)c);
raw = Sse2.ShiftLeftLogical128BitLane(raw, (byte)(8 - text.Length << 1));
Vector128<ushort> digit0 = Vector128.Create('0');
raw = Sse2.SubtractSaturate(raw, digit0);
Vector128<short> mul0 = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
Vector128<int> res = Sse2.MultiplyAddAdjacent(raw.AsInt16(), mul0);
Vector128<int> mul1 = Vector128.Create(1000000, 10000, 100, 1);
res = Sse41.MultiplyLow(res, mul1);
res = Ssse3.HorizontalAdd(res, res);
res = Ssse3.HorizontalAdd(res, res);
return (uint)res.GetElement(0);
}
}
减少了类型转换和最终计算的数量vphaddd。结果它快了大约 10%。
但是...imm8必须是编译时常量。这意味着您不能使用变量 whereimm8是参数。否则 JIT 编译器不会产生操作的内在指令。它会call在这个地方创建一个外部方法(也许有一些解决方法)。感谢@PeterCordes 的帮助。
这个怪物并不显着,但比上面一个更快,无论text.Length.
public unsafe uint ParseUint3(string text)
{
fixed (char* c = text)
{
Vector128<ushort> raw = Sse3.LoadDquVector128((ushort*)c);
switch (text.Length)
{
case 0: raw = Vector128<ushort>.Zero; break;
case 1: raw = Sse2.ShiftLeftLogical128BitLane(raw, 14); break;
case 2: raw = Sse2.ShiftLeftLogical128BitLane(raw, 12); break;
case 3: raw = Sse2.ShiftLeftLogical128BitLane(raw, 10); break;
case 4: raw = Sse2.ShiftLeftLogical128BitLane(raw, 8); break;
case 5: raw = Sse2.ShiftLeftLogical128BitLane(raw, 6); break;
case 6: raw = Sse2.ShiftLeftLogical128BitLane(raw, 4); break;
case 7: raw = Sse2.ShiftLeftLogical128BitLane(raw, 2); break;
};
Vector128<ushort> digit0 = Vector128.Create('0');
raw = Sse2.SubtractSaturate(raw, digit0);
Vector128<short> mul0 = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
Vector128<int> res = Sse2.MultiplyAddAdjacent(raw.AsInt16(), mul0);
Vector128<int> mul1 = Vector128.Create(1000000, 10000, 100, 1);
res = Sse41.MultiplyLow(res, mul1);
res = Ssse3.HorizontalAdd(res, res);
res = Ssse3.HorizontalAdd(res, res);
return (uint)res.GetElement(0);
}
}
同样,@PeterCordes 不允许我编写缓慢的代码。以下版本有 2 个改进。现在加载的字符串已经移位,然后通过相同的偏移量减去移位的掩码。这避免ShiftLeftLogical128BitLane了可变计数的缓慢回退。
第二个改进是替换vphaddd用pshufd+ paddd。
// Note that this loads up to 14 bytes before the data part of the string. (Or 16 for an empty string)
// This might or might not make it possible to read from an unmapped page and fault, beware.
public unsafe uint ParseUint4(string text)
{
const string mask = "xffffxffffxffffxffffxffffxffffxffffxffff00000000";
fixed (char* c = text, m = mask)
{
Vector128<ushort> raw = Sse3.LoadDquVector128((ushort*)c - 8 + text.Length);
Vector128<ushort> mask0 = Sse3.LoadDquVector128((ushort*)m + text.Length);
raw = Sse2.SubtractSaturate(raw, mask0);
Vector128<short> mul0 = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
Vector128<int> res = Sse2.MultiplyAddAdjacent(raw.AsInt16(), mul0);
Vector128<int> mul1 = Vector128.Create(1000000, 10000, 100, 1);
res = Sse41.MultiplyLow(res, mul1);
Vector128<int> shuf = Sse2.Shuffle(res, 0x1b); // 0 1 2 3 => 3 2 1 0
res = Sse2.Add(shuf, res);
shuf = Sse2.Shuffle(res, 0x41); // 0 1 2 3 => 1 0 3 2
res = Sse2.Add(shuf, res);
return (uint)res.GetElement(0);
}
}
~比初始解决方案快两倍。(o_O) 至少在我的 Haswell i7 上。
- Instead of using a fixed `digit0` constant, load *that* from a sliding window of `0xffff` and `'0'` bytes. (I earlier wrote `0x20` but that was a mistake). Like in [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](https://stackoverflow.com/q/34306933) but instead of loading an AND mask, you load a constant for `psubusw` that will zero the elements you want (because saturating unsigned sub can do that when you subtract 0xffff)