C#中的硬件SIMD解析性能提升

html5 • 2022年12月5日 am7:35 • 问答

我已经实现了一种使用 .NET 中可用的 SIMD 内在函数解析长度 <= 8 的无符号整数字符串的方法，如下所示：

public unsafe static uint ParseUint(string text)
{
  fixed (char* c = text)
  {
    var parsed = Sse3.LoadDquVector128((byte*) c);
    var shift = (8 - text.Length) * 2;
    var shifted = Sse2.ShiftLeftLogical128BitLane(parsed, 
      (byte) (shift));

    Vector128<byte> digit0 = Vector128.Create((byte) '0');
    var reduced = Sse2.SubtractSaturate(shifted, digit0);

    var shortMult = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
    var collapsed2 = Sse2.MultiplyAddAdjacent(reduced.As<byte, short>(), shortMult);

    var repack = Sse41.PackUnsignedSaturate(collapsed2, collapsed2);
    var intMult = Vector128.Create((short)0, 0, 0, 0, 100, 1, 100, 1);
    var collapsed3 = Sse2.MultiplyAddAdjacent(repack.As<ushort,short>(), intMult);

    var e1 = collapsed3.GetElement(2);
    var e2 = collapsed3.GetElement(3);
    return (uint) (e1 * 10000 + e2);
  }
}

可悲的是，与基线的比较uint.Parse()给出了以下相当不起眼的结果：

方法	意思	错误	标准差
基线	15.157 纳秒	0.0325 纳秒	0.0304 纳秒
解析模拟	3.269 纳秒	0.0115 纳秒	0.0102 纳秒

回答

我做了一些优化。

public unsafe uint ParseUint2(string text)
{
    fixed (char* c = text)
    {
        Vector128<ushort> raw = Sse3.LoadDquVector128((ushort*)c);
        raw = Sse2.ShiftLeftLogical128BitLane(raw, (byte)(8 - text.Length << 1));
        Vector128<ushort> digit0 = Vector128.Create('0');
        raw = Sse2.SubtractSaturate(raw, digit0);
        Vector128<short> mul0 = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
        Vector128<int> res = Sse2.MultiplyAddAdjacent(raw.AsInt16(), mul0);
        Vector128<int> mul1 = Vector128.Create(1000000, 10000, 100, 1);
        res = Sse41.MultiplyLow(res, mul1);
        res = Ssse3.HorizontalAdd(res, res);
        res = Ssse3.HorizontalAdd(res, res);
        return (uint)res.GetElement(0);
    }
}

减少了类型转换和最终计算的数量vphaddd。结果它快了大约 10%。

但是...imm8必须是编译时常量。这意味着您不能使用变量 whereimm8是参数。否则 JIT 编译器不会产生操作的内在指令。它会call在这个地方创建一个外部方法（也许有一些解决方法）。感谢@PeterCordes 的帮助。

这个怪物并不显着，但比上面一个更快，无论text.Length.

public unsafe uint ParseUint3(string text)
{
    fixed (char* c = text)
    {
        Vector128<ushort> raw = Sse3.LoadDquVector128((ushort*)c);
        switch (text.Length)
        {
            case 0: raw = Vector128<ushort>.Zero; break;
            case 1: raw = Sse2.ShiftLeftLogical128BitLane(raw, 14); break;
            case 2: raw = Sse2.ShiftLeftLogical128BitLane(raw, 12); break;
            case 3: raw = Sse2.ShiftLeftLogical128BitLane(raw, 10); break;
            case 4: raw = Sse2.ShiftLeftLogical128BitLane(raw, 8); break;
            case 5: raw = Sse2.ShiftLeftLogical128BitLane(raw, 6); break;
            case 6: raw = Sse2.ShiftLeftLogical128BitLane(raw, 4); break;
            case 7: raw = Sse2.ShiftLeftLogical128BitLane(raw, 2); break;
        };
        Vector128<ushort> digit0 = Vector128.Create('0');
        raw = Sse2.SubtractSaturate(raw, digit0);
        Vector128<short> mul0 = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
        Vector128<int> res = Sse2.MultiplyAddAdjacent(raw.AsInt16(), mul0);
        Vector128<int> mul1 = Vector128.Create(1000000, 10000, 100, 1);
        res = Sse41.MultiplyLow(res, mul1);
        res = Ssse3.HorizontalAdd(res, res);
        res = Ssse3.HorizontalAdd(res, res);
        return (uint)res.GetElement(0);
    }
}

同样，@PeterCordes 不允许我编写缓慢的代码。以下版本有 2 个改进。现在加载的字符串已经移位，然后通过相同的偏移量减去移位的掩码。这避免ShiftLeftLogical128BitLane了可变计数的缓慢回退。
第二个改进是替换vphaddd用pshufd+ paddd。

// Note that this loads up to 14 bytes before the data part of the string.  (Or 16 for an empty string)
// This might or might not make it possible to read from an unmapped page and fault, beware.
public unsafe uint ParseUint4(string text)
{
    const string mask = "xffffxffffxffffxffffxffffxffffxffffxffff00000000";
    fixed (char* c = text, m = mask)
    {
        Vector128<ushort> raw = Sse3.LoadDquVector128((ushort*)c - 8 + text.Length);
        Vector128<ushort> mask0 = Sse3.LoadDquVector128((ushort*)m + text.Length);
        raw = Sse2.SubtractSaturate(raw, mask0);
        Vector128<short> mul0 = Vector128.Create(10, 1, 10, 1, 10, 1, 10, 1);
        Vector128<int> res = Sse2.MultiplyAddAdjacent(raw.AsInt16(), mul0);
        Vector128<int> mul1 = Vector128.Create(1000000, 10000, 100, 1);
        res = Sse41.MultiplyLow(res, mul1);
        Vector128<int> shuf = Sse2.Shuffle(res, 0x1b); // 0 1 2 3 => 3 2 1 0
        res = Sse2.Add(shuf, res);
        shuf = Sse2.Shuffle(res, 0x41); // 0 1 2 3 => 1 0 3 2
        res = Sse2.Add(shuf, res);
        return (uint)res.GetElement(0);
    }
}

~比初始解决方案快两倍。(o_O) 至少在我的 Haswell i7 上。

Instead of using a fixed `digit0` constant, load *that* from a sliding window of `0xffff` and `'0'` bytes. (I earlier wrote `0x20` but that was a mistake). Like in [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](https://stackoverflow.com/q/34306933) but instead of loading an AND mask, you load a constant for `psubusw` that will zero the elements you want (because saturating unsigned sub can do that when you subtract 0xffff)

以上是C#中的硬件SIMD解析性能提升的全部内容。

THE END

二维码

按整数列表中的最小值对Map键进行排序

< <上一篇

在调试器工具窗口中看不到协程选项卡

下一篇>>

搜索内容

C#中的硬件SIMD解析性能提升

回答

目录

目录

推荐文章

最新文章