Julia的广播速度是Matlab的两倍

html5 • 2022年9月12日 pm1:29 • 问答

我试图让自己熟悉 Julia 以便从 Matlab 迁移，到目前为止一切顺利，直到我开始使用广播来移植一个特定函数，该函数的执行速度或多或少是 Matlab 的两倍。

function features(X::Vector{Float64},M::Int,hyper::Float64,mid::Float64)
    X = X.-mid
    H = 4.0.*hyper.+maximum(abs.(X))
    X = (X.+H)./(2.0.*H)
    w = transpose(1:M)
    S = (sqrt.(2.0.*pi).*hyper).*exp.((-0.5.*hyper.^2).*((pi.*w./(2.0.*H)).^2))
    f = H.^(-0.5).*sin.(pi.*X.*w).*sqrt.(S)
end

任何帮助，将不胜感激！

回答

首先，您对广播的使用不是最佳的。您使用它太多了，而且还不够 😉

其次，几乎所有的运行时间 (99.9%) 都发生在广播sin表达式中，所以应该把精力集中在那里。

第三，在这种情况下，您真的不应该期望 Julia 的表现优于 Matlab。这正是 Matlab 的优化目标：直接按元素调用优化的 C/Fortran 例程。此外，Matlab 默认是多线程的，隐式地并行运行元素调用，而 Julia 要求您明确多线程。

就目前而言，2 倍的差异似乎并非不合理。

尽管如此，让我们努力吧。这里先提几点意见：

X = X .- mid

您错过了就地操作，请使用

X .= X .- mid

反而。这节省了中间数组的分配。

H = 4.0.*hyper.+maximum(abs.(X))

在标量 ( hyper) 上广播是徒劳的，最坏的情况是浪费。并abs.(X)创建一个不必要的临时数组。而是使用maximum带有函数输入的版本，这样效率更高：

H = 4 * hyper + maximum(abs, X)

这里还有一些不必要的点：

S = (sqrt.(2.0.*pi).*hyper).*exp.((-0.5.*hyper.^2).*((pi.*w./(2.0.*H)).^2))

避免再次通过标量广播并在大多数地方使用整数而不是浮点数：

S = (sqrt(2pi) * hyper) .* exp.((-0.5 * hyper^2 * (pi/2H)^2) .* w.^2)

请注意，x^(-0.5)它比慢得多1/sqrt(x)，所以

f = H.^(-0.5).*sin.(pi.*X.*w).*sqrt.(S)

应该

f = sin.(pi .* X .* w') .* (sqrt.(S)' ./ sqrt(H))

让我们把它放在一起：

function features2(X::Vector{Float64},M::Int,hyper::Float64,mid::Float64)
    X .= X .- mid
    H = 4 * hyper + maximum(abs, X)
    X .= (X .+ H) ./ (2 * H)
    w = 1:M
    S = (sqrt(2pi) * hyper) .* exp.((-0.5 * hyper^2 * (pi/2H)^2) .* w.^2)
    f = sin.(pi .* X .* w') .* (sqrt.(S)' ./ sqrt(H))
    return f
end

基准：

jl> X = rand(10000);

jl> M = 100;

jl> hyper = rand();

jl> mid = 0.4;

jl> @btime features($X, $M, $hyper, $mid);
  17.339 ms (9 allocations: 7.86 MiB)

jl> @btime features2($X, $M, $hyper, $mid);
  17.173 ms (4 allocations: 7.63 MiB)

这并不是一个很大的加速。不过，分配较少。问题是运行时间在很大程度上受sin广播支配。

让我们尝试多线程。我有 8 个内核，所以我使用了 8 个线程：

function features3(X::Vector{Float64},M::Int,hyper::Float64,mid::Float64)
    X .= X .- mid
    H = 4 * hyper + maximum(abs, X)
    X .= (X .+ H) ./ (2 * H)
    w = transpose(1:M)
    S = (sqrt(2pi) * hyper) .* exp.((-0.5 * hyper^2 * (pi/2H)^2) .* w.^2)
    f = similar(X, length(X), M)
    temp = sqrt.(S) ./ sqrt(H)
    Threads.@threads for j in axes(f, 2)
        wj = w[j]
        tempj = temp[j]
        for i in axes(f, 1)
            @inbounds f[i, j] = tempj * sin(pi * X[i] * w[j])
        end
    end
    return f
end

基准：

jl> @btime features3($X, $M, $hyper, $mid);
  1.919 ms (45 allocations: 7.63 MiB)

这要好得多，使用循环和显式线程的速度提高了 9 倍。

但仍有一些选项可供选择：例如 LoopVectorization.jl。您可以安装这个惊人的软件包，但是您需要一个新版本，可能存在一些安装问题，具体取决于您拥有的其他软件包。LoopVectorization具有特别重要的两个宏，@avx并且@avxt，前者做了很多工作，向量化（在SIMD意义上的）你的代码，单线程的，而后者则是相同的，但多线程。

using LoopVectorization

function features4(X::Vector{Float64},M::Int,hyper::Float64,mid::Float64)
    X .= X .- mid
    H = 4 * hyper + maximum(abs, X)
    X .= (X .+ H) ./ (2 * H)
    w = collect(1:M)  # I have to use collect here due to some issue with LoopVectorization
    S = (sqrt(2pi) * hyper) .* exp.((-0.5 * hyper^2 * (pi/2H)^2) .* w.^2)
    f = @avx sin.(pi .* X .* w') .* (sqrt.(S)' ./ sqrt(H))
    return f
end

function features4t(X::Vector{Float64},M::Int,hyper::Float64,mid::Float64)
    X .= X .- mid
    H = 4 * hyper + maximum(abs, X)
    X .= (X .+ H) ./ (2 * H)
    w = collect(1:M)  # I have to use collect here due to some issue with LoopVectorization
    S = (sqrt(2pi) * hyper) .* exp.((-0.5 * hyper^2 * (pi/2H)^2) .* w.^2)
    f = @avxt sin.(pi .* X .* w') .* (sqrt.(S)' ./ sqrt(H))
    return f
end

这些函数之间的唯一区别是@avxvs @avxt。

基准：

jl> @btime features4($X, $M, $hyper, $mid);
  2.695 ms (5 allocations: 7.63 MiB)

对于单线程情况来说，这是一个非常好的加速。

jl> @btime features4t($X, $M, $hyper, $mid);
  431.700 ?s (5 allocations: 7.63 MiB)

多线程 avx 代码的速度是我笔记本电脑上原始代码的 40 倍。不错？

The performance should be the same, so I suspect that there is something wrong with the benchmark in that case. The reason for using integers is two-fold: it's more readable and brief. And it does not inadvertently change the types (as much). For example `2*x` is an integer if x is an integer, a single precision float if `x`, and so on. While `2.0*x` turns the result into a double precision float in those cases. Integers are less disruptive.
@DNF Currently `sinpi` is slower, but it should be as fast since it has a much simpler reduction. Currently it uses `sin_kernel` and `cos_kernel` to save code, but if it had it's own kernels, than I'm pretty sure it would be faster. I'll hopefully have a PR that fixes this soon.

以上是Julia的广播速度是Matlab的两倍的全部内容。

THE END

二维码

如何解决TypeError:path.splitisnotafunction

< <上一篇

没有构造函数“Class:Class”的实例与参数列表匹配

下一篇>>

搜索内容

Julia的广播速度是Matlab的两倍

回答

目录

目录

推荐文章

最新文章