在另一个成对的bin数组中获取数据数组最小值的最快方法

html5 • 2022年9月19日 pm2:32 • 问答

我有三个一维数组：

idxs: 索引数据
weights: 中每个指标的权重 idxs
bins：用于计算其中最小重量的 bin。

这是我当前使用的方法idxs来检查weights在哪个 bin 中调用的数据，然后计算 bin 权重的最小值/最大值：

获取slices显示每个垃圾箱idxs元素所属的。
排序slices和weights同时。
计算weights每个 bin（切片）中的最小值。

numpy 方法

import random
import numpy as np

# create example data
out_size = int(10)
bins = np.arange(3, out_size-3)
idxs = np.arange(0, out_size)
#random.shuffle(idxs)

# set duplicated slice manually for test
idxs[4] = idxs[3]
idxs[6] = idxs[7]

weights = idxs

# get which bin idxs belong to
slices = np.digitize(idxs, bins)

# get index and weights in bins
valid = (bins.max() >= idxs) & (idxs >= bins.min())
valid_slices = slices[valid]
valid_weights = weights[valid]

# sort slice and weights
sort_index = valid_slices.argsort()
valid_slices_sort = valid_slices[sort_index]
valid_weights_sort = valid_weights[sort_index]

# get index of each first unque slices
unique_slices, unique_index = np.unique(valid_slices_sort, return_index=True)
# calculate the minimum
res_sub = np.minimum.reduceat(valid_weights_sort, unique_index)

# save results
res = np.full((out_size), np.nan)
res[unique_slices-1] = res_sub

print(res)

结果：

array([ 3., nan,  5., nan, nan, nan, nan, nan, nan, nan])

如果我增加到out_size1e7 并洗牌数据，速度（从 np.digitize 到最后）很慢：

13.5 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

而且，这是每个部分的消耗时间：

np.digitize: 10.8 s ± 12.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
valid: 171 ms ± 3.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
argsort and slice: 2.02 s ± 33.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
unique: 9.9 ms ± 113 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.minimum.reduceat: 5.11 ms ± 52.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

np.digitize()花费大部分时间：10.8 秒。而且，接下来是argsort：2.02 秒。

我还检查计算所消耗的时间mean使用np.histogram：

counts, _ = np.histogram(idxs, bins=out_size, range=(0, out_size))
sums, _ = np.histogram(idxs, bins=out_size, range=(0, out_size), weights = weights, density=False)
mean = sums / np.where(counts == 0, np.nan, counts)

33.2 s ± 3.47 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

这类似于我计算最小值的方法。

scipy方法

from scipy.stats import binned_statistic
statistics, _, _ = binned_statistic(idxs, weights, statistic='min', bins=bins)

print(statistics)

结果有点不同，但对于较长的（1e7）混洗数据，速度要慢得多（x6）：

array([ 3., nan,  5.])

1min 20s ± 6.93 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

概括

我想找出一个更快的方法。如果该方法也适用于dask，那就太好了！

用户案例

这是我的真实数据 (1D) 的样子：

以上是在另一个成对的bin数组中获取数据数组最小值的最快方法的全部内容。

THE END

二维码

枚举默认方法

< <上一篇

即使我将其留空，C++也会初始化变量的值，为什么会发生这种情况？

下一篇>>

搜索内容

在另一个成对的bin数组中获取数据数组最小值的最快方法

numpy 方法

scipy方法

概括

用户案例

目录

目录

推荐文章

最新文章