为什么使用np.empty进行分配而不是O(1)

官方文档上是numpy 这么说的

返回给定形状和类型的新数组,而不初始化条目。

for np.empty,这意味着创建(分配)这个数组所花费的时间将是 O(1),但一些简单的测试timeit表明情况并非如此:

>>> timeit.timeit(lambda: np.empty(100000000 ), number=10000)
0.2733485999999914
>>> timeit.timeit(lambda: np.empty(1000000000), number=10000)
0.8293009999999867

作为一个附带问题,未触及的np.empty数组中存在哪些值?它们都是非常小的值,但我希望它们只是该地址内存中存在的任何值。(示例数组:np.empty(2) = array([-6.42940774e-036, 2.07409447e-117])。这些看起来不像存储在内存中的东西)

回答

首先,我尝试在我的机器上用各种尺寸重现这种行为。以下是原始结果:

np.empty(10**1)   # 421 ns ± 23.7 ns per loop    (on 7 runs, 1000000 loops each)
np.empty(10**2)   # 406 ns ± 1.44 ns per loop    (on 7 runs, 1000000 loops each)
np.empty(10**3)   # 471 ns ± 5.8 ns per loop     (on 7 runs, 1000000 loops each)
np.empty(10**4)   # 616 ns ± 1.56 ns per loop    (on 7 runs, 1000000 loops each)
np.empty(10**5)   # 620 ns ± 2.83 ns per loop    (on 7 runs, 1000000 loops each)
np.empty(10**6)   # 9.61 µs ± 34.2 ns per loop   (on 7 runs, 100000 loops each)
np.empty(10**7)   # 11.1 µs ± 17.6 ns per loop   (on 7 runs, 100000 loops each)
np.empty(10**8)   # 22.1 µs ± 173 ns per loop    (on 7 runs, 10000 loops each)
np.empty(10**9)   # 62.8 µs ± 220 ns per loop    (on 7 runs, 10000 loops each)
np.empty(10**10)  # => Memory Error

因此,您是对的:这还没有完成O(1)(至少在我的 Windows 机器和您的系统上也是如此)。请注意,在这么短的时间内无法(热切地)初始化这些值,因为这意味着 RAM 吞吐量超过 127 TB/s,而我的机器上显然没有这种吞吐量。

对于 np.empty,这意味着创建(分配)这个数组所花费的时间将是 O(1)

The hypothesis that allocations are done in O(1) is not totally true. To check that, I built a simple C program doing a simple malloc+free loop and measured the timings. Here are the raw results:

./malloc.exe 10           # Average time:  41.815 ns (on 1 run, 1000000 loops each)
./malloc.exe 100          # Average time:  45.295 ns (on 1 run, 1000000 loops each)
./malloc.exe 1000         # Average time:  47.400 ns (on 1 run, 1000000 loops each)
./malloc.exe 10000        # Average time: 122.457 ns (on 1 run, 1000000 loops each)
./malloc.exe 100000       # Average time: 123.032 ns (on 1 run, 1000000 loops each)
./malloc.exe 1000000      # Average time:   8.351 us (on 1 run, 1000000 loops each)
./malloc.exe 10000000     # Average time:   9.342 us (on 1 run, 100000 loops each)
./malloc.exe 100000000    # Average time:  18.972 us (on 1 run, 10000 loops each)
./malloc.exe 1000000000   # Average time:  64.527 us (on 1 run, 10000 loops each)
./malloc.exe 10000000000  # => Memory error

As you can see, the results are matching with the ones of Numpy (except for the small ones which is due to the overhead of calling a Python function in CPython). Thus, the issue does not comes from Numpy but the allocation algorithm in the standard libc or the OS itself.

As a side question, what are the values present in an untouched np.empty array?

It is uninitialized data. In practice, it is often zero-initialized (but not always) because mainstream platforms sanitize allocated memory for security reasons (so that critical data like passwords do not leak when they are previously stored in the memory of another process). You should not rely on this.


Deeper explanation of the malloc timings:

As you can see, there is a gap between the allocation of 100K items and 1M items. This can be explained by the use of a fast user-space allocator (called sbrk on Unix and Linux systems): when data are small, the libc of most mainstream platforms does not directly request memory to the operating system. It rather use a fast pre-allocated local memory-pool. Actually, on most mainstream platforms, multiple pool of different sizes are pre-allocated and the libc choose the "right one" depending on the allocated size, hence the timing variation for small data size. Note that this process is done to improve the allocation speed while taking into account memory fragmentation. This strategy is much faster as kernel calls (like mmap) are very expensive (it takes at least several micro-seconds on my machine).

Moreover, most operating systems (OS) have what looks like multiple memory pools. Linux, MacOS and Windows split the virtual memory into small pages (of typically 4KB). Since working on too small pages introduces a significant overhead when dealing with GB/TB of allocated data, these OS provide also big pages called super-pages or huge-pages (of typically 2MB to few GBs). The path taken in the OS can change regarding the amount of allocated memory and most OS are optimized for allocating small chunks of virtual memory and not big ones.

Note that the size of the data structures used to manage the system memory is often bounded by the size of the RAM which is generally constant at runtime. Moreover, the complexity of the algorithm used in a given OS to manage the memory fragmentation may be theoretically O(1) (or close to that). As a result, some people argue that allocating/freeing data is done in constant time. But this controversial because one should take into account practical results and not just theoretical asymptotic bounds.


For more information you can look to the following posts:

  • Time complexity of memory allocation
  • What is the time complexity of free?
  • Why does malloc initialize the values to 0 in gcc?
  • Can an O(n) algorithm ever exceed O(n^2) in terms of computation time?
  • Are there any cases where you would prefer a higher big-O time complexity algorithm over the lower one?

以上是为什么使用np.empty进行分配而不是O(1)的全部内容。
THE END
分享
二维码
< <上一篇
下一篇>>