如何快速获取一个巨大的csv文件的最后一行(48M行)?
我有一个 csv 文件,它会一直增长到大约 48M 行。
在向它添加新行之前,我需要阅读最后一行。
我尝试了下面的代码,但它太慢了,我需要一个更快的替代方案:
def return_last_line(filepath):
with open(filepath,'r') as file:
for x in file:
pass
return x
return_last_line('lala.csv')
回答
这是我在 python 中的看法:我创建了一个函数,可以让您选择最后几行,因为最后几行可能是空的。
def get_last_line(file, how_many_last_lines = 1):
# open your file using with: safety first, kids!
with open(file, 'r') as file:
# find the position of the end of the file: end of the file stream
end_of_file = file.seek(0,2)
# set your stream at the end: seek the final position of the file
file.seek(end_of_file)
# trace back each character of your file in a loop
n = 0
for num in range(end_of_file+1):
file.seek(end_of_file - num)
# save the last characters of your file as a string: last_line
last_line = file.read()
# count how many 'n' you have in your string:
# if you have 1, you are in the last line; if you have 2, you have the two last lines
if last_line.count('n') == how_many_last_lines:
return last_line
get_last_line('lala.csv', 2)
这个 lala.csv 有 4800 万行,比如在你的例子中。我花了 0 秒才拿到最后一行。
- This isn't actually correct. The 'n' count is one too little for Unix text files. A line is *terminated* by n, therefore a text file ends with 'n' and by default your `get_last_line` would just return the *line terminator* for the last line, not the last line.
回答
这是查找文件最后一行的代码mmap,它应该适用于 Unixen 及其衍生产品和 Windows(我仅在 Linux 上测试过,请告诉我它是否也适用于 Windows ;),即几乎所有地方这很重要。由于它使用内存映射 I/O,因此可以预期它的性能非常好。
它期望您可以将整个文件映射到处理器的地址空间 - 对于 50M 文件无处不在应该没问题,但对于 5G 文件,您需要一个 64 位处理器或一些额外的切片。
import mmap
def iterate_lines_backwards(filename):
with open(filename, "rb") as f:
# memory-map the file, size 0 means whole file
with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
start = len(mm)
while start > 0:
start, prev = mm.rfind(b"n", 0, start), start
slice = mm[start + 1:prev + 1]
# if the last character in the file was a 'n',
# technically the empty string after that is not a line.
if slice:
yield slice.decode()
def get_last_nonempty_line(filename):
for line in iterate_lines_backwards(filename):
if stripped := line.rstrip("rn"):
return stripped
print(get_last_nonempty_line("datafile.csv"))
作为奖励,有一个生成器iterate_lines_backwards可以有效地以任意数量的行反向迭代文件的行:
print("Iterating the lines of datafile.csv backwards")
for l in iterate_lines_backwards("datafile.csv"):
print(l, end="")