如何快速获取一个巨大的csv文件的最后一行(48M行)?

我有一个 csv 文件,它会一直增长到大约 48M 行。

在向它添加新行之前,我需要阅读最后一行。

我尝试了下面的代码,但它太慢了,我需要一个更快的替代方案:

def return_last_line(filepath):    
    with open(filepath,'r') as file:        
        for x in file:
            pass
        return x        
return_last_line('lala.csv')

回答

这是我在 python 中的看法:我创建了一个函数,可以让您选择最后几行,因为最后几行可能是空的。

def get_last_line(file, how_many_last_lines = 1):

    # open your file using with: safety first, kids!
    with open(file, 'r') as file:

        # find the position of the end of the file: end of the file stream
        end_of_file = file.seek(0,2)
        
        # set your stream at the end: seek the final position of the file
        file.seek(end_of_file)             
        
        # trace back each character of your file in a loop
        n = 0
        for num in range(end_of_file+1):            
            file.seek(end_of_file - num)    
           
            # save the last characters of your file as a string: last_line
            last_line = file.read()
           
            # count how many 'n' you have in your string: 
            # if you have 1, you are in the last line; if you have 2, you have the two last lines
            if last_line.count('n') == how_many_last_lines: 
                return last_line
get_last_line('lala.csv', 2)

这个 lala.csv 有 4800 万行,比如在你的例子中。我花了 0 秒才拿到最后一行。

  • This isn't actually correct. The 'n' count is one too little for Unix text files. A line is *terminated* by n, therefore a text file ends with 'n' and by default your `get_last_line` would just return the *line terminator* for the last line, not the last line.

回答

这是查找文件最后一行的代码mmap,它应该适用于 Unixen 及其衍生产品和 Windows(我仅在 Linux 上测试过,请告诉我它是否也适用于 Windows ;),即几乎所有地方这很重要。由于它使用内存映射 I/O,因此可以预期它的性能非常好。

它期望您可以将整个文件映射到处理器的地址空间 - 对于 50M 文件无处不在应该没问题,但对于 5G 文件,您需要一个 64 位处理器或一些额外的切片。

import mmap


def iterate_lines_backwards(filename):
    with open(filename, "rb") as f:
        # memory-map the file, size 0 means whole file
        with mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ) as mm:
            start = len(mm)

            while start > 0:
                start, prev = mm.rfind(b"n", 0, start), start
                slice = mm[start + 1:prev + 1]
                # if the last character in the file was a 'n',
                # technically the empty string after that is not a line.
                if slice:
                    yield slice.decode()


def get_last_nonempty_line(filename):
    for line in iterate_lines_backwards(filename):
        if stripped := line.rstrip("rn"):
            return stripped


print(get_last_nonempty_line("datafile.csv"))

作为奖励,有一个生成器iterate_lines_backwards可以有效地以任意数量的行反向迭代文件的行:

print("Iterating the lines of datafile.csv backwards")
for l in iterate_lines_backwards("datafile.csv"):
    print(l, end="")


以上是如何快速获取一个巨大的csv文件的最后一行(48M行)?的全部内容。
THE END
分享
二维码
< <上一篇
下一篇>>