如何从两个制表符分隔的文件中获取枢轴线？

html5 • 2022年11月28日 pm9:29 • 问答

给定两个文件 file1.txt

abc def t 123 456
jkl mno t 987 654
foo bar t 789 123
bar bar t 432

和 file2.txt

foo bar t hello world
abc def t good morning
xyz t 456

任务是提取第一列匹配的行并实现：

abc def t 123 456 t good morning
foo bar t 789 123 t hello world

我可以在 Python 中这样做：

from io import StringIO

file1 = """abc def t 123 456
jkl mno t 987 654
foo bar t 789 123
bar bar t 432"""


file2 = """foo bar t hello world
abc def t good morning
xyz t 456"""

map1, map2 = {}, {}

with StringIO(file1) as fin1:
    for line in file1.split('n'):
        one, two = line.strip().split('t')
        map1[one] = two
    
    
with StringIO(file2) as fin2:
    for line in file2.split('n'):
        one, two = line.strip().split('t')
        map2[one] = two
        
        
for k in set(map1).intersection(set(map2)):
    print('t'.join([k, map1[k], map2[k]]))

实际的任务文件有数十亿行，是否有更快的解决方案而不加载所有内容并保留哈希图/字典？

也许使用 unix/bash 命令？对文件进行预排序有帮助吗？

回答

该join命令有时很难使用，但这里很简单：

join -t $'t' <(sort file1.txt) <(sort file2.txt)

它使用 bash 的ANSI-C 引用来指定制表符分隔符，并处理替换以将程序输出视为文件。

要查看输出，请将上面的管道输入cat -A以查看表示为的选项卡^I：

abc def^I123 456^Igood morning$
foo bar^I789 123^Ihello world$

以上是如何从两个制表符分隔的文件中获取枢轴线？的全部内容。

THE END

二维码

带有-ArgumentList类型命令的`Start-ProcessPowerShell`不调用命令

< <上一篇

如何从函数传递指向容器的指针？

下一篇>>

搜索内容

如何从两个制表符分隔的文件中获取枢轴线？

回答

目录

目录

推荐文章

最新文章