仅当重复项彼此相距在5行以内时才删除重复行
仅当重复项彼此相距在 5 行以内时,我才想删除文本文件的重复行。
例如 :
Chapter 1.1
Overview
Figure 1
Figure 2
Overview <- This should be deleted (ie. within 5 lines of the previous instance)
Figure 3
Figure 4
...
(many lines in between)
Chapter 1.2
Overview <- This should not be deleted (ie. not within 5 lines of the previous instance)
我尝试使用,awk '!a[$0]++'但这将删除整个文件中的所有重复行。我也尝试过循环,sed -n "$startpoint,$endpoint p" file.txt | awk '!a[$0]++'但这实际上会创建新的重复项......
我可以尝试删除哪些其他方法来删除彼此相距 5 行以内的重复行?
回答
您可以使用这个较短的 awk命令:
awk '!NF || NR > rec[$0]; {rec[$0] = NR+5}' file
Chapter 1.1
Overview
Figure 1
Figure 2
Figure 3
Figure 4
...
(many lines in between)
Chapter 1.2
Figure 1
Figure 2
Overview
算法详情:
!NF || NR > rec[$0];: 如果当前行为空或当前记录号大于rec当前记录数组中的值,则打印每条记录。当$0不存在时,rec也将打印行。只有当我们5在rec.{rec[$0] = NR+5}:将每个记录保存在数组中rec,值为current line no + 5
- Yes that was not obvious but I made an assumption that OP may still want to keep all the blank lines between different sections for readability.