用于识别不同位置awk重复的nf的多个选项?

我希望你能找到自己,我写信是想知道是否可以在 awk 中做这样的事情

我需要像许多 NF 一样的东西......对于 NF = 7 PK 是 1、5 美元,但对于 NF=8 是 1、6 美元

输入

AAA|BBB|CCC|DDD|111|20220129|JONH1
AAA|XXX|YYY|DDD|444|20210115|JONH2
AAA|B10|CCC|DDD|000|20200127|JONH3
AAA|BBB|MMM|DDD|444|20200131|JONH4
AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY
AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL

欲望输出

文件 .PK_OK_1

AAA|BBB|CCC|DDD|111|20220129|JONH1
AAA|B10|CCC|DDD|000|20200127|JONH3

文件 DUPLICATE_PK_1

AAA|XXX|YYY|DDD|444|20210115|JONH2
AAA|BBB|MMM|DDD|444|20200131|JONH4

文件 PK_OK_2

AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY

文件 DUPLICATE_PK_2

AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY

文件 INVALID_LENGHT

AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL

我的代码是这样的(NOM_ARCH 是一个变量)

BEGIN { FS="|";
        OFS="|"
          }


NF == 7 {
            if (!seen[$1,$5]) {
                print > NOM_ARCH".PK_OK_1"; seen[$1,$5]=1
             }else{
                 print > NOM_ARCH".DUPLICATE_PK_1"
                }
          next 
          }
NF == 8 {
            if (!seen[$1,$6]) {
                print > NOM_ARCH".PK_OK_2"; seen[$1,$6]=1
             }else{
                 print > NOM_ARCH".DUPLICATE_PK_2"
                }
          next 
          }
{ print > NOM_ARCH".INVALID_LENGHT" }

回答

使用您显示的示例,请尝试以下awk代码。

awk '
BEGIN{ FS=OFS="|" }
{
  if(NF==7){ key=($1 FS $5) }
  if(NF==8){ key=($1 FS $6) }
}
FNR==NR{
  arr1[key]++
  next
}
NF==7{
  outputFile=(arr1[key]==1?"file.PK_OK_1":"file_DUPLICATE_PK_1")
}
NF==8{
  outputFile=(arr1[key]==1?"file.PK_OK_2":"file_DUPLICATE_PK_2")
}
NF>8{
  outputFile="file_INVALID_LENGHTH"
}
{
  print > (outputFile)
}
' Input_file  Input_file

根据 OP 的要求使用以下不带三元运算符的代码:

awk '
BEGIN{ FS=OFS="|" }
{
  if(NF==7){ key=($1 FS $5) }
  if(NF==8){ key=($1 FS $6) }
}
FNR==NR{
  arr1[key]++
  next
}
NF==7{
  if(arr1[key]==1){ outputFile="file.PK_OK_1"       }
  else            { outputFile="file_DUPLICATE_PK_1"}
}
NF==8{
  if(arr1[key]==1){ outputFile="file.PK_OK_2"       }
  else            { outputFile="file_DUPLICATE_PK_2"} 
}
NF>8{
  outputFile="file_INVALID_LENGHTH"
}
{
  print > (outputFile)
}
' Input_file  Input_file

说明:为以上添加详细说明。

## Starting awk program from here.
awk '
## Starting BEGIN section of this program from here, setting FS and OFS to | here.
BEGIN{ FS=OFS="|" }
##Starting main program from here.
{
##Checking condition if NF is 7 then set key to $1 FS $5.
  if(NF==7){ key=($1 FS $5) }
##Checking condition if NF is 8 then set key to $1 FS $6.
  if(NF==8){ key=($1 FS $6) }
}
##Checking condition FNR==NR which will be TRUE when 1st time Input_file is being read.
FNR==NR{
##Creating array arr1 with index of key and keep increasing same key value with 1 here.
  arr1[key]++
##next will skip all further statements from here.
  next
}
##Checking condition if NF==7 then do following.
NF==7{
##Setting outputFile(where contents will be written to), either file.PK_OK_1 OR file_DUPLICATE_PK_1 depending upon value of arr1.
##Basically it uses ternary operators ? and :
##Statements after ? will executed if condition arr1[key]==1 is TRUE.
##Statements after : will be executed if condition ar1[key]==1 is FALSE.
  outputFile=(arr1[key]==1?"file.PK_OK_1":"file_DUPLICATE_PK_1")
}
##Checking condition if NF==8 then do following.
NF==8{
##Setting outputFile(where contents will be written to), either file.PK_OK_2 OR file_DUPLICATE_PK_2 depending upon value of arr1.
  outputFile=(arr1[key]==1?"file.PK_OK_2":"file_DUPLICATE_PK_2")
}
##Checking condition if NF>8 then do following.
NF>8{
##Setting outputFile(where contents will be written to) to file_INVALID_LENGHTH here.
  outputFile="file_INVALID_LENGHTH"
}
{
##Printing current line to outputFile(already set its value above)
  print > (outputFile)
}
##Mentioning Input_file names here.
' Input_file  Input_file

  • 1-pass solutions won't catch the first time a dup key is seen.

    Presorted, we can hold one till you check the next, but your key depends on NF. We can presort recs to files by NF, but it still takes multiple passes.

    We could keep recs in associative arrays by key, but it holds the whole file in memory till you're done. Is that even an option?

    We could output a file per key, but may have a huge number of open files prevented by `ulimit`.

    What's your priority, and what are your available resources?


回答

通常我会建议第一遍用sortuniq -c效率,但我开始假设错误的要求,使假设下写了大多数的这一点,所以我刚刚调整了它现在的真实需求,所以这里是如何做到这一切在一个 awk 脚本中:

$ cat tst.awk
BEGIN {
    FS=OFS="|"
    map[7] = 1
    map[8] = 2
}
{ key = $1 FS $(NF-2) FS NF }
NR==FNR {
    cnt[key]++
    next
}
{
    if ( NF in map ) {
        sfx = ( cnt[key]>1 ? "DUPLICATE_PK" : "PK_OK" ) "_" map[NF]
    }
    else {
        sfx = "INVALID_LENGTH"
    }
    print > (nom_arch "." sfx)
}
$ awk -v nom_arch='foo' -f tst.awk file file
$ head foo.*
==> foo.DUPLICATE_PK_1 <==
AAA|XXX|YYY|DDD|444|20210115|JONH2
AAA|BBB|MMM|DDD|444|20200131|JONH4

==> foo.DUPLICATE_PK_2 <==
AAA|BBB|CCC|DDD|777|0054256|JONH5|MARY
AAA|BBB|CCC|DDD|999|0054256|JONH5|MARY

==> foo.INVALID_LENGTH <==
AAA|BBB|CCC|DDD|202|0054256|JONH5|MARY|MIAMI|FL

==> foo.PK_OK_1 <==
AAA|BBB|CCC|DDD|111|20220129|JONH1
AAA|B10|CCC|DDD|000|20200127|JONH3

==> foo.PK_OK_2 <==
AAA|BBB|CCC|DDD|111|0036000|JONH5|MARY
AAA|BBB|CCC|DDD|888|0089999|CENTRAL|MARY

我更正了LENGTH上面的拼写。

请注意,NF包含在key = $1 FS $(NF-2) FS NF因此我们避免了@rowboat指出的潜在情况,其中具有 7 个字段的行与具有 8 个字段的行具有相同的 $1 和 $(NF-2) ,否则我们最终会计算两次它应该是 2 个单独的 1 计数。

我们本可以在设置时使用NF-6而不是,但是对于识别有效值也很有用,将来可能会有其他值不能通过仅减去 6 来确定。map[NF]sfxmap[]NFNFsfx

  • What's complex about it? For NF of 7 you want an output file with the suffix 1, and for NF of 8 you want an output file with the suffix 2 so I'm just using an array to define that mapping so later in the code you can get the suffix by using `sfx=map[NF]` instead of `if (NF==7) sfx=1; else if (NF==8) sfx=2`. It's **different** from your approach but it's actually much simpler, not more complex, and requires less code and less redundancy of code. Just take a few minutes to think about what each step does - it's all very straight forward.

以上是用于识别不同位置awk重复的nf的多个选项?的全部内容。
THE END
分享
二维码
< <上一篇
下一篇>>