在特定短语后提取数字

我一直在尝试编写两个正则表达式来完成以下两个任务:

  1. 在短语“EDG ICD HCUP CCS”后面拉数字
  2. 拉出“EDG ICD HCUP CCS 159 (PREDICTIVE MODELS-VERSION 1.0)-”后面的字样

我想将数字存储在名为“类别”的列中,并将单词存储在“诊断”中

字符串位于列名称“GROUPER_NAME”中。

df <- structure(list(GROUPER_ID = structure(c("9001742130", "9001742138", 
"9001742058", "9001742062", "9001742102", "9001742247", "9001742055", 
"9001742158", "9001742036", "9001742053"), label = "GROUPER_ID", format.sas = "$"), 
    GROUPER_NAME = structure(c("EDG ICD HCUP CCS 130 (PREDICTIVE MODELS-VERSION 1.0)-PLEURISY; PNEUMOTHORAX; PULMONARY COLLAPSE", 
    "EDG ICD HCUP CCS 138 (PREDICTIVE MODELS-VERSION 1.0)-ESOPHAGEAL DISORDERS", 
    "EDG ICD HCUP CCS 58 (PREDICTIVE MODELS-VERSION 1.0)-OTHER NUTRITIONAL; ENDOCRINE; AND METABOLIC DISORDERS", 
    "EDG ICD HCUP CCS 62 (PREDICTIVE MODELS-VERSION 1.0)-COAGULATION AND HEMORRHAGIC DISORDERS", 
    "EDG ICD HCUP CCS 102 (PREDICTIVE MODELS-VERSION 1.0)-NONSPECIFIC CHEST PAIN", 
    "EDG ICD HCUP CCS 247 (PREDICTIVE MODELS-VERSION 1.0)-LYMPHADENITIS", 
    "EDG ICD HCUP CCS 55 (PREDICTIVE MODELS-VERSION 1.0)-FLUID AND ELECTROLYTE DISORDERS", 
    "EDG ICD HCUP CCS 158 (PREDICTIVE MODELS-VERSION 1.0)-CHRONIC KIDNEY DISEASE", 
    "EDG ICD HCUP CCS 36 (PREDICTIVE MODELS-VERSION 1.0)-CANCER OF THYROID", 
    "EDG ICD HCUP CCS 53 (PREDICTIVE MODELS-VERSION 1.0)-DISORDERS OF LIPID METABOLISM"
    ), label = "GROUPER_NAME", format.sas = "$")), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

对于第一个示例,我想提取“159”和“尿路感染”并将它们分别放在“类别”和“诊断”列中。我试图改变这里的一些解决方案以适应我的情况,但我对正则表达式真的很糟糕,无法得到任何工作。任何帮助将不胜感激!

回答

我们可以使用subbase R. 夺位(d+后)前缀字符串后,字符)-。在替换中,指定捕获组的反向引用(1, 2),并将它们读入一个两列的 data.frame 中read.csv

read.csv(text = sub("w+ w+ w+ w+ (d+)s.*)-(.*)", 
         "1:2", df$GROUPER_NAME), sep = ":", header = FALSE, 
      col.names = c("category", "diagnosis"))

-输出

 category                                             diagnosis
1       130            PLEURISY; PNEUMOTHORAX; PULMONARY COLLAPSE
2       138                                  ESOPHAGEAL DISORDERS
3        58 OTHER NUTRITIONAL; ENDOCRINE; AND METABOLIC DISORDERS
4        62                 COAGULATION AND HEMORRHAGIC DISORDERS
5       102                                NONSPECIFIC CHEST PAIN
6       247                                         LYMPHADENITIS
7        55                       FLUID AND ELECTROLYTE DISORDERS
8       158                                CHRONIC KIDNEY DISEASE
9        36                                     CANCER OF THYROID
10       53                         DISORDERS OF LIPID METABOLISM


以上是在特定短语后提取数字的全部内容。
THE END
分享
二维码
< <上一篇
下一篇>>