在特定短语后提取数字
我一直在尝试编写两个正则表达式来完成以下两个任务:
- 在短语“EDG ICD HCUP CCS”后面拉数字
- 拉出“EDG ICD HCUP CCS 159 (PREDICTIVE MODELS-VERSION 1.0)-”后面的字样
我想将数字存储在名为“类别”的列中,并将单词存储在“诊断”中
字符串位于列名称“GROUPER_NAME”中。
df <- structure(list(GROUPER_ID = structure(c("9001742130", "9001742138",
"9001742058", "9001742062", "9001742102", "9001742247", "9001742055",
"9001742158", "9001742036", "9001742053"), label = "GROUPER_ID", format.sas = "$"),
GROUPER_NAME = structure(c("EDG ICD HCUP CCS 130 (PREDICTIVE MODELS-VERSION 1.0)-PLEURISY; PNEUMOTHORAX; PULMONARY COLLAPSE",
"EDG ICD HCUP CCS 138 (PREDICTIVE MODELS-VERSION 1.0)-ESOPHAGEAL DISORDERS",
"EDG ICD HCUP CCS 58 (PREDICTIVE MODELS-VERSION 1.0)-OTHER NUTRITIONAL; ENDOCRINE; AND METABOLIC DISORDERS",
"EDG ICD HCUP CCS 62 (PREDICTIVE MODELS-VERSION 1.0)-COAGULATION AND HEMORRHAGIC DISORDERS",
"EDG ICD HCUP CCS 102 (PREDICTIVE MODELS-VERSION 1.0)-NONSPECIFIC CHEST PAIN",
"EDG ICD HCUP CCS 247 (PREDICTIVE MODELS-VERSION 1.0)-LYMPHADENITIS",
"EDG ICD HCUP CCS 55 (PREDICTIVE MODELS-VERSION 1.0)-FLUID AND ELECTROLYTE DISORDERS",
"EDG ICD HCUP CCS 158 (PREDICTIVE MODELS-VERSION 1.0)-CHRONIC KIDNEY DISEASE",
"EDG ICD HCUP CCS 36 (PREDICTIVE MODELS-VERSION 1.0)-CANCER OF THYROID",
"EDG ICD HCUP CCS 53 (PREDICTIVE MODELS-VERSION 1.0)-DISORDERS OF LIPID METABOLISM"
), label = "GROUPER_NAME", format.sas = "$")), row.names = c(NA,
-10L), class = c("tbl_df", "tbl", "data.frame"))
对于第一个示例,我想提取“159”和“尿路感染”并将它们分别放在“类别”和“诊断”列中。我试图改变这里的一些解决方案以适应我的情况,但我对正则表达式真的很糟糕,无法得到任何工作。任何帮助将不胜感激!
回答
我们可以使用sub从base R. 夺位(d+后)前缀字符串后,字符)和-。在替换中,指定捕获组的反向引用(1, 2),并将它们读入一个两列的 data.frame 中read.csv
read.csv(text = sub("w+ w+ w+ w+ (d+)s.*)-(.*)",
"1:2", df$GROUPER_NAME), sep = ":", header = FALSE,
col.names = c("category", "diagnosis"))
-输出
category diagnosis
1 130 PLEURISY; PNEUMOTHORAX; PULMONARY COLLAPSE
2 138 ESOPHAGEAL DISORDERS
3 58 OTHER NUTRITIONAL; ENDOCRINE; AND METABOLIC DISORDERS
4 62 COAGULATION AND HEMORRHAGIC DISORDERS
5 102 NONSPECIFIC CHEST PAIN
6 247 LYMPHADENITIS
7 55 FLUID AND ELECTROLYTE DISORDERS
8 158 CHRONIC KIDNEY DISEASE
9 36 CANCER OF THYROID
10 53 DISORDERS OF LIPID METABOLISM