将冒号和等号分隔的字符串拆分为R中的不同列
我有一个 dataframe ,其中一列包含冒号和等号分隔的字符串。
data$col1
[1] "ECNT=2;HCNT=4;MAX_ED=51;MIN_ED=51;NLOD=38.78;TLOD=5.45"
[2] "ECNT=2;HCNT=8;MAX_ED=51;MIN_ED=51;NLOD=36.58;TLOD=4.05"
[3] "DB;ECNT=1;HCNT=16;MAX_ED=.;MIN_ED=.;NLOD=20.42;TLOD=5.82"
[4] "DB;ECNT=1;HCNT=4;MAX_ED=.;MIN_ED=.;NLOD=30.70;TLOD=8.03"
[5] "ECNT=2;HCNT=6;MAX_ED=7;MIN_ED=7;NLOD=41.48;TLOD=5.37"
[6] "ECNT=2;HCNT=9;MAX_ED=7;MIN_ED=7;NLOD=40.59;TLOD=5.29"
我想提取NLOD=和后面的数字TLOD=,然后将其分成两列。这是我想要的输出。
data
col1 TLOD NLOD
"ECNT=2;HCNT=4;MAX_ED=51;MIN_ED=51;NLOD=38.78;TLOD=5.45" 5.45 38.78
"ECNT=2;HCNT=8;MAX_ED=51;MIN_ED=51;NLOD=36.58;TLOD=4.05" 4.05 36.58
"DB;ECNT=1;HCNT=16;MAX_ED=.;MIN_ED=.;NLOD=20.42;TLOD=5.82" 5.82 20.42
"DB;ECNT=1;HCNT=4;MAX_ED=.;MIN_ED=.;NLOD=30.70;TLOD=8.03" 8.03 30.70
"ECNT=2;HCNT=6;MAX_ED=7;MIN_ED=7;NLOD=41.48;TLOD=5.37" 5.37 41.48
"ECNT=2;HCNT=9;MAX_ED=7;MIN_ED=7;NLOD=40.59;TLOD=5.29" 5.29 40.59
任何帮助表示赞赏。谢谢你。
可重现的样本数据
structure(list(col1 = c("ECNT=2;HCNT=4;MAX_ED=51;MIN_ED=51;NLOD=38.78;TLOD=5.45",
"ECNT=2;HCNT=8;MAX_ED=51;MIN_ED=51;NLOD=36.58;TLOD=4.05", "DB;ECNT=1;HCNT=16;MAX_ED=.;MIN_ED=.;NLOD=20.42;TLOD=5.82",
"DB;ECNT=1;HCNT=4;MAX_ED=.;MIN_ED=.;NLOD=30.70;TLOD=8.03", "ECNT=2;HCNT=6;MAX_ED=7;MIN_ED=7;NLOD=41.48;TLOD=5.37",
"ECNT=2;HCNT=9;MAX_ED=7;MIN_ED=7;NLOD=40.59;TLOD=5.29")), row.names = c(NA,
-6L), class = c("tbl_df", "tbl", "data.frame"))
回答
在基础 R 中,您可以使用strcapture将数据捕获到单独的列中。
cbind(df, strcapture('NLOD=(.*?);TLOD=(.*)', df$col1,
proto = list(NLOD = numeric(), TLOD = numeric())))
#. col1 NLOD TLOD
#1 ECNT=2;HCNT=4;MAX_ED=51;MIN_ED=51;NLOD=38.78;TLOD=5.45 38.78 5.45
#2 ECNT=2;HCNT=8;MAX_ED=51;MIN_ED=51;NLOD=36.58;TLOD=4.05 36.58 4.05
#3 DB;ECNT=1;HCNT=16;MAX_ED=.;MIN_ED=.;NLOD=20.42;TLOD=5.82 20.42 5.82
#4 DB;ECNT=1;HCNT=4;MAX_ED=.;MIN_ED=.;NLOD=30.70;TLOD=8.03 30.70 8.03
#5 ECNT=2;HCNT=6;MAX_ED=7;MIN_ED=7;NLOD=41.48;TLOD=5.37 41.48 5.37
#6 ECNT=2;HCNT=9;MAX_ED=7;MIN_ED=7;NLOD=40.59;TLOD=5.29 40.59 5.29
要专门查找数字,您可以执行以下操作:
cbind(df, strcapture('NLOD=(d+.d+);TLOD=(d+.d+)', df$col1,
proto = list(NLOD = numeric(), TLOD = numeric())))
相同的正则表达式也可以用于tidyr::extract:
tidyr::extract(df, col1, c('NLOD', 'TLOD'), 'NLOD=(.*?);TLOD=(.*)', remove = FALSE)
- That is to make matching non-greedy. Although, it would not make any difference here but if the string had another `;` at the end the first capture group would have captured everything until the last `;` with `(.*)`.