当要匹配的字符串模式是来自另一个数据帧的列时,如何计算R中匹配字符串的数量?

我有两个非常大的数据框,第一个数据框由一个列组成body,它是一个评论列表,第二个由names. 我想计算有多少个元素body包含names. 这是一个可重现的小型数据集(原始数据集大约有 2000 个名称,其中每个名称都是汽车的名称):

df1 <- tibble(body = c("The Tesla Roadster has a range of 620 miles",
                       "ferrari needs to make an electric car",
                       "How much does a tesla cost?",
                       "When is the new Mercedes releasing?",
                       "Can't wait to get my hands on the new Tesla"))

df2 <- tibble(names = c("FORD", "TESLA", "MERCEDES", "FERRARI", "JAGUAR", "HYUNDAI"))

如上所述,我正在尝试计算名称中每个值在 body 中出现的次数,然后最好将其添加为df2. 我已经通过以下方式尝试过:

counter = c()
for (i in df2$names) {
  counter[i] = sum(ifelse(str_detect(df1$body, i),1, 0))
}

虽然这种方法的作品,它需要极长的时间量,并返回一个向量,其中名称是属性counter值,然后我拆散它,并加入了数据帧,以df2使用names为按键。这是唯一有效的方法,除了我尝试使用的方法之外,str_count但以我目前对 R 的熟练程度,代码非常糟糕,让我无处可去。

有没有更有效的方法来查找匹配的字符串?我试图在堆栈上找到类似的问题,但无济于事!

提前谢谢了 :)

回答

像这样的东西?

df1 <- data.frame(body = c("The Tesla Roadster has a range of 620 miles",
                       "ferrari needs to make an electric car",
                       "How much does a tesla cost?",
                       "When is the new Mercedes releasing?",
                       "Can't wait to get my hands on the new Tesla"))
df2 <- data.frame(names = c("FORD", "TESLA", "MERCEDES", "FERRARI", "JAGUAR", "HYUNDAI"))

library(tidyverse)            
df2 %>%
  mutate(des_count = map_int(tolower(names), ~ sum(str_detect(tolower(df1$body), .x))))
#>      names des_count
#> 1     FORD         0
#> 2    TESLA         3
#> 3 MERCEDES         1
#> 4  FERRARI         1
#> 5   JAGUAR         0
#> 6  HYUNDAI         0

由reprex 包( v2.0.0 )于 2021 年 5 月 13 日创建

或者如果你想使用 baseR

df1 <- data.frame(body = c("The Tesla Roadster has a range of 620 miles",
                       "ferrari needs to make an electric car",
                       "How much does a tesla cost?",
                       "When is the new Mercedes releasing?",
                       "Can't wait to get my hands on the new Tesla"))
df2 <- data.frame(names = c("FORD", "TESLA", "MERCEDES", "FERRARI", "JAGUAR", "HYUNDAI"))

df2$desired_count <- sapply(df2$names, function(x) sum(grepl(x, df1$body, ignore.case = T)))

df2
#>      names desired_count
#> 1     FORD             0
#> 2    TESLA             3
#> 3 MERCEDES             1
#> 4  FERRARI             1
#> 5   JAGUAR             0
#> 6  HYUNDAI             0

由reprex 包( v2.0.0 )于 2021 年 5 月 13 日创建


回答

你可以使用rowwiseand grepl,我认为它比str_detect

df1 <- df1 %>%
  mutate(body = tolower(body))

df2 %>%
  mutate(names = tolower(names)) %>%
  rowwise() %>%
  mutate(counter = sum(grepl(names,tolower(df1$body),fixed = TRUE )))

# A tibble: 6 x 2
# Rowwise: 
  names    counter
  <chr>      <int>
1 ford           0
2 tesla          3
3 mercedes       1
4 ferrari        1
5 jaguar         0
6 hyundai        0

由于问题是关于速度的,这里是基准:

df1 <- df1 %>%
  mutate(body = tolower(body))
df2 <- df2 %>%
  mutate(names = tolower(names)) 

anilgoyal = function(){
  df2 %>%
    mutate(des_count = map_int(names, ~ sum(str_detect(df1$body, .x))))
}

anigoyal2 = function(){
  sapply(df2$names, function(x) sum(grepl(x, df1$body, ignore.case = T)))
}

denis = function(){
  df2 %>%
    rowwise() %>%
    mutate(counter = sum(grepl(names,df1$body ,fixed = T)))
}

Anoushiravan = function(){
  df1 %>%
    rowwise() %>%
    mutate(match = df2$names[which(str_detect(body, fixed(df2$names, 
                                                          ignore_case = TRUE)))]) -> df3
 
  df2 %>%
    mutate(cnt = map_chr(names, ~ sum(str_detect(df3$match, .x))))
}

chris = function(){
  df2 %>%
    rowwise() %>%
    mutate(count = sum(grepl(paste0("(?i)", names), df1$body)))
}

结果

library(microbenchmark)

microbenchmark(denis(),anilgoyal(),anigoyal2(),Anoushiravan(),chris(),times = 100)

Unit: microseconds
           expr     min       lq      mean   median       uq      max neval  cld
        denis()  5960.6  7059.85 10644.711  8692.50 11533.90  49709.7   100   c 
    anilgoyal()  3614.2  4385.55  6660.244  4886.60  7195.65  31088.9   100  b  
    anigoyal2()   153.4   203.00   315.966   239.35   285.45   2010.8   100 a   
 Anoushiravan() 10083.4 12522.40 19994.135 15355.85 20469.60 100866.2   100    d
        chris()  5971.7  7060.55 11353.754  8356.35 10727.10  98319.3   100   c 

Base R 效率更高!厉害了@AnilGoyal

  • Add `fixed = TRUE` to `grepl` for a big speed boost.

以上是当要匹配的字符串模式是来自另一个数据帧的列时,如何计算R中匹配字符串的数量?的全部内容。
THE END
分享
二维码
< <上一篇
下一篇>>