R中余弦相似矩阵的前N个值
如何获得如下所示的余弦相似度矩阵的顶部对:
southpark_matrix <- structure(c(0, 0.165272735625452, 0.386480286121192, 0.170696960480773,
0.0869562860988618, 0.165272735625452, 0, 0.251690602341816,
0.472701602991984, 0.137486001150133, 0.386480286121192, 0.251690602341816,
0, 0.255849200006255, 0.0972813221214626, 0.170696960480773,
0.472701602991984, 0.255849200006255, 0, 0.156449701347234, 0.0869562860988618,
0.137486001150133, 0.0972813221214626, 0.156449701347234, 0), .Dim = c(5L,
5L), .Dimnames = list(Docs = c("Mr. Garrison_2", "Cartman_3",
"Mr. Garrison_3", "Cartman_4", "Jimbo_5"), Docs = c("Mr. Garrison_2",
"Cartman_3", "Mr. Garrison_3", "Cartman_4", "Jimbo_5")))
Southpark_matrix
Docs
Docs Mr. Garrison_2 Cartman_3 Mr. Garrison_3 Cartman_4 Jimbo_5
Mr. Garrison_2 0.00000000 0.1652727 0.38648029 0.1706970 0.08695629
Cartman_3 0.16527274 0.0000000 0.25169060 0.4727016 0.13748600
Mr. Garrison_3 0.38648029 0.2516906 0.00000000 0.2558492 0.09728132
Cartman_4 0.17069696 0.4727016 0.25584920 0.0000000 0.15644970
Jimbo_5 0.08695629 0.1374860 0.09728132 0.1564497 0.00000000
我如何获得前 2 对?
在此示例中,前 2 对将是。在我的实际示例中,我有 100 多个列和行。
Cartman_3 Cartman_4 0.4727016
Mr. Garrison_2 Mr. Garrison_3 0.38648029
回答
我这样做的方法是将矩阵转换为小标题。我们可以按照此处的步骤将矩阵的上三角部分转换为 2 列的数据帧(请参阅此处:将矩阵的上三角部分转换为 3 列长格式)。
在此之后,我们可以简单地使用由我们的值加权的 top_n(2, val) 函数。此步骤的另一种方法是使用排列(desc(val))按降序排列值,然后使用 head(2) 函数获取前 2 个值。
我在下面制作了我的方法的reprex
library(tidyverse)
southpark_matrix <- structure(c(0, 0.165272735625452, 0.386480286121192, 0.170696960480773,
0.0869562860988618, 0.165272735625452, 0, 0.251690602341816,
0.472701602991984, 0.137486001150133, 0.386480286121192, 0.251690602341816,
0, 0.255849200006255, 0.0972813221214626, 0.170696960480773,
0.472701602991984, 0.255849200006255, 0, 0.156449701347234, 0.0869562860988618,
0.137486001150133, 0.0972813221214626, 0.156449701347234, 0), .Dim = c(5L,
5L), .Dimnames = list(Docs = c("Mr. Garrison_2", "Cartman_3",
"Mr. Garrison_3", "Cartman_4", "Jimbo_5"), Docs = c("Mr. Garrison_2",
"Cartman_3", "Mr. Garrison_3", "Cartman_4", "Jimbo_5")))
# Convert the matrix to an upper diagonal form
ind <- which(upper.tri(southpark_matrix, diag = TRUE), arr.ind = TRUE)
dimnam <- dimnames(southpark_matrix)
df <- data.frame(row = dimnam[[1]][ind[, 1]],
col = dimnam[[2]][ind[, 2]],
val = southpark_matrix[ind])
#top n method
df %>%
tibble() %>%
top_n(2, val)
#> # A tibble: 2 x 3
#> row col val
#> <chr> <chr> <dbl>
#> 1 Mr. Garrison_2 Mr. Garrison_3 0.386
#> 2 Cartman_3 Cartman_4 0.473
#arrange and head method
df %>%
arrange(desc(val)) %>%
head(2)
#> # A tibble: 2 x 3
#> row col val
#> <chr> <chr> <dbl>
#> 1 Cartman_3 Cartman_4 0.473
#> 2 Mr. Garrison_2 Mr. Garrison_3 0.386
由reprex 包( v2.0.0 )于 2021 年 4 月 4 日创建