如何根据条件计算分类变量的频率

下午好 ,

假设我们有来自 UCI 的以下数据集:

ballons=structure(list(YELLOW = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("PURPLE", 
"YELLOW"), class = "factor"), SMALL = structure(c(2L, 2L, 2L, 
2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L
), .Label = c("LARGE", "SMALL"), class = "factor"), STRETCH = structure(c(2L, 
2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 
1L, 1L), .Label = c("DIP", "STRETCH"), class = "factor"), ADULT = structure(c(1L, 
2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 
1L, 2L), .Label = c("ADULT", "CHILD"), class = "factor"), T = c(TRUE, 
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE, 
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE)), class = "data.frame", row.names = c(NA, 
-19L))
 # output :
   YELLOW SMALL STRETCH ADULT     T
1  YELLOW SMALL STRETCH ADULT  TRUE
2  YELLOW SMALL STRETCH CHILD FALSE
3  YELLOW SMALL     DIP ADULT FALSE
4  YELLOW SMALL     DIP CHILD FALSE
5  YELLOW LARGE STRETCH ADULT  TRUE
6  YELLOW LARGE STRETCH ADULT  TRUE
7  YELLOW LARGE STRETCH CHILD FALSE
8  YELLOW LARGE     DIP ADULT FALSE
9  YELLOW LARGE     DIP CHILD FALSE
10 PURPLE SMALL STRETCH ADULT  TRUE
11 PURPLE SMALL STRETCH ADULT  TRUE
12 PURPLE SMALL STRETCH CHILD FALSE
13 PURPLE SMALL     DIP ADULT FALSE
14 PURPLE SMALL     DIP CHILD FALSE
15 PURPLE LARGE STRETCH ADULT  TRUE
16 PURPLE LARGE STRETCH ADULT  TRUE
17 PURPLE LARGE STRETCH CHILD FALSE
18 PURPLE LARGE     DIP ADULT FALSE
19 PURPLE LARGE     DIP CHILD FALSE

假设我还应用了聚类算法来获得如下结果:

clusterss=data.frame(index=1:19,class=c(1,2,3,3,3,2,3,1,2,3,3,2,2,3,2,2,1,1,2))
> clusterss
   index class
1      1     1
2      2     2
3      3     3
4      4     3
5      5     3
6      6     2
7      7     3
8      8     1
9      9     2
10    10     3
11    11     3
12    12     2
13    13     2
14    14     3
15    15     2
16    16     2
17    17     1
18    18     1
19    19     2

这里index变量代表ballons行,class是获取的ballons行所属的簇。

我知道我们可以通过以下方式计算所有分类变量的频率:

> sapply(ballons,table)
       y1 y2 y3 y4 y5
PURPLE 10 10  8 11 12
YELLOW  9  9 11  8  7

但是,我需要为每个集群独立计算这个。这意味着我需要(对于每个班级)选择他们相关的观察,然后我可以计算频率。例如,当 class=1 时:

# Expected results for the first cluster : class == 1
result1 <- filter(clusterss, class == 1)
sapply(ballons[result1[,1],],table)
       y1 y2 y3 y4 y5
PURPLE  2  3  2  3  3
YELLOW  2  1  2  1  1
# Expected results for the second cluster : class == 2
result2 <- filter(clusterss, class == 2)
sapply(ballons[result2[,1],],table)
       y1 y2 y3 y4 y5
PURPLE  5  5  3  4  5
YELLOW  3  3  5  4  3
# Expected results for the third cluster : class == 3
result3 <- filter(clusterss, class == 3)
sapply(ballons[result3[,1],],table)
       y1 y2 y3 y4 y5
PURPLE  3  2  3  4  4
YELLOW  4  5  4  3  3

我正在寻找一种有效的方法来获得这样的结果(可能具有 的select功能dplyr)。谢谢你的帮助 !

回答

你可以给一个附加列,在这里clusterss$class,到table

sapply(ballons,table, clusterss$class)
#lapply(ballons,table, clusterss$class) #Alternative
#     YELLOW SMALL STRETCH ADULT T
#[1,]      2     3       2     3 3
#[2,]      2     1       2     1 1
#[3,]      5     5       3     4 5
#[4,]      3     3       5     4 3
#[5,]      3     2       3     4 4
#[6,]      4     5       4     3 3


以上是如何根据条件计算分类变量的频率的全部内容。
THE END
分享
二维码
< <上一篇
下一篇>>