如何根据条件计算分类变量的频率
下午好 ,
假设我们有来自 UCI 的以下数据集:
ballons=structure(list(YELLOW = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L,
2L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = c("PURPLE",
"YELLOW"), class = "factor"), SMALL = structure(c(2L, 2L, 2L,
2L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 1L, 1L
), .Label = c("LARGE", "SMALL"), class = "factor"), STRETCH = structure(c(2L,
2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 1L, 1L, 2L, 2L, 2L,
1L, 1L), .Label = c("DIP", "STRETCH"), class = "factor"), ADULT = structure(c(1L,
2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L,
1L, 2L), .Label = c("ADULT", "CHILD"), class = "factor"), T = c(TRUE,
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE, TRUE, TRUE,
FALSE, FALSE, FALSE, TRUE, TRUE, FALSE, FALSE, FALSE)), class = "data.frame", row.names = c(NA,
-19L))
# output :
YELLOW SMALL STRETCH ADULT T
1 YELLOW SMALL STRETCH ADULT TRUE
2 YELLOW SMALL STRETCH CHILD FALSE
3 YELLOW SMALL DIP ADULT FALSE
4 YELLOW SMALL DIP CHILD FALSE
5 YELLOW LARGE STRETCH ADULT TRUE
6 YELLOW LARGE STRETCH ADULT TRUE
7 YELLOW LARGE STRETCH CHILD FALSE
8 YELLOW LARGE DIP ADULT FALSE
9 YELLOW LARGE DIP CHILD FALSE
10 PURPLE SMALL STRETCH ADULT TRUE
11 PURPLE SMALL STRETCH ADULT TRUE
12 PURPLE SMALL STRETCH CHILD FALSE
13 PURPLE SMALL DIP ADULT FALSE
14 PURPLE SMALL DIP CHILD FALSE
15 PURPLE LARGE STRETCH ADULT TRUE
16 PURPLE LARGE STRETCH ADULT TRUE
17 PURPLE LARGE STRETCH CHILD FALSE
18 PURPLE LARGE DIP ADULT FALSE
19 PURPLE LARGE DIP CHILD FALSE
假设我还应用了聚类算法来获得如下结果:
clusterss=data.frame(index=1:19,class=c(1,2,3,3,3,2,3,1,2,3,3,2,2,3,2,2,1,1,2))
> clusterss
index class
1 1 1
2 2 2
3 3 3
4 4 3
5 5 3
6 6 2
7 7 3
8 8 1
9 9 2
10 10 3
11 11 3
12 12 2
13 13 2
14 14 3
15 15 2
16 16 2
17 17 1
18 18 1
19 19 2
这里index变量代表ballons行,class是获取的ballons行所属的簇。
我知道我们可以通过以下方式计算所有分类变量的频率:
> sapply(ballons,table)
y1 y2 y3 y4 y5
PURPLE 10 10 8 11 12
YELLOW 9 9 11 8 7
但是,我需要为每个集群独立计算这个。这意味着我需要(对于每个班级)选择他们相关的观察,然后我可以计算频率。例如,当 class=1 时:
# Expected results for the first cluster : class == 1
result1 <- filter(clusterss, class == 1)
sapply(ballons[result1[,1],],table)
y1 y2 y3 y4 y5
PURPLE 2 3 2 3 3
YELLOW 2 1 2 1 1
# Expected results for the second cluster : class == 2
result2 <- filter(clusterss, class == 2)
sapply(ballons[result2[,1],],table)
y1 y2 y3 y4 y5
PURPLE 5 5 3 4 5
YELLOW 3 3 5 4 3
# Expected results for the third cluster : class == 3
result3 <- filter(clusterss, class == 3)
sapply(ballons[result3[,1],],table)
y1 y2 y3 y4 y5
PURPLE 3 2 3 4 4
YELLOW 4 5 4 3 3
我正在寻找一种有效的方法来获得这样的结果(可能具有 的select功能dplyr)。谢谢你的帮助 !
回答
你可以给一个附加列,在这里clusterss$class,到table:
sapply(ballons,table, clusterss$class)
#lapply(ballons,table, clusterss$class) #Alternative
# YELLOW SMALL STRETCH ADULT T
#[1,] 2 3 2 3 3
#[2,] 2 1 2 1 1
#[3,] 5 5 3 4 5
#[4,] 3 3 5 4 3
#[5,] 3 2 3 4 4
#[6,] 4 5 4 3 3