累计计算不同年间不同用户的总数
假设我有一个像这样的 data.frame:
user_df = read.table(text = "id industry year
1 Government 1999
2 Government 1999
3 Government 1999
4 Private 1999
5 NGO 1999
1 Government 2000
2 Government 2000
3 Government 2000
4 Government 2000
1 Government 2001
5 Government 2001
2 Private 2001
3 Private 2001
4 Private 2001", header = T)
对于每一个用户我有一个独特的id,industry和year。
我正在尝试计算曾经为政府工作过的人员的累计人数,因此累计人数应该是当年和所有前几年唯一用户总数的计数。
我知道我可以做一个普通的累积总和,如下所示:
user_df %>% group_by(year, industry) %>% summarize(cum_sum = cumsum(n_distinct(id)))
year industry cum_sum
<int> <chr> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 2
6 2001 Private 3
然而,这不是我想要的,因为在这一年的款项2000,并2001会包括谁已经被列入人1999。我希望每年都是在给定年份曾经在政府部门工作过的唯一用户总数的累积计数。我想不出在dplyr.
所以正确的输出应该是这样的:
year industry cum_sum
<int> <chr> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 5
6 2001 Private 3
回答
一种选择可能是:
user_df %>%
group_by(industry) %>%
mutate(cum_sum = cumsum(!duplicated(id))) %>%
group_by(year, industry) %>%
summarise(cum_sum = max(cum_sum))
year industry cum_sum
<int> <fct> <int>
1 1999 Government 3
2 1999 NGO 1
3 1999 Private 1
4 2000 Government 4
5 2001 Government 5
6 2001 Private 3