如何使用groupby中的两个组计算百分比?
给定以下数据框:
+------+-----+-----+
| Year | Cat | Bin |
+------+-----+-----+
| 2000 | A | 0 |
| 2000 | A | 1 |
| 2001 | A | 0 |
| 2001 | B | 1 |
| 2001 | B | 0 |
| 2001 | B | 1 |
+------+-----+-----+
d = {
'year': [2000, 2000, 2001, 2001, 2001, 2001],
'cat': ["A", "A", "A", "B", "B", "B", ],
'bin': [0, 1, 0, 1, 0, 1],
}
df = pd.DataFrame(data=d)
我想创建下表:
+------+-----+------+-------+------+
| year | cat | mean | count | pct |
+------+-----+------+-------+------+
| 2000 | A | 0.5 | 2 | 100% |
| 2001 | A | 0 | 1 | 25% |
| 2001 | B | 0.67 | 3 | 75% |
+------+-----+------+-------+------+
哪里pct是countby cat&的百分比year是countby year。
我已经得到了前两列:
df["count"] = 1
df_groupby = df.groupby(["year", "cat"]).agg({"bin": "mean", "count": "sum"})
df_groupby.rename(columns={"bin": "mean"}, inplace=True)
但我不知道如何创建第三列?
回答
分组year并cat计算mean和count,然后使用计算年份列中的计数value_counts并将其除以每计数year并cat计算百分比
s = df.groupby(['year', 'cat'])['bin'].agg(['mean', 'count'])
s['pct'] = s['count'].div(df['year'].value_counts(), level=0, axis=0).mul(100)
mean count pct
year cat
2000 A 0.500000 2 100.0
2001 A 0.000000 1 25.0
B 0.666667 3 75.0
回答
通过以下方式使用SeriesGroupBy.value_counts和添加新列DataFrame.join:
s = df.groupby("year")['cat'].value_counts(normalize=True).mul(100)
df1 = df.groupby(["year", "cat"]).agg(mean = ("bin", "mean"),
count = ("bin", "count")).join(s.rename('pct'))
print (df1)
mean count pct
year cat
2000 A 0.500000 2 100.0
2001 A 0.000000 1 25.0
B 0.666667 3 75.0
或通过assign:
s = df.groupby("year")['cat'].value_counts(normalize=True).mul(100)
df1 = df.groupby(["year", "cat"]).agg(mean = ("bin", "mean"),
count = ("bin", "count")).assign(pct = s)