如何使用groupby中的两个组计算百分比?

给定以下数据框:

+------+-----+-----+
| Year | Cat | Bin |
+------+-----+-----+
| 2000 | A   |   0 |
| 2000 | A   |   1 |
| 2001 | A   |   0 |
| 2001 | B   |   1 |
| 2001 | B   |   0 |
| 2001 | B   |   1 |
+------+-----+-----+

d = {
    'year': [2000, 2000, 2001, 2001, 2001, 2001],
    'cat': ["A", "A", "A", "B", "B", "B", ],
    'bin': [0, 1, 0, 1, 0, 1],
}
df = pd.DataFrame(data=d)

我想创建下表:

+------+-----+------+-------+------+
| year | cat | mean | count | pct  |
+------+-----+------+-------+------+
| 2000 | A   |  0.5 |     2 | 100% |
| 2001 | A   |    0 |     1 | 25%  |
| 2001 | B   | 0.67 |     3 | 75%  |
+------+-----+------+-------+------+

哪里pctcountby cat&的百分比yearcountby year

我已经得到了前两列:

df["count"] = 1
df_groupby = df.groupby(["year", "cat"]).agg({"bin": "mean", "count": "sum"})
df_groupby.rename(columns={"bin": "mean"}, inplace=True)

但我不知道如何创建第三列?

回答

分组yearcat计算meancount,然后使用计算年份列中的计数value_counts并将其除以每计数yearcat计算百分比

s = df.groupby(['year', 'cat'])['bin'].agg(['mean', 'count'])
s['pct'] = s['count'].div(df['year'].value_counts(), level=0, axis=0).mul(100)

              mean  count    pct
year cat                        
2000 A    0.500000      2  100.0
2001 A    0.000000      1   25.0
     B    0.666667      3   75.0


回答

通过以下方式使用SeriesGroupBy.value_counts和添加新列DataFrame.join

s = df.groupby("year")['cat'].value_counts(normalize=True).mul(100)
df1 = df.groupby(["year", "cat"]).agg(mean = ("bin", "mean"),
                                      count = ("bin", "count")).join(s.rename('pct'))

print (df1)
              mean  count    pct
year cat                        
2000 A    0.500000      2  100.0
2001 A    0.000000      1   25.0
     B    0.666667      3   75.0

或通过assign

s = df.groupby("year")['cat'].value_counts(normalize=True).mul(100)
df1 = df.groupby(["year", "cat"]).agg(mean = ("bin", "mean"),
                                      count = ("bin", "count")).assign(pct = s)


以上是如何使用groupby中的两个组计算百分比?的全部内容。
THE END
分享
二维码
< <上一篇
下一篇>>