在多索引列上执行聚合
我从这个数据框开始:
df = pd.DataFrame(
[
["a", "aa", "2020-12-20", 10],
["a", "ab", "2020-12-26", 11],
["a", "aa", "2020-12-22", 10],
["b", "bb", "2020-12-25", 111],
["c", "bb", "2020-12-20", 20],
["d", "dd", "2020-12-05", 1111]
],
columns=["cat", "user", "date", "value"]
)
df["date"] = pd.to_datetime(df.date)
| 猫 | 用户 | 日期 | 价值 | |
|---|---|---|---|---|
| 0 | 一种 | aa | 2020-12-20 00:00:00 | 10 |
| 1 | 一种 | AB | 2020-12-26 00:00:00 | 11 |
| 2 | 一种 | aa | 2020-12-22 00:00:00 | 10 |
| 3 | 乙 | bb | 2020-12-25 00:00:00 | 111 |
| 4 | C | bb | 2020-12-20 00:00:00 | 20 |
| 5 | d | 日 | 2020-12-05 00:00:00 | 1111 |
回答
对于选择 MultiIndex 和使用的元组,这里使用了一个元素列表:
print (gb.groupby(level=0)[[("value", "sum")]].mean())
value
sum
cat
a 15.5
b 111.0
c 20.0
d 1111.0
或者您可以使用mean每个级别的简化解决方案:
print (gb[[("value", "sum")]].mean(level=0))
value
sum
cat
a 15.5
b 111.0
c 20.0
d 1111.0
对于Series选择省略嵌套列表:
print (gb[("value", "sum")].mean(level=0))
cat
a 15.5
b 111.0
c 20.0
d 1111.0
Name: (value, sum), dtype: float64
您的解决方案应该更改以避免MultiIndex在列中:
gb = (
df.set_index("date")
.groupby(["cat", pd.Grouper(freq='W')])
.agg(val = ("value", "sum"),
nuniq = ("user", "nunique"),
unqiue_users = ("user", lambda x: x.unique()))
)
print (gb)
val nuniq unqiue_users
cat date
a 2020-12-20 10 1 aa
2020-12-27 21 2 [ab, aa]
b 2020-12-27 111 1 bb
c 2020-12-20 20 1 bb
d 2020-12-06 1111 1 dd
print (gb['val'].mean(level=0))
cat
a 15.5
b 111.0
c 20.0
d 1111.0
Name: val, dtype: float64