Groupby并删除NaN行,同时在Pandas中保留一个行
给定一个测试数据集如下:
id city name
0 1 bj NaN
1 2 bj jack
2 3 bj NaN
3 4 bj jim
4 5 sh NaN
5 6 sh NaN
6 7 sh steve
7 8 sh fiona
8 9 sh NaN
我如何分组city并删除NaN行,name同时只为每个保留一个group?非常感谢。
预期的结果将是这样的:
id city name
0 1 bj NaN
1 2 bj jack
2 4 bj jim
3 5 sh NaN
4 7 sh steve
5 8 sh fiona
df = pd.read_clipboard(na_filter = False)从 excel 文件中读取的新数据集,请注意N/A不应被视为NaN:
newcode build_name floor rent_id rent_name
0 1210010403 C? 25 1765228 ????????????
1 1210010403 C? 25 1765229 N/A
2 1210010403 C? 25 1765229 N/A
3 1210010403 C? 25 1765229 N/A
4 1210010403 C? 25 1765230 ????????????
5 1210010403 C? 25 1765229 N/A
6 1210010403 C? 25 1765231 ??????????
7 1210010403 C? 25 1765232 N/A
8 1210010403 C? 25 1765231 ???NA???????
代码:df[df['rent_name'].ne('N/A') | ~df.duplicated(subset=['newcode', 'build_name', 'floor'])],结果与df[~(df['rent_name'].eq('N/A') & df.duplicated(subset = ['newcode', 'build_name', 'floor'], keep = 'first'))]
出去:
newcode build_name floor rent_id rent_name
0 1210010403 C? 25 1765228 ????????????
4 1210010403 C? 25 1765230 ????????????
6 1210010403 C? 25 1765231 ??????????
8 1210010403 C? 25 1765231 ???NA???????
您可以看到N/A结果中缺少一行,我不知道为什么。
期望的输出:
newcode build_name floor rent_id rent_name
0 1210010403 C? 25 1765228 ????????????
1 1210010403 C? 25 1765229 N/A
4 1210010403 C? 25 1765230 ????????????
6 1210010403 C? 25 1765231 ??????????
8 1210010403 C? 25 1765231 ???NA???????
回答
布尔选择NaNs,删除最后一个重复的name和city
df[~(df.name.isna() & df.duplicated(subset = ['city', 'name'], keep = 'first'))]
id city name
0 1 bj NaN
1 2 bj jack
3 4 bj jim
4 5 sh NaN
6 7 sh steve
7 8 sh fiona