查找所有组值为nan的行
我有这个数据框:
from io import StringIO
u_cols = ['word','count']
audit_trail = StringIO('''
test 1
testing 24
tested
again 5
begin
again 6
begin
again
''')
df = pd.read_csv(audit_trail, sep=" ", names = u_cols )
我可以通过 groupby 查找有多少值为 null:
df.groupby('word')['count'].apply(list)
word
again [5.0, 6.0, nan]
begin [nan, nan]
test [1.0]
tested [nan]
testing [24.0]
Name: count, dtype: object
如果组计数大于 1 并且所有值都是 nan 则不包括此类条目。
Expected:
again [5.0, 6.0, nan]
test [1.0]
tested [nan]
testing [24.0]
为了获得这些结果,应该从数据框中删除单词“ begin ”。
df[df.word != "begin"]
word count
0 test 1.0
1 testing 24.0
2 tested NaN
3 again 5.0
5 again 6.0
7 again NaN
我如何找到应该删除哪些词?
回答
1.groupby与filter
df.groupby('word').filter(lambda x: not (x['count'].isna().all() and len(x) > 1))
2.groupby与 transform
g = df.groupby('word')['count']
m = g.transform('count').eq(0) & g.transform('size').gt(1)
df[~m]
3.groupby与agg
s = df.groupby('word')['count'].agg(['count', 'size'])
m = s['count'].eq(0) & s['size'].gt(1)
df[~df['word'].isin(m[m].index)]
word count
0 test 1.0
1 testing 24.0
2 tested NaN
3 again 5.0
5 again 6.0
7 again NaN