查找所有组值为nan的行

我有这个数据框:

from io import StringIO
u_cols = ['word','count']
audit_trail = StringIO('''
test 1
testing 24
tested 
again 5
begin 
again 6
begin
again
''')

df = pd.read_csv(audit_trail, sep=" ", names = u_cols  )

我可以通过 groupby 查找有多少值为 null:

df.groupby('word')['count'].apply(list)

word
again      [5.0, 6.0, nan]
begin           [nan, nan]
test                 [1.0]
tested               [nan]
testing             [24.0]
Name: count, dtype: object
    

如果组计数大于 1 并且所有值都是 nan 则不包括此类条目。

Expected: 
again      [5.0, 6.0, nan]
test                 [1.0]
tested               [nan]
testing             [24.0]

为了获得这些结果,应该从数据框中删除单词“ begin ”。

df[df.word != "begin"]
word    count
0   test    1.0
1   testing     24.0
2   tested  NaN
3   again   5.0
5   again   6.0
7   again   NaN

我如何找到应该删除哪些词?

回答

1.groupbyfilter

df.groupby('word').filter(lambda x: not (x['count'].isna().all() and len(x) > 1))

2.groupbytransform

g = df.groupby('word')['count']
m = g.transform('count').eq(0) & g.transform('size').gt(1)
df[~m]

3.groupbyagg

s = df.groupby('word')['count'].agg(['count', 'size'])
m = s['count'].eq(0) & s['size'].gt(1)
df[~df['word'].isin(m[m].index)]

      word  count
0     test    1.0
1  testing   24.0
2   tested    NaN
3    again    5.0
5    again    6.0
7    again    NaN


以上是查找所有组值为nan的行的全部内容。
THE END
分享
二维码
< <上一篇
下一篇>>