根据选定的列过滤重复的行并与Pandas中的另一个数据框进行比较
给定两个数据帧,如下所示:
import pandas as pd
# Creating a DataFrame object
df1 = pd.DataFrame([('Stuti', 28, 'Varanasi'),
('Saumya', 32, 'Delhi'),
('Aaditya', 25, 'Mumbai'),
('Saumya', 32, 'Delhi')],
columns = ['Name', 'Score', 'City'])
df2 = pd.DataFrame([('Saumya', 32, 'Delhi'),
('Saumya', 32, 'Mumbai'),
('Aaditya', 40, 'Mumbai'),
('Seema', 32, 'Delhi')],
columns = ['Name', 'Score', 'City'])
我怎么可以创建一个面具df2来过滤重复的基础上的行df1和列Name和City,如果同样存在PAIRE中df1,然后返回check其列Duplicated,否则,返回None。
预期的结果将是:
Name Score City Check
0 Saumya 32 Delhi Duplicated
1 Saumya 32 Mumbai None
2 Aaditya 40 Dehradun Duplicated
3 Seema 32 Delhi None
更新代码:
df = pd.concat([df1, df2])
df[df.duplicated(['Name', 'City'])]
出去:
Name Score City
3 Saumya 32 Delhi
0 Saumya 32 Delhi
2 Aaditya 40 Mumbai
回答
In [65]: df2.merge(df1[['Name', 'City']].drop_duplicates(), how='left', indicator='Check').assign(Check=lambda x: np.where(x['Check'] == 'both', 'Duplicated', None))
Out[65]:
Name Score City Check
0 Saumya 32 Delhi Duplicated
1 Saumya 32 Mumbai None
2 Aaditya 40 Mumbai Duplicated
3 Seema 32 Delhi None