如何使用新列中重叠项目的输出映射两个数据框?
我有两个数据框:
data = {
'values': ['Cricket', 'Soccer', 'Football', 'Tennis', 'Badminton', 'Chess'],
'gems': ['A1K, A2M, JA3, AN4', 'B1, A1, Bn2, B3', 'CD1, A1', 'KWS, KQM', 'JP, CVK', 'KF, GF']
}
df1 = pd.DataFrame(data)
df1
values gems
0 Cricket A1K, A2M, JA3, AN4
1 Soccer B1, A1, Bn2, B3
2 Football CD1, A1
3 Tennis KWS, KQM
4 Badminton JP, CVK
5 Chess KF, GF
第二个数据框
data2 = {
'1C': ['B1', 'K1', 'A1K', 'J1', 'A4'],
'02C': ['Bn2', 'B3', 'JK', 'ZZ', 'ko'],
'34C': ['KF', 'CD1', 'B3','ji', 'HU']
}
df2 = pd.DataFrame(data2)
df2
1C 02C 34C
0 B1 Bn2 KF
1 K1 B3 CD1
2 A1K JK B3
3 J1 ZZ ji
4 A4 ko HU
我想df1['gems']在每列中检查项目df2并表示它们的计数和重叠项目。预期的输出是:
values gems 1C 1CGroup 02C 02CGroup 34C 34CGroup
0 Cricket A1K, A2M, JA3, AN4 1 A1K 0 NA 0 NA
1 Soccer B1, A1, Bn2, B3 1 Bn2 2 Bn2, B3 1 B3
2 Football CD1, A1 0 NA 0 NA 1 CD1
3 Tennis KWS, KQM 0 NA 0 NA 0 NA
4 Badminton JP, CVK 0 NA 0 NA 0 NA
5 Chess KF, GF 0 NA 0 NA 1 KF
回答
首先str.split和explode列 gems 并reset_index保留原始索引。然后对于 df2 的每一列,merge使用爆炸的宝石,groupby原始索引并count根据需要使用join. pd.concat每列的合并并加入您的原始 df1。fillna带有 0 的计数列,如预期输出所示。
# one row per gem used in the merge
df_ = df1['gems'].str.split(', ').explode().reset_index()
res = (
df1.join( #can join to df1 as we keep the original index value
pd.concat([df_.merge(df2[[col]], left_on='gems', right_on=col)
.groupby('index') # original index in df1
[col].agg(**{col: 'count', # do each aggregation
f'{col}Group':lambda x: ', '.join(x)})
for col in df2.columns], # do it for each column of df2
axis=1))
.fillna({col:0 for col in df2.columns}) #fill the count columns with 0
)
print(res)
values gems 1C 1CGroup 02C 02CGroup 34C 34CGroup
0 Cricket A1K, A2M, JA3, AN4 1.0 A1K 0.0 NaN 0.0 NaN
1 Soccer B1, A1, Bn2, B3 1.0 B1 2.0 Bn2, B3 1.0 B3
2 Football CD1, A1 0.0 NaN 0.0 NaN 1.0 CD1
3 Tennis KWS, KQM 0.0 NaN 0.0 NaN 0.0 NaN
4 Badminton JP, CVK 0.0 NaN 0.0 NaN 0.0 NaN
5 Chess KF, GF 0.0 NaN 0.0 NaN 1.0 KF
回答
首先创建一个你的组表:
df3 = (pd.merge(df1['gems'].str.split(',s+').explode().reset_index(),
df2.unstack().reset_index(level=0),
left_on='gems', right_on=0, how='left'
)
.pivot_table(index='index',
columns=['level_0'],
values='gems',
aggfunc=list)
)
输出:
level_0 02C 1C 34C
index
0 NaN [A1K] NaN
1 [Bn2, B3] [B1] [B3]
2 NaN NaN [CD1]
5 NaN NaN [KF]
然后生成计数并将所有内容与原始表连接:
pd.concat([df1,
pd.concat([df3.add_suffix('Group').applymap(lambda x: ','.join(x) if isinstance(x, list) else x),
df3.fillna('').applymap(len)],
axis=1).sort_index(axis=1)
], axis=1)
输出:
values gems 02C 02CGroup 1C 1CGroup 34C 34CGroup
0 Cricket A1K, A2M, JA3, AN4 0.0 NaN 1.0 A1K 0.0 NaN
1 Soccer B1, A1, Bn2, B3 2.0 Bn2, B3 1.0 B1 1.0 B3
2 Football CD1, A1 0.0 NaN 0.0 NaN 1.0 CD1
3 Tennis KWS, KQM NaN NaN NaN NaN NaN NaN
4 Badminton JP, CVK NaN NaN NaN NaN NaN NaN
5 Chess KF, GF 0.0 NaN 0.0 NaN 1.0 KF
编辑:替代字符串连接和计数
df3 = (pd.merge(df1['gems'].str.split(',s+').explode().reset_index(),
df2.unstack().reset_index(level=0),
left_on='gems', right_on=0, how='left'
)
.pivot_table(index='index',
columns=['level_0'],
values='gems',
aggfunc=', '.join)
)
pd.concat([df1,
pd.concat([df3.add_suffix('Group'),
df3.applymap(lambda x: x.count(',')+1 if isinstance(x, str) else 0)],
axis=1).sort_index(axis=1)
], axis=1)
回答
解决方案 findall
对于 中的每一列df2,找到 的 gems 列中列值的所有出现次数df1,然后map使用len计算出现次数,并可选择join使用str.join
for c in df2.columns:
s = df1['gems'].str.findall('|'.join(df2[c]))
df1[c] = s.map(len)
df1[c + 'group'] = s.str.join(', ')
print(df1)
values gems 1C 1Cgroup 02C 02Cgroup 34C 34Cgroup
0 Cricket A1K, A2M, JA3, AN4 1 A1K 0 0
1 Soccer B1, A1, Bn2, B3 1 B1 2 Bn2, B3 1 B3
2 Football CD1, A1 0 0 1 CD1
3 Tennis KWS, KQM 0 0 0
4 Badminton JP, CVK 0 0 0
5 Chess KF, GF 0 0 1 KF