根据原始索引提取字符串并插入为多行
我已经把示例数据集 (df)、预期输出 (df2) 和我的代码放在下面。我有一个 df,其中 i2 列中的某些行包含一个列表 - json 格式,需要从提取它们的行中分解并重新插入到 df 中。但是需要输入到不同的列(i1)。我需要从字符串中提取一个唯一标识符('id_2' 值)并将其插入到 id_2 列中。
到目前为止,在我的代码中,我正在使用 pd.normalize 解析类似 json 的数据,然后将 i1 列中的原始字符串插入到提取字符串的顶部(如果您看一下下面的内容应该会更清楚)和然后根据索引重新插入它们。但是我必须指定索引,这不好。我希望它减少对索引的手动输入的依赖,以防将来随着更多这些嵌套单元格的变化或索引发生变化而发生变化。
非常欢迎任何建议,非常感谢
示例数据
import pandas as pd
df = pd.DataFrame(data={'id': [1, 2, 3, 4, 5], 'id_2': ['a','b','c','d','e'], 'i1': ['How old are you?','Over the last month have you felt','Do you live alone?','In the last week have you had','When did you last visit a doctor?'], 'i2': [0,0,0,0,0]})
df['i2'] = df['i2'].astype('object')
a = [{'id': 'b1', 'item': 'happy?', 'id_2': 'hj59'}, {'id': 'b2', 'item': 'sad?', 'id_2': 'dgb'}, {'id': 'b3', 'item': 'angry?', 'id_2':'kj9'}, {'id': 'b4', 'item': 'frustrated?','id2':'lp7'}]
b = [{'id': 'c1', 'item': 'trouble sleeping?'}, {'id': 'c2', 'item': 'changes in appetite?'}, {'id': 'c3', 'item': 'mood swings?'}, {'id': 'c4', 'item': 'trouble relaxing?'}]
df.at[1, 'i2'] = a
df.at[3, 'i2'] = b
预期产出
df2 = pd.DataFrame(data={'id': [1,2,2,2,2,3,4,4,4,4,5],
'id_2': ['a','hj59','dgb','kj9','lp7','c','d','d','d','d','e'],
'i1': ['How old are you?',
'Over the last month have you felt happy?',
'Over the last month have you felt sad?',
'Over the last month have you felt angry?',
'Over the last month have you felt frustrated?',
'Do you live alone?',
'In the last week have you had trouble sleeping?',
'In the last week have you had changes in appetite?',
'In the last week have you had mood swings?',
'In the last week have you had trouble relaxing?',
'When did you last visit a doctor?'],
'i2': [0,1,1,1,1,0,1,1,1,1,0]})
到目前为止我丑陋的代码
s={}
s = df[df.i2 != 0]
n={}
for i in range(len(s)):
n[i] = pd.json_normalize(s.loc[s.index[i]]['i2']).reset_index(inplace=False, drop=False)
n[i]['i1'] = s.iloc[i].i1 + ' ' + n[i]['item']
def insert_row(i, d1, d2): return d1.iloc[:i, ].append(d2)
for i in n:
if i == 0:
x = insert_row(s.iloc[i].name, df, n[i])
elif i == 1:
x = insert_row(s.iloc[i].name+1+n[i]['index'].count()+1, x, n[i])
y = x.append(df.iloc[s.iloc[i].name+1:, ])
回答
Explode列上的数据框i2,然后使用访问器item从列中检索与键关联的值,然后使用索引将列中的值更新 为并将字符串与检索到的项目值连接起来i2strloci21i1
df2 = df.explode('i2', ignore_index=True)
s = df2['i2'].str['item']
df2.loc[s.notna(), 'i2'] = 1
df2.loc[s.notna(), 'i1'] += ' ' + s
id i1 i2
0 1 How old are you? 0
1 2 Over the last month have you felt happy? 1
2 2 Over the last month have you felt sad? 1
3 2 Over the last month have you felt angry? 1
4 2 Over the last month have you felt frustrated? 1
5 3 Do you live alone? 0
6 4 In the last week have you had trouble sleeping? 1
7 4 In the last week have you had changes in appetite? 1
8 4 In the last week have you had mood swings? 1
9 4 In the last week have you had trouble relaxing? 1
10 5 When did you last visit a doctor? 0