Pandas:按索引计算细胞频率
我的数据框是一长串 4 个字母'A', 'T', 'G','C',我需要按索引计算每个字母的频率
df = pd.DataFrame({'cases': ['ACCTTGTAGTGTATTTTATGACCAAATGACTTTTTCCCCCCAGTGGCTAATTTGTCTCAGGCCTGCGTCTTAAAGAGACACGGTAATGAGTAGGAAGTCCAGCGTGGTCTGGA','ACCTTGTACTGTATCTTATGACCAGATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACGGTAATGAGTAGGAAGTCCAACGTGGTCTAGA','GCCTTGTACTGTATATTATGACCAAATGACTTTTTCCACCCATTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACGGAAATGAGTAGGAAGTCCAGCGTGGTCTAGA','ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGGCTAATTTGTCTCAGGCCTCCGTCTTAAAGAGACACGGTAATGAGTAGGAAGTCCAGCGTGGTCTAGA']})
cases
0 ACCTTGTAGTGTATTTTATGACCAAATGACTTTTTCCCCCCAGTGG...
1 ACCTTGTACTGTATCTTATGACCAGATGACTTTTTCCACCCAGTGG...
2 GCCTTGTACTGTATATTATGACCAAATGACTTTTTCCACCCATTGG...
3 ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGG...
4 ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGG...
5 ACCTTGTAGTGTATTTTATGACCAAATGACTTTTTCCCCCCAGTGG...
6 ACCTTGTACTGTATCTTATGACCAGATGACTTTTTCCACCCAGTGG...
7 GCCTTGTACTGTATATTATGACCAAATGACTTTTTCCACCCATTGG...
8 ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGG...
9 ACCTTGTACTGTATATTATGACCAGATGACTTTTTCCACCCAGTGG...
结果将是一个新的 df 形状4x113,我想不出一个熊猫的方法来做到这一点。以下是我的非熊猫解决方案
def freq_lists(dna_list):
n = len(dna_list[0])
A = [0]*n
T = [0]*n
G = [0]*n
C = [0]*n
for dna in dna_list:
for index, base in enumerate(dna):
if base == 'A':
A[index] += 1
elif base == 'C':
C[index] += 1
elif base == 'G':
G[index] += 1
elif base == 'T':
T[index] += 1
return {'A': A, 'C': C, 'G': G, 'T': T}
fdf = pd.DataFrame(freq_lists(df['cases'].to_list()))
A C G T
0 3 0 1 0
1 0 4 0 0
2 0 4 0 0
3 0 0 0 4
4 0 0 0 4
.. .. .. .. ..
108 0 4 0 0
109 0 0 0 4
110 3 0 1 0
111 0 0 4 0
112 4 0 0 0
为了澄清第一行是通过总结case列中第一个 str 的计数获得的AAGA -> A: 3, C:0, G:1 T:0
回答
让我们做explode与crosstab
s = df.cases.map(list).explode()
out = pd.crosstab(s.groupby(level=0).cumcount(),s)
Out[583]:
cases A C G T
row_0
0 3 0 1 0
1 0 4 0 0
2 0 4 0 0
3 0 0 0 4
4 0 0 0 4
.. .. .. ..
108 0 4 0 0
109 0 0 0 4
110 3 0 1 0
111 0 0 4 0
112 4 0 0 0