假设我有这个 dataframe :
Country Market
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 Spain m2_location
4 USA m1_name
5 USA m2_name
6 USA m3_size
7 USA m3_location
我想对“国家”列进行分组,并在 groupby 对象中保留最频繁记录的记录。预期的结果是:
Country Market
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
6 USA m3_size
7 USA m3_location
我已经尝试过提取前缀,然后在 dataframe 上获取前缀的模式,并用这种模式合并行,但我觉得存在更直接和更有效的解决方案。
以下是可重现结果的工作示例代码:
df = pd.DataFrame({
"Country": ["Spain","Spain","Spain","Spain","USA","USA","USA","USA"],
"City": ["m1_name","m1_location","m1_size","m2_location","m1_name","m2_name","m3_size","m3_location"]
})
df['prefix'] = df['City'].str[1]
modes = df.groupby('Country')['prefix'].agg(pd.Series.mode).rename("modes")
df = df.merge(modes, how="right", left_on=['Country','prefix'], right_on=['Country',"modes"])
df = df.drop(['modes','prefix'], axis = 1)
print(df)
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 USA m3_size
4 USA m3_location
回答1
您可以尝试 groupby 并应用于过滤组行
out = (df.assign(prefix=df['City'].str.split('_').str[0])
.groupby('Country')
.apply(lambda g: g[g['prefix'].isin(g['prefix'].mode())])
.reset_index(drop=True)
.drop('prefix',axis=1))
print(out)
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
3 USA m3_size
4 USA m3_location
回答2
利用:
In [575]: df['Prefix_count'] = df.groupby(['Country', df.City.str.split('_').str[0]])['City'].transform('size')
In [589]: idx = df.groupby('Country')['Prefix_count'].transform(max) == df['Prefix_count']
In [593]: df[idx].drop('Prefix_count', 1)
Out[593]:
Country City
0 Spain m1_name
1 Spain m1_location
2 Spain m1_size
6 USA m3_size
7 USA m3_location
回答3
关于下面提出的解决方案的一个有趣事实是 Mayank 的解决方案要快得多。我在我的数据上运行了 1000 行并得到:
Mayank 的解决方案:0.020 seconds
Ynjxsjmh 的解决方案:0.402 seconds
我的(OP)解决方案:0.122 seconds