处理 NLP 问题
我最终得到了一个大的特征数据集
dfMethod
Out[2]:
c0000167 c0000294 c0000545 ... c4721555 c4759703 c4759772
0 0 0 0 ... 0 0 0
1 0 0 0 ... 0 0 0
2 0 0 0 ... 0 0 0
3 0 0 0 ... 0 0 0
4 0 0 0 ... 0 0 0
... ... ... ... ... ... ...
3995 0 0 0 ... 0 0 0
3996 0 0 0 ... 0 0 0
3997 0 0 0 ... 0 0 0
3998 0 0 0 ... 0 0 0
3999 0 0 0 ... 0 0 0
[4000 rows x 14317 columns]
我想删除重复次数最少的列(即所有记录总和最小的列)
所以如果我的列总和看起来像这样
Sum of c0000167 = 7523
Sum of c0000294 = 8330
Sum of c0000545 = 502
Sum of c4721555 = 51
Sum of c4759703 = 9628
最后,我只想根据每列的总和保留前 5000 列?
我怎样才能做到这一点?
回答1
假设您有一个很大的 dataframe big_df
,您可以使用以下内容获得顶部列:
N = 5000
big_df[big_df.sum().sort_values(ascending=False).index[:N]]
打破这个:
big_df.sum() # Gives the sums you mentioned
.sort_values(ascending=False) # Sort the sums in descending order
.index # because .sum() defaults to axis=0, the index is your columns
[:N] # grab first N items
回答2
在作者评论后编辑。让我们考虑 df a pandas DataFrame。准备过滤器,选择前 5000 个总和列:
df_sum = df.sum() # avoid repeating df.sum() next line
co = sorted([(c, v) for (c, v) in list(zip(df_sum.keys(), df_sum.values))], key = lambda row: row[1], reverse = True)[0:5000]
# fixed trouble of sum value greater than 5000, but the top 5000.
co = [row[0] for row in co]
# convert to a list of column names of interest
在 co 中过滤列之后:
df = df.filter(items = co)
df