python - 如何在 Python 中减小我的 dataframe 的大小?

处理 NLP 问题

我最终得到了一个大的特征数据集

dfMethod
Out[2]: 
      c0000167  c0000294  c0000545  ...  c4721555  c4759703  c4759772
0            0         0         0  ...         0         0         0
1            0         0         0  ...         0         0         0
2            0         0         0  ...         0         0         0
3            0         0         0  ...         0         0         0
4            0         0         0  ...         0         0         0
       ...       ...       ...  ...       ...       ...       ...
3995         0         0         0  ...         0         0         0
3996         0         0         0  ...         0         0         0
3997         0         0         0  ...         0         0         0
3998         0         0         0  ...         0         0         0
3999         0         0         0  ...         0         0         0

[4000 rows x 14317 columns]

我想删除重复次数最少的列(即所有记录总和最小的列)

所以如果我的列总和看起来像这样

Sum of c0000167 = 7523
Sum of c0000294 = 8330
Sum of c0000545 = 502
Sum of c4721555 = 51
Sum of c4759703 = 9628

最后,我只想根据每列的总和保留前 5000 列?

我怎样才能做到这一点?

回答1

假设您有一个很大的 dataframe big_df ,您可以使用以下内容获得顶部列:

N = 5000
big_df[big_df.sum().sort_values(ascending=False).index[:N]]

打破这个:

big_df.sum()  # Gives the sums you mentioned
.sort_values(ascending=False)  # Sort the sums in descending order
.index  # because .sum() defaults to axis=0, the index is your columns
[:N]  # grab first N items

回答2

在作者评论后编辑。让我们考虑 df a pandas DataFrame。准备过滤器,选择前 5000 个总和列:

df_sum = df.sum() # avoid repeating df.sum() next line
co = sorted([(c, v) for (c, v) in list(zip(df_sum.keys(), df_sum.values))], key = lambda row: row[1], reverse = True)[0:5000]
# fixed trouble of sum value greater than 5000, but the top 5000.
co = [row[0] for row in co]
# convert to a list of column names of interest

在 co 中过滤列之后:

df = df.filter(items = co)
df

相似文章

随机推荐

最新文章