我有一个像这样的DF:
asset_id source_id open_px close_px start_bin end_bin
0 1 a None 10 2022-01-01 09:30:00 2022-01-01 10:00:00
1 1 a None 10 2022-01-01 10:00:00 2022-01-01 10:30:00
2 2 a None 101 2022-01-01 09:30:00 2022-01-01 10:00:00
3 2 a None 500 2022-01-01 10:00:00 2022-01-01 10:30:00
4 2 a None 600 2022-01-01 10:30:00 2022-01-01 11:00:00
生成代码:
rows=[
[1, 'a', None, 10, datetime.datetime(2022, 1, 1, 9, 30), datetime.datetime(2022, 1, 1, 10, 0)],
[1, 'a', None, 10, datetime.datetime(2022, 1, 1, 10, 0), datetime.datetime(2022, 1, 1, 10, 30)],
[2, 'a', None, 101, datetime.datetime(2022, 1, 1, 9, 30), datetime.datetime(2022, 1, 1, 10, 0)],
[2, 'a', None, 500, datetime.datetime(2022, 1, 1, 10, 0), datetime.datetime(2022, 1, 1, 10, 30)],
[2, 'a', None, 600, datetime.datetime(2022, 1, 1, 10, 30), datetime.datetime(2022, 1, 1, 11, 0)]
]
cols = ['asset_id', 'source_id', 'open_px', 'close_px', 'start_bin', 'end_bin']
df = pd.DataFrame(data=rows, columns=cols)
我想通过从与此行 start_bin 对应的 bin 中获取最后一个关闭来获取 open_px
,但也以最 pandas 友好的方式按asset_id 分组。 (很高兴每个条目中的第一个条目保留 None
)。我不想用循环蛮力,因为数据集非常大。
预期输出:
asset_id source_id open_px close_px start_bin end_bin
0 1 a None 10 2022-01-01 09:30:00 2022-01-01 10:00:00
1 1 a 10 10 2022-01-01 10:00:00 2022-01-01 10:30:00
2 2 a None 101 2022-01-01 09:30:00 2022-01-01 10:00:00
3 2 a 101 500 2022-01-01 10:00:00 2022-01-01 10:30:00
4 2 a 500 600 2022-01-01 10:30:00 2022-01-01 11:00:00
回答1
df.sort_values(['asset_id','start_bin'], inplace=True)
df['open_px'] = df['close_px'].shift()
df.loc[~df['asset_id'].duplicated(),'open_px'] = None
print(df)
asset_id source_id open_px close_px start_bin end_bin
0 1 a NaN 10 2022-01-01 09:30:00 2022-01-01 10:00:00
1 1 a 10.0 10 2022-01-01 10:00:00 2022-01-01 10:30:00
2 2 a NaN 101 2022-01-01 09:30:00 2022-01-01 10:00:00
3 2 a 101.0 500 2022-01-01 10:00:00 2022-01-01 10:30:00
4 2 a 500.0 600 2022-01-01 10:30:00 2022-01-01 11:00:00