我有一个 dataframe,它有 parent_id、parent_name、id、name、last_category 列。 df 是这样的:
parent_id parent_name id name last_category
NaN NaN 1 b 0
1 b 11 b1 0
11 b1 111 b2 0
111 b2 1111 b3 0
1111 b3 11111 b4 1
NaN NaN 2 a 0
2 a 22 a1 0
22 a1 222 a2 0
222 a2 2222 a3 1
我想用 last_category 列 1 创建 df 的分层路径。从根类别到最后一个。所以我将创建的新 dataframe 应该是这样的 (df_last):
name_path id_path
b / b1 / b2 / b3 / b4 1 / 11 / 111 / 1111 / 11111
a / a1 / a2 / a3 / a4 2 / 22 / 222 / 2222
这个怎么做?
回答1
仅使用 numpy 和 pandas 的解决方案:
# It's easier if we index the dataframe with the `id`
# I assume this ID is unique
df = df.set_index("id")
# `parents[i]` returns the parent ID of `i`
parents = df["parent_id"].to_dict()
paths = {}
# Find all nodes with last_category == 1
for id_ in df.query("last_category == 1").index:
child_id = id_
path = [child_id]
# Iteratively travel up the hierarchy until the parent is nan
while True:
pid = parents[id_]
if np.isnan(pid):
break
else:
path.append(pid)
id_ = pid
# The path to the child node is the reverse of
# the path we traveled
paths[int(child_id)] = np.array(path[::-1], dtype="int")
并构建结果数据框:
result = pd.DataFrame({
id_: (
" / ".join(df.loc[pids, "name"]),
" / ".join(pids.astype("str"))
)
for id_, pids in paths.items()
}, index=["name_path", "id_path"]).T
回答2
您可以使用 networkx
来解析根节点和带有 all_simple_paths
函数的叶节点之间的路径。
# Python env: pip install networkx
# Anaconda env: conda install networkx
import networkx as nx
# Create network from your dataframe
G = nx.from_pandas_edgelist(df, source='parent_id', target='id',
create_using=nx.DiGraph)
nx.set_node_attributes(G, df.set_index('id')[['name']].to_dict('index'))
# Find roots of your graph (a root is a node with no input)
roots = [node for node, degree in G.in_degree() if degree == 0]
# Find leaves of your graph (a leaf is a node with no output)
leaves = [node for node, degree in G.out_degree() if degree == 0]
# Find all paths
paths = []
for root in roots:
for leaf in leaves:
for path in nx.all_simple_paths(G, root, leaf):
# [1:] to remove NaN parent_id
paths.append({'id_path': ' / '.join(str(n) for n in path[1:]),
'name_path': ' / '.join(G.nodes[n]['name'] for n in path[1:])})
out = pd.DataFrame(paths)
输出:
>>> out
id_path name_path
0 1 / 11 / 111 / 1111 / 11111 b / b1 / b2 / b3 / b4
1 2 / 22 / 222 / 2222 a / a1 / a2 / a3