數(shù)據(jù)分析的時候, 我們有時候會遇到這樣的需求.
就比如當一個GO號對應(yīng)多個Gene ID的時候愿汰,如下:
GO_ids Gene_ids
0 GO:666666 AT1G12310,AT1G12320,AT1G23330
1 GO:888888 Gene1,Gene2,Gene3
我們想把它變成GO ID和Gene ID一一對應(yīng)的關(guān)系,這樣做的目的是為了為基因添加表達量信息或者其它注釋信息. 目標表格如下:
GO_ids Gene_ids
0 GO:666666 AT1G12310
0 GO:666666 AT1G12320
0 GO:666666 AT1G23330
1 GO:888888 Gene1
1 GO:888888 Gene2
1 GO:888888 Gene3
以前的我也干過這樣的事情, 居然是硬寫代碼, 后來有一次聽數(shù)據(jù)分析的會議, 聽到有個人提到Hive里面的爆炸函數(shù), 覺得挺有趣, 想著python數(shù)據(jù)分析生態(tài)系統(tǒng)里不可能沒有這樣的輪子. 于是搜索了一下, 還真的有.
學(xué)習(xí)一個工具最好的工具是查文檔, 查文檔, 查文檔.
pandas explode的文檔鏈接如下: pandas explode函數(shù).
This routine will explode list-likes including lists, tuples, sets, Series, and np.ndarray. The result dtype of the subset rows will be object. Scalars will be returned unchanged, and empty list-likes will result in a np.nan for that row. In addition, the ordering of rows in the output will be non-deterministic when exploding sets.
文檔里面最重要的一句話是,能夠?qū)?strong>lists-like的元素"爆炸成"新的行. 下面我們通過一個實例來演示一下, explode函數(shù)如何工作.
Jupyter notebook中測試時所用代碼塊:
#構(gòu)建測試數(shù)據(jù)集
import pandas as pd
import numpy as np
df = pd.DataFrame.from_dict({"GO_ids":["GO:666666", "GO:888888"], "Gene_ids":["AT1G12310,AT1G12320,AT1G23330", "Gene1,Gene2,Gene3"]})
df
#將想被explode的列里的元素, 變?yōu)閘ist like
df["Gene_ids"] = df["Gene_ids"].apply(lambda x: x.split(","))
#df["Gene_ids"] = df["Gene_ids"].str.split(",")
df
#explode
df.explode("Gene_ids")