網(wǎng)卡收包
- 內(nèi)核網(wǎng)絡(luò)模塊如何初始化?
- 內(nèi)核如何通過(guò)網(wǎng)卡驅(qū)動(dòng)收發(fā)數(shù)據(jù)包?
- 驅(qū)動(dòng)收到的數(shù)據(jù)怎么交給協(xié)議棧處理月培?
一,框架
網(wǎng)絡(luò)子系統(tǒng)中蒿往,在本文中我們關(guān)注的是驅(qū)動(dòng)和內(nèi)核的交互。也就是網(wǎng)卡收到數(shù)據(jù)包后怎么交給內(nèi)核湿弦,內(nèi)核收到數(shù)據(jù)包后怎么交給協(xié)議棧處理瓤漏。
在內(nèi)核中,網(wǎng)卡設(shè)備是被net_device
結(jié)構(gòu)體描述的颊埃。驅(qū)動(dòng)需要通過(guò)net_device
向內(nèi)核注冊(cè)一組操作網(wǎng)卡硬件的函數(shù)蔬充,這樣內(nèi)核便可以使用網(wǎng)卡了。而所有的數(shù)據(jù)包在內(nèi)核空間都是使用sk_buff
結(jié)構(gòu)體來(lái)表示班利,所以將網(wǎng)卡硬件收到的數(shù)據(jù)轉(zhuǎn)換成內(nèi)核認(rèn)可的skb_buff
也是驅(qū)動(dòng)的工作饥漫。
在這之后,還有兩個(gè)結(jié)構(gòu)體也發(fā)揮了非常重要的作用罗标。一個(gè)是為struct softnet_data
庸队,另一個(gè)是struct napi_struct
。為軟中斷的方式處理數(shù)據(jù)包提供了支持闯割。盜圖一張:
二彻消,初始化
一切的起源都是上電那一刻,當(dāng)系統(tǒng)初始化完畢后宙拉,我們的系統(tǒng)就應(yīng)該是可用的了宾尚。網(wǎng)絡(luò)子模塊的初始化也是在Linux啟動(dòng)經(jīng)歷兩階段的混沌boost自舉后,進(jìn)入的第一個(gè)C函數(shù)start_kernel
谢澈。在這之前是Bootloader和Linux的故事煌贴,在這之后,便是Linux的單人秀了澳化。
網(wǎng)絡(luò)子設(shè)備初始化調(diào)用鏈:start_kernel
->rest_init
->kernel_init
->kernel_init_freeable
->do_basic_setup
->do_initcalls
->do_initcalls
->net_dev_init
。
上面調(diào)用關(guān)系中的kernel_init
是一個(gè)內(nèi)核子線(xiàn)程中調(diào)用的:
pid = kernel_thread(kernel_init, NULL, CLONE_FS);
然后再一個(gè)問(wèn)題就是當(dāng)進(jìn)入do_initcalls
后我們會(huì)發(fā)現(xiàn)畫(huà)風(fēng)突變:
static void __init do_initcalls(void)
{
int level;
for (level = 0; level < ARRAY_SIZE(initcall_levels) - 1; level++)
do_initcall_level(level);
}
我是誰(shuí)稳吮,我來(lái)自哪缎谷,我要到哪去。
如果do_initcalls
還給了我們一絲看下去的希望灶似,點(diǎn)開(kāi)do_initcall_level
可能就真的絕望了列林。
static void __init do_initcall_level(int level)
{
initcall_t *fn;
strcpy(initcall_command_line, saved_command_line);
parse_args(initcall_level_names[level],
initcall_command_line, __start___param,
__stop___param - __start___param,
level, level,
NULL, &repair_env_string);
trace_initcall_level(initcall_level_names[level]);
for (fn = initcall_levels[level]; fn < initcall_levels[level+1]; fn++)
do_one_initcall(*fn);
}
全局一個(gè)fn
指針,實(shí)現(xiàn)調(diào)用全靠猜酪惭。反正我不管希痴,我說(shuō)調(diào)用了net_dev_init
就是調(diào)用了。偉大的google告訴我只要被下面這些宏定義包裹的函數(shù)就會(huì)被do_one_initcall
調(diào)用春感,用了什么黑科技砌创,先不管:
#file:include/linux/init.h
#define pure_initcall(fn) __define_initcall(fn, 0)
#define core_initcall(fn) __define_initcall(fn, 1)
#define core_initcall_sync(fn) __define_initcall(fn, 1s)
#define postcore_initcall(fn) __define_initcall(fn, 2)
#define postcore_initcall_sync(fn) __define_initcall(fn, 2s)
#define arch_initcall(fn) __define_initcall(fn, 3)
#define arch_initcall_sync(fn) __define_initcall(fn, 3s)
#define subsys_initcall(fn) __define_initcall(fn, 4)
#define subsys_initcall_sync(fn) __define_initcall(fn, 4s)
#define fs_initcall(fn) __define_initcall(fn, 5)
#define fs_initcall_sync(fn) __define_initcall(fn, 5s)
#define rootfs_initcall(fn) __define_initcall(fn, rootfs)
#define device_initcall(fn) __define_initcall(fn, 6)
#define device_initcall_sync(fn) __define_initcall(fn, 6s)
#define late_initcall(fn) __define_initcall(fn, 7)
#define late_initcall_sync(fn) __define_initcall(fn, 7s)
在net_dev_init
的定義下面虏缸,我們可以找到subsys_initcall(net_dev_init);
。Ok嫩实,網(wǎng)絡(luò)子系統(tǒng)的初始化入口已找到到刽辙。
static int __init net_dev_init(void)
{
int i, rc = -ENOMEM;
BUG_ON(!dev_boot_phase);
if (dev_proc_init())
goto out;
if (netdev_kobject_init())
goto out;
INIT_LIST_HEAD(&ptype_all);
for (i = 0; i < PTYPE_HASH_SIZE; i++)
INIT_LIST_HEAD(&ptype_base[i]);
INIT_LIST_HEAD(&offload_base);
if (register_pernet_subsys(&netdev_net_ops))
goto out;
/*
* Initialise the packet receive queues.
*/
for_each_possible_cpu(i) {
struct work_struct *flush = per_cpu_ptr(&flush_works, i);
struct softnet_data *sd = &per_cpu(softnet_data, i);
INIT_WORK(flush, flush_backlog);
skb_queue_head_init(&sd->input_pkt_queue);
skb_queue_head_init(&sd->process_queue);
#ifdef CONFIG_XFRM_OFFLOAD
skb_queue_head_init(&sd->xfrm_backlog);
#endif
INIT_LIST_HEAD(&sd->poll_list);
sd->output_queue_tailp = &sd->output_queue;
#ifdef CONFIG_RPS
sd->csd.func = rps_trigger_softirq;
sd->csd.info = sd;
sd->cpu = i;
#endif
sd->backlog.poll = process_backlog;
sd->backlog.weight = weight_p;
}
dev_boot_phase = 0;
/* The loopback device is special if any other network devices
* is present in a network namespace the loopback device must
* be present. Since we now dynamically allocate and free the
* loopback device ensure this invariant is maintained by
* keeping the loopback device as the first device on the
* list of network devices. Ensuring the loopback devices
* is the first device that appears and the last network device
* that disappears.
*/
if (register_pernet_device(&loopback_net_ops))
goto out;
if (register_pernet_device(&default_device_ops))
goto out;
open_softirq(NET_TX_SOFTIRQ, net_tx_action);
open_softirq(NET_RX_SOFTIRQ, net_rx_action);
rc = cpuhp_setup_state_nocalls(CPUHP_NET_DEV_DEAD, "net/dev:dead",
NULL, dev_cpu_dead);
WARN_ON(rc < 0);
rc = 0;
out:
return rc;
}
在net_dev_init
中,初始化了內(nèi)核收發(fā)包隊(duì)列甲献,開(kāi)啟了對(duì)應(yīng)的軟中斷NET_TX_SOFTIRQ
和NET_RX_SOFTIRQ
宰缤。在其中,該函數(shù)為每個(gè)CPU初始化了一個(gè)softnet_data
來(lái)掛載需要處理設(shè)備的napi_struct
晃洒。這個(gè)結(jié)構(gòu)非常重要慨灭,軟中斷的處理就是從這個(gè)鏈表上取napi_struct
,然后收包的球及。這也是內(nèi)核和驅(qū)動(dòng)的接口之一氧骤。
再就是開(kāi)啟的兩個(gè)軟中斷,當(dāng)驅(qū)動(dòng)在硬終端完成必要的上半部工作后桶略,就會(huì)拉起對(duì)應(yīng)的軟中斷语淘。讓數(shù)據(jù)包下半部軟中斷中處理。
net_dev_init
執(zhí)行完后际歼,我們內(nèi)核就有了處理數(shù)據(jù)包的能力惶翻,只要驅(qū)動(dòng)能向softnet_data
掛載需要收包設(shè)備的napi_struct
。內(nèi)核子線(xiàn)程ksoftirqd
便會(huì)做后續(xù)的處理鹅心。接下來(lái)就是網(wǎng)卡驅(qū)動(dòng)的初始化了吕粗。
各種網(wǎng)卡肯定有不同的驅(qū)動(dòng),各驅(qū)動(dòng)封裝各自硬件的差異旭愧,給內(nèi)核提供一個(gè)統(tǒng)一的接口颅筋。我們這不關(guān)心,網(wǎng)卡驅(qū)動(dòng)是怎么把數(shù)據(jù)發(fā)出去的输枯,如何收回來(lái)的议泵。而是探究網(wǎng)卡收到數(shù)據(jù)了,要怎么交給內(nèi)核桃熄,內(nèi)核如何將要發(fā)的數(shù)據(jù)給網(wǎng)卡先口。總之瞳收,驅(qū)動(dòng)需要給內(nèi)核提供哪些接口碉京,內(nèi)核又需要給網(wǎng)卡哪些支持。我們以e1000網(wǎng)卡為例子螟深⌒持妫看看它和內(nèi)核的纏綿故事。
e1000網(wǎng)卡是一塊PCI設(shè)備界弧。所以它首先得要讓內(nèi)核能通過(guò)PCI總線(xiàn)探測(cè)到凡蜻,需要向內(nèi)核注冊(cè)一個(gè)pci_driver
結(jié)構(gòu)搭综,PCI設(shè)備的使用是另一個(gè)話(huà)題,這里不會(huì)探究咽瓷,我也不知道:
static struct pci_driver e1000_driver = {
.name = e1000_driver_name,
.id_table = e1000_pci_tbl,
.probe = e1000_probe,
.remove = e1000_remove,
#ifdef CONFIG_PM
/* Power Management Hooks */
.suspend = e1000_suspend,
.resume = e1000_resume,
#endif
.shutdown = e1000_shutdown,
.err_handler = &e1000_err_handler
};
其中e1000_probe
就是給內(nèi)核的探測(cè)回調(diào)函數(shù)设凹,算是網(wǎng)卡的初始化函數(shù)吧,驅(qū)動(dòng)需要在這里初始化網(wǎng)卡設(shè)備茅姜。去掉總線(xiàn)相關(guān)的代碼闪朱,錯(cuò)誤處理的代碼,硬件相關(guān)的代碼:
static int e1000_probe(struct pci_dev *pdev, const struct pci_device_id *ent)
{
struct net_device *netdev;
netdev = alloc_etherdev(sizeof(struct e1000_adapter));//申請(qǐng)net_device設(shè)備
netdev->netdev_ops = &e1000_netdev_ops; //注冊(cè)操作設(shè)備的回調(diào)函數(shù)
e1000_set_ethtool_ops(netdev);
netdev->watchdog_timeo = 5 * HZ;
netif_napi_add(netdev, &adapter->napi, e1000_clean, 64);//軟中斷里會(huì)調(diào)用poll鉤子函數(shù)
strncpy(netdev->name, pci_name(pdev), sizeof(netdev->name) - 1);
err = register_netdev(netdev);
}
每一個(gè)網(wǎng)絡(luò)設(shè)備都有一個(gè)對(duì)應(yīng)的net_devie
結(jié)構(gòu)體來(lái)描述钻洒。其中像設(shè)備文件操作一樣奋姿,保存了一種操作設(shè)備的接口函數(shù)netdev_ops
,對(duì)e1000網(wǎng)卡是e1000_netdev_ops
素标。當(dāng)通過(guò)終端輸入ifup
称诗,ifdowm
命令操作網(wǎng)卡時(shí),對(duì)應(yīng)的open
头遭,close
函數(shù)就會(huì)被調(diào)用寓免。這段代碼最重要的還是netif_napi_add
的調(diào)用,它向內(nèi)核注冊(cè)了e1000_clean
函數(shù)计维,用來(lái)給上面的CPU收包隊(duì)列調(diào)用袜香。
通過(guò)初始化,驅(qū)動(dòng)注冊(cè)了網(wǎng)卡描述net_device
, 內(nèi)核可以通過(guò)它操作到網(wǎng)卡設(shè)備鲫惶。通過(guò)e1000_clean
函數(shù)內(nèi)核軟中斷也可以收包了蜈首。
三,驅(qū)動(dòng)收包
前面有一個(gè)內(nèi)核軟中斷來(lái)收包欠母,但這個(gè)軟中斷怎么觸發(fā)呢欢策?硬中斷。當(dāng)有數(shù)據(jù)到網(wǎng)卡時(shí)赏淌,會(huì)產(chǎn)生一個(gè)硬中斷踩寇。這中斷的注冊(cè)是上面,e1000_netdev_ops
中的e1000_up
函數(shù)調(diào)用的六水。也就是網(wǎng)卡up時(shí)會(huì)注冊(cè)這個(gè)硬中斷處理函數(shù)e1000_intr
俺孙。
/**
* e1000_intr - Interrupt Handler
* @irq: interrupt number
* @data: pointer to a network interface device structure
**/
static irqreturn_t e1000_intr(int irq, void *data)
{
struct net_device *netdev = data;
struct e1000_adapter *adapter = netdev_priv(netdev);
struct e1000_hw *hw = &adapter->hw;
u32 icr = er32(ICR);
/* disable interrupts, without the synchronize_irq bit */
ew32(IMC, ~0);
E1000_WRITE_FLUSH();
if (likely(napi_schedule_prep(&adapter->napi))) {
adapter->total_tx_bytes = 0;
adapter->total_tx_packets = 0;
adapter->total_rx_bytes = 0;
adapter->total_rx_packets = 0;
__napi_schedule(&adapter->napi);
} else {
/* this really should not happen! if it does it is basically a
* bug, but not a hard error, so enable ints and continue
*/
if (!test_bit(__E1000_DOWN, &adapter->flags))
e1000_irq_enable(adapter);
}
return IRQ_HANDLED;
}
去掉unlikely
的代碼,其中通過(guò)if (likely(napi_schedule_prep(&adapter->napi)))
測(cè)試缩擂,網(wǎng)卡設(shè)備自己的napi
是否正在被CPU使用鼠冕。沒(méi)有就調(diào)用__napi_schedule
將自己的napi
掛載到CPU的softnet_data
上添寺。這樣軟中斷的內(nèi)核線(xiàn)程就能輪詢(xún)到這個(gè)軟中斷胯盯。
/**
* __napi_schedule - schedule for receive
* @n: entry to schedule
*
* The entry's receive function will be scheduled to run.
* Consider using __napi_schedule_irqoff() if hard irqs are masked.
*/
void __napi_schedule(struct napi_struct *n)
{
unsigned long flags;
local_irq_save(flags);
____napi_schedule(this_cpu_ptr(&softnet_data), n);
local_irq_restore(flags);
}
/* Called with irq disabled */
static inline void ____napi_schedule(struct softnet_data *sd,
struct napi_struct *napi)
{
list_add_tail(&napi->poll_list, &sd->poll_list);
__raise_softirq_irqoff(NET_RX_SOFTIRQ); //設(shè)置軟中斷標(biāo)志位NET_RX_SOFTIRQ
}
這里的softnet_data
就是前面net_dev_init
函數(shù)為每個(gè)CPU初始化的。到這里硬件中斷就處理完了计露,但我們依然沒(méi)有發(fā)現(xiàn)任何有關(guān)數(shù)據(jù)包的處理博脑,只知道了有一個(gè)napi
被掛載憎乙。這是因?yàn)橛布袛嗖荒茱@然太長(zhǎng),的確不會(huì)去做數(shù)據(jù)的處理工作叉趣。這些都交給軟中斷的內(nèi)核線(xiàn)程來(lái)處理的泞边。
四,內(nèi)核處理
硬中斷將一個(gè)napi
結(jié)構(gòu)體甩給了內(nèi)核疗杉,內(nèi)核要怎么根據(jù)它來(lái)接收數(shù)據(jù)呢阵谚?前面說(shuō)到,內(nèi)核為每個(gè)CPU核心都運(yùn)行了一個(gè)內(nèi)核線(xiàn)程ksoftirqd
烟具。軟中斷也就是在這線(xiàn)程中處理的梢什。上面的硬件中斷函數(shù)設(shè)置了NET_RX_SOFTIRQ
軟中斷標(biāo)志,這個(gè)字段處理函數(shù)還記得在哪注冊(cè)的么朝聋?是的嗡午,net_dev_init
中。
open_softirq(NET_TX_SOFTIRQ, net_tx_action);
open_softirq(NET_RX_SOFTIRQ, net_rx_action);
顯然冀痕,后續(xù)處理肯定是由net_rx_action
來(lái)完成荔睹。
static __latent_entropy void net_rx_action(struct softirq_action *h)
{
struct softnet_data *sd = this_cpu_ptr(&softnet_data);
unsigned long time_limit = jiffies +
usecs_to_jiffies(netdev_budget_usecs);
int budget = netdev_budget;
LIST_HEAD(list);
LIST_HEAD(repoll);
local_irq_disable();
list_splice_init(&sd->poll_list, &list);
local_irq_enable();
for (;;) {
struct napi_struct *n;
if (list_empty(&list)) {
if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
goto out;
break;
}
n = list_first_entry(&list, struct napi_struct, poll_list);
budget -= napi_poll(n, &repoll); //在這回調(diào)驅(qū)動(dòng)的poll函數(shù),這個(gè)函數(shù)在napi中
/* If softirq window is exhausted then punt.
* Allow this to run for 2 jiffies since which will allow
* an average latency of 1.5/HZ.
*/
if (unlikely(budget <= 0 ||
time_after_eq(jiffies, time_limit))) {
sd->time_squeeze++;
break;
}
}
local_irq_disable();
list_splice_tail_init(&sd->poll_list, &list);
list_splice_tail(&repoll, &list);
list_splice(&list, &sd->poll_list);
if (!list_empty(&sd->poll_list))
__raise_softirq_irqoff(NET_RX_SOFTIRQ);
net_rps_action_and_irq_enable(sd);
out:
__kfree_skb_flush();
}
上面看到budget -= napi_poll(n, &repoll);
他會(huì)去調(diào)用我們驅(qū)動(dòng)初始化時(shí)注冊(cè)的poll
函數(shù)言蛇,在e1000
網(wǎng)卡中就是e1000_clean
函數(shù)僻他。
/**
* e1000_clean - NAPI Rx polling callback
* @adapter: board private structure
**/
static int e1000_clean(struct napi_struct *napi, int budget)
{
struct e1000_adapter *adapter = container_of(napi, struct e1000_adapter,
napi);
int tx_clean_complete = 0, work_done = 0;
tx_clean_complete = e1000_clean_tx_irq(adapter, &adapter->tx_ring[0]);
adapter->clean_rx(adapter, &adapter->rx_ring[0], &work_done, budget);//將數(shù)據(jù)發(fā)給協(xié)議棧來(lái)處理。
if (!tx_clean_complete)
work_done = budget;
/* If budget not fully consumed, exit the polling mode */
if (work_done < budget) {
if (likely(adapter->itr_setting & 3))
e1000_set_itr(adapter);
napi_complete_done(napi, work_done);
if (!test_bit(__E1000_DOWN, &adapter->flags))
e1000_irq_enable(adapter);
}
return work_done;
}
e1000_clean
函數(shù)通過(guò)調(diào)用clean_rx函數(shù)指針來(lái)處理數(shù)據(jù)包猜极。
/**
* e1000_clean_jumbo_rx_irq - Send received data up the network stack; legacy
* @adapter: board private structure
* @rx_ring: ring to clean
* @work_done: amount of napi work completed this call
* @work_to_do: max amount of work allowed for this call to do
*
* the return value indicates whether actual cleaning was done, there
* is no guarantee that everything was cleaned
*/
static bool e1000_clean_jumbo_rx_irq(struct e1000_adapter *adapter,
struct e1000_rx_ring *rx_ring,
int *work_done, int work_to_do)
{
struct net_device *netdev = adapter->netdev;
struct pci_dev *pdev = adapter->pdev;
struct e1000_rx_desc *rx_desc, *next_rxd;
struct e1000_rx_buffer *buffer_info, *next_buffer;
u32 length;
unsigned int i;
int cleaned_count = 0;
bool cleaned = false;
unsigned int total_rx_bytes = 0, total_rx_packets = 0;
i = rx_ring->next_to_clean;
rx_desc = E1000_RX_DESC(*rx_ring, i);
buffer_info = &rx_ring->buffer_info[i];
e1000_receive_skb(adapter, status, rx_desc->special, skb);
napi_gro_frags(&adapter->napi);
return cleaned;
}
/**
* e1000_receive_skb - helper function to handle rx indications
* @adapter: board private structure
* @status: descriptor status field as written by hardware
* @vlan: descriptor vlan field as written by hardware (no le/be conversion)
* @skb: pointer to sk_buff to be indicated to stack
*/
static void e1000_receive_skb(struct e1000_adapter *adapter, u8 status,
__le16 vlan, struct sk_buff *skb)
{
skb->protocol = eth_type_trans(skb, adapter->netdev);
if (status & E1000_RXD_STAT_VP) {
u16 vid = le16_to_cpu(vlan) & E1000_RXD_SPC_VLAN_MASK;
__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q), vid);
}
napi_gro_receive(&adapter->napi, skb);
}
這個(gè)函數(shù)太長(zhǎng)中姜,我就保留了e1000_receive_skb
函數(shù)的調(diào)用,它調(diào)用了napi_gro_receive
跟伏,這個(gè)函數(shù)同樣是NAPI提供的函數(shù)丢胚,我們的skb
從這里調(diào)用到netif_receive_skb
協(xié)議棧的入口函數(shù)。調(diào)用路徑是napi_gro_receive
->napi_frags_finish
->netif_receive_skb_internal
->__netif_receive_skb
受扳。具體的流程先放放携龟。畢竟NAPI是內(nèi)核為了提高網(wǎng)卡收包性能而設(shè)計(jì)的一套框架。這就可以讓我先挖個(gè)坑以后在分析NAPI的時(shí)候在填上勘高∠矿總之有了NAPI后的收包流程和之前的區(qū)別如圖:
到這里,網(wǎng)卡驅(qū)動(dòng)到協(xié)議棧入口的處理過(guò)程就寫(xiě)完了华望。