一 背景
docker 下的nginx的服務(wù),在一些情況下訪問請(qǐng)求會(huì)反饋比較慢被啼,根據(jù)網(wǎng)文學(xué)習(xí)下帜消,記錄下一些實(shí)驗(yàn)過程。
二 驗(yàn)證環(huán)境
docker 太難下載了浓体,找了一個(gè)老的環(huán)境的nginx券犁,導(dǎo)入到系統(tǒng)中來(lái):
// 導(dǎo)入nginx
# docker load -i nginx.tar.gz
// 啟動(dòng)nginx 以本地端口啟動(dòng)
# docker run -d -p 8153:80 --name my-nginx nginx
查看驗(yàn)證下web是否啟動(dòng)
[root@nms ~]# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
f5de348d753e nginx "/docker-entrypoint.…" About a minute ago Up About a minute 0.0.0.0:8153->80/tcp my-nginx
7a2a244a57f8 splos:5.1nms "/bin/sh /etc/rcS_do…" 2 weeks ago Up 23 hours 5.1nms
[root@nms ~]# netstat -antp|grep 8153
tcp6 0 0 :::8153 :::* LISTEN 2480/docker-proxy
# 訪問也是正常的
[root@nms ~]# curl http://127.0.0.1:8153
<!DOCTYPE html>
<html>
<head>
<title>Welcome to nginx!</title>
<style>
body {
width: 35em;
margin: 0 auto;
font-family: Tahoma, Verdana, Arial, sans-serif;
}
</style>
</head>
<body>
<h1>Welcome to nginx!</h1>
<p>If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.</p>
<p>For online documentation and support please refer to
<a >nginx.org</a>.<br/>
Commercial support is available at
<a >nginx.com</a>.</p>
<p><em>Thank you for using nginx.</em></p>
</body>
</html>
用ab測(cè)試下web的性能:
// -c 是5000個(gè)并發(fā),一共發(fā)起請(qǐng)求10w個(gè) -r 接受到錯(cuò)誤仍然繼續(xù)汹碱,-s 超時(shí)時(shí)間為2s
# ab -c 5000 -n 100000 -r -s 2 http://10.xx.xx.xxx:8153/
Document Path: /
Document Length: 612 bytes
Concurrency Level: 5000
Time taken for tests: 9.158 seconds
Complete requests: 100000
Failed requests: 5397
(Connect: 0, Receive: 0, Length: 2741, Exceptions: 2656)
Write errors: 0
Total transferred: 82191460 bytes
HTML transferred: 59528016 bytes
Requests per second: 10919.47 [#/sec] (mean)
Time per request: 457.898 [ms] (mean)
Time per request: 0.092 [ms] (mean, across all concurrent requests)
Transfer rate: 8764.52 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 255 741.4 5 7029
Processing: 18 112 277.4 50 7372
Waiting: 0 107 266.6 48 6758
Total: 27 367 817.0 57 8660
Percentage of the requests served within a certain time (ms)
50% 57
66% 66
75% 249
80% 449
90% 1059
95% 1429
98% 3062
99% 3272
100% 8660 (longest request)
注意看下幾個(gè)關(guān)鍵指標(biāo):
Requests per second: 10919.47 [#/sec] (mean) 每秒平均請(qǐng)求數(shù)量10919.47
Time per request: 457.898 [ms] (mean) 每個(gè)請(qǐng)求的平均延遲為457ms
Connect: 0 255 741.4 5 7029: 建立鏈接的平均延遲255ms
在啟動(dòng)一個(gè)直接映射本地端口的nginx的容器:
// 啟動(dòng)一個(gè)直接映射端口的nginx容器
# docker run -d -p 9123:80 --network=host --privileged --name my-nginx-host1 nginx
本想這樣啟動(dòng)將容器的80端口映射到9123 粘衬,結(jié)果沒效果,設(shè)置了host網(wǎng)絡(luò)后咳促,映射端口的配置失效稚新,更改如下:
// 拷貝一個(gè)nginx.conf 然后直接做文件映射替換原來(lái)的文件,結(jié)果如下
[root@nms ~]# docker run -d --network=host --privileged --name my-nginx-host4 -v /root/nginx.conf:/etc/nginx/nginx.conf nginx
用ab這個(gè)工具繼續(xù)測(cè)試:
// 說明同上
[root@localhost spiderflow]# ab -c 5000 -n 100000 -r -s 2 http://10.xx.xx.xxx:9153/
// 關(guān)鍵信息如下
Document Path: /
Document Length: 612 bytes
Concurrency Level: 5000
Time taken for tests: 5.356 seconds
Complete requests: 100000
Failed requests: 16296
(Connect: 0, Receive: 25, Length: 8192, Exceptions: 8079)
Write errors: 0
Total transferred: 77597195 bytes
HTML transferred: 56200572 bytes
Requests per second: 18670.58 [#/sec] (mean)
Time per request: 267.801 [ms] (mean)
Time per request: 0.054 [ms] (mean, across all concurrent requests)
Transfer rate: 14148.28 [Kbytes/sec] received
Connection Times (ms)
min mean[+/-sd] median max
Connect: 0 139 442.0 5 3036
Processing: 11 58 113.6 31 1663
Waiting: 0 49 108.7 27 1653
Total: 16 196 480.4 37 4651
Percentage of the requests served within a certain time (ms)
50% 37
66% 56
75% 100
80% 137
90% 443
95% 1041
98% 1825
99% 3036
100% 4651 (longest request)
可以看到通過配置host網(wǎng)絡(luò)方式跪腹,即端口直接映射到主機(jī)上褂删,不做端口轉(zhuǎn)換的情況下,性能會(huì)提升不少冲茸;
Requests per second: 18670.58 [#/sec] (mean) 平均請(qǐng)求數(shù)量從1w左右提升到了1.8w
Time per request: 267.801 [ms] (mean) 每個(gè)請(qǐng)求平均延遲也從457ms降低到了267ms
Connect: 0 139 442.0 5 3036 平均建鏈延遲從255ms降低到139ms
不過這個(gè)異常數(shù)也更多屯阀,異常數(shù)更改-s后面的超時(shí)時(shí)間,可以顯著降低轴术,這個(gè)先不關(guān)注难衰;
需要環(huán)境建立好了之后就可以排查原因了。
三 問題排查
3.1 抓包排查問題
排查網(wǎng)絡(luò)問題逗栽,抓包一般必不可少盖袭,我們先來(lái)抓點(diǎn)包看看什么情況:
[root@nms ~]# tcpdump -i ens192 tcp port 8153 -w a.pcap
用wireshark打開分析:
從信息來(lái)看不少包重組失敗,實(shí)際看下來(lái)彼宠,有各種重傳錯(cuò)誤鳄虱,REST包等,如下:
3.2 丟包原因分析
從上面報(bào)文重組失敗的情況來(lái)看凭峡,顯然是發(fā)生了丟包糠惫,因?yàn)槌绦蚴且粯拥牡磁欤\(yùn)行環(huán)境是一樣的夺蛇,所以丟包肯定只能是內(nèi)核里面丟包沐飘,需要排查具體哪里丟的碉克,什么原因丟的。
可以使用動(dòng)態(tài)追蹤工具來(lái)排查丟包原因,eBPF 在centos下相對(duì)來(lái)說安裝比較麻煩,我們用systemstap,追蹤腳本如下:
#! /usr/bin/env stap
############################################################
# Dropwatch.stp
# Author: Neil Horman <nhorman@redhat.com>
# An example script to mimic the behavior of the dropwatch utility
# http://fedorahosted.org/dropwatch
############################################################
# Array to hold the list of drop points we find
global locations
# Note when we turn the monitor on and off
probe begin { printf("Monitoring for dropped packets\n") }
probe end { printf("Stopping dropped packet monitor\n") }
# increment a drop counter for every location we drop at
probe kernel.trace("kfree_skb") { locations[$location] <<< 1 }
# Every 5 seconds report our drop locations
probe timer.sec(5)
{
printf("\n")
foreach (l in locations-) {
printf("%d packets dropped at %s\n",
@count(locations[l]), symname(l))
}
delete locations
}
簡(jiǎn)單來(lái)說楼镐,就是打印內(nèi)核調(diào)用kfree_skb
的位置癞志,這些位置即是丟包的位置;
打印的結(jié)果類似:
// 執(zhí)行
stap -g --all-modules dropwatch.stp
12 packets dropped at nf_hook_slow
11 packets dropped at ip_rcv_finish
8 packets dropped at ip6_mc_input
1 packets dropped at icmpv6_rcv
9 packets dropped at nf_hook_slow
6 packets dropped at ip6_mc_input
5 packets dropped at ip_rcv_finish
1 packets dropped at tcp_v6_rcv
19 packets dropped at ip_rcv_finish
12 packets dropped at nf_hook_slow
8 packets dropped at ip6_mc_input
^CStopping dropped packet monitor
通過上面腳本我們知道具體的丟包函數(shù)集中在:nf_hook_slow 框产,這個(gè)函數(shù)是 Netfilter 框架的一部分凄杯,它負(fù)責(zé)執(zhí)行掛鉤(hooking)到內(nèi)核網(wǎng)絡(luò)層的自定義函數(shù),這些函數(shù)通常用于包過濾秉宿、網(wǎng)絡(luò)地址轉(zhuǎn)換(NAT)戒突、數(shù)據(jù)包修改等。另外換個(gè)思路描睦,但是我們注意到上面的測(cè)試情況膊存,在同一臺(tái)主機(jī)上,同一個(gè)鏡像忱叭,唯一的區(qū)別就是一個(gè)做了端口映射隔崎,一個(gè)直接通過主機(jī)網(wǎng)絡(luò),通過主機(jī)網(wǎng)絡(luò)這種方式直接在主機(jī)的默認(rèn)網(wǎng)絡(luò)空間開的端口韵丑,沒有經(jīng)過NAT轉(zhuǎn)換爵卒,也就是說NAT轉(zhuǎn)換造成的性能差異,我們先看看NAT轉(zhuǎn)換的配置:
# 即顯示NAT轉(zhuǎn)換表 -n 表示不做ip轉(zhuǎn)成域名撵彻、-L顯示列表 -t nat 只查詢nat表钓株;
#iptables -nL -t nat
我們知道docker如果默認(rèn)的網(wǎng)絡(luò)的IP是內(nèi)部172的IP段,和外部服務(wù)交互的時(shí)候陌僵,需要將172這個(gè)ip轉(zhuǎn)成主機(jī)的IP轴合,這就要做SNAT轉(zhuǎn)換; 另外外部服務(wù)回消息的時(shí)候碗短,要通過DNAT轉(zhuǎn)成內(nèi)部的端口值桩。
我們來(lái)看下NAT表內(nèi)容:
[root@nms ~]# iptables -nL -t nat
Chain PREROUTING (policy ACCEPT)
target prot opt source destination
DOCKER all -- 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
Chain INPUT (policy ACCEPT)
target prot opt source destination
Chain OUTPUT (policy ACCEPT)
target prot opt source destination
DOCKER all -- 0.0.0.0/0 !127.0.0.0/8 ADDRTYPE match dst-type LOCAL
Chain POSTROUTING (policy ACCEPT)
target prot opt source destination
MASQUERADE all -- 172.17.0.0/16 0.0.0.0/0
MASQUERADE tcp -- 172.17.0.3 172.17.0.3 tcp dpt:80
Chain DOCKER (2 references)
target prot opt source destination
RETURN all -- 0.0.0.0/0 0.0.0.0/0
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8153 to:172.17.0.3:80
其中比較關(guān)鍵的配置:MASQUERADE all -- 172.17.0.0/16 0.0.0.0/0
表示從容器IP段172.17.0.0/16 發(fā)出來(lái)的所有包(目的地址不做限制,協(xié)議不限制)都做源地址偽造豪椿,即源地址替換為本機(jī)地址奔坟;
MASQUERADE tcp -- 172.17.0.3 172.17.0.3 tcp dpt:80
這條規(guī)則比較少見,對(duì)于tcp協(xié)議搭盾,源地址為172.17.0.3 地址咳秉,(即咱們前面啟動(dòng)的nginx容器,映射的端口為8153端口)訪問的目的地址為:172.17.0.3 地址鸯隅,tcp端口為80的時(shí)候澜建,將做源地址偽造向挖,舉例:
172.17.0.3:1234------> 172.17.0.3:80
即172.17.0.3連接本機(jī)的80端口,會(huì)被規(guī)則改變成:
10.xx.xx.xx:1234------> 172.17.0.3:80
這條沒看出來(lái)有啥重要作用炕舵,感覺不映射也問題不大何之,畢竟都是本機(jī)訪問,連網(wǎng)卡都不用走吧咽筋,為什么需要溶推,知道的兄弟告知下。
重要的還有下面一條規(guī)則:
DNAT tcp -- 0.0.0.0/0 0.0.0.0/0 tcp dpt:8153 to:172.17.0.3:80
即在tcp報(bào)文中奸攻,任意ip訪問任意ip的時(shí)候蒜危,如果tcp的目的端口是8153,將它轉(zhuǎn)發(fā)給容器的172.17.0.3 的80端口睹耐,上面的RETURN表示其他情況直接返回正常流程處理辐赞。
通過上面可知,docker訪問外部的時(shí)候硝训,源IP地址被轉(zhuǎn)換為主機(jī)的ip响委,返回的時(shí)候通過DNAT配置把映射主機(jī)的端口,映射到具體的容器的端口上窖梁,這里是80端口晃酒。
通過上面分析,只是梳理了一下Docker容器的網(wǎng)絡(luò)包轉(zhuǎn)發(fā)的流程窄绒,并沒有找到慢的原因贝次。
3.2 NAT轉(zhuǎn)換性能排查
由于我們上面幾乎可以肯定是NAT轉(zhuǎn)換問題,那問題就轉(zhuǎn)到如何排查的彰导,通過上面的規(guī)則分析蛔翅,請(qǐng)求和返回的時(shí)候都需要做NAT地址信息的轉(zhuǎn)換,轉(zhuǎn)換的時(shí)候是需要保存每個(gè)連接的狀態(tài)位谋,跟蹤連接的狀態(tài)的山析,才能根據(jù)映射端口映射到正確的容器的正確端口上去。
cat /proc/net/nf_conntrack
四 問題解決
4.1 systemtap安裝
在centos下安裝:
yum install systemtap kernel-devel yum-utils kernel
4.2 systemtap 丟包打印 只有地址沒有打印函數(shù)名
錯(cuò)誤現(xiàn)象:
[root@miao miao]# ./dropwatch.stp
Monitoring for dropped packets
18 packets dropped at 0xffffffff8341ab57
17 packets dropped at 0xffffffff8342704d
24 packets dropped at 0xffffffff8342704d
8 packets dropped at 0xffffffff8341ab57
8 packets dropped at 0xffffffff8341ab57
3 packets dropped at 0xffffffff8342704d
1 packets dropped at 0xffffffff834df57c
...
原因就是缺少內(nèi)核符號(hào)表:
vim /etc/yum.repos.d//CentOS-Linux-Debuginfo.repo
將里面的enable改成1.
安裝內(nèi)核符號(hào)表:
debuginfo-install -y kernel-$(uname -r)
// 或安裝
yum install kernel-debuginfo kernel-devel
// 執(zhí)行準(zhǔn)備
stap-prep
sysctl -w kernel.printk="7 4 1 7"