本文通過實(shí)驗(yàn)侣姆,幫助大家認(rèn)識(shí)docker swarm中的overlay和docker_gwbridge網(wǎng)絡(luò)戏挡。
實(shí)驗(yàn)環(huán)境搭建
先建立兩臺(tái)物理機(jī)組成的docker swarm網(wǎng)絡(luò)(方法可見《docker swarm(一): 入門蹲坷,搭建一個(gè)簡單的swarm集群》):
$ docker node ls
ID HOSTNAME STATUS AVAILABILITY MANAGER STATUS ENGINE VERSION
43k0p9fnwu9dhsyr0n6utfynn * ubuntu Ready Active Leader 19.03.5
gorkh8cb5ylb7szzbbrp2sheu ubuntu-2 Ready Active 19.03.5
創(chuàng)建一個(gè)overlay網(wǎng)絡(luò)。
docker network create -d overlay --attachable --subnet 10.200.0.0/16 overlay_test
當(dāng)前建立的docker相關(guān)的網(wǎng)絡(luò)有:
$ docker network ls
NETWORK ID NAME DRIVER SCOPE
a473a52d686d bridge bridge local
5e1880193fbf docker_gwbridge bridge local
62ba25167374 host host local
jjyg85t5ta3k ingress overlay swarm
d056684646b3 none null local
hxyiridl2b9r overlay_test overlay swarm
這里關(guān)注兩個(gè)網(wǎng)絡(luò):
- overlay_test:overlay網(wǎng)絡(luò)趾浅,實(shí)現(xiàn)容器間東西向流量的網(wǎng)絡(luò)闰挡。
- docker_gwbridge: 容器收發(fā)南北向報(bào)文的網(wǎng)絡(luò)。
工具準(zhǔn)備
我們知道析校,docker是基于namespace构罗,劃分了網(wǎng)絡(luò)空間。這里先準(zhǔn)備一段腳本智玻,由于在各個(gè)namespece中遂唧,執(zhí)行對應(yīng)的網(wǎng)絡(luò)命令。
#!/bin/bash
NAMESPACE=$1
if [[ -z $NAMESPACE ]]; then
ls -1 /var/run/docker/netns/
exit 0
fi
NAMESPACE_FILE=/var/run/docker/netns/${NAMESPACE}
if [[ ! -f $NAMESPACE_FILE ]]; then
NAMESPACE_FILE=$(docker inspect -f "{{.NetworkSettings.SandboxKey}}" $NAMESPACE 2>/dev/null)
fi
if [[ ! -f $NAMESPACE_FILE ]]; then
echo "Cannot open network namespace '$NAMESPACE': No such file or directory"
exit 1
fi
shift
if [[ $# -lt 1 ]]; then
echo "No command specified"
exit 1
fi
nsenter --net=${NAMESPACE_FILE} $@
它可以查看有哪些namespace:
$ sudo ./docker_netns.sh
1-k2rx924tgr
eab3f856fe9a
ingress_sbox
還可以在指定的namespace下執(zhí)行命令:
$ sudo ./docker_netns.sh eab3f856fe9a ip link
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
170: eth0@if171: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:0a:00:00:54 brd ff:ff:ff:ff:ff:ff link-netnsid 0
172: eth1@if173: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP mode DEFAULT group default
link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff link-netnsid 1
第二個(gè)工具吊奢,find_links.sh
#!/bin/bash
DOCKER_NETNS_SCRIPT=./docker_netns.sh
IFINDEX=$1
if [[ -z $IFINDEX ]]; then
for namespace in $($DOCKER_NETNS_SCRIPT); do
printf "\e[1;31m%s: \e[0m\n" $namespace
$DOCKER_NETNS_SCRIPT $namespace ip -c -o link
printf "\n"
done
else
for namespace in $($DOCKER_NETNS_SCRIPT); do
if $DOCKER_NETNS_SCRIPT $namespace ip -c -o link | grep -Pq "^$IFINDEX: "; then
printf "\e[1;31m%s: \e[0m\n" $namespace
$DOCKER_NETNS_SCRIPT $namespace ip -c -o link | grep -P "^$IFINDEX: ";
printf "\n"
fi
done
fi
這個(gè)腳本可以根據(jù)ifindex查找接口所在的namespace盖彭。
$ sudo ./find_links.sh 60
1-hxyiridl2b:
60: veth1@if59: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default \ link/ether 4a:0a:52:98:84:a7 brd ff:ff:ff:ff:ff:ff link-netnsid 2
網(wǎng)絡(luò)結(jié)構(gòu)分析
以下,我們通過實(shí)驗(yàn),了解一下overlay網(wǎng)絡(luò)與docker_gwbridge網(wǎng)絡(luò)召边。
我們現(xiàn)在在兩個(gè)nodes上都創(chuàng)建容器:
$ docker run -d --name busybox --net overlay_test busybox sleep 36000
在容器的環(huán)境下铺呵,查看一下網(wǎng)絡(luò)連接:
docker exec busybox ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
59: eth0@if60: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue
link/ether 02:42:0a:c8:00:02 brd ff:ff:ff:ff:ff:ff
inet 10.200.0.2/16 brd 10.200.255.255 scope global eth0
valid_lft forever preferred_lft forever
61: eth1@if62: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue
link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff
inet 172.18.0.3/16 brd 172.18.255.255 scope global eth1
valid_lft forever preferred_lft forever
我們發(fā)現(xiàn),除了回環(huán)口外掌实,還有兩個(gè)接口陪蜻。10.200.0.2/16即是容器busybox在overlay_test網(wǎng)絡(luò)上的接口的IP地址。172.18.0.3/16是容器busybox在docker_gwbridge網(wǎng)絡(luò)上的接口的IP地址贱鼻。
到目前為止宴卖,我們看到的容器網(wǎng)絡(luò)是這樣的。我們只看到了網(wǎng)絡(luò)地址邻悬,還不知道它們間的報(bào)文是如何交互的症昏。(192.168.154.2是宿主機(jī)的網(wǎng)關(guān))
南北向流量
我們嘗試從容器內(nèi)跟蹤訪問外部IP的路由
$ docker exec busybox traceroute baidu.com
traceroute to baidu.com (220.181.38.148), 30 hops max, 46 byte packets
1 bogon (172.18.0.1) 0.003 ms 0.004 ms 0.006 ms
2 bogon (192.168.154.2) 0.148 ms 0.330 ms 0.175 ms
...
可見,流量經(jīng)過172.18.0.1父丰,然后訪問到宿主機(jī)網(wǎng)關(guān)上肝谭。
接下來,我們嘗試解析出內(nèi)部網(wǎng)絡(luò)連接蛾扇。上面我們已經(jīng)得知攘烛,從容器內(nèi)部的視角,172.18.0.3所在的接口為:61: eth1@if62镀首。我們可以理解為坟漱,此接口的ifindex為61,通過veth連接到ifindex為62的接口上更哄。
我們查找看看62接口的namespace是:
$ sudo ./find_links.sh 62
居然沒有顯示芋齿。這就說明62接口是在宿主機(jī)的主namespace中的。我們在宿主機(jī)上看看:
$ ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
inet6 ::1/128 scope host
valid_lft forever preferred_lft forever
2: ens33: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc fq_codel state UP group default qlen 1000
link/ether 00:0c:29:e5:66:45 brd ff:ff:ff:ff:ff:ff
inet 192.168.154.135/24 brd 192.168.154.255 scope global dynamic noprefixroute ens33
valid_lft 1502sec preferred_lft 1502sec
inet6 fe80::f378:1d3:6cde:69bb/64 scope link noprefixroute
valid_lft forever preferred_lft forever
3: docker_gwbridge: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default
link/ether 02:42:50:e9:2d:e1 brd ff:ff:ff:ff:ff:ff
inet 172.18.0.1/16 brd 172.18.255.255 scope global docker_gwbridge
valid_lft forever preferred_lft forever
inet6 fe80::42:50ff:fee9:2de1/64 scope link
valid_lft forever preferred_lft forever
4: docker0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN group default
link/ether 02:42:5d:cd:c3:16 brd ff:ff:ff:ff:ff:ff
inet 172.17.0.1/16 brd 172.17.255.255 scope global docker0
valid_lft forever preferred_lft forever
23: veth6ee82c3@if22: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP group default
link/ether 4a:71:4d:f7:0e:4e brd ff:ff:ff:ff:ff:ff link-netnsid 1
inet6 fe80::4871:4dff:fef7:e4e/64 scope link
valid_lft forever preferred_lft forever
62: veth0204500@if61: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue master docker_gwbridge state UP group default
link/ether 9e:d6:10:49:8e:42 brd ff:ff:ff:ff:ff:ff link-netnsid 4
inet6 fe80::9cd6:10ff:fe49:8e42/64 scope link
valid_lft forever preferred_lft forever
可見成翩,62接口的master是docker_gwbridge觅捆。也就是說,62接口被橋接到docker_gwbridge中麻敌。
南北向流量在經(jīng)過宿主機(jī)出口時(shí)栅炒,還做了NAT轉(zhuǎn)換
$ sudo iptables-save -t nat | grep -- '-A POSTROUTING'
-A POSTROUTING -o docker_gwbridge -m addrtype --src-type LOCAL -j MASQUERADE
-A POSTROUTING -s 172.17.0.0/16 ! -o docker0 -j MASQUERADE
-A POSTROUTING -s 172.18.0.0/16 ! -o docker_gwbridge -j MASQUERAD
于是,南北向的流量走向就很清晰了术羔。我們的網(wǎng)絡(luò)拓?fù)淇梢愿聻椋?/p>
東西向流量
東西向流量即容器與容器間的流量职辅。我們先測試一下容器間的連通性。
$ docker exec busybox ping 10.200.0.2
PING 10.200.0.2 (10.200.0.2): 56 data bytes
64 bytes from 10.200.0.2: seq=0 ttl=64 time=41.177 ms
64 bytes from 10.200.0.2: seq=1 ttl=64 time=1.181 ms
64 bytes from 10.200.0.2: seq=2 ttl=64 time=1.110 ms
接下來探索這個(gè)流量是怎么走的聂示。我們再看一下容器中的網(wǎng)絡(luò)配置域携。
$ docker exec busybox ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
59: eth0@if60: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1450 qdisc noqueue
link/ether 02:42:0a:c8:00:02 brd ff:ff:ff:ff:ff:ff
inet 10.200.0.2/16 brd 10.200.255.255 scope global eth0
valid_lft forever preferred_lft forever
61: eth1@if62: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue
link/ether 02:42:ac:12:00:03 brd ff:ff:ff:ff:ff:ff
inet 172.18.0.3/16 brd 172.18.255.255 scope global eth1
valid_lft forever preferred_lft forever
10.200.0.2所在的接口為,59: eth0@if60鱼喉。即本接口ifindex為59秀鞭,連接到ifindex為60的接口上趋观。我們查詢一下60接口所在的namespaec。
$ sudo ./find_links.sh 60
1-hxyiridl2b:
60: veth1@if59: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default \ link/ether 4a:0a:52:98:84:a7 brd ff:ff:ff:ff:ff:ff link-netnsid 2
可見60接口處于1-hxyiridl2b這一namespace中锋边。
$ sudo ./docker_netns.sh 1-hxyiridl2b ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
inet 127.0.0.1/8 scope host lo
valid_lft forever preferred_lft forever
2: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP group default
link/ether 0e:2d:34:e6:eb:b7 brd ff:ff:ff:ff:ff:ff
inet 10.200.0.1/16 brd 10.200.255.255 scope global br0
valid_lft forever preferred_lft forever
56: vxlan0@if56: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UNKNOWN group default
link/ether 0e:2d:34:e6:eb:b7 brd ff:ff:ff:ff:ff:ff link-netnsid 0
58: veth0@if57: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP group default
link/ether ea:c1:db:d4:b1:83 brd ff:ff:ff:ff:ff:ff link-netnsid 1
60: veth1@if59: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue master br0 state UP group default
link/ether 4a:0a:52:98:84:a7 brd ff:ff:ff:ff:ff:ff link-netnsid 2
在這個(gè)namespace中皱坛,有一個(gè)vxlan出口。docker overlsy就是通過overlay隧道與其它容器通信的豆巨。
兩個(gè)容器雖然是通過vxlan隧道通信剩辟,但容器內(nèi)部卻不感知。它們只能看到兩個(gè)容器處于同一個(gè)二層網(wǎng)絡(luò)中往扔。由vxlan接口將二層報(bào)文封裝在UDP報(bào)文的payload中贩猎,發(fā)到對端,再由對端的vxlan接口解封裝萍膛。
我們查看一下namespace 1-hxyiridl2b中的arp地址表:
$ sudo ./docker_netns.sh 1-hxyiridl2b ip neigh
10.200.0.5 dev vxlan0 lladdr 02:42:0a:c8:00:05 PERMANENT
10.200.0.4 dev vxlan0 lladdr 02:42:0a:c8:00:04 PERMANENT
我們可以看到吭服,遠(yuǎn)端node中的容器IP 10.200.0.4,有體現(xiàn)在本端的arp地址表中蝗罗。即是通過查找此表艇棕,得到對端的二層地址。
我們再來看看串塑,vxlan報(bào)文的出口在哪里:
$ sudo ./docker_netns.sh 1-hxyiridl2b bridge fdb
...
02:42:0a:c8:00:05 dev vxlan0 dst 192.168.154.136 link-netnsid 0 self permanent
02:42:0a:c8:00:04 dev vxlan0 dst 192.168.154.136 link-netnsid 0 self permanent
...
這可以理解為VxLAN的VTEP表沼琉,即根據(jù)MAC地址,查找出VxLAN報(bào)文應(yīng)該封裝的外層IP桩匪,是192.168.154.136
我們可以畫出東西向流量的完整的拓?fù)淞耍?/p>