繼續(xù)超短流水賬一篇贞铣。
今天午飯時間幅骄,一個向來非常穩(wěn)定的Flink on YARN任務(wù)忽然持續(xù)報警展哭。查看TaskManager日志均沒有問題匪蟀,但JobManager日志內(nèi)報出大量Connection reset by peer
信息椎麦,其中更夾雜著奇怪的錯誤,如下圖所示材彪。
但是作業(yè)是一直正常運行的观挎。根據(jù)日志,可以憑直覺推測JobManager的REST endpoint出了問題(遭到了RST攻擊段化?)
登錄到JM container所在的機器嘁捷,通過netstat
和lsof
命令找出其PID和監(jiān)聽的端口,發(fā)現(xiàn)端口號是4347显熏。然后用tcpdump
命令抓包:
tcpdump -i eth0 tcp port 4347 -XX -vv >> dump.out
抓出的一部分報文如下:
12:17:59.434870 IP (tos 0x0, ttl 61, id 34630, offset 0, flags [DF], proto TCP (6), length 60)
172.16.200.34.36762 > ec-bigdata-flink-worker-040.lansurveyor: Flags [S], cksum 0xaa12 (correct), seq 611029849, win 64240, options [mss 1460,sackOK,TS val 285643676 ecr 0,nop,wscale 7], length 0
0x0000: 0016 3e34 f380 eeff ffff ffff 0800 4500 ..>4..........E.
0x0010: 003c 8746 4000 3d06 1076 ac10 c822 0a00 .<.F@.=..v..."..
0x0020: 27cd 8f9a 10fb 246b 9359 0000 0000 a002 '.....$k.Y......
0x0030: faf0 aa12 0000 0204 05b4 0402 080a 1106 ................
0x0040: 939c 0000 0000 0103 0307 ..........
12:17:59.434890 IP (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto TCP (6), length 60)
ec-bigdata-flink-worker-040.lansurveyor > 172.16.200.34.36762: Flags [S.], cksum 0xa62e (incorrect -> 0x3998), seq 2375752204, ack 611029850, win 28960, options [mss 1460,sackOK,TS val 3208549201 ecr 285643676,nop,wscale 9], length 0
0x0000: eeff ffff ffff 0016 3e34 f380 0800 4500 ........>4....E.
0x0010: 003c 0000 4000 4006 94bc 0a00 27cd ac10 .<..@.@.....'...
0x0020: c822 10fb 8f9a 8d9b 1a0c 246b 935a a012 ."........$k.Z..
0x0030: 7120 a62e 0000 0204 05b4 0402 080a bf3e q..............>
0x0040: 9351 1106 939c 0103 0309 .Q........
12:17:59.440022 IP (tos 0x0, ttl 61, id 34631, offset 0, flags [DF], proto TCP (6), length 52)
172.16.200.34.36762 > ec-bigdata-flink-worker-040.lansurveyor: Flags [.], cksum 0xd78b (correct), seq 1, ack 1, win 502, options [nop,nop,TS val 285643681 ecr 3208549201], length 0
0x0000: 0016 3e34 f380 eeff ffff ffff 0800 4500 ..>4..........E.
0x0010: 0034 8747 4000 3d06 107d ac10 c822 0a00 .4.G@.=..}..."..
0x0020: 27cd 8f9a 10fb 246b 935a 8d9b 1a0d 8010 '.....$k.Z......
0x0030: 01f6 d78b 0000 0101 080a 1106 93a1 bf3e ...............>
0x0040: 9351 .Q
12:17:59.448889 IP (tos 0x0, ttl 61, id 34632, offset 0, flags [DF], proto TCP (6), length 431)
172.16.200.34.36762 > ec-bigdata-flink-worker-040.lansurveyor: Flags [P.], cksum 0xa379 (correct), seq 1:380, ack 1, win 502, options [nop,nop,TS val 285643690 ecr 3208549201], length 379
0x0000: 0016 3e34 f380 eeff ffff ffff 0800 4500 ..>4..........E.
0x0010: 01af 8748 4000 3d06 0f01 ac10 c822 0a00 ...H@.=......"..
0x0020: 27cd 8f9a 10fb 246b 935a 8d9b 1a0d 8018 '.....$k.Z......
0x0030: 01f6 a379 0000 0101 080a 1106 93aa bf3e ...y...........>
0x0040: 9351 4745 5420 2f63 6769 2d62 696e 2f63 .QGET./cgi-bin/c
0x0050: 6f69 6e5f 696e 636c 7564 6573 2f63 6f6e oin_includes/con
0x0060: 7374 616e 7473 2e70 6870 3f5f 4343 4647 stants.php?_CCFG
0x0070: 5b5f 504b 475f 5041 5448 5f49 4e43 4c5d [_PKG_PATH_INCL]
0x0080: 3d2f 6574 632f 7061 7373 7764 2530 3020 =/etc/passwd%00.
0x0090: 4854 5450 2f31 2e31 0d0a 486f 7374 3a20 HTTP/1.1..Host:.
0x00a0: 3130 2e30 2e33 392e 3230 353a 3433 3437 10.0.19.105:4347
0x00b0: 0d0a 4163 6365 7074 2d43 6861 7273 6574 ..Accept-Charset
0x00c0: 3a20 6973 6f2d 3838 3539 2d31 2c75 7466 :.iso-8859-1,utf
0x00d0: 2d38 3b71 3d30 2e39 2c2a 3b71 3d30 2e31 -8;q=0.9,*;q=0.1
0x00e0: 0d0a 4163 6365 7074 2d4c 616e 6775 6167 ..Accept-Languag
0x00f0: 653a 2065 6e0d 0a43 6f6e 6e65 6374 696f e:.en..Connectio
0x0100: 6e3a 204b 6565 702d 416c 6976 650d 0a55 n:.Keep-Alive..U
0x0110: 7365 722d 4167 656e 743a 204d 6f7a 696c ser-Agent:.Mozil
0x0120: 6c61 2f34 2e30 2028 636f 6d70 6174 6962 la/4.0.(compatib
0x0130: 6c65 3b20 4d53 4945 2038 2e30 3b20 5769 le;.MSIE.8.0;.Wi
0x0140: 6e64 6f77 7320 4e54 2035 2e31 3b20 5472 ndows.NT.5.1;.Tr
0x0150: 6964 656e 742f 342e 3029 0d0a 5072 6167 ident/4.0)..Prag
0x0160: 6d61 3a20 6e6f 2d63 6163 6865 0d0a 4163 ma:.no-cache..Ac
0x0170: 6365 7074 3a20 696d 6167 652f 6769 662c cept:.image/gif,
0x0180: 2069 6d61 6765 2f78 2d78 6269 746d 6170 .image/x-xbitmap
0x0190: 2c20 696d 6167 652f 6a70 6567 2c20 696d ,.image/jpeg,.im
0x01a0: 6167 652f 706a 7065 672c 2069 6d61 6765 age/pjpeg,.image
0x01b0: 2f70 6e67 2c20 2a2f 2a0d 0a0d 0a /png,.*/*....
12:17:59.448904 IP (tos 0x0, ttl 64, id 35162, offset 0, flags [DF], proto TCP (6), length 52)
ec-bigdata-flink-worker-040.lansurveyor > 172.16.200.34.36762: Flags [.], cksum 0xa626 (incorrect -> 0xd7b4), seq 1, ack 380, win 59, options [nop,nop,TS val 3208549215 ecr 285643690], length 0
0x0000: eeff ffff ffff 0016 3e34 f380 0800 4500 ........>4....E.
0x0010: 0034 895a 4000 4006 0b6a 0a00 27cd ac10 .4.Z@.@..j..'...
0x0020: c822 10fb 8f9a 8d9b 1a0d 246b 94d5 8010 ."........$k....
0x0030: 003b a626 0000 0101 080a bf3e 935f 1106 .;.&.......>._..
0x0040: 93aa ..
12:17:59.449337 IP (tos 0x0, ttl 64, id 35163, offset 0, flags [DF], proto TCP (6), length 251)
ec-bigdata-flink-worker-040.lansurveyor > 172.16.200.34.36762: Flags [P.], cksum 0xa6ed (incorrect -> 0x438a), seq 1:200, ack 380, win 59, options [nop,nop,TS val 3208549216 ecr 285643690], length 199
0x0000: eeff ffff ffff 0016 3e34 f380 0800 4500 ........>4....E.
0x0010: 00fb 895b 4000 4006 0aa2 0a00 27cd ac10 ...[@.@.....'...
0x0020: c822 10fb 8f9a 8d9b 1a0d 246b 94d5 8018 ."........$k....
0x0030: 003b a6ed 0000 0101 080a bf3e 9360 1106 .;.........>.`..
0x0040: 93aa 4854 5450 2f31 2e31 2034 3034 204e ..HTTP/1.1.404.N
0x0050: 6f74 2046 6f75 6e64 0d0a 436f 6e74 656e ot.Found..Conten
0x0060: 742d 5479 7065 3a20 6170 706c 6963 6174 t-Type:.applicat
0x0070: 696f 6e2f 6a73 6f6e 3b20 6368 6172 7365 ion/json;.charse
0x0080: 743d 5554 462d 380d 0a43 6f6e 6e65 6374 t=UTF-8..Connect
0x0090: 696f 6e3a 206b 6565 702d 616c 6976 650d ion:.keep-alive.
0x00a0: 0a63 6f6e 7465 6e74 2d6c 656e 6774 683a .content-length:
0x00b0: 2038 320d 0a0d 0a7b 2265 7272 6f72 7322 .82....{"errors"
0x00c0: 3a5b 2255 6e61 626c 6520 746f 206c 6f61 :["Unable.to.loa
0x00d0: 6420 7265 7175 6573 7465 6420 6669 6c65 d.requested.file
0x00e0: 202f 6367 692d 6269 6e2f 636f 696e 5f69 ./cgi-bin/coin_i
0x00f0: 6e63 6c75 6465 732f 636f 6e73 7461 6e74 ncludes/constant
0x0100: 732e 7068 702e 225d 7d s.php."]}
12:17:59.454755 IP (tos 0x0, ttl 61, id 34633, offset 0, flags [DF], proto TCP (6), length 52)
172.16.200.34.36762 > ec-bigdata-flink-worker-040.lansurveyor: Flags [.], cksum 0xd52c (correct), seq 380, ack 200, win 501, options [nop,nop,TS val 285643696 ecr 3208549216], length 0
0x0000: 0016 3e34 f380 eeff ffff ffff 0800 4500 ..>4..........E.
0x0010: 0034 8749 4000 3d06 107b ac10 c822 0a00 .4.I@.=..{..."..
0x0020: 27cd 8f9a 10fb 246b 94d5 8d9b 1ad4 8010 '.....$k........
0x0030: 01f5 d52c 0000 0101 080a 1106 93b0 bf3e ...,...........>
0x0040: 9360 .`
12:17:59.454924 IP (tos 0x0, ttl 61, id 34634, offset 0, flags [DF], proto TCP (6), length 52)
172.16.200.34.36762 > ec-bigdata-flink-worker-040.lansurveyor: Flags [R.], cksum 0xd528 (correct), seq 380, ack 200, win 501, options [nop,nop,TS val 285643696 ecr 3208549216], length 0
0x0000: 0016 3e34 f380 eeff ffff ffff 0800 4500 ..>4..........E.
0x0010: 0034 874a 4000 3d06 107a ac10 c822 0a00 .4.J@.=..z..."..
0x0020: 27cd 8f9a 10fb 246b 94d5 8d9b 1ad4 8014 '.....$k........
0x0030: 01f5 d528 0000 0101 080a 1106 93b0 bf3e ...(...........>
0x0040: 9360 .`
以上是一個完整的循環(huán)雄嚣,即“握手→互相Push消息→連接重置”(并且對方還試圖請求非常敏感的數(shù)據(jù),如/etc/passwd)喘蟆,所以短時間內(nèi)爆出大量"Connection reset by peer"也就是不足為奇了缓升。后來兜兜轉(zhuǎn)轉(zhuǎn)找到安全部門,得知他們正在進行漏洞掃描履肃,而4347恰好是LAN Surveyor工具默認(rèn)使用的端口仔沿,虛驚一場 = =
為了徹底解決問題,我們可以指定高可用JobManager的端口范圍尺棋,排除掉大部分的注冊端口封锉。在flink-conf.yaml中設(shè)置:
high-availability.jobmanager.port: 35000-49150
今天帝都持續(xù)降雨,天氣不好膘螟,還是早點回去吧成福。
民那晚安。