處理一個 socket: too many open files 問題

大概情況是這樣的, 前幾天做完離職交接, 今天看到前公司的微信群里消息爆炸了, 用戶 無法在微信小程序中下單. 接手后端項(xiàng)目的小胖已經(jīng)重啟了數(shù)據(jù)庫(不曉得他為啥一進(jìn)服務(wù)器 就先把數(shù)據(jù)庫重啟了), 說還是不行.

后來另一位同事重啟了 consumer 服務(wù), 然后服務(wù)就恢復(fù)了.

查看了 docker log, 發(fā)現(xiàn)了一個 ``dial tcp 127.0.0.1:8202: socket: too many open files 的問題, 但是請求的 http server 也是布署在同一臺服務(wù)器上, 它們也共享了同一個 docker network, 通常是 不會出現(xiàn)這個問題的, 之后就猜到可能是因?yàn)楸镜氐?TCP 端口號被占完了, 或者分配給該進(jìn)程 的文件句柄數(shù)被用光了, http client 無法再發(fā)起新的 socket 連接導(dǎo)致的.

整個問題還是太清晰, 先理一下后臺的服務(wù)布署情況:
services

其中, postal 服務(wù)提供了 GET /api/status 這個接口, 用于獲取主控的狀態(tài). 該接口被 business, consumerstaff 使用.

/proc/CONSUMER_PID/fd/ 目錄里面, 可以看到有數(shù)百個未關(guān)閉的 socket fd, 因?yàn)?這個服務(wù)剛被重啟過, 才只剩這么多的. 同時去看了 /proc/BUSINESS_PID/fd/ 目錄, 發(fā)現(xiàn) 里面有超過 12000 個 socket fd, 這些都是不正常的. 而 /proc/STAFF_PID/fd/ 里面只有 數(shù)百個, 因?yàn)?GET /api/status 這個接口在 staff 服務(wù)里不經(jīng)常被調(diào)用.

因?yàn)槭蔷€上服務(wù)器, 可以選用訪問量比較少的后臺服務(wù)staff 來調(diào)試, 通過 staff 服務(wù) 請求了一下 postal 服務(wù)的 GET /api/status 接口, 發(fā)現(xiàn) /proc/STAFF_PID/fd/ 里面 就多了一個 socket fd, 這種現(xiàn)象很穩(wěn)定, 每次請求都會多一次.

問題可以大致定位到了, 估計就是 staff 里的 http client 那部分代碼:

url := fmt.Sprintf("%s/api/stats/%s", settings.Current.BoxServer, box)
transport := &http.Transport{
    TLSHandshakeTimeout: TlsTimeout,
}

client := http.Client{
    Timeout:   HttpTimeout,
    Transport: transport,
}

resp, err := client.Get(url)
if err != nil {
    return nil, err
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
    return nil, err
}

上面的代碼也很簡單, 沒有多余的操作, 而且也依照 golang docs 里面要求的 關(guān)閉了 response body.

核心的問題是在:

transport := &http.Transport{
    TLSHandshakeTimeout: TlsTimeout,
}

這里, 只調(diào)整了 Transport 結(jié)構(gòu)體的 TLS 握時超時的選項(xiàng), 其它都是默認(rèn)值.

Demo

服務(wù)端, server.go:

package main

import (
    "log"
    "net/http"
    "github.com/gin-gonic/gin"
)

func getStatus(c *gin.Context) {
    c.JSON(http.StatusOK, gin.H{
        "Status": "ok",
    })
}

func main() {
    route := gin.Default()
    api := route.Group("/api")
    api.GET("/status", getStatus)

    if err := route.Run("127.0.0.1:4200"); err != nil {
        log.Panic(err)
    }
}

客戶端代碼, client.go:

package main

import (
    "io/ioutil"
    "log"
    "net/http"
    "time"
)

func request() {
    log.Println("request()")
    url := "http://127.0.0.1:4200/api/status"

    transport := &http.Transport{
        // ResponseHeaderTimeout: 15 * time.Second,
    }

    client := http.Client{
        // Timeout:   30,
        Transport: transport,
    }

    resp, err := client.Get(url)
    if err != nil {
        log.Println(err)
        return
    }
    defer resp.Body.Close()
    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        log.Println(err)
        return
    }
    log.Println(len(string(body)))

    time.Sleep(1 * time.Millisecond)
}

func main() {
    for i := 0; i < 10000; i ++ {
        go request()
    }

    time.Sleep(1 * time.Hour)
}

以上的代碼是在本地運(yùn)行的, 同樣在 client 那里可以看到打印的報錯:

2019/09/19 15:26:34 Get http://127.0.0.1:4200/api/status: dial tcp 127.0.0.1:4200: socket: too many open files

然后看一下該進(jìn)程打開了文件數(shù):

$ ls /proc/5689/limits | wc -l
1024

這個值要比之前在服務(wù)器上面出現(xiàn)的要低很多, 后來確認(rèn)是本地的linux 系統(tǒng)沒有優(yōu)化配置, 一個進(jìn)程默認(rèn)只允許打開 1024 個文件, 下面也可以看到這樣的 resource limit 信息:

$ cat /proc/5689/limits 
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        0                    unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             63754                63754                processes 
Max open files            1024                 1048576              files     
Max locked memory         67108864             67108864             bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       63754                63754                signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us  

分析 Transport

以上問題的核心就是 golang 的 http client 并沒有及時關(guān)閉 socket 連接導(dǎo)致的, 這個 要看它的 Transport 究竟怎么實(shí)現(xiàn)的了.

先看看 golang 的 transport 源碼 (src/net/http/transport.go):

// Transport is an implementation of RoundTripper that supports HTTP,
// HTTPS, and HTTP proxies (for either HTTP or HTTPS with CONNECT).
//
// By default, Transport caches connections for future re-use.
// This may leave many open connections when accessing many hosts.
// This behavior can be managed using Transport's CloseIdleConnections method
// and the MaxIdleConnsPerHost and DisableKeepAlives fields.
//
// Transports should be reused instead of created as needed.
// Transports are safe for concurrent use by multiple goroutines.
//
// A Transport is a low-level primitive for making HTTP and HTTPS requests.
// For high-level functionality, such as cookies and redirects, see Client.
//
// Transport uses HTTP/1.1 for HTTP URLs and either HTTP/1.1 or HTTP/2
// for HTTPS URLs, depending on whether the server supports HTTP/2,
// and how the Transport is configured. The DefaultTransport supports HTTP/2.
// To explicitly enable HTTP/2 on a transport, use golang.org/x/net/http2
// and call ConfigureTransport. See the package docs for more about HTTP/2.
//
// Responses with status codes in the 1xx range are either handled
// automatically (100 expect-continue) or ignored. The one
// exception is HTTP status code 101 (Switching Protocols), which is
// considered a terminal status and returned by RoundTrip. To see the
// ignored 1xx responses, use the httptrace trace package's
// ClientTrace.Got1xxResponse.
//
// Transport only retries a request upon encountering a network error
// if the request is idempotent and either has no body or has its
// Request.GetBody defined. HTTP requests are considered idempotent if
// they have HTTP methods GET, HEAD, OPTIONS, or TRACE; or if their
// Header map contains an "Idempotency-Key" or "X-Idempotency-Key"
// entry. If the idempotency key value is an zero-length slice, the
// request is treated as idempotent but the header is not sent on the
// wire.
type Transport struct {
    idleMu     sync.Mutex
    wantIdle   bool                                // user has requested to close all idle conns
    idleConn   map[connectMethodKey][]*persistConn // most recently used at end
    idleConnCh map[connectMethodKey]chan *persistConn
    idleLRU    connLRU

    reqMu       sync.Mutex
    reqCanceler map[*Request]func(error)

    altMu    sync.Mutex   // guards changing altProto only
    altProto atomic.Value // of nil or map[string]RoundTripper, key is URI scheme

    connCountMu          sync.Mutex
    connPerHostCount     map[connectMethodKey]int
    connPerHostAvailable map[connectMethodKey]chan struct{}

    // Proxy specifies a function to return a proxy for a given
    // Request. If the function returns a non-nil error, the
    // request is aborted with the provided error.
    //
    // The proxy type is determined by the URL scheme. "http",
    // "https", and "socks5" are supported. If the scheme is empty,
    // "http" is assumed.
    //
    // If Proxy is nil or returns a nil *URL, no proxy is used.
    Proxy func(*Request) (*url.URL, error)

    // DialContext specifies the dial function for creating unencrypted TCP connections.
    // If DialContext is nil (and the deprecated Dial below is also nil),
    // then the transport dials using package net.
    //
    // DialContext runs concurrently with calls to RoundTrip.
    // A RoundTrip call that initiates a dial may end up using
    // a connection dialed previously when the earlier connection
    // becomes idle before the later DialContext completes.
    DialContext func(ctx context.Context, network, addr string) (net.Conn, error)

    // Dial specifies the dial function for creating unencrypted TCP connections.
    //
    // Dial runs concurrently with calls to RoundTrip.
    // A RoundTrip call that initiates a dial may end up using
    // a connection dialed previously when the earlier connection
    // becomes idle before the later Dial completes.
    //
    // Deprecated: Use DialContext instead, which allows the transport
    // to cancel dials as soon as they are no longer needed.
    // If both are set, DialContext takes priority.
    Dial func(network, addr string) (net.Conn, error)

    // DialTLS specifies an optional dial function for creating
    // TLS connections for non-proxied HTTPS requests.
    //
    // If DialTLS is nil, Dial and TLSClientConfig are used.
    //
    // If DialTLS is set, the Dial hook is not used for HTTPS
    // requests and the TLSClientConfig and TLSHandshakeTimeout
    // are ignored. The returned net.Conn is assumed to already be
    // past the TLS handshake.
    DialTLS func(network, addr string) (net.Conn, error)

    // TLSClientConfig specifies the TLS configuration to use with
    // tls.Client.
    // If nil, the default configuration is used.
    // If non-nil, HTTP/2 support may not be enabled by default.
    TLSClientConfig *tls.Config

    // TLSHandshakeTimeout specifies the maximum amount of time waiting to
    // wait for a TLS handshake. Zero means no timeout.
    TLSHandshakeTimeout time.Duration

    // DisableKeepAlives, if true, disables HTTP keep-alives and
    // will only use the connection to the server for a single
    // HTTP request.
    //
    // This is unrelated to the similarly named TCP keep-alives.
    DisableKeepAlives bool

    // DisableCompression, if true, prevents the Transport from
    // requesting compression with an "Accept-Encoding: gzip"
    // request header when the Request contains no existing
    // Accept-Encoding value. If the Transport requests gzip on
    // its own and gets a gzipped response, it's transparently
    // decoded in the Response.Body. However, if the user
    // explicitly requested gzip it is not automatically
    // uncompressed.
    DisableCompression bool

    // MaxIdleConns controls the maximum number of idle (keep-alive)
    // connections across all hosts. Zero means no limit.
    MaxIdleConns int

    // MaxIdleConnsPerHost, if non-zero, controls the maximum idle
    // (keep-alive) connections to keep per-host. If zero,
    // DefaultMaxIdleConnsPerHost is used.
    MaxIdleConnsPerHost int

    // MaxConnsPerHost optionally limits the total number of
    // connections per host, including connections in the dialing,
    // active, and idle states. On limit violation, dials will block.
    //
    // Zero means no limit.
    //
    // For HTTP/2, this currently only controls the number of new
    // connections being created at a time, instead of the total
    // number. In practice, hosts using HTTP/2 only have about one
    // idle connection, though.
    MaxConnsPerHost int

    // IdleConnTimeout is the maximum amount of time an idle
    // (keep-alive) connection will remain idle before closing
    // itself.
    // Zero means no limit.
    IdleConnTimeout time.Duration

    // ResponseHeaderTimeout, if non-zero, specifies the amount of
    // time to wait for a server's response headers after fully
    // writing the request (including its body, if any). This
    // time does not include the time to read the response body.
    ResponseHeaderTimeout time.Duration

    // ExpectContinueTimeout, if non-zero, specifies the amount of
    // time to wait for a server's first response headers after fully
    // writing the request headers if the request has an
    // "Expect: 100-continue" header. Zero means no timeout and
    // causes the body to be sent immediately, without
    // waiting for the server to approve.
    // This time does not include the time to send the request header.
    ExpectContinueTimeout time.Duration

    // TLSNextProto specifies how the Transport switches to an
    // alternate protocol (such as HTTP/2) after a TLS NPN/ALPN
    // protocol negotiation. If Transport dials an TLS connection
    // with a non-empty protocol name and TLSNextProto contains a
    // map entry for that key (such as "h2"), then the func is
    // called with the request's authority (such as "example.com"
    // or "example.com:1234") and the TLS connection. The function
    // must return a RoundTripper that then handles the request.
    // If TLSNextProto is not nil, HTTP/2 support is not enabled
    // automatically.
    TLSNextProto map[string]func(authority string, c *tls.Conn) RoundTripper

    // ProxyConnectHeader optionally specifies headers to send to
    // proxies during CONNECT requests.
    ProxyConnectHeader Header

    // MaxResponseHeaderBytes specifies a limit on how many
    // response bytes are allowed in the server's response
    // header.
    //
    // Zero means to use a default limit.
    MaxResponseHeaderBytes int64

    // nextProtoOnce guards initialization of TLSNextProto and
    // h2transport (via onceSetNextProtoDefaults)
    nextProtoOnce sync.Once
    h2transport   h2Transport // non-nil if http2 wired up
}

核心的是這幾句話:

// By default, Transport caches connections for future re-use.
// This may leave many open connections when accessing many hosts.
// This behavior can be managed using Transport's CloseIdleConnections method
// and the MaxIdleConnsPerHost and DisableKeepAlives fields.
//
// Transports should be reused instead of created as needed.
// Transports are safe for concurrent use by multiple goroutines.

這些話指出了我們之前代碼的問題: 應(yīng)該定義一個全局的 transport 結(jié)構(gòu)體, 在多個 goroutine 之間共享.

改動后的 client.go 代碼如下:

package main

import (
    "io/ioutil"
    "log"
    "net/http"
    "time"
)

var globalTransport *http.Transport

func init() {
    globalTransport = &http.Transport{}
}

func request() {
    log.Println("request()")
    url := "http://127.0.0.1:4200/api/status"

    client := http.Client{
        // Timeout:   30,
        Transport: globalTransport,
    }

    resp, err := client.Get(url)
    if err != nil {
        log.Println(err)
        return
    }
    defer resp.Body.Close()
    body, err := ioutil.ReadAll(resp.Body)
    if err != nil {
        log.Println(err)
        return
    }
    log.Println(len(string(body)))

    time.Sleep(10 * time.Millisecond)
}

func main() {
    for i := 0; i < 10000; i++ {
        go request()
        time.Sleep(1 * time.Millisecond)
    }

    time.Sleep(1 * time.Hour)
}

再運(yùn)行測試代碼, 發(fā)現(xiàn)它打開的 socket fd 數(shù)量恢復(fù)到個位數(shù)了, 一切正常.

$ ls /proc/19100/fd -l
total 0
lrwx------ 1 shaohua shaohua 64 Sep 19 15:56 0 -> /dev/pts/8
lrwx------ 1 shaohua shaohua 64 Sep 19 15:56 1 -> /dev/pts/8
lrwx------ 1 shaohua shaohua 64 Sep 19 15:56 2 -> /dev/pts/8
lrwx------ 1 shaohua shaohua 64 Sep 19 15:56 3 -> 'socket:[820276]'
lrwx------ 1 shaohua shaohua 64 Sep 19 15:56 4 -> 'anon_inode:[eventpoll]'

另外的問題

當(dāng)時還用 tcpdump 抓取了 postal 服務(wù)的網(wǎng)絡(luò)數(shù)據(jù), 發(fā)現(xiàn)它里面有大量的請求在重傳:

001.cap.png

這個問題目前還沒有確認(rèn)出現(xiàn)在哪里.

引用:https://blog.biofan.org/2019/09/go-http-client/
參考:https://blog.csdn.net/Roy_70/article/details/78423880

1户秤、增大允許打開的文件數(shù)——命令方式
ulimit -n 2048

這樣就可以把當(dāng)前用戶的最大允許打開文件數(shù)量設(shè)置為2048了,但這種設(shè)置方法在重啟后會還原為默認(rèn)值逮矛。
ulimit -n命令非root用戶只能設(shè)置到4096鸡号。
想要設(shè)置到8192需要sudo權(quán)限或者root用戶。

2须鼎、增大允許打開的文件數(shù)——修改系統(tǒng)配置文件

vim /etc/security/limits.conf

  • 在最后加入
* soft nofile 4096  
* hard nofile 4096  

或者只加入

 * - nofile 8192

最前的 * 表示所有用戶鲸伴,可根據(jù)需要設(shè)置某一用戶府蔗,例如

roy soft nofile 8192  
roy hard nofile 8192  

注意”nofile”項(xiàng)有兩個可能的限制措施。就是項(xiàng)下的hard和soft汞窗。 要使修改過得最大打開文件數(shù)生效姓赤,必須對這兩種限制進(jìn)行設(shè)定。 如果使用”-“字符設(shè)定, 則hard和soft設(shè)定會同時被設(shè)定杉辙。

參考:https://kebingzao.com/2018/06/26/golang-too-many-file/
golang 踩坑之 - 服務(wù)的文件句柄超出系統(tǒng)限制(too many open files)

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個濱河市捶朵,隨后出現(xiàn)的幾起案子蜘矢,更是在濱河造成了極大的恐慌,老刑警劉巖综看,帶你破解...
    沈念sama閱讀 212,816評論 6 492
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件品腹,死亡現(xiàn)場離奇詭異,居然都是意外死亡红碑,警方通過查閱死者的電腦和手機(jī)舞吭,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 90,729評論 3 385
  • 文/潘曉璐 我一進(jìn)店門,熙熙樓的掌柜王于貴愁眉苦臉地迎上來析珊,“玉大人羡鸥,你說我怎么就攤上這事≈已埃” “怎么了惧浴?”我有些...
    開封第一講書人閱讀 158,300評論 0 348
  • 文/不壞的土叔 我叫張陵,是天一觀的道長奕剃。 經(jīng)常有香客問我衷旅,道長,這世上最難降的妖魔是什么纵朋? 我笑而不...
    開封第一講書人閱讀 56,780評論 1 285
  • 正文 為了忘掉前任柿顶,我火速辦了婚禮,結(jié)果婚禮上操软,老公的妹妹穿的比我還像新娘嘁锯。我一直安慰自己,他們只是感情好聂薪,可當(dāng)我...
    茶點(diǎn)故事閱讀 65,890評論 6 385
  • 文/花漫 我一把揭開白布猪钮。 她就那樣靜靜地躺著,像睡著了一般胆建。 火紅的嫁衣襯著肌膚如雪烤低。 梳的紋絲不亂的頭發(fā)上,一...
    開封第一講書人閱讀 50,084評論 1 291
  • 那天笆载,我揣著相機(jī)與錄音扑馁,去河邊找鬼涯呻。 笑死,一個胖子當(dāng)著我的面吹牛腻要,可吹牛的內(nèi)容都是我干的复罐。 我是一名探鬼主播,決...
    沈念sama閱讀 39,151評論 3 410
  • 文/蒼蘭香墨 我猛地睜開眼雄家,長吁一口氣:“原來是場噩夢啊……” “哼效诅!你這毒婦竟也來了?” 一聲冷哼從身側(cè)響起趟济,我...
    開封第一講書人閱讀 37,912評論 0 268
  • 序言:老撾萬榮一對情侶失蹤乱投,失蹤者是張志新(化名)和其女友劉穎,沒想到半個月后顷编,有當(dāng)?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體戚炫,經(jīng)...
    沈念sama閱讀 44,355評論 1 303
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡,尸身上長有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 36,666評論 2 327
  • 正文 我和宋清朗相戀三年媳纬,在試婚紗的時候發(fā)現(xiàn)自己被綠了双肤。 大學(xué)時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片。...
    茶點(diǎn)故事閱讀 38,809評論 1 341
  • 序言:一個原本活蹦亂跳的男人離奇死亡钮惠,死狀恐怖茅糜,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情素挽,我是刑警寧澤限匣,帶...
    沈念sama閱讀 34,504評論 4 334
  • 正文 年R本政府宣布,位于F島的核電站毁菱,受9級特大地震影響米死,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜贮庞,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 40,150評論 3 317
  • 文/蒙蒙 一峦筒、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧窗慎,春花似錦物喷、人聲如沸。這莊子的主人今日做“春日...
    開封第一講書人閱讀 30,882評論 0 21
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽。三九已至术吗,卻和暖如春尉辑,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背较屿。 一陣腳步聲響...
    開封第一講書人閱讀 32,121評論 1 267
  • 我被黑心中介騙來泰國打工隧魄, 沒想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留卓练,地道東北人。 一個月前我還...
    沈念sama閱讀 46,628評論 2 362
  • 正文 我出身青樓购啄,卻偏偏與公主長得像襟企,于是被迫代替她去往敵國和親。 傳聞我的和親對象是個殘疾皇子狮含,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 43,724評論 2 351

推薦閱讀更多精彩內(nèi)容