大概情況是這樣的, 前幾天做完離職交接, 今天看到前公司的微信群里消息爆炸了, 用戶 無法在微信小程序中下單. 接手后端項(xiàng)目的小胖已經(jīng)重啟了數(shù)據(jù)庫(不曉得他為啥一進(jìn)服務(wù)器 就先把數(shù)據(jù)庫重啟了), 說還是不行.
后來另一位同事重啟了 consumer
服務(wù), 然后服務(wù)就恢復(fù)了.
查看了 docker log, 發(fā)現(xiàn)了一個 ``dial tcp 127.0.0.1:8202: socket: too many open files 的問題, 但是請求的 http server 也是布署在同一臺服務(wù)器上, 它們也共享了同一個 docker network, 通常是 不會出現(xiàn)這個問題的, 之后就猜到可能是因?yàn)楸镜氐?TCP 端口號被占完了, 或者分配給該進(jìn)程 的文件句柄數(shù)被用光了, http client 無法再發(fā)起新的 socket 連接導(dǎo)致的.
整個問題還是太清晰, 先理一下后臺的服務(wù)布署情況:其中, postal
服務(wù)提供了 GET /api/status
這個接口, 用于獲取主控的狀態(tài). 該接口被 business
, consumer
和 staff
使用.
在 /proc/CONSUMER_PID/fd/
目錄里面, 可以看到有數(shù)百個未關(guān)閉的 socket fd, 因?yàn)?這個服務(wù)剛被重啟過, 才只剩這么多的. 同時去看了 /proc/BUSINESS_PID/fd/
目錄, 發(fā)現(xiàn) 里面有超過 12000 個 socket fd, 這些都是不正常的. 而 /proc/STAFF_PID/fd/
里面只有 數(shù)百個, 因?yàn)?GET /api/status
這個接口在 staff
服務(wù)里不經(jīng)常被調(diào)用.
因?yàn)槭蔷€上服務(wù)器, 可以選用訪問量比較少的后臺服務(wù)staff
來調(diào)試, 通過 staff 服務(wù) 請求了一下 postal 服務(wù)的 GET /api/status
接口, 發(fā)現(xiàn) /proc/STAFF_PID/fd/
里面 就多了一個 socket fd, 這種現(xiàn)象很穩(wěn)定, 每次請求都會多一次.
問題可以大致定位到了, 估計就是 staff 里的 http client 那部分代碼:
url := fmt.Sprintf("%s/api/stats/%s", settings.Current.BoxServer, box)
transport := &http.Transport{
TLSHandshakeTimeout: TlsTimeout,
}
client := http.Client{
Timeout: HttpTimeout,
Transport: transport,
}
resp, err := client.Get(url)
if err != nil {
return nil, err
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
return nil, err
}
上面的代碼也很簡單, 沒有多余的操作, 而且也依照 golang docs 里面要求的 關(guān)閉了 response body.
核心的問題是在:
transport := &http.Transport{
TLSHandshakeTimeout: TlsTimeout,
}
這里, 只調(diào)整了 Transport 結(jié)構(gòu)體的 TLS 握時超時的選項(xiàng), 其它都是默認(rèn)值.
Demo
服務(wù)端, server.go
:
package main
import (
"log"
"net/http"
"github.com/gin-gonic/gin"
)
func getStatus(c *gin.Context) {
c.JSON(http.StatusOK, gin.H{
"Status": "ok",
})
}
func main() {
route := gin.Default()
api := route.Group("/api")
api.GET("/status", getStatus)
if err := route.Run("127.0.0.1:4200"); err != nil {
log.Panic(err)
}
}
客戶端代碼, client.go
:
package main
import (
"io/ioutil"
"log"
"net/http"
"time"
)
func request() {
log.Println("request()")
url := "http://127.0.0.1:4200/api/status"
transport := &http.Transport{
// ResponseHeaderTimeout: 15 * time.Second,
}
client := http.Client{
// Timeout: 30,
Transport: transport,
}
resp, err := client.Get(url)
if err != nil {
log.Println(err)
return
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Println(err)
return
}
log.Println(len(string(body)))
time.Sleep(1 * time.Millisecond)
}
func main() {
for i := 0; i < 10000; i ++ {
go request()
}
time.Sleep(1 * time.Hour)
}
以上的代碼是在本地運(yùn)行的, 同樣在 client 那里可以看到打印的報錯:
2019/09/19 15:26:34 Get http://127.0.0.1:4200/api/status: dial tcp 127.0.0.1:4200: socket: too many open files
然后看一下該進(jìn)程打開了文件數(shù):
$ ls /proc/5689/limits | wc -l
1024
這個值要比之前在服務(wù)器上面出現(xiàn)的要低很多, 后來確認(rèn)是本地的linux 系統(tǒng)沒有優(yōu)化配置, 一個進(jìn)程默認(rèn)只允許打開 1024
個文件, 下面也可以看到這樣的 resource limit
信息:
$ cat /proc/5689/limits
Limit Soft Limit Hard Limit Units
Max cpu time unlimited unlimited seconds
Max file size unlimited unlimited bytes
Max data size unlimited unlimited bytes
Max stack size 8388608 unlimited bytes
Max core file size 0 unlimited bytes
Max resident set unlimited unlimited bytes
Max processes 63754 63754 processes
Max open files 1024 1048576 files
Max locked memory 67108864 67108864 bytes
Max address space unlimited unlimited bytes
Max file locks unlimited unlimited locks
Max pending signals 63754 63754 signals
Max msgqueue size 819200 819200 bytes
Max nice priority 0 0
Max realtime priority 0 0
Max realtime timeout unlimited unlimited us
分析 Transport
以上問題的核心就是 golang 的 http client 并沒有及時關(guān)閉 socket 連接導(dǎo)致的, 這個 要看它的 Transport 究竟怎么實(shí)現(xiàn)的了.
先看看 golang 的 transport 源碼 (src/net/http/transport.go):
// Transport is an implementation of RoundTripper that supports HTTP,
// HTTPS, and HTTP proxies (for either HTTP or HTTPS with CONNECT).
//
// By default, Transport caches connections for future re-use.
// This may leave many open connections when accessing many hosts.
// This behavior can be managed using Transport's CloseIdleConnections method
// and the MaxIdleConnsPerHost and DisableKeepAlives fields.
//
// Transports should be reused instead of created as needed.
// Transports are safe for concurrent use by multiple goroutines.
//
// A Transport is a low-level primitive for making HTTP and HTTPS requests.
// For high-level functionality, such as cookies and redirects, see Client.
//
// Transport uses HTTP/1.1 for HTTP URLs and either HTTP/1.1 or HTTP/2
// for HTTPS URLs, depending on whether the server supports HTTP/2,
// and how the Transport is configured. The DefaultTransport supports HTTP/2.
// To explicitly enable HTTP/2 on a transport, use golang.org/x/net/http2
// and call ConfigureTransport. See the package docs for more about HTTP/2.
//
// Responses with status codes in the 1xx range are either handled
// automatically (100 expect-continue) or ignored. The one
// exception is HTTP status code 101 (Switching Protocols), which is
// considered a terminal status and returned by RoundTrip. To see the
// ignored 1xx responses, use the httptrace trace package's
// ClientTrace.Got1xxResponse.
//
// Transport only retries a request upon encountering a network error
// if the request is idempotent and either has no body or has its
// Request.GetBody defined. HTTP requests are considered idempotent if
// they have HTTP methods GET, HEAD, OPTIONS, or TRACE; or if their
// Header map contains an "Idempotency-Key" or "X-Idempotency-Key"
// entry. If the idempotency key value is an zero-length slice, the
// request is treated as idempotent but the header is not sent on the
// wire.
type Transport struct {
idleMu sync.Mutex
wantIdle bool // user has requested to close all idle conns
idleConn map[connectMethodKey][]*persistConn // most recently used at end
idleConnCh map[connectMethodKey]chan *persistConn
idleLRU connLRU
reqMu sync.Mutex
reqCanceler map[*Request]func(error)
altMu sync.Mutex // guards changing altProto only
altProto atomic.Value // of nil or map[string]RoundTripper, key is URI scheme
connCountMu sync.Mutex
connPerHostCount map[connectMethodKey]int
connPerHostAvailable map[connectMethodKey]chan struct{}
// Proxy specifies a function to return a proxy for a given
// Request. If the function returns a non-nil error, the
// request is aborted with the provided error.
//
// The proxy type is determined by the URL scheme. "http",
// "https", and "socks5" are supported. If the scheme is empty,
// "http" is assumed.
//
// If Proxy is nil or returns a nil *URL, no proxy is used.
Proxy func(*Request) (*url.URL, error)
// DialContext specifies the dial function for creating unencrypted TCP connections.
// If DialContext is nil (and the deprecated Dial below is also nil),
// then the transport dials using package net.
//
// DialContext runs concurrently with calls to RoundTrip.
// A RoundTrip call that initiates a dial may end up using
// a connection dialed previously when the earlier connection
// becomes idle before the later DialContext completes.
DialContext func(ctx context.Context, network, addr string) (net.Conn, error)
// Dial specifies the dial function for creating unencrypted TCP connections.
//
// Dial runs concurrently with calls to RoundTrip.
// A RoundTrip call that initiates a dial may end up using
// a connection dialed previously when the earlier connection
// becomes idle before the later Dial completes.
//
// Deprecated: Use DialContext instead, which allows the transport
// to cancel dials as soon as they are no longer needed.
// If both are set, DialContext takes priority.
Dial func(network, addr string) (net.Conn, error)
// DialTLS specifies an optional dial function for creating
// TLS connections for non-proxied HTTPS requests.
//
// If DialTLS is nil, Dial and TLSClientConfig are used.
//
// If DialTLS is set, the Dial hook is not used for HTTPS
// requests and the TLSClientConfig and TLSHandshakeTimeout
// are ignored. The returned net.Conn is assumed to already be
// past the TLS handshake.
DialTLS func(network, addr string) (net.Conn, error)
// TLSClientConfig specifies the TLS configuration to use with
// tls.Client.
// If nil, the default configuration is used.
// If non-nil, HTTP/2 support may not be enabled by default.
TLSClientConfig *tls.Config
// TLSHandshakeTimeout specifies the maximum amount of time waiting to
// wait for a TLS handshake. Zero means no timeout.
TLSHandshakeTimeout time.Duration
// DisableKeepAlives, if true, disables HTTP keep-alives and
// will only use the connection to the server for a single
// HTTP request.
//
// This is unrelated to the similarly named TCP keep-alives.
DisableKeepAlives bool
// DisableCompression, if true, prevents the Transport from
// requesting compression with an "Accept-Encoding: gzip"
// request header when the Request contains no existing
// Accept-Encoding value. If the Transport requests gzip on
// its own and gets a gzipped response, it's transparently
// decoded in the Response.Body. However, if the user
// explicitly requested gzip it is not automatically
// uncompressed.
DisableCompression bool
// MaxIdleConns controls the maximum number of idle (keep-alive)
// connections across all hosts. Zero means no limit.
MaxIdleConns int
// MaxIdleConnsPerHost, if non-zero, controls the maximum idle
// (keep-alive) connections to keep per-host. If zero,
// DefaultMaxIdleConnsPerHost is used.
MaxIdleConnsPerHost int
// MaxConnsPerHost optionally limits the total number of
// connections per host, including connections in the dialing,
// active, and idle states. On limit violation, dials will block.
//
// Zero means no limit.
//
// For HTTP/2, this currently only controls the number of new
// connections being created at a time, instead of the total
// number. In practice, hosts using HTTP/2 only have about one
// idle connection, though.
MaxConnsPerHost int
// IdleConnTimeout is the maximum amount of time an idle
// (keep-alive) connection will remain idle before closing
// itself.
// Zero means no limit.
IdleConnTimeout time.Duration
// ResponseHeaderTimeout, if non-zero, specifies the amount of
// time to wait for a server's response headers after fully
// writing the request (including its body, if any). This
// time does not include the time to read the response body.
ResponseHeaderTimeout time.Duration
// ExpectContinueTimeout, if non-zero, specifies the amount of
// time to wait for a server's first response headers after fully
// writing the request headers if the request has an
// "Expect: 100-continue" header. Zero means no timeout and
// causes the body to be sent immediately, without
// waiting for the server to approve.
// This time does not include the time to send the request header.
ExpectContinueTimeout time.Duration
// TLSNextProto specifies how the Transport switches to an
// alternate protocol (such as HTTP/2) after a TLS NPN/ALPN
// protocol negotiation. If Transport dials an TLS connection
// with a non-empty protocol name and TLSNextProto contains a
// map entry for that key (such as "h2"), then the func is
// called with the request's authority (such as "example.com"
// or "example.com:1234") and the TLS connection. The function
// must return a RoundTripper that then handles the request.
// If TLSNextProto is not nil, HTTP/2 support is not enabled
// automatically.
TLSNextProto map[string]func(authority string, c *tls.Conn) RoundTripper
// ProxyConnectHeader optionally specifies headers to send to
// proxies during CONNECT requests.
ProxyConnectHeader Header
// MaxResponseHeaderBytes specifies a limit on how many
// response bytes are allowed in the server's response
// header.
//
// Zero means to use a default limit.
MaxResponseHeaderBytes int64
// nextProtoOnce guards initialization of TLSNextProto and
// h2transport (via onceSetNextProtoDefaults)
nextProtoOnce sync.Once
h2transport h2Transport // non-nil if http2 wired up
}
核心的是這幾句話:
// By default, Transport caches connections for future re-use.
// This may leave many open connections when accessing many hosts.
// This behavior can be managed using Transport's CloseIdleConnections method
// and the MaxIdleConnsPerHost and DisableKeepAlives fields.
//
// Transports should be reused instead of created as needed.
// Transports are safe for concurrent use by multiple goroutines.
這些話指出了我們之前代碼的問題: 應(yīng)該定義一個全局的 transport 結(jié)構(gòu)體, 在多個 goroutine 之間共享.
改動后的 client.go
代碼如下:
package main
import (
"io/ioutil"
"log"
"net/http"
"time"
)
var globalTransport *http.Transport
func init() {
globalTransport = &http.Transport{}
}
func request() {
log.Println("request()")
url := "http://127.0.0.1:4200/api/status"
client := http.Client{
// Timeout: 30,
Transport: globalTransport,
}
resp, err := client.Get(url)
if err != nil {
log.Println(err)
return
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)
if err != nil {
log.Println(err)
return
}
log.Println(len(string(body)))
time.Sleep(10 * time.Millisecond)
}
func main() {
for i := 0; i < 10000; i++ {
go request()
time.Sleep(1 * time.Millisecond)
}
time.Sleep(1 * time.Hour)
}
再運(yùn)行測試代碼, 發(fā)現(xiàn)它打開的 socket fd 數(shù)量恢復(fù)到個位數(shù)了, 一切正常.
$ ls /proc/19100/fd -l
total 0
lrwx------ 1 shaohua shaohua 64 Sep 19 15:56 0 -> /dev/pts/8
lrwx------ 1 shaohua shaohua 64 Sep 19 15:56 1 -> /dev/pts/8
lrwx------ 1 shaohua shaohua 64 Sep 19 15:56 2 -> /dev/pts/8
lrwx------ 1 shaohua shaohua 64 Sep 19 15:56 3 -> 'socket:[820276]'
lrwx------ 1 shaohua shaohua 64 Sep 19 15:56 4 -> 'anon_inode:[eventpoll]'
另外的問題
當(dāng)時還用 tcpdump
抓取了 postal 服務(wù)的網(wǎng)絡(luò)數(shù)據(jù), 發(fā)現(xiàn)它里面有大量的請求在重傳:
這個問題目前還沒有確認(rèn)出現(xiàn)在哪里.
引用:https://blog.biofan.org/2019/09/go-http-client/
參考:https://blog.csdn.net/Roy_70/article/details/78423880
1户秤、增大允許打開的文件數(shù)——命令方式
ulimit -n 2048
這樣就可以把當(dāng)前用戶的最大允許打開文件數(shù)量設(shè)置為2048了,但這種設(shè)置方法在重啟后會還原為默認(rèn)值逮矛。
ulimit -n命令非root用戶只能設(shè)置到4096鸡号。
想要設(shè)置到8192需要sudo權(quán)限或者root用戶。
2须鼎、增大允許打開的文件數(shù)——修改系統(tǒng)配置文件
vim /etc/security/limits.conf
- 在最后加入
* soft nofile 4096
* hard nofile 4096
或者只加入
* - nofile 8192
最前的 * 表示所有用戶鲸伴,可根據(jù)需要設(shè)置某一用戶府蔗,例如
roy soft nofile 8192
roy hard nofile 8192
注意”nofile”項(xiàng)有兩個可能的限制措施。就是項(xiàng)下的hard和soft汞窗。 要使修改過得最大打開文件數(shù)生效姓赤,必須對這兩種限制進(jìn)行設(shè)定。 如果使用”-“字符設(shè)定, 則hard和soft設(shè)定會同時被設(shè)定杉辙。
參考:https://kebingzao.com/2018/06/26/golang-too-many-file/
golang 踩坑之 - 服務(wù)的文件句柄超出系統(tǒng)限制(too many open files)