Could Not Recover
在C/C++中最苦惱的莫過于上線后發(fā)現(xiàn)有野指針或內(nèi)存越界恩沽,導(dǎo)致不可能崩潰的地方崩潰苦银;最無語的是因?yàn)楹茉鐚懙娜罩敬蛴”热?s把整數(shù)當(dāng)字符串低滩,突然某天執(zhí)行到了崩潰缤言;最無奈的是無論因?yàn)槭裁幢罎⒍紝?dǎo)致服務(wù)的所有用戶收到影響采驻。
如果能有一種方案审胚,將指針和內(nèi)存都管理起來,避免用戶錯(cuò)誤訪問和釋放礼旅,這樣雖然浪費(fèi)了一部分的CPU膳叨,但是可以在快速變化的業(yè)務(wù)中避免這些頭疼的問題。在現(xiàn)代的高級(jí)語言中痘系,比如Java菲嘴、Python和JS的異常,Go的panic-recover都是這種機(jī)制。
畢竟龄坪,用一些CPU換得快速迭代中的不Crash昭雌,怎么算都是劃得來的。
哪些可以Recover
Go有Defer, Panic, and Recover健田。其中defer一般用在資源釋放或者捕獲panic烛卧。而panic是中止正常的執(zhí)行流程,執(zhí)行所有的defer妓局,返回調(diào)用函數(shù)繼續(xù)panic总放;主動(dòng)調(diào)用panic函數(shù),還有些運(yùn)行時(shí)錯(cuò)誤都會(huì)進(jìn)入panic過程好爬。最后recover是在panic時(shí)獲取控制權(quán)局雄,進(jìn)入正常的執(zhí)行邏輯。
注意recover只有在defer函數(shù)中才有用存炮,在defer的函數(shù)調(diào)用的函數(shù)中recover不起作用炬搭,如下實(shí)例代碼不會(huì)recover:
package main
import "fmt"
func main() {
f := func() {
if r := recover(); r != nil {
fmt.Println(r)
}
}
defer func() {
f()
} ()
panic("ok")
}
執(zhí)行時(shí)依舊會(huì)panic,結(jié)果如下:
$ go run t.go
panic: ok
goroutine 1 [running]:
main.main()
/Users/winlin/temp/t.go:16 +0x6b
exit status 2
有些情況是不可以被捕獲穆桂,程序會(huì)自動(dòng)退出宫盔,這種都是無法正常recover。當(dāng)然充尉,一般的panic都是能捕獲的飘言,比如Slice越界、nil指針驼侠、除零姿鸿、寫關(guān)閉的chan。
下面是Slice越界的例子倒源,recover可以捕獲到:
package main
import (
"fmt"
)
func main() {
defer func() {
if r := recover(); r != nil {
fmt.Println(r)
}
}()
b := []int{0, 1}
fmt.Println("Hello, playground", b[2])
}
下面是nil指針被引用的例子苛预,recover可以捕獲到:
package main
import (
"bytes"
"fmt"
)
func main() {
defer func() {
if r := recover(); r != nil {
fmt.Println(r)
}
}()
var b *bytes.Buffer
fmt.Println("Hello, playground", b.Bytes())
}
下面是除零的例子,recover可以捕獲到:
package main
import (
"fmt"
)
func main() {
defer func() {
if r := recover(); r != nil {
fmt.Println(r)
}
}()
var v int
fmt.Println("Hello, playground", 1/v)
}
下面是寫關(guān)閉的chan的例子笋熬,recover可以捕獲到:
package main
import (
"fmt"
)
func main() {
defer func() {
if r := recover(); r != nil {
fmt.Println(r)
}
}()
c := make(chan bool)
close(c)
c <- true
}
Recover最佳實(shí)踐
一般recover后會(huì)判斷是否為err热某,有可能需要處理特殊的error,一般也需要打印日志或者告警胳螟,給一個(gè)recover的例子:
package main
import (
"fmt"
)
type Handler interface {
Filter(err error, r interface{}) error
}
type Logger interface {
Ef(format string, a ...interface{})
}
// Handle panic by hdr, which filter the error.
// Finally log err with logger.
func HandlePanic(hdr Handler, logger Logger) error {
return handlePanic(recover(), hdr, logger)
}
type hdrFunc func(err error, r interface{}) error
func (v hdrFunc) Filter(err error, r interface{}) error {
return v(err, r)
}
type loggerFunc func(format string, a ...interface{})
func (v loggerFunc) Ef(format string, a ...interface{}) {
v(format, a...)
}
// Handle panic by hdr, which filter the error.
// Finally log err with logger.
func HandlePanicFunc(hdr func(err error, r interface{}) error,
logger func(format string, a ...interface{}),
) error {
var f Handler
if hdr != nil {
f = hdrFunc(hdr)
}
var l Logger
if logger != nil {
l = loggerFunc(logger)
}
return handlePanic(recover(), f, l)
}
func handlePanic(r interface{}, hdr Handler, logger Logger) error {
if r != nil {
err, ok := r.(error)
if !ok {
err = fmt.Errorf("r is %v", r)
}
if hdr != nil {
err = hdr.Filter(err, r)
}
if err != nil && logger != nil {
logger.Ef("panic err %+v", err)
}
return err
}
return nil
}
func main() {
func() {
defer HandlePanicFunc(nil, func(format string, a ...interface{}) {
fmt.Println(fmt.Sprintf(format, a...))
})
panic("ok")
}()
logger := func(format string, a ...interface{}) {
fmt.Println(fmt.Sprintf(format, a...))
}
func() {
defer HandlePanicFunc(nil, logger)
panic("ok")
}()
}
對(duì)于庫如果需要啟動(dòng)goroutine昔馋,如何recover呢:
- 如果不可能出現(xiàn)panic,可以不用recover糖耸,比如tls.go中的一個(gè)goroutine:
errChannel <- conn.Handshake()
- 如果可能出現(xiàn)panic秘遏,也比較明確的可以recover,可以用調(diào)用用戶回調(diào)嘉竟,或者讓用戶設(shè)置logger邦危,比如http/server.go處理請(qǐng)求的goroutine:
if err := recover(); err != nil && err != ErrAbortHandler {
- 如果完全不知道如何處理recover洋侨,比如一個(gè)cache庫,丟棄數(shù)據(jù)可能會(huì)造成問題倦蚪,那么就應(yīng)該由用戶來啟動(dòng)goroutine希坚,返回異常數(shù)據(jù)和錯(cuò)誤,用戶決定如何recover如何重試陵且。
- 如果完全知道如何recover裁僧,比如忽略panic繼續(xù)跑,或者能使用logger打印日志滩报,那就按照正常的panic-recover邏輯處理锅知。
哪些不能Recover
下面看看一些情況是無法捕獲的播急,包括(不限于):
- Thread Limit脓钾,超過了系統(tǒng)的線程限制,詳細(xì)參考下面的說明桩警。
- Concurrent Map Writers可训,競爭條件,同時(shí)寫map捶枢,參考下面的例子握截。推薦使用標(biāo)準(zhǔn)庫的
sync.Map
解決這個(gè)問題。
Map競爭寫導(dǎo)致panic的實(shí)例代碼如下:
package main
import (
"fmt"
"time"
)
func main() {
m := map[string]int{}
p := func() {
defer func() {
if r := recover(); r != nil {
fmt.Println(r)
}
}()
for {
m["t"] = 0
}
}
go p()
go p()
time.Sleep(1 * time.Second)
}
注意:如果編譯時(shí)加了
-race
烂叔,其他競爭條件也會(huì)退出谨胞,一般用于死鎖檢測,但這會(huì)導(dǎo)致嚴(yán)重的性能問題蒜鸡,使用需要謹(jǐn)慎胯努。
備注:一般標(biāo)準(zhǔn)庫中通過
throw
拋出的錯(cuò)誤都是無法recover的,搜索了下Go1.11一共有690個(gè)地方有調(diào)用throw逢防。
Go1.2引入了能使用的最多線程數(shù)限制ThreadLimit叶沛,如果超過了就panic,這個(gè)panic是無法recover的忘朝。
fatal error: thread exhaustion
runtime stack:
runtime.throw(0x10b60fd, 0x11)
/usr/local/Cellar/go/1.8.3/libexec/src/runtime/panic.go:596 +0x95
runtime.mstart()
/usr/local/Cellar/go/1.8.3/libexec/src/runtime/proc.go:1132
默認(rèn)是1萬個(gè)物理線程灰署,我們可以調(diào)用runtime
的debug.SetMaxThreads
設(shè)置最大線程數(shù)。
SetMaxThreads sets the maximum number of operating system threads that the Go program can use. If it attempts to use more than this many, the program crashes. SetMaxThreads returns the previous setting. The initial setting is 10,000 threads.
用這個(gè)函數(shù)設(shè)置程序能使用的最大系統(tǒng)線程數(shù)局嘁,如果超過了程序就crash溉箕。返回的是之前設(shè)置的值,默認(rèn)是1萬個(gè)線程悦昵。
The limit controls the number of operating system threads, not the number of goroutines. A Go program creates a new thread only when a goroutine is ready to run but all the existing threads are blocked in system calls, cgo calls, or are locked to other goroutines due to use of runtime.LockOSThread.
注意限制的并不是goroutine的數(shù)目肴茄,而是使用的系統(tǒng)線程的限制。goroutine啟動(dòng)時(shí)旱捧,并不總是新開系統(tǒng)線程独郎,只有當(dāng)目前所有的物理線程都阻塞在系統(tǒng)調(diào)用踩麦,cgo調(diào)用,或者顯示有調(diào)用runtime.LockOSThread
時(shí)氓癌。
SetMaxThreads is useful mainly for limiting the damage done by programs that create an unbounded number of threads. The idea is to take down the program before it takes down the operating system.
這個(gè)是最后的防御措施谓谦,可以在程序干死系統(tǒng)前把有問題的程序干掉。
舉一個(gè)簡單的例子贪婉,限制使用10個(gè)線程反粥,然后用runtime.LockOSThread
來綁定goroutine到系統(tǒng)線程,可以看到?jīng)]有創(chuàng)建10個(gè)goroutine就退出了(runtime也需要使用線程)疲迂。參考下面的例子Playground: ThreadLimit:
package main
import (
"fmt"
"runtime"
"runtime/debug"
"sync"
"time"
)
func main() {
nv := 10
ov := debug.SetMaxThreads(nv)
fmt.Println(fmt.Sprintf("Change max threads %d=>%d", ov, nv))
var wg sync.WaitGroup
c := make(chan bool, 0)
for i := 0; i < 10; i++ {
fmt.Println(fmt.Sprintf("Start goroutine #%v", i))
wg.Add(1)
go func() {
c <- true
defer wg.Done()
runtime.LockOSThread()
time.Sleep(10 * time.Second)
fmt.Println("Goroutine quit")
}()
<- c
fmt.Println(fmt.Sprintf("Start goroutine #%v ok", i))
}
fmt.Println("Wait for all goroutines about 10s...")
wg.Wait()
fmt.Println("All goroutines done")
}
運(yùn)行結(jié)果如下:
Change max threads 10000=>10
Start goroutine #0
Start goroutine #0 ok
......
Start goroutine #6
Start goroutine #6 ok
Start goroutine #7
runtime: program exceeds 10-thread limit
fatal error: thread exhaustion
runtime stack:
runtime.throw(0xffdef, 0x11)
/usr/local/go/src/runtime/panic.go:616 +0x100
runtime.checkmcount()
/usr/local/go/src/runtime/proc.go:542 +0x100
......
/usr/local/go/src/runtime/proc.go:1830 +0x40
runtime.startm(0x1040e000, 0x1040e000)
/usr/local/go/src/runtime/proc.go:2002 +0x180
從這次運(yùn)行可以看出才顿,限制可用的物理線程為10個(gè),其中系統(tǒng)占用了3個(gè)物理線程尤蒿,user-level可運(yùn)行7個(gè)線程郑气,開啟第8個(gè)線程時(shí)就崩潰了。
注意這個(gè)運(yùn)行結(jié)果在不同的go版本是不同的腰池,比如Go1.8有時(shí)候啟動(dòng)4到5個(gè)goroutine就會(huì)崩潰尾组。
而且加recover也無法恢復(fù),參考下面的實(shí)例代碼示弓』淝龋可見這個(gè)機(jī)制是最后的防御,不能突破的底線奏属。我們?cè)诰€上服務(wù)時(shí)跨跨,曾經(jīng)因?yàn)閎lock的goroutine過多,導(dǎo)致觸發(fā)了這個(gè)機(jī)制囱皿。
package main
import (
"fmt"
"runtime"
"runtime/debug"
"sync"
"time"
)
func main() {
defer func() {
if r := recover(); r != nil {
fmt.Println("main recover is", r)
}
} ()
nv := 10
ov := debug.SetMaxThreads(nv)
fmt.Println(fmt.Sprintf("Change max threads %d=>%d", ov, nv))
var wg sync.WaitGroup
c := make(chan bool, 0)
for i := 0; i < 10; i++ {
fmt.Println(fmt.Sprintf("Start goroutine #%v", i))
wg.Add(1)
go func() {
c <- true
defer func() {
if r := recover(); r != nil {
fmt.Println("main recover is", r)
}
} ()
defer wg.Done()
runtime.LockOSThread()
time.Sleep(10 * time.Second)
fmt.Println("Goroutine quit")
}()
<- c
fmt.Println(fmt.Sprintf("Start goroutine #%v ok", i))
}
fmt.Println("Wait for all goroutines about 10s...")
wg.Wait()
fmt.Println("All goroutines done")
}
如何避免程序超過線程限制被干掉勇婴?一般可能阻塞在system call,那么什么時(shí)候會(huì)阻塞铆帽?還有咆耿,GOMAXPROCS又有什么作用呢?
The GOMAXPROCS variable limits the number of operating system threads that can execute user-level Go code simultaneously. There is no limit to the number of threads that can be blocked in system calls on behalf of Go code; those do not count against the GOMAXPROCS limit. This package's GOMAXPROCS function queries and changes the limit.
GOMAXPROCS sets the maximum number of CPUs that can be executing simultaneously and returns the previous setting. If n < 1, it does not change the current setting. The number of logical CPUs on the local machine can be queried with NumCPU. This call will go away when the scheduler improves.
可見GOMAXPROCS只是設(shè)置user-level并行執(zhí)行的線程數(shù)爹橱,也就是真正執(zhí)行的線程數(shù) 萨螺。實(shí)際上如果物理線程阻塞在system calls,實(shí)際上會(huì)開啟更多的物理線程愧驱。關(guān)于這個(gè)參數(shù)的說明慰技,這個(gè)文章Number of threads used by goroutine解釋得很清楚:
There is no direct correlation. Threads used by your app may be less than, equal to or more than 10.
So if your application does not start any new goroutines, threads count will be less than 10.
If your app starts many goroutines (>10) where none is blocking (e.g. in system calls), 10 operating system threads will execute your goroutines simultaneously.
If your app starts many goroutines where many (>10) are blocked in system calls, more than 10 OS threads will be spawned (but only at most 10 will be executing user-level Go code).
設(shè)置GOMAXPROCS為10:如果開啟的goroutine小于10個(gè),那么物理線程也小于10個(gè)组砚。如果有很多goroutines吻商,但是沒有阻塞在system calls,那么只有10個(gè)線程會(huì)并行執(zhí)行糟红。如果有很多goroutines同時(shí)超過10個(gè)阻塞在system calls艾帐,那么超過10個(gè)物理線程會(huì)被創(chuàng)建乌叶,但是只有10個(gè)活躍的線程執(zhí)行user-level代碼。
那么什么時(shí)候會(huì)阻塞在system blocking呢柒爸?這個(gè)例子Why does it not create many threads when many goroutines are blocked in writing解釋很清楚准浴,雖然設(shè)置了GOMAXPROCS為1,但是實(shí)際上還是開啟了12個(gè)線程捎稚,每個(gè)goroutine一個(gè)物理線程乐横,具體執(zhí)行下面的代碼Writing Large Block:
package main
import (
"io/ioutil"
"os"
"runtime"
"strconv"
"sync"
)
func main() {
runtime.GOMAXPROCS(1)
data := make([]byte, 128*1024*1024)
var wg sync.WaitGroup
for i := 0; i < 10; i++ {
wg.Add(1)
go func(n int) {
defer wg.Done()
for {
ioutil.WriteFile("testxxx"+strconv.Itoa(n), []byte(data), os.ModePerm)
}
}(i)
}
wg.Wait()
}
運(yùn)行結(jié)果如下:
Mac chengli.ycl$ time go run t.go
real 1m44.679s
user 0m0.230s
sys 0m53.474s
雖然GOMAXPROCS設(shè)置為1,實(shí)際上創(chuàng)建了12個(gè)物理線程今野。
有大量的時(shí)間是在sys上面葡公,也就是system calls。
So I think the syscalls were exiting too quickly in your original test to show the effect you were expecting.
Effective Go中的解釋:
Goroutines are multiplexed onto multiple OS threads so if one should block, such as while waiting for I/O, others continue to run. Their design hides many of the complexities of thread creation and management.
由此可見条霜,如果程序出現(xiàn)因?yàn)槌^線程限制而崩潰催什,那么可以在出現(xiàn)瓶頸時(shí),用linux工具查看系統(tǒng)調(diào)用的統(tǒng)計(jì)蛔外,看哪些系統(tǒng)調(diào)用導(dǎo)致創(chuàng)建了過多的線程蛆楞。
Links
由于簡書限制了文章字?jǐn)?shù),只好分成不同章節(jié):
- Overview 為何Go有時(shí)候也叫Golang?為何要選擇Go作為服務(wù)器開發(fā)的語言夹厌?是沖動(dòng)?還是騷動(dòng)裆悄?Go的重要里程碑和事件矛纹,當(dāng)年吹的那些牛逼,都實(shí)現(xiàn)了哪些光稼?
- Could Not Recover 君可知或南,有什么panic是無法recover的?包括超過系統(tǒng)線程限制艾君,以及map的競爭寫采够。當(dāng)然一般都能recover,比如Slice越界冰垄、nil指針蹬癌、除零、寫關(guān)閉的chan等虹茶。
- Errors 為什么Go2的草稿3個(gè)有2個(gè)是關(guān)于錯(cuò)誤處理的逝薪?好的錯(cuò)誤處理應(yīng)該怎么做?錯(cuò)誤和異常機(jī)制的差別是什么蝴罪?錯(cuò)誤處理和日志如何配合董济?
- Logger 為什么標(biāo)準(zhǔn)庫的Logger是完全不夠用的?怎么做日志切割和輪轉(zhuǎn)要门?怎么在混成一坨的服務(wù)器日志中找到某個(gè)連接的日志虏肾?甚至連接中的流的日志廓啊?怎么做到簡潔又夠用?
- Interfaces 什么是面向?qū)ο蟮腟OLID原則封豪?為何Go更符合SOLID崖瞭?為何接口組合比繼承多態(tài)更具有正交性?Go類型系統(tǒng)如何做到looser, organic, decoupled, independent, and therefore scalable撑毛?一般軟件中如果出現(xiàn)數(shù)學(xué)书聚,要么真的牛逼要么裝逼。正交性這個(gè)數(shù)學(xué)概念在Go中頻繁出現(xiàn)藻雌,是神仙還是妖怪雌续?為何接口設(shè)計(jì)要考慮正交性?
- Modules 如何避免依賴地獄(Dependency Hell)胯杭?小小的版本號(hào)為何會(huì)帶來大災(zāi)難驯杜?Go為什么推出了GOPATH、Vendor還要搞module和vgo做个?新建了16個(gè)倉庫做測試鸽心,碰到了9個(gè)坑,搞清楚了gopath和vendor如何遷移居暖,以及vgo with vendor如何使用(畢竟生產(chǎn)環(huán)境不能每次都去外網(wǎng)下載)顽频。
- Concurrency & Control 服務(wù)器中的并發(fā)處理難在哪里?為什么說Go并發(fā)處理優(yōu)勢(shì)占領(lǐng)了云計(jì)算開發(fā)語言市場太闺?什么是C10K糯景、C10M問題?如何管理goroutine的取消省骂、超時(shí)和關(guān)聯(lián)取消蟀淮?為何Go1.7專門將context放到了標(biāo)準(zhǔn)庫?context如何使用钞澳,以及問題在哪里怠惶?
- Engineering Go在工程化上的優(yōu)勢(shì)是什么?為什么說Go是一門面向工程的語言轧粟?覆蓋率要到多少比較合適策治?什么叫代碼可測性?為什么良好的庫必須先寫Example逃延?
- Go2 Transition Go2會(huì)像Python3不兼容Python2那樣作嗎览妖?C和C++的語言演進(jìn)可以有什么不同的收獲?Go2怎么思考語言升級(jí)的問題揽祥?
- SRS & Others Go在流媒體服務(wù)器中的使用讽膏。Go的GC靠譜嗎?Twitter說相當(dāng)?shù)目孔V拄丰,有圖有真相府树。為何Go的聲明語法是那樣俐末?C的又是怎樣?是拍的大腿奄侠,還是拍的腦袋卓箫?