http://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/
The worst characteristic of this list is that it focuses on technical problems with little discussion of social problems an engineer may run into. Since distributed systems require more machines and more capital, their engineers tend to work with more teams and larger organizations. The social stuff is usually the hardest part of any software developer’s job, and, perhaps, especially so with distributed systems development. # 分布式系統(tǒng)通常是由很多團(tuán)隊(duì)來合作開發(fā)部署的棘钞,所以工程師之間的協(xié)作就顯得更加重要。
- Distributed systems are different because they fail often. Design for failure.
- Writing robust distributed systems costs more than writing robust single-machine systems. distributed systems tend to need actual, not simulated, distribution to flush out their bugs.
- Robust, open source distributed systems are much less common than robust, single-machine systems. # 分布式系統(tǒng)需要真實(shí)的環(huán)境(成百上千機(jī)器的集群)的考驗(yàn)鳖敷,所以普通工程師很難給出穩(wěn)定的實(shí)現(xiàn),所以社區(qū)的工程師通常是來自大公司的開發(fā)者蒿涎。但是大公司優(yōu)先級可能和你的公司的優(yōu)先級不同蜕便,所以導(dǎo)致即使軟件出現(xiàn)某些問題并且社區(qū)意識到了签赃,這個(gè)問題也不一定會被修復(fù)溃斋。
- Coordination is very hard.
- If you can fit your problem in memory, it’s probably trivial.
- “It’s slow” is the hardest problem you’ll ever debug. # 對于性能問題難以分析的原因主要是界拦,我們很難確定整個(gè)pipeline中每個(gè)部分執(zhí)行時(shí)間。Dapper and Zipkin就是用來解決這類問題的梗劫。
- Implement backpressure throughout your system. # 如果沒有過載保護(hù)享甸,可能會導(dǎo)致級聯(lián)故障。
- Find ways to be partially available.
- Metrics are the only way to get your job done. # 觀察分布式系統(tǒng)在跳,最直接有效的方式就是先觀察各個(gè)metrics, 而調(diào)試分布式系統(tǒng)枪萄,最直接有效的辦法則是分析log, 但是需要各種metrics來做支持隐岛。
- Use percentiles, not averages.
- Learn to estimate your capacity. # 容量規(guī)劃
- Feature flags are how infrastructure is rolled out. # 使用feature flags(特性開關(guān))來不斷迭代整個(gè)系統(tǒng)
- Choose id spaces wisely.
- Exploit data-locality.
- Writing cached data back to persistent storage is bad.
- Computers can do more than you think they can. # 2012年底猫妙,一個(gè)輕量級的webserver,6 processors, 24GB, 承載相對比較復(fù)雜的CRUD應(yīng)用聚凹,完全可以做到>1k QPS(< 100ms).
- Use the CAP theorem to critique systems. # 你只能在CA之間做選擇
(但是并不意味著整個(gè)系統(tǒng)都必須在CA之間選擇割坠,我們可以限定到單個(gè)請求上) - Extract services.