k8s網(wǎng)絡(luò)系列學(xué)習(xí)筆記之三----kube-proxy原理分析

本篇主要從代碼的角度分析一下基于ipvs模式的實(shí)現(xiàn)原理昧旨。

ProxyServer的創(chuàng)建

系統(tǒng)代碼基于Cobra實(shí)現(xiàn)虑润,前面的代碼邏輯很清晰航棱,就不做分析憋肖,kube-proxy的關(guān)鍵的對(duì)象是ProxyServer,初始化的過(guò)程就是生成ProxyServer對(duì)象胰苏,并執(zhí)行ProxyServer.Run()硕蛹。
下面我們先看看ProxyServer的定義:

type ProxyServer struct {
    Client                 clientset.Interface
    EventClient            v1core.EventsGetter
    IptInterface           utiliptables.Interface
    IpvsInterface          utilipvs.Interface
    IpsetInterface         utilipset.Interface
    execer                 exec.Interface
    Proxier                proxy.ProxyProvider
    Broadcaster            record.EventBroadcaster
    Recorder               record.EventRecorder
    ConntrackConfiguration kubeproxyconfig.KubeProxyConntrackConfiguration
    Conntracker            Conntracker // if nil, ignored
    ProxyMode              string
    NodeRef                *v1.ObjectReference
    CleanupAndExit         bool
    CleanupIPVS            bool
    MetricsBindAddress     string
    EnableProfiling        bool
    OOMScoreAdj            *int32
    ResourceContainer      string
    ConfigSyncPeriod       time.Duration
    ServiceEventHandler    config.ServiceHandler
    EndpointsEventHandler  config.EndpointsHandler
    HealthzServer          *healthcheck.HealthzServer
}

底層命令接口

下面分析一下這個(gè)結(jié)構(gòu)體的成員,client和EventClient就不啰嗦了硕并,iptInterface法焰、IpvsInterface 、IpsetInterface倔毙、execer這四個(gè)變量對(duì)應(yīng)到底層命令:iptables埃仪,ipvsadm, ipset。

    iptInterface = utiliptables.New(execer, dbus, protocol) // 用于修改iptables
    ipvsInterface = utilipvs.New(execer) // 用于操作ipvs配置
    kernelHandler = ipvs.NewLinuxKernelHandler()  // 用于獲取ipvs相關(guān)的內(nèi)核模塊信息陕赃,沒(méi)有配置相關(guān)的內(nèi)核模塊卵蛉,將會(huì)降級(jí)為iptables代理模式
    ipsetInterface = utilipset.New(execer) // 執(zhí)行ipset命令颁股,維護(hù)ip地址組

這幾個(gè)成員的作用如注釋中鎖描述的,關(guān)鍵的成員是Proxyier毙玻,kube-proxy支持user space豌蟋、iptables和ipvs三種代理模式,不同代理模式使用不同的Proxier實(shí)例桑滩。
如下所示梧疲,用判斷kube-proxy具體采用的模式。

proxyMode := getProxyMode(string(config.Mode), iptInterface, kernelHandler, ipsetInterface, iptables.LinuxKernelCompatTester{})

config.Mode是在啟動(dòng)kube-proxy進(jìn)程時(shí)配置的模式运准,下面是kube-proxy的一個(gè)配置文件例子幌氮,其中mode:"ipvs"描述期望采用ipvs模式。

    apiVersion: kubeproxy.config.k8s.io/v1alpha1
    bindAddress: 0.0.0.0
    clientConnection:
      acceptContentTypes: ""
      burst: 10
      contentType: application/vnd.kubernetes.protobuf
      kubeconfig: /var/lib/kube-proxy/kubeconfig.conf
      qps: 5
    clusterCIDR: 10.244.0.0/16
    configSyncPeriod: 15m0s
    conntrack:
      max: null
      maxPerCore: 32768
      min: 131072
      tcpCloseWaitTimeout: 1h0m0s
      tcpEstablishedTimeout: 24h0m0s
    enableProfiling: false
    healthzBindAddress: 0.0.0.0:10256
    hostnameOverride: ""
    iptables:
      masqueradeAll: false
      masqueradeBit: 14
      minSyncPeriod: 0s
      syncPeriod: 30s
    ipvs:
      excludeCIDRs: null
      minSyncPeriod: 0s
      scheduler: ""
      syncPeriod: 30s
    kind: KubeProxyConfiguration
    metricsBindAddress: 127.0.0.1:10249
    mode: "ipvs"
    nodePortAddresses: null
    oomScoreAdj: -999
    portRange: ""
    resourceContainer: /kube-proxy
    udpIdleTimeout: 250ms

雖然這里配置了ipvs模式胁澳,但是是不是最終采用ipvs该互,還要檢查一下是否滿(mǎn)足條件,可參考學(xué)習(xí)筆記2韭畸。具體的代碼就不展示了宇智。如果最終確定采用ipvs模式,那么Proxier成員為ipvs.Proxyer類(lèi)型胰丁。

ProxyServer的啟動(dòng)

    1. 啟動(dòng)EventBroadcaster随橘,事件同步到API-Server
        // 創(chuàng)建eventBroadcaster
    hostname := utilnode.GetHostname(config.HostnameOverride)
    eventBroadcaster := record.NewBroadcaster()
    recorder := eventBroadcaster.NewRecorder(scheme, v1.EventSource{Component: "kube-proxy", Host: hostname})

        // 啟動(dòng)時(shí)間記錄eventBroadcaster到Api-Server
    s.Broadcaster.StartRecordingToSink(&v1core.EventSinkImpl{Interface: s.EventClient.Events("")})
    1. 啟動(dòng)health服務(wù)接口

代碼邏輯比較清晰,就不列舉了锦庸,我們可以通過(guò)下面的工具來(lái)檢查health信息机蔗。

root@bogon2:~/k8s-yml-tests# curl http://127.0.0.1:10256/healthz
{"lastUpdated": "2019-02-22 07:10:11.258317649 +0000 UTC m=+109458.567293187","currentTime": "2019-02-22 07:10:37.209843398 +0000 UTC m=+109484.518818898"}
    1. 啟動(dòng)metrics服務(wù)接口
      啟動(dòng)好以后,可以查看metrics數(shù)據(jù)甘萧,這是通過(guò)promethues工具統(tǒng)計(jì)的數(shù)據(jù)萝嘁,可以由promethues收集后,進(jìn)行匯總展示扬卷。
root@bogon2:~/k8s-yml-tests# curl http://127.0.0.1:10249/metrics
# HELP apiserver_audit_event_total Counter of audit events generated and sent to the audit backend.
# TYPE apiserver_audit_event_total counter
apiserver_audit_event_total 0
# HELP apiserver_audit_requests_rejected_total Counter of apiserver requests rejected due to an error in audit logging backend.
# TYPE apiserver_audit_requests_rejected_total counter
apiserver_audit_requests_rejected_total 0
......
rest_client_request_latency_seconds_sum{url="https://172.25.39.6:6443/%7Bprefix%7D",verb="POST"} 0.007607341
rest_client_request_latency_seconds_count{url="https://172.25.39.6:6443/%7Bprefix%7D",verb="POST"} 1
# HELP rest_client_requests_total Number of HTTP requests, partitioned by status code, method, and host.
# TYPE rest_client_requests_total counter
rest_client_requests_total{code="200",host="172.25.39.6:6443",method="GET"} 490
rest_client_requests_total{code="201",host="172.25.39.6:6443",method="POST"} 1
    1. conntracker參數(shù)維護(hù)
      conntracker跟蹤并且記錄連接狀態(tài)牙言。Linux為每一個(gè)經(jīng)過(guò)網(wǎng)絡(luò)堆棧的數(shù)據(jù)包,生成一個(gè)新的連接記錄項(xiàng) (Connection entry)怪得。此后嬉挡,所有屬于此連接的數(shù)據(jù)包都被唯一地分配給這個(gè)連接,并標(biāo)識(shí)連接的狀態(tài)汇恤。連接跟蹤是防火墻模塊的狀態(tài)檢測(cè)的基礎(chǔ),同時(shí)也是地址轉(zhuǎn)換中實(shí) 現(xiàn)SNAT和DNAT的前提拔恰。

那么Netfilter又是如何生成連接記錄項(xiàng)的呢因谎?每一個(gè)數(shù)據(jù),都有“來(lái)源”與“目的”主機(jī)颜懊,發(fā)起連接的主機(jī)稱(chēng)為“來(lái)源”财岔,響應(yīng)“來(lái)源”的請(qǐng)求的主機(jī)即 為目的风皿,所謂生成記錄項(xiàng),就是對(duì)每一個(gè)這樣的連接的產(chǎn)生匠璧、傳輸及終止進(jìn)行跟蹤記錄桐款。由所有記錄項(xiàng)產(chǎn)生的表,即稱(chēng)為連接跟蹤表夷恍。

kube-proxy通過(guò)結(jié)構(gòu)體KubeProxyContrackConfiguration來(lái)設(shè)置conntrack參數(shù)信息魔眨,對(duì)應(yīng)的結(jié)構(gòu)如下所示:


// KubeProxyConntrackConfiguration contains conntrack settings for
// the Kubernetes proxy server.
type KubeProxyConntrackConfiguration struct {
    // max is the maximum number of NAT connections to track (0 to
    // leave as-is).  This takes precedence over maxPerCore and min.
    Max *int32
    // maxPerCore is the maximum number of NAT connections to track
    // per CPU core (0 to leave the limit as-is and ignore min).
    MaxPerCore *int32
    // min is the minimum value of connect-tracking records to allocate,
    // regardless of maxPerCore (set maxPerCore=0 to leave the limit as-is).
    Min *int32
    // tcpEstablishedTimeout is how long an idle TCP connection will be kept open
    // (e.g. '2s').  Must be greater than 0 to set.
    TCPEstablishedTimeout *metav1.Duration
    // tcpCloseWaitTimeout is how long an idle conntrack entry
    // in CLOSE_WAIT state will remain in the conntrack
    // table. (e.g. '60s'). Must be greater than 0 to set.
    TCPCloseWaitTimeout *metav1.Duration
}

conntracker的設(shè)置是由ProxyServer中的Conntracker成員負(fù)責(zé)設(shè)置,在linux中酿雪,具體的結(jié)構(gòu)為:type realConntracker struct{}遏暴,下面設(shè)設(shè)置Max參數(shù)的代碼:

func (rct realConntracker) SetMax(max int) error {
    if err := rct.setIntSysCtl("nf_conntrack_max", max); err != nil {
        return err
    }
    glog.Infof("Setting nf_conntrack_max to %d", max)

    // Linux does not support writing to /sys/module/nf_conntrack/parameters/hashsize
    // when the writer process is not in the initial network namespace
    // (https://github.com/torvalds/linux/blob/v4.10/net/netfilter/nf_conntrack_core.c#L1795-L1796).
    // Usually that's fine. But in some configurations such as with github.com/kinvolk/kubeadm-nspawn,
    // kube-proxy is in another netns.
    // Therefore, check if writing in hashsize is necessary and skip the writing if not.
    hashsize, err := readIntStringFile("/sys/module/nf_conntrack/parameters/hashsize")
    if err != nil {
        return err
    }
    if hashsize >= (max / 4) {
        return nil
    }

    // sysfs is expected to be mounted as 'rw'. However, it may be
    // unexpectedly mounted as 'ro' by docker because of a known docker
    // issue (https://github.com/docker/docker/issues/24000). Setting
    // conntrack will fail when sysfs is readonly. When that happens, we
    // don't set conntrack hashsize and return a special error
    // readOnlySysFSError here. The caller should deal with
    // readOnlySysFSError differently.
    writable, err := isSysFSWritable()
    if err != nil {
        return err
    }
    if !writable {
        return readOnlySysFSError
    }
    // TODO: generify this and sysctl to a new sysfs.WriteInt()
    glog.Infof("Setting conntrack hashsize to %d", max/4)
    return writeIntStringFile("/sys/module/nf_conntrack/parameters/hashsize", max/4)
}

從代碼中可以看出,通過(guò)sysctl修改/proc/sys/net/netfilter/nf_conntrack_max文件中的參數(shù)值指黎,然后通過(guò)代碼直接修改/sys/module/nf_conntrack/parameters/hashsize的值為max的1/4朋凉。conntrack-tcp-timeout-established與–conntrack-tcp-timeout-close-wait參數(shù)的設(shè)置同理,最后對(duì)應(yīng)的文件路徑分別為/proc/sys/net/netfilter/nf_conntrack_tcp_timeout_established和/proc/sys/net/netfilter/nf_conntrack_tcp_timeout_close_wait醋安。

這些參數(shù)會(huì)影響后面的網(wǎng)絡(luò)連接杂彭,所以設(shè)置的時(shí)候要慎重,最好與系統(tǒng)規(guī)劃時(shí)的配置一致吓揪,設(shè)置過(guò)小亲怠,可能會(huì)導(dǎo)致nf_conntrack:table full,dropping packet的錯(cuò)誤。

    1. 偵聽(tīng)Service和Endpoints的變化

這里初始化了serviceConfig和endpointsConfig實(shí)例磺芭,并啟動(dòng)它們赁炎,如下所示:

      serviceConfig := config.NewServiceConfig(informerFactory.Core().InternalVersion().Services(), s.ConfigSyncPeriod)
    serviceConfig.RegisterEventHandler(s.ServiceEventHandler)
    go serviceConfig.Run(wait.NeverStop)

    endpointsConfig := config.NewEndpointsConfig(informerFactory.Core().InternalVersion().Endpoints(), s.ConfigSyncPeriod)
    endpointsConfig.RegisterEventHandler(s.EndpointsEventHandler)
    go endpointsConfig.Run(wait.NeverStop)

    // This has to start after the calls to NewServiceConfig and NewEndpointsConfig because those
    // functions must configure their shared informer event handlers first.
    go informerFactory.Start(wait.NeverStop)
    // This has to start after the calls to NewServiceConfig and NewEndpointsConfig because those
    // functions must configure their shared informer event handlers first.
    go informerFactory.Start(wait.NeverStop)

serviceConfig和endpointsConfig注冊(cè)的事件對(duì)象實(shí)際都是proxier,也就是前面初始化的ipvs.Proxier實(shí)例對(duì)象钾腺。serviceConfig和endpointsConfig的Run邏輯是等待相應(yīng)的Infomer同步完成后徙垫,回調(diào)Proxier的OnServiceSynced和OnEndpointsSynced方法。

相應(yīng)對(duì)象的事件回調(diào)方法放棒,在ServiceConfig和EndpointsConfig對(duì)象的構(gòu)建中完成姻报,如下所示:

func NewServiceConfig(serviceInformer coreinformers.ServiceInformer, resyncPeriod time.Duration) *ServiceConfig {
    result := &ServiceConfig{
        lister:       serviceInformer.Lister(),
        listerSynced: serviceInformer.Informer().HasSynced,
    }
        // 這里注冊(cè)回調(diào)事件
    serviceInformer.Informer().AddEventHandlerWithResyncPeriod(
        cache.ResourceEventHandlerFuncs{
            AddFunc:    result.handleAddService,
            UpdateFunc: result.handleUpdateService,
            DeleteFunc: result.handleDeleteService,
        },
        resyncPeriod,
    )

    return result
}

核心的業(yè)務(wù)邏輯主要是就是對(duì)于各種事件的監(jiān)聽(tīng)和處理,后面專(zhuān)門(mén)進(jìn)行分析间螟。

    1. 啟動(dòng)Proxier的循環(huán)處理請(qǐng)求

循環(huán)處理的驅(qū)動(dòng)有事件驅(qū)動(dòng)和超時(shí)驅(qū)動(dòng)兩種吴旋,代碼如下所示:

// SyncLoop runs periodic work.  This is expected to run as a goroutine or as the main loop of the app.  It does not return.
func (proxier *Proxier) SyncLoop() {
    // Update healthz timestamp at beginning in case Sync() never succeeds.
    if proxier.healthzServer != nil {
        proxier.healthzServer.UpdateTimestamp()
    }
    proxier.syncRunner.Loop(wait.NeverStop)
}


// Loop handles the periodic timer and run requests.  This is expected to be
// called as a goroutine.
func (bfr *BoundedFrequencyRunner) Loop(stop <-chan struct{}) {
    glog.V(3).Infof("%s Loop running", bfr.name)
    bfr.timer.Reset(bfr.maxInterval)
    for {
        select {
        case <-stop:
            bfr.stop()
            glog.V(3).Infof("%s Loop stopping", bfr.name)
            return
        case <-bfr.timer.C():
            bfr.tryRun()
        case <-bfr.run:
            bfr.tryRun()
        }
    }
}

有Service、Endpoints事件或者超時(shí)時(shí)厢破,都會(huì)調(diào)用BoundedFrequencyRunner.tryRun方法荣瑟,BoundedFrequencyRunner的結(jié)構(gòu)體如下所示:

// BoundedFrequencyRunner manages runs of a user-provided function.
// See NewBoundedFrequencyRunner for examples.
type BoundedFrequencyRunner struct {
    name        string        // the name of this instance
    minInterval time.Duration // the min time between runs, modulo bursts
    maxInterval time.Duration // the max time between runs

    run chan struct{} // try an async run

    mu      sync.Mutex  // guards runs of fn and all mutations
    fn      func()      // function to run
    lastRun time.Time   // time of last run
    timer   timer       // timer for deferred runs
    limiter rateLimiter // rate limiter for on-demand runs
}

在tryRun方法中,會(huì)調(diào)用fn成員函數(shù)摩泪,該成員函數(shù)為proxyier.syncProxyRules方法笆焰。在后面的處理中會(huì)單獨(dú)分析這個(gè)方法。

核心業(yè)務(wù)邏輯----更新代理規(guī)則

在上一章節(jié)见坑,最后兩個(gè)步驟的處理是核心業(yè)務(wù)邏輯的嚷掠,第5步負(fù)責(zé)注冊(cè)事件回調(diào)捏检,在事件回調(diào)方法中,驅(qū)動(dòng)第6步中的邏輯的(syncProxyRules方法)處理不皆。

事件回調(diào)函數(shù)

在看事件回調(diào)函數(shù)之前贯城,我們先來(lái)看一下Proxier的兩個(gè)成員:

type Proxier struct {
    // endpointsChanges and serviceChanges contains all changes to endpoints and
    // services that happened since last syncProxyRules call. For a single object,
    // changes are accumulated, i.e. previous is state from before all of them,
    // current is state after applying all of those.
    endpointsChanges endpointsChangeMap //  endpoints變更記錄
    serviceChanges   serviceChangeMap       // service變更記錄

    serviceMap   proxyServiceMap                 // service記錄
    endpointsMap proxyEndpointsMap           // endpoints記錄
    portsMap     map[utilproxy.LocalPort]utilproxy.Closeable
    ......
}

endpointsChanges和serviceChanges分別用于記錄service和endpoints的變更記錄信息,而serviceMap霹娄、endpointsMap和portsMap則記錄了實(shí)際的服務(wù)能犯、EP和端口等信息內(nèi)容,也就是所有的變更操作项棠,都會(huì)更新到這些成員中悲雳。

事件回調(diào)方法OnServiceAdd的代碼如下:

// OnServiceAdd is called whenever creation of new service object is observed.
func (proxier *Proxier) OnServiceAdd(service *api.Service) {
    namespacedName := types.NamespacedName{Namespace: service.Namespace, Name: service.Name}
    if proxier.serviceChanges.update(&namespacedName, nil, service) && proxier.isInitialized() {
        proxier.syncRunner.Run()
    }
}

在事件回調(diào)方法中,主要完成兩個(gè)工作:

  • 更新變更記錄
  • 驅(qū)動(dòng)runner運(yùn)行香追,做實(shí)際的規(guī)則變更處理
    這是一種異步的驅(qū)動(dòng)邏輯合瓢,可以把同一時(shí)間大量的變更處理進(jìn)行合并后,批量進(jìn)行處理透典。

更新代理規(guī)則的處理

這里講只分析最常用的ClusterIP模式的Service的規(guī)則處理部分晴楔。

前面列舉過(guò)Proxier結(jié)構(gòu)的核心變量,前面幾次對(duì)Service和Endpoints資源的偵聽(tīng)函數(shù)峭咒,會(huì)實(shí)時(shí)監(jiān)測(cè)到Service和Endpoints(Pod)的變化税弃,把變化的內(nèi)容更新到Proxyier的endpointsChanges和serviceChanges成員后,觸發(fā)更新代理規(guī)則的動(dòng)作凑队。
更新代理規(guī)則這塊邏輯在Proxier.syncProxyRules中完成则果,下面將分步驟來(lái)進(jìn)行說(shuō)明:

    1. 檢查確定真正需要進(jìn)行的變更內(nèi)容

K8S事件機(jī)制能夠保證通知的冪等性,收到service和endpoints的變更內(nèi)容漩氨,但是有可能變更的內(nèi)容是一些周邊資料信息西壮,那這些不會(huì)影響實(shí)際的代理規(guī)則,同時(shí)也難保事件的重復(fù)觸發(fā)叫惊,所以首先需要排除這些情況款青,把changes與實(shí)際的內(nèi)存中存儲(chǔ)的數(shù)據(jù)進(jìn)行對(duì)比,得出真正的有意義的變化內(nèi)容霍狰,確保我們后續(xù)的操作的內(nèi)容是真正需要進(jìn)行規(guī)則變更的抡草。

整個(gè)過(guò)程是通過(guò)updateServiceMap和updateEndpointsMap來(lái)進(jìn)行檢查的。

func (proxier *Proxier) syncProxyRules() {
    proxier.mu.Lock()
    defer proxier.mu.Unlock()
        ......
    serviceUpdateResult := updateServiceMap(
        proxier.serviceMap, &proxier.serviceChanges)
    endpointUpdateResult := updateEndpointsMap(
        proxier.endpointsMap, &proxier.endpointsChanges, proxier.hostname)
        ......
}
    1. 準(zhǔn)備安裝IPtables規(guī)則

首先介紹一下iptables中的INPUT蔗坯、FORWARD等規(guī)則鏈和規(guī)則康震,
在處理各種數(shù)據(jù)包時(shí),根據(jù)防火墻規(guī)則的不同介入時(shí)機(jī)宾濒,iptables供涉及5種默認(rèn)規(guī)則鏈腿短,從應(yīng)用時(shí)間點(diǎn)的角度理解這些鏈:

INPUT鏈:當(dāng)接收到防火墻本機(jī)地址的數(shù)據(jù)包(入站)時(shí),應(yīng)用此鏈中的規(guī)則。
OUTPUT鏈:當(dāng)防火墻本機(jī)向外發(fā)送數(shù)據(jù)包(出站)時(shí)答姥,應(yīng)用此鏈中的規(guī)則。
FORWARD鏈:當(dāng)接收到需要通過(guò)防火墻發(fā)送給其他地址的數(shù)據(jù)包(轉(zhuǎn)發(fā))時(shí)谚咬,應(yīng)用此鏈中的規(guī)則鹦付。
PREROUTING鏈:在對(duì)數(shù)據(jù)包作路由選擇之前,應(yīng)用此鏈中的規(guī)則择卦,如DNAT敲长。
POSTROUTING鏈:在對(duì)數(shù)據(jù)包作路由選擇之后,應(yīng)用此鏈中的規(guī)則秉继,如SNAT祈噪。

-->PREROUTING-->[ROUTE]-->FORWARD-->POSTROUTING-->
     mangle        |       mangle        ^ mangle
      nat          |       filter        |  nat
                   |                     |
                   |                     |
                   v                     |
                 INPUT                 OUTPUT
                   | mangle              ^ mangle
                   | filter              |  nat
                   v ------>local------->| filter

我們可以通過(guò)iptables-save命令來(lái)備份iptables規(guī)則,以iptables-save的格式進(jìn)行簡(jiǎn)單舉例:

root@ubuntu:/home/yuxianbing# iptables-save -t nat
# Generated by iptables-save v1.6.0 on Tue Mar  5 09:26:02 2019 
注釋內(nèi)容
*nat
:PREROUTING ACCEPT [14:12354]
-- :PREROUTING ACCEPT尚辑,表示nat表中的PREROUTING 鏈默認(rèn)報(bào)文策略是接受(匹配不到規(guī)則繼續(xù)) 辑鲤,
-- [14:12354] 即[packet, bytes],表示當(dāng)前有14個(gè)包(12354字節(jié))經(jīng)過(guò)nat表的PREROUTING 鏈
:INPUT ACCEPT [0:0]
:OUTPUT ACCEPT [4:222]
:POSTROUTING ACCEPT [3:149]
:CNI-DN-603508538baa710bd2110 - [0:0]
:CNI-HOSTPORT-DNAT - [0:0]
:CNI-HOSTPORT-SNAT - [0:0]
:CNI-SN-603508538baa710bd2110 - [0:0]
:KUBE-FIRE-WALL - [0:0]
:KUBE-MARK-DROP - [0:0]
:KUBE-MARK-MASQ - [0:0]
:KUBE-POSTROUTING - [0:0]
:KUBE-SERVICES - [0:0]
-- 解釋同上(這些是自定義鏈)

---------- 下面開(kāi)始按條輸出所有規(guī)則----------
-A PREROUTING -m addrtype --dst-type LOCAL -j CNI-HOSTPORT-DNAT
---- 這是用iptables命令配置此規(guī)則的命令(詳解選項(xiàng)可參考iptables幫助)杠茬。
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A OUTPUT -m addrtype --dst-type LOCAL -j CNI-HOSTPORT-DNAT
-A OUTPUT -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A POSTROUTING -s 127.0.0.1/32 ! -d 127.0.0.1/32 -j CNI-HOSTPORT-SNAT
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -s 10.0.0.0/8 ! -d 10.0.0.0/8 -j MASQUERADE
-A CNI-DN-603508538baa710bd2110 -p tcp -m tcp --dport 8080 -j DNAT --to-destination 10.221.2.42:80
-A CNI-HOSTPORT-DNAT -m comment --comment "dnat name: \"cni0\" id: \"42ccbebb916d082a6d872aaa48efea33c4cc33267d14779046f687f4dcddda8d\"" -j CNI-DN-603508538baa710bd2110
-A CNI-HOSTPORT-SNAT -m comment --comment "snat name: \"cni0\" id: \"42ccbebb916d082a6d872aaa48efea33c4cc33267d14779046f687f4dcddda8d\"" -j CNI-SN-603508538baa710bd2110
-A CNI-SN-603508538baa710bd2110 -s 127.0.0.1/32 -d 10.221.2.42/32 -p tcp -m tcp --dport 80 -j MASQUERADE
-A KUBE-MARK-DROP -j MARK --set-xmark 0x8000/0x8000
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -m mark --mark 0x4000/0x4000 -j MASQUERADE
-A KUBE-POSTROUTING -m set --match-set KUBE-LOOP-BACK dst,dst,src -j MASQUERADE
-A KUBE-SERVICES -m set --match-set KUBE-CLUSTER-IP dst,dst -j KUBE-MARK-MASQ
-A KUBE-SERVICES -p tcp -m tcp -m set --match-set KUBE-NODE-PORT-TCP dst -j KUBE-MARK-MASQ
COMMIT
-- 應(yīng)用上述配置
# Completed on Tue Mar  5 09:26:02 2019

從iptables的NAT規(guī)則表包含兩個(gè)部分:鏈和規(guī)則月褥,其中鏈還保存了經(jīng)過(guò)該鏈的包和字節(jié)數(shù)。

針對(duì)iptables和IPVS規(guī)則的處理邏輯為:

    err := proxier.iptables.SaveInto(utiliptables.TableNAT, proxier.iptablesData) 
        // 1.執(zhí)行iptables-save命令
    if err != nil { // if we failed to get any rules
        glog.Errorf("Failed to execute iptables-save, syncing all rules: %v", err)
    } else { // otherwise parse the output
        existingNATChains = utiliptables.GetChainLines(utiliptables.TableNAT, proxier.iptablesData.Bytes()) 
                // 2. 讀取所有的鏈路信息
    }
        // 3. 重置iptables規(guī)則緩沖區(qū)
    proxier.natChains.Reset() 
    proxier.natRules.Reset()
    // Write table headers.
    writeLine(proxier.natChains, "*nat") // 3. 寫(xiě)nat表頭信息
        // 4. 寫(xiě)POSTROUTING鏈路信息
    if chain, ok := existingNATChains[kubePostroutingChain]; ok {
        writeLine(proxier.natChains, chain)
    } else {
        writeLine(proxier.natChains, utiliptables.MakeChainLine(kubePostroutingChain))
    }

        // 寫(xiě)POSTROUTING 規(guī)則
        // 這條POSTROUTING規(guī)則的作用是為消息打標(biāo)記的包瓢喉,執(zhí)行MASQUERADE操作宁赤,也就是SNAT處理
        // 對(duì)于出主機(jī)的包,我們要做SNAT處理
    writeLine(proxier.natRules, []string{
        "-A", string(kubePostroutingChain),
        "-m", "comment", "--comment", `"kubernetes service traffic requiring SNAT"`,
        "-m", "mark", "--mark", proxier.masqueradeMark,
        "-j", "MASQUERADE",
    }...)

        // 5. KUBE-MARK-MASQ鏈路信息
    if chain, ok := existingNATChains[KubeMarkMasqChain]; ok {
        writeLine(proxier.natChains, chain)
    } else {
        writeLine(proxier.natChains, utiliptables.MakeChainLine(KubeMarkMasqChain))
    }
       // KUBE-MARK-MASQ規(guī)則栓票,執(zhí)行MARK决左,打上masquerade標(biāo)記
    writeLine(proxier.natRules, []string{
        "-A", string(KubeMarkMasqChain),
        "-j", "MARK", "--set-xmark", proxier.masqueradeMark,
    }...)

       // 6. 安裝Dummy網(wǎng)卡kube-ipvs0
       // 這個(gè)網(wǎng)卡的作用是綁定所有Service的ClusterIP到該網(wǎng)卡上
    _, err = proxier.netlinkHandle.EnsureDummyDevice(DefaultDummyDevice)
    if err != nil {
        glog.Errorf("Failed to create dummy interface: %s, error: %v", DefaultDummyDevice, err)
        return
    }
        // 7. ipsets初始化,建立所有的ipsets表走贪,并清空里面的內(nèi)容
    // make sure ip sets exists in the system.
    ipSets := []*IPSet{proxier.loopbackSet, proxier.clusterIPSet, proxier.externalIPSet, proxier.nodePortSetUDP, proxier.nodePortSetTCP,
        proxier.lbIngressSet, proxier.lbMasqSet, proxier.lbWhiteListCIDRSet, proxier.lbWhiteListIPSet}
    if err := ensureIPSets(ipSets...); err != nil {
        return
    }
    for i := range ipSets {
        ipSets[i].resetEntries()
    }
        ......
        
// linkKubeServiceChain will Create chain KUBE-SERVICES and link the chin in PREROUTING and OUTPUT

// Chain PREROUTING (policy ACCEPT)
// target            prot opt source               destination
// KUBE-SERVICES     all  --  0.0.0.0/0            0.0.0.0/0

// Chain OUTPUT (policy ACCEPT)
// target            prot opt source               destination
// KUBE-SERVICES     all  --  0.0.0.0/0            0.0.0.0/0

// Chain KUBE-SERVICES (2 references)
        // 8. 創(chuàng)建PrePosting與Output鏈佛猛,匹配所有的協(xié)議、源IP和目標(biāo)IP厉斟,自動(dòng)進(jìn)入目標(biāo)KUBE-SERVICES就行匹配和處理
        //  同時(shí)創(chuàng)建KUBE-SERVICES鏈挚躯,注意:規(guī)則后續(xù)根據(jù)Cluster Service與NodePort Service來(lái)按需創(chuàng)建。
        // 類(lèi)似命令為: iptables -t nat -N KUBE-SERVICES
    if err := proxier.linkKubeServiceChain(existingNATChains, proxier.natChains); err != nil {
        glog.Errorf("Failed to link KUBE-SERVICES chain: %v", err)
        return
    }
    // Kube service ipset
    if err := proxier.createKubeFireWallChain(existingNATChains, proxier.natChains); err != nil {
        glog.Errorf("Failed to create KUBE-FIRE-WALL chain: %v", err)
        return
    }
        
        // 9. 創(chuàng)建KUBE-FIRE-WALL鏈擦秽,不過(guò)目前我們系統(tǒng)沒(méi)有相關(guān)的規(guī)則码荔,暫時(shí)不知道用處
    if err := proxier.createKubeFireWallChain(existingNATChains, proxier.natChains); err != nil {
        glog.Errorf("Failed to create KUBE-FIRE-WALL chain: %v", err)
        return
    }
        
        // 10. 進(jìn)入最關(guān)鍵的環(huán)節(jié),為每個(gè)服務(wù)構(gòu)建IPVS規(guī)則
        for svcName, svcInfo := range proxier.serviceMap {
        protocol := strings.ToLower(string(svcInfo.protocol))
        // Precompute svcNameString; with many services the many calls
        // to ServicePortName.String() show up in CPU profiles.
        svcNameString := svcName.String()

        // Handle traffic that loops back to the originator with SNAT.
            // 10.1 hairpin模式的處理感挥,匹配ip,port,ip這種模式缩搅,這塊代碼不分析了,一般很少有這種現(xiàn)象触幼。
                ......
                // 10.2 ClusterIP處理
                // 準(zhǔn)備 IP Set的項(xiàng)硼瓣,存儲(chǔ)Cluster IP和對(duì)應(yīng)的Port
        entry := &utilipset.Entry{
            IP:       svcInfo.clusterIP.String(),
            Port:     svcInfo.port,
            Protocol: protocol,
            SetType:  utilipset.HashIPPort,
        }
                // 如果kube-proxy啟動(dòng)設(shè)置了masqueradeAll或者clusterCIDR,則安裝偽裝規(guī)則,做SNAT操作
        if proxier.masqueradeAll || len(proxier.clusterCIDR) > 0 {
                        ......
            proxier.clusterIPSet.activeEntries.Insert(entry.String())
        }
        // 準(zhǔn)備ipvs虛擬服務(wù)器
        serv := &utilipvs.VirtualServer{
            Address:   svcInfo.clusterIP,
            Port:      uint16(svcInfo.port),
            Protocol:  string(svcInfo.protocol),
            Scheduler: proxier.ipvsScheduler,
        }
        // Set session affinity flag and timeout for IPVS service
        if svcInfo.sessionAffinityType == api.ServiceAffinityClientIP {
            serv.Flags |= utilipvs.FlagPersistent
            serv.Timeout = uint32(svcInfo.stickyMaxAgeSeconds)
        }
        // We need to bind ClusterIP to dummy interface, so set `bindAddr` parameter to `true` in syncService()
                // 10.3 創(chuàng)建或者更新IPVS虛擬服務(wù)器 
                //  把ClusterIP綁定到kube-ipvs0設(shè)備堂鲤,最后一個(gè)參數(shù)bindAddr為true
        if err := proxier.syncService(svcNameString, serv, true); err == nil {
            activeIPVSServices[serv.String()] = true
            // ExternalTrafficPolicy only works for NodePort and external LB traffic, does not affect ClusterIP
            // So we still need clusterIP rules in onlyNodeLocalEndpoints mode.
                        // 這里要看一下亿傅,傳入的第二個(gè)參數(shù)中,onlyNodeLocalEndpoints傳入為false瘟栖,所以我們會(huì)把所有的Endpoints注冊(cè)到IPVS虛擬服務(wù)器中
                        // onlyNodeLocalEndpoints只有NodePort和LB中設(shè)置才會(huì)生效葵擎,用于只綁定到本地的Endpoints
                      
            if err := proxier.syncEndpoint(svcName, false, serv); err != nil {
                glog.Errorf("Failed to sync endpoint for service: %v, err: %v", serv, err)
            }
        } else {
            glog.Errorf("Failed to sync service: %v, err: %v", serv, err)
        }

        // 10.4 Capture externalIPs. 不關(guān)注,很少用
        // 10.5 Capture load-balancer ingress.  不關(guān)注半哟,很少用酬滤,需要云廠(chǎng)商支持
                ......
                // 10.6 針對(duì)NodePort類(lèi)型的Service進(jìn)行處理        
        if svcInfo.nodePort != 0 {
            lp := utilproxy.LocalPort{
                Description: "nodePort for " + svcNameString,
                IP:          "",
                Port:        svcInfo.nodePort,
                Protocol:    protocol,
            }
            if proxier.portsMap[lp] != nil {
                glog.V(4).Infof("Port %s was open before and is still needed", lp.String())
                replacementPortsMap[lp] = proxier.portsMap[lp]
            } else {
                                // 偵聽(tīng)Host上的PORT端口(TCP/UDP),把端口占用起來(lái)寓涨,以便安全的安裝IPVS規(guī)則
                socket, err := proxier.portMapper.OpenLocalPort(&lp)
                if err != nil {
                    glog.Errorf("can't open %s, skipping this nodePort: %v", lp.String(), err)
                    continue
                }
                if lp.Protocol == "udp" {
                    isIPv6 := utilproxy.IsIPv6(svcInfo.clusterIP)
                    utilproxy.ClearUDPConntrackForPort(proxier.exec, lp.Port, isIPv6)
                }
                replacementPortsMap[lp] = socket
            } // We're holding the port, so it's OK to install ipvs rules.
                        // 如果沒(méi)有指定onlyNodeLocalEndpoints盯串,則需要做SNAT處理,需要把對(duì)應(yīng)的端口插入到IPSET集中戒良,這樣体捏,可以匹配到KUBE-SERVICES規(guī)則,從而做原地址偽裝蔬墩,實(shí)現(xiàn)SNAT功能译打。
            // Nodeports need SNAT, unless they're local.
            // ipset call
            if !svcInfo.onlyNodeLocalEndpoints {
                entry = &utilipset.Entry{
                    // No need to provide ip info
                    Port:     svcInfo.nodePort,
                    Protocol: protocol,
                    SetType:  utilipset.BitmapPort,
                }
                var nodePortSet *IPSet
                switch protocol {
                case "tcp":
                    nodePortSet = proxier.nodePortSetTCP
                case "udp":
                    nodePortSet = proxier.nodePortSetUDP
                default:
                    // It should never hit
                    glog.Errorf("Unsupported protocol type: %s", protocol)
                }
                if nodePortSet != nil {
                    if valid := nodePortSet.validateEntry(entry); !valid {
                        glog.Errorf("%s", fmt.Sprintf(EntryInvalidErr, entry, nodePortSet.Name))
                        continue
                    }
                    nodePortSet.activeEntries.Insert(entry.String())
                }
            }
                        // 為節(jié)點(diǎn)上的每個(gè)物理IP創(chuàng)建對(duì)應(yīng)的IPVS虛擬服務(wù)器和對(duì)應(yīng)的路由Endpoint規(guī)則
            // Build ipvs kernel routes for each node ip address
            nodeIPs, err := proxier.ipGetter.NodeIPs()
            if err != nil {
                glog.Errorf("Failed to get node IP, err: %v", err)
            } else {
                for _, nodeIP := range nodeIPs {
                    // ipvs call
                    serv := &utilipvs.VirtualServer{
                        Address:   nodeIP,
                        Port:      uint16(svcInfo.nodePort),
                        Protocol:  string(svcInfo.protocol),
                        Scheduler: proxier.ipvsScheduler,
                    }
                    if svcInfo.sessionAffinityType == api.ServiceAffinityClientIP {
                        serv.Flags |= utilipvs.FlagPersistent
                        serv.Timeout = uint32(svcInfo.stickyMaxAgeSeconds)
                    }
                    // There is no need to bind Node IP to dummy interface, so set parameter `bindAddr` to `false`.
                    if err := proxier.syncService(svcNameString, serv, false); err == nil {
                        activeIPVSServices[serv.String()] = true
                        if err := proxier.syncEndpoint(svcName, svcInfo.onlyNodeLocalEndpoints, serv); err != nil {
                            glog.Errorf("Failed to sync endpoint for service: %v, err: %v", serv, err)
                        }
                    } else {
                        glog.Errorf("Failed to sync service: %v, err: %v", serv, err)
                    }
                }
            }
        }
    }

總結(jié)

kube-proxy的IPVS Proxier是通過(guò)iptables和ipvs來(lái)實(shí)現(xiàn)代理功能,iptables基本上是幾條固定的鏈路和規(guī)則拇颅,而大量的Service ClusterIP和Endpoint IP等等信息奏司,都封裝進(jìn)了IPSET中,通過(guò)iptables通過(guò)match-set來(lái)匹配樟插,大大減少了iptables規(guī)則數(shù)量韵洋,提高了iptables維護(hù)性能和匹配性能。
iptables規(guī)則主要用于做SNAT等操作黄锤,配合ipvs完成代理服務(wù)能力搪缨。

最后編輯于
?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請(qǐng)聯(lián)系作者
  • 序言:七十年代末,一起剝皮案震驚了整個(gè)濱河市鸵熟,隨后出現(xiàn)的幾起案子副编,更是在濱河造成了極大的恐慌,老刑警劉巖流强,帶你破解...
    沈念sama閱讀 218,858評(píng)論 6 508
  • 序言:濱河連續(xù)發(fā)生了三起死亡事件痹届,死亡現(xiàn)場(chǎng)離奇詭異,居然都是意外死亡打月,警方通過(guò)查閱死者的電腦和手機(jī)队腐,發(fā)現(xiàn)死者居然都...
    沈念sama閱讀 93,372評(píng)論 3 395
  • 文/潘曉璐 我一進(jìn)店門(mén),熙熙樓的掌柜王于貴愁眉苦臉地迎上來(lái)奏篙,“玉大人柴淘,你說(shuō)我怎么就攤上這事。” “怎么了为严?”我有些...
    開(kāi)封第一講書(shū)人閱讀 165,282評(píng)論 0 356
  • 文/不壞的土叔 我叫張陵敛熬,是天一觀(guān)的道長(zhǎng)。 經(jīng)常有香客問(wèn)我第股,道長(zhǎng)荸型,這世上最難降的妖魔是什么? 我笑而不...
    開(kāi)封第一講書(shū)人閱讀 58,842評(píng)論 1 295
  • 正文 為了忘掉前任炸茧,我火速辦了婚禮,結(jié)果婚禮上稿静,老公的妹妹穿的比我還像新娘梭冠。我一直安慰自己,他們只是感情好改备,可當(dāng)我...
    茶點(diǎn)故事閱讀 67,857評(píng)論 6 392
  • 文/花漫 我一把揭開(kāi)白布控漠。 她就那樣靜靜地躺著,像睡著了一般悬钳。 火紅的嫁衣襯著肌膚如雪盐捷。 梳的紋絲不亂的頭發(fā)上,一...
    開(kāi)封第一講書(shū)人閱讀 51,679評(píng)論 1 305
  • 那天默勾,我揣著相機(jī)與錄音碉渡,去河邊找鬼。 笑死母剥,一個(gè)胖子當(dāng)著我的面吹牛滞诺,可吹牛的內(nèi)容都是我干的。 我是一名探鬼主播环疼,決...
    沈念sama閱讀 40,406評(píng)論 3 418
  • 文/蒼蘭香墨 我猛地睜開(kāi)眼习霹,長(zhǎng)吁一口氣:“原來(lái)是場(chǎng)噩夢(mèng)啊……” “哼!你這毒婦竟也來(lái)了炫隶?” 一聲冷哼從身側(cè)響起淋叶,我...
    開(kāi)封第一講書(shū)人閱讀 39,311評(píng)論 0 276
  • 序言:老撾萬(wàn)榮一對(duì)情侶失蹤,失蹤者是張志新(化名)和其女友劉穎伪阶,沒(méi)想到半個(gè)月后煞檩,有當(dāng)?shù)厝嗽跇?shù)林里發(fā)現(xiàn)了一具尸體,經(jīng)...
    沈念sama閱讀 45,767評(píng)論 1 315
  • 正文 獨(dú)居荒郊野嶺守林人離奇死亡望门,尸身上長(zhǎng)有42處帶血的膿包…… 初始之章·張勛 以下內(nèi)容為張勛視角 年9月15日...
    茶點(diǎn)故事閱讀 37,945評(píng)論 3 336
  • 正文 我和宋清朗相戀三年形娇,在試婚紗的時(shí)候發(fā)現(xiàn)自己被綠了。 大學(xué)時(shí)的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片筹误。...
    茶點(diǎn)故事閱讀 40,090評(píng)論 1 350
  • 序言:一個(gè)原本活蹦亂跳的男人離奇死亡桐早,死狀恐怖,靈堂內(nèi)的尸體忽然破棺而出,到底是詐尸還是另有隱情哄酝,我是刑警寧澤友存,帶...
    沈念sama閱讀 35,785評(píng)論 5 346
  • 正文 年R本政府宣布,位于F島的核電站陶衅,受9級(jí)特大地震影響屡立,放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜搀军,卻給世界環(huán)境...
    茶點(diǎn)故事閱讀 41,420評(píng)論 3 331
  • 文/蒙蒙 一膨俐、第九天 我趴在偏房一處隱蔽的房頂上張望。 院中可真熱鬧罩句,春花似錦焚刺、人聲如沸。這莊子的主人今日做“春日...
    開(kāi)封第一講書(shū)人閱讀 31,988評(píng)論 0 22
  • 文/蒼蘭香墨 我抬頭看了看天上的太陽(yáng)。三九已至屯远,卻和暖如春蔓姚,著一層夾襖步出監(jiān)牢的瞬間,已是汗流浹背慨丐。 一陣腳步聲響...
    開(kāi)封第一講書(shū)人閱讀 33,101評(píng)論 1 271
  • 我被黑心中介騙來(lái)泰國(guó)打工坡脐, 沒(méi)想到剛下飛機(jī)就差點(diǎn)兒被人妖公主榨干…… 1. 我叫王不留,地道東北人房揭。 一個(gè)月前我還...
    沈念sama閱讀 48,298評(píng)論 3 372
  • 正文 我出身青樓挨措,卻偏偏與公主長(zhǎng)得像,于是被迫代替她去往敵國(guó)和親崩溪。 傳聞我的和親對(duì)象是個(gè)殘疾皇子浅役,可洞房花燭夜當(dāng)晚...
    茶點(diǎn)故事閱讀 45,033評(píng)論 2 355

推薦閱讀更多精彩內(nèi)容