docker容器內(nèi)無法創(chuàng)建線程的問題(ulimit作用機制kernel源碼解析)

一、問題現(xiàn)象

壓測時析藕，多個模塊出現(xiàn)報錯”unable to create thread: Resource temporarily unavailable“蛤吓。
無論服務(wù)進程或supervisor，都出現(xiàn)此報錯鸠蚪，多次重試拉起失敗后重慢，服務(wù)退出饥臂，然后容器退出逊躁。

二似踱、原因總結(jié)

總的來說，是進程resource limit限制的配置作用范圍與內(nèi)核調(diào)度時對用戶的限制不統(tǒng)一引起的稽煤。即是ulimit限制配置是在容器中讀取核芽，對進程生效，而內(nèi)核調(diào)度時酵熙，對部分資源（這里是線程數(shù)量）的判斷依據(jù)轧简，不區(qū)分進程，是整機單個用戶的全部進程資源數(shù)量的總和匾二。

用戶線程數(shù)量是由內(nèi)核判定的哮独，各容器雖然運行環(huán)境隔離拳芙，但對于內(nèi)核來說，只是多個進程皮璧。同一個用戶id運行的進程舟扎，即使是不同容器，內(nèi)核可見的也是累計的線程數(shù)量悴务。
另一方面睹限，用戶limit實際的生效配置，卻是用戶態(tài)生效讯檐。即limit相關(guān)配置是各個容器各自讀取羡疗。
centos的limit默認配置中，對非root用戶進程數(shù)量軟限制為4096：

root@cvm-172_16_30_8:~ # cat /etc/security/limits.d/20-nproc.conf
# Default limit for number of user's processes to prevent
# accidental fork bombs.
# See rhbz #432903 for reasoning.

*          soft    nproc     4096
root       soft    nproc     unlimited

在進行系統(tǒng)調(diào)用增加線程時别洪，內(nèi)核是以這兩個值進行判斷叨恨。所以就出現(xiàn)了同一個用戶id，在某個容器內(nèi)線程數(shù)量并不多挖垛，卻無法開線程的現(xiàn)象特碳。即是此時改進程本身線程限制為4096，而該id的用戶晕换，對于內(nèi)核來說午乓，機器上總的線程數(shù)已經(jīng)超過4096。
具體機制見下文詳解闸准。

三益愈、詳解

本部分主要說明了三個方面：

ulimit配置何時生效的。
內(nèi)核如何對limit合法性進行判定夷家。
docker拉起容器時的ulimit配置繼承關(guān)系蒸其，即如何解決該問題。

ulimit配置生效方式

1. 原容器內(nèi)進程啟動方式：

容器啟動時執(zhí)行entrypoint.sh库快，該腳本創(chuàng)建指定id的用戶摸袁，修改目錄權(quán)限后，通過su切換用戶并運行supervisor义屏，進一步拉起服務(wù)進程：

?  data-proxy git:(master) ? cat entrypoint.sh
#!/bin/sh
username="yibot"

#create user if not exists
egrep "^${YIBOT_UID}" /etc/passwd >& /dev/null
if [ $? -ne 0 ]
then
    useradd -u "${YIBOT_UID}" "${username}"
fi

mkdir -p /data/yibot/"${MODULE}"/log/ && \
    mkdir -p /data/supervisor/ && \
    chown -R "${YIBOT_UID}":"${YIBOT_UID}" /entrypoint && \
    chown -R "${YIBOT_UID}":"${YIBOT_UID}" /data && \
    su yibot -c "supervisord -n"

2. pam簡介

pam（Pluggable Authentication Modules）中文翻譯是"可插拔的身份認證模塊組"靠汁。這些模塊本身不屬于內(nèi)核，內(nèi)核自身沒有身份驗證的行為闽铐。是為了讓需要身份驗證的應用與身份驗證機制本身進行解耦蝶怔，衍生出來的一套庫。現(xiàn)在的su兄墅、login等應用都會采用該庫踢星。
pam介紹可參考：https://www.linuxjournal.com/article/5940
pam man page: http://man7.org/linux/man-pages/man8/pam.8.html
pam源碼：https://github.com/linux-pam/linux-pam/tree/master/libpam

3. pam與ulimit配置讀取

查看pam源碼發(fā)現(xiàn)，在limit處理中 https://github.com/linux-pam/linux-pam/blob/master/modules/pam_limits/pam_limits.c 每一次該pam會話調(diào)用隙咸，都是parse_config_file-> setup_limits

    retval = parse_config_file(pamh, pwd->pw_name, pwd->pw_uid, pwd->pw_gid, ctrl, pl);

    retval = setup_limits(pamh, pwd->pw_name, pwd->pw_uid, ctrl, pl);

parse_config_file是從給定配置文件中沐悦，讀取limit配置存放在pl指向的pam_limit_s結(jié)構(gòu)體中成洗，該結(jié)構(gòu)體定義如下：

/* internal data */
struct pam_limit_s {
    int login_limit;     /* the max logins limit */
    int login_limit_def; /* which entry set the login limit */
    int flag_numsyslogins; /* whether to limit logins only for a
                  specific user or to count all logins */
    int priority;    /* the priority to run user process with */
    struct user_limits_struct limits[RLIM_NLIMITS];
    const char *conf_file;
    int utmp_after_pam_call;
    char login_group[LINE_LENGTH];
};

各項limit的值都存在limits數(shù)組中，user_limits_struct結(jié)構(gòu)體中包含軟限制和硬限制

struct user_limits_struct {
    int supported;
    int src_soft;
    int src_hard;
    struct rlimit limit;
};

其中limit結(jié)構(gòu)體中是在init_limits中通過系統(tǒng)調(diào)用getrlimit獲取的當前進程的限制值藏否。

解析完配置文件后泌枪，在setup_limits中，通過系統(tǒng)調(diào)用setrlimit修改當前進程pcb中的rlim相關(guān)值

for (i=0, status=LIMITED_OK; i<RLIM_NLIMITS; i++) {
      int res;

    if (!pl->limits[i].supported) {
        /* skip it if its not known to the system */
        continue;
    }
    if (pl->limits[i].src_soft == LIMITS_DEF_NONE &&
        pl->limits[i].src_hard == LIMITS_DEF_NONE) {
        /* skip it if its not initialized */
        continue;
    }
        if (pl->limits[i].limit.rlim_cur > pl->limits[i].limit.rlim_max)
            pl->limits[i].limit.rlim_cur = pl->limits[i].limit.rlim_max;
    res = setrlimit(i, &pl->limits[i].limit);
    if (res != 0)
      pam_syslog(pamh, LOG_ERR, "Could not set limit for '%s': %m",
             rlimit2str(i));
    status |= res;
    }

以上就是pam庫對limit配置的讀取和修改過程秕岛。系統(tǒng)調(diào)用getrlimit和setrlimit具體行為見后文

4. su與pam:

su源碼：https://github.com/shadow-maint/shadow/blob/master/src/su.c
在最新的su實現(xiàn)中碌燕，可以看到是有pam的條件編譯：

#ifdef USE_PAM
    ret = pam_start ("su", name, &conv, &pamh);
    if (PAM_SUCCESS != ret) {
        SYSLOG ((LOG_ERR, "pam_start: error %d", ret);
        fprintf (stderr,
                 _("%s: pam_start: error %d\n"),
                 Prog, ret));
        exit (1);
    }

在最新的centos中，ldd查看su继薛，可以確定是打開了該條件

root@cvm-172_16_30_8:~ # ldd /usr/bin/su | grep pam
    libpam.so.0 => /lib64/libpam.so.0 (0x00007f4d429a6000)
    libpam_misc.so.0 => /lib64/libpam_misc.so.0 (0x00007f4d427a2000)

在su的man里也有說明：

This  version of su uses PAM for authentication, account and session management.  Some configuration options found in other su implementations such as e.g. support of a wheel group have to be configured via PAM.

在pam_start ("su", name, &conv, &pamh)中pam會在/etc/pam.d/下查找名為su的文件進行配置加載修壕，該文件中指定了pam認證中需要用到的庫。實現(xiàn)可插拔的特性
最終在pam打開會話pam_open_session會調(diào)用pam_limits中的pam_sm_open_session實現(xiàn)limits相關(guān)配置文件的解析和設(shè)置遏考。
在su切換用戶后慈鸠，默認打開shell，會繼承更新后的limits配置灌具，具體繼承機制見下文青团。

/*
         * Use the shell and create an argv
         * with the rest of the command line included.
         */
        argv[-1] = cp;
        execve_shell (shellstr, &argv[-1], environ);

之后再打開的進程，都會進行limits繼承
附pam編程例子：https://www.freebsd.org/doc/en_US.ISO8859-1/articles/pam/pam-sample-appl.html

以上解釋了在原entrypoint.sh的做法中咖楣，su調(diào)用pam會讀取當前容器中的limit配置(/etc/security/limits.d/)督笆。在非root時，進程limit中的nproc會被設(shè)為4096的限制**

5.系統(tǒng)調(diào)用setrlimit行為

kernel源碼：https://github.com/torvalds/linux
setrlimit系統(tǒng)調(diào)用如下

SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
{
    struct rlimit new_rlim;

    if (copy_from_user(&new_rlim, rlim, sizeof(*rlim)))
        return -EFAULT;
    return do_prlimit(current, resource, &new_rlim, NULL);
}

current返回的是當前進程的pcb诱贿，即task_struct結(jié)構(gòu)體的指針娃肿，在do_prlimit中進一步調(diào)用security_task_setrlimit修改當前pcb中的limit限制值

int security_task_setrlimit(struct task_struct *p, unsigned int resource,
        struct rlimit *new_rlim)
{
    return call_int_hook(task_setrlimit, 0, p, resource, new_rlim);
}

下面這個操作看不太懂，大概是在鏈表里進行搜索珠十，然后應用FUNC料扰。還請大佬指點迷津。

#define call_int_hook(FUNC, IRC, ...) ({            \
    int RC = IRC;                       \
    do {                            \
        struct security_hook_list *P;           \
                                \
        hlist_for_each_entry(P, &security_hook_heads.FUNC, list) { \
            RC = P->hook.FUNC(__VA_ARGS__);     \
            if (RC != 0)                \
                break;              \
        }                       \
    } while (0);                        \
    RC;                         \
})

補充一下pcb task_struct部分定義焙蹭，完整定義參考：https://github.com/torvalds/linux/blob/master/include/linux/sched.h

struct task_struct {
    ...
    /* Real parent process: */
    struct task_struct __rcu    *real_parent;

    /* Recipient of SIGCHLD, wait4() reports: */
    struct task_struct __rcu    *parent;

    /*
     * Children/sibling form the list of natural children:
     */
    struct list_head        children;
    struct list_head        sibling;
    struct task_struct      *group_leader;
    ...

    /* Effective (overridable) subjective task credentials (COW): */
    const struct cred __rcu     *cred;
    ...
    /* Signal handlers: */
    struct signal_struct        *signal;
    ...
}

注：kernel通過list_head與list_entry宏晒杈，實現(xiàn)了通用的雙鏈表結(jié)構(gòu)
在struct signal_struct中定義了rlim：

struct signal_struct {
    ...
     /*
     * We don't bother to synchronize most readers of this at all,
     * because there is no reader checking a limit that actually needs
     * to get both rlim_cur and rlim_max atomically, and either one
     * alone is a single word that can safely be read normally.
     * getrlimit/setrlimit use task_lock(current->group_leader) to
     * protect this instead of the siglock, because they really
     * have no need to disable irqs.
     */
    struct rlimit rlim[RLIM_NLIMITS];
    ...
}

rlim數(shù)組中即該進程的resource limit相關(guān)值缔恳，setrlimit最終修改的也即該數(shù)組中的值筷笨。可見是每個進程單獨持有的一組值辩尊。

內(nèi)核判定nproc limit(進程數(shù)限制）合法性機制

1. 用戶進程總數(shù)

在上文給出的task_struct定義中烟馅，有一個結(jié)構(gòu)體struct cred说庭，定義如下

struct cred {
    ...
    kuid_t      uid;        /* real UID of the task */
    kgid_t      gid;        /* real GID of the task */
    kuid_t      suid;       /* saved UID of the task */
    kgid_t      sgid;       /* saved GID of the task */
    kuid_t      euid;       /* effective UID of the task */
    kgid_t      egid;       /* effective GID of the task */
    kuid_t      fsuid;      /* UID for VFS ops */
    kgid_t      fsgid;      /* GID for VFS ops */
    ...
    struct user_struct *user;   /* real user ID subscription */
    ...
}

其中struct user_struct定義：

struct user_struct {
    refcount_t __count; /* reference count */
    atomic_t processes; /* How many processes does this user have? */
    atomic_t sigpending;    /* How many pending signals does this user have? */
    ...
}

結(jié)合下文說明pcb中的struct user_struct *user是全局唯一然磷，則processes就是系統(tǒng)當前用戶在運行的所有進程數(shù)(linux中processes與threads幾乎相同郑趁，內(nèi)核中沒有thread概念)
http://www.mulix.org/lectures/kernel_workshop_mar_2004/things.pdf

In Linux, processes and threads are almost the same. The major difference is that threads share the same virtual memory address space.

2. struct user_struct *user全局唯一

在su的實現(xiàn)中，調(diào)用change_uid姿搜，最終通過系統(tǒng)調(diào)用setuid切換uid

SYSCALL_DEFINE1(setuid, uid_t, uid)
{
    return __sys_setuid(uid);
}

__sys_setuid調(diào)用set_user實現(xiàn)用戶真正切換寡润，參數(shù)new為當前pcb中的cred結(jié)構(gòu)體副本

long __sys_setuid(uid_t uid)
{
    ...
    if (ns_capable_setid(old->user_ns, CAP_SETUID)) {
        new->suid = new->uid = kuid;
        if (!uid_eq(kuid, old->uid)) {
            retval = set_user(new);
            if (retval < 0)
                goto error;
        }
    } else if (!uid_eq(kuid, old->uid) && !uid_eq(kuid, new->suid)) {
        goto error;
    }
}

set_user完整實現(xiàn)：

/*
 * change the user struct in a credentials set to match the new UID
 */
static int set_user(struct cred *new)
{
    struct user_struct *new_user;

    new_user = alloc_uid(new->uid);
    if (!new_user)
        return -EAGAIN;

    /*
     * We don't fail in case of NPROC limit excess here because too many
     * poorly written programs don't check set*uid() return code, assuming
     * it never fails if called by root.  We may still enforce NPROC limit
     * for programs doing set*uid()+execve() by harmlessly deferring the
     * failure to the execve() stage.
     */
    if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
            new_user != INIT_USER)
        current->flags |= PF_NPROC_EXCEEDED;
    else
        current->flags &= ~PF_NPROC_EXCEEDED;

    free_uid(new->user);
    new->user = new_user;
    return 0;
}

再來看看alloc_uid：

struct user_struct *alloc_uid(kuid_t uid)
{
    struct hlist_head *hashent = uidhashentry(uid);
    struct user_struct *up, *new;

    spin_lock_irq(&uidhash_lock);
    up = uid_hash_find(uid, hashent);
    spin_unlock_irq(&uidhash_lock);
    ...
}

在kernel/user.c中捆憎，uidhashentry定義如下

#define uidhashentry(uid)   (uidhash_table + __uidhashfn((__kuid_val(uid))))

static struct kmem_cache *uid_cachep;
struct hlist_head uidhash_table[UIDHASH_SZ];

加上uid_hash_find的實現(xiàn)：

static struct user_struct *uid_hash_find(kuid_t uid, struct hlist_head *hashent)
{
    struct user_struct *user;

    hlist_for_each_entry(user, hashent, uidhash_node) {
        if (uid_eq(user->uid, uid)) {
            refcount_inc(&user->__count);
            return user;
        }
    }

    return NULL;
}

如此就可以看出，實際上對于一個uid,用戶信息結(jié)構(gòu)體user_struct全局唯一梭纹。通過uid的hashentry躲惰，在鏈表中查找該結(jié)構(gòu)體，再將指針返回給pcb

3. 新增進程合法性判斷

實際上在上文set_user中变抽，已經(jīng)有如下判斷：

if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
            new_user != INIT_USER)
        current->flags |= PF_NPROC_EXCEEDED;
    else
        current->flags &= ~PF_NPROC_EXCEEDED;

rlimit(RLIMIT_NPROC)是讀取當前進程pcb內(nèi)的nproc限制础拨，再與新用戶總線程數(shù)作比較。
另外绍载，在exec的實現(xiàn)__do_execve_file中,也有類似判斷：https://github.com/torvalds/linux/blob/master/fs/exec.c

if ((current->flags & PF_NPROC_EXCEEDED) &&
        atomic_read(&current_user()->processes) > rlimit(RLIMIT_NPROC)) {
        retval = -EAGAIN;
        goto out_ret;
    }

其他創(chuàng)建process時也類似
另外诡宗，fork最終通過copy_creds實現(xiàn)了atomic_inc(&p->cred->user->processes);進程數(shù)+1
exec最終通過commit_creds實現(xiàn)atomic_inc(&p->cred->user->processes);進程數(shù)+1

四、docker容器的ulimit繼承關(guān)系

1.子進程對父進程ulimt的繼承

fork進程時击儡，在fork的實現(xiàn)kernel/fork.c中實現(xiàn)了copy pcb中的內(nèi)容塔沃，其中的copy_signal：

static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
{
    ...
    task_lock(current->group_leader);
    memcpy(sig->rlim, current->signal->rlim, sizeof sig->rlim);
    task_unlock(current->group_leader);
    ...
}

可見完整地復制了pcb中的rlim，如果不使用setrlim進行更改的話阳谍，子進程與父進程一致蛀柴。

2.docker容器啟動方式

根據(jù)官方文檔的說明，啟動容器時1號進程的rlim繼承于docker daemon：
https://docs.docker.com/engine/reference/commandline/run/

Note: If you do not provide a hard limit, the soft limit will be used for both values. If no ulimits are set, they will be inherited from the default ulimits set on the daemon. as option is disabled now. In other words, the following script is not supported:...

由于docker daemon一般是以root運行矫夯，所以即使指定的非root用戶運行容器鸽疾，1號進程仍然是與root一致的rlim。
此時只要不通過pam讀取容器內(nèi)的ulimit配置(如在容器內(nèi)運行su切換用戶训貌，或通過遠程登錄等）肮韧，則子進程也都會一致繼承root的rlim。

總結(jié)來說旺订，該問題的解決方法就是在容器拉起服務(wù)進程之前弄企，不要在容器內(nèi)運行su切換用戶∏可在容器啟動前指定任意用戶拘领，不影響ulimit統(tǒng)一繼承于docker daemon

最后編輯于：2019.12.30 12:07:23

?著作權(quán)歸作者所有,轉(zhuǎn)載或內(nèi)容合作請聯(lián)系作者

人面猴
序言：七十年代末，一起剝皮案震驚了整個濱河市樱调，隨后出現(xiàn)的幾起案子约素，更是在濱河造成了極大的恐慌，老刑警劉巖笆凌，帶你破解...
沈念sama閱讀 217,406評論 6贊 503
死咒
序言：濱河連續(xù)發(fā)生了三起死亡事件圣猎，死亡現(xiàn)場離奇詭異，居然都是意外死亡乞而，警方通過查閱死者的電腦和手機送悔，發(fā)現(xiàn)死者居然都...
沈念sama閱讀 92,732評論 3贊 393
救了他兩次的神仙讓他今天三更去死
文/潘曉璐我一進店門，熙熙樓的掌柜王于貴愁眉苦臉地迎上來，“玉大人欠啤，你說我怎么就攤上這事荚藻。” “怎么了洁段？”我有些...
開封第一講書人閱讀 163,711評論 0贊 353
道士緝兇錄：失蹤的賣姜人
文/不壞的土叔我叫張陵应狱，是天一觀的道長。經(jīng)常有香客問我祠丝，道長疾呻，這世上最難降的妖魔是什么？我笑而不...
開封第一講書人閱讀 58,380評論 1贊 293
?港島之戀（遺憾婚禮）
正文為了忘掉前任写半，我火速辦了婚禮罐韩，結(jié)果婚禮上，老公的妹妹穿的比我還像新娘污朽。我一直安慰自己散吵，他們只是感情好，可當我...
茶點故事閱讀 67,432評論 6贊 392
惡毒庶女頂嫁案：這布局不是一般人想出來的
文/花漫我一把揭開白布蟆肆。她就那樣靜靜地躺著矾睦，像睡著了一般。火紅的嫁衣襯著肌膚如雪炎功。梳的紋絲不亂的頭發(fā)上枚冗，一...
開封第一講書人閱讀 51,301評論 1贊 301
城市分裂傳說
那天，我揣著相機與錄音蛇损，去河邊找鬼赁温。笑死，一個胖子當著我的面吹牛淤齐，可吹牛的內(nèi)容都是我干的股囊。我是一名探鬼主播，決...
沈念sama閱讀 40,145評論 3贊 418
雙鴛鴦連環(huán)套：你想象不到人心有多黑
文/蒼蘭香墨我猛地睜開眼更啄，長吁一口氣：“原來是場噩夢啊……” “哼稚疹！你這毒婦竟也來了？” 一聲冷哼從身側(cè)響起祭务，我...
開封第一講書人閱讀 39,008評論 0贊 276
萬榮殺人案實錄
序言：老撾萬榮一對情侶失蹤内狗，失蹤者是張志新（化名）和其女友劉穎，沒想到半個月后义锥，有當?shù)厝嗽跇淞掷锇l(fā)現(xiàn)了一具尸體柳沙，經(jīng)...
沈念sama閱讀 45,443評論 1贊 314
?護林員之死
正文獨居荒郊野嶺守林人離奇死亡，尸身上長有42處帶血的膿包…… 初始之章·張勛以下內(nèi)容為張勛視角年9月15日...
茶點故事閱讀 37,649評論 3贊 334
?白月光啟示錄
正文我和宋清朗相戀三年拌倍，在試婚紗的時候發(fā)現(xiàn)自己被綠了赂鲤。大學時的朋友給我發(fā)了我未婚夫和他白月光在一起吃飯的照片噪径。...
茶點故事閱讀 39,795評論 1贊 347
活死人
序言：一個原本活蹦亂跳的男人離奇死亡，死狀恐怖蛤袒，靈堂內(nèi)的尸體忽然破棺而出熄云，到底是詐尸還是另有隱情膨更，我是刑警寧澤妙真，帶...
沈念sama閱讀 35,501評論 5贊 345
?日本核電站爆炸內(nèi)幕
正文年R本政府宣布，位于F島的核電站荚守，受9級特大地震影響珍德，放射性物質(zhì)發(fā)生泄漏。R本人自食惡果不足惜矗漾，卻給世界環(huán)境...
茶點故事閱讀 41,119評論 3贊 328
男人毒藥：我在死后第九天來索命
文/蒙蒙一锈候、第九天我趴在偏房一處隱蔽的房頂上張望。院中可真熱鬧敞贡，春花似錦泵琳、人聲如沸。這莊子的主人今日做“春日...
開封第一講書人閱讀 31,731評論 0贊 22
一樁弒父案获列，背后竟有這般陰謀
文/蒼蘭香墨我抬頭看了看天上的太陽。三九已至蛔垢，卻和暖如春击孩，著一層夾襖步出監(jiān)牢的瞬間，已是汗流浹背鹏漆。一陣腳步聲響...
開封第一講書人閱讀 32,865評論 1贊 269
情欲美人皮
我被黑心中介騙來泰國打工巩梢，沒想到剛下飛機就差點兒被人妖公主榨干…… 1. 我叫王不留，地道東北人艺玲。一個月前我還...
沈念sama閱讀 47,899評論 2贊 370
代替公主和親
正文我出身青樓括蝠，卻偏偏與公主長得像，于是被迫代替她去往敵國和親饭聚。傳聞我的和親對象是個殘疾皇子又跛，可洞房花燭夜當晚...
茶點故事閱讀 44,724評論 2贊 354