一、問題現(xiàn)象
- 壓測時析藕,多個模塊出現(xiàn)報錯”unable to create thread: Resource temporarily unavailable“蛤吓。
- 無論服務(wù)進程或supervisor,都出現(xiàn)此報錯鸠蚪,多次重試拉起失敗后重慢,服務(wù)退出饥臂,然后容器退出逊躁。
二似踱、原因總結(jié)
總的來說,是進程resource limit限制的配置作用范圍與內(nèi)核調(diào)度時對用戶的限制不統(tǒng)一引起的稽煤。即是ulimit限制配置是在容器中讀取核芽,對進程生效,而內(nèi)核調(diào)度時酵熙,對部分資源(這里是線程數(shù)量)的判斷依據(jù)轧简,不區(qū)分進程,是整機單個用戶的全部進程資源數(shù)量的總和匾二。
用戶線程數(shù)量是由內(nèi)核判定的哮独,各容器雖然運行環(huán)境隔離拳芙,但對于內(nèi)核來說,只是多個進程皮璧。同一個用戶id運行的進程舟扎,即使是不同容器,內(nèi)核可見的也是累計的線程數(shù)量悴务。
另一方面睹限,用戶limit實際的生效配置,卻是用戶態(tài)生效讯檐。即limit相關(guān)配置是各個容器各自讀取羡疗。
centos的limit默認配置中,對非root用戶進程數(shù)量軟限制為4096:
root@cvm-172_16_30_8:~ # cat /etc/security/limits.d/20-nproc.conf
# Default limit for number of user's processes to prevent
# accidental fork bombs.
# See rhbz #432903 for reasoning.
* soft nproc 4096
root soft nproc unlimited
在進行系統(tǒng)調(diào)用增加線程時别洪,內(nèi)核是以這兩個值進行判斷叨恨。所以就出現(xiàn)了同一個用戶id,在某個容器內(nèi)線程數(shù)量并不多挖垛,卻無法開線程的現(xiàn)象特碳。即是此時改進程本身線程限制為4096,而該id的用戶晕换,對于內(nèi)核來說午乓,機器上總的線程數(shù)已經(jīng)超過4096。
具體機制見下文詳解闸准。
三益愈、詳解
本部分主要說明了三個方面:
- ulimit配置何時生效的。
- 內(nèi)核如何對limit合法性進行判定夷家。
- docker拉起容器時的ulimit配置繼承關(guān)系蒸其,即如何解決該問題。
ulimit配置生效方式
1. 原容器內(nèi)進程啟動方式:
容器啟動時執(zhí)行entrypoint.sh库快,該腳本創(chuàng)建指定id的用戶摸袁,修改目錄權(quán)限后,通過su切換用戶并運行supervisor义屏,進一步拉起服務(wù)進程:
? data-proxy git:(master) ? cat entrypoint.sh
#!/bin/sh
username="yibot"
#create user if not exists
egrep "^${YIBOT_UID}" /etc/passwd >& /dev/null
if [ $? -ne 0 ]
then
useradd -u "${YIBOT_UID}" "${username}"
fi
mkdir -p /data/yibot/"${MODULE}"/log/ && \
mkdir -p /data/supervisor/ && \
chown -R "${YIBOT_UID}":"${YIBOT_UID}" /entrypoint && \
chown -R "${YIBOT_UID}":"${YIBOT_UID}" /data && \
su yibot -c "supervisord -n"
2. pam簡介
pam(Pluggable Authentication Modules)中文翻譯是"可插拔的身份認證模塊組"靠汁。這些模塊本身不屬于內(nèi)核,內(nèi)核自身沒有身份驗證的行為闽铐。是為了讓需要身份驗證的應用與身份驗證機制本身進行解耦蝶怔,衍生出來的一套庫。現(xiàn)在的su兄墅、login等應用都會采用該庫踢星。
pam介紹可參考:https://www.linuxjournal.com/article/5940
pam man page: http://man7.org/linux/man-pages/man8/pam.8.html
pam源碼:https://github.com/linux-pam/linux-pam/tree/master/libpam
3. pam與ulimit配置讀取
查看pam源碼發(fā)現(xiàn),在limit處理中 https://github.com/linux-pam/linux-pam/blob/master/modules/pam_limits/pam_limits.c 每一次該pam會話調(diào)用隙咸,都是parse_config_file-> setup_limits
retval = parse_config_file(pamh, pwd->pw_name, pwd->pw_uid, pwd->pw_gid, ctrl, pl);
retval = setup_limits(pamh, pwd->pw_name, pwd->pw_uid, ctrl, pl);
parse_config_file是從給定配置文件中沐悦,讀取limit配置存放在pl
指向的pam_limit_s
結(jié)構(gòu)體中成洗,該結(jié)構(gòu)體定義如下:
/* internal data */
struct pam_limit_s {
int login_limit; /* the max logins limit */
int login_limit_def; /* which entry set the login limit */
int flag_numsyslogins; /* whether to limit logins only for a
specific user or to count all logins */
int priority; /* the priority to run user process with */
struct user_limits_struct limits[RLIM_NLIMITS];
const char *conf_file;
int utmp_after_pam_call;
char login_group[LINE_LENGTH];
};
各項limit的值都存在limits
數(shù)組中,user_limits_struct
結(jié)構(gòu)體中包含軟限制和硬限制
struct user_limits_struct {
int supported;
int src_soft;
int src_hard;
struct rlimit limit;
};
其中limit
結(jié)構(gòu)體中是在init_limits
中通過系統(tǒng)調(diào)用getrlimit
獲取的當前進程的限制值藏否。
解析完配置文件后泌枪,在setup_limits
中,通過系統(tǒng)調(diào)用setrlimit
修改當前進程pcb中的rlim
相關(guān)值
for (i=0, status=LIMITED_OK; i<RLIM_NLIMITS; i++) {
int res;
if (!pl->limits[i].supported) {
/* skip it if its not known to the system */
continue;
}
if (pl->limits[i].src_soft == LIMITS_DEF_NONE &&
pl->limits[i].src_hard == LIMITS_DEF_NONE) {
/* skip it if its not initialized */
continue;
}
if (pl->limits[i].limit.rlim_cur > pl->limits[i].limit.rlim_max)
pl->limits[i].limit.rlim_cur = pl->limits[i].limit.rlim_max;
res = setrlimit(i, &pl->limits[i].limit);
if (res != 0)
pam_syslog(pamh, LOG_ERR, "Could not set limit for '%s': %m",
rlimit2str(i));
status |= res;
}
以上就是pam庫對limit配置的讀取和修改過程秕岛。系統(tǒng)調(diào)用getrlimit
和setrlimit
具體行為見后文
4. su與pam:
su源碼:https://github.com/shadow-maint/shadow/blob/master/src/su.c
在最新的su實現(xiàn)中碌燕,可以看到是有pam的條件編譯:
#ifdef USE_PAM
ret = pam_start ("su", name, &conv, &pamh);
if (PAM_SUCCESS != ret) {
SYSLOG ((LOG_ERR, "pam_start: error %d", ret);
fprintf (stderr,
_("%s: pam_start: error %d\n"),
Prog, ret));
exit (1);
}
在最新的centos中,ldd查看su继薛,可以確定是打開了該條件
root@cvm-172_16_30_8:~ # ldd /usr/bin/su | grep pam
libpam.so.0 => /lib64/libpam.so.0 (0x00007f4d429a6000)
libpam_misc.so.0 => /lib64/libpam_misc.so.0 (0x00007f4d427a2000)
在su的man里也有說明:
This version of su uses PAM for authentication, account and session management. Some configuration options found in other su implementations such as e.g. support of a wheel group have to be configured via PAM.
在pam_start ("su", name, &conv, &pamh)
中pam
會在/etc/pam.d/
下查找名為su
的文件進行配置加載修壕,該文件中指定了pam
認證中需要用到的庫。實現(xiàn)可插拔
的特性
最終在pam打開會話pam_open_session
會調(diào)用pam_limits
中的pam_sm_open_session
實現(xiàn)limits
相關(guān)配置文件的解析和設(shè)置遏考。
在su
切換用戶后慈鸠,默認打開shell
,會繼承更新后的limits
配置灌具,具體繼承機制見下文青团。
/*
* Use the shell and create an argv
* with the rest of the command line included.
*/
argv[-1] = cp;
execve_shell (shellstr, &argv[-1], environ);
之后再打開的進程,都會進行limits
繼承
附pam編程例子:https://www.freebsd.org/doc/en_US.ISO8859-1/articles/pam/pam-sample-appl.html
以上解釋了在原entrypoint.sh
的做法中咖楣,su
調(diào)用pam
會讀取當前容器中的limit
配置(/etc/security/limits.d/)
督笆。在非root時,進程limit
中的nproc
會被設(shè)為4096
的限制**
5.系統(tǒng)調(diào)用setrlimit行為
kernel源碼:https://github.com/torvalds/linux
setrlimit
系統(tǒng)調(diào)用如下
SYSCALL_DEFINE2(setrlimit, unsigned int, resource, struct rlimit __user *, rlim)
{
struct rlimit new_rlim;
if (copy_from_user(&new_rlim, rlim, sizeof(*rlim)))
return -EFAULT;
return do_prlimit(current, resource, &new_rlim, NULL);
}
current
返回的是當前進程的pcb诱贿,即task_struct
結(jié)構(gòu)體的指針娃肿,在do_prlimit
中進一步調(diào)用security_task_setrlimit
修改當前pcb中的limit
限制值
int security_task_setrlimit(struct task_struct *p, unsigned int resource,
struct rlimit *new_rlim)
{
return call_int_hook(task_setrlimit, 0, p, resource, new_rlim);
}
下面這個操作看不太懂,大概是在鏈表里進行搜索珠十,然后應用FUNC料扰。還請大佬指點迷津。
#define call_int_hook(FUNC, IRC, ...) ({ \
int RC = IRC; \
do { \
struct security_hook_list *P; \
\
hlist_for_each_entry(P, &security_hook_heads.FUNC, list) { \
RC = P->hook.FUNC(__VA_ARGS__); \
if (RC != 0) \
break; \
} \
} while (0); \
RC; \
})
補充一下pcb task_struct
部分定義焙蹭,完整定義參考:https://github.com/torvalds/linux/blob/master/include/linux/sched.h
struct task_struct {
...
/* Real parent process: */
struct task_struct __rcu *real_parent;
/* Recipient of SIGCHLD, wait4() reports: */
struct task_struct __rcu *parent;
/*
* Children/sibling form the list of natural children:
*/
struct list_head children;
struct list_head sibling;
struct task_struct *group_leader;
...
/* Effective (overridable) subjective task credentials (COW): */
const struct cred __rcu *cred;
...
/* Signal handlers: */
struct signal_struct *signal;
...
}
注:kernel通過list_head
與list_entry
宏晒杈,實現(xiàn)了通用的雙鏈表結(jié)構(gòu)
在struct signal_struct
中定義了rlim
:
struct signal_struct {
...
/*
* We don't bother to synchronize most readers of this at all,
* because there is no reader checking a limit that actually needs
* to get both rlim_cur and rlim_max atomically, and either one
* alone is a single word that can safely be read normally.
* getrlimit/setrlimit use task_lock(current->group_leader) to
* protect this instead of the siglock, because they really
* have no need to disable irqs.
*/
struct rlimit rlim[RLIM_NLIMITS];
...
}
rlim
數(shù)組中即該進程的resource limit
相關(guān)值缔恳,setrlimit
最終修改的也即該數(shù)組中的值筷笨。可見是每個進程單獨持有的一組值辩尊。
內(nèi)核判定nproc limit(進程數(shù)限制)合法性機制
1. 用戶進程總數(shù)
在上文給出的task_struct
定義中烟馅,有一個結(jié)構(gòu)體struct cred
说庭,定義如下
struct cred {
...
kuid_t uid; /* real UID of the task */
kgid_t gid; /* real GID of the task */
kuid_t suid; /* saved UID of the task */
kgid_t sgid; /* saved GID of the task */
kuid_t euid; /* effective UID of the task */
kgid_t egid; /* effective GID of the task */
kuid_t fsuid; /* UID for VFS ops */
kgid_t fsgid; /* GID for VFS ops */
...
struct user_struct *user; /* real user ID subscription */
...
}
其中struct user_struct
定義:
struct user_struct {
refcount_t __count; /* reference count */
atomic_t processes; /* How many processes does this user have? */
atomic_t sigpending; /* How many pending signals does this user have? */
...
}
結(jié)合下文說明pcb中的struct user_struct *user
是全局唯一然磷,則processes
就是系統(tǒng)當前用戶在運行的所有進程數(shù)(linux中processes與threads幾乎相同郑趁,內(nèi)核中沒有thread概念)
http://www.mulix.org/lectures/kernel_workshop_mar_2004/things.pdf
In Linux, processes and threads are almost the same. The major difference is that threads share the same virtual memory address space.
2. struct user_struct *user全局唯一
在su
的實現(xiàn)中,調(diào)用change_uid
姿搜,最終通過系統(tǒng)調(diào)用setuid
切換uid
SYSCALL_DEFINE1(setuid, uid_t, uid)
{
return __sys_setuid(uid);
}
__sys_setuid
調(diào)用set_user
實現(xiàn)用戶真正切換寡润,參數(shù)new
為當前pcb中的cred
結(jié)構(gòu)體副本
long __sys_setuid(uid_t uid)
{
...
if (ns_capable_setid(old->user_ns, CAP_SETUID)) {
new->suid = new->uid = kuid;
if (!uid_eq(kuid, old->uid)) {
retval = set_user(new);
if (retval < 0)
goto error;
}
} else if (!uid_eq(kuid, old->uid) && !uid_eq(kuid, new->suid)) {
goto error;
}
}
set_user
完整實現(xiàn):
/*
* change the user struct in a credentials set to match the new UID
*/
static int set_user(struct cred *new)
{
struct user_struct *new_user;
new_user = alloc_uid(new->uid);
if (!new_user)
return -EAGAIN;
/*
* We don't fail in case of NPROC limit excess here because too many
* poorly written programs don't check set*uid() return code, assuming
* it never fails if called by root. We may still enforce NPROC limit
* for programs doing set*uid()+execve() by harmlessly deferring the
* failure to the execve() stage.
*/
if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
new_user != INIT_USER)
current->flags |= PF_NPROC_EXCEEDED;
else
current->flags &= ~PF_NPROC_EXCEEDED;
free_uid(new->user);
new->user = new_user;
return 0;
}
再來看看alloc_uid
:
struct user_struct *alloc_uid(kuid_t uid)
{
struct hlist_head *hashent = uidhashentry(uid);
struct user_struct *up, *new;
spin_lock_irq(&uidhash_lock);
up = uid_hash_find(uid, hashent);
spin_unlock_irq(&uidhash_lock);
...
}
在kernel/user.c
中捆憎,uidhashentry
定義如下
#define uidhashentry(uid) (uidhash_table + __uidhashfn((__kuid_val(uid))))
static struct kmem_cache *uid_cachep;
struct hlist_head uidhash_table[UIDHASH_SZ];
加上uid_hash_find
的實現(xiàn):
static struct user_struct *uid_hash_find(kuid_t uid, struct hlist_head *hashent)
{
struct user_struct *user;
hlist_for_each_entry(user, hashent, uidhash_node) {
if (uid_eq(user->uid, uid)) {
refcount_inc(&user->__count);
return user;
}
}
return NULL;
}
如此就可以看出,實際上對于一個uid,用戶信息結(jié)構(gòu)體user_struct
全局唯一梭纹。通過uid的hashentry躲惰,在鏈表中查找該結(jié)構(gòu)體,再將指針返回給pcb
3. 新增進程合法性判斷
實際上在上文set_user
中变抽,已經(jīng)有如下判斷:
if (atomic_read(&new_user->processes) >= rlimit(RLIMIT_NPROC) &&
new_user != INIT_USER)
current->flags |= PF_NPROC_EXCEEDED;
else
current->flags &= ~PF_NPROC_EXCEEDED;
rlimit(RLIMIT_NPROC)
是讀取當前進程pcb內(nèi)的nproc
限制础拨,再與新用戶總線程數(shù)作比較。
另外绍载,在exec
的實現(xiàn)__do_execve_file
中,也有類似判斷:https://github.com/torvalds/linux/blob/master/fs/exec.c
if ((current->flags & PF_NPROC_EXCEEDED) &&
atomic_read(¤t_user()->processes) > rlimit(RLIMIT_NPROC)) {
retval = -EAGAIN;
goto out_ret;
}
其他創(chuàng)建process時也類似
另外诡宗,fork
最終通過copy_creds
實現(xiàn)了atomic_inc(&p->cred->user->processes);
進程數(shù)+1
exec
最終通過commit_creds
實現(xiàn)atomic_inc(&p->cred->user->processes);
進程數(shù)+1
四、docker容器的ulimit繼承關(guān)系
1.子進程對父進程ulimt的繼承
fork進程時击儡,在fork的實現(xiàn)kernel/fork.c
中實現(xiàn)了copy pcb中的內(nèi)容塔沃,其中的copy_signal
:
static int copy_signal(unsigned long clone_flags, struct task_struct *tsk)
{
...
task_lock(current->group_leader);
memcpy(sig->rlim, current->signal->rlim, sizeof sig->rlim);
task_unlock(current->group_leader);
...
}
可見完整地復制了pcb中的rlim
,如果不使用setrlim
進行更改的話阳谍,子進程與父進程一致蛀柴。
2.docker容器啟動方式
根據(jù)官方文檔的說明,啟動容器時1號進程的rlim
繼承于docker daemon:
https://docs.docker.com/engine/reference/commandline/run/
Note: If you do not provide a hard limit, the soft limit will be used for both values. If no ulimits are set, they will be inherited from the default ulimits set on the daemon. as option is disabled now. In other words, the following script is not supported:...
由于docker daemon一般是以root運行矫夯,所以即使指定的非root用戶運行容器鸽疾,1號進程仍然是與root一致的rlim
。
此時只要不通過pam讀取容器內(nèi)的ulimit
配置(如在容器內(nèi)運行su切換用戶训貌,或通過遠程登錄等)肮韧,則子進程也都會一致繼承root的rlim
。
總結(jié)來說旺订,該問題的解決方法就是在容器拉起服務(wù)進程之前弄企,不要在容器內(nèi)運行su切換用戶∏可在容器啟動前指定任意用戶拘领,不影響ulimit統(tǒng)一繼承于docker daemon