1.名詞解釋
VPN :virtual page number.
PPN :physical page number.
PTE :page-table entries.
ASID :address space identifier.
PMA :Physical Memory Attributes
PMP :Physical Memory Protection
PGD :Page Global Directory
PUD :Page Upper Directory
PMD :Page Middle Directory
PT :Page Table
TVM :Trap Virtual Memory
2 Linux Memory結(jié)構(gòu)
4KB 的內(nèi)存頁(yè)大小可能不是最佳的選擇绿映,8KB 或者 16KB 說(shuō)不定是更好的選擇,但是這是過(guò)去在特定場(chǎng)景下做出的權(quán)衡祠挫。我們?cè)谶@篇文章中不要過(guò)于糾結(jié)于 4KB 這個(gè)數(shù)字码泞,應(yīng)該更重視決定這個(gè)結(jié)果的幾個(gè)因素,這樣當(dāng)我們?cè)谟龅筋愃茍?chǎng)景時(shí)才可以從這些方面考慮當(dāng)下最佳的選擇,我們?cè)谶@篇文章中會(huì)介紹以下兩個(gè)影響內(nèi)存頁(yè)大小的因素翁逞,它們分別是:
- 過(guò)小的頁(yè)面大小會(huì)帶來(lái)較大的頁(yè)表項(xiàng)增加尋址時(shí) TLB(Translation lookaside buffer)的查找速度和額外開銷肋杖;
- 過(guò)大的頁(yè)面大小會(huì)浪費(fèi)內(nèi)存空間溉仑,造成內(nèi)存碎片,降低內(nèi)存的利用率状植;
上個(gè)世紀(jì)在設(shè)計(jì)內(nèi)存頁(yè)大小時(shí)充分考慮了上述的兩個(gè)因素浊竟,最終選擇了 4KB 的內(nèi)存頁(yè)作為操作系統(tǒng)最常見的頁(yè)大小,我們接下來(lái)將詳細(xì)介紹以上它們對(duì)操作系統(tǒng)性能的影響
每個(gè)進(jìn)程能夠看到的都是獨(dú)立的虛擬內(nèi)存空間津畸,虛擬內(nèi)存空間只是邏輯上的概念振定,進(jìn)程仍然需要訪問(wèn)虛擬內(nèi)存對(duì)應(yīng)的物理內(nèi)存,從虛擬內(nèi)存到物理內(nèi)存的轉(zhuǎn)換就需要使用每個(gè)進(jìn)程持有頁(yè)表肉拓。
在如上圖所示的四層頁(yè)表結(jié)構(gòu)中后频,操作系統(tǒng)會(huì)使用最低的 12 位作為頁(yè)面的偏移量,剩下的 36 位會(huì)分四組分別表示當(dāng)前層級(jí)在上一層中的索引暖途,所有的虛擬地址都可以用上述的多層頁(yè)表查找到對(duì)應(yīng)的物理地址4卑惜。
因?yàn)椴僮飨到y(tǒng)的虛擬地址空間大小都是一定的,整片虛擬地址空間被均勻分成了 N 個(gè)大小相同的內(nèi)存頁(yè)驻售,所以內(nèi)存頁(yè)的大小最終會(huì)決定每個(gè)進(jìn)程中頁(yè)表項(xiàng)的層級(jí)結(jié)構(gòu)和具體數(shù)量露久,虛擬頁(yè)的大小越小,單個(gè)進(jìn)程中的頁(yè)表項(xiàng)和虛擬頁(yè)也就越多欺栗。
PagesCount=VirtualMemoryPageSizePagesCount/VirtualMemoryPageSize
因?yàn)槟壳暗奶摂M頁(yè)大小為 4096 字節(jié)毫痕,所以虛擬地址末尾的 12 位可以表示虛擬頁(yè)中的地址征峦,如果虛擬頁(yè)的大小降到了 512 字節(jié),那么原本的四層頁(yè)表結(jié)構(gòu)或者五層頁(yè)表結(jié)構(gòu)會(huì)變成五層或者六層消请,這不僅會(huì)增加內(nèi)存訪問(wèn)的額外開銷栏笆,還會(huì)增加每個(gè)進(jìn)程中頁(yè)表項(xiàng)占用的內(nèi)存大小。
PGD中包含若干PUD的地址梯啤,PUD中包含若干PMD的地址竖伯,PMD中又包含若干PT的地址。每一個(gè)頁(yè)表項(xiàng)指向一個(gè)頁(yè)框因宇,頁(yè)框就是真正的物理內(nèi)存頁(yè)七婴。
PGD: Page Global Directory
Linux系統(tǒng)中每個(gè)進(jìn)程對(duì)應(yīng)用戶空間的pgd是不一樣的,但是linux內(nèi)核 的pgd是一樣的察滑。當(dāng)創(chuàng)建一個(gè)新的進(jìn)程時(shí)打厘,都要為新進(jìn)程創(chuàng)建一個(gè)新的頁(yè)面目錄PGD,并從內(nèi)核的頁(yè)面目錄swapper_pg_dir中復(fù)制內(nèi)核區(qū)間頁(yè)面目錄項(xiàng)至新建進(jìn)程頁(yè)面目錄PGD的相應(yīng)位置贺辰,具體過(guò)程如下:
do_fork() --> copy_mm() --> mm_init() --> pgd_alloc() --> set_pgd_fast() --> get_pgd_slow() --> memcpy(&PGD + USER_PTRS_PER_PGD, swapper_pg_dir +USER_PTRS_PER_PGD, (PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t))
3 進(jìn)程與物理內(nèi)存
3.1 進(jìn)程創(chuàng)建
mm_init()
--->fork.c
文件 户盯,源碼如下:
if (mm_alloc_pgd(mm))
goto fail_nopgd;
mm_init()
函數(shù)調(diào)用mm_alloc_pgd()
函數(shù)與底層物理內(nèi)存產(chǎn)生關(guān)系,mm_alloc_pgd()
--->fork.c
文件
static inline int mm_alloc_pgd(struct mm_struct *mm)
{
mm->pgd = pgd_alloc(mm);
if (unlikely(!mm->pgd))
return -ENOMEM;
return 0;
}
pgd_alloc()
---> paglloc.h
這個(gè)函數(shù)為當(dāng)前pgd
分配一個(gè)page
,并且將當(dāng)前的page
的首地址返回,并且將內(nèi)
核GPG拷貝的當(dāng)前進(jìn)程的結(jié)構(gòu)體中饲化。函數(shù)中調(diào)用了__get_free_page()
,獲取一個(gè)空間的物理頁(yè)保存當(dāng)前進(jìn)程信息莽鸭,__get_free_page()
就是Kernel常用的__get_free_pages()
,這樣子上層進(jìn)程創(chuàng)建就與底層物理內(nèi)存產(chǎn)生直接的關(guān)系,以上幾個(gè)函數(shù)源碼如下:
#define __get_free_page(gfp_mask) \
__get_free_pages((gfp_mask), 0)
static inline pgd_t *pgd_alloc(struct mm_struct *mm)
{
pgd_t *pgd;
pgd = (pgd_t *)__get_free_page(GFP_KERNEL);
if (likely(pgd != NULL)) {
memset(pgd, 0, USER_PTRS_PER_PGD * sizeof(pgd_t));
/* Copy kernel mappings */
memcpy(pgd + USER_PTRS_PER_PGD,
init_mm.pgd + USER_PTRS_PER_PGD,
(PTRS_PER_PGD - USER_PTRS_PER_PGD) * sizeof(pgd_t));
}
return pgd;
}
init_mm()
--->init_mm.c
結(jié)構(gòu)體記錄了當(dāng)前root table
的所有信息,swapper_pg_dir
是存放PGD 全局信息的全局變量,源碼如下在init_mm.c
文件中吃靠,源碼如下:
/*
* For dynamically allocated mm_structs, there is a dynamically sized cpumask
* at the end of the structure, the size of which depends on the maximum CPU
* number the system can see. That way we allocate only as much memory for
* mm_cpumask() as needed for the hundreds, or thousands of processes that
* a system typically runs.
*
* Since there is only one init_mm in the entire system, keep it simple
* and size this cpu_bitmask to NR_CPUS.
*/
struct mm_struct init_mm = {
.mm_rb = RB_ROOT,
.pgd = swapper_pg_dir,
.mm_users = ATOMIC_INIT(2),
.mm_count = ATOMIC_INIT(1),
.mmap_sem = __RWSEM_INITIALIZER(init_mm.mmap_sem),
.page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
.mmlist = LIST_HEAD_INIT(init_mm.mmlist),
.user_ns = &init_user_ns,
.cpu_bitmap = { [BITS_TO_LONGS(NR_CPUS)] = 0},
INIT_MM_CONTEXT(init_mm)
};
3.2 __get_free_pages()
這樣一來(lái)硫眨,每個(gè)進(jìn)程的頁(yè)面目錄就分成了兩部分,第一部分為“用戶空間”巢块,用來(lái)映射其整個(gè)進(jìn)程空間(0x0000 0000-0xBFFF FFFF)
即3G字節(jié)的虛擬地址礁阁;第二部分為“系統(tǒng)空間”,用來(lái)映射(0xC000 0000-0xFFFF FFFF)1G
字節(jié)的虛擬地址族奢±驯眨可以看出Linux
系統(tǒng)中每個(gè)進(jìn)程的頁(yè)面目錄的第二部分是相同的,所以從進(jìn)程的角度來(lái)看越走,每個(gè)進(jìn)程有4G
字節(jié)的虛擬空間棚品,較低的2G
字節(jié)是自己的用戶空間,最高的2G
字節(jié)則為與所有進(jìn)程以及內(nèi)核共享的系統(tǒng)空間廊敌。每個(gè)進(jìn)程有它自己的PGD( Page Global Directory)
铜跑,它是一個(gè)物理頁(yè),并包含一個(gè)pgd_t
數(shù)組庭敦。
3.RISC-V Addressing and Memory Protection
3.1 虛擬地址
An Sv32 virtual address is partitioned into a virtual page number (VPN) and page offset, as shown in
Figure 4.15.
**Sv32 page tables consist of 2^10 page-table entries (PTEs), each of four bytes. 4K 大小的Page Table**
虛擬地址有三部分內(nèi)容組成:VPN [1](10 bit),VPN0,page offset(10 bit);
根據(jù)以上表可以看到疼进,Page Table entry 的組成部分:PPN[1] PPN[0] RSW D A G U X W R V 這幾個(gè)條目組成。
The U bit: indicates whether the page is accessible to user mode.
The G bit: designates a global mapping.
The A bit: indicates the virtual page has been read, written, or fetched from since the last time the A bit was cleared.
The D bit: indicates the virtual page has been written since the last time the D bit was cleared.
The permission R, W, and X bits: indicate whether the page is readable, writable, and executable, respectively.
The V bit :indicates whether the PTE is valid; if it is 0, all other bits in the PTE are don’t-cares and may be used freely by software.
PPN[1]與PPN[0] 在Linux內(nèi)核中統(tǒng)稱為為:PFN秧廉,在
pgtable-bits.h
文件中伞广,定義如下:
/*
* PTE format:
* | XLEN-1 10 | 9 8 | 7 | 6 | 5 | 4 | 3 | 2 | 1 | 0
* PFN reserved for SW D A G U X W R V
*/
2.2 satp register 寄存器
satp寄存器的組成:
2.3 Virtual Address Translation Process
虛擬地址轉(zhuǎn)換為物理地址轉(zhuǎn)換過(guò)程如下:
每一個(gè)應(yīng)用程序都有自己的Page Global Directory(PGD)拣帽,其保存物理地址的頁(yè)幀,在<asm/page.h>中定義了pgd_t 結(jié)構(gòu)體數(shù)組嚼锄,不同的架構(gòu)有不同的PGD加載方式减拭。
A virtual address va is translated into a physical address pa as follows:
(1)If XLEN equals VALEN, proceed. (For Sv32, VALEN=32.) Otherwise, check whether each
bit of va[XLEN-1:VALEN] is equal to va[VALEN-1]. If not, stop and raise a page-fault
exception corresponding to the original access type.
(2)Let a be satp.ppn × PAGESIZE, and let i = LEVELS ? 1. (For Sv32, PAGESIZE=212 and
LEVELS=2.)
(3)Let pte be the value of the PTE at address a+va.vpn[i]×PTESIZE. (For Sv32, PTESIZE=4.)
If accessing pte violates a PMA or PMP check, raise an access exception corresponding to
the original access type.
(4)If pte.v = 0, or if pte.r = 0 and pte.w = 1, stop and raise a page-fault exception corresponding
to the original access type.
(5)Otherwise, the PTE is valid. If pte.r = 1 or pte.x = 1, go to step 5. Otherwise, this PTE is a
pointer to the next level of the page table. Let i = i?1. If i < 0, stop and raise a page-fault
exception corresponding to the original access type. Otherwise, let a = pte.ppn×PAGESIZE
and go to step 2.
(6)A leaf PTE has been found. Determine if the requested memory access is allowed by the
pte.r, pte.w, pte.x, and pte.u bits, given the current privilege mode and the value of the
SUM and MXR fields of the mstatus register. If not, stop and raise a page-fault exception
corresponding to the original access type.
(7)If i > 0 and pte.ppn[i ? 1 : 0] ?= 0, this is a misaligned superpage; stop and raise a page-fault
exception corresponding to the original access type.
(8)If pte.a = 0, or if the memory access is a store and pte.d = 0, either raise a page-fault
exception corresponding to the original access type, or:
? Set pte.a to 1 and, if the memory access is a store, also set pte.d to 1.
? If this access violates a PMA or PMP check, raise an access exception corresponding to
the original access type.
Volume II: RISC-V Privileged Architectures V20190608-Priv-MSU-Ratified 71
? This update and the loading of pte in step 2 must be atomic; in particular, no intervening
store to the PTE may be perceived to have occurred in-between.
(9)The translation is successful. The translated physical address is given as follows:
? pa.pgoff = va.pgoff.
? If i > 0, then this is a superpage translation and pa.ppn[i ? 1 : 0] = va.vpn[i ? 1 : 0].
? pa.ppn[LEVELS ? 1 : i] = pte.ppn[LEVELS ? 1 : i].
Virtual address to phy address 大概的步驟如下:
- 校驗(yàn)XLEN 與 VALEN 是否相等
- 計(jì)算基地址 base address = satp.ppn × PAGESIZE (PAGESIZE:4096) ,satp.ppn 放的是當(dāng)前進(jìn)程root page table phy page number.
- 計(jì)算PTE的地址 PTE adress = a+va.vpn[i]×PTESIZE (For Sv32, PTESIZE=4)
- 判斷PAGE 屬性是不是合法区丑,如果X,W沧侥,R三個(gè)標(biāo)志全部為:0;則說(shuō)明當(dāng)前的PTE指向下一個(gè)PAGETABLE癣朗,進(jìn)行第三步,否則說(shuō)明當(dāng)前PTE地址是合法的
- 經(jīng)過(guò)PMA與PMP的檢查無(wú)誤之后旺罢,則說(shuō)明地址轉(zhuǎn)換成功
- 然后根據(jù)va.offset 找到實(shí)際的物理地址
2.4 Create Pagetable process
當(dāng)虛擬地址沒(méi)有映射物理地址,最典型就是用戶態(tài)Malloc
一段虛擬地址后扁达,Linux
并沒(méi)有為這段虛擬地址分配物理地址,而是當(dāng)用寫這段虛擬地址時(shí)跪解,Linux Kernel
發(fā)生PageFault
才會(huì)為這段虛擬地址映射物理內(nèi)存炉旷,大概的過(guò)程就是這樣,但是其中Linux Kernel
產(chǎn)生缺頁(yè)異常到映射物理的過(guò)程則是非常復(fù)雜的一個(gè)過(guò)程惠遏,其中涉及到很重要的一個(gè)函數(shù)就是缺頁(yè)中斷服務(wù)函數(shù)节吮,在RISC-V
中叫do_page_fault()
在arch/risv-v/mm/fault.c
文件中定義了該函數(shù)判耕。
2.4.1 do_page_fault
do_page_fault()
函數(shù)實(shí)現(xiàn)如下:
/*
* This routine handles page faults. It determines the address and the
* problem, and then passes it off to one of the appropriate routines.
*/
asmlinkage void do_page_fault(struct pt_regs *regs)
{
struct task_struct *tsk;
struct vm_area_struct *vma;
struct mm_struct *mm;
unsigned long addr, cause;
unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
int code = SEGV_MAPERR;
vm_fault_t fault;
cause = regs->scause;
addr = regs->sbadaddr;
tsk = current;
mm = tsk->mm;
/*
* Fault-in kernel-space virtual memory on-demand.
* The 'reference' page table is init_mm.pgd.
*
* NOTE! We MUST NOT take any locks for this case. We may
* be in an interrupt or a critical region, and should
* only copy the information from the master page table,
* nothing more.
*/
if (unlikely((addr >= VMALLOC_START) && (addr <= VMALLOC_END)))
goto vmalloc_fault;
/* Enable interrupts if they were enabled in the parent context. */
if (likely(regs->sstatus & SR_SPIE))
local_irq_enable();
/*
* If we're in an interrupt, have no user context, or are running
* in an atomic region, then we must not take the fault.
*/
if (unlikely(faulthandler_disabled() || !mm))
goto no_context;
if (user_mode(regs))
flags |= FAULT_FLAG_USER;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, addr);
if (unlikely(!vma))
goto bad_area;
if (likely(vma->vm_start <= addr))
goto good_area;
if (unlikely(!(vma->vm_flags & VM_GROWSDOWN)))
goto bad_area;
if (unlikely(expand_stack(vma, addr)))
goto bad_area;
/*
* Ok, we have a good vm_area for this memory access, so
* we can handle it.
*/
good_area:
code = SEGV_ACCERR;
switch (cause) {
case EXC_INST_PAGE_FAULT:
if (!(vma->vm_flags & VM_EXEC))
goto bad_area;
break;
case EXC_LOAD_PAGE_FAULT:
if (!(vma->vm_flags & VM_READ))
goto bad_area;
break;
case EXC_STORE_PAGE_FAULT:
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
flags |= FAULT_FLAG_WRITE;
break;
default:
panic("%s: unhandled cause %lu", __func__, cause);
}
/*
* If for any reason at all we could not handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
fault = handle_mm_fault(vma, addr, flags);
/*
* If we need to retry but a fatal signal is pending, handle the
* signal first. We do not need to release the mmap_sem because it
* would already be released in __lock_page_or_retry in mm/filemap.c.
*/
if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(tsk))
return;
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
else if (fault & VM_FAULT_SIGBUS)
goto do_sigbus;
BUG();
}
/*
* Major/minor page fault accounting is only done on the
* initial attempt. If we go through a retry, it is extremely
* likely that the page will be found in page cache at that point.
*/
if (flags & FAULT_FLAG_ALLOW_RETRY) {
if (fault & VM_FAULT_MAJOR) {
tsk->maj_flt++;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MAJ,
1, regs, addr);
} else {
tsk->min_flt++;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS_MIN,
1, regs, addr);
}
if (fault & VM_FAULT_RETRY) {
/*
* Clear FAULT_FLAG_ALLOW_RETRY to avoid any risk
* of starvation.
*/
flags &= ~(FAULT_FLAG_ALLOW_RETRY);
flags |= FAULT_FLAG_TRIED;
/*
* No need to up_read(&mm->mmap_sem) as we would
* have already released it in __lock_page_or_retry
* in mm/filemap.c.
*/
goto retry;
}
}
up_read(&mm->mmap_sem);
return;
/*
* Something tried to access memory that isn't in our memory map.
* Fix it, but check if it's kernel or user first.
*/
bad_area:
up_read(&mm->mmap_sem);
/* User mode accesses just cause a SIGSEGV */
if (user_mode(regs)) {
do_trap(regs, SIGSEGV, code, addr, tsk);
return;
}
no_context:
/* Are we prepared to handle this kernel fault? */
if (fixup_exception(regs))
return;
/*
* Oops. The kernel tried to access some bad page. We'll have to
* terminate things with extreme prejudice.
*/
bust_spinlocks(1);
pr_alert("Unable to handle kernel %s at virtual address " REG_FMT "\n",
(addr < PAGE_SIZE) ? "NULL pointer dereference" :
"paging request", addr);
die(regs, "Oops");
do_exit(SIGKILL);
/*
* We ran out of memory, call the OOM killer, and return the userspace
* (which will retry the fault, or kill us if we got oom-killed).
*/
out_of_memory:
up_read(&mm->mmap_sem);
if (!user_mode(regs))
goto no_context;
pagefault_out_of_memory();
return;
do_sigbus:
up_read(&mm->mmap_sem);
/* Kernel mode? Handle exceptions or die */
if (!user_mode(regs))
goto no_context;
do_trap(regs, SIGBUS, BUS_ADRERR, addr, tsk);
return;
vmalloc_fault:
{
pgd_t *pgd, *pgd_k;
pud_t *pud, *pud_k;
p4d_t *p4d, *p4d_k;
pmd_t *pmd, *pmd_k;
pte_t *pte_k;
int index;
if (user_mode(regs))
goto bad_area;
/*
* Synchronize this task's top level page-table
* with the 'reference' page table.
*
* Do _not_ use "tsk->active_mm->pgd" here.
* We might be inside an interrupt in the middle
* of a task switch.
*
* Note: Use the old spbtr name instead of using the current
* satp name to support binutils 2.29 which doesn't know about
* the privileged ISA 1.10 yet.
*/
index = pgd_index(addr);
pgd = (pgd_t *)pfn_to_virt(csr_read(sptbr)) + index;
pgd_k = init_mm.pgd + index;
if (!pgd_present(*pgd_k))
goto no_context;
set_pgd(pgd, *pgd_k);
p4d = p4d_offset(pgd, addr);
p4d_k = p4d_offset(pgd_k, addr);
if (!p4d_present(*p4d_k))
goto no_context;
pud = pud_offset(p4d, addr);
pud_k = pud_offset(p4d_k, addr);
if (!pud_present(*pud_k))
goto no_context;
/*
* Since the vmalloc area is global, it is unnecessary
* to copy individual PTEs
*/
pmd = pmd_offset(pud, addr);
pmd_k = pmd_offset(pud_k, addr);
if (!pmd_present(*pmd_k))
goto no_context;
set_pmd(pmd, *pmd_k);
/*
* Make sure the actual PTE exists as well to
* catch kernel vmalloc-area accesses to non-mapped
* addresses. If we don't do this, this will just
* silently loop forever.
*/
pte_k = pte_offset_kernel(pmd_k, addr);
if (!pte_present(*pte_k))
goto no_context;
return;
}
}