COW写时复制的实现分析 宏观理解COW 原因:
大部分进程在执行fork后,都会执行与父进程不同的逻辑,通过exec覆盖整个子进程的逻辑地址空间,那样在fork时做一次复制父进程的数据的操作就是一种浪费, 而如果父子进程fork后不执行exec,则父子进程逻辑相同, 只要没有人写数据, 内存也是共享的, 完全没有必要复制.
另外如果一个在一个进程逻辑空间中分配堆空间并进行全0初始化,则没必要每次都进行内存初始化.还有在使用共享的lib时也可以这样节省内存
实现方法: 1.系统在fork调用时不进行逻辑地址空间中的数据复制, 直接通过mmu把父进程的逻辑地址所对应的物理地址设置为只读,并记录一个引用数. 当父或子进程有一个想写入数据时,如果页面的引用数大于1,会通过MMU读取只读内存区域,从而引发内存页异常中断,在中断处理程序中检测引发异常的原因,发现是cow引起的,则内核开辟一块物理空间,把异常页数据复制进去,然后改变触发异常的逻辑页的实际物理指向,此时父子进程对应的这个逻辑页就不相关了. 2.先通过MMU让这些逻辑地址都指向一个只读地址,然后在写访问这个地址时同样产生页异常中断,此时才实际分配物理内存空间.
可能带来的负面问题: 如果fork后不执行exec且父子进程都是对进程空间进行大量地写, 那么就会不断地产生大量页面异常中断, 反而可能会出现性能下降.
涉及到的一些基本概念和数据结构 进程结构体: task_struct
,其中包含有该进程的内存结构体 mm_struct
(定义在include/linux/mm_types.h :386
). 在mm_struct
中有以下字段和COW相关:
1 2 3 4 5 6 pgd_t * pgd; atomic_t mm_users; atomic_t mm_count; int map_count; struct vm_area_struct *mmap ; struct rb_root mm_rb ;
而vma_struct
的结构(定义在include/linux/mm_types.h :303
)中包含了vma(virtual memory area)
的起止地址,前/后一个vma的指针和所属mm_struct指针等信息:
1 2 3 4 5 unsigned long vm_start; unsigned long vm_end; struct vm_area_struct *vm_next , *vm_prev ; struct mm_struct *vm_mm ; unsigned long vm_flags;
vma对应着程序的heap, stack, bss, data, text等区域, 没有映射文件的vma是匿名的(stack, heap等).
pte(page table entry)是虚拟地址页表项, 对应一个物理页, 其定义是一个32/64位的UL数. 其中:
1 | 物理地址的起始位置 | avail | G | PAT |D |A| PCD | PWT | U/S | R/W| P |
其中P(0)表示是否有物理页, 0表示缺页(被换出或未分配) R/W(1)表示页是否可读写, 0 表示只读 U/S(2)表示页的访问权限, 0 表示只内核可以访问 D(6)表示页是否脏页, 1为脏页 A(5)表示是否被访问过, 1为被访问
x86的PAGE_SHIFT为 12, 定义在arch/x86/include/asm/page_types.h
linux目前采用统一的4级页表管理法, 把虚拟地址划分成如下一些部分做页面查找:
1 | 未使用(16bit) | PGD(9bit) | PUD(9bit) | PMD(9bit) | PTE(9bit) | OFFSET (PAGE_SHIFT=12bit) |
其中高位128T给内核, 地位128T给用户空间, 中间空间的目前是保留.
每个物理页和虚拟页都默认4KB对齐的数据空间.
在x86系统下, 寄存器 CR2 存放缺页异常是的vma, CR3放当前进程的页目录的物理地址基址, 用于做页面查找
进程的获取虚拟页面对应的物理页的方式是通过MMU硬件自动完成的, MMU内部有一个TLB的页表缓存区域, 用于加快页表索引时速度, 减少多次读取内存的开销. MMU通过CPU给出的虚拟地址, 取第48-39位,得到PGD的偏移量, 加上PGD对应的物理位置获得PUD的物理位置,取出数据加上38-30位得到PMD的物理位置,取出再加上29-21位得到PTE的物理位置, 取出后加上OFFSET偏移得到实际的物理内存地址.
内核页表放在swapper_pg_dir
中, 进程内的内核页表是其内核页表的拷贝, 这样可以减少在进程陷入系统调用时, TLB. CR3的数据切换刷新.
内存交换所换出的内存页分为两种, 文件映射页(file-backed page)和匿名页(anonymous page), 文件映射页直接通过文件进行读写, 而匿名页则需要通过硬盘中单独的交换分区进行读写. 匿名页包括各种堆,栈,bss,pipe,数据段,tmpfs等的页, 用户态一般通过malloc,mmap,brk/sbrk申请.
linux中通过LRU算法执行对页面的swap回收操作, x86页表中的accessed位标记了页面是否最近被访问, 当页面被MMU访问到时会自动设置, 之后需要通过函数调用来清除标记位.
物理内存页用struct page
(定义在include/linux/mm_types.h :69
)表示. 其中atomic_t _mapcount;
表示物理页的被使用计数, 如果为0, 则该物理页不再被任何人使用.
hugetlb是内存大页的tlb页表, 指向大页, 而huge page的大小和pmd所管理的物理页总大小是一致的. 默认是2^9 * 4k = 2M.在TLB中通过hugetlb指向huge page, 而被分配的大页作为hugetlbfs被提供给进程使用,类似tmpfs.
rmap是用于通过匿名页方向查找到使用该物理页的pte的数据结构,该反向映射用于在回收物理内存时断开使用该物理页的所有进程等场景.
分析对象 linux内核: linux-5.11.4
体系结构: x86_64
fork系统调用中的COW 主要的作用链路是: kernel_clone -> copy_process -> copy_mm -> dup_mm -> dup_mmap ->copy_page_range -> copy_p4d_range -> copy_pud_range -> copy_pmd_range -> copy_pte_range -> copy_present_pte/copy_nopresent_pte
fork通过系统中断触发, x86_64的系统中断入口定义在arch/x86/entry/entry_64.S :95
处. 其中会执行call do_syscall_64
进入中断响应程序选择逻辑,do_syscall_64
定义在arch/x86/entry/common.c :39
处, 其中会查询中断描述符表, 这个表在arch/x86/entry/syscalls/syscall_64.tbl
中定义了64位系统下fork系统调用的中断号: 57 common fork sys_fork
. 之后会执行到kernel_clone
(定义在kernel/fork.c :2412
)进行fork的主要处理逻辑.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 #ifdef CONFIG_X86_64 __visible noinstr void do_syscall_64 (unsigned long nr, struct pt_regs *regs) { nr = syscall_enter_from_user_mode(regs, nr); instrumentation_begin(); if (likely(nr < NR_syscalls)) { nr = array_index_nospec(nr, NR_syscalls); regs->ax = sys_call_table[nr](regs); #ifdef CONFIG_X86_X32_ABI } else if (likely((nr & __X32_SYSCALL_BIT) && (nr & ~__X32_SYSCALL_BIT) < X32_NR_syscalls)) { nr = array_index_nospec(nr & ~__X32_SYSCALL_BIT, X32_NR_syscalls); regs->ax = x32_sys_call_table[nr](regs); #endif } instrumentation_end(); syscall_exit_to_user_mode(regs); } #endif #define instrumentation_begin() ({ \ asm volatile ("%c0: nop\n\t" \ ".pushsection .discard.instr_begin\n\t" \ ".long %c0b - .\n\t" \ ".popsection\n\t" : : "i" (__COUNTER__)) ; \}) #define instrumentation_end() ({ \ asm volatile ("%c0: nop\n\t" \ ".pushsection .discard.instr_end\n\t" \ ".long %c0b - .\n\t" \ ".popsection\n\t" : : "i" (__COUNTER__)) ; \})
ps: 这里书上说通过cs和eip寄存器给出了 IDT表中第i项门描述符的段选择符和偏移量字段, 通过其可以跳转到被选中的中断处理程序的第一条指令这个执行过程是由硬件自动完成的
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 pid_t kernel_clone (struct kernel_clone_args *args) { ... p = copy_process(NULL , trace, NUMA_NO_NODE, args); ... wake_up_new_task(p); ... put_pid(pid); }
在copy_process
(定义在kernel/fork.c :1844
)中执行进程拷贝的逻辑:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 static __latent_entropy struct task_struct *copy_process ( struct pid *pid, int trace, int node, struct kernel_clone_args *args) { ... dup_task_struct ... shm_init_task(p); retval = security_task_alloc(p, clone_flags); retval = copy_semundo(clone_flags, p); retval = copy_files(clone_flags, p); retval = copy_fs(clone_flags, p); retval = copy_sighand(clone_flags, p); retval = copy_signal(clone_flags, p); retval = copy_mm(clone_flags, p); retval = copy_namespaces(clone_flags, p); retval = copy_io(clone_flags, p); retval = copy_thread(clone_flags, args->stack , args->stack_size, p, args->tls); ... ... }
内存的拷贝 在copy_mm
(定义在kernel/fork.c :1382
) 中:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 static int copy_mm (unsigned long clone_flags, struct task_struct *tsk) { tsk->mm = NULL ; tsk->active_mm = NULL ; ... if (clone_flags & CLONE_VM) { mmget(oldmm); mm = oldmm; goto good_mm; } ... mm = dup_mm(tsk, current->mm); ... good_mm: tsk->mm = mm; tsk->active_mm = mm; return 0 ; ... }
在这里用到的mmget
是一个专门用于增加mm的使用计数的方法, 其定义(在include/linux/sched/mm.h :68
)如下:
1 2 3 4 static inline void mmget (struct mm_struct *mm) { atomic_inc(&mm->mm_users); }
而与mmget
对应的减少计数并在计数为0时释放资源的方法mmput
定义(在kernel/fork.c :1074-1104
)如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 static inline void __mmput(struct mm_struct *mm){ VM_BUG_ON(atomic_read(&mm->mm_users)); uprobe_clear_state(mm); exit_aio(mm); ksm_exit(mm); khugepaged_exit(mm); exit_mmap(mm); mm_put_huge_zero_page(mm); set_mm_exe_file(mm, NULL ); if (!list_empty(&mm->mmlist)) { spin_lock(&mmlist_lock); list_del(&mm->mmlist); spin_unlock(&mmlist_lock); } if (mm->binfmt) module_put(mm->binfmt->module ); mmdrop(mm); } void mmput (struct mm_struct *mm) { might_sleep(); if (atomic_dec_and_test(&mm->mm_users)) __mmput(mm); } EXPORT_SYMBOL_GPL(mmput);
ps: 这里might_sleep(); 是什么原因还没搞明白
非CLONE_VM的继续执行dup_mm
(在kernel/fork.c :1345
)拷贝mm_struct
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 static struct mm_struct *dup_mm (struct task_struct *tsk, struct mm_struct *oldmm) { ... mm = allocate_mm(); memcpy (mm, oldmm, sizeof (*mm)); if (!mm_init(mm, tsk, mm->user_ns)) goto fail_nomem; err = dup_mmap(mm, oldmm); ... return mm; ... }
其中mm_init
(在kernel/fork.c :1004
处)中做了与COW相关的一些设置:
1 2 3 4 5 6 7 8 9 10 11 mm->mmap = NULL ; mm->mm_rb = RB_ROOT; atomic_set(&mm->mm_users, 1 ); atomic_set(&mm->mm_count, 1 ); mm->map_count = 0 ;
主要的拷贝mmap的逻辑由dup_mmap
(在kernel/fork.c :470-644
处)完成, 这里分两种情况处理:
配置了MMU: #ifdef CONFIG_MMU
没有配置CONFIG_MMU, 这里只执行: RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm));
的逻辑, 对exe_file部分做RCU初始化,使exe_file指向父进程的exe_file,使得两者一致,即任意读都可以,并在写的时候复制副本完成修改写入,之后再一次性替换原来的数据.
针对配置了MMU的场景:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 static __latent_entropy int dup_mmap (struct mm_struct *mm, struct mm_struct *oldmm) { uprobe_start_dup_mmap(); ... flush_cache_dup_mm(oldmm); ... RCU_INIT_POINTER(mm->exe_file, get_mm_exe_file(oldmm)); mm->total_vm = oldmm->total_vm; mm->data_vm = oldmm->data_vm; mm->exec_vm = oldmm->exec_vm; mm->stack_vm = oldmm->stack_vm; ... pprev = &mm->mmap; retval = ksm_fork(mm, oldmm); ... prev = NULL ; for (mpnt = oldmm->mmap; mpnt; mpnt = mpnt->vm_next) { ... if (mpnt->vm_flags & VM_DONTCOPY) { vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); continue ; } ... if (fatal_signal_pending(current)) { retval = -EINTR; goto out; } ... tmp = vm_area_dup(mpnt); ... if (tmp->vm_flags & VM_WIPEONFORK) { tmp->anon_vma = NULL ; } else if (anon_vma_fork(tmp, mpnt)) goto fail_nomem_anon_vma_fork; ... if (is_vm_hugetlb_page(tmp)) reset_vma_resv_huge_pages(tmp); *pprev = tmp; pprev = &tmp->vm_next; ... mm->map_count++; if (!(tmp->vm_flags & VM_WIPEONFORK)) retval = copy_page_range(tmp, mpnt); ... } retval = arch_dup_mmap(oldmm, mm); ... }
ps: 这里pprev 看代码似乎没有被用到, 不知道其有什么用意, 还是这里就是冗余代码
在copy_page_range
(定义在mm/memory.c :1126
处)中执行4级页表的拷贝, 这里会有两种情况的拷贝, hugetlb情况和正常情况两种处理.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 int copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma) { unsigned long addr = src_vma->vm_start; ... if (!(src_vma->vm_flags & (VM_HUGETLB | VM_PFNMAP | VM_MIXEDMAP)) && !src_vma->anon_vma) return 0 ; if (is_vm_hugetlb_page(src_vma)) return copy_hugetlb_page_range(dst_mm, src_mm, src_vma); if (unlikely(src_vma->vm_flags & VM_PFNMAP)) { ret = track_pfn_copy(src_vma); if (ret) return ret; } is_cow = is_cow_mapping(src_vma->vm_flags); if (is_cow) { mmu_notifier_range_init(&range, MMU_NOTIFY_PROTECTION_PAGE, 0 , src_vma, src_mm, addr, end ); mmu_notifier_invalidate_range_start(&range); mmap_assert_write_locked(src_mm); raw_write_seqcount_begin(&src_mm->write_protect_seq); } ret = 0 ; dst_pgd = pgd_offset(dst_mm, addr); src_pgd = pgd_offset(src_mm, addr); do { next = pgd_addr_end(addr, end ); if (pgd_none_or_clear_bad(src_pgd)) continue ; if (unlikely(copy_p4d_range(dst_vma, src_vma, dst_pgd, src_pgd, addr, next))) { ret = -ENOMEM; break ; } } while (dst_pgd++, src_pgd++, addr = next, addr != end ); if (is_cow) { raw_write_seqcount_end(&src_mm->write_protect_seq); mmu_notifier_invalidate_range_end(&range); } return ret; }
搜索相关文献后得知:mmu_notifier_invalidate_range_start
和mmu_notifier_invalidate_range_end
都是在2008年2.6.27的合并窗口里面加入的, 用于通知其他MMU移除这些范围内的内存页的映射,避免MMU的缓存错误
其中关于is_cow_mapping
(定义在mm/internal.h :299
处)的内容如下:
1 2 3 4 5 static inline bool is_cow_mapping (vm_flags_t flags) { return (flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; }
其中VM_SHARED为0x00000008, VM_MAYWRITE为0 x0000020, 则这个表达式检测了vma是否私有且可写, 是则判定这个vma是cow的.
针对内存大页tlb的拷贝处理 对于hugetlb_page执行的copy_hugetlb_page_range
(定义在mm/hugetlb.c :3779
处), 其逻辑如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 int copy_hugetlb_page_range (struct mm_struct *dst, struct mm_struct *src, struct vm_area_struct *vma) { ... cow = (vma->vm_flags & (VM_SHARED | VM_MAYWRITE)) == VM_MAYWRITE; if (cow) { mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0 , vma, src, vma->vm_start, vma->vm_end); mmu_notifier_invalidate_range_start(&range); } else { i_mmap_lock_read(mapping); } for (addr = vma->vm_start; addr < vma->vm_end; addr += sz) { ... if (huge_pte_none(entry) || !huge_pte_none(dst_entry)) { ; } else if (unlikely(is_hugetlb_entry_migration(entry) || is_hugetlb_entry_hwpoisoned(entry))) { if (is_write_migration_entry(swp_entry) && cow) { make_migration_entry_read(&swp_entry); entry = swp_entry_to_pte(swp_entry); set_huge_swap_pte_at(src, addr, src_pte, entry, sz); } set_huge_swap_pte_at(dst, addr, dst_pte, entry, sz); } else { if (cow) { huge_ptep_set_wrprotect(src, addr, src_pte); } entry = huge_ptep_get(src_pte); ptepage = pte_page(entry); get_page(ptepage); page_dup_rmap(ptepage, true ); set_huge_pte_at(dst, addr, dst_pte, entry); hugetlb_count_add(pages_per_huge_page(h), dst); } ... } if (cow) mmu_notifier_invalidate_range_end(&range); else i_mmap_unlock_read(mapping); ... }
针对默认普通页的拷贝处理 从copy_page_range
跳到copy_p4d_range
(在mm/memory.c :1103
处)后, 主要进行遍历循环pgd下的pud进行拷贝
1 2 3 4 5 6 7 8 9 10 11 12 13 14 static inline int copy_p4d_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pgd_t *dst_pgd, pgd_t *src_pgd, unsigned long addr, unsigned long end ) { ... do { ... if (copy_pud_range(dst_vma, src_vma, dst_p4d, src_p4d, addr, next)) return -ENOMEM; } while (dst_p4d++, src_p4d++, addr = next, addr != end ); return 0 ; }
在copy_pud_range
(在mm/memory.c :1066
处)中循环遍历进行pud下的所有pmd拷贝:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 static inline int copy_pud_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, p4d_t *dst_p4d, p4d_t *src_p4d, unsigned long addr, unsigned long end ) { ... do { ... if (copy_pmd_range(dst_vma, src_vma, dst_pud, src_pud, addr, next)) return -ENOMEM; } while (dst_pud++, src_pud++, addr = next, addr != end ); return 0 ; }
在copy_pmd_range
(在mm/memory.c :1029
处) 中循环遍历进行pte的拷贝, 这里会有个分叉, 如果是huge_pmd或者处于swap中或是设备映射的页表则跳到copy_huge_pmd
执行, 普通的页则执行copy_pte_range
:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 static inline int copy_pmd_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pud_t *dst_pud, pud_t *src_pud, unsigned long addr, unsigned long end ) { ... do { if (is_swap_pmd(*src_pmd) || pmd_trans_huge(*src_pmd) || pmd_devmap(*src_pmd)) { ... err = copy_huge_pmd(dst_mm, src_mm, dst_pmd, src_pmd, addr, src_vma); if (err == -ENOMEM) return -ENOMEM; if (!err) continue ; } ... if (copy_pte_range(dst_vma, src_vma, dst_pmd, src_pmd, addr, next)) return -ENOMEM; } while (dst_pmd++, src_pmd++, addr = next, addr != end ); return 0 ; }
处理大pmd的拷贝: copy_huge_pmd
copy_huge_pmd
代码在mm/huge_memory.c :1011
处:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 int copy_huge_pmd (struct mm_struct *dst_mm, struct mm_struct *src_mm, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, struct vm_area_struct *vma) { ... if (is_huge_zero_pmd(pmd)) { zero_page = mm_get_huge_zero_page(dst_mm); set_huge_zero_page(pgtable, dst_mm, vma, addr, dst_pmd, zero_page); ... } if (unlikely(is_cow_mapping(vma->vm_flags) && atomic_read(&src_mm->has_pinned) && page_maybe_dma_pinned(src_page))) { ... } get_page(src_page); page_dup_rmap(src_page, true ); ... pmdp_set_wrprotect(src_mm, addr, src_pmd); pmd = pmd_mkold(pmd_wrprotect(pmd)); set_pmd_at(dst_mm, addr, dst_pmd, pmd); ... }
处理对普通pmd下的所有pte的拷贝: copy_pte_range
对普通pte的拷贝的处理代码位于mm/memory.c :923
处, 逻辑如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 static int copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pmd_t *dst_pmd, pmd_t *src_pmd, unsigned long addr, unsigned long end ) { ... again: progress = 0 ; ... do { if (progress >= 32 ) { progress = 0 ; if (need_resched() || spin_needbreak(src_ptl) || spin_needbreak(dst_ptl)) break ; } if (pte_none(*src_pte)) { progress++; continue ; } if (unlikely(!pte_present(*src_pte))) { entry.val = copy_nonpresent_pte(dst_mm, src_mm, dst_pte, src_pte, src_vma, addr, rss); if (entry.val) break ; progress += 8 ; continue ; } ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, addr, rss, &prealloc); ... progress += 8 ; } while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end ); ... }
判断pte是不是在内存中的逻辑不同体系结构定义不同, 在x86_64中定义为判断pte中第0位(P位)的值是不是1.
不在物理内存中的pte的拷贝 copy_nonpresent_pte
位于mm/memory.c :698
处:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 static unsigned long copy_nonpresent_pte(struct mm_struct *dst_mm, struct mm_struct *src_mm, pte_t *dst_pte, pte_t *src_pte, struct vm_area_struct *vma, unsigned long addr, int *rss) { pte_t pte = *src_pte; ... swp_entry_t entry = pte_to_swp_entry(pte); ... if (likely(!non_swap_entry(entry))) { ... } else if (is_migration_entry(entry)) { page = migration_entry_to_page(entry); rss[mm_counter(page)]++; if (is_write_migration_entry(entry) && is_cow_mapping(vm_flags)) { make_migration_entry_read(&entry); pte = swp_entry_to_pte(entry); if (pte_swp_soft_dirty(*src_pte)) pte = pte_swp_mksoft_dirty(pte); ... set_pte_at(src_mm, addr, src_pte, pte); } } else if (is_device_private_entry(entry)) { page = device_private_entry_to_page(entry); get_page(page); rss[mm_counter(page)]++; page_dup_rmap(page, false ); if (is_write_device_private_entry(entry) && is_cow_mapping(vm_flags)) { make_device_private_entry_read(&entry); ... set_pte_at(src_mm, addr, src_pte, pte); } } set_pte_at(dst_mm, addr, dst_pte, pte); return 0 ; }
处于物理内存中的pte的拷贝 代码位于mm/memory.c :851
中, 逻辑如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 static inline int copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss, struct page **prealloc) { ... page = vm_normal_page(src_vma, addr, pte); if (page) { ... retval = copy_present_page(dst_vma, src_vma, dst_pte, src_pte, addr, rss, prealloc, pte, page); if (retval <= 0 ) return retval; ... get_page(page); page_dup_rmap(page, false ); rss[mm_counter(page)]++; } if (is_cow_mapping(vm_flags) && pte_write(pte)) { ptep_set_wrprotect(src_mm, addr, src_pte); pte = pte_wrprotect(pte); } if (vm_flags & VM_SHARED) pte = pte_mkclean(pte); pte = pte_mkold(pte); ... set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); return 0 ; }
其中vm_normal_page
执行获取虚拟页pte对应关联的物理页page的逻辑:
1 2 3 4 5 6 struct page *vm_normal_page (struct vm_area_struct *vma, unsigned long addr, pte_t pte) { unsigned long pfn = pte_pfn(pte); return pfn_to_page(pfn); }
x86下pte_pfn
的定义在arch/x86/include/asm/pgtable.h :212
处:
1 2 3 4 5 6 static inline unsigned long pte_pfn (pte_t pte) { phys_addr_t pfn = pte_val(pte); pfn ^= protnone_mask(pfn); return (pfn & PTE_PFN_MASK) >> PAGE_SHIFT; }
其中变量的定义如下, 而PAGE_SIZE默认是4K的页, 那么在默认4k页情况下PTE_PFN_MASK即为000ffffffffff000
:
1 2 3 4 5 6 7 8 9 10 11 12 #define PTE_PFN_MASK ((pteval_t)PHYSICAL_PAGE_MASK)`, #define PHYSICAL_PAGE_MASK (((signed long)PAGE_MASK) & __PHYSICAL_MASK) #define PAGE_MASK (~(PAGE_SIZE-1)) #define PAGE_SHIFT 12 #define __PHYSICAL_MASK_SHIFT 52 #ifdef CONFIG_DYNAMIC_PHYSICAL_MASK extern phys_addr_t physical_mask;#define __PHYSICAL_MASK physical_mask #else #define __PHYSICAL_MASK ((phys_addr_t)((1ULL << __PHYSICAL_MASK_SHIFT) - 1)) #endif
其中的copy_present_page
内部会进行判断页是不是pinned和非私有可写的, 如果是非pinned和私有可写的则返回1, 其他情况则执行数据拷贝, 如果遇到错误则返回负数. 代码在mm/memory.c :796
处:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 static inline int copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss, struct page **prealloc, pte_t pte, struct page *page) { ... if (!is_cow_mapping(src_vma->vm_flags)) return 1 ; if (likely(!atomic_read(&src_mm->has_pinned))) return 1 ; if (likely(!page_maybe_dma_pinned(page))) return 1 ; ... copy_user_highpage(new_page, page, addr, src_vma); __SetPageUptodate(new_page); page_add_new_anon_rmap(new_page, dst_vma, addr, false ); lru_cache_add_inactive_or_unevictable(new_page, dst_vma); rss[mm_counter(new_page)]++; pte = mk_pte(new_page, dst_vma->vm_page_prot); pte = maybe_mkwrite(pte_mkdirty(pte), dst_vma); set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte); return 0 ; }
在page_add_new_anon_rmap
(定义在mm/rmap.c : 1175
处)中, 会调用__page_set_anon_rmap
(定义在mm/rmap.c : 1039
处)设置匿名物理页对应的线性地址索引:
1 2 3 anon_vma = (void *) anon_vma + PAGE_MAPPING_ANON; WRITE_ONCE(page->mapping, (struct address_space *) anon_vma); page->index = linear_page_index(vma, address);
而linear_page_index
(位于include/linux/pagemap.h :551
)的代码逻辑则如下:
1 2 3 4 5 6 7 8 9 10 static inline pgoff_t linear_page_index (struct vm_area_struct *vma, unsigned long address) { pgoff_t pgoff; if (unlikely(is_vm_hugetlb_page(vma))) return linear_hugepage_index(vma, address); pgoff = (address - vma->vm_start) >> PAGE_SHIFT; pgoff += vma->vm_pgoff; return pgoff; }
内存COW缺页异常的触发和处理 在读取内存页遇到权限错误后COW的处理链路(普通页)主要如下:DEFINE_IDTENTRY_RAW_ERRORCODE -> handle_page_fault -> do_user_addr_fault -> handle_mm_fault -> __handle_mm_fault -> handle_pte_fault -> do_wp_page -> wp_page_copy/wp_page_reuse
在x86体系下, 当发生访问COW页面时, MMU通过解析线性地址(虚拟地址)获得页表项详细,即会访问W/R为0的页, 查阅Intel的手册第4.7节和6.2节中内容可知: 对W/R为0的页面进行写入操作是, MMU会返回Page Fault的Fault错误, page fault error code是14, 错误Mne-monic为#PF. 其中返回的32位数据中第1位会被置为1,代表导致错误的原因是写入权限问题.
处理这个错误的代码入口定义在arch/x86/entry/entry_64.S
中.
ps: 此处如何通过entry_64.S找到下一个处理函数的入口尚未看明白
之后代码跳转到: arch/x86/mm/fault.c :1469
处的DEFINE_IDTENTRY_RAW_ERRORCODE
执行:
1 2 3 instrumentation_begin(); handle_page_fault(regs, error_code, address); instrumentation_end();
其中的handle_page_fault
(定义在arch/x86/mm/fault.c :1445
处)负责处理page fault, 内部分为两种处理逻辑, 一种是对内核地址上的页错误进行处理, 一种是对用户空间地址上的页错误进行处理.
1 2 3 4 5 6 if (unlikely(fault_in_kernel_space(address))) { do_kern_addr_fault(regs, error_code, address); } else { do_user_addr_fault(regs, error_code, address); ... }
在do_user_addr_fault
(定义在arch/x86/mm/fault.c :1240
处)中执行对COW处理逻辑:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 static inline void do_user_addr_fault (struct pt_regs *regs, unsigned long hw_error_code, unsigned long address) { ... if (hw_error_code & X86_PF_WRITE) flags |= FAULT_FLAG_WRITE; ... if (unlikely(!mmap_read_trylock(mm))) { ... } ... fault = handle_mm_fault(vma, address, flags, regs); ... mmap_read_unlock(mm); ... }
其中的handle_mm_fault
定义在mm/memory.c :4592
处,其中根据页面是否是大页进行分别处理:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 vm_fault_t handle_mm_fault (struct vm_area_struct *vma, unsigned long address, unsigned int flags, struct pt_regs *regs) { __set_current_state(TASK_RUNNING); count_vm_event(PGFAULT); count_memcg_event_mm(vma->vm_mm, PGFAULT); ... if (unlikely(is_vm_hugetlb_page(vma))) ret = hugetlb_fault(vma->vm_mm, vma, address, flags); else ret = __handle_mm_fault(vma, address, flags); ... }
处理hugetlb的page fault错误 hugetlb_fault
的逻辑代码定义在mm/hugetlb.c :4507
处
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 vm_fault_t hugetlb_fault (struct mm_struct *mm, struct vm_area_struct *vma, unsigned long address, unsigned int flags) { ... if ((flags & FAULT_FLAG_WRITE) && !huge_pte_write(entry)) { if (vma_needs_reservation(h, vma, haddr) < 0 ) { ret = VM_FAULT_OOM; goto out_mutex; } vma_end_reservation(h, vma, haddr); if (!(vma->vm_flags & VM_MAYSHARE)) pagecache_page = hugetlbfs_pagecache_page(h, vma, haddr); } ... page = pte_page(entry); if (page != pagecache_page) if (!trylock_page(page)) { need_wait_lock = 1 ; goto out_ptl; } get_page(page); if (flags & FAULT_FLAG_WRITE) { if (!huge_pte_write(entry)) { ret = hugetlb_cow(mm, vma, address, ptep, pagecache_page, ptl); goto out_put_page; } entry = huge_pte_mkdirty(entry); } entry = pte_mkyoung(entry); if (huge_ptep_set_access_flags(vma, haddr, ptep, entry, flags & FAULT_FLAG_WRITE)) update_mmu_cache(vma, haddr, ptep); out_put_page: if (page != pagecache_page) unlock_page(page); put_page(page); }
其中的hugetlb_cow
代码定义在mm/hugetlb.c :4098
处, 其中和cow相关的逻辑为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 ... if (page_mapcount(old_page) == 1 && PageAnon(old_page)) { page_move_anon_rmap(old_page, vma); set_huge_ptep_writable(vma, haddr, ptep); return 0 ; } ... new_page = alloc_huge_page(vma, haddr, outside_reserve); ... copy_user_huge_page(new_page, old_page, address, vma, pages_per_huge_page(h)); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0 , vma, mm, haddr, haddr + huge_page_size(h)); mmu_notifier_invalidate_range_start(&range); spin_lock(ptl); ptep = huge_pte_offset(mm, haddr, huge_page_size(h)); if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) { ClearPagePrivate(new_page); huge_ptep_clear_flush(vma, haddr, ptep); mmu_notifier_invalidate_range(mm, range.start, range.end ); set_huge_pte_at(mm, haddr, ptep, make_huge_pte(vma, new_page, 1 )); page_remove_rmap(old_page, true ); hugepage_add_new_anon_rmap(new_page, vma, haddr); set_page_huge_active(new_page); new_page = old_page; } spin_unlock(ptl); mmu_notifier_invalidate_range_end(&range); out_release_all: restore_reserve_on_error(h, vma, haddr, new_page); put_page(new_page); out_release_old: put_page(old_page);
处理普通页的page fault错误 __handle_mm_fault
的代码定义在mm/memory.c :4436
处:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma, unsigned long address, unsigned int flags) { struct vm_fault vmf = { .vma = vma, .address = address & PAGE_MASK, .flags = flags, .pgoff = linear_page_index(vma, address), .gfp_mask = __get_fault_gfp_mask(vma), }; struct mm_struct *mm = vma ->vm_mm ; ... pgd = pgd_offset(mm, address); p4d = p4d_alloc(mm, pgd, address); ... vmf.pud = pud_alloc(mm, p4d, address); ... vmf.pmd = pmd_alloc(mm, vmf.pud, address); ... return handle_pte_fault(&vmf); }
handle_pte_fault
方法定义在mm/memory.c :4343
处,
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 static vm_fault_t handle_pte_fault (struct vm_fault *vmf) { pte_t entry; ... if (!vmf->pte) { if (vma_is_anonymous(vmf->vma)) return do_anonymous_page(vmf); else return do_fault(vmf); } if (!pte_present(vmf->orig_pte)) return do_swap_page(vmf); if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma)) return do_numa_page(vmf); vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd); spin_lock(vmf->ptl); entry = vmf->orig_pte; if (unlikely(!pte_same(*vmf->pte, entry))) { update_mmu_tlb(vmf->vma, vmf->address, vmf->pte); goto unlock; } if (vmf->flags & FAULT_FLAG_WRITE) { if (!pte_write(entry)) return do_wp_page(vmf); entry = pte_mkdirty(entry); } entry = pte_mkyoung(entry); if (ptep_set_access_flags(vmf->vma, vmf->address, vmf->pte, entry, vmf->flags & FAULT_FLAG_WRITE)) { update_mmu_cache(vmf->vma, vmf->address, vmf->pte); } else { if (vmf->flags & FAULT_FLAG_TRIED) goto unlock; if (vmf->flags & FAULT_FLAG_WRITE) flush_tlb_fix_spurious_fault(vmf->vma, vmf->address); } unlock: pte_unmap_unlock(vmf->pte, vmf->ptl); return 0 ; }
对于匿名页的处理会进入do_anonymous_page
(在mm/memory.c :3482
处)中进行:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 static vm_fault_t do_anonymous_page (struct vm_fault *vmf) { ... if (pte_alloc(vma->vm_mm, vmf->pmd)) return VM_FAULT_OOM; ... page = alloc_zeroed_user_highpage_movable(vma, vmf->address); ... __SetPageUptodate(page); entry = mk_pte(page, vma->vm_page_prot); entry = pte_sw_mkyoung(entry); if (vma->vm_flags & VM_WRITE) entry = pte_mkwrite(pte_mkdirty(entry)); vmf->pte = pte_offset_map_lock(vma->vm_mm, vmf->pmd, vmf->address, &vmf->ptl); ... set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); update_mmu_cache(vma, vmf->address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); ... }
这其中mk_pte
的对于x86系统的定义位于: arch/x86/include/asm/pgtable.h :845
处:
1 #define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot))
而pfn_pte
(位于arch/x86/include/asm/pgtable.h :603
)和page_to_pfn
(位于include/asm-generic :55
)分别是:
1 2 3 4 5 6 7 8 9 10 11 static inline pte_t pfn_pte (unsigned long page_nr, pgprot_t pgprot) { phys_addr_t pfn = (phys_addr_t )page_nr << PAGE_SHIFT; pfn ^= protnone_mask(pgprot_val(pgprot)); pfn &= PTE_PFN_MASK; return __pte(pfn | check_pgprot(pgprot)); } #define __page_to_pfn(page) (unsigned long)((page) - vmemmap)
对于首次访问文件映射区mmap的处理由do_fault
(定义在mm/memory.c :4111)负责:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 static vm_fault_t do_fault (struct vm_fault *vmf) { if (!vma->vm_ops->fault) { ... } else if (!(vmf->flags & FAULT_FLAG_WRITE)) ret = do_read_fault(vmf); else if (!(vma->vm_flags & VM_SHARED)) ret = do_cow_fault(vmf); else ret = do_shared_fault(vmf); ... }
do_read_fault
, do_cow_fault
和do_shared_fault
三者都会执行__do_fault
这个方法,其中对于COW的处理是, 先分配一个COW页, 再通过调用 __do_fault
(内部执行vma->vm_ops->fault(vmf)
,即执行vm_ops指定的fault函数), 之后拷贝page内容到COW页, 最后通过finish_fault
方法设置pte, 并通过put_page
减少新旧页各自的引用计数:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 static vm_fault_t do_cow_fault (struct vm_fault *vmf) { vmf->cow_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address); ret = __do_fault(vmf); copy_user_highpage(vmf->cow_page, vmf->page, vmf->address, vma); __SetPageUptodate(vmf->cow_page); ret |= finish_fault(vmf); put_page(vmf->page); put_page(vmf->cow_page); ... }
finish_fault
内部会调用alloc_set_pte
, 在alloc_set_pte
的内部会执行设置新的pte, 更新pte数据的操作:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 bool write = vmf->flags & FAULT_FLAG_WRITE;... entry = mk_pte(page, vma->vm_page_prot); entry = pte_sw_mkyoung(entry); if (write ) entry = maybe_mkwrite(pte_mkdirty(entry), vma); if (write && !(vma->vm_flags & VM_SHARED)) { inc_mm_counter_fast(vma->vm_mm, MM_ANONPAGES); page_add_new_anon_rmap(page, vma, vmf->address, false ); lru_cache_add_inactive_or_unevictable(page, vma); } else { inc_mm_counter_fast(vma->vm_mm, mm_counter_file(page)); page_add_file_rmap(page, false ); } set_pte_at(vma->vm_mm, vmf->address, vmf->pte, entry); update_mmu_cache(vma, vmf->address, vmf->pte);
在do_wp_page
(代码在mm/memory.c :3085
处)中执行cow的处理逻辑如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 vmf->page = vm_normal_page(vma, vmf->address, vmf->orig_pte); if (!vmf->page) { if ((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED)) return wp_pfn_shared(vmf); pte_unmap_unlock(vmf->pte, vmf->ptl); return wp_page_copy(vmf); } if (PageAnon(vmf->page)) { struct page *page = vmf ->page ; if (PageKsm(page) || page_count(page) != 1 ) goto copy; if (!trylock_page(page)) goto copy; if (PageKsm(page) || page_mapcount(page) != 1 || page_count(page) != 1 ) { unlock_page(page); goto copy; } unlock_page(page); wp_page_reuse(vmf); return VM_FAULT_WRITE; } else if (unlikely((vma->vm_flags & (VM_WRITE|VM_SHARED)) == (VM_WRITE|VM_SHARED))) { return wp_page_shared(vmf); } copy: get_page(vmf->page); pte_unmap_unlock(vmf->pte, vmf->ptl); return wp_page_copy(vmf);
在wp_page_copy
(定义在mm/memory.c :2828
)中执行拷贝页数据的过程:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 static vm_fault_t wp_page_copy (struct vm_fault *vmf) { struct vm_area_struct *vma = vmf ->vma ; struct mm_struct *mm = vma ->vm_mm ; struct page *old_page = vmf ->page ; struct page *new_page = NULL ; pte_t entry; int page_copied = 0 ; struct mmu_notifier_range range ; if (unlikely(anon_vma_prepare(vma))) goto oom; if (is_zero_pfn(pte_pfn(vmf->orig_pte))) { new_page = alloc_zeroed_user_highpage_movable(vma, vmf->address); if (!new_page) goto oom; } else { new_page = alloc_page_vma(GFP_HIGHUSER_MOVABLE, vma, vmf->address); if (!new_page) goto oom; if (!cow_user_page(new_page, old_page, vmf)) { put_page(new_page); if (old_page) put_page(old_page); return 0 ; } } if (mem_cgroup_charge(new_page, mm, GFP_KERNEL)) goto oom_free_new; cgroup_throttle_swaprate(new_page, GFP_KERNEL); __SetPageUptodate(new_page); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0 , vma, mm, vmf->address & PAGE_MASK, (vmf->address & PAGE_MASK) + PAGE_SIZE); mmu_notifier_invalidate_range_start(&range); vmf->pte = pte_offset_map_lock(mm, vmf->pmd, vmf->address, &vmf->ptl); if (likely(pte_same(*vmf->pte, vmf->orig_pte))) { if (old_page) { if (!PageAnon(old_page)) { dec_mm_counter_fast(mm, mm_counter_file(old_page)); inc_mm_counter_fast(mm, MM_ANONPAGES); } } else { inc_mm_counter_fast(mm, MM_ANONPAGES); } flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); entry = mk_pte(new_page, vma->vm_page_prot); entry = pte_sw_mkyoung(entry); entry = maybe_mkwrite(pte_mkdirty(entry), vma); ptep_clear_flush_notify(vma, vmf->address, vmf->pte); page_add_new_anon_rmap(new_page, vma, vmf->address, false ); lru_cache_add_inactive_or_unevictable(new_page, vma); set_pte_at_notify(mm, vmf->address, vmf->pte, entry); update_mmu_cache(vma, vmf->address, vmf->pte); if (old_page) { page_remove_rmap(old_page, false ); } new_page = old_page; page_copied = 1 ; } else { update_mmu_tlb(vma, vmf->address, vmf->pte); } if (new_page) put_page(new_page); pte_unmap_unlock(vmf->pte, vmf->ptl); mmu_notifier_invalidate_range_only_end(&range); if (old_page) { if (page_copied && (vma->vm_flags & VM_LOCKED)) { lock_page(old_page); if (PageMlocked(old_page)) munlock_vma_page(old_page); unlock_page(old_page); } put_page(old_page); } return page_copied ? VM_FAULT_WRITE : 0 ; oom_free_new: put_page(new_page); oom: if (old_page) put_page(old_page); return VM_FAULT_OOM; }
而wp_page_reuse
(定义在mm/memory.c :2789
处)中完成了pte的状态重新设置为可写,并更新mmu的tlb等操作:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 static inline void wp_page_reuse (struct vm_fault *vmf) __releases (vmf->ptl) { struct vm_area_struct *vma = vmf ->vma ; struct page *page = vmf ->page ; pte_t entry; if (page) page_cpupid_xchg_last(page, (1 << LAST_CPUPID_SHIFT) - 1 ); flush_cache_page(vma, vmf->address, pte_pfn(vmf->orig_pte)); entry = pte_mkyoung(vmf->orig_pte); entry = maybe_mkwrite(pte_mkdirty(entry), vma); if (ptep_set_access_flags(vma, vmf->address, vmf->pte, entry, 1 )) update_mmu_cache(vma, vmf->address, vmf->pte); pte_unmap_unlock(vmf->pte, vmf->ptl); count_vm_event(PGREUSE); }
对于共享的页面通过wp_page_shared
设置页可写或通过wp_page_reuse
完成页的状态变更 :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 static vm_fault_t wp_page_shared (struct vm_fault *vmf) __releases (vmf->ptl) { struct vm_area_struct *vma = vmf ->vma ; vm_fault_t ret = VM_FAULT_WRITE; get_page(vmf->page); if (vma->vm_ops && vma->vm_ops->page_mkwrite) { vm_fault_t tmp; pte_unmap_unlock(vmf->pte, vmf->ptl); tmp = do_page_mkwrite(vmf); if (unlikely(!tmp || (tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE)))) { put_page(vmf->page); return tmp; } tmp = finish_mkwrite_fault(vmf); if (unlikely(tmp & (VM_FAULT_ERROR | VM_FAULT_NOPAGE))) { unlock_page(vmf->page); put_page(vmf->page); return tmp; } } else { wp_page_reuse(vmf); lock_page(vmf->page); } ret |= fault_dirty_shared_page(vmf); put_page(vmf->page); return ret; }
至此COW机制得到完成, page fault返回后,操作系统会重新执行引起page fault的动作.
ps: 这里是如何做到重新执行引起page fault的动作的? 书上的找到ret_from_exception
和resume_userspace
函数x86_64下并未找到. 这里我看在entry_64.S
的中断响应程序结尾处, 会执行jne swapgs_restore_regs_and_return_to_usermode
以及SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi
, 这里会重新回到被中断的进程那里重新执行中断前的逻辑, 查阅文献发现这个功能是由硬件支持的
总结 当进行fork时,父子进程将所有私有可写的物理页进行共享, 并将其对应的页表项设置为只读, 当任意一方尝试写时, 会引起COW的缺页异常, 异常处理程序会为写操作方分配一个新的物理页, 并将原来共享的物理页内容拷贝到新页中, 之后重新建立新页的页表映射到新的物理页, 并设置为可写. 如果在缺页异常处理时发现共享的页只有一个使用者, 则直接设置这个页面为可写即可.
内存管理的实现复杂, 这一块还需要进一步学习, 部分实现原理和机制还需要结合深入理解Linux内核书本上的概念进行理解.