[系统运维] 从hugetlbfs看NUMA mempolicy是如何影响内存分配的

开发: C++知识库 Java知识库 JavaScript Python PHP知识库人工智能区块链大数据移动开发嵌入式开发工具数据结构与算法开发测试游戏开发网络协议系统运维
教程: HTML教程 CSS教程 JavaScript教程 Go语言教程 JQuery教程 VUE教程 VUE3教程 Bootstrap教程 SQL数据库教程 C语言教程 C++教程 Java教程 Python教程 Python3教程 C#教程
数码: 电脑笔记本显卡显示器固态硬盘硬盘耳机手机 iphone vivo oppo 小米华为单反装机图拉丁

-> 系统运维 -> 从hugetlbfs看NUMA mempolicy是如何影响内存分配的 -> 正文阅读

[系统运维]从hugetlbfs看NUMA mempolicy是如何影响内存分配的

周末在家闲来无事，研究一下mempolicy对内存页分配的影响。分析基于linux内核4.19.195.

先看看内核里面支持什么内存分配策略：（这里的内存分配策略指的NUMA mem policy策略）

enum {
	MPOL_DEFAULT, //默认使用进程的policy；如果进程也设置了MPOL_DEFAULT，则使用系统默认policy（在CPU本地节点分配内存）
	MPOL_PREFERRED, //在内存分配时优先指定的节点，失败时从附近的内存节点上分配内存
	MPOL_BIND, //强制在指定的节点上分配内存，即只能在nodemask指定的内存节点上分配内存（若nodemask指定了多个内存节点，优先在node编号小的节点上分配）
	MPOL_INTERLEAVE, //内存分配依次在所选的节点上交错进行
	MPOL_LOCAL, //优先在本地节点
	MPOL_MAX,	/* always last member of enum */
};

在处理缺页中断的流程中，若是hugetlbfs的缺页，最终会走到alloc_huge_page函数完成大页的分配。

struct page *alloc_huge_page(struct vm_area_struct *vma,
				    unsigned long addr, int avoid_reserve)
{
	***
	page = dequeue_huge_page_vma(h, vma, addr, avoid_reserve, gbl_chg);
	***
}

内存分配主要走的dequeue_huge_page_vma函数，我们详细看这个函数。

static struct page *dequeue_huge_page_vma(struct hstate *h,
				struct vm_area_struct *vma,
				unsigned long address, int avoid_reserve,
				long chg)
{
	struct page *page;
	struct mempolicy *mpol;
	gfp_t gfp_mask;
	nodemask_t *nodemask;
	int nid;

	/*
	 * A child process with MAP_PRIVATE mappings created by their parent
	 * have no page reserves. This check ensures that reservations are
	 * not "stolen". The child may still get SIGKILLed
	 */
	if (!vma_has_reserves(vma, chg) &&
			h->free_huge_pages - h->resv_huge_pages == 0)
		goto err;

	/* If reserves cannot be used, ensure enough pages are in the pool */
	if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
		goto err;

	gfp_mask = htlb_alloc_mask(h);
	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
	page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
	if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
		SetPagePrivate(page);
		h->resv_huge_pages--;
	}

	mpol_cond_put(mpol);
	return page;

err:
	return NULL;
}

其中，真正的内存分配执行函数是dequeue_huge_page_nodemask，在调用这个函数之前，会通过huge_node函数，初始化nid以及nodemask这两个变量用于传递给dequeue_huge_page_nodemask。其中，nid是内存分配首选的node节点，nodemask用于指明允许在哪些节点上分配内存。从函数dequeue_huge_page_nodemask中我们可以看到，hugetlbfs的内存分配方法和buddy类似，也是依赖zonelist的顺序来做内存分配优先级的排列，然后依赖nodemask来辨别是否允许在zonglist上的zone节点分配内存。
那么。nid和nodemask是怎么初始化的呢？

/*
 * huge_node(@vma, @addr, @gfp_flags, @mpol)
 * @vma: virtual memory area whose policy is sought
 * @addr: address in @vma for shared policy lookup and interleave policy
 * @gfp_flags: for requested zone
 * @mpol: pointer to mempolicy pointer for reference counted mempolicy
 * @nodemask: pointer to nodemask pointer for MPOL_BIND nodemask
 *
 * Returns a nid suitable for a huge page allocation and a pointer
 * to the struct mempolicy for conditional unref after allocation.
 * If the effective policy is 'BIND, returns a pointer to the mempolicy's
 * @nodemask for filtering the zonelist.
 *
 * Must be protected by read_mems_allowed_begin()
 */
int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
				struct mempolicy **mpol, nodemask_t **nodemask)
{
	int nid;

	*mpol = get_vma_policy(vma, addr);
	*nodemask = NULL;	/* assume !MPOL_BIND */

	if (unlikely((*mpol)->mode == MPOL_INTERLEAVE)) {
		nid = interleave_nid(*mpol, vma, addr,
					huge_page_shift(hstate_vma(vma)));
	} else {
		nid = policy_node(gfp_flags, *mpol, numa_node_id());
		if ((*mpol)->mode == MPOL_BIND)
			*nodemask = &(*mpol)->v.nodes;
	}
	return nid;
}

从函数huge_node中可以看到，除非使用了MPOL_BIND内存策略，否则一律允许在所有node节点上尝试分配内存（即*nodemask=NULL）。
我们来看常用走的policy_node函数。

/* Return the node id preferred by the given mempolicy, or the given id */
static int policy_node(gfp_t gfp, struct mempolicy *policy,
								int nd)
{
	if (policy->mode == MPOL_PREFERRED && !(policy->flags & MPOL_F_LOCAL))
		nd = policy->v.preferred_node;
	else {
		/*
		 * __GFP_THISNODE shouldn't even be used with the bind policy
		 * because we might easily break the expectation to stay on the
		 * requested node and not break the policy.
		 */
		WARN_ON_ONCE(policy->mode == MPOL_BIND && (gfp & __GFP_THISNODE));
	}

	return nd;
}