当物理内存实际比较紧张时,内存水位处于较低water level时,会触发间接回收内存kswapd或者直接回收内存,从inactive LRU list中将不常用的内存 置换到内存中,整个调用过程大概如下:
不管是kswapd线程还是 直接回收内存,最终都是回收位于inactive list LRU中回收内存。回收入口总函数为shrink_list。
shrink_list
shrink_list为回收内存总入口函数:
shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct lruvec *lruvec, struct scan_control *sc)
入参:
- enum lru_list lru:回收LRU类型。
- unsigned long nr_to_scan:指定回收内存数目。
- struct lruvec *lruvec: 所回收的内存节点pgdat lru管理数据。
- struct scan_control *sc:回收内存控制
shrink_list源码如下:
static unsigned long shrink_list(enum lru_list lru, unsigned long nr_to_scan,
struct lruvec *lruvec, struct scan_control *sc)
{
if (is_active_lru(lru)) {
if (sc->may_deactivate & (1 << is_file_lru(lru)))
shrink_active_list(nr_to_scan, lruvec, sc, lru);
else
sc->skipped_deactivate = 1;
return 0;
}
return shrink_inactive_list(nr_to_scan, lruvec, sc, lru);
}
- ?如果指定lru为active LRU类型,则首先shrink_active_list,将active list中链表尾开始扫描,将近期没有使用的nr_to_scan个页面 转移到unactive list中。
- 如果是unactive LRU,则调用shrink_inactive_list,从inactive list链表尾部扫描最近没有使用的nr_to_scan个页面,将其swap out到磁盘中。
shrink_inactive_list
shrink_inactive_list从inactive LRU中挑选出最近没有使用的nr_to_scan个页面,将其swap out到swap分区或者写入到文件中,该函数流程如下:
结合源码分析:
/*
* shrink_inactive_list() is a helper for shrink_node(). It returns the number
* of reclaimed pages
*/
static noinline_for_stack unsigned long
shrink_inactive_list(unsigned long nr_to_scan, struct lruvec *lruvec,
struct scan_control *sc, enum lru_list lru)
{
LIST_HEAD(page_list);
unsigned long nr_scanned;
unsigned int nr_reclaimed = 0;
unsigned long nr_taken;
struct reclaim_stat stat;
bool file = is_file_lru(lru);
enum vm_event_item item;
struct pglist_data *pgdat = lruvec_pgdat(lruvec);
bool stalled = false;
while (unlikely(too_many_isolated(pgdat, file, sc))) {
if (stalled)
return 0;
/* wait a bit for the reclaimer. */
msleep(100);
stalled = true;
/* We are about to die and free our memory. Return now. */
if (fatal_signal_pending(current))
return SWAP_CLUSTER_MAX;
}
lru_add_drain();
spin_lock_irq(&pgdat->lru_lock);
nr_taken = isolate_lru_pages(nr_to_scan, lruvec, &page_list,
&nr_scanned, sc, lru);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, nr_taken);
item = current_is_kswapd() ? PGSCAN_KSWAPD : PGSCAN_DIRECT;
if (!cgroup_reclaim(sc))
__count_vm_events(item, nr_scanned);
__count_memcg_events(lruvec_memcg(lruvec), item, nr_scanned);
__count_vm_events(PGSCAN_ANON + file, nr_scanned);
spin_unlock_irq(&pgdat->lru_lock);
if (nr_taken == 0)
return 0;
nr_reclaimed = shrink_page_list(&page_list, pgdat, sc, 0,
&stat, false);
spin_lock_irq(&pgdat->lru_lock);
move_pages_to_lru(lruvec, &page_list);
__mod_node_page_state(pgdat, NR_ISOLATED_ANON + file, -nr_taken);
lru_note_cost(lruvec, file, stat.nr_pageout);
item = current_is_kswapd() ? PGSTEAL_KSWAPD : PGSTEAL_DIRECT;
if (!cgroup_reclaim(sc))
__count_vm_events(item, nr_reclaimed);
__count_memcg_events(lruvec_memcg(lruvec), item, nr_reclaimed);
__count_vm_events(PGSTEAL_ANON + file, nr_reclaimed);
spin_unlock_irq(&pgdat->lru_lock);
mem_cgroup_uncharge_list(&page_list);
free_unref_page_list(&page_list);
/*
* If dirty pages are scanned that are not queued for IO, it
* implies that flushers are not doing their job. This can
* happen when memory pressure pushes dirty pages to the end of
* the LRU before the dirty limits are breached and the dirty
* data has expired. It can also happen when the proportion of
* dirty pages grows not through writes but through memory
* pressure reclaiming all the clean cache. And in some cases,
* the flushers simply cannot keep up with the allocation
* rate. Nudge the flusher threads in case they are asleep.
*/
if (stat.nr_unqueued_dirty == nr_taken)
wakeup_flusher_threads(WB_REASON_VMSCAN);
sc->nr.dirty += stat.nr_dirty;
sc->nr.congested += stat.nr_congested;
sc->nr.unqueued_dirty += stat.nr_unqueued_dirty;
sc->nr.writeback += stat.nr_writeback;
sc->nr.immediate += stat.nr_immediate;
sc->nr.taken += nr_taken;
if (file)
sc->nr.file_taken += nr_taken;
trace_mm_vmscan_lru_shrink_inactive(pgdat->node_id,
nr_scanned, nr_reclaimed, &stat, sc->priority, file);
return nr_reclaimed;
}
- too_many_isolated:查看释放ioslate页面过多,如果过多说明有其他线程比如kswap正在进行内存回收释放,需要等待其他线程处理完毕或者处理一段时间将ioslate页面降低到一个合理水平。
- lru_add_drain:将所有cpu 中独有的per cpu lru_pvecs缓存的数组都刷新到对应LRU中,方便后续处理。
- pgdat->lru_lock:对pgdata LRU加锁,后续需要从指定的lru中 孤立出所需要的页面。
- isolate_lru_pages: 将从指定的LRU list中,从链表尾部开始扫描,孤立出满足要求的nr_to_scan个页面(大部分情况下 page都会孤立出来),将其从原有LRU链表中摘除,并保存到page_list链表中。
- __mod_node_page_state: 更新LRU 链表统计
- current_is_kswapd:查看是否是kswapd线程进行内存回收,并发出相应事件。
- page已经从LRU 孤立出来,因此可以将pgdat->lru_lock锁释放,防止长期占有该锁导致性能下降。
- shrink_page_list: 页面回收实施函数,将孤立出来page_list中的page尝试回收,将可以回收的页面-进行回收,并从page_list摘除掉,处理完成之后page_list中的页面为当前不可回收,后续需要将其加入到原有LRU list中。可以回收的页面会,如果page 是page cache可以将page中的内容写入到磁盘文件中,如果是匿名页则page中的内容会被写入到swap分区中,并将可以回收的页面释放到per cpu?pageset中或者buddy系统中用于后续分配使用。
- 对pgdat->lru_lock 重新加锁
- 将page_list中无法回收的页面重新加入到对应LRU中。
- 同时更新node LRU统计信息,以及发生相应事件。
- pgdat->lru_lock解锁。
- 如果此时page_list还有页面,则会调用free_unref_page_list 将page释放掉,加入到per cpu?pageset或者buddy中。
- 将处理最终结果数据,更新到scan_control。
shrink_page_list
shrink_page_list函数是具体进行page回收实施函数,将根据实际情况将要回收的page_list page,实施不同的回收动作。由于内核中存在各自用途的page,需要针对各自用途的page进行专门单独的swap即回收动作,增加了该处理回收难度,主要处理流程如下:
??该函数处理过程稍微复杂,但是经过上述处理之后思路会稍微清晰一点:
/*
* shrink_page_list() returns the number of reclaimed pages
*/
static unsigned int shrink_page_list(struct list_head *page_list,
struct pglist_data *pgdat,
struct scan_control *sc,
enum ttu_flags ttu_flags,
struct reclaim_stat *stat,
bool ignore_references)
{
LIST_HEAD(ret_pages);
LIST_HEAD(free_pages);
unsigned int nr_reclaimed = 0;
unsigned int pgactivate = 0;
memset(stat, 0, sizeof(*stat));
cond_resched();
while (!list_empty(page_list)) {
struct address_space *mapping;
struct page *page;
enum page_references references = PAGEREF_RECLAIM;
bool dirty, writeback, may_enter_fs;
unsigned int nr_pages;
cond_resched();
page = lru_to_page(page_list);
list_del(&page->lru);
if (!trylock_page(page))
goto keep;
VM_BUG_ON_PAGE(PageActive(page), page);
nr_pages = compound_nr(page);
/* Account the number of base pages even though THP */
sc->nr_scanned += nr_pages;
if (unlikely(!page_evictable(page)))
goto activate_locked;
if (!sc->may_unmap && page_mapped(page))
goto keep_locked;
may_enter_fs = (sc->gfp_mask & __GFP_FS) ||
(PageSwapCache(page) && (sc->gfp_mask & __GFP_IO));
/*
* The number of dirty pages determines if a node is marked
* reclaim_congested which affects wait_iff_congested. kswapd
* will stall and start writing pages if the tail of the LRU
* is all dirty unqueued pages.
*/
page_check_dirty_writeback(page, &dirty, &writeback);
if (dirty || writeback)
stat->nr_dirty++;
if (dirty && !writeback)
stat->nr_unqueued_dirty++;
/*
* Treat this page as congested if the underlying BDI is or if
* pages are cycling through the LRU so quickly that the
* pages marked for immediate reclaim are making it to the
* end of the LRU a second time.
*/
mapping = page_mapping(page);
if (((dirty || writeback) && mapping &&
inode_write_congested(mapping->host)) ||
(writeback && PageReclaim(page)))
stat->nr_congested++;
/*
* If a page at the tail of the LRU is under writeback, there
* are three cases to consider.
*
* 1) If reclaim is encountering an excessive number of pages
* under writeback and this page is both under writeback and
* PageReclaim then it indicates that pages are being queued
* for IO but are being recycled through the LRU before the
* IO can complete. Waiting on the page itself risks an
* indefinite stall if it is impossible to writeback the
* page due to IO error or disconnected storage so instead
* note that the LRU is being scanned too quickly and the
* caller can stall after page list has been processed.
*
* 2) Global or new memcg reclaim encounters a page that is
* not marked for immediate reclaim, or the caller does not
* have __GFP_FS (or __GFP_IO if it's simply going to swap,
* not to fs). In this case mark the page for immediate
* reclaim and continue scanning.
*
* Require may_enter_fs because we would wait on fs, which
* may not have submitted IO yet. And the loop driver might
* enter reclaim, and deadlock if it waits on a page for
* which it is needed to do the write (loop masks off
* __GFP_IO|__GFP_FS for this reason); but more thought
* would probably show more reasons.
*
* 3) Legacy memcg encounters a page that is already marked
* PageReclaim. memcg does not have any dirty pages
* throttling so we could easily OOM just because too many
* pages are in writeback and there is nothing else to
* reclaim. Wait for the writeback to complete.
*
* In cases 1) and 2) we activate the pages to get them out of
* the way while we continue scanning for clean pages on the
* inactive list and refilling from the active list. The
* observation here is that waiting for disk writes is more
* expensive than potentially causing reloads down the line.
* Since they're marked for immediate reclaim, they won't put
* memory pressure on the cache working set any longer than it
* takes to write them to disk.
*/
if (PageWriteback(page)) {
/* Case 1 above */
if (current_is_kswapd() &&
PageReclaim(page) &&
test_bit(PGDAT_WRITEBACK, &pgdat->flags)) {
stat->nr_immediate++;
goto activate_locked;
/* Case 2 above */
} else if (writeback_throttling_sane(sc) ||
!PageReclaim(page) || !may_enter_fs) {
/*
* This is slightly racy - end_page_writeback()
* might have just cleared PageReclaim, then
* setting PageReclaim here end up interpreted
* as PageReadahead - but that does not matter
* enough to care. What we do want is for this
* page to have PageReclaim set next time memcg
* reclaim reaches the tests above, so it will
* then wait_on_page_writeback() to avoid OOM;
* and it's also appropriate in global reclaim.
*/
SetPageReclaim(page);
stat->nr_writeback++;
goto activate_locked;
/* Case 3 above */
} else {
unlock_page(page);
wait_on_page_writeback(page);
/* then go back and try same page again */
list_add_tail(&page->lru, page_list);
continue;
}
}
if (!ignore_references)
references = page_check_references(page, sc);
switch (references) {
case PAGEREF_ACTIVATE:
goto activate_locked;
case PAGEREF_KEEP:
stat->nr_ref_keep += nr_pages;
goto keep_locked;
case PAGEREF_RECLAIM:
case PAGEREF_RECLAIM_CLEAN:
; /* try to reclaim the page below */
}
/*
* Anonymous process memory has backing store?
* Try to allocate it some swap space here.
* Lazyfree page could be freed directly
*/
if (PageAnon(page) && PageSwapBacked(page)) {
if (!PageSwapCache(page)) {
if (!(sc->gfp_mask & __GFP_IO))
goto keep_locked;
if (PageTransHuge(page)) {
/* cannot split THP, skip it */
if (!can_split_huge_page(page, NULL))
goto activate_locked;
/*
* Split pages without a PMD map right
* away. Chances are some or all of the
* tail pages can be freed without IO.
*/
if (!compound_mapcount(page) &&
split_huge_page_to_list(page,
page_list))
goto activate_locked;
}
if (!add_to_swap(page)) {
if (!PageTransHuge(page))
goto activate_locked_split;
/* Fallback to swap normal pages */
if (split_huge_page_to_list(page,
page_list))
goto activate_locked;
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
count_vm_event(THP_SWPOUT_FALLBACK);
#endif
if (!add_to_swap(page))
goto activate_locked_split;
}
may_enter_fs = true;
/* Adding to swap updated mapping */
mapping = page_mapping(page);
}
} else if (unlikely(PageTransHuge(page))) {
/* Split file THP */
if (split_huge_page_to_list(page, page_list))
goto keep_locked;
}
/*
* THP may get split above, need minus tail pages and update
* nr_pages to avoid accounting tail pages twice.
*
* The tail pages that are added into swap cache successfully
* reach here.
*/
if ((nr_pages > 1) && !PageTransHuge(page)) {
sc->nr_scanned -= (nr_pages - 1);
nr_pages = 1;
}
/*
* The page is mapped into the page tables of one or more
* processes. Try to unmap it here.
*/
if (page_mapped(page)) {
enum ttu_flags flags = ttu_flags | TTU_BATCH_FLUSH;
bool was_swapbacked = PageSwapBacked(page);
if (unlikely(PageTransHuge(page)))
flags |= TTU_SPLIT_HUGE_PMD;
if (!try_to_unmap(page, flags)) {
stat->nr_unmap_fail += nr_pages;
if (!was_swapbacked && PageSwapBacked(page))
stat->nr_lazyfree_fail += nr_pages;
goto activate_locked;
}
}
if (PageDirty(page)) {
/*
* Only kswapd can writeback filesystem pages
* to avoid risk of stack overflow. But avoid
* injecting inefficient single-page IO into
* flusher writeback as much as possible: only
* write pages when we've encountered many
* dirty pages, and when we've already scanned
* the rest of the LRU for clean pages and see
* the same dirty pages again (PageReclaim).
*/
if (page_is_file_lru(page) &&
(!current_is_kswapd() || !PageReclaim(page) ||
!test_bit(PGDAT_DIRTY, &pgdat->flags))) {
/*
* Immediately reclaim when written back.
* Similar in principal to deactivate_page()
* except we already have the page isolated
* and know it's dirty
*/
inc_node_page_state(page, NR_VMSCAN_IMMEDIATE);
SetPageReclaim(page);
goto activate_locked;
}
if (references == PAGEREF_RECLAIM_CLEAN)
goto keep_locked;
if (!may_enter_fs)
goto keep_locked;
if (!sc->may_writepage)
goto keep_locked;
/*
* Page is dirty. Flush the TLB if a writable entry
* potentially exists to avoid CPU writes after IO
* starts and then write it out here.
*/
try_to_unmap_flush_dirty();
switch (pageout(page, mapping)) {
case PAGE_KEEP:
goto keep_locked;
case PAGE_ACTIVATE:
goto activate_locked;
case PAGE_SUCCESS:
stat->nr_pageout += hpage_nr_pages(page);
if (PageWriteback(page))
goto keep;
if (PageDirty(page))
goto keep;
/*
* A synchronous write - probably a ramdisk. Go
* ahead and try to reclaim the page.
*/
if (!trylock_page(page))
goto keep;
if (PageDirty(page) || PageWriteback(page))
goto keep_locked;
mapping = page_mapping(page);
case PAGE_CLEAN:
; /* try to free the page below */
}
}
/*
* If the page has buffers, try to free the buffer mappings
* associated with this page. If we succeed we try to free
* the page as well.
*
* We do this even if the page is PageDirty().
* try_to_release_page() does not perform I/O, but it is
* possible for a page to have PageDirty set, but it is actually
* clean (all its buffers are clean). This happens if the
* buffers were written out directly, with submit_bh(). ext3
* will do this, as well as the blockdev mapping.
* try_to_release_page() will discover that cleanness and will
* drop the buffers and mark the page clean - it can be freed.
*
* Rarely, pages can have buffers and no ->mapping. These are
* the pages which were not successfully invalidated in
* truncate_complete_page(). We try to drop those buffers here
* and if that worked, and the page is no longer mapped into
* process address space (page_count == 1) it can be freed.
* Otherwise, leave the page on the LRU so it is swappable.
*/
if (page_has_private(page)) {
if (!try_to_release_page(page, sc->gfp_mask))
goto activate_locked;
if (!mapping && page_count(page) == 1) {
unlock_page(page);
if (put_page_testzero(page))
goto free_it;
else {
/*
* rare race with speculative reference.
* the speculative reference will free
* this page shortly, so we may
* increment nr_reclaimed here (and
* leave it off the LRU).
*/
nr_reclaimed++;
continue;
}
}
}
if (PageAnon(page) && !PageSwapBacked(page)) {
/* follow __remove_mapping for reference */
if (!page_ref_freeze(page, 1))
goto keep_locked;
if (PageDirty(page)) {
page_ref_unfreeze(page, 1);
goto keep_locked;
}
count_vm_event(PGLAZYFREED);
count_memcg_page_event(page, PGLAZYFREED);
} else if (!mapping || !__remove_mapping(mapping, page, true,
sc->target_mem_cgroup))
goto keep_locked;
unlock_page(page);
free_it:
/*
* THP may get swapped out in a whole, need account
* all base pages.
*/
nr_reclaimed += nr_pages;
/*
* Is there need to periodically free_page_list? It would
* appear not as the counts should be low
*/
if (unlikely(PageTransHuge(page)))
destroy_compound_page(page);
else
list_add(&page->lru, &free_pages);
continue;
activate_locked_split:
/*
* The tail pages that are failed to add into swap cache
* reach here. Fixup nr_scanned and nr_pages.
*/
if (nr_pages > 1) {
sc->nr_scanned -= (nr_pages - 1);
nr_pages = 1;
}
activate_locked:
/* Not a candidate for swapping, so reclaim swap space. */
if (PageSwapCache(page) && (mem_cgroup_swap_full(page) ||
PageMlocked(page)))
try_to_free_swap(page);
VM_BUG_ON_PAGE(PageActive(page), page);
if (!PageMlocked(page)) {
int type = page_is_file_lru(page);
SetPageActive(page);
stat->nr_activate[type] += nr_pages;
count_memcg_page_event(page, PGACTIVATE);
}
keep_locked:
unlock_page(page);
keep:
list_add(&page->lru, &ret_pages);
VM_BUG_ON_PAGE(PageLRU(page) || PageUnevictable(page), page);
}
pgactivate = stat->nr_activate[0] + stat->nr_activate[1];
mem_cgroup_uncharge_list(&free_pages);
try_to_unmap_flush();
free_unref_page_list(&free_pages);
list_splice(&ret_pages, page_list);
count_vm_events(PGACTIVATE, pgactivate);
return nr_reclaimed;
}
主要考虑的几种类型:
- PageWriteback:针对正在将内存往磁盘回写时,需要区分三种情况:
- 当前进程时kswapd进行页面回收,需要直接调用PageReclaim进行回收
- 如果当前不是kswapd进程,但是有其他进程增在将该物理页面往磁盘进行回写,此时没有必要一直等待其他进行回写完再回收,因此回写磁盘是一个比较漫长的动作,有可能会卡住,如果采用等待有可能因为硬件故障造成卡住。
- 如果没有其他进程将该物理也往磁盘回写,则直接将该物理也写入磁盘中,然后将该物理加入到page_list尾部,下次循环就可以直接回收。
- ?针对非PageWriteBack: 则调用PageWriteback 实施二次法则,根据页面最近是否访问过,分别做出不同动作。
- PAGEREF_ACTIVATE:说明该页最近被访问过,需要将该页转移到active LRU list中。
- PAGEREF_KEEP: 保持该页面在inactive LRU list中,将其转移到inactive LRU list头部
- PAGEREF_RECLAIM/PAGEREF_RECLAIM_CLEAN:后面可以将该页面回收
- ?针对PAGEREF_RECLAIM/PAGEREF_RECLAIM_CLEAN 可以回收页面场景,需要根据不同页面用途进行不同回收动作:
- PageAnon(page) && PageSwapBacked(page):针对匿名页,且该page 将swap分区作为页面内容保存地方,实施add_to_swap,将其page 内容回收到swap分区。
- page_mapped:针对映射类型,通常文件类型,特殊驱动通过mapping等,调用try_to_unmap将该页面映射解除。
- PageDirty:针对脏页,将调用pageout,将page内容刷新到对应磁盘文件中。
- page_has_private:针对buffer cache:调用try_to_release_page将其页面释放。
- PageAnon(page) && !PageSwapBacked(page):针对匿名页,但是没有设置swap backed时说明该页面等待被释放,调用__remove_mapping进行释放。
- 上述步骤,将可以释放的页面针对不同类型,将page 中内容保存到不同地方,这样下次再次用到时可以将内存从磁盘中读出。
- 页面内容被腾出来之后,该页面就可以被回收list_add(&page->lru, &free_pages):加到到free_pages,后续?free_pages中所有页面都可以释放到buddy中。
- 当page_list中所有页面处理完成之后,调用free_unref_page_list,将free_pages中所有页面释放到zone->pageset或者buddy中,可以用于后续内存分配。
|