IOCB_DONTCACHE_LAZY: Local Filesystem (XFS) Benchmark Analysis

Date: 2026-03-28
Kernel: v6.19-based
Host: 80 CPUS + 256GB RAM
Filesystem: XFS on local NVMe
RAM: 256 GB
File size: ~512 GB (2x RAM)
I/O engine: fio with io_uring, iodepth=16

Overview

This report analyzes local filesystem performance of four I/O modes using RWF_DONTCACHE_LAZY, a new pwritev2/io_uring flag that provides rate-limited writeback with bounded page cache usage.

Modes Tested

Mode	Mechanism	Cache behavior
buffered	Standard buffered I/O	Pages stay in cache indefinitely
dontcache	`RWF_DONTCACHE` — flush all dirty pages on every write	Pages evicted after full flush
direct	`O_DIRECT` — bypass page cache entirely	No caching
dontcache_lazy	`RWF_DONTCACHE_LAZY` — skip-if-busy + proportional writeback	Pages evicted after rate-limited flush

How DONTCACHE_LAZY Works

IOCB_DONTCACHE_LAZY uses two mechanisms to rate-limit writeback:

Skip-if-busy: Before flushing, check mapping_tagged(PAGECACHE_TAG_WRITEBACK). If writeback is already in progress, skip the flush entirely. This eliminates writeback submission contention between concurrent writers.
Proportional cap: When flushing does occur, cap nr_to_write to the number of pages just written. This prevents any single write from triggering a full-file flush that would starve concurrent readers.

Both mechanisms are necessary — benchmarks show removing either one causes severe regressions.

Deliverable 1: Single-Client Benchmarks

Sequential Write

Mode	MB/s	p50 (ms)	p99 (ms)	p99.9 (ms)	Peak Cache
buffered	575	30.0	43.8	51.1	243 GB
dontcache	1179	3.2	103.3	170.9	4.7 GB
direct	1374	11.5	21.9	24.2	242 GB
dontcache_lazy	1442	10.7	22.2	23.5	49 GB

Key finding: buffered is the slowest mode for large sequential writes.

With a 512 GB file on 256 GB RAM, buffered writes fill the page cache (243 GB peak), triggering the kernel's dirty throttling mechanism. This caps throughput to 575 MB/s with 30ms median latency — the writer spends most of its time waiting for the writeback subsystem to drain dirty pages.

dontcache_lazy is the fastest mode at 1442 MB/s, slightly beating even direct I/O (1374 MB/s). It achieves this by evicting pages after rate-limited writeback, keeping the cache bounded at 49 GB. Unlike direct I/O, writes still go through the page cache, allowing the kernel to coalesce adjacent writes before flushing — hence the slight throughput advantage.

dontcache achieves good throughput (1179 MB/s) but with terrible tail latency (p99.9 = 171ms) due to its aggressive full-file flush on every write.

Random Write

Mode	MB/s	IOPS	p99.9 (ms)
direct	326	83K	0.8
buffered	316	81K	0.9
dontcache_lazy	299	76K	16.4
dontcache	148	38K	8.7

For random 4K writes, dontcache_lazy is 5% behind buffered — an acceptable trade-off for bounded cache. The p99.9 tail (16.4ms vs 0.9ms) shows occasional writeback stalls when the skip-if-busy guard finds no active writeback and triggers a proportional flush.

dontcache collapses to half the throughput (148 MB/s), consistent with its behavior under NFS.

Sequential Read

Mode	MB/s	p50 (ms)	p99.9 (ms)	Peak Cache
buffered	2259	6.5	12.5	254 GB
dontcache	2374	6.3	8.2	5.2 GB
direct	2268	6.4	13.6	254 GB
dontcache_lazy	2343	6.4	8.2	3.5 GB

All modes deliver similar sequential read throughput (~2.3 GB/s). dontcache and dontcache_lazy keep cache bounded (3-5 GB vs 254 GB) with slightly tighter tail latency, since the smaller cache footprint reduces pressure on the memory subsystem.

Random Read

Mode	MB/s	IOPS	p99.9 (ms)
direct	567	145K	0.3
buffered	563	144K	0.3
dontcache_lazy	545	140K	0.8
dontcache	542	139K	0.8

Random reads are within ~4% across all modes. The slight disadvantage of dontcache/dontcache_lazy comes from page eviction preventing re-reads from cache, but on a 512 GB file with 256 GB RAM the hit rate is limited anyway.

Deliverable 2: Multi-Client Benchmarks

Scenario A: Multiple Writers (4 concurrent)

Mode	Aggregate MB/s	Per-client p99 (ms)	Per-client p99.9 (ms)
buffered	1501	41.2	58
dontcache_lazy	1434	38.5	329–447
direct	980	50.1	59
dontcache	707	3.1	160

dontcache_lazy maintains 95% of buffered aggregate throughput under 4-way write contention. dontcache collapses to 47% — every writer's full-file flush serializes against every other writer's flush, exactly as observed in the NFS benchmarks.

Concern: dontcache_lazy p99.9 tail latency is high (329–447ms vs 58ms for buffered). When a writer finds the writeback tag clear and triggers a proportional flush, it can stall briefly behind another writer's in-flight I/O. This wasn't visible in NFS benchmarks where protocol overhead dominates tail latency. Worth investigating whether a tighter proportional cap or exponential backoff would smooth these tails.

Scenario C: Noisy Writer + Latency-Sensitive Readers (same mode for both)

Mode	Writer MB/s	Reader MB/s	Reader p99.9 (ms)
buffered	915	990	0.3
direct	1380	985	0.3
dontcache_lazy	1458	27	1974
dontcache	1323	24	44302

When the same mode is applied to both writer and readers, dontcache and dontcache_lazy destroy reader throughput — pages are marked for eviction on read (dropbehind), forcing every subsequent read back to disk.

This is expected behavior and precisely why mixed-mode I/O exists: use dontcache_lazy for writes (bounded cache, high throughput) and buffered for reads (benefit from warm cache).

Scenario D: Mixed Mode — dontcache_lazy writes + buffered reads

Job	MB/s	IOPS	p50 (us)	p99.9 (ms)
Bulk writer (dontcache_lazy)	1435	1435	482	24.0
reader1 (buffered)	750	192K	129	0.5
reader2 (buffered)	764	196K	129	0.5
reader3 (buffered)	746	191K	129	0.5

This is the optimal configuration. Compared to pure buffered (Scenario C):

Metric	Buffered-only	Mixed mode	Change
Writer throughput	915 MB/s	1435 MB/s	+57%
Reader throughput	990 MB/s	753 MB/s (avg)	−24%
Reader p99.9	0.3 ms	0.5 ms	negligible

The writer gains 57% throughput by avoiding dirty throttling. Readers lose 24% because the faster writer consumes more disk bandwidth, but they still serve primarily from page cache with sub-millisecond tail latency. Total system throughput (writer + 3 readers) is comparable: 3884 MB/s (buffered) vs 3694 MB/s (mixed).

Page Cache Footprint

Mode	Seq Write Cache	Multi-Writer Dirty
buffered	243 GB	48 GB
dontcache	4.7 GB	6 KB
direct	242 GB	45 GB
dontcache_lazy	49 GB	44 GB

dontcache_lazy keeps the page cache bounded without the aggressive flushing of dontcache. The 49 GB cache footprint during sequential writes (vs 243 GB for buffered) leaves substantial memory available for other workloads.

Comparison: Local XFS vs NFS Results

Metric	NFS (dontcache_lazy)	Local XFS (dontcache_lazy)
Seq write (single)	~10 GB/s	1.4 GB/s
Multi-writer aggregate	~10 GB/s	1.4 GB/s
vs buffered (multi-write)	98%	95%
Noisy neighbor reader impact	minimal	minimal (mixed mode)

The relative behavior is consistent: dontcache_lazy maintains near-buffered throughput under contention while keeping cache bounded. The absolute throughput difference reflects that NFS benchmarks used a high-end NVMe array behind the NFS server, while local benchmarks run on a single device.

Conclusions

dontcache_lazy is the fastest sequential write mode on local XFS, beating both buffered (2.5x) and direct I/O (1.05x) by avoiding dirty throttling while still benefiting from page cache write coalescing.
dontcache collapses under contention on local filesystems just as it does over NFS — multi-writer throughput drops to 47% of buffered due to full-file flush serialization.
dontcache_lazy multi-writer throughput is near-buffered (95%), confirming the skip-if-busy + proportional cap mechanisms work correctly on the local I/O path.
Mixed mode (dontcache_lazy writes + buffered reads) is the optimal configuration for workloads with concurrent readers and writers — writer gets +57% throughput, readers maintain sub-ms tail latency.
Page cache stays bounded at ~49 GB vs 243 GB for buffered, leaving memory available for other workloads.
Open issue: multi-writer p99.9 tail latency (329–447ms) is elevated compared to buffered (58ms). This warrants investigation into tighter proportional caps or backoff strategies, though it may be acceptable for throughput-oriented workloads.

Test Configuration

Kernel: v6.19-based
Host: 80 CPUS + 256GB RAM
Filesystem: XFS on local storage
RAM: 256 GB
File size: 512 GB (2x RAM, ensures data exceeds cache)
fio: Custom build with RWF_DONTCACHE_LAZY support (uncached=2)
I/O engine: io_uring with iodepth=16
Single-client: numjobs=1, full file size per test
Multi-client: 4 concurrent fio processes, each writing RAM/4
Noisy neighbor: 1 bulk writer (512 GB) + 3 latency readers (512 MB each)