IOCB_DONTCACHE_LAZY: Local Filesystem (XFS) Benchmark Analysis
Date: 2026-03-28
Kernel: v6.19-based
Host: 80 CPUS + 256GB RAM
Filesystem: XFS on local NVMe
RAM: 256 GB
File size: ~512 GB (2x RAM)
I/O engine: fio with io_uring, iodepth=16
Overview
This report analyzes local filesystem performance of four I/O modes using RWF_DONTCACHE_LAZY, a new pwritev2/io_uring flag that provides rate-limited writeback with bounded page cache usage.
Modes Tested
| Mode | Mechanism | Cache behavior |
|---|---|---|
| buffered | Standard buffered I/O | Pages stay in cache indefinitely |
| dontcache | RWF_DONTCACHE — flush all dirty pages on every write |
Pages evicted after full flush |
| direct | O_DIRECT — bypass page cache entirely |
No caching |
| dontcache_lazy | RWF_DONTCACHE_LAZY — skip-if-busy + proportional writeback |
Pages evicted after rate-limited flush |
How DONTCACHE_LAZY Works
IOCB_DONTCACHE_LAZY uses two mechanisms to rate-limit writeback:
-
Skip-if-busy: Before flushing, check
mapping_tagged(PAGECACHE_TAG_WRITEBACK). If writeback is already in progress, skip the flush entirely. This eliminates writeback submission contention between concurrent writers. -
Proportional cap: When flushing does occur, cap
nr_to_writeto the number of pages just written. This prevents any single write from triggering a full-file flush that would starve concurrent readers.
Both mechanisms are necessary — benchmarks show removing either one causes severe regressions.
Deliverable 1: Single-Client Benchmarks
Sequential Write
| Mode | MB/s | p50 (ms) | p99 (ms) | p99.9 (ms) | Peak Cache |
|---|---|---|---|---|---|
| buffered | 575 | 30.0 | 43.8 | 51.1 | 243 GB |
| dontcache | 1179 | 3.2 | 103.3 | 170.9 | 4.7 GB |
| direct | 1374 | 11.5 | 21.9 | 24.2 | 242 GB |
| dontcache_lazy | 1442 | 10.7 | 22.2 | 23.5 | 49 GB |
Key finding: buffered is the slowest mode for large sequential writes.
With a 512 GB file on 256 GB RAM, buffered writes fill the page cache (243 GB peak), triggering the kernel's dirty throttling mechanism. This caps throughput to 575 MB/s with 30ms median latency — the writer spends most of its time waiting for the writeback subsystem to drain dirty pages.
dontcache_lazy is the fastest mode at 1442 MB/s, slightly beating even direct I/O (1374 MB/s). It achieves this by evicting pages after rate-limited writeback, keeping the cache bounded at 49 GB. Unlike direct I/O, writes still go through the page cache, allowing the kernel to coalesce adjacent writes before flushing — hence the slight throughput advantage.
dontcache achieves good throughput (1179 MB/s) but with terrible tail latency (p99.9 = 171ms) due to its aggressive full-file flush on every write.
Random Write
| Mode | MB/s | IOPS | p99.9 (ms) |
|---|---|---|---|
| direct | 326 | 83K | 0.8 |
| buffered | 316 | 81K | 0.9 |
| dontcache_lazy | 299 | 76K | 16.4 |
| dontcache | 148 | 38K | 8.7 |
For random 4K writes, dontcache_lazy is 5% behind buffered — an acceptable trade-off for bounded cache. The p99.9 tail (16.4ms vs 0.9ms) shows occasional writeback stalls when the skip-if-busy guard finds no active writeback and triggers a proportional flush.
dontcache collapses to half the throughput (148 MB/s), consistent with its behavior under NFS.
Sequential Read
| Mode | MB/s | p50 (ms) | p99.9 (ms) | Peak Cache |
|---|---|---|---|---|
| buffered | 2259 | 6.5 | 12.5 | 254 GB |
| dontcache | 2374 | 6.3 | 8.2 | 5.2 GB |
| direct | 2268 | 6.4 | 13.6 | 254 GB |
| dontcache_lazy | 2343 | 6.4 | 8.2 | 3.5 GB |
All modes deliver similar sequential read throughput (~2.3 GB/s). dontcache and dontcache_lazy keep cache bounded (3-5 GB vs 254 GB) with slightly tighter tail latency, since the smaller cache footprint reduces pressure on the memory subsystem.
Random Read
| Mode | MB/s | IOPS | p99.9 (ms) |
|---|---|---|---|
| direct | 567 | 145K | 0.3 |
| buffered | 563 | 144K | 0.3 |
| dontcache_lazy | 545 | 140K | 0.8 |
| dontcache | 542 | 139K | 0.8 |
Random reads are within ~4% across all modes. The slight disadvantage of dontcache/dontcache_lazy comes from page eviction preventing re-reads from cache, but on a 512 GB file with 256 GB RAM the hit rate is limited anyway.
Deliverable 2: Multi-Client Benchmarks
Scenario A: Multiple Writers (4 concurrent)
| Mode | Aggregate MB/s | Per-client p99 (ms) | Per-client p99.9 (ms) |
|---|---|---|---|
| buffered | 1501 | 41.2 | 58 |
| dontcache_lazy | 1434 | 38.5 | 329–447 |
| direct | 980 | 50.1 | 59 |
| dontcache | 707 | 3.1 | 160 |
dontcache_lazy maintains 95% of buffered aggregate throughput under 4-way write contention. dontcache collapses to 47% — every writer's full-file flush serializes against every other writer's flush, exactly as observed in the NFS benchmarks.
Concern: dontcache_lazy p99.9 tail latency is high (329–447ms vs 58ms for buffered). When a writer finds the writeback tag clear and triggers a proportional flush, it can stall briefly behind another writer's in-flight I/O. This wasn't visible in NFS benchmarks where protocol overhead dominates tail latency. Worth investigating whether a tighter proportional cap or exponential backoff would smooth these tails.
Scenario C: Noisy Writer + Latency-Sensitive Readers (same mode for both)
| Mode | Writer MB/s | Reader MB/s | Reader p99.9 (ms) |
|---|---|---|---|
| buffered | 915 | 990 | 0.3 |
| direct | 1380 | 985 | 0.3 |
| dontcache_lazy | 1458 | 27 | 1974 |
| dontcache | 1323 | 24 | 44302 |
When the same mode is applied to both writer and readers, dontcache and dontcache_lazy destroy reader throughput — pages are marked for eviction on read (dropbehind), forcing every subsequent read back to disk.
This is expected behavior and precisely why mixed-mode I/O exists: use dontcache_lazy for writes (bounded cache, high throughput) and buffered for reads (benefit from warm cache).
Scenario D: Mixed Mode — dontcache_lazy writes + buffered reads
| Job | MB/s | IOPS | p50 (us) | p99.9 (ms) |
|---|---|---|---|---|
| Bulk writer (dontcache_lazy) | 1435 | 1435 | 482 | 24.0 |
| reader1 (buffered) | 750 | 192K | 129 | 0.5 |
| reader2 (buffered) | 764 | 196K | 129 | 0.5 |
| reader3 (buffered) | 746 | 191K | 129 | 0.5 |
This is the optimal configuration. Compared to pure buffered (Scenario C):
| Metric | Buffered-only | Mixed mode | Change |
|---|---|---|---|
| Writer throughput | 915 MB/s | 1435 MB/s | +57% |
| Reader throughput | 990 MB/s | 753 MB/s (avg) | −24% |
| Reader p99.9 | 0.3 ms | 0.5 ms | negligible |
The writer gains 57% throughput by avoiding dirty throttling. Readers lose 24% because the faster writer consumes more disk bandwidth, but they still serve primarily from page cache with sub-millisecond tail latency. Total system throughput (writer + 3 readers) is comparable: 3884 MB/s (buffered) vs 3694 MB/s (mixed).
Page Cache Footprint
| Mode | Seq Write Cache | Multi-Writer Dirty |
|---|---|---|
| buffered | 243 GB | 48 GB |
| dontcache | 4.7 GB | 6 KB |
| direct | 242 GB | 45 GB |
| dontcache_lazy | 49 GB | 44 GB |
dontcache_lazy keeps the page cache bounded without the aggressive flushing of dontcache. The 49 GB cache footprint during sequential writes (vs 243 GB for buffered) leaves substantial memory available for other workloads.
Comparison: Local XFS vs NFS Results
| Metric | NFS (dontcache_lazy) | Local XFS (dontcache_lazy) |
|---|---|---|
| Seq write (single) | ~10 GB/s | 1.4 GB/s |
| Multi-writer aggregate | ~10 GB/s | 1.4 GB/s |
| vs buffered (multi-write) | 98% | 95% |
| Noisy neighbor reader impact | minimal | minimal (mixed mode) |
The relative behavior is consistent: dontcache_lazy maintains near-buffered throughput under contention while keeping cache bounded. The absolute throughput difference reflects that NFS benchmarks used a high-end NVMe array behind the NFS server, while local benchmarks run on a single device.
Conclusions
- dontcache_lazy is the fastest sequential write mode on local XFS, beating both buffered (2.5x) and direct I/O (1.05x) by avoiding dirty throttling while still benefiting from page cache write coalescing.
- dontcache collapses under contention on local filesystems just as it does over NFS — multi-writer throughput drops to 47% of buffered due to full-file flush serialization.
- dontcache_lazy multi-writer throughput is near-buffered (95%), confirming the skip-if-busy + proportional cap mechanisms work correctly on the local I/O path.
- Mixed mode (dontcache_lazy writes + buffered reads) is the optimal configuration for workloads with concurrent readers and writers — writer gets +57% throughput, readers maintain sub-ms tail latency.
- Page cache stays bounded at ~49 GB vs 243 GB for buffered, leaving memory available for other workloads.
- Open issue: multi-writer p99.9 tail latency (329–447ms) is elevated compared to buffered (58ms). This warrants investigation into tighter proportional caps or backoff strategies, though it may be acceptable for throughput-oriented workloads.
Test Configuration
- Kernel: v6.19-based
- Host: 80 CPUS + 256GB RAM
- Filesystem: XFS on local storage
- RAM: 256 GB
- File size: 512 GB (2x RAM, ensures data exceeds cache)
- fio: Custom build with RWF_DONTCACHE_LAZY support (uncached=2)
- I/O engine: io_uring with iodepth=16
- Single-client: numjobs=1, full file size per test
- Multi-client: 4 concurrent fio processes, each writing RAM/4
- Noisy neighbor: 1 bulk writer (512 GB) + 3 latency readers (512 MB each)