IOCB_DONTCACHE_LAZY: Local Filesystem (XFS) Benchmark Analysis

Date: 2026-03-28
Kernel: v6.19-based
Host: 80 CPUS + 256GB RAM
Filesystem: XFS on local NVMe
RAM: 256 GB
File size: ~512 GB (2x RAM)
I/O engine: fio with io_uring, iodepth=16


Overview

This report analyzes local filesystem performance of four I/O modes using RWF_DONTCACHE_LAZY, a new pwritev2/io_uring flag that provides rate-limited writeback with bounded page cache usage.

Modes Tested

Mode Mechanism Cache behavior
buffered Standard buffered I/O Pages stay in cache indefinitely
dontcache RWF_DONTCACHE — flush all dirty pages on every write Pages evicted after full flush
direct O_DIRECT — bypass page cache entirely No caching
dontcache_lazy RWF_DONTCACHE_LAZY — skip-if-busy + proportional writeback Pages evicted after rate-limited flush

How DONTCACHE_LAZY Works

IOCB_DONTCACHE_LAZY uses two mechanisms to rate-limit writeback:

  1. Skip-if-busy: Before flushing, check mapping_tagged(PAGECACHE_TAG_WRITEBACK). If writeback is already in progress, skip the flush entirely. This eliminates writeback submission contention between concurrent writers.
  2. Proportional cap: When flushing does occur, cap nr_to_write to the number of pages just written. This prevents any single write from triggering a full-file flush that would starve concurrent readers.

Both mechanisms are necessary — benchmarks show removing either one causes severe regressions.


Deliverable 1: Single-Client Benchmarks

Sequential Write

Mode MB/s p50 (ms) p99 (ms) p99.9 (ms) Peak Cache
buffered 575 30.0 43.8 51.1 243 GB
dontcache 1179 3.2 103.3 170.9 4.7 GB
direct 1374 11.5 21.9 24.2 242 GB
dontcache_lazy 1442 10.7 22.2 23.5 49 GB

Key finding: buffered is the slowest mode for large sequential writes.

With a 512 GB file on 256 GB RAM, buffered writes fill the page cache (243 GB peak), triggering the kernel's dirty throttling mechanism. This caps throughput to 575 MB/s with 30ms median latency — the writer spends most of its time waiting for the writeback subsystem to drain dirty pages.

dontcache_lazy is the fastest mode at 1442 MB/s, slightly beating even direct I/O (1374 MB/s). It achieves this by evicting pages after rate-limited writeback, keeping the cache bounded at 49 GB. Unlike direct I/O, writes still go through the page cache, allowing the kernel to coalesce adjacent writes before flushing — hence the slight throughput advantage.

dontcache achieves good throughput (1179 MB/s) but with terrible tail latency (p99.9 = 171ms) due to its aggressive full-file flush on every write.

Random Write

Mode MB/s IOPS p99.9 (ms)
direct 326 83K 0.8
buffered 316 81K 0.9
dontcache_lazy 299 76K 16.4
dontcache 148 38K 8.7

For random 4K writes, dontcache_lazy is 5% behind buffered — an acceptable trade-off for bounded cache. The p99.9 tail (16.4ms vs 0.9ms) shows occasional writeback stalls when the skip-if-busy guard finds no active writeback and triggers a proportional flush.

dontcache collapses to half the throughput (148 MB/s), consistent with its behavior under NFS.

Sequential Read

Mode MB/s p50 (ms) p99.9 (ms) Peak Cache
buffered 2259 6.5 12.5 254 GB
dontcache 2374 6.3 8.2 5.2 GB
direct 2268 6.4 13.6 254 GB
dontcache_lazy 2343 6.4 8.2 3.5 GB

All modes deliver similar sequential read throughput (~2.3 GB/s). dontcache and dontcache_lazy keep cache bounded (3-5 GB vs 254 GB) with slightly tighter tail latency, since the smaller cache footprint reduces pressure on the memory subsystem.

Random Read

Mode MB/s IOPS p99.9 (ms)
direct 567 145K 0.3
buffered 563 144K 0.3
dontcache_lazy 545 140K 0.8
dontcache 542 139K 0.8

Random reads are within ~4% across all modes. The slight disadvantage of dontcache/dontcache_lazy comes from page eviction preventing re-reads from cache, but on a 512 GB file with 256 GB RAM the hit rate is limited anyway.


Deliverable 2: Multi-Client Benchmarks

Scenario A: Multiple Writers (4 concurrent)

Mode Aggregate MB/s Per-client p99 (ms) Per-client p99.9 (ms)
buffered 1501 41.2 58
dontcache_lazy 1434 38.5 329–447
direct 980 50.1 59
dontcache 707 3.1 160

dontcache_lazy maintains 95% of buffered aggregate throughput under 4-way write contention. dontcache collapses to 47% — every writer's full-file flush serializes against every other writer's flush, exactly as observed in the NFS benchmarks.

Concern: dontcache_lazy p99.9 tail latency is high (329–447ms vs 58ms for buffered). When a writer finds the writeback tag clear and triggers a proportional flush, it can stall briefly behind another writer's in-flight I/O. This wasn't visible in NFS benchmarks where protocol overhead dominates tail latency. Worth investigating whether a tighter proportional cap or exponential backoff would smooth these tails.

Scenario C: Noisy Writer + Latency-Sensitive Readers (same mode for both)

Mode Writer MB/s Reader MB/s Reader p99.9 (ms)
buffered 915 990 0.3
direct 1380 985 0.3
dontcache_lazy 1458 27 1974
dontcache 1323 24 44302

When the same mode is applied to both writer and readers, dontcache and dontcache_lazy destroy reader throughput — pages are marked for eviction on read (dropbehind), forcing every subsequent read back to disk.

This is expected behavior and precisely why mixed-mode I/O exists: use dontcache_lazy for writes (bounded cache, high throughput) and buffered for reads (benefit from warm cache).

Scenario D: Mixed Mode — dontcache_lazy writes + buffered reads

Job MB/s IOPS p50 (us) p99.9 (ms)
Bulk writer (dontcache_lazy) 1435 1435 482 24.0
reader1 (buffered) 750 192K 129 0.5
reader2 (buffered) 764 196K 129 0.5
reader3 (buffered) 746 191K 129 0.5

This is the optimal configuration. Compared to pure buffered (Scenario C):

Metric Buffered-only Mixed mode Change
Writer throughput 915 MB/s 1435 MB/s +57%
Reader throughput 990 MB/s 753 MB/s (avg) −24%
Reader p99.9 0.3 ms 0.5 ms negligible

The writer gains 57% throughput by avoiding dirty throttling. Readers lose 24% because the faster writer consumes more disk bandwidth, but they still serve primarily from page cache with sub-millisecond tail latency. Total system throughput (writer + 3 readers) is comparable: 3884 MB/s (buffered) vs 3694 MB/s (mixed).


Page Cache Footprint

Mode Seq Write Cache Multi-Writer Dirty
buffered 243 GB 48 GB
dontcache 4.7 GB 6 KB
direct 242 GB 45 GB
dontcache_lazy 49 GB 44 GB

dontcache_lazy keeps the page cache bounded without the aggressive flushing of dontcache. The 49 GB cache footprint during sequential writes (vs 243 GB for buffered) leaves substantial memory available for other workloads.


Comparison: Local XFS vs NFS Results

Metric NFS (dontcache_lazy) Local XFS (dontcache_lazy)
Seq write (single) ~10 GB/s 1.4 GB/s
Multi-writer aggregate ~10 GB/s 1.4 GB/s
vs buffered (multi-write) 98% 95%
Noisy neighbor reader impact minimal minimal (mixed mode)

The relative behavior is consistent: dontcache_lazy maintains near-buffered throughput under contention while keeping cache bounded. The absolute throughput difference reflects that NFS benchmarks used a high-end NVMe array behind the NFS server, while local benchmarks run on a single device.


Conclusions

  1. dontcache_lazy is the fastest sequential write mode on local XFS, beating both buffered (2.5x) and direct I/O (1.05x) by avoiding dirty throttling while still benefiting from page cache write coalescing.
  2. dontcache collapses under contention on local filesystems just as it does over NFS — multi-writer throughput drops to 47% of buffered due to full-file flush serialization.
  3. dontcache_lazy multi-writer throughput is near-buffered (95%), confirming the skip-if-busy + proportional cap mechanisms work correctly on the local I/O path.
  4. Mixed mode (dontcache_lazy writes + buffered reads) is the optimal configuration for workloads with concurrent readers and writers — writer gets +57% throughput, readers maintain sub-ms tail latency.
  5. Page cache stays bounded at ~49 GB vs 243 GB for buffered, leaving memory available for other workloads.
  6. Open issue: multi-writer p99.9 tail latency (329–447ms) is elevated compared to buffered (58ms). This warrants investigation into tighter proportional caps or backoff strategies, though it may be acceptable for throughput-oriented workloads.

Test Configuration

  • Kernel: v6.19-based
  • Host: 80 CPUS + 256GB RAM
  • Filesystem: XFS on local storage
  • RAM: 256 GB
  • File size: 512 GB (2x RAM, ensures data exceeds cache)
  • fio: Custom build with RWF_DONTCACHE_LAZY support (uncached=2)
  • I/O engine: io_uring with iodepth=16
  • Single-client: numjobs=1, full file size per test
  • Multi-client: 4 concurrent fio processes, each writing RAM/4
  • Noisy neighbor: 1 bulk writer (512 GB) + 3 latency readers (512 MB each)
Edit
Pub: 28 Mar 2026 12:20 UTC
Update: 28 Mar 2026 12:20 UTC
views: 43

New· How· IP.im· T.im· W.is· Base64.is· Favicon.is· PDF.is· Date.is· TrueURL.com· Portcheck.ing· TLDhub.com· Contact· Issue

text.is - Markdown Pastebin.