I recently needed to read/write to disk a 64,000 x 64,000 array of u64 values. 64k * 64k = 4.1 billion entries * 8 bytes = 32.7 GB loaded every test run, well within the valuable to micro-optimize range. Perfecting superfast load times at these sizes with Rust is surprisingly hard puzzle though.

File content is a complete xy array of Entity IDs stored as usize array indexes. I will use u64 == usize interchangeably. Used case is pathfinding in my facto-loop-miner Factorio Mod.

Sure this problem has other solutions. But in 2023 we have fast SK hynix P41 NVMe drives rated 7 GB/s and 256 GB RAM! We have AWS m7gd.2xlarge instances! "Write memory as-is to disk" should still be easy practical solution.

Documenting the increasing complexity required to obtain better performance. Includes cargo bench benchmarks for my full file read/write use case. This excludes page cache concerns and higher level Rust IO Crates. I will explore the built-in low-level std Rust and libc Linux APIs to understand how they work from safe copy to mmap to io_uring.

All code available at https://github.com/TheLQ/facto-loop-miner io_experiments crate. The file write code is omitted here though looks very similar to read.

Step 1: Safely Read `Vec<u64>` from a File in Rust

We really just have a byte array. Why not just store on disk as a byte array?

pub fn read_entire_file(path: &Path) -> VIoResult<Vec<u8>> {
    let mut file = File::open(path).map_err(VIoError::io_error(path))?;
    let xy_array_len_u8 = get_file_size(&file, path)? as usize;
    let mut xy_array_u8_raw: Vec<u8> = Vec::with_capacity(xy_array_len_u8);
    file.read_to_end(&mut xy_array_u8_raw)
        .map_err(VIoError::io_error(path))?;
    Ok(xy_array_u8_raw)
}

#[bench]
fn bench_read_minimum_unconverted(bencher: &mut test::Bencher) {
  bencher.iter(|| {
    let output = read_entire_file(&input_path()).unwrap();
    checksum_vec_u8(output)
  })
}

Immediate problem is the entire std::io API speaks u8 not u64. Safe std Rust can't convert slices between different sized number types unfortunately. So we must Copy each [u8; 8] chunk into a new usize with usize::from_ne_bytes, from native endian bytes which is simply a safe transmute. Copying isn't ideal but still should be fast right?

Lacking the freedom of C memcpy() that only needs 2 pointers and a size, we will copy each entry manually.

The most generic way is a middle iterator from impl IntoIterator<u8> to impl IntoIterator<usize>. We trust the --release compiler to optimize our iterator chain. We also should support an iterator of references that does not take ownership the source Vec, at the cost of lifetime syntax.

#[bench]
fn bench_read_iter(bencher: &mut test::Bencher) {
    bencher.iter(|| {
        let data = read_entire_file(&input_path()).unwrap();
        let output = map_u8_to_usize_iter(data).into_iter().collect();
        checksum_vec_usize(output)
    });
}

pub fn map_u8_to_usize_iter(
    input_bytes: impl IntoIterator<Item = u8>,
) -> impl IntoIterator<Item = usize> {
    input_bytes
        .into_iter()
        .array_chunks()
        .map(usize::from_ne_bytes)
}

pub fn map_u8_to_usize_iter_ref<'a>(
    input_bytes: impl IntoIterator<Item = &'a u8> + 'a,
) -> impl IntoIterator<Item = usize> + 'a {
    input_bytes
        .into_iter()
        .array_chunks()
        .map(|e| e.map(|b| *b))
        .map(usize::from_ne_bytes)
}

pub fn map_usize_to_u8_iter(
    input_bytes: impl IntoIterator<Item = usize>,
) -> impl IntoIterator<Item = u8> {
    input_bytes.into_iter().flat_map(usize::to_ne_bytes)
}

Let's also try the "feels faster" classic for loop and slice implementation. Think about how many lines of manual math other languages would take to implement our 3 line array_chunks_mut loop.

#[bench]
fn bench_read_slice(bencher: &mut test::Bencher) {
    bencher.iter(|| {
        let data = read_entire_file(&input_path()).unwrap();
        let mut usize_data = vec![0; data.len() / USIZE_BYTES];
        map_u8_to_usize_slice(&data, &mut usize_data);
        checksum_vec_u8(usize_data)
    });
}

pub fn map_usize_to_u8_slice(input: &[usize], output: &mut [u8]) {
    assert_eq!(input.len() * USIZE_BYTES, output.len(), "outsize too small",);
    for (i, output_chunk) in output.array_chunks_mut::<USIZE_BYTES>().enumerate() {
        output_chunk.clone_from_slice(&input[i].to_ne_bytes());
    }
}

pub fn map_u8_to_usize_slice(input: &[u8], output: &mut [usize]) {
    assert_eq!(input.len() / USIZE_BYTES, output.len(), "outsize too small");
    for (i, input_chunk) in input.array_chunks().enumerate() {
        output[i] = usize::from_ne_bytes(*input_chunk);
    }
}

Want to highlight slice::array_chunks: Fixed size arrays give elegant or magical syntax by implicitly yielding [u8; 8] at a time to be converted based on usize::from_ne_bytes's signature. Someone experienced in reading LLVM IR can tell if fixed size arrays compile to more optimal code than slice::chunks_exact.

$ cargo bench -- bench_read_iter bench_read_slice bench_read_minimum_unconverted
test io_bench::bench_read_minimum_unconverted ... bench: 24,807,104,942 ns/iter (+/- 97,146,525)
test io_bench::bench_read_iter                ... bench: 57,887,892,285 ns/iter (+/- 534,311,715)
test io_bench::bench_read_slice               ... bench: 43,433,450,052 ns/iter (+/- 9,505,170,723)

$ echo 0 > /sys/module/zfs/parameters/zfs_arc_shrinker_limit
$ echo 3 > /proc/sys/vm/drop_caches
$ cargo run --release
InnerMain loaded pixel-xy-indexes.dat in 30,920 ms
checksum 16799455120

The unit is nanoseconds per iteration, or around 30-60 seconds to read a file.

Slice is significantly faster than Iterator, as expected for such a tight loop not taking advantage of Iterator ergonomics. Surprisingly bench_read_minimum_unconverted takes a long time peeking roughly 1.1 GiB/s in iotop. Both when copying from the page cache and straight from NVMe.

Waiting minutes every run just shuffling memory around for the Copy grew increasingly annoying. The hardware has plenty of headroom remaining, we just need to write better software.

Step 2: Unsafely reading `Vec<u64>` from a file in Rust

To go faster we enter the land of unsafe memory management. To show the dangers that lurk consider this task: Make a Vec<usize>, fill it with u8 from disk, and return the first Vec<usize>. Sounds reasonable and we can make it compile. Lets ignore the perilous lint clippy::unsound_collection_transmute while building. Crashes at runtime with Bad Address because of unaligned access. Basically this memory is optimized for usize indexing, internally maybe without all 8 bytes, which breaks trying index the u8 underneath.

What about: Fill a Vec<u8> from disk like normal, transmute to Vec<usize> and return? We no longer crash however our validation checksum is somehow 0. Clearly abusing transmute is not the solution.

pub fn read_entire_file_transmute_u64_broken(path: &Path) -> VResult<Vec<usize>> {
    let mut file = File::open(path).map_err(VError::io_error(path))?;
    let xy_array_len_u8 = get_file_size(&file, path)? as usize;
    let xy_array_len_u64 = xy_array_len_u8 / 8;

    // let big_xy_bytes: Vec<usize> = vec![0; xy_array_len_u64];
    // 
    // let mut small_xy_bytes: Vec<u8> = unsafe { transmute(big_xy_bytes) };
    // unsafe { small_xy_bytes.set_len(xy_array_len_u8) };
    let mut small_xy_bytes: Vec<u8> = vec![0; xy_array_len_u8];

    file.read_to_end(&mut small_xy_bytes)
        .map_err(VError::io_error(path))?;

    let mut big_xy_bytes: Vec<usize> = unsafe { transmute(small_xy_bytes) };
    unsafe { big_xy_bytes.set_len(xy_array_len_u64) };
    Ok(big_xy_bytes)
}
fn bench_injest_value(output: Vec<usize>) -> usize {
    let total: usize = output.iter().sum();
    println!("total {}", total);
    assert_ne!(total, 0);
    total
}

# first attempt with commented code
$ BROKEN_TEST=yes cargo bench -- io_bench::bench_read_transmute_broken
thread 'main' panicked at src/util/io_bench.rs:64:77:
... e: Os { code: 14, kind: Uncategorized, message: "Bad address" }

# with code commented out
$ BROKEN_TEST=yes cargo bench -- io_bench::bench_read_transmute_broken
total 0
thread 'main' panicked at src/util/io_bench.rs:70:5:
assertion `left != right` failed
  left: 0
 right: 0

Searching discovers slice::align_to providing a correctly typed slice with aligned access to the same memory. This is the sound transmute we were looking for! We just assemble a Vec<u8> for file.read_to_end() then return a Vec<usize>.

A first experience troubleshooting memory bugs story: First code version crashed again with Bad Address. Strangely checksum would work inside the function but not in the benchmark merely one step away. Thinking deeper, what's actually between the lines is Dropping all allocated memory besides the result. Like xy_vec_aligned_u8 pointing to the same memory as our result. Ooops we must mem::forget any double owned variables, an unsurprising but normally impossible situation in Safe Rust.

#[bench]
fn bench_read_aligned_vec(bencher: &mut test::Bencher) {
  bencher.iter(|| {
    let output = read_entire_file_usize_aligned_vec(&input_path()).unwrap();
    checksum_vec_u8(output)
  })
}

pub fn read_entire_file_usize_aligned_vec(path: &Path) -> VIoResult<Vec<usize>> {
    let mut file = File::open(path).map_err(VIoError::io_error(path))?;
    let xy_array_len_u8 = get_file_size(&file, path)? as usize;
    let xy_array_len_u64 = xy_array_len_u8 / USIZE_BYTES;

    // Allocate result buffer. We will fill the internal capacity
    let mut xy_vec_u64: Vec<usize> = vec![0; xy_array_len_u64];

    // Build u8 Vec viewing the same memory with proper aligned access
    // Docs state the outer slices should be empty in real world environments
    let (xy_vec_prefix, xy_vec_aligned, xy_vec_suffix) = unsafe { xy_vec_u64.align_to_mut::<u8>() };
    assert_eq!(xy_vec_prefix.len(), 0, "prefix big");
    assert_eq!(xy_vec_suffix.len(), 0, "suffix big");
    assert_eq!(xy_vec_aligned.len(), xy_array_len_u8, "aligned size");
    let mut xy_vec_aligned_u8: Vec<u8> = unsafe {
        Vec::from_raw_parts(
            xy_vec_aligned.as_mut_ptr(),
            0,
            xy_array_len_u8 * mem::size_of::<u8>(),
        )
    };
    assert_eq!(xy_vec_aligned_u8.capacity(), xy_array_len_u8, "veccapacity");

    file.read_to_end(&mut xy_vec_aligned_u8)
        .map_err(VIoError::io_error(path))?;
    assert_eq!(xy_vec_aligned_u8.len(), xy_array_len_u8, "vec length");

    // Do not double free memory owned by xy_vec_u64
    mem::forget(xy_vec_aligned_u8);

    println!("wrote array of {}", xy_vec_u64.len());
    println!("array sum {}", xy_vec_u64.iter().sum::<usize>());

    Ok(xy_vec_u64)
}

# double free
$ cargo bench -- --nocapture io_bench::bench_read_aligned_vec
running 1 test
wrote array of 64000000
array sum 224321692961
error: bench failed ... signal: 11, SIGSEGV: invalid memory reference

# clean
$ cargo bench -- --nocapture io_bench::bench_read_aligned_vec

test io_bench::bench_read_minimum_unconverted ... bench: 24,807,104,942 ns/iter (+/- 97,146,525)
test io_bench::bench_read_mmap_custom         ... bench: 17,000,644,090 ns/iter (+/- 217,384,579)

Impressive, only 3 seconds slower than reading the file directly. This is probably the best way using std::io sequential APIs. Looking at io::read_to_end source we probably aren't going to write a better one. Our only other responsibility is allocate and provide aligned access which are only 2 calls.

Step 3: Memory Mapped Files

Why are we (underneath) repeatedly calling read() and mathing a buffer? mmap has the Kernel create a buffer for us and return a pointer to it. File is transparently read from disk into that buffer. That sounds way easier!

We mmap Page Aligned memory prepoulated with our file content and extra padding, add good sounding madvise options, then use same align_to method above to get a useful Vec<usize>.

First attempt we SIGSEV. To Drop a mmap we must use munmap (memory unmap) not free. To communicate this we wrap our Vec in ManuallyDrop. For simplicity our benchmark uses an explicit drop function instead of impl Drop.

pub fn read_entire_file_usize_mmap_custom(
    path: &Path,
    populate: bool,
    sequential: bool,
    willneed: bool,
) -> VIoResult<ManuallyDrop<Vec<usize>>> {
    let file = File::open(path).map_err(VIoError::io_error(path))?;
    let file_size = get_file_size(&file, path)? as usize;
    let page_size = unsafe { libc::sysconf(libc::_SC_PAGE_SIZE) as usize };
    let alignment_padding = page_size - (file_size % page_size);

    let xy_array_len_u8 = file_size;
    let xy_array_len_u64 = xy_array_len_u8 / USIZE_BYTES;
    let xy_array_len_aligned_u8 = file_size + alignment_padding;
    let xy_array_len_aligned_u64 = xy_array_len_u8 / USIZE_BYTES;

    let vec: Vec<usize> = unsafe {
        let mmap_ptr = libc::mmap(
            ptr::null_mut(),
            xy_array_len_aligned_u8,
            // ACL required to use it
            libc::PROT_READ | libc::PROT_WRITE,
            // TODO: libc::MAP_HUGETLB | libc::MAP_HUGE_2MB
            // Required mode, Prepopulate with file content
            libc::MAP_PRIVATE | enable_if(libc::MAP_POPULATE, populate),
            // SAFETY file can be closed immediately
            file.as_raw_fd(),
            0,
        );
        if mmap_ptr == libc::MAP_FAILED {
            panic!("failed to mmap {}", xy_array_len_u8);
        }

        if libc::madvise(
            mmap_ptr,
            xy_array_len_aligned_u8,
            enable_if(libc::MADV_SEQUENTIAL, sequential) | enable_if(libc::MADV_WILLNEED, willneed),
        ) != libc::EXIT_SUCCESS
        {
            panic!("madvise failed {}", io::Error::last_os_error());
        }

        let xy_array_u8 = slice::from_raw_parts_mut(mmap_ptr, xy_array_len_u8);

        // Build usize Vec viewing the same memory with proper aligned access
        // Docs state the outer slices should be empty in real world environments
        let (xy_array_prefix, xy_array_aligned, xy_array_suffix) =
            xy_array_u8.align_to_mut::<usize>();
        assert_eq!(xy_array_prefix.len(), 0, "prefix big");
        assert_eq!(xy_array_suffix.len(), 0, "suffix big");
        assert_eq!(xy_array_aligned.len(), xy_array_len_u64, "aligned size");
        let xy_vec_aligned_usize: Vec<usize> = Vec::from_raw_parts(
            xy_array_aligned.as_mut_ptr(),
            xy_array_len_u64,
            xy_array_len_aligned_u64,
        );

        xy_vec_aligned_usize
    };

    println!("wrote array of {}", vec.len());
    println!("array sum {}", vec.iter().sum::<usize>());
    Ok(ManuallyDrop::new(vec))
}

/// Must Drop with munmap() not the normal free()
pub fn drop_mmap_vec(mut mmap_vec: ManuallyDrop<Vec<usize>>) {
  unsafe {
    let res = munmap(
      mmap_vec.as_mut_slice().as_mut_ptr() as *mut libc::c_void,
      mmap_vec.capacity() * USIZE_BYTES,
    );
    if res != 0 {
      panic!("munmap failed {}", io::Error::from_raw_os_error(-res));
    }
  }
}

#[bench]
fn bench_read_mmap_custom(bencher: &mut test::Bencher) {
    bencher.iter(|| {
        let output = read_entire_file_usize_mmap_custom(&input_path()).unwrap();
        let bench_result: usize = output.iter().sum1().unwrap();
        drop_mmap_vec(output);
        bench_result
    });
}

$ cargo bench -- bench_read_mmap_custom bench_read_minimum_unconverted

running 2 tests
test io_bench::bench_read_minimum_unconverted ... bench: 140,110,749 ns/iter (+/- 22,567,824)
test io_bench::bench_read_mmap_custom         ... bench: 277,764,977 ns/iter (+/- 34,554,011)

test result: ok. 0 passed; 0 failed; 0 ignored; 2 measured; 8 filtered out; finished in 124.81s

Wow we are faster than File::read_to_end even including checksum processing.

Production code should probably use github.com/RazrFalcon/memmap2-rs or other crate that already implements safe memory map Drop, Send, Sync, and other useful utils. This code is to better understand what's happening and maybe re-use in other languages.

Step 4: Superfast `io_uring`

io_uring is an event-based asynchronous IO system for Linux. It consists of 2 queues: A submission queue you fill with read, write, create file, accept connection, and other commands. And a completion queue the kernel later stores results. All asynchronously and minimal syscall overhead. Considered a modern iteration of previous epoll / libaio / select async IO systems. More control of the IO pipeline on modern NVMe drives lets us hit the massive numbers seen in tech media benchmarks like ATTO.

For fun, lets implement on top of the standard C library github.com/axboe/liburing with easy syscall builder functions. We will create a richer than usual implementation post in the style of a real application instead of the common minimum test case. A positive is we can roughly follow amazing liburing resources below without waiting for and translating to a high level wrapper library.

Batch SQE submission
Provided buffers, where we provide the kernel a list of buffers
https://kernel.dk/axboe-kr2022.pdf
https://unixism.net/loti/
https://00pauln00.medium.com/io-uring-fixed-buffer-versus-non-fixed-buffer-performance-comparison-9fd506de6829

Con is we are far from Safe Rust. Implementing is a dizzying feat of raw pointer handling, allocating 4096-byte Page Aligned memory and initializing a different sized type inside, kernel mapped memory that if panic Dropped cause "kernel BUG in mm/usercopy.c" crashes visible in dmesg, Default by mem::zeroed() for padding/reserved fields, in addition to increased buffer management inherent with async programming.

First create the main ring. No size guidance is given, so from common recent benchmarks I assume 32 queue depth will saturate most NVMe controllers. For initialize-by-pointer C APIs the proper way appears to be not mem::zeroed() but mem::MaybeUninit with either zeroed() or scary uninitialized uninit() memory. A handy as_mut_ptr is provided. Then you assume_init() to unwrap the created struct.

const BUF_RING_COUNT: usize = 32;

pub struct IoUring {
    pub ring: io_uring,
}

impl IoUring {
    pub fn new() -> Self {
        let ring = unsafe {
            let mut s = mem::MaybeUninit::<io_uring>::zeroed();
            // IORING_FEAT_ENTER_EXT_ARG so wait_cqes does not do submit() for us
            // IORING_FEAT_EXT_ARG
            let ret = io_uring_queue_init(BUF_RING_COUNT as u32, s.as_mut_ptr(), 0);
            assert_eq!(
                ret,
                libc::EXIT_SUCCESS,
                "io_uring_queue_init: {:?}",
                IoError::from_raw_os_error(ret)
            );
            s.assume_init()
        };
    }
}

Next create a File specific struct using the io_uring. Asynchronous IO requires significant state covering File info, allocated buffers, buffer metadata, progress info, and the final result Vec<u8>.

Large 1 MiB buffers was chosen to reduce CPU overhead vs 4k buffers, as shown in the above 00pauln00 post.

const BUF_RING_ENTRY_SIZE: usize = PAGE_SIZE * 256; // 1 MibiByte
const BUF_RING_COUNT: usize = 32;
type BackingBufEntry = [u8; BUF_RING_ENTRY_SIZE];
type BackingBufRing = [BackingBufEntry; BUF_RING_COUNT];

/// Handles read/write to a larger vec
pub struct IoUringFile {
  file_handle: File,
  file_registered_index: i32,
  backing_iovecs: Vec<libc::iovec>,
  backing_buf_ring: ManuallyDrop<Box<BackingBufRing>>,
  backing_buf_ring_data: [BackingBufData; BUF_RING_COUNT],
  result_buffer: ManuallyDrop<Vec<u8>>,
  result_cursor: usize,
}

impl IoUringFile {
    pub fn open(path: &Path, ring: &mut IoUring) -> IoResult<Self> {
        // let file_handle = File::open(path)?;
        let file_handle = OpenOptions::new()
            .read(true)
            // .custom_flags(libc::O_DIRECT)
            .open(path)?;
        let file_size = file_handle.metadata()?.len() as usize;

        // Internal Fixed Files have less per-operation overhead than File Descriptors
        let fds = [file_handle.as_raw_fd()];
        let register_result = unsafe { io_uring_register_files(&mut ring.ring, fds.as_ptr(), 1) };
        assert_eq!(register_result, 0, "register files failed");

        let (backing_buf_ptr, _) = allocate_page_size_aligned::<BackingBufRing>();
        let mut backing_buf_ring = unsafe { ManuallyDrop::new(Box::from_raw(backing_buf_ptr)) };

        // Register buffers as iovecs
        let backing_iovecs: Vec<libc::iovec> = backing_buf_ring
            .iter_mut()
            .map(|backing_buf| libc::iovec {
                iov_len: backing_buf.len(),
                iov_base: backing_buf.as_mut_ptr() as *mut libc::c_void,
            })
            .collect();
        unsafe {
            io_uring_register_buffers(
                &mut ring.ring,
                backing_iovecs.as_ptr(),
                backing_iovecs.len() as libc::c_uint,
            )
        };

        Ok(IoUringFile {
            file_handle,
            file_registered_index: 0,
            backing_iovecs,
            backing_buf_ring,
            backing_buf_ring_data: [BackingBufData::default(); BUF_RING_COUNT],
            result_buffer: ManuallyDrop::new(vec![0u8; file_size]),
            result_cursor: 0,
        })
    }
}

Now we need an all-in-one processing loop that does add SQE, submit, process CQE until the file ends.

First loop step is creating submission queue event (SQE)s. The submission queue is unlimited so our throttle is available buffers for in-flight read events. Processing must notify when all buffers are in-used indicating we need to wait for completed events first. It also must notify on EOF to terminate the outer loop. See calling loop function at the end.

A generous u64 of user data is available that will follow our CQE into SQE. Provide either an appropriately sized struct or much easier array index. Commands must include additional information so result handling knows how to use it. See next CQE section

impl IoUringFile {
    pub unsafe fn refresh_submission_queue_read_for_buffer_index(
        &mut self,
        ring: &mut IoUring,
        buf_index: usize,
    ) -> VIoResult<CreateSqeResult> {
        let offset = self.result_cursor * BUF_RING_ENTRY_SIZE;
        if offset > self.file_size() {
            info!("EOF at result_cursor {}", self.result_cursor);
            return Ok(CreateSqeResult::Eof);
        }
        let sqe_ptr = io_uring_get_sqe(&mut ring.ring);
        if sqe_ptr.is_null() {
            return Err(VIoError::IoUring_SqeNullPointer {
                backtrace: Backtrace::capture(),
            });
        }

        io_uring_prep_read_fixed(
            sqe_ptr,
            self.file_registered_index,
            self.backing_buf_ring[buf_index].as_mut_ptr() as *mut libc::c_void,
            BUF_RING_ENTRY_SIZE as libc::c_uint,
            offset as u64,
            buf_index as libc::c_int,
        );
        io_uring_sqe_set_flags(sqe_ptr, IOSQE_FIXED_FILE);
        io_uring_sqe_set_data64(sqe_ptr, buf_index as u64);
        self.backing_buf_ring_data[buf_index].result_offset = offset;
        self.backing_buf_ring_data[buf_index].backing_result_cursor = self.result_cursor;
        trace!(
            "sqe create {}\tcursor_status {}",
            buf_index,
            self.result_cursor
        );

        self.result_cursor += 1;
        Ok(CreateSqeResult::CreatedAt(sqe_ptr))
    }
}

Next loop step is batch submit all previously configured SQEs

impl IoUring {
    pub fn submit(&mut self) -> bool {
        let submitted_entries = unsafe { io_uring_submit(&mut self.ring) };
        if submitted_entries < 0 {
            panic!(
                "io_uring_submit: {:?}",
                IoError::from_raw_os_error(submitted_entries)
            );
        }
        log_debug("submit");
        submitted_entries != 0
    }
}

Next we must consume the Submission Queue results as Completion Queue Event (CQE)s. We will wait_cqe blocking if needed, get our "id" with io_uring_cqe_get_data, then copy the read result data into result_buffer.

We can pop events with in 2 ways: wait_cqe to block until an event arrives, best for most cases IMO. And peek_cqe to return immediately maybe with an event, best for extreme polling / spinlocking IO applications. My earlier version used both waiting as needed but didn't provide clear benefit over simple wait.

Remember CQEs are not ordered to how they were submitted.

impl IoUringFile {
    pub unsafe fn pop_completion_queue(&mut self, ring: &mut IoUring) -> VIoResult<usize> {
        let mut cqe_ptr: *mut io_uring_cqe = ptr::null_mut();
        let ret = io_uring_wait_cqe(&mut ring.ring, &mut cqe_ptr);
        if ret != 0 {
            return Err(VIoError::IoUring_CqeWaitReturn {
                e: io::Error::from_raw_os_error(-ret),
                backtrace: Backtrace::capture(),
            });
        }
        if cqe_ptr.is_null() {
            return Err(VIoError::IoUring_CqeNullPointer {
                backtrace: Backtrace::capture(),
            });
        }

        let cqe_result = (*cqe_ptr).res;
        if cqe_result < 0 {
            return Err(VIoError::IoUring_CqeResultReturn {
                e: io::Error::from_raw_os_error(-cqe_result),
                backtrace: Backtrace::capture(),
            });
        } else if cqe_result != BUF_RING_ENTRY_SIZE as i32 {
            // return Err(VIoError::IoUring_CqeReadIncomplete {
            //     expected_size: BUF_RING_ENTRY_SIZE,
            //     actual_size: cqe_result as usize,
            //     backtrace: Backtrace::capture(),
            // });
            warn!(
                "expected {} got {}",
                BUF_RING_ENTRY_SIZE.to_formatted_string(&LOCALE),
                cqe_result.to_formatted_string(&LOCALE)
            );
        }

        // let buffer_id = unsafe {
        //     (*cqe_ptr).flags >> IORING_CQE_BUFFER_SHIFT;
        // };
        let cqe_buf_index = io_uring_cqe_get_data64(cqe_ptr) as usize;
        let target_offset_start = self.backing_buf_ring_data[cqe_buf_index].result_offset;
        if target_offset_start > self.file_size() {
            return Err(VIoError::IoUring_CqeOffsetTooBig {
                file_size: self.file_size(),
                target_offset: target_offset_start,
                backtrace: Backtrace::capture(),
            });
        }

        let source: BackingBufEntry = self.backing_buf_ring[cqe_buf_index];
        let source_ref: &[u8];
        let mut target_offset_end = target_offset_start + source.len();
        if target_offset_end < self.file_size() {
            source_ref = &source;
        } else {
            target_offset_end = self.file_size();
            source_ref = &source[0..(target_offset_end - target_offset_start)];
        }

        trace!(
            "cqe buf {} range {} to {}",
            cqe_buf_index,
            target_offset_start.to_formatted_string(&LOCALE),
            target_offset_end.to_formatted_string(&LOCALE)
        );

        let target = &mut self.result_buffer[target_offset_start..target_offset_end];
        target.copy_from_slice(source_ref);

        io_uring_cqe_seen(&mut ring.ring, cqe_ptr);
        Ok(cqe_buf_index)
    }
}

$ cargo bench -- bench_read_io_uring bench_read_minimum_unconverted

test io_bench::bench_read_io_uring            ... bench: 6,149,094,961 ns/iter (+/- 2,219,072,163)                                                                  [39/7794]
test io_bench::bench_read_minimum_unconverted ... bench: 24,807,104,942 ns/iter (+/- 97,146,525)

The results are impressive. We flew past all previous iterations.

io_uring Aligned Memory. Hugepages

An important detail is how do you get 4096-byte Page Aligned memory required for kernel mapped memory.

All C io_uring examples mmap some anonymous (not file backed) memory. Then with careful pointer math grab chunks as they need.

Rust can do the same but doesn't look very Rust-y passing pointers around and needing munmap like above. The native ways appears to be #[repr(C align(4096)] structs but I couldn't get it to work reliably.

Another way is alloc::Layout with some nice utility methods and using the explicit std::alloc::alloc() methods. Very ergonomic API is pretty short and what I will use.

pub fn allocate_page_size_aligned<Container>() -> (*mut Container, usize) {
    let layout = Layout::from_size_align(mem::size_of::<Container>(), PAGE_SIZE)
        .expect("allocate_layout")
        .pad_to_align();

    // let ptr = unsafe { alloc(layout) as *mut Container };
    // if ptr.is_null() {
    //     panic!("allocate");
    // }
    // (ptr, ptr as usize)

    let mmap_ptr = unsafe {
        libc::mmap(
            ptr::null_mut(),
            layout.size(),
            // ACL required to use it
            libc::PROT_READ | libc::PROT_WRITE,
            // TODO: libc::MAP_HUGETLB | libc::MAP_HUGE_2MB
            // Required mode, Prepopulate with file content
            libc::MAP_PRIVATE | libc::MAP_ANONYMOUS,
            // SAFETY file can be closed immediately
            0,
            0,
        )
    };
    if mmap_ptr == libc::MAP_FAILED {
        panic!("mmap failed");
    }
    (mmap_ptr as *mut Container, mmap_ptr as usize)
}

The other way is with the built in allocator

fn allocate_array_page_size_aligned<Container, Entry>(count: usize) -> (*mut Container, usize) {
    let layout = Layout::from_size_align(count * mem::size_of::<Entry>(), PAGE_SIZE)
        .expect("allocate_layout");
    let ptr = unsafe { alloc(layout) as *mut Container };
    if ptr.is_null() {
        panic!("allocate");
    }
    (ptr, ptr as usize)
}

Back to mmap I noticed in my netdata metrics high transparent huge pages metrics. What about static huge pages?

Sadly with 1 Gigabyte of 2MB hugepages didn't change performance. The bottleneck on this lightly loaded server is elsewhere not in Page Lookups. Going to skip this for now.

$ echo 512 | sudo tee /proc/sys/vm/nr_hugepages
$ grep Huge /proc/meminfo
AnonHugePages:  37255168 kB
ShmemHugePages:        0 kB
FileHugePages:         0 kB
HugePages_Total:     512
HugePages_Free:      512
HugePages_Rsvd:        0
HugePages_Surp:        0
Hugepagesize:       2048 kB
Hugetlb:         1048576 kB

io_uring Caveats

During the artificial cargo bench runs, AMDuProfPcm top -m memory shows almost 50 GB/s of memory bandwidth used. With relatively higher CPU usage because io_uring work queues run in different threads. Understandable as this churns through a lot of IO. Doesn't matter for my case though may for yours.

Be very careful panic unwinding with kernel mapped user memory. Several times the kernel uncleanly exits our process or just hangs, logs stack traces to dmesg, and leaves us an unkillable zombie process that breaks restarting our LXC container (cgroup can never be closed) until we inconveniently reboot the host. The panic isn't always logged either, even though stderr should show immediately unlike stdout.

My guess is the in-flight buffer slices are "owned" by the kernel who doesn't like us trying to free it. Robust solution appears to be special unwinding functions that clears the submission and completion queues before Dropping (maybe in the Error constructors?), and/or not writing buggy code.

desk        2633  0.3  0.0 1826404 49172 pts/1   D    17:31   0:09 ?
desk        2749  0.0  0.0 1826400 50424 pts/1   D    17:31   0:00 ?
desk        3181  0.0  0.0 1826404 52816 pts/1   D    17:32   0:00 ?
desk        4148  0.0  0.0 1826468 51016 ?       D    17:39   0:00 ?

[2499920.170093] INFO: task facto-loop-mine:3972446 blocked for more than 241 seconds.
[2499920.170108]       Tainted: P S         OE      6.5.0-4-amd64 #1 Debian 6.5.10-1
[2499920.170115] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[2499920.170121] task:facto-loop-mine state:D stack:0     pid:3972446 ppid:4048933 flags:0x00020006
[2499920.170131] Call Trace:
[2499920.170136]  <TASK>
[2499920.170142]  __schedule+0x3db/0xb30
[2499920.170155]  schedule+0x5e/0xd0
[2499920.170161]  schedule_timeout+0x151/0x160
[2499920.170168]  wait_for_completion+0x8a/0x160
[2499920.170176]  io_wq_put_and_exit+0xa8/0x240
[2499920.170185]  io_uring_clean_tctx+0x8a/0xc0
[2499920.170192]  io_uring_cancel_generic+0x173/0x340
[2499920.170201]  ? __pfx_autoremove_wake_function+0x10/0x10
[2499920.170210]  do_exit+0x167/0xb10
[2499920.170216]  ? srso_alias_return_thunk+0x5/0x7f
[2499920.170223]  ? file_update_time+0x5a/0xe0
[2499920.170231]  do_group_exit+0x31/0x80
[2499920.170237]  get_signal+0x9a5/0x9e0
[2499920.170244]  ? __pfx_pipe_write+0x10/0x10
[2499920.170253]  arch_do_signal_or_restart+0x3e/0x290
[2499920.170262]  exit_to_user_mode_prepare+0x195/0x1e0
[2499920.170271]  syscall_exit_to_user_mode+0x1b/0x40
[2499920.170278]  do_syscall_64+0x6c/0xc0
[2499920.170284]  ? exit_to_user_mode_prepare+0x40/0x1e0
[2499920.170291]  ? srso_alias_return_thunk+0x5/0x7f
[2499920.170297]  ? syscall_exit_to_user_mode+0x2b/0x40
[2499920.170302]  ? srso_alias_return_thunk+0x5/0x7f
[2499920.170308]  ? do_syscall_64+0x6c/0xc0
[2499920.170313]  ? srso_alias_return_thunk+0x5/0x7f
[2499920.170319]  ? syscall_exit_to_user_mode+0x2b/0x40
[2499920.170325]  ? srso_alias_return_thunk+0x5/0x7f
[2499920.170331]  ? do_syscall_64+0x6c/0xc0
[2499920.170337]  ? srso_alias_return_thunk+0x5/0x7f
[2499920.170342]  ? do_syscall_64+0x6c/0xc0
[2499920.170348]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8

kernel: usercopy: Kernel memory exposure attempt detected from vmalloc 'no area' (offset 0, size 225960)!
kernel: ------------[ cut here ]------------
kernel: kernel BUG at mm/usercopy.c:102!
kernel: invalid opcode: 0000 [#10] PREEMPT SMP NOPTI
kernel: CPU: 6 PID: 45524 Comm: iou-wrk-45022 Tainted: P      D    OE      6.6.8-amd64 #1  Debian 6.6.8-1
kernel: Hardware name: To Be Filled By O.E.M. ROMED8-2T/ROMED8-2T, BIOS P3.50 07/19/2022
kernel: RIP: 0010:usercopy_abort+0x6c/0x80
kernel: Code: 85 b3 51 48 c7 c2 7e bd 87 b3 41 52 48 c7 c7 b8 bb 8c b3 48 0f 45 d6 48 c7 c6 12 14 85 b3 48 89 c1 49 0f 45 f3 e8 44 8a d6 ff <0f> 0b 49 c7 c1 e5 86 86 b3 4d 89 ca 4d 89 c8 eb a8 0f 1f 00 90 90
kernel: RSP: 0018:ffffc9000ed7faa8 EFLAGS: 00010246
kernel: RAX: 0000000000000061 RBX: ffffc90083002d58 RCX: 0000000000000000
kernel: RDX: 0000000000000000 RSI: ffff88a00fca13c0 RDI: ffff88a00fca13c0
kernel: RBP: 00000000000372a8 R08: 0000000000000000 R09: ffffc9000ed7f948
kernel: R10: 0000000000000003 R11: ffff88c04f07aee8 R12: 0000000000000001
kernel: R13: ffffc9008303a000 R14: 0000000000000000 R15: ffff88a19970e800
kernel: FS:  00007f38105e11c0(0000) GS:ffff88a00fc80000(0000) knlGS:0000000000000000
kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
kernel: CR2: 00007ffddd85a000 CR3: 000000212fe22005 CR4: 0000000000770ee0
kernel: PKRU: 55555554
kernel: Call Trace:
kernel:  <TASK>
kernel:  ? die+0x36/0x90
kernel:  ? do_trap+0xda/0x100
kernel:  ? usercopy_abort+0x6c/0x80
kernel:  ? do_error_trap+0x6a/0x90
kernel:  ? usercopy_abort+0x6c/0x80
kernel:  ? exc_invalid_op+0x50/0x70
kernel:  ? usercopy_abort+0x6c/0x80
kernel:  ? asm_exc_invalid_op+0x1a/0x20
kernel:  ? usercopy_abort+0x6c/0x80
kernel:  ? usercopy_abort+0x6c/0x80
kernel:  __check_object_size+0x2b1/0x2c0
kernel:  zfs_uiomove_iter+0x5b/0xe0 [zfs]
kernel:  dmu_read_uio_dnode+0xd3/0x150 [zfs]
kernel:  dmu_read_uio_dbuf+0x46/0x60 [zfs]
kernel:  zfs_read+0x139/0x3e0 [zfs]
kernel:  zpl_iter_read+0xe8/0x190 [zfs]
kernel:  io_read+0xeb/0x4f0
kernel:  io_issue_sqe+0x63/0x3c0
kernel:  io_wq_submit_work+0x8c/0x2d0
kernel:  io_worker_handle_work+0x159/0x580
kernel:  io_wq_worker+0x10e/0x3a0
kernel:  ? srso_alias_return_thunk+0x5/0x7f
kernel:  ? __pfx_io_wq_worker+0x10/0x10
kernel:  ret_from_fork+0x34/0x50
kernel:  ? __pfx_io_wq_worker+0x10/0x10
kernel:  ret_from_fork_asm+0x1b/0x30
kernel:  </TASK>

Summary

Chasing maximum IO performance with esoteric low-level Linux APIs is a fun challenge. Plus a break from monotonous code elsewhere.
My goal is to reduce this code into 2 functions: read_file_easily and read_big_file_fast, the former our basic read_entire_file from step 1 and the latter using io_uring. These should be easily embedded in future projects.

Why both? Most day-to-day files like config, user data, map metadata, etc are relatively small and handled fast enough with plain old File::read_entire_file. Fancy io_uring methods only benefit Gigabyte and Terabyte scale files. Even then micro-optimized IO may have minimal impact on the larger processing step slowly working through all that data. Or the bottleneck is moved to the network and cloud S3 APIs.

Further optimization becomes application specific. At the start I mentioned other solutions: Instead store used XY positions with the entity then simply rebuild the XY map (until very large entity serializing becomes the bottleneck).
Maybe use other datastructures like KDtrees with better disk and memory utilization.

We barely used the full capabilities of io_uring: High performance socket server, zero-copy file server, replace hacks like NodeJS libuv's threadpool just for blocking IO syscalls, linked submissions, multi-shot socket accept, provided buffers, and more. You can even skip the filesystem, page cache, and ARC cache entirely to let bleeding edge io_uring directly issue NVMe commands! I'm excited for future uses as the capabilities grow and I get excuses to use them.

32 GB in 6 seconds. Large file experiments with Rust

Step 1: Safely Read Vec<u64> from a File in Rust

Step 2: Unsafely reading Vec<u64> from a file in Rust