Rust Coreutils: Fixing Low-Hanging Performance Fruit

September 13, 2022

I have been toying with learning Rust for the last few years, but since I wasn’t able to use it at work it never really stuck with me. Throughout my career I mainly worked on low-latency C/C++ systems, and on performance engineering work directly (benchmarking, optimization work, etc) and Rust is trying to play in the same ballpark. Since retiring at the start of this year I’ve had a little more free time, so I’ve been spending more time writing Rust code for my personal projects and trying to contribute to some Rust OSS projects to get a feel for the language.

A few months ago I took a look at the uutils coreutils project, which is a rewrite of the core GNU utils in Rust, to try to get some experience writing newbie code in the language. I saw the maintainers themselves mention that a lot of the code quality isn’t great since a lot of contributions are from people who are very new to Rust, so I figured there would be some easy fixes available for me to take on.

I ended up making 2 quick contributions that sped up the uutils dd tool by 2-3x.

bs=4k count=1000000 bs=1M count=20000 bs=1G count=10
GNU dd (baseline) 1.15s 1.07s 1.13s
Coreutils dd 2.14s 1.98s 5.98s
Coreutils dd (with #3600) 1.99s 1.60s 1.95s
Coreutils dd (with #3610) 1.29s 1.79s 5.96s
Coreutils dd (with both fixes) 1.15s 1.42s 1.90s
  • These results are for copying from /dev/zero to /dev/null to limit I/O impact and measure dd overhead as closely as possible. The commands used are something like: time dd status=progress bs=1G count=10 < /dev/zero > /dev/null

Understanding “dd” Link to heading

I want to explain the two changes I made, but first we need a little background on the dd tool. dd is used to copy (and sometimes convert, which we’ll be ignoring) data from one file to another. It can be setup to copy a specific number of bytes and can be configured to copy in blocks of a specified size.

If you’ve ever installed an .iso file to a USB drive on linux, you’ve probably been directed to use dd with a command that looks something like this:

> dd if=ubuntu-22.04.iso of=/dev/sdb bs=4096

That command sets the input file if, the output file of, and the blocksize bs (in bytes).

dd works by taking the following steps on its main loop:

  1. Read blocksize bytes from the input file into a buffer.
  2. Apply some conversions, if applicable.
  3. Write the buffer to the output file.
  4. Repeat from step 1 until finished (until it has written count blocks, it’s hit the end of the file, or it errors out).

Before I made any changes, here is the main loop for the dd utility in the uutils project: source link

        // Start a thread that reports transfer progress.
        //
        // When `status=progress` is given on the command-line, the
        // `dd` program reports its progress every second or so. We
        // perform this reporting in a new thread so as not to take
        // any CPU time away from the actual reading and writing of
        // data. We send a `ProgUpdate` from the transmitter `prog_tx`
        // to the receives `rx`, and the receiver prints the transfer
        // information.
        let (prog_tx, rx) = mpsc::channel();
        thread::spawn(gen_prog_updater(rx, i.print_level));

        // The main read/write loop.
        //
        // Each iteration reads blocks from the input and writes
        // blocks to this output. Read/write statistics are updated on
        // each iteration and cumulative statistics are reported to
        // the progress reporting thread.
        while below_count_limit(&i.count, &rstat, &wstat) {
            // Read a block from the input then write the block to the output.
            //
            // As an optimization, make an educated guess about the
            // best buffer size for reading based on the number of
            // blocks already read and the number of blocks remaining.
            let loop_bsize = calc_loop_bsize(&i.count, &rstat, &wstat, i.ibs, bsize);
            let (rstat_update, buf) = read_helper(&mut i, loop_bsize)?;
            if rstat_update.is_empty() {
                break;
            }
            let wstat_update = self.write_blocks(&buf)?;

            // Update the read/write stats and inform the progress thread.
            //
            // If the receiver is disconnected, `send()` returns an
            // error. Since it is just reporting progress and is not
            // crucial to the operation of `dd`, let's just ignore the
            // error.
            rstat += rstat_update;
            wstat += wstat_update;
            let prog_update = ProgUpdate::new(rstat, wstat, start.elapsed());
            prog_tx.send(prog_update).unwrap_or(());
        }

Performance Problem #1: Buffer Reuse Link to heading

The first problem becomes most obvious when dd is used with a large blocksize. I initially started looking at dd because of Issue #3094, which hinted at the problem.

The problem is that the main loop above doesn’t reuse the same buffer for each copy. What the block above does is allocate a new buffer of blocksize bytes, then read from the input file into that buffer, write the output, then discard the buffer.

What you want to do instead is to reuse the buffer to avoid reallocating memory. This is one of those problems that people who work in a higher level GC’d language (like Javascript/Python) often miss while learning systems programming, so it’s not too surprising to find something like this in a newbie Rust project. Rust tends to have a mix of both exerienced systems programmers looking for an upgraded C++ (like me) and from web developers looking to write native code for the first time.

Interestingly, this is actually a problem that Rust is pretty uniquely suited to solve safely. Safely reusing a buffer of memory is one of the problems that Rust’s design is intended to help with. The actual code change was relatively small, and Rust helped ensure that it was safe.

This change alone sped up large blocksize copies by up to 3x.

Performance Problem #2: Thread Communication Link to heading

The second problem here is a little less obvious, but shows up mainly when working with small blocksizes.

Whoever wrote this code originally intended to optimize it by moving all of the output to a separate thread, as you can see from the comments. This resulted in sending a message to the output thread after every block was written in order to update the status bar.

dd only outputs the progress once every second to avoid spending too much time printing. The check to see if 1 second had passed was being done in the second thread, so the second thread would be woken up after every block was written but only have output to print once per second.

This is a perfect example of premature optimization, and, like the previous problem, also seems to come from someone unfamiliar with systems programming. Thread-to-thread communication is actually fairly expensive relative to other low-level operations, but you probably won’t think of it that way if you’re writing a higher-level language.

In common usage, blocksizes are in the Kilobyte range, so often this loop will be iterated over thousands of times per second. I have real doubts that sending a message between threads like this is significantly faster than printing a line of text, but sending thousands of messages per second is definitely slower than printing once per second.

My fix was ultimately a general fixing of problems related to output, which included a fix to only wake up the printing thread once per second. There were several other related bugs dealing with with the output that I also fixed, you can look at the PR for more info.

Ultimately I didn’t remove the second printing thread entirely because one signal per second wasn’t ever going to have a significant performance impact, and the code is organized around having that extra thread. It’s still there, but from a performance standpoint there’s certainly no reason for it.

For the smallest commonly used blocksize of 4K, this change alone sped up dd by almost 2x.

The Final Form Link to heading

After my changes the core loop looks like so:

        // Start a thread that reports transfer progress.
        //
        // The `dd` program reports its progress after every block is written,
        // at most every 1 second, and only if `status=progress` is given on
        // the command-line or a SIGUSR1 signal is received. We
        // perform this reporting in a new thread so as not to take
        // any CPU time away from the actual reading and writing of
        // data. We send a `ProgUpdate` from the transmitter `prog_tx`
        // to the receives `rx`, and the receiver prints the transfer
        // information.
        let (prog_tx, rx) = mpsc::channel();
        let output_thread = thread::spawn(gen_prog_updater(rx, i.print_level));
        let mut progress_as_secs = 0;

        // Create a common buffer with a capacity of the block size.
        // This is the max size needed.
        let mut buf = vec![BUF_INIT_BYTE; bsize];

        // The main read/write loop.
        //
        // Each iteration reads blocks from the input and writes
        // blocks to this output. Read/write statistics are updated on
        // each iteration and cumulative statistics are reported to
        // the progress reporting thread.
        while below_count_limit(&i.count, &rstat, &wstat) {
            // Read a block from the input then write the block to the output.
            //
            // As an optimization, make an educated guess about the
            // best buffer size for reading based on the number of
            // blocks already read and the number of blocks remaining.
            let loop_bsize = calc_loop_bsize(&i.count, &rstat, &wstat, i.ibs, bsize);
            let rstat_update = read_helper(&mut i, &mut buf, loop_bsize)?;
            if rstat_update.is_empty() {
                break;
            }
            let wstat_update = self.write_blocks(&buf)?;

            // Update the read/write stats and inform the progress thread once per second.
            //
            // If the receiver is disconnected, `send()` returns an
            // error. Since it is just reporting progress and is not
            // crucial to the operation of `dd`, let's just ignore the
            // error.
            rstat += rstat_update;
            wstat += wstat_update;
            let prog_update = ProgUpdate::new(rstat, wstat, start.elapsed(), false);
            if prog_update.duration.as_secs() >= progress_as_secs {
                progress_as_secs = prog_update.duration.as_secs() + 1;
                prog_tx.send(prog_update).unwrap_or(());
            }
        }

Conclusions Link to heading

I enjoyed these little code fixes, they gave me a chance to practice some of my Rust skills and maybe help out the community if this project ever becomes more popular. I’ve occasionally done similar fixes, or looked into other performance problems in different projects, so I may write up more about them in the future.

For anyone wanting to do something similar checking out the PRs I’ve linked here will point you to several more areas for optimization or fixes in just this one tool.