troubles.md -

A high-level view of low-level code

Inviting God's Wrath with Cursed Rust

This article is a story about optimising the size and performance of std::borrow::Cow<T> in Rust. It requires some basic knowledge of programming Rust, but doesn’t require any knowledge of the kind of low-level details of Rust’s compilation model. We’ll be touching on some pretty low-level subjects later on but I’ll explain them as we go. The optimisations will start pretty reasonable, but will continue to get more disgusting and cursed until all hell breaks loose and the skies themselves turn blood red with the rage of the ancients.

Anyway, to start things off, let’s briefly talk about pointers.

Specifically, let’s talk about array pointers. In C, array pointers are the same as “normal” pointers, and have no size or other metadata attached. If you want to know the size of an array in C, you have to implement that yourself. For strings, this is tradionally implemented by ending the string with a “sentinel value”, so when iterating over the string you can continuously check for this value and exit. For other arrays, this is usually implemented by supplying some metadata as an additional parameter to the function or as a field in a struct. For a safe language like Rust, though, this simply doesn’t fly. Rust has a type &[T], which represents a borrowed array, and Box<[T]>, which is an owned array. Both of these use the same syntax as “normal” pointers to a single element, but additionally store their length using some magic in the compiler itself. You can think of pointers in Rust being split into two different types: the pointer-to-sized-type (a pointer to a type whose size is known at compile-time, like &u64, &[u8; 10], &(), and so forth) and the pointer-to-unsized-type (a pointer to a type whose size is only known at runtime, such as &[u8], &str or &dyn Trait).

struct Ptr<T>
where
    T: Sized
{
    address: usize,
}

struct Ptr<T>
where
    T: !Sized
{
    address: usize,
    // Rust currently only ever uses a pointer or pointer-sized integer for the fat pointer
    // extra data, but in theory we could have anything here, and it could be more than just
    // the size of a `usize`.
    extra_data: usize,
}

This means that &[T] and Box<[T]> are actually two pointer-sized integers (i.e. 64-bit on 64- bit architectures and 32-bit on 32-bit architectures), one for the pointer to the first element and another for the length of the array. I’m going to explain a bit of groundwork here, to make sure that when we start invoking dark rites to twist and bend them you’re not totally lost.

So, if you’ve actually used Rust in real code, though, you might have noticed that this doesn’t mention Vec<T>. Box<[T]> is not a common type in Rust, as Vec<T> is far more flexible - it allows you to add more elements to the array at runtime without making a new array, whereas Box<[T]> does not. The reason for this difference is that Box<[T]> only stores the number of elements, and all those elements must be defined. Vec<T> works differently. It has a integer representing the amount of space it has - which can be more than the number of elements actually in the Vec. This means that it can allocate extra space that doesn’t contain defined elements, and then pushing to that Vec just writes into that space, without having to allocate a whole new array. Fine so far, although it means that Vec sadly requires three pointer-sized integers. Here’s a quick reference:

// This is invalid syntax in Rust of course, but it's just an illustration
// Size: 2 * size_of::<usize>()
struct &'a [T] {
    ptr: *const T,
    length: usize,
}

// Size: 2 * size_of::<usize>()
#[no_mangle]
pub fn dyn_as_ref<'a>(cow: &'a CursedCow<'_, dyn Foo>) -> &'a dyn Foo {
    &**cow
}

#[no_mangle]
pub fn dyn_to_cow<'a>(cow: CursedCow<'a, dyn Foo>) -> std::borrow::Cow<'a, dyn Foo> {
    cow.into()
}

#[no_mangle]
pub fn dyn_new_cow(cow: Box<dyn Foo>) -> Option<CursedCow<'static, dyn Foo>> {
    CursedCow::try_owned(cow)
}

#[repr(transparent)]
pub struct Test(u64);

impl ToOwned for Test {
    type Owned = Box<Test>;
    
    fn to_owned(&self) -> Self::Owned {
        Box::new(Test(self.0))
    }
}

#[no_mangle]
pub fn static_new_cow(cow: Box<Test>) -> Option<CursedCow<'static, Test>> {
    CursedCow::try_owned(cow)
}
struct Box<T> {
    ptr: *mut T,
    length: usize
}

// Size: 3 * size_of::<usize>()
struct Vec<T> {
    ptr: *mut T,
    length: usize,
    capacity: usize,
}

NOTE: If you’re already familiar with the low-level details of Rust, you might have noticed some things I left out of the above types. Ignore them for now, I’ll get to it.

So there’s one other array type that I want to talk about, and it’s what the rest of this article will be focussed on. That type is std::borrow::Cow. We’ll start by talking about Cow<[T]> and Cow<str>, although Cow is generic and works with other types, which we’ll get to later. We’ll mostly be talking about Cow<[T]> when it comes to implementation, since Cow<str> is the same as Cow<[u8]> at runtime - it just requires some extra invariants to be true about the bytes it contains and so it must be a separate type. Cow<[T]>/Cow<str> can be either a &'a [T]/&'a str or a Vec<T>/String. They’re useful in many cases, but one big one is in parsing. One example of where this might be useful is if you have a parser for a programming language that has strings that can have escape characters. You can have many strings simply be a reference into the original program text:

let a = "Hello, world";
         ^----------^ Take a reference to these characters in the original program text
                      and use it as the string value.

If you have an escape sequence, however, you need to allocate a new String, which is an owned string type, allocated on the heap, like Box<[T]> is an owned array type. This requests a block of memory which doesn’t have to follow the same lifetime rules as borrows - borrows must be created in an outer scope and passed inwards, but owned pointers can be created in an inner scope and passed back outwards, and in general are far more flexible, at the cost of inhibiting some optimisations and requiring computation to be done to create and destroy them. Once we’ve allocated our String, we write a version of the string from the program text into that buffer with all the escape sequences turned into the characters they represent. You can represent this with Cow<str> - either Cow::Borrowed(some_string_reference) if the string can be taken from the program text unchanged, or Cow::Owned(some_computed_string) if the string had to be edited. So how many bytes does Cow<[T]> take up? Well, it’s either a Vec<T> or a &[T], so that means that we need enough space for Vec<T>, but we can reuse some of that space if it’s an &[T], since it can only be one or the other. A Vec<T> takes up 3 pointer-size integers, and we can reuse 2 of those for the &[T], so that means we only need 3 pointer-size integers. Except we also need to store a “tag” for whether it’s a Vec<T> or an &[T]. So that means that it’s 3 pointer-size integers, plus a single bit that can be either 0 for Vec<T> or 1 for &[T]. So for 64-bit, that’d be 3 * 64 + 1, or 193, right?

Unfortunately not. You can’t access a type at a bit offset, only at a byte offset. So that means that our size has to be a multiple of 8. Easy, we round it up and have 7 unused bits, right? Well still no. Integers have to be at an offset that is a multiple of its size. You can have a u64 stored at offset 8, 16, 24, 32, 40, etc, but not at an offset of 9, 10, 11, 12, and so forth. Well easy, we just see what the largest size of integer is that we have as a field of our type (in this case, pointer-size), and round our size up to that. So now our Cow type is 4 pointer-sizes in size, or twice the size of a &[T]. This doesn’t sound bad, but this adds up, and there are other downsides to this increase in size that I’ll get to later. We can confirm the size like so:

fn main() {
    // Prints 32 on 64-bit systems, which is 4 * 8, where 8 is the number of bytes in a 64-bit
    // integer
    println!("{}", std::mem::size_of::<std::borrow::Cow<[u8]>>());
}

Act 1: Some Reasonable Optimisations

So what can we do to help? Well there are a few things. For a start, we can notice that if the vector has zero capacity, we can treat it identically to an empty array for most operations. Vectors with zero capacity don’t need their “drop code” ran, for example. So let’s make a version of Vec where it must always have a non-zero capacity.

use std::num::NonZeroUsize;

struct NonZeroCapVec<T> {
    ptr: *mut T,
    len: usize,
    cap: NonZeroUsize,
}

You’ll notice that we can still have a vector with no elements, as long as it has some space to store elements. So now, we can do our size calculations for Cow<[T]> again, replacing Vec<T> with NonZeroCapVec<T>. Again we notice that we NonZeroCapVec is 3 pointer sizes, and can reuse 2 pointer sizes of that to store the slice, except now the Rust compiler knows that it can use cap for both the tag and the capacity, where if cap is zero then it’s a slice, and if it’s non-zero then it’s a Vec. This is a useful trick. We can confirm that this type is now 3 pointer sizes like so:

enum CowArr<'a, T> {
    Borrowed(&'a [T]),
    Owned(NonZeroCapVec<T>),
}

fn main() {
    println!("{}", std::mem::size_of::<CowArr<u8>>());
}

Except no, sadly we actually can’t confirm this. At the time of writing Rust doesn’t yet optimise this correctly, and still reports that the size is 32. This is unfortunate, but until then we can implement this optimisation manually:

struct CowArr<'a, T> {
    // We can use `*mut` to store immutable pointers like `&T`, as long as we never derive an
    // `&mut T` from an `&T`
    ptr: *mut T,
    len: usize,
    cap: usize,
}

Then when we need to know if the value is owned or borrowed, we can simply check if cap is zero.

Rust has NonZero variants for all integers, plus a NonNull<T> pointer type which acts the same as a *mut T, except since Rust knows that it can’t be null, it can use a null pointer as an enum tag. Although Rust doesn’t optimise the size of the enum defined above, it will correctly use these NonZero/NonNull types in Option, which means that Option<Box<[T]>> is the same size as Box<[T]> - it can use a null pointer to mean None. So we can make our CowArr type work the same as Box<[T]> for Option, and let it use a null pointer to represent None, like so:

use std::ptr::NonNull;

struct CowArr<'a, T> {
    ptr: NonNull<T>,
    len: usize,
    cap: usize,
}

Again we can manually do the optimisation where we check cap for zero, but now Rust will automatically use a null pointer for ptr to mean None if we have an Option<CowArr<T>>.

fn main() {
    // Both of these print 24 on 64-bit systems, and 12 on 32-bit systems
    println!("{}", std::mem::size_of::<CowArr<u8>>());
    println!("{}", std::mem::size_of::<Option<CowArr<u8>>>());
}

Hm, except we’ve actually got more than just a size reduction here. To explain what I mean, we’re going to have to talk about assembly. For std::borrow::Cow, to do as_ref we first have to check whether we have a Cow::Borrowed or a Cow::Owned, then if we have the former we return the borrow we already have, and if we have the latter we do <Vec<T>>::as_ref, which is a pretty simple matter of taking the ptr and len from the vector and creating a slice with that ptr and len. The rest of the conversion is in the type system only, at runtime all doing <Vec<T>>::as_ref does is copy a pointer and a length from one place to another. Well with CowArr our code is simpler. The borrowed ptr and len is exactly the same as the owned ptr and len, the only difference is that if we have an owned value then cap is non-zero. That means that we don’t have to check cap at all, we only have to ensure the type system parts of the conversion are correct - essentially, we only need to ensure that we annotate lifetimes correctly. Then once the assembly is created, the type information is removed, and we’re left with an implementation of as_ref which is essentially a no-op. Well, the Rust Playground has a “Show Assembly” feature, so let’s use it:

use std::{
    borrow::Cow,
    ptr::NonNull
    marker::PhantomData,
};

pub struct CowArr<'a, T> {
    ptr: NonNull<T>,
    len: usize,
    cap: usize,
    // I omitted this before since it's just to silence the error that `'a` is unused.
    // There is more to `PhantomData` than just silencing errors, but it's out of scope
    // for this article.
    _marker: PhantomData<&'a T>,
}

impl<'a, T> CowArr<'a, T> {
    pub fn as_ref(&self) -> &[T] {
        unsafe { std::slice::from_raw_parts(self.ptr.as_ptr(), self.len) }
    }
}

#[no_mangle]
pub fn cow_as_ref<'a>(a: &'a Cow<'_, [u8]>) -> &'a [u8] {
    a.as_ref()
}

#[no_mangle]
pub fn cowarr_as_ref<'a>(a: &'a CowArr<'_, u8>) -> &'a [u8] {
    a.as_ref()
}

Clicking “Show Assembly” (in release mode, of course) shows us what this compiles to. Don’t worry, I know assembly can be scary so I’ve written comments:

;; For the standard library `Cow`...
cow_as_ref:
    ;; We only need to load the `ptr` once (good!)
    mov   rax, [rdi + 8]
    ;; Unfortunately, we check the tag to load the length
    cmp   [rdi], 1

    ;; This is a pointer to the length if we have a borrowed slice
    lea   rcx, [rdi + 16]

    ;; This is a pointer to the length if we have an owned vector
    lea   rdx, [rdi + 24]
    ;; We use `cmov`, which will overwrite the pointer to the borrowed
    ;; slice's length with the pointer to the owned vector's length if
    ;; our `Cow`'s tag shows that it is owned.
    cmove rcx, rdx

    ;; Then finally, we dereference this pointer-to-length to get the
    ;; actual length
    mov   rdx, [rcx]
    ret

;; For our `CowArr`
cowarr_as_ref:
    ;; We return the `ptr`
    mov rax, [rdi]
    ;; We return the `len`
    mov rdx, [rdi + 8]
    ;; That's it! We're done
    ret

Even if you don’t understand assembly, you can see that this is an improvement by the reduction in instruction count, if nothing else. It also reduces register pressure, although if you don’t know what that means then don’t worry - its effect is small enough that you don’t need to worry about it for now.

If you know anything about calling conventions, you might notice something a bit odd about that assembly code. We’ll get to it in due time, although not until after we pass through into the Forbidden Realms.

Act 2: Stepping into the Forbidden Realms

So far, so simple. But we can do better, if we’re willing to add some restrictions. We can’t reduce the size of ptr since it’s not really possible to safely make many assumptions about the range of values a pointer can take, but the same can’t be said for len and cap. If we’re on a 64-bit system, using a 64-bit length and capacity allows us to store up to 18,446,744,073,709,551,615 elements, which I think we can agree that it’s unlikely for a single array to contain this many items in the majority of programs. In fact, it’s not even possible to create an array this large for anything other than u8 and other single-byte (or even smaller) types, since you’ll run out of address space long before that, not even mentioning that you’ll run out of memory on your computer long before you run out of address space. So let’s say that len and cap are both 32-bit on 64-bit systems. We’ll ignore 32-bit for now, on 32-bit systems we could choose to either make both len and cap 16-bit, or fall back to the implementation in the previous section. This choice isn’t really important for now, so I’ll focus on 64-bit. With 32-bit len and cap we can store arrays with up to 4,294,967,295 elements, which means, for example, that a single string can be up to 4Gb long. This is a restriction, certainly, it’s not unbelievable that your program would want to process larger strings, but the standard library Cow will always support that if you need it. If you don’t need that many elements, then this gives you a size reduction.

use std::ptr::NonNull;

struct CowArr<'a, T> {
    ptr: NonNull<T>,
    len: u32,
    cap: u32,
}

fn main() {
    // Both of these print 16 on 64-bit systems, and still print 12 on 32-bit systems
    println!("{}", std::mem::size_of::<CowArr<u8>>());
    println!("{}", std::mem::size_of::<Option<CowArr<u8>>>());
}

If you’re anything like me, saving 8 bytes like this is enough to make you cry with joy, but maybe we can do better. Now we get back to that “something a bit odd” that I mentioned above. See, when Rust passes a struct into a function or returns a struct from a function, it has a couple of ways to handle sharing the struct between the caller and the callee. If the struct is “small” (meaning two fields or less, with each field fitting into a single register), then the struct will be passed as an argument in registers, and returned in registers. Otherwise, the struct will be written to stack and a pointer to the struct will be passed to the callee. This is the “something a bit odd” that I mentioned before - many people assume that Rust always passes structs with more than 1 element by pointer, and in fact until relatively recently it did. Now, if you’re as much of a rules-lawyer as me, you might notice a trick we can do here: although the struct above is not considered “small” by Rust, we can make a version of it that is considered small. Let’s do that:

use std::ptr::NonNull;

struct CowArr<'a, T> {
    ptr: NonNull<T>,
    len_cap: u64,
}

const LEN_MASK: u64 = std::u32::MAX as u64;
const CAP_MASK: u64 = !LEN_MASK;
// We want the low 32 bits of `len_cap` to be `len`, and the high 32 bits to be `cap`,
// so we need to shift `cap` when reading and writing it.
const CAP_SHIFT: u64 = 32;

impl<'a, T> CowArr<'a, T> {
    pub fn as_ref(&self) -> &[T] {
        unsafe { std::slice::from_raw_parts(self.ptr.as_ptr(), self.len & LEN_MASK) }
    }
}

So sure, now we have a struct that’s the same size, but can be passed in registers, speeding up function calls that use it. Cool, but isn’t that & to mask the length going to add an additional cost? Well luckily for us, x86 has a way to mask the lower bits of a number for free! Since assembly is untyped, we can just pretend that our 64-bit number is a 32-bit number whenever we’re using it, and it’ll be the same as if we had masked the lower 32 bits. Additionally, we only have to allocate stack space and pass a pointer to the callee when we actually need to pass a reference to the Cow. If we pass an owned value, it’ll just stay in registers. Let’s see the assembly here to see what I mean:

struct CowArr3Fields<'a, T> {
    ptr: NonNull<T>,
    len: u32,
    cap: u32,
}

struct CowArr2Fields<'a, T> {
    ptr: NonNull<T>,
    len_cap: u64,
}

#[no_mangle]
pub fn cow_as_ref<'a>(a: &'a Cow<'_, [u8]>) -> &'a [u8] {
    a.as_ref()
}

#[no_mangle]
pub fn cowarr2fields_as_ref<'a>(a: &'a CowArr2Fields<'_, u8>) -> &'a [u8] {
    a.as_ref()
}
#[no_mangle]
pub fn cowarr3fields_as_ref<'a>(a: &'a CowArr3Fields<'_, u8>) -> &'a [u8] {
    a.as_ref()
}

#[no_mangle]
pub fn cow_noop(a: Cow<'_, [u8]>) -> Cow<'_, [u8]> {
    a
}

#[no_mangle]
pub fn cowarr2fields_noop(a: CowArr2Fields<'_, u8>) -> CowArr2Fields<'_, u8> {
    a
}

#[no_mangle]
pub fn cowarr3fields_noop(a: CowArr3Fields<'_, u8>) -> CowArr3Fields<'_, u8> {
    a
}
cow_as_ref:
    mov   rax, [rdi + 8]
    cmp   [rdi], 1
    lea   rcx, [rdi + 16]
    lea   rdx, [rdi + 24]
    cmove rcx, rdx
    mov   rdx, [rcx]
    ret

cowarr2fields_as_ref:
    mov rax, [rdi]
    mov edx, [rdi + 8]
    ret

cowarr3fields_as_ref:
    mov rax, [rdi]
    mov edx, [rdi + 8]
    ret
    
cow_noop:
    mov    rax, rdi
    movups xmm0, [rsi]
    movups xmm1, [rsi + 16]
    movups [rdi + 16], xmm1
    movups [rdi], xmm0
    ret

cowarr2fields_noop:
    mov rdx, rsi
    mov rax, rdi
    ret

cowarr3fields_noop:
    mov    rax, rdi
    movups xmm0, [rsi]
    movups [rdi], xmm0
    ret

If you can read assembly, you can see that just returning an unmodified Cow requires some messing around with loading and storing data for all the structs apart from CowArr2Fields. If you can’t read assembly, then all you need to know is that the [...] square brackets is a memory access, and cowarr2fields_noop is the only function that doesn’t need them.

Ok, so we’ve optimised the Cow arrays about as much as we can. Now is when we start to invoke the Dark Magicks and risk the wrath of the Old Ones (you know, the Rust core team). Let’s make a “generic” optimised Cow, one that works with more than just arrays.

Act 3: Forgive Me

So this is all well and good, but it’s just not cursed enough. It basically reimplements Cow<[T]> with a custom type that doesn’t work for Cow<str> - you have to write your own wrapper, maybe call it CowStr. Then repeat that for every type. No, we can do better. We can make a CursedCow that you works the same no matter if it’s CursedCow<[T]>, CursedCow<str>, or even CursedCow<dyn Trait>. That last one is where the code really starts to cause damage to the soul. If you find yourself cast into Hell one day, just know that it may be because you read this article. I think we can both agree that this is a fair and just punishment. Anyway, before we can truly damn ourselves, we need to lay down the groundwork and make the far-simpler CursedCow<[T]>/CursedCow<str> work. To make this work, we’ll need a trait.

pub unsafe trait Cursed<T: ?Sized>: Borrow<T> {
    fn borrowed(borowed: &T) -> Option<NonNull<T>>;
    fn owned(self) -> Option<NonNull<T>>;
    fn is_owned(ptr: NonNull<T>) -> bool;
    unsafe fn as_ref<'a>(ptr: NonNull<T>) -> &'a T;
    unsafe fn reconstruct(ptr: NonNull<T>) -> Self;
}

This trait is actually implemented for the owned variant, as opposed to ToOwned which is implemented for the borrowed variant. This is because many types may implement ToOwned pointing to the same type (you could imagine a slice wrapper where ToOwned still generates a Vec), but we still want to explicitly single out Box and other smart pointers. Implementing for the owned variant means that we don’t have to write any generic implementations with impl<T> Cursed for T where ..., which lets us circumvent the issue of overlapping implementations. The trait is unsafe, as it requires some invariants to be true about Borrow, and requires borrowed, owned and is_owned to be in agreement about what borrowed and owned pointers look like. Additionally, as_ref and reconstruct need to be unsafe functions because they should only ever have valid pointers passed into them.

So, let’s write the actual CursedCow struct now. So while in the previous sections we’ve been basically reimplementing Rust’s “fat pointer” system, we can’t do that any more if we want to support more than just arrays. We want to have a 2-pointer-width CursedCow<T> if T is unsized (such as [T], str or dyn Trait) and a 1-pointer-width CursedCow<T> - just a regular pointer - if T is sized. We do this by just using NonNull<T>, which is a fat pointer for unsized types, and letting the implementation of Cursed handle hiding the tag somewhere.

// `repr(transparent)` ensures that this struct is always treated exactly the same as `NonNull<T>`
// at runtime.
#[repr(transparent)]
pub struct CursedCow<'a, T>
where
    T: ?Sized + ToOwned,
    T::Owned: Cursed<T>,
{
    ptr: NonNull<T>,
    _marker: PhantomData<&'a T>,
}

Apart from the fact that you can’t explicitly match on its variants (or get a mutable reference to the internals, but there are ways around this) this is identical to std::borrow::Cow - you don’t do CowArr<T>, you just do CursedCow<[T]>, and this new CursedCow is 2 pointers wide. This is the same as before for 64-bit, although smaller on 32-bit at the cost of only allowing up to 65,535 elements.

Implementing the necessary methods for our new CursedCow<T> - methods to construct it from owned or borrowed data, a Deref implementation, Drop implementation, and so forth - is pretty easy so I’ll skip it, but you can see it in the full gist (linked at the end of the article).

The real work is done in the implementations of Cursed for Vec<T>, String, Box<T> and so forth. I’ll skip over the implementation for String for now since it’s basically the same as Vec<T>, but let’s start with an explanation of how we store all the information needed for a CursedCow<[T]> in a single NonNull<[T]>. You’ll see these constants CAP_SHIFT, CAP_MASK and LEN_MASK referenced in the following functions, so I’ll start by defining them here:

use std::mem;

const CAP_SHIFT: usize = (mem::size_of::<usize>() * 8) / 2;
const LEN_MASK: usize = usize::MAX >> CAP_SHIFT;
const CAP_MASK: usize = !LEN_MASK;

CAP_SHIFT is the amount you need to shift the len field of the NonNull<[T]> fat pointer to get the capacity that we’ve hidden in that field - i.e., the upper 32/16 bits (on 64- and 32-bit, respectively). LEN_MASK and CAP_MASK are the “masks” for these bits, so we can use bitwise & to only get the bits that represent the length or the capacity, respectively.

unsafe impl<T> Cursed<[T]> for Vec<T> {
    fn borrowed(ptr: &[T]) -> Option<NonNull<[T]>> {
        if ptr.len() & CAP_MASK != 0 {
            None
        } else {
            Some(NonNull::from(ptr))
        }
    }

    fn owned(self) -> Option<NonNull<[T]>> {
        // Fail if the capacity is too high
        if self.len() & CAP_MASK != 0 || self.capacity() & CAP_MASK != 0 {
            None
        } else {
            let mut this = mem::ManuallyDrop::new(self);
            unsafe {
                Some(NonNull::from(slice::from_raw_parts_mut(
                    this.as_mut_ptr(),
                    // This combines the length and capacity into a single `usize`
                    this.len() | (this.capacity() << CAP_SHIFT),
                )))
            }
        }
    }

    // ...snip...
}

So here you can see our checks that the length of the borrowed/owned values don’t exceed the amount we can represent in the reduced amount of space that we have. If we have too many elements or too much capacity, we return None, since there’s no way for us to store that. Although we could truncate the len, we can’t safely truncate the capacity without breaking some allocators, and truncating len would be a footgun. Otherwise, this looks pretty much the same as what we had before.

The rest of the trait is implemented essentially how you’d expect, and it looks pretty much the same as CowArr that we had before. This whole trait implementation looks basically exactly the same for String.

unsafe impl<T> Cursed<[T]> for Vec<T> {
    // ...snip...

    fn is_owned(ptr: NonNull<[T]>) -> bool {
        unsafe { ptr.as_ref() }.len() & CAP_MASK != 0
    }

    unsafe fn as_ref<'a>(ptr: NonNull<[T]>) -> &'a [T] {
        // Like before, this mask is essentially free because we can just treat `self.len`
        // like a smaller value, which acts the same as if we did the mask, instead of
        // actually masking.
        slice::from_raw_parts(ptr.as_ptr() as *const T, ptr.as_ref().len() & LEN_MASK)
    }

    unsafe fn reconstruct(ptr: NonNull<[T]>) -> Self {
        Vec::from_raw_parts(
            ptr.as_ptr() as *mut T,
            ptr.as_ref().len() & LEN_MASK,
            ptr.as_ref().len() >> CAP_SHIFT,
        )
    }
}

Well there we go, that’s it. Now our new CursedCow<[T]> acts like CowArr<T> automatically. It’s pretty cursed to hide the tag and capacity inside the len field of a slice, but we’re just getting started

Now we get to the jankiest part of all - CursedCow<T> for other values of T, and specifically CursedCow<dyn Trait>. Now, we can’t be generic over any CursedCow<dyn Trait>, since there’s no way to specify “some trait object” in the type system, but let’s say that if you’re using CursedCow<dyn Trait> for one reason or another then might have an implementation of ToOwned that has Owned = Box<dyn Trait>, since that’s the only way to have an owned trait object. That might look something like this:

trait MyCoolTrait {
    fn clone_boxed(&self) -> Box<Self>;

    // ... the rest of my cool functions... 
}

impl ToOwned for dyn MyCoolTrait {
    type Owned = Box<dyn MyCoolTrait>;

    fn to_owned(&self) -> Self::ToOwned {
        self.clone_boxed()
    }
}

So, how do we store a tag representing the owned-or-borrowed tag in a Box<T>, since Box also uses the possibly-fat-pointer NonNull<T> internally? Well, there are a couple methods, both relying on the fact that not all the bits of pointers get used. For a start, most types have an alignment greater than 1. We mentioned alignment further up - u64s can only be at locations that are a multiple of mem::size_of::<u64>(), u32s can only be at locations that are a multiple of mem::size_of::<u32>(), and so forth. So because of this, we know that if the integer value of a pointer is odd then it must be invalid. We can use this to store a tag - if the pointer is odd then it’s owned (and we should subtract 1 to get the true pointer), if it’s even then it’s just a normal borrowed pointer. That would be a pretty cursed implementation of Cow, for sure, but we can’t implement it generically, and we can’t implement it for dyn Trait since that would lead to weird bugs where Cow<dyn Trait> was fine for most types, but doesn’t work if the implementation of your trait happens to have a size of 1, and you wouldn’t know that until it explodes at runtime.

Another possibility is that Rust goes out of its way to make sure that pointers can’t overflow an isize, since this means that adding two pointers together will never overflow. As far as I know this isn’t a hard guarantee, but certainly on 64-bit x86 it’s not even possible on any hardware that exists in the real world to have a pointer larger than 63 bits, so that 64th bit will always be 0. We can take advantage of that, and if we get given a pointer that happens to have the top bit set, we can just return None, the same as if we get given a &[T]/Vec<T> that’s too large to store. Now, manually manipulating the pointer field of a fat pointer isn’t allowed in Rust - we can do arithmetic on a normal, “thin” pointer, but there’s no way to mutate the pointer field of a fat pointer. However, we can work around this doing a pointer-to-pointer cast, which is undefined behaviour in C but not in Rust. This is wildly unsafe, and we need to be extremely careful to make it work at all, let alone make it work in a way that won’t immediately cause undefined behaviour. Here’s the cursed function at the heart of it all, which allows us to treat the data pointer of a fat pointer as if it’s a usize, and so allows us to manipulate its bits directly.

fn update_as_usize<O, F: FnOnce(&mut usize) -> O, T: ?Sized>(ptr: &mut *mut T, func: F) -> O {
    unsafe {
        // Here's where we invoke the darkest of dark magic, explanation below.
        let ptr_as_bytes = mem::transmute::<&mut *mut T, &mut usize>(ptr);
        // Since this is dark magic, we make sure that we explode if our
        // assumptions are wrong.
        assert_eq!(*ptr_as_bytes, *ptr as *mut u8 as usize);
        func(ptr_as_bytes)
    }
}

The way this works assumes that the layout of the fat pointer has the data pointer first. Essentially, the fat pointer for NonNull<dyn Trait> looks like this, and in fact you can find this exact struct in std::raw::TraitObject:

struct TraitObject {
    data: *mut (),
    vtable: *mut (),
}

However, even with nightly Rust we can’t use std::raw::TraitObject, since it doesn’t work for any fat pointer, only dyn Trait, and as mentioned before we can’t be generic over “any trait object”. So we have to make the further assumption that not only do trait objects have the data pointer first, but all fat pointers have the data pointer first. That’s what the assert_eq in update_as_usize does: it uses Rust’s native ability to convert the pointer to a thin *mut u8 to make sure that our mutable pointer-to-usize is pointing at the right data. This is wildly unsafe, and although it’s likely to work for the forseeable future for all the fat pointers that Rust supports, there’s no guarantee that this will be the case, and if this ever becomes incorrect in a way that the assert_eq doesn’t catch then you’ll get undefined behaviour. So for now we’ll keep using this, because it works, but I want to stress that you should NOT USE THIS IN REAL SOFTWARE, unless you want to get fired and deserve it.

Anyway, since this works just fine for now, let’s explore these cursed realms a little further, shall we?

// The mask for the actual pointer
const PTR_MASK: usize = usize::MAX >> 1;
// The mask for just the "is owned" tag
const TAG_MASK: usize = !PTR_MASK;

unsafe impl<T: ?Sized> Cursed<T> for Box<T> {
    fn borrowed(ptr: &T) -> Option<NonNull<T>> {
        // We use `Self::is_owned` here to avoid duplicating information. You can think of this
        // in this context as expressing "if we _would think_ that `ptr` was owned"
        if Self::is_owned(NonNull::from(ptr)) {
            None
        } else {
            Some(NonNull::from(ptr))
        }
    }

    fn owned(self) -> Option<NonNull<T>> {
        let mut this = mem::ManuallyDrop::new(self);
        let original_ptr = &mut **this as *mut T;
        let mut ptr = original_ptr;

        update_as_usize(&mut ptr, |p| *p |= TAG_MASK);

        unsafe {
            if Self::is_owned(NonNull::new_unchecked(original_ptr)) {
                this.into_inner();
                None
            } else {
                Some(NonNull::new_unchecked(ptr))
            }
        }
    }

    fn is_owned(ptr: NonNull<T>) -> bool {
        ptr.as_ptr() as *mut u8 as usize & TAG_MASK != 0
    }

    unsafe fn as_ref<'a>(ptr: NonNull<T>) -> &'a T {
        let mut ptr = ptr.as_ptr();
        update_as_usize(&mut ptr, |p| *p &= PTR_MASK);
        &*ptr
    }

    unsafe fn reconstruct(ptr: NonNull<T>) -> Self {
        let mut ptr = ptr.as_ptr();
        update_as_usize(&mut ptr, |p| *p &= PTR_MASK);
        Box::from_raw(ptr)
    }
}

So there we go, apart from the implementations of the traits you’d expect from a Cow implementation, which should be fairly self-explanatory, and a Drop implementation, which is also pretty simple, this is more-or-less a fully-working drop-in replacement for std::borrow::Cow. Finally, let’s see what the codegen looks like for this implementation of Cursed for Box<T>. You can just take my word for it that CursedCow<[T]> generates exactly the same code as CowArr<T>. Let’s write the code for a couple of different scenarios that we want to test the codegen for - first for fat pointers, and then for thin pointers (if for some reason you had a ToOwned implementation for a non-dynamically-sized type that still used Box when it’s owned).

#[no_mangle]
pub fn dyn_as_ref<'a>(cow: &'a CursedCow<'_, dyn Foo>) -> &'a dyn Foo {
    &**cow
}

#[no_mangle]
pub fn dyn_to_cow<'a>(cow: CursedCow<'a, dyn Foo>) -> std::borrow::Cow<'a, dyn Foo> {
    cow.into()
}

#[no_mangle]
pub fn dyn_new_cow(cow: Box<dyn Foo>) -> Option<CursedCow<'static, dyn Foo>> {
    CursedCow::try_owned(cow)
}

#[repr(transparent)]
pub struct Test(u64);

impl ToOwned for Test {
    type Owned = Box<Test>;
    
    fn to_owned(&self) -> Self::Owned {
        Box::new(Test(self.0))
    }
}

#[no_mangle]
pub fn static_new_cow(cow: Box<Test>) -> Option<CursedCow<'static, Test>> {
    CursedCow::try_owned(cow)
}

The code made by these functions is pretty amazingly minimal, and almost entirely avoids using the stack - keeping most values in registers. I won’t explain this assembly fully, but it’s here for completeness:

dyn_as_ref:
    mov    rdx, [rdi + 8]
    movabs rax, 9223372036854775807
    and    rax, [rdi]
    ret

dyn_to_cow:
    mov    rax, rdi
    movabs rcx, 9223372036854775807
    and    rcx, rsi
    test   rsi, rsi
    jns    .is_ref
    mov    esi, 1
    mov    [rax + 8], rcx
    mov    [rax + 16], rdx
    mov    [rax], rsi
    ret
.is_ref:
    xor    esi, esi
    mov    [rax + 8], rcx
    mov    [rax + 16], rdx
    mov    [rax], rsi
    ret

dyn_new_cow:
    mov    rdx, rsi
    movabs rcx, -9223372036854775808
    or     rcx, rdi
    xor    eax, eax
    test   rdi, rdi
    cmovns rax, rcx
    ret

static_new_cow:
    movabs rcx, -9223372036854775808
    or     rcx, rdi
    xor    eax, eax
    test   rdi, rdi
    cmovns rax, rcx
    ret

Anyway, if you want to try this out then I recommend the beef crate, which acts as a drop- in replacement for std::borrow::Cow for most cases. It’s written by my colleague and friend, and this article came about due to discussions about how to optimise that crate for the usecase of parsing JSON. It doesn’t include our incredibly sketchy dyn Trait implementation, thank god. There’s also the simpler cowvec crate, which is my earlier implementation, roughly equivalent to the code we had at the end of Act 2. I’d only recommend beef over cowvec because cowvec doesn’t act as a drop-in replacement as it has a different signature to std::borrow::Cow. Well, that and the fact that beef is clearly the far better name.

The JSON-parsing crate simdjson-rs actually integrated beef recently, and you can see the improvements to throughput that they saw just from switching out their Cow implementation in the integration PR.

Now if you don’t mind, I’m going to go seek the advice of a priest.

NOTE: This article originally contained an extended section about how to extend this implementation to support inlining data, but this was considered too cursed for the eyes of mere mortals. Perhaps after I’ve thoroughly cleansed myself with holy water, I may be able to bring myself to write an additional article detailing how to implement this within our existing framework.