Presenting FracPack - A Better Zero-copy Binary Serialization Format for C++ and Rust

in #blog5 months ago (edited)


image.png

FracPack is a zero-copy, fast, secure, future-proof, backward-compatible, and cryptography friendly data format that requires no code generation or schema files. There is currently native support in Rust and C++ and support for other languages could easily be developed. The Rust FracPack library is a single file that is less than 900 lines long.

The following table compares FracPack to other well known libraries and formats.

FormatZero CopyCode GenSafeForwardCanonLang
FracPackYESNOYESYESYESC++/Rust
Protocol BuffersNOYESYESYESNOMost
Cap'n'ProtoYESYESNONOYESC++
FlatBuffersYESYESNOYESNOMany
MessagePackNONOOPTNONOMost
JSONNONOYESYESNOMost
EOS/HIVENONOYESNOYESC++

Zero copy means that you can access any property in the encoded data structure without having to perform a linear scan of the entire buffer. This eliminates the need for many slow memory allocations and protects you from memory explosion attacks.

Code Gen means that a utility is required that transforms a schema file into code that must be compiled into your program.

Safe means that with a bug free implementation, your program can parse data produced from untrusted sources without risking memory explosion or invalid data accesses.

Forward (as in forward-compatible) means old programs can read data from new programs even if they don't understand how to read all the new fields.

Canon (as in Canonical) means there is a standardized way to serialize the datastructure such that it is reproducible across implementations and will survive an unpack/pack round trip. This is critical if you want the cryptographic hash of an object to be reproducible for the purposes of signature verification.

For FracPack, the canonical representation is limited to situations where all fields are understood. When operating in forward-comaptible mode, e.g. old software reading data produced from new software, a round-trip will prune the unknown data. When operating in backward-compatible mode, the data will still round trip because the canonical representation excludes all empty optional values. Even this limited forward-comaptiblity quirk is largely mitigated by the zero-copy feature because old code can still perform a hash of the buffer and avoid any need to do a round-trip.

Tradeoffs

Every software project has to make hard choices on what format to store and transfer data. Some formats are human-readable, while other formats are fast. Some are bloated and some are small. Some work with most programming languages, while others only work with a few. Some are flexible and easy to change over time and others are cast in concrete and difficult to change.

When dealing with systems that take data from untrusted sources the format must be robust against malicious individuals crafting data designed to crash unsuspecting code. When dealing with cryptographic validation, a canonical representation that is the same regardless of who encodes it becomes critical.

Schema & Code Generation Hell

One of my biggest pet peeves is integrating 3rd party code generation tools into my build process. Most popular formats are either self-describing (JSON, BSON, MessagePack) or depend upon a custom schema language that is the input to a code generator that produces the code needed for every supported language. Even if you use JSON/BSON/MessagePack you still need a code generator to parse the data from these formats into your native types.

FracPack is part of the header-only PSIO library that only depends upon boost the preprocessor library and optionally the header-only rapidjson library. Likewise, the Rust version is equally lightweight and unintrusive. With PSIO you also get support for JSON encoding/decoding for all your types.

Once you introduce external code generation (as opposed to using language native macros and meta-programing) you face many issues:

  1. Coding Standard Differences
  2. Compiler Differences
  3. Build Order Dependencies
  4. Conversion to/from Internal Types
  5. Generated Types Contaminate Code
  6. Inablity/Awkard Process to Augment Generated Types
  7. Build Toolchain Configuration
  8. Potentially Checking in Generated Code
  9. API changes Break your Code

With FracPack you don't have to worry about any of these issues, you simply code your types in your native language (C++ or Rust), in the style you like, and annotate your types with a single macro and everything just works. Examples of this are included below.

Performance

The encoding and decoding is "blazing fast" compared to Protocol Buffers and JSON, but that isn't the interesting part. The binary format used by EOS, Hive, and BitShares is also very fast and focuses on keeping the packsize small by using variable-length integer encodings. PSIO also supports EOS/Hive's encoding style.

The following chart shows a comparison in performance for a complex type with nested structs, vectors, and strings. The timing is for 10,000 iterations. Using a fast zero-compression technique like Cap'n'Proto gives you an idea for potential savings if you are willing to give up zero-copy.

ActionFormatTimeSizeComp
PackEOS/HIVE2 ms343 b297 b
PackFracPack3.4 ms619 b400 b
PackJSON54 ms750 b754 b
UnpackEOS/HIVE7.3 ms
UnpackFracPack6.3 ms
UnpackJSON23 ms
ValidateFracPack2 ms
Zero CopyFracPack0 ms

An interesting side effect of zero-copy access, is that it can be faster to read your data directly from a FracPacked buffer than from an already unpacked type due to improvements in data-locality. An unpacked type that contains strings and vectors spreads your data all over your program's heap which isn't friendly to your CPU's memory cache. So if you are dealing with trusted data, there may even be a performance gain by operating on it via the near zero-overhead FracPack API. Even untrusted data can be validated in just 14% of the time it takes to unpack data.

The only tradeoff with FracPack is that it takes about 50% more time to pack than the EOS/HIVE format and is about 80% larger (in this contrived case). The extra size comes from the use of offset pointers and using a full 4 bytes for length prefixes instead of 1 byte (variable encoding). This is the price we pay for extensibility and zero copy reads.

While the data takes slightly longer to pack, it is faster to unpack so the round-trip time is about the same as EOS/HIVE's binary format. If you consider that in many instances data is read many more times than it is written (database records, files) then the faster read performance pays off.

The reality is that for simple types, such as a struct with a bunch of integers, FracPack is equivalent to a memcpy and can be much faster. Actual performance is so data-dependent that synthetic benchmarks are not much use.

FlatBuffers vs FracPack

The closest competitor to FracPack would be Google's FlatBuffers. I took a look at their benchmark page and then looked at their "raw struct" test. I configured FracPack to utilize the exact same data structure and ran it 1 million times. Notably the pack size for FracPack was almost identical to their raw struct (308 bytes).

That said, their "raw" test was contrived with fixed size arrays and strings with more padding than the data required. When I replaced their raw structs and fixed arrays with dynamically sized vectors and strings to make it a closer comparison to what FlatBuffers real types are capable of the result is quite impressive. FracPack used 22% less space while maintaining ability to add new optional fields in the future and it packed in about the same amount of time. If you leverage FracPack's support for "final" never changing types then the size falls to 33% less than FlatBuffers.

FormatOperationTimeSize
FlatBufferPack134 ms344 bytes
FracPackPack (Extensible)130 ms269 bytes
FracPackPack (Fixed)89 ms233 bytes
FlatBufferZero Copy Use8 ms
FracPackZero Copy Use (Extensible)30 ms
FracPackZero Copy Use (Fixed)30 ms
FracPackUnpack (Extensible/Fixed)100 ms
FracPackValidate (Extensible)55 ms
FracPackValidate (Fixed)30 ms

Note that unpacking the struct into c++ standard strings and vectors is about 3x slower than just reading the data in place due to the memory allocation required.

The real issue with FlatBuffers is that it requires an obtuse API and pre-order packing of your objects into a buffer. They have libraries that will "pack" and "unpack" to rich types, but they are much slower to utilize. So the developer "ergonomics" with FlatBuffers is much worse.

Futhermore, there is no canonical form for FlatBuffers which makes using it with cryptographic signatures challenging.

FracPack API Example C++

The FracPack format is part of the PSIO library which also supports several other formats with the same reflection. So the PSIO_REFLECT macro also gets you JSON, and the binary used by Hive, EOSIO, and BitShares.

struct Person {
    string         name;
    uint32_t       age;
    vector<Person> kids;
};
PSIO_REFLECT( Person, name, age, kids)

int main() {
    // pack Person into a shared view
    psio::shared_view_ptr<Person> elvis( Person{"Elivs", 42} );
    std::span<const char> packed = elvis.data();
    
    // shared view, copies buffer, does not construct person
    psio::shared_view_ptr<Person> elvis_copy( packed.data(), packed.size() );

    // extract person 
    Person p = elvis_copy.unpack();    

    // zero copy demo
    auto zero_copy = psio::view<Person>(packed.data());
    auto elvis_age = zero_copy->age();
    auto elvis_name = zero_copy->name();
    auto elvis_num_kids = zero_copy->kids().size();

     // mutate in place allowed, so long as it doesn't resize structure.
    zero_copy->age() = 100;
}

Example Rust

Rust is implemented as a single crate that has a single source file that is less than 900 lines long. While it doesn't currently support zero-copy reading, it could be added in the future.

#[derive(Pack, Unpack, PartialEq, Debug]
#[fracpack(fracpack_mod = "fracpack")]
pub struct Person {
    pub name: String,
    pub age: u32,
    pub kids: Vec<Person>
}

fn main() {
      let elvis = Person { name: "elvis", age: 42 };
      let mut packed = Vec::<u8>::new();
      elvis.pack( &mut packed );

      let unpacked_elvis  = Person::unpack(&packed[..], &mut 0).unwrap();
}

Format Specification

The FracPack format is not self-describing, you must have code that knows what it is expecting to read. That said, the C++ library can generate a schema file that will describe the format of every type and which can be used by future libraries to dynamically interpret data given the schema.

Simple unchanging types are encoded exactly as they would be for a packed C-style struct, assuming you signal that the type will never be changed in the future and there are no dynamic fields such as strings or vectors.

Types that have data fields are divided into two regions, a fixed region, and a heap region where all dynamically sized fields are stored. Note that each struct has its own local heap for its own data. Struct members that are dynamic is size store an offset into the heap.

Example Encoding

The Person struct would be encoded like so:

struct PersonData {
    uint16_t     fixed_size = 12; // the amount of fixed size data
    // ------- start fixed data ----------------
    offset_ptr   name_offset = 12; // bytes from &name_offset to &name_size 
    uint32_t     age = 42;
    offset_ptr   kids_offset = 0; // empty vector optimization
    // -------- start heap ----------- (after fixed_size bytes)
    uint32_t     name_size = 5;  // "elvis" is 5 characters long
    char         name[5] = "elvis";                                               
}

New fields can be added to Person, but they are all "optional" which means each new field would add another offset_ptr and the fixed_size field would grow by 4 (the size of the offset ptr). Old code would simply ignore the added field. And new code, would not even need to pack the added field if the optional field was null.

If the Person struct was final, meaning no new fields could ever be added, then the fixed_size field would not be encoded. The start of the heap would be known by the size of the fixed fields and it could never change in the future.

This format is safe because all offsets are positive, meaning pointer loops are not possible to create. Furthermore, each offset to the heap must point to the end of the last object if the object types are known and must be after the last known position even if the type is not known. The FracPack library comes with a function that can sanity check untrusted data to ensure all the invariants are kept without having to decode the entire struct. The validation steps can be skiped for trusted data from your own program.

Other formats, such as Cap'n'Proto use pointers that are much more flexible and must use a fuzzy check to make sure there is no infinite loop when parsing. With FracPack these infinite loops are not possible.

FracPack Format Specification is described in greater detail on Github. The header only PSIO library that implements FracPack can be used outside of psibase build process even though it is currently part of our mono-repo. Likewise, the FracPack Rust Crate is also easy to extract from our mono-repo and use in your own code. Note that the code is still under development and may have undescovered bugs, but that should rapidly change as adoption increases.

Conclusion

While the library is still under development, I believe that FracPack's binary encoding fills a niche in data serialization space which is lacking. I hope that in time the tools will mature and support for more languages can be added.

Are there any interesting formats that I should consider? Let me know in the comments.

Sort:  

Dan, your FracPack format introduces some fascinating concepts in binary serialization, particularly around efficiency and compatibility. I’m curious about the challenges you faced in balancing performance optimization with the need for forwards and backwards compatibility in FracPack. Could you share some insights into how you navigated these trade-offs during the development process?

Forward and backward compatibility is largely a function of making as many fields as possible in your data structures "optional". This is the approach used by every format I have seen. The cost of making something optional is the extra data you need to encode to indicate the presence of the optional fields.

The smallest way to indicate an optional field is a single bit, but if a field is present then you need to know where to find it which requires an offset (4 bytes). There is no extra overhead for making dynamically sized types optional in FracPack because the struct's heap offset pointer can signal non presence in the pointer itself. Likewise, empty strings and vectors have no extra overhead and use the same amount of data as optional.

For speed of zero-copy access to fields, it is best to know a constant offset to the field you are looking for rather than having to first read a variable length bitfield.

So the tradeoff we made is that null optional fields use 4 bytes each, all 0, unless all fields after it are also null in which case we can truncate the size of the fixed-region and each non-present trailing optional field uses 0 bytes.

If data-size is an issue, then a fast zero-compression algorithm like Cap'n'Proto uses can all but eliminate the overhead; however, the act of compressing/decompressing can slow down the first access and potential security issues with data decompression bombs.

So for the sake of speed and constant time access to fields via pre-computed offsets, optional data has an overhead of 4 bytes. This overhead can be mitigated if you know you will never want to remove fields from version 1 of your data types and you always include or exclude all (or most) of your future extended fields.

I’ve noticed in your past blockchain projects, like what became Hive, the use of an ‘extensions’ array at the end of definitions, which provided flexibility for future enhancements, such as the addition of ‘beneficiaries’ in the ‘comment’ object. Has this experience influenced your approach to data structure design in FracPack? Is that what you're referring to by "extended fields," here?

So good to hear from you again @dan 😊❤️ I really have no much idea about coding, just working on learning it. I hear its quite technical.

Implementing this technology into Bitshares would make a great test case?