Binary Cure for a Headache (My First AI-Assisted Project)

Dmitry AstafyevDmitry Astafyev
13 min read

Surely, you've come across those types of projects that were initially designed (or ordered) as something very small and compact, but then somehow unexpectedly started to grow. For such "tiny" projects, you usually try not to overthink it and stick to a minimalist stack.

So, once I was working on just such a "seemingly" small and simple project. What kind of project it was doesn't really matter in the context of this article. The important part is that at some point, I needed to save data from multiple streams. Logs. Nothing tricky — just take a file and write.

But then a new request comes in. They want to see logs from each stream.

Sure, no problem... Now, we have multiple files. The streams, by the way, are not just simple ones — they don't just output logs but also contain structured information. It could be metrics, it could be some enumerations. In some aspects, the data from streams (structure-wise) matches, and in others — it doesn’t, and the stream gives unique data. However, saving it as plain text was acceptable for the purpose — I repeat, "yeah, it's a small project, just a couple of hours of work."

Done, delivered, let it work.

But not for long...

"Can we have some sort of front-end to browse the output with a mouse? You know, it’s a bit inconvenient for Mrs. Mildred to use the terminal. Also, we’d like filtering by streams and some additional filtering by a number of criteria."

And the log files? They just keep growing during a session, up to 1–2 GB, and they don't seem to stop. And suddenly, several problems rise to full height:

  • The data is in text format, with millions of lines. You can't just load that into a browser — you need something like virtual scrolling. Fortunately, that's relatively painless to implement.

  • The backend, however, requires more serious improvements. First, you need to figure out how to provide a specific segment from each file based on the client’s request. Second, there's the search problem, again linked to that specific frame that needs to be delivered to the client.

  • And search itself is already an issue: if you need to search by some criterion and your data is in text format, the search condition can catch both what the user is looking for and some irrelevant stuff. That means you need to embed additional data into the text. For example, not just "temperature" but "temperature" with an ID or sensor type.

In short, how that project eventually turned out doesn't really matter. It turned out well, and, by the way, the service stopped growing at that point, having fully met the user's needs.

But that wasn't the first time I faced a situation where I needed to store data in a "slightly" structured way, have the ability for basic search, and have easy (and most importantly, fast) access to data fragments. It’s like you kind of need a very lightweight database, but on the other hand, you really don’t want to mess with it.

So, what comes next? The classic scenario... I want a tool that's not quite a database, but a bit more than just storage, while still being type-safe, and with other nice little features.

After such a long introduction — let's get to the point. This article is about a set of tools for creating your own data protocol for storage or transmission. Meet brec (BinaryRECord).

The essence is ridiculously simple. The unit of data is a packet. You can think of it as a log entry, or as a message within a system — it doesn’t really matter. A packet consists of blocks (from 0 to 255) and an optional payload. It looks like this:

Block
…
Block (up to 255)
Payload (Optional)

The packets themselves, the blocks inside them, and the payload all need to be recognized and checked for integrity. This means signatures of some kind and checksum calculations. Nobody wants to bother with all that manually, so let's automate it.

All you need in the code is to use a macro to specify that a given structure is a block, while another structure is the payload.


#[block]
pub struct BlockA {
    pub field_u8: u8,
    pub field_u16: u16,
    pub field_u32: u32,
    pub field_u64: u64,
    pub field_u128: u128,
    pub field_i8: i8,
    pub field_i16: i16,
    pub field_i32: i32,
    pub field_i64: i64,
    pub field_i128: i128,
    pub field_f32: f32,
    pub field_f64: f64,
    pub field_bool: bool,
}

#[block]
pub struct BlockB {
    pub data: [u8;100],
}

#[payload(bincode)]
#[derive(serde::Deserialize, serde::Serialize)]
pub struct PayloadA {
    pub field_u8: u8,
    pub field_u16: u16,
    pub field_u32: u32,
    pub field_u64: u64,
    pub field_u128: u128,
    pub field_struct_a: StructA,
    pub field_struct_b: Option<StructB>,
    pub field_struct_c: Vec<StructC>,
    pub field_enum: EnumA,
    pub vec_enum: Vec<EnumB>,
}

That's it. You can create a packet.

let packet = Packet::new(
    vec![
        Block::BlockA(BlockA::default()),
        Block::BlockB(BlockB::default()),
    ],
    Some(Payload::PayloadA(PayloadA::default()))
);

And just like that, the protocol is ready to use :) Yep, it's that simple. So, what kind of magic happens under the hood?

  • The packet will be equipped with a unique signature for identification.

  • The packet will have a header containing data lengths and a checksum.

  • The blocks will have their own signatures and checksums.

  • The payload will have its own header with a signature, checksum, and length information.

In other words, after packing the blocks and payload into a packet, it can be easily recognized even in a "noisy" stream.

An initialized packet provides the necessary methods for writing:

packet.write(&mut dest);
packet.write_all(&mut dest);
packet.write_vectored(&mut dest);

And of course, reading — can't do without it.

Packet::read(&mut source);

But you might rightly point out that all of this could be achieved without reinventing the wheel, just by using something like bincode (which, as you noticed earlier, is actually used in brec). And you'd be absolutely right! Bincode and similar great crates can do all of this (by the way, huge respect to the developers for their stable and excellent solutions).

However, my task was broader, and therefore so is brec's. It's primarily a set of tools — just two, but quite practical ones.

The first one is the buffer reader PacketBufReader<R: std::io::Read>. Its key features can be summarized as follows:

  • First, it can read a "noisy" stream — that is, data containing not only brec packets but also anything else. At the same time, it doesn't lose the "rejected" data but provides the user with a way to retrieve it.

  • Second — and this is a common headache — it has its own internal buffer, allowing PacketBufReader to load data when there’s a shortage. In other words, you don’t have to worry about the classic problem of “just a bit missing.”

  • Third, it can filter packets — but more on that a bit later.

The second tool is the storage Storage<S: std::io::Read + std::io::Write + std::io::Seek>, and this one is even more interesting:

  • It stores packets in slots, allowing you to access a packet by its ordinal number. For example, if you need the 100th packet — you get it; or if you need packets from 200 to 250 — no problem. Remember when I mentioned virtual scrolling on the front end? This is exactly that (or close to it).

  • Each slot has its own checksum, which helps prevent situations with corrupted data. In other words, if the storage gets damaged, the likelihood of returning compromised data is extremely low — instead, the storage will just "spit out" an error. Moreover, even if the storage is "damaged" and can no longer be read entirely, you can still use PacketBufReader to read the contents, bypassing both the storage metadata and damaged fragments. Sure, you won’t be able to extract data by range anymore, but the packets themselves will not be lost.

But the main point is that both PacketBufReader and Storage are capable of filtering data. Packet parsing occurs sequentially in a few stages:

  • Partial parsing of blocks. The parser reads all primitive fields of the blocks but doesn't touch slices (doesn't copy data). The result is a BlockReferred — a reference representation of your block.

  • Identifying the payload area. At this stage, the payload is defined as a slice.

  • Parsing the payload.

  • Finally, forming the packet. (At this point, full copying of data from the stream occurs.)

At each of these stages, you can filter the data. For example, you set a filter based on blocks and decide whether you need the packet containing the given blocks or not. If needed — parsing continues; if not — the most resource-consuming part of the parsing process is simply skipped. Moreover, you can peek into the payload as a byte slice and also decide whether to continue processing this packet or just skip it. For instance, if your payload is a string, you can perform a search by &str without copying (of course, assuming it’s a valid UTF string).

Naturally, this provides a significant performance boost. The documentation contains test results where brec, in filtering mode, outperforms the notorious JSON by a factor of two in streaming mode (using PacketBufReader), while in storage mode (Storage), their performance is almost identical. Why compare with JSON? Simply because it’s insanely popular and optimized to the max. And just to note — we’re talking about performance with CRC checking enabled, which, by the way, can be disabled.

Alright, I guess that’s about it. Essentially, the public API of brec boils down to just 4 key elements — two macros (#[block] and #[payload]), the buffer reader PacketBufReader, and the storage Storage. That’s it! Everything else is just details.

And with these 4 elements (ah, almost made it to five), you can get rid of a ton of headaches, ending up with a tool that sits somewhere between a file storage system, a primitive database, and a transport protocol. It doesn't claim to be a "Swiss army knife" — quite the opposite. It’s more like: "Bro, if you need to retrieve, store, search quickly, and withstand noise and corruption — I'm your guy. If you need something more complex, sorry, you’re gonna need a heavier tool."

In a way, brec challenges the common paradigm that messages should be predefined (like in protobuf and similar solutions that involve schemas). But you have to admit, brec does it elegantly. Sure, there are no fixed messages, but there are predefined blocks and payloads that you can combine while still maintaining strict typing. So even without fixed messages, you still have the most important thing — strict data typing.

Based on the discussions I've already had with colleagues, and anticipating some questions, I'll try to address them right away. After all, it often happens that you have a question but just don’t feel like asking... I'll give it a shot :)

  • Testing as proof of reliability. This is a whole separate and tough story. Tough because of the macros — testing them is a thankless task. I had to write (using proptest) a small generator for blocks and structures (for payload). The test works as follows: it generates a random collection of blocks and structures for payload, then assigns some random values to each one. From all of this, packets are formed, which are then written and read. Each such case is a separate crate that runs independently. The second layer of testing is parsing. It's simpler in terms of implementation but trickier in terms of execution. I used the same proptest, which ultimately produces more than 40 GB (not a typo) of data that gets processed in a variety of ways. Obviously, stress tests like that are not something you'd try to run on CI. For CI, there's a lighter version.

  • Has it been used anywhere? In public projects — no. Brec has only just reached a stable beta. I've implemented it in one small closed project in production and started using it in another one that's still in development. So far, so good.

  • Where's the comparison with protobuf? Honestly, I just didn't have the time. If there's a request for it, I'll try to add it. I chose JSON for the comparison in the article for the reasons mentioned earlier — and because it was easier to implement for benchmarking.

  • What about cross-language support? The lack of a protocol schema is precisely what makes this support possible (see protobuf). But here, there is no schema (as I've already mentioned). Yes, this kind of support will be needed, but only if this article gets some traction — a "star" on GitHub, at the very least ;) Seriously though, it's about demand driving development. If there’s a request, I'll make it happen. As of now, I can say that integration with NodeJS or the Web works almost "out of the box" with high performance. I'm talking about WASM. You can define the protocol as a separate trait and then create a "wrapper" trait compiled to WASM, which can be seamlessly connected to NodeJS or a browser to parse packets or receive them as bytes. All of this while maintaining high performance. Of course, implementing the storage this way (via WASM) is quite challenging... But attaching the buffer would be much easier. By the way, there’s a WASM example in the repository.

  • What about data exchange (like over a network)? Of course! Just take PacketBufReader and off you go — it doesn't care what it's "eating" as long as std::io::Read is implemented.

  • What about data compression? The idea is there. In the places where I've already implemented brec, it's not needed, so I haven't included it "out of the box." A custom implementation is possible, and it's fairly trivial if we’re talking about compressing the payload rather than the entire packet (for the payload, just implementing a couple of traits is enough). As for packet compression itself — that’s something worth adding "out of the box." But let's do it based on need — if there’s a request, I'll add it :)

  • Version control? Not built-in at the moment. A manual implementation via a separate block is pretty easy. But it's not available out of the box. If you need it — please leave a request on GitHub.

  • Did you use ChatGPT? Yes, and quite intensively. In fact, this is my first project where I used it as an assistant.

  • Does the code contain any ChatGPT-generated parts? No, it doesn’t. Initially, I planned to delegate testing to it, but it quickly turned out that when it comes to slightly more complex proptest strategies, it really struggles and ends up being more harmful than helpful. However, it did provide significant support when it came to various "generalizations." I can't quite explain it... For example, I'd say, "I've covered this and that interface... am I missing something?" — and it would give me suggestions on other interfaces that might be worth supporting. Plus, it’s great at catching "red-eye" mistakes — like when it's 2 a.m. and some offset is getting a wrong value because you forgot offset += blocks_len;. And of course, it’s useful for initial API guidance: first, I’d get a "hint" from it, and then I’d dive into the official documentation based on that hint.

That’s pretty much it from my side. Just one more thing: your support in the form of a star on the GitHub page would definitely put a smile on my face, and hopefully on yours too :) Seriously! You know how important that kind of feedback is for a developer. Just some kind of feedback, damn it :) If only I were a girl... no, I won’t go there... not really the right time to joke about that.

Also, if you found the project at least a bit interesting but feel like something’s missing, please check out the repository and leave a feature request. I built brec to meet my own everyday needs and implemented what I really needed in the end. So yeah, I might have missed some important API. That’s also why I left the project in beta — because I’m counting on your input for new features and useful APIs.

Thanks for reading! I hope it helped you take a little break from your routine. Have a great day!

0
Subscribe to my newsletter

Read articles from Dmitry Astafyev directly inside your inbox. Subscribe to the newsletter, and don't miss out.

Written by

Dmitry Astafyev
Dmitry Astafyev