Handrolling-crypto

Feb 26, 2026

Handrolling-crypto

So recently I wrote my own implementation of the ChaCha20-Poly1305 Cipher, handoptimised for the AVX2 instruction set. Here is the source

This was also me testing zig out, after being fed up of wrestling with rust by trying to write everything in safe rust (a bad idea, if you are doing something low-low level, dip your toes into unsafe if its needed, learnt this the hard way) Also thanks LLM’s for writing the glue code for wasm for me!

My implementation ended up being faster than the stdlib but I really dont know if its actually secure.

I have not even looked into making them side-channel proof.

Here are some things I learned along the way.

But before that, please don’t handroll your crypto in production, even subtle implementation differences might destroy the entropy of the cipher.

ChaCha20

This is an AXR stream cipher ,it promises the same security guarantees as of AES but is faster on devices which do not have a hardware accelerator for AES (certain IOT devices).

I recently used wolfssl’s implementation on a RISC-V device for a DTLS handshake, it ended up dropping the no of cycles needed for the handshake by half! (compared to AES-GCM) (Inter-IIT PS, writing a blog for this is on my todo list)

I made it faster than the zig std lib by interleaving multiple blocks while using simd such that the simd registers max out. This allows the compiler to place instructions such that the cpu pipeline is full. The limit is 6 blocks as each pair of blocks require 4 simd registers, we have 16 on AVX2, so leaving 4 for temporary values turns out to be good. I hit a 3.68 IPC (measured using perf stat) which is quite high! My laptop does not have AVX-512, so I could not try that out.

Also maximising ipc is not a good idea in all cases, but in this case it ended up being a marker of increasing performance , supported by my benchmarks.

Sometimes you may end up having changed no peformance on increasing ipc because you just ended up also increasing the no of instructions.

This is one of those simd optimisations that the compiler doesn’t pickup on its own!( Compiler devs out there please tell me why)

I also aggresively inlined code. (I don’t know how much that affects performance though) Avoiding copies was also something I was looking after.

The zig stdlib also uses simd but I haven’t looked into why its slower (yet).

My naive benchmark running on github actions →

CSPRNG

This was the last thing I implemented, these are supposed to have certain properties like:
There should not exist a polynomial-time algorithm that can guess the next bit emitted by these algorithms with more than 50% probability.
Revealing of the current state should not result in compromise of the previous emitted stream
As it turns out ChaCha-20 is a stream cipher and its stream satisfies the above properties.

Poly1305

I tried to optimise this but failed to do anything substantial! Its still faster than stdlib, no idea why.

Opinions on Zig

Its a nice improvement over C, I see rust being a c++ replacement while Zig a replacement for C, once it hits 1.0 atleast.