CSV parsing is one of those problems that looks trivial until you try to do it well. The format itself is simple, but supporting quoted fields, escape sequences, and multiple line-ending conventions—while still delivering high throughput—requires careful design.
In this post, I’ll walk through the design of csv-zero, a SIMD-accelerated, zero-allocation CSV iterator. We’ll explore the techniques used to achieve high performance, the tradeoffs involved, and the constraints that shaped the final architecture.
Specifically, we’ll cover:
If you’re interested in SIMD programming, branch prediction, or systems-level optimization, you may find some reusable patterns here.
The implementation is written in Zig, so examples reference Zig syntax and semantics, but the underlying ideas are largely language-agnostic and applicable to C, C++, Rust, or similar systems languages.
If you are already familiar with CSV, feel free to skip this section.
CSV does have an RFC specification—RFC 4180—which is concise and readable. Below is a distilled version of the rules relevant to this discussion.
At a high level:
A field may be either:
")Quoted fields may contain:
"")Examples:
a,b,c
"hello, world",42,"foo""bar" Records may be terminated by:
\n)\r\n)A correct parser must handle both.
Understanding these rules is critical for parser implementation because each rule introduces branching logic and state management that impacts performance.
The primary goal of csv-zero is performance, with the explicit constraint of zero dynamic allocation in the iterator. To achieve this, the implementation focuses on three main design choices:
The core abstraction is an iterator with a next() method that returns the next field or an error.
const Field = struct {
data: []u8,
last_column: bool,
needs_unescape: bool,
};
const IteratorError = error{ EOF, FieldTooLong };
const Iterator = struct {
pub fn next() IteratorError!Field {
// ...
}
}; The production implementation includes additional error types and field metadata, but this simplified interface captures the essential contract.
Conceptually, the iterator works as follows:
Scan the buffer for the next delimiter (comma, newline, or quote)
If the delimiter is a comma or newline:
If the delimiter is a quote:
Enter quoted-field handling logic:
"")In pseudocode:
function next():
find next delimiter (quote, comma, newline)
if delimiter is comma or newline:
return field
if delimiter is quote:
loop:
find next quote
if escaped quote:
skip
else if followed by comma or newline:
return field
else if end of input:
return field
else:
error While the algorithm appears straightforward, correct implementation requires careful handling of buffer boundaries, escape sequences, and line ending variations. The following sections detail how SIMD, zero-allocation, and branch reduction address these challenges.
Dynamic memory allocation introduces overhead that compounds at scale. Even with efficient allocators, allocations involve:
malloc/freeWhile arena allocators and custom allocation strategies can mitigate some of these costs, eliminating allocations entirely is more efficient. This is why csv-zero is architected to operate entirely on a fixed-size buffer without any dynamic allocations during parsing.
To completely avoid dynamic allocation, csv-zero deliberately iterates over fields, not records.
Instead of returning an array of fields for each record, the iterator returns one field at a time as a slice into an internal buffer. That slice is valid only until the next call to next(), at which point the buffer may be reused or refilled.
This design choice has consequences.
Many CSV libraries expose a record-oriented API:
while (iterator.next()) |record| {
total += parseInt(record[1]);
} This is convenient, but fundamentally incompatible with zero allocation:
You can mitigate this by reusing allocations, but at that point the design is no longer truly allocation-free. You still have to buffer the entire record, which places a significant constraint on the iterator and, in practice, often pushes users back toward explicit memory allocation anyway.
By iterating over fields instead, csv-zero avoids all of these issues. Users who want record-level iteration can easily build it on top of the field iterator, while users focused on streaming transformations—filtering, validation, conversion—avoid unnecessary buffering entirely.
In many practical workloads (e.g. CSV → JSON conversion, column selection, data cleanup), record materialization provides little benefit.
Because fields are returned as slices into a fixed buffer, the buffer must be large enough to hold the largest field encountered. If a field exceeds the buffer size, the iterator returns a FieldTooLong error, allowing the caller to decide how to proceed (allocate dynamically, skip the field, abort, etc.).
The key challenge is maximizing buffer utilization. If a field starts in the middle of the buffer and extends beyond the current buffered data, we need to:
@memmoveThis sliding window technique ensures that any field smaller than the buffer size can be parsed, regardless of its alignment within the input stream.
SIMD (Single Instruction, Multiple Data) allows a single CPU instruction to operate on multiple data elements simultaneously. This data-level parallelism is ideal for operations like searching for delimiters, where we need to compare many bytes against the same set of characters. Modern CPUs support wide vector registers (128–512 bits, 16-64 bytes).
If you’re unfamiliar with SIMD, I recommend reading about Zig Vectors or reviewing your architecture’s SIMD instruction set documentation.
In csv-zero, SIMD is used to locate delimiter characters—quotes ("), commas (,), and newlines (\n)—by processing multiple bytes at once.
Using Zig vectors:
const Vector = @Vector(16, u8);
const input: Vector = buffer[cursor..cursor+16].*; // Load 16 bytes from buffer.
const quotes = input == @splat('"');
const commas = input == @splat(',');
const newlines = input == @splat('\n');
const delimiters = (quotes | commas | newlines); This produces a vector of booleans indicating the positions of delimiter characters. The entire operation executes in just a few SIMD instructions, processing 16+ bytes in parallel.
Note that @splat in Zig creates an array or vector where each element is set to the same scalar value.
For illustration purposes in this article, we’ll use 6-byte vectors to keep diagrams readable. In production, csv-zero automatically selects the optimal vector length for the target architecture at compile time.
NOTE: The actual bit representation of delimiters uses little-endian bit ordering (LSB = index 0). The diagram shows a human-friendly left-to-right ordering for readability, but the implementation must account for the actual CPU representation when extracting bit positions.
To efficiently consume this result, the boolean vector is bit-cast to an integer:
const Bitmask = std.meta.Int(.unsigned, 16);
const mask: Bitmask = @bitCast(delimiters); This enables fast integer operations instead of higher-level SIMD helpers.
In practice, treating the mask as an integer and using bit-twiddling operations (@ctz, &=) proved significantly faster than calling helper functions like std.simd.firstTrue.
Additionally, from now on, most of the operations, as you will see, are simple integer operations.
Once the delimiter mask is available:
mask == 0, there are no delimiters in this chunkmaskTo find the position of the next delimiter, we need to locate the least significant set bit (the rightmost 1) in our bitmask. This is done efficiently using the count trailing zeros (CTZ) instruction:
const index = @ctz(mask); The CTZ instruction counts the number of consecutive zero bits starting from the LSB, which directly gives us the bit index. On x86-64, this compiles to a single TZCNT or BSF instruction.
To avoid returning the same delimiter again, the least significant set bit is cleared:
mask &= mask - 1; This classic trick compiles down to a single instruction on many architectures (e.g. BLSR).
Why this works: Subtracting 1 from a number flips all trailing zeros to ones and flips the rightmost 1 to 0. The bitwise AND then clears everything up to and including that bit:
A : 0b110100 (original)
A - 1 : 0b110011 (rightmost 1 becomes 0, trailing 0s become 1s)
& : 0b110000 (AND clears the rightmost 1 and all trailing bits) This gives us a two-instruction loop for iterating through set bits: CTZ to find the position, then clear the bit. Effectively using the mask integer variable as a queue. This is significantly faster than scalar iteration or branching based on individual bit tests.
A helper like nextDelim() encapsulates this logic:
If a cached delimiter mask exists, pop from it
Otherwise:
Fall back to byte-by-byte scanning near buffer boundaries
The main next() function repeatedly calls nextDelim() until it can return a field, refill the buffer, or signal EOF or error.
Quoted fields introduce complexity and branching. To keep the fast path fast, csv-zero separates quoted and unquoted logic early:
Key goals when handling quoted fields:
Detect escaped quotes Escaped quotes ("") are detected but not immediately unescaped. Instead, the returned Field includes a needs_unescape flag, allowing the caller to decide whether to pay the cost.
Locate the closing quote and trailing delimiter The parser must consume:
This logic is implemented as a small finite-state automaton, consuming delimiters until the quoted region is complete.
Modern CPUs rely on branch prediction to maintain instruction pipeline efficiency. When a branch is mispredicted, the pipeline must be flushed, incurring a substantial cycle penalty. In tight loops processing millions of fields, even infrequent mispredictions compound into measurable overhead.
At this level of optimization, branch misprediction becomes a real cost.
One notable example is handling both LF and CRLF line endings. Rather than branching on \r, the parser only checks for \n. If the delimiter is \n, a branch-free computation determines whether the preceding byte was \r and trims it if necessary:
// Branching version (susceptible to misprediction)
if (delim == '\n' and end > 0 and buffer[end - 1] == '\r') {
return buffer[start .. end - 1]; // Trim \r
} else {
return buffer[start .. end];
}
// Branch-free version
const prev_is_cr = @intFromBool(end != 0 and buffer[end - 1] == '\r');
const is_newline = @intFromBool(delim == '\n');
const trim_cr = prev_is_cr & is_newline; // 1 if we need to trim \r, 0 otherwise
return buffer[start .. end - trim_cr]; This avoids conditional jumps entirely, reducing misprediction in a hot path.
Similar techniques are used elsewhere, but this change alone produced measurable improvements.
Rigorous benchmarking is critical for validating optimization claims. I created csv-race, a benchmarking harness that compares multiple CSV parsers using Linux perf and other profiling tools. The repository provides reproducible methodology for generating performance metrics:
Methodology: To isolate parser performance from downstream processing, the benchmark measures only iteration throughput—no numerical conversion, string manipulation, or data structure building. The task is simply to iterate every field and count them. All parsers use a fixed 64KB buffer to ensure fair comparison.
Test corpus: The benchmark suite includes commonly-used CSV files (WorldCities, NFL games) plus synthetically generated files with controlled characteristics:
This diversity ensures the benchmarks reflect real-world CSV file heterogeneity rather than overfitting to a single pattern.
Here’s the results from the benchmarks:



Designing a fast CSV parser turned out to be a far richer problem than it initially appeared. By committing to zero allocation, leveraging SIMD for delimiter detection, and aggressively reducing branches, csv-zero achieves high throughput while remaining simple and predictable.
The implementation is available at github.com/peymanmortazavi/csv-zero and is ready to use. I have plans to add additional utilities, but the core library is production-ready. A C-compatible interface is included for integration with C/C++ projects.
I want to express my enthusiasm for the Zig language. Zig has made exploring low-level concepts—custom allocators, SIMD via @Vector, explicit memory control, and compile-time code generation—both accessible and enjoyable. There are many factors to consider when choosing a programming language, but for me, Zig is by far the most enjoyable language to work in—and that enjoyment is not limited to low-level programming.
I want to thank everyone involved in the Zig language and its community. I already support the Zig Software Foundation and encourage you to explore the language, contribute bug reports, share your experiences, or support it financially if it resonates with you.