Documentation Index
Fetch the complete documentation index at: https://mintlify.com/questdb/questdb/llms.txt
Use this file to discover all available pages before exploring further.
Overview
Cairo is QuestDB’s columnar storage engine, designed for high-performance time-series data. Named after the Egyptian city, Cairo provides the foundation for QuestDB’s speed and efficiency. Location:core/src/main/java/io/questdb/cairo/
Core Concepts
Column-Oriented Storage
Data is stored by column rather than by row:- Efficient scans: Read only the columns you need
- Better compression: Similar values stored together compress better
- Vectorization: Process multiple values simultaneously with SIMD
- Cache efficiency: Sequential reads fit in CPU cache
Time-Based Partitioning
Tables are automatically partitioned by timestamp:NONE— No partitioning (single partition)DAY— Daily partitions (recommended for most use cases)MONTH— Monthly partitionsYEAR— Yearly partitionsHOUR— Hourly partitions (high-frequency data)
- Fast time-range queries: Skip entire partitions outside the range
- Efficient data management: Drop old partitions in O(1) time
- Parallel processing: Process multiple partitions concurrently
io/questdb/cairo/PartitionBy.java
Key Classes
CairoEngine
Location:io/questdb/cairo/CairoEngine.java:1
The storage engine core. Manages:
- Table lifecycle (create, drop, rename)
- Reader/writer pools
- WAL coordination
- Schema changes
- Background jobs
getReader()— Acquires a table reader from the poolgetWriter()— Acquires a table writer from the poolgetTableToken()— Resolves table name to tokenremoveTableReader()/removeTableWriter()— Closes and removes tables
TableWriter
Location:io/questdb/cairo/TableWriter.java:1
Writes data to tables. Supports:
- In-order appends (most common)
- Out-of-order (O3) data handling
- Column addition/removal
- Index building
- Partition management
newRow(timestamp)— Begins a new rowputInt(),putDouble(),putStr(), etc. — Set column valuesappend()— Commits the rowcommit()— Makes rows visible to readersaddColumn()— Adds a new column to the table
TableReader
Location:io/questdb/cairo/TableReader.java:61
Reads data from tables. Provides:
- Partition iteration
- Column access
- Snapshot consistency (MVCC)
- Symbol table lookups
openPartition(index)— Opens a partition for readinggetColumn(index)— Returns memory-mapped column datagetSymbolMapReader(index)— Returns symbol table readerreload()— Reloads to see new transactionssize()— Total row count across all partitions
ColumnVersionReader
Location:io/questdb/cairo/ColumnVersionReader.java
Tracks column schema versions across partitions. Handles:
- Column additions
- Column type changes (in WAL tables)
- Column renames
File Format
Column Data Files
Column files store raw data in native binary format: Naming:<column_name>.d (e.g., price.d, ts.d)
Fixed-width types: Stored as binary arrays
INT: 4 bytes per valueLONG: 8 bytes per valueDOUBLE: 8 bytes per valueTIMESTAMP: 8 bytes per value (microseconds since epoch)BOOLEAN: 1 byte per value
<column>.d— Actual string data (UTF-8 or UTF-16)<column>.i— Index file (8-byte offsets into.dfile)
Symbol Files
Symbols (string interning) use three files:<column>.d— Integer keys (4 bytes each)<column>.k— Symbol keys (unique string list)<column>.v— Symbol values (offset index for.k)
- Reduces storage (4 bytes vs. full string)
- Faster comparisons (integer equality)
- Better compression
io/questdb/cairo/SymbolMapReader.java, io/questdb/cairo/SymbolMapWriter.java
Metadata Files
_meta — Table metadata (one per table):
- Table ID
- Column definitions (name, type, indexed, symbol)
- Designated timestamp column
- Partition granularity
- Table version
_txn — Transaction metadata (one per table):
- Current transaction ID
- Row count
- Partition metadata (boundaries, counts)
- Attached partition list
io/questdb/cairo/TableUtils.java
Indexing
Bitmap Indexes
QuestDB uses bitmap indexes for low-cardinality columns (especially symbols). Structure:- One bitmap per distinct value
- Bitmap: bit array where bit
i= 1 if rowihas that value
<column>.k— Index keys (unique values)<column>.v— Index value offsets
WHERE symbol = 'A' → return rows [0, 2, 4]
Query WHERE symbol IN ('A', 'B') → OR the bitmaps → [1, 1, 1, 0, 1, 1] → rows [0, 1, 2, 4, 5]
Benefits:
- Fast lookups for equality queries
- Efficient set operations (AND, OR, NOT)
- Small memory footprint (compressed bitmaps)
io/questdb/cairo/BitmapIndexReader.java, io/questdb/cairo/BitmapIndexWriter.java, core/src/main/c/share/bitmap_index_utils.cpp
Column Indexer Task
Indexes are built asynchronously byColumnIndexerTask:
- Writer appends data without indexes
- Writer commits transaction
- Background job picks up indexing task
- Indexes built concurrently with new writes
io/questdb/tasks/ColumnIndexerTask.java
Write-Ahead Log (WAL)
Overview
WAL provides:- Higher ingestion throughput (multiple writers)
- Durability (writes persisted before acknowledgment)
- Schema flexibility (columns can be added out-of-order)
core/src/main/java/io/questdb/cairo/wal/
WAL Architecture
Components:- WalWriter — Writes data to WAL segments:
io/questdb/cairo/wal/WalWriter.java - TableSequencer — Coordinates multiple writers:
io/questdb/cairo/wal/seq/TableSequencer.java - ApplyWal2TableJob — Applies WAL to table:
io/questdb/cairo/wal/ApplyWal2TableJob.java
WAL Structure
io/questdb/cairo/wal/WalUtils.java
Enabling WAL
WAL keyword enables WAL for the table.
Default: Non-WAL tables (single writer, direct writes)
Out-of-Order (O3) Data
Problem
Time-series data often arrives out-of-order:- Network delays
- Multiple data sources
- Late-arriving events
Solution
QuestDB’s O3 algorithm handles out-of-order data efficiently:- Buffer: Collect O3 data in memory
- Sort: Sort by designated timestamp
- Merge: Merge sorted O3 data with existing partitions
- Commit: Make merged data visible
TableWriterhandles O3 commits:io/questdb/cairo/TableWriter.javaO3PartitionTaskmerges data:io/questdb/tasks/O3PartitionTask.java- Native code for sorting:
core/src/main/c/share/ooo.cpp
cairo.max.uncommitted.rows— Buffer size before commitcairo.o3.max.lag— Maximum lag allowed for O3 data
core/src/main/c/share/ooo.cpp, core/src/main/c/share/ooo_dispatch.cpp
Memory-Mapped Files
Overview
Column data accessed viammap() (OS-level file mapping):
- OS handles paging (load pages on demand)
- Efficient for large datasets (don’t need to fit in RAM)
- Zero-copy reads (data read directly from page cache)
Memory API
Location:core/src/main/java/io/questdb/cairo/vm/api/
Hierarchy:
Memory— Base interfaceMemoryR— Readable memoryMemoryA— Appendable memoryMemoryMA— Memory-mapped appendableMemoryMR— Memory-mapped readableMemoryCR— Contiguous readableMemoryARW— Appendable read-write
MemoryCMRImpl— Memory-mapped read-onlyMemoryCARWImpl— Contiguous read-writeMemoryPARWImpl— Paged read-write
io/questdb/cairo/vm/api/, io/questdb/cairo/vm/Vm.java
Memory Management
See Memory Management for details on:- Off-heap allocation
- Memory tagging and tracking
- Leak detection
Transactions and MVCC
Transaction Model
Single writer per table: At most oneTableWriter at a time.
Multiple readers: Any number of TableReader instances concurrently.
Snapshot isolation: Readers see a consistent snapshot of the table (no dirty reads).
Transaction Metadata
The_txn file tracks:
txn— Transaction ID (incremented on each commit)transientRowCount— Total rowsfixedRowCount— Rows in completed partitionsminTimestamp,maxTimestamp— Time rangedataVersion— Partition version
io/questdb/cairo/TxReader.java, io/questdb/cairo/TxWriter.java
TxnScoreboard
Tracks active reader transactions to prevent premature file deletion. Location:io/questdb/cairo/TxnScoreboard.java
When a writer commits:
- Update
_txnfile with new transaction ID - Notify
TxnScoreboard - Readers reload to see new transaction
- Old files kept until all readers release them
Table Lifecycle
Table Creation
<table>/_meta— Metadata<table>/_txn— Transaction file<table>/<partition>/— First partition (if partitioned)
Schema Evolution
QuestDB supports schema changes on live tables: Adding columns:io/questdb/cairo/ColumnVersionReader.java
Table Deletion
- Metadata files
- Partition directories
- WAL segments (if WAL-enabled)
Performance Considerations
Partition Size
- Too small: Overhead from many files
- Too large: Queries scan unnecessary data
Symbol Cardinality
- Low cardinality (
<1000): Use SYMBOL type - High cardinality (
>100K): Use STRING type
Column Count
- Wide tables (
>100columns): More memory-mapped files, slower opens - Narrow tables (
<50columns): Optimal performance
Related Pages
- SQL Compiler — Query execution over Cairo
- SIMD Optimizations — Vector operations on columns
- Memory Management — Off-heap allocation
- Architecture Overview — System architecture