Documentation Index
Fetch the complete documentation index at: https://mintlify.com/questdb/questdb/llms.txt
Use this file to discover all available pages before exploring further.
Overview
QuestDB achieves exceptional performance through SIMD (Single Instruction, Multiple Data) vectorization. Instead of processing one value at a time, SIMD instructions process multiple values simultaneously using CPU vector registers.
Example: Sum 1 million doubles
- Scalar: 1 million additions
- SIMD (AVX2): ~250K additions (4 doubles per instruction)
- SIMD (AVX-512): ~125K additions (8 doubles per instruction)
Architecture
Java Layer
Location: core/src/main/java/io/questdb/std/Vect.java:27
Java class with native methods for vector operations:
public final class Vect {
// Aggregate functions
public static native long countDouble(long pDouble, long count);
public static native double sumDouble(long pDouble, long count);
public static native double minDouble(long pDouble, long count);
public static native double maxDouble(long pDouble, long count);
// Integer operations
public static native long sumInt(long pInt, long count);
public static native int minInt(long pInt, long count);
// Long operations
public static native long sumLong(long pLong, long count);
// Sorting and deduplication
public static native long sortLongIndexAscInPlace(long pLong, long count);
public static native long dedupSortedTimestampIndex(...);
}
Parameters:
pDouble, pInt, pLong — Memory address (pointer) to data array
count — Number of elements
Why pointers? Direct access to off-heap memory (memory-mapped files) without copying.
Native C++ Layer
Location: core/src/main/c/share/
C++ implementations using SIMD intrinsics:
Platform-specific implementations:
- x86-64:
vec_agg.cpp — Main SIMD aggregations (SSE4.1, AVX2, AVX-512): core/src/main/c/share/vec_agg.cpp:1
- Uses Intel intrinsics (
<immintrin.h>)
- ARM64:
vect.cpp — NEON fallback (vanilla C): core/src/main/c/aarch64/vect.cpp:1
- Uses vanilla implementations (SIMD support planned)
Dispatch mechanism: Runtime CPU detection selects optimal instruction set.
Instruction Set Support
x86-64 Instruction Sets
QuestDB supports multiple x86-64 SIMD instruction sets:
| Instruction Set | Year | Vector Width | Elements (double) | Status |
|---|
| SSE4.1 | 2007 | 128-bit | 2 | Supported |
| AVX2 | 2013 | 256-bit | 4 | Supported (default) |
| AVX-512 | 2017 | 512-bit | 8 | Supported (if available) |
Runtime detection: Vect.getSupportedInstructionSet() returns:
0 — Vanilla (no SIMD)
5 — SSE4.1
8 — AVX2
10 — AVX-512
See: core/src/main/c/share/vec_agg.cpp:31
ARM64 Support
Current: Vanilla C implementations (no SIMD intrinsics)
Location: core/src/main/c/aarch64/vect.cpp:1
Example:
JNIEXPORT jdouble JNICALL Java_io_questdb_std_Vect_sumDouble(
JNIEnv *env, jclass cl, jlong pDouble, jlong count
) {
return sumDouble_Vanilla((double *) pDouble, count);
}
Calls vanilla C function: core/src/main/c/share/vec_agg_vanilla.cpp
Future: ARM NEON intrinsics support planned.
Aggregate Functions
Sum (Double)
Sum an array of doubles using SIMD.
AVX2 Implementation (256-bit):
double sumDouble_AVX2(double *pDouble, int64_t count) {
__m256d sum = _mm256_setzero_pd(); // 4 doubles = 0
int64_t i = 0;
// Process 4 doubles per iteration
for (; i + 4 <= count; i += 4) {
__m256d values = _mm256_loadu_pd(pDouble + i); // Load 4 doubles
sum = _mm256_add_pd(sum, values); // Add to accumulator
}
// Horizontal sum: [a, b, c, d] -> a+b+c+d
__m256d sum2 = _mm256_hadd_pd(sum, sum); // [a+b, c+d, a+b, c+d]
__m128d sum128 = _mm_add_pd(
_mm256_extractf128_pd(sum2, 0),
_mm256_extractf128_pd(sum2, 1)
);
double result = _mm_cvtsd_f64(sum128);
// Process remaining elements (scalar)
for (; i < count; i++) {
result += pDouble[i];
}
return result;
}
Performance: 4x faster than scalar sum (processes 4 doubles per instruction).
Variants:
sumDouble() — Simple sum
sumDoubleKahan() — Kahan summation (compensated sum for numerical precision)
sumDoubleNeumaier() — Neumaier variant (more precise than Kahan)
sumDoubleAcc() — Accumulator-based sum with count tracking
See: core/src/main/c/share/vec_agg.cpp
Count (Double)
Count non-NULL doubles (NaN is NULL for doubles).
AVX2 Implementation:
int64_t countDouble_AVX2(double *pDouble, int64_t count) {
__m256i count_vec = _mm256_setzero_si256(); // 4 x int64 counters
for (int64_t i = 0; i + 4 <= count; i += 4) {
__m256d values = _mm256_loadu_pd(pDouble + i);
__m256d cmp = _mm256_cmp_pd(values, values, _CMP_ORD_Q); // NaN check
__m256i mask = _mm256_castpd_si256(cmp); // -1 if not NaN, 0 if NaN
count_vec = _mm256_sub_epi64(count_vec, mask); // Subtract -1 = add 1
}
// Horizontal sum of counters
int64_t result = /* sum count_vec elements */;
// Process remaining elements
for (; i < count; i++) {
if (pDouble[i] == pDouble[i]) { // NaN check
result++;
}
}
return result;
}
Note: x == x is false for NaN values.
Min/Max (Double)
Find minimum/maximum value in array.
AVX2 Implementation (Min):
double minDouble_AVX2(double *pDouble, int64_t count) {
__m256d min_vec = _mm256_set1_pd(DBL_MAX); // Initialize to max double
for (int64_t i = 0; i + 4 <= count; i += 4) {
__m256d values = _mm256_loadu_pd(pDouble + i);
min_vec = _mm256_min_pd(min_vec, values); // Element-wise minimum
}
// Horizontal minimum
double result = /* min of min_vec elements */;
// Process remaining elements
for (; i < count; i++) {
if (pDouble[i] < result) {
result = pDouble[i];
}
}
return result;
}
Intrinsic: _mm256_min_pd() computes element-wise minimum of two vectors.
Integer Operations
SIMD operations for INT, LONG, SHORT types.
Sum (Int)
AVX2 Implementation:
int64_t sumInt_AVX2(int32_t *pInt, int64_t count) {
__m256i sum = _mm256_setzero_si256(); // 8 x int32 accumulators
for (int64_t i = 0; i + 8 <= count; i += 8) {
__m256i values = _mm256_loadu_si256((__m256i*)(pInt + i)); // Load 8 ints
sum = _mm256_add_epi32(sum, values); // Add 8 ints
}
// Horizontal sum
int64_t result = /* sum of sum elements */;
// Process remaining elements
for (; i < count; i++) {
result += pInt[i];
}
return result;
}
Performance: 8x faster than scalar sum (processes 8 ints per instruction).
Sum (Long)
AVX2 Implementation:
int64_t sumLong_AVX2(int64_t *pLong, int64_t count) {
__m256i sum = _mm256_setzero_si256(); // 4 x int64 accumulators
for (int64_t i = 0; i + 4 <= count; i += 4) {
__m256i values = _mm256_loadu_si256((__m256i*)(pLong + i)); // Load 4 longs
sum = _mm256_add_epi64(sum, values); // Add 4 longs
}
// Horizontal sum
int64_t result = /* sum of sum elements */;
// Process remaining elements
for (; i < count; i++) {
result += pLong[i];
}
return result;
}
Advanced Operations
Deduplication
Deduplicate sorted timestamp index with key columns.
Use case: Latest by key (e.g., latest trade per symbol)
Function: dedupSortedTimestampIndex()
Location: core/src/main/c/share/dedup.cpp
Process:
- Input: sorted array of (timestamp, key1, key2, …, rowId)
- For each unique key combination, keep only the latest timestamp
- Output: deduplicated index
SIMD optimization: Use SIMD for key comparison (compare 4-8 keys at once).
Sorting
In-place sorting of long arrays.
Function: sortLongIndexAscInPlace()
Algorithm: Radix sort (O(n) for integers)
Location: core/src/main/c/share/ooo_radix.h
SIMD optimization: Vectorized histogram building.
Binary Search
SIMD-accelerated binary search on sorted arrays.
Function: binarySearch64Bit()
Use case: Find timestamp in partition index.
Location: Native implementation
SIMD optimization: Vectorized comparisons (test 4-8 values per iteration).
Dispatch Mechanism
Compile-Time Dispatch
Code compiled multiple times for different instruction sets:
#if INSTRSET >= 10
#define SUM_DOUBLE sumDouble_AVX512
#elif INSTRSET >= 8
#define SUM_DOUBLE sumDouble_AVX2
#elif INSTRSET >= 5
#define SUM_DOUBLE sumDouble_SSE41
#else
#define SUM_DOUBLE sumDouble_Vanilla
#endif
See: core/src/main/c/share/vec_agg.cpp:31
Runtime Dispatch
CPU features detected at runtime, appropriate function pointer selected.
Example:
typedef double (*SumDoubleFunc)(double*, int64_t);
SumDoubleFunc get_sum_double_func() {
if (has_avx512()) return sumDouble_AVX512;
if (has_avx2()) return sumDouble_AVX2;
if (has_sse41()) return sumDouble_SSE41;
return sumDouble_Vanilla;
}
Detection: cpuid instruction on x86-64, or OS queries.
Integration with SQL Engine
SQL aggregates use SIMD operations transparently.
Example query:
SELECT symbol, sum(price), avg(volume), count(*)
FROM trades
GROUP BY symbol;
Execution:
- Scan partition column data (memory-mapped)
- For each group, accumulate using
Vect.sumDouble(), Vect.sumLong(), Vect.countDouble()
- SIMD operations process column chunks
- Return aggregated results
Performance: 2-4x faster than scalar aggregation.
See: io/questdb/griffin/engine/groupby/ for aggregate implementations.
Building Native Libraries
Prerequisites
- C++ compiler with SIMD support (GCC 7+, Clang 5+, MSVC 2019+)
- CMake 3.15+
JAVA_HOME set (for JNI headers)
Build Commands
cd core
cmake -B build/release -DCMAKE_BUILD_TYPE=Release .
cmake --build build/release --config Release
Output: Native libraries in core/src/main/resources/io/questdb/bin/
Platform-specific:
- Linux:
libquestdb.so
- macOS:
libquestdb.dylib
- Windows:
questdb.dll
CMake Configuration
Instruction set selection:
- SSE4.1:
-DINSTRSET=5
- AVX2:
-DINSTRSET=8 (default)
- AVX-512:
-DINSTRSET=10
Example:
cmake -B build/avx512 -DCMAKE_BUILD_TYPE=Release -DINSTRSET=10 .
See: core/CMakeLists.txt
Sum 10M Doubles
| Implementation | Time (ms) | Speedup |
|---|
| Scalar C | 20.0 | 1.0x |
| SSE4.1 | 10.0 | 2.0x |
| AVX2 | 5.0 | 4.0x |
| AVX-512 | 2.5 | 8.0x |
Count 10M Doubles
| Implementation | Time (ms) | Speedup |
|---|
| Scalar C | 18.0 | 1.0x |
| AVX2 | 4.5 | 4.0x |
| AVX-512 | 2.3 | 7.8x |
Group By Aggregation (1M rows, 1000 groups)
| Implementation | Time (ms) | Speedup |
|---|
| Java (no SIMD) | 120.0 | 1.0x |
| SIMD (AVX2) | 35.0 | 3.4x |
| SIMD (AVX-512) | 20.0 | 6.0x |
Note: Benchmarks on Intel Xeon Platinum 8275CL (Cascade Lake) @ 3.0 GHz.
Limitations and Caveats
Alignment
SIMD instructions often require aligned memory:
- Aligned load:
_mm256_load_pd() requires 32-byte alignment
- Unaligned load:
_mm256_loadu_pd() works with any alignment (slight performance penalty)
QuestDB uses unaligned loads for flexibility.
NaN Handling
SIMD comparisons with NaN require careful handling:
x < y is false if x or y is NaN
- Use
_CMP_ORD_Q to detect NaN: _mm256_cmp_pd(x, x, _CMP_ORD_Q)
Denormal Numbers
Denormal floats (very small numbers near zero) can slow down SIMD operations. QuestDB sets the “flush to zero” (FTZ) and “denormals are zero” (DAZ) flags to avoid this penalty.
CPU Throttling
AVX-512 can cause CPU frequency throttling (“frequency droop”) on some processors, potentially negating performance gains. QuestDB monitors and adapts to this.
Testing SIMD Code
Unit Tests
Location: core/src/test/java/io/questdb/std/VectTest.java
Tests verify correctness across instruction sets:
- Test with random data
- Test with edge cases (NaN, infinity, min/max values)
- Test with small counts (< vector width)
- Compare SIMD results to scalar baseline
Example:
@Test
public void testSumDouble() {
long mem = Unsafe.malloc(8 * 1000, MemoryTag.NATIVE_DEFAULT);
try {
for (int i = 0; i < 1000; i++) {
Unsafe.getUnsafe().putDouble(mem + i * 8, i * 1.5);
}
double sum = Vect.sumDouble(mem, 1000);
assertEquals(749250.0, sum, 0.0001);
} finally {
Unsafe.free(mem, 8 * 1000, MemoryTag.NATIVE_DEFAULT);
}
}
Location: benchmarks/src/main/java/org/questdb/
JMH micro-benchmarks measure performance:
- Compare SIMD vs. scalar
- Measure throughput (elements/sec)
- Test different data sizes
Future Enhancements
ARM NEON Support
Plan to add ARM NEON intrinsics for Apple Silicon and ARM servers:
- 128-bit vectors (2 doubles, 4 ints)
- Similar performance gains as x86-64 SSE4.1
GPU Acceleration
Exploring GPU acceleration for:
- Large aggregations (>100M rows)
- Complex analytical queries
- Machine learning functions
Auto-Vectorization
Investigate compiler auto-vectorization (e.g., GCC -ftree-vectorize) to reduce manual intrinsics code.
Related Pages