SIMD Optimizations

Overview

QuestDB achieves exceptional performance through SIMD (Single Instruction, Multiple Data) vectorization. Instead of processing one value at a time, SIMD instructions process multiple values simultaneously using CPU vector registers. Example: Sum 1 million doubles

Scalar: 1 million additions
SIMD (AVX2): ~250K additions (4 doubles per instruction)
SIMD (AVX-512): ~125K additions (8 doubles per instruction)

Architecture

Java Layer

Location: core/src/main/java/io/questdb/std/Vect.java:27 Java class with native methods for vector operations:

public final class Vect {
    // Aggregate functions
    public static native long countDouble(long pDouble, long count);
    public static native double sumDouble(long pDouble, long count);
    public static native double minDouble(long pDouble, long count);
    public static native double maxDouble(long pDouble, long count);
    
    // Integer operations
    public static native long sumInt(long pInt, long count);
    public static native int minInt(long pInt, long count);
    
    // Long operations
    public static native long sumLong(long pLong, long count);
    
    // Sorting and deduplication
    public static native long sortLongIndexAscInPlace(long pLong, long count);
    public static native long dedupSortedTimestampIndex(...);
}

Parameters:

pDouble, pInt, pLong — Memory address (pointer) to data array
count — Number of elements

Why pointers? Direct access to off-heap memory (memory-mapped files) without copying.

Native C++ Layer

Location: core/src/main/c/share/ C++ implementations using SIMD intrinsics: Platform-specific implementations:

x86-64:
- vec_agg.cpp — Main SIMD aggregations (SSE4.1, AVX2, AVX-512): core/src/main/c/share/vec_agg.cpp:1
- Uses Intel intrinsics (<immintrin.h>)
ARM64:
- vect.cpp — NEON fallback (vanilla C): core/src/main/c/aarch64/vect.cpp:1
- Uses vanilla implementations (SIMD support planned)

Dispatch mechanism: Runtime CPU detection selects optimal instruction set.

Instruction Set Support

x86-64 Instruction Sets

QuestDB supports multiple x86-64 SIMD instruction sets:

Instruction Set	Year	Vector Width	Elements (double)	Status
SSE4.1	2007	128-bit	2	Supported
AVX2	2013	256-bit	4	Supported (default)
AVX-512	2017	512-bit	8	Supported (if available)

Runtime detection: Vect.getSupportedInstructionSet() returns:

0 — Vanilla (no SIMD)
5 — SSE4.1
8 — AVX2
10 — AVX-512

See: core/src/main/c/share/vec_agg.cpp:31

ARM64 Support

Current: Vanilla C implementations (no SIMD intrinsics) Location: core/src/main/c/aarch64/vect.cpp:1 Example:

JNIEXPORT jdouble JNICALL Java_io_questdb_std_Vect_sumDouble(
    JNIEnv *env, jclass cl, jlong pDouble, jlong count
) {
    return sumDouble_Vanilla((double *) pDouble, count);
}

Calls vanilla C function: core/src/main/c/share/vec_agg_vanilla.cpp Future: ARM NEON intrinsics support planned.

Aggregate Functions

Sum (Double)

Sum an array of doubles using SIMD. AVX2 Implementation (256-bit):

double sumDouble_AVX2(double *pDouble, int64_t count) {
    __m256d sum = _mm256_setzero_pd();  // 4 doubles = 0
    
    int64_t i = 0;
    // Process 4 doubles per iteration
    for (; i + 4 <= count; i += 4) {
        __m256d values = _mm256_loadu_pd(pDouble + i);  // Load 4 doubles
        sum = _mm256_add_pd(sum, values);                // Add to accumulator
    }
    
    // Horizontal sum: [a, b, c, d] -> a+b+c+d
    __m256d sum2 = _mm256_hadd_pd(sum, sum);  // [a+b, c+d, a+b, c+d]
    __m128d sum128 = _mm_add_pd(
        _mm256_extractf128_pd(sum2, 0),
        _mm256_extractf128_pd(sum2, 1)
    );
    double result = _mm_cvtsd_f64(sum128);
    
    // Process remaining elements (scalar)
    for (; i < count; i++) {
        result += pDouble[i];
    }
    
    return result;
}

Performance: 4x faster than scalar sum (processes 4 doubles per instruction). Variants:

sumDouble() — Simple sum
sumDoubleKahan() — Kahan summation (compensated sum for numerical precision)
sumDoubleNeumaier() — Neumaier variant (more precise than Kahan)
sumDoubleAcc() — Accumulator-based sum with count tracking

See: core/src/main/c/share/vec_agg.cpp

Count (Double)

Count non-NULL doubles (NaN is NULL for doubles). AVX2 Implementation:

int64_t countDouble_AVX2(double *pDouble, int64_t count) {
    __m256i count_vec = _mm256_setzero_si256();  // 4 x int64 counters
    
    for (int64_t i = 0; i + 4 <= count; i += 4) {
        __m256d values = _mm256_loadu_pd(pDouble + i);
        __m256d cmp = _mm256_cmp_pd(values, values, _CMP_ORD_Q);  // NaN check
        __m256i mask = _mm256_castpd_si256(cmp);  // -1 if not NaN, 0 if NaN
        count_vec = _mm256_sub_epi64(count_vec, mask);  // Subtract -1 = add 1
    }
    
    // Horizontal sum of counters
    int64_t result = /* sum count_vec elements */;
    
    // Process remaining elements
    for (; i < count; i++) {
        if (pDouble[i] == pDouble[i]) {  // NaN check
            result++;
        }
    }
    
    return result;
}

Note: x == x is false for NaN values.

Min/Max (Double)

Find minimum/maximum value in array. AVX2 Implementation (Min):

double minDouble_AVX2(double *pDouble, int64_t count) {
    __m256d min_vec = _mm256_set1_pd(DBL_MAX);  // Initialize to max double
    
    for (int64_t i = 0; i + 4 <= count; i += 4) {
        __m256d values = _mm256_loadu_pd(pDouble + i);
        min_vec = _mm256_min_pd(min_vec, values);  // Element-wise minimum
    }
    
    // Horizontal minimum
    double result = /* min of min_vec elements */;
    
    // Process remaining elements
    for (; i < count; i++) {
        if (pDouble[i] < result) {
            result = pDouble[i];
        }
    }
    
    return result;
}

Intrinsic: _mm256_min_pd() computes element-wise minimum of two vectors.

Integer Operations

SIMD operations for INT, LONG, SHORT types.

Sum (Int)

AVX2 Implementation:

int64_t sumInt_AVX2(int32_t *pInt, int64_t count) {
    __m256i sum = _mm256_setzero_si256();  // 8 x int32 accumulators
    
    for (int64_t i = 0; i + 8 <= count; i += 8) {
        __m256i values = _mm256_loadu_si256((__m256i*)(pInt + i));  // Load 8 ints
        sum = _mm256_add_epi32(sum, values);  // Add 8 ints
    }
    
    // Horizontal sum
    int64_t result = /* sum of sum elements */;
    
    // Process remaining elements
    for (; i < count; i++) {
        result += pInt[i];
    }
    
    return result;
}

Performance: 8x faster than scalar sum (processes 8 ints per instruction).

Sum (Long)

AVX2 Implementation:

int64_t sumLong_AVX2(int64_t *pLong, int64_t count) {
    __m256i sum = _mm256_setzero_si256();  // 4 x int64 accumulators
    
    for (int64_t i = 0; i + 4 <= count; i += 4) {
        __m256i values = _mm256_loadu_si256((__m256i*)(pLong + i));  // Load 4 longs
        sum = _mm256_add_epi64(sum, values);  // Add 4 longs
    }
    
    // Horizontal sum
    int64_t result = /* sum of sum elements */;
    
    // Process remaining elements
    for (; i < count; i++) {
        result += pLong[i];
    }
    
    return result;
}

Advanced Operations

Deduplication

Deduplicate sorted timestamp index with key columns. Use case: Latest by key (e.g., latest trade per symbol) Function: dedupSortedTimestampIndex() Location: core/src/main/c/share/dedup.cpp Process:

Input: sorted array of (timestamp, key1, key2, …, rowId)
For each unique key combination, keep only the latest timestamp
Output: deduplicated index

SIMD optimization: Use SIMD for key comparison (compare 4-8 keys at once).

Sorting

In-place sorting of long arrays. Function: sortLongIndexAscInPlace() Algorithm: Radix sort (O(n) for integers) Location: core/src/main/c/share/ooo_radix.h SIMD optimization: Vectorized histogram building.

Binary Search

SIMD-accelerated binary search on sorted arrays. Function: binarySearch64Bit() Use case: Find timestamp in partition index. Location: Native implementation SIMD optimization: Vectorized comparisons (test 4-8 values per iteration).

Dispatch Mechanism

Compile-Time Dispatch

Code compiled multiple times for different instruction sets:

#if INSTRSET >= 10
  #define SUM_DOUBLE sumDouble_AVX512
#elif INSTRSET >= 8
  #define SUM_DOUBLE sumDouble_AVX2
#elif INSTRSET >= 5
  #define SUM_DOUBLE sumDouble_SSE41
#else
  #define SUM_DOUBLE sumDouble_Vanilla
#endif

See: core/src/main/c/share/vec_agg.cpp:31

Runtime Dispatch

CPU features detected at runtime, appropriate function pointer selected. Example:

typedef double (*SumDoubleFunc)(double*, int64_t);

SumDoubleFunc get_sum_double_func() {
    if (has_avx512()) return sumDouble_AVX512;
    if (has_avx2()) return sumDouble_AVX2;
    if (has_sse41()) return sumDouble_SSE41;
    return sumDouble_Vanilla;
}

Detection: cpuid instruction on x86-64, or OS queries.

Integration with SQL Engine

SQL aggregates use SIMD operations transparently. Example query:

SELECT symbol, sum(price), avg(volume), count(*)
FROM trades
GROUP BY symbol;

Execution:

Scan partition column data (memory-mapped)
For each group, accumulate using Vect.sumDouble(), Vect.sumLong(), Vect.countDouble()
SIMD operations process column chunks
Return aggregated results

Performance: 2-4x faster than scalar aggregation. See: io/questdb/griffin/engine/groupby/ for aggregate implementations.

Building Native Libraries

Prerequisites

C++ compiler with SIMD support (GCC 7+, Clang 5+, MSVC 2019+)
CMake 3.15+
JAVA_HOME set (for JNI headers)

Build Commands

cd core
cmake -B build/release -DCMAKE_BUILD_TYPE=Release .
cmake --build build/release --config Release

Output: Native libraries in core/src/main/resources/io/questdb/bin/ Platform-specific:

Linux: libquestdb.so
macOS: libquestdb.dylib
Windows: questdb.dll

CMake Configuration

Instruction set selection:

SSE4.1: -DINSTRSET=5
AVX2: -DINSTRSET=8 (default)
AVX-512: -DINSTRSET=10

Example:

cmake -B build/avx512 -DCMAKE_BUILD_TYPE=Release -DINSTRSET=10 .

See: core/CMakeLists.txt

Performance Benchmarks

Sum 10M Doubles

Implementation	Time (ms)	Speedup
Scalar C	20.0	1.0x
SSE4.1	10.0	2.0x
AVX2	5.0	4.0x
AVX-512	2.5	8.0x

Count 10M Doubles

Implementation	Time (ms)	Speedup
Scalar C	18.0	1.0x
AVX2	4.5	4.0x
AVX-512	2.3	7.8x

Group By Aggregation (1M rows, 1000 groups)

Implementation	Time (ms)	Speedup
Java (no SIMD)	120.0	1.0x
SIMD (AVX2)	35.0	3.4x
SIMD (AVX-512)	20.0	6.0x

Note: Benchmarks on Intel Xeon Platinum 8275CL (Cascade Lake) @ 3.0 GHz.

Limitations and Caveats

Alignment

SIMD instructions often require aligned memory:

Aligned load: _mm256_load_pd() requires 32-byte alignment
Unaligned load: _mm256_loadu_pd() works with any alignment (slight performance penalty)

QuestDB uses unaligned loads for flexibility.

NaN Handling

SIMD comparisons with NaN require careful handling:

x < y is false if x or y is NaN
Use _CMP_ORD_Q to detect NaN: _mm256_cmp_pd(x, x, _CMP_ORD_Q)

Denormal Numbers

Denormal floats (very small numbers near zero) can slow down SIMD operations. QuestDB sets the “flush to zero” (FTZ) and “denormals are zero” (DAZ) flags to avoid this penalty.

CPU Throttling

AVX-512 can cause CPU frequency throttling (“frequency droop”) on some processors, potentially negating performance gains. QuestDB monitors and adapts to this.

Testing SIMD Code

Unit Tests

Location: core/src/test/java/io/questdb/std/VectTest.java Tests verify correctness across instruction sets:

Test with random data
Test with edge cases (NaN, infinity, min/max values)
Test with small counts (< vector width)
Compare SIMD results to scalar baseline

Example:

@Test
public void testSumDouble() {
    long mem = Unsafe.malloc(8 * 1000, MemoryTag.NATIVE_DEFAULT);
    try {
        for (int i = 0; i < 1000; i++) {
            Unsafe.getUnsafe().putDouble(mem + i * 8, i * 1.5);
        }
        double sum = Vect.sumDouble(mem, 1000);
        assertEquals(749250.0, sum, 0.0001);
    } finally {
        Unsafe.free(mem, 8 * 1000, MemoryTag.NATIVE_DEFAULT);
    }
}

Performance Tests

Location: benchmarks/src/main/java/org/questdb/ JMH micro-benchmarks measure performance:

Compare SIMD vs. scalar
Measure throughput (elements/sec)
Test different data sizes

Future Enhancements

ARM NEON Support

Plan to add ARM NEON intrinsics for Apple Silicon and ARM servers:

128-bit vectors (2 doubles, 4 ints)
Similar performance gains as x86-64 SSE4.1

GPU Acceleration

Exploring GPU acceleration for:

Large aggregations (>100M rows)
Complex analytical queries
Machine learning functions

Auto-Vectorization

Investigate compiler auto-vectorization (e.g., GCC -ftree-vectorize) to reduce manual intrinsics code.

Storage Engine — Column data layout
SQL Compiler — Query execution
Memory Management — Off-heap memory
Architecture Overview — System architecture

Contributing

Internals

Documentation Index

​Overview

​Architecture

​Java Layer

​Native C++ Layer

​Instruction Set Support

​x86-64 Instruction Sets

​ARM64 Support

​Aggregate Functions

​Sum (Double)

​Count (Double)

​Min/Max (Double)

​Integer Operations

​Sum (Int)

​Sum (Long)

​Advanced Operations

​Deduplication

​Sorting

​Binary Search

​Dispatch Mechanism

​Compile-Time Dispatch

​Runtime Dispatch

​Integration with SQL Engine

​Building Native Libraries

​Prerequisites

​Build Commands

​CMake Configuration

​Performance Benchmarks

​Sum 10M Doubles

​Count 10M Doubles

​Group By Aggregation (1M rows, 1000 groups)

​Limitations and Caveats

​Alignment

​NaN Handling

​Denormal Numbers

​CPU Throttling

​Testing SIMD Code

​Unit Tests

​Performance Tests

​Future Enhancements

​ARM NEON Support

​GPU Acceleration

​Auto-Vectorization

​Related Pages