Research
- Boolean Algebra / Logic optimization https://en.m.wikipedia.org/wiki/Logic_optimization
- https://en.m.wikipedia.org/wiki/Quine%E2%80%93McCluskey_algorithm
- https://en.m.wikipedia.org/wiki/Espresso_heuristic_logic_minimizer
- Algebraic factoring / Logic Synthesis / Kernel Extraction / Common Subexpression Extraction (CSE): merging table steps leads to large OR, which should be simplied before being submitted to
ITableWrapper. - Task dependencies (as DAG) optimization (context: ICuboid execution and inference)
- Dependancy construction: all table steps may be inferred by a smaller set of input steps.
- Merging of leaves into new leaves (e.g. leading to less queries to ITableWrapper)
- Generating intermediate nodes computing shared computation (e.g. applying
a=a1once even if needed by multiple steps) - Columnar Random-Access Compression
- FSST enables random-access over compressed
ListofString - Immutable, Sequenced, Perfect-Hash
Map(IAdhocMap) - Immutable: no defensive copies needed; safe as
HashMapkey; enables hashCode caching (à laString.hashCode()). SequencedMap(Java 21): insertion order is preserved and exposed, unlikeHashMap. Keys and values can be iterated in a stable, predictable order without sorting overhead.- Faster
.getvia perfect hashing (SimplePerfectHash): for a fixed, known key-set the hash function maps every key to a unique slot with no collision, so lookup is a single array access after one multiply+mask rather than open-addressing or chaining. Comparable: enablesTreeMap/sorted-column optimisations and benefits from JEP 180 (tree-ifiedHashMapcollision chains usecompareTowhen keys implementComparable).- Cached keyset projections (
retainAll): the mapping from a full keyset to a projected sub-keyset is memoised, so repeatedretainAllcalls (common in groupBy/slice projection) pay the cost only once. - State-of-the-art alternatives worth knowing:
- Guava
ImmutableMap— immutable, iteration-ordered, array-backed for ≤ 5 entries; noSequencedMap, no perfect hash. - Eclipse Collections
UnifiedMap/ immutable maps — rich immutable variants, primitive-key maps, lower memory footprint thanHashMap. - sux4j (Sebastiano Vigna) — minimal perfect hash functions (MPHF) for static key sets; the theoretical gold standard for space-optimal MPHFs.
- CHD algorithm (Compress, Hash, Displace) — near-optimal MPHF construction; also implemented in CMPH.
- Abseil
flat_hash_map/ Swiss tables — SIMD-accelerated open-addressing with a separate metadata byte array; sets the bar for CPU-cache-friendly hash-map throughput (C++; inspiration for Java ports like F14). - HPPC (High Performance Primitive Collections) — avoids boxing for primitive keys/values; relevant when keys are
int/longrather thanString.
Joining Cuboids
A recurring operation in Adhoc is joining cuboids: when a derived measure (e.g. a Combinator) depends on N underlying CubeQuerySteps, each producing its own ICuboid, the engine must merge those N cuboids into a single deduplicated stream of SliceAndMeasures (one entry per distinct slice, carrying a value from each underlying). This is conceptually the same operation as an outer join in SQL or a merge in dataframe libraries — but Adhoc implements it in-process, over its own column types, rather than delegating to a DB.
Why ordering matters
External databases (DuckDB, PostgreSQL, …) can efficiently produce GROUP BY results in the grouping-key order (ORDER BY is often free when the engine already sorts for GROUP BY). When the table wrapper exposes a TabularRecordStream that arrives slice-sorted, the downstream engine can exploit this ordering in two ways:
-
Sorted-merge join (
MergedSlicesIteratorNavigableElseHash) — merges N sorted cuboid iterators in O(N × M) total comparisons (M = total slices) with no extra memory beyond a priority queue of size N. This is the production fast path forCombinators whose underlyings share the sameGROUP BY. -
Append-last insert (
MultitypeNavigableColumn) — sorted incoming slices always hit the append-last fast path (O(1) per insertion, no binary search, no hash computation). The resulting column is itself sorted, enabling the next step to also benefit from merge-join.
When ordering breaks
Several measure types legitimately break the input ordering:
| Measure type | Why ordering breaks | Consequence |
|---|---|---|
| Sparse slice sets | Different underlyings may contribute different subsets of slices. An underlying that has no value for a given slice does not appear in its sorted iterator, so the merger must fall back to point-lookup via onValue(slice, SORTED_SUB_COMPLEMENT). |
The SORTED_SUB_COMPLEMENT stream (the hash-side tail of MultitypeNavigableElseHashColumn) handles these leftovers. |
Partitionor |
Often changes the GROUP BY — the output grouping may differ from the input grouping, so the input-sorted order is meaningless in the output space. |
Output column accumulates values via a mergeable hash or navigable-else-hash column. |
Dispatchor |
Creates entirely new slices (e.g. dispatching a value across multiple buckets of a new column). The new slices have no inherent ordering relative to the input. | Same — output is hash-accumulated. |
UndictionarizedColumn |
The dictionarization assigns sequential integer indices to slices in insertion order. The index ordering matches slice ordering only for a leading prefix; once an out-of-order slice arrives, the correspondence breaks (see AAggregatingColumns.sortedPrefixLength). |
UndictionarizedColumn.stream(SORTED_SUB) exposes only the genuinely-sorted prefix; stream(SORTED_SUB_COMPLEMENT) covers the unordered tail. |
The sorted-leg / complement split
To handle the mixed case uniformly, every IMultitypeColumnFastGet supports stream(StreamStrategy):
SORTED_SUB— the slice-sorted sub-stream. ForMultitypeNavigableColumnthis is everything; forMultitypeNavigableElseHashColumnit is the navigable side; forUndictionarizedColumnit is the firstsortedPrefixLengthindices.SORTED_SUB_COMPLEMENT— the unordered remainder. For navigable columns this is empty; for navigable-else-hash it is the hash side; forUndictionarizedColumnit is indices≥ sortedPrefixLength.ALL— both legs concatenated.
The join helper (UnderlyingQueryStepHelpersNavigableElseHash.distinctSlices) first merge-joins the sorted legs across cuboids, then iterates each cuboid's complement, deduplicating against already-seen slices. This two-phase approach maximises the benefit of sorted data while still handling arbitrary unordered slices.
Comparison with external systems
| System | Join model | Sorted-input optimisation |
|---|---|---|
| SQL engines | Hash join, merge join, nested-loop join; planner picks based on statistics and indexes. | Merge join is chosen when both sides have a compatible sort order (index scan, explicit ORDER BY). |
| Pandas / Polars | merge() / join() on columns; Polars can exploit sorted flags to use merge join. |
Polars' is_sorted column flag enables O(N) merge join instead of O(N log N) sort-then-merge. |
| DuckDB | Adaptive join: hash join by default, merge join when profitable. | Pipelines may preserve sort order through operators; merge join is used when inputs are pre-sorted. |
| Adhoc | Two-phase merge: sorted legs via priority-queue merge, then hash-side complement via set dedup. | Relies on the DB producing a sorted GROUP BY result (common for single-table queries). Tracks the sorted prefix length via AAggregatingColumns.sortedPrefixLength so partial ordering is never lost. |
References
- Sort-merge join — the classical algorithm Adhoc's sorted-leg merge is based on.
- Hash join — the alternative when inputs are unsorted; Adhoc's hash-side complement lookup is a point-lookup variant.
- DuckDB's join ordering paper — background on adaptive join selection in a modern OLAP engine.
- Polars' sorted flag — the same idea as
sortedPrefixLengthbut at the column metadata level, not per-wrapper. - See also optimisations.md § "Values / Aggregates" for the column types involved, and partitionor.md / the
Dispatchorsection for the measures that break ordering.