Introduction
Code and dataset: github.com/purszki/bitcoin_research_01
The problem: private Bitcoin wallets are too slow on mobile
Bitcoin promises self-sovereignty — your money, held by you, not a bank or exchange. To deliver on that promise, a user must be able to check their balance on a phone without leaking addresses to a third-party server. Today that means downloading compact block filters (BIP 158) and scanning them locally. The privacy is excellent, but the cost is real: BIP 158 filters use Golomb-Rice coding, which requires a full sequential decode for every query. On a mobile phone this translates to minutes of heavy CPU work just to scan recent blocks — the phone heats up, the battery drains, and the user waits.
What we achieved
We built a reproducible benchmark framework on top of Bitcoin Core and used it to evaluate Binary Fuse filters as an alternative to Golomb-Rice coded sets. Across 50,000 real mainnet blocks and 10 wallet scenarios (from 24 to 480 scripts), benchmarked on both desktop x86-64 and a Raspberry Pi 5 (ARM Cortex-A76), the current results support two concrete conclusions:
- Fuse16 is a strong candidate for a new compact block filter type alongside BIP 158 basic filters. In the 50k-block benchmark, Fuse16 preserves zero false negatives, reduces client-side query CPU by 9x to 81x on desktop and 6x to 46x on ARM across the wallet set, and produces slightly smaller filter data than GCS (955.9 MB vs 972.1 MB). When matched-block downloads are included, total bandwidth stays very close to GCS, rising by only about 0.2% to 2.9% in the current measurements.
- Hierarchical Fuse filters are a promising bandwidth-oriented research direction. We also benchmarked two-stage Fuse designs that try to reduce total bandwidth further without giving up the speed advantage completely. Some variants improve the bandwidth tradeoff, and some achieve zero false positives in the current 50k run. But this work is still experimental: it needs more optimization, clearer protocol design, and dedicated privacy analysis because a two-stage retrieval scheme may leak wallet interest unless the request pattern is carefully hidden.
The speedup comes from a fundamental algorithmic difference: GCS requires O(N) sequential decoding per query, while flat Fuse16 answers each query with exactly 3 random memory lookups — O(1) regardless of filter size. The hierarchical variants build on the same idea, but add an extra protocol layer that is not yet deployment-ready.
Where we can go with more work
These results are a proof of concept, not a finished protocol. The current evidence makes Fuse16 the clearest deployment candidate, while the hierarchical variants remain exploratory. ARM benchmarking on a Raspberry Pi 5 confirms the desktop results: Fuse16 delivers significant speedups even on a low-power ARM processor. The path forward includes: writing a native C++ implementation suitable for Bitcoin Core (replacing the current C library wrapper); specifying a deterministic construction and wire format; designing privacy-preserving retrieval if a hierarchical scheme is pursued; and ultimately drafting a BIP proposal. Fuse filters could make private, self-sovereign wallet sync practical on everyday phones.
Summary of BIP 158
In short, the server sends a compact probabilistic summary for each block, which lets the client cheaply test whether a wallet-relevant script element may be present in that block.
We refer to the resulting per-block structure as the basic block filter; its encoding uses Golomb-Rice-coded sets (GCS).
Under BIP 158, the server builds a compact block filter for each block. From that filter, the client can tell whether the block may contain any script element relevant to the wallet. If the filter signals a match, the block becomes a candidate and the client can inspect it in full.
FULL BLOCKS (Big Data) BIP 158 FILTERS (Small Data)
(Megabytes of TXs) (KiloBytes of GCS)
---------------------------- ----------------------------
| [TX1][TX2][TX3][TX4]... | | |
| [TX5][TX6][TX7][TX8]... | | Lossy, but |
| [TX9]....... (1-4 MB) | -----> | deterministic digest |
| (typically 1.5 Mb) | | (~20 KB) |
---------------------------- ----------------------------
^ ^
| |
"Hard to "Easy to
download" store"
During filter construction, the block's relevant script elements are collected: output scriptPubKeys and the spent prevout scriptPubKeys referenced by inputs. These elements are hashed and then encoded with Golomb-Rice coding (GCS). If a wallet-relevant script element matches the filter, the block must be downloaded for full inspection.
Since the result is a lossy digest:
- if the filter says 'No match', then the queried script element is not in that block.
- if it says 'Match possible', then the block is only a candidate: the element may really be there, or this may be a false positive.
The question is: how could we speed up this process on the client side, preferably without increasing the compressed data size too much?
Common discussions suggest several methods: XOR, Binary Fuse, or Cuckoo filters.
References:
Binary Fuse Filters
Binary fuse filters are a family of probabilistic data structures for approximate set membership, introduced by Thomas Mueller Graf and Daniel Lemire in "Binary Fuse Filters: Fast and Smaller Than Xor Filters" (Journal of Experimental Algorithmics, Vol. 27, 2022; arXiv:2201.01174).
A binary fuse filter answers one question: is a given value a member of a set? If the answer is "yes", the value might be in the set (a false positive is possible). If the answer is "no", the value is definitely not in the set — there are no false negatives. This is the same guarantee as GCS filters used in BIP 158, but binary fuse filters answer the question with only 3 memory lookups, regardless of how large the set is.
A binary fuse filter is built from a fingerprint function and an array of fingerprints. The algorithm is described in detail in the reference paper. Using a simple lookup method that XORs three fingerprints, the algorithm gives O(1) query time with exactly three memory accesses, regardless of set size.
The probability of a false positive depends on the fingerprint width: a k-bit fingerprint gives FPR = 1/2^k. Common configurations are:
| Variant | Fingerprint | FPR |
|---|---|---|
| Fuse8 | 8 bits | 1/256 |
| Fuse12 | 12 bits | 1/4,096 |
| Fuse16 | 16 bits | 1/65,536 |
| Fuse18 | 18 bits | 1/262,144 |
| Fuse20 | 20 bits | 1/1,048,576 |
A bandwidth optimization - Hierarchical Fuse Filters
We propose an original contribution on top of Binary Fuse Filters: a hierarchical two-stage design that aims to reduce bandwidth without giving up the core speed advantage. To our knowledge, this two-stage outer/inner filter architecture has not been explored in the context of Bitcoin light client block filtering before.
A hierarchical filter splits the decision into two stages. The client still downloads one small outer filter for every block and queries it locally. If that outer filter reports a possible match, the client fetches a second, more selective inner filter for that block. Only if the inner filter also reports a match does the client download the full block.
The naming convention follows that two-stage structure: F12+16 means a Fuse12 outer filter plus a Fuse16 inner filter. Likewise, F10+10 uses Fuse10 in both stages.
HIERARCHICAL FILTER FLOW
download outer filter for every block
|
no match -> stop
|
match
v
fetch and query inner filter
|
no match -> stop
|
match
v
download full block
The benchmarked bandwidth numbers in this document assume exactly that on-demand inner-filter retrieval model: outer filters for all blocks, inner filters only for outer matches, and full blocks only for final matches. That is also why hierarchical filters raise a privacy issue. Under BIP 158, the peer sees only that the client downloads one filter per block. In a hierarchical scheme, the peer can also observe which specific blocks triggered second-stage inner-filter requests unless the protocol adds a privacy defense.
Proof of Concept Implementation
We use the xor_singleheader C library as a reference implementation of Binary Fuse and XOR filters: https://github.com/FastFilter/xor_singleheader
It is good for benchmarking, but not yet protocol-ready. Main limitations:
- It uses
malloc/freeinternally. - Error handling is too weak for critical production code.
To integrate it cleanly, we wrapped it in Bitcoin Core-style C++ classes with the same API shape as BlockFilter. See src/fuse16filter.h/.cpp for Fuse16, src/fuse8filter.h/.cpp for Fuse8, and src/xor8filter.h/.cpp for Xor8. We use PIMPL (pointer-to-implementation) so the large C header does not leak into the rest of the codebase. After a native C++ rewrite, neither malloc nor PIMPL should be needed.
Alongside that wrapper layer, we also started an experimental C++ refactor of the original library in src/bench/xor_singleheader/binaryfusefilter.hpp. It uses templates to support variable-width filters (for example Fuse12, Fuse18, and Fuse20), and it moves serialization toward an explicit platform-independent little-endian format. This is promising for overcoming some limitations of the original C code, but it is still research code rather than a production-ready implementation.
Construction must be deterministic. We derive the seed from the block hash. If construction fails (the probabilistic peeling process does not converge), we currently throw an exception. For benchmarking this is acceptable: in practice failure is very rare, and a deterministic retry strategy can handle it. A protocol-grade implementation must define that retry behavior explicitly.
Core question: can Binary Fuse filters (and related XOR filters) replace or complement the current GCS-based compact block filters from BIP 158? We evaluate:
- Deterministic construction. Fuse construction is probabilistic, so it may fail and retry with a different seed. Can we define this so every full node still produces exactly the same filter for the same block?
- Client-side query speed. On the client, are Fuse filters faster than GCS (Golomb-Rice decode plus per-element hashing)?
- Network bandwidth. Is total filter size, and therefore download bandwidth, comparable to BIP 158?
- False positives. BIP 158 has a very low false-positive rate (1/784,931). Each false positive is expensive: the client downloads the full block (~1.5 MB) and checks it manually. How do Fuse filters compare?
Source Files
All research-specific C++ code lives under src/bench/light_client_research/ and src/bench/xor_singleheader/, separate from the Bitcoin Core codebase.
Open source file inventory
Filter implementations (src/bench/light_client_research/)
| File | Description |
|---|---|
fuse8filter.h / .cpp | C++ wrapper for Binary Fuse8 filter (8-bit fingerprint, FPR 1/256 — rejected due to false positives) |
fuse10filter.h / .cpp | Experimental Binary Fuse10 filter wrapper used in hierarchical F10+10 benchmarks |
fuse12filter.h / .cpp | Experimental Binary Fuse12 filter wrapper used for flat and hierarchical benchmarks |
fuse16filter.h / .cpp | C++ wrapper for Binary Fuse16 filter (16-bit fingerprint, FPR 1/65,536) |
fuse18filter.h / .cpp | Experimental Binary Fuse18 filter wrapper used for flat and hierarchical benchmarks |
fuse20filter.h / .cpp | Experimental Binary Fuse20 filter wrapper used for flat and hierarchical benchmarks |
fuse32filter.h / .cpp | Experimental Binary Fuse32 filter wrapper used as a near-zero-FP validation reference |
xor8filter.h / .cpp | C++ wrapper for Xor8 filter (8-bit, FPR 1/256 — rejected) |
Data loading and streaming (src/bench/light_client_research/)
| File | Description |
|---|---|
full_dataset.h / .cpp | JSON dataset parser: FullDataset (string-based) and PreparedDataset (binary, benchmark-ready) |
tx_block_chunk_store.h / .cpp | Loads .bin chunk file metadata (paths, block ranges) from a directory |
tx_block_stream_reader.h / .cpp | Streams blocks one at a time from .bin chunks, keeping memory usage low |
filter_bench.h / .cpp | Shared helpers: wallet scenario loading, filter construction, dataset I/O |
Benchmarks (src/bench/light_client_research/)
| File | Description |
|---|---|
research_bip158.cpp | Main benchmark driver: flat Fuse benchmarks, hierarchical Fuse benchmarks, and streaming ground-truth validation |
filter_demo.cpp | Minimal demo: builds a Fuse16 filter for a single block and queries it |
Vendored Library and Experimental C++ Impl (src/bench/xor_singleheader/)
| File | Description |
|---|---|
binaryfusefilter.h | Binary Fuse filter C implementation (from xor_singleheader, commit c482686) |
binaryfusefilter.hpp | Experimental C++ refactor of Binary Fuse with template-based variable-width filters and explicit little-endian serialization |
xorfilter.h | Xor filter C implementation (same upstream library) |
README.md | Local notes about the vendored xor_singleheader code and the in-repo adaptations |
Data Extraction for Profiling and Validation
For benchmarking on real mainnet data, we export block and transaction data from a fully synced Bitcoin Core node into compact binary (.bin) blobs. Each blob stores 250 blocks, including all output scriptPubKeys and spent prevout scripts, which are exactly the elements used by BIP 158 filter construction.
This extraction step is needed for three reasons:
- We must build each filter from scratch (not from a pre-built index), so construction time can be measured fairly across filter types.
- The binary format lets us stream blocks one by one, without keeping the full dataset in memory. This is important for large runs (50k+ blocks).
- Ground-truth validation needs the original scripts, to prove we never miss a real match (zero false negatives).
For exact, step-by-step dataset reproduction from a Bitcoin Core node, see REPRODUCE.md.
Simulating Wallet Behaviour
To measure real-world filter performance, we simulate how a light client wallet queries block filters. A wallet holds a set of scriptPubKeys and scans every block filter to find blocks that may contain relevant transactions.
The number of scripts, and the script-type mix, directly affects query time and false-positive rate.
We generate 10 wallet use cases, from a simple mobile wallet (24 scripts) up to a large exchange hot wallet (480 scripts). Each use case has a realistic address-type mix (P2WPKH, P2TR, P2WSH, P2SH, P2PKH) and includes ground-truth data: which blocks in the 50k-block dataset actually contain matching transactions. We use this ground truth to verify zero false negatives and to count false positives exactly.
| Wallet Use Case | Scripts | Address Type Mix | True Hits (50k blocks) |
|---|---|---|---|
| Simple User | 24 | 17 p2wpkh, 5 p2tr, 2 p2sh | 1,034 |
| Cold Storage Vault | 60 | 42 p2wsh, 12 p2tr, 6 p2sh | 761 |
| Legacy Migrator | 90 | 45 p2pkh, 27 p2sh, 18 p2wpkh | 18,179 |
| Taproot Power User | 96 | 82 p2tr, 14 p2wpkh | 7,124 |
| CoinJoin Privacy User | 120 | 72 p2wpkh, 36 p2tr, 12 p2sh | 8,131 |
| Merchant Batch Settler | 140 | 63 p2wpkh, 42 p2tr, 28 p2sh, 7 p2pkh | 21,795 |
| Lightning Operator | 180 | 90 p2wsh, 45 p2tr, 45 p2wpkh | 4,044 |
| Custody Wallet | 220 | 99 p2wsh, 44 p2tr, 44 p2wpkh, 33 p2sh | 11,216 |
| Watch-Only Auditor | 320 | 128 p2wpkh, 80 p2tr, 64 p2wsh, 32 p2sh, 16 p2pkh | 23,267 |
| Exchange Hot Wallet | 480 | 192 p2wpkh, 144 p2tr, 96 p2sh, 48 p2wsh | 27,530 |
The True Hits column shows how many of the 50,000 blocks contain at least one matching transaction for that wallet. A simple user wallet matches about 2% of blocks, while an exchange hot wallet matches about 55%. For most wallets, non-matching blocks are still the majority. This is exactly where filter query speed matters most, because the client checks every block filter, match or not.
The wallet scenario files are generated deterministically from block data using benchmark_tools/generate_wallet_use_cases.py. See REPRODUCE.md for details.
Disk space required for the 50k-block dataset used in our benchmarks:
| Dataset | Size | Files |
|---|---|---|
| Binary block chunks | ~23 GB | 200 |
| Wallet use-case scenarios | ~53 MB | 10 |
Ground-Truth Validation
Any new filter type must satisfy one hard requirement: zero false negatives. If a block contains a transaction matching the wallet, the filter must report a match. A single false negative means the wallet misses a payment — this is unacceptable. False positives (the filter says "maybe" but the block has no matching transaction) are tolerated; they only cost an extra block download.
We validate correctness and count false positives in two layers:
Layer 1: Computing ground truth (Python, offline). During wallet scenario generation, the script generate_wallet_use_cases.py scans every block in the dataset and records which blocks contain at least one scriptPubKey that belongs to the wallet. This is an exact, brute-force check against the raw transaction data — no filter is involved. The result is a list of block indices stored in the wallet scenario JSON under ground_truth.matched_blocks.
Layer 2: Streaming validation (C++, dedicated benchmark). A dedicated benchmark called StreamingGroundTruthValidation reads blocks one at a time from the binary chunks, constructs a filter for each block using all four filter types (GCS, Fuse16, Fuse8, Xor8), and queries each filter with the wallet's scripts. For each filter type it checks:
- False negatives: For every block index in the ground-truth list, did the filter report a match? If any ground-truth block is missed, the validation throws an exception and stops.
- False positives: The count is computed as:
false_positives = candidate_matches - true_hits_in_range
where candidate_matches is the number of blocks the filter flagged, and true_hits_in_range is the number of ground-truth blocks within the scanned range.
The streaming approach processes one block at a time and discards its elements immediately, so memory usage stays low even for 50k+ blocks.
How to run
A convenience script runs all the wallet use cases against all 4 filter types (GCS, Fuse16, Fuse8, Xor8) and produces a summary table:
./light_client_research/benchmark_tools/false_positive_check.sh 50000
The script iterates over every wallet scenario file, runs the StreamingGroundTruthValidation benchmark for each, parses the output, and checks that false_negatives=0 for every filter. If any false negative is detected, the script prints an error banner and exits with a non-zero code. Results are saved to light_client_research/results/false_positive_check_50000.txt.
You can also run individual wallets manually:
# Single wallet:
BIN_SCAN_MAX_BLOCKS=50000 ./build-release/bin/bench_bitcoin \
-filter='StreamingGroundTruthValidation' -min-time=100
# Specific wallet:
BIN_SCAN_MAX_BLOCKS=50000 \
BIN_WALLET_SCENARIO=light_client_research/mainnet_datasets/wallet_use_cases/wallet_use_case_exchange_hot_wallet.json \
./build-release/bin/bench_bitcoin \
-filter='StreamingGroundTruthValidation' -min-time=100
Expected output per filter type:
FUSE16_GROUND_TRUTH scenario=... scanned_blocks=50000 skipped_small=55
construction_failures=0 true_hits_in_range=... candidate_matches=...
false_negatives=0 false_positives=...
The false_negatives=0 field must be zero for every filter type. Any non-zero value indicates a correctness bug.
Benchmarking Methodology and Results
What we measure
The benchmarks simulate what a light client does when it receives block filters over the network: deserialize the filter from bytes, then query it with the wallet's scripts. This is the client-side hot path that runs for every block in the chain.
For each filter type, the benchmark:
- Setup (not timed): Reads blocks from the binary dataset, constructs filters, and serializes them to byte arrays. This simulates the full node building filters; the client never does this.
- Timed loop (nanobench): For each pre-built filter, deserializes from bytes and calls
MatchAny(wallet_scripts). This is repeated many times by nanobench to get stable timing.
For GCS (BIP 158), deserialization reconstructs the Golomb-Rice encoded bitstream and the query performs a full O(N) sequential decode. For Fuse16, deserialization reconstructs the fingerprint array and the query performs 3 array lookups per wallet script — O(1) per element regardless of filter size.
The benchmarks also report:
- Total filter size in bytes (sum of all serialized filters)
- Average filter size per block
- Match count (number of blocks where the filter reports a hit)
- Block download volume for all matched blocks (true positives + false positives)
- Total bandwidth = filter data + downloaded blocks for all matches
Bandwidth is counted from the client point of view. If a filter reports a match, the client must download the full block before it can tell whether the match was real or just a false positive.
TOTAL BANDWIDTH
= filter_data
+ downloaded_blocks_for_all_matches
= filter_data
+ block_downloads(true_positives)
+ block_downloads(false_positives)
match reported by filter
|
v
download full block
|
v
later classify as TP or FP
So the download cost is paid for every reported match, not only for true hits.
For hierarchical schemes, "filter_data" includes both the outer filters and any
inner filters fetched on demand.
How to run
# Build (release mode, required for meaningful timings):
cmake -B build-release -DCMAKE_BUILD_TYPE=Release -DBUILD_BENCH=ON
cmake --build build-release --target bench_bitcoin -j $(nproc)
# GCS baseline — 50k blocks, simple_user wallet:
BIN_SCAN_MAX_BLOCKS=50000 ./build-release/bin/bench_bitcoin \
-filter='ResearchBasicClientSideQuery' -min-time=1000
# Fuse16 — 50k blocks, simple_user wallet:
BIN_SCAN_MAX_BLOCKS=50000 ./build-release/bin/bench_bitcoin \
-filter='ResearchFuse16ClientSideQuery' -min-time=1000
# Specific wallet:
BIN_SCAN_MAX_BLOCKS=50000 \
BIN_WALLET_SCENARIO=light_client_research/mainnet_datasets/wallet_use_cases/wallet_use_case_exchange_hot_wallet.json \
./build-release/bin/bench_bitcoin \
-filter='ResearchFuse16ClientSideQuery' -min-time=1000
# All wallets, flat and hierarchical filter matrix:
./light_client_research/benchmark_tools/wallet_benchmark.sh 50000
The wallet_benchmark.sh script runs the current all-wallet comparison (GCS, flat Fuse variants, and hierarchical variants), computes actual false positives from ground-truth data, and saves summary tables to light_client_research/results/wallet_benchmark_results_<blocks>.txt. The checked-in 50k result file is here: wallet_benchmark_results_50000.txt.
To run both ground-truth validation and benchmarks in one go:
./light_client_research/benchmark_tools/bench_all.sh 50000
Profiling Results: Fuse8 and Xor8 (20k blocks)
This early 20k-block run compared all four filter types to decide which ones to carry forward. The results led us to drop Fuse8 and Xor8; the 50k benchmarks below focus on the viable candidates. Benchmarked on a Lenovo ThinkPad P51 (Intel Core i7-7820HQ @ 2.90 GHz, 4 cores / 8 threads).
| Filter | Time ms | Size MB | FP | Speedup | Viable? |
|---|---|---|---|---|---|
| GCS | 4,716 | 404 | 1 | 1x | baseline |
| Fuse16 | 61 | 397 | 4 | 77x | YES |
| Fuse8 | 48 | 199 | 1,733 | 99x | NO |
| Xor8 | 47 | 190 | 1,841 | 100x | NO |
Taking into account the FP rate and the speedup, Fuse16 seems to be a viable alternative to GCS.
8-bit filters are ~50% smaller and ~100x faster, but their FPR (1/256) causes thousands of unnecessary full block downloads (~2.6 GB for Fuse8 at 20k blocks). The bandwidth cost far exceeds the size savings.
Profiling Results: Flat Fuse Variants vs GCS (50k mainnet blocks)
The flat-filter comparison covers GCS, Fuse16, Fuse18, and Fuse20 across both platforms. Desktop results use wallet_benchmark_results_50000.txt; Raspberry Pi 5 results use arm_fixed_wallet_benchmark_results_50000.txt. Filter sizes and match counts are identical across platforms; only timing differs.
CPU speed — Raspberry Pi 5 (ARM Cortex-A76, deserialize + MatchAny, lower is better):
| Wallet | Scripts | GCS ms | F16 ms | F16 speedup | F18 ms | F20 ms |
|---|---|---|---|---|---|---|
| Simple User | 24 | 14,547 | 317 | 46x | 3,471 | 3,527 |
| Cold Storage Vault | 60 | 15,821 | 750 | 21x | 3,682 | 3,972 |
| Legacy Migrator | 90 | 12,236 | 672 | 18x | 3,620 | 3,779 |
| Taproot Power User | 96 | 14,827 | 1,008 | 15x | 3,957 | 4,095 |
| CoinJoin Privacy User | 120 | 14,680 | 1,054 | 14x | 3,998 | 4,159 |
| Merchant Batch Settler | 140 | 12,071 | 1,207 | 10x | 4,382 | 4,280 |
| Lightning Operator | 180 | 16,867 | 1,817 | 9x | 4,954 | 5,094 |
| Custody Wallet | 220 | 16,103 | 2,010 | 8x | 5,226 | 5,344 |
| Watch-Only Auditor | 320 | 13,552 | 2,396 | 6x | 5,365 | 5,487 |
| Exchange Hot Wallet | 480 | 14,605 | 2,554 | 6x | 5,573 | 5,714 |
CPU speed — desktop x86-64 data table
CPU speed (deserialize + MatchAny, lower is better):
| Wallet | Scripts | GCS ms | F16 ms | F16 speedup | F18 ms | F20 ms |
|---|---|---|---|---|---|---|
| Simple User | 24 | 10,081 | 125 | 81x | 1,392 | 1,458 |
| Cold Storage Vault | 60 | 10,535 | 237 | 44x | 1,510 | 1,573 |
| Legacy Migrator | 90 | 8,249 | 257 | 32x | 1,534 | 1,598 |
| Taproot Power User | 96 | 9,970 | 311 | 32x | 1,580 | 1,645 |
| CoinJoin Privacy User | 120 | 9,837 | 348 | 28x | 1,604 | 1,676 |
| Merchant Batch Settler | 140 | 7,923 | 396 | 20x | 1,769 | 1,733 |
| Lightning Operator | 180 | 10,753 | 550 | 20x | 1,820 | 1,897 |
| Custody Wallet | 220 | 9,899 | 630 | 16x | 1,895 | 1,987 |
| Watch-Only Auditor | 320 | 8,525 | 654 | 13x | 2,001 | 2,005 |
| Exchange Hot Wallet | 480 | 8,716 | 1,004 | 9x | 2,293 | 2,364 |
Total bandwidth — flat filter data table
Total bandwidth (filter data + downloaded blocks for all reported matches, true positives and false positives; lower is better):
| Wallet | Scripts | GCS MB | F16 MB | F18 MB | F20 MB |
|---|---|---|---|---|---|
| Simple User | 24 | 2544.257 | 2555.803 | 2656.590 | 2764.290 |
| Cold Storage Vault | 60 | 2083.057 | 2143.943 | 2212.660 | 2313.220 |
| Legacy Migrator | 90 | 27945.347 | 28003.093 | 28065.810 | 28169.130 |
| Taproot Power User | 96 | 11561.647 | 11643.493 | 11686.710 | 11787.130 |
| CoinJoin Privacy User | 120 | 13173.247 | 13269.393 | 13300.110 | 13403.630 |
| Merchant Batch Settler | 140 | 33695.347 | 33770.393 | 33819.910 | 33917.830 |
| Lightning Operator | 180 | 7003.037 | 7189.113 | 7153.750 | 7220.370 |
| Custody Wallet | 220 | 17756.747 | 17938.493 | 17902.410 | 17978.830 |
| Watch-Only Auditor | 320 | 35732.847 | 35889.293 | 35867.910 | 35940.630 |
| Exchange Hot Wallet | 480 | 42272.847 | 42445.393 | 42410.710 | 42506.730 |
Key observations:
- Fuse16 remains the CPU winner in every wallet on both platforms. On desktop x86-64 it ranges from 9x to 81x faster than GCS. On the Raspberry Pi 5 (ARM Cortex-A76) it ranges from 6x to 46x faster.
- F18 and F20 reduce false-positive-driven block downloads relative to Fuse16, but their larger filters more than offset that gain. In this flat-filter comparison, both variants are slower than Fuse16 and still worse than GCS on total bandwidth.
- GCS keeps the lowest total bandwidth among these flat filters, but it pays for that result with much higher client-side CPU cost.
Profiling Results: Experimental Hierarchical Fuse Variants vs GCS (50k mainnet blocks)
The experimental hierarchical comparison covers F16+20, F10+10, F12+12, F12+16, F12+18, and three Fuse+GCS hybrids: F8+GCS, F12+GCS, and F16+GCS. Desktop results use wallet_benchmark_results_50000.txt; Raspberry Pi 5 results use arm_fixed_wallet_benchmark_results_50000.txt. In the Fuse+GCS hybrids the outer layer is a fast Fuse filter and the inner layer is the existing BIP 158 GCS filter, which full nodes already compute and serve. These schemes trade protocol simplicity for a different speed/bandwidth frontier.
CPU speed — Raspberry Pi 5 (ARM Cortex-A76, hierarchical; lower is better):
| Wallet | Scripts | GCS ms | F16+20 ms | F10+10 ms | F12+12 ms | F12+16 ms | F12+18 ms | F8+GCS ms | F12+GCS ms | F16+GCS ms |
|---|---|---|---|---|---|---|---|---|---|---|
| Simple User | 24 | 14,547 | 439 | 2,601 | 807 | 797 | 905 | 1,176 | 1,053 | 570 |
| Cold Storage Vault | 60 | 15,821 | 804 | 3,040 | 1,210 | 1,204 | 1,300 | 2,597 | 1,421 | 894 |
| Legacy Migrator | 90 | 12,236 | 2,297 | 4,089 | 1,575 | 1,375 | 2,703 | 6,010 | 4,772 | 4,201 |
| Taproot Power User | 96 | 14,827 | 1,615 | 3,744 | 1,639 | 1,555 | 2,075 | 4,920 | 2,926 | 2,336 |
| CoinJoin Privacy User | 120 | 14,680 | 1,811 | 3,924 | 1,734 | 1,682 | 2,288 | 5,868 | 3,317 | 2,681 |
| Merchant Batch Settler | 140 | 12,071 | 3,114 | 4,825 | 2,103 | 1,877 | 3,515 | 8,934 | 6,384 | 5,538 |
| Lightning Operator | 180 | 16,867 | 2,365 | 4,724 | 2,520 | 2,525 | 2,873 | 7,746 | 3,553 | 2,835 |
| Custody Wallet | 220 | 16,103 | 3,343 | 5,413 | 2,914 | 2,693 | 3,629 | 9,754 | 5,112 | 4,348 |
| Watch-Only Auditor | 320 | 13,552 | 4,877 | 6,665 | 3,875 | 3,621 | 5,311 | 14,236 | 8,237 | 7,314 |
| Exchange Hot Wallet | 480 | 14,605 | 6,350 | 8,226 | 5,209 | 4,912 | 6,900 | 16,648 | 9,836 | 8,934 |
CPU speed — desktop x86-64 hierarchical data table
CPU speed (deserialize + MatchAny, including inner-filter verification when triggered; lower is better):
| Wallet | Scripts | GCS ms | F16+20 ms | F10+10 ms | F12+12 ms | F12+16 ms | F12+18 ms | F8+GCS ms | F12+GCS ms | F16+GCS ms |
|---|---|---|---|---|---|---|---|---|---|---|
| Simple User | 24 | 10,081 | 169 | 1,270 | 448 | 440 | 475 | 720 | 613 | 278 |
| Cold Storage Vault | 60 | 10,535 | 277 | 1,421 | 552 | 547 | 591 | 1,588 | 734 | 349 |
| Legacy Migrator | 90 | 8,249 | 935 | 1,941 | 774 | 642 | 1,217 | 4,128 | 3,054 | 2,676 |
| Taproot Power User | 96 | 9,970 | 582 | 1,701 | 715 | 666 | 896 | 2,929 | 1,668 | 1,239 |
| CoinJoin Privacy User | 120 | 9,837 | 667 | 1,782 | 765 | 703 | 976 | 3,651 | 1,899 | 1,458 |
| Merchant Batch Settler | 140 | 7,923 | 1,225 | 2,232 | 960 | 802 | 1,508 | 5,837 | 3,861 | 3,397 |
| Lightning Operator | 180 | 10,753 | 743 | 1,956 | 946 | 922 | 1,080 | 4,703 | 1,709 | 1,157 |
| Custody Wallet | 220 | 9,899 | 1,107 | 2,301 | 1,147 | 1,165 | 1,468 | 6,521 | 2,856 | 2,282 |
| Watch-Only Auditor | 320 | 8,525 | 1,829 | 2,856 | 1,534 | 1,345 | 2,110 | 7,872 | 4,622 | 4,058 |
| Exchange Hot Wallet | 480 | 8,716 | 2,279 | 3,372 | 1,956 | 1,789 | 2,610 | 11,538 | 5,457 | 4,713 |
Total bandwidth — hierarchical filter data table
Total bandwidth (outer filters + inner filters fetched on demand + downloaded blocks for all reported matches; lower is better):
| Wallet | Scripts | GCS MB | F16+20 MB | F10+10 MB | F12+12 MB | F12+16 MB | F12+18 MB | F8+GCS MB | F12+GCS MB | F16+GCS MB |
|---|---|---|---|---|---|---|---|---|---|---|
| Simple User | 24 | 2544.257 | 2558.250 | 2199.385 | 2312.707 | 2318.916 | 2323.001 | 2157.537 | 2319.668 | 2552.206 |
| Cold Storage Vault | 60 | 2083.057 | 2090.117 | 1756.452 | 1851.336 | 1859.193 | 1863.122 | 1808.441 | 1859.864 | 2085.814 |
| Legacy Migrator | 90 | 27945.347 | 28458.200 | 27864.821 | 28015.700 | 28124.010 | 28178.170 | 28042.290 | 28133.630 | 28361.920 |
| Taproot Power User | 96 | 11561.647 | 11737.560 | 11330.304 | 11433.451 | 11477.245 | 11499.142 | 11475.208 | 11480.772 | 11701.420 |
| CoinJoin Privacy User | 120 | 13173.247 | 13393.060 | 12976.099 | 13075.772 | 13127.361 | 13154.006 | 13159.995 | 13131.971 | 13349.320 |
| Merchant Batch Settler | 140 | 33695.347 | 34328.770 | 33681.057 | 33838.440 | 33972.990 | 34040.270 | 33915.180 | 33985.320 | 34209.880 |
| Lightning Operator | 180 | 7003.037 | 7097.310 | 6771.512 | 6835.821 | 6868.300 | 6885.350 | 7034.930 | 6871.252 | 7074.330 |
| Custody Wallet | 220 | 17756.747 | 18074.820 | 17630.290 | 17723.044 | 17800.190 | 17838.760 | 17935.310 | 17807.130 | 18012.350 |
| Watch-Only Auditor | 320 | 35732.847 | 36385.850 | 35751.900 | 35893.080 | 36037.230 | 36110.050 | 36072.920 | 36050.120 | 36259.980 |
| Exchange Hot Wallet | 480 | 42272.847 | 42978.800 | 42360.980 | 42475.390 | 42631.170 | 42709.060 | 42683.870 | 42643.230 | 42843.960 |
Key observations:
- The hierarchical variants remain much faster than GCS on both platforms, but none of them beats flat Fuse16 on CPU time. On desktop,
F16+20is the fastest hierarchical option for small wallets (6 of 10), while on ARMF12+16wins in the majority of wallets (7 of 10) because its smaller outer filter stresses the ARM cache less.F10+10is typically the slowest on both platforms. F10+10is the strongest bandwidth result in the current experimental set. It beats GCS on total bandwidth in 8 of 10 wallets, but it does so with only modest speedups and a non-zero false-positive rate.F12+16is the cleanest zero-false-positive hierarchical result. In the current 50k run it keeps zero false positives across all wallets, remains materially faster than GCS, and lands near the middle of the experimental bandwidth tradeoff curve.F16+20proves that a zero-false-positive two-stage design is possible, but not yet economical. In the current measurements it is worse than GCS on total bandwidth in all 10 wallets because the extra inner-filter traffic outweighs the saved block downloads.- The Fuse+GCS hybrids (
F8+GCS,F12+GCS,F16+GCS) achieve zero false positives across all wallets by using the existing BIP 158 GCS filter as the inner verification layer. Their main advantage is deployment simplicity: the inner GCS filter is already computed and served by full nodes, so only the new Fuse outer filter needs to be added.F8+GCSoffers the best bandwidth for low-match wallets (15% less total bandwidth than GCS for Simple User) but degrades to GCS-level CPU at high script counts because its 1/256 FPR triggers too many expensive O(N) GCS verifications.F16+GCSis 30–36x faster than GCS for small wallets while keeping bandwidth near-neutral.F12+GCSis the most balanced: 5–16x faster with 0–12% bandwidth savings for wallets with less than 20% match rates.
Limitations and future profiling improvements
The benchmarks now cover both desktop x86-64 and ARM (Raspberry Pi 5). The ARM results confirm the algorithmic advantage holds on low-power hardware, though with reduced speedup ratios. Important remaining gaps:
- Mobile SoC profiling. The Raspberry Pi 5 (Cortex-A76) provides our first ARM data point, confirming 6x-46x Fuse16 speedup over GCS. Mobile phones (Snapdragon, Apple A-series) have different cache hierarchies and thermal constraints. Profiling on actual phone hardware would complete the picture.
- Measuring actual network cost beyond byte totals. The current bandwidth tables already count actual serialized filter bytes, actual on-demand inner-filter bytes, and actual full-block bytes for every reported match in the 50,000-block dataset. What they still do not model is transport behavior: request batching, round-trip latency, peer behavior, retries, parallel downloads, and the CPU cost of parsing blocks after receipt. For hierarchical schemes there is an additional unresolved question: whether inner-filter retrieval can preserve wallet privacy without giving up the measured bandwidth gains.
- Skipped blocks with fewer than 2 elements. Both the Fuse filter wrappers and the GCS benchmark path require at least 2 unique script elements to construct a filter. Blocks with 0 or 1 elements (typically empty blocks or blocks with only a coinbase) are skipped during benchmarking. In the 50,000-block dataset, 55 blocks fall into this category. Skipping them has almost no impact on match counts, false positive rates or profiling results. A production implementation would need to handle these blocks (e.g., by padding or using a trivial filter).
- End-to-end wall-clock measurement. The current benchmarks measure filter query time in isolation. A production-relevant benchmark would measure the full wallet sync flow: receive filter bytes over a (simulated) network, deserialize, query, and on match download and process the block. This would capture the true user-facing latency including I/O and memory allocation overhead.
Path to Protocol-Ready
The proof-of-concept results are promising (6–81x speedup across desktop and ARM, comparable filter size, acceptable FP overhead). Below is what remains to go from research prototype to a deployable protocol change.
1. Evaluate Cuckoo filters
Cuckoo filters are another commonly suggested alternative to GCS. They offer O(1) lookup (similar to Fuse), support deletion (which Fuse does not), and have well-studied implementations. Before committing to Fuse as the proposal, we should build a proof-of-concept Cuckoo filter implementation using the same benchmark framework and compare. Our guess is that Fuse will be faster, but the evaluation must happen.
2. Research on hierarchical filters
The hierarchical filter approach raises a serious protocol question: it breaks the main privacy property that makes BIP 158 attractive. Under BIP 158, the client downloads one compact filter per block and performs all wallet queries locally, so the peer cannot tell which blocks are interesting to the wallet. In a hierarchical design, the client first checks an outer filter and then requests inner filters only for the candidate blocks. That second-stage request pattern can leak wallet interest directly.
The research question is whether there is a way to preserve most of the bandwidth savings without giving up that privacy model. Possible directions include:
- Noise injection. Download random extra inner filters so the true candidate set is hidden inside a larger request set.
- Temporal defenses. Decouple inner-filter requests from outer-filter matches by batching, delaying, or shuffling them over time so the correlation is weaker.
- Private Information Retrieval. Investigate PIR-style techniques that let the client fetch inner-filter data without revealing which blocks it is actually interested in.
Any such workaround must be evaluated as a full protocol tradeoff: privacy leakage, extra bandwidth from cover traffic, latency, implementation complexity, and whether the hierarchical scheme still beats a flat filter once those protections are added.
3. Native C++ implementation of a Binary Fuse Filter
The original implementation wrapped the xor_singleheader C library via PIMPL. During the research we partially addressed the most critical issues:
- C++ template implementation: We wrote
binaryfusefilter.hpp, a template-based C++ reimplementation that supports variable-width filters (Fuse10, Fuse12, Fuse18, Fuse20, Fuse32) and moves serialization toward an explicit little-endian format. The Fuse8 and Fuse16 wrappers still use the original C library. - Exception-safe construction: All filter wrappers now throw on failure and clean up via RAII destructors, replacing the C library's silent error-return pattern.
- Ground-truth validated: All filter types pass streaming ground-truth validation across 50,000 mainnet blocks and 10 wallet scenarios with zero false negatives — we are confident in correctness.
Remaining issues that a production implementation must address:
- Memory management: The Fuse8/Fuse16 wrappers and the C library still use
malloc/freeinternally. Bitcoin Core expects RAII-based allocation with no raw malloc in consensus-adjacent code. - PIMPL overhead: All wrappers still use
unique_ptr<Impl>to hide the C/C++ headers. A production implementation should eliminate this indirection. - Endianness: The C++-template filters use explicit little-endian serialization, but the C-library wrappers (Fuse8, Fuse16) still use native endianness. A wire protocol must use a consistent fixed byte order.
- No formal test suite: Correctness is validated through benchmark ground-truth checks, not through unit tests or fuzz tests. A production codebase needs both.
The algorithm itself is straightforward (the paper is clear and the C code is ~400 lines). A clean, unified C++ rewrite following Bitcoin Core coding conventions — replacing both the C library and the current template prototype — would be moderate effort.
4. Deterministic construction specification
Full nodes must produce identical filters for the same block. The current approach derives a deterministic seed from the block hash via SipHash, but throws an exception if the probabilistic peeling process does not converge.
A production specification must define:
- Exact seed derivation formula (e.g.,
SipHash-2-4(block_hash, domain_tag)). - Deterministic retry: if construction fails with seed S, the next seed is
S' = f(S)for a specified function f. The spec must guarantee that all implementations arrive at the same seed and the same filter bytes. - Maximum retry count and behavior on exhaustion (in practice, failure is extraordinarily rare — zero failures observed in 50k mainnet blocks).
5. Wire format specification
Define the exact byte layout for Fuse filters on the network:
- Fixed-size header: seed (8 bytes), element count (4 bytes), segment parameters (12 bytes), array length (4 bytes) = 28 bytes.
- Fingerprint array:
array_length × 2bytes, little-endian uint16 values. - Total: 28 + 2N bytes where N is the array length (~1.125 × element count).
The format must be specified precisely enough that any implementation can deserialize and query filters from any other implementation, bit-for-bit.
6. Handling edge cases
- Blocks with 0–1 unique scripts: Fuse filters require at least 2 elements. The spec must define how to handle empty blocks and single-element blocks (e.g., encode as a trivial filter with a special marker, or fall back to a direct element list).
- Coinbase-only blocks: Some blocks contain only a coinbase transaction with a single output. The filter for such blocks needs defined behavior.
7. Mobile SoC profiling
The Raspberry Pi 5 results confirm the ARM speedup (6x-46x). Profiling on actual phone SoCs (Snapdragon, Apple A-series) under thermal and battery constraints would complete the picture, but the algorithmic advantage is established.
8. End-to-end integration test
Build a prototype light client that performs the full sync flow:
- Download filter headers and verify the header chain
- Download filters from peers
- Query filters against the wallet
- On match, download the full block and scan transactions
- Measure total sync time, bandwidth, and FP handling cost
This validates the real-world performance including network latency, I/O, and the actual cost of false positives beyond the current byte-counting benchmark tables.
9. Security audit
Before any protocol deployment, the following need independent review:
- Filter construction: Can an attacker craft block content that causes filter construction to fail or produce a malicious filter?
- Deserialization: Buffer validation, integer overflow checks, and rejection of malformed filters. The current C library does minimal validation.
- Fingerprint collisions: Verify that the SipHash-based fingerprinting provides sufficient collision resistance against adversarial inputs.
- DoS resistance: Ensure that pathological filters cannot cause excessive CPU or memory usage on the client.
10. Server-side benchmarking
The current benchmarks focus on the client side (filter query). Full nodes must also construct and serve filters. Before proposing a protocol change, we need to measure the server-side cost:
- Construction time: How long does it take to build a Fuse filter per block vs GCS? This runs during block validation on full nodes.
- Index build time: Time to build the full filter index for the entire chain (~895k blocks). GCS index currently takes several hours.
- Storage: Disk space for the complete filter index.
- Serving throughput: How many filter requests per second can a node handle? Fuse filters are slightly smaller than GCS, which may improve I/O.
11. BIP proposal
Once the above items are resolved, a formal BIP proposal would specify:
- Filter type identifier (new type alongside
basic = 0x00) - Construction algorithm with deterministic seed derivation
- Wire format and serialization rules
- P2P protocol messages (extending BIP 157's
getcfilters/cfilters) - Filter header chain construction (extending BIP 157's
getcfheaders) - Test vectors: known block → expected filter bytes
12. Backward compatibility and deployment
- Fuse filters would be a new filter type, not a replacement for existing BIP 158 basic filters. Nodes can serve both types simultaneously.
- Light clients negotiate the filter type during the P2P handshake.
- Deployment can be gradual: nodes opt in to building and serving Fuse filters, clients opt in to requesting them.
Conclusion
This research set out to answer a simple question: can we make private Bitcoin wallet sync fast enough for mobile phones? The answer, based on 50,000 mainnet blocks, 10 wallet scenarios, and benchmarks on both desktop x86-64 and ARM hardware, is a clear yes — with the right data structure.
The strongest current result is that flat Binary Fuse16 filters are a strong candidate for a new compact block filter type alongside BIP 158 basic filters. In the 50,000-block benchmark they deliver 9x to 81x faster queries on desktop and 6x to 46x on ARM, slightly smaller filter data, and zero false negatives, while keeping total bandwidth very close to GCS.
The core insight is algorithmic: GCS forces O(N) sequential decoding for every query, while Fuse16 answers with exactly 3 memory lookups — O(1) regardless of how many elements the filter contains. This is not an optimization of the existing approach; it is a fundamentally different tradeoff. The ARM benchmarks confirm that the advantage holds on low-power hardware, though the speedup ratio narrows because Fuse's random access pattern is more sensitive to cache latency than GCS's sequential decode.
A second result is that hierarchical Fuse filters may open a better bandwidth frontier, but that line of work is still experimental. It needs more optimization and explicit privacy-preserving protocol design, because on-demand inner-filter retrieval could leak wallet interest if handled naively. Beyond that, the path to deployment requires a native C++ implementation, mobile SoC profiling, deterministic construction and wire-format specs, and a BIP proposal. The filter would deploy as an optional new type — no consensus change, no breaking change, gradual adoption.
"There ain't no such thing as a free lunch"
Research rarely follows a straight line. Here is what actually happened — the dead ends, the crashes, and the bugs that shaped the final result.
The hierarchical GCS era
The project did not start with Fuse filters at all. The first idea was to keep GCS but make it faster with a multi-level design: a coarse "window filter" covering N blocks, checked first, with per-block filters only consulted on a match. On testnet this looked decent — a 5.4x geometric-mean speedup at 50k blocks. But it was still O(N) Golomb decode per query, the outer layer leaked false positives no matter how we tuned the parameters, and positive-heavy workloads barely improved. Two git commits are titled saving work-deadend. The entire hierarchical GCS codebase was eventually deleted.
The mainnet data saga
Getting 50,000 real mainnet blocks turned out to be a project in itself. Attempt one: run a full mainnet node with txindex=1 and blockfilterindex=1. It synced for days, then crashed with a LevelDB fatal error: "No space left on device." Turns out you need about 1.5 TB free. Attempt two: fall back to a public remote RPC endpoint. That failed instantly — the public node does not expose getblockfilter. Attempt three: retry in tx-only mode. That ran for hours, then died to a transient DNS resolution error. And because the extractor wrote output only on completion, no partial data was saved. The fix was a complete redesign: binary chunk streaming (250 blocks per .bin file), checkpoint writes, and retry logic.
The 11 GB JSON problem
Early benchmarks loaded block data from pretty-printed JSON. This was fine for testnet, but a 10k-block mainnet extraction produced an 11 GB JSON file. The C++ harness could stomach 500–1k blocks, but iterative benchmarking at 10k+ was impractical. This forced the move to the binary streaming architecture that the final benchmarks use — and in hindsight it was a blessing, because it also solved the memory problem for 50k-block runs.
Fuse8 and Xor8: blazing fast, completely useless
Both 8-bit filter variants were about 100x faster than GCS and 50% smaller. On paper, a dream. In practice, their 1/256 false-positive rate produced ~1,700 false positives over 20k blocks — roughly 2.6 GB of pointless full-block downloads. The bandwidth cost dwarfed the size savings. They were rejected in a single benchmark run.
The benchmark that lied
A subtle bug in the benchmark harness made hierarchical filters look better than they were. Expected-negative queries used early-exit timing, but a probabilistic filter can produce a false positive on the very first check, causing the timer to stop early and report an implausibly fast result. The bug only became visible at 50k blocks, where the statistical effect was large enough to notice. After switching negative queries to full-scan timing, the inflated speedup numbers came back down to earth. Lesson learned: benchmark semantics must match what a real client actually does.
The bandwidth numbers that added up twice
During the Fuse16+20 hierarchical benchmarks, the total bandwidth column looked suspiciously high. It turned out the C++ layer already included false-positive block downloads in its "total bandwidth" output, and then the shell script added them again in the summary table. A classic double-counting bug, caught only by staring at the numbers long enough to realize they could not possibly be right.
The pivot
After hierarchical GCS failed, the project pivoted to Binary Fuse16 as a new filter type alongside GCS — and the very first benchmark showed a 77x speedup. The hierarchical idea was later revived, but this time built on Fuse filters instead of GCS, with domain-separated SipHash keys and a candidate-only optimization. That is where the research stands today: one clear winner (flat Fuse16) and one promising but unfinished direction (hierarchical Fuse).