Light Client Block Filters Research

Introduction

Code and dataset: github.com/purszki/bitcoin_research_01

The problem: private Bitcoin wallets are too slow on mobile

Bitcoin promises self-sovereignty — your money, held by you, not a bank or exchange. To deliver on that promise, a user must be able to check their balance on a phone without leaking addresses to a third-party server. Today that means downloading compact block filters (BIP 158) and scanning them locally. The privacy is excellent, but the cost is real: BIP 158 filters use Golomb-Rice coding, which requires a full sequential decode for every query. On a mobile phone this translates to minutes of heavy CPU work just to scan recent blocks — the phone heats up, the battery drains, and the user waits.

What we achieved

We built a reproducible benchmark framework on top of Bitcoin Core and used it to evaluate Binary Fuse filters as an alternative to Golomb-Rice coded sets. Across 50,000 real mainnet blocks and 10 wallet scenarios (from 24 to 480 scripts), benchmarked on both desktop x86-64 and a Raspberry Pi 5 (ARM Cortex-A76), the current results support two concrete conclusions:

Fuse16 is a strong candidate for a new compact block filter type alongside BIP 158 basic filters. In the 50k-block benchmark, Fuse16 preserves zero false negatives, reduces client-side query CPU by 9x to 81x on desktop and 6x to 46x on ARM across the wallet set, and produces slightly smaller filter data than GCS (955.9 MB vs 972.1 MB). When matched-block downloads are included, total bandwidth stays very close to GCS, rising by only about 0.2% to 2.9% in the current measurements.
Hierarchical Fuse filters are a promising bandwidth-oriented research direction. We also benchmarked two-stage Fuse designs that try to reduce total bandwidth further without giving up the speed advantage completely. Some variants improve the bandwidth tradeoff, and some achieve zero false positives in the current 50k run. But this work is still experimental: it needs more optimization, clearer protocol design, and dedicated privacy analysis because a two-stage retrieval scheme may leak wallet interest unless the request pattern is carefully hidden.

The speedup comes from a fundamental algorithmic difference: GCS requires O(N) sequential decoding per query, while flat Fuse16 answers each query with exactly 3 random memory lookups — O(1) regardless of filter size. The hierarchical variants build on the same idea, but add an extra protocol layer that is not yet deployment-ready.

Where we can go with more work

These results are a proof of concept, not a finished protocol. The current evidence makes Fuse16 the clearest deployment candidate, while the hierarchical variants remain exploratory. ARM benchmarking on a Raspberry Pi 5 confirms the desktop results: Fuse16 delivers significant speedups even on a low-power ARM processor. The path forward includes: writing a native C++ implementation suitable for Bitcoin Core (replacing the current C library wrapper); specifying a deterministic construction and wire format; designing privacy-preserving retrieval if a hierarchical scheme is pursued; and ultimately drafting a BIP proposal. Fuse filters could make private, self-sovereign wallet sync practical on everyday phones.

Summary of BIP 158

In short, the server sends a compact probabilistic summary for each block, which lets the client cheaply test whether a wallet-relevant script element may be present in that block.

We refer to the resulting per-block structure as the basic block filter; its encoding uses Golomb-Rice-coded sets (GCS).

Under BIP 158, the server builds a compact block filter for each block. From that filter, the client can tell whether the block may contain any script element relevant to the wallet. If the filter signals a match, the block becomes a candidate and the client can inspect it in full.

       FULL BLOCKS (Big Data)                BIP 158 FILTERS (Small Data)
      (Megabytes of TXs)                     (KiloBytes of GCS)
    ----------------------------           ----------------------------
    |  [TX1][TX2][TX3][TX4]... |           |                          |
    |  [TX5][TX6][TX7][TX8]... |           |     Lossy, but           |
    |  [TX9]....... (1-4 MB)   |  ----->   |  deterministic digest    |
    |   (typically 1.5 Mb)     |           |        (~20 KB)          |
    ----------------------------           ----------------------------

                   ^                                     ^
                   |                                     |
              "Hard to                             "Easy to
               download"                            store"

During filter construction, the block's relevant script elements are collected: output scriptPubKeys and the spent prevout scriptPubKeys referenced by inputs. These elements are hashed and then encoded with Golomb-Rice coding (GCS). If a wallet-relevant script element matches the filter, the block must be downloaded for full inspection.

Since the result is a lossy digest:

if the filter says 'No match', then the queried script element is not in that block.
if it says 'Match possible', then the block is only a candidate: the element may really be there, or this may be a false positive.

The question is: how could we speed up this process on the client side, preferably without increasing the compressed data size too much?

Common discussions suggest several methods: XOR, Binary Fuse, or Cuckoo filters.

References:

Binary Fuse Filters

Binary fuse filters are a family of probabilistic data structures for approximate set membership, introduced by Thomas Mueller Graf and Daniel Lemire in "Binary Fuse Filters: Fast and Smaller Than Xor Filters" (Journal of Experimental Algorithmics, Vol. 27, 2022; arXiv:2201.01174).

A binary fuse filter answers one question: is a given value a member of a set? If the answer is "yes", the value might be in the set (a false positive is possible). If the answer is "no", the value is definitely not in the set — there are no false negatives. This is the same guarantee as GCS filters used in BIP 158, but binary fuse filters answer the question with only 3 memory lookups, regardless of how large the set is.

A binary fuse filter is built from a fingerprint function and an array of fingerprints. The algorithm is described in detail in the reference paper. Using a simple lookup method that XORs three fingerprints, the algorithm gives O(1) query time with exactly three memory accesses, regardless of set size.

The probability of a false positive depends on the fingerprint width: a k-bit fingerprint gives FPR = 1/2^k. Common configurations are:

Variant	Fingerprint	FPR
Fuse8	8 bits	1/256
Fuse12	12 bits	1/4,096
Fuse16	16 bits	1/65,536
Fuse18	18 bits	1/262,144
Fuse20	20 bits	1/1,048,576

A bandwidth optimization - Hierarchical Fuse Filters

We propose an original contribution on top of Binary Fuse Filters: a hierarchical two-stage design that aims to reduce bandwidth without giving up the core speed advantage. To our knowledge, this two-stage outer/inner filter architecture has not been explored in the context of Bitcoin light client block filtering before.

A hierarchical filter splits the decision into two stages. The client still downloads one small outer filter for every block and queries it locally. If that outer filter reports a possible match, the client fetches a second, more selective inner filter for that block. Only if the inner filter also reports a match does the client download the full block.

The naming convention follows that two-stage structure: F12+16 means a Fuse12 outer filter plus a Fuse16 inner filter. Likewise, F10+10 uses Fuse10 in both stages.

HIERARCHICAL FILTER FLOW

download outer filter for every block
           |
        no match -> stop
           |
        match
           v
   fetch and query inner filter
           |
        no match -> stop
           |
        match
           v
   download full block

The benchmarked bandwidth numbers in this document assume exactly that on-demand inner-filter retrieval model: outer filters for all blocks, inner filters only for outer matches, and full blocks only for final matches. That is also why hierarchical filters raise a privacy issue. Under BIP 158, the peer sees only that the client downloads one filter per block. In a hierarchical scheme, the peer can also observe which specific blocks triggered second-stage inner-filter requests unless the protocol adds a privacy defense.

Proof of Concept Implementation

We use the xor_singleheader C library as a reference implementation of Binary Fuse and XOR filters: https://github.com/FastFilter/xor_singleheader

It is good for benchmarking, but not yet protocol-ready. Main limitations:

It uses malloc/free internally.
Error handling is too weak for critical production code.

To integrate it cleanly, we wrapped it in Bitcoin Core-style C++ classes with the same API shape as BlockFilter. See src/fuse16filter.h/.cpp for Fuse16, src/fuse8filter.h/.cpp for Fuse8, and src/xor8filter.h/.cpp for Xor8. We use PIMPL (pointer-to-implementation) so the large C header does not leak into the rest of the codebase. After a native C++ rewrite, neither malloc nor PIMPL should be needed.

Alongside that wrapper layer, we also started an experimental C++ refactor of the original library in src/bench/xor_singleheader/binaryfusefilter.hpp. It uses templates to support variable-width filters (for example Fuse12, Fuse18, and Fuse20), and it moves serialization toward an explicit platform-independent little-endian format. This is promising for overcoming some limitations of the original C code, but it is still research code rather than a production-ready implementation.

Construction must be deterministic. We derive the seed from the block hash. If construction fails (the probabilistic peeling process does not converge), we currently throw an exception. For benchmarking this is acceptable: in practice failure is very rare, and a deterministic retry strategy can handle it. A protocol-grade implementation must define that retry behavior explicitly.

Core question: can Binary Fuse filters (and related XOR filters) replace or complement the current GCS-based compact block filters from BIP 158? We evaluate:

Deterministic construction. Fuse construction is probabilistic, so it may fail and retry with a different seed. Can we define this so every full node still produces exactly the same filter for the same block?
Client-side query speed. On the client, are Fuse filters faster than GCS (Golomb-Rice decode plus per-element hashing)?
Network bandwidth. Is total filter size, and therefore download bandwidth, comparable to BIP 158?
False positives. BIP 158 has a very low false-positive rate (1/784,931). Each false positive is expensive: the client downloads the full block (~1.5 MB) and checks it manually. How do Fuse filters compare?

Source Files

All research-specific C++ code lives under src/bench/light_client_research/ and src/bench/xor_singleheader/, separate from the Bitcoin Core codebase.

Open source file inventory

Filter implementations (`src/bench/light_client_research/`)

File	Description
`fuse8filter.h` / `.cpp`	C++ wrapper for Binary Fuse8 filter (8-bit fingerprint, FPR 1/256 — rejected due to false positives)
`fuse10filter.h` / `.cpp`	Experimental Binary Fuse10 filter wrapper used in hierarchical `F10+10` benchmarks
`fuse12filter.h` / `.cpp`	Experimental Binary Fuse12 filter wrapper used for flat and hierarchical benchmarks
`fuse16filter.h` / `.cpp`	C++ wrapper for Binary Fuse16 filter (16-bit fingerprint, FPR 1/65,536)
`fuse18filter.h` / `.cpp`	Experimental Binary Fuse18 filter wrapper used for flat and hierarchical benchmarks
`fuse20filter.h` / `.cpp`	Experimental Binary Fuse20 filter wrapper used for flat and hierarchical benchmarks
`fuse32filter.h` / `.cpp`	Experimental Binary Fuse32 filter wrapper used as a near-zero-FP validation reference
`xor8filter.h` / `.cpp`	C++ wrapper for Xor8 filter (8-bit, FPR 1/256 — rejected)

Data loading and streaming (`src/bench/light_client_research/`)

File	Description
`full_dataset.h` / `.cpp`	JSON dataset parser: `FullDataset` (string-based) and `PreparedDataset` (binary, benchmark-ready)
`tx_block_chunk_store.h` / `.cpp`	Loads `.bin` chunk file metadata (paths, block ranges) from a directory
`tx_block_stream_reader.h` / `.cpp`	Streams blocks one at a time from `.bin` chunks, keeping memory usage low
`filter_bench.h` / `.cpp`	Shared helpers: wallet scenario loading, filter construction, dataset I/O

Benchmarks (`src/bench/light_client_research/`)

File	Description
`research_bip158.cpp`	Main benchmark driver: flat Fuse benchmarks, hierarchical Fuse benchmarks, and streaming ground-truth validation
`filter_demo.cpp`	Minimal demo: builds a Fuse16 filter for a single block and queries it

Vendored Library and Experimental C++ Impl (`src/bench/xor_singleheader/`)

File	Description
`binaryfusefilter.h`	Binary Fuse filter C implementation (from xor_singleheader, commit `c482686`)
`binaryfusefilter.hpp`	Experimental C++ refactor of Binary Fuse with template-based variable-width filters and explicit little-endian serialization
`xorfilter.h`	Xor filter C implementation (same upstream library)
`README.md`	Local notes about the vendored `xor_singleheader` code and the in-repo adaptations

Data Extraction for Profiling and Validation

For benchmarking on real mainnet data, we export block and transaction data from a fully synced Bitcoin Core node into compact binary (.bin) blobs. Each blob stores 250 blocks, including all output scriptPubKeys and spent prevout scripts, which are exactly the elements used by BIP 158 filter construction.

This extraction step is needed for three reasons:

We must build each filter from scratch (not from a pre-built index), so construction time can be measured fairly across filter types.
The binary format lets us stream blocks one by one, without keeping the full dataset in memory. This is important for large runs (50k+ blocks).
Ground-truth validation needs the original scripts, to prove we never miss a real match (zero false negatives).

For exact, step-by-step dataset reproduction from a Bitcoin Core node, see REPRODUCE.md.

Simulating Wallet Behaviour

To measure real-world filter performance, we simulate how a light client wallet queries block filters. A wallet holds a set of scriptPubKeys and scans every block filter to find blocks that may contain relevant transactions.

The number of scripts, and the script-type mix, directly affects query time and false-positive rate.

We generate 10 wallet use cases, from a simple mobile wallet (24 scripts) up to a large exchange hot wallet (480 scripts). Each use case has a realistic address-type mix (P2WPKH, P2TR, P2WSH, P2SH, P2PKH) and includes ground-truth data: which blocks in the 50k-block dataset actually contain matching transactions. We use this ground truth to verify zero false negatives and to count false positives exactly.

Wallet Use Case	Scripts	Address Type Mix	True Hits (50k blocks)
Simple User	24	17 p2wpkh, 5 p2tr, 2 p2sh	1,034
Cold Storage Vault	60	42 p2wsh, 12 p2tr, 6 p2sh	761
Legacy Migrator	90	45 p2pkh, 27 p2sh, 18 p2wpkh	18,179
Taproot Power User	96	82 p2tr, 14 p2wpkh	7,124
CoinJoin Privacy User	120	72 p2wpkh, 36 p2tr, 12 p2sh	8,131
Merchant Batch Settler	140	63 p2wpkh, 42 p2tr, 28 p2sh, 7 p2pkh	21,795
Lightning Operator	180	90 p2wsh, 45 p2tr, 45 p2wpkh	4,044
Custody Wallet	220	99 p2wsh, 44 p2tr, 44 p2wpkh, 33 p2sh	11,216
Watch-Only Auditor	320	128 p2wpkh, 80 p2tr, 64 p2wsh, 32 p2sh, 16 p2pkh	23,267
Exchange Hot Wallet	480	192 p2wpkh, 144 p2tr, 96 p2sh, 48 p2wsh	27,530

The True Hits column shows how many of the 50,000 blocks contain at least one matching transaction for that wallet. A simple user wallet matches about 2% of blocks, while an exchange hot wallet matches about 55%. For most wallets, non-matching blocks are still the majority. This is exactly where filter query speed matters most, because the client checks every block filter, match or not.

The wallet scenario files are generated deterministically from block data using benchmark_tools/generate_wallet_use_cases.py. See REPRODUCE.md for details.

Disk space required for the 50k-block dataset used in our benchmarks:

Dataset	Size	Files
Binary block chunks	~23 GB	200
Wallet use-case scenarios	~53 MB	10

Ground-Truth Validation

Any new filter type must satisfy one hard requirement: zero false negatives. If a block contains a transaction matching the wallet, the filter must report a match. A single false negative means the wallet misses a payment — this is unacceptable. False positives (the filter says "maybe" but the block has no matching transaction) are tolerated; they only cost an extra block download.

We validate correctness and count false positives in two layers:

Layer 1: Computing ground truth (Python, offline). During wallet scenario generation, the script generate_wallet_use_cases.py scans every block in the dataset and records which blocks contain at least one scriptPubKey that belongs to the wallet. This is an exact, brute-force check against the raw transaction data — no filter is involved. The result is a list of block indices stored in the wallet scenario JSON under ground_truth.matched_blocks.

Layer 2: Streaming validation (C++, dedicated benchmark). A dedicated benchmark called StreamingGroundTruthValidation reads blocks one at a time from the binary chunks, constructs a filter for each block using all four filter types (GCS, Fuse16, Fuse8, Xor8), and queries each filter with the wallet's scripts. For each filter type it checks:

False negatives: For every block index in the ground-truth list, did the filter report a match? If any ground-truth block is missed, the validation throws an exception and stops.

False positives: The count is computed as:

false_positives = candidate_matches - true_hits_in_range

where candidate_matches is the number of blocks the filter flagged, and true_hits_in_range is the number of ground-truth blocks within the scanned range.

The streaming approach processes one block at a time and discards its elements immediately, so memory usage stays low even for 50k+ blocks.

How to run

A convenience script runs all the wallet use cases against all 4 filter types (GCS, Fuse16, Fuse8, Xor8) and produces a summary table:

./light_client_research/benchmark_tools/false_positive_check.sh 50000

The script iterates over every wallet scenario file, runs the StreamingGroundTruthValidation benchmark for each, parses the output, and checks that false_negatives=0 for every filter. If any false negative is detected, the script prints an error banner and exits with a non-zero code. Results are saved to light_client_research/results/false_positive_check_50000.txt.

You can also run individual wallets manually:

# Single wallet:
BIN_SCAN_MAX_BLOCKS=50000 ./build-release/bin/bench_bitcoin \
    -filter='StreamingGroundTruthValidation' -min-time=100

# Specific wallet:
BIN_SCAN_MAX_BLOCKS=50000 \
BIN_WALLET_SCENARIO=light_client_research/mainnet_datasets/wallet_use_cases/wallet_use_case_exchange_hot_wallet.json \
    ./build-release/bin/bench_bitcoin \
    -filter='StreamingGroundTruthValidation' -min-time=100

Expected output per filter type:

FUSE16_GROUND_TRUTH scenario=... scanned_blocks=50000 skipped_small=55
    construction_failures=0 true_hits_in_range=... candidate_matches=...
    false_negatives=0 false_positives=...

The false_negatives=0 field must be zero for every filter type. Any non-zero value indicates a correctness bug.

Benchmarking Methodology and Results

What we measure

The benchmarks simulate what a light client does when it receives block filters over the network: deserialize the filter from bytes, then query it with the wallet's scripts. This is the client-side hot path that runs for every block in the chain.

For each filter type, the benchmark:

Setup (not timed): Reads blocks from the binary dataset, constructs filters, and serializes them to byte arrays. This simulates the full node building filters; the client never does this.
Timed loop (nanobench): For each pre-built filter, deserializes from bytes and calls MatchAny(wallet_scripts). This is repeated many times by nanobench to get stable timing.

For GCS (BIP 158), deserialization reconstructs the Golomb-Rice encoded bitstream and the query performs a full O(N) sequential decode. For Fuse16, deserialization reconstructs the fingerprint array and the query performs 3 array lookups per wallet script — O(1) per element regardless of filter size.

The benchmarks also report:

Total filter size in bytes (sum of all serialized filters)
Average filter size per block
Match count (number of blocks where the filter reports a hit)
Block download volume for all matched blocks (true positives + false positives)
Total bandwidth = filter data + downloaded blocks for all matches

Bandwidth is counted from the client point of view. If a filter reports a match, the client must download the full block before it can tell whether the match was real or just a false positive.

TOTAL BANDWIDTH
    = filter_data
    + downloaded_blocks_for_all_matches

    = filter_data
    + block_downloads(true_positives)
    + block_downloads(false_positives)

    match reported by filter
           |
           v
      download full block
           |
           v
    later classify as TP or FP

So the download cost is paid for every reported match, not only for true hits.

For hierarchical schemes, "filter_data" includes both the outer filters and any
inner filters fetched on demand.

How to run

# Build (release mode, required for meaningful timings):
cmake -B build-release -DCMAKE_BUILD_TYPE=Release -DBUILD_BENCH=ON
cmake --build build-release --target bench_bitcoin -j $(nproc)

# GCS baseline — 50k blocks, simple_user wallet:
BIN_SCAN_MAX_BLOCKS=50000 ./build-release/bin/bench_bitcoin \
    -filter='ResearchBasicClientSideQuery' -min-time=1000

# Fuse16 — 50k blocks, simple_user wallet:
BIN_SCAN_MAX_BLOCKS=50000 ./build-release/bin/bench_bitcoin \
    -filter='ResearchFuse16ClientSideQuery' -min-time=1000

# Specific wallet:
BIN_SCAN_MAX_BLOCKS=50000 \
BIN_WALLET_SCENARIO=light_client_research/mainnet_datasets/wallet_use_cases/wallet_use_case_exchange_hot_wallet.json \
    ./build-release/bin/bench_bitcoin \
    -filter='ResearchFuse16ClientSideQuery' -min-time=1000

# All wallets, flat and hierarchical filter matrix:
./light_client_research/benchmark_tools/wallet_benchmark.sh 50000

The wallet_benchmark.sh script runs the current all-wallet comparison (GCS, flat Fuse variants, and hierarchical variants), computes actual false positives from ground-truth data, and saves summary tables to light_client_research/results/wallet_benchmark_results_<blocks>.txt. The checked-in 50k result file is here: wallet_benchmark_results_50000.txt.

To run both ground-truth validation and benchmarks in one go:

./light_client_research/benchmark_tools/bench_all.sh 50000

Profiling Results: Fuse8 and Xor8 (20k blocks)

This early 20k-block run compared all four filter types to decide which ones to carry forward. The results led us to drop Fuse8 and Xor8; the 50k benchmarks below focus on the viable candidates. Benchmarked on a Lenovo ThinkPad P51 (Intel Core i7-7820HQ @ 2.90 GHz, 4 cores / 8 threads).

Filter	Time ms	Size MB	FP	Speedup	Viable?
GCS	4,716	404	1	1x	baseline
Fuse16	61	397	4	77x	YES
Fuse8	48	199	1,733	99x	NO
Xor8	47	190	1,841	100x	NO

Taking into account the FP rate and the speedup, Fuse16 seems to be a viable alternative to GCS.

8-bit filters are ~50% smaller and ~100x faster, but their FPR (1/256) causes thousands of unnecessary full block downloads (~2.6 GB for Fuse8 at 20k blocks). The bandwidth cost far exceeds the size savings.

Profiling Results: Flat Fuse Variants vs GCS (50k mainnet blocks)

The flat-filter comparison covers GCS, Fuse16, Fuse18, and Fuse20 across both platforms. Desktop results use wallet_benchmark_results_50000.txt; Raspberry Pi 5 results use arm_fixed_wallet_benchmark_results_50000.txt. Filter sizes and match counts are identical across platforms; only timing differs.

CPU speed — Raspberry Pi 5 (ARM Cortex-A76, deserialize + MatchAny, lower is better):

Wallet	Scripts	GCS ms	F16 ms	F16 speedup	F18 ms	F20 ms
Simple User	24	14,547	317	46x	3,471	3,527
Cold Storage Vault	60	15,821	750	21x	3,682	3,972
Legacy Migrator	90	12,236	672	18x	3,620	3,779
Taproot Power User	96	14,827	1,008	15x	3,957	4,095
CoinJoin Privacy User	120	14,680	1,054	14x	3,998	4,159
Merchant Batch Settler	140	12,071	1,207	10x	4,382	4,280
Lightning Operator	180	16,867	1,817	9x	4,954	5,094
Custody Wallet	220	16,103	2,010	8x	5,226	5,344
Watch-Only Auditor	320	13,552	2,396	6x	5,365	5,487
Exchange Hot Wallet	480	14,605	2,554	6x	5,573	5,714

CPU speed — desktop x86-64 data table

CPU speed (deserialize + MatchAny, lower is better):

Wallet	Scripts	GCS ms	F16 ms	F16 speedup	F18 ms	F20 ms
Simple User	24	10,081	125	81x	1,392	1,458
Cold Storage Vault	60	10,535	237	44x	1,510	1,573
Legacy Migrator	90	8,249	257	32x	1,534	1,598
Taproot Power User	96	9,970	311	32x	1,580	1,645
CoinJoin Privacy User	120	9,837	348	28x	1,604	1,676
Merchant Batch Settler	140	7,923	396	20x	1,769	1,733
Lightning Operator	180	10,753	550	20x	1,820	1,897
Custody Wallet	220	9,899	630	16x	1,895	1,987
Watch-Only Auditor	320	8,525	654	13x	2,001	2,005
Exchange Hot Wallet	480	8,716	1,004	9x	2,293	2,364

Total bandwidth — flat filter data table

Total bandwidth (filter data + downloaded blocks for all reported matches, true positives and false positives; lower is better):

Wallet	Scripts	GCS MB	F16 MB	F18 MB	F20 MB
Simple User	24	2544.257	2555.803	2656.590	2764.290
Cold Storage Vault	60	2083.057	2143.943	2212.660	2313.220
Legacy Migrator	90	27945.347	28003.093	28065.810	28169.130
Taproot Power User	96	11561.647	11643.493	11686.710	11787.130
CoinJoin Privacy User	120	13173.247	13269.393	13300.110	13403.630
Merchant Batch Settler	140	33695.347	33770.393	33819.910	33917.830
Lightning Operator	180	7003.037	7189.113	7153.750	7220.370
Custody Wallet	220	17756.747	17938.493	17902.410	17978.830
Watch-Only Auditor	320	35732.847	35889.293	35867.910	35940.630
Exchange Hot Wallet	480	42272.847	42445.393	42410.710	42506.730

Key observations:

Fuse16 remains the CPU winner in every wallet on both platforms. On desktop x86-64 it ranges from 9x to 81x faster than GCS. On the Raspberry Pi 5 (ARM Cortex-A76) it ranges from 6x to 46x faster.
F18 and F20 reduce false-positive-driven block downloads relative to Fuse16, but their larger filters more than offset that gain. In this flat-filter comparison, both variants are slower than Fuse16 and still worse than GCS on total bandwidth.
GCS keeps the lowest total bandwidth among these flat filters, but it pays for that result with much higher client-side CPU cost.

Profiling Results: Experimental Hierarchical Fuse Variants vs GCS (50k mainnet blocks)

The experimental hierarchical comparison covers F16+20, F10+10, F12+12, F12+16, F12+18, and three Fuse+GCS hybrids: F8+GCS, F12+GCS, and F16+GCS. Desktop results use wallet_benchmark_results_50000.txt; Raspberry Pi 5 results use arm_fixed_wallet_benchmark_results_50000.txt. In the Fuse+GCS hybrids the outer layer is a fast Fuse filter and the inner layer is the existing BIP 158 GCS filter, which full nodes already compute and serve. These schemes trade protocol simplicity for a different speed/bandwidth frontier.

CPU speed — Raspberry Pi 5 (ARM Cortex-A76, hierarchical; lower is better):

Wallet	Scripts	GCS ms	F16+20 ms	F10+10 ms	F12+12 ms	F12+16 ms	F12+18 ms	F8+GCS ms	F12+GCS ms	F16+GCS ms
Simple User	24	14,547	439	2,601	807	797	905	1,176	1,053	570
Cold Storage Vault	60	15,821	804	3,040	1,210	1,204	1,300	2,597	1,421	894
Legacy Migrator	90	12,236	2,297	4,089	1,575	1,375	2,703	6,010	4,772	4,201
Taproot Power User	96	14,827	1,615	3,744	1,639	1,555	2,075	4,920	2,926	2,336
CoinJoin Privacy User	120	14,680	1,811	3,924	1,734	1,682	2,288	5,868	3,317	2,681
Merchant Batch Settler	140	12,071	3,114	4,825	2,103	1,877	3,515	8,934	6,384	5,538
Lightning Operator	180	16,867	2,365	4,724	2,520	2,525	2,873	7,746	3,553	2,835
Custody Wallet	220	16,103	3,343	5,413	2,914	2,693	3,629	9,754	5,112	4,348
Watch-Only Auditor	320	13,552	4,877	6,665	3,875	3,621	5,311	14,236	8,237	7,314
Exchange Hot Wallet	480	14,605	6,350	8,226	5,209	4,912	6,900	16,648	9,836	8,934

CPU speed — desktop x86-64 hierarchical data table

CPU speed (deserialize + MatchAny, including inner-filter verification when triggered; lower is better):

Wallet	Scripts	GCS ms	F16+20 ms	F10+10 ms	F12+12 ms	F12+16 ms	F12+18 ms	F8+GCS ms	F12+GCS ms	F16+GCS ms
Simple User	24	10,081	169	1,270	448	440	475	720	613	278
Cold Storage Vault	60	10,535	277	1,421	552	547	591	1,588	734	349
Legacy Migrator	90	8,249	935	1,941	774	642	1,217	4,128	3,054	2,676
Taproot Power User	96	9,970	582	1,701	715	666	896	2,929	1,668	1,239
CoinJoin Privacy User	120	9,837	667	1,782	765	703	976	3,651	1,899	1,458
Merchant Batch Settler	140	7,923	1,225	2,232	960	802	1,508	5,837	3,861	3,397
Lightning Operator	180	10,753	743	1,956	946	922	1,080	4,703	1,709	1,157
Custody Wallet	220	9,899	1,107	2,301	1,147	1,165	1,468	6,521	2,856	2,282
Watch-Only Auditor	320	8,525	1,829	2,856	1,534	1,345	2,110	7,872	4,622	4,058
Exchange Hot Wallet	480	8,716	2,279	3,372	1,956	1,789	2,610	11,538	5,457	4,713

Total bandwidth — hierarchical filter data table

Total bandwidth (outer filters + inner filters fetched on demand + downloaded blocks for all reported matches; lower is better):

Wallet	Scripts	GCS MB	F16+20 MB	F10+10 MB	F12+12 MB	F12+16 MB	F12+18 MB	F8+GCS MB	F12+GCS MB	F16+GCS MB
Simple User	24	2544.257	2558.250	2199.385	2312.707	2318.916	2323.001	2157.537	2319.668	2552.206
Cold Storage Vault	60	2083.057	2090.117	1756.452	1851.336	1859.193	1863.122	1808.441	1859.864	2085.814
Legacy Migrator	90	27945.347	28458.200	27864.821	28015.700	28124.010	28178.170	28042.290	28133.630	28361.920
Taproot Power User	96	11561.647	11737.560	11330.304	11433.451	11477.245	11499.142	11475.208	11480.772	11701.420
CoinJoin Privacy User	120	13173.247	13393.060	12976.099	13075.772	13127.361	13154.006	13159.995	13131.971	13349.320
Merchant Batch Settler	140	33695.347	34328.770	33681.057	33838.440	33972.990	34040.270	33915.180	33985.320	34209.880
Lightning Operator	180	7003.037	7097.310	6771.512	6835.821	6868.300	6885.350	7034.930	6871.252	7074.330
Custody Wallet	220	17756.747	18074.820	17630.290	17723.044	17800.190	17838.760	17935.310	17807.130	18012.350
Watch-Only Auditor	320	35732.847	36385.850	35751.900	35893.080	36037.230	36110.050	36072.920	36050.120	36259.980
Exchange Hot Wallet	480	42272.847	42978.800	42360.980	42475.390	42631.170	42709.060	42683.870	42643.230	42843.960

Key observations:

The hierarchical variants remain much faster than GCS on both platforms, but none of them beats flat Fuse16 on CPU time. On desktop, F16+20 is the fastest hierarchical option for small wallets (6 of 10), while on ARM F12+16 wins in the majority of wallets (7 of 10) because its smaller outer filter stresses the ARM cache less. F10+10 is typically the slowest on both platforms.
F10+10 is the strongest bandwidth result in the current experimental set. It beats GCS on total bandwidth in 8 of 10 wallets, but it does so with only modest speedups and a non-zero false-positive rate.
F12+16 is the cleanest zero-false-positive hierarchical result. In the current 50k run it keeps zero false positives across all wallets, remains materially faster than GCS, and lands near the middle of the experimental bandwidth tradeoff curve.
F16+20 proves that a zero-false-positive two-stage design is possible, but not yet economical. In the current measurements it is worse than GCS on total bandwidth in all 10 wallets because the extra inner-filter traffic outweighs the saved block downloads.
The Fuse+GCS hybrids (F8+GCS, F12+GCS, F16+GCS) achieve zero false positives across all wallets by using the existing BIP 158 GCS filter as the inner verification layer. Their main advantage is deployment simplicity: the inner GCS filter is already computed and served by full nodes, so only the new Fuse outer filter needs to be added. F8+GCS offers the best bandwidth for low-match wallets (15% less total bandwidth than GCS for Simple User) but degrades to GCS-level CPU at high script counts because its 1/256 FPR triggers too many expensive O(N) GCS verifications. F16+GCS is 30–36x faster than GCS for small wallets while keeping bandwidth near-neutral. F12+GCS is the most balanced: 5–16x faster with 0–12% bandwidth savings for wallets with less than 20% match rates.

Limitations and future profiling improvements

The benchmarks now cover both desktop x86-64 and ARM (Raspberry Pi 5). The ARM results confirm the algorithmic advantage holds on low-power hardware, though with reduced speedup ratios. Important remaining gaps:

Mobile SoC profiling. The Raspberry Pi 5 (Cortex-A76) provides our first ARM data point, confirming 6x-46x Fuse16 speedup over GCS. Mobile phones (Snapdragon, Apple A-series) have different cache hierarchies and thermal constraints. Profiling on actual phone hardware would complete the picture.
Measuring actual network cost beyond byte totals. The current bandwidth tables already count actual serialized filter bytes, actual on-demand inner-filter bytes, and actual full-block bytes for every reported match in the 50,000-block dataset. What they still do not model is transport behavior: request batching, round-trip latency, peer behavior, retries, parallel downloads, and the CPU cost of parsing blocks after receipt. For hierarchical schemes there is an additional unresolved question: whether inner-filter retrieval can preserve wallet privacy without giving up the measured bandwidth gains.
Skipped blocks with fewer than 2 elements. Both the Fuse filter wrappers and the GCS benchmark path require at least 2 unique script elements to construct a filter. Blocks with 0 or 1 elements (typically empty blocks or blocks with only a coinbase) are skipped during benchmarking. In the 50,000-block dataset, 55 blocks fall into this category. Skipping them has almost no impact on match counts, false positive rates or profiling results. A production implementation would need to handle these blocks (e.g., by padding or using a trivial filter).
End-to-end wall-clock measurement. The current benchmarks measure filter query time in isolation. A production-relevant benchmark would measure the full wallet sync flow: receive filter bytes over a (simulated) network, deserialize, query, and on match download and process the block. This would capture the true user-facing latency including I/O and memory allocation overhead.

Path to Protocol-Ready

The proof-of-concept results are promising (6–81x speedup across desktop and ARM, comparable filter size, acceptable FP overhead). Below is what remains to go from research prototype to a deployable protocol change.

1. Evaluate Cuckoo filters

Cuckoo filters are another commonly suggested alternative to GCS. They offer O(1) lookup (similar to Fuse), support deletion (which Fuse does not), and have well-studied implementations. Before committing to Fuse as the proposal, we should build a proof-of-concept Cuckoo filter implementation using the same benchmark framework and compare. Our guess is that Fuse will be faster, but the evaluation must happen.

2. Research on hierarchical filters

The hierarchical filter approach raises a serious protocol question: it breaks the main privacy property that makes BIP 158 attractive. Under BIP 158, the client downloads one compact filter per block and performs all wallet queries locally, so the peer cannot tell which blocks are interesting to the wallet. In a hierarchical design, the client first checks an outer filter and then requests inner filters only for the candidate blocks. That second-stage request pattern can leak wallet interest directly.

The research question is whether there is a way to preserve most of the bandwidth savings without giving up that privacy model. Possible directions include:

Noise injection. Download random extra inner filters so the true candidate set is hidden inside a larger request set.
Temporal defenses. Decouple inner-filter requests from outer-filter matches by batching, delaying, or shuffling them over time so the correlation is weaker.
Private Information Retrieval. Investigate PIR-style techniques that let the client fetch inner-filter data without revealing which blocks it is actually interested in.

Any such workaround must be evaluated as a full protocol tradeoff: privacy leakage, extra bandwidth from cover traffic, latency, implementation complexity, and whether the hierarchical scheme still beats a flat filter once those protections are added.

3. Native C++ implementation of a Binary Fuse Filter

The original implementation wrapped the xor_singleheader C library via PIMPL. During the research we partially addressed the most critical issues:

C++ template implementation: We wrote binaryfusefilter.hpp, a template-based C++ reimplementation that supports variable-width filters (Fuse10, Fuse12, Fuse18, Fuse20, Fuse32) and moves serialization toward an explicit little-endian format. The Fuse8 and Fuse16 wrappers still use the original C library.
Exception-safe construction: All filter wrappers now throw on failure and clean up via RAII destructors, replacing the C library's silent error-return pattern.
Ground-truth validated: All filter types pass streaming ground-truth validation across 50,000 mainnet blocks and 10 wallet scenarios with zero false negatives — we are confident in correctness.

Remaining issues that a production implementation must address:

Memory management: The Fuse8/Fuse16 wrappers and the C library still use malloc/free internally. Bitcoin Core expects RAII-based allocation with no raw malloc in consensus-adjacent code.
PIMPL overhead: All wrappers still use unique_ptr<Impl> to hide the C/C++ headers. A production implementation should eliminate this indirection.
Endianness: The C++-template filters use explicit little-endian serialization, but the C-library wrappers (Fuse8, Fuse16) still use native endianness. A wire protocol must use a consistent fixed byte order.
No formal test suite: Correctness is validated through benchmark ground-truth checks, not through unit tests or fuzz tests. A production codebase needs both.

The algorithm itself is straightforward (the paper is clear and the C code is ~400 lines). A clean, unified C++ rewrite following Bitcoin Core coding conventions — replacing both the C library and the current template prototype — would be moderate effort.

4. Deterministic construction specification

Full nodes must produce identical filters for the same block. The current approach derives a deterministic seed from the block hash via SipHash, but throws an exception if the probabilistic peeling process does not converge.

A production specification must define:

Exact seed derivation formula (e.g., SipHash-2-4(block_hash, domain_tag)).
Deterministic retry: if construction fails with seed S, the next seed is S' = f(S) for a specified function f. The spec must guarantee that all implementations arrive at the same seed and the same filter bytes.
Maximum retry count and behavior on exhaustion (in practice, failure is extraordinarily rare — zero failures observed in 50k mainnet blocks).

5. Wire format specification

Define the exact byte layout for Fuse filters on the network:

Fixed-size header: seed (8 bytes), element count (4 bytes), segment parameters (12 bytes), array length (4 bytes) = 28 bytes.
Fingerprint array: array_length × 2 bytes, little-endian uint16 values.
Total: 28 + 2N bytes where N is the array length (~1.125 × element count).

The format must be specified precisely enough that any implementation can deserialize and query filters from any other implementation, bit-for-bit.

6. Handling edge cases

Blocks with 0–1 unique scripts: Fuse filters require at least 2 elements. The spec must define how to handle empty blocks and single-element blocks (e.g., encode as a trivial filter with a special marker, or fall back to a direct element list).
Coinbase-only blocks: Some blocks contain only a coinbase transaction with a single output. The filter for such blocks needs defined behavior.

7. Mobile SoC profiling

The Raspberry Pi 5 results confirm the ARM speedup (6x-46x). Profiling on actual phone SoCs (Snapdragon, Apple A-series) under thermal and battery constraints would complete the picture, but the algorithmic advantage is established.

8. End-to-end integration test

Build a prototype light client that performs the full sync flow:

Download filter headers and verify the header chain
Download filters from peers
Query filters against the wallet
On match, download the full block and scan transactions
Measure total sync time, bandwidth, and FP handling cost

This validates the real-world performance including network latency, I/O, and the actual cost of false positives beyond the current byte-counting benchmark tables.

9. Security audit

Before any protocol deployment, the following need independent review:

Filter construction: Can an attacker craft block content that causes filter construction to fail or produce a malicious filter?
Deserialization: Buffer validation, integer overflow checks, and rejection of malformed filters. The current C library does minimal validation.
Fingerprint collisions: Verify that the SipHash-based fingerprinting provides sufficient collision resistance against adversarial inputs.
DoS resistance: Ensure that pathological filters cannot cause excessive CPU or memory usage on the client.

10. Server-side benchmarking

The current benchmarks focus on the client side (filter query). Full nodes must also construct and serve filters. Before proposing a protocol change, we need to measure the server-side cost:

Construction time: How long does it take to build a Fuse filter per block vs GCS? This runs during block validation on full nodes.
Index build time: Time to build the full filter index for the entire chain (~895k blocks). GCS index currently takes several hours.
Storage: Disk space for the complete filter index.
Serving throughput: How many filter requests per second can a node handle? Fuse filters are slightly smaller than GCS, which may improve I/O.

11. BIP proposal

Once the above items are resolved, a formal BIP proposal would specify:

Filter type identifier (new type alongside basic = 0x00)
Construction algorithm with deterministic seed derivation
Wire format and serialization rules
P2P protocol messages (extending BIP 157's getcfilters/cfilters)
Filter header chain construction (extending BIP 157's getcfheaders)
Test vectors: known block → expected filter bytes

12. Backward compatibility and deployment

Fuse filters would be a new filter type, not a replacement for existing BIP 158 basic filters. Nodes can serve both types simultaneously.
Light clients negotiate the filter type during the P2P handshake.
Deployment can be gradual: nodes opt in to building and serving Fuse filters, clients opt in to requesting them.

Conclusion

This research set out to answer a simple question: can we make private Bitcoin wallet sync fast enough for mobile phones? The answer, based on 50,000 mainnet blocks, 10 wallet scenarios, and benchmarks on both desktop x86-64 and ARM hardware, is a clear yes — with the right data structure.

The strongest current result is that flat Binary Fuse16 filters are a strong candidate for a new compact block filter type alongside BIP 158 basic filters. In the 50,000-block benchmark they deliver 9x to 81x faster queries on desktop and 6x to 46x on ARM, slightly smaller filter data, and zero false negatives, while keeping total bandwidth very close to GCS.

The core insight is algorithmic: GCS forces O(N) sequential decoding for every query, while Fuse16 answers with exactly 3 memory lookups — O(1) regardless of how many elements the filter contains. This is not an optimization of the existing approach; it is a fundamentally different tradeoff. The ARM benchmarks confirm that the advantage holds on low-power hardware, though the speedup ratio narrows because Fuse's random access pattern is more sensitive to cache latency than GCS's sequential decode.

A second result is that hierarchical Fuse filters may open a better bandwidth frontier, but that line of work is still experimental. It needs more optimization and explicit privacy-preserving protocol design, because on-demand inner-filter retrieval could leak wallet interest if handled naively. Beyond that, the path to deployment requires a native C++ implementation, mobile SoC profiling, deterministic construction and wire-format specs, and a BIP proposal. The filter would deploy as an optional new type — no consensus change, no breaking change, gradual adoption.

"There ain't no such thing as a free lunch"

Research rarely follows a straight line. Here is what actually happened — the dead ends, the crashes, and the bugs that shaped the final result.

The hierarchical GCS era

The project did not start with Fuse filters at all. The first idea was to keep GCS but make it faster with a multi-level design: a coarse "window filter" covering N blocks, checked first, with per-block filters only consulted on a match. On testnet this looked decent — a 5.4x geometric-mean speedup at 50k blocks. But it was still O(N) Golomb decode per query, the outer layer leaked false positives no matter how we tuned the parameters, and positive-heavy workloads barely improved. Two git commits are titled saving work-deadend. The entire hierarchical GCS codebase was eventually deleted.

The mainnet data saga

Getting 50,000 real mainnet blocks turned out to be a project in itself. Attempt one: run a full mainnet node with txindex=1 and blockfilterindex=1. It synced for days, then crashed with a LevelDB fatal error: "No space left on device." Turns out you need about 1.5 TB free. Attempt two: fall back to a public remote RPC endpoint. That failed instantly — the public node does not expose getblockfilter. Attempt three: retry in tx-only mode. That ran for hours, then died to a transient DNS resolution error. And because the extractor wrote output only on completion, no partial data was saved. The fix was a complete redesign: binary chunk streaming (250 blocks per .bin file), checkpoint writes, and retry logic.

The 11 GB JSON problem

Early benchmarks loaded block data from pretty-printed JSON. This was fine for testnet, but a 10k-block mainnet extraction produced an 11 GB JSON file. The C++ harness could stomach 500–1k blocks, but iterative benchmarking at 10k+ was impractical. This forced the move to the binary streaming architecture that the final benchmarks use — and in hindsight it was a blessing, because it also solved the memory problem for 50k-block runs.

Fuse8 and Xor8: blazing fast, completely useless

Both 8-bit filter variants were about 100x faster than GCS and 50% smaller. On paper, a dream. In practice, their 1/256 false-positive rate produced ~1,700 false positives over 20k blocks — roughly 2.6 GB of pointless full-block downloads. The bandwidth cost dwarfed the size savings. They were rejected in a single benchmark run.

The benchmark that lied

A subtle bug in the benchmark harness made hierarchical filters look better than they were. Expected-negative queries used early-exit timing, but a probabilistic filter can produce a false positive on the very first check, causing the timer to stop early and report an implausibly fast result. The bug only became visible at 50k blocks, where the statistical effect was large enough to notice. After switching negative queries to full-scan timing, the inflated speedup numbers came back down to earth. Lesson learned: benchmark semantics must match what a real client actually does.

The bandwidth numbers that added up twice

During the Fuse16+20 hierarchical benchmarks, the total bandwidth column looked suspiciously high. It turned out the C++ layer already included false-positive block downloads in its "total bandwidth" output, and then the shell script added them again in the summary table. A classic double-counting bug, caught only by staring at the numbers long enough to realize they could not possibly be right.

The pivot

After hierarchical GCS failed, the project pivoted to Binary Fuse16 as a new filter type alongside GCS — and the very first benchmark showed a 77x speedup. The hierarchical idea was later revived, but this time built on Fuse filters instead of GCS, with domain-separated SipHash keys and a candidate-only optimization. That is where the research stands today: one clear winner (flat Fuse16) and one promising but unfinished direction (hierarchical Fuse).

Csaba Purszki

C++ Systems Engineer

Csaba is adapting his classical computer vision / automotive algorithm integration experiences into the world of Bitcoin within ChainCode Labs' 2026 BOSS Challenge.