Skip to content

Add sparse gram extraction#113

Open
aneubeck wants to merge 41 commits intomainfrom
aneubeck/sparse_ngrams
Open

Add sparse gram extraction#113
aneubeck wants to merge 41 commits intomainfrom
aneubeck/sparse_ngrams

Conversation

@aneubeck
Copy link
Copy Markdown
Collaborator

@aneubeck aneubeck commented May 8, 2026

Here is a highly optimized sparse-ngrams extraction implementation.
It limits the maximum size of grams to 8 in this version without the option to configure.

In subsequent PRs, I will add functions for extracting grams for a query string.
It's even more interesting to extract sparse ngrams for a regular expression...

Copilot AI review requested due to automatic review settings May 8, 2026 14:39
@aneubeck aneubeck requested a review from a team as a code owner May 8, 2026 14:39
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Introduces a new sparse-ngrams crate implementing sparse n-gram extraction over byte slices (with a fixed max gram size of 8), and adds a new pef crate with Elias–Fano-related implementations/binaries, both wired into the workspace.

Changes:

  • Added sparse n-gram extraction (deque + scan variants), compact NGram representation, and a frequency-based bigram priority table.
  • Added benchmarks/docs for sparse-ngrams.
  • Added a new pef crate (Elias–Fano encoding/decoding utilities and tooling) and included it in the workspace.
Show a summary per file
File Description
Cargo.toml Adds pef and sparse-ngrams to the workspace members.
crates/sparse-ngrams/Cargo.toml New crate manifest (including Criterion bench dependency).
crates/sparse-ngrams/README.md Documents algorithm, selection criterion, and performance notes.
crates/sparse-ngrams/benchmarks/performance.rs Benchmarks deque vs scan extraction variants.
crates/sparse-ngrams/src/lib.rs Crate entry point, constants, and exports.
crates/sparse-ngrams/src/extract.rs Core sparse-gram extraction implementations + tests.
crates/sparse-ngrams/src/ngram.rs NGram packed hash+len type + tests.
crates/sparse-ngrams/src/table.rs Bigram priority table initialization from embedded data.
crates/sparse-ngrams/src/deque.rs Fixed-capacity monotone deque implementation.
crates/sparse-ngrams/src/murmur.rs Murmur1 hashing helper for bigrams.
crates/pef/Cargo.toml New crate manifest (deps, bins, benches).
crates/pef/src/lib.rs pef crate module structure and exports.
crates/pef/src/elias_fano.rs Elias–Fano encoding/decoder + intersection + tests.
crates/pef/src/batch_decoder.rs Batch decoder implementation over Elias–Fano.
crates/pef/src/avx_batch_decoder.rs AVX-512 batch decoder implementation + tests.
crates/pef/src/helper.rs SIMD-related helper routines (currently not wired into the crate).
crates/pef/src/bin/reorder_docids.rs Tooling binary for docid reordering via MST (feature-gated).
crates/pef/src/bin/encode_pisa.rs Tooling binary for PISA posting list encoding/analysis + tests.
crates/pef/benches/elias_fano_bench.rs Benchmarks for Elias–Fano decoding variants.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 10/13 changed files
  • Comments generated: 7

Comment on lines +28 to +32
let mut chars = s.chars();
let Some((a, b)) = chars.next().zip(chars.next()) else {
continue;
};
let a = (a as u8).to_ascii_lowercase();
Comment on lines +43 to +49
pub fn from_bytes(src: &[u8]) -> Self {
let mut hash = 0u32;
for &byte in src {
hash = hash.wrapping_mul(POLY_HASH_PRIME).wrapping_add(byte as u32);
}
Self((hash << 8) | src.len() as u32)
}
Comment on lines +17 to +21
pub(crate) struct FixedDeque<const CAP: usize> {
data: [MaybeUninit<PosStateBytes>; CAP],
start: u8,
len: u8,
}
Comment on lines +45 to +49
for idx in 1..n as u32 {
let mask = MAX_SPARSE_GRAM_SIZE as usize - 1;
let end_hash = prefix_hashes[idx as usize & mask]
.wrapping_mul(POLY_HASH_PRIME)
.wrapping_add(content[idx as usize] as u32);
harness = false

[dev-dependencies]
criterion = "0.5"
Comment thread Cargo.toml
Comment on lines 4 to 8
"crates/*",
"crates/bpe/benchmarks",
"crates/bpe/tests",
"crates/sparse-ngrams",
]
let chunks_len = bytes.len() / 4 * 4;
for chunk in bytes[..chunks_len].chunks_exact(4) {
let ptr = chunk.as_ptr() as *const u32;
let k = unsafe { ptr.read_unaligned() };
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants