Open
Conversation
- replace dist/poly per-iteration updates with doubled LUT loads - use AVX-512 unsigned compare mask for ge_bits - remove prefix-hash permute via shifted prefix layout
Contributor
There was a problem hiding this comment.
Pull request overview
Introduces a new sparse-ngrams crate implementing sparse n-gram extraction over byte slices (with a fixed max gram size of 8), and adds a new pef crate with Elias–Fano-related implementations/binaries, both wired into the workspace.
Changes:
- Added sparse n-gram extraction (deque + scan variants), compact
NGramrepresentation, and a frequency-based bigram priority table. - Added benchmarks/docs for
sparse-ngrams. - Added a new
pefcrate (Elias–Fano encoding/decoding utilities and tooling) and included it in the workspace.
Show a summary per file
| File | Description |
|---|---|
Cargo.toml |
Adds pef and sparse-ngrams to the workspace members. |
crates/sparse-ngrams/Cargo.toml |
New crate manifest (including Criterion bench dependency). |
crates/sparse-ngrams/README.md |
Documents algorithm, selection criterion, and performance notes. |
crates/sparse-ngrams/benchmarks/performance.rs |
Benchmarks deque vs scan extraction variants. |
crates/sparse-ngrams/src/lib.rs |
Crate entry point, constants, and exports. |
crates/sparse-ngrams/src/extract.rs |
Core sparse-gram extraction implementations + tests. |
crates/sparse-ngrams/src/ngram.rs |
NGram packed hash+len type + tests. |
crates/sparse-ngrams/src/table.rs |
Bigram priority table initialization from embedded data. |
crates/sparse-ngrams/src/deque.rs |
Fixed-capacity monotone deque implementation. |
crates/sparse-ngrams/src/murmur.rs |
Murmur1 hashing helper for bigrams. |
crates/pef/Cargo.toml |
New crate manifest (deps, bins, benches). |
crates/pef/src/lib.rs |
pef crate module structure and exports. |
crates/pef/src/elias_fano.rs |
Elias–Fano encoding/decoder + intersection + tests. |
crates/pef/src/batch_decoder.rs |
Batch decoder implementation over Elias–Fano. |
crates/pef/src/avx_batch_decoder.rs |
AVX-512 batch decoder implementation + tests. |
crates/pef/src/helper.rs |
SIMD-related helper routines (currently not wired into the crate). |
crates/pef/src/bin/reorder_docids.rs |
Tooling binary for docid reordering via MST (feature-gated). |
crates/pef/src/bin/encode_pisa.rs |
Tooling binary for PISA posting list encoding/analysis + tests. |
crates/pef/benches/elias_fano_bench.rs |
Benchmarks for Elias–Fano decoding variants. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 10/13 changed files
- Comments generated: 7
Comment on lines
+28
to
+32
| let mut chars = s.chars(); | ||
| let Some((a, b)) = chars.next().zip(chars.next()) else { | ||
| continue; | ||
| }; | ||
| let a = (a as u8).to_ascii_lowercase(); |
Comment on lines
+43
to
+49
| pub fn from_bytes(src: &[u8]) -> Self { | ||
| let mut hash = 0u32; | ||
| for &byte in src { | ||
| hash = hash.wrapping_mul(POLY_HASH_PRIME).wrapping_add(byte as u32); | ||
| } | ||
| Self((hash << 8) | src.len() as u32) | ||
| } |
Comment on lines
+17
to
+21
| pub(crate) struct FixedDeque<const CAP: usize> { | ||
| data: [MaybeUninit<PosStateBytes>; CAP], | ||
| start: u8, | ||
| len: u8, | ||
| } |
Comment on lines
+45
to
+49
| for idx in 1..n as u32 { | ||
| let mask = MAX_SPARSE_GRAM_SIZE as usize - 1; | ||
| let end_hash = prefix_hashes[idx as usize & mask] | ||
| .wrapping_mul(POLY_HASH_PRIME) | ||
| .wrapping_add(content[idx as usize] as u32); |
| harness = false | ||
|
|
||
| [dev-dependencies] | ||
| criterion = "0.5" |
Comment on lines
4
to
8
| "crates/*", | ||
| "crates/bpe/benchmarks", | ||
| "crates/bpe/tests", | ||
| "crates/sparse-ngrams", | ||
| ] |
| let chunks_len = bytes.len() / 4 * 4; | ||
| for chunk in bytes[..chunks_len].chunks_exact(4) { | ||
| let ptr = chunk.as_ptr() as *const u32; | ||
| let k = unsafe { ptr.read_unaligned() }; |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Here is a highly optimized sparse-ngrams extraction implementation.
It limits the maximum size of grams to 8 in this version without the option to configure.
In subsequent PRs, I will add functions for extracting grams for a query string.
It's even more interesting to extract sparse ngrams for a regular expression...