Feature or enhancement
Proposal:
The performance of unicodedata.normalize can be improved with a more efficient implementation of find_nfc_index. This was found while reviewing one of the PRs for gh-149079 where O(N²) behaviour in unicodedata.normalize("NFC", ...) was reported. The running time of the find_nfc_index part in normalize is O(N·t) with t the table size. The dependency on t is due to a linear search. t is a constant, so this is O(N), but with a large factor. This is no security issue. Whether relevant for real-world applications I am not sure on.
Several approaches work to improve performance. Some of the rejected ones:
| approach |
result |
why not |
| Lookup table |
fastest on all cases (~5× on Greek/Cyrillic) |
adds a generated data table (~1.5–3 KB) |
| Galloping (exponential + binary search) |
very fast (~3×) |
needed extra work (an inlined linear prefix, out-of-line tail) to keep the common low-codepoint cases — ASCII, Latin, Greek — from regressing |
| Affine / piecewise formula (no table) |
— |
the codepoint→index map is too non-linear (dense clusters + large gaps) to fit (see image below) |
Both the LUT and galloping work well but were rejected: the LUT adds data, and
galloping only stays regression-free with enough added complexity.
In the PR we still have a linear scan, but the per step cost is almost halved.
Benchmarks
unicodedata.normalize(), the f<N> suffix is the nfc_first index the lookups mostly hit):
| case |
speedup vs main |
decomposed Greek (f44) |
2.16× |
decomposed Cyrillic (f57) |
2.09× |
Vietnamese double-diacritic (f6/f8) |
2.0× |
combining-mark runs, gh-149079 input (f34) |
1.79× |
Vietnamese double-diacritic (f12) |
1.45× |
decomposed Latin (f3) |
1.32× |
NFKC fullwidth→ASCII (f3) |
1.11× |
| ASCII / already-normalized |
unchanged |
Gains grow with how deep the codepoint sits in the table (longer scans benefit
most); ASCII and already-normalized text are unaffected (handled by a fast path).
Benchmark script
import pyperf
import unicodedata
normalize = unicodedata.normalize
nfd = lambda s: normalize("NFD", s)
# gh-149079 attack input: starter + long alternating-CCC combining run (f34)
ISSUE = "a" + (chr(0x0301) + chr(0x0316)) * 50000
# decomposed text by script (low -> deep table positions)
LATIN = nfd("Café déjà vu naïve coördinate Renée " * 80) # f3
GREEK = nfd("ἀνθρώπων μέγιστος ἥρως ὠκεανός φῶς " * 80) # f44
CYRILLIC = nfd("Йёжик в тумане идёт спокойно " * 100) # f57
# once-accented Latin bases + accent -> Ấ/Ế/Ǘ (f6/f8/f12)
IDX6 = (chr(0x00C2) + chr(0x0301)) * 800
IDX8 = (chr(0x00CA) + chr(0x0301)) * 800
IDX12 = (chr(0x00DC) + chr(0x0301)) * 800
# NFKC fullwidth -> ASCII (urllib/idna spoofing-check shape); already-NFC control
FULLWIDTH = "googledocs" * 100
ALREADY = "Café déjà naïve Renée coördinate " * 100
def bench(loops, form, s):
range_it = range(loops)
t0 = pyperf.perf_counter()
for _ in range_it:
normalize(form, s)
return pyperf.perf_counter() - t0
if __name__ == "__main__":
runner = pyperf.Runner()
runner.bench_time_func("nfc_issue_ccc_f34", bench, "NFC", ISSUE)
runner.bench_time_func("nfc_latin_f3", bench, "NFC", LATIN)
runner.bench_time_func("nfc_double_diacritic_f6", bench, "NFC", IDX6)
runner.bench_time_func("nfc_double_diacritic_f8", bench, "NFC", IDX8)
runner.bench_time_func("nfc_double_diacritic_f12", bench, "NFC", IDX12)
runner.bench_time_func("nfc_greek_f44", bench, "NFC", GREEK)
runner.bench_time_func("nfc_cyrillic_f57", bench, "NFC", CYRILLIC)
runner.bench_time_func("nfkc_fullwidth_f3", bench, "NFKC", FULLWIDTH)
runner.bench_time_func("nfc_already", bench, "NFC", ALREADY)
Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
No response
Linked PRs
Feature or enhancement
Proposal:
The performance of
unicodedata.normalizecan be improved with a more efficient implementation offind_nfc_index. This was found while reviewing one of the PRs forgh-149079where O(N²) behaviour inunicodedata.normalize("NFC", ...)was reported. The running time of thefind_nfc_indexpart innormalizeis O(N·t) withtthe table size. The dependency ontis due to a linear search.tis a constant, so this is O(N), but with a large factor. This is no security issue. Whether relevant for real-world applications I am not sure on.Several approaches work to improve performance. Some of the rejected ones:
Both the LUT and galloping work well but were rejected: the LUT adds data, and
galloping only stays regression-free with enough added complexity.
In the PR we still have a linear scan, but the per step cost is almost halved.
Benchmarks
unicodedata.normalize(), thef<N>suffix is thenfc_firstindex the lookups mostly hit):f44)f57)f6/f8)f34)f12)f3)f3)Gains grow with how deep the codepoint sits in the table (longer scans benefit
most); ASCII and already-normalized text are unaffected (handled by a fast path).
Benchmark script
Has this already been discussed elsewhere?
No response given
Links to previous discussion of this feature:
No response
Linked PRs
unicodedata.normalize()#150890