Improve performance of unicodedata.normalize

Feature or enhancement

Proposal:

The performance of unicodedata.normalize can be improved with a more efficient implementation of find_nfc_index. This was found while reviewing one of the PRs for gh-149079 where O(N²) behaviour in unicodedata.normalize("NFC", ...) was reported. The running time of the find_nfc_index part in normalize is O(N·t) with t the table size. The dependency on t is due to a linear search. t is a constant, so this is O(N), but with a large factor. This is no security issue. Whether relevant for real-world applications I am not sure on.

Several approaches work to improve performance. Some of the rejected ones:

approach	result	why not
Lookup table	fastest on all cases (~5× on Greek/Cyrillic)	adds a generated data table (~1.5–3 KB)
Galloping (exponential + binary search)	very fast (~3×)	needed extra work (an inlined linear prefix, out-of-line tail) to keep the common low-codepoint cases — ASCII, Latin, Greek — from regressing
Affine / piecewise formula (no table)	—	the codepoint→index map is too non-linear (dense clusters + large gaps) to fit (see image below)

Both the LUT and galloping work well but were rejected: the LUT adds data, and
galloping only stays regression-free with enough added complexity.

In the PR we still have a linear scan, but the per step cost is almost halved.

Benchmarks

unicodedata.normalize(), the f<N> suffix is the nfc_first index the lookups mostly hit):

case	speedup vs main
decomposed Greek (`f44`)	2.16×
decomposed Cyrillic (`f57`)	2.09×
Vietnamese double-diacritic (`f6`/`f8`)	2.0×
combining-mark runs, gh-149079 input (`f34`)	1.79×
Vietnamese double-diacritic (`f12`)	1.45×
decomposed Latin (`f3`)	1.32×
NFKC fullwidth→ASCII (`f3`)	1.11×
ASCII / already-normalized	unchanged

Gains grow with how deep the codepoint sits in the table (longer scans benefit
most); ASCII and already-normalized text are unaffected (handled by a fast path).

Benchmark script

import pyperf
import unicodedata

normalize = unicodedata.normalize
nfd = lambda s: normalize("NFD", s)

# gh-149079 attack input: starter + long alternating-CCC combining run (f34)
ISSUE = "a" + (chr(0x0301) + chr(0x0316)) * 50000
# decomposed text by script (low -> deep table positions)
LATIN = nfd("Café déjà vu naïve coördinate Renée " * 80)        # f3
GREEK = nfd("ἀνθρώπων μέγιστος ἥρως ὠκεανός φῶς " * 80)          # f44
CYRILLIC = nfd("Йёжик в тумане идёт спокойно " * 100)            # f57
# once-accented Latin bases + accent -> Ấ/Ế/Ǘ (f6/f8/f12)
IDX6 = (chr(0x00C2) + chr(0x0301)) * 800
IDX8 = (chr(0x00CA) + chr(0x0301)) * 800
IDX12 = (chr(0x00DC) + chr(0x0301)) * 800
# NFKC fullwidth -> ASCII (urllib/idna spoofing-check shape); already-NFC control
FULLWIDTH = "ｇｏｏｇｌｅｄｏｃｓ" * 100
ALREADY = "Café déjà naïve Renée coördinate " * 100


def bench(loops, form, s):
    range_it = range(loops)
    t0 = pyperf.perf_counter()
    for _ in range_it:
        normalize(form, s)
    return pyperf.perf_counter() - t0


if __name__ == "__main__":
    runner = pyperf.Runner()
    runner.bench_time_func("nfc_issue_ccc_f34", bench, "NFC", ISSUE)
    runner.bench_time_func("nfc_latin_f3", bench, "NFC", LATIN)
    runner.bench_time_func("nfc_double_diacritic_f6", bench, "NFC", IDX6)
    runner.bench_time_func("nfc_double_diacritic_f8", bench, "NFC", IDX8)
    runner.bench_time_func("nfc_double_diacritic_f12", bench, "NFC", IDX12)
    runner.bench_time_func("nfc_greek_f44", bench, "NFC", GREEK)
    runner.bench_time_func("nfc_cyrillic_f57", bench, "NFC", CYRILLIC)
    runner.bench_time_func("nfkc_fullwidth_f3", bench, "NFKC", FULLWIDTH)
    runner.bench_time_func("nfc_already", bench, "NFC", ALREADY)

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

gh-150889: Improve performance of unicodedata.normalize() #150890

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve performance of unicodedata.normalize #150889

Feature or enhancement

Proposal:

Benchmarks

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Improve performance of unicodedata.normalize #150889

Description

Feature or enhancement

Proposal:

Benchmarks

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions