Skip to content

Improve performance of unicodedata.normalize #150889

@eendebakpt

Description

@eendebakpt

Feature or enhancement

Proposal:

The performance of unicodedata.normalize can be improved with a more efficient implementation of find_nfc_index. This was found while reviewing one of the PRs for gh-149079 where O(N²) behaviour in unicodedata.normalize("NFC", ...) was reported. The running time of the find_nfc_index part in normalize is O(N·t) with t the table size. The dependency on t is due to a linear search. t is a constant, so this is O(N), but with a large factor. This is no security issue. Whether relevant for real-world applications I am not sure on.

Several approaches work to improve performance. Some of the rejected ones:

approach result why not
Lookup table fastest on all cases (~5× on Greek/Cyrillic) adds a generated data table (~1.5–3 KB)
Galloping (exponential + binary search) very fast (~3×) needed extra work (an inlined linear prefix, out-of-line tail) to keep the common low-codepoint cases — ASCII, Latin, Greek — from regressing
Affine / piecewise formula (no table) the codepoint→index map is too non-linear (dense clusters + large gaps) to fit (see image below)
Image

Both the LUT and galloping work well but were rejected: the LUT adds data, and
galloping only stays regression-free with enough added complexity.

In the PR we still have a linear scan, but the per step cost is almost halved.

Benchmarks

unicodedata.normalize(), the f<N> suffix is the nfc_first index the lookups mostly hit):

case speedup vs main
decomposed Greek (f44) 2.16×
decomposed Cyrillic (f57) 2.09×
Vietnamese double-diacritic (f6/f8) 2.0×
combining-mark runs, gh-149079 input (f34) 1.79×
Vietnamese double-diacritic (f12) 1.45×
decomposed Latin (f3) 1.32×
NFKC fullwidth→ASCII (f3) 1.11×
ASCII / already-normalized unchanged

Gains grow with how deep the codepoint sits in the table (longer scans benefit
most); ASCII and already-normalized text are unaffected (handled by a fast path).

Benchmark script
import pyperf
import unicodedata

normalize = unicodedata.normalize
nfd = lambda s: normalize("NFD", s)

# gh-149079 attack input: starter + long alternating-CCC combining run (f34)
ISSUE = "a" + (chr(0x0301) + chr(0x0316)) * 50000
# decomposed text by script (low -> deep table positions)
LATIN = nfd("Café déjà vu naïve coördinate Renée " * 80)        # f3
GREEK = nfd("ἀνθρώπων μέγιστος ἥρως ὠκεανός φῶς " * 80)          # f44
CYRILLIC = nfd("Йёжик в тумане идёт спокойно " * 100)            # f57
# once-accented Latin bases + accent -> Ấ/Ế/Ǘ (f6/f8/f12)
IDX6 = (chr(0x00C2) + chr(0x0301)) * 800
IDX8 = (chr(0x00CA) + chr(0x0301)) * 800
IDX12 = (chr(0x00DC) + chr(0x0301)) * 800
# NFKC fullwidth -> ASCII (urllib/idna spoofing-check shape); already-NFC control
FULLWIDTH = "googledocs" * 100
ALREADY = "Café déjà naïve Renée coördinate " * 100


def bench(loops, form, s):
    range_it = range(loops)
    t0 = pyperf.perf_counter()
    for _ in range_it:
        normalize(form, s)
    return pyperf.perf_counter() - t0


if __name__ == "__main__":
    runner = pyperf.Runner()
    runner.bench_time_func("nfc_issue_ccc_f34", bench, "NFC", ISSUE)
    runner.bench_time_func("nfc_latin_f3", bench, "NFC", LATIN)
    runner.bench_time_func("nfc_double_diacritic_f6", bench, "NFC", IDX6)
    runner.bench_time_func("nfc_double_diacritic_f8", bench, "NFC", IDX8)
    runner.bench_time_func("nfc_double_diacritic_f12", bench, "NFC", IDX12)
    runner.bench_time_func("nfc_greek_f44", bench, "NFC", GREEK)
    runner.bench_time_func("nfc_cyrillic_f57", bench, "NFC", CYRILLIC)
    runner.bench_time_func("nfkc_fullwidth_f3", bench, "NFKC", FULLWIDTH)
    runner.bench_time_func("nfc_already", bench, "NFC", ALREADY)

Has this already been discussed elsewhere?

No response given

Links to previous discussion of this feature:

No response

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions