Skip to content

gh-149079: Optimize sorting in unicodedata.normalize()#150782

Open
serhiy-storchaka wants to merge 1 commit into
python:mainfrom
serhiy-storchaka:unicodedata-normalize-optimize
Open

gh-149079: Optimize sorting in unicodedata.normalize()#150782
serhiy-storchaka wants to merge 1 commit into
python:mainfrom
serhiy-storchaka:unicodedata-normalize-optimize

Conversation

@serhiy-storchaka
Copy link
Copy Markdown
Member

@serhiy-storchaka serhiy-storchaka commented Jun 2, 2026

Sort the Py_UCS4 buffer instead of PyUnicodeObject. This allows to avoid the use of PyUnicode_READ() and PyUnicode_WRITE().

Sort the Py_UCS4 buffer instead of PyUnicodeObject. This allows to avoid
the use of PyUnicode_READ() and PyUnicode_WRITE().
@serhiy-storchaka
Copy link
Copy Markdown
Member Author

serhiy-storchaka commented Jun 2, 2026

./python -m timeit 's=("a"+"\u0300\u0327"*1000)*100; from unicodedata import normalize' -- 'normalize("NFC", s)'

Baseline: 100 loops, best of 5: 3.76 msec per loop
This PR: 100 loops, best of 5: 3.57 msec per loop

./python -m timeit 's=("a"+"\u0300\u0327"*9)*10000; from unicodedata import normalize' -- 'normalize("NFC", s)'

Baseline: 100 loops, best of 5: 3.99 msec per loop
This PR: 100 loops, best of 5: 3.84 msec per loop

@eendebakpt
Copy link
Copy Markdown
Contributor

@serhiy-storchaka On your benchmark I can improve from 3.75 ms to 2.0 ms by using a more efficient search in find_nfc_index. See main...eendebakpt:gh-149079-find-nfc-index. The changes in the PR look good at first sight.

@serhiy-storchaka
Copy link
Copy Markdown
Member Author

by using a more efficient search in find_nfc_index.

Looks interesting. But this is a different issue, not directly related to #149079. Can you open a new issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants