gh-150815: Speed up copy.deepcopy() of containers with atomic elements#150822
gh-150815: Speed up copy.deepcopy() of containers with atomic elements#150822gaborbernat wants to merge 3 commits into
Conversation
aeab6f7 to
8b8b7e8
Compare
|
|
||
|
|
||
| def _deepcopy_list(x, memo, deepcopy=deepcopy): | ||
| def _deepcopy_list(x, memo, deepcopy=deepcopy, _atomic=_atomic_types): |
There was a problem hiding this comment.
Is this trick of capturing the global in a local still a net positive in newer Python?
There was a problem hiding this comment.
The benchmark numbers I provided were done against a build on the main branch. Now I haven't tried enabling the JIT or any other advanced features, but out of the box there is a significant benefit here.
There was a problem hiding this comment.
But did those gains come from using a local or only from doing the type(...) in check?
There was a problem hiding this comment.
I think we should let the performance people do their thing and not try to artificially speed things up like this.
|
There are more options available to make deepcopy faster (see #91610 (comment)). If we want to make deepcopy faster, I believe we should gather enough support so a core dev can review a C (or rust?) implementation. With a C implementation we have much larger performance gains (see https://github.com/percolab/copium for example). |
|
I don't think a pure-Python tweak and a C/Rust I've also reduced this PR to the minimal change: a 3-way benchmark (base / inline-check-with-local-capture / inline-check-using-the-global) showed the entire speedup comes from the inlined |
…lements The dict, list and tuple deep-copiers send every element back through deepcopy(), paying a function call even for atomic immutable elements that deepcopy() returns unchanged. Inline the atomic-type check into the three copiers so those elements are returned as-is. Behavior is identical, including shared references, recursion and int/tuple subclasses.
The benchmark shows the speedup comes entirely from the inlined type(...) in _atomic_types check, not from binding the global to a local default argument, so drop the local capture for a minimal change.
Address review: the inline atomic check recomputed type() for non-atomic elements (once inline, once inside deepcopy()), a measurable regression on non-atomic-heavy containers. Split deepcopy() into a thin entry plus _deepcopy_fallback() that takes the already-computed class, so each element is typed exactly once whether or not it is atomic. Apply the same pattern to the list, tuple, dict and frozendict copiers.
6ceb409 to
0c3f9e1
Compare
|
@Bobronium is right: the first version ran MethodI compared each version of Options tested
Results (speedup vs
|
| Benchmark | base | 1. inline | 2. inline+local | 3. single-type |
|---|---|---|---|---|
| json_corpus (atomic-heavy) | 1.777 ms | 1.34x | 1.39x | 1.40x |
| list[int] (pure atomic) | 0.131 ms | 2.11x | 2.27x | 2.19x |
| empties (worst non-atomic) | 1.804 ms | 0.96x | 0.96x | 0.98x |
| nested dict (realistic) | 8.817 ms | 1.13x | 1.15x | 1.18x |
| list[Obj] (non-atomic obj) | 6.470 ms | 1.05x | 1.07x | 1.06x |
How I chose
Option 2 (local capture) buys about 3% on the atomic cases but leaves the worst case at 0.96x, since it removes a global lookup, not the second type() call. It does not address the review. (I also tried option 3 plus local capture; it came out slightly worse than plain option 3 on the worst case and the realistic case, so the extra binding was not worth it.)
Option 3 wins on every axis that matters: it has the smallest regression (0.98x vs 0.96x, halving the worst-case loss), it is fastest on the realistic mixed workload (+18%), and it keeps the atomic gains (1.40x and 2.19x). The regression that remains is narrow: it needs non-atomic elements holding no atomic data anywhere in their subtree. Any atomic leaf tips the structure positive, which is why nested dict lands at +18%.
The branch now uses option 3, applied to the list, tuple, dict and frozendict copiers. test_copy (83 tests) and test_pickle pass. Raw pyperf JSON for all runs is available if anyone wants to verify.
copy.deepcopy()copies a structure by sending every element back throughdeepcopy(). For elements that need no copying at all — strings, ints,None, booleans, floats and the other immutable atomic types — that round trip costs a full function call each, even though the value handed back is the same object. Real data is dominated by these atomic leaves: a parsed JSON document, a settings dict cloned before mutation, a record copied inside a framework. The keys are strings and most values are strings and numbers, so copying spends most of its time callingdeepcopy()only to get the same object straight back.This folds the atomic-type check that already gates the top of
deepcopy()into thedict,listandtuplecopiers, so an atomic element is returned as-is without the per-item call. The check is the same onedeepcopy()runs, and atomic objects are not memoized either way, so the result is identical for shared references, recursive structures andint/tuplesubclasses.Deep-copying 105 JSON documents drawn from the top-1000 PyPI projects improves from 1.21 ms to 990 µs, 22% faster. This follows the atomic fast path added in gh-114264, extending it from the entry point to the per-element loop.
Benchmark (pyperf)
Run base vs patched by swapping
Lib/copy.pyon the same interpreter. The figure above is from 105 JSON documents in the top-1000 PyPI corpus; the self-contained script below builds an equivalent atomic-heavy structure and shows a comparable percentage gain.Resolves #150815.