Skip to content

gh-150815: Speed up copy.deepcopy() of containers with atomic elements#150822

Open
gaborbernat wants to merge 3 commits into
python:mainfrom
gaborbernat:opt/deepcopy-inline-atomic
Open

gh-150815: Speed up copy.deepcopy() of containers with atomic elements#150822
gaborbernat wants to merge 3 commits into
python:mainfrom
gaborbernat:opt/deepcopy-inline-atomic

Conversation

@gaborbernat
Copy link
Copy Markdown
Contributor

@gaborbernat gaborbernat commented Jun 2, 2026

copy.deepcopy() copies a structure by sending every element back through deepcopy(). For elements that need no copying at all — strings, ints, None, booleans, floats and the other immutable atomic types — that round trip costs a full function call each, even though the value handed back is the same object. Real data is dominated by these atomic leaves: a parsed JSON document, a settings dict cloned before mutation, a record copied inside a framework. The keys are strings and most values are strings and numbers, so copying spends most of its time calling deepcopy() only to get the same object straight back.

This folds the atomic-type check that already gates the top of deepcopy() into the dict, list and tuple copiers, so an atomic element is returned as-is without the per-item call. The check is the same one deepcopy() runs, and atomic objects are not memoized either way, so the result is identical for shared references, recursive structures and int/tuple subclasses.

Deep-copying 105 JSON documents drawn from the top-1000 PyPI projects improves from 1.21 ms to 990 µs, 22% faster. This follows the atomic fast path added in gh-114264, extending it from the entry point to the per-element loop.

Benchmark base patched
deepcopy 105 real corpus JSON objects 1.21 ms 990 µs: 22% faster
Benchmark (pyperf)

Run base vs patched by swapping Lib/copy.py on the same interpreter. The figure above is from 105 JSON documents in the top-1000 PyPI corpus; the self-contained script below builds an equivalent atomic-heavy structure and shows a comparable percentage gain.

import copy, pyperf

# Representative of parsed-JSON / config data: string keys, scalar leaves.
doc = {
    "name": "example-package", "version": "1.2.3", "private": False,
    "scripts": {"build": "tsc", "test": "pytest", "lint": "ruff check ."},
    "keywords": ["cli", "async", "http", "json"],
    "dependencies": {f"dep{i}": f"^{i}.0.0" for i in range(20)},
    "authors": [{"name": f"Person {i}", "email": f"p{i}@example.com", "active": True} for i in range(10)],
    "config": {"timeout": 30, "retries": 3, "verbose": False, "level": None},
}
objs = [doc] * 50

runner = pyperf.Runner()
runner.bench_func("deepcopy atomic-heavy structures", lambda: [copy.deepcopy(o) for o in objs])

Resolves #150815.

Comment thread Lib/copy.py Outdated


def _deepcopy_list(x, memo, deepcopy=deepcopy):
def _deepcopy_list(x, memo, deepcopy=deepcopy, _atomic=_atomic_types):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this trick of capturing the global in a local still a net positive in newer Python?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark numbers I provided were done against a build on the main branch. Now I haven't tried enabling the JIT or any other advanced features, but out of the box there is a significant benefit here.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But did those gains come from using a local or only from doing the type(...) in check?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should let the performance people do their thing and not try to artificially speed things up like this.

@eendebakpt
Copy link
Copy Markdown
Contributor

There are more options available to make deepcopy faster (see #91610 (comment)).

If we want to make deepcopy faster, I believe we should gather enough support so a core dev can review a C (or rust?) implementation. With a C implementation we have much larger performance gains (see https://github.com/percolab/copium for example).

@gaborbernat
Copy link
Copy Markdown
Contributor Author

I don't think a pure-Python tweak and a C/Rust deepcopy are opposing directions — they optimize different ends and can coexist. This change helps every build today with no extension to compile and no new maintenance surface, and it doesn't block or complicate a future C implementation; if one lands, this just becomes a small fast path that the C version supersedes.

I've also reduced this PR to the minimal change: a 3-way benchmark (base / inline-check-with-local-capture / inline-check-using-the-global) showed the entire speedup comes from the inlined type(...) in _atomic_types check, and the local-variable capture added nothing measurable, so I dropped it.

Comment thread Lib/copy.py Outdated
…lements

The dict, list and tuple deep-copiers send every element back through
deepcopy(), paying a function call even for atomic immutable elements that
deepcopy() returns unchanged. Inline the atomic-type check into the three
copiers so those elements are returned as-is. Behavior is identical, including
shared references, recursion and int/tuple subclasses.
The benchmark shows the speedup comes entirely from the inlined
type(...) in _atomic_types check, not from binding the global to a
local default argument, so drop the local capture for a minimal change.
Address review: the inline atomic check recomputed type() for non-atomic
elements (once inline, once inside deepcopy()), a measurable regression on
non-atomic-heavy containers. Split deepcopy() into a thin entry plus
_deepcopy_fallback() that takes the already-computed class, so each element
is typed exactly once whether or not it is atomic. Apply the same pattern to
the list, tuple, dict and frozendict copiers.
@gaborbernat gaborbernat force-pushed the opt/deepcopy-inline-atomic branch from 6ceb409 to 0c3f9e1 Compare June 3, 2026 21:58
@gaborbernat
Copy link
Copy Markdown
Contributor Author

@Bobronium is right: the first version ran type() twice for every non-atomic element, once in the inline guard and once again inside deepcopy(). I measured the cost, tested three designs to address it, and picked the one that is fastest with the smallest regression.

Method

I compared each version of Lib/copy.py on the same built interpreter (3.16.0a0), loaded as the copy module, driven by pyperf (1 warmup + 3 values × 20 processes, 5 independent runs per version). Coefficient of variation across the 5 run-means stayed at or below 1.1% on every workload, so the gaps below are real rather than jitter. Workloads span all-atomic to all-non-atomic, with the worst case for the regression (empties: a list of 2000 [], non-atomic elements with no atomic data to amortize the check against).

Options tested

  1. Inline guard (the first push): a if type(a) in _atomic_types else deepcopy(a, memo) in each copier. Skips the call for atomic elements but types non-atomic elements twice.
  2. Inline guard + local capture: same, with type and _atomic_types bound to locals/default args in each copier.
  3. Single type(): split deepcopy() into a thin entry plus an internal _deepcopy_fallback(x, memo, cls) that takes the already-computed class. Each element is typed once whether or not it is atomic.

Results (speedup vs main, higher is better)

Benchmark base 1. inline 2. inline+local 3. single-type
json_corpus (atomic-heavy) 1.777 ms 1.34x 1.39x 1.40x
list[int] (pure atomic) 0.131 ms 2.11x 2.27x 2.19x
empties (worst non-atomic) 1.804 ms 0.96x 0.96x 0.98x
nested dict (realistic) 8.817 ms 1.13x 1.15x 1.18x
list[Obj] (non-atomic obj) 6.470 ms 1.05x 1.07x 1.06x

How I chose

Option 2 (local capture) buys about 3% on the atomic cases but leaves the worst case at 0.96x, since it removes a global lookup, not the second type() call. It does not address the review. (I also tried option 3 plus local capture; it came out slightly worse than plain option 3 on the worst case and the realistic case, so the extra binding was not worth it.)

Option 3 wins on every axis that matters: it has the smallest regression (0.98x vs 0.96x, halving the worst-case loss), it is fastest on the realistic mixed workload (+18%), and it keeps the atomic gains (1.40x and 2.19x). The regression that remains is narrow: it needs non-atomic elements holding no atomic data anywhere in their subtree. Any atomic leaf tips the structure positive, which is why nested dict lands at +18%.

The branch now uses option 3, applied to the list, tuple, dict and frozendict copiers. test_copy (83 tests) and test_pickle pass. Raw pyperf JSON for all runs is available if anyone wants to verify.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speed up copy.deepcopy() of containers holding atomic elements

5 participants