Skip to content

gh-150821: Skip URL parsing in mimetypes.guess_type() for file paths#150828

Open
gaborbernat wants to merge 5 commits into
python:mainfrom
gaborbernat:opt/mimetypes-skip-urlparse
Open

gh-150821: Skip URL parsing in mimetypes.guess_type() for file paths#150828
gaborbernat wants to merge 5 commits into
python:mainfrom
gaborbernat:opt/mimetypes-skip-urlparse

Conversation

@gaborbernat
Copy link
Copy Markdown
Contributor

@gaborbernat gaborbernat commented Jun 2, 2026

mimetypes.guess_type() accepts either a URL or a filesystem path, so it parses its argument as a URL with urllib.parse.urlparse() before looking at the extension. The common argument is a plain file path, which has no URL scheme to find, so the parse — and the urllib.parse import it triggers — is spent on nothing. Guessing content types from file names is everywhere: static-file servers, upload handlers, archive and build tools deciding how to treat each file as they walk a tree of thousands.

A URL scheme requires a :, so a path without one cannot be a URL. This detects that case and goes straight to extension lookup, skipping urlparse() and its lazy import. Real URLs, and the rare path that contains a :, still take the full parsing path, and results are unchanged for both.

Guessing types for 15 real file names sampled from the top-1000 corpus improves from 23.4 µs to 11.0 µs, 112% faster.

Benchmark base patched
guess_type x15 file paths 23.4 µs 11.0 µs: 112% faster
Benchmark (pyperf)

Run base vs patched by swapping Lib/mimetypes.py on the same interpreter. The names are real file names sampled from the top-1000 corpus.

import mimetypes, pyperf
mimetypes.init()

names = ["webhook_list.py", "tox.ini", "api_management_delete_policy.py",
    ".env.sample.entra-id", "alerts_get_by_id.py", "ai_prompt_workflow.md",
    "functions.py", "sample_connections.py", "certificate_delete.py",
    "_ai_agents_instrumentor.py", ".flake8", "agent_trace_configurator.py",
    "test_ws_invoke.py", "README.md", "setup.cfg"]

runner = pyperf.Runner()
runner.bench_func("guess_type x15 file paths",
                  lambda: [mimetypes.guess_type(n) for n in names])

Resolves #150821.

…paths

guess_type() parsed every argument as a URL before checking the extension,
even for plain file paths that have no scheme. Detect the no-scheme case and
go straight to extension lookup, avoiding urlparse() and its lazy import. Real
URLs keep the full parsing path; results are unchanged.
@gaborbernat gaborbernat force-pushed the opt/mimetypes-skip-urlparse branch from d35cf04 to 175764e Compare June 2, 2026 23:45
Copy link
Copy Markdown
Member

@sobolevn sobolevn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment thread Lib/mimetypes.py
Comment thread Lib/mimetypes.py
scheme = p.scheme
url = p.path
else:
return self.guess_file_type(url, strict=strict)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: can this branch still happen now? do we have tests for this case?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can still happen: a ':' that isn't a real URL scheme — a single-letter Windows drive like c:fake.html, or a colon elsewhere in the name like note 12:30.txt — gives urlparse an empty or single-character scheme, so it falls through to this branch and is treated as a file path. I've added test_path_with_colon_but_no_url_scheme covering those cases.

Comment thread Lib/mimetypes.py Outdated
Address review: frame the fast path as 'no colon means it cannot be a
URL' (a file path may legitimately contain ':' on POSIX), add the blank
line before the lazy import, and cover a ':'-containing argument that is
not a URL (single-letter drive, colon in the name) reaching the file
path branch.
Comment thread Lib/mimetypes.py Outdated
Replace the in-function import with a top-level lazy import, keeping
urllib.parse off the module import path while declaring it with the
other imports.
Comment thread Lib/mimetypes.py Outdated
except ImportError:
_winreg = None

lazy import urllib.parse
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe our standard style is to put regular import lines above the try/except clauses.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done — moved os, posixpath, and urllib.parse to module-level lazy imports at the top, above the platform try/except blocks.

Comment thread Lib/mimetypes.py Outdated
@@ -123,10 +125,14 @@ def guess_type(self, url, strict=True):
"""
# Lazy import to improve module import time
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment now only applies to os...which could also be made a lazy import at the top now that we have language supported lazy imports ;) Then the comment can go away.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Made os (and posixpath) module-level lazy imports as well and removed the now-redundant comments. test_mimetypes still passes, including test_lazy_import.

Per review: place the lazy imports at the top of the module with the other
imports, above the platform try/except blocks, instead of inside the
functions, and drop the now-redundant 'Lazy import to improve module import
time' comments.
Copy link
Copy Markdown
Member

@bitdancer bitdancer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I'm going to leave to someone with more recent involvement in the mimetypes module to merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Speed up mimetypes.guess_type() for plain file paths

4 participants