gh-150821: Skip URL parsing in mimetypes.guess_type() for file paths#150828
gh-150821: Skip URL parsing in mimetypes.guess_type() for file paths#150828gaborbernat wants to merge 5 commits into
Conversation
…paths guess_type() parsed every argument as a URL before checking the extension, even for plain file paths that have no scheme. Detect the no-scheme case and go straight to extension lookup, avoiding urlparse() and its lazy import. Real URLs keep the full parsing path; results are unchanged.
d35cf04 to
175764e
Compare
| scheme = p.scheme | ||
| url = p.path | ||
| else: | ||
| return self.guess_file_type(url, strict=strict) |
There was a problem hiding this comment.
question: can this branch still happen now? do we have tests for this case?
There was a problem hiding this comment.
Yes, it can still happen: a ':' that isn't a real URL scheme — a single-letter Windows drive like c:fake.html, or a colon elsewhere in the name like note 12:30.txt — gives urlparse an empty or single-character scheme, so it falls through to this branch and is treated as a file path. I've added test_path_with_colon_but_no_url_scheme covering those cases.
Address review: frame the fast path as 'no colon means it cannot be a URL' (a file path may legitimately contain ':' on POSIX), add the blank line before the lazy import, and cover a ':'-containing argument that is not a URL (single-letter drive, colon in the name) reaching the file path branch.
Replace the in-function import with a top-level lazy import, keeping urllib.parse off the module import path while declaring it with the other imports.
| except ImportError: | ||
| _winreg = None | ||
|
|
||
| lazy import urllib.parse |
There was a problem hiding this comment.
I believe our standard style is to put regular import lines above the try/except clauses.
There was a problem hiding this comment.
Done — moved os, posixpath, and urllib.parse to module-level lazy imports at the top, above the platform try/except blocks.
| @@ -123,10 +125,14 @@ def guess_type(self, url, strict=True): | |||
| """ | |||
| # Lazy import to improve module import time | |||
There was a problem hiding this comment.
This comment now only applies to os...which could also be made a lazy import at the top now that we have language supported lazy imports ;) Then the comment can go away.
There was a problem hiding this comment.
Good call. Made os (and posixpath) module-level lazy imports as well and removed the now-redundant comments. test_mimetypes still passes, including test_lazy_import.
Per review: place the lazy imports at the top of the module with the other imports, above the platform try/except blocks, instead of inside the functions, and drop the now-redundant 'Lazy import to improve module import time' comments.
bitdancer
left a comment
There was a problem hiding this comment.
LGTM, but I'm going to leave to someone with more recent involvement in the mimetypes module to merge.
mimetypes.guess_type()accepts either a URL or a filesystem path, so it parses its argument as a URL withurllib.parse.urlparse()before looking at the extension. The common argument is a plain file path, which has no URL scheme to find, so the parse — and theurllib.parseimport it triggers — is spent on nothing. Guessing content types from file names is everywhere: static-file servers, upload handlers, archive and build tools deciding how to treat each file as they walk a tree of thousands.A URL scheme requires a
:, so a path without one cannot be a URL. This detects that case and goes straight to extension lookup, skippingurlparse()and its lazy import. Real URLs, and the rare path that contains a:, still take the full parsing path, and results are unchanged for both.Guessing types for 15 real file names sampled from the top-1000 corpus improves from 23.4 µs to 11.0 µs, 112% faster.
Benchmark (pyperf)
Run base vs patched by swapping
Lib/mimetypes.pyon the same interpreter. The names are real file names sampled from the top-1000 corpus.Resolves #150821.