This fork is a file-format library. It writes and reads .docx bytes.
It does not render, sandbox, or validate the semantic content
that Word will subsequently process. Callers who accept untrusted
input are responsible for their own sanitisation.
An altChunk is a Word primitive that embeds a foreign payload in
the package and asks Word to substitute the payload's rendered content
for the <w:altChunk> marker on open. The substitution runs inside
Word's native import filters, not inside python-docx.
That means an attacker who controls the altChunk payload controls whatever Word's filter chooses to interpret:
- HTML / XHTML (
add_html_chunk) — Word's HTML import pipeline historically evaluates embedded scripts, external image URLs, and conditional-comment VML. It has been the vector for CVE-2017-11826, CVE-2018-0802, and more recent template-injection issues. - RTF (
add_rtf_chunk) — RTF's control-word grammar lets payloads carry embedded OLE objects, external data links, and remote templates. CVE-2017-0199 and CVE-2023-21716 are well-known RCE examples that were triggered simply by opening a document. - MHTML (
add_mhtml_chunk) — multi-part archives combine HTML plus related resources; the HTML portion has the same exposure as HTML altChunks. - Plain text (
add_text_chunk) — lowest-risk, but note Word will still interpret hyperlink-looking strings on autoformat.
python-docx does not auto-sanitise altChunk payloads. None of the helpers below inspect, rewrite, or strip the input:
Document.add_alt_chunk(content, content_type=..., match_src=...)Document.add_html_chunk(html, match_src=...)Document.add_text_chunk(text, encoding=..., match_src=...)Document.add_rtf_chunk(rtf, match_src=...)Document.add_mhtml_chunk(mhtml, match_src=...)
This is deliberate — sanitisation is a content-policy concern that belongs to the embedding application, not to the serializer.
If the payload originated outside your trust boundary:
- Sanitise HTML with a library like
bleachornh3with an explicit allow list of tags/attributes. Strip<script>, event handlers (on*=),javascript:/data:URIs, and external references. - Reject RTF unless you can cryptographically verify its origin. RTF has no practical sanitiser; there is no safe subset for opaque third-party input.
- Reject MHTML by default — it multiplexes HTML plus arbitrary MIME parts. Unpack it, sanitise the HTML portion, drop the attachments, and re-emit a plain HTML altChunk if you must.
- Plain text is safe to embed but will be rendered verbatim;
consider HTML-escaping if the caller expects literal characters
like
<or&to round-trip.
- ECMA-376 Part 1 §17.17 (Glossary Document and Alternate-Format Import Parts)
- Microsoft's Protected View guidance — this mitigates opening altChunks from untrusted origins, but does not eliminate the risk.
settings.xml carries a <w:documentProtection> element whose
@w:hash attribute holds a hash of the editor-restriction password.
The hashing algorithm (@w:cryptAlgorithmSid="4") is SHA-1 with a
small per-document salt and an iteration count; the algorithm is
spec-mandated by ECMA-376 §17.15.1.29 and Microsoft's
MS-OFFCRYPTO §2.3.7.1.
python-docx writes whatever the spec requires — not what a modern
password-hashing policy would prefer — because Word refuses to verify
hashes produced with any other algorithm.
DocumentProtection prevents casual, in-UI editing of a document
opened in Microsoft Word. That is all. It is a user-experience
guardrail, not a cryptographic access-control mechanism:
- The hash and salt live in plaintext inside
settings.xml— any tool that can read the package bytes (including python-docx itself) can remove the element and drop the protection. - SHA-1 collision / preimage attacks are well-documented; brute-forcing the original password against the stored hash is feasible for weak passwords on commodity hardware.
- The document body is not encrypted. Only password-based AES
encryption (
python-docx[encryption], handled bypython-ooxml-cryptounder MS-OFFCRYPTO §2.3.4) provides actual confidentiality.
- Treat
DocumentProtectionas a UI hint, not a secret. - For confidentiality, use whole-package encryption
(
Document.save(..., password="…")), which applies AES under the MS-OFFCRYPTO Agile profile. - Never store a high-value password in the protection hash — assume anyone who opens the file can see / crack it.
If you find a security issue in the loadfix fork (parser memory-exhaustion, ZIP-path traversal, XML external-entity handling, etc.), file an issue on the loadfix monorepo and mark it security. Do not open a public PR that includes a working exploit; drop a note that describes the class of issue and wait for a maintainer to reach out.