PyFu

Python LXML Insecure XML Parsing

Python-based Web Application Attacks

lxml library is one of the most widely adopted XML parsing libraries in Python due to its rich feature set, high parsing performance, and compliance with XML specifications.

It supports a wide range of advanced XML processing capabilities, including XPath queries, XSLT transformations, and full DTD support.

This makes it a powerful choice for developers who require complete XML handling capabilities in their applications.

By default, lxml is safe against many XML-related attacks because it disables DTD loading and external entity resolution.

However, the library allows developers to manually enable these features through its parsing options. If incorrectly configured, lxml becomes vulnerable to XML External Entity (XXE) attacks, which can lead to sensitive data disclosure, server-side request forgery (SSRF) or other issues.

In this example, we have a Python application that reads XML data from a file and parses it using lxml. The vulnerability is introduced when the developer enables external entity resolution and DTD loading by explicitly setting the load_dtd and resolve_entities options:

from lxml import etree

with open("input.xml", "rb") as f:
    xml_data = f.read()

parser = etree.XMLParser(load_dtd=True, resolve_entities=True)
root = etree.fromstring(xml_data, parser)

print(etree.tostring(root))

When load_dtd=True is set, the parser allows Document Type Declarations (DTDs) to be loaded and processed. Combined with resolve_entities=True, external entities defined in the XML input are resolved during parsing.

This becomes dangerous when processing XML from untrusted or user-controlled sources.

An attacker can supply a malicious XML payload that defines an external entity pointing to sensitive system files or internal services.

The following XML input demonstrates a typical XXE attack:

<?xml version="1.0"?>
<!DOCTYPE data [
  <!ENTITY xxe SYSTEM "file:///etc/passwd">
]>
<data>&xxe;</data>

Once this payload is processed by the vulnerable parser, the content of /etc/passwd is read and inserted into the XML tree in place of the &xxe; entity.

This allows the attacker to extract file contents or probe internal resources, depending on the entity URL provided.

For example, running the previous payload will print out the content of /etc/passwd file:

PyFu/generic-py-fu/vulnerable-lxml-example 130 » ls
input.xml  vulnerable-lxml-parser.py
PyFu/generic-py-fu/vulnerable-lxml-example » python3 vulnerable-lxml-parser.py 
b'<data>root:x:0:0:root:/root:/bin/bash\ndaemon:x:1:1:daemon:/usr/sbin:/usr/sbin/nologin\nbin:x:2:2:bin:/bin:/usr/sbin/nologin\nsys:x:3:3:sys:/dev:/usr/sbin/nologin\nsync:x:4:65534:sync:/bin:/bin/sync\ngames:x:5:60:games:/usr/games:/usr/sbin/nologin
...

As we see, the root cause of this vulnerability lies in improper configuration rather than in lxml itself; Its default settings are safe, but enabling DTD processing and entity resolution without fully understanding the risks exposes applications to XXE attacks.

Offensive security researchers should target such misconfigurations when auditing code for XML parsing vulnerabilities.

Why XXE in lxml matters from an offensive security perspective

I value XXE because a single parsed document hands me three primitives at once: local file read, blind SSRF, and on misconfigured stacks, denial of service through entity expansion. With file:///etc/passwd I read the filesystem, with http://169.254.169.254/ I pivot the parser into cloud metadata and internal services the host can reach but I cannot, and when direct exfiltration is blocked I fall back to out-of-band channels that ship file contents to a server I control. One ingestion endpoint becomes a read primitive across the whole trust boundary.

The reason this keeps appearing is that lxml’s safe defaults get switched off by developers who want DTD or entity features for legitimate reasons. These are the tells I look for when auditing:

  • XMLParser(...) with resolve_entities=True, load_dtd=True, or no_network=False. Any of these on a parser that touches untrusted input re-enables the dangerous path that the defaults shut off.
  • etree.fromstring or etree.parse fed request bodies, uploads, or file contents. SOAP, SAML, SVG, DOCX, RSS, and sitemap handlers are the usual carriers; the XML is rarely something the user “typed,” so it gets trusted.
  • Custom resolvers and etree.resolvers.add(...). A registered resolver often reintroduces network and file access even when the base parser looks locked down.
  • Out-of-band behavior on malformed input. When a payload referencing an external DTD makes the server reach back to me, I have blind XXE even without seeing the file in the response. This often pivots into Server Side Request Forgery (SSRF) in Flask Applications.

The defender takeaway: never flip those flags on attacker-reachable parsers, and reach for defusedxml so the safe path is the default one.

Mitigation

The fix is to parse untrusted XML with entity resolution and DTD loading disabled, which is lxml’s default behavior; only an explicit XMLParser(resolve_entities=True, load_dtd=True, no_network=False) re-enables the dangerous path, so never set those flags on attacker-supplied input. Use defusedxml for defense in depth, cap entity expansion to stop billion-laughs style amplification, and validate the parsed document against an expected schema rather than trusting whatever entities or external references it declares.