PyFu

Serialization and Deserialization Concept

Python-based Vulnerabilities Anatomy

Serialization as a Concept

Serialization is the process of converting an in-memory object into a flat, storable representation, typically a string or a byte stream, that captures the object’s state so it can be written to a file, saved in a database, or transmitted over a network.

The core problem serialization solves is that objects live in memory as references and structured data that only make sense inside a running process. Once the process exits, that state is gone. Serialization freezes the object into a portable format that can later be reconstructed into an equivalent object, even in a different process or on a different machine.

Different formats make different trade-offs. Text-based formats like json are language-agnostic and only represent simple data types (strings, numbers, lists, dictionaries), which makes them safe but limited. Python-specific formats like pickle can serialize almost any Python object, including custom classes and their methods, but they achieve this by embedding instructions for reconstructing the object, and those instructions are what attackers abuse during deserialization.

This distinction is the root of the entire vulnerability class: the richer a serialization format is, the more of the language’s runtime it must invoke to rebuild an object, and the more dangerous it becomes when fed untrusted input.

Deserialization as a Concept

Deserialization is a well-established concept across many programming languages, and Python is no exception.

It refers to the process of loading or reconstructing objects from a serialized representation whether stored in files, transmitted over networks, or embedded in data streams.

While serialization and deserialization are essential for many legitimate use cases, they can also introduce significant security concerns while trusting user input during deserialization.

Insecure Deserialization

Insecure deserialization is a well-known and high-impact vulnerability, especially in environments where user-controlled serialized data is accepted and processed without strict validation.

When an application blindly trusts serialized input such as data received from a web request or a file, it opens the door to potential exploitation.

Security researchers often target the deserialization process as a critical attack surface.

In Python, modules like pickle can execute arbitrary code during deserialization, which means that a carefully crafted payload can be used to trigger unintended behavior or gain command execution.

The Reconstruction Primitives Attackers Target

What makes a format like pickle dangerous is not that it stores data, but that it stores instructions for rebuilding objects, and Python lets a class define those instructions through magic methods. The most important is __reduce__, which an object can implement to tell the serializer exactly how to recreate it: it returns a callable and the arguments to call it with. During deserialization the library faithfully calls that callable, so an object whose __reduce__ returns (os.system, ("id",)) runs os.system("id") the moment it is loaded.

Related hooks such as __setstate__ and __reduce_ex__ participate in the same flow, and any of them can become an execution sink when the serialized data is attacker-controlled. These methods are part of Python’s normal object model, covered in Python Magic Methods and Attributes; deserialization attacks simply turn them against the loader.

This is why the safety of a format tracks how much of the object model it has to invoke to rebuild a value. json only ever produces dicts, lists, strings, and numbers, so there is no reconstruction callable to hijack. pickle, marshal, shelve (which is pickle on disk), and yaml.load with the unsafe loader all rebuild arbitrary objects, so they all expose this surface.

Python Serialization and Deserialization

Deserialization in Python is the process of reconstructing a Python object from a serialized form which is typically a string, file, or byte stream.

It serves as the inverse of serialization, which involves converting Python objects into a format that can be stored, transmitted, or persisted using json, pickle or any other similar modules.

Where This Shows Up in PyFu

The concrete techniques all build directly on the concepts above:

Why serialization formats matter from an offensive security perspective

I go after deserialization first on any Python target because it is one of the highest-yield bug classes the language offers: a rich format does not just store my data, it invokes the runtime to rebuild it, and __reduce__ lets me dictate exactly which callable runs. That turns “the app accepts serialized input” into “the app runs my command” with no chain to assemble and no filter standing in the way. The impact is RCE by default, and it inherits whatever privileges the loading process holds.

What an attacker prizes is that these sinks are everywhere objects need to move and persist, yet they rarely look like attack surface. Pickle, marshal, shelve, and unsafe yaml.load all hide under abstractions, caches, queues, session stores, model files, config loaders, so the dangerous call is almost never a literal loads in the request handler.

Where I look for a reachable sink in an assessment:

The audit tell is simple: trace untrusted bytes to the format that rebuilds them, and any object-reconstructing format on that path is a likely execution sink. For defenders the takeaway is that the danger tracks how much of the object model a format must invoke, so untrusted input belongs only in data-only formats.

Mitigation

The general rule that falls out of this is to match the format to the trust level of its source: use a data-only format such as JSON, and yaml.safe_load for YAML, for anything an attacker can influence, and reserve object-reconstructing formats like pickle, marshal, and shelve for data that never leaves your own trust boundary. When objects must be exchanged across a boundary, authenticate the bytes with a keyed signature and verify it before deserializing, then validate the decoded result against an expected schema so a malicious document cannot smuggle in types or fields the application never expected.