Python Under The Hood

Core Python Concepts

Python is an interpreted, high-level programming language, which means that its code is executed directly by an interpreter rather than being compiled into machine-level instructions beforehand.

This simplifies the development workflow by allowing developers to write and run code interactively, enabling faster testing and debugging.

In technical terms, when a Python script is executed, the source code (.py file) is first parsed by the Python interpreter.

The interpreter converts the source code into an intermediate form called bytecode which is a low-level, platform-independent representation of the source.

This bytecode is then executed by the Python Virtual Machine (PVM), which interprets and runs it on the host system.

Unlike compiled languages like C or Java, Python doesn’t require a separate compilation step into binary executables, making it highly flexible and accessible for rapid development.

Python’s interpreted nature also contributes to its portability. Since the bytecode can run on any platform with a compatible Python interpreter, developers can write code once and execute it across multiple operating systems with minimal modification.

However, this convenience may come at the cost of execution speed when compared to fully compiled languages, though modern Python implementations and tools like Just-In-Time (JIT) compilers have significantly mitigated this drawback.

Python Bytecode

Python Bytecode is a low-level, platform-independent representation of your Python source code that is executed by the Python Virtual Machine (PVM).

When you run a .py file, Python automatically compiles the source code into an intermediate form known as bytecode. This bytecode is a lower-level, platform-independent representation of your program that can be efficiently executed by the interpreter.

Python typically stores this compiled bytecode in .pyc files within a special __pycache__ directory located alongside the original source files.

These cached files help improve performance by avoiding the need to recompile the same code on subsequent runs.

The bytecode is then executed by the CPython interpreter which is the standard and most widely used implementation of Python which reads and interprets the bytecode instructions at runtime.

To gain a better understanding of how Python code is translated and executed, we can inspect the bytecode generated by the CPython compiler using the built-in dis module.

The dis module help with the analysis of CPython bytecode by disassembling it

We will use this function that performs an arithmetic operation and prints the result as an example:

def cool_calc_function(number1, number2):
    result = number1 + number2
    print(result)

To disassemble this function and view its bytecode, we can use the dis module as shown below:

import dis
def cool_calc_function(number1, number2):
    result = number1 + number2
    print(result)

dis.dis(cool_calc_function)

Running the previous code should give us the following:

  3           0 RESUME                   0

  4           2 LOAD_FAST                0 (number1)
              4 LOAD_FAST                1 (number2)
              6 BINARY_OP                0 (+)
             10 STORE_FAST               2 (result)

  5          12 LOAD_GLOBAL              1 (NULL + print)
             22 LOAD_FAST                2 (result)
             24 CALL                     1
             32 POP_TOP
             34 RETURN_CONST             0 (None)

This output from the dis module provides a readable view of the bytecode that Python generates when it compiles a function.

This bytecode is what the Python Virtual Machine (PVM) actually executes, rather than the original source code.

The Abstract Syntax Tree (AST)

Before CPython produces bytecode, it parses the token stream into an Abstract Syntax Tree (AST), a structured, tree-shaped representation of the program’s grammar. Each node represents a syntactic construct such as a function definition, a function call, an assignment, or a binary operation. The compiler then walks this tree to emit the bytecode we disassembled above.

Python exposes this stage directly through the built-in ast module, which lets us parse source into a tree and inspect it without executing anything. Using the same function from before:

import ast

source = """
def cool_calc_function(number1, number2):
    result = number1 + number2
    print(result)
"""

tree = ast.parse(source)
print(ast.dump(tree, indent=4))

Running this on CPython 3.12 gives a readable view of the tree (abbreviated here for clarity):

Module(
    body=[
        FunctionDef(
            name='cool_calc_function',
            args=arguments(
                args=[
                    arg(arg='number1'),
                    arg(arg='number2')]),
            body=[
                Assign(
                    targets=[Name(id='result', ctx=Store())],
                    value=BinOp(
                        left=Name(id='number1', ctx=Load()),
                        op=Add(),
                        right=Name(id='number2', ctx=Load()))),
                Expr(
                    value=Call(
                        func=Name(id='print', ctx=Load()),
                        args=[Name(id='result', ctx=Load())]))])])

ast.parse runs the lexical analysis and parsing steps for us and stops right before compilation, handing back the tree. This is the same intermediate representation CPython builds internally on every run.

The AST is not only an inspection tool. The tree can be modified, then compiled and executed, which is what makes it relevant from a security perspective:

tree = ast.parse("result = number1 + number2")
code = compile(tree, filename="<ast>", mode="exec")
namespace = {"number1": 3, "number2": 4}
exec(code, namespace)
print(namespace["result"])
# Output: 7

compile() accepts an AST object directly and turns it into a code object, which exec() then runs. Anything that can construct or rewrite an AST and feed it to compile() controls exactly what bytecode is produced, regardless of what the original source looked like. This is the foundation for AST-level code injection and for tooling that rewrites code at import time through AST transformers (subclasses of ast.NodeTransformer).

The standard library also ships ast.literal_eval, which parses a string into an AST and evaluates only literal nodes such as numbers, strings, tuples, lists, dicts, booleans, and None. It is often recommended as a safer replacement for eval on untrusted input because it refuses to evaluate calls, attribute access, or names. That guarantee holds only while the input stays restricted to literals, and its limits are worth understanding before treating it as a security boundary, see Insecure Dynamic Code Evaluation and Execution in Python for how the dynamic evaluation surface is abused.

Python Virtual Machine (PVM)

The Python Virtual Machine (PVM) is the core component of the CPython interpreter responsible for executing Python bytecode; It’s essentially the engine that runs your Python programs.

CPython Interpreter

CPython is the default and most widely used implementation of the Python programming language.

It is written in C and serves as the reference interpreter that most developers interact with when they run Python code.

CPython follows the full Python language specification, ensuring compatibility and consistency across platforms and use cases.

Beyond simply interpreting code, CPython provides the complete runtime environment required to execute Python programs.

This includes the compiler that translates source code into bytecode, the Python Virtual Machine (PVM) that runs the bytecode, memory management systems, and a C API that allows integration with native C libraries.

Its architecture and implementation details are essential knowledge for anyone exploring Python’s internals or investigating low-level bugs and vulnerabilities.

How it Works When You Execute Python Code

Now after we understood few core concepts about Python, let’s explain what happen when you run a Python script.

When you run a python script, a series of pre-defined steps are performed under the hood to transform your source code into actions executed by your computer.

Here’s a simplified breakdown of the process:

Source Code (your .py file)
   ↓
Lexical Analysis
   ↓
Parsing → Abstract Syntax Tree (AST)
   ↓
Compilation → Bytecode (.pyc files)
   ↓
Execution → Python Virtual Machine (PVM)

Source Code: The process begins with your .py file, which contains human-readable Python code that is ready to be executed.
Lexical Analysis: The interpreter scans the source code and breaks it down into tokens which is a basic syntactic units like keywords, operators, and identifiers.
Parsing: These tokens are analyzed for grammatical structure and converted into an Abstract Syntax Tree (AST), a tree representation of the code’s logical structure.
Compilation: The AST is then compiled into bytecode, a lower-level, platform-independent representation of your code. This bytecode is typically saved as .pyc files in the __pycache__ directory for reuse in future executions.
Execution: Finally, the bytecode is interpreted by the Python Virtual Machine (PVM), which executes the instructions line by line.

Running it in the lab

docker compose run --rm generic-py-fu python3 internals/bytecode-dis.py

 17           2 LOAD_FAST                0 (number1)
              4 LOAD_FAST                1 (number2)
              6 BINARY_OP                0 (+)
             10 STORE_FAST               2 (result)
 18          12 LOAD_GLOBAL              1 (NULL + print)
             22 LOAD_FAST                2 (result)
             24 CALL                     1
             32 POP_TOP
             34 RETURN_CONST             0 (None)

dis prints the exact bytecode the function compiles to, the same instruction stream an attacker reads, rewrites, or injects when manipulating code objects at runtime.

Why the interpreter pipeline matters from an offensive security perspective

I treat the source-to-bytecode pipeline as a chain of intercept points, because every stage is a place where I can substitute my own code for what the developer wrote. The fact that CPython runs bytecode rather than source is the whole reason the language is so soft to attack from the inside.

The AST stage is a code-injection surface. ast.parse then compile() lets me build a code object from a tree I control, so anything that rewrites an AST before compilation (an ast.NodeTransformer, an import-time hook) decides what actually runs, no matter what the original .py file says. When auditing, I look for any code that parses untrusted input into an AST and compiles it.
compile() and exec() accept more than strings. compile() takes an AST object directly, so a sandbox that filters source text strings never sees the payload if it arrives as a pre-built tree. I check whether a restriction operates on source or on the compiled object.
.pyc files decouple execution from source. Because the PVM runs bytecode, a __pycache__/*.pyc that no longer matches its .py still executes. I look for writable __pycache__ directories and stale or planted .pyc files, which run with zero visible source change.
ast.literal_eval gets trusted as a boundary. I flag every place it is treated as fully safe on attacker input, since its guarantee only holds while the input stays restricted to literals.
dis is my reconnaissance tool. Disassembling a target’s functions tells me the exact instruction stream to read, patch, or inject when I am manipulating code objects in a live process.

For defenders: the source you review is not the code that runs, so file-integrity monitoring on __pycache__ and tight control over any compile/exec/eval path on untrusted input matters more than reading the .py alone.