Configuration of Python Applications

During the development of Python applications, I've continually asked myself how and when variables should be passed and initialized for the application's configuration. I want to be able to easily override the configuration for tests, for example, to use a local database for testing. But what exactly is application configuration, and why is it needed? The website https://12factor.net/ describes application configuration as follows:

An app’s config is everything that is likely to vary between deploys (staging, production, developer environments, etc). This includes:

Resource handles to the database, Memcached, and other backing services

Credentials to external services such as Amazon S3 or Twitter

Per-deploy values such as the canonical hostname for the deploy

Apps sometimes store config as constants in the code. This is a violation of twelve-factor, which requires strict separation of config from code. Config varies substantially across deploys, code does not.

So, mutable configuration variables should not be part of the application code but should be defined externally. This offers several advantages:

The application can be reparameterized and, for example, deployed in a different infrastructure without making changes to the code.
The code contains no secrets (private keys, database passwords, ...) that can be leaked through the code.
All configuration parameters are transparently defined in one place so they can be easily found and documented.

So, how do the configuration variables get into my Python application from outside the code? I'd like to answer this question first and compare different formats for configuration files. Subsequently, I'll try to clarify when and at what point in the code the configuration should be initialized. Finally, I will discuss how configurations can be overridden for tests and what problems may arise in the process.

How Do Configuration Variables Get Into My Code?

There are various ways to load configuration variables from outside the code. The variables can be defined within a configuration file, stored in environment variables, or dynamically loaded from so-called "Secret Stores".

Configuration via a Configuration File

There are different formats for defining a configuration file. I would like to briefly introduce the most common formats.

1. INI Format

The Python Core Library provides several libraries to load configuration files. The configparser library loads .INI files, which correspond to the structure of Microsoft Windows INI files.

Example INI Format:

[DEFAULT]
TimeoutInterval = 45

[db.postgres]
Host = localhost
Port = 5432

[app.webserver]
Port = 80

The configuration file is parsed into a nested dictionary. The file describes variables in sections and default values, which are adopted for all sections.

config = configparser.ConfigParser()
config.read('example.ini')

# access section variables
config['db.postgres']['Host']

# list sections
config.sections()

2. JSON Format

In addition to the configparser library, the json and tomllib libraries are also available, which can load the widely used .json and .toml formats.

Example JSON Format:

{
  "db": {
    "postgres": {
      "Host": "localhost",
      "Port": 5432
    }
  },
  "app": {
    "webserver": {
      "Port": 80
    }
  }
}

import json

with open('example.json') as f:
    config = json.load(f)

# access variables
config["db"]["postgres"]["Host"]

3. TOML Format

The TOML format is significantly more compact, and easier to read and edit compared to the JSON format. Whereas nested structures in the JSON format can quickly become confusing, hierarchies in the TOML format can be easily represented through dot notation.

Example TOML (Tom's Obvious Minimal Language) Format:

[db.postgres]
Host = "localhost"
Port = 5432

[app.webserver]
Port = 80

import tomllib

with open("example.toml", "rb") as f:
    config = tomllib.load(f)

# access variables
config['db']['postgres']['Host']

Note: tomllib has been added to the core library in Python 3.11

4. YAML Format

In addition to TOML, the YAML format has established itself as a widely used format for the definitions of configuration files. YAML's biggest advantage lies in its ease of readability and support for comments, but this can be clouded by dependencies on correct indentation and possible confusion caused by various notation styles. In contrast, TOML, with its explicit syntax, is easier to write and analyze, but offers less flexibility in representing complex data structures. YAML's syntax relies on indentation, making the syntax more flexible but also more prone to errors.

In Python, for example, the PyYAML library is available, with which YAML files can be loaded.

Example YAML (Yet Another Markup Language) Format:

db:
  postgres:
    Host: "localhost"
    Port: 5432
app:
  webserver:
    Port: 80

import yaml
with open("example.yaml") as f:
    config = yaml.safe_load(f)

# access variables
config['db']['postgres']['Port']

Configuration via Environment Variables

Configuration via environment variables is described as the preferred method in application development according to the 12-Factor Application concept. Environment variables offer higher security against unintentional publication of secrets as they cannot be checked into version control (e.g., Git). They are agnostic of programming language and operating system, and they do not force a fixed grouping into dev, stage, or prod environments.

Manually Setting Environment Variables

In the simplest form, environment variables can be set via the console. In Linux, this can be done using:

export DB_POSTGRES_HOST=localhost
export DB_POSTGRES_PORT=5432
export APP_WEBSERVER_PORT=80

or in Windows, using:

set DB_POSTGRES_HOST=localhost
set DB_POSTGRES_PORT=5432
set APP_WEBSERVER_PORT=80

Environment variables can be read in Python using the os library:

import os

# access variables
db_postgres_host = os.getenv("DB_POSTGRES_HOST")
db_postgres_port = os.getenv("DB_POSTGRES_PORT")
app_webserver_port = os.getenv("APP_WEBSERVER_PORT")

Configuration Using a .env File

Since manually setting environment variables in the development environment can be quite tedious, many developers use a so-called .env file, which defines the environment variables. Of course, this loses the security aspect of unintentionally publishing secrets. Therefore, the .env file should always be part of the .gitignore file. The content of a .env file could look as follows:

DB_POSTGRES_HOST=localhost
DB_POSTGRES_PORT=5432
APP_WEBSERVER_PORT=80

To load this .env file and access the variables, you can use the python-dotenv Python library, for instance. This library loads the file but does not overwrite existing environment variables.

from dotenv import load_dotenv
import os

load_dotenv(".env")

db_postgres_host = os.getenv("DB_POSTGRES_HOST")
db_postgres_port = os.getenv("DB_POSTGRES_PORT")
app_webserver_port = os.getenv("APP_WEBSERVER_PORT")

Parsing Environment Variables Using Pydantic

The validation library pydantic offers a simple and elegant method to manage configuration files and environment variables in Python programs with Pydantic Settings (installable via the pydantic-settings package). It combines the strengths of Pydantic data validation with a flexible hierarchy of configuration sources, enabling robust and error-free management of application settings.

With validation, once the settings object is initialized without errors, it can be assumed that all necessary configuration parameters are available to the application. The predefined configuration parameters, including all data types, also enable type-checking and auto-completion, significantly simplifying the handling of configuration parameters during development.

Minimal example with pydantic-settings:

from pydantic_settings import BaseSettings

class Settings(BaseSettings):
    db_postgres_host: str
    db_postgres_port: int
    app_webserver_port: int

# load and parse settings
settings = Settings()

# access variables
settings.db_postgres_host

When and where do I load the application configuration?

Configuration initialization in an application should only occur once. All application components should only access a shared truth. For example, if two database clients are initialized at different times in the program code, it should not be possible for both clients to be initialized with different configurations in case the database configuration is altered in the meantime. This issue becomes particularly relevant when configurations are dynamically loaded from a parameter/secret store. While environment variables are typically set at the time of an application's deployment and therefore a subsequent change is unlikely (though not impossible, for example, due to a developer's manual intervention), it is normal for dynamically loaded configuration parameters to be altered during the program's runtime.

What areas in my program code are there to initialize the configuration?

Initialization in the module `init.py`

The __init__.py in the module's root directory, as the name suggests, is used for module initialization - hence, it is the ideal place to initialize the application configuration. The __init__.py is executed by the interpreter only once during the first import from the module. An exemplary __init__.py could look like this:

"""Initialize application settings"""
from ._settings import Settings

settings = Settings()

__all__ = ["settings"]

The settings object is thus available to all parts of the program as a simple import and exists only once in the code.

from my_module import settings

Of course, the initialization can also take place in a separate settings.py or config.py file, as is found in many applications.

Initialization within a function

Another way to initialize the application configuration is to do it, for instance, at the application's entry point.

def main():
    settings = Settings()

    sub_function(settings)

The configuration object is then passed on as a whole or in parts to sub-functions. This approach initially involves additional development work, but it forces that there are no 'sideloads' of variables via the settings object from functions. Here is an example to make that clear.

# Pass all variables
def init_client(host: str, port: int, user: str, password: str) -> Client:
    return Client(host=host, port=port, user=user, password=password)

# Side loading of configuration variables
from my_module import settings

def init_client(user: str, password: str) -> Client:
    return Client(host=settings.host, port=settings.port, user=user, password=password)

The first example is not only easier to read but also much easier to test. The example also illustrates why it can be sensible to define constants separately. By defining constants separate from the configuration object, I reduce the number of configuration parameters. Sideloading of constants is not an issue, as they do not need to be overwritten for tests.

Constants can, for instance, be defined in a separate constants.py file:

# constants.py
MAX_EMAIL_FIELD_LENGTH = 60

# utils.py
FROM constants import MAX_EMAIL_FIELD_LENGTH

def truncate_email(email: str) -> str:
    return email[:MAX_EMAIL_FIELD_LENGTH]

Passing the settings object as a parameter also has some disadvantages, especially for larger applications. If the settings object is passed as a parameter, this can lead to almost every function of the application needing this additional parameter. This can make the code more complex and harder to read, especially if there are many functions that require settings. It can also mean that the settings object must be passed down through many levels of functions, which also increases complexity.

Both approaches increase the complexity of the code and make it harder to understand and maintain. For this reason, they are usually not the preferred method for overriding settings for tests. It is often easier and cleaner to have separate test configurations or to use setup and teardown functions in the tests.

How do I patch the application configuration for tests?

A prerequisite for good testability is first of all to avoid the sideloading of configuration parameters mentioned in the last chapter. Unit tests can thus be carried out without any problems without having to patch configuration parameters. If configuration parameters still need to be patched, for example, to enable connection to a local test database, there are some pitfalls that need to be considered.

Patching a configuration object initialized at import

The initialization of the application configuration is performed by the first import from the module. Since an import from the module to be tested is carried out before the patching process, there is no easy way to manipulate the application configuration before initialization. A test environment must therefore be created in which initialization can be carried out without problems. Patching the application configuration therefore occurs after initialization.

For initializing the application configuration for tests, it is advisable to maintain separate configuration files (e.g., test_config.toml or dev.env). Individual parameters can then be separately patched for each test case. Visual Studio Code, for example, offers the possibility to set separate .env files for the debug and prod environment (Docs).

Patching a configuration object initialized at runtime

Configuration objects that are initialized at runtime and load configuration variables dynamically from a parameter store, for instance, are easier to patch. In this case, it is advisable to patch the function that retrieves the parameters. During implementation, it must be ensured that the function for retrieving the configuration parameters is not called at the time of the module import, as this would make patching in advance impossible. This requirement again impacts the implementation. Here is an example:

# settings.py
@dataclass
class Settings:
    param1: str

settings: Settings | None = None

def load_settings() -> Settings:
    global settings
    if settings:
        return settings

    settings = retrieve_settings()
    return settings

In this example, settings is defined in such a way that it can either be an object of the Settings class or None. When importing from this module, settings remains None until load_settings() has been called. settings is defined here as a global variable and can therefore exist only once. In this case, each use of the settings object would first have to check whether it is None.

from settings import settings

def example_fun():
    if not settings:
        raise ValueError("Settings not loaded.")
    print(settings.param1)

This leads to surplus code, as the existence of the settings object must be checked with every access (which makes sense, as it might not have been initialized yet). It would be much easier to access the settings object through the helper function.

from settings import load_settings

def example_fun():
    print(load_settings().param1)

Of course, this function could also be hidden in a lovely property within a Settings class.

Summary

The configuration of an application is an essential component and includes everything that varies between different environments (e.g., development, test, production). These configurations should not be set as constants in the code but should be loaded from outside. This increases flexibility and security, as sensitive information like passwords or API keys does not have to be stored in the code.

There are various ways to load these configurations in Python, either via special configuration file formats like INI, TOML, or YAML, or via environment variables. The use of environment variables has the advantage of being language- and operating system-independent, thus providing a universal solution.

The configuration initialization should occur exactly once, and all application components should access this one source. The initialization can either occur in a special initialization file or in a function that then passes the configuration objects to the different parts of the application.

For tests, it may be necessary to overwrite certain configurations. In such cases, it is often easier to create separate test configurations or use appropriate setup and teardown functions in the tests.

In conclusion, the configuration of an application is a critical aspect that needs to be carefully planned and implemented. Flexibility and security should always be emphasized to adapt the application to different environments and protect sensitive information.

2023-10-02