13 KiB

Raw Permalink Blame History

QEMU Backend for OpenTofu - Design Document

Goals

Primary Objective

Create a custom backend for OpenTofu that manages QEMU virtual machines directly, providing declarative infrastructure management for VM workloads without external dependencies.

Key Goals

Direct QEMU Control: Manage QEMU processes natively without abstraction layers
OpenTofu Integration: Leverage existing OpenTofu HTTP backend protocol for seamless integration
Declarative VM Management: Define VM infrastructure as code with full lifecycle management
Resource Efficiency: Optimal resource allocation and process supervision
Operational Simplicity: Self-contained solution with minimal external dependencies

Success Criteria

OpenTofu can create, modify, and destroy QEMU VMs through standard configuration
State management maintains consistency between declared and actual VM state
Concurrent operations are safely handled through proper locking
System recovers gracefully from failures and restarts
Performance scales to reasonable VM workloads (10-100 VMs)

Architecture Overview

High-Level Design

OpenTofu Client → HTTP Backend Protocol → QEMU Management Server
                                              ↓
                                        QEMU Processes
                                              ↓
                                        State Storage

Components

1. OpenTofu HTTP Backend Client

Role: Built-in OpenTofu HTTP backend (no custom code required)
Responsibilities: State serialization, HTTP communication, locking protocol
Configuration: Points to custom QEMU management server endpoints
Integration: Works with existing OpenTofu workflows and tooling with no additional code required

2. QEMU Management Server

Role: Core application implementing HTTP backend protocol
Responsibilities:
- HTTP API implementation (state CRUD, locking)
- QEMU process lifecycle management
- Resource allocation and conflict resolution
- State persistence and recovery
- Exposing a web interface for monitoring, management and debugging

3. State Storage Layer

Role: Persistent storage for OpenTofu state and VM metadata
Options: SQLite
Responsibilities: State persistence, backup, recovery

4. QEMU Process Manager

Role: Direct QEMU process control and supervision
Responsibilities: Process spawning, monitoring, resource management, cleanup

Implementation Plan

Phase 1: Core HTTP Backend Server

Deliverables:

Basic HTTP server implementing OpenTofu backend protocol
State storage and retrieval (GET/POST endpoints)
State locking mechanism (LOCK/UNLOCK endpoints)
Configuration management and validation

Key Tasks:

Implement REST API handlers for state operations
Design state storage schema and persistence layer
Add proper error handling and logging
Create basic configuration system

Validation:

We should be able to run OpenTofu against the resulting service, and get valid responses indicating success (even if nothing is created or run)

Phase 2: QEMU Integration

Deliverables:

QEMU process lifecycle management
Resource allocation system (ports, memory, disk)
Process monitoring and health checks
Basic VM operations (create, start, stop, destroy)

Key Tasks:

Build QEMU command-line generation
Implement process supervision and PID tracking
Add QEMU Machine Protocol (QMP) integration
Create resource conflict detection

Validation:

We should be able to run OpenTofu against the resulting service, and get valid responses indicating success (even if nothing is created or run yet)

Phase 3: State Processing and VM Management

Deliverables:

State diff processing to determine required changes
VM configuration template system
Networking and storage management
Graceful shutdown and cleanup procedures

Key Tasks:

Parse OpenTofu state changes into VM operations
Implement VM configuration templating
Add network and storage allocation
Build recovery and cleanup mechanisms

Validation:

Boot a VM from OpenTofu configuration until network connectivity is established (ping response)
Verify VM configuration changes are applied correctly through state diff processing
Test graceful VM shutdown and resource cleanup
Validate network and storage allocation/deallocation

Phase 4: Production Readiness

Deliverables:

Comprehensive error handling and recovery
Performance optimization and resource limits
Monitoring and observability features
Documentation and deployment guides

Key Tasks:

Add metrics and health endpoints
Implement backup and restore procedures
Performance testing and optimization
Security hardening and authentication

Validation:

Performance: Deploy 10+ concurrent VMs and validate system stability under load
Monitoring: Verify metrics endpoints expose VM count, memory usage, and error rates
Recovery: Kill QEMU processes and validate automatic cleanup and state consistency
Backup/Restore: Create state backup, simulate data loss, and restore from backup
Security: Test authentication mechanisms and validate unauthorized access is blocked
Error Handling: Inject failures (disk full, network issues) and verify graceful degradation
Resource Limits: Exceed configured limits (max VMs, memory) and validate enforcement

Technical Specifications

HTTP Backend Protocol Implementation

Required Endpoints

GET    /state/{project}        - Retrieve current state (JSON)
POST   /state/{project}        - Store new state (JSON body)
DELETE /state/{project}        - Delete state (optional)
LOCK   /state/{project}/lock   - Acquire state lock (JSON body)
UNLOCK /state/{project}/lock   - Release state lock (JSON body)

State Format

Standard Terraform/OpenTofu JSON state format
Version 4 state schema compatibility
Custom resource types for QEMU VMs

Locking Protocol

{
  "ID": "unique-lock-id",
  "Operation": "OperationTypePlan|OperationTypeApply",
  "Info": "operation description",
  "Who": "user@host",
  "Version": "opentofu-version",
  "Created": "2024-01-01T00:00:00Z",
  "Path": "terraform-working-directory"
}

QEMU Process Management

VM Lifecycle Operations

Create: Generate QEMU configuration and spawn process
Start/Stop: Control VM power state through QMP
Modify: Update VM configuration (restart required for some changes)
Destroy: Graceful shutdown and resource cleanup
Monitor: Health checks and resource usage tracking

Resource Management

type VMConfig struct {
    Name       string
    Memory     uint64    // MB
    CPUs       int
    DiskPath   string
    NetworkConfig NetworkConfig
    VNCPort    int
    QMPSocket  string
}

type ResourcePool struct {
    MaxMemory     uint64
    UsedMemory    uint64
    PortRange     PortRange
    AllocatedPorts map[int]string
    DiskPaths     map[string]string
}

Process Supervision

PID tracking and process monitoring
Graceful shutdown with configurable timeouts
Orphan process detection and cleanup
Log aggregation and rotation

State Storage Schema

Core Tables/Collections

-- State storage
states (
    project_name VARCHAR PRIMARY KEY,
    state_data JSON,
    version INTEGER,
    updated_at TIMESTAMP
);

-- Lock management  
locks (
    project_name VARCHAR PRIMARY KEY,
    lock_info JSON,
    acquired_at TIMESTAMP
);

-- VM process tracking
vm_processes (
    vm_name VARCHAR PRIMARY KEY,
    project_name VARCHAR,
    pid INTEGER,
    config JSON,
    status VARCHAR,
    created_at TIMESTAMP
);

Configuration

Server Configuration

server:
  host: "0.0.0.0"
  port: 8080
  tls:
    cert_file: "/path/to/cert.pem"
    key_file: "/path/to/key.pem"

storage:
  type: "sqlite"  # sqlite, postgres, file
  connection: "/var/lib/qemu-backend/state.db"

qemu:
  binary_path: "/usr/bin/qemu-system-x86_64"
  default_memory: 1024
  port_range:
    start: 5900
    end: 6000
  max_concurrent_vms: 50
  
resources:
  max_memory_mb: 32768
  disk_base_path: "/var/lib/qemu-backend/disks"
  log_directory: "/var/log/qemu-backend"

auth:
  type: "basic"  # basic, token, none
  username: "admin"
  password: "secret"

OpenTofu Configuration

terraform {
  backend "http" {
    address        = "https://qemu-backend.example.com/state/my_project"
    lock_address   = "https://qemu-backend.example.com/state/my_project/lock"
    unlock_address = "https://qemu-backend.example.com/state/my_project/lock"
    username       = "admin"
    password       = "secret"
  }
}

# Example VM resource (requires custom provider)
resource "qemu_vm" "web_server" {
  name   = "web-01"
  memory = 2048
  cpus   = 2
  
  disk {
    path = "/var/lib/qemu/web-01.qcow2"
    size = "20G"
  }
  
  network {
    type = "user"
    hostfwd = "tcp::8080-:80"
  }
}

Design Decisions and Rationale

Use HTTP Backend vs Custom Backend Plugin

Decision: Use existing OpenTofu HTTP backend rather than implementing a custom backend plugin.

Rationale:

Avoids modifying OpenTofu codebase
Leverages well-tested HTTP backend implementation
Enables implementation in any programming language
Simplifies testing and deployment
Maintains compatibility across OpenTofu versions

Direct QEMU Management vs LibVirt

Decision: Manage QEMU processes directly instead of using the existing libvirt provider.

Rationale:

Control: Fine-grained control over QEMU parameters and configuration
Simplicity: Eliminates libvirt daemon dependency and complexity
Debugging: Direct access to QEMU processes and logs
Flexibility: Custom networking, storage, and feature implementations
Performance: Reduced overhead from abstraction layers
Reliability: Fewer moving parts and potential failure points

State Storage Options

Decision: Support multiple storage backends with SQLite as default.

Rationale:

SQLite: Simple deployment, no external dependencies, suitable for small-medium scale
PostgreSQL: Production scalability, ACID compliance, concurrent access
File-based: Development simplicity, easy backup and migration
Flexibility: Different deployment scenarios have different requirements

Process Management Approach

Decision: Implement custom process supervision rather than using system service managers.

Rationale:

Integration: Direct integration with state management and HTTP API
Control: Custom lifecycle management and resource allocation
Portability: Works across different operating systems and environments
Monitoring: Built-in health checks and resource tracking
Recovery: Coordinated recovery with state consistency

JSON State Format Compatibility

Decision: Maintain full compatibility with standard Terraform/OpenTofu state format.

Rationale:

Interoperability: Works with existing tooling and workflows
Migration: Easy migration from other backends
Standards: Leverages well-defined, stable format
Debugging: Familiar format for troubleshooting
Future-proofing: Compatibility with ecosystem tools

Security and Authentication

Decision: Start with basic authentication, design for pluggable auth system.

Rationale:

Simplicity: Basic auth sufficient for many use cases
Standards: HTTP-based authentication familiar to operators
Extensibility: Architecture supports additional auth methods
Operations: Integrates with existing HTTP infrastructure (proxies, load balancers)

Risk Assessment

Technical Risks

QEMU Process Management Complexity: Mitigation through comprehensive testing and graceful error handling
State Consistency: Mitigation through robust locking and recovery mechanisms
Resource Conflicts: Mitigation through resource allocation tracking and validation
Scale Limitations: Mitigation through resource limits and monitoring

Operational Risks

Data Loss: Mitigation through backup strategies and state validation
Service Availability: Mitigation through health checks and restart procedures
Security: Mitigation through authentication, TLS, and input validation

Future Enhancements

Potential Features

VM templating and cloning
Snapshot management
Live migration support
Multi-host clustering
Advanced networking (bridges, VLANs)
GPU passthrough support
Backup and restore automation
Prometheus metrics integration
Web UI for monitoring and management

Scalability Improvements

Horizontal scaling across multiple hosts
Load balancing and VM placement policies
Resource scheduling and optimization
High availability and failover

13 KiB Raw Permalink Blame History

QEMU Backend for OpenTofu - Design Document

Goals

Primary Objective

Key Goals

Success Criteria

Architecture Overview

High-Level Design

Components

1. OpenTofu HTTP Backend Client

2. QEMU Management Server

3. State Storage Layer

4. QEMU Process Manager

Implementation Plan

Phase 1: Core HTTP Backend Server

Phase 2: QEMU Integration

Phase 3: State Processing and VM Management

Phase 4: Production Readiness

Technical Specifications

HTTP Backend Protocol Implementation

Required Endpoints

State Format

Locking Protocol

QEMU Process Management

VM Lifecycle Operations

Resource Management

Process Supervision

State Storage Schema

Core Tables/Collections

Configuration

Server Configuration

OpenTofu Configuration

Design Decisions and Rationale

Use HTTP Backend vs Custom Backend Plugin

Direct QEMU Management vs LibVirt

State Storage Options

Process Management Approach

JSON State Format Compatibility

Security and Authentication

Risk Assessment

Technical Risks

Operational Risks

Future Enhancements

Potential Features

Scalability Improvements

13 KiB

Raw Permalink Blame History