# QEMU Backend for OpenTofu - Design Document ## Goals ### Primary Objective Create a custom backend for OpenTofu that manages QEMU virtual machines directly, providing declarative infrastructure management for VM workloads without external dependencies. ### Key Goals - **Direct QEMU Control**: Manage QEMU processes natively without abstraction layers - **OpenTofu Integration**: Leverage existing OpenTofu HTTP backend protocol for seamless integration - **Declarative VM Management**: Define VM infrastructure as code with full lifecycle management - **Resource Efficiency**: Optimal resource allocation and process supervision - **Operational Simplicity**: Self-contained solution with minimal external dependencies ### Success Criteria - OpenTofu can create, modify, and destroy QEMU VMs through standard configuration - State management maintains consistency between declared and actual VM state - Concurrent operations are safely handled through proper locking - System recovers gracefully from failures and restarts - Performance scales to reasonable VM workloads (10-100 VMs) ## Architecture Overview ### High-Level Design ``` OpenTofu Client → HTTP Backend Protocol → QEMU Management Server ↓ QEMU Processes ↓ State Storage ``` ### Components #### 1. OpenTofu HTTP Backend Client - **Role**: Built-in OpenTofu HTTP backend (no custom code required) - **Responsibilities**: State serialization, HTTP communication, locking protocol - **Configuration**: Points to custom QEMU management server endpoints #### 2. QEMU Management Server - **Role**: Core application implementing HTTP backend protocol - **Responsibilities**: - HTTP API implementation (state CRUD, locking) - QEMU process lifecycle management - Resource allocation and conflict resolution - State persistence and recovery #### 3. State Storage Layer - **Role**: Persistent storage for OpenTofu state and VM metadata - **Options**: SQLite (simple), PostgreSQL (production), file-based (development) - **Responsibilities**: State persistence, backup, recovery #### 4. QEMU Process Manager - **Role**: Direct QEMU process control and supervision - **Responsibilities**: Process spawning, monitoring, resource management, cleanup ## Implementation Plan ### Phase 1: Core HTTP Backend Server **Deliverables:** - Basic HTTP server implementing OpenTofu backend protocol - State storage and retrieval (GET/POST endpoints) - State locking mechanism (LOCK/UNLOCK endpoints) - Configuration management and validation **Key Tasks:** - Implement REST API handlers for state operations - Design state storage schema and persistence layer - Add proper error handling and logging - Create basic configuration system ### Phase 2: QEMU Integration **Deliverables:** - QEMU process lifecycle management - Resource allocation system (ports, memory, disk) - Process monitoring and health checks - Basic VM operations (create, start, stop, destroy) **Key Tasks:** - Build QEMU command-line generation - Implement process supervision and PID tracking - Add QEMU Machine Protocol (QMP) integration - Create resource conflict detection ### Phase 3: State Processing and VM Management **Deliverables:** - State diff processing to determine required changes - VM configuration template system - Networking and storage management - Graceful shutdown and cleanup procedures **Key Tasks:** - Parse OpenTofu state changes into VM operations - Implement VM configuration templating - Add network and storage allocation - Build recovery and cleanup mechanisms ### Phase 4: Production Readiness **Deliverables:** - Comprehensive error handling and recovery - Performance optimization and resource limits - Monitoring and observability features - Documentation and deployment guides **Key Tasks:** - Add metrics and health endpoints - Implement backup and restore procedures - Performance testing and optimization - Security hardening and authentication ## Technical Specifications ### HTTP Backend Protocol Implementation #### Required Endpoints ``` GET /state/{project} - Retrieve current state (JSON) POST /state/{project} - Store new state (JSON body) DELETE /state/{project} - Delete state (optional) LOCK /state/{project}/lock - Acquire state lock (JSON body) UNLOCK /state/{project}/lock - Release state lock (JSON body) ``` #### State Format - Standard Terraform/OpenTofu JSON state format - Version 4 state schema compatibility - Custom resource types for QEMU VMs #### Locking Protocol ```json { "ID": "unique-lock-id", "Operation": "OperationTypePlan|OperationTypeApply", "Info": "operation description", "Who": "user@host", "Version": "opentofu-version", "Created": "2024-01-01T00:00:00Z", "Path": "terraform-working-directory" } ``` ### QEMU Process Management #### VM Lifecycle Operations - **Create**: Generate QEMU configuration and spawn process - **Start/Stop**: Control VM power state through QMP - **Modify**: Update VM configuration (restart required for some changes) - **Destroy**: Graceful shutdown and resource cleanup - **Monitor**: Health checks and resource usage tracking #### Resource Management ```go type VMConfig struct { Name string Memory uint64 // MB CPUs int DiskPath string NetworkConfig NetworkConfig VNCPort int QMPSocket string } type ResourcePool struct { MaxMemory uint64 UsedMemory uint64 PortRange PortRange AllocatedPorts map[int]string DiskPaths map[string]string } ``` #### Process Supervision - PID tracking and process monitoring - Graceful shutdown with configurable timeouts - Orphan process detection and cleanup - Log aggregation and rotation ### State Storage Schema #### Core Tables/Collections ```sql -- State storage states ( project_name VARCHAR PRIMARY KEY, state_data JSON, version INTEGER, updated_at TIMESTAMP ); -- Lock management locks ( project_name VARCHAR PRIMARY KEY, lock_info JSON, acquired_at TIMESTAMP ); -- VM process tracking vm_processes ( vm_name VARCHAR PRIMARY KEY, project_name VARCHAR, pid INTEGER, config JSON, status VARCHAR, created_at TIMESTAMP ); ``` ## Configuration ### Server Configuration ```yaml server: host: "0.0.0.0" port: 8080 tls: cert_file: "/path/to/cert.pem" key_file: "/path/to/key.pem" storage: type: "sqlite" # sqlite, postgres, file connection: "/var/lib/qemu-backend/state.db" qemu: binary_path: "/usr/bin/qemu-system-x86_64" default_memory: 1024 port_range: start: 5900 end: 6000 max_concurrent_vms: 50 resources: max_memory_mb: 32768 disk_base_path: "/var/lib/qemu-backend/disks" log_directory: "/var/log/qemu-backend" auth: type: "basic" # basic, token, none username: "admin" password: "secret" ``` ### OpenTofu Configuration ```hcl terraform { backend "http" { address = "https://qemu-backend.example.com/state/my_project" lock_address = "https://qemu-backend.example.com/state/my_project/lock" unlock_address = "https://qemu-backend.example.com/state/my_project/lock" username = "admin" password = "secret" } } # Example VM resource (requires custom provider) resource "qemu_vm" "web_server" { name = "web-01" memory = 2048 cpus = 2 disk { path = "/var/lib/qemu/web-01.qcow2" size = "20G" } network { type = "user" hostfwd = "tcp::8080-:80" } } ``` ## Design Decisions and Rationale ### Use HTTP Backend vs Custom Backend Plugin **Decision**: Use existing OpenTofu HTTP backend rather than implementing a custom backend plugin. **Rationale**: - Avoids modifying OpenTofu codebase - Leverages well-tested HTTP backend implementation - Enables implementation in any programming language - Simplifies testing and deployment - Maintains compatibility across OpenTofu versions ### Direct QEMU Management vs LibVirt **Decision**: Manage QEMU processes directly instead of using the existing libvirt provider. **Rationale**: - **Control**: Fine-grained control over QEMU parameters and configuration - **Simplicity**: Eliminates libvirt daemon dependency and complexity - **Debugging**: Direct access to QEMU processes and logs - **Flexibility**: Custom networking, storage, and feature implementations - **Performance**: Reduced overhead from abstraction layers - **Reliability**: Fewer moving parts and potential failure points ### State Storage Options **Decision**: Support multiple storage backends with SQLite as default. **Rationale**: - **SQLite**: Simple deployment, no external dependencies, suitable for small-medium scale - **PostgreSQL**: Production scalability, ACID compliance, concurrent access - **File-based**: Development simplicity, easy backup and migration - **Flexibility**: Different deployment scenarios have different requirements ### Process Management Approach **Decision**: Implement custom process supervision rather than using system service managers. **Rationale**: - **Integration**: Direct integration with state management and HTTP API - **Control**: Custom lifecycle management and resource allocation - **Portability**: Works across different operating systems and environments - **Monitoring**: Built-in health checks and resource tracking - **Recovery**: Coordinated recovery with state consistency ### JSON State Format Compatibility **Decision**: Maintain full compatibility with standard Terraform/OpenTofu state format. **Rationale**: - **Interoperability**: Works with existing tooling and workflows - **Migration**: Easy migration from other backends - **Standards**: Leverages well-defined, stable format - **Debugging**: Familiar format for troubleshooting - **Future-proofing**: Compatibility with ecosystem tools ### Security and Authentication **Decision**: Start with basic authentication, design for pluggable auth system. **Rationale**: - **Simplicity**: Basic auth sufficient for many use cases - **Standards**: HTTP-based authentication familiar to operators - **Extensibility**: Architecture supports additional auth methods - **Operations**: Integrates with existing HTTP infrastructure (proxies, load balancers) ## Risk Assessment ### Technical Risks - **QEMU Process Management Complexity**: Mitigation through comprehensive testing and graceful error handling - **State Consistency**: Mitigation through robust locking and recovery mechanisms - **Resource Conflicts**: Mitigation through resource allocation tracking and validation - **Scale Limitations**: Mitigation through resource limits and monitoring ### Operational Risks - **Data Loss**: Mitigation through backup strategies and state validation - **Service Availability**: Mitigation through health checks and restart procedures - **Security**: Mitigation through authentication, TLS, and input validation ## Future Enhancements ### Potential Features - VM templating and cloning - Snapshot management - Live migration support - Multi-host clustering - Advanced networking (bridges, VLANs) - GPU passthrough support - Backup and restore automation - Prometheus metrics integration - Web UI for monitoring and management ### Scalability Improvements - Horizontal scaling across multiple hosts - Load balancing and VM placement policies - Resource scheduling and optimization - High availability and failover