13 KiB
QEMU Backend for OpenTofu - Design Document
Goals
Primary Objective
Create a custom backend for OpenTofu that manages QEMU virtual machines directly, providing declarative infrastructure management for VM workloads without external dependencies.
Key Goals
- Direct QEMU Control: Manage QEMU processes natively without abstraction layers
- OpenTofu Integration: Leverage existing OpenTofu HTTP backend protocol for seamless integration
- Declarative VM Management: Define VM infrastructure as code with full lifecycle management
- Resource Efficiency: Optimal resource allocation and process supervision
- Operational Simplicity: Self-contained solution with minimal external dependencies
Success Criteria
- OpenTofu can create, modify, and destroy QEMU VMs through standard configuration
- State management maintains consistency between declared and actual VM state
- Concurrent operations are safely handled through proper locking
- System recovers gracefully from failures and restarts
- Performance scales to reasonable VM workloads (10-100 VMs)
Architecture Overview
High-Level Design
OpenTofu Client → HTTP Backend Protocol → QEMU Management Server
↓
QEMU Processes
↓
State Storage
Components
1. OpenTofu HTTP Backend Client
- Role: Built-in OpenTofu HTTP backend (no custom code required)
- Responsibilities: State serialization, HTTP communication, locking protocol
- Configuration: Points to custom QEMU management server endpoints
- Integration: Works with existing OpenTofu workflows and tooling with no additional code required
2. QEMU Management Server
- Role: Core application implementing HTTP backend protocol
- Responsibilities:
- HTTP API implementation (state CRUD, locking)
- QEMU process lifecycle management
- Resource allocation and conflict resolution
- State persistence and recovery
- Exposing a web interface for monitoring, management and debugging
3. State Storage Layer
- Role: Persistent storage for OpenTofu state and VM metadata
- Options: SQLite
- Responsibilities: State persistence, backup, recovery
4. QEMU Process Manager
- Role: Direct QEMU process control and supervision
- Responsibilities: Process spawning, monitoring, resource management, cleanup
Implementation Plan
Phase 1: Core HTTP Backend Server
Deliverables:
- Basic HTTP server implementing OpenTofu backend protocol
- State storage and retrieval (GET/POST endpoints)
- State locking mechanism (LOCK/UNLOCK endpoints)
- Configuration management and validation
Key Tasks:
- Implement REST API handlers for state operations
- Design state storage schema and persistence layer
- Add proper error handling and logging
- Create basic configuration system
Validation:
- We should be able to run OpenTofu against the resulting service, and get valid responses indicating success (even if nothing is created or run)
Phase 2: QEMU Integration
Deliverables:
- QEMU process lifecycle management
- Resource allocation system (ports, memory, disk)
- Process monitoring and health checks
- Basic VM operations (create, start, stop, destroy)
Key Tasks:
- Build QEMU command-line generation
- Implement process supervision and PID tracking
- Add QEMU Machine Protocol (QMP) integration
- Create resource conflict detection
Validation:
- We should be able to run OpenTofu against the resulting service, and get valid responses indicating success (even if nothing is created or run yet)
Phase 3: State Processing and VM Management
Deliverables:
- State diff processing to determine required changes
- VM configuration template system
- Networking and storage management
- Graceful shutdown and cleanup procedures
Key Tasks:
- Parse OpenTofu state changes into VM operations
- Implement VM configuration templating
- Add network and storage allocation
- Build recovery and cleanup mechanisms
Validation:
- Boot a VM from OpenTofu configuration until network connectivity is established (ping response)
- Verify VM configuration changes are applied correctly through state diff processing
- Test graceful VM shutdown and resource cleanup
- Validate network and storage allocation/deallocation
Phase 4: Production Readiness
Deliverables:
- Comprehensive error handling and recovery
- Performance optimization and resource limits
- Monitoring and observability features
- Documentation and deployment guides
Key Tasks:
- Add metrics and health endpoints
- Implement backup and restore procedures
- Performance testing and optimization
- Security hardening and authentication
Validation:
- Performance: Deploy 10+ concurrent VMs and validate system stability under load
- Monitoring: Verify metrics endpoints expose VM count, memory usage, and error rates
- Recovery: Kill QEMU processes and validate automatic cleanup and state consistency
- Backup/Restore: Create state backup, simulate data loss, and restore from backup
- Security: Test authentication mechanisms and validate unauthorized access is blocked
- Error Handling: Inject failures (disk full, network issues) and verify graceful degradation
- Resource Limits: Exceed configured limits (max VMs, memory) and validate enforcement
Technical Specifications
HTTP Backend Protocol Implementation
Required Endpoints
GET /state/{project} - Retrieve current state (JSON)
POST /state/{project} - Store new state (JSON body)
DELETE /state/{project} - Delete state (optional)
LOCK /state/{project}/lock - Acquire state lock (JSON body)
UNLOCK /state/{project}/lock - Release state lock (JSON body)
State Format
- Standard Terraform/OpenTofu JSON state format
- Version 4 state schema compatibility
- Custom resource types for QEMU VMs
Locking Protocol
{
"ID": "unique-lock-id",
"Operation": "OperationTypePlan|OperationTypeApply",
"Info": "operation description",
"Who": "user@host",
"Version": "opentofu-version",
"Created": "2024-01-01T00:00:00Z",
"Path": "terraform-working-directory"
}
QEMU Process Management
VM Lifecycle Operations
- Create: Generate QEMU configuration and spawn process
- Start/Stop: Control VM power state through QMP
- Modify: Update VM configuration (restart required for some changes)
- Destroy: Graceful shutdown and resource cleanup
- Monitor: Health checks and resource usage tracking
Resource Management
type VMConfig struct {
Name string
Memory uint64 // MB
CPUs int
DiskPath string
NetworkConfig NetworkConfig
VNCPort int
QMPSocket string
}
type ResourcePool struct {
MaxMemory uint64
UsedMemory uint64
PortRange PortRange
AllocatedPorts map[int]string
DiskPaths map[string]string
}
Process Supervision
- PID tracking and process monitoring
- Graceful shutdown with configurable timeouts
- Orphan process detection and cleanup
- Log aggregation and rotation
State Storage Schema
Core Tables/Collections
-- State storage
states (
project_name VARCHAR PRIMARY KEY,
state_data JSON,
version INTEGER,
updated_at TIMESTAMP
);
-- Lock management
locks (
project_name VARCHAR PRIMARY KEY,
lock_info JSON,
acquired_at TIMESTAMP
);
-- VM process tracking
vm_processes (
vm_name VARCHAR PRIMARY KEY,
project_name VARCHAR,
pid INTEGER,
config JSON,
status VARCHAR,
created_at TIMESTAMP
);
Configuration
Server Configuration
server:
host: "0.0.0.0"
port: 8080
tls:
cert_file: "/path/to/cert.pem"
key_file: "/path/to/key.pem"
storage:
type: "sqlite" # sqlite, postgres, file
connection: "/var/lib/qemu-backend/state.db"
qemu:
binary_path: "/usr/bin/qemu-system-x86_64"
default_memory: 1024
port_range:
start: 5900
end: 6000
max_concurrent_vms: 50
resources:
max_memory_mb: 32768
disk_base_path: "/var/lib/qemu-backend/disks"
log_directory: "/var/log/qemu-backend"
auth:
type: "basic" # basic, token, none
username: "admin"
password: "secret"
OpenTofu Configuration
terraform {
backend "http" {
address = "https://qemu-backend.example.com/state/my_project"
lock_address = "https://qemu-backend.example.com/state/my_project/lock"
unlock_address = "https://qemu-backend.example.com/state/my_project/lock"
username = "admin"
password = "secret"
}
}
# Example VM resource (requires custom provider)
resource "qemu_vm" "web_server" {
name = "web-01"
memory = 2048
cpus = 2
disk {
path = "/var/lib/qemu/web-01.qcow2"
size = "20G"
}
network {
type = "user"
hostfwd = "tcp::8080-:80"
}
}
Design Decisions and Rationale
Use HTTP Backend vs Custom Backend Plugin
Decision: Use existing OpenTofu HTTP backend rather than implementing a custom backend plugin.
Rationale:
- Avoids modifying OpenTofu codebase
- Leverages well-tested HTTP backend implementation
- Enables implementation in any programming language
- Simplifies testing and deployment
- Maintains compatibility across OpenTofu versions
Direct QEMU Management vs LibVirt
Decision: Manage QEMU processes directly instead of using the existing libvirt provider.
Rationale:
- Control: Fine-grained control over QEMU parameters and configuration
- Simplicity: Eliminates libvirt daemon dependency and complexity
- Debugging: Direct access to QEMU processes and logs
- Flexibility: Custom networking, storage, and feature implementations
- Performance: Reduced overhead from abstraction layers
- Reliability: Fewer moving parts and potential failure points
State Storage Options
Decision: Support multiple storage backends with SQLite as default.
Rationale:
- SQLite: Simple deployment, no external dependencies, suitable for small-medium scale
- PostgreSQL: Production scalability, ACID compliance, concurrent access
- File-based: Development simplicity, easy backup and migration
- Flexibility: Different deployment scenarios have different requirements
Process Management Approach
Decision: Implement custom process supervision rather than using system service managers.
Rationale:
- Integration: Direct integration with state management and HTTP API
- Control: Custom lifecycle management and resource allocation
- Portability: Works across different operating systems and environments
- Monitoring: Built-in health checks and resource tracking
- Recovery: Coordinated recovery with state consistency
JSON State Format Compatibility
Decision: Maintain full compatibility with standard Terraform/OpenTofu state format.
Rationale:
- Interoperability: Works with existing tooling and workflows
- Migration: Easy migration from other backends
- Standards: Leverages well-defined, stable format
- Debugging: Familiar format for troubleshooting
- Future-proofing: Compatibility with ecosystem tools
Security and Authentication
Decision: Start with basic authentication, design for pluggable auth system.
Rationale:
- Simplicity: Basic auth sufficient for many use cases
- Standards: HTTP-based authentication familiar to operators
- Extensibility: Architecture supports additional auth methods
- Operations: Integrates with existing HTTP infrastructure (proxies, load balancers)
Risk Assessment
Technical Risks
- QEMU Process Management Complexity: Mitigation through comprehensive testing and graceful error handling
- State Consistency: Mitigation through robust locking and recovery mechanisms
- Resource Conflicts: Mitigation through resource allocation tracking and validation
- Scale Limitations: Mitigation through resource limits and monitoring
Operational Risks
- Data Loss: Mitigation through backup strategies and state validation
- Service Availability: Mitigation through health checks and restart procedures
- Security: Mitigation through authentication, TLS, and input validation
Future Enhancements
Potential Features
- VM templating and cloning
- Snapshot management
- Live migration support
- Multi-host clustering
- Advanced networking (bridges, VLANs)
- GPU passthrough support
- Backup and restore automation
- Prometheus metrics integration
- Web UI for monitoring and management
Scalability Improvements
- Horizontal scaling across multiple hosts
- Load balancing and VM placement policies
- Resource scheduling and optimization
- High availability and failover