363 lines
11 KiB
Markdown
363 lines
11 KiB
Markdown
# QEMU Backend for OpenTofu - Design Document
|
|
|
|
## Goals
|
|
|
|
### Primary Objective
|
|
Create a custom backend for OpenTofu that manages QEMU virtual machines directly, providing declarative infrastructure management for VM workloads without external dependencies.
|
|
|
|
### Key Goals
|
|
- **Direct QEMU Control**: Manage QEMU processes natively without abstraction layers
|
|
- **OpenTofu Integration**: Leverage existing OpenTofu HTTP backend protocol for seamless integration
|
|
- **Declarative VM Management**: Define VM infrastructure as code with full lifecycle management
|
|
- **Resource Efficiency**: Optimal resource allocation and process supervision
|
|
- **Operational Simplicity**: Self-contained solution with minimal external dependencies
|
|
|
|
### Success Criteria
|
|
- OpenTofu can create, modify, and destroy QEMU VMs through standard configuration
|
|
- State management maintains consistency between declared and actual VM state
|
|
- Concurrent operations are safely handled through proper locking
|
|
- System recovers gracefully from failures and restarts
|
|
- Performance scales to reasonable VM workloads (10-100 VMs)
|
|
|
|
## Architecture Overview
|
|
|
|
### High-Level Design
|
|
```
|
|
OpenTofu Client → HTTP Backend Protocol → QEMU Management Server
|
|
↓
|
|
QEMU Processes
|
|
↓
|
|
State Storage
|
|
```
|
|
|
|
### Components
|
|
|
|
#### 1. OpenTofu HTTP Backend Client
|
|
- **Role**: Built-in OpenTofu HTTP backend (no custom code required)
|
|
- **Responsibilities**: State serialization, HTTP communication, locking protocol
|
|
- **Configuration**: Points to custom QEMU management server endpoints
|
|
|
|
#### 2. QEMU Management Server
|
|
- **Role**: Core application implementing HTTP backend protocol
|
|
- **Responsibilities**:
|
|
- HTTP API implementation (state CRUD, locking)
|
|
- QEMU process lifecycle management
|
|
- Resource allocation and conflict resolution
|
|
- State persistence and recovery
|
|
|
|
#### 3. State Storage Layer
|
|
- **Role**: Persistent storage for OpenTofu state and VM metadata
|
|
- **Options**: SQLite (simple), PostgreSQL (production), file-based (development)
|
|
- **Responsibilities**: State persistence, backup, recovery
|
|
|
|
#### 4. QEMU Process Manager
|
|
- **Role**: Direct QEMU process control and supervision
|
|
- **Responsibilities**: Process spawning, monitoring, resource management, cleanup
|
|
|
|
## Implementation Plan
|
|
|
|
### Phase 1: Core HTTP Backend Server
|
|
**Deliverables:**
|
|
- Basic HTTP server implementing OpenTofu backend protocol
|
|
- State storage and retrieval (GET/POST endpoints)
|
|
- State locking mechanism (LOCK/UNLOCK endpoints)
|
|
- Configuration management and validation
|
|
|
|
**Key Tasks:**
|
|
- Implement REST API handlers for state operations
|
|
- Design state storage schema and persistence layer
|
|
- Add proper error handling and logging
|
|
- Create basic configuration system
|
|
|
|
### Phase 2: QEMU Integration
|
|
**Deliverables:**
|
|
- QEMU process lifecycle management
|
|
- Resource allocation system (ports, memory, disk)
|
|
- Process monitoring and health checks
|
|
- Basic VM operations (create, start, stop, destroy)
|
|
|
|
**Key Tasks:**
|
|
- Build QEMU command-line generation
|
|
- Implement process supervision and PID tracking
|
|
- Add QEMU Machine Protocol (QMP) integration
|
|
- Create resource conflict detection
|
|
|
|
### Phase 3: State Processing and VM Management
|
|
**Deliverables:**
|
|
- State diff processing to determine required changes
|
|
- VM configuration template system
|
|
- Networking and storage management
|
|
- Graceful shutdown and cleanup procedures
|
|
|
|
**Key Tasks:**
|
|
- Parse OpenTofu state changes into VM operations
|
|
- Implement VM configuration templating
|
|
- Add network and storage allocation
|
|
- Build recovery and cleanup mechanisms
|
|
|
|
### Phase 4: Production Readiness
|
|
**Deliverables:**
|
|
- Comprehensive error handling and recovery
|
|
- Performance optimization and resource limits
|
|
- Monitoring and observability features
|
|
- Documentation and deployment guides
|
|
|
|
**Key Tasks:**
|
|
- Add metrics and health endpoints
|
|
- Implement backup and restore procedures
|
|
- Performance testing and optimization
|
|
- Security hardening and authentication
|
|
|
|
## Technical Specifications
|
|
|
|
### HTTP Backend Protocol Implementation
|
|
|
|
#### Required Endpoints
|
|
```
|
|
GET /state/{project} - Retrieve current state (JSON)
|
|
POST /state/{project} - Store new state (JSON body)
|
|
DELETE /state/{project} - Delete state (optional)
|
|
LOCK /state/{project}/lock - Acquire state lock (JSON body)
|
|
UNLOCK /state/{project}/lock - Release state lock (JSON body)
|
|
```
|
|
|
|
#### State Format
|
|
- Standard Terraform/OpenTofu JSON state format
|
|
- Version 4 state schema compatibility
|
|
- Custom resource types for QEMU VMs
|
|
|
|
#### Locking Protocol
|
|
```json
|
|
{
|
|
"ID": "unique-lock-id",
|
|
"Operation": "OperationTypePlan|OperationTypeApply",
|
|
"Info": "operation description",
|
|
"Who": "user@host",
|
|
"Version": "opentofu-version",
|
|
"Created": "2024-01-01T00:00:00Z",
|
|
"Path": "terraform-working-directory"
|
|
}
|
|
```
|
|
|
|
### QEMU Process Management
|
|
|
|
#### VM Lifecycle Operations
|
|
- **Create**: Generate QEMU configuration and spawn process
|
|
- **Start/Stop**: Control VM power state through QMP
|
|
- **Modify**: Update VM configuration (restart required for some changes)
|
|
- **Destroy**: Graceful shutdown and resource cleanup
|
|
- **Monitor**: Health checks and resource usage tracking
|
|
|
|
#### Resource Management
|
|
```go
|
|
type VMConfig struct {
|
|
Name string
|
|
Memory uint64 // MB
|
|
CPUs int
|
|
DiskPath string
|
|
NetworkConfig NetworkConfig
|
|
VNCPort int
|
|
QMPSocket string
|
|
}
|
|
|
|
type ResourcePool struct {
|
|
MaxMemory uint64
|
|
UsedMemory uint64
|
|
PortRange PortRange
|
|
AllocatedPorts map[int]string
|
|
DiskPaths map[string]string
|
|
}
|
|
```
|
|
|
|
#### Process Supervision
|
|
- PID tracking and process monitoring
|
|
- Graceful shutdown with configurable timeouts
|
|
- Orphan process detection and cleanup
|
|
- Log aggregation and rotation
|
|
|
|
### State Storage Schema
|
|
|
|
#### Core Tables/Collections
|
|
```sql
|
|
-- State storage
|
|
states (
|
|
project_name VARCHAR PRIMARY KEY,
|
|
state_data JSON,
|
|
version INTEGER,
|
|
updated_at TIMESTAMP
|
|
);
|
|
|
|
-- Lock management
|
|
locks (
|
|
project_name VARCHAR PRIMARY KEY,
|
|
lock_info JSON,
|
|
acquired_at TIMESTAMP
|
|
);
|
|
|
|
-- VM process tracking
|
|
vm_processes (
|
|
vm_name VARCHAR PRIMARY KEY,
|
|
project_name VARCHAR,
|
|
pid INTEGER,
|
|
config JSON,
|
|
status VARCHAR,
|
|
created_at TIMESTAMP
|
|
);
|
|
```
|
|
|
|
## Configuration
|
|
|
|
### Server Configuration
|
|
```yaml
|
|
server:
|
|
host: "0.0.0.0"
|
|
port: 8080
|
|
tls:
|
|
cert_file: "/path/to/cert.pem"
|
|
key_file: "/path/to/key.pem"
|
|
|
|
storage:
|
|
type: "sqlite" # sqlite, postgres, file
|
|
connection: "/var/lib/qemu-backend/state.db"
|
|
|
|
qemu:
|
|
binary_path: "/usr/bin/qemu-system-x86_64"
|
|
default_memory: 1024
|
|
port_range:
|
|
start: 5900
|
|
end: 6000
|
|
max_concurrent_vms: 50
|
|
|
|
resources:
|
|
max_memory_mb: 32768
|
|
disk_base_path: "/var/lib/qemu-backend/disks"
|
|
log_directory: "/var/log/qemu-backend"
|
|
|
|
auth:
|
|
type: "basic" # basic, token, none
|
|
username: "admin"
|
|
password: "secret"
|
|
```
|
|
|
|
### OpenTofu Configuration
|
|
```hcl
|
|
terraform {
|
|
backend "http" {
|
|
address = "https://qemu-backend.example.com/state/my_project"
|
|
lock_address = "https://qemu-backend.example.com/state/my_project/lock"
|
|
unlock_address = "https://qemu-backend.example.com/state/my_project/lock"
|
|
username = "admin"
|
|
password = "secret"
|
|
}
|
|
}
|
|
|
|
# Example VM resource (requires custom provider)
|
|
resource "qemu_vm" "web_server" {
|
|
name = "web-01"
|
|
memory = 2048
|
|
cpus = 2
|
|
|
|
disk {
|
|
path = "/var/lib/qemu/web-01.qcow2"
|
|
size = "20G"
|
|
}
|
|
|
|
network {
|
|
type = "user"
|
|
hostfwd = "tcp::8080-:80"
|
|
}
|
|
}
|
|
```
|
|
|
|
## Design Decisions and Rationale
|
|
|
|
### Use HTTP Backend vs Custom Backend Plugin
|
|
**Decision**: Use existing OpenTofu HTTP backend rather than implementing a custom backend plugin.
|
|
|
|
**Rationale**:
|
|
- Avoids modifying OpenTofu codebase
|
|
- Leverages well-tested HTTP backend implementation
|
|
- Enables implementation in any programming language
|
|
- Simplifies testing and deployment
|
|
- Maintains compatibility across OpenTofu versions
|
|
|
|
### Direct QEMU Management vs LibVirt
|
|
**Decision**: Manage QEMU processes directly instead of using the existing libvirt provider.
|
|
|
|
**Rationale**:
|
|
- **Control**: Fine-grained control over QEMU parameters and configuration
|
|
- **Simplicity**: Eliminates libvirt daemon dependency and complexity
|
|
- **Debugging**: Direct access to QEMU processes and logs
|
|
- **Flexibility**: Custom networking, storage, and feature implementations
|
|
- **Performance**: Reduced overhead from abstraction layers
|
|
- **Reliability**: Fewer moving parts and potential failure points
|
|
|
|
### State Storage Options
|
|
**Decision**: Support multiple storage backends with SQLite as default.
|
|
|
|
**Rationale**:
|
|
- **SQLite**: Simple deployment, no external dependencies, suitable for small-medium scale
|
|
- **PostgreSQL**: Production scalability, ACID compliance, concurrent access
|
|
- **File-based**: Development simplicity, easy backup and migration
|
|
- **Flexibility**: Different deployment scenarios have different requirements
|
|
|
|
### Process Management Approach
|
|
**Decision**: Implement custom process supervision rather than using system service managers.
|
|
|
|
**Rationale**:
|
|
- **Integration**: Direct integration with state management and HTTP API
|
|
- **Control**: Custom lifecycle management and resource allocation
|
|
- **Portability**: Works across different operating systems and environments
|
|
- **Monitoring**: Built-in health checks and resource tracking
|
|
- **Recovery**: Coordinated recovery with state consistency
|
|
|
|
### JSON State Format Compatibility
|
|
**Decision**: Maintain full compatibility with standard Terraform/OpenTofu state format.
|
|
|
|
**Rationale**:
|
|
- **Interoperability**: Works with existing tooling and workflows
|
|
- **Migration**: Easy migration from other backends
|
|
- **Standards**: Leverages well-defined, stable format
|
|
- **Debugging**: Familiar format for troubleshooting
|
|
- **Future-proofing**: Compatibility with ecosystem tools
|
|
|
|
### Security and Authentication
|
|
**Decision**: Start with basic authentication, design for pluggable auth system.
|
|
|
|
**Rationale**:
|
|
- **Simplicity**: Basic auth sufficient for many use cases
|
|
- **Standards**: HTTP-based authentication familiar to operators
|
|
- **Extensibility**: Architecture supports additional auth methods
|
|
- **Operations**: Integrates with existing HTTP infrastructure (proxies, load balancers)
|
|
|
|
## Risk Assessment
|
|
|
|
### Technical Risks
|
|
- **QEMU Process Management Complexity**: Mitigation through comprehensive testing and graceful error handling
|
|
- **State Consistency**: Mitigation through robust locking and recovery mechanisms
|
|
- **Resource Conflicts**: Mitigation through resource allocation tracking and validation
|
|
- **Scale Limitations**: Mitigation through resource limits and monitoring
|
|
|
|
### Operational Risks
|
|
- **Data Loss**: Mitigation through backup strategies and state validation
|
|
- **Service Availability**: Mitigation through health checks and restart procedures
|
|
- **Security**: Mitigation through authentication, TLS, and input validation
|
|
|
|
## Future Enhancements
|
|
|
|
### Potential Features
|
|
- VM templating and cloning
|
|
- Snapshot management
|
|
- Live migration support
|
|
- Multi-host clustering
|
|
- Advanced networking (bridges, VLANs)
|
|
- GPU passthrough support
|
|
- Backup and restore automation
|
|
- Prometheus metrics integration
|
|
- Web UI for monitoring and management
|
|
|
|
### Scalability Improvements
|
|
- Horizontal scaling across multiple hosts
|
|
- Load balancing and VM placement policies
|
|
- Resource scheduling and optimization
|
|
- High availability and failover |