StageManager/DESIGN.md

386 lines
13 KiB
Markdown

# QEMU Backend for OpenTofu - Design Document
## Goals
### Primary Objective
Create a custom backend for OpenTofu that manages QEMU virtual machines directly, providing declarative infrastructure management for VM workloads without external dependencies.
### Key Goals
- **Direct QEMU Control**: Manage QEMU processes natively without abstraction layers
- **OpenTofu Integration**: Leverage existing OpenTofu HTTP backend protocol for seamless integration
- **Declarative VM Management**: Define VM infrastructure as code with full lifecycle management
- **Resource Efficiency**: Optimal resource allocation and process supervision
- **Operational Simplicity**: Self-contained solution with minimal external dependencies
### Success Criteria
- OpenTofu can create, modify, and destroy QEMU VMs through standard configuration
- State management maintains consistency between declared and actual VM state
- Concurrent operations are safely handled through proper locking
- System recovers gracefully from failures and restarts
- Performance scales to reasonable VM workloads (10-100 VMs)
## Architecture Overview
### High-Level Design
```
OpenTofu Client → HTTP Backend Protocol → QEMU Management Server
QEMU Processes
State Storage
```
### Components
#### 1. OpenTofu HTTP Backend Client
- **Role**: Built-in OpenTofu HTTP backend (no custom code required)
- **Responsibilities**: State serialization, HTTP communication, locking protocol
- **Configuration**: Points to custom QEMU management server endpoints
- **Integration**: Works with existing OpenTofu workflows and tooling with no additional code required
#### 2. QEMU Management Server
- **Role**: Core application implementing HTTP backend protocol
- **Responsibilities**:
- HTTP API implementation (state CRUD, locking)
- QEMU process lifecycle management
- Resource allocation and conflict resolution
- State persistence and recovery
- Exposing a web interface for monitoring, management and debugging
#### 3. State Storage Layer
- **Role**: Persistent storage for OpenTofu state and VM metadata
- **Options**: SQLite
- **Responsibilities**: State persistence, backup, recovery
#### 4. QEMU Process Manager
- **Role**: Direct QEMU process control and supervision
- **Responsibilities**: Process spawning, monitoring, resource management, cleanup
## Implementation Plan
### Phase 1: Core HTTP Backend Server
**Deliverables:**
- Basic HTTP server implementing OpenTofu backend protocol
- State storage and retrieval (GET/POST endpoints)
- State locking mechanism (LOCK/UNLOCK endpoints)
- Configuration management and validation
**Key Tasks:**
- Implement REST API handlers for state operations
- Design state storage schema and persistence layer
- Add proper error handling and logging
- Create basic configuration system
**Validation:**
- We should be able to run OpenTofu against the resulting service, and get valid responses indicating success (even if nothing is created or run)
### Phase 2: QEMU Integration
**Deliverables:**
- QEMU process lifecycle management
- Resource allocation system (ports, memory, disk)
- Process monitoring and health checks
- Basic VM operations (create, start, stop, destroy)
**Key Tasks:**
- Build QEMU command-line generation
- Implement process supervision and PID tracking
- Add QEMU Machine Protocol (QMP) integration
- Create resource conflict detection
**Validation:**
- We should be able to run OpenTofu against the resulting service, and get valid responses indicating success (even if nothing is created or run yet)
### Phase 3: State Processing and VM Management
**Deliverables:**
- State diff processing to determine required changes
- VM configuration template system
- Networking and storage management
- Graceful shutdown and cleanup procedures
**Key Tasks:**
- Parse OpenTofu state changes into VM operations
- Implement VM configuration templating
- Add network and storage allocation
- Build recovery and cleanup mechanisms
**Validation:**
- Boot a VM from OpenTofu configuration until network connectivity is established (ping response)
- Verify VM configuration changes are applied correctly through state diff processing
- Test graceful VM shutdown and resource cleanup
- Validate network and storage allocation/deallocation
### Phase 4: Production Readiness
**Deliverables:**
- Comprehensive error handling and recovery
- Performance optimization and resource limits
- Monitoring and observability features
- Documentation and deployment guides
**Key Tasks:**
- Add metrics and health endpoints
- Implement backup and restore procedures
- Performance testing and optimization
- Security hardening and authentication
**Validation:**
- **Performance**: Deploy 10+ concurrent VMs and validate system stability under load
- **Monitoring**: Verify metrics endpoints expose VM count, memory usage, and error rates
- **Recovery**: Kill QEMU processes and validate automatic cleanup and state consistency
- **Backup/Restore**: Create state backup, simulate data loss, and restore from backup
- **Security**: Test authentication mechanisms and validate unauthorized access is blocked
- **Error Handling**: Inject failures (disk full, network issues) and verify graceful degradation
- **Resource Limits**: Exceed configured limits (max VMs, memory) and validate enforcement
## Technical Specifications
### HTTP Backend Protocol Implementation
#### Required Endpoints
```
GET /state/{project} - Retrieve current state (JSON)
POST /state/{project} - Store new state (JSON body)
DELETE /state/{project} - Delete state (optional)
LOCK /state/{project}/lock - Acquire state lock (JSON body)
UNLOCK /state/{project}/lock - Release state lock (JSON body)
```
#### State Format
- Standard Terraform/OpenTofu JSON state format
- Version 4 state schema compatibility
- Custom resource types for QEMU VMs
#### Locking Protocol
```json
{
"ID": "unique-lock-id",
"Operation": "OperationTypePlan|OperationTypeApply",
"Info": "operation description",
"Who": "user@host",
"Version": "opentofu-version",
"Created": "2024-01-01T00:00:00Z",
"Path": "terraform-working-directory"
}
```
### QEMU Process Management
#### VM Lifecycle Operations
- **Create**: Generate QEMU configuration and spawn process
- **Start/Stop**: Control VM power state through QMP
- **Modify**: Update VM configuration (restart required for some changes)
- **Destroy**: Graceful shutdown and resource cleanup
- **Monitor**: Health checks and resource usage tracking
#### Resource Management
```go
type VMConfig struct {
Name string
Memory uint64 // MB
CPUs int
DiskPath string
NetworkConfig NetworkConfig
VNCPort int
QMPSocket string
}
type ResourcePool struct {
MaxMemory uint64
UsedMemory uint64
PortRange PortRange
AllocatedPorts map[int]string
DiskPaths map[string]string
}
```
#### Process Supervision
- PID tracking and process monitoring
- Graceful shutdown with configurable timeouts
- Orphan process detection and cleanup
- Log aggregation and rotation
### State Storage Schema
#### Core Tables/Collections
```sql
-- State storage
states (
project_name VARCHAR PRIMARY KEY,
state_data JSON,
version INTEGER,
updated_at TIMESTAMP
);
-- Lock management
locks (
project_name VARCHAR PRIMARY KEY,
lock_info JSON,
acquired_at TIMESTAMP
);
-- VM process tracking
vm_processes (
vm_name VARCHAR PRIMARY KEY,
project_name VARCHAR,
pid INTEGER,
config JSON,
status VARCHAR,
created_at TIMESTAMP
);
```
## Configuration
### Server Configuration
```yaml
server:
host: "0.0.0.0"
port: 8080
tls:
cert_file: "/path/to/cert.pem"
key_file: "/path/to/key.pem"
storage:
type: "sqlite" # sqlite, postgres, file
connection: "/var/lib/qemu-backend/state.db"
qemu:
binary_path: "/usr/bin/qemu-system-x86_64"
default_memory: 1024
port_range:
start: 5900
end: 6000
max_concurrent_vms: 50
resources:
max_memory_mb: 32768
disk_base_path: "/var/lib/qemu-backend/disks"
log_directory: "/var/log/qemu-backend"
auth:
type: "basic" # basic, token, none
username: "admin"
password: "secret"
```
### OpenTofu Configuration
```hcl
terraform {
backend "http" {
address = "https://qemu-backend.example.com/state/my_project"
lock_address = "https://qemu-backend.example.com/state/my_project/lock"
unlock_address = "https://qemu-backend.example.com/state/my_project/lock"
username = "admin"
password = "secret"
}
}
# Example VM resource (requires custom provider)
resource "qemu_vm" "web_server" {
name = "web-01"
memory = 2048
cpus = 2
disk {
path = "/var/lib/qemu/web-01.qcow2"
size = "20G"
}
network {
type = "user"
hostfwd = "tcp::8080-:80"
}
}
```
## Design Decisions and Rationale
### Use HTTP Backend vs Custom Backend Plugin
**Decision**: Use existing OpenTofu HTTP backend rather than implementing a custom backend plugin.
**Rationale**:
- Avoids modifying OpenTofu codebase
- Leverages well-tested HTTP backend implementation
- Enables implementation in any programming language
- Simplifies testing and deployment
- Maintains compatibility across OpenTofu versions
### Direct QEMU Management vs LibVirt
**Decision**: Manage QEMU processes directly instead of using the existing libvirt provider.
**Rationale**:
- **Control**: Fine-grained control over QEMU parameters and configuration
- **Simplicity**: Eliminates libvirt daemon dependency and complexity
- **Debugging**: Direct access to QEMU processes and logs
- **Flexibility**: Custom networking, storage, and feature implementations
- **Performance**: Reduced overhead from abstraction layers
- **Reliability**: Fewer moving parts and potential failure points
### State Storage Options
**Decision**: Support multiple storage backends with SQLite as default.
**Rationale**:
- **SQLite**: Simple deployment, no external dependencies, suitable for small-medium scale
- **PostgreSQL**: Production scalability, ACID compliance, concurrent access
- **File-based**: Development simplicity, easy backup and migration
- **Flexibility**: Different deployment scenarios have different requirements
### Process Management Approach
**Decision**: Implement custom process supervision rather than using system service managers.
**Rationale**:
- **Integration**: Direct integration with state management and HTTP API
- **Control**: Custom lifecycle management and resource allocation
- **Portability**: Works across different operating systems and environments
- **Monitoring**: Built-in health checks and resource tracking
- **Recovery**: Coordinated recovery with state consistency
### JSON State Format Compatibility
**Decision**: Maintain full compatibility with standard Terraform/OpenTofu state format.
**Rationale**:
- **Interoperability**: Works with existing tooling and workflows
- **Migration**: Easy migration from other backends
- **Standards**: Leverages well-defined, stable format
- **Debugging**: Familiar format for troubleshooting
- **Future-proofing**: Compatibility with ecosystem tools
### Security and Authentication
**Decision**: Start with basic authentication, design for pluggable auth system.
**Rationale**:
- **Simplicity**: Basic auth sufficient for many use cases
- **Standards**: HTTP-based authentication familiar to operators
- **Extensibility**: Architecture supports additional auth methods
- **Operations**: Integrates with existing HTTP infrastructure (proxies, load balancers)
## Risk Assessment
### Technical Risks
- **QEMU Process Management Complexity**: Mitigation through comprehensive testing and graceful error handling
- **State Consistency**: Mitigation through robust locking and recovery mechanisms
- **Resource Conflicts**: Mitigation through resource allocation tracking and validation
- **Scale Limitations**: Mitigation through resource limits and monitoring
### Operational Risks
- **Data Loss**: Mitigation through backup strategies and state validation
- **Service Availability**: Mitigation through health checks and restart procedures
- **Security**: Mitigation through authentication, TLS, and input validation
## Future Enhancements
### Potential Features
- VM templating and cloning
- Snapshot management
- Live migration support
- Multi-host clustering
- Advanced networking (bridges, VLANs)
- GPU passthrough support
- Backup and restore automation
- Prometheus metrics integration
- Web UI for monitoring and management
### Scalability Improvements
- Horizontal scaling across multiple hosts
- Load balancing and VM placement policies
- Resource scheduling and optimization
- High availability and failover