First draft of the design file
commit
c0af0d5b41
|
|
@ -0,0 +1,363 @@
|
|||
# QEMU Backend for OpenTofu - Design Document
|
||||
|
||||
## Goals
|
||||
|
||||
### Primary Objective
|
||||
Create a custom backend for OpenTofu that manages QEMU virtual machines directly, providing declarative infrastructure management for VM workloads without external dependencies.
|
||||
|
||||
### Key Goals
|
||||
- **Direct QEMU Control**: Manage QEMU processes natively without abstraction layers
|
||||
- **OpenTofu Integration**: Leverage existing OpenTofu HTTP backend protocol for seamless integration
|
||||
- **Declarative VM Management**: Define VM infrastructure as code with full lifecycle management
|
||||
- **Resource Efficiency**: Optimal resource allocation and process supervision
|
||||
- **Operational Simplicity**: Self-contained solution with minimal external dependencies
|
||||
|
||||
### Success Criteria
|
||||
- OpenTofu can create, modify, and destroy QEMU VMs through standard configuration
|
||||
- State management maintains consistency between declared and actual VM state
|
||||
- Concurrent operations are safely handled through proper locking
|
||||
- System recovers gracefully from failures and restarts
|
||||
- Performance scales to reasonable VM workloads (10-100 VMs)
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
### High-Level Design
|
||||
```
|
||||
OpenTofu Client → HTTP Backend Protocol → QEMU Management Server
|
||||
↓
|
||||
QEMU Processes
|
||||
↓
|
||||
State Storage
|
||||
```
|
||||
|
||||
### Components
|
||||
|
||||
#### 1. OpenTofu HTTP Backend Client
|
||||
- **Role**: Built-in OpenTofu HTTP backend (no custom code required)
|
||||
- **Responsibilities**: State serialization, HTTP communication, locking protocol
|
||||
- **Configuration**: Points to custom QEMU management server endpoints
|
||||
|
||||
#### 2. QEMU Management Server
|
||||
- **Role**: Core application implementing HTTP backend protocol
|
||||
- **Responsibilities**:
|
||||
- HTTP API implementation (state CRUD, locking)
|
||||
- QEMU process lifecycle management
|
||||
- Resource allocation and conflict resolution
|
||||
- State persistence and recovery
|
||||
|
||||
#### 3. State Storage Layer
|
||||
- **Role**: Persistent storage for OpenTofu state and VM metadata
|
||||
- **Options**: SQLite (simple), PostgreSQL (production), file-based (development)
|
||||
- **Responsibilities**: State persistence, backup, recovery
|
||||
|
||||
#### 4. QEMU Process Manager
|
||||
- **Role**: Direct QEMU process control and supervision
|
||||
- **Responsibilities**: Process spawning, monitoring, resource management, cleanup
|
||||
|
||||
## Implementation Plan
|
||||
|
||||
### Phase 1: Core HTTP Backend Server
|
||||
**Deliverables:**
|
||||
- Basic HTTP server implementing OpenTofu backend protocol
|
||||
- State storage and retrieval (GET/POST endpoints)
|
||||
- State locking mechanism (LOCK/UNLOCK endpoints)
|
||||
- Configuration management and validation
|
||||
|
||||
**Key Tasks:**
|
||||
- Implement REST API handlers for state operations
|
||||
- Design state storage schema and persistence layer
|
||||
- Add proper error handling and logging
|
||||
- Create basic configuration system
|
||||
|
||||
### Phase 2: QEMU Integration
|
||||
**Deliverables:**
|
||||
- QEMU process lifecycle management
|
||||
- Resource allocation system (ports, memory, disk)
|
||||
- Process monitoring and health checks
|
||||
- Basic VM operations (create, start, stop, destroy)
|
||||
|
||||
**Key Tasks:**
|
||||
- Build QEMU command-line generation
|
||||
- Implement process supervision and PID tracking
|
||||
- Add QEMU Machine Protocol (QMP) integration
|
||||
- Create resource conflict detection
|
||||
|
||||
### Phase 3: State Processing and VM Management
|
||||
**Deliverables:**
|
||||
- State diff processing to determine required changes
|
||||
- VM configuration template system
|
||||
- Networking and storage management
|
||||
- Graceful shutdown and cleanup procedures
|
||||
|
||||
**Key Tasks:**
|
||||
- Parse OpenTofu state changes into VM operations
|
||||
- Implement VM configuration templating
|
||||
- Add network and storage allocation
|
||||
- Build recovery and cleanup mechanisms
|
||||
|
||||
### Phase 4: Production Readiness
|
||||
**Deliverables:**
|
||||
- Comprehensive error handling and recovery
|
||||
- Performance optimization and resource limits
|
||||
- Monitoring and observability features
|
||||
- Documentation and deployment guides
|
||||
|
||||
**Key Tasks:**
|
||||
- Add metrics and health endpoints
|
||||
- Implement backup and restore procedures
|
||||
- Performance testing and optimization
|
||||
- Security hardening and authentication
|
||||
|
||||
## Technical Specifications
|
||||
|
||||
### HTTP Backend Protocol Implementation
|
||||
|
||||
#### Required Endpoints
|
||||
```
|
||||
GET /state/{project} - Retrieve current state (JSON)
|
||||
POST /state/{project} - Store new state (JSON body)
|
||||
DELETE /state/{project} - Delete state (optional)
|
||||
LOCK /state/{project}/lock - Acquire state lock (JSON body)
|
||||
UNLOCK /state/{project}/lock - Release state lock (JSON body)
|
||||
```
|
||||
|
||||
#### State Format
|
||||
- Standard Terraform/OpenTofu JSON state format
|
||||
- Version 4 state schema compatibility
|
||||
- Custom resource types for QEMU VMs
|
||||
|
||||
#### Locking Protocol
|
||||
```json
|
||||
{
|
||||
"ID": "unique-lock-id",
|
||||
"Operation": "OperationTypePlan|OperationTypeApply",
|
||||
"Info": "operation description",
|
||||
"Who": "user@host",
|
||||
"Version": "opentofu-version",
|
||||
"Created": "2024-01-01T00:00:00Z",
|
||||
"Path": "terraform-working-directory"
|
||||
}
|
||||
```
|
||||
|
||||
### QEMU Process Management
|
||||
|
||||
#### VM Lifecycle Operations
|
||||
- **Create**: Generate QEMU configuration and spawn process
|
||||
- **Start/Stop**: Control VM power state through QMP
|
||||
- **Modify**: Update VM configuration (restart required for some changes)
|
||||
- **Destroy**: Graceful shutdown and resource cleanup
|
||||
- **Monitor**: Health checks and resource usage tracking
|
||||
|
||||
#### Resource Management
|
||||
```go
|
||||
type VMConfig struct {
|
||||
Name string
|
||||
Memory uint64 // MB
|
||||
CPUs int
|
||||
DiskPath string
|
||||
NetworkConfig NetworkConfig
|
||||
VNCPort int
|
||||
QMPSocket string
|
||||
}
|
||||
|
||||
type ResourcePool struct {
|
||||
MaxMemory uint64
|
||||
UsedMemory uint64
|
||||
PortRange PortRange
|
||||
AllocatedPorts map[int]string
|
||||
DiskPaths map[string]string
|
||||
}
|
||||
```
|
||||
|
||||
#### Process Supervision
|
||||
- PID tracking and process monitoring
|
||||
- Graceful shutdown with configurable timeouts
|
||||
- Orphan process detection and cleanup
|
||||
- Log aggregation and rotation
|
||||
|
||||
### State Storage Schema
|
||||
|
||||
#### Core Tables/Collections
|
||||
```sql
|
||||
-- State storage
|
||||
states (
|
||||
project_name VARCHAR PRIMARY KEY,
|
||||
state_data JSON,
|
||||
version INTEGER,
|
||||
updated_at TIMESTAMP
|
||||
);
|
||||
|
||||
-- Lock management
|
||||
locks (
|
||||
project_name VARCHAR PRIMARY KEY,
|
||||
lock_info JSON,
|
||||
acquired_at TIMESTAMP
|
||||
);
|
||||
|
||||
-- VM process tracking
|
||||
vm_processes (
|
||||
vm_name VARCHAR PRIMARY KEY,
|
||||
project_name VARCHAR,
|
||||
pid INTEGER,
|
||||
config JSON,
|
||||
status VARCHAR,
|
||||
created_at TIMESTAMP
|
||||
);
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
### Server Configuration
|
||||
```yaml
|
||||
server:
|
||||
host: "0.0.0.0"
|
||||
port: 8080
|
||||
tls:
|
||||
cert_file: "/path/to/cert.pem"
|
||||
key_file: "/path/to/key.pem"
|
||||
|
||||
storage:
|
||||
type: "sqlite" # sqlite, postgres, file
|
||||
connection: "/var/lib/qemu-backend/state.db"
|
||||
|
||||
qemu:
|
||||
binary_path: "/usr/bin/qemu-system-x86_64"
|
||||
default_memory: 1024
|
||||
port_range:
|
||||
start: 5900
|
||||
end: 6000
|
||||
max_concurrent_vms: 50
|
||||
|
||||
resources:
|
||||
max_memory_mb: 32768
|
||||
disk_base_path: "/var/lib/qemu-backend/disks"
|
||||
log_directory: "/var/log/qemu-backend"
|
||||
|
||||
auth:
|
||||
type: "basic" # basic, token, none
|
||||
username: "admin"
|
||||
password: "secret"
|
||||
```
|
||||
|
||||
### OpenTofu Configuration
|
||||
```hcl
|
||||
terraform {
|
||||
backend "http" {
|
||||
address = "https://qemu-backend.example.com/state/my_project"
|
||||
lock_address = "https://qemu-backend.example.com/state/my_project/lock"
|
||||
unlock_address = "https://qemu-backend.example.com/state/my_project/lock"
|
||||
username = "admin"
|
||||
password = "secret"
|
||||
}
|
||||
}
|
||||
|
||||
# Example VM resource (requires custom provider)
|
||||
resource "qemu_vm" "web_server" {
|
||||
name = "web-01"
|
||||
memory = 2048
|
||||
cpus = 2
|
||||
|
||||
disk {
|
||||
path = "/var/lib/qemu/web-01.qcow2"
|
||||
size = "20G"
|
||||
}
|
||||
|
||||
network {
|
||||
type = "user"
|
||||
hostfwd = "tcp::8080-:80"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
## Design Decisions and Rationale
|
||||
|
||||
### Use HTTP Backend vs Custom Backend Plugin
|
||||
**Decision**: Use existing OpenTofu HTTP backend rather than implementing a custom backend plugin.
|
||||
|
||||
**Rationale**:
|
||||
- Avoids modifying OpenTofu codebase
|
||||
- Leverages well-tested HTTP backend implementation
|
||||
- Enables implementation in any programming language
|
||||
- Simplifies testing and deployment
|
||||
- Maintains compatibility across OpenTofu versions
|
||||
|
||||
### Direct QEMU Management vs LibVirt
|
||||
**Decision**: Manage QEMU processes directly instead of using the existing libvirt provider.
|
||||
|
||||
**Rationale**:
|
||||
- **Control**: Fine-grained control over QEMU parameters and configuration
|
||||
- **Simplicity**: Eliminates libvirt daemon dependency and complexity
|
||||
- **Debugging**: Direct access to QEMU processes and logs
|
||||
- **Flexibility**: Custom networking, storage, and feature implementations
|
||||
- **Performance**: Reduced overhead from abstraction layers
|
||||
- **Reliability**: Fewer moving parts and potential failure points
|
||||
|
||||
### State Storage Options
|
||||
**Decision**: Support multiple storage backends with SQLite as default.
|
||||
|
||||
**Rationale**:
|
||||
- **SQLite**: Simple deployment, no external dependencies, suitable for small-medium scale
|
||||
- **PostgreSQL**: Production scalability, ACID compliance, concurrent access
|
||||
- **File-based**: Development simplicity, easy backup and migration
|
||||
- **Flexibility**: Different deployment scenarios have different requirements
|
||||
|
||||
### Process Management Approach
|
||||
**Decision**: Implement custom process supervision rather than using system service managers.
|
||||
|
||||
**Rationale**:
|
||||
- **Integration**: Direct integration with state management and HTTP API
|
||||
- **Control**: Custom lifecycle management and resource allocation
|
||||
- **Portability**: Works across different operating systems and environments
|
||||
- **Monitoring**: Built-in health checks and resource tracking
|
||||
- **Recovery**: Coordinated recovery with state consistency
|
||||
|
||||
### JSON State Format Compatibility
|
||||
**Decision**: Maintain full compatibility with standard Terraform/OpenTofu state format.
|
||||
|
||||
**Rationale**:
|
||||
- **Interoperability**: Works with existing tooling and workflows
|
||||
- **Migration**: Easy migration from other backends
|
||||
- **Standards**: Leverages well-defined, stable format
|
||||
- **Debugging**: Familiar format for troubleshooting
|
||||
- **Future-proofing**: Compatibility with ecosystem tools
|
||||
|
||||
### Security and Authentication
|
||||
**Decision**: Start with basic authentication, design for pluggable auth system.
|
||||
|
||||
**Rationale**:
|
||||
- **Simplicity**: Basic auth sufficient for many use cases
|
||||
- **Standards**: HTTP-based authentication familiar to operators
|
||||
- **Extensibility**: Architecture supports additional auth methods
|
||||
- **Operations**: Integrates with existing HTTP infrastructure (proxies, load balancers)
|
||||
|
||||
## Risk Assessment
|
||||
|
||||
### Technical Risks
|
||||
- **QEMU Process Management Complexity**: Mitigation through comprehensive testing and graceful error handling
|
||||
- **State Consistency**: Mitigation through robust locking and recovery mechanisms
|
||||
- **Resource Conflicts**: Mitigation through resource allocation tracking and validation
|
||||
- **Scale Limitations**: Mitigation through resource limits and monitoring
|
||||
|
||||
### Operational Risks
|
||||
- **Data Loss**: Mitigation through backup strategies and state validation
|
||||
- **Service Availability**: Mitigation through health checks and restart procedures
|
||||
- **Security**: Mitigation through authentication, TLS, and input validation
|
||||
|
||||
## Future Enhancements
|
||||
|
||||
### Potential Features
|
||||
- VM templating and cloning
|
||||
- Snapshot management
|
||||
- Live migration support
|
||||
- Multi-host clustering
|
||||
- Advanced networking (bridges, VLANs)
|
||||
- GPU passthrough support
|
||||
- Backup and restore automation
|
||||
- Prometheus metrics integration
|
||||
- Web UI for monitoring and management
|
||||
|
||||
### Scalability Improvements
|
||||
- Horizontal scaling across multiple hosts
|
||||
- Load balancing and VM placement policies
|
||||
- Resource scheduling and optimization
|
||||
- High availability and failover
|
||||
Loading…
Reference in New Issue