From c0af0d5b41a907640eced94f54d57e9394c38a76 Mon Sep 17 00:00:00 2001 From: John Kenyon Date: Sun, 21 Sep 2025 21:28:51 -0700 Subject: [PATCH] First draft of the design file --- DESIGN.md | 363 ++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 363 insertions(+) create mode 100644 DESIGN.md diff --git a/DESIGN.md b/DESIGN.md new file mode 100644 index 0000000..78afd16 --- /dev/null +++ b/DESIGN.md @@ -0,0 +1,363 @@ +# QEMU Backend for OpenTofu - Design Document + +## Goals + +### Primary Objective +Create a custom backend for OpenTofu that manages QEMU virtual machines directly, providing declarative infrastructure management for VM workloads without external dependencies. + +### Key Goals +- **Direct QEMU Control**: Manage QEMU processes natively without abstraction layers +- **OpenTofu Integration**: Leverage existing OpenTofu HTTP backend protocol for seamless integration +- **Declarative VM Management**: Define VM infrastructure as code with full lifecycle management +- **Resource Efficiency**: Optimal resource allocation and process supervision +- **Operational Simplicity**: Self-contained solution with minimal external dependencies + +### Success Criteria +- OpenTofu can create, modify, and destroy QEMU VMs through standard configuration +- State management maintains consistency between declared and actual VM state +- Concurrent operations are safely handled through proper locking +- System recovers gracefully from failures and restarts +- Performance scales to reasonable VM workloads (10-100 VMs) + +## Architecture Overview + +### High-Level Design +``` +OpenTofu Client → HTTP Backend Protocol → QEMU Management Server + ↓ + QEMU Processes + ↓ + State Storage +``` + +### Components + +#### 1. OpenTofu HTTP Backend Client +- **Role**: Built-in OpenTofu HTTP backend (no custom code required) +- **Responsibilities**: State serialization, HTTP communication, locking protocol +- **Configuration**: Points to custom QEMU management server endpoints + +#### 2. QEMU Management Server +- **Role**: Core application implementing HTTP backend protocol +- **Responsibilities**: + - HTTP API implementation (state CRUD, locking) + - QEMU process lifecycle management + - Resource allocation and conflict resolution + - State persistence and recovery + +#### 3. State Storage Layer +- **Role**: Persistent storage for OpenTofu state and VM metadata +- **Options**: SQLite (simple), PostgreSQL (production), file-based (development) +- **Responsibilities**: State persistence, backup, recovery + +#### 4. QEMU Process Manager +- **Role**: Direct QEMU process control and supervision +- **Responsibilities**: Process spawning, monitoring, resource management, cleanup + +## Implementation Plan + +### Phase 1: Core HTTP Backend Server +**Deliverables:** +- Basic HTTP server implementing OpenTofu backend protocol +- State storage and retrieval (GET/POST endpoints) +- State locking mechanism (LOCK/UNLOCK endpoints) +- Configuration management and validation + +**Key Tasks:** +- Implement REST API handlers for state operations +- Design state storage schema and persistence layer +- Add proper error handling and logging +- Create basic configuration system + +### Phase 2: QEMU Integration +**Deliverables:** +- QEMU process lifecycle management +- Resource allocation system (ports, memory, disk) +- Process monitoring and health checks +- Basic VM operations (create, start, stop, destroy) + +**Key Tasks:** +- Build QEMU command-line generation +- Implement process supervision and PID tracking +- Add QEMU Machine Protocol (QMP) integration +- Create resource conflict detection + +### Phase 3: State Processing and VM Management +**Deliverables:** +- State diff processing to determine required changes +- VM configuration template system +- Networking and storage management +- Graceful shutdown and cleanup procedures + +**Key Tasks:** +- Parse OpenTofu state changes into VM operations +- Implement VM configuration templating +- Add network and storage allocation +- Build recovery and cleanup mechanisms + +### Phase 4: Production Readiness +**Deliverables:** +- Comprehensive error handling and recovery +- Performance optimization and resource limits +- Monitoring and observability features +- Documentation and deployment guides + +**Key Tasks:** +- Add metrics and health endpoints +- Implement backup and restore procedures +- Performance testing and optimization +- Security hardening and authentication + +## Technical Specifications + +### HTTP Backend Protocol Implementation + +#### Required Endpoints +``` +GET /state/{project} - Retrieve current state (JSON) +POST /state/{project} - Store new state (JSON body) +DELETE /state/{project} - Delete state (optional) +LOCK /state/{project}/lock - Acquire state lock (JSON body) +UNLOCK /state/{project}/lock - Release state lock (JSON body) +``` + +#### State Format +- Standard Terraform/OpenTofu JSON state format +- Version 4 state schema compatibility +- Custom resource types for QEMU VMs + +#### Locking Protocol +```json +{ + "ID": "unique-lock-id", + "Operation": "OperationTypePlan|OperationTypeApply", + "Info": "operation description", + "Who": "user@host", + "Version": "opentofu-version", + "Created": "2024-01-01T00:00:00Z", + "Path": "terraform-working-directory" +} +``` + +### QEMU Process Management + +#### VM Lifecycle Operations +- **Create**: Generate QEMU configuration and spawn process +- **Start/Stop**: Control VM power state through QMP +- **Modify**: Update VM configuration (restart required for some changes) +- **Destroy**: Graceful shutdown and resource cleanup +- **Monitor**: Health checks and resource usage tracking + +#### Resource Management +```go +type VMConfig struct { + Name string + Memory uint64 // MB + CPUs int + DiskPath string + NetworkConfig NetworkConfig + VNCPort int + QMPSocket string +} + +type ResourcePool struct { + MaxMemory uint64 + UsedMemory uint64 + PortRange PortRange + AllocatedPorts map[int]string + DiskPaths map[string]string +} +``` + +#### Process Supervision +- PID tracking and process monitoring +- Graceful shutdown with configurable timeouts +- Orphan process detection and cleanup +- Log aggregation and rotation + +### State Storage Schema + +#### Core Tables/Collections +```sql +-- State storage +states ( + project_name VARCHAR PRIMARY KEY, + state_data JSON, + version INTEGER, + updated_at TIMESTAMP +); + +-- Lock management +locks ( + project_name VARCHAR PRIMARY KEY, + lock_info JSON, + acquired_at TIMESTAMP +); + +-- VM process tracking +vm_processes ( + vm_name VARCHAR PRIMARY KEY, + project_name VARCHAR, + pid INTEGER, + config JSON, + status VARCHAR, + created_at TIMESTAMP +); +``` + +## Configuration + +### Server Configuration +```yaml +server: + host: "0.0.0.0" + port: 8080 + tls: + cert_file: "/path/to/cert.pem" + key_file: "/path/to/key.pem" + +storage: + type: "sqlite" # sqlite, postgres, file + connection: "/var/lib/qemu-backend/state.db" + +qemu: + binary_path: "/usr/bin/qemu-system-x86_64" + default_memory: 1024 + port_range: + start: 5900 + end: 6000 + max_concurrent_vms: 50 + +resources: + max_memory_mb: 32768 + disk_base_path: "/var/lib/qemu-backend/disks" + log_directory: "/var/log/qemu-backend" + +auth: + type: "basic" # basic, token, none + username: "admin" + password: "secret" +``` + +### OpenTofu Configuration +```hcl +terraform { + backend "http" { + address = "https://qemu-backend.example.com/state/my_project" + lock_address = "https://qemu-backend.example.com/state/my_project/lock" + unlock_address = "https://qemu-backend.example.com/state/my_project/lock" + username = "admin" + password = "secret" + } +} + +# Example VM resource (requires custom provider) +resource "qemu_vm" "web_server" { + name = "web-01" + memory = 2048 + cpus = 2 + + disk { + path = "/var/lib/qemu/web-01.qcow2" + size = "20G" + } + + network { + type = "user" + hostfwd = "tcp::8080-:80" + } +} +``` + +## Design Decisions and Rationale + +### Use HTTP Backend vs Custom Backend Plugin +**Decision**: Use existing OpenTofu HTTP backend rather than implementing a custom backend plugin. + +**Rationale**: +- Avoids modifying OpenTofu codebase +- Leverages well-tested HTTP backend implementation +- Enables implementation in any programming language +- Simplifies testing and deployment +- Maintains compatibility across OpenTofu versions + +### Direct QEMU Management vs LibVirt +**Decision**: Manage QEMU processes directly instead of using the existing libvirt provider. + +**Rationale**: +- **Control**: Fine-grained control over QEMU parameters and configuration +- **Simplicity**: Eliminates libvirt daemon dependency and complexity +- **Debugging**: Direct access to QEMU processes and logs +- **Flexibility**: Custom networking, storage, and feature implementations +- **Performance**: Reduced overhead from abstraction layers +- **Reliability**: Fewer moving parts and potential failure points + +### State Storage Options +**Decision**: Support multiple storage backends with SQLite as default. + +**Rationale**: +- **SQLite**: Simple deployment, no external dependencies, suitable for small-medium scale +- **PostgreSQL**: Production scalability, ACID compliance, concurrent access +- **File-based**: Development simplicity, easy backup and migration +- **Flexibility**: Different deployment scenarios have different requirements + +### Process Management Approach +**Decision**: Implement custom process supervision rather than using system service managers. + +**Rationale**: +- **Integration**: Direct integration with state management and HTTP API +- **Control**: Custom lifecycle management and resource allocation +- **Portability**: Works across different operating systems and environments +- **Monitoring**: Built-in health checks and resource tracking +- **Recovery**: Coordinated recovery with state consistency + +### JSON State Format Compatibility +**Decision**: Maintain full compatibility with standard Terraform/OpenTofu state format. + +**Rationale**: +- **Interoperability**: Works with existing tooling and workflows +- **Migration**: Easy migration from other backends +- **Standards**: Leverages well-defined, stable format +- **Debugging**: Familiar format for troubleshooting +- **Future-proofing**: Compatibility with ecosystem tools + +### Security and Authentication +**Decision**: Start with basic authentication, design for pluggable auth system. + +**Rationale**: +- **Simplicity**: Basic auth sufficient for many use cases +- **Standards**: HTTP-based authentication familiar to operators +- **Extensibility**: Architecture supports additional auth methods +- **Operations**: Integrates with existing HTTP infrastructure (proxies, load balancers) + +## Risk Assessment + +### Technical Risks +- **QEMU Process Management Complexity**: Mitigation through comprehensive testing and graceful error handling +- **State Consistency**: Mitigation through robust locking and recovery mechanisms +- **Resource Conflicts**: Mitigation through resource allocation tracking and validation +- **Scale Limitations**: Mitigation through resource limits and monitoring + +### Operational Risks +- **Data Loss**: Mitigation through backup strategies and state validation +- **Service Availability**: Mitigation through health checks and restart procedures +- **Security**: Mitigation through authentication, TLS, and input validation + +## Future Enhancements + +### Potential Features +- VM templating and cloning +- Snapshot management +- Live migration support +- Multi-host clustering +- Advanced networking (bridges, VLANs) +- GPU passthrough support +- Backup and restore automation +- Prometheus metrics integration +- Web UI for monitoring and management + +### Scalability Improvements +- Horizontal scaling across multiple hosts +- Load balancing and VM placement policies +- Resource scheduling and optimization +- High availability and failover \ No newline at end of file