# QEMU Backend for OpenTofu - Design Document

## Goals

### Primary Objective
Create a custom backend for OpenTofu that manages QEMU virtual machines directly, providing declarative infrastructure management for VM workloads without external dependencies.

### Key Goals
- **Direct QEMU Control**: Manage QEMU processes natively without abstraction layers
- **OpenTofu Integration**: Leverage existing OpenTofu HTTP backend protocol for seamless integration
- **Declarative VM Management**: Define VM infrastructure as code with full lifecycle management
- **Resource Efficiency**: Optimal resource allocation and process supervision
- **Operational Simplicity**: Self-contained solution with minimal external dependencies

### Success Criteria
- OpenTofu can create, modify, and destroy QEMU VMs through standard configuration
- State management maintains consistency between declared and actual VM state
- Concurrent operations are safely handled through proper locking
- System recovers gracefully from failures and restarts
- Performance scales to reasonable VM workloads (10-100 VMs)

## Architecture Overview

### High-Level Design
```
OpenTofu Client → HTTP Backend Protocol → QEMU Management Server
                                              ↓
                                        QEMU Processes
                                              ↓
                                        State Storage
```

### Components

#### 1. OpenTofu HTTP Backend Client
- **Role**: Built-in OpenTofu HTTP backend (no custom code required)
- **Responsibilities**: State serialization, HTTP communication, locking protocol
- **Configuration**: Points to custom QEMU management server endpoints

#### 2. QEMU Management Server
- **Role**: Core application implementing HTTP backend protocol
- **Responsibilities**: 
  - HTTP API implementation (state CRUD, locking)
  - QEMU process lifecycle management
  - Resource allocation and conflict resolution
  - State persistence and recovery

#### 3. State Storage Layer
- **Role**: Persistent storage for OpenTofu state and VM metadata
- **Options**: SQLite (simple), PostgreSQL (production), file-based (development)
- **Responsibilities**: State persistence, backup, recovery

#### 4. QEMU Process Manager
- **Role**: Direct QEMU process control and supervision
- **Responsibilities**: Process spawning, monitoring, resource management, cleanup

## Implementation Plan

### Phase 1: Core HTTP Backend Server
**Deliverables:**
- Basic HTTP server implementing OpenTofu backend protocol
- State storage and retrieval (GET/POST endpoints)
- State locking mechanism (LOCK/UNLOCK endpoints)
- Configuration management and validation

**Key Tasks:**
- Implement REST API handlers for state operations
- Design state storage schema and persistence layer
- Add proper error handling and logging
- Create basic configuration system

### Phase 2: QEMU Integration
**Deliverables:**
- QEMU process lifecycle management
- Resource allocation system (ports, memory, disk)
- Process monitoring and health checks
- Basic VM operations (create, start, stop, destroy)

**Key Tasks:**
- Build QEMU command-line generation
- Implement process supervision and PID tracking
- Add QEMU Machine Protocol (QMP) integration
- Create resource conflict detection

### Phase 3: State Processing and VM Management
**Deliverables:**
- State diff processing to determine required changes
- VM configuration template system
- Networking and storage management
- Graceful shutdown and cleanup procedures

**Key Tasks:**
- Parse OpenTofu state changes into VM operations
- Implement VM configuration templating
- Add network and storage allocation
- Build recovery and cleanup mechanisms

### Phase 4: Production Readiness
**Deliverables:**
- Comprehensive error handling and recovery
- Performance optimization and resource limits
- Monitoring and observability features
- Documentation and deployment guides

**Key Tasks:**
- Add metrics and health endpoints
- Implement backup and restore procedures
- Performance testing and optimization
- Security hardening and authentication

## Technical Specifications

### HTTP Backend Protocol Implementation

#### Required Endpoints
```
GET    /state/{project}        - Retrieve current state (JSON)
POST   /state/{project}        - Store new state (JSON body)
DELETE /state/{project}        - Delete state (optional)
LOCK   /state/{project}/lock   - Acquire state lock (JSON body)
UNLOCK /state/{project}/lock   - Release state lock (JSON body)
```

#### State Format
- Standard Terraform/OpenTofu JSON state format
- Version 4 state schema compatibility
- Custom resource types for QEMU VMs

#### Locking Protocol
```json
{
  "ID": "unique-lock-id",
  "Operation": "OperationTypePlan|OperationTypeApply",
  "Info": "operation description",
  "Who": "user@host",
  "Version": "opentofu-version",
  "Created": "2024-01-01T00:00:00Z",
  "Path": "terraform-working-directory"
}
```

### QEMU Process Management

#### VM Lifecycle Operations
- **Create**: Generate QEMU configuration and spawn process
- **Start/Stop**: Control VM power state through QMP
- **Modify**: Update VM configuration (restart required for some changes)
- **Destroy**: Graceful shutdown and resource cleanup
- **Monitor**: Health checks and resource usage tracking

#### Resource Management
```go
type VMConfig struct {
    Name       string
    Memory     uint64    // MB
    CPUs       int
    DiskPath   string
    NetworkConfig NetworkConfig
    VNCPort    int
    QMPSocket  string
}

type ResourcePool struct {
    MaxMemory     uint64
    UsedMemory    uint64
    PortRange     PortRange
    AllocatedPorts map[int]string
    DiskPaths     map[string]string
}
```

#### Process Supervision
- PID tracking and process monitoring
- Graceful shutdown with configurable timeouts
- Orphan process detection and cleanup
- Log aggregation and rotation

### State Storage Schema

#### Core Tables/Collections
```sql
-- State storage
states (
    project_name VARCHAR PRIMARY KEY,
    state_data JSON,
    version INTEGER,
    updated_at TIMESTAMP
);

-- Lock management  
locks (
    project_name VARCHAR PRIMARY KEY,
    lock_info JSON,
    acquired_at TIMESTAMP
);

-- VM process tracking
vm_processes (
    vm_name VARCHAR PRIMARY KEY,
    project_name VARCHAR,
    pid INTEGER,
    config JSON,
    status VARCHAR,
    created_at TIMESTAMP
);
```

## Configuration

### Server Configuration
```yaml
server:
  host: "0.0.0.0"
  port: 8080
  tls:
    cert_file: "/path/to/cert.pem"
    key_file: "/path/to/key.pem"

storage:
  type: "sqlite"  # sqlite, postgres, file
  connection: "/var/lib/qemu-backend/state.db"

qemu:
  binary_path: "/usr/bin/qemu-system-x86_64"
  default_memory: 1024
  port_range:
    start: 5900
    end: 6000
  max_concurrent_vms: 50
  
resources:
  max_memory_mb: 32768
  disk_base_path: "/var/lib/qemu-backend/disks"
  log_directory: "/var/log/qemu-backend"

auth:
  type: "basic"  # basic, token, none
  username: "admin"
  password: "secret"
```

### OpenTofu Configuration
```hcl
terraform {
  backend "http" {
    address        = "https://qemu-backend.example.com/state/my_project"
    lock_address   = "https://qemu-backend.example.com/state/my_project/lock"
    unlock_address = "https://qemu-backend.example.com/state/my_project/lock"
    username       = "admin"
    password       = "secret"
  }
}

# Example VM resource (requires custom provider)
resource "qemu_vm" "web_server" {
  name   = "web-01"
  memory = 2048
  cpus   = 2
  
  disk {
    path = "/var/lib/qemu/web-01.qcow2"
    size = "20G"
  }
  
  network {
    type = "user"
    hostfwd = "tcp::8080-:80"
  }
}
```

## Design Decisions and Rationale

### Use HTTP Backend vs Custom Backend Plugin
**Decision**: Use existing OpenTofu HTTP backend rather than implementing a custom backend plugin.

**Rationale**:
- Avoids modifying OpenTofu codebase
- Leverages well-tested HTTP backend implementation
- Enables implementation in any programming language
- Simplifies testing and deployment
- Maintains compatibility across OpenTofu versions

### Direct QEMU Management vs LibVirt
**Decision**: Manage QEMU processes directly instead of using the existing libvirt provider.

**Rationale**:
- **Control**: Fine-grained control over QEMU parameters and configuration
- **Simplicity**: Eliminates libvirt daemon dependency and complexity
- **Debugging**: Direct access to QEMU processes and logs
- **Flexibility**: Custom networking, storage, and feature implementations
- **Performance**: Reduced overhead from abstraction layers
- **Reliability**: Fewer moving parts and potential failure points

### State Storage Options
**Decision**: Support multiple storage backends with SQLite as default.

**Rationale**:
- **SQLite**: Simple deployment, no external dependencies, suitable for small-medium scale
- **PostgreSQL**: Production scalability, ACID compliance, concurrent access
- **File-based**: Development simplicity, easy backup and migration
- **Flexibility**: Different deployment scenarios have different requirements

### Process Management Approach
**Decision**: Implement custom process supervision rather than using system service managers.

**Rationale**:
- **Integration**: Direct integration with state management and HTTP API
- **Control**: Custom lifecycle management and resource allocation
- **Portability**: Works across different operating systems and environments
- **Monitoring**: Built-in health checks and resource tracking
- **Recovery**: Coordinated recovery with state consistency

### JSON State Format Compatibility
**Decision**: Maintain full compatibility with standard Terraform/OpenTofu state format.

**Rationale**:
- **Interoperability**: Works with existing tooling and workflows
- **Migration**: Easy migration from other backends
- **Standards**: Leverages well-defined, stable format
- **Debugging**: Familiar format for troubleshooting
- **Future-proofing**: Compatibility with ecosystem tools

### Security and Authentication
**Decision**: Start with basic authentication, design for pluggable auth system.

**Rationale**:
- **Simplicity**: Basic auth sufficient for many use cases
- **Standards**: HTTP-based authentication familiar to operators
- **Extensibility**: Architecture supports additional auth methods
- **Operations**: Integrates with existing HTTP infrastructure (proxies, load balancers)

## Risk Assessment

### Technical Risks
- **QEMU Process Management Complexity**: Mitigation through comprehensive testing and graceful error handling
- **State Consistency**: Mitigation through robust locking and recovery mechanisms  
- **Resource Conflicts**: Mitigation through resource allocation tracking and validation
- **Scale Limitations**: Mitigation through resource limits and monitoring

### Operational Risks
- **Data Loss**: Mitigation through backup strategies and state validation
- **Service Availability**: Mitigation through health checks and restart procedures
- **Security**: Mitigation through authentication, TLS, and input validation

## Future Enhancements

### Potential Features
- VM templating and cloning
- Snapshot management
- Live migration support
- Multi-host clustering
- Advanced networking (bridges, VLANs)
- GPU passthrough support
- Backup and restore automation
- Prometheus metrics integration
- Web UI for monitoring and management

### Scalability Improvements
- Horizontal scaling across multiple hosts
- Load balancing and VM placement policies
- Resource scheduling and optimization
- High availability and failover