Operations Runbook¶

Audience: Ops
You will learn: - Produktiven Betrieb des Icon-Management-Tools - Start/Stop-Procedures und Service-Management - Monitoring, Logging und Health-Checks - Backup-Strategien und Disaster Recovery

Pre-requisites: - Entwicklungsumgebung verstanden - Admin-Zugriff auf Produktionssystem - Grundkenntnisse in System-Administration

Service-Übersicht¶

System-Architektur¶

graph TB
    A[Load Balancer] --> B[Flask App]
    B --> C[File System]
    B --> D[Static Files]

    E[Git Repository] --> F[CI/CD Pipeline]
    F --> B

    G[Monitoring] --> B
    H[Backup System] --> C
    H --> D

Production Services: - Web Application: Flask auf Port 5000 - Static File Server: Nginx/Apache für /static/ (optional) - Monitoring: Health checks und Metrics - Backup: Automatisierte Datensicherung

Evidenz: app.py:66-67, production architecture patterns

Service Dependencies¶

Service	Dependencies	Critical Path
Flask App	Python 3.11+, Flask	Yes
Icon Files	File System, extracted SVGs	Yes
Metadata	icons.json file	Yes
Git Repository	Version control, CI/CD	No (runtime)
Monitoring	Health endpoint	No

Start/Stop Procedures¶

1. Service Start¶

Systemd Service (Linux)¶

# /etc/systemd/system/icon-tool.service
[Unit]
Description=ak Systems Icon Management Tool
After=network.target

[Service]
Type=simple
User=iconuser
Group=icongroup
WorkingDirectory=/opt/icon-tool
Environment=FLASK_ENV=production
Environment=PYTHONPATH=/opt/icon-tool
ExecStart=/opt/icon-tool/venv/bin/python app.py
ExecReload=/bin/kill -HUP $MAINPID
KillMode=mixed
TimeoutStopSec=5
PrivateTmp=true
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

Service-Management:

# Service starten
sudo systemctl start icon-tool
sudo systemctl enable icon-tool  # Auto-start bei Boot

# Status prüfen
sudo systemctl status icon-tool

# Service stoppen
sudo systemctl stop icon-tool

# Logs anzeigen
sudo journalctl -u icon-tool -f

Docker Container¶

# Docker-basierter Start
docker run -d \
  --name icon-tool \
  --restart unless-stopped \
  -p 5000:5000 \
  -v /opt/icon-data:/app/static \
  -e FLASK_ENV=production \
  icon-tool:latest

# Container-Management
docker start icon-tool
docker stop icon-tool
docker logs -f icon-tool

Manual Start (Development/Testing)¶

# Environment vorbereiten
cd /opt/icon-tool
source venv/bin/activate
export FLASK_ENV=production

# Dependencies prüfen
python -c "import flask; print('Flask OK')"
ls static/icons/*.svg | wc -l  # Icon count

# Service starten
python app.py

# Background-Prozess
nohup python app.py > app.log 2>&1 &
echo $! > app.pid

Erwartetes Verhalten:

 * Serving Flask app 'app'
 * Environment: production
 * Debug mode: off
 * Running on all addresses (0.0.0.0)
 * Running on http://127.0.0.1:5000
 * Running on http://[your-ip]:5000

Evidenz: app.py:66-67 Flask startup

2. Service Stop¶

# Graceful Stop (systemd)
sudo systemctl stop icon-tool

# Force Stop bei hängenden Prozessen
sudo pkill -f "python app.py"

# Docker Stop
docker stop icon-tool

# Manual Stop
if [ -f app.pid ]; then
    kill $(cat app.pid)
    rm app.pid
fi

3. Service Restart¶

# Systemd Restart
sudo systemctl restart icon-tool

# Rolling Update (zero-downtime)
# 1. Health check before restart
curl -f http://localhost:5000/health

# 2. Graceful restart with validation
sudo systemctl restart icon-tool
sleep 5
curl -f http://localhost:5000/health || sudo systemctl restart icon-tool

Health Monitoring¶

1. Health Endpoint¶

# Basic Health Check
curl -f http://localhost:5000/health

# Expected Response:
{
  "status": "healthy",
  "timestamp": 1692874800.123,
  "checks": {
    "icons_directory": {
      "exists": true,
      "icon_count": 162
    },
    "metadata": {
      "exists": true,
      "valid_json": true
    },
    "api": {
      "functional": true,
      "icon_count": 162,
      "category_count": 20
    }
  }
}

2. Monitoring Script¶

#!/bin/bash
# scripts/monitor.sh

LOG_FILE="/var/log/icon-tool-monitor.log"

log() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG_FILE"
}

check_service() {
    if systemctl is-active --quiet icon-tool; then
        log "✓ Service is running"
        return 0
    else
        log "❌ Service is not running"
        return 1
    fi
}

check_health() {
    local response
    response=$(curl -s -w "%{http_code}" http://localhost:5000/health)
    local http_code="${response: -3}"
    local body="${response%???}"

    if [ "$http_code" = "200" ]; then
        local icon_count
        icon_count=$(echo "$body" | jq -r '.checks.api.icon_count // 0')
        log "✓ Health check passed - $icon_count icons available"
        return 0
    else
        log "❌ Health check failed - HTTP $http_code"
        return 1
    fi
}

check_response_time() {
    local start_time end_time response_time
    start_time=$(date +%s%N)
    curl -s http://localhost:5000/api/icons > /dev/null
    end_time=$(date +%s%N)

    response_time=$(( (end_time - start_time) / 1000000 ))

    if [ $response_time -lt 500 ]; then
        log "✓ Response time OK: ${response_time}ms"
        return 0
    else
        log "⚠️  Response time slow: ${response_time}ms"
        return 1
    fi
}

check_disk_space() {
    local usage
    usage=$(df /opt/icon-tool | tail -1 | awk '{print $5}' | sed 's/%//')

    if [ "$usage" -lt 80 ]; then
        log "✓ Disk usage OK: ${usage}%"
        return 0
    else
        log "⚠️  Disk usage high: ${usage}%"
        return 1
    fi
}

# Main monitoring routine
main() {
    log "Starting health monitoring..."

    local failures=0

    check_service || ((failures++))
    check_health || ((failures++))
    check_response_time || ((failures++))
    check_disk_space || ((failures++))

    if [ $failures -eq 0 ]; then
        log "✅ All checks passed"
        exit 0
    else
        log "❌ $failures check(s) failed"
        exit 1
    fi
}

main "$@"

3. Crontab Monitoring¶

# /etc/crontab - Run monitoring every 5 minutes
*/5 * * * * root /opt/icon-tool/scripts/monitor.sh

# Alerting on failures
*/5 * * * * root /opt/icon-tool/scripts/monitor.sh || echo "Icon Tool health check failed on $(hostname)" | mail -s "Icon Tool Alert" admin@ak-systems.com

4. Performance Metrics¶

# Memory Usage Monitoring
ps aux | grep "python app.py" | awk '{print $6}' # RSS memory in KB

# Connection Count
netstat -an | grep :5000 | wc -l

# Log Analysis
tail -f /var/log/icon-tool.log | grep "ERROR"

# Response Time Monitoring
curl -w "@curl-format.txt" -o /dev/null -s http://localhost:5000/api/icons

curl-format.txt:

     time_namelookup:  %{time_namelookup}s\n
        time_connect:  %{time_connect}s\n
     time_appconnect:  %{time_appconnect}s\n
    time_pretransfer:  %{time_pretransfer}s\n
       time_redirect:  %{time_redirect}s\n
  time_starttransfer:  %{time_starttransfer}s\n
                     ----------\n
          time_total:  %{time_total}s\n

Logging¶

1. Application Logging¶

# Enhanced logging in app.py
import logging
from logging.handlers import RotatingFileHandler
import os

# Configure logging
if not app.debug:
    if not os.path.exists('logs'):
        os.mkdir('logs')

    file_handler = RotatingFileHandler(
        'logs/icon-tool.log', 
        maxBytes=10240000,  # 10MB
        backupCount=10
    )
    file_handler.setFormatter(logging.Formatter(
        '%(asctime)s %(levelname)s: %(message)s [in %(pathname)s:%(lineno)d]'
    ))
    file_handler.setLevel(logging.INFO)
    app.logger.addHandler(file_handler)

    app.logger.setLevel(logging.INFO)
    app.logger.info('Icon Tool startup')

# Add request logging
@app.before_request
def log_request_info():
    app.logger.info('Request: %s %s', request.method, request.url)

@app.after_request
def log_response_info(response):
    app.logger.info('Response: %s %s', response.status_code, response.content_length or 0)
    return response

2. Log Rotation¶

# /etc/logrotate.d/icon-tool
/opt/icon-tool/logs/*.log {
    daily
    rotate 30
    compress
    delaycompress
    missingok
    notifempty
    create 644 iconuser icongroup
    postrotate
        systemctl reload icon-tool
    endscript
}

3. Log Monitoring¶

# Error Detection
grep "ERROR" /opt/icon-tool/logs/icon-tool.log | tail -10

# Performance Issues
grep "slow response" /opt/icon-tool/logs/icon-tool.log

# Access Patterns
awk '{print $7}' /var/log/nginx/access.log | grep "/api/" | sort | uniq -c | sort -nr

# Real-time Monitoring
tail -f /opt/icon-tool/logs/icon-tool.log | grep -E "(ERROR|WARNING|slow)"

Backup Strategies¶

1. File System Backup¶

#!/bin/bash
# scripts/backup.sh

BACKUP_DIR="/backup/icon-tool"
APP_DIR="/opt/icon-tool"
DATE=$(date +%Y%m%d_%H%M%S)

# Create backup directory
mkdir -p "$BACKUP_DIR"

# Backup application files
tar -czf "$BACKUP_DIR/app-backup-$DATE.tar.gz" \
    -C "$APP_DIR" \
    --exclude="venv" \
    --exclude="__pycache__" \
    --exclude="*.pyc" \
    --exclude="logs" \
    .

# Backup icons and metadata separately
tar -czf "$BACKUP_DIR/icons-backup-$DATE.tar.gz" \
    -C "$APP_DIR" \
    static/icons/ \
    icons.json

# Cleanup old backups (keep last 30 days)
find "$BACKUP_DIR" -name "*.tar.gz" -mtime +30 -delete

echo "Backup completed: $DATE"
ls -lh "$BACKUP_DIR"/*-$DATE.tar.gz

2. Automated Backup¶

# Crontab entry for daily backups
0 2 * * * /opt/icon-tool/scripts/backup.sh >> /var/log/icon-tool-backup.log 2>&1

# Weekly full system backup
0 3 * * 0 rsync -av /opt/icon-tool/ backup-server:/backups/icon-tool/$(date +%Y%m%d)/

3. Restoration Procedures¶

#!/bin/bash
# scripts/restore.sh

BACKUP_FILE="$1"
RESTORE_DIR="/opt/icon-tool"

if [ -z "$BACKUP_FILE" ]; then
    echo "Usage: $0 <backup-file.tar.gz>"
    exit 1
fi

# Stop service
systemctl stop icon-tool

# Create restoration backup
cp -r "$RESTORE_DIR" "$RESTORE_DIR.pre-restore-$(date +%Y%m%d_%H%M%S)"

# Extract backup
tar -xzf "$BACKUP_FILE" -C "$RESTORE_DIR"

# Validate restoration
if [ -f "$RESTORE_DIR/app.py" ] && [ -d "$RESTORE_DIR/static/icons" ]; then
    echo "✓ Restoration appears successful"

    # Restart service
    systemctl start icon-tool

    # Health check
    sleep 5
    if curl -f http://localhost:5000/health; then
        echo "✅ Service restored and healthy"
    else
        echo "❌ Service started but health check failed"
        exit 1
    fi
else
    echo "❌ Restoration validation failed"
    exit 1
fi

Scaling & Performance¶

1. Horizontal Scaling¶

# Load Balancer Configuration (nginx)
upstream icon_tool {
    server 127.0.0.1:5001;
    server 127.0.0.1:5002;
    server 127.0.0.1:5003;
}

server {
    listen 80;
    server_name icons.ak-systems.com;

    location / {
        proxy_pass http://icon_tool;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
    }

    location /static/ {
        alias /opt/icon-tool/static/;
        expires 1y;
        add_header Cache-Control "public, immutable";
    }
}

2. Performance Optimization¶

# Application-level caching
from functools import lru_cache
import time

# Cache icon list for 5 minutes
@lru_cache(maxsize=1)
def get_cached_icon_list():
    return get_icon_list()

# Cache invalidation
_cache_timestamp = time.time()

def invalidate_cache_if_needed():
    global _cache_timestamp
    metadata_mtime = os.path.getmtime('icons.json')

    if metadata_mtime > _cache_timestamp:
        get_cached_icon_list.cache_clear()
        _cache_timestamp = time.time()

@app.before_request
def check_cache():
    invalidate_cache_if_needed()

3. Resource Monitoring¶

# System Resource Usage
#!/bin/bash
# scripts/resource-monitor.sh

echo "=== Icon Tool Resource Usage ==="
echo "Date: $(date)"

# Memory Usage
echo -e "\n📊 Memory Usage:"
ps aux | grep "python app.py" | awk '{sum += $6} END {print "Total RSS: " sum/1024 " MB"}'

# CPU Usage
echo -e "\n🔄 CPU Usage:"
ps aux | grep "python app.py" | awk '{sum += $3} END {print "Total CPU: " sum "%"}'

# Disk Usage
echo -e "\n💾 Disk Usage:"
du -sh /opt/icon-tool/static/icons/
du -sh /opt/icon-tool/logs/

# Network Connections
echo -e "\n🌐 Network Connections:"
netstat -an | grep :5000 | grep ESTABLISHED | wc -l

# File Descriptor Usage
echo -e "\n📁 File Descriptors:"
lsof -p $(pgrep -f "python app.py") | wc -l

Security Operations¶

1. Access Control¶

# File Permissions
chmod 750 /opt/icon-tool/
chmod 640 /opt/icon-tool/icons.json
chmod 644 /opt/icon-tool/static/icons/*.svg
chown -R iconuser:icongroup /opt/icon-tool/

# Service User (non-root)
useradd -r -s /bin/false -d /opt/icon-tool iconuser

2. Security Monitoring¶

# Log Security Events
grep "404\|403\|401" /var/log/nginx/access.log | tail -20

# Monitor for suspicious access patterns
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -10

# Check for failed authentication attempts
grep "authentication failed" /opt/icon-tool/logs/icon-tool.log

3. SSL/TLS Management¶

# Certificate expiry check
openssl x509 -in /etc/ssl/certs/icon-tool.crt -text -noout | grep "Not After"

# Auto-renewal (Let's Encrypt)
certbot renew --dry-run

Incident Response¶

1. Service Down¶

# Immediate Response Checklist
□ Check service status: systemctl status icon-tool
□ Check system resources: free -h, df -h
□ Check logs: journalctl -u icon-tool --since="10 minutes ago"
□ Restart service: systemctl restart icon-tool
□ Validate health: curl -f http://localhost:5000/health
□ Monitor for stability: watch -n 10 'curl -s http://localhost:5000/health'

2. Performance Degradation¶

# Performance Investigation
□ Check response times: curl -w "@curl-format.txt" http://localhost:5000/api/icons
□ Monitor system load: uptime, iostat -x 1
□ Check memory usage: ps aux | grep python
□ Analyze slow queries: grep "slow" /opt/icon-tool/logs/icon-tool.log
□ Clear cache if needed: systemctl restart icon-tool

3. Data Corruption¶

# Data Recovery Steps
□ Stop service immediately: systemctl stop icon-tool
□ Backup current state: cp -r /opt/icon-tool /opt/icon-tool.corrupt-$(date +%Y%m%d_%H%M%S)
□ Validate icons.json: python -c "import json; json.load(open('icons.json'))"
□ Restore from backup: scripts/restore.sh latest-backup.tar.gz
□ Re-extract icons if needed: node extract-icons.js
□ Validate restoration: curl -f http://localhost:5000/health

Maintenance Windows¶

1. Planned Maintenance¶

# Maintenance Procedure Template
#!/bin/bash
# scripts/maintenance.sh

echo "Starting maintenance window: $(date)"

# 1. Notification
echo "Service entering maintenance mode..."

# 2. Health check before maintenance
curl -f http://localhost:5000/health || exit 1

# 3. Create backup
scripts/backup.sh

# 4. Stop service
systemctl stop icon-tool

# 5. Perform maintenance tasks
# - Update dependencies
# - Rotate logs
# - Clean temporary files
# - Update application code

# 6. Start service
systemctl start icon-tool

# 7. Validation
sleep 10
curl -f http://localhost:5000/health

# 8. Monitoring
echo "Maintenance completed: $(date)"
echo "Monitor service for next 30 minutes"

2. Updates & Patches¶

# Application Update Procedure
git pull origin main
npm ci
pip install -r requirements.txt
node extract-icons.js
systemctl restart icon-tool

# System Updates
apt update && apt upgrade -y
reboot  # if kernel updates

SLO/SLA Monitoring¶

Service Level Objectives¶

Metric	Target	Measurement
Uptime	99.0%	Monthly
Response Time	<200ms	95th percentile
Error Rate	<1%	Weekly
MTTR	<15 minutes	Per incident

Monitoring Dashboard¶

# SLO Reporting Script
#!/bin/bash
# scripts/slo-report.sh

echo "=== SLO Report for $(date +%Y-%m) ==="

# Uptime calculation
uptime_seconds=$(systemctl show icon-tool --property=ActiveEnterTimestamp --value | xargs -I {} date -d {} +%s)
current_seconds=$(date +%s)
uptime_hours=$(( (current_seconds - uptime_seconds) / 3600 ))
uptime_percentage=$(echo "scale=2; $uptime_hours / 744 * 100" | bc)  # 744 hours in month

echo "Uptime: $uptime_percentage%"

# Response time analysis
echo "Response Times:"
for i in {1..10}; do
    curl -w "%{time_total}" -o /dev/null -s http://localhost:5000/api/icons
    echo ""
done | awk '{sum+=$1; count++} END {print "Average: " sum/count "s"}'

# Error rate from logs
error_count=$(grep "ERROR" /opt/icon-tool/logs/icon-tool.log | wc -l)
total_requests=$(grep "Request:" /opt/icon-tool/logs/icon-tool.log | wc -l)
error_rate=$(echo "scale=4; $error_count / $total_requests * 100" | bc)

echo "Error Rate: $error_rate%"

Evidenz: Operations best practices, SLO/SLA standards

Operations Checklist: - [ ] Service-Management konfiguriert (systemd/Docker) - [ ] Health-Monitoring implementiert - [ ] Logging und Log-Rotation aktiviert - [ ] Backup-Strategie implementiert und getestet - [ ] Performance-Monitoring eingerichtet - [ ] Incident-Response-Procedures dokumentiert - [ ] SLO/SLA-Monitoring aktiviert