Operations Runbook¶
Audience: Ops
You will learn:
- Produktiven Betrieb des Icon-Management-Tools
- Start/Stop-Procedures und Service-Management
- Monitoring, Logging und Health-Checks
- Backup-Strategien und Disaster Recovery
Pre-requisites: - Entwicklungsumgebung verstanden - Admin-Zugriff auf Produktionssystem - Grundkenntnisse in System-Administration
Service-Übersicht¶
System-Architektur¶
graph TB
A[Load Balancer] --> B[Flask App]
B --> C[File System]
B --> D[Static Files]
E[Git Repository] --> F[CI/CD Pipeline]
F --> B
G[Monitoring] --> B
H[Backup System] --> C
H --> D
Production Services:
- Web Application: Flask auf Port 5000
- Static File Server: Nginx/Apache für /static/
(optional)
- Monitoring: Health checks und Metrics
- Backup: Automatisierte Datensicherung
Evidenz: app.py:66-67, production architecture patterns
Service Dependencies¶
Service | Dependencies | Critical Path |
---|---|---|
Flask App | Python 3.11+, Flask | Yes |
Icon Files | File System, extracted SVGs | Yes |
Metadata | icons.json file | Yes |
Git Repository | Version control, CI/CD | No (runtime) |
Monitoring | Health endpoint | No |
Start/Stop Procedures¶
1. Service Start¶
Systemd Service (Linux)¶
# /etc/systemd/system/icon-tool.service
[Unit]
Description=ak Systems Icon Management Tool
After=network.target
[Service]
Type=simple
User=iconuser
Group=icongroup
WorkingDirectory=/opt/icon-tool
Environment=FLASK_ENV=production
Environment=PYTHONPATH=/opt/icon-tool
ExecStart=/opt/icon-tool/venv/bin/python app.py
ExecReload=/bin/kill -HUP $MAINPID
KillMode=mixed
TimeoutStopSec=5
PrivateTmp=true
Restart=on-failure
RestartSec=10
[Install]
WantedBy=multi-user.target
Service-Management:
# Service starten
sudo systemctl start icon-tool
sudo systemctl enable icon-tool # Auto-start bei Boot
# Status prüfen
sudo systemctl status icon-tool
# Service stoppen
sudo systemctl stop icon-tool
# Logs anzeigen
sudo journalctl -u icon-tool -f
Docker Container¶
# Docker-basierter Start
docker run -d \
--name icon-tool \
--restart unless-stopped \
-p 5000:5000 \
-v /opt/icon-data:/app/static \
-e FLASK_ENV=production \
icon-tool:latest
# Container-Management
docker start icon-tool
docker stop icon-tool
docker logs -f icon-tool
Manual Start (Development/Testing)¶
# Environment vorbereiten
cd /opt/icon-tool
source venv/bin/activate
export FLASK_ENV=production
# Dependencies prüfen
python -c "import flask; print('Flask OK')"
ls static/icons/*.svg | wc -l # Icon count
# Service starten
python app.py
# Background-Prozess
nohup python app.py > app.log 2>&1 &
echo $! > app.pid
Erwartetes Verhalten:
* Serving Flask app 'app'
* Environment: production
* Debug mode: off
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:5000
* Running on http://[your-ip]:5000
Evidenz: app.py:66-67 Flask startup
2. Service Stop¶
# Graceful Stop (systemd)
sudo systemctl stop icon-tool
# Force Stop bei hängenden Prozessen
sudo pkill -f "python app.py"
# Docker Stop
docker stop icon-tool
# Manual Stop
if [ -f app.pid ]; then
kill $(cat app.pid)
rm app.pid
fi
3. Service Restart¶
# Systemd Restart
sudo systemctl restart icon-tool
# Rolling Update (zero-downtime)
# 1. Health check before restart
curl -f http://localhost:5000/health
# 2. Graceful restart with validation
sudo systemctl restart icon-tool
sleep 5
curl -f http://localhost:5000/health || sudo systemctl restart icon-tool
Health Monitoring¶
1. Health Endpoint¶
# Basic Health Check
curl -f http://localhost:5000/health
# Expected Response:
{
"status": "healthy",
"timestamp": 1692874800.123,
"checks": {
"icons_directory": {
"exists": true,
"icon_count": 162
},
"metadata": {
"exists": true,
"valid_json": true
},
"api": {
"functional": true,
"icon_count": 162,
"category_count": 20
}
}
}
2. Monitoring Script¶
#!/bin/bash
# scripts/monitor.sh
LOG_FILE="/var/log/icon-tool-monitor.log"
log() {
echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG_FILE"
}
check_service() {
if systemctl is-active --quiet icon-tool; then
log "✓ Service is running"
return 0
else
log "❌ Service is not running"
return 1
fi
}
check_health() {
local response
response=$(curl -s -w "%{http_code}" http://localhost:5000/health)
local http_code="${response: -3}"
local body="${response%???}"
if [ "$http_code" = "200" ]; then
local icon_count
icon_count=$(echo "$body" | jq -r '.checks.api.icon_count // 0')
log "✓ Health check passed - $icon_count icons available"
return 0
else
log "❌ Health check failed - HTTP $http_code"
return 1
fi
}
check_response_time() {
local start_time end_time response_time
start_time=$(date +%s%N)
curl -s http://localhost:5000/api/icons > /dev/null
end_time=$(date +%s%N)
response_time=$(( (end_time - start_time) / 1000000 ))
if [ $response_time -lt 500 ]; then
log "✓ Response time OK: ${response_time}ms"
return 0
else
log "⚠️ Response time slow: ${response_time}ms"
return 1
fi
}
check_disk_space() {
local usage
usage=$(df /opt/icon-tool | tail -1 | awk '{print $5}' | sed 's/%//')
if [ "$usage" -lt 80 ]; then
log "✓ Disk usage OK: ${usage}%"
return 0
else
log "⚠️ Disk usage high: ${usage}%"
return 1
fi
}
# Main monitoring routine
main() {
log "Starting health monitoring..."
local failures=0
check_service || ((failures++))
check_health || ((failures++))
check_response_time || ((failures++))
check_disk_space || ((failures++))
if [ $failures -eq 0 ]; then
log "✅ All checks passed"
exit 0
else
log "❌ $failures check(s) failed"
exit 1
fi
}
main "$@"
3. Crontab Monitoring¶
# /etc/crontab - Run monitoring every 5 minutes
*/5 * * * * root /opt/icon-tool/scripts/monitor.sh
# Alerting on failures
*/5 * * * * root /opt/icon-tool/scripts/monitor.sh || echo "Icon Tool health check failed on $(hostname)" | mail -s "Icon Tool Alert" admin@ak-systems.com
4. Performance Metrics¶
# Memory Usage Monitoring
ps aux | grep "python app.py" | awk '{print $6}' # RSS memory in KB
# Connection Count
netstat -an | grep :5000 | wc -l
# Log Analysis
tail -f /var/log/icon-tool.log | grep "ERROR"
# Response Time Monitoring
curl -w "@curl-format.txt" -o /dev/null -s http://localhost:5000/api/icons
curl-format.txt:
time_namelookup: %{time_namelookup}s\n
time_connect: %{time_connect}s\n
time_appconnect: %{time_appconnect}s\n
time_pretransfer: %{time_pretransfer}s\n
time_redirect: %{time_redirect}s\n
time_starttransfer: %{time_starttransfer}s\n
----------\n
time_total: %{time_total}s\n
Logging¶
1. Application Logging¶
# Enhanced logging in app.py
import logging
from logging.handlers import RotatingFileHandler
import os
# Configure logging
if not app.debug:
if not os.path.exists('logs'):
os.mkdir('logs')
file_handler = RotatingFileHandler(
'logs/icon-tool.log',
maxBytes=10240000, # 10MB
backupCount=10
)
file_handler.setFormatter(logging.Formatter(
'%(asctime)s %(levelname)s: %(message)s [in %(pathname)s:%(lineno)d]'
))
file_handler.setLevel(logging.INFO)
app.logger.addHandler(file_handler)
app.logger.setLevel(logging.INFO)
app.logger.info('Icon Tool startup')
# Add request logging
@app.before_request
def log_request_info():
app.logger.info('Request: %s %s', request.method, request.url)
@app.after_request
def log_response_info(response):
app.logger.info('Response: %s %s', response.status_code, response.content_length or 0)
return response
2. Log Rotation¶
# /etc/logrotate.d/icon-tool
/opt/icon-tool/logs/*.log {
daily
rotate 30
compress
delaycompress
missingok
notifempty
create 644 iconuser icongroup
postrotate
systemctl reload icon-tool
endscript
}
3. Log Monitoring¶
# Error Detection
grep "ERROR" /opt/icon-tool/logs/icon-tool.log | tail -10
# Performance Issues
grep "slow response" /opt/icon-tool/logs/icon-tool.log
# Access Patterns
awk '{print $7}' /var/log/nginx/access.log | grep "/api/" | sort | uniq -c | sort -nr
# Real-time Monitoring
tail -f /opt/icon-tool/logs/icon-tool.log | grep -E "(ERROR|WARNING|slow)"
Backup Strategies¶
1. File System Backup¶
#!/bin/bash
# scripts/backup.sh
BACKUP_DIR="/backup/icon-tool"
APP_DIR="/opt/icon-tool"
DATE=$(date +%Y%m%d_%H%M%S)
# Create backup directory
mkdir -p "$BACKUP_DIR"
# Backup application files
tar -czf "$BACKUP_DIR/app-backup-$DATE.tar.gz" \
-C "$APP_DIR" \
--exclude="venv" \
--exclude="__pycache__" \
--exclude="*.pyc" \
--exclude="logs" \
.
# Backup icons and metadata separately
tar -czf "$BACKUP_DIR/icons-backup-$DATE.tar.gz" \
-C "$APP_DIR" \
static/icons/ \
icons.json
# Cleanup old backups (keep last 30 days)
find "$BACKUP_DIR" -name "*.tar.gz" -mtime +30 -delete
echo "Backup completed: $DATE"
ls -lh "$BACKUP_DIR"/*-$DATE.tar.gz
2. Automated Backup¶
# Crontab entry for daily backups
0 2 * * * /opt/icon-tool/scripts/backup.sh >> /var/log/icon-tool-backup.log 2>&1
# Weekly full system backup
0 3 * * 0 rsync -av /opt/icon-tool/ backup-server:/backups/icon-tool/$(date +%Y%m%d)/
3. Restoration Procedures¶
#!/bin/bash
# scripts/restore.sh
BACKUP_FILE="$1"
RESTORE_DIR="/opt/icon-tool"
if [ -z "$BACKUP_FILE" ]; then
echo "Usage: $0 <backup-file.tar.gz>"
exit 1
fi
# Stop service
systemctl stop icon-tool
# Create restoration backup
cp -r "$RESTORE_DIR" "$RESTORE_DIR.pre-restore-$(date +%Y%m%d_%H%M%S)"
# Extract backup
tar -xzf "$BACKUP_FILE" -C "$RESTORE_DIR"
# Validate restoration
if [ -f "$RESTORE_DIR/app.py" ] && [ -d "$RESTORE_DIR/static/icons" ]; then
echo "✓ Restoration appears successful"
# Restart service
systemctl start icon-tool
# Health check
sleep 5
if curl -f http://localhost:5000/health; then
echo "✅ Service restored and healthy"
else
echo "❌ Service started but health check failed"
exit 1
fi
else
echo "❌ Restoration validation failed"
exit 1
fi
Scaling & Performance¶
1. Horizontal Scaling¶
# Load Balancer Configuration (nginx)
upstream icon_tool {
server 127.0.0.1:5001;
server 127.0.0.1:5002;
server 127.0.0.1:5003;
}
server {
listen 80;
server_name icons.ak-systems.com;
location / {
proxy_pass http://icon_tool;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
}
location /static/ {
alias /opt/icon-tool/static/;
expires 1y;
add_header Cache-Control "public, immutable";
}
}
2. Performance Optimization¶
# Application-level caching
from functools import lru_cache
import time
# Cache icon list for 5 minutes
@lru_cache(maxsize=1)
def get_cached_icon_list():
return get_icon_list()
# Cache invalidation
_cache_timestamp = time.time()
def invalidate_cache_if_needed():
global _cache_timestamp
metadata_mtime = os.path.getmtime('icons.json')
if metadata_mtime > _cache_timestamp:
get_cached_icon_list.cache_clear()
_cache_timestamp = time.time()
@app.before_request
def check_cache():
invalidate_cache_if_needed()
3. Resource Monitoring¶
# System Resource Usage
#!/bin/bash
# scripts/resource-monitor.sh
echo "=== Icon Tool Resource Usage ==="
echo "Date: $(date)"
# Memory Usage
echo -e "\n📊 Memory Usage:"
ps aux | grep "python app.py" | awk '{sum += $6} END {print "Total RSS: " sum/1024 " MB"}'
# CPU Usage
echo -e "\n🔄 CPU Usage:"
ps aux | grep "python app.py" | awk '{sum += $3} END {print "Total CPU: " sum "%"}'
# Disk Usage
echo -e "\n💾 Disk Usage:"
du -sh /opt/icon-tool/static/icons/
du -sh /opt/icon-tool/logs/
# Network Connections
echo -e "\n🌐 Network Connections:"
netstat -an | grep :5000 | grep ESTABLISHED | wc -l
# File Descriptor Usage
echo -e "\n📁 File Descriptors:"
lsof -p $(pgrep -f "python app.py") | wc -l
Security Operations¶
1. Access Control¶
# File Permissions
chmod 750 /opt/icon-tool/
chmod 640 /opt/icon-tool/icons.json
chmod 644 /opt/icon-tool/static/icons/*.svg
chown -R iconuser:icongroup /opt/icon-tool/
# Service User (non-root)
useradd -r -s /bin/false -d /opt/icon-tool iconuser
2. Security Monitoring¶
# Log Security Events
grep "404\|403\|401" /var/log/nginx/access.log | tail -20
# Monitor for suspicious access patterns
awk '{print $1}' /var/log/nginx/access.log | sort | uniq -c | sort -nr | head -10
# Check for failed authentication attempts
grep "authentication failed" /opt/icon-tool/logs/icon-tool.log
3. SSL/TLS Management¶
# Certificate expiry check
openssl x509 -in /etc/ssl/certs/icon-tool.crt -text -noout | grep "Not After"
# Auto-renewal (Let's Encrypt)
certbot renew --dry-run
Incident Response¶
1. Service Down¶
# Immediate Response Checklist
□ Check service status: systemctl status icon-tool
□ Check system resources: free -h, df -h
□ Check logs: journalctl -u icon-tool --since="10 minutes ago"
□ Restart service: systemctl restart icon-tool
□ Validate health: curl -f http://localhost:5000/health
□ Monitor for stability: watch -n 10 'curl -s http://localhost:5000/health'
2. Performance Degradation¶
# Performance Investigation
□ Check response times: curl -w "@curl-format.txt" http://localhost:5000/api/icons
□ Monitor system load: uptime, iostat -x 1
□ Check memory usage: ps aux | grep python
□ Analyze slow queries: grep "slow" /opt/icon-tool/logs/icon-tool.log
□ Clear cache if needed: systemctl restart icon-tool
3. Data Corruption¶
# Data Recovery Steps
□ Stop service immediately: systemctl stop icon-tool
□ Backup current state: cp -r /opt/icon-tool /opt/icon-tool.corrupt-$(date +%Y%m%d_%H%M%S)
□ Validate icons.json: python -c "import json; json.load(open('icons.json'))"
□ Restore from backup: scripts/restore.sh latest-backup.tar.gz
□ Re-extract icons if needed: node extract-icons.js
□ Validate restoration: curl -f http://localhost:5000/health
Maintenance Windows¶
1. Planned Maintenance¶
# Maintenance Procedure Template
#!/bin/bash
# scripts/maintenance.sh
echo "Starting maintenance window: $(date)"
# 1. Notification
echo "Service entering maintenance mode..."
# 2. Health check before maintenance
curl -f http://localhost:5000/health || exit 1
# 3. Create backup
scripts/backup.sh
# 4. Stop service
systemctl stop icon-tool
# 5. Perform maintenance tasks
# - Update dependencies
# - Rotate logs
# - Clean temporary files
# - Update application code
# 6. Start service
systemctl start icon-tool
# 7. Validation
sleep 10
curl -f http://localhost:5000/health
# 8. Monitoring
echo "Maintenance completed: $(date)"
echo "Monitor service for next 30 minutes"
2. Updates & Patches¶
# Application Update Procedure
git pull origin main
npm ci
pip install -r requirements.txt
node extract-icons.js
systemctl restart icon-tool
# System Updates
apt update && apt upgrade -y
reboot # if kernel updates
SLO/SLA Monitoring¶
Service Level Objectives¶
Metric | Target | Measurement |
---|---|---|
Uptime | 99.0% | Monthly |
Response Time | <200ms | 95th percentile |
Error Rate | <1% | Weekly |
MTTR | <15 minutes | Per incident |
Monitoring Dashboard¶
# SLO Reporting Script
#!/bin/bash
# scripts/slo-report.sh
echo "=== SLO Report for $(date +%Y-%m) ==="
# Uptime calculation
uptime_seconds=$(systemctl show icon-tool --property=ActiveEnterTimestamp --value | xargs -I {} date -d {} +%s)
current_seconds=$(date +%s)
uptime_hours=$(( (current_seconds - uptime_seconds) / 3600 ))
uptime_percentage=$(echo "scale=2; $uptime_hours / 744 * 100" | bc) # 744 hours in month
echo "Uptime: $uptime_percentage%"
# Response time analysis
echo "Response Times:"
for i in {1..10}; do
curl -w "%{time_total}" -o /dev/null -s http://localhost:5000/api/icons
echo ""
done | awk '{sum+=$1; count++} END {print "Average: " sum/count "s"}'
# Error rate from logs
error_count=$(grep "ERROR" /opt/icon-tool/logs/icon-tool.log | wc -l)
total_requests=$(grep "Request:" /opt/icon-tool/logs/icon-tool.log | wc -l)
error_rate=$(echo "scale=4; $error_count / $total_requests * 100" | bc)
echo "Error Rate: $error_rate%"
Evidenz: Operations best practices, SLO/SLA standards
Operations Checklist: - [ ] Service-Management konfiguriert (systemd/Docker) - [ ] Health-Monitoring implementiert - [ ] Logging und Log-Rotation aktiviert - [ ] Backup-Strategie implementiert und getestet - [ ] Performance-Monitoring eingerichtet - [ ] Incident-Response-Procedures dokumentiert - [ ] SLO/SLA-Monitoring aktiviert