Troubleshooting Guide¶

Audience: Ops, Dev
You will learn: - Systematische Problemdiagnose für das Icon-Tool - Häufige Probleme und deren Lösungen - Debug-Techniken und Monitoring-Tools - Escalation-Pfade und Support-Procedures

Pre-requisites: - Operations Runbook verstanden - System-Administration Grundkenntnisse - Zugriff auf Logs und Monitoring-Tools

Allgemeine Diagnose-Strategie¶

1. Problem-Kategorisierung¶

graph TD
    A[Problem erkannt] --> B{Service läuft?}
    B -->|Ja| C{Health Check OK?}
    B -->|Nein| D[Service-Problem]
    C -->|Ja| E{Performance OK?}
    C -->|Nein| F[Funktions-Problem]
    E -->|Ja| G[Benutzer-Problem]
    E -->|Nein| H[Performance-Problem]

    D --> I[Service-Restart]
    F --> J[Daten-Validierung]
    H --> K[Resource-Analyse]
    G --> L[Benutzer-Support]

2. Standard-Diagnose-Flow¶

#!/bin/bash
# scripts/diagnose.sh

echo "=== Icon Tool Diagnostics ==="
echo "Timestamp: $(date)"

# 1. Service Status
echo -e "\n🔍 Service Status:"
systemctl is-active icon-tool && echo "✓ Service running" || echo "❌ Service stopped"

# 2. Health Check
echo -e "\n🏥 Health Check:"
if curl -s -f http://localhost:5000/health > /dev/null; then
    echo "✓ Health check passed"
    curl -s http://localhost:5000/health | jq .
else
    echo "❌ Health check failed"
    curl -s http://localhost:5000/health || echo "No response"
fi

# 3. Resource Usage
echo -e "\n📊 Resource Usage:"
echo "Memory: $(ps aux | grep 'python app.py' | awk '{sum+=$6} END {print sum/1024 " MB"}')"
echo "CPU: $(ps aux | grep 'python app.py' | awk '{sum+=$3} END {print sum "%"}')"
echo "Disk: $(df -h /opt/icon-tool | tail -1 | awk '{print $5}')"

# 4. Recent Errors
echo -e "\n❌ Recent Errors:"
tail -50 /opt/icon-tool/logs/icon-tool.log | grep ERROR | tail -5

# 5. Connection Status
echo -e "\n🌐 Network Status:"
echo "Active connections: $(netstat -an | grep :5000 | grep ESTABLISHED | wc -l)"
echo "Listening: $(netstat -an | grep :5000 | grep LISTEN | wc -l)"

# 6. File System
echo -e "\n📁 File System:"
echo "Icons count: $(ls /opt/icon-tool/static/icons/*.svg 2>/dev/null | wc -l)"
echo "Metadata exists: $([ -f /opt/icon-tool/icons.json ] && echo "Yes" || echo "No")"

Evidenz: Operational troubleshooting patterns

Service-Level Probleme¶

Problem: Service startet nicht¶

Symptome¶

systemctl status icon-tool
# ● icon-tool.service - ak Systems Icon Management Tool
#    Loaded: loaded (/etc/systemd/system/icon-tool.service; enabled; vendor preset: enabled)
#    Active: failed (Result: exit-code) since Thu 2025-08-24 10:30:15 UTC; 2min ago

Diagnose¶

# 1. Detaillierte Logs prüfen
journalctl -u icon-tool --no-pager

# 2. Python-Fehler identifizieren
python /opt/icon-tool/app.py
# Direkte Ausgabe ohne systemd

# 3. Dependency-Check
cd /opt/icon-tool
python -c "import flask; print('Flask OK')"
python -c "import json; print('JSON OK')"

# 4. File-Permissions prüfen
ls -la /opt/icon-tool/
ls -la /opt/icon-tool/static/icons/

Häufige Ursachen & Lösungen¶

Ursache	Symptom	Lösung
Missing Dependencies	`ModuleNotFoundError: No module named 'flask'`	`pip install flask`
Permission Denied	`PermissionError: [Errno 13]`	`chown -R iconuser:icongroup /opt/icon-tool/`
Port bereits belegt	`Address already in use`	`lsof -i :5000` und Prozess beenden
Missing Icons	`FileNotFoundError: static/icons`	`node extract-icons.js`
Corrupt Metadata	`json.decoder.JSONDecodeError`	Restore `icons.json` from backup

Lösungsschritte¶

# Standard-Reparatur-Sequenz
cd /opt/icon-tool

# 1. Dependencies reparieren
pip install flask

# 2. Icons neu extrahieren
node extract-icons.js

# 3. Permissions korrigieren
sudo chown -R iconuser:icongroup .
sudo chmod 755 .
sudo chmod 644 icons.json
sudo chmod 755 static/icons/
sudo chmod 644 static/icons/*.svg

# 4. Service neu starten
sudo systemctl restart icon-tool

# 5. Validierung
sleep 5
curl -f http://localhost:5000/health

Evidenz: systemd service configuration, common startup issues

Problem: Service läuft aber ist nicht erreichbar¶

Symptome¶

systemctl status icon-tool  # active (running)
curl http://localhost:5000  # Connection refused

Diagnose¶

# 1. Port-Binding prüfen
netstat -tulpn | grep :5000
# Sollte zeigen: python app.py listening on 0.0.0.0:5000

# 2. Firewall prüfen
ufw status
iptables -L | grep 5000

# 3. Process-Status
ps aux | grep "python app.py"

# 4. Application-Logs
tail -f /opt/icon-tool/logs/icon-tool.log

Lösungen¶

# Port-Konflikt beheben
sudo lsof -i :5000
sudo kill <PID>  # Falls anderer Prozess Port blockiert

# Firewall-Regel hinzufügen
sudo ufw allow 5000

# Flask-Binding prüfen (in app.py)
# app.run(host='0.0.0.0', port=5000)  # Nicht nur 127.0.0.1

Funktions-Probleme¶

Problem: Icons werden nicht angezeigt¶

Symptome¶

curl http://localhost:5000/  # 200 OK
curl http://localhost:5000/api/icons  # {"icons": [], "categories": {}}

Diagnose¶

# 1. Icons-Directory prüfen
ls -la /opt/icon-tool/static/icons/
# Sollte *.svg Dateien enthalten

# 2. Metadata prüfen
cat /opt/icon-tool/icons.json | jq .
# Sollte valid JSON mit Kategorien sein

# 3. API-Response analysieren
curl -s http://localhost:5000/api/icons | jq '.icons | length'
# Sollte > 0 sein

# 4. File-Permissions
ls -la /opt/icon-tool/static/icons/ | head -5

Lösungen¶

# Icons neu extrahieren
cd /opt/icon-tool
node extract-icons.js

# Erwartete Ausgabe:
# ✓ 162 Icons erfolgreich extrahiert
# ✓ ZIP-Archiv erstellt: 69.234 bytes

# Validierung
ls static/icons/*.svg | wc -l  # Sollte 162 sein
curl -s http://localhost:5000/api/icons | jq '.icons | length'  # Sollte 162 sein

Evidenz: extract-icons.js output, app.py icon loading logic

Problem: Kategorie-Filter funktioniert nicht¶

Symptome¶

// Browser Console
fetch('/api/icons').then(r => r.json()).then(d => console.log(d.categories))
// {}  (leeres Objekt statt Kategorien)

Diagnose¶

# 1. Metadata-File prüfen
cat /opt/icon-tool/icons.json | jq 'keys'
# Sollte Kategorie-Namen zeigen

# 2. JSON-Syntax validieren
python3 -c "import json; json.load(open('icons.json')); print('Valid JSON')"

# 3. Backend-Response testen
curl -s http://localhost:5000/api/icons | jq '.categories | keys'

Lösungen¶

# Metadata reparieren
cd /opt/icon-tool

# Backup erstellen
cp icons.json icons.json.backup

# Neu-Kategorisierung (wenn icons.json corrupt)
python3 -c "
import json
import os

icons = [f for f in os.listdir('static/icons') if f.endswith('.svg')]
categories = {'Uncategorized': icons}

with open('icons.json', 'w') as f:
    json.dump(categories, f, indent=2)

print(f'Created basic categorization for {len(icons)} icons')
"

# Service neu starten
systemctl restart icon-tool

# Validierung
curl -s http://localhost:5000/api/icons | jq '.categories | keys'

Evidenz: icons.json structure, app.py category loading

Performance-Probleme¶

Problem: Langsame API-Responses¶

Symptome¶

curl -w "Time: %{time_total}s" http://localhost:5000/api/icons
# Time: 2.345s  (sollte < 0.2s sein)

Diagnose¶

# 1. Response-Time-Breakdown
curl -w "@curl-format.txt" -o /dev/null http://localhost:5000/api/icons

# 2. System-Load prüfen
uptime
iostat -x 1 5

# 3. Memory-Usage
free -h
ps aux | grep python | awk '{print $6/1024 " MB"}'

# 4. File-System-Performance
time ls /opt/icon-tool/static/icons/ | wc -l

Lösungen¶

# 1. Caching aktivieren
export ENABLE_CACHING=true
systemctl restart icon-tool

# 2. Icon-Anzahl reduzieren (temporär)
cd /opt/icon-tool
mkdir static/icons-backup
mv static/icons/*.svg static/icons-backup/
cp static/icons-backup/home.svg static/icons-backup/user.svg static/icons/

# 3. System-Resources überwachen
top -p $(pgrep -f "python app.py")

# 4. Disk-I/O optimieren
# Icons auf SSD verschieben
sudo mkdir /mnt/ssd/icons
sudo cp -r static/icons/* /mnt/ssd/icons/
sudo ln -sfn /mnt/ssd/icons static/icons

Problem: Hoher Memory-Verbrauch¶

Symptome¶

ps aux | grep "python app.py"
# USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
# iconuser  1234  1.0 25.0 500000 250000 ?      S    10:30   0:05 python app.py
# RSS > 100MB ist verdächtig

Diagnose¶

# 1. Memory-Profiling
python3 -c "
import psutil
p = psutil.Process()
print(f'Memory: {p.memory_info().rss / 1024 / 1024:.1f} MB')
print(f'CPU: {p.cpu_percent()}%')
"

# 2. Memory-Leaks identifizieren
# Monitor über Zeit
while true; do
    ps aux | grep "python app.py" | awk '{print strftime(\"%Y-%m-%d %H:%M:%S\"), \$6/1024 \" MB\"}'
    sleep 60
done

# 3. Object-Anzahl im Python-Process
python3 -c "
import gc
print(f'Objects in memory: {len(gc.get_objects())}')
"

Lösungen¶

# 1. Memory-Limit setzen (systemd)
# In /etc/systemd/system/icon-tool.service:
# MemoryLimit=100M

# 2. Graceful-Restart implementieren
*/6 * * * * systemctl restart icon-tool  # Alle 6 Stunden

# 3. Caching optimieren
export ICON_CACHE_TIMEOUT=60  # Kürzere Cache-Zeit
systemctl restart icon-tool

# 4. Python-Garbage-Collection forcieren
systemctl kill -s USR1 icon-tool  # Wenn USR1 handler implementiert

Evidenz: Performance monitoring, resource optimization

Data-Integrity Probleme¶

Problem: Icon-Metadaten inkonsistent¶

Symptome¶

curl -s http://localhost:5000/api/icons | jq '.icons | length'  # 162
cat icons.json | jq '[.[]] | flatten | length'  # 158
# Mismatch zwischen Dateien und Metadaten

Diagnose¶

# Consistency-Check Script
python3 -c "
import json
import os
from pathlib import Path

# Load metadata
with open('icons.json') as f:
    categories = json.load(f)

# Get categorized icons
categorized = set()
for cat_icons in categories.values():
    categorized.update(cat_icons)

# Get actual files
icons_dir = Path('static/icons')
actual = {f.name for f in icons_dir.glob('*.svg')}

print(f'Actual files: {len(actual)}')
print(f'Categorized: {len(categorized)}')

uncategorized = actual - categorized
missing = categorized - actual

if uncategorized:
    print(f'Uncategorized files: {uncategorized}')
if missing:
    print(f'Missing files: {missing}')

if not uncategorized and not missing:
    print('✅ Metadata is consistent')
else:
    print('❌ Metadata inconsistency detected')
"

Lösungen¶

# Automatische Reparatur
python3 -c "
import json
import os
from pathlib import Path

# Load current metadata
try:
    with open('icons.json') as f:
        categories = json.load(f)
except:
    categories = {}

# Get all actual SVG files
icons_dir = Path('static/icons')
actual_files = {f.name for f in icons_dir.glob('*.svg')}

# Remove missing files from categories
for category in categories:
    categories[category] = [f for f in categories[category] if f in actual_files]

# Add uncategorized files
categorized = set()
for cat_icons in categories.values():
    categorized.update(cat_icons)

uncategorized = actual_files - categorized
if uncategorized:
    if 'Uncategorized' not in categories:
        categories['Uncategorized'] = []
    categories['Uncategorized'].extend(sorted(uncategorized))

# Save repaired metadata
with open('icons.json', 'w') as f:
    json.dump(categories, f, indent=2)

print(f'✅ Repaired metadata for {len(actual_files)} files')
"

# Service neu starten
systemctl restart icon-tool

# Validierung
curl -s http://localhost:5000/api/icons | jq '.icons | length'

Problem: Corrupted SVG Files¶

Symptome¶

curl http://localhost:5000/static/icons/home.svg
# Zeigt Binary-Data oder Fehler statt SVG

Diagnose¶

# SVG-Integrität prüfen
for svg in /opt/icon-tool/static/icons/*.svg; do
    if ! head -1 "$svg" | grep -q "<svg"; then
        echo "Corrupted: $svg"
    fi
done

# File-Types prüfen
file /opt/icon-tool/static/icons/*.svg | grep -v "SVG"

# SVG-Syntax validieren
xmllint --noout /opt/icon-tool/static/icons/home.svg 2>&1 || echo "Invalid XML"

Lösungen¶

# Icons komplett neu extrahieren
cd /opt/icon-tool

# Backup der aktuellen Icons
mv static/icons static/icons-corrupted-$(date +%Y%m%d_%H%M%S)

# Neu-Extraktion
node extract-icons.js

# Validierung
ls static/icons/*.svg | wc -l
head -1 static/icons/home.svg | grep "<svg"

# Service neu starten
systemctl restart icon-tool

Evidenz: SVG generation process, file integrity checks

Network & Connectivity¶

Problem: External Access funktioniert nicht¶

Symptome¶

curl http://localhost:5000/health  # OK
curl http://external-ip:5000/health  # Connection refused

Diagnose¶

# 1. Binding-Address prüfen
netstat -tulpn | grep :5000
# Sollte 0.0.0.0:5000 zeigen, nicht 127.0.0.1:5000

# 2. Firewall-Status
ufw status verbose
iptables -L INPUT -v

# 3. Network-Interface
ip addr show
ping external-ip  # Von anderem System

Lösungen¶

# 1. Flask-Binding korrigieren
# In app.py: app.run(host='0.0.0.0', port=5000)

# 2. Firewall-Regel hinzufügen
sudo ufw allow 5000/tcp
sudo iptables -A INPUT -p tcp --dport 5000 -j ACCEPT

# 3. Service neu starten
systemctl restart icon-tool

# 4. Validierung
netstat -tulpn | grep :5000  # Sollte 0.0.0.0:5000 zeigen

Problem: Load Balancer Health Checks schlagen fehl¶

Symptome¶

# Load Balancer Logs zeigen:
# Health check failed: HTTP 503 Service Unavailable

Diagnose¶

# 1. Health-Endpoint direkt testen
curl -v http://localhost:5000/health

# 2. Response-Headers prüfen
curl -I http://localhost:5000/health

# 3. Load Balancer Konfiguration
nginx -t  # Syntax-Check
cat /etc/nginx/sites-enabled/icon-tool

Lösungen¶

# 1. Dedicated Health-Check Route
# app.py erweitern:
@app.route('/lb-health')
def lb_health():
    return 'OK', 200

# 2. Nginx-Konfiguration anpassen
upstream icon_tool {
    server 127.0.0.1:5000;

    # Health check configuration
    keepalive 32;
    keepalive_requests 100;
    keepalive_timeout 60s;
}

location /health {
    proxy_pass http://icon_tool/lb-health;
    proxy_read_timeout 5s;
}

# 3. Service und nginx neu laden
systemctl restart icon-tool
nginx -s reload

Evidenz: Network configuration, load balancer integration

Escalation & Support¶

Support-Level Matrix¶

Problem Severity	Response Time	Escalation Path
Critical (Service Down)	15 minutes	On-call → Dev Team Lead
High (Performance)	1 hour	Dev Team → Architecture
Medium (Feature Issues)	4 hours	Support → Dev Team
Low (Documentation)	24 hours	Support → Documentation

Information Gathering für Support¶

#!/bin/bash
# scripts/support-info.sh

echo "=== Support Information Package ==="
echo "Generated: $(date)"
echo "System: $(uname -a)"

# 1. System Status
echo -e "\n## System Status"
systemctl status icon-tool --no-pager
uptime
free -h
df -h

# 2. Application Health
echo -e "\n## Application Health"
curl -s http://localhost:5000/health 2>/dev/null | jq . || echo "Health check failed"

# 3. Recent Logs
echo -e "\n## Recent Logs (Last 50 lines)"
tail -50 /opt/icon-tool/logs/icon-tool.log

# 4. Configuration
echo -e "\n## Configuration"
echo "Flask Environment: $FLASK_ENV"
echo "Python Version: $(python --version)"
echo "Node Version: $(node --version)"

# 5. File System Status
echo -e "\n## File System"
ls -la /opt/icon-tool/ | head -10
echo "Icon count: $(ls /opt/icon-tool/static/icons/*.svg 2>/dev/null | wc -l)"
echo "Metadata size: $(wc -l < /opt/icon-tool/icons.json)"

# 6. Network Status
echo -e "\n## Network"
netstat -tulpn | grep :5000
ss -tulpn | grep :5000

# Package to send to support
tar -czf support-package-$(date +%Y%m%d_%H%M%S).tar.gz \
    /tmp/support-info.txt \
    /opt/icon-tool/logs/icon-tool.log \
    /opt/icon-tool/icons.json \
    --exclude='*.svg'

echo -e "\n✅ Support package created: support-package-*.tar.gz"

Remote Debugging¶

# Sichere Remote-Debugging-Session
# 1. SSH-Tunnel für sicheren Zugriff
ssh -L 5000:localhost:5000 user@production-server

# 2. Debug-Mode temporär aktivieren (Vorsicht!)
export FLASK_DEBUG=1
systemctl restart icon-tool

# 3. Nach Debugging: Debug-Mode deaktivieren
unset FLASK_DEBUG
systemctl restart icon-tool

Incident Documentation¶

# Incident Report Template

## Incident Summary
- **Date/Time:** 2025-08-24 14:30 UTC
- **Duration:** 15 minutes
- **Severity:** High
- **Root Cause:** Disk space exhaustion

## Timeline
- 14:30 - Alert triggered: Service unhealthy
- 14:32 - Investigation started
- 14:35 - Root cause identified: /opt full
- 14:40 - Mitigation: Log rotation and cleanup
- 14:45 - Service restored and validated

## Impact
- **Users Affected:** All users (estimated 50)
- **Service Degradation:** Complete outage
- **Data Loss:** None

## Resolution
- Immediate: Freed disk space by rotating logs
- Long-term: Automated disk cleanup cron job

## Prevention
- [ ] Implement disk usage monitoring
- [ ] Automated log rotation
- [ ] Disk usage alerts at 80%

## Lessons Learned
- Need proactive monitoring for disk usage
- Log rotation should be automatic
- Better alerting thresholds needed

Evidenz: Incident response best practices, support workflows

Troubleshooting Toolkit: - [ ] Standard-Diagnose-Skript verfügbar - [ ] Common-Issues-Playbook dokumentiert - [ ] Escalation-Matrix definiert - [ ] Support-Information-Gathering automatisiert - [ ] Remote-Debugging-Procedures getestet