ποΈ DIGITALIA IT OPERATIONS MANUAL¶
Infrastructure & Operations Guide for IT Staff
π TABLE OF CONTENTS¶
- System Overview
- Infrastructure Architecture
- Deployment Procedures
- Monitoring & Alerting
- Backup & Recovery
- Security Operations
- Performance Tuning
- Troubleshooting Guide
- Emergency Procedures
- Maintenance Tasks
ποΈ SYSTEM OVERVIEW¶
Infrastructure Stack¶
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β PRODUCTION ENVIRONMENT β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Load Balancer Layer (HAProxy/NGINX) β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β Web Traffic β β API Traffic β β Admin Traffic β β
β β (Port 80/443) β β (Port 8545) β β (VPN Access) β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Kubernetes Cluster (Production) β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β DVM Nodes β β DigitalianPassportβ β Services β β
β β (3 replicas) β β (2 replicas) β β (API, Monitor) β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Data Layer β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β RocksDB β β PostgreSQL β β Redis β β
β β (Blockchain) β β (Application) β β (Cache) β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Monitoring & Logging β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
β β Prometheus β β Grafana β β ELK Stack β β
β β (Metrics) β β (Dashboards) β β (Logs) β β
β βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Resource Requirements¶
Production Cluster Specifications¶
Master Nodes (3x):
CPU: 8 cores
RAM: 32 GB
Storage: 500 GB SSD
Network: 10 Gbps
Worker Nodes (6x):
CPU: 16 cores
RAM: 64 GB
Storage: 1 TB NVMe SSD
Network: 10 Gbps
Database Nodes (3x):
CPU: 12 cores
RAM: 128 GB
Storage: 2 TB NVMe SSD (RAID 10)
Network: 10 Gbps
Network Configuration¶
# Network Topology
Production VLAN: 10.0.0.0/16
- Kubernetes Cluster: 10.0.1.0/24
- Database Cluster: 10.0.2.0/24
- Monitoring: 10.0.3.0/24
- Management: 10.0.4.0/24
Staging VLAN: 10.1.0.0/16
- Mirror of production for testing
Development VLAN: 10.2.0.0/16
- Developer environments
ποΈ INFRASTRUCTURE ARCHITECTURE¶
Kubernetes Cluster Setup¶
Master Node Configuration¶
# Initialize cluster (on first master)
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 \
--service-cidr=10.96.0.0/12 \
--apiserver-advertise-address=10.0.1.10
# Install CNI (Flannel)
kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml
# Join additional masters
sudo kubeadm join 10.0.1.10:6443 --token <token> \
--discovery-token-ca-cert-hash sha256:<hash> \
--control-plane --certificate-key <key>
Worker Node Configuration¶
# Join worker nodes
sudo kubeadm join 10.0.1.10:6443 --token <token> \
--discovery-token-ca-cert-hash sha256:<hash>
# Label nodes for specific workloads
kubectl label nodes worker-1 node-role.kubernetes.io/dvm=true
kubectl label nodes worker-2 node-role.kubernetes.io/dvm=true
kubectl label nodes worker-3 node-role.kubernetes.io/frontend=true
kubectl label nodes worker-4 node-role.kubernetes.io/database=true
Storage Classes¶
# File: k8s/storage-classes.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp3
iops: "3000"
throughput: "125"
allowVolumeExpansion: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: blockchain-storage
provisioner: kubernetes.io/aws-ebs
parameters:
type: gp3
iops: "16000"
throughput: "1000"
allowVolumeExpansion: true
Load Balancer Configuration¶
HAProxy Configuration¶
# File: /etc/haproxy/haproxy.cfg
global
daemon
log stdout local0
chroot /var/lib/haproxy
stats socket /run/haproxy/admin.sock mode 660 level admin
defaults
mode http
timeout connect 5000ms
timeout client 50000ms
timeout server 50000ms
option httplog
frontend digitalia_frontend
bind *:80
bind *:443 ssl crt /etc/ssl/certs/digitalia.org.pem
redirect scheme https if !{ ssl_fc }
# Route to appropriate backend
acl is_api path_beg /api
acl is_rpc path_beg /rpc
use_backend api_backend if is_api
use_backend rpc_backend if is_rpc
default_backend web_backend
backend web_backend
balance roundrobin
server web1 10.0.1.20:80 check
server web2 10.0.1.21:80 check
backend api_backend
balance roundrobin
server api1 10.0.1.30:8080 check
server api2 10.0.1.31:8080 check
backend rpc_backend
balance roundrobin
server rpc1 10.0.1.40:8545 check
server rpc2 10.0.1.41:8545 check
Database Cluster Setup¶
PostgreSQL High Availability¶
# Primary database setup
# File: /etc/postgresql/14/main/postgresql.conf
listen_addresses = '*'
wal_level = replica
max_wal_senders = 3
max_replication_slots = 3
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/archive/%f'
# Standby configuration
# File: /var/lib/postgresql/14/main/standby.signal
standby_mode = 'on'
primary_conninfo = 'host=10.0.2.10 port=5432 user=replicator'
restore_command = 'cp /var/lib/postgresql/archive/%f %p'
π DEPLOYMENT PROCEDURES¶
Production Deployment Checklist¶
Pre-Deployment (T-24 hours)¶
# 1. Backup current state
./scripts/backup-production.sh --full
# 2. Verify staging environment
./scripts/verify-staging.sh --comprehensive
# 3. Run security scan
./scripts/security-scan.sh --production
# 4. Check resource availability
kubectl top nodes
kubectl top pods -A
# 5. Validate Mainnet RevenueRouter deployment record mode
cd digitalia-contracts
npm run validate:mainnet-planned
npm run validate:revenue-policy
npm run validate:service-pricing
npm run validate:revenue-production
Deployment record mode policy:
- Pre-launch phase:
validate:mainnet-plannedis the required gate. - Post-launch phase: switch to
validate:mainnet-deploymentand require strict pass. - Do not mark production-ready if strict mode is required and failing.
Deployment Process¶
# 1. Enable maintenance mode
kubectl apply -f k8s/maintenance-mode.yaml
# 2. Deploy new version
kubectl set image deployment/digitalia-dvm dvm=digitalia/dvm:v1.2.0
kubectl set image deployment/digitalian-passport passport=digitalia/passport:v1.2.0
# 3. Wait for rollout completion
kubectl rollout status deployment/digitalia-dvm --timeout=600s
kubectl rollout status deployment/digitalian-passport --timeout=600s
# 4. Run health checks
./scripts/health-check.sh --production
# 5. Disable maintenance mode
kubectl delete -f k8s/maintenance-mode.yaml
Rollback Procedure¶
# Emergency rollback
kubectl rollout undo deployment/digitalia-dvm
kubectl rollout undo deployment/digitalian-passport
# Verify rollback
kubectl rollout status deployment/digitalia-dvm
kubectl get pods -l app=digitalia-dvm
Blue-Green Deployment¶
Setup Blue-Green Infrastructure¶
# Create green environment
kubectl create namespace digitalia-green
# Deploy to green
kubectl apply -f k8s/ -n digitalia-green
# Test green environment
./scripts/test-environment.sh --namespace digitalia-green
# Switch traffic (update ingress)
kubectl patch ingress digitalia-ingress -p '{"spec":{"rules":[{"host":"digitalia.org","http":{"paths":[{"path":"/","pathType":"Prefix","backend":{"service":{"name":"digitalia-service-green","port":{"number":80}}}}]}}]}}'
π MONITORING & ALERTING¶
Prometheus Configuration¶
Metrics Collection¶
# File: monitoring/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- "digitalia_rules.yml"
scrape_configs:
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- job_name: 'digitalia-dvm'
static_configs:
- targets: ['digitalia-dvm:8080']
metrics_path: /metrics
scrape_interval: 10s
- job_name: 'digitalian-passport'
static_configs:
- targets: ['digitalian-passport:3000']
metrics_path: /metrics
Alert Rules¶
# File: monitoring/prometheus/digitalia_rules.yml
groups:
- name: digitalia_alerts
rules:
- alert: DVMNodeDown
expr: up{job="digitalia-dvm"} == 0
for: 30s
labels:
severity: critical
annotations:
summary: "DVM node is down"
description: "DVM node {{ $labels.instance }} has been down for more than 30 seconds"
- alert: HighBlockTime
expr: digitalia_block_time > 30
for: 2m
labels:
severity: warning
annotations:
summary: "Block time is high"
description: "Block time is {{ $value }} seconds"
- alert: ValidatorOffline
expr: digitalia_active_validators < 4500
for: 5m
labels:
severity: critical
annotations:
summary: "Too many validators offline"
description: "Only {{ $value }} validators are active"
Grafana Dashboards¶
DVM Node Dashboard¶
{
"dashboard": {
"title": "Digitalia DVM Nodes",
"panels": [
{
"title": "Block Height",
"type": "stat",
"targets": [
{
"expr": "digitalia_block_height",
"legendFormat": "Block Height"
}
]
},
{
"title": "Transaction Rate",
"type": "graph",
"targets": [
{
"expr": "rate(digitalia_transactions_total[5m])",
"legendFormat": "TPS"
}
]
},
{
"title": "Memory Usage",
"type": "graph",
"targets": [
{
"expr": "process_resident_memory_bytes{job=\"digitalia-dvm\"}",
"legendFormat": "Memory Usage"
}
]
}
]
}
}
Log Management¶
ELK Stack Configuration¶
# File: monitoring/elasticsearch/elasticsearch.yml
cluster.name: digitalia-logs
network.host: 0.0.0.0
discovery.seed_hosts: ["elasticsearch-1", "elasticsearch-2", "elasticsearch-3"]
cluster.initial_master_nodes: ["elasticsearch-1"]
# Index templates for Digitalia logs
PUT _index_template/digitalia-logs
{
"index_patterns": ["digitalia-*"],
"template": {
"settings": {
"number_of_shards": 3,
"number_of_replicas": 1,
"index.lifecycle.name": "digitalia-policy"
}
}
}
Logstash Pipeline¶
# File: monitoring/logstash/pipeline/digitalia.conf
input {
beats {
port => 5044
}
}
filter {
if [fields][service] == "digitalia-dvm" {
grok {
match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}" }
}
if [level] == "ERROR" {
mutate {
add_tag => ["alert"]
}
}
}
}
output {
elasticsearch {
hosts => ["elasticsearch:9200"]
index => "digitalia-%{+YYYY.MM.dd}"
}
}
πΎ BACKUP & RECOVERY¶
Backup Strategy¶
Daily Backup Schedule¶
#!/bin/bash
# File: scripts/daily-backup.sh
# Blockchain data backup
kubectl exec -n digitalia-production digitalia-dvm-0 -- \
tar czf /tmp/blockchain-backup-$(date +%Y%m%d).tar.gz /data/blockchain
# Database backup
kubectl exec -n digitalia-production postgres-0 -- \
pg_dump -U digitalia digitalia_db | gzip > /backups/db-backup-$(date +%Y%m%d).sql.gz
# Configuration backup
kubectl get all,configmaps,secrets -n digitalia-production -o yaml > \
/backups/k8s-config-$(date +%Y%m%d).yaml
# Upload to remote storage
aws s3 sync /backups/ s3://digitalia-backups/daily/$(date +%Y%m%d)/
Recovery Procedures¶
#!/bin/bash
# File: scripts/restore-production.sh
BACKUP_DATE=$1
# Stop services
kubectl scale deployment digitalia-dvm --replicas=0
kubectl scale deployment digitalian-passport --replicas=0
# Restore blockchain data
aws s3 cp s3://digitalia-backups/daily/${BACKUP_DATE}/blockchain-backup-${BACKUP_DATE}.tar.gz /tmp/
kubectl cp /tmp/blockchain-backup-${BACKUP_DATE}.tar.gz digitalia-dvm-0:/tmp/
kubectl exec digitalia-dvm-0 -- tar xzf /tmp/blockchain-backup-${BACKUP_DATE}.tar.gz -C /
# Restore database
aws s3 cp s3://digitalia-backups/daily/${BACKUP_DATE}/db-backup-${BACKUP_DATE}.sql.gz /tmp/
gunzip /tmp/db-backup-${BACKUP_DATE}.sql.gz
kubectl exec postgres-0 -- psql -U digitalia digitalia_db < /tmp/db-backup-${BACKUP_DATE}.sql
# Restart services
kubectl scale deployment digitalia-dvm --replicas=3
kubectl scale deployment digitalian-passport --replicas=2
Disaster Recovery¶
RTO/RPO Targets¶
- Recovery Time Objective (RTO): 4 hours
- Recovery Point Objective (RPO): 1 hour
- Maximum Tolerable Downtime: 8 hours
DR Site Configuration¶
# Standby cluster configuration
REGION_PRIMARY="us-east-1"
REGION_DR="us-west-2"
# Replicate data to DR site
aws s3 sync s3://digitalia-backups-primary/ s3://digitalia-backups-dr/ --region $REGION_DR
# Maintain warm standby
kubectl apply -f k8s/dr-cluster.yaml --context=digitalia-dr
π SECURITY OPERATIONS¶
Security Monitoring¶
Intrusion Detection¶
# Install and configure Falco
helm install falco falcosecurity/falco \
--set falco.grpc.enabled=true \
--set falco.grpcOutput.enabled=true
# Custom rules for Digitalia
# File: /etc/falco/falco_rules.local.yaml
- rule: Unauthorized Network Connection from DVM
desc: Detect network connections from DVM to unauthorized hosts
condition: >
(container.name contains "digitalia-dvm") and
(fd.type = ipv4 or fd.type = ipv6) and
(fd.ip != "10.0.0.0/16") and
(fd.ip != "127.0.0.1")
output: >
Unauthorized network connection from DVM
(command=%proc.cmdline connection=%fd.name user=%user.name)
priority: WARNING
Vulnerability Scanning¶
# Container image scanning
trivy image digitalia/dvm:latest
trivy image digitalia/passport:latest
# Kubernetes cluster scanning
kube-bench run --targets master,node,etcd,policies
# Network security scanning
nmap -sS -O 10.0.1.0/24
Certificate Management¶
SSL Certificate Renewal¶
#!/bin/bash
# File: scripts/renew-certificates.sh
# Renew Let's Encrypt certificates
certbot renew --nginx
# Update Kubernetes secrets
kubectl create secret tls digitalia-tls \
--cert=/etc/letsencrypt/live/digitalia.org/fullchain.pem \
--key=/etc/letsencrypt/live/digitalia.org/privkey.pem \
--dry-run=client -o yaml | kubectl apply -f -
# Reload ingress controller
kubectl rollout restart deployment/nginx-ingress-controller
Access Control¶
RBAC Configuration¶
# File: k8s/rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: digitalia-operator
rules:
- apiGroups: [""]
resources: ["pods", "services", "endpoints"]
verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
resources: ["deployments", "replicasets"]
verbs: ["get", "list", "watch", "update", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: digitalia-operator-binding
subjects:
- kind: User
name: digitalia-ops
apiGroup: rbac.authorization.k8s.io
roleRef:
kind: ClusterRole
name: digitalia-operator
apiGroup: rbac.authorization.k8s.io
β‘ PERFORMANCE TUNING¶
JVM Optimization¶
DVM Node JVM Settings¶
# File: k8s/dvm-deployment.yaml (env section)
- name: JAVA_OPTS
value: |
-Xmx8g
-Xms8g
-XX:+UseG1GC
-XX:MaxGCPauseMillis=200
-XX:+UseStringDeduplication
-XX:+PrintGC
-XX:+PrintGCDetails
-XX:+PrintGCTimeStamps
-Xloggc:/logs/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=5
-XX:GCLogFileSize=10M
-Dcom.sun.management.jmxremote
-Dcom.sun.management.jmxremote.port=9999
-Dcom.sun.management.jmxremote.authenticate=false
-Dcom.sun.management.jmxremote.ssl=false
Database Optimization¶
PostgreSQL Configuration¶
-- File: /etc/postgresql/14/main/postgresql.conf
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 64MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 256MB
min_wal_size = 2GB
max_wal_size = 8GB
-- Blockchain-specific indexes
CREATE INDEX CONCURRENTLY idx_blocks_timestamp ON blocks(timestamp);
CREATE INDEX CONCURRENTLY idx_transactions_hash ON transactions(hash);
CREATE INDEX CONCURRENTLY idx_transactions_block_hash ON transactions(block_hash);
Network Optimization¶
Kubernetes Network Policies¶
# File: k8s/network-policies.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: digitalia-dvm-policy
spec:
podSelector:
matchLabels:
app: digitalia-dvm
policyTypes:
- Ingress
- Egress
ingress:
- from:
- podSelector:
matchLabels:
app: digitalian-passport
ports:
- protocol: TCP
port: 8545
egress:
- to:
- podSelector:
matchLabels:
app: postgres
ports:
- protocol: TCP
port: 5432
π§ TROUBLESHOOTING GUIDE¶
Common Issues & Solutions¶
1. DVM Node Not Starting¶
# Check logs
kubectl logs -f deployment/digitalia-dvm
# Common causes and solutions:
# - Insufficient memory: Increase memory limits
kubectl patch deployment digitalia-dvm -p '{"spec":{"template":{"spec":{"containers":[{"name":"dvm","resources":{"limits":{"memory":"16Gi"}}}]}}}}'
# - Corrupted blockchain data: Restore from backup
./scripts/restore-blockchain-data.sh
# - Network connectivity: Check service discovery
kubectl get endpoints digitalia-dvm
2. High Block Time¶
# Check validator status
curl -X POST http://digitalia-rpc:8545 \
-H "Content-Type: application/json" \
-d '{"jsonrpc":"2.0","method":"digitalia_getValidatorInfo","params":[],"id":1}'
# Check network latency
kubectl exec -it digitalia-dvm-0 -- ping digitalia-dvm-1.digitalia-dvm.digitalia-production.svc.cluster.local
# Solutions:
# - Scale up validator nodes
kubectl scale statefulset digitalia-validators --replicas=6
# - Optimize consensus parameters
kubectl patch configmap digitalia-config -p '{"data":{"consensus.timeout":"5s"}}'
3. Database Connection Issues¶
# Check database connectivity
kubectl exec -it digitalia-dvm-0 -- pg_isready -h postgres -p 5432
# Check connection pool
kubectl exec -it postgres-0 -- psql -U digitalia -c "SELECT * FROM pg_stat_activity;"
# Solutions:
# - Increase connection pool size
kubectl patch configmap postgres-config -p '{"data":{"max_connections":"200"}}'
# - Restart database connections
kubectl rollout restart deployment/digitalia-dvm
Performance Issues¶
Memory Leaks¶
# Monitor JVM heap usage
kubectl exec -it digitalia-dvm-0 -- jcmd 1 VM.summary
kubectl exec -it digitalia-dvm-0 -- jcmd 1 GC.run
# Heap dump analysis
kubectl exec -it digitalia-dvm-0 -- jcmd 1 GC.dump /tmp/heap.hprof
kubectl cp digitalia-dvm-0:/tmp/heap.hprof ./heap.hprof
# Analyze with Eclipse MAT or similar tool
Disk Space Issues¶
# Check disk usage
kubectl exec -it digitalia-dvm-0 -- df -h
kubectl top nodes
# Clean up old logs
kubectl exec -it digitalia-dvm-0 -- find /logs -name "*.log" -mtime +7 -delete
# Expand persistent volumes if needed
kubectl patch pvc blockchain-data -p '{"spec":{"resources":{"requests":{"storage":"2Ti"}}}}'
π¨ EMERGENCY PROCEDURES¶
Emergency Response Team¶
Primary On-Call: +1-555-0101 (DevOps Lead)
Secondary On-Call: +1-555-0102 (Infrastructure Manager)
Security Team: +1-555-0103 (Security Officer)
Management: +1-555-0104 (CTO)
Escalation Matrix:
Level 1: 0-15 minutes - Primary On-Call
Level 2: 15-30 minutes - Secondary On-Call + Security
Level 3: 30+ minutes - Management + Full Team
Critical System Failures¶
Complete System Outage¶
# 1. Immediate assessment
./scripts/emergency-status-check.sh
# 2. Activate emergency protocols
./scripts/activate-emergency-mode.sh
# 3. Notify stakeholders
./scripts/send-emergency-notification.sh "System outage in progress"
# 4. Begin recovery
./scripts/emergency-recovery.sh --from-backup --latest
# 5. Status updates every 15 minutes
./scripts/send-status-update.sh
Security Breach¶
# 1. Isolate affected systems
kubectl patch networkpolicy default-deny -p '{"spec":{"podSelector":{},"policyTypes":["Ingress","Egress"]}}'
# 2. Preserve evidence
./scripts/preserve-logs.sh --security-incident
# 3. Begin forensic analysis
./scripts/security-analysis.sh --immediate
# 4. Notify authorities if required
./scripts/breach-notification.sh
Communication Templates¶
Incident Notification Email¶
Subject: [CRITICAL] Digitalia System Incident - {{INCIDENT_ID}}
Dear Stakeholders,
We are currently experiencing a {{SEVERITY}} incident affecting {{AFFECTED_SYSTEMS}}.
Incident Details:
- Start Time: {{START_TIME}} UTC
- Impact: {{IMPACT_DESCRIPTION}}
- ETA for Resolution: {{ETA}}
We are actively working to resolve this issue and will provide updates every 30 minutes.
Next Update: {{NEXT_UPDATE_TIME}}
Digitalia Operations Team
π MAINTENANCE TASKS¶
Daily Tasks¶
#!/bin/bash
# File: scripts/daily-maintenance.sh
# System health check
./scripts/health-check.sh --comprehensive
# Backup verification
./scripts/verify-backups.sh --yesterday
# Security scan
./scripts/security-scan.sh --quick
# Performance metrics review
./scripts/generate-daily-report.sh
Weekly Tasks¶
#!/bin/bash
# File: scripts/weekly-maintenance.sh
# Update system packages
kubectl apply -f k8s/system-updates.yaml
# Certificate renewal check
./scripts/check-certificates.sh --expiry-30-days
# Database maintenance
kubectl exec postgres-0 -- psql -U digitalia -c "VACUUM ANALYZE;"
# Log rotation and cleanup
./scripts/cleanup-logs.sh --older-than-7-days
Monthly Tasks¶
#!/bin/bash
# File: scripts/monthly-maintenance.sh
# Full security audit
./scripts/security-audit.sh --comprehensive
# Performance optimization review
./scripts/performance-review.sh --monthly
# Disaster recovery test
./scripts/test-dr-procedures.sh --non-disruptive
# Update documentation
./scripts/update-runbooks.sh
π CONTACT INFORMATION & ESCALATION¶
Emergency Contacts¶
- Primary DevOps: John Smith - +1-555-0101 - john.smith@digitalia.org
- Infrastructure Lead: Jane Doe - +1-555-0102 - jane.doe@digitalia.org
- Security Officer: Bob Wilson - +1-555-0103 - bob.wilson@digitalia.org
- Database Admin: Alice Brown - +1-555-0104 - alice.brown@digitalia.org
Vendor Support¶
- Cloud Provider: AWS Enterprise Support - Case Priority 1
- Monitoring: Datadog Enterprise - 24/7 phone support
- Security: CrowdStrike - Emergency response team
External Services¶
- DNS: Cloudflare Enterprise - Emergency contact available
- CDN: AWS CloudFront - Included in AWS support
- Backup Storage: AWS S3 - Standard AWS support channels
π― This operations manual should be reviewed quarterly and updated as the infrastructure evolves. Always test procedures in staging before applying to production.