Skip to content

πŸ—οΈ DIGITALIA IT OPERATIONS MANUAL

Infrastructure & Operations Guide for IT Staff

πŸ“‹ TABLE OF CONTENTS

  1. System Overview
  2. Infrastructure Architecture
  3. Deployment Procedures
  4. Monitoring & Alerting
  5. Backup & Recovery
  6. Security Operations
  7. Performance Tuning
  8. Troubleshooting Guide
  9. Emergency Procedures
  10. Maintenance Tasks

πŸ—οΈ SYSTEM OVERVIEW

Infrastructure Stack

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    PRODUCTION ENVIRONMENT                       β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Load Balancer Layer (HAProxy/NGINX)                           β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   Web Traffic   β”‚  β”‚   API Traffic   β”‚  β”‚  Admin Traffic  β”‚  β”‚
β”‚  β”‚  (Port 80/443)  β”‚  β”‚   (Port 8545)   β”‚  β”‚  (VPN Access)   β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Kubernetes Cluster (Production)                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   DVM Nodes     β”‚  β”‚ DigitalianPassportβ”‚  β”‚   Services     β”‚  β”‚
β”‚  β”‚  (3 replicas)   β”‚  β”‚   (2 replicas)   β”‚  β”‚ (API, Monitor) β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Data Layer                                                     β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   RocksDB       β”‚  β”‚   PostgreSQL    β”‚  β”‚    Redis        β”‚  β”‚
β”‚  β”‚ (Blockchain)    β”‚  β”‚  (Application)  β”‚  β”‚   (Cache)       β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Monitoring & Logging                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚   Prometheus    β”‚  β”‚     Grafana     β”‚  β”‚  ELK Stack      β”‚  β”‚
β”‚  β”‚   (Metrics)     β”‚  β”‚  (Dashboards)   β”‚  β”‚   (Logs)        β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Resource Requirements

Production Cluster Specifications

Master Nodes (3x):
  CPU: 8 cores
  RAM: 32 GB
  Storage: 500 GB SSD
  Network: 10 Gbps

Worker Nodes (6x):
  CPU: 16 cores  
  RAM: 64 GB
  Storage: 1 TB NVMe SSD
  Network: 10 Gbps

Database Nodes (3x):
  CPU: 12 cores
  RAM: 128 GB
  Storage: 2 TB NVMe SSD (RAID 10)
  Network: 10 Gbps

Network Configuration

# Network Topology
Production VLAN: 10.0.0.0/16
  - Kubernetes Cluster: 10.0.1.0/24
  - Database Cluster: 10.0.2.0/24
  - Monitoring: 10.0.3.0/24
  - Management: 10.0.4.0/24

Staging VLAN: 10.1.0.0/16
  - Mirror of production for testing

Development VLAN: 10.2.0.0/16
  - Developer environments

πŸ—οΈ INFRASTRUCTURE ARCHITECTURE

Kubernetes Cluster Setup

Master Node Configuration

# Initialize cluster (on first master)
sudo kubeadm init --pod-network-cidr=10.244.0.0/16 \
  --service-cidr=10.96.0.0/12 \
  --apiserver-advertise-address=10.0.1.10

# Install CNI (Flannel)
kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml

# Join additional masters
sudo kubeadm join 10.0.1.10:6443 --token <token> \
  --discovery-token-ca-cert-hash sha256:<hash> \
  --control-plane --certificate-key <key>

Worker Node Configuration

# Join worker nodes
sudo kubeadm join 10.0.1.10:6443 --token <token> \
  --discovery-token-ca-cert-hash sha256:<hash>

# Label nodes for specific workloads
kubectl label nodes worker-1 node-role.kubernetes.io/dvm=true
kubectl label nodes worker-2 node-role.kubernetes.io/dvm=true
kubectl label nodes worker-3 node-role.kubernetes.io/frontend=true
kubectl label nodes worker-4 node-role.kubernetes.io/database=true

Storage Classes

# File: k8s/storage-classes.yaml
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: fast-ssd
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  iops: "3000"
  throughput: "125"
allowVolumeExpansion: true
---
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: blockchain-storage
provisioner: kubernetes.io/aws-ebs
parameters:
  type: gp3
  iops: "16000"
  throughput: "1000"
allowVolumeExpansion: true

Load Balancer Configuration

HAProxy Configuration

# File: /etc/haproxy/haproxy.cfg
global
    daemon
    log stdout local0
    chroot /var/lib/haproxy
    stats socket /run/haproxy/admin.sock mode 660 level admin

defaults
    mode http
    timeout connect 5000ms
    timeout client 50000ms
    timeout server 50000ms
    option httplog

frontend digitalia_frontend
    bind *:80
    bind *:443 ssl crt /etc/ssl/certs/digitalia.org.pem
    redirect scheme https if !{ ssl_fc }

    # Route to appropriate backend
    acl is_api path_beg /api
    acl is_rpc path_beg /rpc

    use_backend api_backend if is_api
    use_backend rpc_backend if is_rpc
    default_backend web_backend

backend web_backend
    balance roundrobin
    server web1 10.0.1.20:80 check
    server web2 10.0.1.21:80 check

backend api_backend
    balance roundrobin
    server api1 10.0.1.30:8080 check
    server api2 10.0.1.31:8080 check

backend rpc_backend
    balance roundrobin
    server rpc1 10.0.1.40:8545 check
    server rpc2 10.0.1.41:8545 check

Database Cluster Setup

PostgreSQL High Availability

# Primary database setup
# File: /etc/postgresql/14/main/postgresql.conf
listen_addresses = '*'
wal_level = replica
max_wal_senders = 3
max_replication_slots = 3
archive_mode = on
archive_command = 'cp %p /var/lib/postgresql/archive/%f'

# Standby configuration
# File: /var/lib/postgresql/14/main/standby.signal
standby_mode = 'on'
primary_conninfo = 'host=10.0.2.10 port=5432 user=replicator'
restore_command = 'cp /var/lib/postgresql/archive/%f %p'

πŸš€ DEPLOYMENT PROCEDURES

Production Deployment Checklist

Pre-Deployment (T-24 hours)

# 1. Backup current state
./scripts/backup-production.sh --full

# 2. Verify staging environment
./scripts/verify-staging.sh --comprehensive

# 3. Run security scan
./scripts/security-scan.sh --production

# 4. Check resource availability
kubectl top nodes
kubectl top pods -A

# 5. Validate Mainnet RevenueRouter deployment record mode
cd digitalia-contracts
npm run validate:mainnet-planned
npm run validate:revenue-policy
npm run validate:service-pricing
npm run validate:revenue-production

Deployment record mode policy:

  • Pre-launch phase: validate:mainnet-planned is the required gate.
  • Post-launch phase: switch to validate:mainnet-deployment and require strict pass.
  • Do not mark production-ready if strict mode is required and failing.

Deployment Process

# 1. Enable maintenance mode
kubectl apply -f k8s/maintenance-mode.yaml

# 2. Deploy new version
kubectl set image deployment/digitalia-dvm dvm=digitalia/dvm:v1.2.0
kubectl set image deployment/digitalian-passport passport=digitalia/passport:v1.2.0

# 3. Wait for rollout completion
kubectl rollout status deployment/digitalia-dvm --timeout=600s
kubectl rollout status deployment/digitalian-passport --timeout=600s

# 4. Run health checks
./scripts/health-check.sh --production

# 5. Disable maintenance mode
kubectl delete -f k8s/maintenance-mode.yaml

Rollback Procedure

# Emergency rollback
kubectl rollout undo deployment/digitalia-dvm
kubectl rollout undo deployment/digitalian-passport

# Verify rollback
kubectl rollout status deployment/digitalia-dvm
kubectl get pods -l app=digitalia-dvm

Blue-Green Deployment

Setup Blue-Green Infrastructure

# Create green environment
kubectl create namespace digitalia-green

# Deploy to green
kubectl apply -f k8s/ -n digitalia-green

# Test green environment
./scripts/test-environment.sh --namespace digitalia-green

# Switch traffic (update ingress)
kubectl patch ingress digitalia-ingress -p '{"spec":{"rules":[{"host":"digitalia.org","http":{"paths":[{"path":"/","pathType":"Prefix","backend":{"service":{"name":"digitalia-service-green","port":{"number":80}}}}]}}]}}'

πŸ“Š MONITORING & ALERTING

Prometheus Configuration

Metrics Collection

# File: monitoring/prometheus/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

rule_files:
  - "digitalia_rules.yml"

scrape_configs:
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true

  - job_name: 'digitalia-dvm'
    static_configs:
      - targets: ['digitalia-dvm:8080']
    metrics_path: /metrics
    scrape_interval: 10s

  - job_name: 'digitalian-passport'
    static_configs:
      - targets: ['digitalian-passport:3000']
    metrics_path: /metrics

Alert Rules

# File: monitoring/prometheus/digitalia_rules.yml
groups:
  - name: digitalia_alerts
    rules:
    - alert: DVMNodeDown
      expr: up{job="digitalia-dvm"} == 0
      for: 30s
      labels:
        severity: critical
      annotations:
        summary: "DVM node is down"
        description: "DVM node {{ $labels.instance }} has been down for more than 30 seconds"

    - alert: HighBlockTime
      expr: digitalia_block_time > 30
      for: 2m
      labels:
        severity: warning
      annotations:
        summary: "Block time is high"
        description: "Block time is {{ $value }} seconds"

    - alert: ValidatorOffline
      expr: digitalia_active_validators < 4500
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "Too many validators offline"
        description: "Only {{ $value }} validators are active"

Grafana Dashboards

DVM Node Dashboard

{
  "dashboard": {
    "title": "Digitalia DVM Nodes",
    "panels": [
      {
        "title": "Block Height",
        "type": "stat",
        "targets": [
          {
            "expr": "digitalia_block_height",
            "legendFormat": "Block Height"
          }
        ]
      },
      {
        "title": "Transaction Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(digitalia_transactions_total[5m])",
            "legendFormat": "TPS"
          }
        ]
      },
      {
        "title": "Memory Usage",
        "type": "graph",
        "targets": [
          {
            "expr": "process_resident_memory_bytes{job=\"digitalia-dvm\"}",
            "legendFormat": "Memory Usage"
          }
        ]
      }
    ]
  }
}

Log Management

ELK Stack Configuration

# File: monitoring/elasticsearch/elasticsearch.yml
cluster.name: digitalia-logs
network.host: 0.0.0.0
discovery.seed_hosts: ["elasticsearch-1", "elasticsearch-2", "elasticsearch-3"]
cluster.initial_master_nodes: ["elasticsearch-1"]

# Index templates for Digitalia logs
PUT _index_template/digitalia-logs
{
  "index_patterns": ["digitalia-*"],
  "template": {
    "settings": {
      "number_of_shards": 3,
      "number_of_replicas": 1,
      "index.lifecycle.name": "digitalia-policy"
    }
  }
}

Logstash Pipeline

# File: monitoring/logstash/pipeline/digitalia.conf
input {
  beats {
    port => 5044
  }
}

filter {
  if [fields][service] == "digitalia-dvm" {
    grok {
      match => { "message" => "%{TIMESTAMP_ISO8601:timestamp} %{LOGLEVEL:level} %{GREEDYDATA:message}" }
    }

    if [level] == "ERROR" {
      mutate {
        add_tag => ["alert"]
      }
    }
  }
}

output {
  elasticsearch {
    hosts => ["elasticsearch:9200"]
    index => "digitalia-%{+YYYY.MM.dd}"
  }
}

πŸ’Ύ BACKUP & RECOVERY

Backup Strategy

Daily Backup Schedule

#!/bin/bash
# File: scripts/daily-backup.sh

# Blockchain data backup
kubectl exec -n digitalia-production digitalia-dvm-0 -- \
  tar czf /tmp/blockchain-backup-$(date +%Y%m%d).tar.gz /data/blockchain

# Database backup
kubectl exec -n digitalia-production postgres-0 -- \
  pg_dump -U digitalia digitalia_db | gzip > /backups/db-backup-$(date +%Y%m%d).sql.gz

# Configuration backup
kubectl get all,configmaps,secrets -n digitalia-production -o yaml > \
  /backups/k8s-config-$(date +%Y%m%d).yaml

# Upload to remote storage
aws s3 sync /backups/ s3://digitalia-backups/daily/$(date +%Y%m%d)/

Recovery Procedures

#!/bin/bash
# File: scripts/restore-production.sh

BACKUP_DATE=$1

# Stop services
kubectl scale deployment digitalia-dvm --replicas=0
kubectl scale deployment digitalian-passport --replicas=0

# Restore blockchain data
aws s3 cp s3://digitalia-backups/daily/${BACKUP_DATE}/blockchain-backup-${BACKUP_DATE}.tar.gz /tmp/
kubectl cp /tmp/blockchain-backup-${BACKUP_DATE}.tar.gz digitalia-dvm-0:/tmp/
kubectl exec digitalia-dvm-0 -- tar xzf /tmp/blockchain-backup-${BACKUP_DATE}.tar.gz -C /

# Restore database
aws s3 cp s3://digitalia-backups/daily/${BACKUP_DATE}/db-backup-${BACKUP_DATE}.sql.gz /tmp/
gunzip /tmp/db-backup-${BACKUP_DATE}.sql.gz
kubectl exec postgres-0 -- psql -U digitalia digitalia_db < /tmp/db-backup-${BACKUP_DATE}.sql

# Restart services
kubectl scale deployment digitalia-dvm --replicas=3
kubectl scale deployment digitalian-passport --replicas=2

Disaster Recovery

RTO/RPO Targets

  • Recovery Time Objective (RTO): 4 hours
  • Recovery Point Objective (RPO): 1 hour
  • Maximum Tolerable Downtime: 8 hours

DR Site Configuration

# Standby cluster configuration
REGION_PRIMARY="us-east-1"
REGION_DR="us-west-2"

# Replicate data to DR site
aws s3 sync s3://digitalia-backups-primary/ s3://digitalia-backups-dr/ --region $REGION_DR

# Maintain warm standby
kubectl apply -f k8s/dr-cluster.yaml --context=digitalia-dr

πŸ”’ SECURITY OPERATIONS

Security Monitoring

Intrusion Detection

# Install and configure Falco
helm install falco falcosecurity/falco \
  --set falco.grpc.enabled=true \
  --set falco.grpcOutput.enabled=true

# Custom rules for Digitalia
# File: /etc/falco/falco_rules.local.yaml
- rule: Unauthorized Network Connection from DVM
  desc: Detect network connections from DVM to unauthorized hosts
  condition: >
    (container.name contains "digitalia-dvm") and
    (fd.type = ipv4 or fd.type = ipv6) and
    (fd.ip != "10.0.0.0/16") and
    (fd.ip != "127.0.0.1")
  output: >
    Unauthorized network connection from DVM
    (command=%proc.cmdline connection=%fd.name user=%user.name)
  priority: WARNING

Vulnerability Scanning

# Container image scanning
trivy image digitalia/dvm:latest
trivy image digitalia/passport:latest

# Kubernetes cluster scanning
kube-bench run --targets master,node,etcd,policies

# Network security scanning
nmap -sS -O 10.0.1.0/24

Certificate Management

SSL Certificate Renewal

#!/bin/bash
# File: scripts/renew-certificates.sh

# Renew Let's Encrypt certificates
certbot renew --nginx

# Update Kubernetes secrets
kubectl create secret tls digitalia-tls \
  --cert=/etc/letsencrypt/live/digitalia.org/fullchain.pem \
  --key=/etc/letsencrypt/live/digitalia.org/privkey.pem \
  --dry-run=client -o yaml | kubectl apply -f -

# Reload ingress controller
kubectl rollout restart deployment/nginx-ingress-controller

Access Control

RBAC Configuration

# File: k8s/rbac.yaml
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: digitalia-operator
rules:
- apiGroups: [""]
  resources: ["pods", "services", "endpoints"]
  verbs: ["get", "list", "watch"]
- apiGroups: ["apps"]
  resources: ["deployments", "replicasets"]
  verbs: ["get", "list", "watch", "update", "patch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: digitalia-operator-binding
subjects:
- kind: User
  name: digitalia-ops
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: ClusterRole
  name: digitalia-operator
  apiGroup: rbac.authorization.k8s.io

⚑ PERFORMANCE TUNING

JVM Optimization

DVM Node JVM Settings

# File: k8s/dvm-deployment.yaml (env section)
- name: JAVA_OPTS
  value: |
    -Xmx8g
    -Xms8g
    -XX:+UseG1GC
    -XX:MaxGCPauseMillis=200
    -XX:+UseStringDeduplication
    -XX:+PrintGC
    -XX:+PrintGCDetails
    -XX:+PrintGCTimeStamps
    -Xloggc:/logs/gc.log
    -XX:+UseGCLogFileRotation
    -XX:NumberOfGCLogFiles=5
    -XX:GCLogFileSize=10M
    -Dcom.sun.management.jmxremote
    -Dcom.sun.management.jmxremote.port=9999
    -Dcom.sun.management.jmxremote.authenticate=false
    -Dcom.sun.management.jmxremote.ssl=false

Database Optimization

PostgreSQL Configuration

-- File: /etc/postgresql/14/main/postgresql.conf
shared_buffers = 8GB
effective_cache_size = 24GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 64MB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 256MB
min_wal_size = 2GB
max_wal_size = 8GB

-- Blockchain-specific indexes
CREATE INDEX CONCURRENTLY idx_blocks_timestamp ON blocks(timestamp);
CREATE INDEX CONCURRENTLY idx_transactions_hash ON transactions(hash);
CREATE INDEX CONCURRENTLY idx_transactions_block_hash ON transactions(block_hash);

Network Optimization

Kubernetes Network Policies

# File: k8s/network-policies.yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: digitalia-dvm-policy
spec:
  podSelector:
    matchLabels:
      app: digitalia-dvm
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: digitalian-passport
    ports:
    - protocol: TCP
      port: 8545
  egress:
  - to:
    - podSelector:
        matchLabels:
          app: postgres
    ports:
    - protocol: TCP
      port: 5432

πŸ”§ TROUBLESHOOTING GUIDE

Common Issues & Solutions

1. DVM Node Not Starting

# Check logs
kubectl logs -f deployment/digitalia-dvm

# Common causes and solutions:
# - Insufficient memory: Increase memory limits
kubectl patch deployment digitalia-dvm -p '{"spec":{"template":{"spec":{"containers":[{"name":"dvm","resources":{"limits":{"memory":"16Gi"}}}]}}}}'

# - Corrupted blockchain data: Restore from backup
./scripts/restore-blockchain-data.sh

# - Network connectivity: Check service discovery
kubectl get endpoints digitalia-dvm

2. High Block Time

# Check validator status
curl -X POST http://digitalia-rpc:8545 \
  -H "Content-Type: application/json" \
  -d '{"jsonrpc":"2.0","method":"digitalia_getValidatorInfo","params":[],"id":1}'

# Check network latency
kubectl exec -it digitalia-dvm-0 -- ping digitalia-dvm-1.digitalia-dvm.digitalia-production.svc.cluster.local

# Solutions:
# - Scale up validator nodes
kubectl scale statefulset digitalia-validators --replicas=6

# - Optimize consensus parameters
kubectl patch configmap digitalia-config -p '{"data":{"consensus.timeout":"5s"}}'

3. Database Connection Issues

# Check database connectivity
kubectl exec -it digitalia-dvm-0 -- pg_isready -h postgres -p 5432

# Check connection pool
kubectl exec -it postgres-0 -- psql -U digitalia -c "SELECT * FROM pg_stat_activity;"

# Solutions:
# - Increase connection pool size
kubectl patch configmap postgres-config -p '{"data":{"max_connections":"200"}}'

# - Restart database connections
kubectl rollout restart deployment/digitalia-dvm

Performance Issues

Memory Leaks

# Monitor JVM heap usage
kubectl exec -it digitalia-dvm-0 -- jcmd 1 VM.summary
kubectl exec -it digitalia-dvm-0 -- jcmd 1 GC.run

# Heap dump analysis
kubectl exec -it digitalia-dvm-0 -- jcmd 1 GC.dump /tmp/heap.hprof
kubectl cp digitalia-dvm-0:/tmp/heap.hprof ./heap.hprof

# Analyze with Eclipse MAT or similar tool

Disk Space Issues

# Check disk usage
kubectl exec -it digitalia-dvm-0 -- df -h
kubectl top nodes

# Clean up old logs
kubectl exec -it digitalia-dvm-0 -- find /logs -name "*.log" -mtime +7 -delete

# Expand persistent volumes if needed
kubectl patch pvc blockchain-data -p '{"spec":{"resources":{"requests":{"storage":"2Ti"}}}}'

🚨 EMERGENCY PROCEDURES

Emergency Response Team

Primary On-Call: +1-555-0101 (DevOps Lead)
Secondary On-Call: +1-555-0102 (Infrastructure Manager)
Security Team: +1-555-0103 (Security Officer)
Management: +1-555-0104 (CTO)

Escalation Matrix:
Level 1: 0-15 minutes - Primary On-Call
Level 2: 15-30 minutes - Secondary On-Call + Security
Level 3: 30+ minutes - Management + Full Team

Critical System Failures

Complete System Outage

# 1. Immediate assessment
./scripts/emergency-status-check.sh

# 2. Activate emergency protocols
./scripts/activate-emergency-mode.sh

# 3. Notify stakeholders
./scripts/send-emergency-notification.sh "System outage in progress"

# 4. Begin recovery
./scripts/emergency-recovery.sh --from-backup --latest

# 5. Status updates every 15 minutes
./scripts/send-status-update.sh

Security Breach

# 1. Isolate affected systems
kubectl patch networkpolicy default-deny -p '{"spec":{"podSelector":{},"policyTypes":["Ingress","Egress"]}}'

# 2. Preserve evidence
./scripts/preserve-logs.sh --security-incident

# 3. Begin forensic analysis
./scripts/security-analysis.sh --immediate

# 4. Notify authorities if required
./scripts/breach-notification.sh

Communication Templates

Incident Notification Email

Subject: [CRITICAL] Digitalia System Incident - {{INCIDENT_ID}}

Dear Stakeholders,

We are currently experiencing a {{SEVERITY}} incident affecting {{AFFECTED_SYSTEMS}}.

Incident Details:
- Start Time: {{START_TIME}} UTC
- Impact: {{IMPACT_DESCRIPTION}}
- ETA for Resolution: {{ETA}}

We are actively working to resolve this issue and will provide updates every 30 minutes.

Next Update: {{NEXT_UPDATE_TIME}}

Digitalia Operations Team

πŸ”„ MAINTENANCE TASKS

Daily Tasks

#!/bin/bash
# File: scripts/daily-maintenance.sh

# System health check
./scripts/health-check.sh --comprehensive

# Backup verification
./scripts/verify-backups.sh --yesterday

# Security scan
./scripts/security-scan.sh --quick

# Performance metrics review
./scripts/generate-daily-report.sh

Weekly Tasks

#!/bin/bash
# File: scripts/weekly-maintenance.sh

# Update system packages
kubectl apply -f k8s/system-updates.yaml

# Certificate renewal check
./scripts/check-certificates.sh --expiry-30-days

# Database maintenance
kubectl exec postgres-0 -- psql -U digitalia -c "VACUUM ANALYZE;"

# Log rotation and cleanup
./scripts/cleanup-logs.sh --older-than-7-days

Monthly Tasks

#!/bin/bash
# File: scripts/monthly-maintenance.sh

# Full security audit
./scripts/security-audit.sh --comprehensive

# Performance optimization review
./scripts/performance-review.sh --monthly

# Disaster recovery test
./scripts/test-dr-procedures.sh --non-disruptive

# Update documentation
./scripts/update-runbooks.sh

πŸ“ž CONTACT INFORMATION & ESCALATION

Emergency Contacts

  • Primary DevOps: John Smith - +1-555-0101 - john.smith@digitalia.org
  • Infrastructure Lead: Jane Doe - +1-555-0102 - jane.doe@digitalia.org
  • Security Officer: Bob Wilson - +1-555-0103 - bob.wilson@digitalia.org
  • Database Admin: Alice Brown - +1-555-0104 - alice.brown@digitalia.org

Vendor Support

  • Cloud Provider: AWS Enterprise Support - Case Priority 1
  • Monitoring: Datadog Enterprise - 24/7 phone support
  • Security: CrowdStrike - Emergency response team

External Services

  • DNS: Cloudflare Enterprise - Emergency contact available
  • CDN: AWS CloudFront - Included in AWS support
  • Backup Storage: AWS S3 - Standard AWS support channels

🎯 This operations manual should be reviewed quarterly and updated as the infrastructure evolves. Always test procedures in staging before applying to production.