Skip to main content

Scaling Troubleshooting

Systematic approach to diagnosing and resolving NodeGroup scaling issues.

Scale-Up Issues

Resource Limitations

Symptoms:

  • Scaling operation fails with "insufficient quota" error
  • NodeGroup stuck in "Scaling" status
  • Error messages indicating resource unavailability

Common Causes:

  • vCloud Quota Exhaustion: Check quota usage vs limits
  • Instance Type Unavailability: Verify instance availability in target region
  • Network Resource Depletion: Check IP address availability in subnets

Solutions:

  • Request quota increase through support
  • Try alternative instance types
  • Expand subnet CIDR ranges or add additional subnets

Node Health Issues

Symptoms:

  • New nodes appear but remain "NotReady"
  • Nodes fail health checks during bootstrap
  • Intermittent connectivity issues

Diagnostic Commands:

# Check node status and conditions
kubectl describe node NODE_NAME

# Review node system events
kubectl get events --field-selector involvedObject.name=NODE_NAME

# Check kubelet logs
kubectl logs -n kube-system KUBELET_POD --previous

# Verify network connectivity
kubectl exec -it TEST_POD -- ping NODE_IP

Common Solutions:

  • Verify security group rules and network policies
  • Check cluster DNS configuration
  • Verify container registry access
  • Ensure adequate disk space and memory

Scale-Down Issues

Pod Eviction Problems

Symptoms:

  • Scale-down fails with pod eviction errors
  • Nodes refuse to drain
  • Scale-down timeouts during pod migration

Root Causes:

  • Pod Disruption Budget Violations: Check PDB configurations
  • Persistent Storage Dependencies: Identify pods with local storage
  • StatefulSet Constraints: Check for StatefulSets preventing eviction

Diagnostic Commands:

# Check PDB configurations
kubectl get pdb --all-namespaces
kubectl describe pdb PDB_NAME -n NAMESPACE

# Identify persistent storage
kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,VOLUMES:.spec.volumes[*].name

# Check StatefulSets and DaemonSets
kubectl get statefulsets,daemonsets --all-namespaces

Capacity Issues

Symptoms:

  • Remaining nodes lack capacity for migrated workloads
  • Pod scheduling failures during scale-down
  • Resource exhaustion on remaining nodes

Solutions:

  • Calculate remaining capacity before scale-down
  • Scale down one node at a time
  • Optimize pod resource requests and limits
  • Use node affinity rules for better distribution

Performance Issues

Application Instability

Symptoms:

  • Application errors during scaling operations
  • Performance degradation coinciding with scaling
  • Inconsistent behavior across nodes

Investigation:

# Check application pod status
kubectl get pods -l app=APP_NAME -o wide

# Review application logs
kubectl logs -l app=APP_NAME --tail=100

# Monitor application metrics
kubectl top pods -l app=APP_NAME

Common Issues:

  • Service discovery problems
  • Load balancer configuration issues
  • Session affinity settings
  • Database connection pool limits

Diagnostic Approach

Step-by-Step Diagnosis

  1. Initial Assessment: Correlate scaling operations with error occurrences
  2. Resource Status: Check cluster and NodeGroup health
  3. Error Collection: Gather logs from multiple sources
  4. Impact Analysis: Assess scope of affected services

Essential Log Sources

# NodeGroup controller logs
kubectl logs -n kube-system deployment/cluster-autoscaler

# Node provisioning logs
kubectl logs -n kube-system daemonset/node-problem-detector

# Kubelet status
journalctl -u kubelet -f

# Cloud provider integration
kubectl logs -n kube-system deployment/cloud-controller-manager

Network Troubleshooting

# Test inter-node communication
kubectl run test-pod --image=busybox --rm -it -- ping NODE_IP

# Check DNS resolution
kubectl run test-dns --image=busybox --rm -it -- nslookup kubernetes.default

# Verify service connectivity
kubectl run test-service --image=curlimages/curl --rm -it -- curl SERVICE_NAME.NAMESPACE

Resolution Strategies

Immediate Actions

  • Rollback Scaling: Revert to previous stable node count
  • Manual Intervention: Manually provision resources if automation fails
  • Traffic Redirection: Route traffic to healthy nodes
  • Resource Reallocation: Move critical workloads to stable nodes

Preventive Measures

  • Early Warning Systems: Alerts before resource exhaustion
  • Capacity Planning: Regular capacity assessment
  • Testing Procedures: Regular scaling operation testing
  • Documentation: Maintain current troubleshooting procedures