Scaling Troubleshooting
Systematic approach to diagnosing and resolving NodeGroup scaling issues.
Scale-Up Issues
Resource Limitations
Symptoms:
- Scaling operation fails with "insufficient quota" error
- NodeGroup stuck in "Scaling" status
- Error messages indicating resource unavailability
Common Causes:
- vCloud Quota Exhaustion: Check quota usage vs limits
- Instance Type Unavailability: Verify instance availability in target region
- Network Resource Depletion: Check IP address availability in subnets
Solutions:
- Request quota increase through support
- Try alternative instance types
- Expand subnet CIDR ranges or add additional subnets
Node Health Issues
Symptoms:
- New nodes appear but remain "NotReady"
- Nodes fail health checks during bootstrap
- Intermittent connectivity issues
Diagnostic Commands:
# Check node status and conditions
kubectl describe node NODE_NAME
# Review node system events
kubectl get events --field-selector involvedObject.name=NODE_NAME
# Check kubelet logs
kubectl logs -n kube-system KUBELET_POD --previous
# Verify network connectivity
kubectl exec -it TEST_POD -- ping NODE_IP
Common Solutions:
- Verify security group rules and network policies
- Check cluster DNS configuration
- Verify container registry access
- Ensure adequate disk space and memory
Scale-Down Issues
Pod Eviction Problems
Symptoms:
- Scale-down fails with pod eviction errors
- Nodes refuse to drain
- Scale-down timeouts during pod migration
Root Causes:
- Pod Disruption Budget Violations: Check PDB configurations
- Persistent Storage Dependencies: Identify pods with local storage
- StatefulSet Constraints: Check for StatefulSets preventing eviction
Diagnostic Commands:
# Check PDB configurations
kubectl get pdb --all-namespaces
kubectl describe pdb PDB_NAME -n NAMESPACE
# Identify persistent storage
kubectl get pods -o custom-columns=NAME:.metadata.name,NODE:.spec.nodeName,VOLUMES:.spec.volumes[*].name
# Check StatefulSets and DaemonSets
kubectl get statefulsets,daemonsets --all-namespaces
Capacity Issues
Symptoms:
- Remaining nodes lack capacity for migrated workloads
- Pod scheduling failures during scale-down
- Resource exhaustion on remaining nodes
Solutions:
- Calculate remaining capacity before scale-down
- Scale down one node at a time
- Optimize pod resource requests and limits
- Use node affinity rules for better distribution
Performance Issues
Application Instability
Symptoms:
- Application errors during scaling operations
- Performance degradation coinciding with scaling
- Inconsistent behavior across nodes
Investigation:
# Check application pod status
kubectl get pods -l app=APP_NAME -o wide
# Review application logs
kubectl logs -l app=APP_NAME --tail=100
# Monitor application metrics
kubectl top pods -l app=APP_NAME
Common Issues:
- Service discovery problems
- Load balancer configuration issues
- Session affinity settings
- Database connection pool limits
Diagnostic Approach
Step-by-Step Diagnosis
- Initial Assessment: Correlate scaling operations with error occurrences
- Resource Status: Check cluster and NodeGroup health
- Error Collection: Gather logs from multiple sources
- Impact Analysis: Assess scope of affected services
Essential Log Sources
# NodeGroup controller logs
kubectl logs -n kube-system deployment/cluster-autoscaler
# Node provisioning logs
kubectl logs -n kube-system daemonset/node-problem-detector
# Kubelet status
journalctl -u kubelet -f
# Cloud provider integration
kubectl logs -n kube-system deployment/cloud-controller-manager
Network Troubleshooting
# Test inter-node communication
kubectl run test-pod --image=busybox --rm -it -- ping NODE_IP
# Check DNS resolution
kubectl run test-dns --image=busybox --rm -it -- nslookup kubernetes.default
# Verify service connectivity
kubectl run test-service --image=curlimages/curl --rm -it -- curl SERVICE_NAME.NAMESPACE
Resolution Strategies
Immediate Actions
- Rollback Scaling: Revert to previous stable node count
- Manual Intervention: Manually provision resources if automation fails
- Traffic Redirection: Route traffic to healthy nodes
- Resource Reallocation: Move critical workloads to stable nodes
Preventive Measures
- Early Warning Systems: Alerts before resource exhaustion
- Capacity Planning: Regular capacity assessment
- Testing Procedures: Regular scaling operation testing
- Documentation: Maintain current troubleshooting procedures