Troubleshooting Guide
Quick reference for diagnosing Azure issues.
Connectivity Issues
VM Can’t Reach Internet
Checklist:
- VM has public IP (or NAT gateway)
- Default route exists (0.0.0.0/0)
- Outbound NSG rule allows traffic (destination = internet, port 443/80)
- Route table doesn’t override default
Tools:
Network Watcher → IP Flow Verify → Select source VM, destination internet
VM Can’t Talk to Another VM
Checklist:
- Both in same VNet (or peered)
- NSG on both sides (check inbound AND outbound)
- Private IP is correct
- Route exists (check route table)
Diagnose:
VM1 → Network Watcher → IP Flow Verify
Source: VM1 IP, Destination: VM2 IP, Port: 443
Result: Allow/Deny (tells you if NSG is blocking)
If NSG blocks:
- Add inbound rule on VM2 subnet/NSG
- Source: VM1 IP or subnet
- Port: Whatever you need
RDP Connection Failed
Checklist:
- VM running?
- Public IP assigned (or use Bastion)?
- NSG allows port 3389 from your IP?
- Windows Firewall on VM allows RDP?
- Boot diagnostics successful (not hung)?
Troubleshooting Steps:
- Check VM status → Should be “Running”
- If VM stuck: Stop (deallocate) → Start
- If still won’t boot: Check boot diagnostics
- If network issue: Check NSG (port 3389 inbound)
SSH Connection Failed
Checklist:
- SSH key correct?
- Port 22 allowed in NSG?
- Public IP assigned?
- SSH daemon running on VM?
Commands:
# Test connectivity
ssh -v azure-user@<public-ip>
# If port blocked:
# Add NSG rule: Inbound, port 22, from your IP
# If key issue:
# Check file permissions: chmod 400 key.pem
Performance Issues
VM Running Slow
Check:
- CPU usage high? (Azure Monitor → CPU %)
- Memory high? (Azure Monitor → Memory %)
- Disk I/O high? (Azure Monitor → Disk read/write)
- Network latency? (Network Watcher → Connection Troubleshoot)
Solutions:
High CPU: Resize to larger VM (more cores)
High Memory: Add more RAM (resize)
High Disk I/O: Premium SSD (faster disks)
High Latency: Move to closer region
Database Queries Slow
Check:
- Connection count maxed?
- Missing indexes?
- Large full table scan?
- Network latency to database?
Diagnose:
Azure Portal → SQL Database → Query Performance Insight
Shows: Slowest queries, resource-intensive queries
Storage Issues
Blob Upload Fails
Checklist:
- Storage account exists and accessible?
- Credentials correct (account key or SAS token)?
- Quota not exceeded?
- Network access allowed (firewall)?
Common Errors:
401 (Unauthorized): Invalid credentials
403 (Forbidden): Access denied, check RBAC
404 (Not Found): Container doesn't exist
Timeout: Network issue or large file
Storage Account Throttled
Symptoms: Requests fail with 503 (Service Unavailable)
Solutions:
- Increase throughput:
- Upgrade disk type (Standard → Premium)
- Distribute across multiple accounts
- Optimize:
- Cache frequently accessed data
- Batch operations (don’t do 1000 individual writes)
- Use CDN for static content
Permission Issues
User Can’t Delete Resource
Check:
- Role assigned at resource/RG level?
- Role includes “delete” permission?
- No resource lock (Read-only or Delete)?
- Is subscription owner role?
Diagnose:
Azure Portal → Access Control (IAM)
→ Check Roles for user
→ Check if permission includes "delete"
App Can’t Write to Database
Check:
- Database credentials correct?
- Database user exists?
- NSG allows traffic (port 1433 for SQL)?
- Firewall rule allows app IP?
- User has INSERT/UPDATE permission?
Backup & Recovery
Backup Failed
Check:
- Backup vault configured?
- VM accessible (running)?
- Backup agent installed (if needed)?
- Storage space available?
- Network accessible (can reach backup endpoint)?
Solution:
Azure Portal → Backup vault → Protected items
→ Click VM → See failure reason
→ Common: VM not running, agent failed
Can’t Restore from Backup
Issues:
Target VM deleted: Restore to new VM
Wrong region: Use cross-region restore
Quota exceeded: Delete old resources first
Disk conflicts: Rename disks during restore
Monitoring & Logging
No Data in Log Analytics
Check:
- Diagnostic settings configured?
- Data flowing to Log Analytics workspace?
- Workspace in correct region?
- Sufficient permissions to query?
Diagnose:
Resource → Diagnostic settings
→ Check "Send to Log Analytics"
→ Query: CommonSecurityLog | take 10
(should return data)
Alert Not Firing
Check:
- Metric data present? (no data = no alert)
- Threshold correct? (metric not exceeding)
- Action group configured?
- Alert rule enabled?
Networking Troubleshooting
Traceroute Path Wrong
Use Network Watcher:
Network Watcher → Next Hop
Source VM: Check where traffic goes
Destination: Where is it being routed?
Result: Next hop type (VM, gateway, internet, none)
If wrong:
- Check route table on subnet
- Routes override default
- Most specific route wins
Common Error Codes
| Code | Meaning | Fix |
|---|---|---|
| 401 | Unauthorized | Check credentials |
| 403 | Forbidden | Check permissions/RBAC |
| 404 | Not Found | Resource doesn’t exist |
| 409 | Conflict | Resource already exists |
| 429 | Throttled | Too many requests, slow down |
| 503 | Unavailable | Service overloaded or maintenance |
| 508 | Loop Detected | Network routing loop |
Quick Diagnostic Commands
Azure CLI
# Check VM status
az vm list --resource-group MyRG --query "[].{Name:name, Status:powerState}"
# Check NSG rules
az network nsg rule list --resource-group MyRG --nsg-name MyNSG
# Check route table
az network route-table route list --resource-group MyRG --route-table-name MyRT
# Check RBAC assignments
az role assignment list --resource-group MyRG --query "[].{Role:roleDefinitionName, User:principalName}"
# Check diagnostics
az monitor diagnostic-settings list --resource /subscriptions/.../resourceGroups/MyRG/providers/Microsoft.Compute/virtualMachines/MyVM
Azure Portal
- Resource → Activity Log → See who did what when
- Resource → Diagnose and solve problems → Automated troubleshooting
- Monitor → Metrics → Check CPU, memory, disk
- Monitor → Logs → Query events
- Network Watcher → IP Flow Verify, Next Hop, Connection Test
Escalation Checklist
Still stuck? Before calling support:
- Checked Azure Status (azure.status.microsoft)
- Verified credentials/permissions
- Checked NSG/firewall rules
- Reviewed activity log for errors
- Checked resource quotas
- Tried creating in different region
- Disabled extensions (if VM won’t start)
- Reproduced in separate resource
Pro Tip: Enable diagnostic logs BEFORE you have problems. Troubleshooting is easier with historical data!