Programming lesson
Scaling Cloud Infrastructure: Cost Optimization and Virtualization for Distributed Systems
Learn how to optimize cloud costs and choose virtualization methods for distributed systems, using real-world scenarios from a game backend and a virtual pet platform.
Introduction to Cloud Scaling and Virtualization
In modern distributed systems, efficient resource allocation and cost management are critical. This tutorial explores two key aspects: cost optimization for scaling a game backend (based on Problem Set 2) and virtualization choices for a cloud-based virtual pet platform (Problem Set 3). By understanding these concepts, you'll be better equipped to design scalable, cost-effective systems.
Part 1: Cost-Effective Configuration for a Game Backend
Consider a three-tier game application: a frontend server handling client requests, a backend server running game logic, and a database. Each stage has specific resource requirements per concurrent user and minimum instance resources. Given a constant load of 64 concurrent users, we need to select the most cost-effective instance types from a cloud provider offering Small, Medium, Large, and X-Large machines.
Step 1: Calculate Resource Requirements
For 64 users, compute total CPU, RAM, and disk needed for each stage, then compare with instance capacities.
- Frontend Server: CPU: 64 users × 0.5 Gops = 32 Gops. RAM: 64 × 0.2 GB = 12.8 GB (min 4 GB, so 12.8 GB). Disk: 64 × 0.2 GB = 12.8 GB (min 10 GB, so 12.8 GB).
- Backend Server: CPU: 64 × 0.5 Gops = 32 Gops (but 50% must run on a single node, so 16 Gops must be on one node, the rest can be distributed). RAM: 64 × 2 GB = 128 GB (min 32 GB). Disk: 64 × 1 GB = 64 GB (min 120 GB, so 120 GB).
- Database: CPU: 64 × 50 Mops = 3.2 Gops. RAM: 64 × 0.5 GB = 32 GB (min 16 GB). Disk: 64 × 2 GB = 128 GB (min 300 GB, so 300 GB).
Step 2: Match to Instance Types
Instance CPU capacity: Small (2.5 GHz × 2 cores = 5 Gops), Medium (3.2×4=12.8 Gops), Large (3.8×8=30.4 Gops), X-Large (5.0×20=100 Gops).
- Frontend: Needs 32 Gops, 12.8 GB RAM, 12.8 GB disk. One Large (30.4 Gops) is slightly insufficient; two Medium (25.6 Gops total) or one X-Large (overkill). Best: two Medium (cost 2×$0.25=$0.50/hr) or one Large ($0.60/hr). Two Medium meet CPU? 25.6 < 32, no. Actually need 32 Gops: one Large (30.4) is short; two Medium (25.6) short; one X-Large (100) works but expensive. Check if we can use one Large plus something? But we must meet all resources. Let's recalc: Frontend CPU 32 Gops. Large has 30.4 Gops, so insufficient. Two Medium give 25.6, still insufficient. One X-Large gives 100 Gops, RAM 256 GB, disk 2000 GB, cost $3.25/hr. Alternatively, three Medium? 3×12.8=38.4 Gops, cost $0.75/hr, RAM 3×32=96 GB (enough), disk 3×250=750 GB (enough). That's better. So frontend: three Medium instances ($0.75/hr).
- Backend: Needs 32 Gops total, but 16 Gops must be on one node. RAM 128 GB, disk 120 GB. One Large: 30.4 Gops, RAM 64 GB (insufficient), disk 500 GB. Need more RAM. One X-Large: 100 Gops, RAM 256 GB, disk 2000 GB, cost $3.25/hr, but overkill. Could use one Large (30.4 Gops, but RAM 64 <128) plus another instance? But single node requirement: the 16 Gops must be on one node. So we need at least one node with 16 Gops capacity and enough RAM for that node? Actually, the backend stage can be distributed except for 50% CPU on single node. So we can have multiple nodes. The single node must handle 16 Gops. A Large has 30.4 Gops, so it can handle the 16 Gops plus some of the distributed part. But RAM per node: if we use one Large, its RAM is 64 GB, but total RAM needed is 128 GB. So we need additional nodes for RAM. Also disk 120 GB, Large has 500 GB, so disk is fine. So we can use one Large for the single node (handles 16 Gops and some RAM), and then additional nodes for the remaining CPU and RAM. Remaining CPU: 32-16=16 Gops (distributed). Remaining RAM: 128-64=64 GB. We can use two Medium: each 12.8 Gops, 32 GB RAM, total 25.6 Gops (enough for 16), 64 GB RAM (enough). Cost: Large $0.60 + 2×Medium $0.50 = $1.10/hr. Alternatively, one Large + one Medium? That gives 30.4+12.8=43.2 Gops, RAM 64+32=96 GB (still short of 128). So need two Medium. So backend: one Large + two Medium = $1.10/hr.
- Database: Needs 3.2 Gops, 32 GB RAM, 300 GB disk. Single node required. Small: 5 Gops, 8 GB RAM (insufficient RAM). Medium: 12.8 Gops, 32 GB RAM, 250 GB disk (disk insufficient, needs 300). Large: 30.4 Gops, 64 GB RAM, 500 GB disk (overkill, cost $0.60). X-Large: too expensive. So database must use Large ($0.60/hr) because Medium disk too small. Check if we can use two Medium? Database must run on single node, so no. So database: one Large ($0.60/hr).
Total hourly cost: Frontend $0.75 + Backend $1.10 + Database $0.60 = $2.45/hr.
Part B: Buy vs. Rent for 3 Years
Assuming 3 years = 3×365×24 = 26,280 hours (ignoring leap years). Rent cost = $2.45/hr × 26,280 = $64,386. Buy cost: Small $2,000, Medium $5,500, Large $10,000, X-Large $55,000. For our configuration: Frontend: 3 Medium = 3×$5,500 = $16,500. Backend: 1 Large + 2 Medium = $10,000 + $11,000 = $21,000. Database: 1 Large = $10,000. Total buy = $16,500 + $21,000 + $10,000 = $47,500. Rent is $64,386 - $47,500 = $16,886 more expensive. So buying is cheaper by $16,886.
Part C: Variable Load (256 day, 160 night)
Day (12 hrs): 256 users. Night (12 hrs): 160 users. Calculate resources for each load, then find cost- effective configs. Due to space, we summarize: Day config likely uses more instances, night config fewer. Rent cost over 3 years: (day cost × 12 + night cost × 12) × 365 × 3. Buy cost: purchase enough capacity for peak (256 users). Compare. Typically, buying for peak is cheaper than renting if utilization is high. Exact numbers omitted for brevity.
Part D: Maximum Concurrent Users
Identify bottleneck: For each stage, compute max users based on resource limits of the largest instance (X-Large) or combination. Likely backend CPU single-node limit or database disk. Show calculations.
Part 2: Virtualization for CloudPets
CloudPets is a virtual pet platform requiring high availability. We must choose between VMs (Type 1/2), containers, or processes for specific scenarios.
Scenario A: Multiplayer Rooms
Short-lived server instances (seconds to minutes) for invited users. Containers are best because they boot quickly, are lightweight, and can be spun up/down rapidly. VMs take longer to boot, processes lack isolation.
Scenario B: Customizable Personality Module
Single-threaded AI worker on user's machine, CPU efficient. Process is most efficient with minimal overhead; no need for full OS or container.
Scenario C: Testing on Local Machines
Developer runs full server stack on different OS (Linux, macOS). Type 2 VM (e.g., VirtualBox) allows running different OS images on host, providing full isolation and compatibility.
Auto-scaling and SLA Violation
When requests jump to 1800/s, each VM handles 500 req/s, so need 4 VMs (including existing). Boot time 30s. During boot, requests pile up. Calculate response time: assume requests queue, processed at 500 req/s per VM. SLA: 90% under 200ms. With 1800 req/s and 1 VM initially, queue grows. After 30s, 4 VMs serve 2000 req/s, but queue already large. Likely SLA violated. Explain with Little's Law.
Thrashing and Solution
VMs terminating quickly and restarting indicates thrashing due to over-aggressive scaling or misconfigured health checks. Solution: use longer cooldown periods, pre-warm instances, or use containers for faster scaling.
Understanding these trade-offs is essential for building robust cloud systems. Whether you're optimizing costs or choosing virtualization, always consider workload characteristics and SLA requirements.