I share real-world lessons from building scalable systems at Jump Trading, Binance, and running mission-critical cloud ops at GovTech and Singapore Air Force. No fluff, just practical takeaways, hard-earned fixes, and deep dives that matter.
Identify high-level DB entities (don’t worry about attributes yet).
User info embedded in the JWT header, so it can’t be altered.
Step 3 (High-Level Design)
Map out the architecture diagram.
Determine the type and quality of database needed (both NoSQL and SQL can work).
Start simple without optimization; cover all functional requirements first.
Abstract out the payment component (provide a callback endpoint to receive payment confirmations).
Use distributed locks with auto-TTL to keep the data model clean (mid-level candidates may only think of cron jobs).
Step 4 (Deep Dive (the differentiator))
Reference relevant non-functional requirements.
ElasticSearch for fast search using inverted indexes; cannot be the main DB since it can’t handle complex transaction management.
Change Data Capture (CDC): changes in the primary DB are streamed to update ElasticSearch. Heavy updates may require a queue or batching.
Caching complexities (especially with ranking):
Might require sticky sessions.
AWS OpenSearch supports node query caching for faster queries without adding external caching layers.
Alternatively, Redis works, but invalidation must be handled carefully when primary DB changes.
CDN caching works for static content but loses effectiveness with high key permutations or timestamp-driven keys. User-specific results also reduce usefulness.
Real-time UX improvements:
Long polling: cheap and easy, especially if users remain on a page for a few minutes.
Server-Sent Events (SSE): simpler than WebSockets if communication is server → client only.
Choke Point (virtual waiting queue):
Prevents instant access and avoids “fastest click wins” behavior.
Redis can be used.
Similar to a rate limiter but with better UX.
Misc
When to shard
High write throughput, vertical scaling exhausted.
Excessive data volume.
Hotspot keys causing uneven load.
When not to shard
Strong global consistency required.
Queries frequently join large tables (sharding increases fan-out).