Retweet == bookmark

Joined March 2011
Sharing some anecdotes about hedging requests in distributed systems When we built the blob storage at LinkedIn, every blob was broken into chunks and streamed to different storage nodes. Each chunk had multiple replicas that were written with a majority quorum. The metadata chunk had the location of all the chunks. A request for a blob would typically issue read request for the chunk and stream it to the client. At scale, we found that the tail latency (99.9th percentile) was 3-5x slower. This typically happens for various reasons - network blips, momentary spike in traffic for a specific node or deployments. One of the quick fix was to send multiple requests for the same chunk to different replicas. This is called hedging requests. The benefit of this approach is that you get the chunk faster from one of the replicas before the other. This dramatically reduced the tail latency and brought it almost on par to the 99th percentile. The downside was that we immediately doubled the request load across all nodes. To alleviate this problem, we sent the second request 10-20 ms later after checking if the first request had not started responding yet. This approach only increased the load by 1.2x while keeping the tail latency comparable to the previous approach. The downside to hedging is that the operations have to be idempotent. Immutable blobs were a good candidate for this. We also had to empirically figure out the right interval to send the second request.
1
3
14
The highest-performing developers I worked with at Amazon asked better questions than everyone else. After 18 years in tech, here's what I learned: while average engineers jump to solutions, exceptional ones pause to ask the right questions first. The 6 questions that separated high performers: "What problem are we actually solving?" "What happens when this fails?" "How will we know if this is working?" "What's the simplest solution that could work?" "Who else has solved this?" "What are we NOT going to do?" The career-changing insight: The quality of your questions determines the quality of your solutions. These thinking patterns apply beyond engineering to any complex problem-solving role. More systematic approaches to advancing your tech career: go.alifeengineered.com/?utm_…
1
16
1
202
Both VictoriaMetrics and VictoriaLogs are optimized for persistent storage with low IOPS (aka HDD): - Data ingestion path. Both systems buffer recently ingested data in memory before writing it to files. Both systems rely on the OS page cache for minimising the number of write operations to persistent storage when writing the buffered data to files - regular writes to files do not induce physical writes to disk - the operating system stores the written data in memory (page cache) and writes it to disk in the background. Both VictoriaMetrics and VictoriaLogs call rsync() before closing just created immutable data files in order to guarantee that the data is properly stored to disk. This prevents data loss and data corruption on unclean shutdown such as accidental power off in the middle of creating a data file. The frequency of fsync() calls is usually quite low, even for high data ingestion rate (e.g. a few tens of fsync() calls per second for millions of ingested samples per second / hundreds of thousands of ingested logs per second). This allows using HDD disks with 100-200 iops for high data ingestion workloads at VictoriaMetrics and VictoriaLogs. - Querying path. Usually the most frequent queries are executed over the recently ingested data - the majority of alerting and recording rules usually rely on the data ingested during the last few hours. There are very high chances that this data is located in the OS page cache, since the operating system automatically and transparently stores all the recently written and read files' data there (of course if there is enough RAM for storing such a data aka working set). In this case the majority of queries don't induce any read operations from disk - they are satisfied by the OS page cache. Some rare and heavy queries over the historical data may need disk read operations for reading the historical data from disk. Both VictoriaMetrics and VictoriaLogs have data storage format optimized for minimising random disk read operations when reading the data for heavy queries over historical data. This data storage format is optimized for sequential reads of large chunks of data from disk, and these chunks of data are stored in a compressed form with high compression ratio. This allows minimising both random disk read operations and disk read bandwidth needed for executing heavy queries over historical data, which is missing in the OS page cache, so regular HDD disks are enough for the majority of use cases. When to use SSD and NVMe at VictoriaMetrics and VictoriaLogs? There are two cases: - When the working set doesn't fit the OS page cache. Then the operating system needs to read the frequently requested data from disk, so it needs disk with high iops. But I'd recommend increasing the available RAM instead of switching from HDD to SSD in this case - it will give better performance and capacity for both queries and data ingestion. - When heavy queries need higher performance and the bottleneck is in disk read bandwidth or in disk read iops.
1
3
64
Much of the the internet runs on Elastic Block Storage. It's the default (and in most cases, required) storage layer for every EC2 instance running in AWS. This article by @molson was a delightful read on its history and engineering challenges.
3
19
289
TLA+ Modeling of AWS outage DNS race condition muratbuffalo.blogspot.com/20… AWS’s N. Virginia region suffered a DynamoDB outage triggered by a DNS automation defect.This post focuses narrowly on the race condition at the core of the bug, which is best understood through TLA+ modeling
1
5
31
Once a queue in a distributed system or a buffer in a network gets large enough it takes so long to reach the front that the client has already timed out and retried. You'll spend all your time processing useless work. Good systems have small queues. If your queue has to get long then switching to LIFO under load at least ensures you're making progress on requests users still care about.
Replying to @jamesacowling
What’s the logic for LIFO instead of FIFO?
Counter-intuitive advice for designing very large scale live-site services: 1. Don't have retries inside your system, only at the edges. 2. Don't have queues inside your system, only at the edges. 3. When shit really goes down process requests in LIFO order.
TIL about the slotted counter pattern. Makes total sense! Continuously reminded that parallel performance is about partitioning.
While researching for the podcast with @samlambert I came across his blog on "Slotted counter pattern", which is very interesting. I implemented a very similar pattern to solve high contention single row updates for analytics use case. Give it a read! a simple and effective pattern to avoid hammering a single row and spreading the updates across N rows, reducing the potential for contention. ps: I am looking forward to the discussion and releasing the episode as soon as I can. 😎 planetscale.com/blog/the-slo… Also subscribe here to stay updated: piped.video/c/TheGeekNarrato…
6
10
2
176
How to implement the Outbox pattern in Go and Postgres by @pliutau packagemain.tech/p/how-to-im…
2
37
259
PT retweeted
How did Bluesky move off AWS and continue to thrive despite crossing 25 million users and only employing 15 software engineers? Learn how a distributed social network was built and the challenges encountered along the way. ow.ly/gWps50Waa3m #ScyllaDB
3
14
PT retweeted
the question I get the most at every conference or meetup: "what even are durable objects?" I spent the past 2 years building almost exclusively with durable objects, and today I'm compiling my notes and favourite patterns into a single blog post boristane.com/blog/what-are-…
PT retweeted
she actually summarized everything you must know from the “AI Engineering” book in 76 minutes. if you don’t got the time to read the book, you need to watch this. foundational models, evaluation, prompt engineering, RAG, memory, fine-tuning and many more. great starting point.
40
434
10
4,646
TernFS — an exabyte scale, multi-region distributed filesystem xtxmarkets.com/tech/2025-ter… This post motivates TernFS, explains its high-level architecture, and then explores some key implementation details.
11
41
📝 Blogged: "'You Don't Need Kafka, Just Use Postgres' Considered Harmful" In which I'm arguing that both Postgres and Kafka are great tools for their respective purposes. But don't create your custom implementation of one on top of the other. 👉 morling.dev/blog/you-dont-ne…
Flink's 95% problem #apacheflink Flink might sound like the holy grail of streaming data processing, but for 95% of us, it's just a complex headache we don't need. tinybird.co/blog/flink-is-95…
3
30
built my own vector db from scratch with - linear scan, kd_tree, hsnw, ivf indexes just to understand things from first principles. all the way from: > recursive BST insertion with d cycling split > hyperplan perpendicular splitting to axis at depth%d > bound and branch pruning + greedy DFS > optimizing tree pruning by ~50% on avg > skip list prob distribution > selecting M diverse neighbors to prevent long-range edge clustering > bidirectional edge creation > greedy descent on upper layers for fast O(log n) coarse search > exhaustive beam search at layer 0 + bfs pruning > heuristic picks of diverse M from candidate set
98
125
29
2,647
Adding a few more that go deeper into real-world scaling and distributed systems side of databases: Consensus protocols (Raft, Paxos, Zab) Sharding strategies (range, hash, consistent hashing) Quorum reads/writes Conflict resolution (CRDTs, vector clocks) Snapshot isolation & MVCC Compaction & tombstones (esp. in LSM-based systems) Query optimizers & cost-based planning Columnar storage formats (Parquet, ORC) OLTP vs OLAP design differences Data lake vs data warehouse architecture Hot key mitigation & load balancing Schema evolution in distributed systems Data locality & co-location strategies Write amplification and read amplification ZK / Etcd / Consul for coordination Storage engines (InnoDB, RocksDB, WiredTiger)
Database stuff I’d study if I wanted to understand scaling deeply: Bookmark this. B+ Trees LSM Trees Write-Ahead Logging Two-Phase Commit Three-Phase Commit Read Replicas Leader-Follower Replication Partitioning Query Caching Secondary Indexes Vector Indexes (FAISS, HNSW) Distributed Joins Materialized Views Event Sourcing Change Data Capture
2
40
457