0:00
/
0:00
Transcript

How Liquid Clustering Improves Streaming Merges and P99 Latency

The Trio Behind Simpler Streaming Merges: Deletion Vectors, Row-Level Concurrency, and Liquid Clustering

Why Liquid Clustering Improves Merge Efficiency

Unlike traditional Z-order clustering, which could end up re-clustering portions of the dataset even when unnecessary, Liquid Clustering is designed to be almost always incremental. It focuses on clustering only the new data that has arrived and hasn’t yet been organized, giving much stronger guarantees against re-clustering existing data. This incremental behaviour makes clustering more predictable and cost-efficient. The payoff shows up during merges: for merges to be efficient, you need to scan the fewest possible files, and that requires data to be sorted or clustered. By clustering/physically storing sorted data, Liquid Clustering ensures better file pruning which helps in faster merges, and lowers overall latency.

Reference

  • Behind the Scenes: How do deletion vectors actually work (Substack)

    • Deep dive into the mechanics of deletion vectors: how they mark rows deleted without rewriting whole files.

  • Use row tracking for Delta tables (Delta Lake Docs) Delta Lake

    • Explains row tracking: the new metadata fields (row_id, row_commit_version) that identify and version rows. Delta Lake

    • How to enable/disable it, and what its limitations are. Delta Lake

  • Deep Dive: How Row-level Concurrency Works Out of the Box (Databricks Blog) Databricks

    • Describes what row-level concurrency means, and how it works in the Databricks Runtime. Databricks

    • Shows how Liquid Clustering + deletion vectors enable out-of-box conflict resolution (e.g. avoiding ConcurrentAppendException / ConcurrentUpdateException). Databricks

    • Gives examples / internal logic of how concurrent modifications are tracked per row rather than per file or partition. Databricks

  • merge_and_optimize_parallel_demo_fixed.ipynb (GitHub notebook by “material_for_public_consumption”) GitHub

    • Demo notebook that shows merges and Optimize operations running in parallel. GitHub

  • Why Liquid Clustering

Discussion about this video

User's avatar