<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Databricksters: Data Engineering]]></title><description><![CDATA[Simplifying Data Engineering on Databricks]]></description><link>https://www.databricksters.com/s/data-engineering</link><image><url>https://substackcdn.com/image/fetch/$s_!zPJJ!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fff49ecae-7c56-403c-9389-61b28de6a50f_1280x1280.png</url><title>Databricksters: Data Engineering</title><link>https://www.databricksters.com/s/data-engineering</link></image><generator>Substack</generator><lastBuildDate>Sat, 09 May 2026 04:40:21 GMT</lastBuildDate><atom:link href="https://www.databricksters.com/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Soni]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[databricksters@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[databricksters@substack.com]]></itunes:email><itunes:name><![CDATA[Canadian Data Guy]]></itunes:name></itunes:owner><itunes:author><![CDATA[Canadian Data Guy]]></itunes:author><googleplay:owner><![CDATA[databricksters@substack.com]]></googleplay:owner><googleplay:email><![CDATA[databricksters@substack.com]]></googleplay:email><googleplay:author><![CDATA[Canadian Data Guy]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[Migrating Existing Dashboards to Databricks AI/BI, Part 3: User Filters and Row-Level Security with Unity Catalog]]></title><description><![CDATA[How to implement user-based filtering and row-level security using dynamic views, row filters, column masks, and ABAC]]></description><link>https://www.databricksters.com/p/migrating-existing-dashboards-to-3e1</link><guid isPermaLink="false">https://www.databricksters.com/p/migrating-existing-dashboards-to-3e1</guid><dc:creator><![CDATA[Artem Chebotko]]></dc:creator><pubDate>Tue, 14 Apr 2026 15:01:56 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!mMaQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa3ac56e-a591-4cbe-8db1-eace395402a7_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mMaQ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa3ac56e-a591-4cbe-8db1-eace395402a7_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mMaQ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa3ac56e-a591-4cbe-8db1-eace395402a7_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!mMaQ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa3ac56e-a591-4cbe-8db1-eace395402a7_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!mMaQ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa3ac56e-a591-4cbe-8db1-eace395402a7_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!mMaQ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa3ac56e-a591-4cbe-8db1-eace395402a7_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mMaQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa3ac56e-a591-4cbe-8db1-eace395402a7_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aa3ac56e-a591-4cbe-8db1-eace395402a7_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mMaQ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa3ac56e-a591-4cbe-8db1-eace395402a7_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!mMaQ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa3ac56e-a591-4cbe-8db1-eace395402a7_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!mMaQ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa3ac56e-a591-4cbe-8db1-eace395402a7_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!mMaQ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faa3ac56e-a591-4cbe-8db1-eace395402a7_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As a Specialist Solutions Architect at Databricks, I often hear the same questions from customers who are migrating dashboards from legacy BI tools to Databricks AI/BI Dashboards:</p><ul><li><p><em>&#8220;What&#8217;s the Databricks equivalent of the context filters we use today?&#8221;</em></p></li><li><p><em>&#8220;Can we still do cascading filters where each dropdown only shows relevant values?&#8221;</em></p></li><li><p><em>&#8220;Do you support filter actions when I click on a bar or a point?&#8221;</em></p></li><li><p><em>&#8220;How do we do user-based filtering in AI/BI Dashboards?&#8221;</em></p></li></ul><p>In the <a href="https://www.databricksters.com/p/migrating-existing-dashboards-to">first blog post in this series</a>, I focused on the first two questions and showed how to recreate:</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.databricksters.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Databricksters! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><ul><li><p>context filters using parameters in dataset SQL, and</p></li><li><p>&#8220;<em>Only Relevant Values</em>&#8221; filters using field filters and query-based parameters.</p></li></ul><p>In the <a href="https://www.databricksters.com/p/migrating-existing-dashboards-to-482">second post</a>, I focused on cross-filtering and drill-through interactions.</p><p>This third post tackles the remaining question: &#8220;<em>How do we do user-based filtering in AI/BI Dashboards?</em>&#8221;</p><p>In Databricks, these controls live in <a href="https://docs.databricks.com/aws/en/data-governance/unity-catalog/">Unity Catalog</a>, not in the dashboard itself. AI/BI Dashboards query governed tables and views, and Unity Catalog enforces fine-grained access control before the data ever reaches the dashboard. I&#8217;ll walk through how to:</p><ul><li><p>Implement user-based filtering and RLS with <a href="https://docs.databricks.com/aws/en/views/dynamic">dynamic views</a> that use <code>current_user()</code> and <code>is_account_group_member()</code>.</p></li><li><p>Apply RLS directly on tables using <a href="https://docs.databricks.com/aws/en/data-governance/unity-catalog/filters-and-masks/">row filters</a> and protect sensitive fields with <a href="https://docs.databricks.com/aws/en/data-governance/unity-catalog/filters-and-masks/">column masks</a>.</p></li><li><p>Scale these patterns across many tables and columns using <a href="https://docs.databricks.com/aws/en/data-governance/unity-catalog/abac/">ABAC tag policies and governed tags</a>.</p></li></ul><p>As in the previous posts, I&#8217;ll use the built-in <code>samples.tpch</code> dataset. I&#8217;ve also published the <a href="https://github.com/ArtemChebotko/Migrating-Existing-Dashboards-to-Databricks-AI-BI">companion dashboard</a> so you can import it into your workspace, follow along as you read, and adapt these patterns to your own Unity Catalog data.</p><h3><strong>1. How User-Based Filtering Maps to Unity Catalog</strong></h3><p>Before we dive into SQL, it&#8217;s useful to clarify where these responsibilities live in Databricks.</p><p>In many BI tools, user-specific security is often implemented close to the dashboard or semantic layer:</p><ul><li><p>You define user- or group-based rules that map principals to specific regions, customers, or business units.</p></li><li><p>You may use identity-aware logic in filters or calculated fields.</p></li><li><p>You may maintain a security table that drives which slice of data each user can see.</p></li></ul><p>In Databricks, these controls live in <a href="https://docs.databricks.com/aws/en/data-governance/unity-catalog/">Unity Catalog</a>, not in AI/BI Dashboards:</p><ul><li><p>Object privileges on catalogs, schemas, tables, and views control whether a user can query a given object at all.</p></li><li><p>Dynamic views, row filters, and column masks implement row-level security and masking at query time. They can inspect the current user and their groups and return different rows or values per user.</p></li><li><p>AI/BI Dashboards simply query those governed tables and views. They never bypass Unity Catalog: any row filters or masks you define apply to every query, regardless of whether it comes from a notebook, Databricks SQL, or an AI/BI dashboard.</p></li></ul><p>The result is conceptually similar to user-based filtering in traditional BI tools, but with one important shift: The security rules live with the data, not with a particular dashboard.</p><p>That&#8217;s especially important when:</p><ul><li><p>The same Unity Catalog tables power multiple AI/BI dashboards and external BI tools, and</p></li><li><p>You embed AI/BI Dashboards into applications where thousands of users see the same dashboard definition, but each must see a different subset of data.</p></li></ul><p>In the rest of this post, we&#8217;ll build up from that idea:</p><ol><li><p>Use a <a href="https://docs.databricks.com/aws/en/views/dynamic">dynamic view</a> and a permission table to enforce RLS on a TPCH Sales dataset.</p></li><li><p>Show how to do similar things directly on tables with <a href="https://docs.databricks.com/aws/en/data-governance/unity-catalog/filters-and-masks/">row filters and column masks</a>.</p></li><li><p>Discuss how to scale those patterns across many tables and columns using <a href="https://docs.databricks.com/aws/en/data-governance/unity-catalog/abac/">ABAC tag policies and governed tags</a>.</p></li></ol><h3><strong>2. Building blocks: dynamic views, row filters, and column masks</strong></h3><p>To implement user-based filtering in Databricks, you really only need three Unity Catalog primitives: dynamic views, row filters, and column masks.</p><p>They all rely on the same core idea: At query time, Unity Catalog can look at who is running the query (and which groups they&#8217;re in), and then decide which rows and values to return.</p><h4><strong>2.1 Identity functions</strong></h4><p>The main functions you&#8217;ll use in policies are:</p><ul><li><p><code>current_user()</code><br>Returns the current user&#8217;s identity (usually their email).</p></li><li><p><code>is_account_group_member(&#8217;&lt;group_name&gt;&#8217;)</code><br>Returns <code>TRUE</code> if the current user is a member of an account-level group.</p></li></ul><p>You can call these functions from views and from SQL UDFs used by row filters and column masks.</p><h4><strong>2.2 Dynamic views</strong></h4><p>A <a href="https://docs.databricks.com/aws/en/views/dynamic">dynamic view</a> is just a normal SQL view whose logic depends on the current user or their groups.</p><p>You can:</p><ul><li><p>Filter rows based on <code>current_user()</code> or group membership.</p></li><li><p>Mask or null out columns for certain users.</p></li><li><p>Join to a separate permission table that maps users/groups to allowed regions, customers, etc.</p></li></ul><p>Any AI/BI dataset that selects from a dynamic view automatically inherits its logic. You don&#8217;t need to add any special configuration in AI/BI itself.</p><p>We&#8217;ll use a dynamic view for our first <em>TPCH Sales</em> RLS example.</p><h4><strong>2.3 Row filters</strong></h4><p>A <a href="https://docs.databricks.com/aws/en/data-governance/unity-catalog/filters-and-masks/">row filter</a> attaches RLS logic directly to a table, instead of wrapping the table in a view:</p><ul><li><p>You define a <a href="https://docs.databricks.com/aws/en/udf/unity-catalog">SQL UDF</a> that takes one or more columns as input and returns <code>BOOLEAN</code>.</p></li><li><p>You attach it to a table with <code>ALTER TABLE ... SET ROW FILTER ... ON (column[, ...])</code>.</p></li></ul><p>The row filter runs for every query and can call <code>current_user()</code> and <code>is_account_group_member()</code> internally. This is handy when you want:</p><ul><li><p>A stable table name (no extra view layer), or</p></li><li><p>A single table that&#8217;s consumed by many tools, all of which should respect the same RLS.</p></li></ul><p>We&#8217;ll look at both group-based and <code>current_user()</code> + permission-table examples later in the post.</p><h4><strong>2.4 Column masks</strong></h4><p>A <a href="https://docs.databricks.com/aws/en/data-governance/unity-catalog/filters-and-masks/">column mask</a> is similar, but operates at the column level:</p><ul><li><p>You define a <a href="https://docs.databricks.com/aws/en/udf/unity-catalog">SQL UDF</a> that returns a &#8220;masked&#8221; value.</p></li><li><p>You attach it with <code>ALTER TABLE ... ALTER COLUMN ... SET MASK ...</code>.</p></li></ul><p>This lets you:</p><ul><li><p>Show full values (for example, email or salary) only to certain groups.</p></li><li><p>Show partially masked or null values to everyone else.</p></li></ul><p>Think of it as the Unity Catalog side of &#8220;row-level security + column-level masking&#8221; that you might combine in legacy BI tools using data source filters and calculated fields.</p><p>Next, we&#8217;ll put these pieces together in a concrete example: enforcing region-based RLS on a <em>TPCH Sales</em> dataset using a combination of a base view, a permission table, and a dynamic view.</p><h3><strong>3. Implementing RLS with a dynamic view</strong></h3><p>Let&#8217;s start with a concrete scenario:</p><ul><li><p>You have a <em>TPCH Sales</em> dashboard shared across multiple sales teams.</p></li><li><p><em>NA Sales Managers</em> should see only <em>AMERICA</em>.</p></li><li><p><em>EMEA Sales Managers</em> should see only <em>EUROPE</em>.</p></li><li><p><em>APAC Sales Managers</em> should see only <em>ASIA</em>.</p></li><li><p>Individual users may have their own custom regions.</p></li></ul><p>In many traditional BI tools, you&#8217;d typically solve this with a user filter or data source filter that maps groups to <em>Regions</em>, and a security table to keep that mapping up to date.</p><p>In Databricks, we&#8217;ll use the same logical pattern &#8211; but move it into Unity Catalog:</p><ol><li><p>Create a base view for <em>TPCH Sales</em> in a demo schema.</p></li><li><p>Create a permission table that maps principals (groups or users) to <em>Regions</em>.</p></li><li><p>Create a dynamic view that joins the base view to the permission table and applies row-level security based on <code>current_user()</code> and <code>is_account_group_member()</code>.</p></li><li><p>Create an AI/BI dataset on top of that dynamic view and build a simple table visualization.</p></li></ol><p>Throughout this section, we&#8217;ll:</p><ul><li><p>Read from <code>samples.tpch</code> (which everyone has).</p></li><li><p>Create objects in <code>main.demo_tpch</code> (you can substitute another catalog/schema if needed).</p></li></ul><p>All of the <code>CREATE</code> / <code>INSERT</code> statements in this section should be run in a notebook or SQL editor, not inside an AI/BI dataset. In AI/BI, you&#8217;ll just <code>SELECT</code> from the resulting view.</p><h4><strong>3.1 </strong><em><strong>TPCH Sales</strong></em><strong> base view</strong></h4><p>First, set up a simple demo schema and define a reusable base view for <em>TPCH Sales</em>:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">-- Use the default UC catalog and create a demo schema
USE CATALOG main;
CREATE SCHEMA IF NOT EXISTS demo_tpch;

-- Base TPCH Sales view, reading from samples.tpch
CREATE OR REPLACE VIEW main.demo_tpch.tpch_sales_base AS
SELECT
  r.r_name              AS region,
  n.n_name              AS nation,
  c.c_custkey           AS customer_id,
  c.c_name              AS customer_name,
  o.o_orderkey          AS order_id,
  o.o_orderdate         AS order_date,
  l.l_extendedprice * (1 - l.l_discount) AS revenue
FROM samples.tpch.region   AS r
JOIN samples.tpch.nation   AS n ON n.n_regionkey = r.r_regionkey
JOIN samples.tpch.customer AS c ON c.c_nationkey = n.n_nationkey
JOIN samples.tpch.orders   AS o ON o.o_custkey   = c.c_custkey
JOIN samples.tpch.lineitem AS l ON l.l_orderkey  = o.o_orderkey;</code></pre></div><h4><strong>3.2 Region access permission table</strong></h4><p>Next, create a permission table that describes who is allowed to see which <em>Region</em>.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">-- Optional: you can keep security tables in the same schema
-- or create a separate one, e.g. main.demo_security
CREATE SCHEMA IF NOT EXISTS main.demo_tpch;

CREATE TABLE IF NOT EXISTS main.demo_tpch.tpch_region_access (
  principal_type STRING,   -- 'group' or 'user'
  principal      STRING,   -- group name or user email
  region         STRING    -- must match tpch_sales_base.region
);

INSERT INTO main.demo_tpch.tpch_region_access VALUES
  ('group', 'NA Sales Managers',        'AMERICA'),
  ('group', 'EMEA Sales Managers',      'EUROPE'),
  ('group', 'APAC Sales Managers',      'ASIA'),
  ('user',  'some.user@databricks.com', 'ASIA'),
  ('user',   current_user(),            'AFRICA');</code></pre></div><p>What this does:</p><ul><li><p>The first three rows grant access based on account-level groups</p></li><li><p>The fourth row grants access to <em>ASIA</em> to a specific user, even if they are not in one of those groups.</p></li><li><p>The fifth row uses <code>current_user()</code> to grant you, the person running this SQL, access to <em>AFRICA</em>. When you execute the <code>INSERT</code>, Unity Catalog evaluates <code>current_user()</code> to your own email.</p></li></ul><p>If you&#8217;re not in any of the <em>NA/EMEA/APAC Sales Managers</em> groups and you&#8217;re not <em>some.user@databricks.com</em>, the only applicable rule for you will be the one that says you can see <em>AFRICA</em>. We&#8217;ll see the effect of that in a moment when we query through the dynamic view.</p><h4><strong>3.3 Dynamic view with region-level RLS</strong></h4><p>Now create a <a href="https://docs.databricks.com/aws/en/views/dynamic">dynamic view</a> that applies row-level security by joining the base <em>TPCH Sales</em> view to the permission table and checking the current user&#8217;s identity and groups:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">CREATE OR REPLACE VIEW main.demo_tpch.tpch_sales_rls AS
SELECT s.*
FROM   main.demo_tpch.tpch_sales_base AS s
WHERE EXISTS (
  SELECT 1
  FROM   main.demo_tpch.tpch_region_access a
  WHERE  a.region = s.region
    AND (
      (a.principal_type = 'group'
       AND is_account_group_member(a.principal))
      OR
      (a.principal_type = 'user'
       AND a.principal = current_user())
    )
);</code></pre></div><p>This view enforces RLS as follows. For each row in <code>tpch_sales_base</code>, it looks for a matching rule in <code>tpch_region_access</code> based on region and either group or user principle. If no matching rule exists for the current user and that region, the row is filtered out.</p><p>You can verify this by running:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">SELECT DISTINCT region
FROM main.demo_tpch.tpch_sales_rls;</code></pre></div><p>If you are only granted access via the <code>(&#8217;user&#8217;, current_user(), &#8216;AFRICA&#8217;)</code> row, you should see: <em>AFRICA</em>.</p><p>From now on:</p><ul><li><p>Any query against <code>main.demo_tpch.tpch_sales_rls</code> returns only the regions granted to the current user.</p></li><li><p>This applies uniformly whether the query comes from a notebook, Databricks SQL, or an AI/BI Dashboard.</p></li><li><p>You can add or revoke access simply by inserting or deleting rows in <code>main.demo_tpch.tpch_region_access</code> &#8211; you don&#8217;t need to change the view logic.</p></li></ul><h4><strong>3.4 Using the dynamic view in an AI/BI dataset</strong></h4><p>With <code>main.demo_tpch.tpch_sales_rls</code> in place, using it in AI/BI Dashboards is straightforward. You don&#8217;t need to re-implement any RLS logic in AI/BI &#8211; the dataset just selects from the governed view.</p><p>Create the <em>TPCH Sales (RLS Dynamic View)</em> dataset:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">SELECT
  region,
  nation,
  customer_id,
  customer_name,
  order_id,
  order_date,
  revenue
FROM main.demo_tpch.tpch_sales_rls;</code></pre></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!dlAq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa92013e6-065f-4405-8564-a23013c9a17c_1600x748.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!dlAq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa92013e6-065f-4405-8564-a23013c9a17c_1600x748.png 424w, https://substackcdn.com/image/fetch/$s_!dlAq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa92013e6-065f-4405-8564-a23013c9a17c_1600x748.png 848w, https://substackcdn.com/image/fetch/$s_!dlAq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa92013e6-065f-4405-8564-a23013c9a17c_1600x748.png 1272w, https://substackcdn.com/image/fetch/$s_!dlAq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa92013e6-065f-4405-8564-a23013c9a17c_1600x748.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!dlAq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa92013e6-065f-4405-8564-a23013c9a17c_1600x748.png" width="1456" height="681" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a92013e6-065f-4405-8564-a23013c9a17c_1600x748.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:681,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!dlAq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa92013e6-065f-4405-8564-a23013c9a17c_1600x748.png 424w, https://substackcdn.com/image/fetch/$s_!dlAq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa92013e6-065f-4405-8564-a23013c9a17c_1600x748.png 848w, https://substackcdn.com/image/fetch/$s_!dlAq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa92013e6-065f-4405-8564-a23013c9a17c_1600x748.png 1272w, https://substackcdn.com/image/fetch/$s_!dlAq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa92013e6-065f-4405-8564-a23013c9a17c_1600x748.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Build a simple dashboard page to visualize the dataset. In the <a href="https://github.com/ArtemChebotko/Migrating-Existing-Dashboards-to-Databricks-AI-BI">companion dashboard</a>, I created a <em>RLS with dynamic view</em> page based on this dataset:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!atyN!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F108ce78d-3404-44ae-9cc7-f7e57a4651ab_1600x474.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!atyN!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F108ce78d-3404-44ae-9cc7-f7e57a4651ab_1600x474.png 424w, https://substackcdn.com/image/fetch/$s_!atyN!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F108ce78d-3404-44ae-9cc7-f7e57a4651ab_1600x474.png 848w, https://substackcdn.com/image/fetch/$s_!atyN!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F108ce78d-3404-44ae-9cc7-f7e57a4651ab_1600x474.png 1272w, https://substackcdn.com/image/fetch/$s_!atyN!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F108ce78d-3404-44ae-9cc7-f7e57a4651ab_1600x474.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!atyN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F108ce78d-3404-44ae-9cc7-f7e57a4651ab_1600x474.png" width="1456" height="431" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/108ce78d-3404-44ae-9cc7-f7e57a4651ab_1600x474.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:431,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!atyN!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F108ce78d-3404-44ae-9cc7-f7e57a4651ab_1600x474.png 424w, https://substackcdn.com/image/fetch/$s_!atyN!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F108ce78d-3404-44ae-9cc7-f7e57a4651ab_1600x474.png 848w, https://substackcdn.com/image/fetch/$s_!atyN!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F108ce78d-3404-44ae-9cc7-f7e57a4651ab_1600x474.png 1272w, https://substackcdn.com/image/fetch/$s_!atyN!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F108ce78d-3404-44ae-9cc7-f7e57a4651ab_1600x474.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When you view this page, if your only grant is the row we inserted with <code>(&#8217;user&#8217;, current_user(), &#8216;AFRICA&#8217;)</code>, the table will show only rows where the region is <em>AFRICA</em>.</p><p>The security logic lives in Unity Catalog (dynamic view + permission table). The dashboard just selects from <code>tpch_sales_rls</code> and automatically respects row-level security for each viewer.</p><h3><strong>4. Implementing RLS on tables with row filters and column masks</strong></h3><p>In the previous section, we implemented row-level security for <em>TPCH Sales</em> using a dynamic view and a permission table. That pattern works well when you want a named, shareable view to point AI/BI datasets at.</p><p>Unity Catalog also lets you attach RLS and masking logic directly to tables using:</p><ul><li><p><a href="https://docs.databricks.com/aws/en/data-governance/unity-catalog/filters-and-masks/">Row filters</a> &#8211; control which rows a user can access in a table.</p></li><li><p><a href="https://docs.databricks.com/aws/en/data-governance/unity-catalog/filters-and-masks/">Column masks</a> &#8211; control what values they see in specific columns.</p></li></ul><p>These policies are evaluated in Unity Catalog at query time and apply to all compute &#8211; SQL warehouses, notebooks, and AI/BI Dashboards. Unlike dynamic views, they keep the table name unchanged, which can be important when the same table is shared across many tools.</p><p>In this section, we&#8217;ll:</p><ol><li><p>Create a <em>TPCH Sales</em> table in <code>main.demo_tpch</code> for row filters and masks.</p></li><li><p>Attach a group-based row filter that restricts Regions.</p></li><li><p>Attach a user-based row filter that uses <code>current_user()</code> and a permission table.</p></li><li><p>Add a column mask to protect a sensitive column.</p></li></ol><p>Any AI/BI dataset that selects from this table will automatically respect these policies. You don&#8217;t need to configure anything special in AI/BI.</p><p>All of the statements below should be run in a notebook or SQL editor. AI/BI Dashboards just query the resulting table.</p><h4><strong>4.1 TPCH Sales table for filters and masks</strong></h4><p>Row filters and column masks only apply to tables (and a few other relation types), not to views. To keep things simple, we&#8217;ll materialize the tpch_sales_base view from Section 3 into a Delta table that we can attach policies to:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">-- Use the same demo catalog and schema as before
USE CATALOG main;
USE SCHEMA demo_tpch;

-- Create a physical table from the base view for row filters and masks
CREATE OR REPLACE TABLE main.demo_tpch.tpch_sales_table AS
SELECT *
FROM main.demo_tpch.tpch_sales_base;</code></pre></div><p>From now on, we&#8217;ll attach row filters and masks to <code>main.demo_tpch.tpch_sales_table</code>.<br>If you point an AI/BI dataset at this table instead of the dynamic view, the behavior will be controlled by these table-level policies.</p><h4><strong>4.2 Group-based row filter on </strong><em><strong>Region</strong></em></h4><p>First, let&#8217;s attach a row filter that enforces the same <em>Region</em> rules we used in the dynamic view, but purely based on account-level groups:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">-- Row filter function that decides which regions each group can see
CREATE OR REPLACE FUNCTION main.demo_tpch.tpch_region_filter(p_region STRING)
RETURNS BOOLEAN
RETURN
  CASE
    WHEN is_account_group_member('NA Sales Managers')   THEN p_region = 'AMERICA'
    WHEN is_account_group_member('EMEA Sales Managers') THEN p_region = 'EUROPE'
    WHEN is_account_group_member('APAC Sales Managers') THEN p_region = 'ASIA'
    ELSE FALSE
  END;</code></pre></div><p>Attach it to the table:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">ALTER TABLE main.demo_tpch.tpch_sales_table
  SET ROW FILTER main.demo_tpch.tpch_region_filter ON (region);</code></pre></div><p>Effect:</p><ul><li><p>Whenever anyone queries <code>main.demo_tpch.tpch_sales_table</code>, Unity Catalog evaluates <code>tpch_region_filter(region)</code> for each row.</p></li><li><p>If the user is in NA Sales Managers, only rows where the region is AMERICA are returned. If they&#8217;re in <em>EMEA Sales Managers</em>, they only see <em>EUROPE</em>; <em>APAC Sales Managers</em> see <em>ASIA</em>.</p></li><li><p>Users not in any of these groups see no rows from this table.</p></li></ul><p>If you point a dataset at:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">SELECT
  region,
  nation,
  customer_id,
  customer_name,
  order_id,
  order_date,
  revenue
FROM main.demo_tpch.tpch_sales_table;</code></pre></div><p>Any visualizations based on such a dataset will now respect the group-based <em>Region</em> logic without going through the dynamic view.</p><h4><strong>4.3 User-based row filter with </strong><code>current_user()</code><strong> and a permission table</strong></h4><p>Group-based rules are great for broad roles, but you may need finer control &#8211; different users seeing different subsets of customers, accounts, or regions.</p><p>We can reuse the same pattern as in Section 3 &#8211; <code>current_user()</code> + a permission table &#8211; but this time embed it in a row filter function instead of a dynamic view.</p><p>Example: restrict access by <code>customer_id</code>:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">-- Permission table mapping users to customers they can see
CREATE TABLE IF NOT EXISTS main.demo_tpch.customer_access (
  user_email  STRING,
  customer_id BIGINT
);

-- Example grants
INSERT INTO main.demo_tpch.customer_access VALUES
  ('some.user@databricks.com', 889),
  (current_user(),             1111);  -- Give yourself access</code></pre></div><p>Now create a row filter function that consults this table. To avoid ambiguous name resolution with the <code>customer_id</code> column on the table, we&#8217;ll use a parameter name <code>p_customer_id</code>:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">CREATE OR REPLACE FUNCTION main.demo_tpch.tpch_customer_filter(p_customer_id BIGINT)
RETURNS BOOLEAN
RETURN EXISTS (
  SELECT 1
  FROM   main.demo_tpch.customer_access a
  WHERE  a.user_email  = current_user()
    AND  a.customer_id = p_customer_id
);</code></pre></div><p>Attach it to the same table. Because a table can have only one row filter, we&#8217;ll replace the <em>Region</em> filter from the previous subsection in this example:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">ALTER TABLE main.demo_tpch.tpch_sales_table
  DROP ROW FILTER;

ALTER TABLE main.demo_tpch.tpch_sales_table
  SET ROW FILTER main.demo_tpch.tpch_customer_filter ON (customer_id);</code></pre></div><p>Effect:</p><ul><li><p>Whenever someone queries <code>main.demo_tpch.tpch_sales_table</code>, Unity Catalog evaluates <code>tpch_customer_filter(customer_id)</code> for each row.</p></li><li><p>For a given user, only rows whose <code>customer_id</code> appears in <code>main.demo_tpch.customer_access</code><br>for <code>current_user()</code> are returned.</p></li><li><p>In the sample data above, you will only see orders for customer <code>1111</code>, while <code>some.user@databricks.com</code> will see orders for customer <code>889</code>.</p></li></ul><p>You can verify this quickly:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">SELECT DISTINCT customer_id
FROM main.demo_tpch.tpch_sales_table
ORDER BY customer_id
LIMIT 20;</code></pre></div><p>If your only mapping is <code>(current_user(), 1111)</code>, this query should return just <code>1111</code>.</p><p>To use this in AI/BI Dashboards, you can point a dataset directly at the table:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">SELECT
  region,
  nation,
  customer_id,
  customer_name,
  order_id,
  order_date,
  revenue
FROM main.demo_tpch.tpch_sales_table;</code></pre></div><p>In the <a href="https://github.com/ArtemChebotko/Migrating-Existing-Dashboards-to-Databricks-AI-BI">companion dashboard</a>, I created a <em>RLS with row filter</em> page based on this dataset:<br></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Pq42!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2322b426-4b34-429a-b7bf-74cb5b9d9dbe_1600x485.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Pq42!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2322b426-4b34-429a-b7bf-74cb5b9d9dbe_1600x485.png 424w, https://substackcdn.com/image/fetch/$s_!Pq42!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2322b426-4b34-429a-b7bf-74cb5b9d9dbe_1600x485.png 848w, https://substackcdn.com/image/fetch/$s_!Pq42!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2322b426-4b34-429a-b7bf-74cb5b9d9dbe_1600x485.png 1272w, https://substackcdn.com/image/fetch/$s_!Pq42!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2322b426-4b34-429a-b7bf-74cb5b9d9dbe_1600x485.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Pq42!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2322b426-4b34-429a-b7bf-74cb5b9d9dbe_1600x485.png" width="1456" height="441" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2322b426-4b34-429a-b7bf-74cb5b9d9dbe_1600x485.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:441,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Pq42!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2322b426-4b34-429a-b7bf-74cb5b9d9dbe_1600x485.png 424w, https://substackcdn.com/image/fetch/$s_!Pq42!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2322b426-4b34-429a-b7bf-74cb5b9d9dbe_1600x485.png 848w, https://substackcdn.com/image/fetch/$s_!Pq42!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2322b426-4b34-429a-b7bf-74cb5b9d9dbe_1600x485.png 1272w, https://substackcdn.com/image/fetch/$s_!Pq42!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2322b426-4b34-429a-b7bf-74cb5b9d9dbe_1600x485.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When you open it, the table and visuals only show data for customers you are allowed to see according to <code>customer_access</code>, without any RLS logic in the dashboard itself.</p><h4><strong>4.4 Bonus: masking sensitive columns with a column mask</strong></h4><p>Row filters decide which rows a user can see. Sometimes you also need to partially hide sensitive values within those rows &#8211; for example, masking customer names, emails or phone numbers for most users while leaving them fully visible for a small group.</p><p>Unity Catalog column masks handle this at the column level using the same pattern: a SQL UDF that can branch on <code>current_user()</code> or group membership.</p><p>Suppose we want:</p><ul><li><p>Users in a <em>PII Full Access</em> group to see full customer names.</p></li><li><p>Everyone else to see a partially masked version (for example, just the first few characters).</p></li></ul><p>First, define a masking function:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">CREATE OR REPLACE FUNCTION main.demo_tpch.mask_customer_name(name STRING)
RETURNS STRING
RETURN
  CASE
    WHEN is_account_group_member('PII Full Access') THEN name
    ELSE concat(substr(name, 1, 3), '***')
  END;</code></pre></div><p>Attach it as a mask on the table:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">ALTER TABLE main.demo_tpch.tpch_sales_table
  ALTER COLUMN customer_name
  SET MASK main.demo_tpch.mask_customer_name;</code></pre></div><p>Effect:</p><ul><li><p>Users in the <em>PII Full Access</em> group see the full <code>customer_name</code> value.</p></li><li><p>All other users see a masked version like <code>Cus***</code> instead of <code>Customer#000001111</code>.</p></li><li><p>The mask is enforced for every query against <code>tpch_sales_table</code> &#8211; notebooks, SQL editor, and AI/BI Dashboards &#8211; including exports.</p></li></ul><p>Here is what it looks like in the <a href="https://github.com/ArtemChebotko/Migrating-Existing-Dashboards-to-Databricks-AI-BI">companion dashboard</a>:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9SDT!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3b91a0-d851-4081-9a2a-aa550bf2a506_1600x570.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9SDT!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3b91a0-d851-4081-9a2a-aa550bf2a506_1600x570.png 424w, https://substackcdn.com/image/fetch/$s_!9SDT!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3b91a0-d851-4081-9a2a-aa550bf2a506_1600x570.png 848w, https://substackcdn.com/image/fetch/$s_!9SDT!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3b91a0-d851-4081-9a2a-aa550bf2a506_1600x570.png 1272w, https://substackcdn.com/image/fetch/$s_!9SDT!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3b91a0-d851-4081-9a2a-aa550bf2a506_1600x570.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9SDT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3b91a0-d851-4081-9a2a-aa550bf2a506_1600x570.png" width="1456" height="519" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e3b91a0-d851-4081-9a2a-aa550bf2a506_1600x570.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:519,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9SDT!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3b91a0-d851-4081-9a2a-aa550bf2a506_1600x570.png 424w, https://substackcdn.com/image/fetch/$s_!9SDT!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3b91a0-d851-4081-9a2a-aa550bf2a506_1600x570.png 848w, https://substackcdn.com/image/fetch/$s_!9SDT!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3b91a0-d851-4081-9a2a-aa550bf2a506_1600x570.png 1272w, https://substackcdn.com/image/fetch/$s_!9SDT!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e3b91a0-d851-4081-9a2a-aa550bf2a506_1600x570.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>5. Scaling RLS with ABAC and governed tags</strong></h3><p>Everything we&#8217;ve done so far (dynamic views, row filters, and column masks) is defined directly on individual Unity Catalog objects. That&#8217;s fine for a handful of tables, but it becomes hard to manage when you have dozens of catalogs, hundreds of schemas, and thousands of tables. This is exactly the problem that <a href="https://docs.databricks.com/gcp/en/data-governance/unity-catalog/abac/">attribute-based access control (ABAC)</a> with governed tags is designed to solve in Unity Catalog.</p><p>At a high level, ABAC adds three building blocks on top of the mechanisms we already used:</p><ul><li><p>Governed tags &#8211; account-level tags like <em>sensitivity</em>, <em>business_domain</em>, or <em>region_scope</em>, with a controlled set of allowed values. You attach these tags to catalogs, schemas, tables, or columns.</p></li><li><p>Policy UDFs &#8211; reusable SQL UDFs that implement row-filter or column-mask logic, similar to the functions we wrote earlier, but intended to be reused across many datasets.</p></li><li><p>ABAC policies &#8211; centrally managed policies that say &#8220;when a tagged object matches these conditions, apply this row filter or column mask for these principals.&#8221; Policies can be attached at the catalog, schema, or table level and inherit down the hierarchy.</p></li></ul><p>Databricks recommends using ABAC as the primary way to apply row filters and column masks at scale, and reserving table-by-table configuration for special cases. Conceptually, you can think of ABAC as &#8220;lifting&#8221; the patterns from Section 4 into a central policy layer:</p><ol><li><p>Tag the data once &#8211; apply governed tags to catalogs, schemas, tables, and columns that participate in RLS or masking.</p></li><li><p>Register reusable UDFs &#8211; define shared row-filter and mask functions in a governance schema (for example, <code>governance.region_filter()</code> and <code>governance.mask_customer_name()</code>).</p></li><li><p>Create ABAC policies &#8203;&#8203;&#8211; define policies that attach those UDFs to tagged objects based on tag conditions and target groups.</p></li><li><p>Let tags drive behavior &#8211; as new tables and columns are tagged, the appropriate row filters and masks are applied automatically by Unity Catalog.</p></li></ol><p>From an AI/BI Dashboards perspective, the experience is the same as in Sections 3 and 4: the dataset SQL stays simple, and the dashboard filters and visualizations work as usual. The difference is that the security logic is now centralized in ABAC policies and tags instead of being embedded directly into each table or view.</p><h3><strong>6. Summary and next steps</strong></h3><p>In many BI tools, user-based filtering and row-level security are often implemented close to the dashboard or semantic layer using user/group mappings, security tables, and identity-aware logic. In Databricks, the key shift is that these controls move into <a href="https://docs.databricks.com/aws/en/data-governance/unity-catalog/">Unity Catalog</a>, and AI/BI Dashboards simply query governed tables and views.</p><p>In this post, we walked through three main patterns:</p><ul><li><p><a href="https://docs.databricks.com/aws/en/views/dynamic">Dynamic views and permission tables</a> (Section 3)<strong><br></strong> We built <code>main.demo_tpch.tpch_sales_rls</code> on top of a base TPCH Sales view and a <code>tpch_region_access</code> table. The dynamic view uses <code>current_user()</code> and <code>is_account_group_member()</code> to return different Regions for different users and groups. AI/BI datasets that query this view automatically inherit the row-level security.</p></li><li><p><a href="https://docs.databricks.com/aws/en/data-governance/unity-catalog/filters-and-masks/">Table-level row filters and column masks</a> (Section 4)<strong><br></strong> We materialized TPCH Sales into <code>main.demo_tpch.tpch_sales_table</code> and attached a row filter that looks up allowed <code>customer_id</code> values in <code>customer_access</code> based on <code>current_user()</code>. We also added a column mask for <code>customer_name</code>, showing full names only to a privileged group. Any dataset that selects from this table sees the combined effect of RLS and masking without any extra logic in the dashboard.</p></li><li><p><a href="https://docs.databricks.com/aws/en/data-governance/unity-catalog/abac/">ABAC and governed tags</a> (Section 5)<strong><br></strong> We then zoomed out to show how ABAC can apply the same kinds of row filters and masks at scale, using governed tags, reusable policy UDFs, and central ABAC policies. Instead of configuring each table or view by hand, you tag data once and let policies attach the right filters and masks automatically.</p></li></ul><p>Across all three patterns, the core idea is the same: <strong>AI/BI Dashboards stay simple; Unity Catalog enforces who sees which rows and what values.</strong></p><p>The <a href="https://github.com/ArtemChebotko/Migrating-Existing-Dashboards-to-Databricks-AI-BI">companion dashboard</a> brings these ideas together in a concrete, runnable example. If you import it into your workspace and wire it up to the views and tables from this post, you can see exactly how the visuals behave for different users as Unity Catalog applies dynamic views, row filters, and column masks behind the scenes.</p><p>Combined with the first two posts in this series (<a href="https://www.databricksters.com/p/migrating-existing-dashboards-to">Part 1</a> and <a href="https://www.databricksters.com/p/migrating-existing-dashboards-to-482">Part 2</a>), you now have a practical set of patterns for implementing filtering, drill-through, and row-level security in Databricks AI/BI Dashboards on top of Unity Catalog.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.databricksters.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Databricksters! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The Kafka TTL Trap: Updating Spark Streaming Tables Without Data Loss]]></title><description><![CDATA[How to update streaming bronze tables in Spark after your source data has expired.]]></description><link>https://www.databricksters.com/p/beating-kafkas-clock-the-zero-data</link><guid isPermaLink="false">https://www.databricksters.com/p/beating-kafkas-clock-the-zero-data</guid><dc:creator><![CDATA[Neil Wilson]]></dc:creator><pubDate>Tue, 24 Mar 2026 15:01:25 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/2ac95c65-630b-48c7-b643-405af3cbe2d8_1376x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><strong>TL;DR: Updating Streaming Tables Without Data Loss</strong></p><ul><li><p><strong>The Problem:</strong> A &#8220;Full Refresh&#8221; on Kafka-fed pipelines can cause <strong>permanent data loss</strong> if older records have aged out of the topic (TTL).</p></li><li><p><strong>The Strategy:</strong> Use a <strong>&#8220;Backup-and-Rebase&#8221;</strong> workflow: archive existing data, identify the last processed offsets, and point the new pipeline to that exact starting position.</p></li><li><p><strong>The Execution:</strong> This guide demonstrates how to manually configure <code>startingOffsets</code> in Spark to bridge the gap between historical backups and new Kafka data.</p></li><li><p><strong>The Result:</strong> Seamless streaming table updates with zero data loss or record duplication.</p></li></ul><h2>The Constraint: Kafka TTL &amp; Streaming Table Immutability</h2><p>In Spark Declarative Pipelines (SDP), situations may arise when you need to alter a bronze Streaming Table that is being fed by Apache Kafka. This can pose a challenge as you cannot manually alter Streaming Tables via Alter Table commands. <br><br>This is further complicated by the fact that Kafka topics are configured with a finite retention period (Time-to-live or TTL), meaning older records eventually age out. Since a pipeline Full Refresh clears the target Delta Table, you cannot simply update your pipeline definition and full refresh from the source, as older records will be missing. The diagram below illustrates this situation. The Kafka cluster contains user-4 who has already been ingested, and users 5 and 6 who still need to be ingested into our Lakehouse. The older user records (1, 2, and 3) have aged out of the topic&#8217;s TTL.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!LMhK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23461c4d-4ca6-4f67-ba66-caf45b929828_767x583.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!LMhK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23461c4d-4ca6-4f67-ba66-caf45b929828_767x583.png 424w, https://substackcdn.com/image/fetch/$s_!LMhK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23461c4d-4ca6-4f67-ba66-caf45b929828_767x583.png 848w, https://substackcdn.com/image/fetch/$s_!LMhK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23461c4d-4ca6-4f67-ba66-caf45b929828_767x583.png 1272w, https://substackcdn.com/image/fetch/$s_!LMhK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23461c4d-4ca6-4f67-ba66-caf45b929828_767x583.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!LMhK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23461c4d-4ca6-4f67-ba66-caf45b929828_767x583.png" width="767" height="583" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/23461c4d-4ca6-4f67-ba66-caf45b929828_767x583.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:583,&quot;width&quot;:767,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:32107,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://neilwilsondata.substack.com/i/188168915?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23461c4d-4ca6-4f67-ba66-caf45b929828_767x583.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!LMhK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23461c4d-4ca6-4f67-ba66-caf45b929828_767x583.png 424w, https://substackcdn.com/image/fetch/$s_!LMhK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23461c4d-4ca6-4f67-ba66-caf45b929828_767x583.png 848w, https://substackcdn.com/image/fetch/$s_!LMhK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23461c4d-4ca6-4f67-ba66-caf45b929828_767x583.png 1272w, https://substackcdn.com/image/fetch/$s_!LMhK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F23461c4d-4ca6-4f67-ba66-caf45b929828_767x583.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As new records are constantly being appended to this topic, how can I update my pipeline and alter my result table without missing new records (users 5, 6, etc), dropping old records (users 1, 2, and 3), or duplicating records (user-4)? The following example has been fabricated to show the solution. The actual reason for implementing this will vary by use-case.</p><h2>Example: Ingesting JSON via Kafka</h2><p>Imagine a pipeline is ingesting JSON data from Kafka. This JSON data contains three high-level fields: name, country, and email. It also contains two nested fields &#8220;event&#8221; and &#8220;device&#8221; which contain information about what actions users are taking from which devices. </p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;json&quot;,&quot;nodeId&quot;:&quot;bb338c7a-d588-4351-840e-76569893f3ce&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-json">{&#8221;user&#8221;:&#8221;user-2&#8221;,&#8221;email&#8221;:&#8221;user-2@example.com&#8221;,
&#8220;country&#8221;:&#8221;CA&#8221;,
&#8220;device&#8221;:{&#8221;os&#8221;:&#8221;android&#8221;,&#8221;model&#8221;:&#8221;Pixel 7&#8221;,&#8221;geo&#8221;:{&#8221;lat&#8221;:39.98,&#8221;lon&#8221;:-82.98}},
&#8220;event&#8221;:{&#8221;name&#8221;:&#8221;demo&#8221;,&#8221;seq&#8221;:2,&#8221;ts&#8221;:&#8221;2026-02-17T14:40:26.189Z&#8221;}}</code></pre></div><p>This data is being written to a json_bronze table, and is storing the user, email, and country fields as String and the nested fields as Struct types. It also adds Kafka metadata fields.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;d6daedb6-3323-4952-9560-62e6c1225bf7&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import dlt
from pyspark.sql.functions import current_timestamp, col, from_json, expr

SERVERS = "REDACTED"
TOPIC = &#8220;neil_struct_topic&#8221;

# Explicit JSON schema for predictable demos (no schemaLocationKey)
SCHEMA_DDL = &#8220;&#8221;&#8220;
  user STRING,
  email STRING,
  country STRING,
  device STRUCT&lt;os: STRING, model: STRING, geo: STRUCT&lt;lat: DOUBLE, lon: DOUBLE&gt;&gt;,
  event STRUCT&lt;name: STRING, seq: BIGINT, ts: STRING&gt;
&#8220;&#8221;&#8220;

@dlt.table(
    name=&#8221;json_bronze&#8221;,
    comment=&#8221;Raw Kafka payload with explicit JSON schema and rescued data&#8221;,
    table_properties={&#8221;quality&#8221;: &#8220;bronze&#8221;}
)
def json_bronze():
    df = (
        spark.readStream
            .format(&#8221;kafka&#8221;)
            .option(&#8221;kafka.bootstrap.servers&#8221;, SERVERS)
            .option(&#8221;kafka.security.protocol&#8221;, &#8220;SSL&#8221;)
            .option(&#8221;subscribe&#8221;, TOPIC)
            .option(&#8221;startingOffsets&#8221;, &#8220;earliest&#8221;)
            .load()
    )

    parsed = (
        df.selectExpr(&#8221;CAST(value AS STRING) AS json_str&#8221;, &#8220;topic&#8221;, &#8220;partition&#8221;, &#8220;offset&#8221;, &#8220;timestamp&#8221;)
          .select(
              from_json(
                  col(&#8221;json_str&#8221;),
                  SCHEMA_DDL,
                  options={&#8221;rescuedDataColumn&#8221;: &#8220;_rescued_data&#8221;}  # capture type mismatches/new fields
              ).alias(&#8221;data&#8221;),
              &#8220;topic&#8221;, &#8220;partition&#8221;, &#8220;offset&#8221;, &#8220;timestamp&#8221;
          )
          # data.* includes _rescued_data already; do not reselect it to avoid duplicate column error
          .selectExpr(&#8221;data.*&#8221;, &#8220;topic&#8221;, &#8220;partition&#8221;, &#8220;offset&#8221;, &#8220;timestamp AS kafka_timestamp&#8221;)
          .withColumn(&#8221;ingestion_ts&#8221;, current_timestamp())
    )

    return parsed</code></pre></div><p>Below is the current state of our target table. This matches the state of the diagram above. Users 1 through 4 have been ingested into the target Delta table.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yqXK!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F626e4ec6-b311-40e7-99cc-2fde4bdec665_853x200.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yqXK!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F626e4ec6-b311-40e7-99cc-2fde4bdec665_853x200.png 424w, https://substackcdn.com/image/fetch/$s_!yqXK!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F626e4ec6-b311-40e7-99cc-2fde4bdec665_853x200.png 848w, https://substackcdn.com/image/fetch/$s_!yqXK!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F626e4ec6-b311-40e7-99cc-2fde4bdec665_853x200.png 1272w, https://substackcdn.com/image/fetch/$s_!yqXK!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F626e4ec6-b311-40e7-99cc-2fde4bdec665_853x200.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yqXK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F626e4ec6-b311-40e7-99cc-2fde4bdec665_853x200.png" width="853" height="200" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/626e4ec6-b311-40e7-99cc-2fde4bdec665_853x200.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:200,&quot;width&quot;:853,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:41922,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://neilwilsondata.substack.com/i/188168915?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F779260b0-f779-4e54-92d9-9cfc02013122_1171x200.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!yqXK!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F626e4ec6-b311-40e7-99cc-2fde4bdec665_853x200.png 424w, https://substackcdn.com/image/fetch/$s_!yqXK!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F626e4ec6-b311-40e7-99cc-2fde4bdec665_853x200.png 848w, https://substackcdn.com/image/fetch/$s_!yqXK!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F626e4ec6-b311-40e7-99cc-2fde4bdec665_853x200.png 1272w, https://substackcdn.com/image/fetch/$s_!yqXK!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F626e4ec6-b311-40e7-99cc-2fde4bdec665_853x200.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Now let&#8217;s imagine the nature of our topic changes and the device and event fields need to become flexible, allowing for new nested fields to be added anytime and reflected in our target. In the current implementation, these new fields would not appear automatically in our Struct columns.</p><p>One way to allow for flexibility of these nested columns is to update the Struct columns to Variant type. As mentioned above, however, Alter Table is unavailable on a streaming table to update column types. Here&#8217;s how to accomplish a streaming table update while ensuring there is no data-loss or duplication.</p><h4>Step 1: Pause your pipeline</h4><p>Whether your pipeline is continuous or scheduled to run periodically, you don&#8217;t want to be ingesting data while performing these actions. <br><br>Under the scheduled Job:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eG7D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F236b7239-8e51-48b6-b76f-ba2e657b488c_393x125.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eG7D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F236b7239-8e51-48b6-b76f-ba2e657b488c_393x125.png 424w, https://substackcdn.com/image/fetch/$s_!eG7D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F236b7239-8e51-48b6-b76f-ba2e657b488c_393x125.png 848w, https://substackcdn.com/image/fetch/$s_!eG7D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F236b7239-8e51-48b6-b76f-ba2e657b488c_393x125.png 1272w, https://substackcdn.com/image/fetch/$s_!eG7D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F236b7239-8e51-48b6-b76f-ba2e657b488c_393x125.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eG7D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F236b7239-8e51-48b6-b76f-ba2e657b488c_393x125.png" width="393" height="125" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/236b7239-8e51-48b6-b76f-ba2e657b488c_393x125.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:125,&quot;width&quot;:393,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:12269,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.databricksters.com/i/189148173?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F236b7239-8e51-48b6-b76f-ba2e657b488c_393x125.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eG7D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F236b7239-8e51-48b6-b76f-ba2e657b488c_393x125.png 424w, https://substackcdn.com/image/fetch/$s_!eG7D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F236b7239-8e51-48b6-b76f-ba2e657b488c_393x125.png 848w, https://substackcdn.com/image/fetch/$s_!eG7D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F236b7239-8e51-48b6-b76f-ba2e657b488c_393x125.png 1272w, https://substackcdn.com/image/fetch/$s_!eG7D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F236b7239-8e51-48b6-b76f-ba2e657b488c_393x125.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h4>Step 2: Backup Bronze Table</h4><p>This retains all data we&#8217;ve already ingested (users 1-4), including data that no longer exists in our Kafka topic (users 1-3). We&#8217;ve now ensured we won&#8217;t lose records that have aged out of the source.</p><pre><code><code>CREATE TABLE neil_test_catalog.streaming.json_bronze_backup
SELECT * FROM neil_test_catalog.streaming.json_bronze;</code></code></pre><h4>Step 3: Determine Max Offset per Partition</h4><p>At this point, Kafka has continued to receive new records behind the scenes (user-5 and user-6). We want to ensure that when we Full Refresh our pipeline, we read only these new messages without reprocessing data (user-4). This is achieved by making use of Spark&#8217;s readStream <code>startingOffsets </code>parameter. </p><p>This parameter allows you to specify which offsets Spark should <strong>begin</strong> reading data from, <em>the first time a pipeline runs</em>. Keep in mind that in Kafka, offsets are integers that uniquely identify messages <strong>per partition</strong>, so you&#8217;ll have to specify a starting offset for each partition in your topic. SDP uses these offsets to ensure that upon initial startup, the pipeline begins exactly where you intend. From that point forward, Spark continuously records these offsets in its internal checkpoints to track progress over time and guarantee exactly-once processing. It&#8217;s also a good idea to store this Kafka metadata in the Delta table itself.</p><p>Here&#8217;s the same snapshot of our source topic and target Delta table, with offset information included:</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GuUb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe5c843-bf29-4786-a1cd-87bd9405bc8d_862x599.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GuUb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe5c843-bf29-4786-a1cd-87bd9405bc8d_862x599.png 424w, https://substackcdn.com/image/fetch/$s_!GuUb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe5c843-bf29-4786-a1cd-87bd9405bc8d_862x599.png 848w, https://substackcdn.com/image/fetch/$s_!GuUb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe5c843-bf29-4786-a1cd-87bd9405bc8d_862x599.png 1272w, https://substackcdn.com/image/fetch/$s_!GuUb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe5c843-bf29-4786-a1cd-87bd9405bc8d_862x599.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GuUb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe5c843-bf29-4786-a1cd-87bd9405bc8d_862x599.png" width="862" height="599" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3fe5c843-bf29-4786-a1cd-87bd9405bc8d_862x599.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:599,&quot;width&quot;:862,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:46089,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://neilwilsondata.substack.com/i/188168915?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe5c843-bf29-4786-a1cd-87bd9405bc8d_862x599.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!GuUb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe5c843-bf29-4786-a1cd-87bd9405bc8d_862x599.png 424w, https://substackcdn.com/image/fetch/$s_!GuUb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe5c843-bf29-4786-a1cd-87bd9405bc8d_862x599.png 848w, https://substackcdn.com/image/fetch/$s_!GuUb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe5c843-bf29-4786-a1cd-87bd9405bc8d_862x599.png 1272w, https://substackcdn.com/image/fetch/$s_!GuUb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3fe5c843-bf29-4786-a1cd-87bd9405bc8d_862x599.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Notice that to begin processing at user-5 we will need to specify our pipeline start at offset 4. The startingOffsets parameter expects this information in the following format:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;059cdf57-c09a-47f0-b447-6a1e51336b9e&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">.option(&#8221;startingOffsets&#8221;, &#8216;{&#8221;neil_struct_topic&#8221;:{&#8221;0&#8221;:4}}&#8217;)</code></pre></div><p>If our topic contained multiple partitions, starting offsets must be set for each partition and would look like this for two partitions:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;b9310ebd-3020-42c3-b53b-cc58a02106cd&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">.option(&#8221;startingOffsets&#8221;, &#8216;{&#8221;neil_struct_topic&#8221;:{&#8221;0&#8221;:4, &#8220;1&#8221;:6}}&#8217;)
</code></pre></div><p>But how can we find this information?</p><h4>Finding via Metadata Columns (If Defined in Pipeline and Tracked in Delta Table)</h4><p>If you&#8217;ve added Kafka metadata to your bronze table, you can retrieve your max offset per partition there. Remember that for this simple example we only have one Kafka partition. If using this method it&#8217;s important to note that these results show the most recent offset ingested, and +1 must be added to specify where Spark should <strong>start</strong> reading.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:&quot;3b7187f9-2c43-4090-9182-4205357e34df&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">SELECT partition, MAX(offset)
FROM neil_test_catalog.streaming.json_bronze
GROUP BY partition</code></pre></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0pFu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b343062-1b38-49b2-b258-106bc89c3a3d_688x196.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0pFu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b343062-1b38-49b2-b258-106bc89c3a3d_688x196.png 424w, https://substackcdn.com/image/fetch/$s_!0pFu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b343062-1b38-49b2-b258-106bc89c3a3d_688x196.png 848w, https://substackcdn.com/image/fetch/$s_!0pFu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b343062-1b38-49b2-b258-106bc89c3a3d_688x196.png 1272w, https://substackcdn.com/image/fetch/$s_!0pFu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b343062-1b38-49b2-b258-106bc89c3a3d_688x196.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0pFu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b343062-1b38-49b2-b258-106bc89c3a3d_688x196.png" width="688" height="196" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6b343062-1b38-49b2-b258-106bc89c3a3d_688x196.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:196,&quot;width&quot;:688,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:20135,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://neilwilsondata.substack.com/i/188168915?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b343062-1b38-49b2-b258-106bc89c3a3d_688x196.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!0pFu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b343062-1b38-49b2-b258-106bc89c3a3d_688x196.png 424w, https://substackcdn.com/image/fetch/$s_!0pFu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b343062-1b38-49b2-b258-106bc89c3a3d_688x196.png 848w, https://substackcdn.com/image/fetch/$s_!0pFu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b343062-1b38-49b2-b258-106bc89c3a3d_688x196.png 1272w, https://substackcdn.com/image/fetch/$s_!0pFu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6b343062-1b38-49b2-b258-106bc89c3a3d_688x196.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h4>Finding via Spark Declarative Pipelines Checkpoints</h4><p>Another way to retrieve your offset information is to query SDP&#8217;s /checkpoints/ folder that your pipeline uses to track state and progress. For more detailed information on checkpoints and how Spark Structured Streaming achieves exactly-once processing, check out this blog: <a href="https://www.canadiandataguy.com/p/inside-delta-lakes-idempotency-magic">Inside Delta Lake&#8217;s Idempotency Magic: The Secret to Exactly-Once Spark</a>.</p><p>First, find your streaming table&#8217;s storage location via DESCRIBE DETAIL.<br><br>Note: If your table is Unity Catalog-managed, this method requires direct read access to the table&#8217;s managed storage location. </p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!TB3u!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef457b7-710f-4ea9-9742-bb201bd52e8d_1396x444.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!TB3u!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef457b7-710f-4ea9-9742-bb201bd52e8d_1396x444.png 424w, https://substackcdn.com/image/fetch/$s_!TB3u!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef457b7-710f-4ea9-9742-bb201bd52e8d_1396x444.png 848w, https://substackcdn.com/image/fetch/$s_!TB3u!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef457b7-710f-4ea9-9742-bb201bd52e8d_1396x444.png 1272w, https://substackcdn.com/image/fetch/$s_!TB3u!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef457b7-710f-4ea9-9742-bb201bd52e8d_1396x444.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!TB3u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef457b7-710f-4ea9-9742-bb201bd52e8d_1396x444.png" width="1396" height="444" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/aef457b7-710f-4ea9-9742-bb201bd52e8d_1396x444.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:444,&quot;width&quot;:1396,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:105584,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://neilwilsondata.substack.com/i/188168915?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef457b7-710f-4ea9-9742-bb201bd52e8d_1396x444.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!TB3u!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef457b7-710f-4ea9-9742-bb201bd52e8d_1396x444.png 424w, https://substackcdn.com/image/fetch/$s_!TB3u!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef457b7-710f-4ea9-9742-bb201bd52e8d_1396x444.png 848w, https://substackcdn.com/image/fetch/$s_!TB3u!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef457b7-710f-4ea9-9742-bb201bd52e8d_1396x444.png 1272w, https://substackcdn.com/image/fetch/$s_!TB3u!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Faef457b7-710f-4ea9-9742-bb201bd52e8d_1396x444.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Using the location, you can append &#8220;/_dlt_metadata/checkpoints/<strong>your_table_name</strong>/&#8221; to find the most recent streaming query context (the greatest number). For this streaming table the max result is 8 as shown below.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;c72351ae-4363-40a7-bb0b-0e4a9ad1c060&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">path = &#8220;s3://....&#8221;
metadata_path = path + &#8220;/_dlt_metadata/checkpoints/neil_test_catalog.streaming.json_bronze/&#8221;
display(dbutils.fs.ls(metadata_path))</code></pre></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qpG9!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f30f6ae-941e-45fc-bfc9-6908bd30b52f_1810x863.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qpG9!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f30f6ae-941e-45fc-bfc9-6908bd30b52f_1810x863.png 424w, https://substackcdn.com/image/fetch/$s_!qpG9!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f30f6ae-941e-45fc-bfc9-6908bd30b52f_1810x863.png 848w, https://substackcdn.com/image/fetch/$s_!qpG9!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f30f6ae-941e-45fc-bfc9-6908bd30b52f_1810x863.png 1272w, https://substackcdn.com/image/fetch/$s_!qpG9!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f30f6ae-941e-45fc-bfc9-6908bd30b52f_1810x863.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qpG9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f30f6ae-941e-45fc-bfc9-6908bd30b52f_1810x863.png" width="1456" height="694" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/9f30f6ae-941e-45fc-bfc9-6908bd30b52f_1810x863.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:694,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:212970,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://neilwilsondata.substack.com/i/188168915?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd25f7e4f-dfb4-45b5-853b-5ba022965d0c_2028x888.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qpG9!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f30f6ae-941e-45fc-bfc9-6908bd30b52f_1810x863.png 424w, https://substackcdn.com/image/fetch/$s_!qpG9!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f30f6ae-941e-45fc-bfc9-6908bd30b52f_1810x863.png 848w, https://substackcdn.com/image/fetch/$s_!qpG9!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f30f6ae-941e-45fc-bfc9-6908bd30b52f_1810x863.png 1272w, https://substackcdn.com/image/fetch/$s_!qpG9!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F9f30f6ae-941e-45fc-bfc9-6908bd30b52f_1810x863.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Within the numbered subfolder under your table name, you will see /offsets/ and /commits/ folders, each also containing numbered folders 0/, 1/, 2/ and so on. These folders represent streaming batches in SDP.</p><ul><li><p><code>offsets/N</code> is a <strong>write&#8209;ahead log entry</strong> written before processing batch N. It stores the <strong>end offsets (high&#8209;water mark)</strong> for that batch &#8212; i.e., &#8220;read up to here&#8221; for each topic/partition.</p></li><li><p><code>commits/N</code> is only written after batch N has finished successfully.</p></li></ul><p>Because an offsets/N entry can exist even if its corresponding commits/N is missing (a batch started but never committed), you should follow these steps to determine where to retrieve your startingOffsets.</p><p>List the batch IDs in both folders:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">offsets_path = path + "/_dlt_metadata/checkpoints/your_table_name/8/offsets/"
commits_path = path + "/_dlt_metadata/checkpoints/your_table_name/8/commits/"
display(dbutils.fs.ls(offsets_path))
display(dbutils.fs.ls(commits_path))</code></pre></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UwIC!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F484104d7-6833-422a-81b1-a33bc63d21e7_1171x579.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UwIC!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F484104d7-6833-422a-81b1-a33bc63d21e7_1171x579.png 424w, https://substackcdn.com/image/fetch/$s_!UwIC!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F484104d7-6833-422a-81b1-a33bc63d21e7_1171x579.png 848w, https://substackcdn.com/image/fetch/$s_!UwIC!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F484104d7-6833-422a-81b1-a33bc63d21e7_1171x579.png 1272w, https://substackcdn.com/image/fetch/$s_!UwIC!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F484104d7-6833-422a-81b1-a33bc63d21e7_1171x579.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UwIC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F484104d7-6833-422a-81b1-a33bc63d21e7_1171x579.png" width="1171" height="579" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/484104d7-6833-422a-81b1-a33bc63d21e7_1171x579.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:579,&quot;width&quot;:1171,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:149194,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.databricksters.com/i/189148173?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F434128e2-d030-41e8-823a-07ccf1b63a36_1171x579.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!UwIC!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F484104d7-6833-422a-81b1-a33bc63d21e7_1171x579.png 424w, https://substackcdn.com/image/fetch/$s_!UwIC!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F484104d7-6833-422a-81b1-a33bc63d21e7_1171x579.png 848w, https://substackcdn.com/image/fetch/$s_!UwIC!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F484104d7-6833-422a-81b1-a33bc63d21e7_1171x579.png 1272w, https://substackcdn.com/image/fetch/$s_!UwIC!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F484104d7-6833-422a-81b1-a33bc63d21e7_1171x579.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Let:</p><ul><li><p><code>commit_batches</code> = all numeric batch IDs under <code>commits/</code></p></li><li><p><code>max_commit_batch</code> = largest value in <code>commit_batches</code></p></li></ul><p>Use <code>max_commit_batch</code> as the last fully committed batch, and read the matching offsets file:</p><p><strong>Always</strong> read <code>offsets/max_commit_batch</code> to get the correct <code>startingOffsets</code> JSON.</p><p>Ignore any higher batch ID that appears only under <code>offsets/</code> but not <code>commits/</code>. That batch started but never finished, so if you treat its offsets as your starting position, Spark will behave as if that data has already been read and will <strong>skip</strong> it rather than processing it.</p><p>In my example, both the /offsets/ and /commits/ folders contain only batches 0 and 1, so using <code>max_commit_batch,</code> we read startingOffsets from /offsets/1.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;5e16f2a8-9c27-4b1a-b3ae-e66ef265589f&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">metadata_path = path + &#8220;/_dlt_metadata/checkpoints/neil_test_catalog.streaming.json_bronze/8/offsets/1/&#8221;
display(dbutils.fs.head(metadata_path))</code></pre></div><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;json&quot;,&quot;nodeId&quot;:&quot;45573231-4ab2-4ab6-b675-b5b5cdd1ed08&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-json">{&#8221;batchWatermarkMs&#8221;:0,&#8221;batchTimestampMs&#8221;:1771439553201,&#8221;conf&#8221;:{...}}
{&#8221;neil_struct_topic&#8221;:{&#8221;0&#8221;:4}}</code></pre></div><p>Notice the final line is our starting partition:offset information in the exact format we created manually above. Spark tracks the next offset to read, so you do not need to increment +1 via this method. Again, if our topic had multiple partitions it might look like:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;json&quot;,&quot;nodeId&quot;:&quot;d18d4bc3-3a9f-4d23-afb3-bbc85e9cf333&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-json">{&#8221;neil_struct_topic&#8221;:{&#8221;0&#8221;:4, &#8220;1&#8221;:6}}</code></pre></div><h4>Step 4: Update pipeline definition</h4><p>Now it&#8217;s time to apply the actual logic changes that prompted this process. For my example, I add variant support to table properties:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">table_properties={&#8221;quality&#8221;: &#8220;bronze&#8221;, &#8220;delta.feature.variantType-preview&#8221;: &#8220;supported&#8221;}</code></pre></div><p>And cast the Struct columns to Variant:</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:null}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">return (
        parsed
            .withColumn(&#8221;event&#8221;, expr(&#8221;parse_json(to_json(event))&#8221;))
            .withColumn(&#8221;device&#8221;, expr(&#8221;parse_json(to_json(device))&#8221;))
        )</code></pre></div><h4>Step 5: Full Refresh Table</h4><p>With pipeline logic updated, it&#8217;s time to run with Full refresh to wipe the target table and ingest the new Kafka records starting at our specified offsets. The resulting table will contain only the records we did not back up via Step 2.</p><p>To do this, we run our pipeline with full refresh after adding our startingOffsets (line 28). Here is the final pipeline definition.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;6c55857d-d0f1-47b3-a53d-d6592eae648e&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">.option(&#8221;startingOffsets&#8221;, &#8216;{&#8221;neil_struct_topic&#8221;:{&#8221;0&#8221;:4}}&#8217;)</code></pre></div><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;152abf3a-b98c-4704-9e92-8f5622e075cb&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">import dlt
from pyspark.sql.functions import current_timestamp, col, from_json, expr

SERVERS = &#8220;REDACTED&#8221;
TOPIC = &#8220;neil_struct_topic&#8221;

# Explicit JSON schema for predictable demos (no schemaLocationKey)
SCHEMA_DDL = &#8220;&#8221;&#8220;
  user STRING,
  email STRING,
  country STRING,
  device STRUCT&lt;os: STRING, model: STRING, geo: STRUCT&lt;lat: DOUBLE, lon: DOUBLE&gt;&gt;,
  event STRUCT&lt;name: STRING, seq: BIGINT, ts: STRING&gt;
&#8220;&#8221;&#8220;

@dlt.table(
    name=&#8221;json_bronze&#8221;,
    comment=&#8221;Raw Kafka payload with explicit JSON schema and rescued data&#8221;,
    table_properties={&#8221;quality&#8221;: &#8220;bronze&#8221;, &#8220;delta.feature.variantType-preview&#8221;: &#8220;supported&#8221;}
)
def json_bronze():
    df = (
        spark.readStream
            .format(&#8221;kafka&#8221;)
            .option(&#8221;kafka.bootstrap.servers&#8221;, SERVERS)
            .option(&#8221;kafka.security.protocol&#8221;, &#8220;SSL&#8221;)
            .option(&#8221;subscribe&#8221;, TOPIC)
            .option(&#8221;startingOffsets&#8221;, &#8216;{&#8221;neil_struct_topic&#8221;:{&#8221;0&#8221;:4}}&#8217;)
            .load()
    )

    parsed = (
        df.selectExpr(&#8221;CAST(value AS STRING) AS json_str&#8221;, &#8220;topic&#8221;, &#8220;partition&#8221;, &#8220;offset&#8221;, &#8220;timestamp&#8221;)
          .select(
              from_json(
                  col(&#8221;json_str&#8221;),
                  SCHEMA_DDL,
                  options={&#8221;rescuedDataColumn&#8221;: &#8220;_rescued_data&#8221;}  # capture type mismatches/new fields
              ).alias(&#8221;data&#8221;),
              &#8220;topic&#8221;, &#8220;partition&#8221;, &#8220;offset&#8221;, &#8220;timestamp&#8221;
          )
          # data.* includes _rescued_data already; do not reselect it to avoid duplicate column error
          .selectExpr(&#8221;data.*&#8221;, &#8220;topic&#8221;, &#8220;partition&#8221;, &#8220;offset&#8221;, &#8220;timestamp AS kafka_timestamp&#8221;)
          .withColumn(&#8221;ingestion_ts&#8221;, current_timestamp())
    )

    return (
        parsed
            .withColumn(&#8221;event&#8221;, expr(&#8221;parse_json(to_json(event))&#8221;))
            .withColumn(&#8221;device&#8221;, expr(&#8221;parse_json(to_json(device))&#8221;))
        )</code></pre></div><p>On the pipeline page:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!A22d!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287e92dd-2a47-4cc9-b03f-612e38730a73_396x219.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!A22d!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287e92dd-2a47-4cc9-b03f-612e38730a73_396x219.png 424w, https://substackcdn.com/image/fetch/$s_!A22d!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287e92dd-2a47-4cc9-b03f-612e38730a73_396x219.png 848w, https://substackcdn.com/image/fetch/$s_!A22d!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287e92dd-2a47-4cc9-b03f-612e38730a73_396x219.png 1272w, https://substackcdn.com/image/fetch/$s_!A22d!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287e92dd-2a47-4cc9-b03f-612e38730a73_396x219.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!A22d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287e92dd-2a47-4cc9-b03f-612e38730a73_396x219.png" width="396" height="219" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/287e92dd-2a47-4cc9-b03f-612e38730a73_396x219.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:219,&quot;width&quot;:396,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:33568,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.databricksters.com/i/189148173?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287e92dd-2a47-4cc9-b03f-612e38730a73_396x219.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!A22d!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287e92dd-2a47-4cc9-b03f-612e38730a73_396x219.png 424w, https://substackcdn.com/image/fetch/$s_!A22d!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287e92dd-2a47-4cc9-b03f-612e38730a73_396x219.png 848w, https://substackcdn.com/image/fetch/$s_!A22d!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287e92dd-2a47-4cc9-b03f-612e38730a73_396x219.png 1272w, https://substackcdn.com/image/fetch/$s_!A22d!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F287e92dd-2a47-4cc9-b03f-612e38730a73_396x219.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>Here&#8217;s the result: user-5 and user-6 as expected:</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!2vh_!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5471779-6f11-42d8-8d0b-1b4d8c58c3f6_909x127.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!2vh_!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5471779-6f11-42d8-8d0b-1b4d8c58c3f6_909x127.png 424w, https://substackcdn.com/image/fetch/$s_!2vh_!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5471779-6f11-42d8-8d0b-1b4d8c58c3f6_909x127.png 848w, https://substackcdn.com/image/fetch/$s_!2vh_!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5471779-6f11-42d8-8d0b-1b4d8c58c3f6_909x127.png 1272w, https://substackcdn.com/image/fetch/$s_!2vh_!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5471779-6f11-42d8-8d0b-1b4d8c58c3f6_909x127.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!2vh_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5471779-6f11-42d8-8d0b-1b4d8c58c3f6_909x127.png" width="909" height="127" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e5471779-6f11-42d8-8d0b-1b4d8c58c3f6_909x127.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:127,&quot;width&quot;:909,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:24239,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://neilwilsondata.substack.com/i/188168915?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2543971d-6d72-4e6c-b276-43fc19af5dff_1176x127.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!2vh_!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5471779-6f11-42d8-8d0b-1b4d8c58c3f6_909x127.png 424w, https://substackcdn.com/image/fetch/$s_!2vh_!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5471779-6f11-42d8-8d0b-1b4d8c58c3f6_909x127.png 848w, https://substackcdn.com/image/fetch/$s_!2vh_!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5471779-6f11-42d8-8d0b-1b4d8c58c3f6_909x127.png 1272w, https://substackcdn.com/image/fetch/$s_!2vh_!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe5471779-6f11-42d8-8d0b-1b4d8c58c3f6_909x127.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h4>Step 6: Insert Historical Records</h4><p>To complete our intended result table, insert historical data from the backup table, ensuring the data matches the new table format (cast columns, etc.):</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;sql&quot;,&quot;nodeId&quot;:&quot;591001a1-3fec-4a05-9d0b-6af6a7eb9d12&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-sql">INSERT INTO neil_test_catalog.streaming.json_bronze
SELECT user, email, country, to_variant_object(device), to_variant_object(event), _rescued_data, topic, partition, offset, kafka_timestamp, ingestion_ts
FROM neil_test_catalog.streaming.json_bronze_backup</code></pre></div><p>Our final result. Your pipeline may be resumed.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;yaml&quot;,&quot;nodeId&quot;:&quot;95091137-7fc2-4fac-b81c-bc3bd107ffb9&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-yaml">user:string
email:string
country:string
device:variant
event:variant
_rescued_data:string
topic:string
partition:integer
offset:long
kafka_timestamp:timestamp
ingestion_ts:timestamp</code></pre></div><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!x3gO!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3d7f02-263e-49ae-a838-8cc8826ac96c_948x224.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!x3gO!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3d7f02-263e-49ae-a838-8cc8826ac96c_948x224.png 424w, https://substackcdn.com/image/fetch/$s_!x3gO!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3d7f02-263e-49ae-a838-8cc8826ac96c_948x224.png 848w, https://substackcdn.com/image/fetch/$s_!x3gO!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3d7f02-263e-49ae-a838-8cc8826ac96c_948x224.png 1272w, https://substackcdn.com/image/fetch/$s_!x3gO!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3d7f02-263e-49ae-a838-8cc8826ac96c_948x224.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!x3gO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3d7f02-263e-49ae-a838-8cc8826ac96c_948x224.png" width="948" height="224" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4d3d7f02-263e-49ae-a838-8cc8826ac96c_948x224.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:224,&quot;width&quot;:948,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:61678,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://neilwilsondata.substack.com/i/188168915?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc8eb96d1-3eb6-40c1-8a4a-9a7c91c5513f_1303x224.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!x3gO!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3d7f02-263e-49ae-a838-8cc8826ac96c_948x224.png 424w, https://substackcdn.com/image/fetch/$s_!x3gO!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3d7f02-263e-49ae-a838-8cc8826ac96c_948x224.png 848w, https://substackcdn.com/image/fetch/$s_!x3gO!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3d7f02-263e-49ae-a838-8cc8826ac96c_948x224.png 1272w, https://substackcdn.com/image/fetch/$s_!x3gO!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4d3d7f02-263e-49ae-a838-8cc8826ac96c_948x224.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><h4>Step 7: Revert startingOffsets</h4><p>To avoid pipeline failures in the case of a future Full Refresh, revert startingOffsets to its prior value.</p><div class="highlighted_code_block" data-attrs="{&quot;language&quot;:&quot;python&quot;,&quot;nodeId&quot;:&quot;693848c8-f57d-4c01-b5cf-9214847e70ea&quot;}" data-component-name="HighlightedCodeBlockToDOM"><pre class="shiki"><code class="language-python">.option(&#8221;startingOffsets&#8221;, &#8220;earliest&#8221;)</code></pre></div><h3>Frequently Asked Questions</h3><p><strong>Q: Why can&#8217;t I just run a standard Full Refresh on the pipeline?</strong> </p><p><strong>A:</strong> A Full Refresh clears the target table and re-reads the source from the beginning. If your Kafka topic has a retention policy (TTL), data that has "aged out" of the topic will be permanently lost because it no longer exists in the source to be re-read.</p><p><strong>Q: Why is </strong><code>startingOffsets</code><strong> necessary if I have a backup?</strong> </p><p><strong>A:</strong> While the backup saves your history, <code>startingOffsets</code> ensures your pipeline resumes reading <em>exactly</em> where the backup stopped. Without this explicit instruction, the pipeline might default to "earliest" (reading only what remains in Kafka, creating a gap) or "latest" (skipping data that arrived during the maintenance window).</p><p><strong>Q: Is this process required for "Append-Only" tables?</strong> </p><p><strong>A:</strong> Generally, yes, if you need to restructure the existing table. If you are only adding new columns that are nullable, you might rely on schema evolution, but fundamental type changes usually require the table to be rewritten.</p><p></p>]]></content:encoded></item><item><title><![CDATA[Migrating Existing Dashboards to Databricks AI/BI, Part 2: Filter Actions, Cross-Filtering, and Drill-Through]]></title><description><![CDATA[How to connect visuals, enable cross-filtering, and drill into details in Databricks AI/BI Dashboards]]></description><link>https://www.databricksters.com/p/migrating-existing-dashboards-to-482</link><guid isPermaLink="false">https://www.databricksters.com/p/migrating-existing-dashboards-to-482</guid><dc:creator><![CDATA[Artem Chebotko]]></dc:creator><pubDate>Tue, 10 Mar 2026 15:02:53 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!gLl3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc4b6aea-ff9b-4eb0-a60e-9e2338aefa0e_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!gLl3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc4b6aea-ff9b-4eb0-a60e-9e2338aefa0e_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!gLl3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc4b6aea-ff9b-4eb0-a60e-9e2338aefa0e_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!gLl3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc4b6aea-ff9b-4eb0-a60e-9e2338aefa0e_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!gLl3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc4b6aea-ff9b-4eb0-a60e-9e2338aefa0e_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!gLl3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc4b6aea-ff9b-4eb0-a60e-9e2338aefa0e_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!gLl3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc4b6aea-ff9b-4eb0-a60e-9e2338aefa0e_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc4b6aea-ff9b-4eb0-a60e-9e2338aefa0e_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!gLl3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc4b6aea-ff9b-4eb0-a60e-9e2338aefa0e_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!gLl3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc4b6aea-ff9b-4eb0-a60e-9e2338aefa0e_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!gLl3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc4b6aea-ff9b-4eb0-a60e-9e2338aefa0e_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!gLl3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc4b6aea-ff9b-4eb0-a60e-9e2338aefa0e_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As a Specialist Solutions Architect at Databricks, I often hear the same questions from customers who are migrating dashboards from legacy BI tools to Databricks AI/BI Dashboards:</p><ul><li><p><em>&#8220;What&#8217;s the Databricks equivalent of the context filters we use today?&#8221;</em></p></li><li><p><em>&#8220;Can we still do cascading filters where each dropdown only shows relevant values?&#8221;</em></p></li><li><p><em>&#8220;Do you support filter actions when I click on a bar or a point?&#8221;</em></p></li><li><p><em>&#8220;How do we do user-based filtering in AI/BI Dashboards?&#8221;</em></p></li></ul><p>In the <a href="https://www.databricksters.com/p/migrating-existing-dashboards-to">first blog post in this series</a>, I focused on the first two questions and showed how to recreate:</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.databricksters.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Databricksters! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><ul><li><p>context filters using parameters in dataset SQL, and</p></li><li><p>&#8220;<em>Only Relevant Values</em>&#8221; filters using field filters and query-based parameters.</p></li></ul><p>This blog tackles the third question: &#8220;<em>How do we replace filter actions from existing dashboards when we click on a bar/segment/point?</em>&#8221;</p><p>In many BI tools, this behavior is configured as a <strong>filter action</strong> (a click on a mark filters other views). In Databricks AI/BI Dashboards, the equivalent interactivity is split into two built-in features:</p><ul><li><p><strong>Cross-filtering</strong>, where clicking a mark in one chart filters other charts on the same page that use the same dataset.</p></li><li><p><strong>Drill-through</strong>, where right-clicking a mark opens a target page filtered to that selection.</p></li></ul><p>Once you combine cross-filtering and drill-through with the <a href="https://www.databricksters.com/p/migrating-existing-dashboards-to">context and cascading patterns</a>, you can reproduce most real-world filter-action workflows.</p><p>As before, I&#8217;ll use the built-in <code>samples.tpch</code> dataset. I&#8217;ve also published the <a href="https://github.com/ArtemChebotko/Migrating-Existing-Dashboards-to-Databricks-AI-BI">companion dashboard</a>, so you can follow along and inspect the configurations yourself.</p><h3><strong>Recap: TPCH Sales dataset</strong></h3><p>To keep examples concrete, we&#8217;ll use the TPCH sample data that ships with Databricks in the <code>samples.tpch</code> schema. I&#8217;ll reuse the same base dataset, <em>TPCH Sales</em>, from the <a href="https://www.databricksters.com/p/migrating-existing-dashboards-to">first blog post</a>, which joins tables <code>region</code>, <code>nation</code>, <code>customer</code>, <code>orders</code>, and <code>lineitem</code>, and computes revenue:</p><pre><code>SELECT
  r.r_name              AS region,
  n.n_name              AS nation,
  c.c_custkey           AS customer_id,
  c.c_name              AS customer_name,
  o.o_orderkey          AS order_id,
  o.o_orderdate         AS order_date,
  l.l_extendedprice * (1 - l.l_discount) AS revenue
FROM samples.tpch.region   AS r
JOIN samples.tpch.nation   AS n ON n.n_regionkey = r.r_regionkey
JOIN samples.tpch.customer AS c ON c.c_nationkey = n.n_nationkey
JOIN samples.tpch.orders   AS o ON o.o_custkey   = c.c_custkey
JOIN samples.tpch.lineitem AS l ON l.l_orderkey  = o.o_orderkey;</code></pre><p>In the <a href="https://github.com/ArtemChebotko/Migrating-Existing-Dashboards-to-Databricks-AI-BI">companion dashboard</a>, this is the <em>TPCH Sales</em> dataset. It and its derivatives are used by multiple pages:</p><ul><li><p><em>Context filter</em></p></li><li><p><em>Cascading filters with field filters</em></p></li><li><p><em>Cascading filters with query-based parameters</em></p></li><li><p><em>Cross-filtering </em>(new in this post)</p></li><li><p><em>Drill-through details </em>(new in this post)</p></li><li><p>and others</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ZQnz!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdff3310d-8eba-4213-b83e-e5244c972269_1600x740.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ZQnz!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdff3310d-8eba-4213-b83e-e5244c972269_1600x740.png 424w, https://substackcdn.com/image/fetch/$s_!ZQnz!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdff3310d-8eba-4213-b83e-e5244c972269_1600x740.png 848w, https://substackcdn.com/image/fetch/$s_!ZQnz!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdff3310d-8eba-4213-b83e-e5244c972269_1600x740.png 1272w, https://substackcdn.com/image/fetch/$s_!ZQnz!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdff3310d-8eba-4213-b83e-e5244c972269_1600x740.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ZQnz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdff3310d-8eba-4213-b83e-e5244c972269_1600x740.png" width="1456" height="673" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dff3310d-8eba-4213-b83e-e5244c972269_1600x740.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:673,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ZQnz!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdff3310d-8eba-4213-b83e-e5244c972269_1600x740.png 424w, https://substackcdn.com/image/fetch/$s_!ZQnz!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdff3310d-8eba-4213-b83e-e5244c972269_1600x740.png 848w, https://substackcdn.com/image/fetch/$s_!ZQnz!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdff3310d-8eba-4213-b83e-e5244c972269_1600x740.png 1272w, https://substackcdn.com/image/fetch/$s_!ZQnz!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdff3310d-8eba-4213-b83e-e5244c972269_1600x740.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3><strong>Filter actions vs. cross-filtering and drill-through</strong></h3><p>Before we build anything, it helps to align vocabulary.</p><p>In many traditional BI tools, filter actions are configured explicitly:</p><ul><li><p>You specify one or more source sheets.</p></li><li><p>You specify one or more target sheets.</p></li><li><p>You choose which fields are passed as filters.</p></li><li><p>You choose what happens when the selection is cleared.</p></li></ul><p>In Databricks AI/BI Dashboards, cross-filtering is implicit. You don&#8217;t turn it on in the visualization panel.</p><ul><li><p>It is automatically applied to supported visualization types that use the same dataset.</p></li><li><p>When you click a bar, slice, or point, AI/BI adds a filter based on that value and re-runs all other visualizations on the page that share the dataset.</p></li></ul><p>In Databricks AI/BI Dashboards, drill-through is also implicit, but slightly more structured:</p><ul><li><p>When you right-click a supported chart type, AI/BI shows <em>Drill to &#8594; &lt;target page&gt;</em> if there is another page in the dashboard that uses the same dataset.</p></li><li><p>The target page opens with all visuals based on that dataset filtered to the selected segment, and any compatible filters on that dataset are auto-populated.</p></li></ul><p>Conceptually:</p><ul><li><p>Cross-filtering &#8776; a within-page filter action.</p></li><li><p>Drill-through &#8776; a navigation filter action (summary &#8594; details).</p></li></ul><p>The rest of the post walks through how to configure your pages so these implicit behaviors &#8220;just work&#8221;.</p><h3><strong>1. Recreating within-page filter actions with cross-filtering</strong></h3><p>A very common dashboard pattern is:</p><ul><li><p>A summary bar chart (for example, total revenue by nation).</p></li><li><p>One or more supporting charts (for example, revenue share by region, or a donut chart for mix).</p></li><li><p>A filter action so clicking a bar filters the other visuals.</p></li></ul><p>In AI/BI Dashboards, this becomes cross-filtering on top of the <em>TPCH Sales</em> dataset.</p><h4><strong>1.1. When cross-filtering is applied</strong></h4><p>Cross-filtering is applied automatically when all of the following are true:</p><ul><li><p>The visualizations are on the same page.</p></li><li><p>The visualizations use the same dataset (for example, <em>TPCH Sales</em>).</p></li><li><p>The visualization type is one of the supported chart types: <em>Bar</em>, <em>Box plot</em>, <em>Heatmap</em>, <em>Histogram</em>, <em>Pie</em>, <em>Scatter</em>, or <em>Point map</em>.</p></li></ul><p>If those conditions are met, there is nothing to enable. Clicks on supported charts become filters for any other visualizations on the page that use that dataset.</p><h4><strong>1.2. Build the </strong><em><strong>Cross-filtering</strong></em><strong> page</strong></h4><p>In your AI/BI dashboard, add a page named <em>Cross-filtering</em>.</p><p>On this page:</p><ol><li><p>Add a bar chart: <em>Revenue by nation</em></p><ul><li><p>Visualization: <em>Bar</em></p></li><li><p>Dataset = <em>TPCH Sales</em></p></li><li><p>X axis: <code>nation</code></p></li><li><p>Y axis: <code>SUM(revenue)</code></p></li></ul></li><li><p>Add a pie: <em>Revenue by region</em></p><ul><li><p>Visualization: <em>Pie</em></p></li><li><p>Dataset = <em>TPCH Sales</em></p></li><li><p>Slice by (Color): <code>region</code></p></li><li><p>Value (Angle): <code>SUM(revenue)</code></p></li></ul></li><li><p>Add filters (optional)</p><ul><li><p>Add a <em>Region</em> filter on <code>TPCH Sales.region</code></p></li><li><p>Add a <em>Nation</em> filter on <code>TPCH Sales.nation</code></p></li></ul></li></ol><p>These are just standard filter widgets. There is no &#8220;cross-filtering&#8221; toggle anywhere in the configuration.</p><p>In the <a href="https://github.com/ArtemChebotko/Migrating-Existing-Dashboards-to-Databricks-AI-BI">companion dashboard</a>, this page is already built for you.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aM9L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7030f297-722f-4ebe-b522-b20ff0bc75b3_1600x640.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aM9L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7030f297-722f-4ebe-b522-b20ff0bc75b3_1600x640.png 424w, https://substackcdn.com/image/fetch/$s_!aM9L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7030f297-722f-4ebe-b522-b20ff0bc75b3_1600x640.png 848w, https://substackcdn.com/image/fetch/$s_!aM9L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7030f297-722f-4ebe-b522-b20ff0bc75b3_1600x640.png 1272w, https://substackcdn.com/image/fetch/$s_!aM9L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7030f297-722f-4ebe-b522-b20ff0bc75b3_1600x640.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aM9L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7030f297-722f-4ebe-b522-b20ff0bc75b3_1600x640.png" width="1456" height="582" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7030f297-722f-4ebe-b522-b20ff0bc75b3_1600x640.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:582,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!aM9L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7030f297-722f-4ebe-b522-b20ff0bc75b3_1600x640.png 424w, https://substackcdn.com/image/fetch/$s_!aM9L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7030f297-722f-4ebe-b522-b20ff0bc75b3_1600x640.png 848w, https://substackcdn.com/image/fetch/$s_!aM9L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7030f297-722f-4ebe-b522-b20ff0bc75b3_1600x640.png 1272w, https://substackcdn.com/image/fetch/$s_!aM9L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7030f297-722f-4ebe-b522-b20ff0bc75b3_1600x640.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4><strong>1.3. Use cross-filtering on the charts</strong></h4><p>Try the following workflow:</p><ol><li><p>Select two regions, <em>AFRICA</em> and <em>ASIA</em>, on the <em>Revenue by region</em> chart.</p></li></ol><blockquote></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!xwfZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc420da58-ce61-418f-8cbb-35d3494b7e3e_1600x644.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!xwfZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc420da58-ce61-418f-8cbb-35d3494b7e3e_1600x644.png 424w, https://substackcdn.com/image/fetch/$s_!xwfZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc420da58-ce61-418f-8cbb-35d3494b7e3e_1600x644.png 848w, https://substackcdn.com/image/fetch/$s_!xwfZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc420da58-ce61-418f-8cbb-35d3494b7e3e_1600x644.png 1272w, https://substackcdn.com/image/fetch/$s_!xwfZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc420da58-ce61-418f-8cbb-35d3494b7e3e_1600x644.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!xwfZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc420da58-ce61-418f-8cbb-35d3494b7e3e_1600x644.png" width="1456" height="586" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c420da58-ce61-418f-8cbb-35d3494b7e3e_1600x644.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:586,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!xwfZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc420da58-ce61-418f-8cbb-35d3494b7e3e_1600x644.png 424w, https://substackcdn.com/image/fetch/$s_!xwfZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc420da58-ce61-418f-8cbb-35d3494b7e3e_1600x644.png 848w, https://substackcdn.com/image/fetch/$s_!xwfZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc420da58-ce61-418f-8cbb-35d3494b7e3e_1600x644.png 1272w, https://substackcdn.com/image/fetch/$s_!xwfZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc420da58-ce61-418f-8cbb-35d3494b7e3e_1600x644.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ol start="2"><li><p>Click the bar for <em>JAPAN</em> in the <em>Revenue by nation</em> chart.</p></li></ol><blockquote></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!N1P1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12352e-c307-47a3-8cd6-e76126084dd5_1600x644.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!N1P1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12352e-c307-47a3-8cd6-e76126084dd5_1600x644.png 424w, https://substackcdn.com/image/fetch/$s_!N1P1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12352e-c307-47a3-8cd6-e76126084dd5_1600x644.png 848w, https://substackcdn.com/image/fetch/$s_!N1P1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12352e-c307-47a3-8cd6-e76126084dd5_1600x644.png 1272w, https://substackcdn.com/image/fetch/$s_!N1P1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12352e-c307-47a3-8cd6-e76126084dd5_1600x644.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!N1P1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12352e-c307-47a3-8cd6-e76126084dd5_1600x644.png" width="1456" height="586" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7f12352e-c307-47a3-8cd6-e76126084dd5_1600x644.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:586,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!N1P1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12352e-c307-47a3-8cd6-e76126084dd5_1600x644.png 424w, https://substackcdn.com/image/fetch/$s_!N1P1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12352e-c307-47a3-8cd6-e76126084dd5_1600x644.png 848w, https://substackcdn.com/image/fetch/$s_!N1P1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12352e-c307-47a3-8cd6-e76126084dd5_1600x644.png 1272w, https://substackcdn.com/image/fetch/$s_!N1P1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7f12352e-c307-47a3-8cd6-e76126084dd5_1600x644.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Because cross-filtering is automatic for supported charts that share a dataset:</p><ul><li><p>AI/BI adds filters <em>Nation: JAPAN</em> and <em>Region: AFRICA, ASIA</em> to the <em>TPCH Sales</em> dataset for this page.</p></li><li><p>Both charts get updated accordingly.</p></li></ul><p>You can <em>Reset all to default</em> and try a different filter combination.</p><p>From a migration perspective, this answers a common question: &#8220;<em>Can clicking a bar automatically update the rest of the dashboard in AI/BI?</em>&#8221; Yes &#8211; when charts share a dataset and use supported visualization types, cross-filtering works implicitly with no additional configuration.</p><h3><strong>2. Recreating across-page filter actions with drill-through</strong></h3><p>Another classic <em>summary &#8594; detail</em> filter-action pattern is:</p><ul><li><p>A summary view (for example, revenue by nation).</p></li><li><p>A detail view (for example, individual orders).</p></li><li><p>A filter action that passes the selected value into the detail sheet as a filter.</p></li></ul><p>In AI/BI Dashboards, this is implemented as drill-through.</p><h4><strong>2.1. How drill-through is applied</strong></h4><p>Drill-through shows up as a right-click option when several conditions are satisfied.</p><ul><li><p>The source chart is a supported type: <em>Bar</em>, <em>Box plot</em>, <em>Heatmap</em>, <em>Histogram</em>, <em>Pie</em>, <em>Scatter</em>, or <em>Point map</em>.</p></li><li><p>There is at least one target page in the same dashboard where:</p><ul><li><p>At least one visualization uses the same dataset as the source chart.</p></li><li><p>The field you click on has a compatible filter or column on the target page.</p></li></ul></li></ul><p>In recent AI/BI releases, drill-through no longer requires an explicit target filter; any visualization based on the same dataset as the source selection is filtered automatically, and filters (if they exist) are populated with the drilled values.</p><p>There is no drill-through toggle in the widget side panel. Once the above conditions are true, AI/BI surfaces <em>Drill to &#8594; &lt;page name&gt;</em> in the context menu automatically.</p><h4><strong>2.2. Build the </strong><em><strong>Drill-through details</strong></em><strong> page</strong></h4><p>In the same dashboard, add another page named <em>Drill-through details</em>.</p><p>On this page:</p><ol><li><p>Add a detail table visualization</p><ul><li><p>Visualization: <em>Table</em></p></li><li><p>Dataset = <em>TPCH Sales</em></p></li><li><p>Columns: <em>region</em>, <em>nation</em>, <em>customer_id</em>, <em>customer_name</em>, <em>order_id</em>, <em>order_date</em>, <em>revenue</em></p></li></ul></li><li><p>Add filters (optional)</p><ul><li><p>Add a <em>Region</em> filter on <code>TPCH Sales.region</code></p></li><li><p>Add a <em>Nation</em> filter on <code>TPCH Sales.nation</code></p></li><li><p>Add a <em>Customer</em> filter on <code>TPCH Sales.customer_id</code></p></li></ul></li></ol><p>Again, there is no special drill-through configuration here &#8211; just a normal page that uses the same dataset.</p><p>In the <a href="https://github.com/ArtemChebotko/Migrating-Existing-Dashboards-to-Databricks-AI-BI">companion dashboard</a>, this page is already built for you.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!d5DG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d8d760-8221-47f5-a7d5-b994033d2da3_1504x669.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!d5DG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d8d760-8221-47f5-a7d5-b994033d2da3_1504x669.png 424w, https://substackcdn.com/image/fetch/$s_!d5DG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d8d760-8221-47f5-a7d5-b994033d2da3_1504x669.png 848w, https://substackcdn.com/image/fetch/$s_!d5DG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d8d760-8221-47f5-a7d5-b994033d2da3_1504x669.png 1272w, https://substackcdn.com/image/fetch/$s_!d5DG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d8d760-8221-47f5-a7d5-b994033d2da3_1504x669.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!d5DG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d8d760-8221-47f5-a7d5-b994033d2da3_1504x669.png" width="1456" height="648" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/87d8d760-8221-47f5-a7d5-b994033d2da3_1504x669.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:648,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!d5DG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d8d760-8221-47f5-a7d5-b994033d2da3_1504x669.png 424w, https://substackcdn.com/image/fetch/$s_!d5DG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d8d760-8221-47f5-a7d5-b994033d2da3_1504x669.png 848w, https://substackcdn.com/image/fetch/$s_!d5DG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d8d760-8221-47f5-a7d5-b994033d2da3_1504x669.png 1272w, https://substackcdn.com/image/fetch/$s_!d5DG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87d8d760-8221-47f5-a7d5-b994033d2da3_1504x669.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h4><strong>2.3. Drill from summary to details</strong></h4><p>Go back to the <em>Cross-filtering</em> page and right-click a bar in the <em>Revenue by nation</em> chart (for example, <em>UNITED STATES</em>).</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!4-aE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe7d29c2-1a06-494c-822a-2152aefea12c_1502x846.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!4-aE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe7d29c2-1a06-494c-822a-2152aefea12c_1502x846.png 424w, https://substackcdn.com/image/fetch/$s_!4-aE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe7d29c2-1a06-494c-822a-2152aefea12c_1502x846.png 848w, https://substackcdn.com/image/fetch/$s_!4-aE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe7d29c2-1a06-494c-822a-2152aefea12c_1502x846.png 1272w, https://substackcdn.com/image/fetch/$s_!4-aE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe7d29c2-1a06-494c-822a-2152aefea12c_1502x846.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!4-aE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe7d29c2-1a06-494c-822a-2152aefea12c_1502x846.png" width="1456" height="820" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/be7d29c2-1a06-494c-822a-2152aefea12c_1502x846.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:820,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!4-aE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe7d29c2-1a06-494c-822a-2152aefea12c_1502x846.png 424w, https://substackcdn.com/image/fetch/$s_!4-aE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe7d29c2-1a06-494c-822a-2152aefea12c_1502x846.png 848w, https://substackcdn.com/image/fetch/$s_!4-aE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe7d29c2-1a06-494c-822a-2152aefea12c_1502x846.png 1272w, https://substackcdn.com/image/fetch/$s_!4-aE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe7d29c2-1a06-494c-822a-2152aefea12c_1502x846.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>After clicking <em>Drill to &#8594; Drill-through details</em>, AI/BI opens the <em>Drill-through details</em> page and filters all visualizations based on <em>TPCH Sales</em> to <em>Nation: UNITED STATES</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!QvbJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4964f70f-2bfb-4050-8762-aba6cf0081c3_1501x666.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!QvbJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4964f70f-2bfb-4050-8762-aba6cf0081c3_1501x666.png 424w, https://substackcdn.com/image/fetch/$s_!QvbJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4964f70f-2bfb-4050-8762-aba6cf0081c3_1501x666.png 848w, https://substackcdn.com/image/fetch/$s_!QvbJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4964f70f-2bfb-4050-8762-aba6cf0081c3_1501x666.png 1272w, https://substackcdn.com/image/fetch/$s_!QvbJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4964f70f-2bfb-4050-8762-aba6cf0081c3_1501x666.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!QvbJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4964f70f-2bfb-4050-8762-aba6cf0081c3_1501x666.png" width="1456" height="646" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/4964f70f-2bfb-4050-8762-aba6cf0081c3_1501x666.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:646,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!QvbJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4964f70f-2bfb-4050-8762-aba6cf0081c3_1501x666.png 424w, https://substackcdn.com/image/fetch/$s_!QvbJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4964f70f-2bfb-4050-8762-aba6cf0081c3_1501x666.png 848w, https://substackcdn.com/image/fetch/$s_!QvbJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4964f70f-2bfb-4050-8762-aba6cf0081c3_1501x666.png 1272w, https://substackcdn.com/image/fetch/$s_!QvbJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F4964f70f-2bfb-4050-8762-aba6cf0081c3_1501x666.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>From the user&#8217;s perspective, this feels almost identical to a filter action in traditional dashboards that navigates from a summary sheet to a detailed sheet with the selected country carried over.</p><h3><strong>Cross-filtering vs. drill-through vs. filter widgets</strong></h3><p>Across this blog and the <a href="https://www.databricksters.com/p/migrating-existing-dashboards-to">first blog</a>, you now have three main interaction tools in AI/BI Dashboards:</p><ol><li><p>Filter widgets</p><ul><li><p>Context filters, cascading filters, and query-based parameters.</p></li><li><p>Best for primary, always-visible controls like <em>Region</em>, <em>Date</em>, <em>Product</em>.</p></li></ul></li><li><p>Cross-filtering</p><ul><li><p>Click data points in a supported visualization to filter other charts on the page that use the same dataset.</p></li><li><p>Best for ad-hoc exploration, answering questions like &#8220;<em>What happens if I focus only on this region?</em>&#8221; and &#8220;<em>Which nations are driving that spike?</em>&#8221;.</p></li></ul></li><li><p>Drill-through</p><ul><li><p>Right-click a mark to open another page already filtered to that selection.</p></li><li><p>Best for guided summary-to-detail flows where you don&#8217;t want to cram everything onto one page.</p></li></ul></li></ol><p>A simple migration rule of thumb:</p><ul><li><p>Use filter widgets to rebuild the core filter panels from your existing dashboards.</p></li><li><p>Use cross-filtering to automatically filter other visualizations on the page based on chart interactions.</p></li><li><p>Use drill-through to replace &#8220;go to sheet&#8221; filter actions and connect high-level KPIs to detail pages.</p></li></ul><h3><strong>Summary and what&#8217;s next</strong></h3><p>In this second blog post in the series, we answered: &#8220;<em>Do you support filter actions when I click on a bar or a point?</em>&#8221;</p><p>The short answer is <em>yes</em>:</p><ul><li><p><strong>Cross-filtering</strong> lets viewers click on supported charts to filter all other visualizations on the same page that share a dataset &#8211; no configuration required.</p></li><li><p><strong>Drill-through</strong> lets viewers right-click a mark and open another page where visuals on the same dataset are already filtered to the selected values, and any matching filters are pre-populated.</p></li></ul><p>Combined with the patterns from the <a href="https://www.databricksters.com/p/migrating-existing-dashboards-to">first blog post</a> &#8211; context filters and cascading &#8220;<em>Only Relevant Values</em>&#8221; filters &#8211; you now have a robust toolkit for recreating the interactive filtering experience your users expect from traditional dashboards inside Databricks AI/BI Dashboards.</p><p>The <a href="https://github.com/ArtemChebotko/Migrating-Existing-Dashboards-to-Databricks-AI-BI">companion dashboard</a> now includes: a context filter page, a field-based cascading page, a query-based cascading page, a cross-filtering page, and a drill-through details page. You can import it into your workspace and adapt the patterns to your own datasets.</p><p>In the next post in this series, I&#8217;ll tackle the remaining migration question: &#8220;<em>How do we do user-based filtering and row-level security in AI/BI Dashboards?</em>&#8221;</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.databricksters.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Databricksters! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Migrating Existing Dashboards to Databricks AI/BI, Part 1: Context and Cascading Filters]]></title><description><![CDATA[How to implement context filters and &#8220;only relevant values&#8221; behavior in Databricks AI/BI Dashboards]]></description><link>https://www.databricksters.com/p/migrating-existing-dashboards-to</link><guid isPermaLink="false">https://www.databricksters.com/p/migrating-existing-dashboards-to</guid><dc:creator><![CDATA[Artem Chebotko]]></dc:creator><pubDate>Tue, 03 Feb 2026 18:02:24 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!72WV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289c7692-0274-4274-a173-9db55df49c08_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!72WV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289c7692-0274-4274-a173-9db55df49c08_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!72WV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289c7692-0274-4274-a173-9db55df49c08_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!72WV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289c7692-0274-4274-a173-9db55df49c08_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!72WV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289c7692-0274-4274-a173-9db55df49c08_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!72WV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289c7692-0274-4274-a173-9db55df49c08_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!72WV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289c7692-0274-4274-a173-9db55df49c08_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/289c7692-0274-4274-a173-9db55df49c08_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!72WV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289c7692-0274-4274-a173-9db55df49c08_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!72WV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289c7692-0274-4274-a173-9db55df49c08_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!72WV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289c7692-0274-4274-a173-9db55df49c08_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!72WV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F289c7692-0274-4274-a173-9db55df49c08_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As a Specialist Solutions Architect at Databricks, I regularly work with customers who are migrating critical analytics from existing BI tools to Databricks AI/BI Dashboards &#8211; and the first questions I usually get are about filters.</p><p><strong>Teams want to know</strong>:</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.databricksters.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Databricksters! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><ul><li><p><em>&#8220;What&#8217;s the Databricks equivalent of the context filters we use today?&#8221;</em></p></li><li><p><em>&#8220;Can we still do cascading filters where each dropdown only shows relevant values?&#8221;</em></p></li><li><p><em>&#8220;Do you support filter actions when I click on a bar or a point?&#8221;</em></p></li><li><p><em>&#8220;How do we do user-based filtering in AI/BI Dashboards?&#8221;</em></p></li></ul><p>These aren&#8217;t cosmetic features. They&#8217;re how analysts actually interact with dashboards, and they&#8217;re often the reason an existing BI dashboard feels &#8220;alive&#8221; instead of static.</p><p>In this post, I&#8217;ll walk through how to implement two familiar filter patterns from existing BI dashboards in Databricks AI/BI Dashboards, using the built-in <code>samples.tpch</code> dataset:</p><ol><li><p><strong>Context filters</strong> &#8594; implemented as parameters in dataset SQL</p></li><li><p><strong>&#8220;</strong><em><strong>Only Relevant Values</strong></em><strong>&#8221; or cascading filters</strong> &#8594; implemented with field filters and query-based parameters</p></li></ol><p>Row-level security and user-based filtering deserve their own deep dive, and action-style interactions (cross-filtering and drill-through) could easily fill another post, so I&#8217;ll cover those separately.</p><p>I&#8217;ve also published the <a href="https://github.com/ArtemChebotko/Migrating-Existing-Dashboards-to-Databricks-AI-BI">companion dashboard</a>, so you can follow along and inspect the configurations yourself.</p><h3><strong>Quick primer: datasets and filters in Databricks AI/BI Dashboards</strong></h3><p>Before we map those patterns, it helps to align on a few AI/BI Dashboards concepts:</p><h4><strong>Datasets</strong></h4><p>In AI/BI Dashboards, each dashboard has a <em>Data</em> tab where you define one or more datasets:</p><ul><li><p>A dataset is defined by an SQL query, direct reference to a Unity Catalog table/view, or an uploaded file.</p></li><li><p>Multiple visualizations can reuse the same dataset.</p></li><li><p>Datasets are bundled with the dashboard when you share/export/import it.</p></li></ul><p>Practically, a dataset is your &#8220;model&#8221; for a set of visuals: one query, many charts.</p><h4><strong>Field filters vs parameter filters</strong></h4><p>AI/BI Dashboards support two core ways to filter data from a dashboard: <a href="https://docs.databricks.com/aws/en/dashboards/filters#should-i-filter-on-a-field-or-a-parameter">field filters and parameter filters</a>. Both are implemented as <strong>filter widgets</strong>, but they behave differently under the hood.</p><p><strong>Field filters</strong> are applied directly to dataset fields (columns) on top of the dataset query. Processing behaviour is defined by the <a href="https://docs.databricks.com/aws/en/dashboards/caching#dataset-optimizations">dataset performance thresholds</a>. Specifically, for small datasets (&#8804; 100K rows or &#8804; 100MB), results are pulled to the browser and visualization-specific filtering and aggregation are applied client-side. For larger datasets, Databricks wraps the dataset query in a <code>WITH</code> clause and applies the filter predicates and aggregations in Databricks SQL warehouse (DBSQL).</p><p><strong>Parameter filters </strong>are applied to parameters, which are variables that get substituted into your dataset SQL at runtime. When the parameter value changes, the query is always re-run in DBSQL.</p><p>In other words, field filters operate on the results of the dataset query, while parameter filters operate inside the dataset SQL itself.</p><p>To speed up processing, various <a href="https://docs.databricks.com/aws/en/dashboards/caching#caching-and-data-freshness">caching layers</a> in AI/BI Dashboards and DBSQL are used.</p><p>We&#8217;ll use parameter filters to emulate context filters, and field filters + query-based parameters to emulate &#8220;<em>Only Relevant Values</em>.&#8221;</p><h4><strong>Filter scope: global, page-level, and widget-level</strong></h4><p>Filters in AI/BI Dashboards also differ by <a href="https://docs.databricks.com/aws/en/dashboards/filters#filter-interactivity-and-scope">scope</a>:</p><ul><li><p><strong>Global filters</strong> are interactive filters in the global filters panel that apply across all pages of the dashboard to any visualization that shares the selected datasets.</p></li><li><p><strong>Page-level filters</strong> are interactive filter widgets placed on a specific page in the canvas. They apply to all visualizations on that page that share one or more datasets.</p></li><li><p><strong>Widget-level filters</strong> are static filters configured directly on a single visualization widget in its configuration panel. Authors set the values, and viewers can&#8217;t change them.</p></li></ul><p>With that foundation in place, we can now map these context filters and &#8220;<em>Only Relevant Values</em>&#8221; patterns into AI/BI Dashboards patterns.</p><h3><strong>Sample dataset: TPCH on Databricks</strong></h3><p>To keep examples concrete, we&#8217;ll use the TPCH sample data that ships with Databricks in the <code>samples.tpch</code> schema.</p><p>For the purposes of this post, you can start by creating a dataset that joins tables <code>region</code>, <code>nation</code>, <code>customer</code>, <code>orders</code>, and <code>lineitem</code>:</p><pre><code><code>SELECT
  r.r_name              AS region,
  n.n_name              AS nation,
  c.c_custkey           AS customer_id,
  c.c_name              AS customer_name,
  o.o_orderkey          AS order_id,
  o.o_orderdate         AS order_date,
  l.l_extendedprice * (1 - l.l_discount) AS revenue
FROM samples.tpch.region   AS r
JOIN samples.tpch.nation   AS n ON n.n_regionkey = r.r_regionkey
JOIN samples.tpch.customer AS c ON c.c_nationkey = n.n_nationkey
JOIN samples.tpch.orders   AS o ON o.o_custkey   = c.c_custkey
JOIN samples.tpch.lineitem AS l ON l.l_orderkey  = o.o_orderkey;</code></code></pre><p>In AI/BI Dashboards, you define this query as a dataset in the <em>Data</em> tab and then reuse it across multiple visualizations. Let&#8217;s call this dataset <em>TPCH Sales</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Li34!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2aa1025-9a71-4c07-9bd2-7315bbc81448_1600x745.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Li34!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2aa1025-9a71-4c07-9bd2-7315bbc81448_1600x745.png 424w, https://substackcdn.com/image/fetch/$s_!Li34!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2aa1025-9a71-4c07-9bd2-7315bbc81448_1600x745.png 848w, https://substackcdn.com/image/fetch/$s_!Li34!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2aa1025-9a71-4c07-9bd2-7315bbc81448_1600x745.png 1272w, https://substackcdn.com/image/fetch/$s_!Li34!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2aa1025-9a71-4c07-9bd2-7315bbc81448_1600x745.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Li34!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2aa1025-9a71-4c07-9bd2-7315bbc81448_1600x745.png" width="1456" height="678" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e2aa1025-9a71-4c07-9bd2-7315bbc81448_1600x745.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:678,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Li34!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2aa1025-9a71-4c07-9bd2-7315bbc81448_1600x745.png 424w, https://substackcdn.com/image/fetch/$s_!Li34!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2aa1025-9a71-4c07-9bd2-7315bbc81448_1600x745.png 848w, https://substackcdn.com/image/fetch/$s_!Li34!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2aa1025-9a71-4c07-9bd2-7315bbc81448_1600x745.png 1272w, https://substackcdn.com/image/fetch/$s_!Li34!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe2aa1025-9a71-4c07-9bd2-7315bbc81448_1600x745.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>We&#8217;ll reuse this same dataset or its derivatives throughout the rest of the post to illustrate context filters and cascading filters.</p><h3><strong>Implementing context filters with parameters in dataset SQL</strong></h3><h4><strong>What a context filter does</strong></h4><p>A context filter defines a high-level subset of the data:</p><ul><li><p>The context filter is applied first, often materializing a temporary subset.</p></li><li><p>Other filters and some calculations are then evaluated on top of that subset.</p></li></ul><p>Context filters are used to:</p><ul><li><p>Improve performance by filtering early and shrinking the working set.</p></li><li><p>Enforce logical order, such as &#8220;<em>always filter by Region first</em>.&#8221;</p></li><li><p>Make other filters depend on that subset.</p></li></ul><h4><strong>How to think about context in AI/BI Dashboards</strong></h4><p>Given the primer:</p><ul><li><p><a href="https://docs.databricks.com/aws/en/dashboards/filters#should-i-filter-on-a-field-or-a-parameter">Field filters</a> operate on the results of the dataset query (Databricks wraps your dataset SQL and applies them on top).</p></li><li><p><a href="https://docs.databricks.com/aws/en/dashboards/filters#should-i-filter-on-a-field-or-a-parameter">Parameter filters</a> substitute values directly into your dataset SQL, so they filter inside the query, before joins and aggregations.</p></li></ul><p>If you want &#8220;context&#8221; behavior &#8211; <em>filter first, then apply everything else</em> &#8211; you should implement that filter as a <a href="https://docs.databricks.com/aws/en/dashboards/parameters">parameter</a> in the dataset SQL, driven by a parameter filter widget.</p><h4><strong>Pattern: treat the context as a base parameter</strong></h4><p>Let&#8217;s add a context filter for <em>Region</em>:</p><p>If you&#8217;re following along with the <a href="https://github.com/ArtemChebotko/Migrating-Existing-Dashboards-to-Databricks-AI-BI">companion dashboard</a>, this setup lives on the &#8220;Context filter&#8221; page.</p><p><strong>Step 1</strong>. Define <em>TPCH Sales (Context)</em> with a <em>Region</em> parameter</p><p>Create a dataset <em>TPCH Sales (Context)</em>:</p><pre><code><code>SELECT
  r.r_name              AS region,
  n.n_name              AS nation,
  c.c_custkey           AS customer_id,
  c.c_name              AS customer_name,
  o.o_orderkey          AS order_id,
  o.o_orderdate         AS order_date,
  l.l_extendedprice * (1 - l.l_discount) AS revenue
FROM samples.tpch.region   AS r
JOIN samples.tpch.nation   AS n ON n.n_regionkey = r.r_regionkey
JOIN samples.tpch.customer AS c ON c.c_nationkey = n.n_nationkey
JOIN samples.tpch.orders   AS o ON o.o_custkey   = c.c_custkey
JOIN samples.tpch.lineitem AS l ON l.l_orderkey  = o.o_orderkey
WHERE r.r_name = :region_param      -- &#8220;context&#8221; filter</code></code></pre><p>In the dataset&#8217;s <em>Parameters</em> panel:</p><ul><li><p>Define <code>region_param</code> with type <em>String</em>.</p></li><li><p>Optionally set a default (for example, <em>AMERICA</em>) so the dataset runs without any dashboard filter.</p></li></ul><p>This makes <code>region_param</code> the context for all visuals that use <em>TPCH Sales (Context)</em>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Id06!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce877d63-410b-4f3f-878b-aec639cfc9c9_1600x786.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Id06!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce877d63-410b-4f3f-878b-aec639cfc9c9_1600x786.png 424w, https://substackcdn.com/image/fetch/$s_!Id06!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce877d63-410b-4f3f-878b-aec639cfc9c9_1600x786.png 848w, https://substackcdn.com/image/fetch/$s_!Id06!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce877d63-410b-4f3f-878b-aec639cfc9c9_1600x786.png 1272w, https://substackcdn.com/image/fetch/$s_!Id06!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce877d63-410b-4f3f-878b-aec639cfc9c9_1600x786.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Id06!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce877d63-410b-4f3f-878b-aec639cfc9c9_1600x786.png" width="1456" height="715" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ce877d63-410b-4f3f-878b-aec639cfc9c9_1600x786.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:715,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Id06!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce877d63-410b-4f3f-878b-aec639cfc9c9_1600x786.png 424w, https://substackcdn.com/image/fetch/$s_!Id06!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce877d63-410b-4f3f-878b-aec639cfc9c9_1600x786.png 848w, https://substackcdn.com/image/fetch/$s_!Id06!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce877d63-410b-4f3f-878b-aec639cfc9c9_1600x786.png 1272w, https://substackcdn.com/image/fetch/$s_!Id06!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fce877d63-410b-4f3f-878b-aec639cfc9c9_1600x786.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Step 2</strong>. (Optional but nice) Create a helper dataset for <em>Region</em> values</p><p>You can drive <code>region_param</code> directly from <em>TPCH Sales (Context)</em>, but a tiny helper dataset keeps things tidy, convenient, and re-usable.</p><p>Create <em>TPCH Regions (Context)</em>:</p><pre><code><code>SELECT DISTINCT r_name AS region
FROM samples.tpch.region
ORDER BY region;</code></code></pre><p>This dataset has no parameters; it just returns the list of available regions.</p><p><strong>Step 3</strong>. Add a <em>Region</em> parameter filter widget</p><p>We will configure the widget as a page-level filter (alternatively, you can move it into the global filters panel if it should apply across pages).</p><p>On the page where you want <em>Region</em> as a context filter:</p><ol><li><p>Add a filter widget and title it <em>Region</em>.</p></li><li><p>Set the filter type to <em>Single value</em>.</p></li><li><p>Configure it as a parameter filter:</p><ul><li><p>Fields: <code>TPCH Regions (Context).region</code></p></li><li><p>Parameters: <code>TPCH Sales (Context).region_param</code></p></li></ul></li></ol><p>If you don&#8217;t want a helper dataset, you can instead use <code>TPCH Sales (Context).region</code> as the field source, but the wiring is otherwise identical.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!oUxR!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4b617a8-188e-4c89-84cc-0ca1ed8c84dc_512x641.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!oUxR!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4b617a8-188e-4c89-84cc-0ca1ed8c84dc_512x641.png 424w, https://substackcdn.com/image/fetch/$s_!oUxR!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4b617a8-188e-4c89-84cc-0ca1ed8c84dc_512x641.png 848w, https://substackcdn.com/image/fetch/$s_!oUxR!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4b617a8-188e-4c89-84cc-0ca1ed8c84dc_512x641.png 1272w, https://substackcdn.com/image/fetch/$s_!oUxR!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4b617a8-188e-4c89-84cc-0ca1ed8c84dc_512x641.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!oUxR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4b617a8-188e-4c89-84cc-0ca1ed8c84dc_512x641.png" width="512" height="641" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e4b617a8-188e-4c89-84cc-0ca1ed8c84dc_512x641.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:641,&quot;width&quot;:512,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!oUxR!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4b617a8-188e-4c89-84cc-0ca1ed8c84dc_512x641.png 424w, https://substackcdn.com/image/fetch/$s_!oUxR!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4b617a8-188e-4c89-84cc-0ca1ed8c84dc_512x641.png 848w, https://substackcdn.com/image/fetch/$s_!oUxR!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4b617a8-188e-4c89-84cc-0ca1ed8c84dc_512x641.png 1272w, https://substackcdn.com/image/fetch/$s_!oUxR!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe4b617a8-188e-4c89-84cc-0ca1ed8c84dc_512x641.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Effect</strong></p><ul><li><p>When a viewer selects <em>Region: EUROPE</em> in the <em>Region</em> filter:</p><ul><li><p>The widget writes &#8220;<em>EUROPE</em>&#8220; into <code>region_param</code> for <em>TPCH Sales (Context)</em>.</p></li><li><p><em>TPCH Sales (Context)</em> reruns with <code>WHERE r.r_name = &#8216;EUROPE&#8217;</code>.</p></li></ul></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hjO5!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10638dc8-f318-443c-ad39-31ed2b15c8ec_1237x626.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hjO5!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10638dc8-f318-443c-ad39-31ed2b15c8ec_1237x626.png 424w, https://substackcdn.com/image/fetch/$s_!hjO5!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10638dc8-f318-443c-ad39-31ed2b15c8ec_1237x626.png 848w, https://substackcdn.com/image/fetch/$s_!hjO5!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10638dc8-f318-443c-ad39-31ed2b15c8ec_1237x626.png 1272w, https://substackcdn.com/image/fetch/$s_!hjO5!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10638dc8-f318-443c-ad39-31ed2b15c8ec_1237x626.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hjO5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10638dc8-f318-443c-ad39-31ed2b15c8ec_1237x626.png" width="1237" height="626" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/10638dc8-f318-443c-ad39-31ed2b15c8ec_1237x626.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:626,&quot;width&quot;:1237,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!hjO5!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10638dc8-f318-443c-ad39-31ed2b15c8ec_1237x626.png 424w, https://substackcdn.com/image/fetch/$s_!hjO5!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10638dc8-f318-443c-ad39-31ed2b15c8ec_1237x626.png 848w, https://substackcdn.com/image/fetch/$s_!hjO5!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10638dc8-f318-443c-ad39-31ed2b15c8ec_1237x626.png 1272w, https://substackcdn.com/image/fetch/$s_!hjO5!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F10638dc8-f318-443c-ad39-31ed2b15c8ec_1237x626.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>All visuals built on <em>TPCH Sales (Context)</em> now use only European data as their starting point.</p></li><li><p>Any additional field filters (for example, Nation, Customer, Date) operate on this already-filtered subset, just like secondary filters evaluated after a context filter.</p></li><li><p>From the viewer&#8217;s perspective, <em>Region</em> behaves like a true context filter: it defines the base subset of data first, and everything else &#8211; other filters, cross-filtering, drill-through &#8211; is evaluated on top of that context.</p></li></ul><h3><strong>Implementing &#8220;</strong><em><strong>Only Relevant Values</strong></em><strong>&#8221; with cascading filters and query-based parameters</strong></h3><h4><strong>What &#8220;</strong><em><strong>Only Relevant Values</strong></em><strong>&#8221; behavior means</strong></h4><p>&#8220;<em>Only Relevant Values</em>&#8221; behavior on a filter shrinks the list of values based on the current state of other filters and the view:</p><ul><li><p>If you select <em>Region: ASIA</em>, the <em>Country</em> filter only shows countries that actually have data in <em>ASIA</em>.</p></li><li><p>As you add more filters, each filter&#8217;s domain is recomputed from the filtered dataset.</p></li></ul><p>Practically, this gives you cascading filters that stay in sync with each other and with the current slice of data.</p><h4><strong>How to think about &#8220;</strong><em><strong>Only Relevant Values</strong></em><strong>&#8221; in AI/BI Dashboards</strong></h4><p>In AI/BI Dashboards, you get the same effect in two ways:</p><ol><li><p><a href="https://docs.databricks.com/aws/en/dashboards/filters">Field filters</a> on the same dataset &#8211; AI/BI recomputes the value list based on the current filtered dataset.</p></li><li><p><a href="https://docs.databricks.com/aws/en/dashboards/parameters#query-based-parameters">Query-based parameters</a> &#8211; a specialized filter widget that both populates its values from a query, and writes the selected value into a parameter used in your dataset SQL.</p></li></ol><h4><strong>Pattern 1: Cascading filters with field filters</strong></h4><p>The simplest way to mimic &#8220;<em>Only Relevant Values</em>&#8221; is to use <a href="https://docs.databricks.com/aws/en/dashboards/filters">field filters</a> wired to the same dataset. AI/BI Dashboards will automatically recompute each filter&#8217;s value list based on the current filtered dataset.</p><p>We&#8217;ll build a <em>Region &#8594; Nation &#8594; Customer</em> cascade on top of <em>TPCH Sales (Cascading Pattern 1)</em>.</p><p>In the <a href="https://github.com/ArtemChebotko/Migrating-Existing-Dashboards-to-Databricks-AI-BI">companion dashboard</a>, this pattern is implemented on the &#8220;<em>Cascading filters with field filters</em>&#8221; page.</p><p><strong>Step 1</strong>. Define <em>TPCH Sales (Cascading Pattern 1)</em></p><p>Create a dataset <em>TPCH Sales (Cascading Pattern 1)</em> with the base TPCH join and a revenue metric:</p><pre><code><code>SELECT
  r.r_name              AS region,
  n.n_name              AS nation,
  c.c_custkey           AS customer_id,
  c.c_name              AS customer_name,
  o.o_orderkey          AS order_id,
  o.o_orderdate         AS order_date,
  l.l_extendedprice * (1 - l.l_discount) AS revenue
FROM samples.tpch.region   AS r
JOIN samples.tpch.nation   AS n ON n.n_regionkey = r.r_regionkey
JOIN samples.tpch.customer AS c ON c.c_nationkey = n.n_nationkey
JOIN samples.tpch.orders   AS o ON o.o_custkey   = c.c_custkey
JOIN samples.tpch.lineitem AS l ON l.l_orderkey  = o.o_orderkey;</code></code></pre><p>This dataset has no parameters; all filtering will be done with field filters on top of the query results.</p><p><strong>Step 2</strong>. Add <em>Region</em>, <em>Nation</em>, and <em>Customer</em> field filters</p><p>On the dashboard page where you want cascading behavior:</p><ol><li><p>Add three field filter widgets with titles <em>Region</em>, <em>Nation</em>, and <em>Customer</em>.</p></li><li><p>Configure each widget as a page-level filter (or move them into the global filters panel if you want them to apply across pages).</p></li><li><p>Connect the filters to the following fields from <em>TPCH Sales (Cascading Pattern 1)</em>:</p><ul><li><p><em>Region</em> &#8594; <code>region</code></p></li><li><p><em>Nation</em> &#8594; <code>nation</code></p></li><li><p><em>Customer</em> &#8594; <code>customer_id</code></p></li></ul></li></ol><p>No parameters are involved here &#8211; these are pure field filters on a single dataset.</p><p><strong>Effect</strong></p><ul><li><p>When a viewer selects <em>region: ASIA</em>, the <em>TPCH Sales (Cascading Pattern 1)</em> dataset is filtered to <em>ASIA</em> for all visuals on the page.</p></li><li><p>The <em>Nation</em> field filter&#8217;s value list is recomputed from that filtered dataset, so it only shows nations in <em>ASIA</em>.</p></li></ul><blockquote></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!bTX3!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936f7d-2ea4-4089-928b-d1002a4b757e_1600x585.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!bTX3!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936f7d-2ea4-4089-928b-d1002a4b757e_1600x585.png 424w, https://substackcdn.com/image/fetch/$s_!bTX3!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936f7d-2ea4-4089-928b-d1002a4b757e_1600x585.png 848w, https://substackcdn.com/image/fetch/$s_!bTX3!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936f7d-2ea4-4089-928b-d1002a4b757e_1600x585.png 1272w, https://substackcdn.com/image/fetch/$s_!bTX3!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936f7d-2ea4-4089-928b-d1002a4b757e_1600x585.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!bTX3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936f7d-2ea4-4089-928b-d1002a4b757e_1600x585.png" width="1456" height="532" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d7936f7d-2ea4-4089-928b-d1002a4b757e_1600x585.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:532,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!bTX3!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936f7d-2ea4-4089-928b-d1002a4b757e_1600x585.png 424w, https://substackcdn.com/image/fetch/$s_!bTX3!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936f7d-2ea4-4089-928b-d1002a4b757e_1600x585.png 848w, https://substackcdn.com/image/fetch/$s_!bTX3!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936f7d-2ea4-4089-928b-d1002a4b757e_1600x585.png 1272w, https://substackcdn.com/image/fetch/$s_!bTX3!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd7936f7d-2ea4-4089-928b-d1002a4b757e_1600x585.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>After the viewer chooses a nation (e.g., <em>JAPAN</em>), the <em>Customer</em> field filter shrinks to show only customers in that nation.</p></li></ul><blockquote></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!D7pG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caeb4b3-84dc-4cc5-ae15-5add326ae0f6_1600x583.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!D7pG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caeb4b3-84dc-4cc5-ae15-5add326ae0f6_1600x583.png 424w, https://substackcdn.com/image/fetch/$s_!D7pG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caeb4b3-84dc-4cc5-ae15-5add326ae0f6_1600x583.png 848w, https://substackcdn.com/image/fetch/$s_!D7pG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caeb4b3-84dc-4cc5-ae15-5add326ae0f6_1600x583.png 1272w, https://substackcdn.com/image/fetch/$s_!D7pG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caeb4b3-84dc-4cc5-ae15-5add326ae0f6_1600x583.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!D7pG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caeb4b3-84dc-4cc5-ae15-5add326ae0f6_1600x583.png" width="1456" height="531" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3caeb4b3-84dc-4cc5-ae15-5add326ae0f6_1600x583.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:531,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!D7pG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caeb4b3-84dc-4cc5-ae15-5add326ae0f6_1600x583.png 424w, https://substackcdn.com/image/fetch/$s_!D7pG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caeb4b3-84dc-4cc5-ae15-5add326ae0f6_1600x583.png 848w, https://substackcdn.com/image/fetch/$s_!D7pG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caeb4b3-84dc-4cc5-ae15-5add326ae0f6_1600x583.png 1272w, https://substackcdn.com/image/fetch/$s_!D7pG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3caeb4b3-84dc-4cc5-ae15-5add326ae0f6_1600x583.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>From the user&#8217;s perspective, these field filters behave like filters with an &#8220;<em>Only Relevant Values</em>&#8221; option enabled: each dropdown shows only values that exist in the currently filtered data. Under the hood, AI/BI Dashboards are simply applying field filters on top of a single dataset and recomputing the dropdown values from the currently filtered result set.</p></li></ul><h4><strong>Pattern 2: Cascading filters with query-based parameters</strong></h4><p>In Pattern 1, we used field filters only. In some cases you may want more control over how dropdown values are loaded, or you may want the same parameter to drive multiple datasets. In that case you can use <a href="https://docs.databricks.com/aws/en/dashboards/parameters#query-based-parameters">query-based parameters</a>. A query-based parameter filter widget gets its dropdown values from a field in a &#8220;choices&#8221; dataset, and writes the selected value into one or more parameters that are used in dataset SQL.</p><p>Here we&#8217;ll build a three-level cascade <em>Region &#8594; Nation &#8594; Customer</em> using:</p><ul><li><p>One main dataset: <em>TPCH Sales (Cascading Pattern 2)</em></p></li><li><p>Three small &#8220;value list&#8221; datasets:</p><ul><li><p><em>TPCH Regions (Cascading Pattern 2)</em></p></li><li><p><em>TPCH Nations by Region (Cascading Pattern 2)</em></p></li><li><p><em>TPCH Customers by Nation (Cascading Pattern 2)</em></p></li></ul></li></ul><p>In the <a href="https://github.com/ArtemChebotko/Migrating-Existing-Dashboards-to-Databricks-AI-BI">companion dashboard</a>, this pattern is implemented on the &#8220;<em>Cascading filters with query-based parameters</em>&#8221; page.</p><p><strong>Step 1</strong>. Define <em>TPCH Sales (Cascading Pattern 2)</em></p><p>Create the <em>TPCH Sales (Cascading Pattern 2)</em> dataset with parameters for <em>region</em>, <em>nation</em>, and <em>customer</em>:</p><pre><code><code>SELECT
 r.r_name              AS region,
 n.n_name              AS nation,
 c.c_custkey           AS customer_id,
 c.c_name              AS customer_name,
 o.o_orderkey          AS order_id,
 o.o_orderdate         AS order_date,
 l.l_extendedprice * (1 - l.l_discount) AS revenue
FROM samples.tpch.region   AS r
JOIN samples.tpch.nation   AS n ON n.n_regionkey = r.r_regionkey
JOIN samples.tpch.customer AS c ON c.c_nationkey = n.n_nationkey
JOIN samples.tpch.orders   AS o ON o.o_custkey   = c.c_custkey
JOIN samples.tpch.lineitem AS l ON l.l_orderkey  = o.o_orderkey
WHERE (:region_param   = 'All' OR r.r_name = :region_param)
  AND (:nation_param   = 'All' OR n.n_name = :nation_param)
  AND (:customer_param = 0     OR c.c_custkey  = :customer_param);</code></code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ft3p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b3da122-46ff-4505-b268-de0d80f55758_1600x866.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ft3p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b3da122-46ff-4505-b268-de0d80f55758_1600x866.png 424w, https://substackcdn.com/image/fetch/$s_!ft3p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b3da122-46ff-4505-b268-de0d80f55758_1600x866.png 848w, https://substackcdn.com/image/fetch/$s_!ft3p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b3da122-46ff-4505-b268-de0d80f55758_1600x866.png 1272w, https://substackcdn.com/image/fetch/$s_!ft3p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b3da122-46ff-4505-b268-de0d80f55758_1600x866.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ft3p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b3da122-46ff-4505-b268-de0d80f55758_1600x866.png" width="1456" height="788" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/2b3da122-46ff-4505-b268-de0d80f55758_1600x866.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:788,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ft3p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b3da122-46ff-4505-b268-de0d80f55758_1600x866.png 424w, https://substackcdn.com/image/fetch/$s_!ft3p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b3da122-46ff-4505-b268-de0d80f55758_1600x866.png 848w, https://substackcdn.com/image/fetch/$s_!ft3p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b3da122-46ff-4505-b268-de0d80f55758_1600x866.png 1272w, https://substackcdn.com/image/fetch/$s_!ft3p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F2b3da122-46ff-4505-b268-de0d80f55758_1600x866.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In the dataset&#8217;s <em>Parameters</em> panel:</p><ul><li><p>Set <code>region_param</code> type to <em>String</em>.</p></li><li><p>Set <code>nation_param</code> type to <em>String</em>.</p></li><li><p>Set <code>customer_param</code> type to <em>Numeric / Integer</em> (to match <code>c_custkey</code>).</p></li></ul><p>This last bit is important: the <em>Customer</em> filter uses a numeric field, so the parameter must be numeric as well.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!qzZ2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96df94a4-c73f-4c97-b2bd-b8bb0ba8a63d_262x297.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!qzZ2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96df94a4-c73f-4c97-b2bd-b8bb0ba8a63d_262x297.png 424w, https://substackcdn.com/image/fetch/$s_!qzZ2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96df94a4-c73f-4c97-b2bd-b8bb0ba8a63d_262x297.png 848w, https://substackcdn.com/image/fetch/$s_!qzZ2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96df94a4-c73f-4c97-b2bd-b8bb0ba8a63d_262x297.png 1272w, https://substackcdn.com/image/fetch/$s_!qzZ2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96df94a4-c73f-4c97-b2bd-b8bb0ba8a63d_262x297.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!qzZ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96df94a4-c73f-4c97-b2bd-b8bb0ba8a63d_262x297.png" width="262" height="297" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/96df94a4-c73f-4c97-b2bd-b8bb0ba8a63d_262x297.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:297,&quot;width&quot;:262,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!qzZ2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96df94a4-c73f-4c97-b2bd-b8bb0ba8a63d_262x297.png 424w, https://substackcdn.com/image/fetch/$s_!qzZ2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96df94a4-c73f-4c97-b2bd-b8bb0ba8a63d_262x297.png 848w, https://substackcdn.com/image/fetch/$s_!qzZ2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96df94a4-c73f-4c97-b2bd-b8bb0ba8a63d_262x297.png 1272w, https://substackcdn.com/image/fetch/$s_!qzZ2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F96df94a4-c73f-4c97-b2bd-b8bb0ba8a63d_262x297.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Step 2</strong>. Create helper datasets for the dropdowns</p><p>1. <em>TPCH Regions (Cascading Pattern 2)</em> &#8211; list of regions:</p><pre><code><code>SELECT DISTINCT r_name AS region
FROM samples.tpch.region
ORDER BY region;</code></code></pre><p>2. <em>TPCH Nations by Region (Cascading Pattern 2)</em> &#8211; nations for the selected region:</p><pre><code><code>SELECT DISTINCT n.n_name AS nation
FROM samples.tpch.nation   AS n
JOIN samples.tpch.region   AS r ON n.n_regionkey = r.r_regionkey
WHERE r.r_name = :region_param
ORDER BY nation;</code></code></pre><p>This dataset defines its own <code>region_param</code> (<em>string</em>) in the <em>Data</em> tab.</p><p>3. <em>TPCH Customers by Nation (Cascading Pattern 2)</em> &#8211; customers for the selected nation:</p><pre><code><code>SELECT DISTINCT
  c.c_custkey AS customer_id
FROM samples.tpch.nation   AS n
JOIN samples.tpch.customer AS c ON c.c_nationkey = n.n_nationkey
WHERE n.n_name = :nation_param
ORDER BY customer_id;</code></code></pre><p>This dataset defines <code>nation_param</code> (<em>string</em>). <code>customer_id</code> is <em>numeric</em>, matching <code>customer_param</code> in <em>TPCH Sales (Cascading Pattern 2)</em>.</p><p>Run each dataset in the <em>Data</em> tab once to confirm they succeed.</p><p><strong>Step 3</strong>. Add <em>Region</em>, <em>Nation</em>, and <em>Customer</em> filter widgets</p><p>On your dashboard page, add three filter widgets and wire them to fields and parameters. Configure all three widgets as page-level filters (or move them into the global filters panel if they should apply across pages).</p><p>1. <em>Region filter widget</em></p><ul><li><p>Filter type: <em>Single value</em></p></li><li><p>Fields: <code>TPCH Regions (Cascading Pattern 2).region</code></p></li><li><p>Parameters:</p><ul><li><p><code>TPCH Sales (Cascading Pattern 2).region_param</code></p></li><li><p><code>TPCH Nations by Region (Cascading Pattern 2).region_param</code></p></li></ul></li><li><p>Default value: <code>All</code></p></li></ul><p>This keeps region_param in <em>TPCH Sales (Cascading Pattern 2)</em> and <em>TPCH Nations by Region (Cascading Pattern 2)</em> in sync.</p><p>2. <em>Nation filter widget</em></p><ul><li><p>Filter type: <em>Single value</em></p></li><li><p>Fields: <code>TPCH Nations by Region (Cascading Pattern 2).nation</code></p></li><li><p>Parameters:</p><ul><li><p><code>TPCH Sales (Cascading Pattern 2).nation_param</code></p></li><li><p><code>TPCH Customers by Nation (Cascading Pattern 2).nation_param</code></p></li></ul></li></ul><ul><li><p>Default value: <code>All</code></p></li></ul><p>This keeps nation_param in <em>TPCH Sales (Cascading Pattern 2)</em> and <em>TPCH Customers by Nation (Cascading Pattern 2)</em> in sync.</p><p>3. <em>Customer filter widget</em></p><ul><li><p>Filter type: <em>Single value</em></p></li><li><p>Fields: <code>TPCH Customers by Nation (Cascading Pattern 2).customer_id</code></p></li><li><p>Parameters: <code>TPCH Sales (Cascading Pattern 2).customer_param</code></p></li></ul><ul><li><p>Default value: <code>0</code></p></li></ul><p><strong>Effect</strong></p><ul><li><p>When a viewer selects <em>Region: AMERICA</em>:</p><ul><li><p>The <em>Region</em> widget writes &#8220;<em>AMERICA</em>&#8220; into <code>region_param</code> in <em>TPCH Sales (Cascading Pattern 2)</em> and <em>TPCH Nations by Region (Cascading Pattern 2)</em>.</p></li><li><p><em>TPCH Nations by Region (Cascading Pattern 2)</em> reruns and returns only nations in <em>AMERICA</em>, so the <em>Nation</em> dropdown only shows those nations.</p></li></ul></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jskP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0107b4-e71a-47c4-aecf-e1dd374b6d3d_1600x584.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jskP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0107b4-e71a-47c4-aecf-e1dd374b6d3d_1600x584.png 424w, https://substackcdn.com/image/fetch/$s_!jskP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0107b4-e71a-47c4-aecf-e1dd374b6d3d_1600x584.png 848w, https://substackcdn.com/image/fetch/$s_!jskP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0107b4-e71a-47c4-aecf-e1dd374b6d3d_1600x584.png 1272w, https://substackcdn.com/image/fetch/$s_!jskP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0107b4-e71a-47c4-aecf-e1dd374b6d3d_1600x584.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jskP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0107b4-e71a-47c4-aecf-e1dd374b6d3d_1600x584.png" width="1456" height="531" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dd0107b4-e71a-47c4-aecf-e1dd374b6d3d_1600x584.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:531,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!jskP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0107b4-e71a-47c4-aecf-e1dd374b6d3d_1600x584.png 424w, https://substackcdn.com/image/fetch/$s_!jskP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0107b4-e71a-47c4-aecf-e1dd374b6d3d_1600x584.png 848w, https://substackcdn.com/image/fetch/$s_!jskP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0107b4-e71a-47c4-aecf-e1dd374b6d3d_1600x584.png 1272w, https://substackcdn.com/image/fetch/$s_!jskP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fdd0107b4-e71a-47c4-aecf-e1dd374b6d3d_1600x584.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>When the viewer then selects <em>Nation: UNITED STATES</em>:</p><ul><li><p>The <em>Nation</em> widget writes &#8220;<em>UNITED STATES</em>&#8220; into <code>nation_param</code> in <em>TPCH Sales (Cascading Pattern 2)</em> and <em>TPCH Customers by Nation (Cascading Pattern 2)</em>.</p></li><li><p><em>TPCH Customers by Nation (Cascading Pattern 2)</em> reruns and returns only customers in <em>UNITED STATES</em>, so the <em>Customer</em> dropdown only shows those customer IDs.</p></li></ul></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!0lIb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23de116-d555-497e-9eaf-8375f56465de_1600x583.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!0lIb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23de116-d555-497e-9eaf-8375f56465de_1600x583.png 424w, https://substackcdn.com/image/fetch/$s_!0lIb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23de116-d555-497e-9eaf-8375f56465de_1600x583.png 848w, https://substackcdn.com/image/fetch/$s_!0lIb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23de116-d555-497e-9eaf-8375f56465de_1600x583.png 1272w, https://substackcdn.com/image/fetch/$s_!0lIb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23de116-d555-497e-9eaf-8375f56465de_1600x583.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!0lIb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23de116-d555-497e-9eaf-8375f56465de_1600x583.png" width="1456" height="531" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d23de116-d555-497e-9eaf-8375f56465de_1600x583.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:531,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!0lIb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23de116-d555-497e-9eaf-8375f56465de_1600x583.png 424w, https://substackcdn.com/image/fetch/$s_!0lIb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23de116-d555-497e-9eaf-8375f56465de_1600x583.png 848w, https://substackcdn.com/image/fetch/$s_!0lIb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23de116-d555-497e-9eaf-8375f56465de_1600x583.png 1272w, https://substackcdn.com/image/fetch/$s_!0lIb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd23de116-d555-497e-9eaf-8375f56465de_1600x583.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>When the viewer selects a specific <em>Customer</em> (for example, <em>607</em>):</p><ul><li><p>The <em>Customer</em> widget writes <em>607</em> into <code>customer_param</code> in <em>TPCH Sales (Cascading Pattern 2)</em>.</p></li><li><p><em>TPCH Sales (Cascading Pattern 2)</em> reruns with <code>region_param</code>, <code>nation_param</code>, and <code>customer_param</code> applied, and all visuals built on this dataset show only orders for customer <em>607</em> in <em>UNITED STATES / AMERICA</em>.</p></li></ul></li></ul><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Gv_s!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e03bea-826e-4d5a-9022-ad7ec61f30d2_1600x614.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Gv_s!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e03bea-826e-4d5a-9022-ad7ec61f30d2_1600x614.png 424w, https://substackcdn.com/image/fetch/$s_!Gv_s!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e03bea-826e-4d5a-9022-ad7ec61f30d2_1600x614.png 848w, https://substackcdn.com/image/fetch/$s_!Gv_s!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e03bea-826e-4d5a-9022-ad7ec61f30d2_1600x614.png 1272w, https://substackcdn.com/image/fetch/$s_!Gv_s!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e03bea-826e-4d5a-9022-ad7ec61f30d2_1600x614.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Gv_s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e03bea-826e-4d5a-9022-ad7ec61f30d2_1600x614.png" width="1456" height="559" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/80e03bea-826e-4d5a-9022-ad7ec61f30d2_1600x614.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:559,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!Gv_s!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e03bea-826e-4d5a-9022-ad7ec61f30d2_1600x614.png 424w, https://substackcdn.com/image/fetch/$s_!Gv_s!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e03bea-826e-4d5a-9022-ad7ec61f30d2_1600x614.png 848w, https://substackcdn.com/image/fetch/$s_!Gv_s!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e03bea-826e-4d5a-9022-ad7ec61f30d2_1600x614.png 1272w, https://substackcdn.com/image/fetch/$s_!Gv_s!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F80e03bea-826e-4d5a-9022-ad7ec61f30d2_1600x614.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><ul><li><p>From the viewer&#8217;s perspective, <em>Region &#8594; Nation &#8594; Customer</em> behaves like cascading filters with &#8220;<em>Only Relevant Values</em>&#8221; behavior enabled. Under the hood, each dropdown is a query-based parameter filter, and the <em>Region</em> and <em>Nation</em> widgets keep parameters in multiple datasets in sync, while Customer filters the main <em>TPCH Sales (Cascading Pattern 2)</em> dataset down to a single customer.</p></li></ul><h4><strong>Which pattern when?</strong></h4><p>Both patterns get you &#8220;<em>Only Relevant Values</em>&#8221;-style cascading behavior, but they shine in different situations.</p><p><strong>Pattern 1 &#8211; Cascading field filters</strong></p><p>Use this when:</p><ul><li><p>You&#8217;re working off one main dataset per page.</p></li><li><p>You want the simplest authoring experience: add field filters, connect them to the dataset, done.</p></li><li><p>&#8220;<em>Allow All</em>&#8221; and easy clearing of filters are important to your users.</p></li></ul><p>This is the closest to what many BI tools do by default and is usually the right starting point.</p><p><strong>Pattern 2 &#8211; Cascading query-based parameters</strong></p><p>Use this when:</p><ul><li><p>You need parameters that drive multiple datasets.</p></li><li><p>You want tighter control over dropdown values, including custom queries per level.</p></li><li><p>You&#8217;re comfortable managing parameter types and wiring filters to multiple datasets.</p></li></ul><p>Pattern 2 is more flexible and explicit, but also more advanced. In practice, I start with Pattern 1 for most dashboards, and reach for Pattern 2 when I need parameter-driven logic or want to reuse the same parameters across several datasets and pages.</p><h3><strong>Summary</strong></h3><p>In this post, we looked at how to carry two of the most important filter patterns from traditional BI dashboards into Databricks AI/BI Dashboards:</p><ul><li><p><strong>Context filters</strong> become parameters in your dataset SQL, driven by parameter filter widgets. This lets you enforce &#8220;filter by Region first&#8221; semantics and shrink the working set before joins and aggregations.</p></li><li><p><strong>&#8220;Only Relevant Values&#8221; / cascading filters</strong> can be implemented either with simple field filters on a single dataset (Pattern 1) or with query-based parameters and helper datasets (Pattern 2) when you need more control and reusable parameters.</p></li></ul><p>The <a href="https://github.com/ArtemChebotko/Migrating-Existing-Dashboards-to-Databricks-AI-BI">companion dashboard</a> includes all three examples: a context filter page, a field-based cascading page, and a query-based cascading page. You can import it into your workspace and adapt the patterns to your own datasets.</p><p>In future posts, I plan to cover:</p><ul><li><p>Row-level security and user-based filtering in AI/BI Dashboards</p></li><li><p>Action-style interactions such as cross-filtering and drill-through in AI/BI Dashboards</p></li></ul><p>If you&#8217;re starting a migration from an existing BI tool to Databricks AI/BI today, I recommend:</p><ol><li><p>Identify your key context filters (Region, Business Unit, etc.) and implement them as parameters in dataset SQL.</p></li><li><p>Start with Pattern 1 (field filters) for cascading behavior, and only move to Pattern 2 where you truly need parameter-driven logic or shared parameters across datasets.</p></li></ol><p>These two patterns alone are usually enough to make an AI/BI dashboard feel as interactive and &#8220;alive&#8221; as the dashboards your teams are used to today.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.databricksters.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Databricksters! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Liquid Clustering at Scale: Overcoming Challenges and Unlocking Performance]]></title><description><![CDATA[This piece shares the story behind migrating to Liquid Clustering: the architectural pain points that forced the decision, the challenges along the way, and the hands-on solutions that made it work at scale.]]></description><link>https://www.databricksters.com/p/liquid-clustering-at-scale-overcoming</link><guid isPermaLink="false">https://www.databricksters.com/p/liquid-clustering-at-scale-overcoming</guid><dc:creator><![CDATA[Geethu]]></dc:creator><pubDate>Tue, 27 Jan 2026 12:10:21 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f0616724-431f-4575-9740-1ef6469eaa68_700x405.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>As data volumes grow and access patterns become more demanding, traditional data layouts can quickly become a bottleneck. This post walks through a real-world migration to Liquid Clustering, focusing on the architectural limitations that triggered the change, the challenges encountered during the transition, and the practical fixes that made the migration successful at scale.</p><p>The goal was simple but demanding: near-real-time data availability and consistently fast query performance across large time ranges, even in the presence of late-arriving data and massive daily ingestion volumes.</p><h2><strong>Why Traditional Partitioning Fell Short and How Liquid Clustering Solves It</strong></h2><p>The original architecture relied on continuous streaming ingestion into Bronze tables from Kafka, followed by scheduled batch jobs that populated optimized Silver tables.Bronze tables were partitioned by date, while Silver tables were partitioned and z-ordered by relevant keys. When everything arrived on time and tables were fully optimized, query performance was excellent.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!otlB!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3a545a-e73d-4d1f-aee8-d8d81ee41fad_627x379.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!otlB!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3a545a-e73d-4d1f-aee8-d8d81ee41fad_627x379.png 424w, https://substackcdn.com/image/fetch/$s_!otlB!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3a545a-e73d-4d1f-aee8-d8d81ee41fad_627x379.png 848w, https://substackcdn.com/image/fetch/$s_!otlB!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3a545a-e73d-4d1f-aee8-d8d81ee41fad_627x379.png 1272w, https://substackcdn.com/image/fetch/$s_!otlB!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3a545a-e73d-4d1f-aee8-d8d81ee41fad_627x379.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!otlB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3a545a-e73d-4d1f-aee8-d8d81ee41fad_627x379.png" width="627" height="379" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fa3a545a-e73d-4d1f-aee8-d8d81ee41fad_627x379.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:379,&quot;width&quot;:627,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:67322,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.databricksters.com/i/185583181?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3a545a-e73d-4d1f-aee8-d8d81ee41fad_627x379.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!otlB!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3a545a-e73d-4d1f-aee8-d8d81ee41fad_627x379.png 424w, https://substackcdn.com/image/fetch/$s_!otlB!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3a545a-e73d-4d1f-aee8-d8d81ee41fad_627x379.png 848w, https://substackcdn.com/image/fetch/$s_!otlB!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3a545a-e73d-4d1f-aee8-d8d81ee41fad_627x379.png 1272w, https://substackcdn.com/image/fetch/$s_!otlB!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa3a545a-e73d-4d1f-aee8-d8d81ee41fad_627x379.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The problem began with late-arriving data.</p><p>Some events arrived days&#8212;even weeks&#8212;after their original event time. These records landed in partitions that had already been optimized, slowly reintroducing small, unoptimized files into previously clean partitions. Over time:</p><ul><li><p>File counts ballooned</p></li><li><p>Queries that once ran in ~10 seconds stretched to a minute or more</p></li><li><p>Performance degraded steadily as more late data accumulated</p></li></ul><p>The only way to recover performance was to re-optimize entire partitions repeatedly, which became increasingly expensive and time-consuming at scale. Rigid partition boundaries simply did not work well with unpredictable arrival patterns.</p><p>This is where Liquid Clustering became a natural fit. Instead of relying on static partitions, Liquid Clustering incrementally maintains data layout quality as data arrives. It continuously rebalances files based on clustering keys, reducing the need for repeated full rewrites and making late-arriving data far less disruptive.</p><p>Liquid Clustering addresses these issues with a multi-dimensional, incremental clustering strategy. It removes rigid partition boundaries, continuously reorganizes poorly clustered segments, and supports both eager clustering (during ingestion) and lazy clustering (post-ingestion). Using a tree-based multi-column clustering, it improves data skipping and maintains predictable, low-latency query performance even on large, late-arriving datasets.</p><h2><strong>Where Liquid Met Production Reality: Scaling Challenges and Fixes</strong></h2><p>While Liquid Clustering addressed the core architectural issue, the migration itself surfaced a new set of challenges primarily driven by scale.</p><h3><strong>High-Throughput Ingestion and Backfills</strong></h3><p>One of the first challenges during the migration to Liquid Clustering was handling the existing historical data. To adopt the new layout strategy, tens of terabytes of data per day over several months had to be backfilled and reorganized. This was challenging because the platform also had to continue processing new streaming data and maintain Silver tables for analysts. Running backfill and OPTIMIZE jobs simultaneously put the system under extreme load, pushing cluster resources to their limits.</p><h3><strong>Long-Running OPTIMIZE Jobs After Enabling Liquid</strong></h3><p>After enabling Liquid Clustering, OPTIMIZE runtimes increased noticeably&#8212;not because of Liquid alone, but due to the scale at which it was introduced. Large historical backfills were running alongside ongoing ingestion, forcing OPTIMIZE to process very large data volumes under heavy skew. Certain clustering stages ended up handling disproportionate amounts of data, resulting in large shuffles, disk spills, and increasingly unpredictable runtimes.</p><p>To address this, eager clustering was enabled for streaming writes, moving a portion of the clustering work into ingestion. This reduced the amount of reorganization required during OPTIMIZE and helped stabilize optimization runtimes, especially once batch sizes were increased and clustering work was better distributed across the cluster.</p><p>In addition, Liquid-specific tuning played a critical role in improving OPTIMIZE stability at scale. Key adjustments included:</p><ul><li><p><strong>Enhanced data skipping</strong> was enabled to reduce unnecessary data movement during clustering, significantly lowering shuffle volume and improving OPTIMIZE efficiency.</p></li><li><p><strong>Increased clustering parallelism</strong> was also configured to distribute clustering work more evenly across the cluster, reducing skew and stabilizing runtimes for large and wide tables.</p></li></ul><p>For workload-specific tuning and the exact configuration details, we recommend reaching out to Databricks, as the optimal settings can vary based on data volume and cluster characteristics.</p><p>To further reduce reliance on manual OPTIMIZE jobs, we also leveraged Predictive Optimization (PO) to automatically manage optimization workloads&#8212;this topic is discussed in more detail in the upcoming section.</p><h3><strong>Eager Clustering at Scale: Small File Challenges</strong></h3><p>While eager clustering reduced the amount of work during OPTIMIZE, it introduced a new challenge when streaming batches were too small. This was especially noticeable during large historical backfills, where each batch was around 40 GB. Although each batch was locally clustered, the small size meant that OPTIMIZE still had to rewrite many small files, resulting in high write amplification, longer runtimes, and increased operational overhead.</p><h4><strong>Solution: Batch Size as a Critical Lever</strong></h4><p>One of the most important lessons from this migration was how sensitive eager clustering is to batch size, particularly for backfills. Increasing batch sizes to larger, more meaningful units&#8212;around 1 TB per batch&#8212;changed the behavior dramatically (particularly during petabyte-scale backfills) . Larger batches allowed eager clustering to produce larger, better-clustered files upfront, which significantly reduced or even eliminated downstream OPTIMIZE work. This not only shortened OPTIMIZE runtimes but also lowered overall operational overhead by triggering fewer jobs and improving system stability during high-volume backfill processing.</p><h3><strong>Infrastructure Constraints</strong></h3><p>Another challenge emerged from the existing cluster configuration. At petabyte-scale, OPTIMIZE planning occasionally failed due to driver disk exhaustion, caused by large Spark event logs generated during complex optimization planning. The original cluster setup, designed for partitioned tables, was no longer sufficient to handle the heavy resource demands of Liquid Clustering and large backfills.</p><h4><strong>Solution: Reducing Event Log Volume</strong></h4><p>The issue was mitigated by tuning Spark to limit event log growth during OPTIMIZE planning. This significantly reduced the size of driver-side logs, preventing disk exhaustion and allowing large optimization jobs to complete reliably even while ingestion workloads continued to run.</p><p>For workload-specific tuning and exact configuration details, we recommend reaching out to Databricks.</p><h3><strong>Cluster and Runtime Constraints</strong></h3><p>The existing cluster configuration was no longer sufficient to handle simultaneous high-volume ingestion and large-scale OPTIMIZE jobs. Under petabyte-scale workloads, resource contention could slow down processing and introduce instability.</p><h4><strong>Solution: Right-Sizing Compute and Updating Runtime</strong></h4><p>To address this, cluster capacity was increased where needed, reducing reliance on spot instances and favoring on-demand workers for stability during long-running OPTIMIZE operations. In addition, the latest DBR 17.3 runtime was adopted for all new Liquid tables, leveraging improvements that enhanced performance, stability, and optimization efficiency at scale.</p><p>Cluster capacity was adjusted to reduce reliance on spot instances for long-running OPTIMIZE jobs, favoring on-demand workers where stability mattered most.</p><h2><strong>Predictive Optimization: Powerful, but Needs Guardrails</strong></h2><p>Once we stabilized OPTIMIZE runtimes and tuned eager clustering, the next focus was on reducing reliance on manual optimization jobs. For this, we leveraged Predictive Optimization , which automatically determines when and how to run clustering operations based on table state and data layout.</p><p>Concurrent manual OPTIMIZE jobs sometimes conflicted with PO runs in a few cases , causing transaction failures. Limited observability made it difficult to track PO activity and determine how much data remained unoptimized, which could lead to degraded query performance.</p><p>To address this, we implemented several best practices:</p><ul><li><p>Monitor PO activity using system.storage.predictive_optimization_history to track execution and outcomes.</p></li><li><p>Fallback manual OPTIMIZE jobs are run whenever PO does not execute successfully.</p><p></p></li></ul><h2><strong>Results: Performance, Freshness, and Simpler Operations</strong></h2><p>The migration to Liquid Clustering delivered clear, measurable improvements:</p><ul><li><p>Query performance improved dramatically&#8212;around 50% faster for most queries and up to 90% faster for long-range scans</p></li><li><p>File counts were cut roughly in half, reducing I/O overhead</p></li><li><p>Data freshness improved from hours to minutes, enabling near-real-time analytics</p></li><li><p>Operational overhead dropped significantly with UC-managed tables and automated optimization</p></li><li><p>Legacy partitioning was eliminated, reducing technical debt and modernizing the architecture<br></p></li></ul><h2><strong>Final Thoughts</strong></h2><p>Liquid Clustering fundamentally changes how data layout is managed at scale. By moving away from rigid partitions and embracing incremental, adaptive clustering, it becomes possible to handle late-arriving data, massive ingestion volumes, and evolving schemas without sacrificing performance or driving up costs.</p><p>The key is understanding the operational nuances: batch sizing, clustering strategy, Liquid-specific tuning, cluster configuration, and optimization automation. When those pieces come together, Liquid Clustering can unlock faster queries, fresher data, and a far more resilient data platform.</p><p>For readers interested in diving deeper:</p><ul><li><p><a href="https://www.databricks.com/blog/arctic-wolfs-liquid-clustering-architecture-tuned-petabyte-scale">Arctic Wolf&#8217;s Liquid Clustering Architecture Tuned for Petabyte Scale &#8211; Databricks Blog</a></p></li><li><p><a href="https://open.substack.com/pub/canadiandataguy/p/optimizing-delta-lake-tables-liquid?utm_campaign=post-expanded-share&amp;utm_medium=web">Optimizing Delta Lake Tables with Liquid Clustering &#8211; Canadian Data Guy</a></p></li></ul>]]></content:encoded></item><item><title><![CDATA[10 Lessons from Analyzing and Tuning Two Dozen Databricks SQL Warehouses]]></title><description><![CDATA[How to cut cost and boost performance]]></description><link>https://www.databricksters.com/p/10-lessons-from-analyzing-and-tuning</link><guid isPermaLink="false">https://www.databricksters.com/p/10-lessons-from-analyzing-and-tuning</guid><dc:creator><![CDATA[Artem Chebotko]]></dc:creator><pubDate>Fri, 07 Nov 2025 16:01:25 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!ghL8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd16790d4-1fd3-447c-ad7e-a84965ef6d54_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ghL8!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd16790d4-1fd3-447c-ad7e-a84965ef6d54_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ghL8!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd16790d4-1fd3-447c-ad7e-a84965ef6d54_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!ghL8!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd16790d4-1fd3-447c-ad7e-a84965ef6d54_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!ghL8!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd16790d4-1fd3-447c-ad7e-a84965ef6d54_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!ghL8!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd16790d4-1fd3-447c-ad7e-a84965ef6d54_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ghL8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd16790d4-1fd3-447c-ad7e-a84965ef6d54_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d16790d4-1fd3-447c-ad7e-a84965ef6d54_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2451869,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.databricksters.com/i/178229675?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd16790d4-1fd3-447c-ad7e-a84965ef6d54_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!ghL8!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd16790d4-1fd3-447c-ad7e-a84965ef6d54_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!ghL8!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd16790d4-1fd3-447c-ad7e-a84965ef6d54_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!ghL8!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd16790d4-1fd3-447c-ad7e-a84965ef6d54_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!ghL8!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd16790d4-1fd3-447c-ad7e-a84965ef6d54_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As a Specialist Solutions Architect at Databricks, over the past year I&#8217;ve led the Databricks SQL Cost &amp; Performance Optimization Assessment initiative, reviewing and tuning two dozen customer warehouses across industries such as ad tech, healthcare, fintech, and energy. Each engagement shared the same goal &#8212; improving performance, reducing cost, and uncovering hidden inefficiencies. While every environment is unique, several recurring themes emerged. The following ten lessons highlight recurring optimization patterns and practical takeaways that helped customers achieve measurable cost and performance improvements. Of course, these are generalizations &#8212; exceptions may apply depending on workload characteristics, data layout, and other factors.</p><h3><strong>1. Liquid Clustering effectively saves compute and improves performance</strong></h3><p>This should be foundational, yet I continue to see multi-terabyte tables without any data-layout optimization, or with Liquid Clustering misused as hierarchical sorting. In some cases, tables are clustered on too many columns, degrading clustering efficiency. In both scenarios, file pruning is limited, and queries end up scanning far more data than necessary.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.databricksters.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Databricksters! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p><strong>Lesson:</strong> Use <a href="https://docs.databricks.com/aws/en/delta/clustering">Liquid Clustering</a> strategically to improve query filtering and data skipping. Follow <a href="https://docs.databricks.com/aws/en/delta/clustering#choose-clustering-keys">Databricks best practices</a> when defining clustering keys.</p><h3><strong>2. Predictive Optimization delivers ongoing performance and cost gains</strong></h3><p>Predictive Optimization automatically improves managed Delta tables by compacting files, applying Liquid Clustering, collecting statistics, and deleting old files, reducing both compute and storage costs. It continuously observes workloads and schedules these maintenance operations automatically.</p><p><strong>Lesson:</strong> Enable <a href="https://docs.databricks.com/aws/en/optimizations/predictive-optimization">Predictive Optimization</a> on Unity Catalog managed tables to benefit from continuous layout tuning and automatic vacuuming &#8212; improving performance and lowering storage spend with no manual jobs to manage.</p><h3><strong>3. Missing statistics are a common performance issue</strong></h3><p>Across most environments, I found tables with missing or partial column statistics, often due to external table ingestion or infrequent maintenance. Without accurate statistics, the optimizer can&#8217;t effectively estimate data volumes or apply efficient file pruning.</p><p><strong>Lesson:</strong> Run <code>ANALYZE TABLE</code> regularly or rely on <a href="https://docs.databricks.com/aws/en/optimizations/predictive-optimization">Predictive Optimization</a> to automate stats collection for managed Delta tables. Missing statistics can significantly degrade query performance.</p><h3><strong>4. Disk spills are silent performance killers</strong></h3><p>Between 5% and 20% of queries in typical environments spill to disk during shuffles, adding seconds or even minutes to query runtime. These issues often go unnoticed because queries still &#8220;succeed,&#8221; but at a high cost.</p><p><strong>Lesson:</strong> Identify queries with significant disk spill by analyzing the <a href="https://docs.databricks.com/aws/en/admin/system-tables/query-history">query history system table</a>. Rewriting a query, adding <a href="https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-qry-select-hints">repartitioning or join hints</a>, splitting large queries into smaller ones that process subsets of data, or increasing warehouse size are all valid strategies to mitigate spills. Reducing disk spills alone can yield a 10&#8211;30% performance improvement.</p><h3><strong>5. Tiny files are the hidden tax of inefficient ingestion</strong></h3><p>One customer&#8217;s ingestion pipeline wrote one file per row &#8212; millions of files, each only a few kilobytes. These microfiles result in drastically slow reads, increased metadata overhead, and make optimization more expensive.</p><p><strong>Lesson:</strong> Adjust partitioning or writer settings in your ingestion pipelines to produce fewer, larger files. Regularly compact small files using the <code>OPTIMIZE</code> command, or rely on <a href="https://docs.databricks.com/aws/en/optimizations/predictive-optimization">Predictive Optimization</a> to automate compaction for managed Delta tables. Excessive small files can significantly degrade query performance and increase compute and metadata management costs.</p><h3><strong>6. Large inserts and replaces drive up cost</strong></h3><p>Many teams refresh large datasets using <code>INSERT OVERWRITE</code>, <code>REPLACE WHERE</code>, or even <code>CREATE OR REPLACE TABLE</code> &#8212; for example, rewriting an entire 30-day time series when only a small subset of rows has changed. This results in billions of rows being rewritten unnecessarily each day.</p><p><strong>Lesson:</strong> Replace full overwrites with incremental <code>MERGE</code> operations whenever possible. <a href="https://www.youtube.com/watch?v=yZmrpXJg-G8">Optimize merge performance</a> using <a href="https://docs.databricks.com/aws/en/delta/clustering">Liquid Clustering</a>. In one environment, this simple change reduced warehouse runtime by about 50%.</p><h3><strong>7. Cross-system queries often run slower without Photon acceleration</strong></h3><p>Federated queries &#8212; for example, Databricks reading from Snowflake or other external systems &#8212; often fall out of Photon execution mode, forcing execution to fall back to slower row-based processing. This can significantly increase query latency and compute cost.</p><p><strong>Lesson:</strong> Keep frequently joined or high-volume tables within Databricks when possible, or create materialized views over foreign tables to reduce repeated cross-system access. Use <a href="https://docs.databricks.com/aws/en/query-federation">Lakehouse Federation</a> selectively to balance interoperability with performance. When an external system supports <a href="https://docs.databricks.com/aws/en/delta-sharing/">Delta Sharing</a> for Delta Lake tables, it is typically a better option than federated queries because it preserves Photon acceleration.</p><h3><strong>8. Auto-stop settings are often too conservative</strong></h3><p>Many warehouses retain the default 10-minute idle timeout, or even increase it, despite intermittent workloads that provide frequent opportunities for the warehouse to scale down to zero. This leads to clusters sitting idle while still incurring DBU costs.</p><p><strong>Lesson:</strong> Reducing the auto-stop setting to 5 minutes via the UI &#8212; or even 1 minute via <a href="https://docs.databricks.com/api/workspace/warehouses">REST API</a> &#8212; can save thousands of dollars monthly, especially for scheduled or bursty workloads. For continuously running environments, consider right-sizing instead of keeping large warehouses idling.</p><h3><strong>9. BI tools can create inefficiencies by keeping connections alive and not properly cancelling queries</strong></h3><p>A frequent source of hidden inefficiency comes from BI tools that keep connections alive or fail to cancel submitted queries properly. In the first case, clients send periodic heartbeat queries to keep sessions open, preventing the warehouse from shutting down. In the second, clients stop polling for results from submitted queries, leaving the warehouse in a waiting state until it cancels them with the familiar <em>&#8220;Query has been timed out due to inactivity&#8221;</em> message.</p><p><strong>Lesson:</strong> Review how your BI tools handle connection pooling, query cancellation, and session timeouts. Proper client configuration prevents idle workloads from consuming compute and improves overall warehouse efficiency.</p><h3><strong>10. Consolidating compatible workloads improves efficiency and reduces cost</strong></h3><p>Most teams maintain multiple warehouses for BI, ETL, ad hoc querying, metadata exploration, and data science. While such separation is necessary for workloads with different latency or concurrency requirements, running too many similar warehouses creates unnecessary startup, scaling, and caching overhead.</p><p><strong>Lesson:</strong> Consolidate workloads with similar characteristics and latency expectations onto shared warehouses with appropriate scaling policies. This approach can deliver 20&#8211;40% cost savings, improves cache reuse, and simplifies governance. Use a small dedicated warehouse for lightweight metadata exploration instead of a large production warehouse.</p><h3><strong>Bonus Insights</strong></h3><ul><li><p>Avoid running <code>OPTIMIZE</code> after every write. In one case, a dbt job was configured to do exactly that &#8212; wasting DBUs without providing meaningful benefit. Let <a href="https://docs.databricks.com/aws/en/optimizations/predictive-optimization">Predictive Optimization</a> or a scheduled job handle compaction automatically.</p></li><li><p>Larger warehouses don&#8217;t always mean better performance. A well-tuned Medium or Large warehouse can, in some cases, outperform an underutilized X-Large, delivering better efficiency at lower cost. Check query profiles to ensure queries parallelize effectively and make full use of available resources.</p></li><li><p>Very high operator counts in SQL queries indicate data model complexity. Simplify by denormalizing or materializing gold tables to reduce query overhead and improve performance.</p></li></ul><h3><strong>Conclusion</strong></h3><p>Across the Databricks SQL Cost &amp; Performance Optimization Assessments, the most impactful improvements rarely came from sweeping architectural changes but from small, targeted adjustments &#8212; smarter auto-stop settings, better clustering, predictive optimization, or tuned queries. Collectively, these refinements translated into substantial cost reductions and faster, more predictable performance.</p><p>The key takeaway is simple: <strong>continuously measure, analyze, and iterate.</strong> Databricks SQL provides all the observability and automation needed to make optimization an ongoing practice, not a one-time project.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.databricksters.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Databricksters! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Warming Up Databricks SQL Disk Cache for Reliable BI Dashboard Benchmarking]]></title><description><![CDATA[Ever benchmarked Databricks SQL and wondered why results fluctuate between runs?&#160;It&#8217;s often not your queries &#8212; it&#8217;s the cache.In my latest post, I walk through a practical, production-friendly method to warm up the Databricks SQL disk cache using real historical queries &#8212; ensuring your BI benchmarks reflect true user experience, not cold starts.]]></description><link>https://www.databricksters.com/p/warming-up-databricks-sql-disk-cache</link><guid isPermaLink="false">https://www.databricksters.com/p/warming-up-databricks-sql-disk-cache</guid><dc:creator><![CDATA[Artem Chebotko]]></dc:creator><pubDate>Tue, 04 Nov 2025 16:02:22 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f8159ccc-6e63-4f2c-b1c4-076e2a302f86_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kz0D!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76e6155-d733-4c4c-9047-51ef6ca5bf10_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kz0D!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76e6155-d733-4c4c-9047-51ef6ca5bf10_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!kz0D!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76e6155-d733-4c4c-9047-51ef6ca5bf10_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!kz0D!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76e6155-d733-4c4c-9047-51ef6ca5bf10_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!kz0D!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76e6155-d733-4c4c-9047-51ef6ca5bf10_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kz0D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76e6155-d733-4c4c-9047-51ef6ca5bf10_1536x1024.png" width="1456" height="971" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a76e6155-d733-4c4c-9047-51ef6ca5bf10_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:2427894,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.databricksters.com/i/177300044?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76e6155-d733-4c4c-9047-51ef6ca5bf10_1536x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kz0D!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76e6155-d733-4c4c-9047-51ef6ca5bf10_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!kz0D!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76e6155-d733-4c4c-9047-51ef6ca5bf10_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!kz0D!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76e6155-d733-4c4c-9047-51ef6ca5bf10_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!kz0D!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa76e6155-d733-4c4c-9047-51ef6ca5bf10_1536x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As a Specialist Solutions Architect at Databricks, I work with customers who want to compare Databricks SQL (DBSQL) performance against tools like Power BI, Tableau, and Databricks&#8217; own AI/BI Dashboards. They&#8217;ll spin up a dashboard, hit &#8220;Refresh visuals,&#8221; time the result, and call that a benchmark.</p><p>Here&#8217;s the <strong>problem</strong>: production dashboards almost never run on a cold warehouse. Warehouses build up local disk cache as they serve repeated queries. If you benchmark against a cold cache, you&#8217;re not measuring what users actually see &#8212; you&#8217;re measuring the very first run after a restart.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.databricksters.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Databricksters! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><p>This post introduces a production-friendly way to warm up the Databricks SQL disk cache using real historical queries. By replaying organic query patterns before testing, you can ensure your BI dashboards are benchmarked under realistic, steady-state conditions. <a href="https://github.com/ArtemChebotko/databricks-dbsql-disk-cache/blob/main/Warming%20Up%20Databricks%20SQL%20Disk%20Cache%20for%20Reliable%20BI%20Dashboard%20Benchmarking.py">The companion notebook</a> provides all the code to automate this process and visualize when the cache has stabilized. You&#8217;ll be able to:</p><ul><li><p><em>Prime a warehouse&#8217;s disk cache using realistic dashboard traffic</em></p></li><li><p><em>Keep benchmarks consistent across runs</em></p></li><li><p><em>Avoid cheating by using the result cache</em></p></li><li><p><em>Warm up one warehouse using the historical workload of a completely different warehouse</em></p></li></ul><h3><strong>Why Benchmark with a Warm Cache</strong></h3><p>In production, BI dashboards rarely start from scratch. The Databricks SQL Disk Cache stores frequently accessed data locally on a warehouse cluster for faster retrieval. Ignoring this behavior during benchmarking can lead to unfair comparisons, especially if one test benefits from cached data while another does not.</p><p>A quick recap of <a href="https://docs.databricks.com/aws/en/sql/user/queries/query-caching">caching layers</a> in Databricks SQL:</p><ul><li><p><strong>Cold run</strong> &#8212; nothing is cached yet; all data is read from storage.</p></li><li><p><strong>Warm run</strong> &#8212; data files are already cached on disk.</p></li><li><p><strong>Result cache</strong> &#8212; identical query results are returned instantly if underlying data hasn&#8217;t changed.</p></li></ul><p>If you want to measure what end users actually experience in Power BI, Tableau, or Databricks&#8217; AI/BI Dashboards, in most cases, you must benchmark against a <em>warm cache</em>, not a cold run.</p><h3><strong>The Warm-Up Approach</strong></h3><p><a href="https://github.com/ArtemChebotko/databricks-dbsql-disk-cache/blob/main/Warming%20Up%20Databricks%20SQL%20Disk%20Cache%20for%20Reliable%20BI%20Dashboard%20Benchmarking.py">The companion notebook</a> automates the warm-up process by extracting historical <code>SELECT</code> queries from the <code>system.query.history</code> table and replaying them with controlled concurrency. This effectively &#8220;primes&#8221; the disk cache with the same data your dashboards typically access.</p><p>Each warm-up session can be customized with parameters such as:</p><ul><li><p><strong>Time range</strong> &#8212; which historical window of queries to extract</p></li><li><p><strong>User or service principal</strong> &#8212; whose queries represent dashboard workloads</p></li><li><p><strong>Warehouse ID</strong> &#8212; which warehouse&#8217;s query history to use as the source of truth</p></li><li><p><strong>Concurrency and delay</strong> &#8212; to simulate natural workload intensity</p></li></ul><p>The notebook also tracks total execution times across multiple warm-up iterations and visualizes convergence &#8212; helping you see when the cache has stabilized and is ready for reliable benchmarking.</p><h3><strong>Extracting Historical Queries</strong></h3><p>The notebook begins by pulling real queries from <code>system.query.history</code>. You specify a user, warehouse, and time range that represents your normal dashboard activity:</p><pre><code>query = f&#8221;&#8220;&#8221;
  SELECT statement_text
  FROM system.query.history
  WHERE start_time &gt;= &#8216;{start_time_utc.isoformat()}&#8217;
    AND start_time &lt;= &#8216;{end_time_utc.isoformat()}&#8217;
    AND executed_by = &#8216;{EXECUTED_BY}&#8217;
    AND compute.warehouse_id = &#8216;{WAREHOUSE_ID}&#8217;
  ORDER BY start_time
&#8220;&#8221;&#8220;</code></pre><p>This ensures that the warm-up process reflects genuine production workloads, not synthetic benchmarks.</p><h3><strong>Avoiding Result Cache Shortcuts</strong></h3><p>If you replay the same queries exactly, Databricks SQL will return them from the result cache &#8212; defeating the purpose of warming the disk cache. To avoid that, the notebook automatically injects a harmless dynamic column into each query:</p><pre><code>SELECT col_a, col_b, NOW() AS injected_now
FROM sales.fact_orders
WHERE ...</code></pre><p>This simple trick forces each query to execute freshly while preserving the same scan patterns, ensuring that the disk cache &#8212; not the result cache &#8212; is exercised.</p><p>Alternatively, if you prefer not to modify queries programmatically, you can disable the result cache for the session before replaying queries:</p><pre><code>SET use_cached_result = false;</code></pre><p>This directive ensures that all queries execute freshly, regardless of whether identical results exist in the cache. For more details, see the<a href="https://docs.databricks.com/aws/en/sql/user/queries/query-caching"> Databricks documentation on query caching</a>.</p><h3><strong>Replaying Queries with Concurrency</strong></h3><p>The modified queries are then replayed concurrently using the Databricks SQL Connector for Python. This concurrency simulates the burst of queries typical of a dashboard refresh:</p><pre><code>from databricks import sql
import concurrent.futures, time

def execute_query(q):
    conn = sql.connect(**DBSQL_CONFIG)
    cursor = conn.cursor()
    cursor.execute(q)
    cursor.fetchall()
    cursor.close()
    conn.close()

def run_warmup(queries, passes, concurrency, delay_sec):
    durations = []
    for i in range(passes):
        start = time.time()
        with concurrent.futures.ThreadPoolExecutor(max_workers=concurrency) as pool:
            pool.map(execute_query, queries)
        durations.append(round(time.time() - start, 2))
        time.sleep(delay_sec)
    return durations</code></pre><p>In practice, you don&#8217;t need extreme concurrency to warm a single DBSQL warehouse. For most standard setups, 10 concurrent queries are sufficient to engage the warehouse&#8217;s local cache efficiently without overwhelming it. However, if you&#8217;re using multiple clusters per warehouse (for example, auto-scaled DBSQL configurations), you should scale concurrency proportionally &#8212; roughly 10 &#215; <em>N</em>, where <em>N</em> is the number of clusters. This ensures that all clusters participate in the warm-up process and that data is cached across all of them.</p><h3><strong>Visualizing Cache Stabilization</strong></h3><p>After replay, the notebook visualizes warm-up durations across multiple passes to confirm that the cache has reached steady state:</p><pre><code>import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
plt.bar(range(1, len(durations)+1), durations)
plt.title(&#8221;Execution Time per Warm-up Pass&#8221;)
plt.xlabel(&#8221;Pass&#8221;)
plt.ylabel(&#8221;Duration (seconds)&#8221;)
plt.grid(axis=&#8217;y&#8217;)
plt.tight_layout()
plt.show()</code></pre><p>You should see total duration drop sharply after the first run and then level off. When the bars flatten, the warehouse&#8217;s disk cache is effectively warmed, and your benchmark can begin.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!9Smc!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322830fb-c4ff-42f9-b948-24da0efeaf97_790x490.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!9Smc!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322830fb-c4ff-42f9-b948-24da0efeaf97_790x490.png 424w, https://substackcdn.com/image/fetch/$s_!9Smc!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322830fb-c4ff-42f9-b948-24da0efeaf97_790x490.png 848w, https://substackcdn.com/image/fetch/$s_!9Smc!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322830fb-c4ff-42f9-b948-24da0efeaf97_790x490.png 1272w, https://substackcdn.com/image/fetch/$s_!9Smc!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322830fb-c4ff-42f9-b948-24da0efeaf97_790x490.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!9Smc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322830fb-c4ff-42f9-b948-24da0efeaf97_790x490.png" width="790" height="490" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/322830fb-c4ff-42f9-b948-24da0efeaf97_790x490.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:490,&quot;width&quot;:790,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!9Smc!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322830fb-c4ff-42f9-b948-24da0efeaf97_790x490.png 424w, https://substackcdn.com/image/fetch/$s_!9Smc!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322830fb-c4ff-42f9-b948-24da0efeaf97_790x490.png 848w, https://substackcdn.com/image/fetch/$s_!9Smc!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322830fb-c4ff-42f9-b948-24da0efeaf97_790x490.png 1272w, https://substackcdn.com/image/fetch/$s_!9Smc!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F322830fb-c4ff-42f9-b948-24da0efeaf97_790x490.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption"><strong>Cache Warm-Up Convergence Over Time</strong></figcaption></figure></div><p>The chart shows total warm-up duration across repeated passes for an X-Large DBSQL warehouse configured with 2 clusters (min = 2, max = 2). The first run is the slowest (cold cache). After several passes, execution time flattens &#8212; indicating the warehouse&#8217;s disk cache has converged and is ready for benchmarking.</p><h3><strong>Using Separate Warehouses for Query Extraction and Warm-Up</strong></h3><p>A powerful capability of this approach is that you can separate where you extract history from where you warm up.</p><p>This helps keep production environments safe and ensures that your benchmarks are isolated.</p><ul><li><p><strong>Extraction warehouse:</strong> Connect to any SQL warehouse with access to <code>system.query.history</code> and filter for the production warehouse whose workload you want to reproduce. This step is lightweight and metadata-only.</p></li><li><p><strong>Warm-up warehouse:</strong> Use a different SQL warehouse (for example, a staging or benchmarking environment) to replay those queries and populate its disk cache.</p></li></ul><p>This separation allows you to safely warm up a benchmarking warehouse using production-like workloads, without affecting your production dashboards or incurring additional load.</p><h3><strong>Why This Matters for BI Dashboards</strong></h3><p>Whether your users interact through Power BI, Tableau, or Databricks&#8217; native AI/BI Dashboards, they all benefit from a warm disk cache. By simulating steady-state conditions before running benchmarks, you can:</p><ul><li><p>Ensure consistent, realistic performance testing</p></li><li><p>Evaluate dashboard changes fairly</p></li><li><p>Tune warehouse size and concurrency</p></li><li><p>Compare tools and configurations on equal footing</p></li></ul><h3><strong>Conclusion</strong></h3><p>To measure <em>true</em> user experience, always benchmark against a warmed Databricks SQL disk cache &#8212; not a cold one.</p><ul><li><p>Extract real queries from <code>system.query.history</code></p></li><li><p>Modify them to bypass the result cache</p></li><li><p>Replay them with realistic concurrency (&#8776;10 per cluster)</p></li><li><p>Confirm warm-up stabilization</p></li><li><p>Then test your dashboards</p></li></ul><p><a href="https://github.com/ArtemChebotko/databricks-dbsql-disk-cache/blob/main/Warming%20Up%20Databricks%20SQL%20Disk%20Cache%20for%20Reliable%20BI%20Dashboard%20Benchmarking.py">The companion notebook</a> automates this entire workflow &#8212; including query extraction, replay, visualization, and optional warehouse separation &#8212; so you can focus on measuring performance, not preparing for it.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.databricksters.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Databricksters! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[The Hidden Price of Streaming: Cutting S3 API Calls for Massive Cloud Savings]]></title><description><![CDATA[A practical approach to cutting cloud expenses through smarter S3 API usage]]></description><link>https://www.databricksters.com/p/the-hidden-price-of-streaming-cutting</link><guid isPermaLink="false">https://www.databricksters.com/p/the-hidden-price-of-streaming-cutting</guid><dc:creator><![CDATA[Geethu]]></dc:creator><pubDate>Tue, 20 May 2025 15:02:18 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!GfOV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2db54b-dd6c-4000-9a0b-8fb7f14883a5_1024x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GfOV!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2db54b-dd6c-4000-9a0b-8fb7f14883a5_1024x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GfOV!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2db54b-dd6c-4000-9a0b-8fb7f14883a5_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!GfOV!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2db54b-dd6c-4000-9a0b-8fb7f14883a5_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!GfOV!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2db54b-dd6c-4000-9a0b-8fb7f14883a5_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!GfOV!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2db54b-dd6c-4000-9a0b-8fb7f14883a5_1024x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GfOV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2db54b-dd6c-4000-9a0b-8fb7f14883a5_1024x1024.png" width="495" height="495" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ed2db54b-dd6c-4000-9a0b-8fb7f14883a5_1024x1024.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1024,&quot;width&quot;:1024,&quot;resizeWidth&quot;:495,&quot;bytes&quot;:2253150,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.databricksters.com/i/163960459?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2db54b-dd6c-4000-9a0b-8fb7f14883a5_1024x1024.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GfOV!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2db54b-dd6c-4000-9a0b-8fb7f14883a5_1024x1024.png 424w, https://substackcdn.com/image/fetch/$s_!GfOV!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2db54b-dd6c-4000-9a0b-8fb7f14883a5_1024x1024.png 848w, https://substackcdn.com/image/fetch/$s_!GfOV!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2db54b-dd6c-4000-9a0b-8fb7f14883a5_1024x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!GfOV!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fed2db54b-dd6c-4000-9a0b-8fb7f14883a5_1024x1024.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>As data pipelines scale in size and complexity, keeping operational costs under control becomes increasingly important particularly for streaming workloads. Unlike batch jobs that run at scheduled intervals, streaming pipelines operate continuously, generating a steady stream of read and write operations. This constant interaction with cloud storage services like S3 can quickly accumulate API costs if not properly managed. Gaining visibility into these patterns and optimizing them is essential to maintaining efficient, cost-effective, and scalable data systems.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.databricksters.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Databricksters! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><h3><strong>Why S3 API Calls Matter for Cost</strong></h3><p>S3 is widely used for its scalability, durability, and integration with modern data platforms. However, beyond storage fees, S3 also charges for API requests - an often underestimated factor that can significantly impact the cost of streaming data pipelines.</p><p>In autoloader or DLT workloads in Databricks, S3 is frequently accessed for operations such as file reads, writes, checkpoints, schema inference, and metadata management. These operations translate into S3 API calls, each of which incurs a cost depending on the request type.</p><h4><strong>Use case Overview:</strong></h4><p>As an example, let's take a Delta Live Tables (DLT) pipeline or a Structured Streaming job with the default trigger interval of 500 milliseconds to illustrate how S3 API calls can quietly drive up the cost of a streaming pipeline.</p><p><strong>Assumptions with 500ms trigger interval:</strong></p><ul><li><p>100 S3 API calls every 500 ms = 200 API calls per second = 17,280,000 API calls per day<strong><br></strong></p></li><li><p><strong> Breakdown</strong>:</p><ul><li><p><strong>40%</strong> PUT, LIST, POST &#8594; <strong>69 calls/sec</strong></p></li><li><p><strong>60%</strong> GET, READ &#8594; <strong>131 calls/sec</strong></p></li></ul></li></ul><p><strong>&#128202; Daily API Call Volumes:</strong></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!a1_w!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe82b6e7-bc37-4948-9254-c5ebd8e7f413_1284x306.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!a1_w!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe82b6e7-bc37-4948-9254-c5ebd8e7f413_1284x306.png 424w, https://substackcdn.com/image/fetch/$s_!a1_w!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe82b6e7-bc37-4948-9254-c5ebd8e7f413_1284x306.png 848w, https://substackcdn.com/image/fetch/$s_!a1_w!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe82b6e7-bc37-4948-9254-c5ebd8e7f413_1284x306.png 1272w, https://substackcdn.com/image/fetch/$s_!a1_w!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe82b6e7-bc37-4948-9254-c5ebd8e7f413_1284x306.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!a1_w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe82b6e7-bc37-4948-9254-c5ebd8e7f413_1284x306.png" width="1284" height="306" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/be82b6e7-bc37-4948-9254-c5ebd8e7f413_1284x306.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:306,&quot;width&quot;:1284,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:60757,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.databricksters.com/i/163960459?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe82b6e7-bc37-4948-9254-c5ebd8e7f413_1284x306.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!a1_w!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe82b6e7-bc37-4948-9254-c5ebd8e7f413_1284x306.png 424w, https://substackcdn.com/image/fetch/$s_!a1_w!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe82b6e7-bc37-4948-9254-c5ebd8e7f413_1284x306.png 848w, https://substackcdn.com/image/fetch/$s_!a1_w!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe82b6e7-bc37-4948-9254-c5ebd8e7f413_1284x306.png 1272w, https://substackcdn.com/image/fetch/$s_!a1_w!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbe82b6e7-bc37-4948-9254-c5ebd8e7f413_1284x306.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>&#128197; Monthly Cost (30 Days):</strong></p><p> $38.71/day &#215; 30 days = <em>~$1,161.30/month</em></p><p>This is per pipeline cost and if we have 10 pipelines like this easily it can cost you ~10000/ month</p><h2><strong>Common Scenarios Leading to High S3 API Costs</strong></h2><p>Certain patterns such as small file writes, frequent checkpointing, or excessive schema discovery can unintentionally amplify S3 API usage. Since S3 charges per request type, even optimizations at the infrastructure or code level (e.g., reducing unnecessary LIST calls or batching writes) can have a noticeable impact on cost.</p><p>Understanding where and why S3 API calls occur in a streaming pipeline is crucial. Without this insight, organizations may see ballooning storage-related costs that don't correlate directly with the amount of data stored or processed. Therefore, monitoring and minimizing unnecessary API calls is key to building cost-efficient streaming architectures. Below are proven strategies to slash API expenses while maintaining performance.</p><div><hr></div><h2><strong>Core Optimization Strategies</strong></h2><p></p><h4>Option 1: Increase Trigger Interval in Bronze and Silver Layers</h4><p>Reducing the frequency of micro-batch execution helps lower the number of S3 API calls, especially in high-volume streaming jobs where each trigger performs multiple GET, PUT, and LIST operations on S3. This optimization can be possible only when the latency requirement is not too low(in subseconds) per microbatch.</p><p>**The following metrics are based on the assumption that increasing micro-batch size leads to fewer files being read and written, as each batch processes more data.</p><p><strong>New Assumptions with 2-Second Interval</strong></p><ul><li><p><strong>Old Trigger Rate:</strong> every 500 ms &#8594; 2 triggers/sec</p></li><li><p><strong>New Trigger Rate:</strong> every 2 seconds &#8594; 1 triggers/sec (consider only fewer files are written and the read rate remains same but writes are lower)</p></li><li><p><strong>Reduction Factor:</strong> 2&#215; fewer triggers</p></li><li><p><strong>New API call rate:</strong></p><ul><li><p>100 API calls every 1 seconds = 100 API calls per second</p></li><li><p>= 8,640,000 API calls/day</p></li></ul></li></ul><p><strong>&#128202; Daily API Call Breakdown (Same 40/60 Split):</strong></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!sYzg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908b723-3fb1-46ce-8b1c-c6dc4c1b8685_1262x300.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!sYzg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908b723-3fb1-46ce-8b1c-c6dc4c1b8685_1262x300.png 424w, https://substackcdn.com/image/fetch/$s_!sYzg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908b723-3fb1-46ce-8b1c-c6dc4c1b8685_1262x300.png 848w, https://substackcdn.com/image/fetch/$s_!sYzg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908b723-3fb1-46ce-8b1c-c6dc4c1b8685_1262x300.png 1272w, https://substackcdn.com/image/fetch/$s_!sYzg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908b723-3fb1-46ce-8b1c-c6dc4c1b8685_1262x300.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!sYzg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908b723-3fb1-46ce-8b1c-c6dc4c1b8685_1262x300.png" width="1262" height="300" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8908b723-3fb1-46ce-8b1c-c6dc4c1b8685_1262x300.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:300,&quot;width&quot;:1262,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:60368,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.databricksters.com/i/163960459?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908b723-3fb1-46ce-8b1c-c6dc4c1b8685_1262x300.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!sYzg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908b723-3fb1-46ce-8b1c-c6dc4c1b8685_1262x300.png 424w, https://substackcdn.com/image/fetch/$s_!sYzg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908b723-3fb1-46ce-8b1c-c6dc4c1b8685_1262x300.png 848w, https://substackcdn.com/image/fetch/$s_!sYzg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908b723-3fb1-46ce-8b1c-c6dc4c1b8685_1262x300.png 1272w, https://substackcdn.com/image/fetch/$s_!sYzg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8908b723-3fb1-46ce-8b1c-c6dc4c1b8685_1262x300.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>&#128197; Monthly Cost (30 Days):</strong></p><p>$19.35/day &#215; 30 = <em>~$580.50/month</em></p><p>This shows 2x reduction in cost per pipeline which can add up to multiple pipelines and help in significant reduction in cost.</p><div><hr></div><h4>Option 2: Table Properties Settings &#8211; Use v2 Checkpointing</h4><p>Delta Lake&#8217;s v2 checkpointing is an optimized format that improves how Delta tables manage transaction logs. Unlike the default checkpointing (v1), which may trigger additional reads of Parquet data to gather stats or metadata, v2 checkpointing stores those stats directly in the checkpoint files. This reduces the need to make additional S3 GET or LIST API calls&#8212;thereby lowering I/O overhead and S3 request costs.</p><p><strong>Why It Matters for Streaming:</strong></p><p>In streaming pipelines, frequent checkpointing is common (especially in the bronze/silver layers). Traditional checkpoints often involve multiple S3 metadata reads. v2 checkpointing reduces this footprint by minimizing how often the engine needs to fetch additional files, helping control the number of REST API calls made to S3.</p><p><strong>Benefits:</strong></p><ul><li><p>Reduces S3 GET/LIST API calls</p></li><li><p>Speeds up streaming reads and commits</p></li><li><p>Lowers cloud storage access costs</p></li></ul><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jeuL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e5a2afe-b864-4a07-8653-39c3652f096f_1268x202.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jeuL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e5a2afe-b864-4a07-8653-39c3652f096f_1268x202.png 424w, https://substackcdn.com/image/fetch/$s_!jeuL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e5a2afe-b864-4a07-8653-39c3652f096f_1268x202.png 848w, https://substackcdn.com/image/fetch/$s_!jeuL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e5a2afe-b864-4a07-8653-39c3652f096f_1268x202.png 1272w, https://substackcdn.com/image/fetch/$s_!jeuL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e5a2afe-b864-4a07-8653-39c3652f096f_1268x202.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jeuL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e5a2afe-b864-4a07-8653-39c3652f096f_1268x202.png" width="1268" height="202" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e5a2afe-b864-4a07-8653-39c3652f096f_1268x202.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:202,&quot;width&quot;:1268,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:44026,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.databricksters.com/i/163960459?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e5a2afe-b864-4a07-8653-39c3652f096f_1268x202.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jeuL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e5a2afe-b864-4a07-8653-39c3652f096f_1268x202.png 424w, https://substackcdn.com/image/fetch/$s_!jeuL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e5a2afe-b864-4a07-8653-39c3652f096f_1268x202.png 848w, https://substackcdn.com/image/fetch/$s_!jeuL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e5a2afe-b864-4a07-8653-39c3652f096f_1268x202.png 1272w, https://substackcdn.com/image/fetch/$s_!jeuL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e5a2afe-b864-4a07-8653-39c3652f096f_1268x202.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p>To enable v2 Checkpointing set the below property:</p><p><code>ALTER TABLE my_table</code></p><p><code>SET TBLPROPERTIES (</code></p><p><code>'delta.feature.v2Checkpoint' = 'supported',</code></p><p><code>'delta.checkpointPolicy' = 'v2'</code></p><p><code>);</code></p><div><hr></div><h4>Option 3: Delta Lake Metadata Management</h4><p>Over time, Delta Lake&#8217;s transaction log can grow significantly&#8212;especially in high-frequency streaming jobs. In one observed case, 30 days of logs resulted in 150GB+ of metadata in the _delta_log directory. This metadata bloat increases s3 list and get api calls.</p><p><strong>Fix: Reduce Metadata Retention Durations</strong></p><p>Set shorter retention periods for transaction logs and deleted files to control metadata growth:</p><p><code>ALTER TABLE silver SET TBLPROPERTIES (</code></p><p><code>'delta.logRetentionDuration' = '7 days',</code></p><p><code>'delta.deletedFileRetentionDuration' = '3 days'</code></p><p><code>);</code></p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!KU-P!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e69b98a-074b-4b70-baba-3b428f2a3c90_1294x194.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!KU-P!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e69b98a-074b-4b70-baba-3b428f2a3c90_1294x194.png 424w, https://substackcdn.com/image/fetch/$s_!KU-P!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e69b98a-074b-4b70-baba-3b428f2a3c90_1294x194.png 848w, https://substackcdn.com/image/fetch/$s_!KU-P!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e69b98a-074b-4b70-baba-3b428f2a3c90_1294x194.png 1272w, https://substackcdn.com/image/fetch/$s_!KU-P!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e69b98a-074b-4b70-baba-3b428f2a3c90_1294x194.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!KU-P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e69b98a-074b-4b70-baba-3b428f2a3c90_1294x194.png" width="1294" height="194" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1e69b98a-074b-4b70-baba-3b428f2a3c90_1294x194.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:194,&quot;width&quot;:1294,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:32696,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.databricksters.com/i/163960459?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e69b98a-074b-4b70-baba-3b428f2a3c90_1294x194.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!KU-P!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e69b98a-074b-4b70-baba-3b428f2a3c90_1294x194.png 424w, https://substackcdn.com/image/fetch/$s_!KU-P!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e69b98a-074b-4b70-baba-3b428f2a3c90_1294x194.png 848w, https://substackcdn.com/image/fetch/$s_!KU-P!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e69b98a-074b-4b70-baba-3b428f2a3c90_1294x194.png 1272w, https://substackcdn.com/image/fetch/$s_!KU-P!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1e69b98a-074b-4b70-baba-3b428f2a3c90_1294x194.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Benefits:</strong></p><ul><li><p>85% smaller _delta_log directories</p></li><li><p>Reduced metadata scanning and S3 API usage</p></li><li><p>Lowers cloud storage access costs</p></li></ul><div><hr></div><h4>Option 4: Increase the Frequency of Manual Maintenance Jobs the source and staging tables</h4><p>In a continuous streaming pipeline, smaller files tend to accumulate in the intermediate tables due to high-frequency writes. Since each micro-batch writes individual records to storage, this results in a large number of small files, a scenario that can heavily degrade query performance and increase the frequency of S3 API calls (e.g., LIST and GET), leading to higher storage and I/O costs. So, it&#8217;s essential to run OPTIMIZE regularly to consolidate small files and better organize the data. Enabling auto-optimize and auto-compaction settings in Delta Lake can help automatically reduce the number of small files, ensuring more efficient file management and improved query performance.</p><p><strong>Benefits</strong>:</p><ul><li><p>Optimizes large data volumes by consolidating smaller files, reducing file fragmentation.</p></li><li><p>Fewer S3 LIST and GET API calls due to fewer, larger files.</p></li><li><p>Continuous and regular maintenance ensures tables do not degrade over time due to excessive fragmentation.</p></li></ul><div><hr></div><h4>Option 5: Reduce the Number of Min Batches to Retain</h4><p>In streaming jobs, the minBatchesToRetain setting controls how many of the most recent micro-batches are retained in memory for processing. By default, spark.sql.streaming.minBatchesToRetain is set to 100, meaning the system keeps the latest 100 batches in memory. Lowering this number can help reduce API calls&#8212;particularly to S3&#8212;thereby optimizing costs.</p><p><strong>Smaller State to Manage = Fewer Metadata Reads<br><br></strong>When fewer micro-batches are retained, the streaming engine manages less state during job execution and checkpointing. Delta Lake may avoid reading and re-validating as many past logs or files (via list and get calls to s3) when cleaning up or updating internal state.<br><strong>Impact</strong>: Leads to fewer S3 reads during log replay, especially after restarts or recovery operations.</p><p><strong>Reduced Dependency on Historical Checkpoints<br><br></strong>With a lower minBatchesToRetain, Checkpointing may reference fewer previous batch files. It also reduces interactions with _delta_log/ and old checkpoint files stored on S3. <br><strong>Impact:</strong> Reduces the need to fetch older transaction log entries or metadata files.</p><p><strong>When to Use This Setting:</strong></p><ul><li><p>For low-latency, high-frequency streaming jobs with frequent checkpointing.</p></li><li><p>When historical state is not needed for auditing or recovery beyond a few batches.</p></li><li><p>As part of a broader S3 cost control strategy, alongside more direct tactics like optimizing checkpoint frequency, batching, and file compaction.</p></li></ul><div><hr></div><h4>Option 6: Tune Shuffle Partitions Based on Cluster Size</h4><p>By default, Spark sets spark.sql.shuffle.partitions = 200, which defines how many partitions are created during shuffles (e.g., after joins, groupBy, aggregations). In environments with fewer total available CPU cores, this leads to unnecessary parallelism, resulting in:</p><ul><li><p>Idle or underutilized partitions</p></li><li><p>More metadata overhead</p></li><li><p>Higher S3 API calls, especially LIST calls, when reading partitioned data or transaction logs</p></li></ul><p>The number of S3 LIST and GET calls during a shuffle or file scan phase often scales with:S3 API calls &#8776; listCallsPerPartition &#215; number_of_partitions</p><p>Reducing partition count directly lowers the number of directory listings and metadata requests needed during delta transaction log reads, query planning, partition pruning and OPTIMIZE/Z-ORDER operations</p><p><strong>When to Use: </strong>Total available cores &#215; 2 &lt; 200</p><div><hr></div><h4>Additional Recommendation for monitoring costs:</h4><p>To speed up feedback and reduce reliance on delayed billing dashboards, use Amazon Athena to directly query S3 server access logs. This allows for:</p><ul><li><p>Near real-time visibility into GET, PUT, LIST, and DELETE requests</p></li><li><p>Faster iteration and validation of pipeline optimizations</p></li></ul><h4>Final Thoughts</h4><p>Streaming pipelines are powerful&#8212;but they can get expensive fast if you&#8217;re not careful with how often you hit services like S3. A lot of the cost sneaks in through frequent API calls, especially when you're writing small files, constantly listing buckets, or triggering jobs too often.</p><p>The good news? Small tweaks like increasing trigger intervals, batching writes, and cutting back on unnecessary reads can go a long way in saving money and making your pipeline run more smoothly.</p><p>At the end of the day, it's all about being smart with how your pipeline talks to S3. A few simple changes now can lead to big cost savings down the line&#8212;without sacrificing performance.</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.databricksters.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Databricksters! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item><item><title><![CDATA[Everything You Ever Wanted to Know about Pandas / PyArrow UDFs in Apache Spark ]]></title><description><![CDATA[Vectorized UDFs, Zero-Copy Arrows & 100&#215; Speed-Ups]]></description><link>https://www.databricksters.com/p/everything-you-ever-wanted-to-know</link><guid isPermaLink="false">https://www.databricksters.com/p/everything-you-ever-wanted-to-know</guid><dc:creator><![CDATA[Canadian Data Guy]]></dc:creator><pubDate>Fri, 25 Apr 2025 18:15:44 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e362a94-9995-4b0b-8814-05005ee33c03_1536x1024.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h2>TLDR</h2><blockquote><p><em>Vectorized</em> (Pandas) UDFs marry Spark&#8217;s scale with Pandas &amp; NumPy&#8217;s speed by streaming Arrow column batches across the JVM &#8596; Python boundary. They are 10-100&#215; faster than classic Python UDFs, but they still have sharp edges&#8212;batch sizing, unsupported types, 2 GB limits, executor memory, SafeSpark jail-sandboxes, etc. </p></blockquote><p>In the world of big data processing, Apache Spark stands as the preeminent framework for distributed computation. One of its most powerful features for Python users is the ability to create User-Defined Functions (UDFs). However, traditional Python UDFs often face significant performance limitations. Enter Pandas PyArrow UDFs: a revolutionary approach that combines the analytical capabilities of pandas with the efficiency of Apache Arrow to deliver exceptional performance in distributed environments.</p><h2><strong>The Evolution of Python UDFs in Spark</strong></h2><p>Traditional Python UDFs in Spark suffer from three fundamental limitations that impact performance:</p><ol><li><p><strong>Serialization overhead</strong>: Data must be serialized between JVM and Python processes using pickle, which is computationally expensive<a href="https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html">2</a>.</p></li><li><p><strong>Row-by-row processing</strong>: Functions operate on individual rows rather than batches, resulting in millions of function calls for large datasets<a href="https://www.databricks.com/blog/arrow-optimized-python-udfs-apache-sparktm-35">4</a>.</p></li><li><p><strong>Lack of vectorization</strong>: Operations can't leverage the optimized C/Cython implementations in pandas and NumPy libraries<a href="https://www.databricks.com/blog/arrow-optimized-python-udfs-apache-sparktm-35">4</a>.</p></li></ol><p>Pandas UDFs were introduced in Spark 2.3 to address these limitations, with significant improvements in Spark 3.0 and beyond. These vectorized UDFs use Apache Arrow to efficiently transfer data and pandas to process it in a vectorized manner, delivering performance increases of up to 100x compared to traditional UDFs<a href="https://docs.databricks.com/aws/en/udf/pandas">3</a>.</p><h2><strong>Apache Arrow: The Backbone of High-Performance Data Exchange</strong></h2><p>Apache Arrow is the critical technology that enables the exceptional performance of Pandas UDFs in Spark. As an open-source columnar in-memory data format, Arrow was specifically designed to facilitate efficient data exchange between different programming environments<a href="https://spark.apache.org/docs/latest/api/python/user_guide/sql/arrow_pandas.html">2</a>. For Pandas UDFs, Arrow eliminates the costly serialization/deserialization overhead that plagues traditional Python UDFs when transferring data between JVM and Python processes<a href="https://downloads.apache.org/spark/docs/3.0.1/sql-pyspark-pandas-with-arrow.html">5</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RZ4n!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a4d76e4-a12c-4e1b-bf4c-c75b8a8f97f4_574x318.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RZ4n!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a4d76e4-a12c-4e1b-bf4c-c75b8a8f97f4_574x318.png 424w, https://substackcdn.com/image/fetch/$s_!RZ4n!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a4d76e4-a12c-4e1b-bf4c-c75b8a8f97f4_574x318.png 848w, https://substackcdn.com/image/fetch/$s_!RZ4n!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a4d76e4-a12c-4e1b-bf4c-c75b8a8f97f4_574x318.png 1272w, https://substackcdn.com/image/fetch/$s_!RZ4n!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a4d76e4-a12c-4e1b-bf4c-c75b8a8f97f4_574x318.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RZ4n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a4d76e4-a12c-4e1b-bf4c-c75b8a8f97f4_574x318.png" width="574" height="318" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1a4d76e4-a12c-4e1b-bf4c-c75b8a8f97f4_574x318.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:318,&quot;width&quot;:574,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Format | Apache Arrow&quot;,&quot;title&quot;:&quot;Format | Apache Arrow&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Format | Apache Arrow" title="Format | Apache Arrow" srcset="https://substackcdn.com/image/fetch/$s_!RZ4n!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a4d76e4-a12c-4e1b-bf4c-c75b8a8f97f4_574x318.png 424w, https://substackcdn.com/image/fetch/$s_!RZ4n!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a4d76e4-a12c-4e1b-bf4c-c75b8a8f97f4_574x318.png 848w, https://substackcdn.com/image/fetch/$s_!RZ4n!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a4d76e4-a12c-4e1b-bf4c-c75b8a8f97f4_574x318.png 1272w, https://substackcdn.com/image/fetch/$s_!RZ4n!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1a4d76e4-a12c-4e1b-bf4c-c75b8a8f97f4_574x318.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a><figcaption class="image-caption">Before Arrow</figcaption></figure></div><p>Arrow achieves this efficiency through its columnar memory layout, which stores data contiguously by column rather than by row. This approach provides numerous benefits for analytical workloads: better memory compression, improved CPU cache utilization, and support for SIMD (Single Instruction, Multiple Data) vector operations<a href="https://www.databricks.com/blog/arrow-optimized-python-udfs-apache-sparktm-35">4</a>. Most importantly, Arrow enables a "<strong>zero-copy"</strong> shared memory model where both JVM and Python processes can access the same data without duplicating it, dramatically reducing the cost of data transfer<a href="https://best-practice-and-impact.github.io/ons-spark/ancillary-topics/pandas-udfs.html">6</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!aG3G!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4024ac-5915-4fd9-9f24-c26f9b2217e5_878x472.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!aG3G!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4024ac-5915-4fd9-9f24-c26f9b2217e5_878x472.png 424w, https://substackcdn.com/image/fetch/$s_!aG3G!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4024ac-5915-4fd9-9f24-c26f9b2217e5_878x472.png 848w, https://substackcdn.com/image/fetch/$s_!aG3G!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4024ac-5915-4fd9-9f24-c26f9b2217e5_878x472.png 1272w, https://substackcdn.com/image/fetch/$s_!aG3G!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4024ac-5915-4fd9-9f24-c26f9b2217e5_878x472.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!aG3G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4024ac-5915-4fd9-9f24-c26f9b2217e5_878x472.png" width="878" height="472" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5c4024ac-5915-4fd9-9f24-c26f9b2217e5_878x472.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:472,&quot;width&quot;:878,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;Origins of Apache Arrow &amp; Its Role Today | Dremio Blog&quot;,&quot;title&quot;:&quot;Origins of Apache Arrow &amp; Its Role Today | Dremio Blog&quot;,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="Origins of Apache Arrow &amp; Its Role Today | Dremio Blog" title="Origins of Apache Arrow &amp; Its Role Today | Dremio Blog" srcset="https://substackcdn.com/image/fetch/$s_!aG3G!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4024ac-5915-4fd9-9f24-c26f9b2217e5_878x472.png 424w, https://substackcdn.com/image/fetch/$s_!aG3G!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4024ac-5915-4fd9-9f24-c26f9b2217e5_878x472.png 848w, https://substackcdn.com/image/fetch/$s_!aG3G!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4024ac-5915-4fd9-9f24-c26f9b2217e5_878x472.png 1272w, https://substackcdn.com/image/fetch/$s_!aG3G!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c4024ac-5915-4fd9-9f24-c26f9b2217e5_878x472.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>When a Pandas UDF executes, Spark converts data to Arrow format, splits it into batches, transfers these batches to Python workers as Arrow structures, processes them using pandas, and then returns the results via the same efficient Arrow pathway<a href="https://downloads.apache.org/spark/docs/3.0.1/sql-pyspark-pandas-with-arrow.html">5</a>. This entire pipeline is optimized for high-throughput, parallel processing across a distributed cluster. The result is performance gains that can transform previously impractical Python processing into viable production workflows<a href="https://docs.databricks.com/aws/en/udf/pandas">3</a>.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!ybZ0!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e362a94-9995-4b0b-8814-05005ee33c03_1536x1024.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!ybZ0!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e362a94-9995-4b0b-8814-05005ee33c03_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!ybZ0!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e362a94-9995-4b0b-8814-05005ee33c03_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!ybZ0!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e362a94-9995-4b0b-8814-05005ee33c03_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!ybZ0!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e362a94-9995-4b0b-8814-05005ee33c03_1536x1024.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!ybZ0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e362a94-9995-4b0b-8814-05005ee33c03_1536x1024.png" width="728" height="485.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7e362a94-9995-4b0b-8814-05005ee33c03_1536x1024.png&quot;,&quot;srcNoWatermark&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/dc64b401-c121-4e33-af57-a1c5f2c5726a_1536x1024.png&quot;,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:971,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!ybZ0!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e362a94-9995-4b0b-8814-05005ee33c03_1536x1024.png 424w, https://substackcdn.com/image/fetch/$s_!ybZ0!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e362a94-9995-4b0b-8814-05005ee33c03_1536x1024.png 848w, https://substackcdn.com/image/fetch/$s_!ybZ0!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e362a94-9995-4b0b-8814-05005ee33c03_1536x1024.png 1272w, https://substackcdn.com/image/fetch/$s_!ybZ0!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7e362a94-9995-4b0b-8814-05005ee33c03_1536x1024.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>The Data Flow Process</strong></h2><p>The process of executing a Pandas UDF involves several steps that highlight how data flows through the Spark execution environment:</p><ol><li><p>Spark converts the data into Arrow format</p></li><li><p>The data is split into batches (configured by <code>spark.sql.execution.arrow.maxRecordsPerBatch</code>)</p></li><li><p>Arrow batches are transferred to Python workers</p></li><li><p>Python workers convert Arrow batches to pandas Series or DataFrames</p></li><li><p>The UDF function processes these pandas objects</p></li><li><p>Results are converted back to Arrow format</p></li><li><p>Arrow data is transferred back to Spark</p></li><li><p>Spark converts Arrow data back to its internal format</p></li></ol><p>This entire process happens in parallel across the Spark cluster, leveraging the distributed nature of Spark while maintaining the efficiency of vectorized operations.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!RH5p!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7885fd4-83e7-4238-a78b-67b96f939395_1434x502.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!RH5p!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7885fd4-83e7-4238-a78b-67b96f939395_1434x502.png 424w, https://substackcdn.com/image/fetch/$s_!RH5p!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7885fd4-83e7-4238-a78b-67b96f939395_1434x502.png 848w, https://substackcdn.com/image/fetch/$s_!RH5p!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7885fd4-83e7-4238-a78b-67b96f939395_1434x502.png 1272w, https://substackcdn.com/image/fetch/$s_!RH5p!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7885fd4-83e7-4238-a78b-67b96f939395_1434x502.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!RH5p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7885fd4-83e7-4238-a78b-67b96f939395_1434x502.png" width="1434" height="502" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b7885fd4-83e7-4238-a78b-67b96f939395_1434x502.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:502,&quot;width&quot;:1434,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:172626,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.canadiandataguy.com/i/162103523?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7885fd4-83e7-4238-a78b-67b96f939395_1434x502.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!RH5p!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7885fd4-83e7-4238-a78b-67b96f939395_1434x502.png 424w, https://substackcdn.com/image/fetch/$s_!RH5p!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7885fd4-83e7-4238-a78b-67b96f939395_1434x502.png 848w, https://substackcdn.com/image/fetch/$s_!RH5p!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7885fd4-83e7-4238-a78b-67b96f939395_1434x502.png 1272w, https://substackcdn.com/image/fetch/$s_!RH5p!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb7885fd4-83e7-4238-a78b-67b96f939395_1434x502.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!n3-L!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c440036-7669-449a-bb11-35aae1932242_1460x594.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!n3-L!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c440036-7669-449a-bb11-35aae1932242_1460x594.png 424w, https://substackcdn.com/image/fetch/$s_!n3-L!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c440036-7669-449a-bb11-35aae1932242_1460x594.png 848w, https://substackcdn.com/image/fetch/$s_!n3-L!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c440036-7669-449a-bb11-35aae1932242_1460x594.png 1272w, https://substackcdn.com/image/fetch/$s_!n3-L!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c440036-7669-449a-bb11-35aae1932242_1460x594.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!n3-L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c440036-7669-449a-bb11-35aae1932242_1460x594.png" width="1456" height="592" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/7c440036-7669-449a-bb11-35aae1932242_1460x594.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:592,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:217112,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.canadiandataguy.com/i/162103523?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c440036-7669-449a-bb11-35aae1932242_1460x594.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!n3-L!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c440036-7669-449a-bb11-35aae1932242_1460x594.png 424w, https://substackcdn.com/image/fetch/$s_!n3-L!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c440036-7669-449a-bb11-35aae1932242_1460x594.png 848w, https://substackcdn.com/image/fetch/$s_!n3-L!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c440036-7669-449a-bb11-35aae1932242_1460x594.png 1272w, https://substackcdn.com/image/fetch/$s_!n3-L!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F7c440036-7669-449a-bb11-35aae1932242_1460x594.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!lT6Q!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71351669-e43d-4b87-b02f-89a30027137e_1448x612.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!lT6Q!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71351669-e43d-4b87-b02f-89a30027137e_1448x612.png 424w, https://substackcdn.com/image/fetch/$s_!lT6Q!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71351669-e43d-4b87-b02f-89a30027137e_1448x612.png 848w, https://substackcdn.com/image/fetch/$s_!lT6Q!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71351669-e43d-4b87-b02f-89a30027137e_1448x612.png 1272w, https://substackcdn.com/image/fetch/$s_!lT6Q!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71351669-e43d-4b87-b02f-89a30027137e_1448x612.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!lT6Q!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71351669-e43d-4b87-b02f-89a30027137e_1448x612.png" width="1200" height="507.18232044198896" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/71351669-e43d-4b87-b02f-89a30027137e_1448x612.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:612,&quot;width&quot;:1448,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:284960,&quot;alt&quot;:&quot;&quot;,&quot;title&quot;:&quot;&quot;,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.canadiandataguy.com/i/162103523?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71351669-e43d-4b87-b02f-89a30027137e_1448x612.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:&quot;center&quot;,&quot;offset&quot;:false}" class="sizing-large" alt="" title="" srcset="https://substackcdn.com/image/fetch/$s_!lT6Q!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71351669-e43d-4b87-b02f-89a30027137e_1448x612.png 424w, https://substackcdn.com/image/fetch/$s_!lT6Q!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71351669-e43d-4b87-b02f-89a30027137e_1448x612.png 848w, https://substackcdn.com/image/fetch/$s_!lT6Q!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71351669-e43d-4b87-b02f-89a30027137e_1448x612.png 1272w, https://substackcdn.com/image/fetch/$s_!lT6Q!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71351669-e43d-4b87-b02f-89a30027137e_1448x612.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Pandas UDFs Defined</h2><p>You define a pandas UDF by decorating a Python function with <code>@pandas_udf</code> <strong>and</strong> adding <strong>type hints</strong> for the input and output:</p><pre><code><code>from typing import Iterator
import pandas as pd
from pyspark.sql.functions import pandas_udf       

@pandas_udf('long')
def pandas_plus_one(iterator: Iterator[pd.Series]) -&gt; Iterator[pd.Series]:
    return map(lambda s: s + 1, iterator)

display(spark.range(10).select(pandas_plus_one("id")))
</code></code></pre><ul><li><p>The <strong>signature</strong> (<code>pd.Series &#8594; pd.Series</code>) tells Spark which UDF flavor to pick.</p></li><li><p>Under the hood, Spark uses <strong>Apache Arrow</strong> for zero-copy (de)serialization.</p></li><li><p><strong>Vectorized</strong>: your code gets whole batches as <code>pd.Series</code>/<code>pd.DataFrame</code>, not single cells.</p></li></ul><h2>Different flavours of Pandas UDF</h2><ul><li><p><strong><a href="https://www.databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html">Series to Series (</a></strong><code>pandas.Series -&gt; pandas.Series</code><strong><a href="https://www.databricks.com/blog/2020/05/20/new-pandas-udfs-and-python-type-hints-in-the-upcoming-release-of-apache-spark-3-0.html">)</a>:</strong> This pattern exists to provide a clear, Pythonic, and type-hinted way to define vectorized UDFs that transform one Spark column into another, operating row-by-row (conceptually, applied batch-wise). It directly replaces the need to explicitly specify the older <code>SCALAR</code> Pandas UDF type, making the function's intent (operating on a Series and returning a Series of the same size) self-evident from the type hints.</p></li><li><p><strong><a href="https://docs.databricks.com/aws/en/udf/pandas">Iterator of Series to Iterator of Series (</a></strong><code>Iterator[pandas.Series] -&gt; Iterator[pandas.Series]</code><strong><a href="https://docs.databricks.com/aws/en/udf/pandas">)</a>:</strong> This pattern was introduced to offer more flexibility and optimization for Series-to-Series transformations. It allows processing data in batches (iterators) rather than loading the entire column partition at once. The <em>why</em> is twofold: 1) It helps manage memory usage for very large data partitions, and 2) It enables expensive state initialization (e.g., loading a model) to be done once per batch iterator, improving performance.</p></li><li><p><strong>Iterator of Multiple Series to Iterator of Series (</strong><code>Iterator[Tuple[pandas.Series, ...]] -&gt; Iterator[pandas.Series]</code><strong>):</strong> This extends the previous pattern because many operations require logic based on <em>multiple</em> input columns simultaneously. This type hint signature allows users to define UDFs that take batches of multiple input Series, perform calculations using them together, and return a single output Series batch, offering the same memory and initialization benefits for multi-column logic.</p></li><li><p><strong>Series to Scalar (</strong><code>pandas.Series -&gt; Any</code><strong>):</strong> This pattern, often used with <code>groupBy().agg()</code> or window functions, provides a type-hinted way to define aggregations. It takes a Pandas Series representing a group or partition and returns a single scalar value. The <em>why</em> is to replace the older <code>GROUPED_AGG</code> Pandas UDF type with a more standard Python type hint signature, making the aggregation intent clear.</p></li><li><p><code>applyInPandas</code><strong> (on GroupedData):</strong> This function exists specifically to implement the "split-apply-combine" pattern on grouped data<a href="https://learn.microsoft.com/en-us/azure/databricks/pandas/pandas-function-apis">6</a>. <strong>Why?</strong> It allows applying a custom Python function, operating on a <em>full Pandas DataFrame</em> for each group, to perform complex, group-specific transformations or aggregations that are difficult or inefficient with standard Spark functions<a href="https://www.getorchestra.io/guides/spark-concepts-pyspark-sql-groupeddata-applyinpandas-quick-start">2</a><a href="https://learn.microsoft.com/en-us/azure/databricks/pandas/pandas-function-apis">6</a>. It expects one Pandas DataFrame (representing a group) as input and requires a Pandas DataFrame as output, effectively transforming each group<a href="https://www.getorchestra.io/guides/spark-concepts-pyspark-sql-groupeddata-applyinpandas-quick-start">2</a><a href="https://community.databricks.com/t5/technical-blog/understanding-pandas-udf-applyinpandas-and-mapinpandas/ba-p/75717">5</a><a href="https://learn.microsoft.com/en-us/azure/databricks/pandas/pandas-function-apis">6</a>. Note: It loads the entire group into memory, which can be demanding for large groups<a href="https://learn.microsoft.com/en-us/azure/databricks/pandas/pandas-function-apis">6</a>.</p></li><li><p><code>mapInPandas</code><strong> (on DataFrame):</strong> This function exists to apply a Python function to an <em>iterator of Pandas DataFrames</em>, where each DataFrame represents a batch of data from a DataFrame partition<a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.mapInPandas.html">1</a><a href="https://www.getorchestra.io/guides/spark-concepts-pyspark-sql-dataframe-mapinpandas-explained">3</a>. <strong>Why?</strong> It's designed for transformations on entire partitions or batches where the logic is complex, best expressed in Python/Pandas, and crucially, where the <em>number of output rows per input batch might differ</em> from the input batch size (e.g., filtering rows, or mapping one input row to multiple output rows like unpacking files)<a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.mapInPandas.html">1</a><a href="https://www.databricks.com/blog/processing-uncommon-file-formats-scale-mapinpandas-and-delta-live-tables">4</a><a href="https://community.databricks.com/t5/technical-blog/understanding-pandas-udf-applyinpandas-and-mapinpandas/ba-p/75717">5</a><a href="https://learn.microsoft.com/en-us/azure/databricks/pandas/pandas-function-apis">6</a>. Its iterator-based approach aids memory efficiency for large datasets<a href="https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.mapInPandas.html">1</a><a href="https://learn.microsoft.com/en-us/azure/databricks/pandas/pandas-function-apis">6</a>.</p></li></ul><h2><strong>Arrow Batch Size Considerations</strong></h2><p>The batch size for Arrow transfers is controlled by <code>spark.sql.execution.arrow.maxRecordsPerBatch</code> (default is typically 10,000 records). This setting can be tuned for performance:</p><ul><li><p><strong>Smaller batches</strong>: Reduce memory pressure but increase overhead</p></li><li><p><strong>Larger batches</strong>: Better performance but higher memory requirements</p></li></ul><p>For scalar operations, one partition consists of multiple Arrow batches. For grouping operations, one partition (group) is one Arrow batch, which can lead to out-of-memory issues if a group is too large<a href="https://best-practice-and-impact.github.io/ons-spark/ancillary-topics/pandas-udfs.html">6</a>.</p><h3>&#9888;&#65039; Gotchas / Limitations</h3><ul><li><p><strong>Memory pressure:</strong> Each Arrow batch (or entire group in grouped UDFs) is copied into the Python worker&#8217;s RAM; large groups or oversized batches can OOM an executor. For grouped operations, all data for a group must fit in memory<a href="https://best-practice-and-impact.github.io/ons-spark/ancillary-topics/pandas-udfs.html">6</a></p></li><li><p><strong>Arrow caps:</strong></p><ul><li><p>Max rows per record batch = 2,147,483,647.</p></li><li><p>Some complex / deeply nested types (<code>ArrayType(TimestampType)</code>, very nested <code>Map</code>/<code>Struct</code>) need recent PyArrow versions&#8212;or are unsupported.</p></li><li><p><strong>Memory constraints</strong>: <strong>Setup overhead on tiny data:</strong> Python-worker spin-up and Arrow serialisation can outweigh benefits for small inputs; native Spark or even a pickled UDF may be faster. If you small data,  honestly you should not worry just use Arrow.</p></li></ul></li></ul><div><hr></div><h3>&#128640; Best Practices &amp; Performance Tips</h3><ul><li><p><strong>Prefer native Spark first &#8594; Pandas UDF second &#8594; pickled Python UDF last.</strong></p></li><li><p><strong>Vectorise:</strong> Use pandas/NumPy column ops, not Python loops.</p></li><li><p><strong>Tune batch size:</strong></p><ul><li><p>Start with default 10 k rows.</p></li><li><p>Reduce to ease memory or raise to improve throughput&#8212;set via <code>spark.sql.execution.arrow.maxRecordsPerBatch</code>.</p></li></ul></li><li><p><strong>Provision RAM:</strong> Remember processing is in Python space; size executors for worst-case batch or group.</p></li><li><p><strong>Resource cleanup:</strong> In iterator UDFs, wrap model loads / file handles in <code>try &#8230; finally</code>.</p></li><li><p><strong>Type hints everywhere:</strong> Clearer code, earlier failures, faster Arrow conversion.</p></li><li><p><strong>Timestamps:</strong> Keep data in UTC; rely on pandas time-series APIs for conversions.</p></li></ul><blockquote><p><em>Our newsletter is 100% free and always will be, but without your claps, comments, or shares, search engines may bury this post forever. A quick <strong>clap</strong> not only tells us this content resonates but also makes sure you (and everyone else) can find it again when it matters most.</em></p></blockquote><p><strong>Performance Hierarchy:</strong></p><ol><li><p><strong>Native Spark Functions:</strong> Generally the fastest as they operate entirely within the optimized Spark engine<a href="https://www.reddit.com/r/apachespark/comments/15ehcma/is_a_pandas_udf_faster_than_a_python_udf_when_it/">4</a><a href="https://community.databricks.com/t5/data-engineering/sql-udf-vs-python-udf-sql-udf-vs-pandas-udf/td-p/91546">6</a>.</p></li><li><p><strong>Pandas UDFs (Vectorized):</strong> Offer significant speedups over traditional Python UDFs by using Arrow and vectorized processing. Can be 10-100x faster1<a href="https://docs.databricks.com/aws/en/udf/pandas">3</a><a href="https://www.reddit.com/r/apachespark/comments/15ehcma/is_a_pandas_udf_faster_than_a_python_udf_when_it/">4</a><a href="https://community.databricks.com/t5/data-engineering/sql-udf-vs-python-udf-sql-udf-vs-pandas-udf/td-p/91546">6</a>.</p></li><li><p><strong>Arrow-Optimized Python UDFs:</strong> Faster than pickled UDFs due to efficient Arrow transport, but still process row-by-row logically1<a href="https://www.reddit.com/r/apachespark/comments/15ehcma/is_a_pandas_udf_faster_than_a_python_udf_when_it/">4</a><a href="https://www.databricks.com/blog/arrow-optimized-python-udfs-apache-sparktm-35">5</a><a href="https://community.databricks.com/t5/data-engineering/sql-udf-vs-python-udf-sql-udf-vs-pandas-udf/td-p/91546">6</a>.</p></li><li><p><strong>Traditional (Pickled) Python UDFs:</strong> Slowest due to serialization overhead and row-by-row execution1<a href="https://www.reddit.com/r/apachespark/comments/15ehcma/is_a_pandas_udf_faster_than_a_python_udf_when_it/">4</a><a href="https://www.databricks.com/blog/arrow-optimized-python-udfs-apache-sparktm-35">5</a><a href="https://community.databricks.com/t5/data-engineering/sql-udf-vs-python-udf-sql-udf-vs-pandas-udf/td-p/91546">6</a>.</p></li></ol><div class="pullquote"><blockquote><p>If your batch job processes fewer than 1 billion rows, stressing over whether to use a Pandas UDF or a traditional UDF is honestly overkill. Too many people debate UDF types without timing anything. Benchmark it properly: first, run your pipeline without the UDF to get a baseline. Then add the UDF and compare. Don&#8217;t speculate&#8212;measure.</p><p><strong>Pro tip</strong>: use this command to time your pipeline<br><code>df.write.format("noop").mode("overwrite").save()</code></p></blockquote></div><h2><strong>Conclusion: The Future of Python Data Processing in Spark &amp; Arrow</strong></h2><p>Pandas PyArrow UDFs represent a breakthrough in solving the long-standing performance challenges of Python processing in Apache Spark. By bridging the gap between Spark's distributed computing capabilities and the rich ecosystem of Python data science tools, these UDFs enable data scientists and engineers to write high-performance Python code that can scale to massive datasets.</p><p>With the continued evolution of Arrow-optimized UDFs in Spark 3.5 and beyond, the performance advantages will only increase. The integration between pandas and Spark continues to deepen, making it easier for Python users to leverage their existing skills in a distributed environment while maintaining high performance.</p><p>As the ecosystem evolves, we can expect further innovations in this space, potentially eliminating the remaining barriers between Python's ease of use and Spark's distributed computing power, ultimately making complex data processing both simple and efficient.</p><blockquote><p><strong>TL;DR</strong> &#8211; Stop pickling rows. Start vectorizing columns. And may your Arrow batches be ever&#8209;contiguous.</p></blockquote>]]></content:encoded></item><item><title><![CDATA[How to Actually Delete Data in Spark Streaming (Without Breaking Things ) 💥 ]]></title><description><![CDATA[What Every Data Engineer Needs to Know About GDPR-Ready Pipelines]]></description><link>https://www.databricksters.com/p/how-to-actually-delete-data-in-spark</link><guid isPermaLink="false">https://www.databricksters.com/p/how-to-actually-delete-data-in-spark</guid><dc:creator><![CDATA[Canadian Data Guy]]></dc:creator><pubDate>Tue, 25 Mar 2025 15:01:47 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!UL6k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca975cc4-a1f5-4a24-9f3a-6cbac42ca128_3214x2410.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p>Modern data pipelines are increasingly adopting streaming paradigms, but handling deletes in streaming pipelines is far from trivial. In this blog, we&#8217;ll explore how to handle deletes effectively using Delta Lake and Spark Structured Streaming, with a focus on real-world use cases like GDPR.</p><div><hr></div><h2>Why Deletes Are Hard in Streaming</h2><p>Streaming pipelines are traditionally append-only. Systems like Kafka or files as a source don&#8217;t natively support deletes. But in real-world applications like GDPR, customers expect their data to be deleted across all layers: Bronze, Silver, and Gold.</p><h4>Delta Lake Capabilities for DELETE</h4><p>Delta Lake supports <code>DELETE</code>, <code>UPDATE</code>, and <code>MERGE</code> on batch and streaming tables. This unlocks powerful patterns for Change Data Capture (CDC), Slowly Changing Dimensions (SCD), and data correction.</p><ul><li><p>Supports ACID transactions</p></li><li><p>Deletes propagate only if you plan for it</p></li></ul><p>Said another way, Delta is more flexible than Kafka and Kinesis as a streaming solution.</p><h2>Streaming Table Design for Delete Propagation</h2><p>For append-only sinks, here&#8217;s a common pattern:</p><h3>Bronze (Append-Only Example)</h3><pre><code><code>CREATE STREAMING TABLE bronze ( ... )
TBLPROPERTIES (
  delta.appendOnly = true
);</code></code></pre><blockquote><p><strong>Note</strong>: If <code>delta.appendOnly</code> is set to <code>true</code>, <strong>no</strong> DELETE operations are allowed on that table. This setting locks the table to <strong>only</strong> support appends. So either we set the above or make sure to handle ignore DML commits (<code>skipChangeCommits)</code> or handle DML Commits (<code>readChangeFeed)</code>. Anything else can cause failures in production.</p></blockquote><h3>Bronze (with Deletes/Updates/Merge)</h3><pre><code><code>ignore_dml_df = spark.readStream.format("delta") \
    .option("skipChangeCommits", "true") \
    .table("bronze.table_with_deletes")


change_feed_df = spark.readStream.format("delta") \
  .option("readChangeFeed", "true") \
  .table("bronze.table_with_deletes")

updates = change_feed_df.filter("_change_type = 'update_postimage'")
deletes = change_feed_df.filter("_change_type = 'delete'")</code></code></pre><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!UL6k!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca975cc4-a1f5-4a24-9f3a-6cbac42ca128_3214x2410.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!UL6k!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca975cc4-a1f5-4a24-9f3a-6cbac42ca128_3214x2410.png 424w, https://substackcdn.com/image/fetch/$s_!UL6k!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca975cc4-a1f5-4a24-9f3a-6cbac42ca128_3214x2410.png 848w, https://substackcdn.com/image/fetch/$s_!UL6k!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca975cc4-a1f5-4a24-9f3a-6cbac42ca128_3214x2410.png 1272w, https://substackcdn.com/image/fetch/$s_!UL6k!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca975cc4-a1f5-4a24-9f3a-6cbac42ca128_3214x2410.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!UL6k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca975cc4-a1f5-4a24-9f3a-6cbac42ca128_3214x2410.png" width="728" height="546" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ca975cc4-a1f5-4a24-9f3a-6cbac42ca128_3214x2410.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:1092,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:13734521,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.databricksters.com/i/159504639?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca975cc4-a1f5-4a24-9f3a-6cbac42ca128_3214x2410.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!UL6k!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca975cc4-a1f5-4a24-9f3a-6cbac42ca128_3214x2410.png 424w, https://substackcdn.com/image/fetch/$s_!UL6k!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca975cc4-a1f5-4a24-9f3a-6cbac42ca128_3214x2410.png 848w, https://substackcdn.com/image/fetch/$s_!UL6k!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca975cc4-a1f5-4a24-9f3a-6cbac42ca128_3214x2410.png 1272w, https://substackcdn.com/image/fetch/$s_!UL6k!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fca975cc4-a1f5-4a24-9f3a-6cbac42ca128_3214x2410.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>Understanding Deletion Vectors</h2><p>Deletion vectors store metadata about logically deleted rows without immediately rewriting data files. They reduce the overhead associated with frequent DELETE operations, significantly improving DELETE efficiency.</p><ul><li><p>Reduces file rewrites</p></li><li><p>Enhances performance of DELETE operations</p></li><li><p>Automatically used by Delta Lake to keep track of logically removed rows</p></li></ul><p>When you want to physically remove these rows from the data files, use <code>REORG TABLE ... APPLY (PURGE)</code>.</p><h2>Importance of VACUUM and REORG</h2><h4>VACUUM</h4><p>After running DELETE operations, Delta tables contain logically deleted data but physically retain this data until explicitly vacuumed. Without running <code>VACUUM</code>, storage costs continually grow.</p><ul><li><p>Default retention period for VACUUM is 7 days, after which data files are physically removed, optimizing storage.</p></li><li><p>Regularly running <code>VACUUM</code> ensures compliance with GDPR requirements by physically purging deleted data within a defined timeframe.</p></li></ul><h4>REORG TABLE ... APPLY (PURGE)</h4><p><code>REORG</code> is a command for rewriting data files in a Delta table that contain rows marked by deletion vectors. With <code>APPLY (PURGE)</code>, logically deleted rows are physically removed:</p><pre><code><code>REORG TABLE &lt;table_name&gt; APPLY (PURGE);</code></code></pre><ul><li><p><strong>Deletion vectors</strong> indicate rows that have been logically removed. <code>REORG</code> with <code>APPLY (PURGE)</code> re-creates data files without those rows.</p></li><li><p><strong>This process permanently applies deletions, removing the need to store deletion vectors.</strong></p></li><li><p>Essential for GDPR compliance when you need to ensure no residual data in physical storage.</p></li></ul><p>Running both <strong>VACUUM</strong> and <strong>REORG</strong> periodically keeps your storage usage optimized and meets strict data compliance regulations like GDPR.</p><h2>&#128588; Keep This Post Discoverable: Your Support Matters!</h2><p>We&#8217;re a small, independent team creating in-depth technical content like this to help the data engineering community. We don&#8217;t rely on sponsorships or ads&#8212;just passion and practical experience.</p><p>If you found this blog valuable, please take a moment to <strong>clap</strong>, <strong>comment</strong>, or <strong>share</strong> it. Your engagement helps search engines like Google surface this content for others and ensures you can find it again when needed.</p><p>Without interaction, even the most helpful posts can disappear into the internet void. Let&#8217;s keep high-quality content alive and accessible. &#128591;</p><p class="button-wrapper" data-attrs="{&quot;url&quot;:&quot;https://www.databricksters.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe now&quot;,&quot;action&quot;:null,&quot;class&quot;:null}" data-component-name="ButtonCreateButton"><a class="button primary" href="https://www.databricksters.com/subscribe?"><span>Subscribe now</span></a></p><p><strong>Now back to the blog</strong></p><h2>Choosing Between <code>skipChangeCommits</code> and <code>readChangeFeed</code></h2><h3>When to Use <code>skipChangeCommits</code></h3><ul><li><p>You do not care about Deletes/Merges/Update and you only need to handle appends then <code>skipChangeCommits meets your needs.</code></p></li></ul><pre><code><code>only_appends_df = spark.readStream.format("delta") \
    .option("skipChangeCommits", "true") \
    .table("bronze.table_with_deletes")
</code></code></pre><h3>When to Use <code>readChangeFeed</code></h3><ul><li><p>You need to explicitly handle all changes including DELETE and UPDATE operations.</p></li><li><p>You want detailed tracking of changes (e.g., <code>_change_type</code>: <code>insert</code>, <code>update_preimage</code>, <code>update_postimage</code>, <code>delete</code>).</p></li></ul><pre><code><code>change_feed_df = spark.readStream.format("delta") \
  .option("readChangeFeed", "true") \
  .table("bronze.table_with_deletes")

updates = change_feed_df.filter("_change_type = 'update_postimage'")
deletes = change_feed_df.filter("_change_type = 'delete'")</code></code></pre><h2>Use Case: GDPR Delete Propagation</h2><p>GDPR compliance demands timely deletion:</p><pre><code><code>DELETE FROM bronze WHERE email = 'user@domain.com';
DELETE FROM silver WHERE email = 'user@domain.com';</code></code></pre><p>After deletes:</p><ol><li><p>Execute <strong>REORG TABLE &lt;table_name&gt; APPLY (PURGE)</strong> to physically purge rows marked in deletion vectors.</p></li><li><p>Run periodic <strong>VACUUM</strong> to remove data files no longer referenced.</p></li></ol><h2>Row-Level Concurrency</h2><p>Databricks provides row-level concurrency, ensuring multiple writes (updates, deletes, merges) can be performed without corrupting the data. Instead of locking entire partitions or tables, Databricks manages concurrent writes at the file or row level.</p><h4>How it Works</h4><ul><li><p><strong>Write Conflicts</strong>: If multiple transactions try to update the same rows simultaneously, Delta Lake checks the conflict at the row level.</p></li><li><p><strong>Partition/File Boundaries</strong>: For partitioned tables, concurrency can also be handled by partition-level transactions if updates touch different partitions.</p></li><li><p><strong>Isolation Levels</strong>: Databricks uses Snapshot Isolation by default and can escalate to Write-Serializable isolation when merges or updates overlap.</p></li></ul><p><strong>Why it Matters?</strong></p><ul><li><p>Prevents data corruption.</p></li><li><p>Ensures high availability &#8212; you don&#8217;t need to pause all jobs for one update.</p></li><li><p>Complex streaming and batch patterns can coexist safely.</p></li></ul><h2>Summary and Recommendations</h2><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!v5pb!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13dfe440-4807-4543-9e66-d4330a336f3d_1350x502.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!v5pb!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13dfe440-4807-4543-9e66-d4330a336f3d_1350x502.png 424w, https://substackcdn.com/image/fetch/$s_!v5pb!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13dfe440-4807-4543-9e66-d4330a336f3d_1350x502.png 848w, https://substackcdn.com/image/fetch/$s_!v5pb!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13dfe440-4807-4543-9e66-d4330a336f3d_1350x502.png 1272w, https://substackcdn.com/image/fetch/$s_!v5pb!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13dfe440-4807-4543-9e66-d4330a336f3d_1350x502.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!v5pb!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13dfe440-4807-4543-9e66-d4330a336f3d_1350x502.png" width="1200" height="446.22222222222223" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/13dfe440-4807-4543-9e66-d4330a336f3d_1350x502.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:502,&quot;width&quot;:1350,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:96106,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://www.databricksters.com/i/159504639?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13dfe440-4807-4543-9e66-d4330a336f3d_1350x502.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!v5pb!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13dfe440-4807-4543-9e66-d4330a336f3d_1350x502.png 424w, https://substackcdn.com/image/fetch/$s_!v5pb!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13dfe440-4807-4543-9e66-d4330a336f3d_1350x502.png 848w, https://substackcdn.com/image/fetch/$s_!v5pb!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13dfe440-4807-4543-9e66-d4330a336f3d_1350x502.png 1272w, https://substackcdn.com/image/fetch/$s_!v5pb!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F13dfe440-4807-4543-9e66-d4330a336f3d_1350x502.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>&#9989; Best Practice:</h3><ul><li><p>set <code>delta.appendOnly = true</code> on tables where you want to lock your source table from DML operations.</p></li><li><p>Perform deletes explicitly at Bronze and Silver layers.</p></li><li><p>Periodically execute   <strong>REORG TABLE ... APPLY (PURGE)</strong> and <strong>VACUUM.</strong></p></li><li><p>Decide between <code>skipChangeCommits</code> and <code>readChangeFeed</code> based on downstream needs.</p></li></ul><h2>FAQ</h2><ol><li><p><strong>If I enable predictive optimization, do I need to run </strong><code>REORG</code><strong> / </strong><code>PURGE</code><strong> myself?<br></strong>Yes &#8212; enabling predictive optimization does <strong>not</strong> automatically execute the physical reorganization and purging of deleted data. You still must explicitly run <code>REORG TABLE &#8230; APPLY (PURGE)</code></p></li><li><p><strong>Do we need to pause batch and streaming jobs during the delete window?</strong></p><p>If you are only doing streaming appends alongside a single DML operation (like MERGE, UPDATE, or DELETE), conflicts are minimal. However, concurrent DML operations might lead to conflicts, resolved at the partition level for partitioned tables or file level using row-level concurrency.</p></li><li><p><strong>How are concurrent updates handled in Spark?</strong></p><p>Databricks employs row-level concurrency controls. Conflicts from concurrent updates are handled at the file or partition level to ensure transaction integrity.</p></li><li><p><strong>If we have two streaming tables, one upstream and one downstream, and </strong><code>skipChangeCommits</code><strong> enabled for both, can we safely run an UPDATE on both?</strong></p><p>Yes. Since they're independent streams, updates can safely occur as long as you use <code>skipChangeCommits</code> or properly handle deletes using Change Data Feed.</p></li><li><p><strong>How often should I run Delete, REORG and Vacuum?</strong></p><p>Once a week, is a more cost effective strategy than daily to meet compliance needs. Order matters so do a DELTE, REORG and VACUUM in this sequence. </p></li><li><p><strong>What happens if I set Vacuum to a large number?</strong></p><p>You will end up paying for storing logically deleted files meaning a big S3 or ADLS bill which you could have avoided.</p></li><li><p><strong>What should your retention policy be?</strong></p><p>The default VACUUM retention period is 7 days, but you can choose a longer duration based on your data recovery needs. For instance, if someone accidentally deletes data and you&#8217;ve already vacuumed beyond the 7-day threshold, recovery via time travel becomes impossible.</p><p></p><p>Consider real-world team dynamics&#8212;teams in North America often take two-week vacations. In such cases, setting a retention policy of 21 days might be more appropriate to allow for delayed incident detection and recovery.</p><p></p><p>Remember: once a file is vacuumed, you can no longer time travel to a point before its deletion. A longer retention period buys your team more time to investigate and restore if necessary.</p></li></ol><div><hr></div><h2>References</h2><p><a href="https://www.databricks.com/blog/handling-right-be-forgotten-gdpr-and-ccpa-using-delta-live-tables-dlt">https://www.databricks.com/blog/handling-right-be-forgotten-gdpr-and-ccpa-using-delta-live-tables-dlt</a></p><ul><li><p><a href="https://docs.databricks.com/aws/en/sql/language-manual/delta-vacuum">Delta Lake VACUUM</a></p></li><li><p><a href="https://docs.databricks.com/aws/en/sql/language-manual/delta-reorg-table">Delta Lake REORG</a></p></li><li><p><a href="https://docs.databricks.com/aws/en/delta/deletion-vectors">Deletion Vectors</a></p></li><li><p><a href="https://docs.databricks.com/aws/en/optimizations/isolation-level">Row-Level Concurrency</a></p></li><li><p><a href="https://docs.databricks.com/aws/en/structured-streaming/delta-lake#ignore-updates-and-deletes">Skip Change Commits</a></p></li><li><p><a href="https://docs.databricks.com/gcp/en/delta/delta-change-data-feed">Change Data Feed</a></p></li></ul><div id="youtube2-qIz0QM-L_p0" class="youtube-wrap" data-attrs="{&quot;videoId&quot;:&quot;qIz0QM-L_p0&quot;,&quot;startTime&quot;:null,&quot;endTime&quot;:null}" data-component-name="Youtube2ToDOM"><div class="youtube-inner"><iframe src="https://www.youtube-nocookie.com/embed/qIz0QM-L_p0?rel=0&amp;autoplay=0&amp;showinfo=0&amp;enablejsapi=0" frameborder="0" loading="lazy" gesture="media" allow="autoplay; fullscreen" allowautoplay="true" allowfullscreen="true" width="728" height="409"></iframe></div></div><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p><p></p>]]></content:encoded></item><item><title><![CDATA[Concurrent Execution and Query Throughput in Databricks SQL]]></title><description><![CDATA[Introduction]]></description><link>https://www.databricksters.com/p/concurrent-execution-and-query-throughput</link><guid isPermaLink="false">https://www.databricksters.com/p/concurrent-execution-and-query-throughput</guid><dc:creator><![CDATA[Artem Chebotko]]></dc:creator><pubDate>Tue, 11 Mar 2025 15:02:12 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!S5rs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1178f606-8f5b-45cd-bde0-6e1a8dd796cc_4022x1025.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!S5rs!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1178f606-8f5b-45cd-bde0-6e1a8dd796cc_4022x1025.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!S5rs!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1178f606-8f5b-45cd-bde0-6e1a8dd796cc_4022x1025.png 424w, https://substackcdn.com/image/fetch/$s_!S5rs!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1178f606-8f5b-45cd-bde0-6e1a8dd796cc_4022x1025.png 848w, https://substackcdn.com/image/fetch/$s_!S5rs!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1178f606-8f5b-45cd-bde0-6e1a8dd796cc_4022x1025.png 1272w, https://substackcdn.com/image/fetch/$s_!S5rs!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1178f606-8f5b-45cd-bde0-6e1a8dd796cc_4022x1025.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!S5rs!,w_2400,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1178f606-8f5b-45cd-bde0-6e1a8dd796cc_4022x1025.png" width="1200" height="305.7692307692308" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/1178f606-8f5b-45cd-bde0-6e1a8dd796cc_4022x1025.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;large&quot;,&quot;height&quot;:371,&quot;width&quot;:1456,&quot;resizeWidth&quot;:1200,&quot;bytes&quot;:6078988,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://www.databricksters.com/i/158816531?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1178f606-8f5b-45cd-bde0-6e1a8dd796cc_4022x1025.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-large" alt="" srcset="https://substackcdn.com/image/fetch/$s_!S5rs!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1178f606-8f5b-45cd-bde0-6e1a8dd796cc_4022x1025.png 424w, https://substackcdn.com/image/fetch/$s_!S5rs!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1178f606-8f5b-45cd-bde0-6e1a8dd796cc_4022x1025.png 848w, https://substackcdn.com/image/fetch/$s_!S5rs!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1178f606-8f5b-45cd-bde0-6e1a8dd796cc_4022x1025.png 1272w, https://substackcdn.com/image/fetch/$s_!S5rs!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F1178f606-8f5b-45cd-bde0-6e1a8dd796cc_4022x1025.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h1>Introduction</h1><p>While discussing concurrent query execution in Databricks SQL (DBSQL) with customers, I noticed a common misconception that I&#8217;d like to demystify in this blog. Databricks recommends a cluster for every 10 concurrent queries. This often sounds too low for customers with hundreds of analysts running BI dashboards containing dozens of queries each. With only 10 concurrent queries per cluster, <strong>can DBSQL truly support high-concurrency workloads?</strong> The answer is yes, and here&#8217;s how.</p><h1>Definitions</h1><p>Let&#8217;s start by defining two important concepts:</p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.databricksters.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Databricksters! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div><ul><li><p><strong>Concurrent Query Execution</strong> refers to the number of queries that can run simultaneously on a SQL warehouse. It is constrained by compute resources, workload complexity, and system-enforced concurrency limits.</p></li><li><p><strong>Query Throughput</strong> measures the total number of queries processed over a given period (e.g., queries per hour, per minute, or per second). It depends on query execution speed, warehouse size, and workload optimization.</p></li></ul><p>For example, if an isolated BI dashboard with 20 queries takes one minute to refresh, the throughput is 20 queries per minute (QPM). When 20 queries are submitted simultaneously, only about 10 of them will start executing immediately, while the other 10 may wait in a queue. In this case, concurrent query execution is 10, but it can be increased dynamically through autoscaling.</p><h1>Concurrency Limits</h1><p><strong>Databricks recommends a cluster for every 10 concurrent queries. The maximum number of queries in a queue for all SQL warehouse types is 1,000</strong> [<a href="https://docs.databricks.com/en/compute/sql-warehouse/warehouse-behavior.html#queueing-and-autoscaling-for-pro-and-classic-sql-warehouses">1</a>].</p><p>At first glance, 10 concurrent queries per cluster may seem low, especially when compared to Online Transaction Processing (OLTP) workloads, where processing many thousands of queries per second (QPS) is common. However, OLTP queries are relatively simple&#8212;they retrieve a "needle in a haystack" using optimized indexes. In contrast, Online Analytical Processing (OLAP) queries scan, process, and aggregate large datasets, requiring more compute resources.</p><p>Concurrency throttling is not unique to Databricks SQL; Snowflake, Redshift, BigQuery, and other cloud data warehouses enforce similar concurrency limits to prevent system overload. Both OLTP and OLAP databases implement query queuing to manage workloads efficiently.</p><h1>Scaling Concurrency</h1><p><strong>DBSQL dynamically provisions compute resources (clusters) to increase concurrency. When new capacity is added, queued queries are automatically routed to the new compute resources</strong> [1].</p><p>The minimum and maximum cluster limits for autoscaling are configurable. For example, for workloads requiring 20 to 50 concurrent queries, setting Min = 2 and Max = 5 helps minimize queuing delays. Additionally, increasing the SQL warehouse size often improves query execution speed, leading to higher throughput.</p><p>The figures below illustrate the impact of SQL warehouse scaling on BI dashboard execution in DBSQL. In the first figure, with autoscaling settings Min = 1 and Max = 2, execution starts on one cluster, and after some queries get queued, DBSQL quickly adds another cluster. This setup demonstrates how DBSQL dynamically provisions additional compute resources to handle increased concurrency. In the second figure, the warehouse with a fixed cluster count Min = 2 and Max = 2 demonstrates that execution takes place on two clusters from the start, which avoids query queueing completely.</p><div class="captioned-image-container"><figure><a class="image-link image2" target="_blank" href="https://substackcdn.com/image/fetch/$s_!AZXa!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F590fb58d-7705-4d3b-88ce-c0d351eba794_1600x415.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!AZXa!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F590fb58d-7705-4d3b-88ce-c0d351eba794_1600x415.png 424w, https://substackcdn.com/image/fetch/$s_!AZXa!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F590fb58d-7705-4d3b-88ce-c0d351eba794_1600x415.png 848w, https://substackcdn.com/image/fetch/$s_!AZXa!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F590fb58d-7705-4d3b-88ce-c0d351eba794_1600x415.png 1272w, https://substackcdn.com/image/fetch/$s_!AZXa!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F590fb58d-7705-4d3b-88ce-c0d351eba794_1600x415.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!AZXa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F590fb58d-7705-4d3b-88ce-c0d351eba794_1600x415.png" width="724" height="187.96153846153845" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/590fb58d-7705-4d3b-88ce-c0d351eba794_1600x415.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:378,&quot;width&quot;:1456,&quot;resizeWidth&quot;:724,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!AZXa!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F590fb58d-7705-4d3b-88ce-c0d351eba794_1600x415.png 424w, https://substackcdn.com/image/fetch/$s_!AZXa!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F590fb58d-7705-4d3b-88ce-c0d351eba794_1600x415.png 848w, https://substackcdn.com/image/fetch/$s_!AZXa!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F590fb58d-7705-4d3b-88ce-c0d351eba794_1600x415.png 1272w, https://substackcdn.com/image/fetch/$s_!AZXa!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F590fb58d-7705-4d3b-88ce-c0d351eba794_1600x415.png 1456w" sizes="100vw" loading="lazy"></picture><div></div></div></a></figure></div><p><strong>Figure 1: BI Dashboard Execution on DBSQL Large Min = 1 Max = 2</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!fkWU!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c4c000-a685-4ee0-8765-138114079fe6_1600x427.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!fkWU!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c4c000-a685-4ee0-8765-138114079fe6_1600x427.png 424w, https://substackcdn.com/image/fetch/$s_!fkWU!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c4c000-a685-4ee0-8765-138114079fe6_1600x427.png 848w, https://substackcdn.com/image/fetch/$s_!fkWU!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c4c000-a685-4ee0-8765-138114079fe6_1600x427.png 1272w, https://substackcdn.com/image/fetch/$s_!fkWU!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c4c000-a685-4ee0-8765-138114079fe6_1600x427.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!fkWU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c4c000-a685-4ee0-8765-138114079fe6_1600x427.png" width="1456" height="389" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f7c4c000-a685-4ee0-8765-138114079fe6_1600x427.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:389,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:null,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!fkWU!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c4c000-a685-4ee0-8765-138114079fe6_1600x427.png 424w, https://substackcdn.com/image/fetch/$s_!fkWU!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c4c000-a685-4ee0-8765-138114079fe6_1600x427.png 848w, https://substackcdn.com/image/fetch/$s_!fkWU!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c4c000-a685-4ee0-8765-138114079fe6_1600x427.png 1272w, https://substackcdn.com/image/fetch/$s_!fkWU!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff7c4c000-a685-4ee0-8765-138114079fe6_1600x427.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>Figure 2: BI Dashboard Execution on DBSQL Large Min = 2 Max = 2</strong></p><p>It is worth noting that, for Pro and Classic SQL warehouses, scaling is not instant, as it takes time to spin up new clusters. For Serverless SQL warehouses, scaling is near-instantaneous as resources are allocated from a pre-provisioned pool, eliminating startup delays. Serverless also has an advantage of Intelligent Workload Management (IWM), which predicts resource demands and adjusts capacity accordingly. Over time, IWM learns to make smarter decisions about scaling, ensuring efficient workload distribution and minimized latency.</p><h1>Query Throughput and Performance Benchmarks</h1><p>Here are two fully audited benchmarks that are available in the public domain since 2021. As of today, internal performance benchmarks indicate even higher throughput, though specific numbers cannot be disclosed.</p><h2>Benchmark 1: Small-Scale Workloads</h2><p><strong>In 2021, DBSQL processed 14,777 TPC-DS queries per hour (QPH) on a 10GB dataset using a DBSQL Large warehouse without autoscaling</strong> [<a href="https://www.databricks.com/blog/2021/09/08/new-performance-improvements-in-databricks-sql.html">2</a>].</p><p>To put this into perspective, if each BI dashboard refresh consists of 20 queries, then 14,777 / 20 = 738 dashboards can be refreshed per hour. In real-world scenarios, many of these dashboards and queries are likely to be served from the results cache, further increasing throughput and reducing compute resource consumption.</p><p>This demonstrates DBSQL&#8217;s efficiency for high-concurrency analytical workloads over small datasets, which are common in multi-tenant environments.</p><h2>Benchmark 2: Large-Scale Workloads</h2><p><strong>In 2021, Databricks set an official world record by processing 32,941,245 queries per hour on a 100TB dataset using a DBSQL 4X-Large warehouse without autoscaling</strong> [<a href="https://www.databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html">3</a>].</p><p>Following the same example, if each dashboard refresh consists of 20 queries, then 32,941,245 / 20 = 1,647,062 dashboards can be refreshed per hour.</p><p>The QphDS metric, used in the TPC-DS benchmark, represents the performance of mixed workloads. This essentially translates to nearly 33 million queries per hour, making DBSQL one of the most efficient solutions for large-scale analytical workloads.</p><h1>Conclusion</h1><p>By understanding and leveraging DBSQL&#8217;s concurrency management and scaling capabilities, organizations can efficiently support high-concurrency, high-throughput workloads. Properly sizing SQL warehouses, configuring autoscaling, and optimizing query execution strategies ensure timely and reliable analytical query processing, even for large-scale workloads. Whether handling dozens or millions of queries per hour, DBSQL provides one of the most efficient and scalable solutions available today.</p><h1>References</h1><p>[1] <a href="https://docs.databricks.com/en/compute/sql-warehouse/warehouse-behavior.html#queueing-and-autoscaling-for-pro-and-classic-sql-warehouses">SQL Warehouse Sizing, Scaling, and Queuing Behavior</a></p><p>[2] <a href="https://www.databricks.com/blog/2021/09/08/new-performance-improvements-in-databricks-sql.html">New performance improvements in Databricks SQL</a></p><p>[3] <a href="https://www.databricks.com/blog/2021/11/02/databricks-sets-official-data-warehousing-performance-record.html">Databricks Sets Official Data Warehousing Performance Record</a></p><div class="subscription-widget-wrap-editor" data-attrs="{&quot;url&quot;:&quot;https://www.databricksters.com/subscribe?&quot;,&quot;text&quot;:&quot;Subscribe&quot;,&quot;language&quot;:&quot;en&quot;}" data-component-name="SubscribeWidgetToDOM"><div class="subscription-widget show-subscribe"><div class="preamble"><p class="cta-caption">Thanks for reading Databricksters! Subscribe for free to receive new posts and support my work.</p></div><form class="subscription-widget-subscribe"><input type="email" class="email-input" name="email" placeholder="Type your email&#8230;" tabindex="-1"><input type="submit" class="button primary" value="Subscribe"><div class="fake-input-wrapper"><div class="fake-input"></div><div class="fake-button"></div></div></form></div></div>]]></content:encoded></item></channel></rss>