Databricksters: AI & ML

Cutting Token Costs Reaches the Renaissance

Austin — Tue, 17 Mar 2026 14:02:05 GMT

Back in October I published a blog called Getting Medieval on Token Costs. The code and strategy I provided worked, but as the title implied it was rough around the edges. How rough? Well…

The API calls to the FMs were synchronous, so QPS would have been Medieval indeed
The Lakebase instance was provisioned, so it would always be accruing costs even without usage
There was no UI, so you or your admin would be spending hours fiddling with thousand line SQL queries for enterprise use cases

But no matter, we’ve had a Renaissance!

The repo got three meaningful updates and a handful of smaller ones that collectively move this from being technically functional to something a medium enterprise team might actually want to use on Databricks.

The Money Changer and His Wife - Quentin Matsys, 1514 oil-on-panel

Subscribe now

Quick Refresher

The original solution uses a custom MLflow model serving endpoint as a proxy between your users and whatever foundation model they’re calling. Before the request hits the FM, the endpoint checks two Lakebase tables: one for the user’s token limit and another for how many tokens they’ve already burned through. If they’re over budget, the request ends up like John the Baptist in the cover image. If not, it goes through to the FM and the usage is written back to Lakebase along with the response and remaining balance.

Change 1: Autoscaling Lakebase

The original code used a Provisioned Lakebase instance because that was the only option at the time, but it’s going away and we have something better. Autoscaling Lakebase.

Swapping to Autoscaling Lakebase means you can set minimum and maximum scaling bands, and if you don’t need a high availability instance, it will also allow you to scale to zero during times of no use.

This is the smallest change architecturally, but it’s nice not to pay for compute we don’t need.

Change 2: ResponsesAgent + Async FM Calls

The original code used mlflow.pyfunc.PythonModel and called the FM endpoint via requests.post(), which is synchronous and blocking. Only one request can be handled at a time per unit of concurrency. Which meant the endpoint that was supposed to help you manage costs via budgeting is throttling your throughput instead. While I suppose that is one way to reduce token costs, it’s not very useful.

The new version replaces the PythonModel with a ResponsesAgent and swaps requests for httpx.AsyncClient inside an async def predict_stream(). Now multiple FM calls can be in flight simultaneously and the serving endpoint isn’t waiting on one user’s 20-second Claude response before it can look at the next request in the queue.

The core logic now lives in a standalone rate_limiter_agent.py decoupled from the notebook. The public API is much cleaner:

  agent = TokenRateLimiterAgent(
      db_config={...},
      workspace_client=WorkspaceClient(),
      endpoint_name="ep-your-endpoint",
      group_members={
          "data-science-team": ["andrea@company.com", "john@company.com"],
      },
  )

  # Before calling the FM:
  quota = agent.check_quota("andrea@company.com", "databricks-claude-sonnet-4-5")
  if not quota["allowed"]:
      # Return 429 or block the request
      ...

  # After the FM call completes:
  agent.log_usage(
      user_name="andrea@company.com",
      model_name="databricks-claude-sonnet-4-5",
      prompt_tokens=1200,
      completion_tokens=350,
      request_id="req-abc123",
  )

You can drop this into any existing pipeline without touching the notebook, which certainly helps if you’re integrating this into something that already has its own serving infrastructure.

Change 3: An Actual Frontend

The original had no management UI, which meant you had to set limits by writing manual SQL queries for any new change to your budgeting policy. Not very convenient.

The new repo ships a full Databricks App: a React + FastAPI application that deploys alongside your serving endpoint and gives administrators a no-code interface for setting granular budgets. How granular you ask? Any combination of:

A user, service principal, or group
Calling any FM, list of FMs, or across all FMs in the workspace
That resets every X hours, days, weeks, months, or never
Limited to a specified count of tokens or dollars
- Pre-populates token costs from Databricks documentation, but manually editable in case this changes or you have some kind of secret discount I don’t know about
- This is another nice quality of life feature since tokens are not all created equally; GPT OSS 20B tokens cost about 100x less than GPT 5.4 tokens

The drop-downs auto-populate users, SPs, and groups as well as the Databricks Foundation Models.

It also comes with a handy monitoring dashboard so you can see usage over time, your top consumers, and the most popular models.

The App authenticates to Lakebase via a native Postgres role with a static password stored in Databricks Secrets, so there’s no OAuth token refresh to manage.

An Honest Conclusion

Is this production-grade for an org running thousands of concurrent end users? Maybe not. You might consider mini-batching requests at scale, but there will still be some amount of cost tracking overhead, and this gets more difficult at scale.

Is this production-grade for most actual enterprise teams who want to stop their power users from accidentally burning through their monthly token budget in a week? Yes. I think this solution really shines when you have dozens to hundreds of daily active users who might get greedy on Opus requests without some budget enforcement.

But don’t take my word for it; check it out for yourself. The code lives here and setup instructions are in the README.

Cheers and happy coding.

Observability for Any Agent, Anywhere: Production-Ready Tracing with MLflow & OpenTelemetry on Databricks

Anoop Sunke — Fri, 20 Feb 2026 16:03:07 GMT

Executive Summary

The Problem: AI agents generate massive volumes of trace data, but traditional observability tools make that data expensive to retain, difficult to govern, and hard to use in evaluation and analytics workflows.
The Solution: MLflow now supports writing OpenTelemetry (OTEL) traces directly to Unity Catalog tables via a fully managed, serverless ingestion path.
The Benefit: By landing traces directly in the Lakehouse, teams get governed, analytics-ready observability data with long-term retention, unified evaluation and monitoring workflows, and no OTEL infrastructure to operate.
The Outcome: Production traces become immediately usable for analysis and evaluation, enabling faster iteration loops between real-world usage, model evaluation, and continuous improvement.

Why AI Tracing Breaks Traditional Observability

As AI applications move into production, traces become one of the clearest ways to understand how agents actually behave by capturing prompts, tool calls, responses, latency, and execution paths. Without strong tracing, it’s hard to understand why agents behave the way they do, making debugging, evaluation, and governance much more difficult.

The challenge isn’t that observability platforms can’t ingest this data. It’s that AI traces quickly become valuable beyond debugging. Teams want to retain them longer, analyze them with SQL, join them with business and model data, and reuse them for evaluation and monitoring. When traces live only inside observability systems, that flexibility is limited, governance becomes fragmented, and moving data into analytics workflows often requires extra pipelines and duplication, especially when sensitive prompt data is involved.

MLflow and OTEL Trace Ingestion

Databricks now supports writing MLflow traces directly to Unity Catalog using the OpenTelemetry (OTEL) format. In practice, this means traces can be ingested in real time and stored in Delta tables, where they benefit from the same scalability, governance, and tooling as the rest of your data.

This changes how teams can use trace data:

Real-time ingestion with practical retention: Traces can be written as they’re generated at high throughput (GBs/sec) and retained long-term without the cost pressure typically associated with observability platforms.
Analyze and govern using the Lakehouse: Once traces are tables, you can treat them like any other dataset: query them with SQL, build dashboards, run ETL pipelines, use tools like Genie, and apply governance controls such as PII masking.
Use the full MLflow evaluation stack: Persisting traces in Unity Catalog removes typical experiment constraints (such as trace caps), making it easier to run large offline evaluations, monitor production systems, and continuously improve quality as workloads grow.

The Engineering trade-off: SaaS vs. Lakehouse

So why not rely entirely on a SaaS observability tool?

Retention economics: Agents generate massive text payloads. Storing this data in Delta Lake on object storage is often significantly more cost-effective than SaaS-based retention models.
The PII deadlock: Sending raw prompts to third-party platforms can create InfoSec friction. Keeping traces inside Unity Catalog helps maintain data sovereignty and simplifies governance.
Analytics, not just telemetry: SaaS tools are strong for operational metrics like latency, but the Lakehouse gives you something different: an analytics and AI engine. You can join traces with business data — revenue, conversions, customer outcomes — to understand real impact, not just system health. Furthermore, the Lakehouse enables you to apply AI directly to your traces, allowing for advanced use cases like classifying user interactions as ‘good’ or ‘bad,’ and building evaluation frameworks to continuously improve system quality.

Architecture: Serverless OpenTelemetry Ingestion

MLflow tracing can use the OpenTelemetry (OTEL) standard, which separates instrumentation from storage. In a typical OTEL deployment, teams are responsible for running collector fleets, scaling agents, handling backpressure, and managing reliability.

Databricks removes that operational layer by providing a managed OpenTelemetry endpoint, transparently powered by Zerobus. Zerobus is a serverless ingestion engine that enables applications to stream data directly into Delta tables using a gRPC API. Applications can easily export spans, logs, and metrics from any OTEL-compatible client directly to Unity Catalog tables, where the data is stored in Delta format. Zerobus acts as the telemetry pipeline, handling ingestion and durability so teams don’t have to operate their own collectors.

From there, traces become first-class data in the Lakehouse, powering MLflow evaluations and monitoring, ad-hoc SQL analysis, dashboards, and downstream analytics. This creates a continuous improvement flywheel where production behavior feeds evaluation and analysis, which in turn drives faster iteration and better agent performance.

Tutorial: Wiring Traces into the Lakehouse

Sample agent: Support manager assistant

For this blog, we’ll create a simple support manager assistant that we can use to demonstrate tracing end-to-end. The agent can be deployed outside of Databricks, as we’ve done here, highlighting that trace ingestion is decoupled from where the agent runs.

We built a LangGraph agent powered by a Databricks-hosted Claude Sonnet 4 model for reasoning and response generation. The agent calls a Genie Space as a tool, which you can deploy here.

When a user asks a data-driven question, the agent invokes Genie through the MCP tool API. Genie translates the request into SQL, executes it against the support dataset, and returns the result. The agent then summarizes the findings and provides actionable takeaways for a support manager.

Setting up MLflow tracing with UC

Before instrumenting the agent, we first configure MLflow to store traces in Unity Catalog. This involves creating the underlying OpenTelemetry tables and linking them to an MLflow experiment so traces can be searched, analyzed, and annotated from the UI. Start by identifying (or creating) a SQL warehouse and an MLflow experiment, then use the MLflow Python library to create the Unity Catalog tables and link the schema to the experiment. For full steps, follow the docs here.

import os
import mlflow
from mlflow.entities import UCSchemaLocation
from mlflow.tracing.enablement import set_experiment_trace_location

mlflow.set_tracking_uri("databricks")

os.environ["MLFLOW_TRACING_SQL_WAREHOUSE_ID"] = ""

experiment_name = ""
catalog_name = ""
schema_name = ""

experiment_id = mlflow.create_experiment(name=experiment_name)

set_experiment_trace_location(
    location=UCSchemaLocation(
        catalog_name=catalog_name,
        schema_name=schema_name,
    ),
    experiment_id=experiment_id,
)

This setup creates Unity Catalog tables for spans, logs, and metrics. Once traces begin flowing, the MLflow service also creates Databricks views that transform the underlying OpenTelemetry data into an MLflow-friendly format for easier querying and analysis. These include:

mlflow_experiment_trace_otel_spans: detailed execution steps for each request
mlflow_experiment_trace_otel_logs: structured events such as metadata, tags, and assessments
mlflow_experiment_trace_otel_metrics: numerical telemetry captured during execution
mlflow_experiment_trace_metadata: MLflow tags, metadata, and assessments grouped by trace ID
mlflow_experiment_trace_unified: a consolidated view that assembles all trace data into a single record per trace. For better performance at scale, consider converting it to a materialized view with incremental refresh.

After configuring the trace destination, agent instrumentation remains the same. You can do automatic and/or manual tracing as described here. In our example, we rely on mlflow.langchain.autolog() to capture the detailed LangGraph execution (model calls and tool calls). We also wrap the entrypoint with @mlflow.trace to establish a request-level root span, allowing each invocation to be observed as a single end-to-end execution.

Inspecting a sample trace

Now that the agent is instrumented and traces are flowing into Unity Catalog, let’s look at a real execution.

For this example, we asked the Support Manager Assistant:

“Which support engineer should I put up for promotion?”

The agent evaluated the request, called the Genie space multiple times to gather supporting data, and returned a recommendation based on performance metrics.

While the response looks straightforward, the trace reveals the underlying execution path that produced it. In the MLflow experiment, we can see each of the tool calls as well as the reasoning logic of our claude sonnet model. We can see that it called the genie space tool three times before putting together a final answer.

We can click through each of the individual steps to study the inputs and outputs.

Because traces are stored as Delta tables, they can be queried like any other dataset. We can start with the mlflow_experiment_trace_unified view, where we will find a record that includes the request, response, trace metadata, and an array of the spans.

Beyond Debugging: Analytics on Trace Data

Now that traces are stored in Unity Catalog, they become immediately available for both batch and streaming analytics.

Governance in Unity Catalog

Prompts and responses, however, often contain sensitive information, so treating trace data as governed data is critical. By storing it in Unity Catalog, traces inherit fine-grained access controls, from catalog and schema permissions to column masking and row-level filtering, enabling secure, production-ready analytics without limiting flexibility.

Once access is established, teams can securely run ad-hoc analytics by querying the underlying tables and views with SQL, as we did above. We can also build ETL pipelines, in addition to dashboards and genie spaces, for actionable business insights.

Dashboards

One of the most powerful aspects of having traces in Unity Catalog is that we aren’t locked into a vendor’s rigid, pre-canned views. Because the traces are in Delta tables, we can build custom dashboards that reflect our specific business logic, not just generic system health.

Using AI/BI Dashboards, we built an AI Operations Center that sits directly on top of our trace tables. This dashboard provides a unified view of our application performance, costs, and reliability. Instead of learning a proprietary query language, we just wrote standard SQL (with the help of AI) to extract exactly what we needed.

Here are some key capabilities this unlocked:

Custom Cost & Token Analysis
Generic “cost” metrics are rarely accurate because every team negotiates different rates or uses fine-tuned models with unique pricing. Since we control the SQL, we embedded our specific pricing logic directly into the query. Our dashboard tracks token usage by model type (e.g., GPT-4o vs. Claude 4 Sonnet) and applies our contract-specific rates to calculate a precise Estimated Cost per Trace. This lets us spot expensive outliers immediately—like a single complex query that costs $0.50 due to a retrieval loop.

Component-Level Performance

High-level latency metrics often hide the real culprit. Is the bottleneck the LLM or is it the Genie space retrieval? We built a “Tool Performance” widget that breaks down latency (P50, P99) and error rates for every individual tool in our agent (e.g., retrieve_docs vs. generate_response). This allows us to pinpoint exactly which step in the chain is degrading the user experience.

Genie spaces

Both business and technical stakeholders often want to explore agent behavior without writing SQL. By exposing trace tables through Genie, teams can enable natural-language analysis over their telemetry data, allowing users to ask questions about performance, tool usage, latency, and model behavior directly. In our example, this could include questions such as:

What types of requests require escalation?
Are tool retries increasing?
Which queries trigger the most complex execution paths?

ETL pipelines

Because traces are stored as Delta tables, they can feed downstream ETL pipelines just like any other dataset. By enabling Change Data Feed (CDF), teams can process trace data incrementally, either in batch or streaming, without repeatedly scanning entire tables.

This makes it possible to operationalize observability. For example, a pipeline could monitor trace patterns and trigger alerts when latency exceeds defined thresholds, tool failures spike, or token usage deviates from expected baselines. These signals can then feed dashboards, notification systems, or automated remediation workflows.

Importantly, this complements real-time protections such as AI Guardrails. While guardrails enforce policy at request time, ETL pipelines create a feedback loop, helping teams analyze trends, refine policies, and continuously improve agent performance.

Closing the Loop: From Production Traces to Evaluation

Once traces are available, they can power the full MLflow 3 evaluation stack, enabling teams to measure, improve, and maintain the quality of their AI applications across the entire lifecycle. Evaluation and monitoring build directly on tracing, allowing the same telemetry captured during development, testing, and production to be scored using LLM judges and custom metrics.

Evaluate during development using AI Judges

MLflow allows us to run evaluations against an evaluation dataset, applying built-in or custom judges to score response quality. One effective approach is to bootstrap this dataset from real traces. Because these prompts originate from actual user interactions, they better represent the scenarios your agent must handle compared to synthetic test cases.

Below, we create an evaluation dataset from recently captured traces. MLflow uses a SQL warehouse to search and materialize dataset records, so be sure to configure the warehouse ID in your environment.

import os
import mlflow
import mlflow.genai.datasets
import time

# Required for dataset operations
os.environ["MLFLOW_TRACING_SQL_WAREHOUSE_ID"] = MLFLOW_TRACING_SQL_WAREHOUSE_ID

DATASET_NAME = f"{CATALOG_NAME}.{SCHEMA_NAME}.support_management_chatbot_traces"

# Create (or load) the dataset
try:
    eval_dataset = mlflow.genai.datasets.create_dataset(name=DATASET_NAME)
except Exception:
    eval_dataset = mlflow.genai.get_dataset(name=DATASET_NAME)

# Pull recent traces (example - from yesterday)
yesterday = int((time.time() - 60 * 60 * 24) * 1000)

traces_df = mlflow.search_traces(
    filter_string=f"attributes.timestamp_ms > {yesterday}",
    order_by=["attributes.timestamp_ms DESC"],
)

# Merge traces into the dataset
eval_dataset = eval_dataset.merge_records(traces_df[["inputs"]])

With the dataset in place, we can define the judges that will score our application. MLflow provides a set of built-in judges, and also allows us to define custom guidelines tailored to our agent’s expected behavior.

from mlflow.genai.scorers import RelevanceToQuery, Safety, Guidelines

# Define judges
agent_judges = [
    RelevanceToQuery(),
    Guidelines(
        name="analytical_correctness",
        guidelines="The response must correctly interpret the data and avoid unsupported conclusions.",
    ),
    Guidelines(
        name="actionable_support_insights",
        guidelines="The response must provide at least one concrete, data-backed recommendation.",
    ),
    Guidelines(
        name="performance_management",
        guidelines="The response should not recommend admonishing or firing employees.",
    ),
    Safety(),
]

# Run evaluation
eval_results = mlflow.genai.evaluate(
    data=eval_dataset,
    predict_fn=predict_fn,
    scorers=agent_judges,
)

eval_results

And we can now see the results in the MLflow experiment.

Production monitoring

Development evaluations help us validate behavior before release, but production monitoring shows us how the application performs with real users. MLflow can automatically evaluate live traces using the same judges, helping us quickly detect regressions, drift, and emerging failure patterns. This turns evaluation from a one-time task into an ongoing practice as the application evolves.

Frequently Asked Questions (FAQ)

Can I use this for agents running outside of Databricks?
Yes, the agent can be running anywhere. In fact the support assistant agent example that was used for this blog is deployed locally.
What are the throughput and storage limits of this solution?
The ingestion throughput limit is 200 QPS today. There is no limit on storage. Previous limits on traces per experiment are no longer applicable. If you need higher throughput limits, please reach out to your Databricks account team.
What can I do to ensure my search queries, MLflow experiment experience, and downstream analytics remain performant?
Consider optimizing the OTEL tables using Z-ordering as described here.
How does this handle PII found in user prompts?
This feature does not apply any special handling to PII. However, the data is stored in Unity Catalog, where you can leverage governance capabilities, such as fine-grained access controls, column masking, and row filtering, to manage and restrict downstream access.

Get started

To get started, follow along with the documentation.

Trace your steps back to Slack

Veena — Tue, 25 Nov 2025 16:02:25 GMT

If you have been creating and deploying Agents on Databricks, then perhaps you are already aware of the existence of MLflow Review Apps. For those who have not used them before, MLflow Review Apps are an easy way to collect feedback from your Subject Matter Experts on your agent. Databricks provides support for using review apps through the built-in interface or, if you need more customization, through a custom review app hosted on Databricks Apps.

But what if we could just bring this process directly to Slack? This blog post will walk you through building a Slackbot that enables real-time agent interaction and feedback collection.

How does tracing and feedback work in MLflow?

With MLflow Production Monitoring, you can see traces arrive directly in an MLflow experiment. These traces can be synced to a table in Unity Catalog.

Each trace has a unique ID automatically generated by MLflow. This ID can be used to add feedback (source: Databricks documentation) via the MLflow log_feedback function. This can be an LLM judge or human feedback. Feedback is also stored as an assessment linked to the specific trace, making it queryable through MLflow.

A labeling session (source: Databricks documentation) is a special type of run within MLflow. Databricks recommends adding specific traces to a labelling session beforehand-- the custom or built-in review app then connects to that labeling session and exposes the traces to SMEs. The app allows us to just interact with the MLflow client in a specific way. This requires us to pre-select traces.

To create a Slackbot that can perform the same tasks as a custom review app, we will need to host it on a Databricks App. In this app, we are going to use labeling sessions slightly differently. Instead of interacting with pre-selected traces, we will allow SMEs to interact with the agent directly, creating traces and adding them to an already- created labeling session immediately. Then, the SME can add feedback via Slack interactions.

Building the Slackbot Review App

Follow along with the code here.

This is the experience we want:

Human experts ask questions in a Slack channel.
The agent answers the question in the same Slack thread.
Human experts provide feedback via Slack shortcuts.

Therefore, our Slackbot should:

Listen to messages in Slack.
Call our agent in Databricks.
Collect feedback from SMEs in Slack.
Annotate MLflow traces with that feedback.

Some housekeeping

Before we get started with writing the Databricks app, we will first need to create the following:

Creating an app in Slack

First, let’s create an application in Slack. For more detailed instructions, check out this Medium blog post.

I have included the app manifest for the Slackbot with all necessary configurations, but check out the necessary scope and permissions for the bot. We will definitely need the scopes: (1) chat:write (2) groups:read (3) im:read (4) mpim:history (5) commands.

Once you have installed the app in Slack, you will be given a Bot User oauth token. Save this securely. We will need to use that in our app.

Creating a Databricks App

Databricks Apps makes hosting straightforward, as each app has an associated Service Principal. All we need to do is ensure that the Service Principal has access to our MLflow experiment and agent endpoint.

Using the CLI, we can create the app:

databricks apps create slackbot

Sync local files to the Databricks workspace:

databricks sync . “/Users/$DATABRICKS_USERNAME/slackbot-app”

Then, deploy:

databricks apps deploy slackbot --source-code-path /Workspace/Users/$DATABRICKS_USERNAME/agent-proto

Creating a MLflow labeling session

We should also create a labeling session that we will use within our MLflow experiment. This creates a persistent Mlflow run that we will link all Slack-generated traces to. You can do this in a notebook with the SDK or through the MLflow experiment UI.

import mlflow.genai.labeling as labeling

import mlflow.genai.label_schemas as schemas

# Create a simple labeling session with built-in schemas

session = labeling.create_labeling_session(
    name=”customer_service_review_jan_2024”, 
    assigned_users=[”alice@company.com”, “bob@company.com”],
    label_schemas=[schemas.EXPECTED_FACTS]  
    # Required: at least one schema needed 
)

Source: Databricks documentation.

1. Initializing the Slack client

In our Databricks App, using the Slack SDK, we can easily connect to our Slack App:

def get_slack_auth():
    w = WorkspaceClient()
    token_bot = dbutils.secrets.get(scope=”brickbrain-scope”, key=”slack-bot-token”)    
    return token_bot

def start_slack_client():
    logger.info(”Initalized slack client. “)
    ssl_context = ssl.create_default_context()
    ssl_context.check_hostname = False
    ssl_context.verify_mode = ssl.CERT_NONE
    token_bot = get_slack_auth()
    client = slack_sdk.WebClient(token=token_bot, ssl=ssl_context)
    return App(client=client, process_before_response=False)

app = start_slack_client()

Note: store the Slack token in Databricks Secrets for security. Ensure your Service Principal has permissions to that secret scope.

2. Listening to events

Depending on the permissions given to your application, your slackbot will be able to receive and be able to respond to different events. Take a look at the full list of the events (source: Slack documentation).

First, let’s take a look at the message event, which observes whether or not a message was sent to a channel. In the example, I am observing every event that is sent to a channel. If you want to minimize the scope, you can select a message subtype or naively use string manipulation. I am going to be using slack-bolt moving forward to respond and take actions as the bot.

Bolt has many decorators that we can use to listen or observe events. For example, when observing the message event, I can declare the following:

@app.event(”message”)
def llm_response(event, say, client):
    logger.info(f”Message received - User: {event[’user’]}, Text: {event[’text’][:20]}...”)
   <...>

For different types of “listeners”, we can have different function arguments:

payload: also accessible via the alias corresponding to the method name that the listener is passed to (message, event, action, shortcut, view, command, options).
- In this case, event == payload
say: function send a message to the channel associated with the event.
ack: function that must be acknowledged that an incoming event was received by the app.
client: web API client that uses the token associated with that event.
logger

This is not a complete list! But these are the most important ones for our use case (source: Slack documentation).

3. Calling the agent

In our app, we want to respond to messages sent to the channel. We can easily trigger an LLM call now. However, in order to add feedback to the trace, we need to get the trace ID. When interacting with a Databricks endpoint, we can do this by setting the variable return_trace to True.

        input_data = {
            “input”: history + [{”role”: “user”, “content”:  message_text}],
            “databricks_options”: {”return_trace”: True}
        }

        response = mlflow_client.predict(endpoint=ENDPOINT_NAME, inputs=input_data)

The response output will then give me the trace ID:

        trace_id = response[’databricks_output’][’trace’][’info’][’trace_id’]

4. Responding to the message

To respond within a thread, we will need to use the client API. Recall that the listener argument “say” is offered with most events. However, “say” does not allow us to respond within a thread.

LLMs often use and produce Markdown as an output format. It is important to note that Slack uses its own markdown language, and although most basic syntax support is provided, some elements are absent. Take a look at what is supported here.

If you want to ensure that the output is stylized in the same way that the LLM intended, I would suggest looking at manually converting the Markdown text into Slack’s mrkdwn format. This would require some string manipulation with regex (source: Github repo).

    result = client.chat_postMessage(
        channel=event[’channel’],
        blocks=[
            {
                “type”: “section”,
                “text”: {”type”: “mrkdwn”, “text”: slack_response}
            },
        ],
        text=slack_response,
        thread_ts=event[’ts’],  # reply in the thread
        metadata={
            “event_type”: “agent_response”,
            “event_payload”: {
                “trace_id”: trace_id, # trace id in metadata
                “thread_id”: event[’ts’],
                “resource_type”: “AGENT_RESPONSE”,
            }
        }
    )

Using the Client API, we can also attach metadata to each message. This makes it easier to retrieve information across sessions, like trace_id.

We have designed the response simply, but Slack has a lot of options on how to design a Slack message. Take a look at Block Kit Builder to see how you can structure your Slack message with buttons, dividers, images, inputs, etc.

5. Adding feedback

We will use a Slack message shortcut to log feedback. I found this method to be the most straightforward and easiest to customize. However, we can also use Slack message blocks to design a feedback form as well.

When I use the add_feedback shortcut, this triggers the event “message_shortcut”. Because we have added the trace id to the metadata of the agent response Slack message, we can access that trace_id in the Slack shortcut.

@app.message_shortcut(”log_feedback”)
def handle_log_feedback_shortcut(ack, shortcut, client):
    ack()
    logger.info(f”Feedback message shortcut triggered by user: {shortcut[’user’][’name’]}”)
    
    message = shortcut[’message’]
    message_ts = message[’ts’]
    
    metadata = message.get(’metadata’, {})

When handling this event, we can use the Client API to open a view with the formatted feedback form. We can add comments and binary feedback. These inputs will be translated as input for mlflow.log_feedback(). However, log_feedback can take all sorts of values: integers, floats, categorical values, and multiple-category feedback (source: Databricks documentation). So, feel free to customize this to what your evaluation system needs.

Since this is a form, once we hit submit, we will need to respond to another Slack event as well. This will create another Slack event called “view”. This is where we actually handle the feedback submission and use mlflow.log_feedback(). For your review app, you can also log expectations (aka ground truth) using another function log_expectations().

6. Linking everything to a labeling session

We still have not linked these traces to a labeling session. To do so, we fetch the run ID associated with the labeling session and the trace_id:

def link_traces_to_run(run_id: str, trace_ids: List[str]) -> Dict[str, Any]:
    creds = get_databricks_host_creds()
    url = _get_mlflow_api_url(’/traces/link-to-run’, creds=creds)
    data = {’run_id’: run_id, ‘trace_ids’: trace_ids}

############################
in the @app.event function: 
############################

link_traces_to_run(run_id=LABELLING_SESSION.mlflow_run_id, trace_ids=[trace_id])
logger.info(f”Traces linked to run - Run: {LABELLING_SESSION.mlflow_run_id}, Trace: {trace_id}”)

7. Handling with conversation history

Slack threads make conversation history management simple. Instead of requiring a database to checkpoint, we can simply fetch the threads themselves. Using the client API and the thread ID:

def get_thread_messages(client, channel, thread_ts):
    response = client.conversations_replies(
        channel=channel,
        ts=thread_ts,
        inclusive=True,  # Include the parent message
        limit=10  # Max messages to retrieve
    )
    logger.info(f”Retrieved {len(response[’messages’])} messages from thread {thread_ts}”)
    return response[’messages’]

Happy reviewing!

Take a look at the full implementation and code here.

There are no limitations in how you can use MLflow review apps! You can easily bring the feedback mechanism in MLflow to Slack, reducing any friction in the feedback process. Thanks for reading.

Getting Medieval on Token Costs

Austin — Fri, 24 Oct 2025 20:20:45 GMT

Don’t you hate it when your employees run up a thousand dollar tab on Claude API calls inside of a week and then hit you with this look when you tell them that was the budget for the quarter?

I might have something for that.

Subscribe now

One of the cornerstones of the Databricks value-add in AI is that we are a model provider neutral platform. We offer native pay-per-token hosting for open source model families like Llama, Gemma, and GPT OSS and we have first party connections with Claude, OpenAI, and Gemini. However, if you want to control costs, our current AI Gateway offering only allows you to do so via QPM rate limiting. QPM certainly has its use cases, but the majority of companies don’t care how many times per minute their employees or end users hit a model; they care about how much it’s going to cost them.

Luckily with Lakebase, token-based rate limiting is now possible and the implementation is simple: a user submits a request, which is then validated by the endpoint via queries to two Lakebase tables, the first to determine that user’s token limits and the second to determine how far into those limits they already are. If the user is out of tokens, a cutoff message is returned and the request does not hit the FM. Otherwise, the request is passed to the FM and the payload is written back to Lakebase so that the user’s total token count is updated. Finally, the response is returned to the end user with a message noting their remaining token balance.

Great, let’s see some code then, huh?

First we need to install psycopg2:

%pip install psycopg2
dbutils.library.restartPython()

And set a few environment variables from a Lakebase instance:

import mlflow.pyfunc
import os

os.environ[’OPENAI_API_KEY’] = ‘’ # or whatever FM API key
os.environ[’DATABRICKS_TOKEN’] = ‘’
os.environ[’POSTGRES_HOST’] = ‘’
os.environ[’POSTGRES_DBNAME’] = ‘databricks_postgres’ # or ‘’
os.environ[’POSTGRES_USER’] = ‘’
os.environ[’POSTGRES_SSLMODE’] = ‘’
os.environ[’POSTGRES_PORT’] = 5432 # or ‘’
os.environ[’POSTGRES_PASSWORD’] = ‘’

For the demonstration, let’s create a couple quick example tables and populate the user_token_limits table with a record:

%sql
-- Create token_usage table for tracking all API calls
CREATE TABLE IF NOT EXISTS token_usage (
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    user_name VARCHAR(255) NOT NULL,
    model_name VARCHAR(100) NOT NULL,
    prompt_tokens INTEGER NOT NULL,
    completion_tokens INTEGER NOT NULL,
    total_tokens INTEGER NOT NULL,
    request_timestamp TIMESTAMP NOT NULL,
    request_id VARCHAR(255),
    response_content STRING,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Create user_token_limits table for managing quotas
CREATE TABLE IF NOT EXISTS user_token_limits (
    id BIGINT GENERATED ALWAYS AS IDENTITY PRIMARY KEY,
    user_name VARCHAR(255) NOT NULL,
    model_name VARCHAR(100) NOT NULL,
    token_limit INTEGER NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    updated_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

-- Insert sample user limit
INSERT INTO user_token_limits (user_name, model_name, token_limit) 
VALUES (’test.user@databricks.com’, ‘gpt-4.1-2025-04-14’, 1000);

Obviously you could do the above in the PostgreSQL editor, but might as well use the notebook since we’re here.

And now we can define our rate limiter. Note that this is extremely flexible. Any kind of rate limiting you can think up is doable as long as you can translate it into PostgreSQL. That means per user, per user per model, per user per model per unit time, and so on are all at your fingertips. I’m going to define a simple per user per model rate limit as hinted above and populate that with a token cutoff of just 1000 tokens on GPT 4.1:

import mlflow
from mlflow.types import DataType, Schema, ColSpec
import mlflow.models
import json
import pandas as pd
import psycopg2
import requests
from datetime import datetime
import os

class TokenLimitedGatewayModel(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        “”“Initialize database connection and endpoint URL”“”
        self.conn = psycopg2.connect(
            host=os.environ[’POSTGRES_HOST’],
            dbname=os.environ[’POSTGRES_DBNAME’],
            user=os.environ[’POSTGRES_USER’],
            password=os.environ[’POSTGRES_PASSWORD’],
            port=int(os.environ.get(’POSTGRES_PORT’, 5432)),
            sslmode=os.environ.get(’POSTGRES_SSLMODE’, ‘require’)
        )
        self.conn.autocommit = True
        self.cursor = self.conn.cursor()
        
        # FM endpoint
        self.fm_endpoint = “”
        
        # Get API token from environment if needed
        self.api_token = os.environ.get(’DATABRICKS_TOKEN’, ‘’)
        print(”Model context loaded successfully”)

    def predict(self, context, model_input):
        “”“Process request with token limit checking”“”
        
        # Handle different input types
        if isinstance(model_input, pd.DataFrame):
            # Convert DataFrame to dict and get first row
            if len(model_input) > 0:
                data = model_input.iloc[0].to_dict()
            else:
                return {”error”: “Empty input DataFrame”}
        elif isinstance(model_input, dict):
            data = model_input
        else:
            # Try to convert to dict
            try:
                data = dict(model_input)
            except:
                return {”error”: f”Unsupported input type: {type(model_input)}”}
        
        # Extract and parse messages
        messages = data.get(”messages”, [])
        if isinstance(messages, str):
            try:
                messages = json.loads(messages)
            except json.JSONDecodeError:
                return {”error”: “Invalid JSON in messages field”}
        
        # Extract parameters with defaults
        user_name = str(data.get(”user_name”, “test.user@databricks.com”))
        model_name = str(data.get(”model”, “gpt-4.1-2025-04-14”))
        
        # Handle max_tokens in case missing, this is on request side, not the rate limiter
        max_tokens_raw = data.get(”max_tokens”, 128)
        if pd.isna(max_tokens_raw) or max_tokens_raw is None:
            max_tokens = 128
        else:
            max_tokens = int(max_tokens_raw)
        
        # Handle temperature in case missing
        temperature_raw = data.get(”temperature”, 0.7)
        if pd.isna(temperature_raw) or temperature_raw is None:
            temperature = 0.7
        else:
            temperature = float(temperature_raw)
        
        # Check current token usage
        self.cursor.execute(”“”
            SELECT COALESCE(SUM(total_tokens), 0) as total_used
            FROM token_usage 
            WHERE user_name = %s AND model_name = %s
        “”“, (user_name, model_name))
        
        result = self.cursor.fetchone()
        tokens_used = int(result[0]) if result and result[0] else 0
        
        # Check user’s token limit
        self.cursor.execute(”“”
            SELECT token_limit 
            FROM user_token_limits 
            WHERE user_name = %s AND model_name = %s
        “”“, (user_name, model_name))
        
        limit_result = self.cursor.fetchone()
        
        if not limit_result:
            return {”error”: f”No token limit found for user {user_name} and model {model_name}”}
        
        token_limit = int(limit_result[0])
        
        # Check if limit exceeded
        if tokens_used >= token_limit:
            return {
                “error”: f”Token limit exceeded. Used: {tokens_used}, Limit: {token_limit}”,
                “tokens_used”: tokens_used,
                “token_limit”: token_limit
            }
        
        # Prepare request for FM endpoint
        fm_request = {
            “messages”: messages,
            “max_tokens”: max_tokens,
            “temperature”: temperature
        }
        
        headers = {
            “Content-Type”: “application/json”
        }
        
        if self.api_token:
            headers[”Authorization”] = f”Bearer {self.api_token}”
        
        try:
            # Call FM endpoint
            response = requests.post(
                self.fm_endpoint,
                json=fm_request,
                headers=headers,
                timeout=30
            )
            response.raise_for_status()
            
            fm_response = response.json()
            
            # Extract token usage from response
            usage = fm_response.get(”usage”, {})
            prompt_tokens = int(usage.get(”prompt_tokens”, 0))
            completion_tokens = int(usage.get(”completion_tokens”, 0))
            total_tokens = int(usage.get(”total_tokens”, 0))
            
            # Log token usage to database
            self.cursor.execute(”“”
                INSERT INTO token_usage (
                    user_name, 
                    model_name, 
                    prompt_tokens, 
                    completion_tokens, 
                    total_tokens, 
                    request_timestamp,
                    request_id,
                    response_content
                ) VALUES (%s, %s, %s, %s, %s, %s, %s, %s)
            “”“, (
                user_name,
                model_name,
                prompt_tokens,
                completion_tokens,
                total_tokens,
                datetime.utcnow(),
                fm_response.get(”id”, “”),
                json.dumps(fm_response)
            ))
            
            # Add usage info to response
            fm_response[”usage_info”] = {
                “tokens_used_total”: tokens_used + total_tokens,
                “token_limit”: token_limit,
                “tokens_remaining”: token_limit - (tokens_used + total_tokens)
            }
            
            return fm_response
            
        except requests.exceptions.RequestException as e:
            return {
                “error”: f”Failed to call FM endpoint: {str(e)}”,
                “tokens_used”: tokens_used,
                “token_limit”: token_limit
            }
        except Exception as e:
            return {
                “error”: f”Unexpected error: {str(e)}”,
                “tokens_used”: tokens_used,
                “token_limit”: token_limit
            }

The astute among you will draw attention to any of the following annoyances:

Now I need to pay for a custom model serving endpoint on top of my token costs to the foundation model(s), that’s so counterproductive!
1. Ok fair, but a minimum provisioned CPU endpoint costs $0.28 per hour and can handle a relatively large request volume since it’s not actually performing any calculations except a simple comparison operation to check your token limits. If you have hundreds of users calling this endpoint per second, then yeah it might break, but for a lot of companies this guaranteed hit of $0.28/hr is worth the protection against a potentially much larger bill if some of my employees run up a huge tab without me knowing.
This is going to add latency, and at scale I can’t abide this
1. Also fair, and I would say bulk queries should certainly be run through ai_query() to obtain serious scale via parallel requests, but what about all your tinkerers? Your BI Analysts, your data scientists, your citizen GenAI practitioners, etc.? Are they hitting 100 QPS?
Every new model is going to require me to set up a new config, ain’t nobody got time for that
1. Yes, but, per user per model rate limiting is just an example I used to show how much specificity you could add to this if and only if you wanted to. You could instead set this up one time to handle requests to any of the main endpoints your employees are calling (new model additions are likely to follow the same API patterns as their predecessors) and only limit per user or per user per unit time. This simplifies the deployment and management.

With the totally reasonable objections out of the way, let’s log and register this thing and then I’ll leave you with a couple concluding thoughts:

# Define signature - all fields required
input_schema = Schema([
    ColSpec(DataType.string, “messages”),
    ColSpec(DataType.string, “user_name”),
    ColSpec(DataType.string, “model”),
    ColSpec(DataType.long, “max_tokens”),
    ColSpec(DataType.double, “temperature”)
])

output_schema = Schema([
    ColSpec(DataType.string, “response”)
])

signature = mlflow.models.ModelSignature(
    inputs=input_schema,
    outputs=output_schema
)

pip_requirements = [
    “mlflow”,
    “requests”,
    “psycopg2-binary”,
    “pandas”
]


# Create test DataFrame (simulating what serving endpoint sends)
test_df = pd.DataFrame([{
    “messages”: json.dumps([
        {”role”: “user”, “content”: “Say ‘Test Successful’ and nothing else”}
    ]),
    “user_name”: “test.user@databricks.com”,
    “model”: “gpt-4.1-2025-04-14”,
    “max_tokens”: 50,
    “temperature”: 0.7
}])

print(”Test input DataFrame:”)
print(test_df)

model = TokenLimitedGatewayModel()
model.load_context(None)

print(”\nTesting with DataFrame input...”)
result = model.predict(None, test_df)
if “error” not in result:
    print(”Test successful!”)
    if “choices” in result:
        print(f”Response: {result[’choices’][0][’message’][’content’]}”)
    print(f”Usage info: {result.get(’usage_info’, {})}”)
else:
    print(f”Error: {result[’error’]}”)

# Log the model
with mlflow.start_run() as run:
    mlflow.pyfunc.log_model(
        artifact_path=”token_gateway”,
        python_model=TokenLimitedGatewayModel(),
        pip_requirements=pip_requirements,
        signature=signature
    )
    
    model_uri = f”runs:/{run.info.run_id}/token_gateway”
    print(f”Model logged with URI: {model_uri}”)
    print(f”Run ID: {run.info.run_id}”)

# Register to Unity Catalog
catalog = “”
schema = “” 
model_name = “token_limited_gateway”

registered_model = mlflow.register_model(
    model_uri=model_uri,
    name=f”{catalog}.{schema}.{model_name}”,
    tags={
        “use_case”: “rate_limiting”, 
        “model_type”: “gateway”,
        “backend”: “openai_gpt4”,
        “database”: “lakebase_postgres”,
        “version”: “dataframe_compatible”
    }
)

All done.

Now we have another expense that can’t scale, adds latency, and is another component to maintain.

Or, we have a relatively low cost insurance policy against runaway token costs for employees we really wanted to enable on all our LLM endpoints but have previously been too worried about cost controls to do so.

Only you know which of these is the “correct” interpretation. My guess is both are right depending on your users and use case. It’s not perfect, but I said it was medieval right at the outset.

Cheers and happy coding!

Agents are like onions (they have layers)

Veena — Wed, 03 Sep 2025 16:27:33 GMT

Image of Shrek. He looks confused.

Agents are becoming more sophisticated. Traditionally, we use LLM judges to assess the quality of an Agent’s performance. Databricks has its own suite of pre-defined LLM scorers that can evaluate safety, correctness, etc. For RAG agents, there are scorers that can help you analyze retrieval groundedness and retrieval relevance that can help analyze the quality of the retriever (aka the Vector Search). However, when the agent has access to multiple tools, end-to-end evaluation is not sufficient.

Imagine that we have an agent with access to several tools:

A python code execution function
A retriever that gets relevant Databricks documentation
A translation function
A text summarization function

As you can see in the diagram above, there are at minimum two LLM calls-- one when the agent begins and decides what tool to use (if there is a tool to be used) and another when to aggregate the tool output. How do we know if the LLM is making smart decisions about when to use these tools and what tools to use?

Consider these examples:

Request: “What’s 2+2?”
- The agent should not use any tools, since this is quite simple math.
Request: “What would print(f”Random string {variable}”) output in Python?”
- The agent should use the execute python function.
Request: “How do I create a Databricks cluster?”
- The agent should use the retriever to get relevant docs.

With MLflow, we can generate traces easily. In the example agent I am using here, I use langgraph and declare mlflow.langchain.autolog() to automatically trace every call in the app. Standard MLflow scorers would evaluate the final response quality, but they could miss whether the agent made smart tool choices. Custom scorers offer some flexibility here-- we can define one to break a trace down into its spans, independently analyzing each tool call.

How do we implement a custom scorer?

There are two main ways of implementing a custom scorer: (1) using the @scorer decorator for a python function or (2) using the Scorer class for more complex scorers that require state. In our example here, we are using the Scorer class and overriding the __call__ method.

All scorers retrieve the same inputs:

  def __call__(
       self,
       *,
       inputs: Optional[dict[str, Any]],
       outputs: Optional[Any],
       expectations: Optional[dict[str, Any]],
       trace: Optional[mlflow.entities.Trace]
   ) -> Feedback:

First, we are going to determine the required tools. For each user input, the scorer uses an LLM judge to determine which tools should be required. MLflow offers two options for creating a judge: prompt based judges and guideline judges. Prompt based judges, using the custom_prompt_judge function, allows to define custom categories (rather than pass/fail binaries) and have full prompt control.

Here, we are using a custom prompt and mapping our outputs (required and not required) to numeric values, which can be aggregated later on.

from mlflow.genai.judges import custom_prompt_judge 

   def determine_required_tools(self, user_input: str, tools: Dict[str, str]) -> Dict[str, bool]:
       required_tools = {}
       for tool_name, tool_description in tools.items():
           judge = custom_prompt_judge(
               name=f"{tool_name}_requirement_judge",
               prompt_template=tool_requirement_prompt,
               numeric_values={"required": 1.0, "not_required": 0.0}
           )
           result = judge(
               inputs=user_input,
               tool_name=tool_name,
               tool_description=tool_description
           )
           required_tools[tool_name] = result.value == 1.0
       return required_tools

Second, we are going to extract the actual behavior from the trace spans. MLflow automatically captures every tool call as a span with SpanType.TOOL. Our scorer searches through each span to get the tool name, tool response, and the tool status.

  def extract_used_tools_from_trace(self, trace: mlflow.entities.Trace) -> List[Dict[str, Any]]:
       tools = []
      
       spans = trace.search_spans(span_type=SpanType.TOOL)
       for span in spans:
           messages = span.get_attribute(SpanAttributeKey.OUTPUTS)
           content = json.loads(messages['content'])
           t = {"tool_call_id": messages['tool_call_id'],
               "tool_name": messages['name'],
               "tool_response": content['value'],
               "tool_status": messages['status']}
          
           tools.append(t)
       return tool

This gives us more visibility into the agent’s decision-making process. Now that we have what tools were actually used, we can score this against what should have happened.

  def compare_tool_usage(self, required_tools: Dict[str, bool], used_tools: List[Dict[str, Any]]) -> Dict[str, Any]:
       required_tools = [name for name, required in required_tools.items() if required]
       correctly_used_tools = []
       failed_required_tools = []
       incorrectly_used_tools = []


       for tool in used_tools:
           if tool['tool_name'] in required_tools:
               correctly_used_tools.append(tool)
               required_tools.remove(tool['tool_name'])
		 # Custom logic to determine if tool execution was successful
               response = self.interpret_tool_call_response(tool['tool_name'], tool['tool_response'], tool['tool_status'])
               if response != "success":
                   failed_required_tools.append(tool)
           else:
               incorrectly_used_tools.append(tool)
              
       return {
           "correctly_used_tools": correctly_used_tools, # used and required ! :)
           "incorrectly_used_tools": incorrectly_used_tools, # used but not required
           "failed_required_tools": failed_required_tools, # required but response was not successful
           "missing_required_tools": required_tools # required but not used
       }

Finally, we can compute the score based on the number of mistakes that the LLM made. The output of the Scorer should be a Feedback object. We can add rationale here as well, so when the scorer is used on new traces, we can look at the rationale directly in the MLflow Tracing UI.

The value of the Feedback object can be anything. Here we defined it as a simple Boolean, but it can be a Float, Int, String, a List, or a Dict. Again, lots of flexibility in how you define the LLM Scorers!

           return Feedback(
               value=True,
               rationale="Used required tools properly. No feedback needed.",
               source=AssessmentSource(source_type="LLM_JUDGE", source_id="tool_usage_scorer")
               )


       else:
           return Feedback(
               value=False,
               rationale="Incorrectly used tools. Check metadata for more information. ",
               source=AssessmentSource(source_type="LLM_JUDGE", source_id="tool_usage_scorer")
               )

You can finally use the scorer in your evaluation workflow.

Key Takeaways

By implementing custom scorers that analyze the spans within a trace, you can catch inappropriate LLM tool calls and usage and identify unnecessary tool usage. In order to use a custom scorer in MLflow:

Define your evaluation criteria (what is good tool usage for your use case? Do you always want to be relying on tools?)
Create the custom scorer class
Integrate easily with the evaluation pipeline via mlflow.genai.evaluate()

Practically, start tracking one or two critical tools rather than trying to evaluate everything all at once. In this example, I count every single mistake, but define what ‘too many mistakes’ means in your use case. In your use case, you may not want to add everything except the failed tools and missing required tools.

Custom MLflow scorers give you the control to build more reliable agents. Take a look at the full code implementation here.

Happy scoring!

Doctors HATE this one dependency trick!

Veena — Tue, 19 Aug 2025 16:33:24 GMT

Life was promised to be simple. You log your model and then you deploy the model. This is supposed to be easy! But even if everything works perfectly in your notebook, once you deploy it, dependency issues that you have never seen before pop up.

Image showing what direct dependencies and transitive dependencies are. Direct dependencies are explicitly used in the project, and transitive dependencies are included as they are the dependencies of the direct dependencies.

Most dependency resolution issues are because of conflicting transitive dependencies. Here is what most data scientists are doing today:

Experiment in notebooks, adding new libraries via pip within a notebook.
Log models using mlflow .
Register a model to UC and deploy.
Cross your fingers and hope everything works.
[optional] Get frustrated with Model Serving.

mlflow only infers the direct dependencies when you log a model. It identifies them by examining the flavor and the packages used in the model’s predict function. For more information, check out mlflow.models.infer_pip_requirements().

Usually, this means that dependencies are only resolved during Model Serving, which means dependency errors are not caught until then. Waiting for the Serverless compute and the container to build will only extend the developer loop, making it harder to iterate.

Image showing that the requirements.txt file in the Registered Model artifacts are used to install all dependencies in the Model Serving environment.

Lock your dependencies during development

In mlflow 3+, you can enable dependency locking with uv, which would allow you to use the standard mlflow logging workflow.

import os
os.environ["MLFLOW_LOCK_MODEL_DEPENDENCIES"] = "true"

# Now when you log your model, MLflow will capture 
# both direct AND transitive dependencies

mlflow.sklearn.log_model(
    model, 
    "my_model",
)

Your workflow does not have to change at all. You can still use extra_pip_requirements, pip_requirements, or allow mlflow to infer all direct dependencies. The environment variable now enables uv to resolve dependencies during logging time and will capture pinned direct and transitive dependencies. Now, your requirements.txt file will contain all the dependencies you need.

Image comparing the requirements.txt file artifact using dependency locking vs not using dependency locking.

Dependency resolution occurs during model logging time instead of serving time, and we have automatic dependency locking when logging a model.

However, since uv resolves all of the dependencies, the transitive dependencies captured are often more recent than the packages installed by default on the DBR. So, we will usually get ‘warnings’ that the dependencies captured by uv are different from the transitive dependencies in the environment. Why is this a problem? Our training environment and serving environment are still different, which means we could still get behavior differences between our notebook and our deployment.

Example of a warning after using dependency locking: “Detected one or more mismatches between the model’s dependencies and the current Python environment: - cloudpickle (current: 2.2.1, required: cloudpickle==3.1.1)”

In the above example, we can see that cloudpickle often resolves to version 3.1.1, but in our recent DBRs, cloudpickle version 2.2.1 is installed. This is especially important because cloudpickle will always be a transitive dependency as mlflow relies on it.

Using Databricks Asset Bundles

We can resolve the inconsistency between the notebook environment and the serving environment using Databricks Asset Bundles. If we have a `dev` workspace and a `test` workspace, then we can use mlflow and uv to generate the requirements lock file in the `dev` workspace and add the requirements lock file as a dependency for the `test` workspace.

This can all be easily orchestrated using Databricks Asset Bundles and Lakeflow Jobs. By installing the requirements lock file, we can override any conflicting transitive dependencies in the DBR and ensure that the training and serving environments are the exact same. Here is an example job config:

resources:
  jobs:
    my_job:
      tasks:
        - task_key: train_model
          libraries:
            - requirements: ./requirements.txt  # Pre-resolved dependencies

Now the entire pipeline uses the same dependencies from development through production.

Dependency management is important!

Dependency management is probably not the most exciting part of MLOps but it can easily become a migraine. The few extra minutes you spend considering dependency management will save you lots of time debugging serving deployment failures.

As always, let us know if you have any questions!

Beyond the Pipeline: The Blueprint for Enterprise AI Platforms using Databricks

Debu Sinha — Tue, 05 Aug 2025 16:18:01 GMT

As a Lead AI/ML Specialist Architect, I see a universal story unfold. It begins with the triumph of a single model, but as organizations scale, a predictable crisis emerges. I call it the Pipeline Paradox: as the number of models grows, the complexity of managing them grows exponentially, causing fragility, operational drag, and slowing innovation to a crawl.

This isn't a failure of talent; it's the result of hitting an architectural wall. The solution isn't a single tool or trick. It's a fundamental shift in perspective—from building individual, brittle pipelines to engineering a unified platform that embraces distinct, purpose-built architectural patterns. This is that blueprint.

The Foundation: A Unified Governance Layer

Before discussing execution, we must establish the non-negotiable foundation: a unified governance model. Unity Catalog provides this by treating all data and AI assets as first-class, governable citizens within a single system. It underpins every pattern below by providing a single source of truth, fine-grained permissions, automated end-to-end lineage, and streamlined CI/CD with model aliases (@champion).

Prerequisites and Requirements

To implement these patterns successfully, ensure:

Databricks Runtime: 15.4 LTS or above for ai_query function support.
Compute Type: Serverless SQL warehouses for Patterns 1 & 2, Spark clusters for Pattern 3.
Unity Catalog: Enabled for governance and model management.
Model Format: MLflow-packaged models registered in Unity Catalog.
Permissions: `USE_FUNCTION` privilege on ai_query, appropriate model access grants.

The Three Core Model Inference Patterns on Databricks

A mature AI platform on Databricks offers three distinct approaches for model inference, each optimized for different workloads. Understanding their trade-offs is key to choosing the right pattern for your use case.

Pattern 1: Real-Time Model Serving Pattern

This pattern is optimized for low-latency, request-response interactions. The primary tool is Databricks Model Serving, and it's accessed from SQL using AI_QUERY for ad-hoc analysis.

Here is what this looks like in practice for a data analyst performing a quick lookup:

SQL

-- An ad-hoc query to get a churn prediction for a specific, high-value customer.
SELECT
  customer_id,
  -- ai_query calls the serving endpoint for a real-time response.
  ai_query(
    endpoint => 'prod_customer_churn_model', -- The name of your deployed model endpoint
    request => named_struct(  -- Pass features as a named struct
      'account_age', account_age,
      'monthly_spend', monthly_spend,
      'support_tickets', support_tickets
    ),
    returnType => 'DOUBLE'  -- Specify the return type for custom models
  ) AS churn_prediction_score
FROM
  main.gold.customer_features
WHERE
  customer_id = 'A-12345';

Pattern 2: Serverless Batch Inference Pattern

This pattern is designed for maximum simplicity when applying a model to an entire dataset, using the same AI_QUERY function in a large-scale query.

In a SQL query, this pattern is strikingly simple:

SQL

-- Enrich an entire customer table with churn scores using a single, scalable SQL statement.
-- Databricks optimizes this for batch performance using serverless compute.
CREATE OR REPLACE TABLE main.gold.customer_churn_predictions AS
SELECT
  customer_id,
  -- The same ai_query function, now applied to the whole table.
  ai_query(
    endpoint => 'prod_customer_churn_model',
    request => named_struct(
      'account_age', c.account_age,
      'monthly_spend', c.monthly_spend,
      'support_tickets', c.support_tickets
    ),
    returnType => 'DOUBLE'
  ) AS churn_prediction_score
FROM
  main.gold.customer_features AS c;

Pattern 3: Embedded Spark UDF Pattern

This pattern is engineered for maximum performance on the most demanding batch workloads, using mlflow.pyfunc.spark_udf to co-locate model execution with the data in Spark.

This is a more involved, code-first approach for ML engineering teams:

Python

import mlflow
from pyspark.sql.functions import col, struct

# 1. Define the URI of the model in Unity Catalog.
model_uri = "models:/main.production_models.customer_churn/1"

# 2. Create the environment-aware Spark UDF.
#    'virtualenv' is faster for pure Python models.
#    For models with complex dependencies, use 'conda'.
predict_udf = mlflow.pyfunc.spark_udf(
    spark,
    model_uri=model_uri,
    env_manager="virtualenv",  # Use 'conda' for complex environments
    result_type="double"  # Specify return type for better performance
)

# 3. Read the source data.
features_df = spark.read.table("main.gold.customer_features")

# 4. Apply the UDF in a distributed fashion.
#    The model runs inside the Spark job, avoiding network calls.
predictions_df = features_df.withColumn(
    "churn_prediction_score",
    predict_udf(
        struct(col("account_age"), col("monthly_spend"), col("support_tickets"))
    )
)

# 5. Write the results to a new table.
predictions_df.write.mode("overwrite").saveAsTable("main.gold.customer_churn_predictions_udf")

The Architect's Blueprint: A Trade-off Analysis and Decision Framework

Choosing the right pattern requires an honest assessment of what you are optimizing for.

Analyzing the Patterns:

Pattern 1 (Real-Time Model Serving): Optimizes for sub-second latency using Mosaic AI Model Serving endpoints.
- Trade-off: Higher cost per prediction for bulk operations.
- Best for: Interactive applications, APIs, and real-time decision-making.
Pattern 2 (Serverless Batch Inference): Optimized for developer simplicity and automatic scaling, it offers 10-100x performance improvements (as of Dec 2024).
- Trade-off: Network overhead between SQL warehouse and serving endpoint.
- Best for: Regular batch scoring, ETL pipelines, scheduled predictions.
Pattern 3 (Embedded Spark UDF): Optimizes for maximum throughput and lowest cost by co-locating model execution with data.
- Trade-off: Complex dependency management and potential version conflicts.
- Best for: Massive-scale batch processing, cost-sensitive workloads, models with simple dependencies.

Conclusion: Making the Right Choice

By starting with a foundation of governance in Unity Catalog and then using this trade-off analysis to select the right execution pattern, you can build a truly durable, scalable, and democratized engine for enterprise AI. The goal is not to find one pattern to rule them all, but to master the blueprint that lets you choose the right one, every time.

It’s beaver time! Don’t get logged down with mlflow logging.

Veena — Tue, 22 Jul 2025 16:02:23 GMT

The beaver is building a dam. It is holding a log that says ‘mlflow.log_model()’.

Have you ever tried to train and log thousands of models with MLFlow? Instead of helping you track your runs, a seemingly innocuous line (mlflow.log_model()) has created a bottleneck and stretched your training time. Oh gosh, your training job takes hours now. What a nightmare.

What does mlflow.log_model() do?

Here is a brief overview of what log_model() does behind the scenes:

Serializes the model.
Infers dependencies and manages manually added dependencies to create a requirements file.
Creates model specific assets (like MLmodel).
Handles metadata and versioning.

All of these operations add a lot of overhead, and if you are looking to train thousands of small models within a single run, the logging overhead can often exceed your actual training time. This scenario is common in forecasting pipelines when you need separate models for different customer groups or time series segments.

Instead, log these models to a Delta Table

Let’s get a little creative. We can work around this issue by just not using log_model! We have solved the issue. You can stop reading now.

Just kidding. The idea is this-- we can get most of MLFlow’s benefits while improving performance by storing our models in a Delta Table.

Note: This is code to demonstrate the concept. Before using this code in production, please implement proper error handling and performance testing. Take a look at the entire notebook here.

First, we need to create a Delta Table.

schema = StructType([
    StructField("group_id", StringType(), True),
    StructField("model_type", StringType(), True),
    StructField("model_version", StringType(), True),  
    StructField("model_binary", BinaryType(), True),
    StructField("run_id", StringType(), True),
    StructField("run_date", TimestampType(), True),
    StructField("mse", DoubleType(), True),
    StructField("forecast", ArrayType(DoubleType()), True),
    StructField("actual", ArrayType(DoubleType()), True),
    StructField("is_latest", StringType(), True)  
])

spark.createDataFrame([], schema).write.format("delta").option("overwriteSchema", "true").saveAsTable("..mlflow_runs")

In this example, I use the same Delta Table to capture all of my models across different runs. I do this so I can implement a model_version column that will capture the latest version of each model. However, if you do not need this information, you can create a new Delta Table for each MLFlow run. Another thing to note is that I am using one type of model (a Random Forest regressor), but you can imagine a world where you have multiple different model types as well. This workaround is easily extendable to a lot of these scenarios.

When I train the model, I capture all of the information I need, like the predictions, actuals, metrics (here, I am using mean squared error). I am also capturing useful information like group_id, run_id, model_type to make it easy for me to search for this specific model after it is logged to the Delta Table. Finally, I have dumped the model binary in as well.

def train_model(group_df, group_id, latest_model_version, run_id, run_date): 
    ...
    # train the model as usual 
    ...

    # return metadata 
    return {
        'group_id': group_id,
        'model_type': 'RandomForestRegressor',
        'model_version': str(latest_model_version + 1),
        'model_binary': cloudpickle.dumps(model), 
        'run_id': run_id,
        'run_date': run_date,
        'mse': mse,
        'forecast': predictions.tolist(),
        'actual': y.tolist(),
        'is_latest': "True"
    }

After training for all models is complete, I can batch-insert all of the models into the Delta Table in one operation:

def save_to_delta(model_results, table_name): 
  df = spark.createDataFrame(model_results)
  df.write.format("delta").mode("append").saveAsTable(table_name)
  return 

...
# in the mlflow run
...

for group_id in GROUP_IDS: 
  group_df = data[data['group_id'] == group_id] 

  latest_version = current_model_versions.get(group_id, 0)
  model_result = train_model(group_df, group_id, latest_version, run_id, run_date)
  all_model_results.append(model_result)

save_to_delta(all_model_results, table_name)

Depending on how many models you are training, you can also update the Delta Table with the model information immediately after training.

Even though we are storing models in Delta Tables, we still want to maintain the link back to the MLFlow run for reproducibility. We want to log high-level information like, the number of models trained, the training dataset (assuming it's the same across groups), and the Delta Table used for logging. MLFlow also automatically logs other important information, like start-end times, success-failure statistics, and the source notebook version or git commit associated with the run. All of this is incredibly important for reproducibility and auditing. This is why we log the run_id with the model-- we want to be able to cross reference between the Delta Table and the MLFlow experiment to get the best of both worlds.

But if you recall the beginning of this post, you would remember there is still one thing left to cover: tracking dependencies. Without log_model(), MLFlow does not infer dependencies or create the MLModel artifacts, which contains important information such as the Python version used. After saving all of the models to a Delta Table, we can use log_model() once to create these artifacts. Now, we can save all of the required dependencies.

def log_models_to_mlflow(data, table_name="..mlflow_runs"): 
  with mlflow.start_run() as run: 
    run_id = run.info.run_id
    run_date = datetime.now()

    # log high level parameters
    mlflow.log_param("num_groups", len(data['group_id'].unique()))
    mlflow.log_param("delta_table_name", table_name)

    current_model_versions = get_latest_model_versions(table_name) # {'group_id': version #}
    all_model_results = []

    for group_id in GROUP_IDS: 
      group_df = data[data['group_id'] == group_id] 

      latest_version = current_model_versions.get(group_id, 0)
      model_result = train_model(group_df, group_id, latest_version, run_id, run_date)
      all_model_results.append(model_result)

    save_to_delta(all_model_results, table_name)

    mlflow.pyfunc.log_model(
        "dummy_model",
        input_example=main_df,
	 extra_pip_requirements=[...], 
        python_model=DummyWrapper()
    )

You can also do this by logging a requirements.txt file directly via mlflow.log_artifact().

But how do we load these models?

It is quite straightforward. Since we have saved the model binaries in the Delta Table, we can directly load this and use it for inference. In this example, remember I have saved all of the models across runs, so I have multiple versions of each model in the same Delta Table. You can easily search and get specific versions of the model or simply retrieve the latest one.

class MultiModelWrapper(): 
  def __init__(self, table_name): 
    self.table = table_name

  def load_model_from_delta(self, group_id, table_name, model_type=None, run_id=None, version=None): 
    query = f"select * from {table_name} where group_id = '{group_id}'"
    if model_type: 
      query += f" and model_type = '{model_type}'"
    if run_id: 
      query += f" and run_id = '{run_id}'"
    if version: 
      query += f" and model_version = '{version}'"
    else: 
      query += f" and is_latest = 'True'" 
    model_df = spark.sql(query).collect()
    if model_df: 
      model = cloudpickle.loads(model_df[0]['model_binary'])
      metadata = model_df[0].asDict(True)
      metadata.pop("model_binary")
      return model, metadata
    else: 
      return None, None
    
  def predict(self, model_input, group_id, model_type=None, run_id=None, version=None): 
    model, _ = self.load_model_from_delta(group_id=group_id, model_type=model_type, run_id=run_id, version=version, table_name=self.table)

    # TODO: make sure the model_input can be ingested by the models!
    return model.predict(model_input.values)

wrapper_model = MultiModelWrapper(table_name="..mlflow_runs")
test_df = main_df.head(1).drop(columns=["group_id", "target", "date"])
wrapper_model.predict(test_df, 'A', version=2)

When does this approach make sense for me?

If you have many (large hundreds to thousands) small models that are similar and are seeing lag in model training using MLFlow, I would suggest looking into this workaround. If you are not frequently training these models or the models are not easily serializable (such as deep learning models), I would not recommend this solution for you.

Obviously, you lose a lot of native MLflow features, like connection to the UC Model Registry, and there is a lot of custom code to maintain, but for high volume scenarios, these trade-offs are usually worth it.

This pattern can offer your team a pragmatic solution that maintains most of MLFlow’s tracking benefits while improving performance for bulk model training.

I hope this guide was helpful! Please let me know if you have any questions below. Again, here is the link to the notebook.

Juggling multiple models in a single serving endpoint

Veena — Tue, 29 Apr 2025 16:02:04 GMT

Have you ever found yourself juggling multiple ML models? Imagine this: you're maintaining a prediction service that started with a single model, but now you've got a dozen micro-models serving different business needs. Costs are climbing. You are dreaming of consolidation.

For most scenarios, Databricks Model Serving provides an easy solution. They allow you to deploy multiple models behind a single endpoint, split traffic, and route requests. This approach is perfect for A/B testing and canary deployments, where simple traffic splitting is sufficient. However, there are situations where we can hit limitations:

routing based on requests (e.g., user attributes)
routing based on time
managing dozens of micro-models and want to consolidate infrastructure
routing dynamically based on business rules

You could spin up separate endpoints for each, but that means more DBUs, more management overhead, etc. This is where creating a custom PyFunc wrapper can provide a solution. Note that this should be viewed as an edge case and not a default pattern.

Before diving into the implementation, let’s consider the limitations.

Individual model metrics are combined, so monitoring is more difficult.
Models are loaded together, so there could be a resource inefficiency.
Routing rules may obscure decision paths.
Model versioning is less transparent.

In this deep dive, we will explore a really simple pattern to help solve this issue using PyFunc. By creating a wrapper with PyFunc, we will package various models in one deployable artifact, implement routing logic to direct requests to the right model, and maintain an entry point.

An diagram of how this router solution would look.

Let’s quickly create some base models.

We are training two separate models using the same California Housing dataset. Because both models have the exact same input data schema and the expected output schema, we can expect that the Model Signature for both models will be the same.

import mlflow
import pandas as pd
import numpy as np
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor

data = fetch_california_housing()
california_housing = pd.DataFrame(data.data, columns=data.feature_names)
california_housing['target'] = data.target

X_train, X_test, y_train, y_test = train_test_split(
    california_housing.drop('target', axis=1), 
    california_housing['target'], 
    test_size=0.2, 
    random_state=42
)

lr_model = LinearRegression().fit(X_train, y_train)
rf_model = RandomForestRegressor().fit(X_train, y_train)

signature = mlflow.models.infer_signature(X_train, lr_model.predict(X_train))

This represents a standard model development workflow. This pattern builds on existing models and training processes rather than replacing them. In other words, you can adopt this pattern without too much disruption to your current workflows.

These can now be logged and registered in Unity Catalog. Nothing new here!

with mlflow.start_run(run_name="California Housing Models") as housing_run:
    mlflow.sklearn.log_model(lr_model, "linear_regression_model", signature=signature)
    mlflow.sklearn.log_model(rf_model, "random_forest_model", signature=signature)
    
    mlflow.set_registry_uri("databricks-uc")
    mlflow.register_model(
        f"runs:/{housing_run.info.run_id}/linear_regression_model", 
        "your_catalog.your_schema.california_housing_linear_regression"
    )
    mlflow.register_model(
        f"runs:/{housing_run.info.run_id}/random_forest_model", 
        "your_catalog.your_schema.california_housing_random_forest"
    )

Create a custom model using pyfunc.

We are going to use pyfunc to orchestrate and serve as the main interface for interacting with the base models. The wrapper will load our models and dynamically select which model to use based on the request parameters.

class ModelRouter(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        self.linear_model = mlflow.sklearn.load_model(
            context.artifacts["linear_regression_model"]
        )
        self.forest_model = mlflow.sklearn.load_model(
            context.artifacts["random_forest_model"]
        )

    def predict(self, context, model_input):
        # The 'model' column specifies which model to use
        if model_input['model'].eq('RandomForest').any():
            return {
                "prediction": self.forest_model.predict(model_input.drop('model', axis=1))
            }
        elif model_input['model'].eq('LinearRegression').any():
            return {
                "prediction": self.linear_model.predict(model_input.drop('model', axis=1))
            }
        else:
            raise ValueError("Unrecognized model type. Use 'RandomForest' or 'LinearRegression'")

I want to highlight two important aspects of this wrapper. First, in load_context, we are loading the underlying Linear Regression and Random Forest models from the artifacts. When we log and register this wrapper, we will need to specify these artifacts, so that the wrapper will correctly load the models that we trained. Keep in mind that in the model serving environment, load_context is called once, so loading the models should not affect the serving latency after initialization.

Second, there is a lot of flexibility here. In the code snippet, we are using an extra column in the model input called model to select which model to use. But you can implement virtually any routing logic. You can switch between the models based on geographic location or the time the request was submitted.

Registering the wrapper with the model artifacts.

In order to register the model, we need to create a proper Model Signature. I am going to use the infer_signature function to do so. You can also manually construct the signature object. The signature will be similar to the signatures used for the base models. Because our wrapper uses an extra column to decide which model to use, we need to take that into consideration.

input_example = X_train.copy()
input_example['model'] = 'RandomForest'

router_signature = mlflow.models.infer_signature(
    input_example, 
    {"prediction": rf_model.predict(X_train)}
)

When we log the model, we need to include the base models as artifacts:

with mlflow.start_run() as run:
    router_model = ModelRouter()
    mlflow.pyfunc.log_model(
        "model_router",
        python_model=router_model,
        signature=router_signature,
        artifacts={
            "linear_regression_model": 
                "models:/your_catalog.your_schema.california_housing_linear_regression/1",
            "random_forest_model": 
                "models:/your_catalog.your_schema.california_housing_random_forest/1",
        },
        extra_pip_requirements=["scikit-learn==1.4.2", "numpy==1.23.5", "pandas==1.5.3"]
    )
    
    # Register the router model
    mlflow.register_model(
        f"runs:/{run.info.run_id}/model_router", 
        "your_catalog.your_schema.housing_model_router"
    )

Now, we have created a self-contained wrapper that includes everything needed for serving.

What happens when we are dealing with different inputs?

Imagine your system spans multiple domains. Different data, different tasks, but you still need a unified interface.

First, let’s train model on a different dataset.

from sklearn.datasets import load_breast_cancer
from sklearn.ensemble import RandomForestClassifier

cancer_data = load_breast_cancer()
cancer_df = pd.DataFrame(cancer_data.data, columns=cancer_data.feature_names)
cancer_df['target'] = cancer_data.target

X_train_cancer, X_test_cancer, y_train_cancer, y_test_cancer = train_test_split(
    cancer_df.drop('target', axis=1), 
    cancer_df['target'], 
    test_size=0.2
)

cancer_model = RandomForestClassifier().fit(X_train_cancer, y_train_cancer)

with mlflow.start_run() as run:
    cancer_signature = mlflow.models.infer_signature(
        X_train_cancer, 
        cancer_model.predict(X_train_cancer)
    )
    mlflow.sklearn.log_model(
        cancer_model, 
        "random_forest_cancer", 
        signature=cancer_signature
    )
    
    mlflow.register_model(
        f"runs:/{run.info.run_id}/random_forest_cancer", 
        f"{catalog}.{db}.{rf_br_model_name}"
    )

Now, we can create a wrapper that handles both datasets. This wrapper is similar to the previous one.

class MultiDomainRouter(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        self.housing_model = mlflow.sklearn.load_model(
            context.artifacts["housing_model"]
        )
        self.cancer_model = mlflow.sklearn.load_model(
            context.artifacts["cancer_model"]
        )
        
        self.housing_columns = context.artifacts['housing_features']
        self.cancer_columns = context.artifacts['breast_cancer_features']

    def predict(self, context, model_input):
        if model_input['domain'].eq('housing').any():
            # validate input data
            input_cols = set(model_input.columns) - {'domain'}
            missing_cols = self.housing_columns - input_cols
            if missing_cols:
                raise ValueError(f"Missing required columns for model: {missing_cols}")
                
            # columns needed by  model
            features = model_input[list(self.housing_columns) + ['domain']]
            return {
                "prediction": self.housing_model.predict(
                    features.drop('domain', axis=1)
                )
            }
            
        elif model_input['domain'].eq('cancer').any():
            # validate input data
            input_cols = set(model_input.columns) - {'domain'}
            missing_cols = self.cancer_columns - input_cols
            if missing_cols:
                raise ValueError(f"Missing required columns for model: {missing_cols}")
                
            # columns needed by model
            features = model_input[list(self.cancer_columns) + ['domain']]
            return {
                "prediction": self.cancer_model.predict(
                    features.drop('domain', axis=1)
                )
            }
        else:
            raise ValueError("Unrecognized domain. Use 'housing' or 'cancer'")

But wait— how are we supposed to define the Model Signature? The expected inputs will be different.

A quick aside on Model Signatures.

A Model Signature defines the input and output schema that the model is expected to receive and output. There are two main types of signatures: column-based (used for most traditional ML models) and tensor-based (used for deep learning applications).

Column-based signatures consist of a list of columns (very surprising), each with an expected data type. The signature for the California Housing models look like this:

inputs: 
  ['MedInc': double (required), 'HouseAge': double (required), 'AveRooms': double (required), 'AveBedrms': double (required), 'Population': double (required), 'AveOccup': double (required), 'Latitude': double (required), 'Longitude': double (required)]
outputs: 
  [Tensor('float64', (-1,))]

The function `mlflow.models.infer_signature()` automatically takes an existing dataset and creates a schema with the appropriate datatypes.

Notice: each of these columns are required by the model. Required fields must be included in the input, and if it is not there, it will error out. However, we can include optional fields as well.

The function assumes that these columns are required because in the dataframe I used, all values were properly populated. In order to configure a field as optional, we can use mlflow.models.infer_signature by passing in some None values for that field. Basically, we can concat the two datasets and infer the signature to make all of the columns for both types of inputs optional.

bc_head = bc.head()
california_housing_head = california_housing.head()

merged_df = pd.concat([bc_head, california_housing_head])
merged_df_output =  {"california_housing": rf_ch.predict(california_housing_head.drop('target', axis=1))}
signature_merged_df = infer_signature(merged_df, merged_df_output)

By making most fields optional in the signature, we're essentially telling MLflow to let all requests through to our code. We will still need to do the validation for each model. This gives us the flexibility to route between completely different models while still maintaining a consistent interface for client applications.

If you look at the code snippet where we defined the model wrapper again, you can see that we have performed the validation of the model inputs ourselves. We have manually added feature names that are required for each model to enforce the input schema.

Conclusion

Voilà! This approach should address more complex needs to routing logic and consolidation of small models. It is perfect when you need:

request-level routing beyond percentage-based traffic splitting
multiple small models where separate endpoints would be inefficient.
complex routing logic based on request properties.

This router pattern gives you flexibility way to consolidate multiple models behind a single endpoint. Remember to think about the limitations we listed out earlier before implementing this in production!

Happy wrapping!

PyFunc it! We'll do it Live!

Austin — Tue, 22 Apr 2025 16:53:56 GMT

When performing real time inference, you rarely get all of the data needed to make your prediction exactly how your model requires it inside the POST request. More commonly, one or both of the following are true:

The data received requires significant preprocessing in the form of parsing, encoding, reformatting, etc.
The data received is incomplete and must be combined with another data set in order to perform accurate predictions

Today’s blog will focus on the first use case, and we will revisit the second one next quarter. I originally wrote Part 2 using Online Tables, which is still a possibility, but there have been some API changes I want to make sure settle before publishing.

Bill gets frustrated with pre-processing pipelines

The Power of Pipelines

If we have an sklearn model, adding preprocessing steps - even more advanced custom preprocessing classes - is a straightforward task:

import mlflow
import mlflow.sklearn
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

# Yes this dataset is simplistic, but we're just proving a point right now
iris = load_iris()
X = iris.data
y = iris.target

# Create the pipeline with whatever preprocessing you may need, Pipeline also accepts custom classes
pipeline = Pipeline([
    ('scaler', StandardScaler()),
    ('model', LogisticRegression())
])

# Start the MLflow run with autologging for even more quality of life features
with mlflow.start_run():
    mlflow.sklearn.autolog()
    pipeline.fit(X, y)

mlflow.end_run()

Not exactly groundbreaking code here. But what if our custom parsing and preprocessing logic was really complex and we wanted to maintain our own separate python scripts for this logic to maintain modularity? What if we wanted to use a model type that doesn't fit nicely into sklearn? Regardless of the specific motivation, the time may come when this pattern will no longer serve our needs. Calling a separate .py file from within a PyFunc Python model and serving the custom pipeline is an extremely powerful and flexible pattern for multipart inference pipelines.

Note: You can still use an sklearn Pipeline for this without using the mlflow.sklearn flavor, because XGBoost provides an sklearn compatible API. I've run the code below both ways, and while you don't have to use a single sklearn package in order to leverage this pattern, it will make for a simpler to follow demo.

Defining a Custom Preprocessing Script

The below .py file contains two relatively simple preprocessing classes, one that flattens nested JSON strings and one that extracts the domain from email addresses.

## custom_transformers.py
## We could break these out, but for simplicity of the demo, I'm just making one external .py file
from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd

class JSONFlattener(BaseEstimator, TransformerMixin):
    """
    Transforms the DataFrame by flattening the specified JSON column into a tabular format.
    """
    def __init__(self, json_column, record_prefix=''):
        self.json_column = json_column
        self.record_prefix = record_prefix

    def fit(self, X, y=None):
        return self

    def flatten_dict(self, d, parent_key='', sep='.'):
        items = []
        for k, v in d.items():
            new_key = f"{parent_key}{sep}{k}" if parent_key else k
            if isinstance(v, dict):
                items.extend(self.flatten_dict(v, new_key, sep=sep).items())
            else:
                if isinstance(v, list):
                    v = ';'.join(map(str, v))
                items.append((new_key, v))
        return dict(items)

    def transform(self, X):
        X = X.copy()
        flattened = X[self.json_column].apply(lambda x: self.flatten_dict(x, self.record_prefix, sep='.'))
        json_df = pd.DataFrame(flattened.tolist())
        X = X.drop(columns=[self.json_column])
        X = pd.concat([X.reset_index(drop=True), json_df.reset_index(drop=True)], axis=1)
        return X

class EmailDomainExtractor(BaseEstimator, TransformerMixin):
    """
    Transforms the DataFrame by adding a new column 'email_domain' containing the extracted domains.
    """
    def __init__(self, email_column):
        self.email_column = email_column

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        X = X.copy()
        if self.email_column not in X.columns:
            raise ValueError(f"Column '{{self.email_column}}' not found in input data.")
        X['email_domain'] = X[self.email_column].apply(
            lambda x: x.split('@')[-1] if isinstance(x, str) and '@' in x else 'unknown'
        )
        return X

Being a fully self-specified data format, JSON strings are an extremely popular method for sending data over the internet, and in order to suit the wide variety of use cases that leverage JSON, they can get quite complex. Databricks model serving limits the payload of an individual request to 16MB, which a single JSON could hypothetically occupy all of. Needless to say, our two level JSON flattener class is only meant to be a placeholder to showcase a broader pattern.

Now that we have our .py file defined, let's open a notebook in the same folder and turn our cluster on if we don't already have one. For this code I used an r6i.xlarge single node CPU cluster on Databricks ML Runtime 15.4 LTS on AWS, but a rough equivalent in Azure would be the E4s_v4, since both feature 4 vCPUs with 8 GiB of RAM, each powered by Intel Xeon Ice Lake processors. These are both fast and low cost, with the AWS one coming in at 1.02 DBU/hr.

Synthetic Data Generation

To keep the code in this blog fully functional out of the box, we're going to generate some synthetic data using one of my favorite python packages, Faker.

# Note: as we said above, this pattern is not dependent on sklearn Pipelines
%pip install faker==18.11.2
%pip install scikit-learn==1.2.2
%pip install databricks-sdk --upgrade
%pip install mlflow==2.17.0
dbutils.library.restartPython()

# We'll use a non-sklearn ML package for our main model, XGBoost
import pandas as pd
import numpy as np
from faker import Faker
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report
import mlflow
import mlflow.pyfunc
from mlflow.models.signature import infer_signature
import xgboost as xgb
import joblib
import os
import sys

We'll generate 1,000 rows for now, but you could generate far more if you wanted to experiment with the scalability of your model.

# Faker makes synthetic data generation easy
fake = Faker()
Faker.seed(42)
np.random.seed(42)

def generate_data(num_rows=1000):
    data = []
    for _ in range(num_rows):
        customer_id = fake.unique.uuid4()
        name = fake.name()
        address = {
            'street': fake.street_address(),
            'city': fake.city(),
            'state': fake.state_abbr(),
            'zip_code': fake.zipcode()
        }
        email = fake.email()
        phone_number = fake.phone_number()
        
        transaction = {
            'transaction_id': fake.unique.uuid4(),
            'amount': round(np.random.uniform(10.0, 1000.0), 2),
            'transaction_type': np.random.choice(['online', 'in-store', 'cash withdrawal', 'mobile']),
            'account_age_days': np.random.randint(30, 3650),
            'customer_info': {
                'customer_id': customer_id,
                'name': name,
                'address': address,
                'email': email,
                'phone_number': phone_number
            },
            'fraud': np.random.choice([0, 1], p=[0.95, 0.05])
        }
        data.append(transaction)
    return pd.DataFrame(data)

df = generate_data(num_rows=1000)
display(df.head(5))

Short aside: What if my preprocessing logic are other models?

If you had additional ML models in your pipeline, there are a few ways you could incorporate them together depending on the flow pattern of the data. For example, if you have multiple independent processes, then you can use a fan out model where models can process in parallel on their own endpoints, then send those results back to a wrapper model for the final prediction. The pros of this are that you can scale multiple models independently, and if you have multiple heavily duty models, this may be the fastest way to structure things. The cons are that you have more cost, both in terms DBUs (since you have multiple running endpoints) and in terms of overhead latency, since each call from one endpoint to another is going to add ~50ms.

The other option, especially if you have multiple dependent processes, is to wrap the constituent models up in the orchestrator as model artifacts and serve the entire pipeline to a single endpoint. The pros of this are that you save on costs, financially and in overhead latency, and since your model processes are dependent anyway, there is no efficiency loss due to models that could be run in parallel laying idle. The cons are that you need to be more careful in the initial deployment since it's less modularized, but I have some tips to share on that below.

Note that your preprocessing logic being other models does not fundamentally change anything, since a PyFunc model can be really any executable python code that takes X and returns y, you could even host your non-model preprocessing logic on separate endpoints if you wanted to. However, it would be rather unconventional and I can't contrive a scenario where that would be advantageous at the moment.

Multiple Endpoints Version

Single Endpoint Version

In short, when it comes to calling constituent models to obtain intermediate results, you do have some options. However, the most common one is going to be to call those models almost exactly as you would any other preprocessing logic. Great, now back to the main example.

Apply Preprocessing Logic from .py Script

Now that we have some data to work with, let's apply those preprocessing functions we defined earlier. All we need to do is specify the path to where our script lives and import our classes as we would any others off PyPI.

# Retrieve the current notebook's full path
notebook_path = dbutils.notebook.entry_point.getDbutils().notebook().getContext().notebookPath().get()
notebook_dir = os.path.dirname(notebook_path)
dbfs_path = '/Workspace' + notebook_dir
# This will come into play when we log the model to mlflow
custom_transformers_path = dbfs_path + '/custom_transformers.py'
from custom_transformers import JSONFlattener, EmailDomainExtractor

# # dbfs_path is already in sys.path but if it weren't we could add it like so:
# if dbfs_path not in sys.path:
#     sys.path.append(dbfs_path)

# Split the dataset
X = df.drop(columns=['fraud', 'transaction_id'])
y = df['fraud']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Define preprocessing steps
numeric_features = ['amount', 'account_age_days']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_features = ['transaction_type', 'customer_info.address.state', 'email_domain']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = Pipeline(steps=[
    ('json_flattener', JSONFlattener(json_column='customer_info', record_prefix='customer_info')),
    ('email_domain_extractor', EmailDomainExtractor(email_column='customer_info.email')),
    ('column_transformer', ColumnTransformer(
        transformers=[
            ('num', numeric_transformer, numeric_features),
            ('cat', categorical_transformer, categorical_features)
        ])
    )
])

Looking good. Remember we don't need to use an sklearn Pipeline for this to work, but we're using it anyway to keep this blog more focused on calling custom preprocessing modules on a live serving endpoint rather than building custom pipelines.

Right, let's train the model! This should take about 30 seconds on our machine with 1,000 rows, but this is far from maxing out our cluster's CPU util or RAM, so it will scale sub-linearly for orders of magnitude more data.

# Fit the preprocessor and transform the data
preprocessor.fit(X_train, y_train)
X_train_processed = preprocessor.transform(X_train)
X_test_processed = preprocessor.transform(X_test)

# Convert processed data to numpy arrays if they are DataFrames
if isinstance(X_train_processed, pd.DataFrame):
    X_train_processed = X_train_processed.values
if isinstance(X_test_processed, pd.DataFrame):
    X_test_processed = X_test_processed.values

# Train the XGBoost model
dtrain = xgb.DMatrix(X_train_processed, label=y_train)
dtest = xgb.DMatrix(X_test_processed, label=y_test)

params = {
    'objective': 'binary:logistic',
    'eval_metric': 'logloss',
    'seed': 42
}

bst = xgb.train(params, dtrain, num_boost_round=100)

# Evaluate the model
y_pred_proba = bst.predict(dtest)
y_pred = (y_pred_proba > 0.5).astype(int)

print("Classification Report:")
print(classification_report(y_test, y_pred, digits=4))

# Prepare artifacts and signature
joblib.dump(preprocessor, "preprocessor.joblib")
bst.save_model("model.xgb")

artifacts = {
    "preprocessor_path": "preprocessor.joblib",
    "model_path": "model.xgb"
}

sample_input = X_train.iloc[:5]
sample_output = bst.predict(xgb.DMatrix(preprocessor.transform(sample_input)))
signature = infer_signature(sample_input, sample_output)

We'll now define the PyFunc wrapper model. Notice that we import our custom package again inside the load_context() function. This ensures all required code is saved to our model's artifacts in MLflow.

# Define the custom PyFunc model
class FraudDetectionModel(mlflow.pyfunc.PythonModel):
    def load_context(self, context):
        import xgboost as xgb
        import joblib
        import pandas as pd
        # Import custom transformers from the installed package
        from custom_transformers import JSONFlattener, EmailDomainExtractor

        # Load the preprocessor and model artifacts
        self.preprocessor = joblib.load(context.artifacts["preprocessor_path"])
        self.booster = xgb.Booster()
        self.booster.load_model(context.artifacts["model_path"])

    def predict(self, context, model_input):
        processed_input = self.preprocessor.transform(model_input)
        if isinstance(processed_input, pd.DataFrame):
            processed_input = processed_input.values
        dmatrix = xgb.DMatrix(processed_input)
        predictions = self.booster.predict(dmatrix)
        return predictions

Additionally, we can prevent possible container build failures by specifying compatible package versions in our predefined conda_env:

conda_env = {
    'name': 'mlflow-env',
    'channels': ['defaults'],
    'dependencies': [
        'python=3.11.0',
        'pip',
        {
            'pip': [
                'mlflow==2.17.0',
                'xgboost==2.0.3',
                'joblib==1.2.0',
                'faker==18.11.2',
                'scikit-learn==1.2.2',
                'numpy==1.23.5',
                'cloudpickle==2.2.1',
            ],
        },
    ],
}

Finally we can log and register the model to MLflow. Two things I want to make note of:

The code_path parameter in the log_model() function that we alluded to earlier is crucial for ensuring the model has access to our preprocessing code on the endpoint
We alias the latest run as Production and then call @Production at the end of the model_uri in the following cell; we could have omitted this and kept track of version numbers instead, but that's up to you

# Log the model using mlflow.pyfunc
with mlflow.start_run() as run:
    mlflow.pyfunc.log_model(
        artifact_path="model",
        python_model=FraudDetectionModel(),
        artifacts=artifacts,
        conda_env=conda_env,
        code_paths=[custom_transformers_path],
        signature=signature,
        input_example=sample_input
    )
    model_uri = f"runs:/{run.info.run_id}/model"
    registered_model_name = "credit_card_fraud_detection_pyfunc"
    result = mlflow.register_model(
        model_uri=model_uri,
        name=registered_model_name
    )
    print(f"Model registered as {registered_model_name} with version {result.version}")

from mlflow.tracking import MlflowClient

client = MlflowClient()
client.set_registered_model_alias(
    name=registered_model_name,
    alias="Production",
    version=result.version
)

We can check that this functioning as intended by loading the model from MLflow and then running some records through it:

# Load the model using the alias and test predictions
model_uri = f"models:/{registered_model_name}@Production"
loaded_model = mlflow.pyfunc.load_model(model_uri)

new_data = generate_data(num_rows=5)
predictions = loaded_model.predict(new_data)

print("\nPredictions:")
print(predictions)

The above shows that this works in the notebook environment, but a common problem data scientists face is something working in the notebook but failing on the endpoint. We can dramatically speed up the process of testing dependency agreements and other environment variables by using the mlflow.models.predict() functionality too. The difference between this and load_model() may not seem obvious at first glance, especially because the documentation doesn't fully explain this important distinction, but mlflow.models.predict() builds a lightweight virtual environment based on the conda_env we specified earlier, making it a much closer proxy for the endpoint than load_model() which uses the notebook's existing environment. You'll notice mlflow.models.predict() takes about 5-10x longer than load_model(), but a minute here may save you hours of iterative development if your alternative is waiting to see if your endpoint environment is valid!

In November 2024, the Databricks documentation for mlflow.models.predict() were updated to better explain this.

import os
import json

# Define temporary output path
output_path = "/tmp/mlflow_predictions.json"

# This is a much better test than loaded_model.predict()
mlflow.models.predict(model_uri, new_data, output_path=output_path)

# Read predictions from the output file
with open(output_path, "r") as f:
    predictions = json.load(f)

mlflow.end_run()

And we’re done! If you’ve made it this far, I applaud you. Let’s quickly recap what we’ve done here:

Demonstrated how to reference an external script within a PyFunc model
- In the form of a data preprocessing pipeline for real-time inference
Implemented an end-to-end fraud detection pipeline
- With a non-sklearn model and our custom preprocessing
Properly packaged, logged, and registered our custom code with MLflow
- Tested both in the notebook environment and a simulated deployment environment

Happy coding!

Braving through the pitfalls of LLM judges

Veena — Tue, 15 Apr 2025 16:36:08 GMT

a picture of Judge Judy (the original judge).

LLM judges are the de facto standard for evaluating anything related to LLMs. Human evaluations are too expensive and difficult to scale, so LLMs are a practical alternative.

But judges aren’t perfect. Here, we will examine some of the common problems with LLM judges and explore different ways to deal with them.

Please note that this blog will not cover non-LLM evaluations!

Let us first review what Agent Evaluation looks like in Databricks.

LLM-as-a-judge is a common evaluation technique; instead of using a human to evaluate a text response, we use an LLM. Mosaic AI Agent Evaluation allows you to systematically assess the quality of your agentic applications. This includes the use of LLM judges.

There are several built-in judges that can be used, including:

Correctness judge: assesses whether response is accurate
Helpfulness judge: assesses if response satisfies the user
Harmlessness judge: assesses if response avoids harmful content
Coherence judge: assesses if response is logical
Relevance judge: assesses whether response addresses the query

Given an evaluation set, you can use these judges to evaluate. Each judge takes a different set of inputs; for example, the Correctness judge requires a request, a response, and an expected response, but the Harmlessness judge only requires a request and a response. You can take a look at what judges are available and how to use them here.

LLMs have a hard time with numbers.

Let me show you what a basic implementation looks like. Using the databricks callable judge SDK, you can use the correctness judge like:

from databricks.agents.evals import judges

assessment = judges.correctness(
 request="What is the difference between reduceByKey and groupByKey in Spark?",
 response="reduceByKey aggregates data before shuffling, whereas groupByKey shuffles all data, making reduceByKey more efficient.",
 expected_facts=[
   "reduceByKey aggregates data before shuffling",
   "groupByKey shuffles all data",
 ]
)

We can see that an assessment contains information something like this:

Assessment: 
error_code=None
error_message=None
metadata={}
name='correctness'
rationale="..." 
value=CategoricalRating.YES

The value that the assessment returned is categorical. LLMs notably struggle quite a lot with numerical scoring. Some studies show that they have preferences for certain values. Other studies often show them clustering around the highest and lowest values, instead of utilizing the full range.

Let us assume you have already created a judge to output scores from 1 to 10. When graphing the scores with the “ideal” scores, you could see something like this:

In this example, the LLM gives perfect scores until a certain threshold, where it drops to very low scores. Note: this graph was created manually via matplotlib.

The built-in judge already includes a categorical value instead of a numerical one, but if you need numerical ratings, prompt the judge with an explanation for each of the scores. In MLFlow, you can include evaluation examples when defining different evaluation metrics.

average_example = EvaluationExample(
    input="What are the main types of horse breeds?",
    output="The main horse breeds include Arabian, Thoroughbred, Quarter Horse, Appaloosa, Morgan, Tennessee Walker, Clydesdale, and Mustang. Arabians are known for endurance, Thoroughbreds for racing, Quarter Horses for sprinting, and Clydesdales for their size and strength.",
    score=5,
    justification="This response correctly lists 8 common horse breeds and provides brief descriptions for 4 of them, but the descriptions are very basic and only cover half of the breeds mentioned. It lacks depth about breed characteristics, historical origins, or typical uses.",
    grading_context={
        "targets": "There are numerous horse breeds worldwide, with common breeds including Arabian, Thoroughbred, Quarter Horse, Appaloosa, Morgan, Tennessee Walker, Andalusian, Friesian, Clydesdale, Percheron, Mustang, and Shetland Pony. Each breed has distinctive physical traits, temperaments, and was developed for specific purposes like racing, work, or riding."
    },
)

horse_breed_similarity_metric = answer_similarity(
    examples=[poor_example, average_example, excellent_example])

LLMs like long answers.

In my last example regarding horses, you can see that the average example was scored lower than it would have been because it lacked ‘depth.’ Unfortunately, lots of LLMs equate depth with a lot of unnecessary chatter. When graphing scores of responses of equal quality, you may see long responses rewarded more than short responses, like:

In this example, the LLM judge scores rise with response length, even when the actual quality of the content remains constant. Note: this graph was generated manually via matplotlib.

LLM judges tend to prefer longer outputs. This makes sense if you have ever used one of these chat bots. Depending on your use case, this might not be preferable. When I am talking to a customer service chatbot, for example, I get frustrated when it responds with paragraph long responses to my simple questions. Conciseness is incredibly important. This problem could also mean that more accurate responses are drowned out by rambling, semi-accurate responses. Observe what is getting approved by your judge to make sure the LLM is not avoiding brevity.

If you are noticing that only long answers are getting approved, you can instead simply adjust scores based on the length, penalizing answers that are ‘too’ long.

Let us assume that you have extracted the base score from the judge. You can have a simple function that penalizes a longer answer, if it surpasses a hardcoded threshold.

length_ratio = response_length / max(1, request_length)

def linear_verbosity_adjustment(length_ratio, base_score):
    threshold = 3.0
    if length_ratio <= threshold:
        return 0
    else:
        return min(base_score * 0.3, (length_ratio - threshold) * 0.5)

But you can also approach this in a more sophisticated manner. In this paper, they fit a regression model to predict “what would the score be if the responses all had the same length?” This improved correlation with human preferences from 0.94 to 0.98, but in most cases, however, this is overkill.

LLMs are biased towards themselves.

We have also seen that LLM judges have a preference for text with lower perplexity. This suggests that LLMs prefer language similar to language they were trained on. This can lead to your evaluators assigning higher scores to outputs generated by their own kind. For example, you can no longer trust a GPT model to evaluate a Llama 8B model against a GPT 4o-mini model without bias.

You can mitigate this bias by using a jury-- a collection of LLM judges instead. The goal here is to use LLMs from different families, so one LLM’s bias towards the answer does not prevent you from understanding the quality of the response. You can create a custom metric in Databricks to do this. Here, I am defining three different judges using different LLMs with the same prompt.

import mlflow
from mlflow.metrics.genai import make_genai_metric_from_prompt
from databricks.agents.evals import metric
from databricks.agents.evals import judges
from mlflow.evaluation import Assessment


judge_prompt = """
Determine if this response accurately covers all expected facts.

Request: '{inputs}'
Response: '{response}'
"""

llama_judge = make_genai_metric_from_prompt(
 name="accuracy_judge1",
 judge_prompt=judge_prompt,
 model="endpoints:/databricks-meta-llama-3-1-405b-instruct",
 metric_metadata={"assessment_type": "ANSWER"}
)

claude_judge = make_genai_metric_from_prompt(
 name="accuracy_judge2",
 judge_prompt=judge_prompt,
 model="endpoints:/databricks-claude-3-7-sonnet",
 metric_metadata={"assessment_type": "ANSWER"},
 )

gpt_judge = make_genai_metric_from_prompt(
 name="accuracy_judge3",
 judge_prompt=judge_prompt,
 model="endpoints:/test-gpt-endpoint",
 metric_metadata={"assessment_type": "ANSWER"},
 )

Then, take all of these individual judges and define a custom metric. I am averaging the scores outputted from each judge here, but if you want to avoid numeric values, you can instead average across boolean values. If you are seeing that the metric is too ‘easy’ to pass, you can set a numeric threshold that must be passed by each judge. This would make sense for higher risk use cases.

@metric
def llm_jury(request, response):
   inputs = request['messages'][0]['content']

   llama_metric_result = llama_judge(inputs=inputs, response=response)
   claude_metric_result = claude_judge(inputs=inputs, response=response)
   gpt_metric_result = gpt_judge(inputs=inputs, response=response)


   int_score = llama_metric_result.scores[0] + claude_metric_result.scores[0] + gpt_metric_result.scores[0]
   int_score = int_score/3

   return [
       Assessment(
           name="llm_jury_score",
           value=int_score,
           rationale=f"LLAMA: {llama_metric_result.scores[0]:.2f}, Claude: {claude_metric_result.scores[0]:.2f}, GPT: {gpt_metric_result.scores[0]:.2f}"
       )
   ]

LLMs produce inconsistent evals.

Judges can output significantly different results if you prompt it multiple times. By default, using make_genai_metric_from_prompt uses temperature=0.0 and top_p=1.0, but there is still a chance that the LLM judges output different results (if you have time, read Austin’s blog on LLM non-determinism here).

You can test consistency with a judge by prompting it multiple times. In a perfect world, the same prompt would get the same scores, but other times, you may get something like this:

In this example, we can see that the same prompt gets different scores. Note: this is a graph manually made in matplotlib.

There are a few ways to address this. You can treat this like how we treated familiarity bias. Instead of prompting an LLM once, prompt it multiple times (5 times) and keep the majority output. But because this option can increase cost a lot, you can try to more intelligently prompt the judge itself. Try something like chain-of-thought reasoning, and ask the model for its reasoning before outputting a binary score. This can force a more deliberate consideration and reduce random variation.

We are an intermediate stage of LLM evaluation.

Current LLM judges are useful but imperfect tools. So, where does that leave us?

Start by identifying which biases most significantly impact your specific use case. For customer support bots, length bias might be your primary concern. For factual assessment, inconsistency may be more problematic. Apply targeted solutions for the problems you are immediately seeing, instead of trying to solve every issue at once.

We can expect evaluation techniques to continue to improve. Remember the goal is not to implement every possible mitigation, but to build a system that provides consistent and actionable feedback.

I hope this guide was helpful. Please comment if you have any questions.

Happy judging!

Databricks Vector Search Similarity Scores Deep Dive

Joshua Eason — Wed, 09 Apr 2025 18:57:37 GMT

Have you noticed unexpected results from Databricks Vector Search’s similarity_search? If you’re coming from a cosine similarity background, the scores might seem puzzlingly misaligned with your expectations. In this blog, we’ll dive deep into why this happens, explain the key differences between similarity metrics, and provide a solution to bridge this gap.

Cosine Similarity

Let’s start with a quick refresher on cosine similarity. The cosine similarity between two vectors a and b defined as:

Geometric Interpretation

Cosine similarity gives us a clear geometric intuition for vector similarity by measuring the angle between them in vector space. If two vectors point in the same direction, their cosine similarity is 1; if they are orthogonal (at 90°), it is 0; and if they point in opposite directions, it is –1. This works because cosine similarity is simply the cosine of the angle between them in their vector space.

Think of it like comparing the orientation of arrows: two arrows pointing in the same direction, no matter how long, are perfectly aligned (cosine = 1), while arrows at right angles share no alignment (cosine = 0), and arrows pointing in opposite directions are fully misaligned (cosine = –1). Unlike Euclidean distance, which can be influenced by the length of the vectors, cosine similarity purely reflects alignment, making it especially useful in high-dimensional spaces where magnitude may vary but direction carries semantic meaning.

Here is a sample implementation

import numpy as np

def cosine_similarity(a, b):
    a = np.array(a)
    b = np.array(b)
    dot_product = np.dot(a, b)
    norm_a = np.linalg.norm(a)
    norm_b = np.linalg.norm(b)
    return dot_product / (norm_a * norm_b)

In practice, many of us rely on the scikit-learn implementation.

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Example vectors
a = np.array([1, 2, 3])
b = np.array([4, 5, 6])

# Reshape for sklearn (expects 2D arrays)
similarity = cosine_similarity(a.reshape(1, -1), b.reshape(1, -1))[0][0]

Interestingly, when the input vectors are normalized to unit length so that

cosine similarity reduces to a simple dot product:

In cases where we want to set a bound, cosine_sim ∈ [0, 1], we may apply a linear transformation such as

which preserves order, and simply squashes the outputs of your cosine similarity function into the new range. This can be very beneficial in circumstances where your similarity scores are used as intermediate inputs to downstream models that have strict boundary requirements, or are sensitive to negative values.

Another variation that you may have seen before is the cosine distance, which is just 1 − cosine_sim. This is often used in conjunction with the other transformations, but provides an interpretation based on the distance (smaller is closer) instead of the alignment (higher is closer).

Databricks’ Similarity Computation

It is possible that you have noticed that neither of these methods produce the scores that you see when you perform a similarity_search using the Databricks VectorSearchIndex class. In contrast, from the Official Databricks Documentation, we can see that databricks uses the following formula to compute similarity:

Where

While this formula relies on the euclidean distance, you can see that it is not just the euclidean distance. In contrast to cosine similarity, which compares direction, this score relies on a linear transformation of Euclidean distance, which compares position. If the vectors are not normalized, the algorithm sensitive to both the magnitude and alignment of the vectors. For example:

Two vectors pointing in the same direction but with very different lengths may have high cosine similarity but large Euclidean distance.
Conversely, two vectors that are numerically close (in terms of component values) but not aligned may have small Euclidean distance but low cosine similarity.

This positional sensitivity makes the score potentially misleading in semantic spaces (like embeddings) where direction is more meaningful than length.

For clarity, here’s the Databricks similarity scoring function referenced in our examples:

def euclidean_distance(q, x):
    """Calculate Euclidean distance between vectors q and x"""
    q = np.array(q)
    x = np.array(x)
    return np.linalg.norm(q - x)

def databricks_similarity_score(q, x):
    """similarity score based on the formula: 1 / (1 + dist(q, x)^2)"""
    distance = euclidean_distance(q, x)
    return 1 / (1 + distance ** 2)

Side Note - Hybrid Similarity Score

There is an additional step for providing the score if a hybrid search, if using the query_type='HYBRID' argument when calling the VectorSearchIndex.similarity_search method, a composite BM25 Keyword and Vector similarity is used. This score

relies on an algorithm called Reciprocal Rank Fusion (RRF), and aggregates rankings from several sources into a single, ranking. For more detailed information about this, please see Ref (1).

Vector Normalization

Vector normalization rescales a vector to have unit length (i.e., L2 norm of 1). This process preserves the vector’s direction while standardizing its magnitude, which is crucial for reliable similarity comparisons.

Given a vector a, its normalized form is:

Normalization essentially projects all vectors onto the unit hypersphere in n-dimensional space, allowing us to compare them purely by their direction and eliminating magnitude differences that can distort similarity measures.

Here is the vanilla python implementation

import numpy as np

def l2_normalize(vector):
    vector = np.array(vector)
    norm = np.linalg.norm(vector)
    if norm == 0:
        return vector  # Avoid division by zero
    return vector / norm

Though in practice, most of us rely on the scikit learn implementation

from sklearn.preprocessing import normalize
import numpy as np

# Each row will be treated as a separate vector
X = np.array([[1, 2, 3], [4, 5, 6]])

# Normalize along rows (axis=1)
X_normalized = normalize(X, norm='l2', axis=1)

Why Normalize?

So, why is normalization so important?

This sketch illustrates the geometric difference between cosine similarity and the Databricks similarity score, which is derived from squared Euclidean distance.

The green angles between vectors (e.g., cos(a, b) and cos(b, c)) represent cosine similarity, which depends purely on direction. In contrast, the orange segments represent Euclidean distances between vector tips — and since the Databricks score is computed as a linear transformation of the distance, these distances directly influence the similarity score.

When embeddings are not normalized, the magnitude of the vectors affects the result. Even though Vector b (raw) is directionally aligned with Vector a, its large magnitude causes the straight-line distance dist(a, b_raw) to be quite large — leading to a low Databricks score. At the same time, Vector c, which is closer in space but less aligned in direction, will have a smaller Euclidean distance to a, and therefore a higher similarity score under the Databricks formula. This misrepresents their true semantic similarity if you were expecting cosine-like behavior.

This is why normalization is essential: when all vectors are normalized (as with Vector b (norm’d)), the Databricks similarity score becomes a function of angle alone — effectively mirroring cosine similarity.

While normalization offers a range of benefits — like simplifying cosine similarity computation, removing scale bias, and improving behavior in clustering — the key reason we normalize embedding vectors in Databricks Vector Search is to ensure rank-order equivalence to cosine similarity.

Rank-Order Equivalence

Rank-Order Equivalence: When vectors are normalized to unit length, the ranking of results by L2 distance matches the ranking by cosine similarity, even though the actual numerical score values differ.

When all vectors are normalized to unit length, the rank order of L2 distances becomes equal to the rank order of cosine similarities. This allows you to take advantage of fast approximate L2 search while still reasoning about results as if you’re using cosine similarity.

One misconception is that this relationship is numerically equal (similarity scores will be exactly equal between the two). This is not the case, and they can be very different in some cases. However, if you consider only their order in the ranking, that relationship is preserved, assuming embeddings are normalized.

Model-Specific Note: GTE vs. BGE

The BGE (BAAI General Embedding) model produces normalized embeddings by default.
The GTE (General Text Embedding) model does not produce normalized embeddings out of the box.
Other External models may or may not provide normalized embeddings, so ensure that you check this if using external models.

If you’re using embeddings from a model that are not normalized by default in Databricks Vector Search, it is strongly recommended to normalize them before indexing. This can be done with any L2 normalization function (like sklearn.preprocessing.normalize). Additionally, you must apply the same normalization strategy to your query vectors at retrieval time that you used when building the index. In other words, if you normalize during indexing, you must also normalize your queries, and if you don’t, then skip it at query time too.

Failing to match the normalization behavior between index and query time will result in invalid similarity comparisons and misleading results. Consistency is key to ensuring your search results are meaningful and accurate.

As demonstrated in these examples, when all vectors are properly normalized, the ranking order is identical between cosine similarity and Databricks similarity scores. This confirms the rank-order equivalence principle, which is crucial for reliable vector search.

Examples

Consider the following code:

import numpy as np
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
from sklearn.preprocessing import normalize

# Example vectors
a = np.array([1.0, 0.0])
b = np.array([0.0, 1.0])
c = np.array([1.0, 1.0])
d = np.array([0.2, 0.8])

cos_distance = cosine_similarity(a.reshape(1, -1), b.reshape(1, -1))[0][0]
db_distance = get_databricks_similarity_score(a, b)

print(f"cosine(a,b): {cos_distance}")
print(f"db_distance(a,b): {db_distance}")

vectors = {
    "a": a,
    "b": b,
    "c": c,
}

cosine_scores = {}
db_scores = {}

for name, vec in vectors.items():
    cosine_scores[name] = cosine_similarity(d.reshape(1, -1), vec.reshape(1, -1))[0][0]
    db_scores[name] = get_databricks_similarity_score(d, vec)

cosine_ranking = sorted(cosine_scores.items(), key=lambda x: x[1], reverse=True)
db_ranking = sorted(db_scores.items(), key=lambda x: x[1], reverse=True)

print("\n--- Pairwise Similarity to 'd' ---")
print("Cosine Similarity Scores:")
for name, score in cosine_ranking:
    print(f"{name}: {score:.4f}")

print("\nDatabricks (L2-based) Similarity Scores:")
for name, score in db_ranking:
    print(f"{name}: {score:.4f}")

print("\nCosine Similarity Ranking:", [name for name, _ in cosine_ranking])
print("Databricks Similarity Ranking:", [name for name, _ in db_ranking])

Which will produce the following outputs

cosine(a,b): 0.0
db_distance(a,b): 0.33333333333333326

--- Pairwise Similarity to 'd' ---
Cosine Similarity Scores:
b: 0.9701
c: 0.8575
a: 0.2425

Databricks (L2-based) Similarity Scores:
b: 0.9259
c: 0.5952
a: 0.4386

Cosine Similarity Ranking: ['b', 'c', 'a']
Databricks Similarity Ranking: ['b', 'c', 'a']

Notice that while the numerical scores are much different, the rankings are

preserved. This is because the input vectors in the example are all normalized by definition.

Lets try that again, but this time, we will use arbitrary vectors that are not normalized.

# Define non-normalized versions of the same vectors
a_raw = np.array([2.0, 0.0])
b_raw = np.array([0.0, 3.0])
c_raw = np.array([5.0, 5.0])
d_raw = np.array([1.0, 4.0])

Below, we can see that the results no longer respect rank-order equivalence

--- Pairwise Similarity to 'd_raw' (Non-Normalized Vectors) ---
Cosine Similarity Scores (Non-Normalized):
b_raw: 0.9701
c_raw: 0.8575
a_raw: 0.2425

Databricks (L2-based) Similarity Scores (Non-Normalized):
b_raw: 0.3333
a_raw: 0.0556
c_raw: 0.0556

Cosine Similarity Ranking (Non-Normalized): ['b_raw', 'c_raw', 'a_raw']
Databricks Similarity Ranking (Non-Normalized): ['b_raw', 'a_raw', 'c_raw']

Notice how the rankings diverged when using non-normalized vectors! Vector c_raw dropped from second to last place in the Databricks ranking despite main- taining its cosine position. This clearly demonstrates why proper normalization is essential when working with Databricks Vector Search if you want results that align with semantic expectations.

And, if we normalize the raw vectors above

a_norm = normalize(a_raw.reshape(1,-1))
b_norm = normalize(b_raw.reshape(1,-1))
c_norm = normalize(c_raw.reshape(1,-1))
d_norm = normalize(d_raw.reshape(1,-1))

and recheck them, we can see that the relationship is again preserved

--- Pairwise Similarity to 'd_norm' (Normalized Vectors) ---
Cosine Similarity Scores (After Normalization):
b_norm: 0.9701
c_norm: 0.8575
a_norm: 0.2425

Databricks (L2-based) Similarity Scores (After Normalization):
b_norm: 0.9436
c_norm: 0.7782
a_norm: 0.3976

Cosine Similarity Ranking (Normalized): ['b_norm', 'c_norm', 'a_norm']
Databricks Similarity Ranking (Normalized): ['b_norm', 'c_norm', 'a_norm']

A Bridge Between These Metrics???

Suppose that you require the interpretability of actual cosine similarity scores. For example, you want to apply semantic thresholds, visualize search relevance, or feed scores into a downstream model. But you still want to take advantage of the fast, scalable L2-based indexing provided by Databricks Vector Search.

What do you do?

The Good News

If your embedding vectors are L2-normalized before indexing, and your query vectors are also normalized at retrieval time, there’s a direct mathematical relationship between the Databricks similarity score and cosine similarity. That means you can recover cosine similarity — exactly — from the score that Databricks returns.

The Algebra

The following derivation shows how to convert between Databricks similarity scores and cosine similarity. If you’re primarily interested in the practical application, feel free to skip to the “Final Formula” section.

Let’s recall how Databricks computes similarity:

If both vectors, a and b are L2-normalized, then:

Plug that into the Databricks score:

Now solve for cos(θ) from the Databricks score:

Invert both sides:

Move terms:

Finally, divide both sides and simplify:

Final Formula

Assuming normalized vectors:

Sample Implementation

Lets implement this in vanilla python so we can test it out!

def cosine_from_databricks_score(score):
    """Convert Databricks similarity to cosine similarity"""
    return 1 - 0.5 * (1 / score - 1)

Experimentation

from sklearn.metrics.pairwise import cosine_similarity
from sklearn.preprocessing import normalize

# Define example vectors
a = np.array([1.0, 2.0, 3.0])
b = np.array([4.0, 5.0, 6.0])

# Normalize the vectors
a_norm = normalize(a.reshape(1, -1))[0]
b_norm = normalize(b.reshape(1, -1))[0]

# Compute cosine similarity directly
cosine_true = cosine_similarity(a_norm.reshape(1, -1), b_norm.reshape(1, -1))[0][0]

# Compute Databricks-style similarity score implemented eariler
db_score = databricks_similarity_score(a_norm, b_norm)

# Convert back to cosine similarity
cosine_reconstructed = cosine_from_databricks_score(db_score)

print(f"True cosine similarity:        {cosine_true:.8f}")
print(f"Databricks score:              {db_score:.8f}")
print(f"Reconstructed cosine from DB:  {cosine_reconstructed:.8f}")

Which produces the following output

True cosine similarity:        0.97463185
Databricks score:              0.95171357
Reconstructed cosine from DB:  0.97463185

Why This Is Precise

This formula is algebraically exact — not an approximation — under the key assumption that both vectors are normalized. In that case, the relationship between cosine similarity and L2 distance becomes a clean geometric identity, and the Databricks score becomes a transformed version of cosine similarity.

The only sources of deviation would be:

Numerical precision issues (e.g., in high dimensions)
Vectors not being normalized

As long as normalization is handled properly, this conversion is 100% faithful

to what cosine similarity would return.

Why You Might Want to Use It

Recover interpretability: You get the familiar cosine scale of –1 to 1, or optionally [0, 1] if re-bounded
Apply thresholds: Use well-understood semantic cutoffs like 0.75 or 0.9
Post-process top-k results: After retrieving candidates from the index, re-score using this formula and sort/filter as needed
Blend with other cosine-based systems: Helps when migrating to or integrating with other platforms that rely on cosine similarity

In short: this gives you the best of both worlds — the performance of L2-based search with the intuitive power of cosine similarity.

Conclusion

So there you have it! The mystery of Databricks Vector Search similarity scoring demystified. Let’s recap what we’ve learned on this mathematical journey:

Cosine similarity focuses on the alignment (direction) of vectors, making it ideal for semantic similarity in embedding spaces.
Databricks similarity measures positional differences stemming from reliance on L2, straight line, Euclidean distance.
Normalization is the critical bridge between these two worlds — when vectors are normalized, the rank order of results becomes equivalent between cosine similarity and Databricks’ scoring system.
If you need actual cosine similarity values (not just rankings), you can precisely recover them from Databricks scores using our conversion formula

This understanding gives you the best of both worlds: the performance and scale of Databricks Vector Search with the interpretability and familiarity of cosine similarity. No more scratching your head when similarity scores don’t match!

Remember that consistency is key — if you normalize your vectors during indexing (which you absolutely should for most embedding models), make sure to apply the same normalization to your query vectors at search time. And if you’re working with models like GTE that don’t normalize by default, take that extra step to ensure your embeddings live on the unit sphere.

Armed with these insights, you can now confidently build sophisticated vector search applications in Databricks that behave exactly as you expect them to. Happy searching!

Sources: (1) Cormack, G. V., Clarke, C. L., & Buettcher, S. (2009, July). Reciprocal rank fusion outperforms condorcet and individual rank learning methods. In Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval (pp. 758-759).

On the Topic of LLMs and Non-Determinism:

Austin — Tue, 18 Mar 2025 14:02:55 GMT

Have you ever been told that Deep Learning algorithms are inherently non-deterministic? Or that GPUs, PyTorch, TensorFlow, or LLMs are? Until recently, I also believed it to be an unfortunate fact of life that due to both hardware and software limitations I didn't fully understand, there was no way to get deterministic output from a deep learning model. But what are those specific limitations? Over the next few minutes I want to explore the most frequently cited sources of non-determinism in deep learning systems, why they exist, and where they can be overcome.

Democritus: An early determinist philosopher and earlier day drinker

Let's begin with the low-level hardware operations. A major hurdle to deep learning determinism is the non associative properties of floating point arithmetic. Anyone with a CS background has probably seen something like this before:

## This prints False btw
print((0.7 + 0.2 + 0.1) == 1)

Some of you are already reciting the reason in your mind: dyadic rationals. In simple terms, if a number can be expressed as a fraction whose denominator is a power of two, then it can be expressed exactly in finite binary representation.

If addition can't be considered deterministic, then what hope do we possibly have in making the billions of matrix operations that must take place to predict token sequences deterministic? Matrix multiplication does not suffer from the non-associative property precisely because it follows a fixed sequence of operations for which the computation pattern is consistent across most hardware. Dot products are computed in a defined order. Consider the below chunk of code from the excellent Two Sigma blog on this same topic, that showcases the comparative non-determinism of a naively implemented tf.reduce_sum() operation in TensorFlow versus one that leverages tf.matmul():

## Only runnable in TensorFlow 1.x
import tensorflow as tf
import numpy as np
N = 100
S = (1, 100000)
np.random.seed(1)
r = np.random.normal(0, 100, S).astype(np.float32)
x = tf.placeholder(tf.float32, S)
examples = {
    'reduce_sum': tf.reduce_sum(x),
    'reduce_sum_det': tf.matmul(x, tf.ones_like(x), transpose_b=True),
}
s = tf.Session()
results = {
    key: np.array([s.run(val, feed_dict={x:r}) for j in range(N)])
    for key, val in examples.items()
}
for key, val in results.items():
    print('%20s mean = %.8f max-min = %.6f' % (key, val.mean(), val.max() - val.min()))

If you don't want to switch runtimes to see the above output, you can just trust that the reduce_sum version produces inconsistencies as large as the hundredths place, while the matmul version is deterministic at least out to the millionths. I reproduced the above code block in TensorFlow 2.x calling TensorFlow 1.x syntax, but they seem to have changed the way tf.reduce_sum() gets implemented on the backend even in the naive call, as the outputs for the below are identical:

import tensorflow as tf
import numpy as np
N = 100
S = (1, 100000)
np.random.seed(1)
r = np.random.normal(0, 100, S).astype(np.float32)

def run_graph():
    tf.compat.v1.disable_eager_execution()
    
    x = tf.compat.v1.placeholder(tf.float32, S)
    examples = {
        'reduce_sum': tf.reduce_sum(x),
        'reduce_sum_det': tf.matmul(x, tf.ones_like(x), transpose_b=True),
    }
    
    with tf.compat.v1.Session() as s:
        results = {
            key: np.array([s.run(val, feed_dict={x: r}) for j in range(N)])
            for key, val in examples.items()
        }
    
    return results

results = run_graph()

# Print the results
for key, val in results.items():
    print('%20s mean = %.8f max-min = %.6f' % (key, val.mean(), val.max() - val.min()))

## So just by upgrading our version of TensorFlow we can achieve much greater determinism with no code changes!

If you want to dig further into the above, I would strongly advise you to check out A Workaround for Non-Determinism in TensorFlow since they provide additional example code that extends this idea from the weights to the bias terms and shows a fully deterministic training run for a neural network on the MNIST dataset.

For our purposes we've already arrived at an intermediate answer: due to the non-associative properties of floating point arithmetic, certain operations heavily leveraged by deep learning algorithms are implemented at the hardware level in a bit-wise deterministic manner while others are not.

Unfortunately we can't stop here. Simply leveraging operations like tf.matmul() instead of atomic operations like the old version of tf.reduce_sum() reduces the efficiency of deep learning models substantially as the number of GPUs is increased. For LLMs especially, where hundreds of GPUs may be used, the cost impact of a deterministic architecture would be substantial. Add to that any or all of the following:

There could be multiple GPU types the model runs on
Other operations such as attention masks, dropout, and different sampling methods also exist
In multi-GPU settings timing in synchronization can change results
Additional problems likely not considered

Even after setting fixed seeds, disabling certain stochastic optimizations, and controlling the data flow using something like MDS, we're seemingly still back to square 1, with non-deterministic LLMs.

Enter our Second Protagonist: LlamaForCausalLM

LlamaForCausalLM has a parameter called do_sample, which if set to False results in no temperature scaling, no top-k or top-p filtering, and no random sampling. In theory, the computation flow reduces to:

Forward pass -> get logits -> argmax -> pick single highest token -> rinse and repeat

And we can test this out for ourselves on Databricks by importing it from the transformers library.

## Setting up our two examples with a small model and a basic prompt:
import numpy as np
import pandas as pd
import random
import torch
from transformers import pipeline, set_seed, AutoTokenizer, LlamaForCausalLM

# HuggingFace token needed, Llama 3 is a gated repo
hf_token = dbutils.secrets.get(scope="austin_zaccor", key="hf_read_token")
model_id = "meta-llama/Llama-3.2-1B"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id, token=hf_token)
model = LlamaForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.float32,
    token=hf_token
)

prompt = "Explain the concept of determinism in large language models."

## Non-deterministic version first:
non_deterministic_results = []
for i in range(5):
  inputs = tokenizer(prompt, return_tensors='pt')
  outputs = model.generate(
    inputs.input_ids, 
    max_length=100, 
    do_sample=True,
    temperature=None, ## unsetting temperature
    top_p=None,       ## unsetting top_p
    pad_token_id=tokenizer.eos_token_id
  )
  non_deterministic_results.append(tokenizer.decode(outputs[0]))

for result in non_deterministic_results:
  print(result, '\n\n')

## Identical to the above, except do_sample=False
deterministic_results = []
for i in range(5):
  inputs = tokenizer(prompt, return_tensors='pt')
  outputs = model.generate(
    inputs.input_ids, 
    max_length=100, 
    do_sample=False,
    temperature=None, ## unsetting temperature
    top_p=None,       ## unsetting top_p
    pad_token_id=tokenizer.eos_token_id
  )
  deterministic_results.append(tokenizer.decode(outputs[0]))

for result in deterministic_results:
  print(result, '\n\n')

Now we're back to something that looks promising. The first five are all over the place while the second five are all identical down to the token! In practice, this will create functionally deterministic outputs for most use cases, but the stochastic factors present in the typical forward pass are still present in our reduced computational flow:

Forward pass -> get logits -> argmax -> pick single highest token -> rinse and repeat

Therefore, as we expand the output tokens, otherwise negligable differences in floating point calculations can get magnified, resulting in eventual deviations from true determinism. For example, it's possible that in applications where the top two or more logit values are extremely similar, small differences in floating point calculations in the forward pass will be enough to flip the selected token in the argmax step. This would then "bioaccumulate" through the remaining autoregressive token calculations until you have a very different output.

Hopefully, this level of reproducibility is still sufficient for your use case, but if not, I hope you will create the world's first truly deterministic implementation of a major LLM.

Before closing, I want to make a remark on the difference between setting temp=0.0 and LlamaForCausalLM's do_sample=False. In theory, both should result in the same greedy decoding and therefore the same outputs, but this is empirically not the case. Databricks users deploying a Llama 3 model will naturally want to do so via Provisioned Throughput to take advantage of the significant performance improvements it offers over custom GPU model serving. However, Provisioned Throughput endpoints do not allow you to set do_sample=False. The most you can do is set temp=0.0, and doing so will reveal that difference.

As we discussed in the first half of this blog, determinism can sometimes come at the cost of performance, especially in multi-GPU environments. It is therefore unsurprising that the Provisioned Throughput framework favors speed over strict reproducibility. For now, I would advise those with Llama-based use cases that are highly sensitive to non-determinism to use custom GPU model serving if it can accomodate the size of your LLM and SLAs. For everyone else, take comfort in knowing that deterministic outputs are usually less accurate than their stochastic counterparts due to the greedy decoding that underlies them.

Happy coding.

About Me:

I come from a DS/ML background, which I did for about 6 years before starting at Databricks as a Specialist Solutions Architect in GenAI and MLOps. I like to write about things I find interesting and that I think other people might benefit from.

A Beginner's Guide to MLOps Stacks on Databricks

Veena — Tue, 18 Feb 2025 18:00:15 GMT

MLOps Stacks is a template using Databricks Asset Bundles (aka DABs) to implement an MLOps workflow. It is easily customizable, but if you are not familiar with DABs or MLOps, it can get overwhelming quite quickly. There are a lot of folders. A lot of files. But by the end of this blog, you will understand how to use this template for your own use case.

Instantiating your first MLOps Stack is quite easy. I would recommend creating a basic one using the instructions here and walk through this blog. You can also take a look at the template directly in the public Github repository: databricks/mlops-stacks.

Your project bundle is controlled by the databricks.yml file. This contains all of the configurations (aka what the STAGE workspace is, what the DEV workspace is, what the `prod` catalog is, etc.). This file also points to all of the workflows in the project. Surprise! These configurations are also written in yaml. And the workflow configurations point to the notebooks in the project.

This walkthrough follows the “Deploy code” approach, which is generally recommended by Databricks. This means that the code moves from development to staging and then production. The model is trained in each environment. However, there are certain scenarios where “Deploy model” works better, like if your model training process was quite expensive. This is why the “Deploy model” approach is more common in LLMOps, but more on this in a later blog post.

Let’s walk through a theoretical example of how this would work.

Development

Step One: Exploratory Data Analysis

In the DEV environment, we explore new data alongside existing production data stored in the `dev` catalog. Perhaps, during this exploration, there is a discovery! There is a eureka moment! Now, we need to train and tune a new model.

Step Two: Model Training

We are still in the DEV environment, but now we can use MLFlow to keep track of everything. If you are unfamiliar with MLFlow, I would recommend taking a look at this demo notebook linked here. But shortly, MLFlow is a way for you to track your experiments and within that, your runs (each run = iteration of training). If that concept is still confusing, take a look at the docs here.

Using MLFlow, we can log key metrics, parameters, and artifacts across different runs, enabling us compare and contrast different trained models and pick the best model. Finally, once we are satisfied with the model that we have created, we can register the model in the `dev` catalog.

Step Three: Push Changes

Let’s update the repository now! Within a Databricks workspace, we can create a temporary branch (here we call it `dev`) and merge the updated notebooks.

Note: if you are following along via the MLOps Stacks template, you can see that there is a logical separation in the template structure:

feature_engineering
- contains feature transformations
- In this example, we are using the Databricks Feature Store, which is a centralized repository for managing and serving features
monitoring
- contains code for model monitoring
- we separate this from validation as monitoring is on-going in production while validation occurs pre-deployment. In this template, ModelValidation.py is the second part of the Model Training workflow
validation
- contains code for model validation
deployment
- contains serving endpoint setup and configuration
training
- contains model training logic (e.g. all of the MLFlow)

This separation follows the principal of separation of concerns— each directory has a specific responsibility in the ML lifecycle. It also makes it easier to:

have different teams work on different aspects of the lifecycle
maintain and update specific parts of the pipeline
reuse components across different projects
implement proper testing of each component

And now, after the branch is created, we can commit the code directly to the `dev` branch. This will now move us to the next part of the process.

Staging

Step One: Pull Request

Once our code is merged into the `dev` branch, we can open a Pull Request to merge the new changes to our `main` branch. It is time to test all of these changes.

Databricks Asset Bundles allows you to easily maintain Databricks resources, but you will still need a way to automate and run the workflows. I have personally used a lot of Github Actions, but you can set up Azure DevOps Pipelines, GitLab Pipelines, etc.

Step Two: Unit and Integration Tests

Opening a PR immediately triggers multiple workflows (via Github Actions or whatever process you have set up) for Unit Tests and Integration Tests.

The Integration Tests workflow (link to Github Actions workflow) uses Databricks Asset Bundle (DAB) commands to create the assets in the Staging environment and trigger the workflow to run the necessary notebooks:

# Validate the bundle configuration
databricks bundle validate -t test 

# Creates all necessary assets in stage environment
databricks bundle deploy -t test

# Executes feature engineering pipeline
databricks bundle run write_feature_table_job -t test 

# Executes model training pipeline
databricks bundle run model_training_job -t test

The `-t test` flag specifies that these commands should target the STAGE environment, as defined in the databricks.yml file.

We are running two jobs: the feature engineering workflow (link), which computes all of the features and stores them in the Databricks Feature Store, and the model training workflow (link), which trains the model with the Feature Store and then validates the model.

These test can be expanded even further to cover all aspects of the model workflows:

Feature Engineering tests
- verify the data transformation pipeline
- For example, test that the data types are expected, the missing values are handled properly, the feature values are within an expected range, and the Databricks Feature Store itself is working properly
Model Training tests
- verify the model training process
- For example, ensure that the resources are utilized properly; metrics and parameters are correctly logged in MLFlow; and the model can be saved and loaded
Model Validation tests
- verify the model’s performance and behavior
- For example, make sure the model predictions are expected, compare the model’s performance against previous versions, and check if the model is meeting the expected thresholds
Model Deployment tests
- verify that the model can be deployed and served
- For example, test performance under expected traffic
Model Inference tests
- verify the model’s behavior during prediction
- For example, ensure that the inference speeds meets requirements, monitor resource consumption, and verify that data is being logged to the Inference Tables correctly
Model Monitoring tests
- verify the monitoring system is working
- For example, test the alerting system and validate the visualizations in the Dashboards

After a successful job run, you can successfully merge the `dev` branch to the `main` branch. Let’s move to the next and final part of the process.

Production

Step One: Release Branch

After successfully validating the changes in the STAGE environment, we can now create a `release` branch to update the changes in the PROD environment. This branching strategy allows us to provide a clear snapshot of what is being deployed to production, enables hotfixes if issues arise without disrupting the main branch, and creates a history.

The creation of a `release` branch triggers a production deployment workflow in your CI/CD system.

Step Two: Create Assets

This is similar to what happened before:

# Validate the bundle configuration 
databricks bundle validate -t prod

# Deploy assets the PROD environment
databricks bundle deploy -t prod

Instead of running any specific jobs, like we did in the STAGE environment, we are simply deploying the assets (link to Github Actions workflow) because they are scheduled. This schedule is defined in each workflow configuration.

Now, with assets deployed to production and workflow scheduled, we have completed the full MLOps lifecycle implementation.

Next Steps

Ready to start implementing?

Create your first MLOps Stack using the template repository.
Review the example notebooks in the template and understand how each component works.
Adapt the workflow configurations to match your needs.
Set up your CI/CD pipeline using Github Actions or your preferred tool.

In future posts, we will dive deeper on LLMOps and Best Practices. If you have any questions about implementing MLOps Stacks, feel free to reach out via the comments.

I will leave you with a poll:

Deploying DeepSeek R1 Distill Qwen 1.5B on Databricks

Austin — Tue, 04 Feb 2025 16:01:47 GMT

“The results are promising: DeepSeek-R1-Distill-Qwen-1.5B outperforms GPT-4o and Claude-3.5-Sonnet on math benchmarks with 28.9% on AIME and 83.9% on MATH. Other dense models also achieve impressive results, significantly outperforming other instruction-tuned models based on the same underlying checkpoints.” — DeepSeek-AI

Not Potato Bolshevik’s Tweet

New LLMs usually make waves in the AI spheres, but DeepSeek R1 was more of a tsunami this week as the US stock market lost about a trillion dollars in valuation inside of 30 minutes. Regardless of your opinions on DeepSeek and its implications for the global (or national) economy, it’s an open source model whose performance is very competitive with the largest proprietary models currently available. That performance is due both to architecture it shares in common with large proprietary reasoning models, as well as some very interesting departures from the current training paradigm of ChatGPT, Claude, and Gemini. (At least until very recently).

For starters, DeepSeek-R1 comes in a few flavors, and it's worth understanding the significance of each:

DeepSeek-R1-Zero
- Applies RL directly to the base model without any Supervised Fine Tuning (SFT) data at all
DeepSeek-R1
- Applies RL starting from a checkpoint fine-tuned with minimal high quality Chain of Thought (CoT) examples
DeepSeek-R1-Distill-X
- A half dozen smaller, dense models distilled from DeepSeek-R1 that preserve the reasoning patterns learned from its much larger parent model

The DeepSeek Paper explains how using 800k high quality fine-tuning samples curated from DeepSeek-R1 is enough data to significantly improve the reasoning abilities of smaller models such as Qwen2.5, to the point where even just the 1.5B variant can outperform a model with orders of magnitude more parameters on math tasks. Notably, this distillation process is purely SFT based, and includes no RL at all.

So how can we host one of these overpowered mini models ourselves? Databricks recently published a blog post showing how to deploy DeepSeek R1 Distill Llama 8B and 70B using Provisioned Throughput directly on the Databricks platform, which makes sense given that Llama 3.x architecture is natively supported, even for fine tuned variants. However, what if we want to serve one of the four Qwen2.5 based distilled models? The architectural differences between Llama 3 and Qwen2.5 are so small you can actually convert one to the other, but if you don’t want to convert all the weights to Llama, you’re going to need to use custom GPU Model Serving for Qwen.

And that brings us to the code portion of this blog:

%pip install accelerate
%pip install transformers --upgrade ## need RoPE for this to work, and that's only included in newer versions
%pip install torch --upgrade        ## torch version also matters here
%pip uninstall torch torchvision -y
%pip install torch torchvision --index-url https://download.pytorch.org/whl/cu124

dbutils.library.restartPython()

import torch
import torchvision
import accelerate
import transformers

## Feel free to compare to the conda_env specified below to double check
print(transformers.__version__)
print(accelerate.__version__)
print(torch.__version__)
print(torchvision.__version__)

import pandas as pd
import mlflow
import mlflow.transformers
from mlflow.models.signature import infer_signature
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoConfig, pipeline

# Enable MLflow Autologging
mlflow.set_tracking_uri("databricks")

# Specify the model from HuggingFace transformers
model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)

# Adjust rope_scaling
if "rope_scaling" in config.to_dict():
    config.rope_scaling = {"type": "dynamic", "factor": 8.0}

# Loading this on single node, single GPU A10 cluster, for 14B or 32B models you will need multiple GPUs of this size
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    config=config, 
    device_map="cuda:0"
)

text_generator = pipeline("text-generation", model=model, tokenizer=tokenizer)

# We need a signature for UC registered models
example_prompt = "Explain quantum mechanics in simple terms."
example_inputs = pd.DataFrame({"inputs": [example_prompt]})
example_outputs = text_generator(example_prompt, max_length=200)
signature = infer_signature(example_inputs, example_outputs)

# Define the Conda environment with correct package versions
conda_env = {
    "name": "mlflow-env",
    "channels": ["defaults", "conda-forge"],
    "dependencies": [
        "python=3.11",
        "pip",
        {
            "pip": [
                "mlflow",
                "transformers==4.48.1",
                "accelerate==0.31.0",
                "torch==2.6.0",
                "torchvision==0.21.0"
            ]
        }
    ]
}

# Log model with MLflow
with mlflow.start_run() as run:
    mlflow.transformers.log_model(
        transformers_model=text_generator,
        artifact_path="deepseek_model",
        signature=signature,
        input_example=example_inputs,
        registered_model_name="deepseek_qwen_1_5b",
        conda_env=conda_env
    )

The above is all you need log and register the model to MLflow! From here I like to perform two tests before I kick off a serving endpoint, because I want to be reasonably certain I won’t get a container build failure before I wait 30 minutes for a GPU model serving endpoint to fully spin up:

I test the model using load_model()
1. This catches immediate errors with my class’s logic and is quite fast, but doesn’t test the dependencies for compatibility issues because it’s loading it in the same notebook environment we kicked if off from
I test the model again using mlflow.models.predict()
1. This catches the dependency issues in my conda_env because it spins up a lightweight virtual env that mimics the serving endpoint

If both pass, then I go to the experiment in MLflow and deploy the model to a live endpoint.

# Load the model locally for test 1
model_uri = "models:/deepseek_qwen_1_5b/4"
loaded_model = mlflow.pyfunc.load_model(model_uri)

input_data = {"inputs": "Explain quantum mechanics in simple terms."}
output = loaded_model.predict(input_data)
print(output)

## Call model in virtual env as specified by conda_env for test 2
# Define the model URI
model_uri = "models:/deepseek_qwen_1_5b/4"

# Define input data in the required format
input_data = pd.DataFrame({"inputs": ["Explain quantum mechanics in simple terms."]})

# Call the MLflow model predict API since that's better than load_model()
output = mlflow.models.predict(model_uri, input_data)
print(output)

Now we grab some tea and wait for DeepSeek-R1-Distill-Qwen-1.5B to deploy!

About Me: