Cutting Token Costs Reaches the Renaissance

A Lakebase Powered Solution for Enforcing Token Budgets, Now with Fewer Sharp Edges 🔪

Mar 17, 2026

Back in October I published a blog called Getting Medieval on Token Costs. The code and strategy I provided worked, but as the title implied it was rough around the edges. How rough? Well…

The API calls to the FMs were synchronous, so QPS would have been Medieval indeed
The Lakebase instance was provisioned, so it would always be accruing costs even without usage
There was no UI, so you or your admin would be spending hours fiddling with thousand line SQL queries for enterprise use cases

But no matter, we’ve had a Renaissance!

The repo got three meaningful updates and a handful of smaller ones that collectively move this from being technically functional to something a medium enterprise team might actually want to use on Databricks.

The Money Changer and His Wife - Quentin Matsys, 1514 oil-on-panel

Quick Refresher

The original solution uses a custom MLflow model serving endpoint as a proxy between your users and whatever foundation model they’re calling. Before the request hits the FM, the endpoint checks two Lakebase tables: one for the user’s token limit and another for how many tokens they’ve already burned through. If they’re over budget, the request ends up like John the Baptist in the cover image. If not, it goes through to the FM and the usage is written back to Lakebase along with the response and remaining balance.

Change 1: Autoscaling Lakebase

The original code used a Provisioned Lakebase instance because that was the only option at the time, but it’s going away and we have something better. Autoscaling Lakebase.

Swapping to Autoscaling Lakebase means you can set minimum and maximum scaling bands, and if you don’t need a high availability instance, it will also allow you to scale to zero during times of no use.

This is the smallest change architecturally, but it’s nice not to pay for compute we don’t need.

Change 2: ResponsesAgent + Async FM Calls

The original code used mlflow.pyfunc.PythonModel and called the FM endpoint via requests.post(), which is synchronous and blocking. Only one request can be handled at a time per unit of concurrency. Which meant the endpoint that was supposed to help you manage costs via budgeting is throttling your throughput instead. While I suppose that is one way to reduce token costs, it’s not very useful.

The new version replaces the PythonModel with a ResponsesAgent and swaps requests for httpx.AsyncClient inside an async def predict_stream(). Now multiple FM calls can be in flight simultaneously and the serving endpoint isn’t waiting on one user’s 20-second Claude response before it can look at the next request in the queue.

The core logic now lives in a standalone rate_limiter_agent.py decoupled from the notebook. The public API is much cleaner:

  agent = TokenRateLimiterAgent(
      db_config={...},
      workspace_client=WorkspaceClient(),
      endpoint_name="ep-your-endpoint",
      group_members={
          "data-science-team": ["andrea@company.com", "john@company.com"],
      },
  )

  # Before calling the FM:
  quota = agent.check_quota("andrea@company.com", "databricks-claude-sonnet-4-5")
  if not quota["allowed"]:
      # Return 429 or block the request
      ...

  # After the FM call completes:
  agent.log_usage(
      user_name="andrea@company.com",
      model_name="databricks-claude-sonnet-4-5",
      prompt_tokens=1200,
      completion_tokens=350,
      request_id="req-abc123",
  )

You can drop this into any existing pipeline without touching the notebook, which certainly helps if you’re integrating this into something that already has its own serving infrastructure.

Change 3: An Actual Frontend

The original had no management UI, which meant you had to set limits by writing manual SQL queries for any new change to your budgeting policy. Not very convenient.

The new repo ships a full Databricks App: a React + FastAPI application that deploys alongside your serving endpoint and gives administrators a no-code interface for setting granular budgets. How granular you ask? Any combination of:

A user, service principal, or group
Calling any FM, list of FMs, or across all FMs in the workspace
That resets every X hours, days, weeks, months, or never
Limited to a specified count of tokens or dollars
- Pre-populates token costs from Databricks documentation, but manually editable in case this changes or you have some kind of secret discount I don’t know about
- This is another nice quality of life feature since tokens are not all created equally; GPT OSS 20B tokens cost about 100x less than GPT 5.4 tokens

The drop-downs auto-populate users, SPs, and groups as well as the Databricks Foundation Models.

It also comes with a handy monitoring dashboard so you can see usage over time, your top consumers, and the most popular models.

The App authenticates to Lakebase via a native Postgres role with a static password stored in Databricks Secrets, so there’s no OAuth token refresh to manage.

An Honest Conclusion

Is this production-grade for an org running thousands of concurrent end users? Maybe not. You might consider mini-batching requests at scale, but there will still be some amount of cost tracking overhead, and this gets more difficult at scale.

Is this production-grade for most actual enterprise teams who want to stop their power users from accidentally burning through their monthly token budget in a week? Yes. I think this solution really shines when you have dozens to hundreds of daily active users who might get greedy on Opus requests without some budget enforcement.

But don’t take my word for it; check it out for yourself. The code lives here and setup instructions are in the README.

Cheers and happy coding.

Databricksters

Discussion about this post

Ready for more?