Proprietary foundation models are great for a wide array of use cases, but many companies have legitimate need for an OSS model. Sometimes it’s a security concern, other times it’s about tuning and optimizing for task specific performance, and sometimes it just comes down to cost or latency SLAs. Whatever your reason, you want to serve an LLM to custom GPU model serving, and I’m here to help.
And before you accuse me of writing clickbait titles that have little to do with the content of the blog (I would never do that), I assure you that Mbappé will be mentioned no fewer than 4 times.
The landscape of LLMs evolves so rapidly that it’s nearly impossible to keep up with every model, let alone every dependency change that may be required by them. With AI Runtime and Custom GPU Model Serving, it’s trivially easy to serve the latest OSS models, because we can edit the versions at the outset, just as we would any other notebook.
In honor of their World Cup victory, let’s use a Mistral model: Devstral-Small-2-24B-Instruct-2512, a great mid-sized model for SWE tasks that deploys easily on Databricks custom GPU model serving.
Mbappé.
Most of these package versions are shipped natively with AI Runtime v5, but just to underscore my above point, I’ll set them here as well:
%pip install databricks-sdk==0.102.0 vllm==0.13.0 transformers==4.57.6 mistral_common openai==2.17.0 opencv-python-headless==4.12.* mlflow==3.12.0 hf_transfer==0.1.9
%restart_pythonWe’ll now set the wd to local disk and configure the endpoint. We set SKIP_LOCAL_SERVE_TEST=True unless we want to load the whole thing into RAM here in the serverless notebook. That would work on a single node H100, but not on an A10. We can also set the max model length here if we don’t need to service long prompts and want to conserve VRAM during serving. We’ll also set the location in UC we want to save this to, the endpoint name, and the sizing here as well.
# Set working directory to local disk (/Workspace doesn't support large files).
import os, tempfile
workdir = tempfile.mkdtemp()
os.chdir(workdir)
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
# Hugging Face model to download.
MODEL_REPO_ID = "mistralai/Devstral-Small-2-24B-Instruct-2512"
ARTIFACTS_PATH = "devstral" # Local dir the weights download to
SERVED_MODEL_NAME = "devstral" # Name vLLM exposes the model under.
# Devstral 2 Small FP8 (~24 GB) won't fit an A10
SKIP_LOCAL_SERVE_TEST = True
MAX_MODEL_LEN = 24576
GPU_MEMORY_UTILIZATION = 0.90
# Allowlisted ports for Serverless GPU notebooks are 3000-3999. Model Serving requires 8080.
LOCAL_PORT = 3080
SERVING_PORT = 8080
# Unity Catalog destination
UC_MODEL_NAME = "catalog.schema.devstral_small_2_24b_fp8"
# Serving endpoint configuration.
ENDPOINT_NAME = "austin-devstral-24b-h100"
WORKLOAD_SIZE = "Small"
SCALE_TO_ZERO_ENABLED = False # GPU_XLARGE/H100 does NOT support scale-to-zero in Beta
from databricks.sdk.service.serving import ServingModelWorkloadType
WORKLOAD_TYPE = ServingModelWorkloadType.GPU_XLARGEMbappé.
Now we can download the model:
from huggingface_hub import snapshot_download
snapshot_download(
repo_id=MODEL_REPO_ID,
local_dir=ARTIFACTS_PATH,
)And log it to our custom entry point and print the model URI in case we want to reference that later. This will take a few minutes. Note that we do NOT download from HF in the serving container.
import mlflow
from mlflow.pyfunc.model import ChatModel, ChatCompletionResponse
# Required placeholder. Serving runs the entrypoint, not python_model.predict.
class LLMModel(ChatModel):
def predict(self, context, messages, params):
return ChatCompletionResponse.from_dict({"choices": []})
mlflow.set_registry_uri("databricks-uc")
model_info = mlflow.pyfunc.log_model(
name=SERVED_MODEL_NAME,
python_model=LLMModel(),
artifacts={
"model_dir": ARTIFACTS_PATH,
},
metadata={
"task": "llm/v1/chat",
"entrypoint": entrypoint(SERVING_PORT),
},
extra_pip_requirements=[
"mlflow==3.12.0",
],
)
model_info.model_uriNow we register the model. Note that env_pack="databricks_model_serving" packages the env and the bundled weights for Serverless Optimized Deployments. This step will take like 15 minutes.
model_version = mlflow.register_model(model_info.model_uri, UC_MODEL_NAME, env_pack="databricks_model_serving")
print(f"Registered {UC_MODEL_NAME} version {model_version.version}")Mbappé.
Our model is now ready for deployment. This cell provisions its own GPU_XLARGE (i.e. H100) serving endpoint independent of this A10 notebook.
from databricks.sdk import WorkspaceClient
from datetime import timedelta
from databricks.sdk.service.serving import EndpointCoreConfigInput, ServedEntityInput
config = EndpointCoreConfigInput(
name=ENDPOINT_NAME,
served_entities=[
ServedEntityInput(
entity_name=UC_MODEL_NAME,
entity_version=str(model_version.version),
workload_type=WORKLOAD_TYPE,
workload_size=WORKLOAD_SIZE,
scale_to_zero_enabled=SCALE_TO_ZERO_ENABLED,
)
]
)
w = WorkspaceClient()
w.serving_endpoints.create_and_wait(name=ENDPOINT_NAME, config=config, timeout=timedelta(minutes=45))And you’ve done it. With that, all that’s left to do is query the endpoint to confirm it’s working.
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import ChatMessage, ChatMessageRole
w = WorkspaceClient()
resp = w.serving_endpoints.query(
name=ENDPOINT_NAME,
messages=[ChatMessage(role=ChatMessageRole.USER, content="Hi, will Mbappé lead France to another World Cup victory? And by extension make Mistral the world favorite OSS LLM for SWE tasks?")],
)
print(resp.choices[0].message.content)Happy coding,
-Austin


