-
Run local agentic AI on the Mac using MLX
Run AI agents locally with privacy, low latency, and offline access. Dive into how MLX advancements and Mac hardware make powerful agentic workflows possible entirely on-device. You'll explore code agents such as OpenCode, see how they integrate into Xcode, learn techniques for multi-Mac scaling, and discover how to integrate tools seamlessly — without ever leaving your machine.
Chapters
- 0:00 - Introduction
- 0:32 - The chat and agentic loop
- 2:42 - Local agentic AI stack
- 4:36 - Setting up your own agent
- 5:39 - Making agents fast
- 6:53 - Concurrency and distributed inference
- 9:20 - More examples
- 13:01 - Next steps
Resources
- MLX Swift LM on GitHub
- MLX Swift Examples
- MLX Examples
- MLX Swift
- MLX LM - Python API
- MLX Explore - Python API
- MLX Framework
- MLX
Related Videos
WWDC26
WWDC25
-
Search this video…
Hi, I'm Angelos, an engineer on the MLX team. Today I'm going to show you how to build and run agentic AI workflows entirely on your Mac using MLX. No cloud, no API keys, just your hardware doing the work. Over the past year, AI agents have gone from research prototypes to everyday productivity tools. But before we talk about agents, let's look at what we had before.
Here's the chat experience you're familiar with. You send a prompt to the language model. The model sends a response back. If you need to act on that response, run a command, check a file, or fix an error, that's on you. But now you're talking to an agent. The agent talks to the model to decide what to do. Then it calls tools to actually do it: running commands, reading files, hitting APIs — It observes the results and goes back to the model to figure out the next step. User to agent. Agent to model. Agent to tools. This is the agentic loop. And it keeps cycling until your task is done. What makes this particularly exciting on Apple silicon is that the entire loop can run locally. Your data stays on your machine; AI is available anywhere at any time and there are no usage costs. Let me now show you what this looks like in practice. Here I have an agent running locally on my Mac. On my screen you can see the setup: on the left, MLX running the model, and on the right the OpenCode agent I am interacting with. I asked it to fetch the recent pull requests from our MLX repository, summarize the changes, and identify anything that needs my attention. The model reasons about the request, calls the GitHub CLI to fetch PR data, reads through the diffs, and produces a concise summary. All of this is happening locally, the model runs on my hardware and only the git commands reach the network. Well it seems like I have a lot of work to do after finishing this video. Now that you've seen what's possible, let me walk you through how we'll get there today. We'll start by introducing the local agentic AI stack, the four layers that make all of this work, from MLX at the foundation all the way up to the agent. Then I'll show you step-by-step how to set up your own local agent. After that, we'll look at how MLX gets the most out of your hardware to make agents fast. And finally, we'll go through more live demos, including building a SwiftUI app from scratch and fixing a bug in Xcode. Let's start with the stack.
The stack that powers local agentic AI on the Mac has four layers. Let me walk you through each one, starting from the bottom. At the bottom is MLX, our open-source array framework purpose-built for Apple silicon. It handles all the low-level computation, Metal acceleration, and memory management. This is the foundation everything else is built on. One level up, we have the language model layer. MLX-LM provides everything you need to load, run, quantize, and fine-tune large language models. It supports thousands of models from HuggingFace and gives you both CLI tools and a Python API. If you saw our sessions last year, this is what we covered in depth. But to serve an agent, we need something more: a persistent server with a standard API. That's where MLX-LM Server comes in. This is an OpenAI-compatible HTTP server that exposes your local model through a standard API. It supports structured tool calling so the model can invoke functions reliably, and reasoning models that can analyze complex problems step-by-step before responding. It's a drop-in replacement for any cloud LLM API. And at the top of the stack, we have the agent itself. This can be any framework or tool that speaks the OpenAI chat completions protocol: Xcode, OpenCode, Pi agent, a custom script, or anything else. Because MLX-LM Server provides a standard interface, any agent framework works out of the box. And it's not just us building on this stack. Several popular apps and tools build on MLX and MLX-LM. Ollama, LM Studio, and vLLM are just a few of the most popular ones. The ecosystem is broad and growing, and if you're using one of these tools, chances are you're already running on MLX. So that's the stack. Let me now show you how to set everything up yourself. It only takes three steps to go from zero to a fully local agentic workflow. Step one: install MLX-LM. A single pip install gets you everything you need. Step two: start the server. Run mlx_lm.server with a model that supports tool calling. Starting with a small model to test your set-up is always a good idea. The server starts up, loads the model, and is ready to accept requests on local host. Step three: point your agent at the local server. In most agent frameworks, you just set the base URL to your local server's address and you're done. The agent doesn't know or care that the model is running on your Mac rather than in the cloud.
Let me show you a concrete example. Here's the configuration for OpenCode. We define a local provider. In particular, we set the URL to local host and set the model name the server expects. We also tell OpenCode to use this local model for everything. That's it. Now every interaction runs through your local model. Now that we have an agent talking to MLX, let's look at how MLX gets the most out of your hardware and addresses the key challenges of running agents locally.
The first challenge is prompt processing. In an agentic workflow, every time the model receives tool output, it has to process all that new context before it can reason about the next step. This happens over and over throughout the agentic loop, and it adds up fast. Agentic sessions usually comprise hundreds of thousands of tokens and most of those are not generated.
The M5 chip introduces dedicated Neural Accelerators, and MLX can target them for exactly this kind of work. Specifically, Neural Accelerators make matrix multiplication four times faster on M5 compared to M4. And with the specialized multiplication and attention kernels in MLX this translates almost exactly to prompt processing speedup.
Reducing prompt processing time means your agents can read your codebase or process tool results almost four times faster. And the best part? Taking advantage of Neural Accelerators requires no special arguments or code changes on your part, MLX selects the best kernel for the available hardware and it just works.
Let's now talk about the second challenge, concurrency. In practice, agents rarely work alone. A common pattern is for an agent to spawn several subagents, each tackling a different part of the problem in parallel. One might be reading documentation, another searching code, and a third writing tests; all at the same time. That means multiple requests hitting your local model simultaneously. MLX-LM Server handles this with continuous batching.
Instead of processing requests one at a time, it dynamically groups incoming requests into batches and processes them together on the GPU. New requests can join a batch in progress without waiting for the current one to finish. The result is that your subagents don't stall waiting in a queue. They all get served concurrently, which keeps the entire agentic workflow moving. Finally, the third challenge is model size. Sometimes a single machine, even one with 512GB of RAM, just isn't enough because the model is too large to fit in memory. The most recent DeepSeek model for instance has a whopping 1.6 trillion parameters and requires more than 800GB of memory just for the weights. MLX's distributed support lets you spread a model across multiple Macs connected over Thunderbolt or Ethernet. For agents, this is powerful in two ways. First, it lets you run much larger, more capable models that wouldn't fit on a single machine. Second, it parallelizes prompt processing across devices, which directly speeds up the agentic loop since the model can process tool results faster.
Setting up distributed inference with MLX-LM Server is fairly straightforward. You launch the server using mlx.launch and a hostfile that contains information about the nodes and the type of connection. The model is automatically sharded across all available devices and everything else just works. Starting with macOS 26.2, we have support for Thunderbolt RDMA, which provides low-latency, high-bandwidth communication over Thunderbolt. As a result, distributed inference with MLX has seen significant speed-ups: up to three times with four nodes. To learn how to set up your Macs for distributed inference with MLX, check out our session "Explore distributed inference and training with MLX". Remember our PR summary demo from earlier? That was a simple read-and-report task. Let's now push things further and see what happens when we ask an agent to write an entire project from scratch and then fix a bug in an existing one.
In this demo, I'm going to ask the agent to build a small SwiftUI application from scratch.
I have started with a blank Xcode project and I am asking the agent to build a drawing app for the iPad.
And off it goes. The agent first looks at the current directory to find out the existing project structure, makes a plan to guide its implementation, and gets on to writing the code. Using an agent means we don't need to copy anything or even build the project. The agent writes the file then builds the app, fixing any errors it encounters along the way.
And here we are: the model is done, it only took a couple of minutes to create the first version of the app. At the same time, I have the project open in Xcode and I am launching the app in the simulator.
Let's have a look at what the agent created.
It seems that we have a fully functional drawing app. That's really nice for something that was built in 2 minutes. With agentic coding, however, we can keep iterating until we are happy with the result. For instance, I prefer rounded end caps. I think they look much better. Let's ask the agent to add them.
The agent will edit the code and recompile the app until it compiles without errors.
Let's test the new version.
We now have rounded end caps. This is cool indeed. It is even more cool that all of this happened locally, the model ran through MLX-LM server on this Mac and the agent used standard development tools like xcodebuild to verify and build its work.
For our final demo, let's look at something that integrates directly with your development environment.
Here I have the same drawing app project open in Xcode. Let's connect Xcode to our already running MLX server. We open the settings and navigate to the Intelligence tab. We click on Add Chat Provider... and select a Locally Hosted provider. We set the Port to 8080 or whichever port we selected when launching our MLX server and we're done. Now Xcode can talk to our local model.
I have introduced a bug to our previously working app and now we can ask the model to fix it.
Within seconds, it identifies the bug and inspects the code around it. Finally, it writes a fix and we can now build and run our app.
This shows how a locally running agent can integrate with your existing development workflow in Xcode, reading project files, understanding build errors, and making targeted fixes. Local AI means your code never leaves your Mac. Today, we showed you the full stack for running agentic AI locally on your Mac, from MLX all the way up to the agent, and how Neural Accelerators, continuous batching, and distributed inference make it fast. To get started, install MLX-LM, launch the server, and point your favorite agent at it. Everything we showed today is open-source and available right now. Thank you for watching and I'm excited to see what you build with local agentic AI on the Mac.
-
-
4:40 - Set up MLX-LM and start the local server
# Step 1: Install MLX-LM pip install mlx-lm # Step 2: Start the server mlx_lm.server --model mlx-community/Qwen-3.5-4B-8bit # Step 3: Point your agent to the server curl -X POST \ http://127.0.0.1:8080/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"model":"default_model","messages":[{"role":"user","content":"Hello!"}]}' -
5:18 - Configure an agent to use your local MLX server
{ "$schema": "https://opencode.ai/config.json", "model": "mlx/default_model", "small_model": "mlx/default_model", "provider": { "mlx": { "npm": "@ai-sdk/openai-compatible", "name": "MLX (local)", "options": { "baseURL": "http://127.0.0.1:8080/v1" }, "models": { "default_model": { "name": "Default MLX Model" } } } } } -
8:33 - Launch distributed inference with MLX
mlx.launch --hostfile hosts.json \ --backend jaccl \ /remote/path/to/mlx_lm.server \ --model mlx-community/Qwen-3.5-122B-A3B-8bit
-
-
- 0:00 - Introduction
Overview of building and running agentic AI workflows entirely on Mac using MLX — no cloud, no API keys, just your hardware.
- 0:32 - The chat and agentic loop
How traditional chat differs from the agentic loop: the model decides what to do, calls tools to run commands, read files, and hit APIs, observes the results, and iterates — all running locally for privacy and offline availability.
- 2:42 - Local agentic AI stack
A walkthrough of the four-layer stack powering local agentic AI on the Mac: MLX (array framework for Apple Silicon), MLX-LM (model loading, quantization, and fine-tuning), MLX-LM Server (OpenAI-compatible HTTP server), and the agent layer — including popular tools like Ollama, LM Studio, and vLLM.
- 4:36 - Setting up your own agent
Three steps to go from zero to a fully local agentic workflow: install MLX-LM with pip, start the server with a tool-calling model, and configure your agent to point at the local endpoint.
- 5:39 - Making agents fast
How MLX tackles the first challenge of agentic workloads — efficiently processing large contexts with hundreds of thousands of tokens — including how M5 Neural Accelerators accelerate prompt processing speed.
- 6:53 - Concurrency and distributed inference
How MLX handles continuous batching for concurrent multi-agent requests, and distributed inference to spread large models across multiple Macs over Thunderbolt.
- 9:20 - More examples
Two-part live demo building SwiftUI apps entirely on-device. First, using OpenCode with MLX to generate a complete SwiftUI project from a description; then, using Xcode's agentic coding capabilities to build and fix a SwiftUI app — all running locally.
- 13:01 - Next steps
Summary of the full local AI stack and practical steps to get started: install MLX-LM, launch the server, and connect your agent. All shown tools are open-source and available now.