Explore distributed inference and training with MLX

Explore distributed inference and training with MLX

Scale your machine learning workloads across multiple Macs using MLX. Learn how to tackle interconnect efficiency, large model inference, request batching, and distributed training challenges. Discover how a few Macs on your desk can replace expensive cloud infrastructure for demanding AI workloads.

Chapters
- 0:00 - Introduction
- 2:09 - Distributed communication
- 4:32 - Setting up your cluster
- 10:33 - Distributed inference and fine-tuning
- 13:35 - Model parallelism strategies
- 15:53 - Distributed fine-tuning
- 18:34 - CLI, Python, Swift, and C++ APIs
- 20:45 - Next steps
Resources
Related Videos

WWDC26
- Explore numerical computing in Swift with MLX
- Run local agentic AI on the Mac using MLX
WWDC25
- Explore large language models on Apple silicon with MLX
- Get started with MLX for Apple silicon
Hi, I'm Tatiana, research scientist at MLX team. It's been a remarkable time for local LLMs: models keep getting larger and gaining new amazing capabilities -- becoming smarter and handling harder problems. And as they improve, we use them for more: longer contexts, harder tasks, more complex workflows. Eventually, memory, compute, or bandwidth on a single machine becomes a limitation. In our WWDC 26 video "Run local agentic AI on the Mac using MLX" it is shown how to run AI agents locally. But when you have multiple devices, you can take local AI even further, running larger LLMs or accelerating them through distributed inference and training. Today, we'll take a deep dive into scaling across multiple Macs with MLX, using the hardware right on your desk. We'll start with the command line interface to get models running on your machines, move to the Python API for experimentation, and finish with Swift for embedding these workflows directly into your apps.
Let's start! First, we'll look at the full hardware and software stacks to make distributed workloads on Apple Silicon possible. Then we'll put everything together: turn four M3 Ultras into a cluster. We'll walk through every step: choosing the right topology to connect machines, enabling fast communication, and launching distributed jobs.
Once the cluster is ready, we'll get to the exciting part: fast and local distributed LLM inference and finetuning. We'll run it with MLX, compare it side by side against a single Mac, and look at how MLX distributes the model across the cluster.
Having most examples in command line interface, in the end we'll show how distributed communication is also exposed to you via Python, Swift and C++ APIs. Let's start by looking at distributed communication for Apple Silicon. To send and receive data fast, machines need to be connected with a physical link — an interconnect. On top of that, we also need a transport protocol — a mechanism that pushes bytes from one machine's memory to another's. Starting in macOS 26.2, Remote Direct Memory Access protocol, shortly RDMA, is supported over Thunderbolt 5. RDMA moves data directly from one machine's memory to another's, avoiding most CPU and operating system overhead.
RDMA over Thunderbolt gives us the high-bandwidth — low-latency communication we need for distributed workloads. However, alone, it gives us raw data movement between two machines only. Thus, distributed programs need something higher-level — a communication backend which provides communication primitives for sending data between individual machines or coordinating across the entire group. These two operations are building blocks of distributed training and inference. And this is where JACCL comes in.
JACCL is an open-source collective communication library built by Apple. It leverages RDMA over Thunderbolt and gives you collective communication primitives for sending data between machines and combining results across the group — without managing any of the low-level transport yourself. And it's not limited to machine learning — any distributed workload on Apple Silicon can be built on top of it.
And the final piece of the stack is a machine learning framework that uses the communication backend for distributed inference and training — that's MLX.
MLX is an open-source machine learning library built by Apple for Apple Silicon. It leverages JACCL for low-latency distributed communication and provides tools for orchestrating distributed jobs across the cluster. If you're new to MLX, check out our video "Getting Started with MLX on Apple Silicon" from WWDC25.
So now we understand the full stack. Let's put it all together and build a cluster — a group of machines that work together on the same task. We will use 4 M3 Ultras.
To setup the cluster, we need to connect the machines with Thunderbolt 5 cables. There are different ways to wire them together, and the topology directly affects the communication time.
So to begin with, we'll look at what defines that time. Next, we'll look at how to actually connect the machines — which topologies JACCL supports, and the trade-offs between them.
After that, we'll show how to enable RDMA on the machines for fast communication. And finally, we'll launch distributed jobs on the cluster using MLX.
So, communication time has two components: latency and transfer time. Latency is the fixed cost paid for each communication operation, independent of the amount of data being sent.
Transfer time is the cost of moving the data though the link; it grows with message size and depends on the bandwidth of the link.
For small messages, the data movement cost is tiny, so latency dominates.
For large messages, the trade off is opposite. Depending on whether communication is latency-bound or bandwidth-bound, we may prefer different topologies.
JACCL supports two of them: a mesh and a ring.
In a full mesh, every machine connects directly to every other, thus any group communication has the lowest possible latency. In a ring, each node connects only to its two neighbors. Communication between nonadjacent nodes must travel through intermediate machines which increases latency. However, the ring requires fewer cables and ports per machine, making it easier to scale to more nodes. And because each node has only two connections, we can use the extra Thunderbolt ports to run two or tree cables per neighbor (depending on the Mac) — thus increasing the bandwidth per link and reducing transfer time.
When machines are connected into a mesh, we have the flexibility to route each communication through either a mesh topology or a ring topology.
What's nice about JACCL, it automatically picks the best topology depending on the message size and communication operation — mesh when latency matters, ring when bandwidth matters. For this flexibility, let's connect all M3 Ultras into a mesh.
As we connected all M3 Ultras together, now we need to enable RDMA on all machines. Open settings on the machine, search for "RDMA", click on "Enable RDMA over Thunderbolt", enable RDMA, and reboot.
Great! Macs are connected with Thunderbolt 5 cables, and RDMA is enabled. Now we need a way to launch distributed programs.
One way to do it, is over the local network, for example, through wifi or ethernet. From any machine with SSH access to the cluster, for example MacBook in my case, we connect to each Mac, start the program, and from that point on, all machines communicate directly over the Thunderbolt links.
MLX provides a launch helper, which exactly does all of this for you! You run mlx.launch on your MacBook and it orchestrates the cluster. You give it the executable you want to run and a JSON hostfile describing your cluster. From there, it SSHes into each node using hostnames from provided hostfile and starts the executable on every machine.
Let's see how the hostfile that describes the cluster should look like.
It is a JSON array — one entry per node. "ssh" is the hostname used by mlx.launch to reach the machine. "ips" is the machine's IP on your local network used by JACCL for initial coordination between nodes. And "rdma" is a list of the RDMA device names for each Thunderbolt peer connection.
You can write it manually, but MLX also provides a helper script `mlx.distributed_config` that generates it for you.
You pass the list of hostnames, and an output path. You can also embed environment variables in the config. They will be set automatically on every node at launch time. Here we set MLX_METAL_FAST_SYNCH=1, which enables faster GPU-to-CPU synchronization. It is critical for distributed tasks because computation runs on the GPU while communication runs on the CPU. You can also pass the --auto-setup flag to configure the Thunderbolt network automatically. Communication --backend argument defines whether it is a mesh or ring: for a mesh, --backend is set to jaccl, as in this example; for a ring, we would change it to jaccl-ring.
Let's run this command and generate the hostfile for our cluster.
First, it checks that all hosts are reachable over SSH. Then it probes each machine's Thunderbolt ports to discover which machines are physically connected to which — building a map of the topology. Since we passed --auto-setup, it disables the Thunderbolt Bridge on all machines and configures each Thunderbolt link for RDMA. Finally, it writes a JSON hostfile with everything mlx.launch needs. Note, that without --auto-setup flag, script prints the configuration commands, so you can review them and run yourself.
Now, the cluster is ready. Let's move to the exciting part — distributed language model inference and finetuning. And the easiest way to start is via command line interface and MLX LM. MLX LM is an open-source Python package built on top of MLX that provides command-line tools and a Python API for running language models locally on Apple Silicon. Check out our video, "Explore large language models on Apple Silicon with MLX" from WWDC25 to get started on a single device.
As we showed last year, chatting with a model on a single Mac can be done via command line interface with mlx_lm.chat. We run it in the terminal, specifying the model we want to use, for example, Qwen 3.6, and the maximum number of tokens for the response. Under the hood, MLX LM loads and runs the model on a single machine.
To chat with the same model on the cluster via command line interface, we wrap the command with mlx.launch. On our MacBook, in the terminal we run mlx.launch with the --hostfile pointing to our cluster configuration. After the double dash, we pass the exact same mlx_lm.chat command — but using the remote path to the executable on each node. The command is almost identical, MLX LM shards the model and coordinates the distributed inference for you. Keep in mind that all necessary libraries like MLX must be installed on each Mac and the executable must be accessible on all machines.
One line via command line interface, and we're running a model spread across the entire cluster! Let's try both side by side and chat with Qwen 3.6 — a 27-billion-parameter model — on a single M3 Ultra and on 4 of them.
I've already started mlx_lm.chat on both sides — on the left, the model is loaded on a single M3 Ultra; on the right, it's sharded across four machines.
Let's prompt both with "Implement a transformer model in MLX." It is a quite impressive speed up! The cluster generates tokens at nearly three times the rate of a single machine for Qwen 3.6 model.
As we see, running a model across multiple Macs can significantly boost inference speed. The exact speedup depends on the model size and architecture. But time improvement is not the only reason to go distributed, sometimes a model is simply too large for one machine. Kimi 2.6, for example, has 1 trillion total parameters. Even with 8-bit quantization, the weights alone require about one terabyte of memory. That does not fit on a single M3 Ultra, but it can fit across four. So how do we actually split the weights and computation across machines? MLX and MLX LM support two approaches: pipeline and tensor parallelism.
Pipeline parallelism splits the model by depth. In this case, each machine holds a group of layers, and data moves through the machines sequentially. It does not speed up the inference, because each token still has to pass through the layer groups one after another. But the benefit is simple communication: machines only exchange activations at the boundaries between layer groups.
Tensor parallelism splits the model by width. In this case, each machine holds part of every layer, so all machines process the same token at the same time. It improves inference speed due to parallelized per-layer computation. However the trade-off is much more frequent communication, that happens at every layer and for every token. This makes low latency important, and that is why the mesh topology is crucial for this case — every machine can reach every other machine in a single hop.
Tensor paralelism is the default sharding strategy in MLX LM. To shard the model with pipeline parallelism, we can simply append a flag --pipeline to the command. Note, that not all models support pipeline parallelism.
Now, let's chat with a one-trillion-parameter Kimi 2.6 on our cluster.
For this we use mlx.launch from our MacBook as before, pointing to the hostfile. I'm not passing the --pipeline flag, so we're using tensor parallelism. We need to wait a moment — mlx.launch is connecting to every machine, MLX LM loads and shards the model, and starts the chat.
Great, the model is loaded! Let's prompt model with: "Implement machine learning architecture for GPT in Python with MLX".
And there we go — with one command, a massive trillion-parameter model is running locally across your Macs, answering your questions.
With MLX and MLX LM, you can not only run language model inference, you can also fine-tune models on your hardware. Fast, efficient, and fully private — your data never leaves your machines. Let's start with a single Mac, and then scale to our cluster.
When fine-tuning or training on a single machine, we split the training data into batches — a set of multiple examples.
For each batch, the Mac computes gradients and updates the model weights. We repeat this process for one or more passes over the training dataset, until the model reaches the desired quality.
The faster we process the training data, the sooner fine-tuning finishes. So how can we use multiple machines to speed this up? The idea is straightforward. We replicate the model on every Mac.
Each machine receives a different batch of data and computes gradients locally. Then we average the gradients, so the model's update uses information from all batches. This is called data-parallel training because the model is replicated, while the data is processed in parallel across machines — this is what gives us the speedup.
So with N machines we can process data up to N times faster. Sounds amazing! Lets see how we can use data parallelism with MLX LM.
As before, the only difference from a single device is launching the job with mlx.launch from your MacBook, specifying a path to mlx_lm.lora on remote machines. Data sharding is handled by MLX LM and the command is almost identical — we scale --batch-size by the number of devices so each machine still processes the same number of samples per step as before.
Let's fine-tune Qwen 3.5 with 9 billion parameters on a single machine and on the cluster, and compare the number of tokens the model processes per second.
We are launching fine-tuning on a single device on the left and on the cluster on the right using mlx.launch and hostfile, specifying path to mlx_lm.lora on the remote machine.
First, it loads data and model; and then training starts. Single M3 Ultra is processing around 180 tokens per second, while on the cluster we process around 600 tokens per second, which gives us more than 3 times speed up for fine-tuning. Now, with MLX, you can turn your devices into a local training cluster for efficient fine-tuning without moving to a cloud.
So far, we used command line interface for distributed inference and fine-tuning within MLX LM. However, MLX provides a fine-grained control over sharding and distributed operations, via flexible Python, Swift, and C++ APIs.
This allows you to experiment with models in Python and C++ or embed models into your App with Swift. Let's look at the examples.
To run distributed inference with Python API and MLX LM, we first initialize the distributed group for communication. Then, define the type of parallelism we want, for example, tensor parallelism. Finally, we shard the model using the sharded_load function. After that, we use the model exactly as we would on a single device — MLX LM handles all distributed communications under the hood.
To have more control over the model and its sharding, we can use low-level primitives from MLX itself. For example, after defining a simple Linear layer, we can shard it with tensor parallelism using shard_linear function.
You can even control basic distributed operations like all reduce. In Python, Swift or C++ after initializing the distributed group via JACCL, we perform a collective distributed sum across all Macs for our tensor using corresponding MLX primitives.
As we pointed out at the beginning of the session, JACCL is available on its own and you can leverage it for any applications requiring distributed communication, even non-ML applications. JACCL can be built without MLX and it provides a C++ API with communication primitives: after initializing a JACCL group, we again perform a collective distributed sum across all Macs for our tensor but via JACCL directly, not MLX.
Now you know both high-level and low-level APIs, for distributed inference and training with MLX and JACCL, and you are ready to build advanced distributed workflows with MLX.
Throughout this session, we looked at the full stack that makes distributed training and inference possible on Apple Silicon — from RDMA over Thunderbolt, all the way up to MLX and MLX LM. We showed you how easy it is to scale from a single device to multiple devices, and the benefits it brings: faster inference, the ability to run trillion-parameter models, and faster fine-tuning; all with minimal changes to your single device code, supporting command line interface, Python, Swift and C++ APIs.
With distributed cluster, now you can run local AI agents powered entirely by MLX — fast, private, and on the hardware you own. To know more, check out our WWDC 2026 video "Run local agentic AI on the Mac using MLX".
To further dive into advanced distributed features — including custom parallelism strategies and training loops, check out our documentation. You can also use MLX LM to serve models distributedly with the built-in server.
We can't wait to see what you build with MLX on Apple Silicon!

8:31 - Hostfile format for a 4-node MLX cluster

[
  {
    "ssh": "m3-ultra-0",
    "ips": ["192.168.1.10"],
    "rdma": [null, "rdma_en5", "rdma_en4", "rdma_en3"]
  },
  {
    "ssh": "m3-ultra-1",
    "ips": ["192.168.1.11"],
    "rdma": ["rdma_en5", null, "rdma_en4", "rdma_en3"]
  },
  {
    "ssh": "m3-ultra-2",
    "ips": ["192.168.1.12"],
    "rdma": ["rdma_en5", "rdma_en4", null, "rdma_en3"]
  },
  {
    "ssh": "m3-ultra-3",
    "ips": ["192.168.1.13"],
    "rdma": ["rdma_en5", "rdma_en4", "rdma_en3", null]
  }
]

8:56 - Generate the cluster hostfile with mlx.distributed_config

mlx.distributed_config \
    --hosts m3-ultra-0,m3-ultra-1,m3-ultra-2,m3-ultra-3 \
    --output "m3-ultra-jaccl.json" \
    --env MLX_METAL_FAST_SYNCH=1 \
    --auto-setup \
    --backend jaccl

11:04 - Run distributed LLM inference with mlx_lm.chat

# Single-device LLM inference
mlx_lm.chat --model "Qwen/Qwen3.6-27B" --max-tokens 2048

# Distributed LLM inference across the cluster
mlx.launch --hostfile "m3-ultra-jaccl.json" -- \
    /remote/path/to/mlx_lm.chat --model "Qwen/Qwen3.6-27B" --max-tokens 2048

15:03 - Run distributed inference with pipeline parallelism

# Tensor parallelism (default)
mlx.launch --hostfile "m3-ultra-jaccl.json" -- \
    /remote/path/to/mlx_lm.chat --model "moonshotai/Kimi-K2.6" \
                                 --max-tokens 2048

# Pipeline parallelism — append --pipeline flag
mlx.launch --hostfile "m3-ultra-jaccl.json" -- \
    /remote/path/to/mlx_lm.chat --model "moonshotai/Kimi-K2.6" \
                                 --max-tokens 2048 \
                                 --pipeline

17:18 - Run distributed fine-tuning with mlx_lm.lora

# Single-device fine-tuning
mlx_lm.lora --model "Qwen/Qwen3.5-9B" \
             --data "mlx-community/wikisql" \
             --train --batch-size 4

# Distributed fine-tuning (scale --batch-size by number of devices)
mlx.launch --hostfile "hostfile.json" -- \
    /remote/path/to/mlx_lm.lora --model "Qwen/Qwen3.5-9B" \
                                  --data "mlx-community/wikisql" \
                                  --train --batch-size 16

19:01 - Distributed inference with the MLX LM Python API

import mlx.core as mx
from mlx_lm import stream_generate
from mlx_lm.utils import sharded_load

# Initialise distributed backend
group = mx.distributed.init(strict=True, backend="jaccl")
# Define parallelism
tensor_group, pipeline_group = group, None

# Shard the model
model, tokenizer = sharded_load("moonshotai/Kimi-K2.6", pipeline_group, tensor_group)
for response in stream_generate(model, tokenizer, prompt, max_tokens=1024):
    if group.rank() == 0:
        print(response.text, end="", flush=True)

19:31 - Shard a layer with the MLX Python API

import mlx.core as mx
import mlx.nn as nn

# Initialise distributed backend
group = mx.distributed.init(strict=True, backend="jaccl")

# Define layer and shard it column-wise
layer = nn.Linear(1024, 1024)
sharded_layer = nn.layers.distributed.shard_linear(
    layer, strategy="all-to-sharded", group=group
)
data = mx.random.normal((1, 1, 1024))
output = sharded_layer(data)
mx.eval(output)

19:47 - All-reduce across devices in Python, Swift, and C++

# Python
import mlx.core as mx
world = mx.distributed.init(strict=True, backend="jaccl")
data = mx.full((4,), float(world.rank()), dtype=mx.float32)
result = mx.distributed.all_sum(data, group=world)
mx.eval(result)

# Swift
let group = try DistributedGroup(strict: .ring)
let data = rank == 0
    ? MLXArray(converting: [1.0, 2.0, 3.0])
    : MLXArray(converting: [5.0, 6.0, 7.0])
let result = try group.allSum(data)

// C++
namespace mx = mlx::core;
auto world = mx::distributed::init(/* strict */ true, "jaccl");
mx::array data = mx::full({4}, static_cast<float>(world.rank()), mx::float32);
mx::array result = mx::distributed::all_sum(data, world);
mx::eval(result);

20:06 - Standalone distributed sum with the JACCL C++ API

#include <jaccl/jaccl.h>
#include <iostream>

int main() {
    // Initialize JACCL group
    auto group = jaccl::init();
    std::cout << "Rank " << group->rank() << " of " << group->size() << std::endl;
    // Perform all-reduce sum
    float data[10] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 6.0f, 7.0f, 8.0f, 9.0f, 10.0f};
    float output[10];
    group->all_sum(data, output, sizeof(data), jaccl::Float32);
    std::cout << "Result: " << output[0] << std::endl;
    return 0;
}

- 0:00 - Introduction
- Overview of why distributed AI becomes necessary as models grow larger, and a preview of what the session covers: CLI tools, Python API, and Swift for embedding distributed workflows in your apps.
- 2:09 - Distributed communication
- A walkthrough of the full hardware and software stack enabling distributed workloads on Apple silicon: RDMA over Thunderbolt 5 for low-latency data movement, JACCL (open-source collective communication library), and MLX as the ML framework that ties them together.
- 4:32 - Setting up your cluster
- How to physically connect four M3 Ultras into a cluster — understanding latency vs. bandwidth trade-offs, choosing between mesh and ring topologies, enabling RDMA in System Settings, and using mlx.distributed_config and mlx.launch to configure and orchestrate the cluster.
- 10:33 - Distributed inference and fine-tuning
- How to run distributed LLM inference with MLX LM using a single CLI command — wrapping mlx_lm.chat with mlx.launch to shard a 27B-parameter Qwen model across four M3 Ultras, achieving nearly 3x the token generation rate of a single machine.
- 13:35 - Model parallelism strategies
- How MLX LM splits large models across machines using tensor parallelism (splitting by width for faster inference) and pipeline parallelism (splitting by depth for simpler communication) — including a demo running the 1-trillion-parameter Kimi 2.6 model across four Macs.
- 15:53 - Distributed fine-tuning
- How data-parallel training accelerates fine-tuning by replicating the model across machines, processing different data batches in parallel, and averaging gradients — demonstrated fine-tuning Qwen 3.5 (9B) at over 3x throughput on the cluster versus a single M3 Ultra.
- 18:34 - CLI, Python, Swift, and C++ APIs
- How to use MLX's fine-grained Python, Swift, and C++ APIs for distributed inference — initializing a distributed group, sharding models with tensor parallelism, using low-level all_reduce primitives, and leveraging JACCL standalone for non-ML distributed workloads.
- 20:45 - Next steps
- Summary of the full distributed stack — from RDMA over Thunderbolt to MLX and MLX LM — and next steps including the companion session on local agentic AI, documentation on custom parallelism strategies, and the built-in MLX LM distributed server.

Explore Get Started

Stay Updated

Explore Platforms

Featured

Explore Technologies

Featured

Explore Community

Featured

Explore Documentation

Release Notes

Explore Downloads

Featured

Explore Support

Featured

Quick Links

Chapters

Resources

Related Videos

WWDC26

WWDC25