Get in Touch

Course Outline

Introduction to EXO and Local AI Clustering

  • Overview of the EXO framework and the exo-explore ecosystem.
  • Comparing centralized cloud inference versus distributed local inference.
  • Architecture: libp2p device discovery, MLX backend, dashboard, and API layers.
  • Hardware requirements: Apple Silicon (M3 Ultra, M4 Pro/Max), Thunderbolt 5, shared storage.

Installing EXO on macOS

  • Setting up Xcode, the Metal ToolChain, and macOS prerequisites.
  • Installing uv, Node.js, and the Rust nightly toolchain.
  • Installing the pinned macmon fork for Apple Silicon monitoring.
  • Cloning the repository and building the dashboard with npm.
  • Running EXO from source and verifying the localhost:52415 dashboard.

Installing EXO on Linux

  • Installing dependencies via apt or Homebrew on Linux.
  • Configuring uv, Node.js 18+, and the Rust nightly toolchain.
  • Building the dashboard and running EXO in CPU-only mode.
  • Directory layout: XDG Base Directory paths for config, data, cache, and logs.

Automatic Device Discovery and Cluster Formation

  • Understanding libp2p-based auto-discovery across local networks.
  • Configuring custom namespaces using EXO_LIBP2P_NAMESPACE for cluster isolation.
  • Verifying node membership in the dashboard cluster view.
  • Handling discovery failures and network segmentation issues.

Enabling RDMA over Thunderbolt 5

  • Understanding RDMA architecture and the claimed 99 percent latency reduction.
  • Enabling RDMA in macOS Recovery mode with rdma_ctl.
  • Cable requirements and port topology constraints on Mac Studio.
  • Ensuring macOS versions match across all cluster nodes.
  • Troubleshooting RDMA discovery and DHCP configuration.

Deploying Frontier Models

  • Using the dashboard to load and shard DeepSeek v3.1, Qwen3-235B, and Llama family models.
  • Previewing instance placements via the /instance/previews API endpoint.
  • Creating model instances using pipeline or tensor-parallel sharding.
  • Configuring custom model cards from the HuggingFace hub.

Monitoring and Troubleshooting

  • Reading EXO logs and understanding distributed tracing.
  • Interpreting cluster health in the dashboard cluster view.
  • Diagnosing worker node failures and reconnection behavior.
  • Using EXO_TRACING_ENABLED for performance bottleneck analysis.

Cluster Maintenance and Updates

  • Updating EXO binaries and performing dashboard rebuild procedures.
  • Migrating model caches and managing pre-downloaded models over NFS.
  • Gracefully removing nodes and rebalancing workloads.

Requirements

  • A solid understanding of networking fundamentals (IP, subnetting, firewalls).
  • Experience with command-line administration on macOS or Linux.
  • Familiarity with Python package management (pip/uv) and Node.js tooling.

Audience

  • System administrators.
  • DevOps engineers.
  • AI infrastructure architects responsible for on-premise LLM deployment.
 21 Hours

Number of participants


Price per participant

Upcoming Courses

Related Categories