swift-mlx-lm

Verified·Scanned 2/18/2026

This skill is a Swift package for running LLMs/VLMs on Apple Silicon via MLX, covering local inference, VLMs, LoRA training, and embeddings. It downloads model weights from HuggingFace and accepts a hub token via HubApi(hfToken: "your_token").

from clawhub.ai·v013f776·82.2 KB·0 installs
Scanned from 1.0.0 at 013f776 · Transparency log ↗
$ vett add clawhub.ai/ronaldmannak/swift-mlx-lm

mlx-swift-lm Skill

1. Overview & Triggers

mlx-swift-lm is a Swift package for running Large Language Models (LLMs) and Vision-Language Models (VLMs) on Apple Silicon using MLX. It supports local inference, fine-tuning via LoRA/DoRA, and embeddings generation.

When to Use This Skill

  • Running LLM/VLM inference on macOS/iOS with Apple Silicon
  • Streaming text generation from local models
  • Vision tasks with images/video (VLMs)
  • Tool calling / function calling with models
  • LoRA adapter training and fine-tuning
  • Text embeddings for RAG/semantic search

Architecture Overview

MLXLMCommon     - Core infrastructure (ModelContainer, ChatSession, KVCache, etc.)
MLXLLM          - Text-only LLM support (Llama, Qwen, Gemma, Phi, DeepSeek, etc. - examples, not exhaustive)
MLXVLM          - Vision-Language Models (Qwen2-VL, PaliGemma, Gemma3, etc. - examples, not exhaustive)
Embedders       - Embedding models (BGE, Nomic, MiniLM)

2. Key File Reference

PurposeFile Path
Thread-safe model wrapperLibraries/MLXLMCommon/ModelContainer.swift
Simplified chat APILibraries/MLXLMCommon/ChatSession.swift
Generation & streamingLibraries/MLXLMCommon/Evaluate.swift
KV cache typesLibraries/MLXLMCommon/KVCache.swift
Model configurationLibraries/MLXLMCommon/ModelConfiguration.swift
Chat message typesLibraries/MLXLMCommon/Chat.swift
Tool call processingLibraries/MLXLMCommon/Tool/ToolCallFormat.swift
Concurrency utilitiesLibraries/MLXLMCommon/Utilities/SerialAccessContainer.swift
LLM factory & registryLibraries/MLXLLM/LLMModelFactory.swift
VLM factory & registryLibraries/MLXVLM/VLMModelFactory.swift
LoRA configurationLibraries/MLXLMCommon/Adapters/LoRA/LoRAContainer.swift
LoRA trainingLibraries/MLXLLM/LoraTrain.swift

3. Quick Start

LLM Chat (Simplest API)

import MLXLLM
import MLXLMCommon

// Load model (downloads from HuggingFace automatically)
let modelContainer = try await LLMModelFactory.shared.loadContainer(
    configuration: .init(id: "mlx-community/Qwen3-4B-4bit")
)

// Create chat session
let session = ChatSession(modelContainer)

// Single response
let response = try await session.respond(to: "What is Swift?")
print(response)

// Streaming response
for try await chunk in session.streamResponse(to: "Explain concurrency") {
    print(chunk, terminator: "")
}

VLM with Image

import MLXVLM
import MLXLMCommon

let modelContainer = try await VLMModelFactory.shared.loadContainer(
    configuration: .init(id: "mlx-community/Qwen2-VL-2B-Instruct-4bit")
)

let session = ChatSession(modelContainer)

// With image (video is also an optional parameter)
let image = UserInput.Image.url(imageURL)
let response = try await session.respond(
    to: "Describe this image",
    image: image,
    video: nil  // Optional video parameter
)

Embeddings

import Embedders

// Note: Embedders uses loadModelContainer() helper (not a factory pattern)
let container = try await loadModelContainer(
    configuration: ModelConfiguration(id: "mlx-community/bge-small-en-v1.5-mlx")
)

let embeddings = await container.perform { model, tokenizer, pooler in
    let tokens = tokenizer.encode(text: "Hello world")
    let input = MLXArray(tokens).expandedDimensions(axis: 0)
    let output = model(input)
    let pooled = pooler(output, normalize: true)
    eval(pooled)
    return pooled
}

4. Primary Workflow: LLM Inference

ChatSession API (Recommended)

ChatSession manages conversation history and KV cache automatically:

let session = ChatSession(
    modelContainer,
    instructions: "You are a helpful assistant",  // System prompt
    generateParameters: GenerateParameters(
        maxTokens: 500,
        temperature: 0.7
    )
)

// Multi-turn conversation (history preserved automatically)
let r1 = try await session.respond(to: "What is 2+2?")
let r2 = try await session.respond(to: "And if you multiply that by 3?")

// Clear session to start fresh
await session.clear()

Streaming with generate()

For lower-level control, use generate() directly:

let input = try await modelContainer.prepare(input: UserInput(prompt: .text("Hello")))
let stream = try await modelContainer.generate(input: input, parameters: GenerateParameters())

for await generation in stream {
    switch generation {
    case .chunk(let text):
        print(text, terminator: "")
    case .info(let info):
        print("\n\(info.tokensPerSecond) tok/s")
    case .toolCall(let call):
        // Handle tool call
        break
    }
}

Tool Calling

// 1. Define tool
struct WeatherInput: Codable { let location: String }
struct WeatherOutput: Codable { let temperature: Double; let conditions: String }

let weatherTool = Tool<WeatherInput, WeatherOutput>(
    name: "get_weather",
    description: "Get current weather",
    parameters: [.required("location", type: .string, description: "City name")]
) { input in
    WeatherOutput(temperature: 22.0, conditions: "Sunny")
}

// 2. Include tool schema in request
let input = UserInput(
    prompt: .text("What's the weather in Tokyo?"),
    tools: [weatherTool.schema]
)

// 3. Handle tool calls in generation stream
for await generation in try await modelContainer.generate(input: input, parameters: params) {
    switch generation {
    case .chunk(let text): print(text)
    case .toolCall(let call):
        let result = try await call.execute(with: weatherTool)
        print("Weather: \(result.conditions)")
    case .info: break
    }
}

See references/tool-calling.md for multi-turn and feeding results back.

GenerateParameters

let params = GenerateParameters(
    maxTokens: 1000,           // nil = unlimited
    maxKVSize: 4096,           // Sliding window (uses RotatingKVCache)
    kvBits: 4,                 // Quantized cache (4 or 8 bit)
    temperature: 0.7,          // 0 = greedy/argmax
    topP: 0.9,                 // Nucleus sampling
    repetitionPenalty: 1.1,    // Penalize repeats
    repetitionContextSize: 20  // Window for penalty
)

Prompt Caching / History Re-hydration

Restore chat from persisted history:

let history: [Chat.Message] = [
    .system("You are helpful"),
    .user("Hello"),
    .assistant("Hi there!")
]

let session = ChatSession(
    modelContainer,
    history: history
)
// Continues from this point

5. Secondary Workflow: VLM Inference

Image Input Types

// From URL (file or remote)
let image = UserInput.Image.url(fileURL)

// From CIImage
let image = UserInput.Image.ciImage(ciImage)

// From MLXArray directly
let image = UserInput.Image.array(mlxArray)

Video Input

// From URL (file or remote)
let video = UserInput.Video.url(videoURL)

// From AVFoundation asset
let video = UserInput.Video.avAsset(avAsset)

// From pre-extracted frames
let video = UserInput.Video.frames(videoFrames)

let response = try await session.respond(
    to: "What happens in this video?",
    video: video
)

Multiple Images

let images: [UserInput.Image] = [
    .url(url1),
    .url(url2)
]

let response = try await session.respond(
    to: "Compare these two images",
    images: images,
    videos: []
)

VLM-Specific Processing

let session = ChatSession(
    modelContainer,
    processing: UserInput.Processing(
        resize: CGSize(width: 512, height: 512)  // Resize images
    )
)

6. Best Practices

DO

// DO: Use ChatSession for multi-turn conversations
let session = ChatSession(modelContainer)

// DO: Use AsyncStream APIs (modern, Swift concurrency)
for try await chunk in session.streamResponse(to: prompt) { ... }

// DO: Check Task.isCancelled in long-running loops
for try await generation in stream {
    if Task.isCancelled { break }
    // process generation
}

// DO: Use ModelContainer.perform() for thread-safe access
await modelContainer.perform { context in
    // Access model, tokenizer safely
    let tokens = try context.tokenizer.applyChatTemplate(messages: messages)
    return tokens
}

// DO: When breaking early from generation, use generateTask() to get a task handle
// This is the lower-level API used internally by ChatSession
let (stream, task) = generateTask(...)  // Returns (AsyncStream, Task)

for await item in stream {
    if shouldStop { break }
}
await task.value  // Ensures KV cache cleanup before next generation

generateTask() is defined in Evaluate.swift. Most users should use ChatSession which handles this internally.

DON'T

// DON'T: Share MLXArray across tasks (not Sendable)
let array = MLXArray(...)
Task { array.sum() }  // Wrong!

// DON'T: Use deprecated callback-based generation
// Old:
generate(input: input, parameters: params) { tokens in ... }  // Deprecated
// New:
for await generation in try generate(input: input, parameters: params, context: context) { ... }

// DON'T: Use old perform(model, tokenizer) signature
// Old:
modelContainer.perform { model, tokenizer in ... }  // Deprecated
// New:
modelContainer.perform { context in ... }

// DON'T: Forget to eval() MLXArrays before returning from perform()
await modelContainer.perform { context in
    let result = context.model(input)
    eval(result)  // Required before returning
    return result.item(Float.self)
}

Thread Safety

  • ModelContainer is Sendable and thread-safe
  • ChatSession is NOT thread-safe (use from single task)
  • MLXArray is NOT Sendable - don't pass across isolation boundaries
  • Use SendableBox for transferring non-Sendable data in consuming contexts

Memory Management

// For long contexts, use sliding window cache
let params = GenerateParameters(maxKVSize: 4096)

// For memory efficiency, use quantized cache
let params = GenerateParameters(kvBits: 4)  // or 8

// Clear session cache when done
await session.clear()

7. Reference Links

For detailed documentation on specific topics, see:

ReferenceWhen to Use
references/model-container.mdLoading models, ModelContainer API, ModelConfiguration
references/kv-cache.mdCache types, memory optimization, cache serialization
references/concurrency.mdThread safety, SerialAccessContainer, async patterns
references/tool-calling.mdFunction calling, tool formats, ToolCallProcessor
references/tokenizer-chat.mdTokenizer, Chat.Message, EOS tokens
references/supported-models.mdModel families, registries, model-specific config
references/lora-adapters.mdLoRA/DoRA/QLoRA, loading adapters
references/training.mdLoRATrain API, fine-tuning
references/embeddings.mdEmbeddingModel, pooling, use cases

8. Deprecated Patterns Summary

Most common migrations (see individual reference files for topic-specific deprecations):

If you see...Use instead...
generate(... didGenerate:) callbackgenerate(...) -> AsyncStream
perform { model, tokenizer in }perform { context in }
TokenIterator(prompt: MLXArray)TokenIterator(input: LMInput)
ModelRegistry typealiasLLMRegistry or VLMRegistry
createAttentionMask(h:cache:[KVCache]?)createAttentionMask(h:cache:KVCache?)

Each reference file contains a "Deprecated Patterns" section with topic-specific migrations.

9. Automatic vs Manual Configuration

Automatic Behaviors (NO developer action needed)

The framework handles these automatically:

FeatureDetails
EOS token loadingLoaded from config.json
EOS token overridePriority: generation_config.json > config.json > defaults
EOS token mergingAll sources merged at generation time
EOS token detectionStops generation automatically when EOS encountered
Chat template applicationApplied automatically via applyChatTemplate()
Tool call format detectionInferred from model_type in config.json
Cache type selectionBased on GenerateParameters (maxKVSize, kvBits)
Tokenizer loadingLoaded from tokenizer.json automatically
Model weights loadingDownloaded and loaded from HuggingFace

Optional Configuration (Developer MAY configure)

FeatureWhen to Configure
extraEOSTokensOnly if model has unlisted stop tokens
toolCallFormatOnly to override auto-detection
maxKVSizeTo enable sliding window cache
kvBitsTo enable quantized cache (4 or 8 bit)
maxTokensTo limit output length