cloudflare-workers-ai

⚠Review·Scanned 2/18/2026

This skill provides comprehensive guidance and templates for running Workers AI models on Cloudflare Workers (streaming, RAG, images, AI Gateway). It instructs running shell commands like rm -rf ~/.wrangler, using secrets such as env.CLOUDFLARE_API_KEY, and making network calls to https://gateway.ai.cloudflare.com/v1/${accountId}/${gatewayId}/workers-ai/@cf/meta/llama-3.1-8b-instruct.

by jezweb·v10a1f16·107.1 KB·343 installs

Scanned from main at 10a1f16 · Transparency log ↗

$ vett add jezweb/claude-skills/cloudflare-workers-aiReview findings below

Cloudflare Workers AI

Complete knowledge domain for Cloudflare Workers AI - Run AI models on serverless GPUs across Cloudflare's global network.

Auto-Trigger Keywords

Primary Keywords

workers ai
cloudflare ai
ai bindings
llm workers
@cf/meta/llama
workers ai models
ai inference
cloudflare llm

Secondary Keywords

ai streaming
text generation ai
ai embeddings
image generation ai
workers ai rag
ai gateway
llama workers
flux image generation
stable diffusion workers
vision models ai
ai chat completion

Error-Based Keywords

AI_ERROR
rate limit ai
model not found
token limit exceeded
neurons exceeded
ai quota exceeded
streaming failed
model unavailable

Framework Integration Keywords

workers ai hono
ai gateway workers
vercel ai sdk workers
openai compatible workers
workers ai vectorize
ai bindings wrangler
workers ai streaming

What This Skill Does

This skill provides complete Workers AI knowledge including:

✅ Text Generation (LLMs) - Llama 3.1/3.2, Qwen, Mistral, DeepSeek (50+ models)
✅ Streaming Responses - Real-time text generation without buffering
✅ Text Embeddings - BGE models for RAG and semantic search
✅ Image Generation - Flux, Stable Diffusion XL for text-to-image
✅ Vision Models - Llama 3.2 Vision for image understanding
✅ AI Gateway Integration - Caching, logging, and cost tracking
✅ OpenAI Compatibility - Drop-in replacement for OpenAI API
✅ RAG Patterns - Complete Retrieval Augmented Generation examples
✅ Function Calling - Embedded function calling with tools
✅ Vercel AI SDK - Integration with workers-ai-provider

Known Issues Prevented

Issue	Description	Prevention
Rate limit errors (429)	Text generation exceeds 300 req/min	Implement exponential backoff retry
Response buffering	Large responses timeout without streaming	Always use `stream: true` for text generation
Token limit exceeded	Input exceeds model context window	Validate prompt length before inference
Model not found	Invalid model ID	Use correct @cf/ or @hf/ prefixes from catalog
Missing cost tracking	Production costs unmonitored	Always use AI Gateway for logging
Free tier exhaustion	Exceeds 10,000 neurons/day	Plan for Workers Paid ($0.011/1000 neurons)

Quick Example

import { Hono } from 'hono';

type Bindings = {
  AI: Ai;
};

const app = new Hono<{ Bindings: Bindings }>();

// Text generation with streaming
app.post('/chat', async (c) => {
  const { prompt } = await c.req.json<{ prompt: string }>();

  const stream = await c.env.AI.run(
    '@cf/meta/llama-3.1-8b-instruct',
    {
      messages: [{ role: 'user', content: prompt }],
      stream: true,
    }
  );

  return new Response(stream, {
    headers: { 'content-type': 'text/event-stream' },
  });
});

export default app;

wrangler.jsonc:

{
  "ai": {
    "binding": "AI"
  }
}

Token Efficiency

Without this skill:

15-20 documentation lookups for model selection
10+ searches for rate limits and pricing
8-12 examples for different task types
~8,000 tokens of repetitive context

With this skill:

1 skill invocation with complete knowledge
~3,200 tokens of focused, battle-tested patterns
~60% token savings

Coverage

Models Catalog

Text Generation: 20+ LLMs (Llama, Qwen, Mistral, DeepSeek, Gemma)
Text Embeddings: BGE-base, BGE-large, BGE-small
Image Generation: Flux, Stable Diffusion XL, DreamShaper
Vision Models: Llama 3.2 11B Vision Instruct
Translation: M2M100, Opus MT
Classification: DistilBERT, RoBERTa
Speech Recognition: Whisper models

Patterns

Chat completions with message history
RAG (Retrieval Augmented Generation)
Structured output with Zod schemas
Function calling with tools
Image generation and storage
Vision model image understanding
AI Gateway caching and logging

Production Topics

Rate limits by task type
Neurons-based pricing
Streaming best practices
Error handling strategies
OpenAI API compatibility
Vercel AI SDK integration
Cost optimization

When to Use This Skill

Use this skill when you see keywords like:

"integrate Workers AI"
"add LLM to my Worker"
"generate embeddings for RAG"
"create AI image generator"
"stream AI responses"
"setup AI Gateway"
"OpenAI compatible API"
"model rate limit"
"Workers AI pricing"