Together AI runs open-source LLMs at scale with an OpenAI-compatible API — new Together({ apiKey }) initializes. together.chat.completions.create({ model: "meta-llama/Llama-3.3-70B-Instruct-Turbo", messages }) generates chat completions. stream: true with for await (const chunk of response) enables streaming. Together is OpenAI API drop-in: new OpenAI({ apiKey: process.env.TOGETHER_API_KEY, baseURL: "https://api.together.xyz/v1" }) routes any OpenAI SDK call. Models: meta-llama/Llama-3.3-70B-Instruct-Turbo (flagship, fast), deepseek-ai/DeepSeek-R1 (reasoning), Qwen/Qwen2.5-72B-Instruct-Turbo (multilingual), mistralai/Mixtral-8x22B-Instruct-v0.1 (long context). Vision: pass { type: "image_url", image_url: { url } } in message content with meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo. Embeddings: together.embeddings.create({ model: "BAAI/bge-large-en-v1.5", input: texts }). Function calling: tools array with the OpenAI tool format. Claude Code generates Together AI inference, reasoning agents, and embedding pipelines.
CLAUDE.md for Together AI
## Together AI Stack
- Version: together-ai >= 0.13 OR use OpenAI SDK with baseURL
- SDK init: const together = new Together({ apiKey: process.env.TOGETHER_API_KEY! })
- OpenAI drop-in: const openai = new OpenAI({ apiKey: process.env.TOGETHER_API_KEY, baseURL: "https://api.together.xyz/v1" })
- Chat: const res = await together.chat.completions.create({ model: "meta-llama/Llama-3.3-70B-Instruct-Turbo", messages })
- Stream: const stream = await together.chat.completions.create({ ..., stream: true }); for await (const chunk of stream) text += chunk.choices[0]?.delta?.content ?? ""
- Embed: const res = await together.embeddings.create({ model: "BAAI/bge-large-en-v1.5", input: texts }); res.data[0].embedding
- Models: Llama-3.3-70B-Instruct-Turbo (fast), DeepSeek-R1 (reasoning), Qwen2.5-72B (multilingual)
Together AI Client
// lib/together/client.ts — Together AI SDK with model catalog
import Together from "together-ai"
const together = new Together({ apiKey: process.env.TOGETHER_API_KEY! })
export const MODELS = {
// Chat/Instruction models
LLAMA_70B: "meta-llama/Llama-3.3-70B-Instruct-Turbo", // Best quality+speed
LLAMA_8B: "meta-llama/Llama-3.1-8B-Instruct-Turbo", // Fastest
DEEPSEEK_R1: "deepseek-ai/DeepSeek-R1", // Reasoning (o1-like)
QWEN_72B: "Qwen/Qwen2.5-72B-Instruct-Turbo", // Multilingual
MIXTRAL: "mistralai/Mixtral-8x22B-Instruct-v0.1", // 64k context
// Vision
LLAMA_VISION: "meta-llama/Llama-3.2-90B-Vision-Instruct-Turbo",
// Embeddings
BGE_LARGE: "BAAI/bge-large-en-v1.5", // 1024-dim, English
BGE_M3: "BAAI/bge-m3", // Multi-lingual, multi-granularity
} as const
type ChatModel = Exclude<(typeof MODELS)[keyof typeof MODELS], "BAAI/bge-large-en-v1.5" | "BAAI/bge-m3">
export interface ChatOptions {
model?: string
systemPrompt?: string
temperature?: number
maxTokens?: number
stopSequences?: string[]
}
/** Chat completion */
export async function chat(
prompt: string,
options: ChatOptions = {},
): Promise<string> {
const { model = MODELS.LLAMA_70B, systemPrompt, temperature = 0.7, maxTokens = 1024 } = options
const messages: Array<{ role: string; content: string }> = []
if (systemPrompt) messages.push({ role: "system", content: systemPrompt })
messages.push({ role: "user", content: prompt })
const response = await together.chat.completions.create({
model,
messages: messages as any,
temperature,
max_tokens: maxTokens,
stop: options.stopSequences,
})
return response.choices[0]?.message?.content ?? ""
}
/** Streaming chat */
export async function* streamChat(
prompt: string,
options: ChatOptions = {},
): AsyncGenerator<string> {
const { model = MODELS.LLAMA_70B, systemPrompt, temperature = 0.7, maxTokens = 2048 } = options
const messages: Array<{ role: string; content: string }> = []
if (systemPrompt) messages.push({ role: "system", content: systemPrompt })
messages.push({ role: "user", content: prompt })
const stream = await together.chat.completions.create({
model,
messages: messages as any,
temperature,
max_tokens: maxTokens,
stream: true,
})
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content
if (delta) yield delta
}
}
/** DeepSeek R1 reasoning with <think> tag extraction */
export async function reason(
problem: string,
options: Omit<ChatOptions, "model"> = {},
): Promise<{ thinking: string; answer: string }> {
const fullResponse = await chat(problem, {
...options,
model: MODELS.DEEPSEEK_R1,
maxTokens: 4096,
temperature: 0.6,
})
const thinkMatch = fullResponse.match(/<think>([\s\S]*?)<\/think>/)
const thinking = thinkMatch?.[1]?.trim() ?? ""
const answer = fullResponse.replace(/<think>[\s\S]*?<\/think>/g, "").trim()
return { thinking, answer }
}
/** Vision: analyze an image URL */
export async function analyzeImage(imageUrl: string, question: string): Promise<string> {
const response = await together.chat.completions.create({
model: MODELS.LLAMA_VISION,
messages: [{
role: "user",
content: [
{ type: "image_url", image_url: { url: imageUrl } },
{ type: "text", text: question },
],
}] as any,
max_tokens: 1024,
})
return response.choices[0]?.message?.content ?? ""
}
/** Generate embeddings */
export async function embedTexts(
texts: string[],
model = MODELS.BGE_LARGE,
): Promise<number[][]> {
const response = await together.embeddings.create({
model,
input: texts,
})
return response.data.map((d) => d.embedding)
}
export { together }
OpenAI Drop-In Usage
// lib/together/openai-compat.ts — Together as an OpenAI 1:1 replacement
import OpenAI from "openai"
// Use exactly like the OpenAI SDK — just point to Together's API
export const togetherOpenAI = new OpenAI({
apiKey: process.env.TOGETHER_API_KEY!,
baseURL: "https://api.together.xyz/v1",
})
// Example: drop-in replacement for any existing OpenAI chat code
export async function generateWithTogether(
messages: OpenAI.Chat.Completions.ChatCompletionMessageParam[],
model = "meta-llama/Llama-3.3-70B-Instruct-Turbo",
): Promise<string> {
const response = await togetherOpenAI.chat.completions.create({
model,
messages,
temperature: 0.7,
})
return response.choices[0].message.content ?? ""
}
// Streaming with the OpenAI SDK pointing to Together
export async function* streamWithTogether(
prompt: string,
model = "meta-llama/Llama-3.1-8B-Instruct-Turbo",
): AsyncGenerator<string> {
const stream = await togetherOpenAI.chat.completions.create({
model,
messages: [{ role: "user", content: prompt }],
stream: true,
})
for await (const chunk of stream) {
const delta = chunk.choices[0]?.delta?.content
if (delta) yield delta
}
}
For the Groq alternative when ultra-low latency (50-100ms TTFT) and maximum tokens/second on Llama models is the primary requirement — Groq’s purpose-built LPU hardware is 5-10x faster than Together AI’s GPU inference for latency-critical streaming applications, while Together has a larger model catalog including DeepSeek-R1 reasoning and vision models, see the Groq guide. For the Replicate alternative when running image generation models (Stable Diffusion XL, ControlNet), audio/video models, and thousands of specialized community models beyond LLM text generation is the goal — Replicate has the broadest model variety while Together AI specializes in fast, cost-efficient LLM inference with enterprise SLAs, see the Replicate guide. The Claude Skills 360 bundle includes Together AI skill sets covering Llama inference, DeepSeek R1 reasoning, and embeddings. Start with the free tier to try open-model generation.