1. Insight
Insight
The problem this article addresses and why it matters.
Vision APIs each want a different image shape
In 2025 the three frontier multimodal APIs converged on base64-encoded image payloads, then diverged on every detail. Anthropic's Messages API expects a content block with type: "image", source.type: "base64", source.media_type: "image/png" (or jpeg/gif/webp), and source.data as raw base64. OpenAI's GPT-4V chat completions want type: "image_url" with image_url.url as a data URL — base64 prefixed with data:image/png;base64,. Google's Gemini API expects inline_data.mime_type and inline_data.data as separate fields with no data URL wrapper.
That's three near-identical payloads with three different field names. A developer building a multi-provider vision pipeline ends up writing the same encode-and-format logic three times, once per provider. Each implementation drifts. Each has its own subtle bug — the one where Claude rejects the payload because the media type was inferred from extension rather than content, the one where GPT-4V rejects images over 20MB without telling you why, the one where Gemini's mime_type is case-sensitive in ways the others aren't.
The encoding is the easy part
Every developer can call Buffer.from(file).toString('base64'). The annoyance is the surrounding work: stripping EXIF metadata before sending to a provider (some images contain GPS coordinates), resizing to each provider's preferred dimensions to reduce token cost (Claude charges by image area, not pixel count), converting WebP or AVIF or HEIF to PNG because vision APIs don't accept those formats, and formatting the result into the provider-specific message block. None of those steps is hard. All of them are tedious. All of them have edge cases the team will discover in production.
What this article delivers
The image_to_base64 tool replaces the five-step manual pipeline (convert format → resize → strip metadata → encode → format for provider) with one call. We'll walk through the standard single-image encoding, the multi-image batch mode that returns provider-ready message blocks for an entire vision conversation, the format conversion that handles WebP / AVIF / HEIF input, and the resize logic that auto-picks the optimal dimension for each provider's pricing tier.
2. Intent
Intent
What you will be able to do after reading.
By the end of this article you will be able to:
- Convert any image (PNG, JPEG, WebP, AVIF, HEIF, BMP, TIFF) into a base64 payload formatted for Claude, GPT-4V, or Gemini in a single call
- Apply the auto-resize heuristic that picks the optimal pixel count per provider to minimise vision-API token cost
- Strip EXIF metadata in one parameter, including GPS coordinates and camera details that would otherwise leak
- Batch-encode multiple images for a multi-image vision request, all formatted for the same provider in one response
- Choose between the four output formats (
data-uri,raw,css-url,llm-vision) based on where the encoded payload is going next
The Examples section walks through the single-image happy path, the multi-image batch, and the WebP-to-PNG conversion path.
3. Examples
Examples
Annotated code and worked scenarios.
Before / after: encoding for Claude
The manual approach to sending an image to Anthropic's Messages API:
Before:
import fs from 'node:fs';
import sharp from 'sharp';
const raw = fs.readFileSync('./screenshot.png');
const resized = await sharp(raw).resize(1568, null, { fit: 'inside' }).toBuffer();
const stripped = await sharp(resized).withMetadata({ exif: {} }).toBuffer();
const base64 = stripped.toString('base64');
const block = {
type: 'image',
source: {
type: 'base64',
media_type: 'image/png',
data: base64,
},
};Five steps. Each has its own version drift surface — does this codebase use sharp 0.32 or 0.34? Does withMetadata({ exif: {} }) actually strip EXIF, or does it just empty the EXIF dict but leave XMP intact? (For sharp 0.33+, you need .keepExif(false).keepXmp(false).) The Anthropic SDK expects the block shape exactly — get the media_type casing wrong and the API returns a 400 with a generic error message.
After:
imageToBase64({
dataUrl: `data:image/png;base64,${raw.toString('base64')}`,
format: 'llm-vision',
llmTarget: 'claude',
optimize: true,
});
// llmBlock: {
// type: 'image',
// source: {
// type: 'base64',
// media_type: 'image/png',
// data: '<base64 data, auto-resized to 1568px longest edge, EXIF stripped>'
// }
// }
// dimensions: { width: 1568, height: 1024 }
// bytes: 284_516
// savings: 127_303 // bytes saved by resize + EXIF strip vs originalThe block drops into your messages.create({ ... }) call directly. The resize dimension (1568px longest side for Claude) is the result of Anthropic's vision pricing tiering — the optimal point where image quality plateaus and token cost stops dropping. The tool defaults to per-provider optima so you don't have to remember the exact pixel counts.
Before / after: multi-image vision request
A multi-image vision call to GPT-4V comparing two product photos:
Before: loop over each file, encode each one, format each as the OpenAI block, push into the content array. ~40 lines of boilerplate.
After:
imageToBase64({
batch: [
`data:image/jpeg;base64,${product1.toString('base64')}`,
`data:image/jpeg;base64,${product2.toString('base64')}`,
],
format: 'llm-vision',
llmTarget: 'gpt4v',
optimize: true,
});
// batchResults: [
// {
// encoded: '...',
// bytes: 198_443,
// mimeType: 'image/jpeg',
// dimensions: { width: 2048, height: 1365 },
// llmBlock: { type: 'image_url', image_url: { url: 'data:image/jpeg;base64,...' } },
// },
// {
// encoded: '...',
// bytes: 211_092,
// mimeType: 'image/jpeg',
// dimensions: { width: 2048, height: 1365 },
// llmBlock: { type: 'image_url', image_url: { url: 'data:image/jpeg;base64,...' } },
// },
// ]The agent pushes batchResults[i].llmBlock into the content array of the OpenAI chat request. No looping, no per-image work, no provider-specific formatting in the agent's code.
Before / after: WebP to PNG conversion for vision
Vision APIs accept PNG and JPEG. They do not accept WebP, AVIF, or HEIF. If your image source is a modern format (browser screenshots are PNG by default but mobile camera output is HEIF on iOS, WebP on some Android variants), you need to convert before encoding.
Before: detect format via magic bytes, branch on the format, call sharp().toFormat('png').toBuffer(), handle the case where sharp doesn't have the HEIF codec compiled in (HEIC support not available — install libheif first).
After:
imageToBase64({
dataUrl: `data:image/heif;base64,${heifFile.toString('base64')}`,
format: 'llm-vision',
llmTarget: 'claude',
convertTo: 'png',
optimize: true,
});
// converted: true
// llmBlock: { type: 'image', source: { type: 'base64', media_type: 'image/png', data: '...' } }The full pipeline collapses to one parameter. converted: true in the response confirms the conversion happened — useful for logging when the input format was unexpected.
When humans use this
A designer iterating on a Claude-powered design feedback tool drops images into a browser input. The web UI calls image_to_base64 and renders the formatted block in a preview box so the designer can verify the resize + format before sending to the API. A developer integrating multimodal LLMs for the first time uses the four-format output (data-uri, raw, css-url, llm-vision) to figure out which shape they actually need — the css-url output is the helper most developers don't know they wanted, for embedding small images as CSS background-image: url(...) rules without a separate image file.
When agents use this
This is the highest-leverage tool in the grid for any agent that calls vision APIs:
- Multi-provider routing. An agent that load-balances vision requests across Claude, GPT-4V, and Gemini calls
image_to_base64once per provider with the appropriatellmTarget. The agent never has to know the provider's block format — the tool encapsulates that knowledge and is updated when providers change their APIs. - Pre-processing for downstream tools. An agent extracting tabular data from a screenshot needs to (a) crop, (b) resize, (c) encode, (d) format for Claude. With
image_to_base64, steps (b), (c), and (d) collapse to one call. The cropping happens upstream with image manipulation primitives the agent already has. - Bulk vision annotation. An agent processing a folder of 200 product photos for Claude vision uses
batchmode in batches of 20 (Anthropic's per-request limit) to keep the response size manageable.
Edge cases
Animated GIFs and multi-frame TIFFs
The tool encodes the first frame only. Vision APIs accept static images; multi-frame formats are converted to single-frame PNG before encoding. The transformations array surfaces this with extracted_first_frame: true so the caller can see the lossy step.
Images exceeding provider size limits
Anthropic accepts images up to 5MB encoded. GPT-4V up to 20MB. Gemini up to 7MB (when inline; larger via file upload). The tool's optimize: true flag aims to keep output under 4.5MB for Anthropic by default. If the input is larger than the limit and the resize doesn't bring it under, the tool returns PAYLOAD_LIMIT with a clear message pointing at the per-provider ceiling.
Data URLs with incorrect MIME type declarations
If the input is data:image/png;base64,... but the actual decoded content is JPEG (some apps lie about MIME types), the tool reads the magic bytes and trusts those over the declared MIME type. The output mimeType reflects the true format.
Privacy-sensitive metadata in optimize: false mode
If you pass optimize: false, EXIF data is preserved including GPS coordinates, camera model, and software version. This is intentional — sometimes you want the metadata (forensic analysis, photo organisation). For any user-uploaded content destined for an LLM API, default to optimize: true to strip metadata.
4. Documentation
Documentation
Reference signatures, edge cases, and lookup tables.
Input parameters
Field | Type | Required | Default | Description |
|---|---|---|---|---|
|
| conditional | — | Data URL of the input image. Required when |
|
| conditional | — | Array of data URLs. When set, |
|
| ✓ | — | Output shape |
|
| when | — | Which vision provider's block format to produce |
|
| ✗ |
| Strip EXIF + quantize PNG palette where lossless. Recommended for LLM inputs |
|
| ✗ | provider-default | Longest-side pixel count after resize. Auto-selected per |
|
| ✗ | inferred | Force output format. Required when input is WebP / AVIF / HEIF and target is a vision API |
|
| ✗ | inferred | Override the declared MIME type. The tool trusts magic bytes over this value |
Output shape
// Single-image (no batch)
{
encoded: string; // base64 string (no data URL prefix)
bytes: number; // byte length of encoded
mimeType: string; // final MIME type after conversion
dimensions?: { width: number; height: number };
savings?: number; // bytes saved vs unoptimised
resized?: boolean;
converted?: boolean; // true if format was changed
llmBlock?: object; // provider-shaped block, when format: 'llm-vision'
}
// Batch mode
{
batchResults: Array<{
encoded: string;
bytes: number;
mimeType: string;
dimensions: { width: number; height: number };
llmBlock?: object;
}>;
}Provider-default dimensions
| Optimal longest edge (px) | Rationale |
|---|---|---|
| 1568 | Anthropic's vision tier — quality plateau, token cost minimum |
| 2048 | OpenAI's high-detail tier — 765 tokens per 512×512 tile |
| 3072 | Google's vision sweet spot — supports up to 16MP, but 3072px is the cost-quality knee |
Override with maxDimension for application-specific needs (thumbnails, full-resolution batch processing, etc.).
Error codes
Code | When it fires | Recovery |
|---|---|---|
|
| Provide an image input |
| Data URL parse failed (missing | Verify the input is a valid data URL |
| Input format not recognised by sharp | Convert to PNG / JPEG with another tool first |
| Output exceeds the provider's image size limit | Pass smaller input or use |
| sharp encountered an internal error (corrupt image, unsupported codec subset) | Re-export the image from its source application |
When NOT to use this tool
Don't use it for full-resolution archival encoding. The default optimize: true strips EXIF and resamples; if you're encoding for storage where lossless preservation matters, use optimize: false and format: 'raw' (or skip the tool entirely and use sharp directly).
Don't use it as a generic image converter unrelated to vision APIs. The tool's optimisations are calibrated to vision-API economics. For browser-side image processing, lazy-loading thumbnails, or canvas-based pipelines, a smaller per-task library is the right call.
For very large images (over 50MB input), do the resize outside the tool with a streaming pipeline. The tool loads the full image into memory.
Performance notes
Single-image encoding: typical execution under 100ms for inputs under 5MB. Resize + EXIF strip adds 50-200ms. Format conversion (WebP → PNG) adds 100-400ms depending on input dimensions. Batch mode is parallelised internally — encoding ten 2MB images takes about 1.2× the time of one, not 10×. The tool is deterministic: same input + same parameters always produce byte-identical output, so REST responses are Edge-Cache eligible.
The tool depends on sharp's native libvips bindings. In serverless environments (Vercel, Lambda) you need to verify sharp is bundled correctly for the runtime — Vercel handles this automatically; bare AWS Lambda needs a Lambda layer or container deployment.