obfus.link
Encoders

Preparing images for Claude, GPT-4V, and Gemini in one call

Convert any image format to base64 with provider-optimal resize, EXIF stripping, and ready-to-paste vision message blocks for Claude, GPT-4V, or Gemini. Replaces five manual steps with one tool call.

The Image-to-Base64 tool converts any image into a base64 payload formatted for Claude, GPT-4V, or Gemini in one call. It handles format conversion (WebP, AVIF, HEIF to PNG), per-provider resize optimisation, EXIF metadata stripping, and produces the exact message block shape each vision API expects.

1. Insight

Insight

The problem this article addresses and why it matters.

Vision APIs each want a different image shape

In 2025 the three frontier multimodal APIs converged on base64-encoded image payloads, then diverged on every detail. Anthropic's Messages API expects a content block with type: "image", source.type: "base64", source.media_type: "image/png" (or jpeg/gif/webp), and source.data as raw base64. OpenAI's GPT-4V chat completions want type: "image_url" with image_url.url as a data URL — base64 prefixed with data:image/png;base64,. Google's Gemini API expects inline_data.mime_type and inline_data.data as separate fields with no data URL wrapper.

That's three near-identical payloads with three different field names. A developer building a multi-provider vision pipeline ends up writing the same encode-and-format logic three times, once per provider. Each implementation drifts. Each has its own subtle bug — the one where Claude rejects the payload because the media type was inferred from extension rather than content, the one where GPT-4V rejects images over 20MB without telling you why, the one where Gemini's mime_type is case-sensitive in ways the others aren't.

The encoding is the easy part

Every developer can call Buffer.from(file).toString('base64'). The annoyance is the surrounding work: stripping EXIF metadata before sending to a provider (some images contain GPS coordinates), resizing to each provider's preferred dimensions to reduce token cost (Claude charges by image area, not pixel count), converting WebP or AVIF or HEIF to PNG because vision APIs don't accept those formats, and formatting the result into the provider-specific message block. None of those steps is hard. All of them are tedious. All of them have edge cases the team will discover in production.

What this article delivers

The image_to_base64 tool replaces the five-step manual pipeline (convert format → resize → strip metadata → encode → format for provider) with one call. We'll walk through the standard single-image encoding, the multi-image batch mode that returns provider-ready message blocks for an entire vision conversation, the format conversion that handles WebP / AVIF / HEIF input, and the resize logic that auto-picks the optimal dimension for each provider's pricing tier.

2. Intent

Intent

What you will be able to do after reading.

By the end of this article you will be able to:

  • Convert any image (PNG, JPEG, WebP, AVIF, HEIF, BMP, TIFF) into a base64 payload formatted for Claude, GPT-4V, or Gemini in a single call
  • Apply the auto-resize heuristic that picks the optimal pixel count per provider to minimise vision-API token cost
  • Strip EXIF metadata in one parameter, including GPS coordinates and camera details that would otherwise leak
  • Batch-encode multiple images for a multi-image vision request, all formatted for the same provider in one response
  • Choose between the four output formats (data-uri, raw, css-url, llm-vision) based on where the encoded payload is going next

The Examples section walks through the single-image happy path, the multi-image batch, and the WebP-to-PNG conversion path.

3. Examples

Examples

Annotated code and worked scenarios.

Before / after: encoding for Claude

The manual approach to sending an image to Anthropic's Messages API:

Before:

import fs from 'node:fs';
import sharp from 'sharp';

const raw       = fs.readFileSync('./screenshot.png');
const resized   = await sharp(raw).resize(1568, null, { fit: 'inside' }).toBuffer();
const stripped  = await sharp(resized).withMetadata({ exif: {} }).toBuffer();
const base64    = stripped.toString('base64');

const block = {
  type: 'image',
  source: {
    type:       'base64',
    media_type: 'image/png',
    data:       base64,
  },
};

Five steps. Each has its own version drift surface — does this codebase use sharp 0.32 or 0.34? Does withMetadata({ exif: {} }) actually strip EXIF, or does it just empty the EXIF dict but leave XMP intact? (For sharp 0.33+, you need .keepExif(false).keepXmp(false).) The Anthropic SDK expects the block shape exactly — get the media_type casing wrong and the API returns a 400 with a generic error message.

After:

imageToBase64({
  dataUrl:      `data:image/png;base64,${raw.toString('base64')}`,
  format:       'llm-vision',
  llmTarget:    'claude',
  optimize:     true,
});

// llmBlock: {
//   type: 'image',
//   source: {
//     type:       'base64',
//     media_type: 'image/png',
//     data:       '<base64 data, auto-resized to 1568px longest edge, EXIF stripped>'
//   }
// }
// dimensions: { width: 1568, height: 1024 }
// bytes:    284_516
// savings:  127_303    // bytes saved by resize + EXIF strip vs original

The block drops into your messages.create({ ... }) call directly. The resize dimension (1568px longest side for Claude) is the result of Anthropic's vision pricing tiering — the optimal point where image quality plateaus and token cost stops dropping. The tool defaults to per-provider optima so you don't have to remember the exact pixel counts.

Before / after: multi-image vision request

A multi-image vision call to GPT-4V comparing two product photos:

Before: loop over each file, encode each one, format each as the OpenAI block, push into the content array. ~40 lines of boilerplate.

After:

imageToBase64({
  batch: [
    `data:image/jpeg;base64,${product1.toString('base64')}`,
    `data:image/jpeg;base64,${product2.toString('base64')}`,
  ],
  format:    'llm-vision',
  llmTarget: 'gpt4v',
  optimize:  true,
});

// batchResults: [
//   {
//     encoded:    '...',
//     bytes:      198_443,
//     mimeType:   'image/jpeg',
//     dimensions: { width: 2048, height: 1365 },
//     llmBlock:   { type: 'image_url', image_url: { url: 'data:image/jpeg;base64,...' } },
//   },
//   {
//     encoded:    '...',
//     bytes:      211_092,
//     mimeType:   'image/jpeg',
//     dimensions: { width: 2048, height: 1365 },
//     llmBlock:   { type: 'image_url', image_url: { url: 'data:image/jpeg;base64,...' } },
//   },
// ]

The agent pushes batchResults[i].llmBlock into the content array of the OpenAI chat request. No looping, no per-image work, no provider-specific formatting in the agent's code.

Before / after: WebP to PNG conversion for vision

Vision APIs accept PNG and JPEG. They do not accept WebP, AVIF, or HEIF. If your image source is a modern format (browser screenshots are PNG by default but mobile camera output is HEIF on iOS, WebP on some Android variants), you need to convert before encoding.

Before: detect format via magic bytes, branch on the format, call sharp().toFormat('png').toBuffer(), handle the case where sharp doesn't have the HEIF codec compiled in (HEIC support not available — install libheif first).

After:

imageToBase64({
  dataUrl:    `data:image/heif;base64,${heifFile.toString('base64')}`,
  format:     'llm-vision',
  llmTarget:  'claude',
  convertTo:  'png',
  optimize:   true,
});

// converted: true
// llmBlock: { type: 'image', source: { type: 'base64', media_type: 'image/png', data: '...' } }

The full pipeline collapses to one parameter. converted: true in the response confirms the conversion happened — useful for logging when the input format was unexpected.

When humans use this

A designer iterating on a Claude-powered design feedback tool drops images into a browser input. The web UI calls image_to_base64 and renders the formatted block in a preview box so the designer can verify the resize + format before sending to the API. A developer integrating multimodal LLMs for the first time uses the four-format output (data-uri, raw, css-url, llm-vision) to figure out which shape they actually need — the css-url output is the helper most developers don't know they wanted, for embedding small images as CSS background-image: url(...) rules without a separate image file.

When agents use this

This is the highest-leverage tool in the grid for any agent that calls vision APIs:

  • Multi-provider routing. An agent that load-balances vision requests across Claude, GPT-4V, and Gemini calls image_to_base64 once per provider with the appropriate llmTarget. The agent never has to know the provider's block format — the tool encapsulates that knowledge and is updated when providers change their APIs.
  • Pre-processing for downstream tools. An agent extracting tabular data from a screenshot needs to (a) crop, (b) resize, (c) encode, (d) format for Claude. With image_to_base64, steps (b), (c), and (d) collapse to one call. The cropping happens upstream with image manipulation primitives the agent already has.
  • Bulk vision annotation. An agent processing a folder of 200 product photos for Claude vision uses batch mode in batches of 20 (Anthropic's per-request limit) to keep the response size manageable.

Edge cases

Animated GIFs and multi-frame TIFFs

The tool encodes the first frame only. Vision APIs accept static images; multi-frame formats are converted to single-frame PNG before encoding. The transformations array surfaces this with extracted_first_frame: true so the caller can see the lossy step.

Images exceeding provider size limits

Anthropic accepts images up to 5MB encoded. GPT-4V up to 20MB. Gemini up to 7MB (when inline; larger via file upload). The tool's optimize: true flag aims to keep output under 4.5MB for Anthropic by default. If the input is larger than the limit and the resize doesn't bring it under, the tool returns PAYLOAD_LIMIT with a clear message pointing at the per-provider ceiling.

Data URLs with incorrect MIME type declarations

If the input is data:image/png;base64,... but the actual decoded content is JPEG (some apps lie about MIME types), the tool reads the magic bytes and trusts those over the declared MIME type. The output mimeType reflects the true format.

Privacy-sensitive metadata in optimize: false mode

If you pass optimize: false, EXIF data is preserved including GPS coordinates, camera model, and software version. This is intentional — sometimes you want the metadata (forensic analysis, photo organisation). For any user-uploaded content destined for an LLM API, default to optimize: true to strip metadata.

4. Documentation

Documentation

Reference signatures, edge cases, and lookup tables.

Input parameters

Field

Type

Required

Default

Description

dataUrl

string

conditional

Data URL of the input image. Required when batch is not used

batch

string[]

conditional

Array of data URLs. When set, dataUrl is ignored

format

'data-uri' | 'raw' | 'css-url' | 'llm-vision'

Output shape

llmTarget

'claude' | 'gpt4v' | 'gemini'

when format: 'llm-vision'

Which vision provider's block format to produce

optimize

boolean

false

Strip EXIF + quantize PNG palette where lossless. Recommended for LLM inputs

maxDimension

number

provider-default

Longest-side pixel count after resize. Auto-selected per llmTarget if omitted

convertTo

'png' | 'jpeg'

inferred

Force output format. Required when input is WebP / AVIF / HEIF and target is a vision API

mimeType

string

inferred

Override the declared MIME type. The tool trusts magic bytes over this value

Output shape

// Single-image (no batch)
{
  encoded:      string;     // base64 string (no data URL prefix)
  bytes:        number;     // byte length of encoded
  mimeType:     string;     // final MIME type after conversion
  dimensions?:  { width: number; height: number };
  savings?:     number;     // bytes saved vs unoptimised
  resized?:     boolean;
  converted?:   boolean;    // true if format was changed
  llmBlock?:    object;     // provider-shaped block, when format: 'llm-vision'
}

// Batch mode
{
  batchResults: Array<{
    encoded:    string;
    bytes:      number;
    mimeType:   string;
    dimensions: { width: number; height: number };
    llmBlock?:  object;
  }>;
}

Provider-default dimensions

llmTarget

Optimal longest edge (px)

Rationale

claude

1568

Anthropic's vision tier — quality plateau, token cost minimum

gpt4v

2048

OpenAI's high-detail tier — 765 tokens per 512×512 tile

gemini

3072

Google's vision sweet spot — supports up to 16MP, but 3072px is the cost-quality knee

Override with maxDimension for application-specific needs (thumbnails, full-resolution batch processing, etc.).

Error codes

Code

When it fires

Recovery

INPUT_EMPTY

dataUrl empty AND batch not provided

Provide an image input

INPUT_MALFORMED

Data URL parse failed (missing data: prefix, invalid base64)

Verify the input is a valid data URL

UNSUPPORTED_FORMAT

Input format not recognised by sharp

Convert to PNG / JPEG with another tool first

PAYLOAD_LIMIT

Output exceeds the provider's image size limit

Pass smaller input or use maxDimension to force aggressive resize

TRANSFORM_FAILED

sharp encountered an internal error (corrupt image, unsupported codec subset)

Re-export the image from its source application

When NOT to use this tool

Don't use it for full-resolution archival encoding. The default optimize: true strips EXIF and resamples; if you're encoding for storage where lossless preservation matters, use optimize: false and format: 'raw' (or skip the tool entirely and use sharp directly).

Don't use it as a generic image converter unrelated to vision APIs. The tool's optimisations are calibrated to vision-API economics. For browser-side image processing, lazy-loading thumbnails, or canvas-based pipelines, a smaller per-task library is the right call.

For very large images (over 50MB input), do the resize outside the tool with a streaming pipeline. The tool loads the full image into memory.

Performance notes

Single-image encoding: typical execution under 100ms for inputs under 5MB. Resize + EXIF strip adds 50-200ms. Format conversion (WebP → PNG) adds 100-400ms depending on input dimensions. Batch mode is parallelised internally — encoding ten 2MB images takes about 1.2× the time of one, not 10×. The tool is deterministic: same input + same parameters always produce byte-identical output, so REST responses are Edge-Cache eligible.

The tool depends on sharp's native libvips bindings. In serverless environments (Vercel, Lambda) you need to verify sharp is bundled correctly for the runtime — Vercel handles this automatically; bare AWS Lambda needs a Lambda layer or container deployment.

Try it now

Image to Base64

Encode images for LLM vision APIs with resize and format conversion

FAQ

Frequently asked questions

Why does each vision provider need a different block format?

The three providers converged on base64 input but diverged on field names. Anthropic uses source.type/source.media_type/source.data. OpenAI uses image_url.url as a data URL. Gemini uses inline_data.mime_type and inline_data.data. The tool encapsulates this provider drift.

What if my image is in a format vision APIs reject?

Use convertTo: 'png' or 'jpeg'. The tool accepts WebP, AVIF, HEIF, BMP, TIFF as input and converts before encoding. Output mimeType reflects the converted format. converted: true in the response confirms the conversion happened.

Does it work for animated GIFs?

It encodes the first frame only. Vision APIs accept static images, so multi-frame formats are reduced to single-frame PNG. The response surfaces this with extracted_first_frame: true so the caller can see the lossy step.

What's the largest image I can encode?

Anthropic accepts up to 5MB encoded, GPT-4V up to 20MB, Gemini up to 7MB inline. With optimize: true the tool aims for under 4.5MB by default. Larger inputs that don't resize under the limit return PAYLOAD_LIMIT with the provider-specific ceiling.

Can I use the tool without specifying an llmTarget?

Yes. format: 'data-uri', 'raw', or 'css-url' produce non-LLM output shapes. llmTarget is only required when format: 'llm-vision' is set.