About

General plan

We selected some Pokémon test suites and let AI models generate SVG code.

  • Try to use only APIs for generation, without relying on any additional tools, agents, or harnesses.
  • Use a unified Prompt, with 2 generation attempts for each Pokémon.
  • Select Pokémon test suites with different evaluation dimensions for each group.
  • Manually evaluate each generated result and align the scoring scale.

How to choose Pokémon

The currently selected test suites divide Pokémon into three groups:

  • S1 Stage 1: Focus on testing the fidelity of basic geometry and colors.
  • S2 Stage 2: Focus on testing the fidelity of iconic features and spatial proportions.
  • S3 Stage 3: Focus on testing the structure and fine details of large, complex Pokémon.

5 Pokémon were selected for each group, with 2 generation attempts per Pokémon, resulting in a total of 30 SVGs per model across the groups. Initially, we tested with 4 generation attempts, but it didn't significantly impact the final scores, and reducing it helps lower the cost of manual evaluation.

If you are not familiar with Pokémon, you can find a brief description of them as test subjects here.

Generation

We use the simplest unified Prompt for generation:

Generate the SVG code for the Pokémon {{pokemon_name}}

Most models can successfully capture the SVG information. However, some models that are overly enthusiastic about front-end styling have a chance of generating HTML, so we add an extra constraint prompt for these models:

Output only the SVG tag code. No additional text and tool call.

Models involving the constraint prompt: Kimi-K2.6, GLM 5.1

Some models don't have a public API, such as Cursor's Composer 2. In this case, under compliant conditions, we use subagents in Cursor and disable tool usage for batch generation. The corresponding rules are as follows:

---
name: pokemon-svg-generator
description: One Pokémon per run only; English filename; do not read other SVGs. This subagent must never batch all 15 in one invocation. Parent agent—when the user gives an output dir plus explicit full-batch intent (e.g. all 15, full whitelist, generate every name in this agent file), launch 15 parallel pokemon-svg-generator subagents. Whitelist and templates in this doc are not user chat input unless the user quotes them in the same turn. /pokemon-svg-generator
model: inherit
---

Models involving the use of subagents: Composer 2

Some model APIs are inaccessible in mainland China, so we have to use third-party proxies. We also try to keep the model request parameters as close to real-world usage as possible. After all, no one wants to use expensive reasoning tokens to generate poor-quality SVGs.

The relevant information above is marked on the leaderboard.

Scoring rules

Although the evaluation focus differs for each group, a 4-2-4 weight distribution is generally adopted. The scoring range for all dimensions is 0 to 10 points.

  • S1 Geometry 40%,Color 20%,Impression 40%
  • S2 Signature 40%,Proportion 20%,Impression 40%
  • S3 Structure 40%,Detail 20%,Impression 40%
S_i = 0.4 · d_1 + 0.2 · d_2 + 0.4 · d_3

After calculating the score for a single generated result, the arithmetic mean is calculated within the group to get the group average. Since the difficulty of each group is different, the total score is weighted again according to W_1 : W_2 : W_3 = 1.0 : 1.8 : 3.2. Assuming the number of items per group is N (currently 10) and the intra-group average score range is still 0–10, the theoretical maximum for the weighted total score is N × 10 × (W_1 + W_2 + W_3) = 600. The final score is then normalized to a 100-point scale:

S_total = (S_1 · N · W_1 + S_2 · N · W_2 + S_3 · N · W_3) · 100 / 600

SVG parameter statistics

We additionally calculated some SVG parameters for reference. These include:

Size

The byte length of the SVG (UTF-8), consistent with the Size column in the "SVG structure" table on the homepage leaderboard.

Total tags

The parsed opening tag count, consistent with the "Total tags" column.

Advanced tags

The proportion (percentage) of nodes falling within the predefined advanced tag set out of the total tag count, consistent with the "Advanced tags" column. Advanced tags include <defs>, <use>, <clipPath>, <mask>, <linearGradient>, <radialGradient>, <filter>, <pattern>, <symbol>, <marker>, <foreignObject>, <textPath>, <switch>.

Consistent with the scatter plot in "SVG structure" · "View" on the homepage leaderboard: the horizontal axis is the anchor count, and the vertical axis is the visual score (the weighted score × 10 on a 100-point scale, consistent with the "Visual score" header and the vertical axis of the scatter plot). The meanings of the four quadrants are consistent with the corner annotations:

  • High visual score, low anchors: Minimal
  • High visual score, high anchors: Refined
  • Low visual score, low anchors: Crude
  • Low visual score, high anchors: Cluttered

FAQ

Q: Will manual evaluation be subjective?

A: Yes, it will. Usually, after evaluating a group of model scores, I will fine-tune the scores again from a holistic perspective. But personally, I don't have any extra preference for specific models; I just want to see which model generates the best SVGs. I am currently the only evaluator, and I try my best to act as a non-variable! So, this is for reference only.

Q: Why does Claude Opus 4.7 rank lower?

A: Currently, aside from Gemini 3.1 Pro marketing its SVG generation capabilities at launch, and Arrow 1.1 specifically targeting vector graphics, other models haven't focused heavily on this capability. This is also one of the interesting points right now. Because Opus 4.7 has the `thinking type: adaptive` parameter, it might default to treating SVG generation as a lightweight task without focusing on generation quality. Adding an extra Prompt specifically for it would be unfair to other models. As shown in the figure below, the left is "effort": "medium", and the right is "effort": "max"—there is actually not much difference.

Claude Opus 4.7 Jigglypuff SVG comparison: effort medium vs effort max

Q: Why are some APIs not from the official source?

A: Anthropic and OpenAI prohibit users in mainland China from using their products, including APIs, and I don't want to bother fighting it. I recently found out that OpenRouter also disabled these two models due to billing addresses, so I chose Cloudflare. Shh, keep it down.

Q: Is the Benchmark related to Nintendo?

A: Not at all, it's just my personal hobby. I hope Nintendo goes easy on me.

Q: Do you play Pokémon games?

A: Although I watched some Pokémon anime in my childhood, I'm not actually a die-hard game fan. In recent years, I played "Pokémon Legends: Arceus" and "Pokémon Legends: Z-A" and felt the joy again! Next time I'll definitely try "Pokémon Wind and Waves".

Feedback

My email:

hi@fenx.work

Behind the scenes

The following products were used during the development of this site:

  • Cloudflare stack: Workers, KV, D1, R2, AI, the online integration is very smooth;
  • This site was developed using Cursor, utilizing GPT 5.5, Codex 5.3, and Composer 2 models;
  • Purely manual design. Because I am a UI designer;
  • SvelteKit: I don't know much about it. I was using Astro before, but since the admin backend proportion is larger this time, I switched to this;
  • Pokémon wiki: It helped a lot when selecting test subjects;
  • Simon Willison's Weblog: The pelican riding a bicycle is one of the sources of inspiration;

About me

Some other links: