FIG. 07 A RAG that picks its own host

Model picker,
recursive.

A retrieval system over OpenRouter's model catalog. Ask in plain language, get a side-by-side comparison of the two to four models that fit. Generated by a model on OpenRouter, about OpenRouter, using OpenRouter.

Three hundred-plus models, weekly releases, opaque pricing pages, vague benchmarks. Picking the right one is harder than it should be. The catalog is structured data. Retrieval is the obvious move. Embeddings + cosine similarity + a cheap generator below.

Filed

Engine

OpenRouter · Gemini Flash · Cloudflare Pages Functions

Source

notebooks/model_picker_lab.py ↗

§ I Why picking a model is harder than it should be

The model catalog is bigger than the people who use it. Most teams pick by reputation, the loudest tweet last week, or whatever was cheapest the day they shipped.

OpenRouter aggregates over 300 models from a couple dozen providers. Each one has a context window, an input price, an output price, a modality set, a tool-call interface, and an opinion about JSON. The data is structured and freely available through their API. The work of comparing it is not.

The question users actually ask is plain language. "Cheapest model good at code." "Best vision model under five dollars per million." "Long-context for an entire codebase." The catalog is structured. The query is not. That gap is what retrieval is for.

§ II · FIG. 07.1 The picker — ask in plain language

Type a question or pick a starter. The pipeline parses hard constraints (price, context, modality), embeds the query, runs cosine similarity over the catalog, and feeds the top four to a cheap generator. Streamed answer above, comparison table below.

Your question

0 / 1500

Or try one of these:

● embedding query…

Recommendation

Candidates

Model	In $/M	Out $/M	Context	Capabilities	Latency	Best for

Capabilities: V vision · T tools · J JSON · R reasoning. Latency tier inferred from provider class and model size.

§ III How it works

A · Index

Catalog → chunks → embeddings

A Python notebook pulls /api/v1/models from OpenRouter once a week. Each row becomes a chunk: provider, name, ID, context length, prices, modalities, tool and JSON support, a one-line best-for tagline. Each chunk is embedded with google/gemini-embedding-2-preview and committed to the repo as models.json. GitHub Actions runs the refresh every Sunday at 02:00 UTC.

B · Retrieve

Hybrid: regex filters + cosine

A regex pre-pass extracts hard constraints from the query: price ceilings, context minimums, modalities, capability flags. The catalog is filtered first. The query is then embedded, and cosine similarity ranks the survivors. Top four go to the generator. If hard filters return zero, the system says so instead of hallucinating a match.

C · Generate

Cheap model, grounded prompt

The four retrieved chunks become the system context for a Gemini Flash call on OpenRouter. The model writes a two-to-three sentence recommendation and is instructed to refuse anything not in the retrieved set. Streamed back token-by-token. Each call runs ~$0.0002. A daily spend cap, hashed-IP rate limit, and 24-hour cache sit in front.

§ IV · FIG. 07.2 What stops this from being a free LLM endpoint

Public LLM endpoints get abused. The defense is layered: cheap rejections first, expensive calls only after every gate passes.

Hard input cap

1,500-character limit at the client and the server. Anything longer is rejected before any model call.

Hashed-IP rate limits

Ten requests per IP per hour, thirty per day. IP is SHA-256 hashed at the edge before storage.

Topic guard

A keyword pre-check rejects queries unrelated to LLM or model selection. Cheap, deterministic.

Hard output cap

400 tokens max from the generator. Prompt-injection that asks for longer output is bounded by the API call itself.

24-hour query cache

Identical questions return cached answers. Repeat traffic costs nothing.

Daily spend cap

$2 per day hard limit on the OpenRouter side. Endpoint returns a polite refusal once the cap trips.

§ V Receipts

§ VI Methodology & Colophon

Pipeline

notebooks/model_picker_lab.py ↗ pulls the OpenRouter catalog, builds a chunk per model, embeds each one, and writes models.json. A GitHub Action ↗ runs the notebook weekly and commits the refreshed file.

Inference

A Cloudflare Pages Function loads models.json at request time, runs the regex constraint extractor, embeds the query, computes cosine similarity, and streams a generation from google/gemini-2.0-flash-001. Both embedding and generation go through the same OpenRouter API key. No client-side secrets.

Reading list

OpenRouter · /api/v1/models reference ↗
Google · text embeddings ↗
Cloudflare Pages Functions ↗

Limitations

Latency tiers are inferred from provider class and parameter count, not measured. Quality claims are restricted to whatever is in the model card. The system will not rank models on benchmarks it cannot verify. New models that release between Sunday refreshes are not in the index until the next run.

← Back to the portfolio View the script on GitHub ↗

Model picker, recursive.