LLM extraction and fallback
Use an LLM as a rescue, a selector repairer, or a schema inferrer.
WebReaper's AI extraction is built so the deterministic path runs first and the
model steps in only when needed. That keeps cost predictable: on a stable page
the LLM is never called. There are three à la carte behaviors, each a single
builder method, and each backed by your own IChatClient.
LLM fallback
Run a deterministic CSS or XPath schema first; if extraction comes back empty or incomplete, the LLM rescues that page. The successful fix is cached, so repeated pages of the same shape do not pay again:
using WebReaper.Builders;
var engine = await ScraperEngineBuilder
.Crawl("https://example.com")
.Extract(Article.Schema)
.WithLlmFallback(chatClient)
.WriteToJsonFile("articles.json")
.BuildAsync();
await engine.RunAsync();This is the most common AI setup: deterministic speed on the pages that behave, an LLM safety net on the pages that do not.
Self-healing selectors
When a site changes its markup and your selectors stop matching, self-healing asks the LLM to repair the broken selectors against the live page, then continues deterministically with the repaired schema:
var engine = await ScraperEngineBuilder
.Crawl("https://example.com")
.Extract(Article.Schema)
.WithLlmSelfHealing(chatClient)
.WriteToJsonFile("articles.json")
.BuildAsync();
await engine.RunAsync();The repair is cached against the schema, so one fix covers the rest of the crawl.
Inferred schema
Sometimes you do not want to write a schema at all. .ExtractInferred(goal)
describes what you want in plain language; .WithLlmSchemaInferrer(...) lets the
model infer a schema from the URL and that goal, which is then reused for the
crawl:
var engine = await ScraperEngineBuilder
.Crawl("https://example.com/products")
.ExtractInferred("product name, price, and rating")
.WithLlmSchemaInferrer(chatClient)
.WriteToJsonFile("products.json")
.BuildAsync();
await engine.RunAsync();Stable pages cost zero LLM calls
This is the design principle behind all three. The deterministic extractor and the cached fixes do the work on every page that behaves. The model is reserved for the empty result, the broken selector, or the first inference. A crawl over a consistent site can finish without a single live LLM call after the first page.
For the one-line way to wire these together, see the AI features overview; for letting a model decide which pages to visit, see the autonomous agent.