Schema extraction
Define a typed schema with the source generator for reflection-free extraction.
When you want structured records instead of Markdown, you describe the shape with
a schema. WebReaper ships a source generator: annotate a class, and it emits both
a Schema you pass to .Extract(...) and a reflection-free Materialize method
that turns extracted JSON into your type. No runtime reflection, AOT-friendly.
Define a schema
Mark the class partial, add [ScrapeSchema], and put [ScrapeField(...)] on
each property you want filled:
using WebReaper.Extraction.Attributes;
[ScrapeSchema]
public partial class Article
{
[ScrapeField("h1")]
public string? Title { get; set; }
[ScrapeField(".score", Type = SchemaFieldType.Integer)]
public int Points { get; set; }
[ScrapeField(".tag", IsList = true)]
public List<string> Tags { get; set; } = new();
}The first argument to [ScrapeField(...)] is a CSS selector. Type coerces the
extracted text (for example to an integer), and IsList = true collects every
match into a list.
What the generator emits
For the class above, the generator produces two static members:
// A ready-to-use schema for the extractor
Article.Schema
// A reflection-free mapper from extracted JSON to your type
Article Materialize(JsonObject json)Use it in a scraper
Pass the generated schema to the .Extract(...) terminal:
using WebReaper.Builders;
var engine = await ScraperEngineBuilder
.Crawl("https://news.ycombinator.com")
.Extract(Article.Schema)
.WriteToJsonFile("articles.json")
.BuildAsync();
await engine.RunAsync();Constraints in v1
The source generator covers the common case and is deliberately scoped:
- The class must be declared
partial. - Annotated properties must have a public setter.
- It supports a single level of fields plus lists of primitives. Nested
[ScrapeSchema]types are not supported yet.
If your target page is irregular or its markup drifts often, pair schema
extraction with an LLM fallback or self-healing selectors; see
LLM extraction and fallback. For pages where you would
rather not write a schema at all, .ExtractInferred(goal) infers one at runtime.