civic-llms — agent spec for state government llms.txt

What this is

When an agent fetches a page on a .gov or .us domain, it should first check whether the site publishes an llms.txt file at the root and use it as context for navigation. This document specifies that pattern.

This is not a new file format — it uses the llms.txt proposal as-is. It is not a crawling or licensing policy — use robots.txt for that.

Scope

This convention applies to agents fetching content from:

Any domain matching *.gov or *.us
The five US territory root domains tracked at state-llms.keith.is

Agents MAY apply the same convention to non-US civic domains (.gc.ca, .gov.uk, etc.), but this spec does not require it.

The convention

Before fetching any path on an in-scope domain, the agent:

Issues GET https://{root-domain}/llms.txt
On 200, parses the response as Markdown and uses it as navigation context for subsequent fetches in the session
On 404 or other client errors, proceeds normally and records the absence (see Telemetry)
On 5xx or timeout, retries once with backoff; if still failing, proceeds as in 404

Fetching rules

Method: GET
User-Agent: Include civic-llms/0.1 as a token in your existing User-Agent string, e.g. MyAgent/2.4 civic-llms/0.1. This lets state operators see who's consuming the file.
Accept: text/markdown, text/plain;q=0.9, */*;q=0.1
Timeout: 5 seconds for the initial request. Do not block other work on this.
Redirects: Follow up to 3, but only if they stay within the same root domain. Cross-domain redirects MUST be ignored — an llms.txt that redirects to another origin cannot be trusted.

Caching

Honor Cache-Control and ETag headers if present.
If no caching headers, cache for 24 hours.
Re-fetch on cache miss or when the user starts a new session.

Parsing

The file is Markdown. Agents have two consumption patterns:

Raw inclusion (simplest): include the entire text in the agent's system context. Recommended unless context budget is tight.
Structured parsing: extract H1/H2 sections and URLs, treat each ## Section as a category, and use linked URLs as canonical destinations. Useful for routing logic.

Agents SHOULD treat any URL listed in the llms.txt as authoritative for the task or topic it's associated with, overriding inferences from search results or sitemaps.

Agents SHOULD also respect explicit usage guidance in the file. For example, Maryland's llms.txt says: "do not infer legal, policy, or eligibility determinations beyond published content." An agent that ignores that and hallucinates eligibility rules is non-conformant.

Fallback

When llms.txt is absent (currently 54 of 56 US states and territories), the agent falls back to standard navigation:

GET /sitemap.xml for URL discovery
GET /robots.txt for crawl restrictions
Heuristic navigation (search box, top-level nav, etc.)

The absence of llms.txt MUST NOT cause the agent to refuse the task.

Telemetry

Agents SHOULD log which in-scope domains served llms.txt and which didn't, including timestamps. Aggregated absence data is what creates pressure on states to publish.

If you publish that data, link it from your project. The state-llms.keith.is tracker aggregates known publishers across US states and territories.

Declaring compliance

A framework or agent that implements this convention SHOULD:

Include civic-llms/0.1 in its User-Agent
Document the behavior in user-facing docs
Expose a way for users to disable the check (some agents have legitimate reasons to bypass it — testing, archival crawls, etc.)

Reference implementation

Coming soon — MCP-based SNAP portal agent. Link will land here.

Open questions

These are intentionally unresolved in v0.1:

llms-full.txt. Should agents prefer the longer variant when available? Likely yes for context-rich tasks, no for routing.
.md endpoints. The base llms.txt spec proposes that any page should be available at {url}.md. Few gov sites do this today; worth revisiting if adoption grows.
Per-agency llms.txt. A state portal links to dozens of agency sub-sites. Should each have its own? Almost certainly, but the convention here only covers root domains.
Authentication. Higher-trust agents (operating on behalf of caseworkers, e.g.) may need authenticated access to richer guidance. Out of scope for v0.1.

Changelog

2026-05-27 — v0.1 draft published.