GEO Technical Architecture

Blueprint-level documentation of how Geolocus implements Generative Engine Optimization at the infrastructure layer. Every pattern described here is deployed in production and validated by real AI crawler traffic.

For the scoring framework these systems support, see the 8-Signal GEO Methodology.

The Clean-Room HTML Pattern

The core architectural insight of GEO is that AI crawlers and human visitors have fundamentally different rendering capabilities. Human visitors run browsers with full JavaScript engines. AI crawlers send HTTP requests and parse the HTML response — they do not execute JavaScript, hydrate React components, or wait for client-side API calls to complete.

The clean-room HTML pattern solves this with a dual-path architecture: bot detection at the edge determines whether the request comes from an AI crawler or a human browser. Bots receive fully-rendered semantic HTML from Supabase edge functions. Humans receive the interactive single-page application (React + Vite) with client-side rendering. Same content, optimized delivery.

The term "clean-room" reflects the design philosophy: bot-facing HTML is built from scratch with zero framework dependencies. No React, no Tailwind runtime, no hydration scripts, no external CSS files. Every byte in the response serves a purpose that an AI crawler can consume.

Request Flow Diagram

  Client Request
       |
       v
  Vercel Edge (middleware.js)
       |
       +-- User-Agent Analysis
       |
       +--[Bot detected]------> api/html.js (Vercel proxy)
       |                              |
       |                              v
       |                     Supabase Edge Function
       |                     (serve-bot-*-html)
       |                              |
       |                              +-- Render clean-room HTML
       |                              +-- Embed JSON-LD structured data
       |                              +-- Log bot visit (fire-and-forget)
       |                              |
       |                              v
       |                     Return: semantic HTML
       |                     Cache-Control: s-maxage=43200
       |
       +--[Human detected]---> SPA (React + Vite)
                                      |
                                      v
                               Client-side rendering
                               Interactive experience

The proxy layer (api/html.js) handles Content-Type normalization, CORS headers, and CDN cache coordination between Vercel's edge and Supabase's edge.

Edge Function Architecture

Each page has its own dedicated Deno edge function deployed to Supabase (powered by Deno Deploy). Edge functions execute at the CDN layer — not in a centralized server — delivering sub-100ms response times globally. There is no cold start penalty for frequently-accessed functions because Deno Deploy keeps them warm.

Function Structure

Every serve-bot-*-html function follows an identical pattern built on three shared helpers:

  • _shared/layout.ts — HTML document wrapper with navbar, footer, design tokens, and default Organization JSON-LD. Accepts title, body content, active nav path, and optional description/JSON-LD overrides.
  • _shared/response.ts — Response builders with standardized Cache-Control, CORS headers, and Content-Type. Exports htmlResponse(), errorResponse(), and handleOptions().
  • _shared/log-bot-visit.ts — Fire-and-forget bot crawl logger. Extracts User-Agent from request headers, detects bot identity via pattern matching, and inserts a row into bot_crawl_logs without blocking the response.

The 10 Bot-Facing Pages

Geolocus deploys 10 marketing page edge functions, each with its own JSON-LD schema type:

Path Edge Function JSON-LD Type
/ serve-bot-home-html Organization
/services serve-bot-services-html Service
/methodology serve-bot-methodology-html WebPage
/architecture serve-bot-architecture-html TechArticle
/whitepaper serve-bot-whitepaper-html ScholarlyArticle
/about serve-bot-about-html AboutPage
/faq serve-bot-faq-html FAQPage
/case-studies serve-bot-case-studies-html Article
/crawl-stats serve-bot-crawl-stats-html WebPage
/contact serve-bot-contact-html ContactPage

In addition to marketing pages, three internal edge functions handle operational tasks: send-email (Gmail OAuth transactional email), geo-audit-email (automated daily GEO audit reports), and health-check-daily (infrastructure health monitoring).

Bot Crawl Telemetry

Every bot visit to every page is logged in real time. The telemetry pipeline works as follows:

  1. Detection: The log-bot-visit.ts helper extracts the User-Agent header (checking x-forwarded-user-agent first for proxy transparency) and matches it against a comprehensive pattern library of known AI crawler signatures.
  2. Logging: Identified bot visits are inserted into bot_crawl_logs with bot name, page path, full user agent string (truncated to 500 chars), and server timestamp. The insert is fire-and-forget — it adds zero latency to the HTML response.
  3. Aggregation: The bot_crawl_hourly table aggregates raw logs into hourly counts by bot name, page path, and source. This powers dashboards and trend analysis without querying the high-volume raw log table.
  4. Analysis: Bot-by-bot breakdowns reveal which AI systems are most actively crawling, which pages they prioritize, crawl frequency trends over time, and whether new AI crawlers are discovering the site.

Recognized AI Crawlers

OpenAI

GPTBot, ChatGPT-User, OAI-SearchBot

Anthropic

ClaudeBot, Claude-Web, anthropic-ai

Google

Google-Extended, Googlebot

Perplexity

PerplexityBot, Perplexity-User

Microsoft

Bingbot, BingPreview

Others

Applebot, Amazonbot, Meta-ExternalAgent, YouBot, DuckAssistBot, cohere-ai, CCBot

Structured Data Layer

JSON-LD structured data is generated server-side within each edge function and embedded directly in the clean-room HTML response. This guarantees that structured data is always present when AI crawlers parse the page — unlike client-side injection approaches that fail when JavaScript is not executed.

Each page's JSON-LD uses the most specific Schema.org type applicable to its content:

  • Organization: Homepage and brand pages — name, URL, description, contact points
  • WebSite: Site-level metadata paired with Organization for search and AI indexing
  • Service: Service offering pages with provider, name, description, and service type
  • TechArticle: Technical documentation with proficiency level and about topics
  • ScholarlyArticle: Research and thought-leadership content with named authors
  • FAQPage: Question/Answer pairs structured for direct extraction by AI systems
  • ContactPage: Contact information with organization details
  • DefinedTermSet: Glossary and framework definitions (e.g., the 8 GEO Signals)

The layout helper provides a default Organization JSON-LD fallback, ensuring that even if a specific function omits custom structured data, the page still has valid machine-readable metadata.

CDN Caching Strategy

Bot-facing pages use a two-tier caching strategy that balances content freshness (more frequent crawls see updated content sooner) with edge efficiency (reduced origin requests):

Tier 1: Static Marketing Pages

Cache-Control: public, max-age=0, s-maxage=43200, stale-while-revalidate=60
12-hour CDN edge cache. No browser cache (max-age=0) so content updates propagate on the next crawler visit after a redeploy. 60-second stale-while-revalidate provides seamless transitions during deploys.

Tier 2: Dynamic Data Pages

Cache-Control: public, max-age=0, s-maxage=60, stale-while-revalidate=30
60-second CDN cache for pages with live data (e.g., crawl-stats). Ensures AI crawlers see near-real-time data without hammering the database on every request.

Error Responses

Cache-Control: no-store
Error responses are never cached, ensuring a broken render does not persist at the edge and poison subsequent crawler visits.

The Vercel proxy at api/html.js coordinates caching between Vercel's edge CDN and Supabase's edge function layer, ensuring consistent Cache-Control headers reach the final response.

Monitoring Stack: 10 GEO Infrastructure Tables

The entire GEO infrastructure is monitored through 10 purpose-built Postgres tables that track every dimension of AI visibility, from raw crawl logs to composite scoring:

geo_ledger_entries

Immutable audit trail of every GEO optimization action — what changed, when, impact score, responsible agent

geo_signal_status

Current PASS/FAIL/PARTIAL state of each of the 8 GEO signals per monitored site

geo_score_dimensions

Dimension-level scores (0-100) with weights for the 7 active composite categories

bot_crawl_logs

Raw log of every AI bot visit — bot name, page path, user agent, timestamp

bot_crawl_hourly

Hourly aggregated crawl counts by bot, page, and source for efficient trend queries

site_health_checks

TTFB, status codes, SSL validity, and response size for monitored endpoints

health_monitor_runs

Execution log of health monitoring jobs — start time, duration, pass/fail counts

health_check_daily_runs

Daily rollup of health check results for long-term uptime and performance trending

cron_heartbeats

Heartbeat pings from scheduled cron jobs — ensures automated tasks are running on schedule

middleware_heartbeat

Vercel middleware liveness signal — confirms bot detection is active at the edge

Automated Audit Pipeline

Daily at 06:00 MST, a pg_cron job triggers the geo-audit-email edge function, which:

  1. Queries all 10 infrastructure tables for the trailing 24-hour window
  2. Computes the current GEO composite score from dimension weights
  3. Identifies notable pages (highest/lowest crawl volume, new bot appearances)
  4. Generates a terse, data-dense email report
  5. Sends via Gmail OAuth through the send-email edge function

Explore the Framework

Understand the scoring methodology behind this architecture, or read the whitepaper on why GEO infrastructure is the defining investment of the AI era.

8-Signal Methodology → Read the Whitepaper → Live Crawl Stats → Get a GEO Audit →