Core Concepts

Website Crawling

How hej! discovers and extracts content from your website.

The crawling process is the foundation of your AI chatbot's knowledge. Our intelligent crawler visits your website, discovers pages, and extracts the content that will power your AI assistant's responses.

How Crawling Works

1. Initial Discovery

Starting from your homepage, the crawler discovers all linked pages within your domain. It builds a sitemap of your entire website structure.

2. AI Page Selection

An LLM analyzes the discovered pages and intelligently selects which ones to include in your knowledge base. This ensures quality over quantity.

3. Content Extraction

For each selected page, we extract the main content, removing navigation, footers, ads, and other boilerplate elements.

4. Indexing & Embedding

Extracted content is processed, chunked, and converted into vector embeddings for semantic search capabilities.

Page Limits by Plan

Each plan includes a maximum number of indexed pages. These limits help ensure optimal performance and relevance of your knowledge base.

PlanMax PagesRecrawl Frequency
Basic200 pagesWeekly
Start500 pages3x weekly
Business2,000 pagesDaily
Pro5,000 pagesDaily + Priority pages
EnterpriseUnlimitedReal-time

Priority Pages

What are Priority Pages?

Priority pages are frequently updated pages (like news, pricing, or contact info) that get recrawled more often than your regular content. This ensures your AI always has the latest information.

Pro and Enterprise plans include priority page support with daily or real-time updates.

Re-crawling Your Site

When your website content changes, you'll want to update your AI's knowledge base. There are several ways to trigger a re-crawl:

Manual Re-crawl

Trigger a re-crawl anytime from your Studio dashboard. Great for after major content updates.

Scheduled Re-crawl

Automatic re-crawls happen based on your plan's frequency (weekly to real-time).

Best Practices

Ensure pages are publicly accessible

Our crawler can only index pages that don't require authentication.

Use semantic HTML

Proper heading structure and semantic elements help with content extraction.

Keep content focused

Pages with clear, focused content produce better AI responses.

Avoid blocking our crawler

Make sure your robots.txt allows our crawler (User-agent: hej-bot).

Feedback

Route: /docs/crawling