Core Concepts
Website Crawling
How hej! discovers and extracts content from your website.
The crawling process is the foundation of your AI chatbot's knowledge. Our intelligent crawler visits your website, discovers pages, and extracts the content that will power your AI assistant's responses.
How Crawling Works
1. Initial Discovery
Starting from your homepage, the crawler discovers all linked pages within your domain. It builds a sitemap of your entire website structure.
2. AI Page Selection
An LLM analyzes the discovered pages and intelligently selects which ones to include in your knowledge base. This ensures quality over quantity.
3. Content Extraction
For each selected page, we extract the main content, removing navigation, footers, ads, and other boilerplate elements.
4. Indexing & Embedding
Extracted content is processed, chunked, and converted into vector embeddings for semantic search capabilities.
Page Limits by Plan
Each plan includes a maximum number of indexed pages. These limits help ensure optimal performance and relevance of your knowledge base.
| Plan | Max Pages | Recrawl Frequency |
|---|---|---|
| Basic | 200 pages | Weekly |
| Start | 500 pages | 3x weekly |
| Business | 2,000 pages | Daily |
| Pro | 5,000 pages | Daily + Priority pages |
| Enterprise | Unlimited | Real-time |
Priority Pages
What are Priority Pages?
Priority pages are frequently updated pages (like news, pricing, or contact info) that get recrawled more often than your regular content. This ensures your AI always has the latest information.
Pro and Enterprise plans include priority page support with daily or real-time updates.
Re-crawling Your Site
When your website content changes, you'll want to update your AI's knowledge base. There are several ways to trigger a re-crawl:
Manual Re-crawl
Trigger a re-crawl anytime from your Studio dashboard. Great for after major content updates.
Scheduled Re-crawl
Automatic re-crawls happen based on your plan's frequency (weekly to real-time).
Best Practices
Ensure pages are publicly accessible
Our crawler can only index pages that don't require authentication.
Use semantic HTML
Proper heading structure and semantic elements help with content extraction.
Keep content focused
Pages with clear, focused content produce better AI responses.
Avoid blocking our crawler
Make sure your robots.txt allows our crawler (User-agent: hej-bot).