Built-in Fetchers
requests
Fast HTTP client for static pages
playwright
Full browser with JavaScript support
puppeteer
Alternative browser automation (legacy)
Fetcher Architecture
All fetchers inherit from theFetcher base class and implement a standard interface:
content_fetchers/base.py
requests Fetcher (HTTP Client)
The simplest and fastest fetcher using the Python requests library.Features
- Fast HTTP/HTTPS requests
- Cookie support
- Custom headers and authentication
- Proxy support (HTTP, HTTPS, SOCKS5)
- File:// URL support
- Brotli compression
When to Use
- Static HTML pages without JavaScript
- API endpoints returning JSON/XML
- Pages that don’t require authentication
- Maximum performance needed
- Low resource consumption required
Implementation Highlights
Fromcontent_fetchers/requests.py:
Configuration
Watch Configuration
playwright Fetcher (Browser)
Full-featured browser automation using Playwright. Recommended for JavaScript-heavy sites.Features
- JavaScript execution
- Browser Steps support (click, type, wait)
- Visual Selector integration
- Full-page screenshots
- Cookie and localStorage handling
- Network interception
- Multi-browser support (Chromium, Firefox, WebKit)
When to Use
- Single Page Applications (SPAs)
- JavaScript-rendered content
- Pages requiring login/authentication
- Interactive pages with dynamic loading
- Screenshot monitoring needed
- Visual Selector required
Implementation Highlights
Fromcontent_fetchers/playwright.py:
Configuration
Watch Configuration
Creating a Custom Fetcher
Step 1: Create Fetcher File
Create a new file incontent_fetchers/:
content_fetchers/my_custom_fetcher.py
Step 2: Register Fetcher
The fetcher is automatically discovered if placed incontent_fetchers/ directory. Update __init__.py if needed:
content_fetchers/__init__.py
Step 3: Use Custom Fetcher
API Example
Advanced Fetcher Examples
API Client Fetcher
Fetcher that authenticates with an API:RSS/Atom Feed Fetcher
Specialized fetcher for feed parsing:WebSocket Fetcher
Fetcher for WebSocket connections:Best Practices
Resource Management
- Always implement
quit()to clean up resources - Use context managers for file handles and connections
- Close browser instances in Playwright/Selenium fetchers
- Set reasonable timeouts
Error Handling
Performance
- Use
requestsfor static content (10-100x faster than browsers) - Only use browser fetchers when JavaScript is required
- Implement caching for expensive operations
- Use connection pooling for multiple requests
- Set appropriate timeouts (default: 60s)
Security
- Validate and sanitize URLs
- Respect robots.txt (unless monitoring your own sites)
- Implement rate limiting for external APIs
- Never log sensitive data (API keys, passwords)
- Use environment variables for credentials
When to Extend vs Configure
Extend (Create Custom Fetcher) When:
- Need custom authentication mechanism
- Require special protocol support (WebSocket, gRPC, etc.)
- Want to integrate with proprietary APIs
- Need specialized parsing or transformation
- Performance optimization for specific use case
Configure (Use Built-in Fetcher) When:
- Standard HTTP/HTTPS requests sufficient
- JavaScript rendering needed (use Playwright)
- Browser automation required (use Playwright with Browser Steps)
- Custom headers/cookies needed (use requests with headers)
- Proxy configuration needed (use built-in proxy support)
Testing Your Fetcher
Create a test file:tests/test_my_fetcher.py
Next Steps
Processors
Learn about change detection processors
Browser Steps
Master browser automation