Custom Fetchers

Content fetchers are responsible for retrieving web page content in changedetection.io. The application includes three built-in fetchers, each optimized for different scenarios.

Built-in Fetchers

requests

Fast HTTP client for static pages

playwright

Full browser with JavaScript support

puppeteer

Alternative browser automation (legacy)

Fetcher Architecture

All fetchers inherit from the Fetcher base class and implement a standard interface:

content_fetchers/base.py

class Fetcher:
    """Base class for all content fetchers"""
    
    browser_steps = None
    browser_steps_screenshot_path = None
    content = None
    error = None
    fetcher_description = "No description"
    headers = {}
    instock_data = None
    screenshot = None
    status_code = None
    
    def __init__(self, **kwargs):
        """Initialize fetcher with configuration"""
        pass
    
    def run(self, url, timeout, request_headers, request_body, request_method="GET",
            ignore_status_codes=False, current_include_filters=None):
        """
        Fetch content from URL.
        
        Returns:
            - self.content: Retrieved content (HTML, JSON, etc.)
            - self.headers: Response headers
            - self.status_code: HTTP status code
            - self.screenshot: Screenshot bytes (if supported)
            - self.error: Error message (if any)
        """
        raise NotImplementedError()
    
    def quit(self):
        """Clean up resources (close browser, etc.)"""
        pass

requests Fetcher (HTTP Client)

The simplest and fastest fetcher using the Python requests library.

Features

Fast HTTP/HTTPS requests
Cookie support
Custom headers and authentication
Proxy support (HTTP, HTTPS, SOCKS5)
File:// URL support
Brotli compression

When to Use

Static HTML pages without JavaScript
API endpoints returning JSON/XML
Pages that don’t require authentication
Maximum performance needed
Low resource consumption required

Implementation Highlights

From content_fetchers/requests.py:

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

class fetcher(Fetcher):
    fetcher_description = "Basic fast Plaintext/HTTP Client"
    
    def run(self, url, timeout, request_headers, request_body, 
            request_method="GET", **kwargs):
        # Configure retries
        retry_strategy = Retry(
            total=3,
            status_forcelist=[429, 500, 502, 503, 504],
            backoff_factor=1
        )
        
        # Setup session with retries
        session = requests.Session()
        adapter = HTTPAdapter(max_retries=retry_strategy)
        session.mount("http://", adapter)
        session.mount("https://", adapter)
        
        # Make request
        r = session.request(
            method=request_method,
            url=url,
            headers=request_headers,
            data=request_body,
            timeout=timeout,
            verify=True
        )
        
        # Store results
        self.content = r.text
        self.headers = dict(r.headers)
        self.status_code = r.status_code
        
        return

Configuration

Watch Configuration

{
  "fetch_backend": "html_requests",
  "headers": {
    "User-Agent": "Custom Agent",
    "Accept": "application/json"
  },
  "method": "POST",
  "body": "param=value"
}

playwright Fetcher (Browser)

Full-featured browser automation using Playwright. Recommended for JavaScript-heavy sites.

Features

JavaScript execution
Browser Steps support (click, type, wait)
Visual Selector integration
Full-page screenshots
Cookie and localStorage handling
Network interception
Multi-browser support (Chromium, Firefox, WebKit)

When to Use

Single Page Applications (SPAs)
JavaScript-rendered content
Pages requiring login/authentication
Interactive pages with dynamic loading
Screenshot monitoring needed
Visual Selector required

Implementation Highlights

From content_fetchers/playwright.py:

import playwright
from playwright.sync_api import sync_playwright
import multiprocessing

def _run_playwright(url, timeout, headers, screenshot_path, user_agent):
    """Run Playwright in isolated process"""
    with sync_playwright() as p:
        # Launch browser
        browser = p.chromium.connect(
            playwright_driver_url,
            timeout=timeout * 1000
        )
        
        # Create context with custom settings
        context = browser.new_context(
            user_agent=user_agent,
            viewport={'width': 1920, 'height': 1080},
            extra_http_headers=headers
        )
        
        # Open page
        page = context.new_page()
        page.goto(url, wait_until='networkidle', timeout=timeout * 1000)
        
        # Execute Browser Steps if configured
        if browser_steps:
            execute_browser_steps(page, browser_steps)
        
        # Get content and screenshot
        content = page.content()
        if screenshot_path:
            page.screenshot(path=screenshot_path, full_page=True)
        
        return content

class fetcher(Fetcher):
    fetcher_description = "Playwright Chrome"
    
    def run(self, url, timeout, request_headers, **kwargs):
        # Run in separate process for isolation
        ctx = multiprocessing.get_context('spawn')
        queue = ctx.Queue()
        
        process = ctx.Process(
            target=_run_playwright,
            args=(url, timeout, request_headers, queue)
        )
        process.start()
        process.join(timeout=timeout + 5)
        
        # Get results
        self.content = queue.get()
        return

Configuration

Watch Configuration

{
  "fetch_backend": "html_webdriver",
  "browser_steps": [
    {"operation": "goto", "url": "https://example.com/login"},
    {"operation": "type", "selector": "#username", "value": "user"},
    {"operation": "click", "selector": "#submit"},
    {"operation": "wait", "seconds": 2}
  ],
  "webdriver_screenshot": true
}

Creating a Custom Fetcher

Step 1: Create Fetcher File

Create a new file in content_fetchers/:

content_fetchers/my_custom_fetcher.py

from .base import Fetcher
import requests

class fetcher(Fetcher):
    fetcher_description = "My Custom Fetcher"
    
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        # Initialize custom settings
        self.api_key = kwargs.get('api_key')
    
    def run(self, url, timeout, request_headers, request_body,
            request_method="GET", ignore_status_codes=False,
            current_include_filters=None, **kwargs):
        """
        Fetch content with custom logic.
        """
        try:
            # Your custom fetching logic
            response = self.fetch_with_custom_method(url)
            
            # Store results in standard format
            self.content = response['content']
            self.headers = response.get('headers', {})
            self.status_code = response.get('status_code', 200)
            
            # Optional: store custom data
            self.instock_data = response.get('metadata')
            
        except Exception as e:
            self.error = str(e)
            self.status_code = 500
    
    def fetch_with_custom_method(self, url):
        """Implement your custom fetching logic"""
        # Example: use special API or protocol
        return {
            'content': '<html>Custom content</html>',
            'headers': {},
            'status_code': 200
        }
    
    def quit(self):
        """Clean up resources"""
        # Close connections, browsers, etc.
        pass

Step 2: Register Fetcher

The fetcher is automatically discovered if placed in content_fetchers/ directory. Update __init__.py if needed:

content_fetchers/__init__.py

available_fetchers = {
    'html_requests': 'Basic fast Plaintext/HTTP Client',
    'html_webdriver': 'Playwright Chrome',
    'my_custom_fetcher': 'My Custom Fetcher'
}

Step 3: Use Custom Fetcher

API Example

curl -X POST http://localhost:5000/api/v1/watch \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "fetch_backend": "my_custom_fetcher"
  }'

Advanced Fetcher Examples

API Client Fetcher

Fetcher that authenticates with an API:

class fetcher(Fetcher):
    fetcher_description = "Authenticated API Client"
    
    def __init__(self, **kwargs):
        self.api_token = os.getenv('API_TOKEN')
    
    def run(self, url, timeout, request_headers, **kwargs):
        # Add authentication
        headers = request_headers.copy()
        headers['Authorization'] = f'Bearer {self.api_token}'
        
        # Make authenticated request
        r = requests.get(url, headers=headers, timeout=timeout)
        
        self.content = r.text
        self.headers = dict(r.headers)
        self.status_code = r.status_code

RSS/Atom Feed Fetcher

Specialized fetcher for feed parsing:

import feedparser

class fetcher(Fetcher):
    fetcher_description = "RSS/Atom Feed Parser"
    
    def run(self, url, timeout, **kwargs):
        # Parse feed
        feed = feedparser.parse(url)
        
        # Convert to consistent format
        items = []
        for entry in feed.entries:
            items.append({
                'title': entry.title,
                'link': entry.link,
                'published': entry.published,
                'summary': entry.summary
            })
        
        # Store as JSON
        import json
        self.content = json.dumps(items, indent=2)
        self.status_code = 200 if feed.entries else 404

WebSocket Fetcher

Fetcher for WebSocket connections:

import websocket
import json

class fetcher(Fetcher):
    fetcher_description = "WebSocket Client"
    
    def run(self, url, timeout, **kwargs):
        messages = []
        
        def on_message(ws, message):
            messages.append(message)
        
        def on_error(ws, error):
            self.error = str(error)
        
        # Connect and collect messages
        ws = websocket.WebSocketApp(
            url,
            on_message=on_message,
            on_error=on_error
        )
        
        ws.run_forever(timeout=timeout)
        
        # Store collected messages
        self.content = '\n'.join(messages)
        self.status_code = 200 if messages else 500

Best Practices

Resource Management

Always implement quit() to clean up resources
Use context managers for file handles and connections
Close browser instances in Playwright/Selenium fetchers
Set reasonable timeouts

Error Handling

def run(self, url, timeout, **kwargs):
    try:
        # Fetching logic
        pass
    except requests.Timeout:
        self.error = f"Timeout after {timeout}s"
        self.status_code = 408
    except requests.ConnectionError:
        self.error = "Connection failed"
        self.status_code = 503
    except Exception as e:
        self.error = f"Unexpected error: {str(e)}"
        self.status_code = 500

Performance

Use requests for static content (10-100x faster than browsers)
Only use browser fetchers when JavaScript is required
Implement caching for expensive operations
Use connection pooling for multiple requests
Set appropriate timeouts (default: 60s)

Security

Validate and sanitize URLs
Respect robots.txt (unless monitoring your own sites)
Implement rate limiting for external APIs
Never log sensitive data (API keys, passwords)
Use environment variables for credentials

When to Extend vs Configure

Extend (Create Custom Fetcher) When:

Need custom authentication mechanism
Require special protocol support (WebSocket, gRPC, etc.)
Want to integrate with proprietary APIs
Need specialized parsing or transformation
Performance optimization for specific use case

Configure (Use Built-in Fetcher) When:

Standard HTTP/HTTPS requests sufficient
JavaScript rendering needed (use Playwright)
Browser automation required (use Playwright with Browser Steps)
Custom headers/cookies needed (use requests with headers)
Proxy configuration needed (use built-in proxy support)

Testing Your Fetcher

Create a test file:

tests/test_my_fetcher.py

import pytest
from changedetectionio.content_fetchers.my_custom_fetcher import fetcher

def test_basic_fetch():
    f = fetcher()
    f.run(
        url="https://example.com",
        timeout=30,
        request_headers={},
        request_body=None
    )
    
    assert f.status_code == 200
    assert f.content is not None
    assert f.error is None

def test_error_handling():
    f = fetcher()
    f.run(
        url="https://invalid.domain.that.does.not.exist",
        timeout=5,
        request_headers={},
        request_body=None
    )
    
    assert f.error is not None
    assert f.status_code != 200

Run tests:

pytest tests/test_my_fetcher.py -v

Built-in Fetchers

requests

playwright

puppeteer

Fetcher Architecture

requests Fetcher (HTTP Client)

Features

When to Use

Implementation Highlights

Configuration

playwright Fetcher (Browser)

Features

When to Use

Implementation Highlights

Configuration

Creating a Custom Fetcher

Step 1: Create Fetcher File

Step 2: Register Fetcher

Step 3: Use Custom Fetcher

Advanced Fetcher Examples

API Client Fetcher

RSS/Atom Feed Fetcher

WebSocket Fetcher

Best Practices

Resource Management

Error Handling

Performance

Security

When to Extend vs Configure

Extend (Create Custom Fetcher) When:

Configure (Use Built-in Fetcher) When:

Testing Your Fetcher

Next Steps

Processors

Browser Steps

Documentation Index

​Built-in Fetchers

requests

playwright

puppeteer

​Fetcher Architecture

​requests Fetcher (HTTP Client)

​Features

​When to Use

​Implementation Highlights

​Configuration

​playwright Fetcher (Browser)

​Features

​When to Use

​Implementation Highlights

​Configuration

​Creating a Custom Fetcher

​Step 1: Create Fetcher File

​Step 2: Register Fetcher

​Step 3: Use Custom Fetcher

​Advanced Fetcher Examples

​API Client Fetcher

​RSS/Atom Feed Fetcher

​WebSocket Fetcher

​Best Practices

​Resource Management

​Error Handling

​Performance

​Security

​When to Extend vs Configure

​Extend (Create Custom Fetcher) When:

​Configure (Use Built-in Fetcher) When:

​Testing Your Fetcher

​Next Steps

Processors

Browser Steps

Built-in Fetchers

Fetcher Architecture

requests Fetcher (HTTP Client)

Features

When to Use

Implementation Highlights

Configuration

playwright Fetcher (Browser)

Features

When to Use

Implementation Highlights

Configuration

Creating a Custom Fetcher

Step 1: Create Fetcher File

Step 2: Register Fetcher

Step 3: Use Custom Fetcher

Advanced Fetcher Examples

API Client Fetcher

RSS/Atom Feed Fetcher

WebSocket Fetcher

Best Practices

Resource Management

Error Handling

Performance

Security

When to Extend vs Configure

Extend (Create Custom Fetcher) When:

Configure (Use Built-in Fetcher) When:

Testing Your Fetcher

Next Steps