Build Your Own Search Result Scraper with Markdown Output Using FastAPI, SearXNG, and Browserless

Jun 4, 2024

Learn how to build your own search result scraper with FastAPI, SearXNG, and Browserless, and return results in Markdown format using a proxy.

Build Your Own Search Result Scraper with Markdown Output Using FastAPI, SearXNG, and Browserless

Build Your Own Search Result Scraper with Markdown Output Using FastAPI, SearXNG, and Browserless

Today, I'm excited to share with you a detailed guide on how to build your own search result scraper that returns results in Markdown format. We'll be using FastAPI, SearXNG, and Browserless, and we'll run everything in Docker containers.

This tutorial is perfect for early-stage students or anyone interested in web scraping and data extraction. By the end of this guide, you'll have a working application that can fetch search results, scrape content, and convert it into Markdown format, all while using a proxy.

Table of Contents

  1. Services We'll Use
  2. Purpose of Scraping
  3. Prerequisites
  4. Docker Setup
  5. Manual Setup
  6. Writing the Code
  7. Running SearXNG and Browserless in Docker
  8. Using Proxies
  9. Full Source Code
  10. Enjoyed the Post?

Services We'll Use

  • FastAPI: A modern, fast (high-performance), web framework for building APIs with Python 3.6+.
  • SearXNG: A free internet metasearch engine which aggregates results from various search services and databases.
  • Browserless: A web browser automation service that allows you to scrape web pages without dealing with the browser directly.

Purpose of Scraping

Web scraping allows you to extract useful information from websites and use it for various purposes like data analysis, content aggregation, and more. In this tutorial, we'll focus on scraping search results and converting them into Markdown format for easy readability and integration with other tools.

Prerequisites

Before we begin, make sure you have the following installed:

  • Python 3.11
  • Virtualenv

You can install the prerequisites using the following commands:

# Install Python 3.11 (skip if already installed)
sudo apt-get update
sudo apt-get install python3.11

# Install virtualenv
pip install virtualenv

Docker Setup

You can use Docker to simplify the setup process. Follow these steps:

  1. Clone the repository:
  2. Run Docker Compose:

With this setup, if you change the .env or main.py file, you no longer need to restart Docker. Changes will be reloaded automatically.

Manual Setup

Follow these steps for manual setup:

  1. Clone the repository:
  2. Create and activate virtual environment:
  3. Install dependencies:
  4. Create a .env file in the root directory with the following content:

Writing the Code

Here's the complete code for our FastAPI application:

.env File

SEARXNG_URL=http://localhost:8888
BROWSERLESS_URL=http://localhost:3000
TOKEN=b7a7ad74da294fa39ed75c01cfe4e41b
PROXY_PROTOCOL=http
PROXY_URL=us.premium-residential.geonode.com
PROXY_USERNAME=geonode_OglxV49yXfxxxx
PROXY_PASSWORD=xxxxx-ef6a-42a6-98fa-cc9e86cc0628
PROXY_PORT=9000
REQUEST_TIMEOUT=300

Explanation of Each Variable

  • SEARXNG_URL: This is the URL where your SearXNG service is running. In this setup, it's running locally on port 8888.
  • BROWSERLESS_URL: This is the URL where your Browserless service is running. In this setup, it's running locally on port 3000.
  • TOKEN: This is a placeholder for any API token that might be required by your services. In this specific example, it's not actively used but can be kept for future use or services that require authentication.
  • PROXY_PROTOCOL: The protocol used by your proxy service. Typically, this will be either http or https.
  • PROXY_URL: The URL or IP address of your proxy service provider. Here, we're using a Geonode proxy.
  • PROXY_USERNAME: The username for authenticating with your proxy service. This is specific to your Geonode account.
  • PROXY_PASSWORD: The password for authenticating with your proxy service. This is specific to your Geonode account.
  • PROXY_PORT: The port number on which your proxy service is running. Common ports include 8080 and 9000.
  • REQUEST_TIMEOUT: The timeout duration for HTTP requests, specified in seconds. This helps ensure your application doesn't hang indefinitely while waiting for a response.

Why Use a .env File?

  1. Security: Keeps sensitive information like API keys, tokens, and passwords out of your codebase.
  2. Configuration: Allows easy configuration changes without modifying the code.
  3. Environment-Specific Settings: Easily switch configurations between different environments (development, testing, production) by changing the .env file.

How to Use the .env File

  1. Create the .env file: In the root of your project directory, create a file named .env.
  2. Add your variables: Copy the variables listed above into your .env file, replacing the example values with your actual values.
  3. Load the .env file in your code: Use python-dotenv to load these variables into your application.

main.py

import os
from dotenv import load_dotenv
import httpx
from fastapi import FastAPI, Query


from fastapi.responses import JSONResponse, PlainTextResponse
from markdownify import markdownify as md
from bs4 import BeautifulSoup, Comment
import json

# Load .env file
load_dotenv()

# Retrieve environment variables
SEARXNG_URL = os.getenv('SEARXNG_URL')
BROWSERLESS_URL = os.getenv('BROWSERLESS_URL')
TOKEN = os.getenv('TOKEN')
PROXY_PROTOCOL = os.getenv('PROXY_PROTOCOL', 'http')
PROXY_URL = os.getenv('PROXY_URL')
PROXY_USERNAME = os.getenv('PROXY_USERNAME')
PROXY_PASSWORD = os.getenv('PROXY_PASSWORD')
PROXY_PORT = os.getenv('PROXY_PORT')
REQUEST_TIMEOUT = int(os.getenv('REQUEST_TIMEOUT', '30'))

# Domains that should only be accessed using Browserless
domains_only_for_browserless = ["twitter", "x", "facebook"] # Add more domains here

# Create FastAPI app
app = FastAPI()

HEADERS = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}

def fetch_normal_content(url, proxies):
    try:
        response = httpx.get(url, headers=HEADERS, timeout=REQUEST_TIMEOUT, proxies=proxies)
        response.raise_for_status()
        return response.text
    except httpx.RequestError as e:
        print(f"An error occurred while requesting {url}: {e}")
    except httpx.HTTPStatusError as e:
        print(f"HTTP error occurred: {e}")
    return None

def fetch_browserless_content(url, proxies):
    try:
        browserless_url = f"{BROWSERLESS_URL}/content"
        
        params = {}
        if TOKEN:
            params['token'] = TOKEN
            
        proxy_url = f"{PROXY_PROTOCOL}://{PROXY_URL}:{PROXY_PORT}" if PROXY_URL and PROXY_PORT else None
        if proxy_url:
            params['--proxy-server'] = proxy_url
            
        browserless_data = {
            "url": url,
            "rejectResourceTypes": ["image"],
            "rejectRequestPattern": ["/^.*\\.(css)/"],
            "gotoOptions": {"waitUntil": "networkidle2"},
            "bestAttempt": True
        }
        if PROXY_USERNAME and PROXY_PASSWORD:
            browserless_data["authenticate"] = {
                "username": PROXY_USERNAME,
                "password": PROXY_PASSWORD
            }
            
        headers = {
            'Cache-Control': 'no-cache',
            'Content-Type': 'application/json'
        }
        
        response = httpx.post(browserless_url, params=params, headers=headers, data=json.dumps(browserless_data), timeout=REQUEST_TIMEOUT*2)

        response.raise_for_status()
        return response.text
    except httpx.RequestError as e:
        print(f"An error occurred while requesting Browserless for {url}: {e}")
    except httpx.HTTPStatusError as e:
        print(f"HTTP error occurred with Browserless: {e}")
    return None

def fetch_content(url):
    proxies = None
    if PROXY_URL and PROXY_USERNAME and PROXY_PORT:
        proxies = {
            "http://": f"{PROXY_PROTOCOL}://{PROXY_USERNAME}:{PROXY_PASSWORD}@{PROXY_URL}:{PROXY_PORT}",
            "https://": f"{PROXY_PROTOCOL}://{PROXY_USERNAME}:{PROXY_PASSWORD}@{PROXY_URL}:{PROXY_PORT}"
        }
        print(f"Using proxy {proxies}")
    if any(domain in url for domain in domains_only_for_browserless):
        content = fetch_browserless_content(url, proxies)
    else:
        content = fetch_normal_content(url, proxies)
        if content is None:
            content = fetch_browserless_content(url, proxies)

    return content

def clean_html(html):
    soup = BeautifulSoup(html, 'html.parser')
    
    # Remove all script, style, and other unnecessary elements
    for script_or_style in soup(["script", "style", "header", "footer", "noscript", "form", "input", "textarea", "select", "option", "button", "svg", "iframe", "object", "embed", "applet"]):
        script_or_style.decompose()
    
    # Remove unwanted classes and ids
    for tag in soup.find_all(True):
        tag.attrs = {key: value for key, value in tag.attrs.items() if key not in ['class', 'id', 'style']}
    
    # Remove comments
    for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
        comment.extract()
    
    return str(soup)

def parse_html_to_markdown(html, url):
    cleaned_html = clean_html(html)
    markdown_content = md(cleaned_html)
    return {
        "title": BeautifulSoup(html, 'html.parser').title.string if BeautifulSoup(html, 'html.parser').title else 'No title',
        "url": url,
        "md": clean_markdown(markdown_content)
    }

def clean_markdown(markdown):
    # Remove extra newlines and whitespace
    lines = markdown.split('\n')
    cleaned_lines = [line.strip() for line in lines if line.strip()]
    return '\n'.join(cleaned_lines)

def search(query: str, num_results: int) -> list:
    searxng_query_url = f"{SEARXNG_URL}/search?q={query}&categories=general&format=json"
    try:
        response = httpx.get(searxng_query_url, headers=HEADERS, timeout=REQUEST_TIMEOUT)
        response.raise_for_status()
    except httpx.RequestError as e:
        return [{"error": f"Search query failed with error: {e}"}]
    except httpx.HTTPStatusError as e:
        return [{"error": f"Search query failed with HTTP error: {e}"}]

    search_results = response.json()
    results = []
    
    for result in search_results["results"][:num_results]:
        url = result["url"]
        title = result["title"]
        html_content = fetch_content(url)
        if html_content:
            markdown_data = parse_html_to_markdown(html_content, url)
            results.append({
                "title": title,
                "url": url,
                "markdown_content": (
                f"Title: {markdown_data['title']}\n\n"
                f"URL Source: {markdown_data['url']}\n\n"
                f"Markdown Content:\n{markdown_data['md']}"
            )
            })
    
    return results

@app.get("/", response_class=JSONResponse)
def get_search_results(q: str = Query(..., description="Search query"), num_results: int = Query(5, description="Number of results")):
    result_list = search(q, num_results)
    return result_list

@app.get("/r/{url:path}", response_class=PlainTextResponse)
def fetch_url(url: str):
    html_content = fetch_content(url)
    if html_content:
        markdown_data = parse_html_to_markdown(html_content, url)
        response_text = (
            f"Title: {markdown_data['title']}\n\n"
            f"URL Source: {markdown_data['url']}\n\n"
            f"Markdown Content:\n{markdown_data['md']}"
        )
        return PlainTextResponse(response_text)
    return PlainTextResponse("Failed to retrieve content")

# Example usage
if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Running SearXNG and Browserless in Docker

Next, let's set up and run the SearXNG and Browserless services using Docker. Create a shell script run-services.sh with the following content:

#!/bin/bash

# Run SearXNG
docker run -d --name searxng -p 8888:8888 searxng/searxng

# Run Browserless
docker run -d --name browserless -p 3000:3000 browserless/chrome

echo "SearXNG is running at http://localhost:8888"
echo "Browserless is running at http://localhost:3000"

Make the script executable and run it:

chmod +x run-services.sh
./run-services.sh

This script will pull and run the SearXNG and Browserless Docker images, making them accessible on your local machine.

Using Proxies

In this tutorial, I'm using Geonode proxies to scrape content. You can use my Geonode affiliate link to get started with their proxy services.

Full Source Code

You can find the full source code for this project on my GitHub repository.

Enjoyed the Post?

I hope you found this tutorial helpful! Feel free to reach out if you have any questions or need further assistance. Happy scraping!


Enjoyed the post? Follow my blog at essamamdani.com for more tutorials and insights.

Recent Posts