June 4, 2024

7 min read

Build Your Own Search Result Scraper with Markdown Output Using FastAPI, SearXNG, and Browserless

> Learn how to build your own search result scraper with FastAPI, SearXNG, and Browserless, and return results in Markdown format using a proxy.

Audio version coming soon

Verified by Essa Mamdani

Today, I'm excited to share with you a detailed guide on how to build your own search result scraper that returns results in Markdown format. We'll be using FastAPI, SearXNG, and Browserless, and we'll run everything in Docker containers.

This tutorial is perfect for early-stage students or anyone interested in web scraping and data extraction. By the end of this guide, you'll have a working application that can fetch search results, scrape content, and convert it into Markdown format, all while using a proxy.

Services We'll Use
Purpose of Scraping
Prerequisites
Docker Setup
Manual Setup
Writing the Code
Running SearXNG and Browserless in Docker
Using Proxies
Full Source Code
Enjoyed the Post?

Services We'll Use

FastAPI: A modern, fast (high-performance), web framework for building APIs with Python 3.6+.
SearXNG: A free internet metasearch engine which aggregates results from various search services and databases.
Browserless: A web browser automation service that allows you to scrape web pages without dealing with the browser directly.

Purpose of Scraping

Web scraping allows you to extract useful information from websites and use it for various purposes like data analysis, content aggregation, and more. In this tutorial, we'll focus on scraping search results and converting them into Markdown format for easy readability and integration with other tools.

Prerequisites

Before we begin, make sure you have the following installed:

Python 3.11
Virtualenv

You can install the prerequisites using the following commands:

sh
1# Install Python 3.11 (skip if already installed)
2sudo apt-get update
3sudo apt-get install python3.11
4
5# Install virtualenv
6pip install virtualenv

Docker Setup

You can use Docker to simplify the setup process. Follow these steps:

Clone the repository:

sh
1git clone https://github.com/essamamdani/search-result-scraper-markdown.git
2cd search-result-scraper-markdown

Run Docker Compose:

sh
1docker compose up --build

With this setup, if you change the .env or main.py file, you no longer need to restart Docker. Changes will be reloaded automatically.

Manual Setup

Follow these steps for manual setup:

Clone the repository:

sh
1git clone https://github.com/essamamdani/search-result-scraper-markdown.git
2cd search-result-scraper-markdown

Create and activate virtual environment:

sh
1virtualenv venv
2source venv/bin/activate

Install dependencies:

sh
1pip install -r requirements.txt

Create a .env file in the root directory with the following content:

bash
1SEARXNG_URL=http://localhost:8888
2BROWSERLESS_URL=http://localhost:3000
3TOKEN=b7a7ad74da294fa39ed75c01cfe4e41b
4PROXY_PROTOCOL=http
5PROXY_URL=us.premium-residential.geonode.com
6PROXY_USERNAME=geonode_OglxV49yXfxxxx
7PROXY_PASSWORD=xxxxx-ef6a-42a6-98fa-cc9e86cc0628
8PROXY_PORT=9000
9REQUEST_TIMEOUT=300

Writing the Code

Here's the complete code for our FastAPI application:

.env File

env
1SEARXNG_URL=http://localhost:8888
2BROWSERLESS_URL=http://localhost:3000
3TOKEN=b7a7ad74da294fa39ed75c01cfe4e41b
4PROXY_PROTOCOL=http
5PROXY_URL=us.premium-residential.geonode.com
6PROXY_USERNAME=geonode_OglxV49yXfxxxx
7PROXY_PASSWORD=xxxxx-ef6a-42a6-98fa-cc9e86cc0628
8PROXY_PORT=9000
9REQUEST_TIMEOUT=300

Explanation of Each Variable

SEARXNG_URL: This is the URL where your SearXNG service is running. In this setup, it's running locally on port 8888.
BROWSERLESS_URL: This is the URL where your Browserless service is running. In this setup, it's running locally on port 3000.
TOKEN: This is a placeholder for any API token that might be required by your services. In this specific example, it's not actively used but can be kept for future use or services that require authentication.
PROXY_PROTOCOL: The protocol used by your proxy service. Typically, this will be either http or https.
PROXY_URL: The URL or IP address of your proxy service provider. Here, we're using a Geonode proxy.
PROXY_USERNAME: The username for authenticating with your proxy service. This is specific to your Geonode account.
PROXY_PASSWORD: The password for authenticating with your proxy service. This is specific to your Geonode account.
PROXY_PORT: The port number on which your proxy service is running. Common ports include 8080 and 9000.
REQUEST_TIMEOUT: The timeout duration for HTTP requests, specified in seconds. This helps ensure your application doesn't hang indefinitely while waiting for a response.

Why Use a .env File?

Security: Keeps sensitive information like API keys, tokens, and passwords out of your codebase.
Configuration: Allows easy configuration changes without modifying the code.
Environment-Specific Settings: Easily switch configurations between different environments (development, testing, production) by changing the .env file.

How to Use the .env File

Create the .env file: In the root of your project directory, create a file named .env.
Add your variables: Copy the variables listed above into your .env file, replacing the example values with your actual values.
Load the .env file in your code: Use python-dotenv to load these variables into your application.

main.py

python
1import os
2from dotenv import load_dotenv
3import httpx
4from fastapi import FastAPI, Query
5
6
7from fastapi.responses import JSONResponse, PlainTextResponse
8from markdownify import markdownify as md
9from bs4 import BeautifulSoup, Comment
10import json
11
12# Load .env file
13load_dotenv()
14
15# Retrieve environment variables
16SEARXNG_URL = os.getenv('SEARXNG_URL')
17BROWSERLESS_URL = os.getenv('BROWSERLESS_URL')
18TOKEN = os.getenv('TOKEN')
19PROXY_PROTOCOL = os.getenv('PROXY_PROTOCOL', 'http')
20PROXY_URL = os.getenv('PROXY_URL')
21PROXY_USERNAME = os.getenv('PROXY_USERNAME')
22PROXY_PASSWORD = os.getenv('PROXY_PASSWORD')
23PROXY_PORT = os.getenv('PROXY_PORT')
24REQUEST_TIMEOUT = int(os.getenv('REQUEST_TIMEOUT', '30'))
25
26# Domains that should only be accessed using Browserless
27domains_only_for_browserless = ["twitter", "x", "facebook"] # Add more domains here
28
29# Create FastAPI app
30app = FastAPI()
31
32HEADERS = {
33    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
34}
35
36def fetch_normal_content(url, proxies):
37    try:
38        response = httpx.get(url, headers=HEADERS, timeout=REQUEST_TIMEOUT, proxies=proxies)
39        response.raise_for_status()
40        return response.text
41    except httpx.RequestError as e:
42        print(f"An error occurred while requesting {url}: {e}")
43    except httpx.HTTPStatusError as e:
44        print(f"HTTP error occurred: {e}")
45    return None
46
47def fetch_browserless_content(url, proxies):
48    try:
49        browserless_url = f"{BROWSERLESS_URL}/content"
50        
51        params = {}
52        if TOKEN:
53            params['token'] = TOKEN
54            
55        proxy_url = f"{PROXY_PROTOCOL}://{PROXY_URL}:{PROXY_PORT}" if PROXY_URL and PROXY_PORT else None
56        if proxy_url:
57            params['--proxy-server'] = proxy_url
58            
59        browserless_data = {
60            "url": url,
61            "rejectResourceTypes": ["image"],
62            "rejectRequestPattern": ["/^.*\\.(css)/"],
63            "gotoOptions": {"waitUntil": "networkidle2"},
64            "bestAttempt": True
65        }
66        if PROXY_USERNAME and PROXY_PASSWORD:
67            browserless_data["authenticate"] = {
68                "username": PROXY_USERNAME,
69                "password": PROXY_PASSWORD
70            }
71            
72        headers = {
73            'Cache-Control': 'no-cache',
74            'Content-Type': 'application/json'
75        }
76        
77        response = httpx.post(browserless_url, params=params, headers=headers, data=json.dumps(browserless_data), timeout=REQUEST_TIMEOUT*2)
78
79        response.raise_for_status()
80        return response.text
81    except httpx.RequestError as e:
82        print(f"An error occurred while requesting Browserless for {url}: {e}")
83    except httpx.HTTPStatusError as e:
84        print(f"HTTP error occurred with Browserless: {e}")
85    return None
86
87def fetch_content(url):
88    proxies = None
89    if PROXY_URL and PROXY_USERNAME and PROXY_PORT:
90        proxies = {
91            "http://": f"{PROXY_PROTOCOL}://{PROXY_USERNAME}:{PROXY_PASSWORD}@{PROXY_URL}:{PROXY_PORT}",
92            "https://": f"{PROXY_PROTOCOL}://{PROXY_USERNAME}:{PROXY_PASSWORD}@{PROXY_URL}:{PROXY_PORT}"
93        }
94        print(f"Using proxy {proxies}")
95    if any(domain in url for domain in domains_only_for_browserless):
96        content = fetch_browserless_content(url, proxies)
97    else:
98        content = fetch_normal_content(url, proxies)
99        if content is None:
100            content = fetch_browserless_content(url, proxies)
101
102    return content
103
104def clean_html(html):
105    soup = BeautifulSoup(html, 'html.parser')
106    
107    # Remove all script, style, and other unnecessary elements
108    for script_or_style in soup(["script", "style", "header", "footer", "noscript", "form", "input", "textarea", "select", "option", "button", "svg", "iframe", "object", "embed", "applet"]):
109        script_or_style.decompose()
110    
111    # Remove unwanted classes and ids
112    for tag in soup.find_all(True):
113        tag.attrs = {key: value for key, value in tag.attrs.items() if key not in ['class', 'id', 'style']}
114    
115    # Remove comments
116    for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
117        comment.extract()
118    
119    return str(soup)
120
121def parse_html_to_markdown(html, url):
122    cleaned_html = clean_html(html)
123    markdown_content = md(cleaned_html)
124    return {
125        "title": BeautifulSoup(html, 'html.parser').title.string if BeautifulSoup(html, 'html.parser').title else 'No title',
126        "url": url,
127        "md": clean_markdown(markdown_content)
128    }
129
130def clean_markdown(markdown):
131    # Remove extra newlines and whitespace
132    lines = markdown.split('\n')
133    cleaned_lines = [line.strip() for line in lines if line.strip()]
134    return '\n'.join(cleaned_lines)
135
136def search(query: str, num_results: int) -> list:
137    searxng_query_url = f"{SEARXNG_URL}/search?q={query}&categories=general&format=json"
138    try:
139        response = httpx.get(searxng_query_url, headers=HEADERS, timeout=REQUEST_TIMEOUT)
140        response.raise_for_status()
141    except httpx.RequestError as e:
142        return [{"error": f"Search query failed with error: {e}"}]
143    except httpx.HTTPStatusError as e:
144        return [{"error": f"Search query failed with HTTP error: {e}"}]
145
146    search_results = response.json()
147    results = []
148    
149    for result in search_results["results"][:num_results]:
150        url = result["url"]
151        title = result["title"]
152        html_content = fetch_content(url)
153        if html_content:
154            markdown_data = parse_html_to_markdown(html_content, url)
155            results.append({
156                "title": title,
157                "url": url,
158                "markdown_content": (
159                f"Title: {markdown_data['title']}\n\n"
160                f"URL Source: {markdown_data['url']}\n\n"
161                f"Markdown Content:\n{markdown_data['md']}"
162            )
163            })
164    
165    return results
166
167@app.get("/", response_class=JSONResponse)
168def get_search_results(q: str = Query(..., description="Search query"), num_results: int = Query(5, description="Number of results")):
169    result_list = search(q, num_results)
170    return result_list
171
172@app.get("/r/{url:path}", response_class=PlainTextResponse)
173def fetch_url(url: str):
174    html_content = fetch_content(url)
175    if html_content:
176        markdown_data = parse_html_to_markdown(html_content, url)
177        response_text = (
178            f"Title: {markdown_data['title']}\n\n"
179            f"URL Source: {markdown_data['url']}\n\n"
180            f"Markdown Content:\n{markdown_data['md']}"
181        )
182        return PlainTextResponse(response_text)
183    return PlainTextResponse("Failed to retrieve content")
184
185# Example usage
186if __name__ == "__main__":
187    import uvicorn
188    uvicorn.run(app, host="0.0.0.0", port=8000)

Running SearXNG and Browserless in Docker

Next, let's set up and run the SearXNG and Browserless services using Docker. Create a shell script run-services.sh with the following content:

sh
1#!/bin/bash
2
3# Run SearXNG
4docker run -d --name searxng -p 8888:8888 searxng/searxng
5
6# Run Browserless
7docker run -d --name browserless -p 3000:3000 browserless/chrome
8
9echo "SearXNG is running at http://localhost:8888"
10echo "Browserless is running at http://localhost:3000"

Make the script executable and run it:

sh
1chmod +x run-services.sh
2./run-services.sh

This script will pull and run the SearXNG and Browserless Docker images, making them accessible on your local machine.

Using Proxies

In this tutorial, I'm using Geonode proxies to scrape content. You can use my Geonode affiliate link to get started with their proxy services.

Full Source Code

You can find the full source code for this project on my GitHub repository.

Enjoyed the Post?

I hope you found this tutorial helpful! Feel free to reach out if you have any questions or need further assistance. Happy scraping!

Enjoyed the post? Follow my blog at essamamdani.com for more tutorials and insights.

Table of Contents