Build Your Own Search Result Scraper with Markdown Output Using FastAPI, SearXNG, and Browserless
> Learn how to build your own search result scraper with FastAPI, SearXNG, and Browserless, and return results in Markdown format using a proxy.
Today, I'm excited to share with you a detailed guide on how to build your own search result scraper that returns results in Markdown format. We'll be using FastAPI, SearXNG, and Browserless, and we'll run everything in Docker containers.
This tutorial is perfect for early-stage students or anyone interested in web scraping and data extraction. By the end of this guide, you'll have a working application that can fetch search results, scrape content, and convert it into Markdown format, all while using a proxy.
Table of Contents
- Services We'll Use
- Purpose of Scraping
- Prerequisites
- Docker Setup
- Manual Setup
- Writing the Code
- Running SearXNG and Browserless in Docker
- Using Proxies
- Full Source Code
- Enjoyed the Post?
Services We'll Use
- FastAPI: A modern, fast (high-performance), web framework for building APIs with Python 3.6+.
- SearXNG: A free internet metasearch engine which aggregates results from various search services and databases.
- Browserless: A web browser automation service that allows you to scrape web pages without dealing with the browser directly.
Purpose of Scraping
Web scraping allows you to extract useful information from websites and use it for various purposes like data analysis, content aggregation, and more. In this tutorial, we'll focus on scraping search results and converting them into Markdown format for easy readability and integration with other tools.
Prerequisites
Before we begin, make sure you have the following installed:
- Python 3.11
- Virtualenv
You can install the prerequisites using the following commands:
sh1# Install Python 3.11 (skip if already installed) 2sudo apt-get update 3sudo apt-get install python3.11 4 5# Install virtualenv 6pip install virtualenv
Docker Setup
You can use Docker to simplify the setup process. Follow these steps:
-
Clone the repository:
sh1git clone https://github.com/essamamdani/search-result-scraper-markdown.git 2cd search-result-scraper-markdown -
Run Docker Compose:
sh1docker compose up --build
With this setup, if you change the .env or main.py file, you no longer need to restart Docker. Changes will be reloaded automatically.
Manual Setup
Follow these steps for manual setup:
-
Clone the repository:
sh1git clone https://github.com/essamamdani/search-result-scraper-markdown.git 2cd search-result-scraper-markdown -
Create and activate virtual environment:
sh1virtualenv venv 2source venv/bin/activate -
Install dependencies:
sh1pip install -r requirements.txt -
Create a .env file in the root directory with the following content:
bash1SEARXNG_URL=http://localhost:8888 2BROWSERLESS_URL=http://localhost:3000 3TOKEN=b7a7ad74da294fa39ed75c01cfe4e41b 4PROXY_PROTOCOL=http 5PROXY_URL=us.premium-residential.geonode.com 6PROXY_USERNAME=geonode_OglxV49yXfxxxx 7PROXY_PASSWORD=xxxxx-ef6a-42a6-98fa-cc9e86cc0628 8PROXY_PORT=9000 9REQUEST_TIMEOUT=300
Writing the Code
Here's the complete code for our FastAPI application:
.env File
env1SEARXNG_URL=http://localhost:8888 2BROWSERLESS_URL=http://localhost:3000 3TOKEN=b7a7ad74da294fa39ed75c01cfe4e41b 4PROXY_PROTOCOL=http 5PROXY_URL=us.premium-residential.geonode.com 6PROXY_USERNAME=geonode_OglxV49yXfxxxx 7PROXY_PASSWORD=xxxxx-ef6a-42a6-98fa-cc9e86cc0628 8PROXY_PORT=9000 9REQUEST_TIMEOUT=300
Explanation of Each Variable
- SEARXNG_URL: This is the URL where your SearXNG service is running. In this setup, it's running locally on port 8888.
- BROWSERLESS_URL: This is the URL where your Browserless service is running. In this setup, it's running locally on port 3000.
- TOKEN: This is a placeholder for any API token that might be required by your services. In this specific example, it's not actively used but can be kept for future use or services that require authentication.
- PROXY_PROTOCOL: The protocol used by your proxy service. Typically, this will be either
httporhttps. - PROXY_URL: The URL or IP address of your proxy service provider. Here, we're using a Geonode proxy.
- PROXY_USERNAME: The username for authenticating with your proxy service. This is specific to your Geonode account.
- PROXY_PASSWORD: The password for authenticating with your proxy service. This is specific to your Geonode account.
- PROXY_PORT: The port number on which your proxy service is running. Common ports include
8080and9000. - REQUEST_TIMEOUT: The timeout duration for HTTP requests, specified in seconds. This helps ensure your application doesn't hang indefinitely while waiting for a response.
Why Use a .env File?
- Security: Keeps sensitive information like API keys, tokens, and passwords out of your codebase.
- Configuration: Allows easy configuration changes without modifying the code.
- Environment-Specific Settings: Easily switch configurations between different environments (development, testing, production) by changing the .env file.
How to Use the .env File
- Create the .env file: In the root of your project directory, create a file named
.env. - Add your variables: Copy the variables listed above into your .env file, replacing the example values with your actual values.
- Load the .env file in your code: Use
python-dotenvto load these variables into your application.
main.py
python1import os 2from dotenv import load_dotenv 3import httpx 4from fastapi import FastAPI, Query 5 6 7from fastapi.responses import JSONResponse, PlainTextResponse 8from markdownify import markdownify as md 9from bs4 import BeautifulSoup, Comment 10import json 11 12# Load .env file 13load_dotenv() 14 15# Retrieve environment variables 16SEARXNG_URL = os.getenv('SEARXNG_URL') 17BROWSERLESS_URL = os.getenv('BROWSERLESS_URL') 18TOKEN = os.getenv('TOKEN') 19PROXY_PROTOCOL = os.getenv('PROXY_PROTOCOL', 'http') 20PROXY_URL = os.getenv('PROXY_URL') 21PROXY_USERNAME = os.getenv('PROXY_USERNAME') 22PROXY_PASSWORD = os.getenv('PROXY_PASSWORD') 23PROXY_PORT = os.getenv('PROXY_PORT') 24REQUEST_TIMEOUT = int(os.getenv('REQUEST_TIMEOUT', '30')) 25 26# Domains that should only be accessed using Browserless 27domains_only_for_browserless = ["twitter", "x", "facebook"] # Add more domains here 28 29# Create FastAPI app 30app = FastAPI() 31 32HEADERS = { 33 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36' 34} 35 36def fetch_normal_content(url, proxies): 37 try: 38 response = httpx.get(url, headers=HEADERS, timeout=REQUEST_TIMEOUT, proxies=proxies) 39 response.raise_for_status() 40 return response.text 41 except httpx.RequestError as e: 42 print(f"An error occurred while requesting {url}: {e}") 43 except httpx.HTTPStatusError as e: 44 print(f"HTTP error occurred: {e}") 45 return None 46 47def fetch_browserless_content(url, proxies): 48 try: 49 browserless_url = f"{BROWSERLESS_URL}/content" 50 51 params = {} 52 if TOKEN: 53 params['token'] = TOKEN 54 55 proxy_url = f"{PROXY_PROTOCOL}://{PROXY_URL}:{PROXY_PORT}" if PROXY_URL and PROXY_PORT else None 56 if proxy_url: 57 params['--proxy-server'] = proxy_url 58 59 browserless_data = { 60 "url": url, 61 "rejectResourceTypes": ["image"], 62 "rejectRequestPattern": ["/^.*\\.(css)/"], 63 "gotoOptions": {"waitUntil": "networkidle2"}, 64 "bestAttempt": True 65 } 66 if PROXY_USERNAME and PROXY_PASSWORD: 67 browserless_data["authenticate"] = { 68 "username": PROXY_USERNAME, 69 "password": PROXY_PASSWORD 70 } 71 72 headers = { 73 'Cache-Control': 'no-cache', 74 'Content-Type': 'application/json' 75 } 76 77 response = httpx.post(browserless_url, params=params, headers=headers, data=json.dumps(browserless_data), timeout=REQUEST_TIMEOUT*2) 78 79 response.raise_for_status() 80 return response.text 81 except httpx.RequestError as e: 82 print(f"An error occurred while requesting Browserless for {url}: {e}") 83 except httpx.HTTPStatusError as e: 84 print(f"HTTP error occurred with Browserless: {e}") 85 return None 86 87def fetch_content(url): 88 proxies = None 89 if PROXY_URL and PROXY_USERNAME and PROXY_PORT: 90 proxies = { 91 "http://": f"{PROXY_PROTOCOL}://{PROXY_USERNAME}:{PROXY_PASSWORD}@{PROXY_URL}:{PROXY_PORT}", 92 "https://": f"{PROXY_PROTOCOL}://{PROXY_USERNAME}:{PROXY_PASSWORD}@{PROXY_URL}:{PROXY_PORT}" 93 } 94 print(f"Using proxy {proxies}") 95 if any(domain in url for domain in domains_only_for_browserless): 96 content = fetch_browserless_content(url, proxies) 97 else: 98 content = fetch_normal_content(url, proxies) 99 if content is None: 100 content = fetch_browserless_content(url, proxies) 101 102 return content 103 104def clean_html(html): 105 soup = BeautifulSoup(html, 'html.parser') 106 107 # Remove all script, style, and other unnecessary elements 108 for script_or_style in soup(["script", "style", "header", "footer", "noscript", "form", "input", "textarea", "select", "option", "button", "svg", "iframe", "object", "embed", "applet"]): 109 script_or_style.decompose() 110 111 # Remove unwanted classes and ids 112 for tag in soup.find_all(True): 113 tag.attrs = {key: value for key, value in tag.attrs.items() if key not in ['class', 'id', 'style']} 114 115 # Remove comments 116 for comment in soup.find_all(string=lambda text: isinstance(text, Comment)): 117 comment.extract() 118 119 return str(soup) 120 121def parse_html_to_markdown(html, url): 122 cleaned_html = clean_html(html) 123 markdown_content = md(cleaned_html) 124 return { 125 "title": BeautifulSoup(html, 'html.parser').title.string if BeautifulSoup(html, 'html.parser').title else 'No title', 126 "url": url, 127 "md": clean_markdown(markdown_content) 128 } 129 130def clean_markdown(markdown): 131 # Remove extra newlines and whitespace 132 lines = markdown.split('\n') 133 cleaned_lines = [line.strip() for line in lines if line.strip()] 134 return '\n'.join(cleaned_lines) 135 136def search(query: str, num_results: int) -> list: 137 searxng_query_url = f"{SEARXNG_URL}/search?q={query}&categories=general&format=json" 138 try: 139 response = httpx.get(searxng_query_url, headers=HEADERS, timeout=REQUEST_TIMEOUT) 140 response.raise_for_status() 141 except httpx.RequestError as e: 142 return [{"error": f"Search query failed with error: {e}"}] 143 except httpx.HTTPStatusError as e: 144 return [{"error": f"Search query failed with HTTP error: {e}"}] 145 146 search_results = response.json() 147 results = [] 148 149 for result in search_results["results"][:num_results]: 150 url = result["url"] 151 title = result["title"] 152 html_content = fetch_content(url) 153 if html_content: 154 markdown_data = parse_html_to_markdown(html_content, url) 155 results.append({ 156 "title": title, 157 "url": url, 158 "markdown_content": ( 159 f"Title: {markdown_data['title']}\n\n" 160 f"URL Source: {markdown_data['url']}\n\n" 161 f"Markdown Content:\n{markdown_data['md']}" 162 ) 163 }) 164 165 return results 166 167@app.get("/", response_class=JSONResponse) 168def get_search_results(q: str = Query(..., description="Search query"), num_results: int = Query(5, description="Number of results")): 169 result_list = search(q, num_results) 170 return result_list 171 172@app.get("/r/{url:path}", response_class=PlainTextResponse) 173def fetch_url(url: str): 174 html_content = fetch_content(url) 175 if html_content: 176 markdown_data = parse_html_to_markdown(html_content, url) 177 response_text = ( 178 f"Title: {markdown_data['title']}\n\n" 179 f"URL Source: {markdown_data['url']}\n\n" 180 f"Markdown Content:\n{markdown_data['md']}" 181 ) 182 return PlainTextResponse(response_text) 183 return PlainTextResponse("Failed to retrieve content") 184 185# Example usage 186if __name__ == "__main__": 187 import uvicorn 188 uvicorn.run(app, host="0.0.0.0", port=8000)
Running SearXNG and Browserless in Docker
Next, let's set up and run the SearXNG and Browserless services using Docker. Create a shell script run-services.sh with the following content:
sh1#!/bin/bash 2 3# Run SearXNG 4docker run -d --name searxng -p 8888:8888 searxng/searxng 5 6# Run Browserless 7docker run -d --name browserless -p 3000:3000 browserless/chrome 8 9echo "SearXNG is running at http://localhost:8888" 10echo "Browserless is running at http://localhost:3000"
Make the script executable and run it:
sh1chmod +x run-services.sh 2./run-services.sh
This script will pull and run the SearXNG and Browserless Docker images, making them accessible on your local machine.
Using Proxies
In this tutorial, I'm using Geonode proxies to scrape content. You can use my Geonode affiliate link to get started with their proxy services.
Full Source Code
You can find the full source code for this project on my GitHub repository.
Enjoyed the Post?
I hope you found this tutorial helpful! Feel free to reach out if you have any questions or need further assistance. Happy scraping!
Enjoyed the post? Follow my blog at essamamdani.com for more tutorials and insights.