Build Your Own Search Result Scraper with Markdown Output Using FastAPI, SearXNG, and Browserless
Jun 4, 2024Learn how to build your own search result scraper with FastAPI, SearXNG, and Browserless, and return results in Markdown format using a proxy.
Build Your Own Search Result Scraper with Markdown Output Using FastAPI, SearXNG, and Browserless
Today, I'm excited to share with you a detailed guide on how to build your own search result scraper that returns results in Markdown format. We'll be using FastAPI, SearXNG, and Browserless, and we'll run everything in Docker containers.
This tutorial is perfect for early-stage students or anyone interested in web scraping and data extraction. By the end of this guide, you'll have a working application that can fetch search results, scrape content, and convert it into Markdown format, all while using a proxy.
Table of Contents
- Services We'll Use
- Purpose of Scraping
- Prerequisites
- Docker Setup
- Manual Setup
- Writing the Code
- Running SearXNG and Browserless in Docker
- Using Proxies
- Full Source Code
- Enjoyed the Post?
Services We'll Use
- FastAPI: A modern, fast (high-performance), web framework for building APIs with Python 3.6+.
- SearXNG: A free internet metasearch engine which aggregates results from various search services and databases.
- Browserless: A web browser automation service that allows you to scrape web pages without dealing with the browser directly.
Purpose of Scraping
Web scraping allows you to extract useful information from websites and use it for various purposes like data analysis, content aggregation, and more. In this tutorial, we'll focus on scraping search results and converting them into Markdown format for easy readability and integration with other tools.
Prerequisites
Before we begin, make sure you have the following installed:
- Python 3.11
- Virtualenv
You can install the prerequisites using the following commands:
# Install Python 3.11 (skip if already installed)
sudo apt-get update
sudo apt-get install python3.11
# Install virtualenv
pip install virtualenv
Docker Setup
You can use Docker to simplify the setup process. Follow these steps:
- Clone the repository:
- Run Docker Compose:
With this setup, if you change the .env
or main.py
file, you no longer need to restart Docker. Changes will be reloaded automatically.
Manual Setup
Follow these steps for manual setup:
- Clone the repository:
- Create and activate virtual environment:
- Install dependencies:
- Create a .env file in the root directory with the following content:
Writing the Code
Here's the complete code for our FastAPI application:
.env File
SEARXNG_URL=http://localhost:8888
BROWSERLESS_URL=http://localhost:3000
TOKEN=b7a7ad74da294fa39ed75c01cfe4e41b
PROXY_PROTOCOL=http
PROXY_URL=us.premium-residential.geonode.com
PROXY_USERNAME=geonode_OglxV49yXfxxxx
PROXY_PASSWORD=xxxxx-ef6a-42a6-98fa-cc9e86cc0628
PROXY_PORT=9000
REQUEST_TIMEOUT=300
Explanation of Each Variable
- SEARXNG_URL: This is the URL where your SearXNG service is running. In this setup, it's running locally on port 8888.
- BROWSERLESS_URL: This is the URL where your Browserless service is running. In this setup, it's running locally on port 3000.
- TOKEN: This is a placeholder for any API token that might be required by your services. In this specific example, it's not actively used but can be kept for future use or services that require authentication.
- PROXY_PROTOCOL: The protocol used by your proxy service. Typically, this will be either
http
orhttps
. - PROXY_URL: The URL or IP address of your proxy service provider. Here, we're using a Geonode proxy.
- PROXY_USERNAME: The username for authenticating with your proxy service. This is specific to your Geonode account.
- PROXY_PASSWORD: The password for authenticating with your proxy service. This is specific to your Geonode account.
- PROXY_PORT: The port number on which your proxy service is running. Common ports include
8080
and9000
. - REQUEST_TIMEOUT: The timeout duration for HTTP requests, specified in seconds. This helps ensure your application doesn't hang indefinitely while waiting for a response.
Why Use a .env File?
- Security: Keeps sensitive information like API keys, tokens, and passwords out of your codebase.
- Configuration: Allows easy configuration changes without modifying the code.
- Environment-Specific Settings: Easily switch configurations between different environments (development, testing, production) by changing the .env file.
How to Use the .env File
- Create the .env file: In the root of your project directory, create a file named
.env
. - Add your variables: Copy the variables listed above into your .env file, replacing the example values with your actual values.
- Load the .env file in your code: Use
python-dotenv
to load these variables into your application.
main.py
import os
from dotenv import load_dotenv
import httpx
from fastapi import FastAPI, Query
from fastapi.responses import JSONResponse, PlainTextResponse
from markdownify import markdownify as md
from bs4 import BeautifulSoup, Comment
import json
# Load .env file
load_dotenv()
# Retrieve environment variables
SEARXNG_URL = os.getenv('SEARXNG_URL')
BROWSERLESS_URL = os.getenv('BROWSERLESS_URL')
TOKEN = os.getenv('TOKEN')
PROXY_PROTOCOL = os.getenv('PROXY_PROTOCOL', 'http')
PROXY_URL = os.getenv('PROXY_URL')
PROXY_USERNAME = os.getenv('PROXY_USERNAME')
PROXY_PASSWORD = os.getenv('PROXY_PASSWORD')
PROXY_PORT = os.getenv('PROXY_PORT')
REQUEST_TIMEOUT = int(os.getenv('REQUEST_TIMEOUT', '30'))
# Domains that should only be accessed using Browserless
domains_only_for_browserless = ["twitter", "x", "facebook"] # Add more domains here
# Create FastAPI app
app = FastAPI()
HEADERS = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36'
}
def fetch_normal_content(url, proxies):
try:
response = httpx.get(url, headers=HEADERS, timeout=REQUEST_TIMEOUT, proxies=proxies)
response.raise_for_status()
return response.text
except httpx.RequestError as e:
print(f"An error occurred while requesting {url}: {e}")
except httpx.HTTPStatusError as e:
print(f"HTTP error occurred: {e}")
return None
def fetch_browserless_content(url, proxies):
try:
browserless_url = f"{BROWSERLESS_URL}/content"
params = {}
if TOKEN:
params['token'] = TOKEN
proxy_url = f"{PROXY_PROTOCOL}://{PROXY_URL}:{PROXY_PORT}" if PROXY_URL and PROXY_PORT else None
if proxy_url:
params['--proxy-server'] = proxy_url
browserless_data = {
"url": url,
"rejectResourceTypes": ["image"],
"rejectRequestPattern": ["/^.*\\.(css)/"],
"gotoOptions": {"waitUntil": "networkidle2"},
"bestAttempt": True
}
if PROXY_USERNAME and PROXY_PASSWORD:
browserless_data["authenticate"] = {
"username": PROXY_USERNAME,
"password": PROXY_PASSWORD
}
headers = {
'Cache-Control': 'no-cache',
'Content-Type': 'application/json'
}
response = httpx.post(browserless_url, params=params, headers=headers, data=json.dumps(browserless_data), timeout=REQUEST_TIMEOUT*2)
response.raise_for_status()
return response.text
except httpx.RequestError as e:
print(f"An error occurred while requesting Browserless for {url}: {e}")
except httpx.HTTPStatusError as e:
print(f"HTTP error occurred with Browserless: {e}")
return None
def fetch_content(url):
proxies = None
if PROXY_URL and PROXY_USERNAME and PROXY_PORT:
proxies = {
"http://": f"{PROXY_PROTOCOL}://{PROXY_USERNAME}:{PROXY_PASSWORD}@{PROXY_URL}:{PROXY_PORT}",
"https://": f"{PROXY_PROTOCOL}://{PROXY_USERNAME}:{PROXY_PASSWORD}@{PROXY_URL}:{PROXY_PORT}"
}
print(f"Using proxy {proxies}")
if any(domain in url for domain in domains_only_for_browserless):
content = fetch_browserless_content(url, proxies)
else:
content = fetch_normal_content(url, proxies)
if content is None:
content = fetch_browserless_content(url, proxies)
return content
def clean_html(html):
soup = BeautifulSoup(html, 'html.parser')
# Remove all script, style, and other unnecessary elements
for script_or_style in soup(["script", "style", "header", "footer", "noscript", "form", "input", "textarea", "select", "option", "button", "svg", "iframe", "object", "embed", "applet"]):
script_or_style.decompose()
# Remove unwanted classes and ids
for tag in soup.find_all(True):
tag.attrs = {key: value for key, value in tag.attrs.items() if key not in ['class', 'id', 'style']}
# Remove comments
for comment in soup.find_all(string=lambda text: isinstance(text, Comment)):
comment.extract()
return str(soup)
def parse_html_to_markdown(html, url):
cleaned_html = clean_html(html)
markdown_content = md(cleaned_html)
return {
"title": BeautifulSoup(html, 'html.parser').title.string if BeautifulSoup(html, 'html.parser').title else 'No title',
"url": url,
"md": clean_markdown(markdown_content)
}
def clean_markdown(markdown):
# Remove extra newlines and whitespace
lines = markdown.split('\n')
cleaned_lines = [line.strip() for line in lines if line.strip()]
return '\n'.join(cleaned_lines)
def search(query: str, num_results: int) -> list:
searxng_query_url = f"{SEARXNG_URL}/search?q={query}&categories=general&format=json"
try:
response = httpx.get(searxng_query_url, headers=HEADERS, timeout=REQUEST_TIMEOUT)
response.raise_for_status()
except httpx.RequestError as e:
return [{"error": f"Search query failed with error: {e}"}]
except httpx.HTTPStatusError as e:
return [{"error": f"Search query failed with HTTP error: {e}"}]
search_results = response.json()
results = []
for result in search_results["results"][:num_results]:
url = result["url"]
title = result["title"]
html_content = fetch_content(url)
if html_content:
markdown_data = parse_html_to_markdown(html_content, url)
results.append({
"title": title,
"url": url,
"markdown_content": (
f"Title: {markdown_data['title']}\n\n"
f"URL Source: {markdown_data['url']}\n\n"
f"Markdown Content:\n{markdown_data['md']}"
)
})
return results
@app.get("/", response_class=JSONResponse)
def get_search_results(q: str = Query(..., description="Search query"), num_results: int = Query(5, description="Number of results")):
result_list = search(q, num_results)
return result_list
@app.get("/r/{url:path}", response_class=PlainTextResponse)
def fetch_url(url: str):
html_content = fetch_content(url)
if html_content:
markdown_data = parse_html_to_markdown(html_content, url)
response_text = (
f"Title: {markdown_data['title']}\n\n"
f"URL Source: {markdown_data['url']}\n\n"
f"Markdown Content:\n{markdown_data['md']}"
)
return PlainTextResponse(response_text)
return PlainTextResponse("Failed to retrieve content")
# Example usage
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Running SearXNG and Browserless in Docker
Next, let's set up and run the SearXNG and Browserless services using Docker. Create a shell script run-services.sh
with the following content:
#!/bin/bash
# Run SearXNG
docker run -d --name searxng -p 8888:8888 searxng/searxng
# Run Browserless
docker run -d --name browserless -p 3000:3000 browserless/chrome
echo "SearXNG is running at http://localhost:8888"
echo "Browserless is running at http://localhost:3000"
Make the script executable and run it:
chmod +x run-services.sh
./run-services.sh
This script will pull and run the SearXNG and Browserless Docker images, making them accessible on your local machine.
Using Proxies
In this tutorial, I'm using Geonode proxies to scrape content. You can use my Geonode affiliate link to get started with their proxy services.
Full Source Code
You can find the full source code for this project on my GitHub repository.
Enjoyed the Post?
I hope you found this tutorial helpful! Feel free to reach out if you have any questions or need further assistance. Happy scraping!
Enjoyed the post? Follow my blog at essamamdani.com for more tutorials and insights.
Pinecone Results as String: Retrieving Text from Vector Embeddings
Published Jan 5, 2025
Learn how to retrieve Pinecone results as strings instead of embeddings. This article explores the process of using Pinecone and other data sources to get human-readable text from vector search results....
Building Powerful RAG Applications with Pinecone, OpenAI, and LangChain
Published Jan 5, 2025
Explore how to build powerful RAG applications using Pinecone, OpenAI, and LangChain. Learn about the core concepts, implementation steps, and benefits of this approach for creating context-aware AI systems....
Google Axion vs AMD vs Intel CPU Comparison
Published Jan 2, 2025
Explore the competitive landscape of the processor market with a detailed comparison of Google Axion vs AMD vs Intel CPUs. Learn about their performance, architecture, and target markets....