Web Scraping with Python
I recently went to the AI Agent Conference and I noticed a recurring theme - quite a few vendors are offering public data that they’ve scraped from the web (or scrape in real time) to make available to AI agents. BrightData had a big display and even hosted a nice happy hour event. It makes sense when you think about it - AI agents need access to current, structured data to be useful, and web scraping is a popular way how folks get that data at scale.
It got me thinking about how fundamental web scraping has become, not just for AI but for all kinds of projects. I’ve been scraping websites for well over a decade. I started many years ago with trying to pull player stats to get ready for my Fantasy Football draft, and over the years it’s become one of those skills I reach for constantly. Need to pull product data for a side project? Scrape it. Want to monitor prices across marketplaces? Scrape it. Need to build a dataset that doesn’t exist as a nice API? You guessed it.
Web scraping is one of those things that sounds simple on the surface - just grab the HTML and parse it, right? But anyone who’s spent real time doing it knows the rabbit hole goes deep. You’ve got JavaScript-rendered pages, rate limiting, anti-bot measures, changing DOM structures, and the ever-present question of whether you’re being a good citizen of the internet. I wanted to put together a thorough guide covering the tools and techniques I’ve found most useful over the years.
The basics
At its core, web scraping is just making HTTP requests and extracting data from the responses. The simplest version looks like this:
1
2
3
4
5
6
7
8
9
10
11
import requests
from bs4 import BeautifulSoup
url = "https://example.com/products"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
for item in soup.select(".product-card"):
name = item.select_one(".product-name").text.strip()
price = item.select_one(".product-price").text.strip()
print(f"{name}: {price}")
That’s the hello world of scraping. You make a GET request with the requests library, pass the HTML into BeautifulSoup, and use CSS selectors to find the elements you care about. For simple, static HTML pages, this is all you need.
But most interesting websites aren’t that simple.
Choosing the right tool for the job
Over the years I’ve settled on a few go-to libraries depending on what I’m dealing with. Here’s how I think about it:
requests + BeautifulSoup
This is my default starting point. It’s lightweight, fast, and handles the majority of cases where the content you want is in the initial HTML response. I’d estimate 60-70% of my scraping projects start and end here.
Best for:
- Static HTML pages
- Sites with server-rendered content
- APIs that return HTML fragments
- Quick one-off scripts
Install them with:
1
pip install requests beautifulsoup4 lxml
I always install lxml as the parser - it’s significantly faster than the default html.parser and handles malformed HTML more gracefully:
1
soup = BeautifulSoup(response.text, "lxml")
Selenium
When a site relies heavily on JavaScript to render content, requests + BeautifulSoup won’t cut it because the HTML you get back is basically an empty shell with a bunch of <script> tags. That’s where Selenium comes in - it drives a real browser, so JavaScript executes just like it would for a human visitor.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.get("https://example.com/dynamic-page")
# wait for the content to actually render
wait = WebDriverWait(driver, 10)
items = wait.until(
EC.presence_of_all_elements_located((By.CSS_SELECTOR, ".product-card"))
)
for item in items:
name = item.find_element(By.CSS_SELECTOR, ".product-name").text
price = item.find_element(By.CSS_SELECTOR, ".product-price").text
print(f"{name}: {price}")
driver.quit()
Best for:
- JavaScript-heavy single page applications (SPAs)
- Pages that require user interaction (clicking, scrolling, form submission)
- Sites where you need to log in first
The downside: Selenium is slow. It’s launching a full browser for every request. For scraping a handful of pages that’s fine, but if you need to hit thousands of URLs, you’ll want to look for alternatives first.
Playwright
Playwright is the newer kid on the block and has become my preferred choice over Selenium for browser automation. It’s faster, has a cleaner API, and handles async operations more naturally.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
browser = p.chromium.launch(headless=True)
page = browser.new_page()
page.goto("https://example.com/dynamic-page")
# wait for content to load
page.wait_for_selector(".product-card")
items = page.query_selector_all(".product-card")
for item in items:
name = item.query_selector(".product-name").inner_text()
price = item.query_selector(".product-price").inner_text()
print(f"{name}: {price}")
browser.close()
Install with:
1
2
pip install playwright
playwright install
Best for:
- Same use cases as Selenium, but with better performance
- When you need to intercept network requests
- When you want built-in waiting and auto-retry logic
Worth mentioning: if you’re running into bot detection with Playwright, check out Patchright. It’s a patched fork of Playwright that removes many of the telltale signs that automation tools leave behind (like the navigator.webdriver flag and other browser fingerprinting leaks). The API is identical to Playwright, so you can swap it in with minimal changes:
1
2
pip install patchright
patchright install
1
2
from patchright.sync_api import sync_playwright
# everything else stays the same
I reach for Patchright when a site’s bot detection is catching standard Playwright but I still need browser-level rendering. It saves you from having to manually patch all those detection vectors yourself.
Scrapy
Scrapy is a full framework rather than just a library. If you’re building a scraper that needs to crawl hundreds or thousands of pages, handle retries, respect robots.txt, manage a queue of URLs, and output structured data - Scrapy is the right answer.
I’ll be honest though, for most of my projects Scrapy is overkill. I tend to reach for it when a project grows beyond what a simple script can handle, rather than starting with it from the beginning.
Best for:
- Large-scale crawling across many pages
- Projects that need built-in retry logic, rate limiting, and pipeline processing
- When you want a structured, maintainable scraping codebase
Parsing strategies
Getting the HTML is only half the battle. Extracting the right data from it is where things get interesting.
CSS selectors
CSS selectors are my go-to for most parsing. If you’ve written any frontend code, they’ll feel natural:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
# by class
soup.select(".product-name")
# by id
soup.select("#main-content")
# nested elements
soup.select("div.product-card > h2.title")
# by attribute
soup.select('a[href*="product"]')
# nth child
soup.select("table tr:nth-child(2) td")
XPath
Sometimes CSS selectors aren’t expressive enough. XPath lets you do things like “find the div that contains this text” or “get the parent of this element”:
1
2
3
4
5
6
7
8
9
10
11
12
from lxml import html
tree = html.fromstring(response.text)
# find by text content
tree.xpath('//div[contains(text(), "Price")]')
# get parent element
tree.xpath('//span[@class="price"]/parent::div')
# get following sibling
tree.xpath('//h2[text()="Details"]/following-sibling::p[1]')
Regular expressions
I’m not going to tell you to parse HTML with regex - that’s a well-known path to madness. But regex is genuinely useful for extracting structured data from text content you’ve already parsed:
1
2
3
4
5
6
7
8
9
10
import re
text = soup.select_one(".product-details").text
price = re.search(r'\$[\d,]+\.?\d*', text)
sku = re.search(r'SKU:\s*(\w+-\d+)', text)
if price:
print(f"Price: {price.group()}")
if sku:
print(f"SKU: {sku.group(1)}")
Understanding User-Agent strings
Every HTTP request your browser makes includes a User-Agent header that tells the server what software is making the request. When you visit a website in Chrome, the header looks something like this:
1
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36
It’s a bit of a mess historically - all those “Mozilla” and “Safari” references are legacy compatibility artifacts from the browser wars. But the important thing for scraping is that servers use this header to decide how to handle your request. The default User-Agent for Python’s requests library is something like python-requests/2.31.0, which is basically a neon sign saying “I’m a bot.”
Why it matters
Many websites treat requests differently based on the User-Agent:
- Some block known bot User-Agents entirely
- Some serve different content (simplified HTML, CAPTCHAs, or error pages)
- Some rate limit more aggressively when they see non-browser User-Agents
- WAFs (Web Application Firewalls) like Cloudflare often flag requests with missing or suspicious User-Agents
Best practices
There are a few approaches, and the right one depends on your situation:
For personal projects and research, I like setting an honest, descriptive User-Agent that identifies your bot and provides contact info. This is the most ethical approach and many site owners appreciate the transparency:
1
"User-Agent": "Mozilla/5.0 (compatible; JaysDataBot/1.0; +https://jaygrossman.com/bot-info)"
For scraping sites that block non-browser User-Agents, you’ll need to use a realistic browser string. The easiest way to get one is to open your browser’s dev tools (F12), go to the Console tab, and type navigator.userAgent - that gives you your own browser’s exact string. You can also check whatismybrowser.com which maintains lists of current strings across browsers and platforms. The key is to use one that matches a current, common browser - an outdated Chrome version from 2019 can be just as suspicious as no User-Agent at all.
Rotating User-Agents is useful when you’re making many requests and want to avoid pattern detection. But don’t just randomize wildly - stick to a pool of realistic, current browser strings.
Building a UserAgent manager
Here’s a class I use that handles User-Agent rotation and keeps things organized:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import random
from dataclasses import dataclass
@dataclass
class UserAgentConfig:
rotate: bool = True
custom: str = None
class UserAgentManager:
# current, realistic browser User-Agent strings
BROWSER_AGENTS = [
# Chrome on Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
# Chrome on Mac
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
# Chrome on Linux
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
# Firefox on Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:121.0) "
"Gecko/20100101 Firefox/121.0",
# Firefox on Mac
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:121.0) "
"Gecko/20100101 Firefox/121.0",
# Safari on Mac
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 "
"(KHTML, like Gecko) Version/17.2 Safari/605.1.15",
# Edge on Windows
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 "
"(KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36 Edg/120.0.0.0",
]
def __init__(self, config: UserAgentConfig = None):
self.config = config or UserAgentConfig()
self._last_used = None
def get(self) -> str:
"""Return a User-Agent string based on the configuration."""
if self.config.custom:
return self.config.custom
if self.config.rotate:
# avoid using the same one twice in a row
available = [ua for ua in self.BROWSER_AGENTS if ua != self._last_used]
agent = random.choice(available)
self._last_used = agent
return agent
return self.BROWSER_AGENTS[0]
def apply(self, session) -> None:
"""Apply a User-Agent to a requests session."""
session.headers.update({"User-Agent": self.get()})
Using it is straightforward:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import requests
# rotating User-Agents (default)
ua_manager = UserAgentManager()
session = requests.Session()
# apply a new User-Agent before each request (or batch of requests)
ua_manager.apply(session)
response = session.get("https://example.com")
# or use a custom, honest bot identifier
ua_manager = UserAgentManager(UserAgentConfig(
custom="Mozilla/5.0 (compatible; JaysDataBot/1.0; +https://jaygrossman.com)"
))
ua_manager.apply(session)
# or no rotation - just use one consistent browser string
ua_manager = UserAgentManager(UserAgentConfig(rotate=False))
One thing to keep in mind - User-Agent rotation alone won’t get you past sophisticated bot detection. Modern anti-bot systems look at a combination of signals including TLS fingerprints, header ordering, JavaScript execution patterns, and behavioral analysis. But a proper User-Agent is table stakes and will get you past the simpler checks.
Handling common challenges
This is where the real fun starts. Here are the problems you’ll inevitably run into and how I deal with them.
Rate limiting and being a good citizen
The single most important thing in scraping is not hammering the server. It’s both an ethical issue and a practical one - get too aggressive and you’ll get blocked fast.
1
2
3
4
5
6
7
8
9
import time
import random
def polite_get(url, session, min_delay=1, max_delay=5):
"""Make a request with a random delay to avoid hammering the server."""
time.sleep(random.uniform(min_delay, max_delay))
response = session.get(url)
response.raise_for_status()
return response
Some rules I follow:
- Always add delays between requests. I typically use random delays between 1-5 seconds for general scraping, longer for smaller sites.
- Read the Terms of Service. Some sites explicitly prohibit scraping or automated access. It’s worth knowing what you’re agreeing to before you start.
- Check robots.txt first. It tells you what the site owner is comfortable with. Respect it.
- Use a session object. It reuses TCP connections and is actually friendlier to the server than creating new connections every time.
- Set a reasonable User-Agent. Use the
UserAgentManagerclass from above, or at minimum set a realistic browser string. The defaultpython-requestsUser-Agent is an easy way to get blocked.
1
2
3
session = requests.Session()
ua_manager = UserAgentManager()
ua_manager.apply(session)
Handling pagination
Most sites paginate their results. The approach depends on how the site implements pagination:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def scrape_all_pages(base_url, session):
"""Scrape through paginated results."""
all_items = []
page = 1
while True:
url = f"{base_url}?page={page}"
response = polite_get(url, session)
soup = BeautifulSoup(response.text, "lxml")
items = soup.select(".product-card")
if not items:
break
for item in items:
all_items.append({
"name": item.select_one(".product-name").text.strip(),
"price": item.select_one(".product-price").text.strip(),
})
# check if there's a next page
next_button = soup.select_one("a.next-page")
if not next_button:
break
page += 1
print(f"Scraped page {page - 1}, found {len(items)} items")
return all_items
Dealing with anti-bot measures
Sites increasingly use tools like Cloudflare, DataDome, or custom bot detection. Some strategies:
Rotate User-Agent strings:
1
2
3
4
5
6
7
8
9
10
11
import random
USER_AGENTS = [
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/120.0.0.0 Safari/537.36",
]
session.headers.update({
"User-Agent": random.choice(USER_AGENTS)
})
Use proxy rotation for larger scraping jobs where you need to distribute requests across multiple IPs. I won’t go into specific proxy providers here, but the pattern looks like:
1
2
3
4
5
6
7
8
proxies = [
"http://proxy1:port",
"http://proxy2:port",
"http://proxy3:port",
]
proxy = {"http": random.choice(proxies), "https": random.choice(proxies)}
response = session.get(url, proxies=proxy)
Use headless browsers with stealth plugins when detection is more aggressive. For Playwright there’s playwright-stealth which patches common detection vectors.
Handling login-required pages
Sometimes you need to authenticate first. For cookie-based auth with requests:
1
2
3
4
5
6
7
8
9
10
11
session = requests.Session()
# log in
login_data = {
"username": "myuser",
"password": "mypassword",
}
session.post("https://example.com/login", data=login_data)
# now subsequent requests carry the session cookie
response = session.get("https://example.com/protected-page")
For sites with CSRF tokens, you’ll need to grab the token from the login page first:
1
2
3
4
5
6
7
8
9
10
11
# get the login page to extract CSRF token
login_page = session.get("https://example.com/login")
soup = BeautifulSoup(login_page.text, "lxml")
csrf_token = soup.select_one('input[name="csrf_token"]')["value"]
login_data = {
"username": "myuser",
"password": "mypassword",
"csrf_token": csrf_token,
}
session.post("https://example.com/login", data=login_data)
When the DOM structure changes
This is the bane of every scraper’s existence. You build a beautiful scraper, it works great for two weeks, then the site redesigns and everything breaks.
A few things that help:
- Use data attributes when available (
data-product-id,data-price) - they change less often than CSS classes - Be defensive - always check if an element exists before accessing its text
- Add monitoring - log what you’re scraping so you know quickly when something breaks
1
2
3
4
def safe_extract(element, selector, default=""):
"""Extract text from a selector, returning a default if not found."""
found = element.select_one(selector)
return found.text.strip() if found else default
When you need to get creative
Sometimes the standard approaches just don’t work. The site’s bot detection is too aggressive, the data is buried behind complex JavaScript interactions, or the page you need has been taken down entirely. These are the situations where you need to think sideways.
Use a headed browser instead of headless. This sounds counterintuitive - why would you want a visible browser window? But some anti-bot systems specifically detect headless mode. Running Playwright or Selenium with the browser visible (headless=False) can bypass these checks. It’s slower and you can’t easily run it on a server, but when nothing else works, it works:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from playwright.sync_api import sync_playwright
with sync_playwright() as p:
# launch with a visible browser window
browser = p.chromium.launch(headless=False)
page = browser.new_page()
page.goto("https://stubborn-site.com/data")
# you can even add manual pauses to mimic human behavior
page.wait_for_timeout(2000)
page.mouse.move(100, 200)
page.wait_for_timeout(500)
content = page.content()
browser.close()
Build a Chrome extension. I’ve written a blog post about building Chrome extensions before, and they’re surprisingly useful for scraping. The key advantage is that your extension runs inside a real browser session with your real cookies and browsing context - there is no obvious signal to the site that this is not regular browsing behavior. You can inject a content script that extracts data from pages as you browse them, or use the extension to intercept and save API responses. It’s more manual and potentially more resource intensive than a fully automated scraper, but for sites that aggressively block automation it can be the only option that works.
Check the Internet Archive’s Wayback Machine. This is one people forget about. If you need historical data from a site, or if a page has been taken down, the Wayback Machine might have a cached copy. They also have an API you can use programmatically:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import requests
def get_archived_page(url, timestamp=None):
"""Fetch a page from the Internet Archive's Wayback Machine."""
if timestamp:
# get a specific snapshot (format: YYYYMMDDHHmmss)
api_url = f"https://web.archive.org/web/{timestamp}/{url}"
else:
# get the most recent snapshot
api_url = f"https://web.archive.org/web/{url}"
response = requests.get(api_url)
if response.status_code == 200:
return response.text
return None
# check what snapshots are available
def list_snapshots(url):
"""List available snapshots for a URL."""
cdx_url = "https://web.archive.org/cdx/search/cdx"
params = {
"url": url,
"output": "json",
"limit": 10,
"fl": "timestamp,statuscode",
"filter": "statuscode:200",
}
response = requests.get(cdx_url, params=params)
if response.status_code == 200:
return response.json()
return []
Look at Google’s cache. Similar to the Wayback Machine, Google caches pages it crawls. You can access cached versions by prepending https://webcache.googleusercontent.com/search?q=cache: to a URL. It’s less reliable than the Wayback Machine since Google doesn’t keep cached pages forever, but it’s worth checking for recently changed or removed content.
Scrape the API instead of the page. I mentioned this briefly earlier, but it’s worth emphasizing as a creative strategy. Open your browser’s dev tools, go to the Network tab, filter by XHR/Fetch, and browse the site normally. Many modern sites load their data from internal JSON APIs that are way easier to work with than parsing HTML. Sometimes these APIs don’t require authentication, or they accept the same session cookies your browser uses. I’ve had entire scraping projects collapse down to a single requests.get() call once I found the right API endpoint.
Storing scraped data
Once you’ve got the data, you need somewhere to put it. Here are the approaches I use most.
CSV for simple datasets
For quick scraping jobs where I just need to look at the data in a spreadsheet:
1
2
3
4
5
6
7
8
9
10
11
import csv
def save_to_csv(items, filename):
if not items:
return
keys = items[0].keys()
with open(filename, "w", newline="", encoding="utf-8") as f:
writer = csv.DictWriter(f, fieldnames=keys)
writer.writeheader()
writer.writerows(items)
SQLite for anything recurring
If I’m going to be running a scraper more than once, I almost always use SQLite. It’s zero-config, handles concurrent reads fine, and makes it easy to track what you’ve already scraped:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import sqlite3
def init_db(db_path):
conn = sqlite3.connect(db_path)
conn.execute("""
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE,
name TEXT,
price REAL,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
conn.commit()
return conn
def save_product(conn, product):
conn.execute("""
INSERT OR REPLACE INTO products (url, name, price)
VALUES (?, ?, ?)
""", (product["url"], product["name"], product["price"]))
conn.commit()
The UNIQUE constraint on the URL means I can re-run the scraper without worrying about duplicates, and the scraped_at timestamp lets me track freshness.
JSON for nested/complex data
When the data has nested structures that don’t fit neatly into rows and columns:
1
2
3
4
5
import json
def save_to_json(items, filename):
with open(filename, "w", encoding="utf-8") as f:
json.dump(items, f, indent=2, ensure_ascii=False)
Putting it all together
Here’s a more complete example that ties together the patterns above. This scraper handles pagination, rate limiting, error recovery, and saves to SQLite:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
import requests
from bs4 import BeautifulSoup
import sqlite3
import time
import random
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
class ProductScraper:
def __init__(self, db_path="products.db"):
self.session = requests.Session()
self.session.headers.update({
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) "
"AppleWebKit/537.36 Chrome/120.0.0.0 Safari/537.36"
})
self.conn = sqlite3.connect(db_path)
self._init_db()
def _init_db(self):
self.conn.execute("""
CREATE TABLE IF NOT EXISTS products (
id INTEGER PRIMARY KEY AUTOINCREMENT,
url TEXT UNIQUE,
name TEXT,
price TEXT,
description TEXT,
scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
)
""")
self.conn.commit()
def _polite_get(self, url, retries=3):
for attempt in range(retries):
try:
time.sleep(random.uniform(1, 5))
response = self.session.get(url, timeout=30)
response.raise_for_status()
return response
except requests.RequestException as e:
logger.warning(f"Attempt {attempt + 1} failed for {url}: {e}")
if attempt == retries - 1:
raise
time.sleep(5 * (attempt + 1)) # backoff
def _parse_product(self, card):
return {
"url": card.select_one("a")["href"] if card.select_one("a") else "",
"name": self._safe_text(card, ".product-name"),
"price": self._safe_text(card, ".product-price"),
"description": self._safe_text(card, ".product-description"),
}
def _safe_text(self, element, selector):
found = element.select_one(selector)
return found.text.strip() if found else ""
def _save_product(self, product):
self.conn.execute("""
INSERT OR REPLACE INTO products (url, name, price, description)
VALUES (?, ?, ?, ?)
""", (product["url"], product["name"],
product["price"], product["description"]))
self.conn.commit()
def scrape(self, base_url):
page = 1
total = 0
while True:
url = f"{base_url}?page={page}"
logger.info(f"Scraping {url}")
response = self._polite_get(url)
soup = BeautifulSoup(response.text, "lxml")
cards = soup.select(".product-card")
if not cards:
break
for card in cards:
product = self._parse_product(card)
self._save_product(product)
total += 1
logger.info(f"Page {page}: {len(cards)} products (total: {total})")
if not soup.select_one("a.next-page"):
break
page += 1
logger.info(f"Done. Scraped {total} products across {page} pages.")
def close(self):
self.conn.close()
self.session.close()
if __name__ == "__main__":
scraper = ProductScraper()
try:
scraper.scrape("https://example.com/products")
finally:
scraper.close()
A note on ethics and legality
I’d be irresponsible if I didn’t mention this. Web scraping exists in a gray area, and it’s worth thinking about before you start a project:
- Check the Terms of Service. Some sites explicitly prohibit scraping. That doesn’t necessarily make it illegal everywhere, but it’s worth knowing.
- Respect robots.txt. It’s not legally binding in most jurisdictions, but it represents the site owner’s wishes.
- Don’t overload servers. This is both ethical and practical. A small site running on a shared host can be genuinely impacted by aggressive scraping.
- Be careful with personal data. Scraping publicly available data is generally different from scraping personal information. GDPR and similar regulations apply.
- Consider the API first. If a site offers an API, use it. It’s more reliable, more respectful, and usually gives you better data.
I’ve always tried to follow a simple rule: scrape the way you’d want someone to scrape your site. Be polite, don’t take more than you need, and if the site owner asks you to stop, stop.
What I’ve learned
After years of scraping projects, here are the things I wish I’d known from the start:
- Start simple. Don’t reach for Selenium or Playwright until you’ve confirmed that requests + BeautifulSoup can’t handle it. Check the page source in your browser - if the data is in the HTML, you don’t need a headless browser.
- Check for APIs first. Open your browser’s network tab and watch the XHR requests. Many “dynamic” sites actually load data from JSON APIs that you can call directly, which is way easier and faster than parsing HTML.
- Build incrementally. Get one page working before worrying about pagination, error handling, or data storage. Layer complexity as you need it.
- Monitor your scrapers. If a scraper runs on a schedule, add alerting so you know when it breaks. They always break eventually.
- Cache raw responses during development. Save the HTML to disk so you’re not hitting the server every time you tweak your parsing code. Your development loop gets much faster, and you’re nicer to the server.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
import hashlib
import os
def cached_get(url, session, cache_dir="cache"):
"""Cache responses to disk during development."""
os.makedirs(cache_dir, exist_ok=True)
cache_key = hashlib.md5(url.encode()).hexdigest()
cache_path = os.path.join(cache_dir, f"{cache_key}.html")
if os.path.exists(cache_path):
with open(cache_path, "r", encoding="utf-8") as f:
return f.read()
response = session.get(url)
response.raise_for_status()
with open(cache_path, "w", encoding="utf-8") as f:
f.write(response.text)
return response.text
Web scraping is one of those skills that just keeps paying dividends. Once you’re comfortable with it, the entire web becomes your dataset. Just remember to be a good neighbor while you’re at it.