How to Scrape the Web with Python: A Practical Guide with requests and BeautifulSoup
Web scraping is the practice of extracting data from web pages programmatically instead of copying it by hand. You write code that fetches a page, reads its HTML, picks out the pieces you care about — prices, headlines, listings, tables — and saves them somewhere structured like a CSV or a database. When you want to scrape web Python is almost always the right tool, because its ecosystem is mature, readable, and battle-tested for exactly this kind of work.
This guide is developer-to-developer. We will build a working scraper with `requests` and `BeautifulSoup`, handle the real-world problems that break naive scrapers, and talk about where to actually run these jobs once they work. But first, the part most tutorials skip.
Key Takeaways
• Web scraping = fetching HTML and parsing it. `requests` fetches the page, `BeautifulSoup` parses the HTML so you can select elements.
• Legality and ethics come first. Respect `robots.txt`, terms of service, and rate limits. Scrape public data, don’t hammer servers, identify yourself with a `User-Agent`.
• The decisive question is static vs. dynamic. If the data lives in the page’s initial HTML, `requests` + `BeautifulSoup` works. If JavaScript builds it client-side, you need Selenium/Playwright or the site’s API.
• `requests` + `BeautifulSoup` covers most jobs; Scrapy scales to large crawls; Selenium/Playwright handle JS-rendered pages.
• Run scrapers on a server. A VPS with cron turns a one-off script into a scheduled, reliable data pipeline.
Is web scraping legal and ethical?
Before you write a line of code, understand the boundaries. Web scraping is not inherently illegal, but how you do it matters, and being careless can get your IP blocked, your account banned, or worse.
A few principles to scrape responsibly:
- Check `robots.txt`. Most sites publish rules at `https://example.com/robots.txt` describing which paths crawlers may access. It is not a legal contract, but it signals the site owner’s intent — honor it.
- Read the terms of service. Some sites explicitly prohibit automated access. Scraping anyway can violate their terms.
- Prefer public data. Data behind a login, paywall, or that includes personal information carries far more legal and ethical risk. Stick to publicly visible, non-personal data when you can.
- Don’t overload the server. A scraper that fires hundreds of requests per second looks like a denial-of-service attack. Add delays, limit concurrency, and scrape during off-peak hours.
- Identify yourself. Set a descriptive `User-Agent`. Some scrapers even include a contact URL so admins can reach you instead of just blocking you.
The short version: be a good citizen. Take only what you need, at a pace the server can absorb, and look for an official API first — it is almost always the better path.
What is the Python web scraping toolkit?
Python gives you a small set of focused tools. You rarely need all of them at once; you pick based on the page.
| Library | Use it when |
|---|---|
| `requests` | You need to fetch a page or call an API over HTTP. The standard for sending GET/POST requests. |
| `BeautifulSoup` (bs4) | You have raw HTML and need to parse it — find tags, read attributes, extract text. |
| `lxml` | You want a fast parser backend for BeautifulSoup, or you prefer XPath selectors. |
| `Scrapy` | You’re building a large crawler: many pages, concurrency, pipelines, retries, and built-in throttling. |
| `Selenium` / `Playwright` | The page renders content with JavaScript and you need a real browser to execute it. |
For the majority of scraping tasks, `requests` to fetch plus `BeautifulSoup` to parse is the whole toolkit. We’ll focus there, then cover when to reach for the heavier options.
Install the two essentials:
“`bash pip install requests beautifulsoup4 “`
How do you build a basic scraper with requests and BeautifulSoup?
Let’s build a scraper step by step. The pattern is always the same: fetch, parse, select, extract, save.
Fetching a page with requests
“`python import requests
url = “https://example.com/products” response = requests.get(url, timeout=10)
response.raise_for_status() # raises an error on 4xx/5xx
html = response.text print(response.status_code) # 200 means success “`
`requests.get()` returns a `Response` object. `response.text` is the raw HTML as a string, `response.status_code` tells you whether the request succeeded, and `raise_for_status()` fails loudly on errors instead of letting you parse an error page by accident.
Parsing HTML with BeautifulSoup
“`python from bs4 import BeautifulSoup
soup = BeautifulSoup(html, “html.parser”)
print(soup.title.text) # the
`BeautifulSoup` turns the HTML string into a navigable tree. The second argument is the parser — `html.parser` ships with Python; install `lxml` and pass `”lxml”` for more speed on large documents.
Selecting elements: find, find_all, and CSS selectors
This is the heart of scraping — telling BeautifulSoup exactly which elements you want.
“`python
first_product = soup.find(“div”, class_=”product”)
all_products = soup.find_all(“div”, class_=”product”)
titles = soup.select(“div.product h2.title”) price = soup.select_one(“span.price”) “`
`find()` returns the first match, `find_all()` returns a list. `select()` and `select_one()` use CSS selectors, which are usually the cleanest way to target nested elements — the same selectors you’d use in the browser’s dev tools.
Extracting text and attributes
“`python for product in soup.select(“div.product”): name = product.select_one(“h2.title”).get_text(strip=True) price = product.select_one(“span.price”).get_text(strip=True) link = product.select_one(“a”)[“href”] # read an attribute image = product.select_one(“img”).get(“src”) # .get() returns None if missing
print(name, price, link, image) “`
Use `.get_text(strip=True)` to pull clean text without surrounding whitespace. Read attributes with bracket syntax (`element[“href”]`) when you’re sure it exists, or `.get(“href”)` when it might be missing — `.get()` returns `None` instead of raising a `KeyError`.
The deciding question in any scraping project is not which library to use — it is whether the data you want exists in the page’s initial HTML or is loaded later by JavaScript. `requests` and `BeautifulSoup` only ever see the raw HTML the server sends back. If a site builds its content client-side with JavaScript — increasingly common in modern apps built on React, Vue, or similar — that data simply isn’t in the response, and you’ll get empty results no matter how clever your selectors are. Check first: compare View Page Source (the raw HTML) against Inspect Element (the live, rendered DOM). If your target data appears in View Source, it’s in the initial HTML and fast `requests` + `BeautifulSoup` will work. If it only shows up in Inspect Element, it’s JS-rendered — and you’ll need a real browser (Selenium/Playwright) or, far better, the underlying API the page itself calls. Open the Network tab, filter to XHR/Fetch, and you’ll often find a clean JSON endpoint feeding the page; hitting that directly is faster and more reliable than rendering anything. Diagnosing static-vs-dynamic before you write a single selector saves hours of confused debugging.
How do you scrape multiple pages and save the data?
Real datasets span many pages. Most paginated sites use a predictable URL pattern like `?page=2`, which makes looping straightforward.
Looping through pagination
“`python import requests from bs4 import BeautifulSoup import time
base_url = “https://example.com/products?page={}” all_rows = []
for page in range(1, 11): # pages 1 through 10 response = requests.get(base_url.format(page), timeout=10) response.raise_for_status() soup = BeautifulSoup(response.text, “html.parser”)
products = soup.select(“div.product”) if not products: break # no more results — stop early
for product in products: all_rows.append({ “name”: product.select_one(“h2.title”).get_text(strip=True), “price”: product.select_one(“span.price”).get_text(strip=True), “link”: product.select_one(“a”)[“href”], })
time.sleep(2) # be polite: pause between requests
print(f”Collected {len(all_rows)} products”) “`
Two details matter here. First, the `if not products: break` guard stops the loop when a page returns nothing, so you don’t keep requesting empty pages. Second, `time.sleep(2)` spaces out your requests — this single line is the difference between a respectful scraper and one that gets blocked.
Saving to CSV
Python’s built-in `csv` module handles export with no extra dependencies:
“`python import csv
with open(“products.csv”, “w”, newline=””, encoding=”utf-8″) as f: writer = csv.DictWriter(f, fieldnames=[“name”, “price”, “link”]) writer.writeheader() writer.writerows(all_rows)
print(“Saved products.csv”) “`
If you’d rather work with the data analytically, `pandas` can write the same list of dictionaries with `pd.DataFrame(all_rows).to_csv(“products.csv”, index=False)`.
How do you handle blocking, headers, and dynamic content?
A scraper that works in testing often breaks in production. Here are the issues you’ll hit and how to deal with them.
Setting headers and a User-Agent
Many servers reject requests that don’t look like a browser. The default `requests` User-Agent (`python-requests/…`) is an easy thing for sites to block. Send realistic headers:
“`python headers = { “User-Agent”: ( “Mozilla/5.0 (Windows NT 10.0; Win64; x64) ” “AppleWebKit/537.36 (KHTML, like Gecko) ” “Chrome/120.0 Safari/537.36” ), “Accept-Language”: “en-US,en;q=0.9”, }
response = requests.get(url, headers=headers, timeout=10) “`
Rate limiting and retries
Don’t just `sleep` a fixed amount — handle transient failures gracefully too. Use a session for connection reuse and back off when you get rate-limited:
“`python import requests, time
session = requests.Session() session.headers.update(headers)
def fetch(url, retries=3): for attempt in range(retries): resp = session.get(url, timeout=10) if resp.status_code == 429: # Too Many Requests wait = 2 ** attempt # exponential backoff print(f”Rate limited, waiting {wait}s”) time.sleep(wait) continue resp.raise_for_status() return resp raise RuntimeError(f”Failed to fetch {url}”) “`
Exponential backoff (waiting longer after each failure) is the standard, polite way to recover from a `429 Too Many Requests` response.
Dealing with JavaScript-rendered content
If you’ve confirmed (per the unique insight above) that your data is built by JavaScript, `requests` won’t see it. You have two options:
- Find the underlying API. Check the Network tab for the JSON endpoint the page calls, and request that directly with `requests`. This is the cleanest solution when available.
- Render with a real browser. Use Selenium or Playwright to load the page, execute its JavaScript, and then hand the rendered HTML to BeautifulSoup:
“`python from playwright.sync_api import sync_playwright from bs4 import BeautifulSoup
with sync_playwright() as p: browser = p.chromium.launch(headless=True) page = browser.new_page() page.goto(“https://example.com/spa-products”) page.wait_for_selector(“div.product”) # wait for content to load html = page.content() # fully rendered HTML browser.close()
soup = BeautifulSoup(html, “html.parser”) products = soup.select(“div.product”) “`
Browser-based scraping is far slower and heavier than `requests`, so use it only when you must. If you can hit an API instead, do that.
When you’re being blocked
If responses come back empty, return CAPTCHAs, or throw 403s, the site has likely detected automation. Slow down, rotate your User-Agent, respect `robots.txt`, and reconsider whether the site offers an official API. Aggressive evasion is usually a sign you should be using a sanctioned data source instead.
Run your Python scrapers on infrastructure you control. DarazHost VPS and dedicated servers are an ideal home for Python data pipelines: full root access to install Python, `requests`, `BeautifulSoup`, Selenium, Playwright and anything else you need, plus cron for scheduled scraping jobs, guaranteed resources so long-running crawls don’t get throttled, and a stable IP that won’t shift under you mid-job. Run your scrapers on real infrastructure — not a flaky laptop connection — backed by 24/7 support. This is exactly the kind of workload that belongs on a server, and it’s a natural extension of hosting for developers: the complete guide to a real environment you control.
Where should you actually run a web scraper?
A scraper that lives on your laptop is a prototype. The moment you need it to run on a schedule, collect data while you sleep, or feed a dashboard, it needs a real home — a server that’s always on.
A VPS gives you that: a Linux environment with full control where you install Python, your dependencies, and your script, then schedule it. On Linux, `cron` is the classic scheduler. A crontab entry like this runs your scraper every day at 3 AM:
“`bash
0 3 * * * /usr/bin/python3 /home/user/scrapers/products.py >> /home/user/scrapers/log.txt 2>&1 “`
The `>> log.txt 2>&1` redirect captures both output and errors to a log file so you can see what happened on each run. This is where web scraping connects to development hosting in general: scrapers are long-running, scheduled, resource-hungry, and need a stable IP — all things a controlled server environment provides and shared hosting often does not.
Frequently asked questions
Is `requests` enough, or do I always need BeautifulSoup? They do different jobs. `requests` fetches the page; `BeautifulSoup` parses the resulting HTML so you can extract specific elements. For scraping HTML pages you typically use both. If you’re calling a JSON API, `requests` alone is enough — just use `response.json()`.
Why is my scraper returning empty results? The most common cause is JavaScript-rendered content. Your selectors are looking for elements that don’t exist in the raw HTML because the browser builds them after load. Compare View Page Source with Inspect Element to confirm, then either find the underlying API or render with Selenium/Playwright. Less commonly, the site changed its HTML structure and your selectors are now stale.
How do I avoid getting blocked while scraping? Set a realistic `User-Agent`, add delays between requests (`time.sleep`), respect `robots.txt`, use exponential backoff on `429` responses, and keep your request rate modest. The goal is to be indistinguishable from polite, low-volume human traffic — not to evade detection.
When should I use Scrapy instead of requests and BeautifulSoup? Reach for Scrapy when you’re crawling many pages and want built-in concurrency, automatic retries, throttling, and data pipelines without writing that plumbing yourself. For a handful of pages or a single endpoint, `requests` + `BeautifulSoup` is simpler and faster to write.
Can I run a Python scraper for free on my computer? You can run it locally, but it only works while your machine is on and connected. For anything scheduled or continuous, a VPS with cron is the practical choice — it runs unattended, has a stable IP, and won’t tie up your laptop.