
How to Crawl a Web Page with Scrapy and Python 3
Web scraping has become an essential skill for developers who need to extract data from websites programmatically, whether you’re building data pipelines, conducting market research, or monitoring competitor pricing. Scrapy stands out as one of the most powerful and flexible Python frameworks for web scraping, offering built-in support for handling complex scenarios like JavaScript rendering, form submissions, and large-scale crawling operations. In this comprehensive guide, you’ll learn how to set up Scrapy from scratch, build your first spider, handle common challenges like rate limiting and anti-bot measures, and deploy your scraping solution on production servers.
How Scrapy Works Under the Hood
Scrapy operates on an asynchronous, event-driven architecture that makes it incredibly efficient for crawling multiple pages simultaneously. Unlike simple HTTP libraries like requests, Scrapy uses Twisted, a Python networking framework that handles concurrent connections without the overhead of traditional threading.
The framework follows a component-based architecture where each piece serves a specific purpose:
- Spider – Defines which URLs to crawl and how to extract data
- Scheduler – Manages the queue of requests to be processed
- Downloader – Fetches web pages and handles HTTP protocols
- Item Pipeline – Processes and validates extracted data
- Middlewares – Provide hooks for custom processing of requests and responses
When you run a Scrapy spider, the engine coordinates these components in a continuous loop. It sends requests through middlewares to the downloader, processes responses back through the spider, and feeds extracted items through the pipeline. This architecture allows Scrapy to handle thousands of concurrent requests while maintaining low memory usage.
Setting Up Your Scrapy Environment
Before diving into spider development, you’ll need a proper Python environment. Scrapy works best with Python 3.7+ and requires several system dependencies that can be tricky to install on some platforms.
For Ubuntu/Debian systems, install the required system packages first:
sudo apt-get update
sudo apt-get install python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev
On CentOS/RHEL:
sudo yum install python3-devel python3-pip libxml2-devel libxslt-devel openssl-devel libffi-devel
Create a virtual environment to isolate your project dependencies:
python3 -m venv scrapy_env
source scrapy_env/bin/activate # On Windows: scrapy_env\Scripts\activate
pip install --upgrade pip
Install Scrapy and additional useful packages:
pip install scrapy
pip install scrapy-user-agents # Rotating user agents
pip install scrapy-proxies # Proxy rotation
pip install itemadapter # Item handling utilities
Verify your installation by creating a new Scrapy project:
scrapy startproject webscraper
cd webscraper
scrapy genspider example example.com
This creates a basic project structure with all necessary files and directories. If you’re planning to deploy on a VPS or dedicated server, make sure your hosting environment supports the required system packages.
Building Your First Spider
Let’s create a practical spider that crawls a news website to extract article titles, publication dates, and content. This example demonstrates core Scrapy concepts while tackling real-world challenges.
Create a new spider file news_spider.py
in your spiders directory:
import scrapy
from datetime import datetime
import re
class NewsSpider(scrapy.Spider):
name = 'news_crawler'
allowed_domains = ['quotes.toscrape.com'] # Using quotes.toscrape.com as example
start_urls = ['http://quotes.toscrape.com/']
custom_settings = {
'DOWNLOAD_DELAY': 1, # Be respectful - 1 second delay between requests
'RANDOMIZE_DOWNLOAD_DELAY': 0.5, # Randomize delay by 0.5 * DOWNLOAD_DELAY
'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
}
def parse(self, response):
# Extract quote data from the page
quotes = response.css('div.quote')
for quote in quotes:
yield {
'text': quote.css('span.text::text').get().strip('"'),
'author': quote.css('small.author::text').get(),
'tags': quote.css('div.tags a.tag::text').getall(),
'scraped_at': datetime.now().isoformat(),
'source_url': response.url
}
# Follow pagination links
next_page = response.css('li.next a::attr(href)').get()
if next_page:
yield response.follow(next_page, callback=self.parse)
# Follow links to author pages for additional data
author_links = response.css('small.author + a::attr(href)').getall()
for link in author_links:
yield response.follow(link, callback=self.parse_author)
def parse_author(self, response):
# Extract author biographical information
yield {
'name': response.css('h3.author-title::text').get(),
'birth_date': response.css('span.author-born-date::text').get(),
'birth_location': response.css('span.author-born-location::text').get(),
'description': response.css('div.author-description::text').get().strip(),
'scraped_at': datetime.now().isoformat()
}
This spider demonstrates several important concepts:
- CSS selectors – More reliable than XPath for most HTML parsing tasks
- Custom settings – Configure delays and user agents per spider
- Data extraction – Using
get()
for single values andgetall()
for lists - Following links – Using
response.follow()
for relative URLs - Multiple callbacks – Different parsing methods for different page types
Run your spider and save results to JSON:
scrapy crawl news_crawler -o quotes_output.json
For CSV output with custom fields:
scrapy crawl news_crawler -o quotes_output.csv -s FEED_EXPORT_FIELDS="text,author,tags,scraped_at"
Advanced Data Processing with Item Pipelines
Raw scraped data often needs cleaning, validation, and processing before it’s useful. Scrapy’s Item Pipeline system provides a clean way to handle this post-processing.
First, define your data structure in items.py
:
import scrapy
from itemloaders.processors import TakeFirst, MapCompose, Join
from w3lib.html import remove_tags
def clean_text(value):
"""Remove extra whitespace and clean text"""
return re.sub(r'\s+', ' ', value.strip()) if value else value
class QuoteItem(scrapy.Item):
text = scrapy.Field()
author = scrapy.Field()
tags = scrapy.Field()
scraped_at = scrapy.Field()
source_url = scrapy.Field()
word_count = scrapy.Field()
class AuthorItem(scrapy.Item):
name = scrapy.Field()
birth_date = scrapy.Field()
birth_location = scrapy.Field()
description = scrapy.Field()
scraped_at = scrapy.Field()
Create custom pipelines in pipelines.py
:
import re
import sqlite3
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem
class ValidationPipeline:
"""Validate and clean scraped items"""
def process_item(self, item, spider):
adapter = ItemAdapter(item)
# Validate required fields
if not adapter.get('text') or not adapter.get('author'):
raise DropItem(f"Missing required fields in {item}")
# Clean text fields
if adapter.get('text'):
adapter['text'] = re.sub(r'\s+', ' ', adapter['text'].strip())
adapter['word_count'] = len(adapter['text'].split())
# Normalize author names
if adapter.get('author'):
adapter['author'] = adapter['author'].title().strip()
return item
class DuplicatesPipeline:
"""Filter out duplicate items"""
def __init__(self):
self.seen_items = set()
def process_item(self, item, spider):
adapter = ItemAdapter(item)
# Create unique identifier
identifier = f"{adapter.get('text', '')}-{adapter.get('author', '')}"
identifier_hash = hash(identifier)
if identifier_hash in self.seen_items:
raise DropItem(f"Duplicate item found: {item}")
else:
self.seen_items.add(identifier_hash)
return item
class DatabasePipeline:
"""Store items in SQLite database"""
def __init__(self, database_path):
self.database_path = database_path
self.connection = None
@classmethod
def from_crawler(cls, crawler):
database_path = crawler.settings.get("DATABASE_PATH", "scrapy_data.db")
return cls(database_path)
def open_spider(self, spider):
self.connection = sqlite3.connect(self.database_path)
self.connection.execute('''
CREATE TABLE IF NOT EXISTS quotes (
id INTEGER PRIMARY KEY AUTOINCREMENT,
text TEXT NOT NULL,
author TEXT NOT NULL,
tags TEXT,
word_count INTEGER,
scraped_at TEXT,
source_url TEXT
)
''')
self.connection.commit()
def close_spider(self, spider):
if self.connection:
self.connection.close()
def process_item(self, item, spider):
adapter = ItemAdapter(item)
# Only process quote items
if 'text' in adapter and 'author' in adapter:
insert_sql = '''
INSERT INTO quotes (text, author, tags, word_count, scraped_at, source_url)
VALUES (?, ?, ?, ?, ?, ?)
'''
self.connection.execute(insert_sql, (
adapter.get('text'),
adapter.get('author'),
','.join(adapter.get('tags', [])),
adapter.get('word_count'),
adapter.get('scraped_at'),
adapter.get('source_url')
))
self.connection.commit()
return item
Enable pipelines in settings.py
:
ITEM_PIPELINES = {
'webscraper.pipelines.ValidationPipeline': 300,
'webscraper.pipelines.DuplicatesPipeline': 400,
'webscraper.pipelines.DatabasePipeline': 500,
}
DATABASE_PATH = 'quotes_database.db'
The pipeline numbers determine execution order – lower numbers run first. This setup validates data, removes duplicates, and stores clean results in a database.
Handling JavaScript and Dynamic Content
Many modern websites load content dynamically with JavaScript, which Scrapy’s default downloader can’t handle. Here are several approaches to tackle this challenge:
Option 1: Scrapy-Splash Integration
Splash is a lightweight browser engine specifically designed for scraping JavaScript-heavy sites:
pip install scrapy-splash
Add Splash settings to settings.py
:
SPLASH_URL = 'http://localhost:8050'
DOWNLOADER_MIDDLEWARES = {
'scrapy_splash.SplashCookiesMiddleware': 723,
'scrapy_splash.SplashMiddleware': 725,
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}
SPIDER_MIDDLEWARES = {
'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}
DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'
Use Splash in your spider:
import scrapy
from scrapy_splash import SplashRequest
class JsSpider(scrapy.Spider):
name = 'js_scraper'
def start_requests(self):
urls = ['https://example.com/dynamic-content']
for url in urls:
yield SplashRequest(
url=url,
callback=self.parse,
args={
'wait': 3, # Wait 3 seconds for JS to load
'html': 1, # Return HTML
'png': 1, # Return screenshot
}
)
def parse(self, response):
# Parse fully rendered HTML
titles = response.css('h2.dynamic-title::text').getall()
for title in titles:
yield {'title': title}
Option 2: Selenium Integration
For more complex JavaScript scenarios, integrate Selenium directly:
pip install selenium
Create a custom middleware:
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from scrapy.http import HtmlResponse
class SeleniumMiddleware:
def __init__(self):
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
self.driver = webdriver.Chrome(options=chrome_options)
def process_request(self, request, spider):
if 'selenium' in request.meta:
self.driver.get(request.url)
# Wait for specific elements
if 'wait_for' in request.meta:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, request.meta['wait_for']))
)
html = self.driver.page_source
return HtmlResponse(
url=request.url,
body=html,
encoding='utf-8',
request=request
)
def spider_closed(self, spider):
self.driver.quit()
Performance Comparison and Best Practices
Choosing the right approach depends on your specific requirements. Here’s a performance comparison of different scraping methods:
Method | Speed (pages/min) | Memory Usage | JavaScript Support | Resource Cost | Best Use Case |
---|---|---|---|---|---|
Pure Scrapy | 500-2000 | Low (10-50MB) | None | Very Low | Static HTML sites |
Scrapy + Splash | 50-200 | Medium (100-300MB) | Full | Medium | Light JS content |
Scrapy + Selenium | 20-100 | High (200-500MB) | Full | High | Complex JS apps |
Requests + BeautifulSoup | 100-500 | Low (5-20MB) | None | Very Low | Simple scraping tasks |
Essential Best Practices
- Respect robots.txt – Enable ROBOTSTXT_OBEY in settings for ethical scraping
- Implement delays – Use DOWNLOAD_DELAY and RANDOMIZE_DOWNLOAD_DELAY to avoid overwhelming servers
- Rotate user agents – Install scrapy-user-agents for automatic rotation
- Handle errors gracefully – Implement retry logic and error handling
- Monitor memory usage – Use MEMDEBUG_ENABLED for memory profiling during development
- Cache responses – Enable HTTPCACHE_ENABLED during development to speed up testing
Real-World Use Cases and Applications
Scrapy excels in various professional scenarios where reliable, scalable data extraction is crucial:
E-commerce Price Monitoring
Build competitive intelligence systems that track product prices across multiple retailers. This spider example monitors pricing changes:
class PriceMonitorSpider(scrapy.Spider):
name = 'price_monitor'
def __init__(self, products=None, *args, **kwargs):
super(PriceMonitorSpider, self).__init__(*args, **kwargs)
self.product_urls = products.split(',') if products else []
def start_requests(self):
for url in self.product_urls:
yield scrapy.Request(
url=url,
callback=self.parse_product,
meta={'dont_cache': True} # Always fetch fresh data
)
def parse_product(self, response):
price_text = response.css('.price::text').get()
if price_text:
price = float(re.search(r'[\d.]+', price_text).group())
yield {
'url': response.url,
'price': price,
'currency': re.search(r'[^\d\s.]+', price_text).group(),
'product_name': response.css('h1::text').get(),
'availability': response.css('.stock-status::text').get(),
'timestamp': datetime.now().isoformat()
}
Content Aggregation and News Monitoring
Media companies use Scrapy to aggregate content from multiple sources, track mention of brands, or monitor industry news. The asynchronous architecture makes it perfect for crawling hundreds of news sites simultaneously.
Lead Generation and Business Intelligence
Sales teams leverage web scraping to gather contact information, company details, and market intelligence. However, always ensure compliance with terms of service and data protection regulations.
Troubleshooting Common Issues
Getting Blocked by Anti-Bot Systems
Modern websites employ sophisticated bot detection. Here’s a comprehensive anti-detection setup:
# settings.py
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = 0.5
CONCURRENT_REQUESTS = 1
CONCURRENT_REQUESTS_PER_DOMAIN = 1
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
}
# Enable autothrottling
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0
# Proxy settings (if using scrapy-proxies)
PROXY_LIST = '/path/to/proxy/list.txt'
PROXY_MODE = 0 # Random proxy
Memory Leaks and Performance Issues
Large-scale crawling can lead to memory problems. Monitor and optimize with these settings:
# Limit memory usage
MEMDEBUG_ENABLED = True
MEMDEBUG_NOTIFY = ['scrapy.utils.memory.get_vmsize']
# Close spider when memory exceeds limit (in MB)
CLOSESPIDER_MEMORYUSAGE = 2048
# Limit items in memory
CONCURRENT_ITEMS = 100
# Enable garbage collection
import gc
gc.set_threshold(700, 10, 10)
Handling Timeouts and Connection Issues
Network instability requires robust error handling:
DOWNLOAD_TIMEOUT = 180
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]
# Custom retry policy
RETRY_PRIORITY_ADJUST = -1
RETRY_BACKOFF = 2 # Exponential backoff multiplier
Deployment and Production Considerations
When deploying Scrapy applications to production servers, several factors require attention:
Containerized Deployment
Docker provides consistent environments across development and production:
# Dockerfile
FROM python:3.9-slim
RUN apt-get update && apt-get install -y \
gcc \
libxml2-dev \
libxslt1-dev \
zlib1g-dev \
libffi-dev \
libssl-dev \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
CMD ["scrapy", "crawl", "your_spider"]
Monitoring and Logging
Production deployments need comprehensive monitoring:
# settings.py
LOG_LEVEL = 'INFO'
LOG_FILE = 'scrapy.log'
# Enable stats collection
STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'
# Custom logging configuration
import logging
logging.getLogger('scrapy').setLevel(logging.WARNING)
logging.getLogger('twisted').setLevel(logging.WARNING)
For large-scale operations, consider using dedicated servers with sufficient CPU and memory resources. Scrapy’s concurrent nature benefits significantly from multiple CPU cores and adequate RAM for handling thousands of simultaneous connections.
The official Scrapy documentation provides extensive details on advanced configuration options, while the GitHub repository offers examples and community contributions for specialized use cases.
Web scraping with Scrapy opens up powerful possibilities for data-driven applications, but success depends on understanding both the technical framework and the ethical considerations of web data extraction. Start with simple projects, gradually incorporate advanced features like JavaScript handling and distributed crawling, and always prioritize respectful, responsible scraping practices that don’t burden target websites.

This article incorporates information and material from various online sources. We acknowledge and appreciate the work of all original authors, publishers, and websites. While every effort has been made to appropriately credit the source material, any unintentional oversight or omission does not constitute a copyright infringement. All trademarks, logos, and images mentioned are the property of their respective owners. If you believe that any content used in this article infringes upon your copyright, please contact us immediately for review and prompt action.
This article is intended for informational and educational purposes only and does not infringe on the rights of the copyright owners. If any copyrighted material has been used without proper credit or in violation of copyright laws, it is unintentional and we will rectify it promptly upon notification. Please note that the republishing, redistribution, or reproduction of part or all of the contents in any form is prohibited without express written permission from the author and website owner. For permissions or further inquiries, please contact us.