BLOG POSTS

MangoHost Blog / How to Crawl a Web Page with Scrapy and Python 3

How to Crawl a Web Page with Scrapy and Python 3

Web scraping has become an essential skill for developers who need to extract data from websites programmatically, whether you’re building data pipelines, conducting market research, or monitoring competitor pricing. Scrapy stands out as one of the most powerful and flexible Python frameworks for web scraping, offering built-in support for handling complex scenarios like JavaScript rendering, form submissions, and large-scale crawling operations. In this comprehensive guide, you’ll learn how to set up Scrapy from scratch, build your first spider, handle common challenges like rate limiting and anti-bot measures, and deploy your scraping solution on production servers.

How Scrapy Works Under the Hood

Scrapy operates on an asynchronous, event-driven architecture that makes it incredibly efficient for crawling multiple pages simultaneously. Unlike simple HTTP libraries like requests, Scrapy uses Twisted, a Python networking framework that handles concurrent connections without the overhead of traditional threading.

The framework follows a component-based architecture where each piece serves a specific purpose:

Spider – Defines which URLs to crawl and how to extract data
Scheduler – Manages the queue of requests to be processed
Downloader – Fetches web pages and handles HTTP protocols
Item Pipeline – Processes and validates extracted data
Middlewares – Provide hooks for custom processing of requests and responses

When you run a Scrapy spider, the engine coordinates these components in a continuous loop. It sends requests through middlewares to the downloader, processes responses back through the spider, and feeds extracted items through the pipeline. This architecture allows Scrapy to handle thousands of concurrent requests while maintaining low memory usage.

Setting Up Your Scrapy Environment

Before diving into spider development, you’ll need a proper Python environment. Scrapy works best with Python 3.7+ and requires several system dependencies that can be tricky to install on some platforms.

For Ubuntu/Debian systems, install the required system packages first:

sudo apt-get update
sudo apt-get install python3-dev python3-pip libxml2-dev libxslt1-dev zlib1g-dev libffi-dev libssl-dev

On CentOS/RHEL:

sudo yum install python3-devel python3-pip libxml2-devel libxslt-devel openssl-devel libffi-devel

Create a virtual environment to isolate your project dependencies:

python3 -m venv scrapy_env
source scrapy_env/bin/activate  # On Windows: scrapy_env\Scripts\activate
pip install --upgrade pip

Install Scrapy and additional useful packages:

pip install scrapy
pip install scrapy-user-agents  # Rotating user agents
pip install scrapy-proxies     # Proxy rotation
pip install itemadapter        # Item handling utilities

Verify your installation by creating a new Scrapy project:

scrapy startproject webscraper
cd webscraper
scrapy genspider example example.com

This creates a basic project structure with all necessary files and directories. If you’re planning to deploy on a VPS or dedicated server, make sure your hosting environment supports the required system packages.

Building Your First Spider

Let’s create a practical spider that crawls a news website to extract article titles, publication dates, and content. This example demonstrates core Scrapy concepts while tackling real-world challenges.

Create a new spider file news_spider.py in your spiders directory:

import scrapy
from datetime import datetime
import re

class NewsSpider(scrapy.Spider):
    name = 'news_crawler'
    allowed_domains = ['quotes.toscrape.com']  # Using quotes.toscrape.com as example
    start_urls = ['http://quotes.toscrape.com/']
    
    custom_settings = {
        'DOWNLOAD_DELAY': 1,  # Be respectful - 1 second delay between requests
        'RANDOMIZE_DOWNLOAD_DELAY': 0.5,  # Randomize delay by 0.5 * DOWNLOAD_DELAY
        'USER_AGENT': 'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36'
    }
    
    def parse(self, response):
        # Extract quote data from the page
        quotes = response.css('div.quote')
        
        for quote in quotes:
            yield {
                'text': quote.css('span.text::text').get().strip('"'),
                'author': quote.css('small.author::text').get(),
                'tags': quote.css('div.tags a.tag::text').getall(),
                'scraped_at': datetime.now().isoformat(),
                'source_url': response.url
            }
        
        # Follow pagination links
        next_page = response.css('li.next a::attr(href)').get()
        if next_page:
            yield response.follow(next_page, callback=self.parse)
        
        # Follow links to author pages for additional data
        author_links = response.css('small.author + a::attr(href)').getall()
        for link in author_links:
            yield response.follow(link, callback=self.parse_author)
    
    def parse_author(self, response):
        # Extract author biographical information
        yield {
            'name': response.css('h3.author-title::text').get(),
            'birth_date': response.css('span.author-born-date::text').get(),
            'birth_location': response.css('span.author-born-location::text').get(),
            'description': response.css('div.author-description::text').get().strip(),
            'scraped_at': datetime.now().isoformat()
        }

This spider demonstrates several important concepts:

CSS selectors – More reliable than XPath for most HTML parsing tasks
Custom settings – Configure delays and user agents per spider
Data extraction – Using get() for single values and getall() for lists
Following links – Using response.follow() for relative URLs
Multiple callbacks – Different parsing methods for different page types

Run your spider and save results to JSON:

scrapy crawl news_crawler -o quotes_output.json

For CSV output with custom fields:

scrapy crawl news_crawler -o quotes_output.csv -s FEED_EXPORT_FIELDS="text,author,tags,scraped_at"

Advanced Data Processing with Item Pipelines

Raw scraped data often needs cleaning, validation, and processing before it’s useful. Scrapy’s Item Pipeline system provides a clean way to handle this post-processing.

First, define your data structure in items.py:

import scrapy
from itemloaders.processors import TakeFirst, MapCompose, Join
from w3lib.html import remove_tags

def clean_text(value):
    """Remove extra whitespace and clean text"""
    return re.sub(r'\s+', ' ', value.strip()) if value else value

class QuoteItem(scrapy.Item):
    text = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()
    scraped_at = scrapy.Field()
    source_url = scrapy.Field()
    word_count = scrapy.Field()

class AuthorItem(scrapy.Item):
    name = scrapy.Field()
    birth_date = scrapy.Field()
    birth_location = scrapy.Field() 
    description = scrapy.Field()
    scraped_at = scrapy.Field()

Create custom pipelines in pipelines.py:

import re
import sqlite3
from itemadapter import ItemAdapter
from scrapy.exceptions import DropItem

class ValidationPipeline:
    """Validate and clean scraped items"""
    
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        
        # Validate required fields
        if not adapter.get('text') or not adapter.get('author'):
            raise DropItem(f"Missing required fields in {item}")
        
        # Clean text fields
        if adapter.get('text'):
            adapter['text'] = re.sub(r'\s+', ' ', adapter['text'].strip())
            adapter['word_count'] = len(adapter['text'].split())
        
        # Normalize author names
        if adapter.get('author'):
            adapter['author'] = adapter['author'].title().strip()
        
        return item

class DuplicatesPipeline:
    """Filter out duplicate items"""
    
    def __init__(self):
        self.seen_items = set()
    
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        
        # Create unique identifier
        identifier = f"{adapter.get('text', '')}-{adapter.get('author', '')}"
        identifier_hash = hash(identifier)
        
        if identifier_hash in self.seen_items:
            raise DropItem(f"Duplicate item found: {item}")
        else:
            self.seen_items.add(identifier_hash)
            return item

class DatabasePipeline:
    """Store items in SQLite database"""
    
    def __init__(self, database_path):
        self.database_path = database_path
        self.connection = None
    
    @classmethod
    def from_crawler(cls, crawler):
        database_path = crawler.settings.get("DATABASE_PATH", "scrapy_data.db")
        return cls(database_path)
    
    def open_spider(self, spider):
        self.connection = sqlite3.connect(self.database_path)
        self.connection.execute('''
            CREATE TABLE IF NOT EXISTS quotes (
                id INTEGER PRIMARY KEY AUTOINCREMENT,
                text TEXT NOT NULL,
                author TEXT NOT NULL,
                tags TEXT,
                word_count INTEGER,
                scraped_at TEXT,
                source_url TEXT
            )
        ''')
        self.connection.commit()
    
    def close_spider(self, spider):
        if self.connection:
            self.connection.close()
    
    def process_item(self, item, spider):
        adapter = ItemAdapter(item)
        
        # Only process quote items
        if 'text' in adapter and 'author' in adapter:
            insert_sql = '''
                INSERT INTO quotes (text, author, tags, word_count, scraped_at, source_url)
                VALUES (?, ?, ?, ?, ?, ?)
            '''
            self.connection.execute(insert_sql, (
                adapter.get('text'),
                adapter.get('author'),
                ','.join(adapter.get('tags', [])),
                adapter.get('word_count'),
                adapter.get('scraped_at'),
                adapter.get('source_url')
            ))
            self.connection.commit()
        
        return item

Enable pipelines in settings.py:

ITEM_PIPELINES = {
    'webscraper.pipelines.ValidationPipeline': 300,
    'webscraper.pipelines.DuplicatesPipeline': 400,
    'webscraper.pipelines.DatabasePipeline': 500,
}

DATABASE_PATH = 'quotes_database.db'

The pipeline numbers determine execution order – lower numbers run first. This setup validates data, removes duplicates, and stores clean results in a database.

Handling JavaScript and Dynamic Content

Many modern websites load content dynamically with JavaScript, which Scrapy’s default downloader can’t handle. Here are several approaches to tackle this challenge:

Option 1: Scrapy-Splash Integration

Splash is a lightweight browser engine specifically designed for scraping JavaScript-heavy sites:

pip install scrapy-splash

Add Splash settings to settings.py:

SPLASH_URL = 'http://localhost:8050'

DOWNLOADER_MIDDLEWARES = {
    'scrapy_splash.SplashCookiesMiddleware': 723,
    'scrapy_splash.SplashMiddleware': 725,
    'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
}

SPIDER_MIDDLEWARES = {
    'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
}

DUPEFILTER_CLASS = 'scrapy_splash.SplashAwareDupeFilter'

Use Splash in your spider:

import scrapy
from scrapy_splash import SplashRequest

class JsSpider(scrapy.Spider):
    name = 'js_scraper'
    
    def start_requests(self):
        urls = ['https://example.com/dynamic-content']
        for url in urls:
            yield SplashRequest(
                url=url,
                callback=self.parse,
                args={
                    'wait': 3,  # Wait 3 seconds for JS to load
                    'html': 1,  # Return HTML
                    'png': 1,   # Return screenshot
                }
            )
    
    def parse(self, response):
        # Parse fully rendered HTML
        titles = response.css('h2.dynamic-title::text').getall()
        for title in titles:
            yield {'title': title}

Option 2: Selenium Integration

For more complex JavaScript scenarios, integrate Selenium directly:

pip install selenium

Create a custom middleware:

from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from scrapy.http import HtmlResponse

class SeleniumMiddleware:
    def __init__(self):
        chrome_options = Options()
        chrome_options.add_argument('--headless')
        chrome_options.add_argument('--no-sandbox')
        chrome_options.add_argument('--disable-dev-shm-usage')
        self.driver = webdriver.Chrome(options=chrome_options)
    
    def process_request(self, request, spider):
        if 'selenium' in request.meta:
            self.driver.get(request.url)
            
            # Wait for specific elements
            if 'wait_for' in request.meta:
                from selenium.webdriver.support.ui import WebDriverWait
                from selenium.webdriver.support import expected_conditions as EC
                from selenium.webdriver.common.by import By
                
                WebDriverWait(self.driver, 10).until(
                    EC.presence_of_element_located((By.CSS_SELECTOR, request.meta['wait_for']))
                )
            
            html = self.driver.page_source
            return HtmlResponse(
                url=request.url,
                body=html,
                encoding='utf-8',
                request=request
            )
    
    def spider_closed(self, spider):
        self.driver.quit()

Performance Comparison and Best Practices

Choosing the right approach depends on your specific requirements. Here’s a performance comparison of different scraping methods:

Method	Speed (pages/min)	Memory Usage	JavaScript Support	Resource Cost	Best Use Case
Pure Scrapy	500-2000	Low (10-50MB)	None	Very Low	Static HTML sites
Scrapy + Splash	50-200	Medium (100-300MB)	Full	Medium	Light JS content
Scrapy + Selenium	20-100	High (200-500MB)	Full	High	Complex JS apps
Requests + BeautifulSoup	100-500	Low (5-20MB)	None	Very Low	Simple scraping tasks

Essential Best Practices

Respect robots.txt – Enable ROBOTSTXT_OBEY in settings for ethical scraping
Implement delays – Use DOWNLOAD_DELAY and RANDOMIZE_DOWNLOAD_DELAY to avoid overwhelming servers
Rotate user agents – Install scrapy-user-agents for automatic rotation
Handle errors gracefully – Implement retry logic and error handling
Monitor memory usage – Use MEMDEBUG_ENABLED for memory profiling during development
Cache responses – Enable HTTPCACHE_ENABLED during development to speed up testing

Real-World Use Cases and Applications

Scrapy excels in various professional scenarios where reliable, scalable data extraction is crucial:

E-commerce Price Monitoring

Build competitive intelligence systems that track product prices across multiple retailers. This spider example monitors pricing changes:

class PriceMonitorSpider(scrapy.Spider):
    name = 'price_monitor'
    
    def __init__(self, products=None, *args, **kwargs):
        super(PriceMonitorSpider, self).__init__(*args, **kwargs)
        self.product_urls = products.split(',') if products else []
    
    def start_requests(self):
        for url in self.product_urls:
            yield scrapy.Request(
                url=url,
                callback=self.parse_product,
                meta={'dont_cache': True}  # Always fetch fresh data
            )
    
    def parse_product(self, response):
        price_text = response.css('.price::text').get()
        if price_text:
            price = float(re.search(r'[\d.]+', price_text).group())
            
            yield {
                'url': response.url,
                'price': price,
                'currency': re.search(r'[^\d\s.]+', price_text).group(),
                'product_name': response.css('h1::text').get(),
                'availability': response.css('.stock-status::text').get(),
                'timestamp': datetime.now().isoformat()
            }

Content Aggregation and News Monitoring

Media companies use Scrapy to aggregate content from multiple sources, track mention of brands, or monitor industry news. The asynchronous architecture makes it perfect for crawling hundreds of news sites simultaneously.

Lead Generation and Business Intelligence

Sales teams leverage web scraping to gather contact information, company details, and market intelligence. However, always ensure compliance with terms of service and data protection regulations.

Troubleshooting Common Issues

Getting Blocked by Anti-Bot Systems

Modern websites employ sophisticated bot detection. Here’s a comprehensive anti-detection setup:

# settings.py
DOWNLOAD_DELAY = 3
RANDOMIZE_DOWNLOAD_DELAY = 0.5
CONCURRENT_REQUESTS = 1
CONCURRENT_REQUESTS_PER_DOMAIN = 1

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_user_agents.middlewares.RandomUserAgentMiddleware': 400,
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
}

# Enable autothrottling
AUTOTHROTTLE_ENABLED = True
AUTOTHROTTLE_START_DELAY = 1
AUTOTHROTTLE_MAX_DELAY = 10
AUTOTHROTTLE_TARGET_CONCURRENCY = 2.0

# Proxy settings (if using scrapy-proxies)
PROXY_LIST = '/path/to/proxy/list.txt'
PROXY_MODE = 0  # Random proxy

Memory Leaks and Performance Issues

Large-scale crawling can lead to memory problems. Monitor and optimize with these settings:

# Limit memory usage
MEMDEBUG_ENABLED = True
MEMDEBUG_NOTIFY = ['scrapy.utils.memory.get_vmsize']

# Close spider when memory exceeds limit (in MB)
CLOSESPIDER_MEMORYUSAGE = 2048

# Limit items in memory
CONCURRENT_ITEMS = 100

# Enable garbage collection
import gc
gc.set_threshold(700, 10, 10)

Handling Timeouts and Connection Issues

Network instability requires robust error handling:

DOWNLOAD_TIMEOUT = 180
RETRY_TIMES = 3
RETRY_HTTP_CODES = [500, 502, 503, 504, 408, 429]

# Custom retry policy
RETRY_PRIORITY_ADJUST = -1
RETRY_BACKOFF = 2  # Exponential backoff multiplier

Deployment and Production Considerations

When deploying Scrapy applications to production servers, several factors require attention:

Containerized Deployment

Docker provides consistent environments across development and production:

# Dockerfile
FROM python:3.9-slim

RUN apt-get update && apt-get install -y \
    gcc \
    libxml2-dev \
    libxslt1-dev \
    zlib1g-dev \
    libffi-dev \
    libssl-dev \
    && rm -rf /var/lib/apt/lists/*

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY . .
CMD ["scrapy", "crawl", "your_spider"]

Monitoring and Logging

Production deployments need comprehensive monitoring:

# settings.py
LOG_LEVEL = 'INFO'
LOG_FILE = 'scrapy.log'

# Enable stats collection
STATS_CLASS = 'scrapy.statscollectors.MemoryStatsCollector'

# Custom logging configuration
import logging
logging.getLogger('scrapy').setLevel(logging.WARNING)
logging.getLogger('twisted').setLevel(logging.WARNING)

For large-scale operations, consider using dedicated servers with sufficient CPU and memory resources. Scrapy’s concurrent nature benefits significantly from multiple CPU cores and adequate RAM for handling thousands of simultaneous connections.

The official Scrapy documentation provides extensive details on advanced configuration options, while the GitHub repository offers examples and community contributions for specialized use cases.

Web scraping with Scrapy opens up powerful possibilities for data-driven applications, but success depends on understanding both the technical framework and the ethical considerations of web data extraction. Start with simple projects, gradually incorporate advanced features like JavaScript handling and distributed crawling, and always prioritize respectful, responsible scraping practices that don’t burden target websites.

This article incorporates information and material from various online sources. We acknowledge and appreciate the work of all original authors, publishers, and websites. While every effort has been made to appropriately credit the source material, any unintentional oversight or omission does not constitute a copyright infringement. All trademarks, logos, and images mentioned are the property of their respective owners. If you believe that any content used in this article infringes upon your copyright, please contact us immediately for review and prompt action.

This article is intended for informational and educational purposes only and does not infringe on the rights of the copyright owners. If any copyrighted material has been used without proper credit or in violation of copyright laws, it is unintentional and we will rectify it promptly upon notification. Please note that the republishing, redistribution, or reproduction of part or all of the contents in any form is prohibited without express written permission from the author and website owner. For permissions or further inquiries, please contact us.