Tree directory

In items.py...

from scrapy_jsonschema.item import JsonSchemaItem

class BookSchemaItem(JsonSchemaItem):
    jsonschema = {
        "$schema": "http://json-schema.org/draft-07/schema#",
        "title": "Book",
        "description": "A Book item extracted from books.toscrape.com",
        "type": "object",
        "properties": {
            "url": {
                "description": "Book's URL",
                "type": "string",
                "pattern": "^https?://[\\S]+$"
            },
            "category_name": {
                "description": "Name of the category which the book belongs to",
                "type": "string"
            },
            "title": {
                "description": "Book's title",
                "type": "string"
            },
            "price": {
                "description": "Book's price",
                "minimum": 0,
                "type": "number"
            },
            "availability": {
                "description": "Book's availability",
                "type": "string"
            }
        },
        "required": ["url"]
    }

Effective Ways to Scale-Up and Maintain Your Web Crawling Project

What you can bring home today

Tree directory

How?

Now give it a try!

Under any directory...

Use Scrapy in crawling project

Scrapy

Start virtual Environment

Find a directory you like and run

Mac User

Windows User

Definition of Terms

Start crawling project

Scrapy flow (1)

Let's open books.py

Remember CSS path?

Have you noticed any new friend?

Extract the book information

Run the code

Open the file, what happened?

Shxt...sth went wrong, let's revise the code.

Let's try again.

Scrapy flow (2)

We need to find the CSS path of the categories

Have you noticed any new friend?

Run the code

Let's try to improve our spider.

Define items (what you scraped)

Now go to items.py

Delete them all

Let's explore a little bit more

Decoupling parsing from crawling

Let's modify other page with web_poet

Let's compare the code with the first version.

Some other feature

When you have multiple spiders...

Monitoring the spider

Add schema validation

What about dynamic page?

Helper libraries

Full Code

Source:

Debug note

if this happens...

or this...

try run

Let's open `books.py`

Now go to `items.py`