Browser Developer Tools, Scrapy and XPath quirks

When you want to extract specific data from a website, you basically tell your Scrapy crawler to look for specific elements at specific locations on a website with the help of Selectors. While using CSS-Selectors (e.g. via response.css()) is perfectly fine when a website is well-structured and doesn't change its CSS elements or attributes often, you should still take some time to get acquainted with XPath because in the long run its flexibility will make your (and your scrapy.Spider's) life more easy.

Before we get into XPath's quirks and unique behaviors to look out for, depending on in which context you use an XPath-expression, here are some useful XPath bookmarks:

w3.org: XML XPath Language
MDN: XPath
en.wikipedia.org: XPath
two amazing XPath cheat-sheets available at:
- https://devhints.io/xpath
- https://quickref.me/xpath
and especially in our crawler use-case with Scrapy:
- Working with XPaths
- Selecting dynamically-loaded content

Use your browser's Developer Tools

If you want to save time while building a crawler, modern browsers offer convenient Developer Tools to help you identify the desired elements and locations inside the DOM. Scrapy's documentation on using your browser's Developer Tools for scraping should be your first read if you've completed the Scrapy Tutorial and want to crawl more efficiently. You'll be using these tools constantly and avoiding rookie-traps like the <tbody>-element will save you a lot of time. Trust us on this one.

...but don't trust the browser DOM blindly! Beware of `tbody`!

Speaking of tbody: What your browser "sees" and shows you upon inspecting a website and what the Scrapy shell is able to see inside the response.body are sometimes just different enough to send you down a debugging-rabbit-hole.

You might encounter websites that use nested tables which hold the information that you want to gather. Using the context menu with right-click -> Inspect on a text element will help you swiftly identify its location, but don't make the mistake to just right-click -> copy (full) XPath on the selected node to immediately copy-paste it into your crawler's source code. Website layouts change and the absolute position of an element might be just one afternoon away from rendering your crawler in a broken state.

Let's say you're looking for a string that sits within a <td>-element, which itself is inside a nested structure of elements like /html/body/main/div/table/tbody/tr/td your best bet would be narrowing down your XPath-expression to something short, yet still precise enough to only catch the elements that you initially wanted to gather data from: If, for example, the table belongs to a specific class or has a unique id, you could shorten your XPath expression to one of these variations:

response.xpath('//table[@class="someSpecificClass"]/tr/td').get()
response.xpath('//table[@id="someUniqueID"]/tr/td').get().

Did you notice that the <tbody>-element is missing from the above expressions? This was 100% on purpose! Whenever you spot a tbody-element within a <table>, act like it didn't exist in the first place! Modern browsers insert <tbody>-elements into tables, even though the original HTML-document might not even have them. If you want to select elements from a table, always double-check that you removed every single tbody-element from your XPath-expression!

XPath: keep it as short and concise as possible!

Be smart with your XPath-expressions: Be as short and concise as possible without casting too wide of a net. If a web-developer was gracious enough to use proper, differentiable ids, names or other useful attributes that help you discern elements from each other, craft your XPath-expressions to only target those elements!

A short example - if the header of a website holds meta-data that you want to gather, it might look something like this:

<meta property="description" content="Ich kenne die wichtigsten menschlichen Skelettknochen. Ich weiß, welche Funktion die Wirbelsäule hat. Ich kann Rückenschmerzen vorbeugen.">

Using your browser's Developer Tools, your right-click -> Copy XPath might offer you an XPath-expression like /html/head/meta[4] which might work perfectly today, but could already be broken with the next update to the web-site's header structure. Say we wanted to gather the string inside the content-attribute, we could enter the following XPath-expression in our Scrapy shell to confirm if we are on the right track:

>>> response.xpath('//meta[@property="description"]')
[<Selector xpath='//meta[@property="description"]' data='<meta property="description" content=...'>]

Okay. Our XPath-expression returns a Selector, which is neat, but we're not quite there yet. Let's refine our XPath-expression just a little bit:

response.xpath('//meta[@property="description"]/@content').get()
'Ich kenne die wichtigsten menschlichen Skelettknochen. Ich weiß, welche Funktion die Wirbelsäule hat. Ich kann Rückenschmerzen vorbeugen.'

Perfect! Now that we are sure that the XPath-expression returns us the correct string, we can use it in our crawler's source code to fill LomGeneralItemLoader with a value for its description-key.

The more complicated a website is structured, the more useful XPath and its syntax will become, because it allows you to traverse the whole DOM and its elements. When a website varies slightly in structure, it might be worth taking a look at XPath Axes like child or descendant.

Keep in mind: If you enter CTRL+F while being inside your browser's Developer Tools, you'll be able to experiment with your XPath-expressions right in the browser. For quick reference, there's amazing XPath cheat-sheets available at devhints.io and quickref.me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Browser Developer Tools, Scrapy and XPath quirks

Use your browser's Developer Tools

...but don't trust the browser DOM blindly! Beware of `tbody`!

XPath: keep it as short and concise as possible!

Clone this wiki locally

Browser Developer Tools, Scrapy and XPath quirks

Use your browser's Developer Tools

...but don't trust the browser DOM blindly! Beware of tbody!

XPath: keep it as short and concise as possible!

Clone this wiki locally

...but don't trust the browser DOM blindly! Beware of `tbody`!