Playwright Web Scraping 2024

This blog is originally posted to the crawlbase blog.

In this tutorial, our main focus will be on Playwright web scraping. So what is Playwright? It’s a handy framework created by Microsoft. It’s known for making web interactions more streamlined and works reliably with all the latest browsers like WebKit, Chromium, and Firefox. You can also run tests in headless or headed mode and emulate native mobile environments like Google Chrome for Android and Mobile Safari.

Playwright started its journey back in January 2020, and since then, it’s been growing steadily. By March 2024, it’s hitting about 4 million downloads every week. Big names like Adobe Spectrum and Visual Studio Code are already using it for their projects. This year Playwright is getting more popular than ever.

playwright monthly downloads 2024

source

Throughout this tutorial, we’ll show you step by step tutorial on how to use Playwright for web scraping in Node.js. We’ll also showcase Crawlbase for a different approach to getting data from the web. Whether you’re a seasoned developer or new to web scraping, this tutorial promises to equip you with the necessary skills for success. Let’s begin.

1. Blog Scope

2. Prerequisites

3. How to Web Scrape using Playwright

4. Scraping Original Price using Playwright

5. Scraping Discounted Price with Playwright

6. Scraping Product Thumbnail using Playwright

7. Scraping Product Ratings with Playwright

8. Scraping Product Reviews Count using Playwright

9. Scraping Product Reviews with Playwright

10. Code Compiling and Execution for Playwright Scraping

11. Scraper using Crawlbase Crawling API and Cheerio

12. Conclusion

13. Frequently Asked Questions

1. Blog Scope

Our goal for this tutorial is to crawl and scrape Backmarket.com, particularly this URL, and extract essential information such as the product’s title, original and discounted prices, thumbnail image URL, ratings, reviews count, and individual reviews.

scraping backmarket using playwright

We will provide two different approaches for this tutorial:

Playwright Method: First, we will utilize the Playwright framework to extract relevant data from our target URL. We will provide step-by-step instructions, from setting up your coding environment for Playwright to parsing the HTML content of the page to printing the results in an easily digestible JSON format which can be used for further analysis.

Crawlbase Crawling API Method: Next, we will utilize the Crawling API from Crawlbase to crawl and scrape data from the same product page. We will make a GET request to the target URL to extract the HTML content without getting blocked then parse the content using Cheerio. The parsed content will be printed in a readable JSON format as well.

You’ll get a deeper understanding of different techniques for web scraping by exploring these approaches and understand how to adapt your scraping strategy based on the requirements of your project.

2. Prerequisites

Now that we’ve established our goals, we can start setting up our environment for coding. We need to make sure we have all the necessary foundations in place. Here’s what you’ll need to get started:

prerequisites to playwright web scraping

Node.js installed on your system: Node.js serves as a runtime environment, enabling the execution of JavaScript on your local machine. This is important for running the web scraping script we’re about to develop.

To install the environment, simply download Node.js from the official website: Node.js Official Website.

Fundamental understanding of JavaScript: Since we will be heavily utilizing JavaScript in our web scraping project, knowing the basics of the language is important. You should be familiar with concepts like variables, functions, loops, and basic DOM manipulation.

For those new to JavaScript, consider engaging in tutorials or documentation available on platforms like Mozilla Developer Network (MDN) or W3Schools.

Crawlbase API Token: In the second part of this tutorial, we’ll be utilizing the Crawlbase API for more efficient web scraping. The API token will serve as your key to authenticating requests and unlocking the full potential of the Crawling API. We’ll be using the Crawlbase JavaScript Token to mimic real browser requests.

To get your token, head over to the Crawlbase website, create an account, and access your API tokens from the account documentation section.

3. How to Web Scrape using Playwright

Once you have Node.js installed, open your command prompt (Windows) or terminal (macOS/Linux).

Create a directory to store your Playwright scraper code by running the following commands:

mkdir playwright-scraper
cd playwright-scraper && touch index.js

Now install Playwright by using the command below:

npm install playwright

Import library and create a function for scraping title:

Once you’ve installed the necessary packages, we can now begin coding. Start by importing the required modules like chromium from the Playwright library for automating the browser and fs for the file system operations. These are the key modules for performing web scraping and handling file operations within the Node.js environment.

For your convenience, you can copy and paste the code below into your index.js file

Scraping Process: The scrapeResults function is defined to scrape results from a provided web URL. Within this function:

It launches a new instance of the Chromium browser using chromium.launch({ headless: false }), the browser should run in non-headless mode (visible UI).
It creates a new browsing context and page using browser.newContext() and context.newPage(), respectively. This prepares the environment for navigating to and interacting with web pages.
It navigates to the provided web URL using page.goto(webUrl). This instructs the browser to load the specified URL.
It extracts the title of the product from the page using page.$eval("h1", ...), which finds the first <h1> element on the page and retrieves its text content. The ?.textContent.trim() ensures that leading and trailing whitespace is removed from the extracted text.

inspect html elements for web crawling

4. Scraping Original Price using Playwright

To extract the original price from the target webpage, first, visit the provided URL in your web browser. Next, right-click on the original price element and select “Inspect” to access the Developer Tools, which will highlight the corresponding HTML code.

identify html element

Identify the HTML element representing the original price, ensuring uniqueness by examining its attributes, classes, or IDs. Once identified, copy the CSS selector of the element and verify its accuracy in a text or code editor, making any necessary adjustments to accurately target the original price element.

// Function to get text content of an element by selector
const getValueBySelector = async (selector) =>
await page.evaluate((selector) => document.querySelector(selector)?.textContent.trim(), selector);

// Extract original price of the product
const originalPrice = await getValueBySelector('[data-popper-trigger="true"].body-1-light-striked');

Function Definition (getValueBySelector):

This code defines an asynchronous function named getValueBySelector, which takes a CSS selector as its parameter.
Inside the function, it uses page.evaluate() from Playwright to execute JavaScript code in the context of the current page.
The JavaScript code inside page.evaluate() selects an HTML element based on the provided CSS selector using document.querySelector(selector).
It then accesses the textContent property of the selected element to retrieve its text content and applies the trim() method to remove any leading or trailing white spaces.
The function returns the trimmed text content of the selected element.

Extracting Original Price:

After defining the getValueBySelector function, it is invoked with a specific CSS selector ('[data-popper-trigger="true"].body-1-light-striked') to target a particular element on the page.
The result, representing the original price of a product, is assigned to the variable originalPrice.

5. Scraping Discounted Price with Playwright

Once again, right-click on the original price element and select “Inspect” to access the Developer Tools, which will highlight the associated HTML code.

identify html element

Once you have the correct CSS selector, you can write the code as shown below:

// Extract discounted price of the product
const discountedPrice = await getValueBySelector('[data-qa="productpage-product-price"]');

This code extracts the discounted price of a product from our target web page. It likely utilizes getValueBySelector() to select the element displaying the discounted price based on its attribute. Once found, the discounted price value is stored in the variable discountedPrice for further processing.

6. Scraping Product Thumbnail using Playwright

Similar to the previous steps, right-click on the thumbnail image of the product and select “Inspect” to open the Developer Tools. This action will highlight the corresponding HTML code related to the thumbnail image.

image scraping playwright

To write the code for extracting the thumbnail image URL using Playwright, you can use the following approach:

// Extract thumbnail image URL
const thumbnail = await page.evaluate(
() => document.querySelector('div[data-test="carousel"] img[loading="eager"]')['src'],
);

This code extracts the URL of a thumbnail image from a web page. We utilize page.evaluate() function to execute JavaScript code in the context of the web page. It selects the image element within a div with specific attributes and retrieves its src attribute, which holds the image URL. The extracted URL is then stored in the variable thumbnail for further use.

7. Scraping Product Ratings with Playwright

By now, you know the drill. Right-click on the product ratings and select “Inspect” to open the Developer Tools. This action will allow you to view the HTML code of the particular element we’ve selected.

scrape product rating with playwright

Here’s the code snippet for extracting the ratings of the product:

// Extract ratings of the product
const ratings = await getValueBySelector('[data-test="product-page-reviews-count"] span');

8. Scraping Product Reviews Count using Playwright

scrape product review count using playwright

Once again, you’ll want to use the getValueBySelector to extract the count of reviews for the product. Pass in the CSS selector to select the element displaying the reviews count based on its attribute. Once found, store the count value in the variable reviewsCount for further use.

// Extract reviews count of the product
const reviewsCount = await getValueBySelector('[data-test="product-page-reviews-count"] > span');

9. Scraping Product Reviews with Playwright

Lastly, we will scrape the reviews of the product. Get the associated HTML code as shown below:

scrape product reviews with playwright

Select All Review Elements: Use document.querySelectorAll() to select all list items containing reviews. Store them in a variable named reviewElements.

Map Over Each Review Element: Use map() to iterate over each review element in reviewElements. Inside the map() function, extract relevant information such as author name, review URL, ratings, and review text.

Extract Author Name: Use querySelector() to find the element with the class .body-1-bold inside the current review element. Extract the text content and trim any leading or trailing spaces.

Extract Review URL: Use querySelector() to find the anchor (<a>) element with the attribute rel set to “noreferrer noopener”. Get the value of the href attribute and prepend it with the base URL https://www.backmarket.com/. This will give you the complete review URL.

Extract Ratings: Use querySelector() to find the element with the attribute data-qa set to “user-comment”. Extract the text content and trim any leading or trailing spaces.

Extract Review Text: Use querySelector() to find the element with the classes .body-1-light, .text-grey-500, and .whitespace-pre-line. Extract the text content and trim any leading or trailing spaces.

Return Extracted Information: Return an object containing the extracted information for each review element.

Filter Out Reviews: Use filter() to remove any reviews where either the author name or review text is missing.

Here’s how you can write the code:

const reviews = await page.evaluate(() => {
// Select all list items containing reviews
return Array.from(document.querySelectorAll('ul.list-none li.group'))
.map((list) => ({
// Extract author name
author: list.querySelector('.body-1-bold')?.textContent.trim(),
// Extract review URL
url: https://www.backmarket.com/${list.querySelector('a[rel="noreferrer noopener"]')?.getAttribute('href')},
// Extract ratings
ratings: list.querySelector('[data-qa="user-comment"] span.body-2-bold')?.textContent.trim(),
// Extract review text
reviews: list.querySelector('.body-1-light.text-grey-500.whitespace-pre-line')?.textContent.trim(),
}))
.filter((review) => review.author && review.reviews); // Filter out reviews with missing author or review text
});

10. Code Compiling and Execution for Playwright Scraping

Now that we have the code snippets for each element we want to scrape from backmarket.com, let’s compile them and save them as index.js.

// Import required modules
const { chromium } = require('playwright');
const fs = require('fs');

// Function to scrape results from the provided web URL
async function scrapeResults(webUrl) {
// Launch a new browser instance
const browser = await chromium.launch({ headless: false });

// Create a new browsing context and page
const context = await browser.newContext();
const page = await context.newPage();

// Navigate to the provided web URL
await page.goto(webUrl);

// Function to get text content of an element by selector
const getValueBySelector = async (selector) =>
await page.evaluate((selector) => document.querySelector(selector)?.textContent.trim(), selector);

// Extract original price of the product
const originalPrice = await getValueBySelector('[data-popper-trigger="true"].body-1-light-striked');

// Extract discounted price of the product
const discountedPrice = await getValueBySelector('[data-qa="productpage-product-price"]');

// Extract thumbnail image URL
const thumbnail = await page.evaluate(
() => document.querySelector('div[data-test="carousel"] img[loading="eager"]')['src'],
);

// Extract ratings of the product
const ratings = await getValueBySelector('[data-test="product-page-reviews-count"] span');

// Extract reviews count of the product
const reviewsCount = await getValueBySelector('[data-test="product-page-reviews-count"] > span');

const reviews = await page.evaluate(() => {
// Select all list items containing reviews
return Array.from(document.querySelectorAll('ul.list-none li.group'))
.map((list) => ({
// Extract author name
author: list.querySelector('.body-1-bold')?.textContent.trim(),
// Extract review URL
url: https://www.backmarket.com/${list.querySelector('a[rel="noreferrer noopener"]')?.getAttribute('href')},
// Extract ratings
ratings: list.querySelector('[data-qa="user-comment"] span.body-2-bold')?.textContent.trim(),
// Extract review text
reviews: list.querySelector('.body-1-light.text-grey-500.whitespace-pre-line')?.textContent.trim(),
}))
.filter((review) => review.author && review.reviews); // Filter out reviews with missing author or review text
});

// Close the browser
await browser.close();

// Write the extracted data to a JSON file
fs.writeFileSync(
'backmarket.json',
JSON.stringify(
{
title,
originalPrice,
discountedPrice,
thumbnail,
ratings,
reviewsCount,
reviews,
},
null,
2,
),
);

// Return the title of the product
return title;
}

// Call the scrapeResults function with the provided web URL
scrapeResults(
'https://www.backmarket.com/en-us/p/iphone-15-pro-max-256-gb-white-titanium-unlocked/4d141eb2-e989-4572-b57d-1685c77483a7',
);

Open your terminal or command prompt and navigate to the directory where index.js is saved. Execute the script by running the following command:

node index.js

After running the script, check the output. If successful, you should see the extracted data printed to the console or saved to a file, depending on how you’ve implemented the code.

{
"title": "iPhone 15 Pro Max 256GB - White Titanium - Unlocked - Dual eSIM",
"originalPrice": "$1,199.00 new",
"discountedPrice": "$1,078.00",
"thumbnail": "https://www.backmarket.com/cdn-cgi/image/format%3Dauto%2Cquality%3D75%2Cwidth%3D260/https://d2e6ccujb3mkqf.cloudfront.net/4d141eb2-e989-4572-b57d-1685c77483a7-1_bd9df67c-c913-4d6c-a7a2-e4a93ad78089.jpg",
"ratings": "4.4/5",
"reviewsCount": "(15 reviews)",
"reviews": [
{
"author": "Robert W.",
"url": "https://www.backmarket.com//en-us/p/iphone-15-pro-max-256-gb-blue-titanium-unlocked/878e83a6-9edf-4e23-8580-02fb22c5a022",
"ratings": "5/5",
"reviews": "I couldn't be happier with my iPhone 15 purchase. First arrived safely packaged, protective film on the screen, original box with the c-cable bundled. Just like I would've received it from the Apple store where I have made all my previous purchases of new products. In the future, a refurbished item will be my first thought!"
},
{
"author": "Theresa J.",
"url": "https://www.backmarket.com//en-us/p/iphone-15-pro-max-512-gb-natural-titanium-unlocked/d17785fe-7317-4cb5-bc0c-695b6211fb64",
"ratings": "4.6/5",
"reviews": "First time using this app. Navigating my way around wasn’t bad. Delivery was delayed….that made me nervous. I customized my phone.After receiving the package I realized couple of marks on the phone that was supposed to be in excellent condition and without the original box."
},
{
"author": "Syed A.",
"url": "https://www.backmarket.com//en-us/p/iphone-15-pro-max-256-gb-black-titanium-unlocked/39357e4a-02a9-4392-b750-8be737fdbb82",
"ratings": "3.8/5",
"reviews": "I am happy with the product but it should be more less in the price because in that price you can buy from Apple Store just need to add few more amount . Rest all good . If Backmsrket reduce the price then people should not go to Apple Store for refurbished items … think backmarket it should be better in future …"
},
{
"author": "Kevin D.",
"url": "https://www.backmarket.com//en-us/p/iphone-15-pro-max-256-gb-black-titanium-unlocked/39357e4a-02a9-4392-b750-8be737fdbb82",
"ratings": "4/5",
"reviews": "The unit arrived not working, overheats (signicicatgly), and restarts literally about every 2 and a half minutes...I am not sure how something like this could have been over looked..."
},
{
"author": "Quintrell O.",
"url": "https://www.backmarket.com//en-us/p/iphone-15-pro-max-256-gb-blue-titanium-unlocked/878e83a6-9edf-4e23-8580-02fb22c5a022",
"ratings": "5/5",
"reviews": "great phone looks crystal clear"
}
]
}

11. Scraper using Crawlbase Crawling API and Cheerio

Scraping using Crawlbase Crawling API and Cheerio is relevant to this tutorial as it offers an alternative approach to web scraping. This method stands out as one of the best alternatives because it utilizes the Crawlbase Crawling API, which can help prevent potential blocks and CAPTCHAs imposed by the target website.

Additionally, integrating the Crawlbase Crawling API with parsers like Cheerio provides a more stable solution for extracting data from web pages without encountering issues commonly faced in traditional scraping methods like getting blocked or being heavily rate limited. This method ensures that we can get the information we want from a website reliably and effectively.

To start, simply install the packages below:

mkdir scraper
cd scraper && touch index.js
npm install crawlbase cheerio

These commands will create a directory called scraper, create an index.js file, and install the Crawlbase and Cheerio libraries.

Now, we will apply a similar approach as we did with Playwright. However, this time, we’ll be utilizing an HTTP/HTTPs request to the Crawling API to obtain the HTML code of the page. We’ll then use Cheerio to extract the same sets of data from this HTML code.

// import Crawlbase Crawling API package
const { CrawlingAPI } = require('crawlbase');
const cheerio = require('cheerio');
const fs = require('fs');

// initializing Crawling API
const api = new CrawlingAPI({ token: 'Crawlbase_JSToken' }); // Replace it with your Crawlbase token
const webURL =
'https://www.backmarket.com/en-us/p/iphone-15-pro-max-256-gb-white-titanium-unlocked/4d141eb2-e989-4572-b57d-1685c77483a7';

// Crawling API get request execution
api
.get(webURL)
.then((response) => {
// Function call to parse data from HTML response
const scrapedData = parseDataFromHTML(response.body);

// Writing scraped data to a JSON file
fs.writeFileSync('scraped.json', JSON.stringify({ scrapedData }, null, 2));
})
.catch((error) => {
console.log(error, 'ERROR');
});

// Function to parse data from HTML response
function parseDataFromHTML(html) {
try {
const $ = cheerio.load(html),
scrapedData = {
title: '',
originalPrice: '',
discountedPrice: '',
thumbnail: '',
ratings: '',
reviewsCount: '',
reviews: [],
};

scrapedData['title'] = $('h1.title-1:first').text().trim();
scrapedData['originalPrice'] = $('[data-popper-trigger="true"].body-1-light-striked:first').text().trim();
scrapedData['discountedPrice'] = $('[data-qa="productpage-product-price"]:first').text().trim();
scrapedData['thumbnail'] = $('div[data-test="carousel"] img[loading="eager"]')['src'];
scrapedData['ratings'] = $('[data-test="product-page-reviews-count"] span:first').text().trim();
scrapedData['reviewsCount'] = $('[data-test="product-page-reviews-count"] > span:first').text().trim();
$('ul.list-none li.group').each((, element) => {
const author = $(element).find('.body-1-bold').text().trim();
const url = https://www.backmarket.com/${$(element).find('a[rel="noreferrer noopener"]').attr('href')};
const ratings = $(element).find('[data-qa="user-comment"] span.body-2-bold').text().trim();
const reviews = $(element).find('.body-1-light.text-grey-500.whitespace-pre-line').text().trim();

if (author && reviews) {
// Pushing extracted data to reviews array
scrapedData['reviews'].push({
author,
url,
ratings,
reviews,
});
}
});

return scrapedData;
} catch (err) {
return err;
}
}

After saving the code above, navigate to the directory where index.js is saved in your terminal or command prompt. Execute the script by running node index.js.

Check the output afterward. If successful, you’ll see the extracted data printed as shown below:

{
"scrapedData": {
"title": "iPhone 15 Pro Max 256GB - White Titanium - Unlocked - Dual eSIM",
"originalPrice": "$1,199.00 new",
"discountedPrice": "$1,078.00",
"ratings": "4.4/5",
"reviewsCount": "(15 reviews)",
"reviews": [
{
"author": "Robert W.",
"url": "https://www.backmarket.com//en-us/p/iphone-15-pro-max-256-gb-blue-titanium-unlocked/878e83a6-9edf-4e23-8580-02fb22c5a022",
"ratings": "5/5",
"reviews": "I couldn't be happier with my iPhone 15 purchase. First arrived safely packaged, protective film on the screen, original box with the c-cable bundled. Just like I would've received it from the Apple store where I have made all my previous purchases of new products. In the future, a refurbished item will be my first thought!"
},
{
"author": "Theresa J.",
"url": "https://www.backmarket.com//en-us/p/iphone-15-pro-max-512-gb-natural-titanium-unlocked/d17785fe-7317-4cb5-bc0c-695b6211fb64",
"ratings": "4.6/5",
"reviews": "First time using this app. Navigating my way around wasn’t bad.Delivery was delayed….that made me nervous. I customized my phone.After receiving the package I realized couple of marks on the phone that was supposed to be in excellent condition and without the original box."
},
{
"author": "Syed A.",
"url": "https://www.backmarket.com//en-us/p/iphone-15-pro-max-256-gb-black-titanium-unlocked/39357e4a-02a9-4392-b750-8be737fdbb82",
"ratings": "3.8/5",
"reviews": "I am happy with the product but it should be more less in the price because in that price you can buy from Apple Store just need to add few more amount . Rest all good . If Backmsrket reduce the price then people should not go to Apple Store for refurbished items … think backmarket it should be better in future …"
},
{
"author": "Kentraile M.",
"url": "https://www.backmarket.com//en-us/p/iphone-15-pro-max-512-gb-blue-titanium-unlocked/6e024703-8d25-4828-872d-484d38a8cd99",
"ratings": "5/5",
"reviews": "For The First Time I’m Rocking The Latest iPhone Came In Excellent Like Described No Flaws 💯 recommend"
},
{
"author": "Noe L.",
"url": "https://www.backmarket.com//en-us/p/iphone-15-pro-max-256-gb-natural-titanium-unlocked/de1fc6b3-ee1d-4d89-8db9-e382102a8528",
"ratings": "4.5/5",
"reviews": "Very satisfied with this iPhone very easy to transfer all data To the new iPhone I recommend to buy from back market vey Trusted webside"
}
]
}
}

12. Conclusion

In conclusion, this tutorial has demonstrated two effective methods for scraping data from backmarket.com using Node.js. We explored the utilization of Playwright, showcasing how to extract specific information from a website. Also, we introduced an alternative approach using the Crawling API with Cheerio, offering a solution to bypass captchas and potential blocks while efficiently parsing HTML content.

Choosing between Playwright and the Crawling API with Cheerio will greatly depend on what you need for your project. Playwright provides a robust solution for dynamic web scraping and interaction with modern web applications, while the Crawling API with Cheerio offers a reliable method for accessing and parsing HTML content.

So, go ahead and choose what suits your project best. Don’t forget, the code we’ve shared here is all free for you to use. Feel free to apply it to scrape data from other websites too!

If you are interested on other scraping projects, we recommend checking the tutorials below:

How to Scrape TikTok Comments

How to Scrape Crunchbase Company Data

How to Scrape Websites with ChatGPT

13. Frequently Asked Questions

Q. Can Playwright be used for Scraping?

A. Yes. To scrape websites using Playwright, you can follow these general steps:

Step 1: Install Playwright: Begin by installing Playwright via npm using the command npm install playwright.

Step 2: Write Your Script: Create a JavaScript file (e.g., scrape.js) and write the code to automate your scraping tasks using Playwright. This may include navigating to the website, interacting with elements, and extracting data.

Step 3: Run Your Script: Execute your script by running node scrape.js in your terminal or command prompt.

Step 4: Check Output: After running the script, check the output to ensure that the desired data has been scraped successfully.

Step 5: Refine Your Script: Refine your script as needed to handle any edge cases or errors that may arise during the scraping process.

Q. Is Playwright easier than Selenium?

In terms of ease of use, Playwright is often considered to have a more user-friendly approach compared to Selenium. Playwright offers a simpler and more modern interface for automating browser interactions. It has features like automatic installation of browser drivers and built-in support for multiple programming languages.

On the other hand, Selenium has been around for longer and has a larger community and ecosystem. It’s widely used and has extensive documentation and resources available.

In short, the choice between Playwright and Selenium depends on factors such as the specific requirements of your project, team preferences, and existing infrastructure.

Q: Can you get blocked when you scrape a website using Playwright?

Yes, it’s possible to get blocked when scraping a website using Playwright. Websites may implement measures to detect and block automated scraping activities, such as detecting unusual traffic patterns or high-frequency requests from the same IP address.

To avoid blocks and CAPTCHAs, we recommend using Crawlbase. By integrating your code with Crawlbase API, you can benefit from its features, including the use of millions of rotating IPs and AI-based algorithm to mimic human interaction. This helps to mitigate the risk of detection and ensures smoother scraping operations.

Playwright Web Scraping 2024 - Tutorial

Table of Contents