Automating LinkedIn Job Searches with Puppeteer and RxJS


Tutorial on how to scrape job offers from LinkedIn using Puppeteer and RxJS
Web scraping may seem like a simple task, but there are many challenges to overcome. In this blog, we will dive into how to scrape LinkedIn to extract job listings. To do this, we will use Puppeteer and RxJS. The goal is to achieve web scraping in a declarative, modular, and scalable manner.
What is Web Scraping?
Web scraping is an automated method of extracting valuable data from websites. It allows users to retrieve specific information—such as text, images, links, and structured content—without manually copying and pasting. This technique is widely used for various purposes, including market research, data analysis, job listings aggregation, and competitive intelligence.
By leveraging web scraping tools, developers can efficiently collect, process, and utilize web data, transforming unstructured online information into structured insights.
Puppeteer: A Powerful Web Scraping Tool
Puppeteer is a JavaScript library that provides programmatic control over headless or full browsers like Chrome. It allows developers to automate tasks such as navigating web pages, interacting with elements, and extracting data, making it an excellent choice for web scraping.
One of Puppeteer's biggest advantages is its ability to handle dynamic content. Unlike traditional scraping techniques that rely solely on fetching raw HTML, Puppeteer can execute JavaScript, ensuring that all elements—including those loaded asynchronously—are properly rendered before extraction. This makes it particularly useful for scraping websites with complex structures or content hidden behind interactive elements.
Understanding RxJS
RxJS is a powerful JavaScript library designed for reactive programming, making it easier to handle asynchronous data streams efficiently. In this project, we leverage RxJS due to its numerous advantages:
✅ Streamlined Asynchronous Workflow – Enables a declarative approach to managing async operations.
✅ Robust Error Handling – Provides built-in mechanisms to catch and handle errors gracefully.
✅ Effortless Retry Logic – Allows automatic retries when scraping issues arise.
✅ Flexible and Scalable Code – Simplifies adaptation as project complexity grows.
✅ Extensive Operator Support – Offers a rich set of functions to process and manipulate data efficiently.
1. Puppeteer initialization
The code snippet below initializes a Puppeteer browser instance in a non-headless mode and subsequently creates a new web page. This represents the most fundamental and straightforward initialization process for Puppeteer:
src/index.ts
(async () => {
console.log('Launching Chrome...');
const browser = await puppeteer.launch({
headless: false,
// devtools: true,
// slowMo: 250, // slow down puppeteer script so that it's easier to follow visually
args: [
'--disable-gpu',
'--disable-dev-shm-usage',
'--disable-setuid-sandbox',
'--no-first-run',
'--no-sandbox',
'--no-zygote',
'--single-process',
],
});
const page = await browser.newPage()
/**
* 1. Go lo linkedin jobs url
* 2. Get the jobs
* 3. Repeat step 1 with other search parameters
*/
2. Accessing LinkedIn Job Listings and Extracting Data
This is the core section of our blog, where we delve into the process of navigating LinkedIn’s job listings, parsing the HTML content, and extracting job details in a structured JSON format. Our approach ensures that we retrieve relevant job information efficiently while handling potential roadblocks during the scraping process.
2.1. Construct the URL for navigating to LinkedIn job offers page
To access LinkedIn's job listings, we need to construct a URL using the function urlQueryPage
:
src/linkedin.ts
export const urlQueryPage = (searchParams: ScraperSearchParams) =>
`https://linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${searchParams.searchText}
&start=${searchParams.pageNumber * 25}${searchParams.locationText ? '&location=' + searchParams.locationText : ''}`
In this case, I have already conducted the necessary research to identify a suitable URL for scraping. Our goal is to find a URL that can be dynamically parameterized based on our desired search criteria.
For this example, the key search parameters will include:
searchText
– The job title or keyword.pageNumber
– The pagination index to navigate through job listings.locationText
(optional) – A specific location filter to refine search results.
By structuring the URL accordingly, we can efficiently retrieve job listings that match our specified criteria.
Examples of url can be:
https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=Angular&start=0
https://www.linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=python&start=0
2.2. Navigate to the URL and extract the job offers
With our target URL identified, we can proceed with the two primary actions required:
Navigating to the Job Listings URL: This step involves directing our web scraping tool to the URL where the job listings are hosted.
Extracting the job offers data and converting to JSON: Once we're on the jobs listings page, we'll employ web scraping techniques to extract the jobs data and return them in JSON format.
src/linkedin.ts
export interface ScraperSearchParams {
searchText: string;
locationText: string;
pageNumber: number;
}
/** main function */
export function goToLinkedinJobsPageAndExtractJobs(page: Page, searchParams: ScraperSearchParams): Observable<JobInterface[]> {
return defer(() => fromPromise(navigateToJobsPage(page, searchParams)))
.pipe(switchMap(() => getJobsFromLinkedinPage(page)));
}
/* Utility functions */
export const urlQueryPage = (searchParams: ScraperSearchParams) =>
`https://linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${searchParams.searchText}
&start=${searchParams.pageNumber * 25}${searchParams.locationText ? '&location=' + searchParams.locationText : ''}`
function navigateToJobsPage(page: Page, searchParams: ScraperSearchParams): Promise<Response | null> {
return page.goto(urlQueryPage(searchParams), { waitUntil: 'networkidle0' });
}
export const stacks = ['angularjs', 'kubernetes', 'javascript', 'jenkins', 'html', /* ... */];
export function getJobsFromLinkedinPage(page: Page): Observable<JobInterface[]> {
return defer(() => fromPromise(page.evaluate((pageEvalData) => {
const collection: HTMLCollection = document.body.children;
const results: JobInterface[] = [];
for (let i = 0; i < collection.length; i++) {
try {
const item = collection.item(i)!;
const title = item.getElementsByClassName('base-search-card__title')[0].textContent!.trim();
const imgSrc = item.getElementsByTagName('img')[0].getAttribute('data-delayed-url') || '';
const remoteOk: boolean = !!title.match(/remote|No office location/gi);
const url = (
(item.getElementsByClassName('base-card__full-link')[0] as HTMLLinkElement)
|| (item.getElementsByClassName('base-search-card--link')[0] as HTMLLinkElement)
).href;
const companyNameAndLinkContainer = item.getElementsByClassName('base-search-card__subtitle')[0];
const companyUrl: string | undefined = companyNameAndLinkContainer?.getElementsByTagName('a')[0]?.href;
const companyName = companyNameAndLinkContainer.textContent!.trim();
const companyLocation = item.getElementsByClassName('job-search-card__location')[0].textContent!.trim();
const toDate = (dateString: string) => {
const [year, month, day] = dateString.split('-')
return new Date(parseFloat(year), parseFloat(month) - 1, parseFloat(day) )
}
const dateTime = (
item.getElementsByClassName('job-search-card__listdate')[0]
|| item.getElementsByClassName('job-search-card__listdate--new')[0] // less than a day. TODO: Improve precision on this case.
).getAttribute('datetime');
const postedDate = toDate(dateTime as string).toISOString();
/**
* Calculate minimum and maximum salary
*
* Salary HTML example to parse:
* <span class="job-result-card__salary-info">$65,000.00 - $90,000.00</span>
*/
let currency: SalaryCurrency = ''
let salaryMin = -1;
let salaryMax = -1;
const salaryCurrencyMap: any = {
['€']: 'EUR',
['$']: 'USD',
['£']: 'GBP',
}
const salaryInfoElem = item.getElementsByClassName('job-search-card__salary-info')[0]
if (salaryInfoElem) {
const salaryInfo: string = salaryInfoElem.textContent!.trim();
if (salaryInfo.startsWith('€') || salaryInfo.startsWith('$') || salaryInfo.startsWith('£')) {
const coinSymbol = salaryInfo.charAt(0);
currency = salaryCurrencyMap[coinSymbol] || coinSymbol;
}
const matches = salaryInfo.match(/([0-9]|,|\.)+/g)
if (matches && matches[0]) {
// values are in USA format, so we need to remove ALL the comas
salaryMin = parseFloat(matches[0].replace(/,/g, ''));
}
if (matches && matches[1]) {
// values are in USA format, so we need to remove ALL the comas
salaryMax = parseFloat(matches[1].replace(/,/g, ''));
}
}
// Calculate tags
let stackRequired: string[] = [];
title.split(' ').concat(url.split('-')).forEach(word => {
if (!!word) {
const wordLowerCase = word.toLowerCase();
if (pageEvalData.stacks.includes(wordLowerCase)) {
stackRequired.push(wordLowerCase)
}
}
})
// Define uniq function here. remember that page.evaluate executes inside the browser, so we cannot easily import outside functions form other contexts
const uniq = (_array) => _array.filter((item, pos) => _array.indexOf(item) == pos);
stackRequired = uniq(stackRequired)
const result: JobInterface = {
id: item!.children[0].getAttribute('data-entity-urn') as string,
city: companyLocation,
url: url,
companyUrl: companyUrl || '',
img: imgSrc,
date: new Date().toISOString(),
postedDate: postedDate,
title: title,
company: companyName,
location: companyLocation,
salaryCurrency: currency,
salaryMax: salaryMax,
salaryMin: salaryMin,
countryCode: '',
countryText: '',
descriptionHtml: '',
remoteOk: remoteOk,
stackRequired: stackRequired
};
console.log('result', result);
results.push(result);
} catch (e) {
console.error(`Something when wrong retrieving linkedin page item: ${i} on url: ${window.location}`, e.stack);
}
}
return results;
}, {stacks})) as Observable<JobInterface[]>)
}
The code provided extracts the information of all jobs from the page. While it may not be the most aesthetically pleasing code, it gets the job done. It is not aesthetic because parsing this type of HTML always leads to many fallbacks and checks.
In a standard programming context, breaking code into smaller, isolated functions improves readability and maintainability. However, when working with
page.evaluate
in Puppeteer, we face certain limitations. Since this code executes within the Puppeteer (Chrome) instance rather than our Node.js environment, all logic must be self-contained within thepage.evaluate
call.The only exception is simple variables (such as stacks in our case), which can be passed as arguments to
page.evaluate
. However, these variables must not contain functions or complex objects that cannot be serialized, as Puppeteer does not support passing non-serializable data between Node.js and the browser context.
In this case, the most challenging part of scraping is extracting salary information, as it requires converting a text format like "$65,000.00 - $90,000.00" into separate salaryMin
and salaryMax
values.
To handle potential issues gracefully, we have encapsulated the entire code within a try/catch block. While we currently log errors to the console, it is highly recommended to implement a mechanism for storing error logs on disk. This is especially important because websites frequently update their structure, requiring regular adjustments to the HTML parsing logic.
Finally, we consistently use the defer
and fromPromise
operators to convert Promises into Observables, ensuring a reactive and efficient data flow throughout the scraping process.
defer(() => fromPromise(myPromise()));
This approach is a recommended best practice that works reliably in all scenarios. Promises are eager, whereas Observables are lazy and only initiate when someone subscribes to them. The defer operator allows us to make a Promise lazy. Go to this link for more information about it
3. Add an asynchronous loop to iterate through all pages
In the previous step, we learned how to obtain all job offers data from a LinkedIn page. Now, we want to use that code as many times as possible to gather as much data as we can. To achieve this, we first need to iterate through all available pages:
src/linkedin.ts
function getJobsFromAllPages(page: Page, initSearchParams: ScraperSearchParams): Observable<ScraperResult> {
const getJobs$ = (searchParams: ScraperSearchParams) => goToLinkedinJobsPageAndExtractJobs(page, searchParams).pipe(
map((jobs): ScraperResult => ({jobs, searchParams} as ScraperResult)),
catchError(error => {
console.error(error);
return of({jobs: [], searchParams: searchParams})
})
);
return getJobs$(initSearchParams).pipe(
expand(({jobs, searchParams}) => {
console.log(`Linkedin - Query: ${searchParams.searchText}, Location: ${searchParams.locationText}, Page: ${searchParams.pageNumber}, nJobs: ${jobs.length}, url: ${urlQueryPage(searchParams)}`);
if (jobs.length === 0) {
return EMPTY;
} else {
return getJobs$({...searchParams, pageNumber: searchParams.pageNumber + 1});
}
})
);
}
The code above increments the page number until we reach a page where there are no jobs. To perform this loop in RxJS, we use the operator expand
, which recursively projects each source value to an Observable that is merged into the output Observable. Its functionality is well explained here.
In RxJS, we cannot use a
for
loop as we do with await/async. We are required to use another technique likeexpand
operator or a recursive loop instead. While it might initially appear as a limitation, in an asynchronous context, this method proves to be more advantageous in numerous situations.
So, what would the equivalent code using Promises look like? Here's an example:
export async function getJobsFromAllPages(
page: Page,
searchParams: ScraperSearchParams
): Promise<ScraperResult> {
const results: ScraperResult = { jobs: [], searchParams };
try {
while (true) {
const jobs = await getJobsFromLinkedinPage(page, searchParams);
console.log(
`Linkedin - Query: ${searchParams.searchText}, Location: ${
searchParams.locationText
}, Page: ${searchParams.nPage}, nJobs: ${
jobs.length
}, url: ${urlQueryPage(searchParams)}`
);
results.jobs.push(...jobs);
if (jobs.length === 0) {
break;
}
searchParams.nPage++;
}
} catch (error) {
console.error('Error:', error);
results.jobs = []; // Clear the jobs in case of an error.
}
return results;
}
This code is nearly equivalent to the Observable-based one, with one critical difference: it only emits when all pages have finished processing. In contrast, the implementation using Observables emits after each page. Creating a stream is crucial in this case because we want to handle the jobs as soon as they are resolved.
Certainly, we could introduce our logic following the line:
const jobs = await getJobsFromLinkedinPage(page, searchParams);
/* Handle the jobs here */
...but this would unnecessarily couple our scraping code with the part that handles the jobs data. Handling the jobs data may involve some transformations, API calls, and finally, saving the data into a database.
In this example, we clearly see one of the many benefits Observables offer over Promises.
4. Implementing an Asynchronous Loop for Multiple Search Parameters
Now that we've established how to iterate through multiple pages for a given search query, it's time to take the next step: expanding our search across multiple search parameters.
To achieve this, we'll introduce an additional asynchronous loop that cycles through various search criteria, ensuring comprehensive data extraction.
The first step is defining a structured data format to store these search parameters. We'll call this list searchParamsList
, which will hold different combinations of keywords, locations, or other relevant filters:
src/data.ts
const searchParamsList: { searchText: string; locationText: string }[] = [
{ searchText: 'Angular', locationText: 'Barcelona' },
{ searchText: 'Angular', locationText: 'Madrid' },
// ...
{ searchText: 'React', locationText: 'Barcelona' },
{ searchText: 'React', locationText: 'Madrid' },
// ...
];
To iterate through the searchParamsList
array, we essentially need to convert it from an Array to an Observable using the fromArray
operator. Subsequently, we will use the concatMap
operator to sequentially process each searchText and locationText pair. The power of RxJS here is that, in the case where we may want to switch from sequential to parallel processing, we just need to change the concatMap
for a mergeMap
. In this case, it is not recommended because we will exceed LinkedIn's rate limits, but it's something to consider in other scenarios.
src/linkedin.ts
/**
* Creates a new page and scrapes LinkedIn job offers data for each pair of searchText and locationText, recursively retrieving data until there are no more pages.
* @param browser A Puppeteer instance
* @returns An Observable that emits scraped job offers data as ScraperResult
*/
export function getJobsFromLinkedin(browser: Browser): Observable<ScraperResult> {
// Create a new page
const createPage = defer(() => fromPromise(browser.newPage()));
// Iterate through search parameters and scrape jobs
const scrapeJobs = (page: Page): Observable<ScraperResult> =>
fromArray(searchParamsList).pipe(
concatMap(({ searchText, locationText }) =>
getJobsFromAllPages(page, { searchText, locationText, pageNumber: 0 })
)
)
// Compose sequentially previous steps
return createPage.pipe(switchMap(page => scrapeJobs(page)));
}
This code will loop through different search parameters, retrieving job listings for each combination of technology and location efficiently.
🎉 Congratulations! You now have the skills to scrape LinkedIn job postings! 🎉
However, like many other platforms, LinkedIn employs anti-scraping measures to prevent automated data extraction. Let’s explore how to handle these challenges 👇
Common Errors When Scraping LinkedIn
Running the code as it is will quickly lead to various errors, making it challenging to scrape a substantial amount of data. The two most common issues are:
1. 429 Status Code (Too Many Requests)
This error occurs when we send too many requests in a short period. To avoid being blocked, we need to slow down the request rate and introduce random delays until the error subsides.
2. LinkedIn Authwall
Occasionally, instead of the job listings page, LinkedIn may redirect us to an authentication wall. When this happens, the best approach is to pause requests for a while before trying again.
Handling 429 Errors & LinkedIn Authwall
To tackle these issues, we modify the getJobsFromLinkedinPage
function by introducing a separate function, getLinkedinJobsFromJobsPage
, to handle the HTML scraping logic. The updated code structure looks like this:
src/linkedin.ts
const AUTHWALL_PATH = 'linkedin.com/authwall';
const STATUS_TOO_MANY_REQUESTS = 429;
const JOB_SEARCH_SELECTOR = '.job-search-card';
function goToLinkedinJobsPageAndExtractJobs(page: Page, searchParams: ScraperSearchParams): Observable<JobInterface[]> {
return defer(() => fromPromise(page.setExtraHTTPHeaders({'accept-language': 'en-US,en;q=0.9'})))
.pipe(
switchMap(() => navigateToLinkedinJobsPage(page, searchParams)),
tap(response => checkResponseStatus(response)),
switchMap(() => throwErrorIfAuthwall(page)),
switchMap(() => waitForJobSearchCard(page)),
switchMap(() => getJobsFromLinkedinPage(page)),
retryWhen(retryStrategyByCondition({
maxRetryAttempts: 4,
retryConditionFn: error => error.retry === true
})),
map(jobs => Array.isArray(jobs) ? jobs : []),
take(1)
);
}
/**
* Navigate to the LinkedIn search page, using the provided search parameters.
*/
function navigateToLinkedinJobsPage(page: Page, searchParams: ScraperSearchParams) {
return defer(() => fromPromise(page.goto(urlQueryPage(searchParams), {waitUntil: 'networkidle0'})));
}
/**
* Check the HTTP response status and throw an error if too many requests have been made.
*/
function checkResponseStatus(response: any) {
const status = response?.status();
if (status === STATUS_TOO_MANY_REQUESTS) {
throw {message: 'Status 429 (Too many requests)', retry: true, status: STATUS_TOO_MANY_REQUESTS};
}
}
/**
* Check if the current page is an authwall and throw an error if it is.
*/
function throwErrorIfAuthwall(page: Page) {
return getPageLocationOperator(page).pipe(tap(locationHref => {
if (locationHref.includes(AUTHWALL_PATH)) {
console.error('Authwall error');
throw {message: `Linkedin authwall! locationHref: ${locationHref}`, retry: true};
}
}));
}
/**
* Wait for the job search card to be visible on the page, and handle timeouts or authwalls.
*/
function waitForJobSearchCard(page: Page) {
return defer(() => fromPromise(page.waitForSelector(JOB_SEARCH_SELECTOR, {visible: true, timeout: 5000}))).pipe(
catchError(error => throwErrorIfAuthwall(page).pipe(tap(() => {throw error})))
);
}
In this code, we address the previously mentioned errors, that is, the 429 response error and the authwall issue. Overcoming these errors is very important for successfully web scraping on LinkedIn.
To handle the errors, the code employs a custom retry strategy implemented by the retryStrategyByCondition
function:
src/scraper.utils.ts
export const retryStrategyByCondition = ({maxRetryAttempts = 3, scalingDuration = 1000, retryConditionFn = (error) => true}: {
maxRetryAttempts?: number,
scalingDuration?: number,
retryConditionFn?: (error) => boolean
} = {}) => (attempts: Observable<any>) => {
return attempts.pipe(
mergeMap((error, i) => {
const retryAttempt = i + 1;
if (
retryAttempt > maxRetryAttempts ||
!retryConditionFn(error)
) {
return throwError(error);
}
console.log(
`Attempt ${retryAttempt}: retrying in ${retryAttempt *
scalingDuration}ms`
);
// retry after 1s, 2s, etc...
return timer(retryAttempt * scalingDuration);
}),
finalize(() => console.log('retryStrategyOnlySpecificErrors - finalized'))
);
};
This strategy essentially increases the time between each retry after a failure. This way, we ensure that we will wait long enough for LinkedIn to allow us to make requests again
⚠️ Important Note: LinkedIn has strict anti-scraping measures, and excessive requests from a single IP address can lead to IP blacklisting. Simply increasing wait times between requests may not be a foolproof solution. To minimize the risk of detection and reduce errors, it's highly advisable to rotate IP addresses periodically. This can be achieved by using proxy services or VPNs, ensuring a more sustainable and uninterrupted scraping process.
Final Words
Web scraping can sometimes violate a website's terms of service, so it's crucial to review and respect the robots.txt file and Terms of Service before scraping any site. In this case, the provided code is intended strictly for educational and hobby purposes. LinkedIn specifically prohibits any data extraction from its website; you can read more here.
I encourage using web scraping as a learning tool, but always be mindful of ethical practices. Avoid excessive requests, respect the website's resources, and use the extracted data responsibly.
You can find the complete, updated code in this repository, don't doubt to give an star if it helped! 🙏⭐
Subscribe to my newsletter
Read articles from SHEMANTI PAL directly inside your inbox. Subscribe to the newsletter, and don't miss out.
Written by
