Automate LinkedIn Job Searches Efficiently

Tutorial on how to scrape job offers from LinkedIn using Puppeteer and RxJS

Web scraping may seem like a simple task, but there are many challenges to overcome. In this blog, we will dive into how to scrape LinkedIn to extract job listings. To do this, we will use Puppeteer and RxJS. The goal is to achieve web scraping in a declarative, modular, and scalable manner.

What is Web Scraping?

Web scraping is an automated method of extracting valuable data from websites. It allows users to retrieve specific information—such as text, images, links, and structured content—without manually copying and pasting. This technique is widely used for various purposes, including market research, data analysis, job listings aggregation, and competitive intelligence.

By leveraging web scraping tools, developers can efficiently collect, process, and utilize web data, transforming unstructured online information into structured insights.

Puppeteer: A Powerful Web Scraping Tool

Puppeteer is a JavaScript library that provides programmatic control over headless or full browsers like Chrome. It allows developers to automate tasks such as navigating web pages, interacting with elements, and extracting data, making it an excellent choice for web scraping.

One of Puppeteer's biggest advantages is its ability to handle dynamic content. Unlike traditional scraping techniques that rely solely on fetching raw HTML, Puppeteer can execute JavaScript, ensuring that all elements—including those loaded asynchronously—are properly rendered before extraction. This makes it particularly useful for scraping websites with complex structures or content hidden behind interactive elements.

Understanding RxJS

RxJS is a powerful JavaScript library designed for reactive programming, making it easier to handle asynchronous data streams efficiently. In this project, we leverage RxJS due to its numerous advantages:

✅ Streamlined Asynchronous Workflow – Enables a declarative approach to managing async operations.
✅ Robust Error Handling – Provides built-in mechanisms to catch and handle errors gracefully.
✅ Effortless Retry Logic – Allows automatic retries when scraping issues arise.
✅ Flexible and Scalable Code – Simplifies adaptation as project complexity grows.
✅ Extensive Operator Support – Offers a rich set of functions to process and manipulate data efficiently.

1. Puppeteer initialization

The code snippet below initialize s a Puppeteer browser insta nce in a non-headless mo de and subsequently creates a new web page. This represents the most fundamental and straightfo rward initialization process for Puppeteer:

src/index.ts

(async () => {
  console.log('Launching Chrome...');
  const browser = await puppeteer.launch({
    headless: false,
    // devtools: true,
    // slowMo: 250, // slow down puppeteer script so that it's easier to follow visually
    args: [
      '--disable-gpu',
      '--disable-dev-shm-usage',
      '--disable-setuid-sandbox',
      '--no-first-run',
      '--no-sandbox',
      '--no-zygote',
      '--single-process',
    ],
  });

  const page = await browser.newPage()

    /**
     * 1. Go lo linkedin jobs url
     * 2. Get the jobs
     * 3. Repeat step 1 with other search parameters
     */

2. Accessing LinkedIn Job Listings and Extracting Data

This is the core section of our blog, where we delve into the process of navigating LinkedIn’s job listings, parsing the HTML content, and extracting job details in a structured JSON format. Our approach ensures that we retrieve relevant job information efficiently while handling potential roadblocks during the scraping process.

2.1. Construct the URL for navigating to LinkedIn job offers page

To access LinkedIn's job listings, we need to construct a URL using the function urlQueryPage:

src/linkedin.ts

export const urlQueryPage = (searchParams: ScraperSearchParams) =>
    `https://linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${searchParams.searchText}
    &start=${searchParams.pageNumber * 25}${searchParams.locationText ? '&location=' + searchParams.locationText : ''}`

In this case, I have already conducted the necessary research to identify a suitable URL for scraping. Our goal is to find a URL that can be dynamically parameterized based on our desired search criteria.

For this example, the key search parameters will include:

searchText – The job title or keyword.
pageNumber – The pagination index to navigate through job listings.
locationText (optional) – A specific location filter to refine search results.

By structuring the URL accordingly, we can efficiently retrieve job listings that match our specified criteria.

Examples of url can be:

2.2. Navigate to the URL and extract the job offers

With our target URL identified, we can proceed with the two primary actions required:

Navigating to the Job Listings URL: This step involves directing our web scraping tool to the URL where the job listings are hosted.
Extracting the job offers data and converting to JSON: Once we're on the jobs listings page, we'll employ web scraping techniques to extract the jobs data and return them in JSON format.

src/linkedin.ts


export interface ScraperSearchParams {
    searchText: string;
    locationText: string;
    pageNumber: number;
}

/** main function */
export function goToLinkedinJobsPageAndExtractJobs(page: Page, searchParams: ScraperSearchParams): Observable<JobInterface[]> {
    return defer(() => fromPromise(navigateToJobsPage(page, searchParams)))
        .pipe(switchMap(() => getJobsFromLinkedinPage(page)));
}

/* Utility functions  */
export const urlQueryPage = (searchParams: ScraperSearchParams) =>
    `https://linkedin.com/jobs-guest/jobs/api/seeMoreJobPostings/search?keywords=${searchParams.searchText}
    &start=${searchParams.pageNumber * 25}${searchParams.locationText ? '&location=' + searchParams.locationText : ''}`

function navigateToJobsPage(page: Page, searchParams: ScraperSearchParams): Promise<Response | null> {
    return page.goto(urlQueryPage(searchParams), { waitUntil: 'networkidle0' });
}

export const stacks = ['angularjs', 'kubernetes', 'javascript', 'jenkins', 'html', /* ... */];

export function getJobsFromLinkedinPage(page: Page): Observable<JobInterface[]> {
    return defer(() => fromPromise(page.evaluate((pageEvalData) => {
        const collection: HTMLCollection = document.body.children;
        const results: JobInterface[] = [];
        for (let i = 0; i < collection.length; i++) {
            try {
                const item = collection.item(i)!;
                const title = item.getElementsByClassName('base-search-card__title')[0].textContent!.trim();
                const imgSrc = item.getElementsByTagName('img')[0].getAttribute('data-delayed-url') || '';
                const remoteOk: boolean = !!title.match(/remote|No office location/gi);

                const url = (
                    (item.getElementsByClassName('base-card__full-link')[0] as HTMLLinkElement)
                    || (item.getElementsByClassName('base-search-card--link')[0] as HTMLLinkElement)
                ).href;

                const companyNameAndLinkContainer = item.getElementsByClassName('base-search-card__subtitle')[0];
                const companyUrl: string | undefined = companyNameAndLinkContainer?.getElementsByTagName('a')[0]?.href;
                const companyName = companyNameAndLinkContainer.textContent!.trim();
                const companyLocation = item.getElementsByClassName('job-search-card__location')[0].textContent!.trim();

                const toDate = (dateString: string) => {
                    const [year, month, day] = dateString.split('-')
                    return new Date(parseFloat(year), parseFloat(month) - 1, parseFloat(day)    )
                }

                const dateTime = (
                    item.getElementsByClassName('job-search-card__listdate')[0]
                    || item.getElementsByClassName('job-search-card__listdate--new')[0] // less than a day. TODO: Improve precision on this case.
                ).getAttribute('datetime');
                const postedDate = toDate(dateTime as string).toISOString();


                /**
                 * Calculate minimum and maximum salary
                 *
                 * Salary HTML example to parse:
                 * <span class="job-result-card__salary-info">$65,000.00 - $90,000.00</span>
                 */
                let currency: SalaryCurrency = ''
                let salaryMin = -1;
                let salaryMax = -1;

                const salaryCurrencyMap: any = {
                    ['€']: 'EUR',
                    ['$']: 'USD',
                    ['£']: 'GBP',
                }

                const salaryInfoElem = item.getElementsByClassName('job-search-card__salary-info')[0]
                if (salaryInfoElem) {
                    const salaryInfo: string = salaryInfoElem.textContent!.trim();
                    if (salaryInfo.startsWith('€') || salaryInfo.startsWith('$') || salaryInfo.startsWith('£')) {
                        const coinSymbol = salaryInfo.charAt(0);
                        currency = salaryCurrencyMap[coinSymbol] || coinSymbol;
                    }

                    const matches = salaryInfo.match(/([0-9]|,|\.)+/g)
                    if (matches && matches[0]) {
                        // values are in USA format, so we need to remove ALL the comas
                        salaryMin = parseFloat(matches[0].replace(/,/g, ''));
                    }
                    if (matches && matches[1]) {
                        // values are in USA format, so we need to remove ALL the comas
                        salaryMax = parseFloat(matches[1].replace(/,/g, ''));
                    }
                }

                // Calculate tags
                let stackRequired: string[] = [];
                title.split(' ').concat(url.split('-')).forEach(word => {
                    if (!!word) {
                        const wordLowerCase = word.toLowerCase();
                        if (pageEvalData.stacks.includes(wordLowerCase)) {
                            stackRequired.push(wordLowerCase)
                        }
                    }
                })
                // Define uniq function here. remember that page.evaluate executes inside the browser, so we cannot easily import outside functions form other contexts
                const uniq = (_array) => _array.filter((item, pos) => _array.indexOf(item) == pos);
                stackRequired = uniq(stackRequired)

                const result: JobInterface = {
                    id: item!.children[0].getAttribute('data-entity-urn') as string,
                    city: companyLocation,
                    url: url,
                    companyUrl: companyUrl || '',
                    img: imgSrc,
                    date: new Date().toISOString(),
                    postedDate: postedDate,
                    title: title,
                    company: companyName,
                    location: companyLocation,
                    salaryCurrency: currency,
                    salaryMax: salaryMax,
                    salaryMin: salaryMin,
                    countryCode: '',
                    countryText: '',
                    descriptionHtml: '',
                    remoteOk: remoteOk,
                    stackRequired: stackRequired
                };
                console.log('result', result);

                results.push(result);
            } catch (e) {
                console.error(`Something when wrong retrieving linkedin page item: ${i} on url: ${window.location}`, e.stack);
            }
        }
        return results;
    }, {stacks})) as Observable<JobInterface[]>)
}

The code provided extracts the information of all jobs from the page. While it may not be the most aesthetically pleasing code, it gets the job done. It is not aesthetic because parsing this type of HTML always leads to many fallbacks and checks.

In a standard programming context, breaking code into smaller, isolated functions improves readability and maintainability. However, when working with page.evaluate in Puppeteer, we face certain limitations. Since this code executes within the Puppeteer (Chrome) instance rather than our Node.js environment, all logic must be self-contained within the page.evaluate call.

The only exception is simple variables (such as stacks in our case), which can be passed as arguments to page.evaluate. However, these variables must not contain functions or complex objects that cannot be serialized, as Puppeteer does not support passing non-serializable data between Node.js and the browser context.

In this case, the most challenging part of scraping is extracting salary information, as it requires converting a text format like "$65,000.00 - $90,000.00" into separate salaryMin and salaryMax values.

To handle potential issues gracefully, we have encapsulated the entire code within a try/catch block. While we currently log errors to the console, it is highly recommended to implement a mechanism for storing error logs on disk. This is especially important because websites frequently update their structure, requiring regular adjustments to the HTML parsing logic.

Finally, we consistently use the defer and fromPromise operators to convert Promises into Observables, ensuring a reactive and efficient data flow throughout the scraping process.

defer(() => fromPromise(myPromise()));

This approach is a recommended best practice that works reliably in all scenarios. Promises are eager, whereas Observables are lazy and only initiate when someone subscribes to them. The defer operator allows us to make a Promise lazy. Go to this link for more information about it

3. Add an asynchronous loop to iterate through all pages

In the previous step, we learned how to obtain all job offers data from a LinkedIn page. Now, we want to use that code as many times as possible to gather as much data as we can. To achieve this, we first need to iterate through all available pages:

src/linkedin.ts

function getJobsFromAllPages(page: Page, initSearchParams: ScraperSearchParams): Observable<ScraperResult> {
    const getJobs$ = (searchParams: ScraperSearchParams) => goToLinkedinJobsPageAndExtractJobs(page, searchParams).pipe(
        map((jobs): ScraperResult => ({jobs, searchParams} as ScraperResult)),
        catchError(error => {
            console.error(error);
            return of({jobs: [], searchParams: searchParams})
        })
    );

    return getJobs$(initSearchParams).pipe(
        expand(({jobs, searchParams}) => {
            console.log(`Linkedin - Query: ${searchParams.searchText}, Location: ${searchParams.locationText}, Page: ${searchParams.pageNumber}, nJobs: ${jobs.length}, url: ${urlQueryPage(searchParams)}`);
            if (jobs.length === 0) {
                return EMPTY;
            } else {
                return getJobs$({...searchParams, pageNumber: searchParams.pageNumber + 1});
            }
        })
    );
}

The code above increments the page number until we reach a page where there are no jobs. To perform this loop in RxJS, we use the operator expand, which recursively projects each source value to an Observable that is merged into the output Observable. Its functionality is well explained here.

In RxJS, we cannot use a for loop as we do with await/async. We are required to use another technique like expand operator or a recursive loop instead. While it might initially appear as a limitation, in an asynchronous context, this method proves to be more advantageous in numerous situations.

So, what would the equivalent code using Promises look like? Here's an example:

export async function getJobsFromAllPages(
  page: Page,
  searchParams: ScraperSearchParams
): Promise<ScraperResult> {
  const results: ScraperResult = { jobs: [], searchParams };

  try {
    while (true) {
      const jobs = await getJobsFromLinkedinPage(page, searchParams);
      console.log(
        `Linkedin - Query: ${searchParams.searchText}, Location: ${
          searchParams.locationText
        }, Page: ${searchParams.nPage}, nJobs: ${
          jobs.length
        }, url: ${urlQueryPage(searchParams)}`
      );

      results.jobs.push(...jobs);

      if (jobs.length === 0) {
        break;
      }

      searchParams.nPage++;
    }
  } catch (error) {
    console.error('Error:', error);
    results.jobs = []; // Clear the jobs in case of an error.
  }

  return results;
}

This code is nearly equivalent to the Observable-based one, with one critical difference: it only emits when all pages have finished processing. In contrast, the implementation using Observables emits after each page. Creating a stream is crucial in this case because we want to handle the jobs as soon as they are resolved.

Certainly, we could introduce our logic following the line:

const jobs = await getJobsFromLinkedinPage(page, searchParams);

/* Handle the jobs here */

...but this would unnecessarily couple our scraping code with the part that handles the jobs data. Handling the jobs data may involve some transformations, API calls, and finally, saving the data into a database.

In this example, we clearly see one of the many benefits Observables offer over Promises.

4. Implementing an Asynchronous Loop for Multiple Search Parameters

Now that we've established how to iterate through multiple pages for a given search query, it's time to take the next step: expanding our search across multiple search parameters.

To achieve this, we'll introduce an additional asynchronous loop that cycles through various search criteria, ensuring comprehensive data extraction.

The first step is defining a structured data format to store these search parameters. We'll call this list searchParamsList, which will hold different combinations of keywords, locations, or other relevant filters:

src/data.ts

const searchParamsList: { searchText: string; locationText: string }[] = [
  { searchText: 'Angular', locationText: 'Barcelona' },
  { searchText: 'Angular', locationText: 'Madrid' },
  // ...
  { searchText: 'React', locationText: 'Barcelona' },
  { searchText: 'React', locationText: 'Madrid' },
  // ...
];

To iterate through the searchParamsList array, we essentially need to convert it from an Array to an Observable using the fromArray operator. Subsequently, we will use the concatMap operator to sequentially process each searchText and locationText pair. The power of RxJS here is that, in the case where we may want to switch from sequential to parallel processing, we just need to change the concatMap for a mergeMap. In this case, it is not recommended because we will exceed LinkedIn's rate limits, but it's something to consider in other scenarios.

src/linkedin.ts

/**
 * Creates a new page and scrapes LinkedIn job offers data for each pair of searchText and locationText, recursively retrieving data until there are no more pages.
 * @param browser A Puppeteer instance
 * @returns An Observable that emits scraped job offers data as ScraperResult
 */
export function getJobsFromLinkedin(browser: Browser): Observable<ScraperResult> {
    // Create a new page
    const createPage = defer(() => fromPromise(browser.newPage()));

    // Iterate through search parameters and scrape jobs
    const scrapeJobs = (page: Page): Observable<ScraperResult> =>
        fromArray(searchParamsList).pipe(
            concatMap(({ searchText, locationText }) =>
                getJobsFromAllPages(page, { searchText, locationText, pageNumber: 0 })
            )
        )

    // Compose sequentially previous steps
    return createPage.pipe(switchMap(page => scrapeJobs(page)));
}

This code will loop through different search parameters, retrieving job listings for each combination of technology and location efficiently.

🎉 Congratulations! You now have the skills to scrape LinkedIn job postings! 🎉

However, like many other platforms, LinkedIn employs anti-scraping measures to prevent automated data extraction. Let’s explore how to handle these challenges 👇

Common Errors When Scraping LinkedIn

Running the code as it is will quickly lead to various errors, making it challenging to scrape a substantial amount of data. The two most common issues are:

1. 429 Status Code (Too Many Requests)

This error occurs when we send too many requests in a short period. To avoid being blocked, we need to slow down the request rate and introduce random delays until the error subsides.

2. LinkedIn Authwall

Occasionally, instead of the job listings page, LinkedIn may redirect us to an authentication wall. When this happens, the best approach is to pause requests for a while before trying again.

Handling 429 Errors & LinkedIn Authwall

To tackle these issues, we modify the getJobsFromLinkedinPage function by introducing a separate function, getLinkedinJobsFromJobsPage, to handle the HTML scraping logic. The updated code structure looks like this:

src/linkedin.ts

const AUTHWALL_PATH = 'linkedin.com/authwall';
const STATUS_TOO_MANY_REQUESTS = 429;
const JOB_SEARCH_SELECTOR = '.job-search-card';

function goToLinkedinJobsPageAndExtractJobs(page: Page, searchParams: ScraperSearchParams): Observable<JobInterface[]> {
    return defer(() => fromPromise(page.setExtraHTTPHeaders({'accept-language': 'en-US,en;q=0.9'})))
        .pipe(
            switchMap(() => navigateToLinkedinJobsPage(page, searchParams)),
            tap(response => checkResponseStatus(response)),
            switchMap(() => throwErrorIfAuthwall(page)),
            switchMap(() => waitForJobSearchCard(page)),
            switchMap(() => getJobsFromLinkedinPage(page)),
            retryWhen(retryStrategyByCondition({
                maxRetryAttempts: 4,
                retryConditionFn: error => error.retry === true
            })),
            map(jobs =>  Array.isArray(jobs) ? jobs : []),
            take(1)
        );
}

/**
 * Navigate to the LinkedIn search page, using the provided search parameters.
 */
function navigateToLinkedinJobsPage(page: Page, searchParams: ScraperSearchParams) {
    return defer(() => fromPromise(page.goto(urlQueryPage(searchParams), {waitUntil: 'networkidle0'})));
}

/**
 * Check the HTTP response status and throw an error if too many requests have been made.
 */
function checkResponseStatus(response: any) {
    const status = response?.status();
    if (status === STATUS_TOO_MANY_REQUESTS) {
        throw {message: 'Status 429 (Too many requests)', retry: true, status: STATUS_TOO_MANY_REQUESTS};
    }
}

/**
 * Check if the current page is an authwall and throw an error if it is.
 */
function throwErrorIfAuthwall(page: Page) {
    return getPageLocationOperator(page).pipe(tap(locationHref => {
        if (locationHref.includes(AUTHWALL_PATH)) {
            console.error('Authwall error');
            throw {message: `Linkedin authwall! locationHref: ${locationHref}`, retry: true};
        }
    }));
}

/**
 * Wait for the job search card to be visible on the page, and handle timeouts or authwalls.
 */
function waitForJobSearchCard(page: Page) {
    return defer(() => fromPromise(page.waitForSelector(JOB_SEARCH_SELECTOR, {visible: true, timeout: 5000}))).pipe(
        catchError(error => throwErrorIfAuthwall(page).pipe(tap(() => {throw error})))
    );
}

In this code, we address the previously mentioned errors, that is, the 429 response error and the authwall issue. Overcoming these errors is very important for successfully web scraping on LinkedIn.

To handle the errors, the code employs a custom retry strategy implemented by the retryStrategyByCondition function:

src/scraper.utils.ts

export const retryStrategyByCondition = ({maxRetryAttempts = 3, scalingDuration = 1000, retryConditionFn = (error) => true}: {
    maxRetryAttempts?: number,
    scalingDuration?: number,
    retryConditionFn?: (error) => boolean
} = {}) => (attempts: Observable<any>) => {
    return attempts.pipe(
        mergeMap((error, i) => {
            const retryAttempt = i + 1;
            if (
                retryAttempt > maxRetryAttempts ||
                !retryConditionFn(error)
            ) {
                return throwError(error);
            }
            console.log(
                `Attempt ${retryAttempt}: retrying in ${retryAttempt *
                scalingDuration}ms`
            );
            // retry after 1s, 2s, etc...
            return timer(retryAttempt * scalingDuration);
        }),
        finalize(() => console.log('retryStrategyOnlySpecificErrors - finalized'))
    );
};

This strategy essentially increases the time between each retry after a failure. This way, we ensure that we will wait long enough for LinkedIn to allow us to make requests again

⚠️ Important Note: LinkedIn has strict anti-scraping measures, and excessive requests from a single IP address can lead to IP blacklisting. Simply increasing wait times between requests may not be a foolproof solution. To minimize the risk of detection and reduce errors, it's highly advisable to rotate IP addresses periodically. This can be achieved by using proxy services or VPNs, ensuring a more sustainable and uninterrupted scraping process.

Final Words

Web scraping can sometimes violate a website's terms of service, so it's crucial to review and respect the robots.txt file and Terms of Service before scraping any site. In this case, the provided code is intended strictly for educational and hobby purposes. LinkedIn specifically prohibits any data extraction from its website; you can read more here.

I encourage using web scraping as a learning tool, but always be mindful of ethical practices. Avoid excessive requests, respect the website's resources, and use the extracted data responsibly.

You can find the complete, updated code in this repository, don't doubt to give an star if it helped! 🙏⭐

Automating LinkedIn Job Searches with Puppeteer and RxJS

Tutorial on how to scrape job offers from LinkedIn using Puppeteer and RxJS

What is Web Scraping?

Puppeteer: A Powerful Web Scraping Tool

Understanding RxJS

1. Puppeteer initialization

2. Accessing LinkedIn Job Listings and Extracting Data

2.1. Construct the URL for navigating to LinkedIn job offers page

2.2. Navigate to the URL and extract the job offers

3. Add an asynchronous loop to iterate through all pages

4. Implementing an Asynchronous Loop for Multiple Search Parameters

Common Errors When Scraping LinkedIn

1. 429 Status Code (Too Many Requests)

2. LinkedIn Authwall

Handling 429 Errors & LinkedIn Authwall

Final Words

Subscribe to my newsletter

SHEMANTI PAL

SHEMANTI PAL