Web scraping with Puppeteer in Google Cloud Functions



One of the major benefits of web scraping with the Puppeteer library from Google is that it allows you to perform web scraping on pages which depend on javascript to load the content of the page. Puppeteer allows you to wait until the content is loaded before starting any web scraping tasks.

The way the library does this is by running a cutdown version of Google Chrome, which is referred as headless, as it doesn't have all the overheads, there's no GUI, no user accounts and preferences, it only has what it needs.

Puppeteer is a Node.js library which you can use in your Google Cloud functions, it also has a number of other built in features such as the ability to:
  • Generate screenshots and PDFs of pages
  • Automate form submissions
  • Automated testing
  • and more!

How to install Puppeteer in Google Cloud

  1. Navigate to cloud.google.com
  2. Launch the Google Cloud Shell from the top right, you may also read Launching the Google Cloud Shell Console
  3. Once the shell console loads type npm install -g npm to ensure npm is uptodate
  4. Next we need to install some dependencies for Puppeteer, type npm install --save request
  5. Type npm install --save request-promise
  6. Now we will install the Puppeteer library, type in npm install puppeteer
  7. You should now have Puppeteer and the required dependencies installed, time to create your first project using my free templates to get you started.



Create a Node.js project with Puppeteer


  1. Create a new Google Cloud function
  2. Set your memory for the function to 1GB, Puppeteer uses quite a bit of memory
  3. Copy the following code into the index.js and package.js files like follows:
  4. Code to copy embedded below, you can also find the code on my GitHub repository
    /**
    *
    * Templates developed for nodejs, designed for use with Google Cloud Functions
    * Webscraping template provided by lukestoolkit.blogspot.com
    *
    * LukesToolkit Blog: https://lukestoolkit.blogspot.com
    * Template files avaliable at: https://github.com/lukegackle
    *
    * Responds to any HTTP request.
    *
    * @param {!express:Request} req HTTP request context.
    * @param {!express:Response} res HTTP response context.
    */
    exports.helloWorld = (req, res) => {
    //For more information on Puppeteer visit https://developers.google.com/web/tools/puppeteer
    const puppeteer = require('puppeteer');
    (async () => {
    const browser = await puppeteer.launch({args: ['--no-sandbox']})
    const page = await browser.newPage()
    await page.setViewport({ width: 1280, height: 1800 })
    await page.setDefaultTimeout(0) //Wait Maximum amount of time for page to load
    await page.setDefaultNavigationTimeout(0) //Wait Maximum amount of time for page to load
    await page.goto("https://lukestoolkit.blogspot.com", {waitUntil: 'load', timeout: 0})
    //await page.waitForFunction('document.querySelector("span#SPANID").innerText.includes("Content")') //Wait for an element to contain particular text
    //await page.waitFor(1000) //Wait for 1000
    //var mentions = await page.$eval("span#SPANID", node => node.textContent); //Get text content of element
    //DEMO Test to show code works in guide
    var htmlBody = await page.content();
    res.send(htmlBody);
    //Further usage examples from Puppeteer docs
    //const checkboxStatus = await page.$eval('#defaultCheck1', input => { return input.checked })
    //console.log('Checkbox checked status:', checkboxStatus)
    //const radios = await page.$$eval('input[name="exampleRadios"]', inputs => { return inputs.map(input => input.value) })
    //console.log('Radio values:', radios)
    //await page.goto('https://lukestoolkit.blogspot.com')
    //const selectOptions = await page.$$eval('.bd-example > select.custom-select.custom-select-lg.mb-3 > option', options => { return options.map(option => option.value ) })
    //await page.screenshot({path: 'example.png'});
    //await page.pdf({path: 'hn.pdf', format: 'A4'});
    await browser.close() //Close the browser when finished
    })()
    };
    view raw index.js delivered with ❤ by emgithub

    {
    "name": "sample-http",
    "version": "0.0.1",
    "dependencies": {
    "puppeteer": "^1.20.0"
    }
    }
    view raw package.json delivered with ❤ by emgithub
  5. Click Create, your function may take a couple of minutes to create, you should get a green tick once it has finished.
  6. Click the URL to test that your code works, it should fetch the HTML code for the lukestoolkit blog and render it on screen, this will tell you whether the code is working.
  7. Now its your turn to start developing, review the puppeteer documentation at https://developers.google.com/web/tools/puppeteer

Was this helpful?

Yes No


Comments