One of the major benefits of web scraping with the Puppeteer library from Google is that it allows you to perform web scraping on pages which depend on javascript to load the content of the page. Puppeteer allows you to wait until the content is loaded before starting any web scraping tasks.
The way the library does this is by running a cutdown version of Google Chrome, which is referred as headless, as it doesn't have all the overheads, there's no GUI, no user accounts and preferences, it only has what it needs.
Puppeteer is a Node.js library which you can use in your Google Cloud functions, it also has a number of other built in features such as the ability to:
- Generate screenshots and PDFs of pages
- Automate form submissions
- Automated testing
- and more!
How to install Puppeteer in Google Cloud
- Navigate to cloud.google.com
- Launch the Google Cloud Shell from the top right, you may also read Launching the Google Cloud Shell Console
- Once the shell console loads type npm install -g npm to ensure npm is uptodate
- Next we need to install some dependencies for Puppeteer, type npm install --save request
- Type npm install --save request-promise
- Now we will install the Puppeteer library, type in npm install puppeteer
- You should now have Puppeteer and the required dependencies installed, time to create your first project using my free templates to get you started.
Create a Node.js project with Puppeteer
- Create a new Google Cloud function
- Set your memory for the function to 1GB, Puppeteer uses quite a bit of memory
- Copy the following code into the index.js and package.js files like follows:
- Code to copy embedded below, you can also find the code on my GitHub repository
- Click Create, your function may take a couple of minutes to create, you should get a green tick once it has finished.
- Click the URL to test that your code works, it should fetch the HTML code for the lukestoolkit blog and render it on screen, this will tell you whether the code is working.
- Now its your turn to start developing, review the puppeteer documentation at https://developers.google.com/web/tools/puppeteer
Was this helpful?
Comments
Post a Comment