Request Promise is the standard library that is used in Node.js for web scraping and other web related requests, it's simple, compact, and easy to use.
Because Request Promise is so compact it uses much less memory than other libraries such as Puppeteer, and therefore is much faster to get a result.
How to Install Request Promise in Google Cloud Functions
- Navigate to cloud.google.com
- Launch the Google Cloud Shell from the top right, you may also read Launching the Google Cloud Shell Console
- Once the shell console loads type npm install -g npm to ensure npm is uptodate
- Next we need to install request, type npm install --save request
- Next to install request promise, type npm install --save request-promise
- Now we will install cheerio, a library to help fetch content from the page, to install cheerio type in npm install cheerio
- You should now have Request Promise, Cheerio, and the required dependencies installed, time to create your first project using my free templates to get you started.
Create a Node.js project with Request Promise
- Create a new Google Cloud function
- Copy the following code into the index.js and package.js files like follows:
- Code to copy embedded below, you can also find the code on my GitHub repository
/** * * Templates developed for nodejs, designed for use with Google Cloud Functions * Webscraping template provided by lukestoolkit.blogspot.com * * LukesToolkit Blog: https://lukestoolkit.blogspot.com * Template files avaliable at: https://github.com/lukegackle * * Responds to any HTTP request. * * @param {!express:Request} req HTTP request context. * @param {!express:Response} res HTTP response context. */ exports.helloWorld = (req, res) => { let rp = require('request-promise').defaults({jar: true}); var cheerio = require('cheerio'); // Basically jQuery for node.js //For more information on Cheerio visit https://github.com/cheeriojs/cheerio var tough = require('tough-cookie'); //Dont know that this is required but included for good measure var GETURLParam = req.query.GETURLParam; let ReqOptions = { method: 'GET', uri: "https://lukestoolkit.blogspot.com", port: 443, resolveWithFullResponse: true /* Example options can be used for API requests with authentication headers headers: { 'Content-Type': 'application/json', 'sign': sign, 'key': key } Example Options for posting data to a page with form data form: { // Like <input type="text" name="name"> username: '', password: '', csrf: csrf } */ }; rp(ReqOptions) .then(function (response) { //Example on Getting Response headers and setting cookies in request-promise //This is useful for web requests that require authentication var setcookies = response.headers["set-cookie"]; //If response headers are JSON and if website returns a set-cookie object var dt = Date.now(); //setcookies.forEach(function(cookie) { // rp.jar().setCookie(cookie, 'https://domain.com', {expires: dt }); //}); // var $$ = cheerio.load(response.body); var data = $$(".blog-name > a").attr('src'); res.status(200).send(response.body); //Examples on how to get data from the page //var rows = data.find("table > tr").length; //returns number of tr elements //var title = data.find("h1").text(); //returns text content of h1 tag }) .catch(function (err) { // Crawling failed... res.status(200).send(err); }); };
{ "name": "sample-http", "version": "0.0.1", "dependencies": { "request": "^2.88.0", "request-promise": "^4.2.4", "cheerio": "^0.22.0" } } - Click Create, your function may take a couple of minutes to create, you should get a green tick once it has finished.
- Click the URL to test that your code works, it should fetch the HTML code for the lukestoolkit blog and render it on screen, this will tell you whether the code is working.
- Now its your turn to start developing, review the documentation for Request Promise and Cheerio linked below.
Recommended Reading
Was this helpful?
Comments
Post a Comment