Web sites contains mass amount of data. Some times people need to extract those information for their needs. Web scraping is a popular technique use to collect data automatically from websites. Web scraping is also called as Web harvesting.
There are lot of uses of web scraping. It is used
- To compare prices
- To monitor weather data
- To detect changes made to websites
- For research purposes
- To collect contact details
Complexity of the web scraping program depends on the web site and the data we going to collect. If we develop a web scraping software from VBA we need to use several in-built functions to make our program easier. Here below are very important functions we will need.
Functions helpful in web scraping
Mid - This function can be used to extract a string segment from it's parent string.
InStr - This function can be used to check whether certain string is included in another string and get the position of that sub string if it exists.
Replace - This function replace a character or string segments with another character or string.
Create an Internet Explorer object and navigate to a webpage
First, let's see how to create an internet explorer object. You can use below line of code to create that object.
Set objIE = CreateObject("InternetExplorer.Application")
Then set following attributes to the created object. You can change them according to your requirement. First to attributions decide the position of internet explorer window. Next to attributes will decide the size of the window.
objIE.Top = 0
objIE.Left = 0
objIE.Width = 800
objIE.Height = 600
Following attribute enable the viewing of window navigation. However you can set it to false if you do not want to see it.
objIE.Visible = True
So now we have created the internet explorer object and set the attributes. Next let's see how to navigate to a website. You can use below line to navigate to the website you want.
objIE.Navigate ("http://www.example.com/")
There is a very important thing you need to know when developing a web scraping software. Your program will have to run through several line of codes after navigating to a specific website. However website can take some time to be fully loaded due to various reasons. So when you run the program there is a chance of running the rest of the code before web site is fully loaded. This can harmfully affected to the results you get from the software. However we can take preventive action to overcome this kind of situations. For that you need to always put below three lines where ever you have navigation command.
Do
DoEvents
Loop Until objIE.readystate = 4
Following is the full code you can use to Create an Internet Explorer object and navigate to a webpage.
Set objIE = CreateObject("InternetExplorer.Application") objIE.Top = 0 objIE.Left = 0 objIE.Width = 800 objIE.Height = 600 objIE.Visible = True objIE.Navigate ("http://www.example.com/") Do DoEvents Loop Until objIE.readystate = 4 |
There is an another reason which can cause our program to behave differently to the way we have intended. There are times web sites not available due to different reasons. If our program try web harvest with a web site which is not online it will halt our program. So we need to use error handling techniques wisely to overcome this kind of scenarios.
Here below is a example of using error handling in VBA web scraping.
the_start: Set objIE = CreateObject("InternetExplorer.Application") objIE.Top = 0 objIE.Left = 0 objIE.Width = 800 objIE.Height = 600 objIE.Visible = True 'We will see the window navigation On Error Resume Next objIE.navigate ("http://www.example.com/") Do DoEvents If Err.Number <> 0 Then objIE.Quit Set objIE = Nothing GoTo the_start: End If Loop Until objIE.readystate = 4 MsgBox "webpage has loaded" |
Often we need to get the html code to a text box or to a variable, in first steps of the web scraping program. So here below is a code to paste html code of a specific web page to a text box. Here txtHtml refers to a text box name.
txtHtml.Text = objIE.document.body.innerhtml
Some times it is very easier to get required data from the inner-text without dealing with html codes. So we can use below line of code to copy paste inner-text to a text box in that kind of situation.
txtHtml.Text = objIE.document.body.innertext
In web scraping, frequently we need to get elements by their tag names. Here below is an example which can get elements have the "a" tags (anchor)
Set Alllinks = objIE.document.getelementsbytagname("A")
For Each Hyperlink In Alllinks
MsgBox Hyperlink.innertext & "-" & Hyperlink.href
Next
Often web scraping software need to click links inside it's procedures. Below example will click the link if it find a link with the text "New Arrivals".
Set Alllinks = objIE.document.getelementsbytagname("A")
For Each Hyperlink In Alllinks
If InStr(Hyperlink.innertext, "New Arrivals") > 0 Then
Hyperlink.Click
Exit For
End If
Next
Above are few basic things you need to know if you like to develop a software for Web scraping using VBA. Read these posts if you want to learn more.
Useful References for VBA Web Scraping
Web Scraping Techniques - Part 2
Web Scraping - Collecting Data From a Webpage
Web Scraping - Collect Options Inside Drop-down Lists
VBA Web Scraping - Identifying Disabled Dropdowns
How to scroll a web page number of pixels