Web scraping is simply the term for downloading data from various web sources. Note that the data obtained through web scraping is often unstructured, so it requires the use of regular expressions to get it into a usable format. Nov 17, 2020 Web scraping or also known as web harvesting is a powerful tool that can help you collect data online and transfer the information in either an excel, CSV or JSON file to help you better understand the information you’ve gathered. Although web scraping can be done manually, this can be a long and tedious process. Overview of Web Scraping (00:04) Web scraping is simply the term for downloading data from various web sources. Data obtained from web sources can be combined with internal data to generate insights. Data obtained through web scraping is often unstructured, requiring the use of regular expressions to convert the data to a more usable format.
by Robert Abela
This article is part of a series that goes through all the steps needed to write a script that reads information from a website and save it locally. Lexus crossover hybrid. Make sure that all the pre-requisites (at the end of this article) are in place before continuing.
Installing Selenium and other requirements
Selenium setup requires two steps:
- Install the Selenium library using the command: pip install selenium
- Download the Selenium WebDriver for your browser (exact version)
Chrome drivers can be found on chromium.org
Scraper setup requires two commands:
- pip install requests
- pip install beautifulsoup4
Scraping a website
What is web scraping?
Scraping is like browsing to a website and copying some content, but it is done programmatically (e.g. using Python) which means that it is much faster. The limit to how fast you can scrape is basically your bandwidth and computing power (and how much the web server allows you to). Technically this process can be divided in two parts:
- Crawling is the first part, which basically involves opening a page and finding all the interesting links in it, e.g. shops listed in a section of the yellow pages.
- Scraping comes next, where all the links from the previous step are visited to extract specific parts of the web page, e.g. the address or phone number.
Challenges of Scraping
One main challenge is that websites tend to be varied and you will likely end up writing a scraper specific to every site you are dealing with. Even if you stick with the same websites, updates/re-designs will likely break your scraper in some way (you will be using the F12 button frequently).
Some websites do not tolerate being scraped and will employ different techniques to slow or stop scraping. Another aspect to consider is the legality of this process, which depends on where the server is located, the term of service and what you do once you have the data amongst other things.
An Alternative to web scraping, when available, are Application Programming Interfaces (APIs) which offer a way to access structured data directly (using formats like JSON and XML) without dealing with the visual presentation of the web pages. Hence it is always a good idea to check if the website offers an API before investing time and effort in a scraper.
Scraping libraries
While there are many ways how to get data from web pages (e.g. using Excel, browser plugins or other tools) this article will focus on how to do it with Python. Having the flexibility of a programming language makes it a very powerful approach and there are very good libraries available such as Beautiful Soup which will be used in the sample below. There is a very good write up on how to Build a Web Scraper With Beautiful Soup. Another framework to consider is Scrapy.
What is Selenium? Why is it needed?
Some websites use JavaScript to load parts of the page later (not directly when the page is loaded). Some links are also calling JavaScript and the URL to go to is computed on click. These techniques are becoming increasingly common (even as an anti-scraping technique) and unfortunately libraries like Beautiful Soup do not handle them well.
In comes Selenium, a framework designed primarily for automated web applications testing. It allows developers to programmatically control a browser using different programming languages. Since with Selenium there is a real browser rendering the page, JavaScript is executed normally and the problems mentioned above can be avoided. This of course requires more resources and makes the whole process slower, so it is wise to use it only when strictly required.
Beautiful Soup and Selenium can also be used together as shown in this interesting article at freecodecamp.org.
Building a first scraper
This first scraper will perform the following steps:
- Visit the page and parse source HTML
- Check that the page title is as expected
- Perform a search
- Look for expected result
- Get the link URL
Two Implementations, using Beautiful Soup and Selenium, can be found below.
Scraper using Beautiful Soup
Scraper using Selenium
Pre-requisites
This is part of a series that goes through all the steps needed to write a script that reads information from a website and save it locally. This section lists all the technologies you should be familiar with and all the tools that need to be installed.
Basic knowledge of HTML
This article series assumes a basic understanding of web page source code including:
- Familiarity with Python 3.x
- HTML document structure
- Attributes of common HTML elements
- Basic JavaScript and AJAX
- CSS classes
- HTTP request parameters
- Awareness of lazy-loading techniques
A good place to start is W3Schools.
Installing Python
As part of the pre-requisites, installing the correct version of Python and pip is required. This setup section assumes a Windows operating system, but it should be easily transferable to macOS or Linux.
Which Python version should one use: Python 2 or 3? This might have been a point of discussion in the past (Python 2.7 is the latest version of Python 2.x and was released in 2010) since the two are not compatible, one had to pick a version. However today (2020) it is safe to go with version 3.x, with the latest stable version at the time of writing being 3.8.
Start by downloading the latest version of Python 3 from the official website. Install it as you would with any other software. Make sure you add python to the PATH as shown below.
To confirm that it was successfully installed open the Command Prompt window and type python, you should see something like the following:
Installing and using pip
pip is the package installer for Python. It is very likely that it came along with your Python installation. You can check by entering pip -V in a Command Prompt Window, and you should see something like the following:
If pip is not available, it needs to be installed by following these steps:
- Download get-pip.py to a folder on your computer.
- Open a command prompt
- Navigate to the folder where get-pip.py was saved
- Run the following command: python get-pip.py
Our Beginner's Guide to Web Scraping
The internet has become such a powerful tool because there is so much information on there. Many marketers, web developers, investors and data scientists use web scraping to collect online data to help them make valuable decisions.
But if you’re not sure how to use a web scraper tool, it can be intermediating and discouraging. The goal of this beginner's guide is to help introduce web scraping to people who are new to it or for those who don't know where to exactly start.
We’ll even go through an example together to give a basic understanding of it. So I recommended downloading our free web scraping tool so you can follow along.
So, let’s get into it.
Introduction to Web scraping
First, it's important to discuss what is web scraping and what you can do with it. Whether this is your first time hearing about web scraping, or you have but have no idea what it is, this beginner's guide will help guide you to discover what Web scraping is capable of doing!
What is Web Scraping?
Web scraping or also known as web harvesting is a powerful tool that can help you collect data online and transfer the information in either an excel, CSV or JSON file to help you better understand the information you’ve gathered.
Although web scraping can be done manually, this can be a long and tedious process. That’s why using data extraction tools are preferred when scraping online data as they can be more accurate and more efficient.
Web scraping is incredibly common and can be used to create APIs out of almost any website.
How do web scrapers work?
Automatic web scraping can be simple but also complex at the same time. But once you understand and get the hang of it, it’ll become a lot easier to understand. Just like anything in life, you need practice to make it perfect. At first, you’re not going to understand it but the more you do it, the more you’ll get the hang of it.
The web scraper will be given one or more URLs to load before scraping. The scraper then loads the entire HTML code for the page in question. More advanced scrapers will render the entire website, including CSS and JavaScript elements.
Then the scraper will either extract all the data on the page or specific data selected by the user before the project is run.
Introduction To Web Scraping Pdf
Ideally, you want to go through the process of selecting which data you want to collect from the page. This can be texts, images, prices, ratings, ASIN, addresses, URLs etc.
Once you have everything you want to extract selected, you can then place it on an excel/CSV file for you to analyze all of the data. Some advanced web scrapers can convert the data into a JSON file which can be used as an API.
If you want to learn more, you can read our guide on What is Web Scraping and what it’s used for
Is Web Scraping Legal?
With you being able to attract public information off of competitors or other websites, is web scraping legal?
Any publicly available data that can be accessed by everyone on the internet can be legally extracted.
The data has to follow these 3 criteria for it to be legally extracted:
- User has made the data public
- No account required for access
- Not blocked by robots.txt file
As long as it follows these 3 rules, it's legal!
You can learn more about the rules of web scraping here: Is web scraping legal?
Web scraping for beginners
Now that we understand what web scraping is and how it works. Let’s use it in action to get the hang of it!
For this example, we are going to extract all of the blog posts ParseHub has created, how long they take to read, who wrote them and URLs. Not sure what you will use with this information, but we just want to show you what you can do with web scraping and how easy it can be!
First, download our free web scraping tool.
You’ll need to set up ParseHub on your desktop so here’s the guide to help you: Downloading and getting started.
Once ParesHub is ready, we can now begin scraping data.
If it’s your first time using ParseHub, we recommend following the tutorial just to give you an idea of how it works.
But let’s scrape an actual website like our Blog.
For this example, we want to extract all of the blogs we have written, the URL of the blog, who wrote the blog, and how long it takes to read.
Your first web scraping project
1. Open up ParseHub and create a new project by selecting “New Project”
2. Audacity company. Copy this URL: https://www.parsehub.com/blog/ and place it in the text box on the left-hand side and then click on the “Start project on this URL” button.
3. Once the page is loaded on ParseHub there will be 3 sections:
- Command Section
- The wbe page you're extracting from
- Preview of what the data will look like
The command section is where you will tell the software what you want to do, whether this is a click making a selection, or the advanced features ParseHub can do.
4. To begin extracting data, you will need to click on what exactly you want to extract, in this case, the blog title. Click on the first blog title you see.
Once clicked, the selection you made will turn green. ParseHub will then make suggestions of what it thinks you want to extract.
The suggested data will be in a yellow container. Click on a title that is in a yellow container then all blog titles will be selected. Scroll down a bit to make sure there is no blog title missing.
Now that you have some data, you can see a preview of what it will look like when it's exported.
5. Let’s rename our selection to something that will help us keep our data organized. To do this, just double click on the selection, the name will be highlighted and you can now rename it. In this case, we are going to name it “blog_name”.
Quick note, whenever renaming your selections or data to have no spaces i.e. Blog names won't work but blog_names will.
Now that all blog titles are selected, we also want to extract who wrote them, and how long they take to read. We will need to make a relative selection.
6. On the left sidebar, click the PLUS (+) sign next to the blog name selection and choose the Relative Select command.
7. Using the Relative Select command, click on the first blog name and then the author. You will see an arrow connect the two selections. You should see something like this:
Let’s rename the relative selection to blog_author
Since we don’t need the image URL let’s get rid of it. To do this you want to click on the expand button on the “relative blog_author” selection.
Now select the trash can beside “extract blog_author”
8. Repeat steps 6 and 7 to get the length of the blog, you won't need to delete the URL since we are extracting a text. Let's name this selection “blog_length”
It should look like this.
Since our blog is a scrolling page (scroll to load more) we will need to tell the software to scroll to get all the content.
If you were to run the project now you would only get the first few blogs extracted.
9. To do this, click on the PLUS + sign beside the page selection and click select. You will need to select the main element to this, in this case, it will look like this.
10. Once you have the main Div clicked you can add the scroll function, to do this On the left sidebar, click the PLUS (+) sign next to the main selection, click on advanced, then select the scroll function.
You will need to tell how long the software to scroll, depending on how big the blog is you may need a bigger number. But for now, let’s put it 5 times and make sure it's aligned to the bottom.
If you still need help with the scroll option you can click here to learn more.
We will need to move the main scroll option above blog names, it should look like this now:
11. Now that we have everything we want to be extracted; we can now let ParseHub do its magic. Click on the “Get data” button
12. You’ll be taken to this page.
You can test your extraction to make sure it’s working properly. For bigger projects, we recommend doing a test run first. But for this project let's press “run” so ParseHub can extract the online data.
13. This project shouldn’t take too long, but once ParseHub is done extracting the data, you can now download it and export it into a CSV/Excel, JSON, or API. But we just need a CSV/ Excel file for this project.
And there you have it! You’ve completed your first web scraping project. Pretty simple huh? But ParseHub can do so much more!
What else can you do with web scraping?
Python Web Scraping
Now that we scraped our blog and movie titles (if you did the tutorial), you can try to implement web scraping in more of a business-related setting. Our mission is to help you make better decisions and to make better decisions you need data.
ParseHub can help you make valuable decisions by doing efficient competitor research, brand monitoring and management, lead generation, finding investment opportunities and many more!
Whatever you choose to do with web scraping, ParseHub can Help!
Check out our other blog posts on how you can use ParseHub to help grow your business. We’ve split our blog posts into different categories depending on what kind of information you're trying to extract and the purpose of your scraping.
Ecommerce website/ Competitor Analysis / Brand reputation
Lead Generation
Brand Monitoring and Investing Opportunities
Closing Thoughts
There are many ways web scraping can help with your business and every day many businesses are finding creative ways to use ParseHub to grow their business! Web scraping is a great way to collect the data you need, but can be a bit intimidating at first if you don’t know what you’re doing. That’s why we wanted to create this beginner's guide to web scraping to help you gain a better understanding of what it is, how it works, and how you can use web scraping for your business!
If you have any trouble with anything, you can visit our help center or blog to help you to navigate with ParseHub or can contact support for any inquiries.
Learn more about web scraping
What Is Web Scraping
If you want to learn more about web scraping and elevate your skills, you can check out our free web scraping course! Once completed, you'll get a certification to show off your new skills and knowledge.
Web Scraping Online
Happy Scraping!