I couldn’t find a Medium post on this one. There is one by Angelica Dietzel but it’s unfortunately only readable if you have a paid account on Medium. If you have any suggestions for improvement on the method I will demonstrate here please leave a comment.
A typical website hierarchy
We start with a visual hierarchical representation of a website.
In red are pages that contain links where you don’t want to go (Facebook, Linkedin, other social media pages). In green are the places where you want to extract valuable information from. Note that these two product pages are interlinked so we should think about not saving the same link multiple times when we go from up to down within the hierarchy while we are crawling through the website.
1. Import necessary modules
from bs4 import BeautifulSoup
from tqdm import tqdm
2. Write a function for getting the text data from a website url
3. Write a function for getting all links from one page and store them in a list
First, in this function we will get all “a href” marked links. As mentioned, this could potentially lead to the scraping of other websites you do not want information from. We have to place some restraints on the function.
Second thing is that we also want the href that don’t show the full HTML link but only a relative link and starts with a “/” to be included in the collection of links. For instance, we can encounter a valuable HTML string marked with href like this:
<a class="pat-inject follow icon-medicine-group" data-pat-inject="hooks: raptor; history: record" href="/bladeren/groepsteksten/alfabet">
If we would use the “/bladeren/groepsteksten/alfabet” string to search for new subpages we will get an error.
Third, we want to convert the link to a dictionary with the function dict.fromkeys() to prevent saving duplications of the same link and to speed up the link searching process. The result looks like this:
4. Write a function that loops over all the subpages
We use a for loop to go through the subpages and use tqdm to obtain insight into the number of steps that have been completed and keep track of the remaining time to complete the process.
5. Create the loop
Before we start the loop we have to initialize some variables. We save the website we want to scrape in a variable and convert this variable into a single key dictionary that has the value “Not-checked”. We create a counter “counter” to count the number of “Not-checked” links and we create a second counter “counter2” to count the number of iterations. To communicate back to ourselves we create some print statements.
This will give you something like the following:
The Json file looks something like this:
You can find the full code on my GitHub page.