As I advance in my knowledge of python I decided to apply to a data engineer position at a prestigious finance company. Luckily I was selected from the pool of candidates and was entrusted with learning how to web scrap and use basic libraries like pandas to process the information scraped from the web. Today I’ve come to share what I learned during that process, and we are going to be web scrapping for data and creating pandas data frames to easily manipulate and present the data in the future.
The python library and Selenium WebDriver and we are going to focus on using it to scrape a table from www.the-numbers.com/market/2019/top-grossing-movies and get the information of the top 30 movies in this table and create a dataset.
The first step to take in order to start web scrapping the table is to import all the dependencies needed. It includes Selenium WebDriver to navigate through the webpage. WebDriverWait to wait for dynamic content to be loaded, ChromeDriverManager to provide the WebDriver control over a browser in order to navigate to the website, Options to set extra configurations to our Webdriver, and finally Pandas to create our dataset.
After loading our dependencies, we need to configure and instantiate our WebDriver.
The option “ — headless” is to prevent the driver from opening a browser window and do the web scraping in the background. We create the instance of the WebDriver on the last line, webdriver.Chrome() receives two parameters the first one is the browser driver it is going to use in this case, we are passing the ChromeDriverManager().install() method which returns an appropriate version of the ChromeDriver for the version of Google Chrome installed on your computer. The second parameter is the configuration options for the driver in this case the “ — headless”.
After that, all we need to do is tell the driver to open the URL of the website we are going to scrape and make the driver wait for 5 seconds for all the dynamic content to be loaded before starting transversing the DOM
The next step would be to transverse the DOM and retrieve the information we want but before that, we need to understand how our data is spread across the DOM. Let’s have a closer look by visiting the website in the chrome browser and open the developers’ tools.
Here you can explore the structure of the DOM, if you look closer the data we want is inside the cells within the rows of a table. Some of them are withing anchor tags inside the cells. In order to retrieve those values, we are going to used several methods provided by Selenium WebDriver.
- find_element_by_tag_name( “tag_name”), retrieves the first element with the maching tag.
- find_elements_by_tag_name(‘tag_name’), retrieves all elements with the maching tag.
- get_attribute(‘attribute_name’), retrieves the value within a particular attribute associated with a particular tag.
here are many other methods that can be used to achieve this, you can visit Selenium WebDriver to be more familiar with them at the following link https://selenium-python.readthedocs.io/installation.html.
Now that we are familiar with the structure of the DOM and with the methods needed, we would need to find the table with the information we want. Luckily since it is the only table on the site we can easily locate it using find_element_by_tag_name( “table”).
After we get the table we retrieve all the rows within it as follows.
Now that we have all the table rows, we are going to iterate over them get all cells and retrieve all the data within the cells, create a dictionary, and saving it in a list. While iterating over the rows the first row is skipped since it only contains the header and not the information we are looking for. After retrieving all the information we close the driver using driver.quit() closing all browsers and terminating the driver session and clearing the memory.
Now that we have our list of dictionaries with all the information we can create a dataset using pandas the following way.
The final result would be a dataset we the information of the top 30 movies of 2019.
Web scraping is a great way to acquire data. Web Scraping is used for contact scraping, and as a component of applications used for web indexing, web mining and data mining, online price change monitoring and price comparison, product review scraping (to watch the competition), gathering real estate listings, weather data monitoring, website change detection, research, tracking online presence and reputation, web mashup and, web data integration and many more things, therefore a must-have tool for software engineer and data scientist.