Michael Wooley's Homepage
Scraping Dynamic Web Content
I have a project in mind that is going to involve a lot of scraping from the websites of U.S. soccer leagues. While the leagues offer a lot of data about each match, we’re going to have to do some slightly non-standard tricks to get the data to load. Once that’s done we can use BeautifulSoup to extract the html elements.
What Needs to Happen
Ultimately, I’m going to want to be scraping data from pages like this. As you can see, this page provides a lot of information about a particular football game (Red Bulls II v. FC Cincinnati, 8/19/2017).
Notice that there are elements that are loading as you open the page.
See, for example the green loading bars.
This provides an indication that all of the data is being loaded dynamically via javascript. Indeed, if you just requested this page (e.g. with the requests
module) you’d find that a lot of the data elements are missing. In short:
Problem 1. We need to get the javascript on the page to load.
Now scroll to the top of the page and reload. Let the page sit for 10 seconds so that you are certain that the page is “loaded”. Now scroll down. Notice that the load bars come back up and all of that data needs to be called. It looks like this information only loads once we scroll down! This is a smart move by whoever made this page (there’s a lot of data on the page!) but it represents a problem for us:
Problem 2. Programmatically scroll the page so that the full page loads.
The Code
Here’s the full code with a leading example:
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver
class ScrapeDynamic(object):
"""
ScrapeDynamic: Methods for scraping dynamic webpages.
Information on:
Basic concept: https://coderwall.com/p/vivfza/fetch-dynamic-web-pages-with-selenium
Selenium Scrolling: https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python
Selenium waiting: http://selenium-python.readthedocs.io/waits.html
Be sure to call ScrapeDynamic.stop() when you're done to shut down the
server thing
"""
def __init__(self, browserPath, browser='phantom'):
"""
Input:
- browserPath: Path to browser .exe
- browser: Browser to use ['phantom', 'firefox']. (default='phantom') (Add later)
Returns:
A ScrapeDynamic object.
"""
# Start the WebDriver and load the page
self.wd = webdriver.PhantomJS(executable_path = BrowserPath)
def getUrl(self, url, selector):
"""
Retrieve page source of dynamic webpage. Waits until `selector` loads to
return. Automatically scrolls to bottom of page to ensure that all
JS loads.
Inputs:
- url: website url
- selector: CSS selector
Returns:
Page source (i.e. suitable for BeautifulSoup).
"""
# Begin to retrieve the URL
self.wd.get(url)
# Scroll to bottom to page so that it will load all elements
self.wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")
# Wait for the dynamically loaded elements to show up
WebDriverWait(self.wd, 10).until(
EC.visibility_of_element_located((By.CSS_SELECTOR, selector)))
return self.wd.page_source
def stop(self):
self.wd.quit()
if __name__ == "__main__":
from bs4 import BeautifulSoup
BrowserPath = 'C:/Program Files/PhantomJS/bin/phantomjs.exe' # Path to browser .exe
URL = 'http://www.uslsoccer.com/newyorkredbullsii-fccincinnati-905766' # URL to retrieve
selector = 'table.Opta-Striped.Opta-Squad' # CSS element to wait for
R = ScrapeDynamic(BrowserPath)
html_page = R.getUrl(URL, selector)
R.stop()
soup = BeautifulSoup(html_page, 'lxml')
element = soup.select(selector)
print(element[0].prettify()[0:1000])
The basic idea is to Programmatically launch and control a web browser. We init the ScrapeDynamic
object by launching a browser.
In the ScrapeDynamic.getURL
we request a particular web page (self.wd.get(url)
). Here we’re really just opening the page like you would if you were surfing yourself.
With a few modifications you can run this code with Firefox and the there the fact that we’re just opening the page becomes really apparent.
In the next line we get the page to scroll to the bottom by executing one line of JavaScript. This takes care of problem 2. The only thing that we have to do is check to see if the information has loaded. This is where the selector
element comes in. The idea behind doing this is that the code is going to be searching for this selector. Once it spots this selector in the html source the method will return. In the example script I’ve passed CSS selector that corresponds to the table containing information about the players on each roster. Since I’m mostly going to be interested in this data I want to make sure that it has loaded before the method returns the output.
Usage
To use the code you’ll:
- Initiate a
ScrapeDynamic
object with the path to your web browser. For more info on dependencies see below. - Call the
ScrapeDynmic.getURL
method with arguments:URL
: The url of the page to be scraped.selector
: A CSS selector discussed more below.
- Tidy up: Shut down the
ScrapeDynamic
object.
The ScrapeDynamic.getURL
method will return the full html source of the page. This output is essentially what we would get if we scraped a static page with the requests
module. We can then pass this on to a tool like BeautifulSoup
to parse the elements.
By doing all of this setup once, we can save time when, for example, we’ll want to scrape several pages.
Dependencies
I’ve only tested this code on Windows 10 at the moment. You can see that there are certain features that assume a Windows-type system (e.g. the need for an executable argument).
In terms of python packages you’ll need to:
pip install selenium
(Selenium) Use this to set up a “webdriver”, which can take commands and send them to the browser.pip install bs4
(BeautifulSoup4) Needed for the example and to manipulate any of the output returned by thegetURL
method.
For this I also used the PhantomJS browser. It is nice because it is lightweight and designed for this sort of thing.
Conclusion
That’s it. In future posts I’ll put this code to work to gather my data. Tomorrow I’m going to return to my drawing app with a big update.