Michael Wooley's Homepage

Scraping Dynamic Web Content

I have a project in mind that is going to involve a lot of scraping from the websites of U.S. soccer leagues. While the leagues offer a lot of data about each match, we’re going to have to do some slightly non-standard tricks to get the data to load. Once that’s done we can use BeautifulSoup to extract the html elements.

What Needs to Happen

Ultimately, I’m going to want to be scraping data from pages like this. As you can see, this page provides a lot of information about a particular football game (Red Bulls II v. FC Cincinnati, 8/19/2017).

Notice that there are elements that are loading as you open the page. See, for example the green loading bars. This provides an indication that all of the data is being loaded dynamically via javascript. Indeed, if you just requested this page (e.g. with the requests module) you’d find that a lot of the data elements are missing. In short:

Problem 1. We need to get the javascript on the page to load.

Now scroll to the top of the page and reload. Let the page sit for 10 seconds so that you are certain that the page is “loaded”. Now scroll down. Notice that the load bars come back up and all of that data needs to be called. It looks like this information only loads once we scroll down! This is a smart move by whoever made this page (there’s a lot of data on the page!) but it represents a problem for us:

Problem 2. Programmatically scroll the page so that the full page loads.

The Code

Here’s the full code with a leading example:

from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium import webdriver

class ScrapeDynamic(object):
    """
    ScrapeDynamic: Methods for scraping dynamic webpages.

    Information on:
        Basic concept: https://coderwall.com/p/vivfza/fetch-dynamic-web-pages-with-selenium
        Selenium Scrolling: https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python
        Selenium waiting: http://selenium-python.readthedocs.io/waits.html

    Be sure to call ScrapeDynamic.stop() when you're done to shut down the
        server thing
    """

    def __init__(self, browserPath, browser='phantom'):
        """
        Input:
            - browserPath: Path to browser .exe
            - browser: Browser to use ['phantom', 'firefox']. (default='phantom') (Add later)
        Returns:
            A ScrapeDynamic object.
        """
        # Start the WebDriver and load the page
        self.wd = webdriver.PhantomJS(executable_path = BrowserPath)

    def getUrl(self, url, selector):
        """
        Retrieve page source of dynamic webpage. Waits until `selector` loads to
            return. Automatically scrolls to bottom of page to ensure that all
            JS loads.

        Inputs:
            - url: website url
            - selector: CSS selector

        Returns:
            Page source (i.e. suitable for BeautifulSoup).
        """
        # Begin to retrieve the URL
        self.wd.get(url)
        # Scroll to bottom to page so that it will load all elements
        self.wd.execute_script("window.scrollTo(0, document.body.scrollHeight);")

        # Wait for the dynamically loaded elements to show up
        WebDriverWait(self.wd, 10).until(
                EC.visibility_of_element_located((By.CSS_SELECTOR, selector)))

        return self.wd.page_source

    def stop(self):
        self.wd.quit()


if __name__ == "__main__":
    from bs4 import BeautifulSoup

    BrowserPath = 'C:/Program Files/PhantomJS/bin/phantomjs.exe' # Path to browser .exe
    URL = 'http://www.uslsoccer.com/newyorkredbullsii-fccincinnati-905766' # URL to retrieve
    selector = 'table.Opta-Striped.Opta-Squad' # CSS element to wait for

    R = ScrapeDynamic(BrowserPath)
    html_page = R.getUrl(URL, selector)
    R.stop()

    soup = BeautifulSoup(html_page, 'lxml')
    element = soup.select(selector)
    print(element[0].prettify()[0:1000])

The basic idea is to Programmatically launch and control a web browser. We init the ScrapeDynamic object by launching a browser.

In the ScrapeDynamic.getURL we request a particular web page (self.wd.get(url)). Here we’re really just opening the page like you would if you were surfing yourself. With a few modifications you can run this code with Firefox and the there the fact that we’re just opening the page becomes really apparent. In the next line we get the page to scroll to the bottom by executing one line of JavaScript. This takes care of problem 2. The only thing that we have to do is check to see if the information has loaded. This is where the selector element comes in. The idea behind doing this is that the code is going to be searching for this selector. Once it spots this selector in the html source the method will return. In the example script I’ve passed CSS selector that corresponds to the table containing information about the players on each roster. Since I’m mostly going to be interested in this data I want to make sure that it has loaded before the method returns the output.

Usage

To use the code you’ll:

  1. Initiate a ScrapeDynamic object with the path to your web browser. For more info on dependencies see below.
  2. Call the ScrapeDynmic.getURL method with arguments:
    • URL: The url of the page to be scraped.
    • selector: A CSS selector discussed more below.
  3. Tidy up: Shut down the ScrapeDynamic object.

The ScrapeDynamic.getURL method will return the full html source of the page. This output is essentially what we would get if we scraped a static page with the requests module. We can then pass this on to a tool like BeautifulSoup to parse the elements.

By doing all of this setup once, we can save time when, for example, we’ll want to scrape several pages.

Dependencies

I’ve only tested this code on Windows 10 at the moment. You can see that there are certain features that assume a Windows-type system (e.g. the need for an executable argument).

In terms of python packages you’ll need to:

  • pip install selenium (Selenium) Use this to set up a “webdriver”, which can take commands and send them to the browser.
  • pip install bs4 (BeautifulSoup4) Needed for the example and to manipulate any of the output returned by the getURL method.

For this I also used the PhantomJS browser. It is nice because it is lightweight and designed for this sort of thing.

Conclusion

That’s it. In future posts I’ll put this code to work to gather my data. Tomorrow I’m going to return to my drawing app with a big update.