Web Scraping Simple Data Tables

Jack Overby
8 min readNov 9, 2020

Sometimes I see a table of data on a website and wonder “How is that information stored?” The data itself, of course, is situated within whatever database the website’s owner is using. However, the data is rendered on the page in simple HTML, and as it turns out, collecting and manipulating that data is pretty darn easy! All we need is our friend, the DOM.

The DOM

DOM means “Document-Object Model”. Basically, an HTML page is an object, with elements that are also objects that can contain “child” objects, as well as belong to other objects. The whole page can be thought of as a tree. Take this simple snippet of code, for instance:

<html class="client-js ve-available" lang="en" dir="ltr">
<head></head>
<body class="mediawiki ltr sitedir-ltr mw-hide-empty-elt ns-0 ns-subject page-Main_Page rootpage-Main_Page skin-vector action-view skin-vector-legacy">
<div id="mw-page-base" class="noprint"></div>
<div id="mw-head-base" class="noprint"></div>
<div id="content" class="mw-body" role="main"></div>
<div id="mw-data-after-content"></div>
<div id="mw-navigation"></div>
<footer id="footer" class="mw-footer" role="contentinfo"></footer>
<script></script>
<script></script>
<script></script>
<a accesskey="v" href="https://en.wikipedia.org/wiki/Main_Page?action=edit" class="oo-ui-element-hidden"></a>
<div id="mwe-popups-svg"></div>
</body>
</html>

The entire document is wrapped in an <html> tag. Within this tag, there are two child elements, common to virtually all web pages: <head> and <body>. Within the body, we see a variety of child elements: <div> tags, as well as <script> and <a> tags. The divs contain further child <div>s, some of which have even more children, etc.

Thanks to the DOM library of functions provided by JavaScript, we can parse any HTML page and search for elements by class name, id, tag type, etc. Thus, the data on any website can be collected, converted into arrays and then stored or rendered on our page! I am going to use these functions to scrape simple data from an NFL wagering site (Note: not promoting gambling here- this is purely for recreation).

The Site (Vegas Insider)

This is the web page we will be exploring for this article. You can see the data displayed in a fairly straightforward table:

I want to get the following info for each one of these games (rows):

  • Game time (Day, month, time)
  • Home team name
  • Away team name
  • Spread (basically, the median expected outcome for the game. If a team is a 3.5-point favorite, Vegas expects them to win the game outright by 4+ points half the time and to win by 3 or fewer, or lose, half the time)

Vegas Insider lists the spreads for numerous books (e.g. Westgate Superbook, DraftKings), which often differ by half a point or so, so we’re looking for the “VI Consensus”, i.e. the most frequently appearing spread amount, which appears in the third column from the left.

Ultimately, I want to find the HTML representation of this table and store the data for each row as an object. The first row would look like this:

{
"gameTime": "11/08/2020 1:00 PM",
"awayTeam": "Seattle Seahawks",
"homeTeam": "Buffalo Bills",
"spread": 3
}

Shouldn’t be too hard, especially with one of the most magical tools ever created: Chrome DevTools!

DevTools

By clicking on <tbody>, the table body is highlighted in blue

By pressing CTRL + SHIFT + J, a sidebar appears, with two main tabs that let us do wonderful things:

  1. Elements: This tab displays the page’s HTML in a neat fashion and also lets the user click on the page and highlight the underlying HTML, or to click on a line of HTML code and highlight the corresponding section of the page.
  2. Console: This lets us run snippets of code in JavaScript, which will come in handy here, as we can test whether or not our commands are collecting the right data

By clicking around on the Elements tab, we see that the table data is contained within the <table class=”frodds-data-tbl”> tag. We want to capture the game data from each row within this table. So let’s declare a variable called myTable, which represents the table:

const myTable = document.querySelector("table[class='frodds-data-tbl']");

We see several child elements within myTable:

  • <colgroup>, which contains the info about the column headers (Open, VI Consensus, etc.)
  • <tbody>, which contains a bunch of child tags called <tr>

As it turns out, “tr” means “table row”. This could be what we’re looking for. When we click on the arrow to expand, we get this:

So the team info is embedded within <td> cells within the <tr> rows within the <tbody> table body. Is there a way to capture all the cells for a given row? Yes, thanks to querySelector and querySelectorAll.

First, let’s create an interim variable, myRows, to get the rows:

const myRows = myTable.querySelectorAll('tr');

With further exploration, we can see that the home/away team & game time info lies within the <td class=”viCellBg1 cellTextNorm cellBorderL1 gameCell”> tags within each <tr> row.

If we parse the text within that cell, we can get the data we need. Let’s declare variable row to be equal to the first element of myRows, to do some tests:

let row = myRows[0];
let rowText = row.querySelector("td[class='viCellBg1 cellTextNorm cellBorderL1 gameCell']").innerText;
# This yields a multi-line string
"11/08 1:00 PM
451 Seattle
452 Buffalo"

Hooray, three of the four bits of data we need! We need to break this string up, into individual rows. This can be done via the split function:

rowText = rowText.split('\n');
["11/08 1:00 PM", "451 Seattle", "452 Buffalo"]

So the 0th element of the array gives us the date, while the 1st gives us the away team, and the 2nd the home team. Let’s create another sample object, rowData, to hold our data, which will be added bit by bit:

const rowData = {};

To turn “11/08 1:00 PM” into a Date object, we can simply use new Date():

let rowDate = rowText[0];
console.log(new Date(rowDate));
// Thu Nov 08 2001 13:00:00 GMT-0600 (Central Standard Time)

This isn’t quite what we need! Looks like we need to add the year info to this string. Let’s split up rowDate by space, add “2020” to the string and then try again:

rowDate = rowDate.split(' ');
// ["11/08", "1:00", "PM"]
# Append '/2020' to the end of the 0th element
rowDate[0] += "/2020";
console.log(rowDate)
// ["11/08/2020", "1:00", "PM"]
# Now, we need to turn rowDate back into a string and then try the new Date() function:rowDate = rowDate.join(' ');
console.log(new Date(rowDate));
// Sun Nov 08 2020 13:00:00 GMT-0600 (Central Standard Time)

Bingo! We now have a Date object, which we can add to our rowData object:

rowDate = new Date(rowDate);
rowData['gameTime'] = rowDate

For the team names, we need to strip out the numbers at the start. This can be done a variety of ways, but we’ll do it via regular expressions:

console.log(rowText[1]);
// "451 Seattle"
let awayTeam = rowText[1];
awayTeam = awayTeam.match(/[A-Za-z]+/)[0];
rowData['awayTeam'] = awayTeam;
console.log(rowText[2]);
// "452 Buffalo"
let homeTeam = rowText[1];
homeTeam = homeTeam.match(/[A-Za-z]+/)[0];
rowData['homeTeam'] = homeTeam;

Now we have three out of the four pieces of data; now all we need to do is get the consensus line. We need to go through the row cells and pick the 2nd one from the left, and then get its inner text;

const cells = row.querySelector("td[class='viCellBg1 cellTextNorm cellBorderL1 center_text nowrap oddsCell']")
let spread = cells[1];

The spread cell HTML looks like this:

Looks like we need to get the innerText from the <a> tag, and split by newline, as before:

let spreadText = spread.querySelector('a').innerText.trim().split('\n');
let spreadAmount;
console.log(spreadText);
// ["-3 -10", "55u-10"]

So if the away team is favored, the spread will be in the 1st array member. If the home team is favored, the spread will be in the 2nd array member. Here is some conditional logic to extract it:

if (spreadText[0].match(/PK/) || spreadText[1].match(/PK/)) {
spreadAmount = 0;
}
else if (spreadText[0].match(/^-[0-9]+/)) {
spreadAmount = -parseInt(spreadText[0].match(/^-[0-9]+/));
}
else if (spreadText[1].match(/^-[0-9]+/)) {
spreadAmount = parseInt(spreadText[1].match(/^-[0-9]+/));
}
rowData['spread'] = spreadAmount;

Finally, we’ll console log the rowData object to make sure we have everything:

{
'awayTeam': “Seattle”,
'gameTime': Sun Nov 08 2020 13:00:00 GMT-0600 (Central Standard Time) {},
'homeTeam': “Seattle”,
'spread': 3
}

Now we have all our data for this row! The final step is to take all our code and apply it to every row on our table. First, we’ll create an array into which we’ll push our data object for every row, then we loop through all the rows:

const allGames = [];
const myTable = document.querySelector("table[class='frodds-data-tbl']");
const myRows = myTable.querySelectorAll('tr');
myRows.forEach(row=>{
const rowData = {};
let rowText = row.querySelector("td[class='viCellBg1 cellTextNorm cellBorderL1 gameCell']").innerText;
rowText = rowText.split('\n');
let rowDate = rowText[0];
rowDate = rowDate.split(' ');
rowDate[0] += "/2020";
rowDate = rowDate.join(' ');
rowDate = new Date(rowDate);
rowData['gameTime'] = rowDate;
let awayTeam = rowText[1];
awayTeam = awayTeam.match(/[A-Za-z]+/)[0];
rowData['awayTeam'] = awayTeam;
let homeTeam = rowText[1];
homeTeam = homeTeam.match(/[A-Za-z]+/)[0];
rowData['homeTeam'] = homeTeam;
const cells = row.querySelectorAll("td[class='viCellBg1 cellTextNorm cellBorderL1 center_text nowrap oddsCell']")
let spread = cells[1];
let spreadText = spread.querySelector('a').innerText.trim().split('\n');
let spreadAmount;
if (spreadText[0].match(/PK/) || spreadText[1].match(/PK/)) {
spreadAmount = 0;
}
else if (spreadText[0].match(/^-[0-9]+/)) {
spreadAmount = -parseInt(spreadText[0].match(/^-[0-9]+/));
}
else if (spreadText[1].match(/^-[0-9]+/)) {
spreadAmount = parseInt(spreadText[1].match(/^-[0-9]+/));
}
rowData['spread'] = spreadAmount;
allGames.push(rowData);
})

This loops through every row and produces the necessary data object for each game, then pushes each object into our array of games. At the end of this, we will have all the data we need and can do with it as we wish!

Conclusion

Data scraping is not always the most elegant solution, as it requires digging into the HTML code for each site and writing script based on the organization and naming conventions of the site designer. Thus, this sort of exercise is not always scalable. Regardless, with patience, diligence and a bit of JavaScript, you can get the data you want from any website. Pretty cool!

--

--