Many times the information you want to track is embedded within a web page. The site may offer an RSS feed that you can monitor (but giving you no control over the contents of the feed). In most cases, you don't have a feed and must manually revisit the site to scan for changes.
Sounds like a job for a Klip.
This article shows how to create a Klip that can parse an HTML page and extract specific content, in this example from an HTML table. The techniques in this article can be applied to any content in an HTML page.
HTML is XML
HTML is commonly referred to as a subset of XML, but there is a problem: most HTML on the web is not well-formed XML. You see this by saving an HTML page to your hard disk with the extension .xml and opening it with your web browser. Your web-browser will attempt to validate and start displaying errors.
KlipFolio’s XML parser, however, is non-validating and very forgiving. Unlike an XML parser that uses a complex Document Object Model (DOM) API for extracting content, KlipFolio offers a very powerful, yet very simple, Cascading Style Sheet-like syntax for parsing XML.
Here is a sample HTML page with an embedded table.
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title>Untitled Document</title>
</head>
<body>
<h1>Table Example </h1>
<table width="341" border="1">
<tr>
<td width="66"> </td>
<td width="35"><strong>East</strong></td>
<td width="46"><strong>West</strong></td>
<td width="77"><strong>North</strong></td>
<td width="83"><strong>South</strong></td>
</tr>
<tr>
<td><strong>Region 1 </strong></td>
<td><div align="right">12.3</div></td>
<td><div align="right">34.5</div></td>
<td><div align="right">33.2</div></td>
<td><div align="right">901</div></td>
</tr>
<tr>
<td><strong>Region 2 </strong></td>
<td><div align="right">83</div></td>
<td><div align="right">32</div></td>
<td><div align="right">34</div></td>
<td><div align="right">98</div></td>
</tr>
</table>
</body>
</html>
Here's what the table looks like:

You can load this table from http://www.serence.com/support/samples/table_1.html.
Let’s build a Klip to monitor changes to this table.
Parsing an HTML table
To parse an XML source – any source – KlipFolio needs to know, at a minimum, two pieces of information specified in CSS. First, it needs to know the enclosing XML around each item (or row) of data. This lets it find items in the XML.
Second, it needs to know the enclosing XML around each element of data within an item. This lets it extract the data that comprises an item which the users see as a column in the klip or row in an item’s tooltip.
Looking at our example HTML table, let’s start with the data. We see the data for each row of the table is enclosed in
<td> </td>
so it’s not difficult to pull out (we’ll show how in a moment). The HTML table itself contains two rows of data, each enclosed in
<tr>
</tr>
When parsing HTML, a typical web page usually contains many tables. So we need to be specific in which table we want to extract the. Looking at our example HTML, we see the tr’s enclosed in this table
<table width="341" border="1"> ... </table>
The attribute width="341" is specific to this table, so we can use that attribute to uniquely identify this table. Here’s a first cut at the Klip
<klip>
<identity>
<title>
Get Stats
</title>
</identity>
<locations>
<contentsource>
http://www.serence.com/support/samples/table_1.html
</contentsource>
<icon>
http://www.serence.com/support/samples/images/sample_icon.png
</icon>
<banner>
http://www.serence.com/support/samples/images/sample_banner.png
</banner>
</locations>
<style>
table[width="341"] {
type: item;
}
tr {
itemcol: 1;
noterow: 1;
content: cdata;
}
</style>
</klip>
Here’s the results.
At this point, don’t worry if were not getting the proper data yet. The key is to get a Klip working that finds the proper table and extracts some data you can verify.
Our Klip has picked out the first row of data in the table, and we can see the raw XML because we specified the property content: cdata in the CSS.
tr { itemcol: 1;
noterow: 1;
content: cdata;
}
This instructs KlipFolio’s XML parser to not do any processing on the extracted data. Just display its contents. This mode is very is very helpful in parsing HTML as it let’s you see exactly what KlipFolio is extracting before it strips out XML elements or processes XML entities.
OK, we can the raw data, but the Klip only extracted one row. Why? The answer is didn’t get the specification for item quite right. Note the specification of item is as follows:
table[width="341"] { type: item;
}
This reads the item is enclosed within
<table width="341"> </table>
That’s not quite specific enough: the item is enclosed within a
<tr>
which is enclosed within
<table width="341">
Here’s the correct specification for item.
table[width="341"] tr { type: item;
}
td {
itemcol: 1;
noterow: 1;
content: cdata;
}
Reload the Klip. It now shows the following:
Looking better. We now we see three items. The CSS rule for
td is extracting out the contents of the first
td in each row. But why are there three rows?
It’s because the table does have three rows: one header and two rows of data. We want to skip the first row and just extract the two data. Let’s look again at the HTML for the table.
<table width="341" border="1">
<tr>
<td width="66"> </td>
<td width="35"><strong>East</strong></td>
<td width="46"><strong>West</strong></td>
<td width="77"><strong>North</strong></td>
<td width="83"><strong>South</strong></td>
</tr>
<tr>
<td><strong>Region 1 </strong></td>
<td><div align="right">12.3</div></td>
<td><div align="right">34.5</div></td>
<td><div align="right">33.2</div></td>
<td><div align="right">901</div></td>
</tr> Notice the
<td>
elements for the first row also specify the attribute width, as in
<td width=N>
We want to match those td's that do not have this attribute. (For the remainder of this article, we'll referr to XML elements without the brackets.)
Modify the CSS definition for td to td:not([width]).
table[width="341"] tr { type: item;
}
td:not([width]) {
itemcol: 1;
noterow: 1;
content: cdata;
}
And let’s reload the Klip.
We now have just the rows of data, but only the first column (the first
td). Let’s now extract the second
td.
To extract the second
td we add the following rule
td + td:not([width]) { itemcol: 2;
noterow: 2;
content: cdata;
}
Which reads (from right to left) look for the first
td that does not have a width attribute that is preceded
td.
Reload the Klip and we now see the following.
Looking good Huston. Now, since were getting data, let’s drop the
content: cdata and just view the text.
table[width="341"] tr { type: item;
}
td:not([width]) {
itemcol: 1;
noterow: 1;
}
td + td:not([width]) {
itemcol: 2;
noterow: 2;
}
Reload the Klip.
Wait! Where did the data go? Why did taking
content: cdata cause all the data to disappear.
The reason is the default setting for the content property is
content: firstchild. This means KlipFolio only takes the first child of a matching element. To see why it’s empty, look at the contents of the first
td <td><div align="right">12.3</div></td>
The first child after
td is an empty node – it’s between the
>< in
<td><strong>
It’s the same for the second td node. This causes both the first and second cells to be empty in the item. When KlipFolio tries to add the second empty item, it already finds one in the table, so it overwrites it. The result: a Klip with no data that has one unread item.
Why does KlipFolio have a default as content: firstchild? There are two reasons. First, when parsing XML, an element usually just contains data, as in.
<td>Region 1</td>
Here, the first child is the text node
Region 1. Second, picking out the first node saves processing time because KlipFolio does not need to scan for the closing
td. When parsing large sets of XML, this optimization saves CPU time.
But, HTML tables are usually very small, and since the HTML contains formatting that we don’t care about in the Klip, we need to tell KlipFolio to do a bit extra work.
To tell KlipFolio to scan for the enclosing td and process everything in between, we specify the content as
content: text.
Here’s the updated CSS.
table[width="341"] tr { type: item;
}
td:not([width]) {
itemcol: 1;
noterow: 1;
content: text;
}
td + td:not([width]) {
itemcol: 2;
noterow: 2;
align: right;
content: text;
}
Reload the Klip. We now see our data.
It’s looking good. Let’s add the CSS entries to extract the contents of the third and forth
td from the data.
table[width="341"] tr { type: item;
}
td:not([width]) {
itemcol: 1;
noterow: 1;
content: text;
}
td + td:not([width]) {
itemcol: 2;
noterow: 2;
align: right;
content: text;
}
td + td + td:not([width]) {
itemcol: 3;
noterow: 3;
align: right;
content: text;
}
td + td + td + td:not([width]) {
itemcol: 4;
noterow: 4;
align: right;
content: text;
}
td + td + td + td + td:not([width]) {
itemcol: 5;
noterow: 5;
align: right;
content: text;
}
Reload. Here’s the resulting Klip.
At this point the core of the Klip is working. We’re finding the right table and extracting only the rows of the data. Let’s take a look at the tooltip (referred to as note).
By default, KlipFolio uses the CSS matching criteria as the label for a note. Let’s use notelabel to make our Klip a bit more expressive.
table[width="341"] tr { type: item;
}
td:not([width]) {
itemcol: 1;
noterow: 1;
content: text;
notelabel: false;
emphasis: strong;
label: 'Region';
}
td + td:not([width]) {
itemcol: 2;
noterow: 2;
align: right;
content: text;
label: 'East';
}
td + td + td:not([width]) {
itemcol: 3;
noterow: 3;
align: right;
content: text;
label: 'West';
}
td + td + td + td:not([width]) {
itemcol: 4;
noterow: 4;
align: right;
content: text;
label: 'North';
}
td + td + td + td + td:not([width]) {
itemcol: 5;
noterow: 5;
align: right;
content: text;
label: 'South';
}
Reload and we now see the Klip has labels for each row. We used notelabel: false to hide the label for the first row and set the emphasis: strong to make it a title.
Since the Klip is going to monitor a table for changes, we want to make it a dashboard Klip. We explain how a dashboard klip works in cookbook article
http://www.klipfolio.com/index.php?action=dev,cookbook_item&item=15& .
In short, you need only add some JavaScript that uses the KlipFolio API to make a Klip a dashboard Klip.
Here’s the full Klip with JavaScript.
<klip>
<identity>
<title>
Get Stats
</title>
</identity>
<locations>
<contentsource>
http://www.serence.com/support/samples/table_1.html
</contentsource>
<icon>
http://www.serence.com/support/samples/images/sample_icon.png
</icon>
<banner>
http://www.serence.com/support/samples/images/sample_banner.png
</banner>
</locations>
<style>
table[width="341"] tr {
type: item;
}
td:not([width]) {
itemcol: 1;
noterow: 1;
content: text;
notelabel: false;
emphasis: strong;
label: 'Region';
}
td + td:not([width]) {
itemcol: 2;
noterow: 2;
align: right;
content: text;
label: 'East';
}
td + td + td:not([width]) {
itemcol: 3;
noterow: 3;
align: right;
content: text;
label: 'West';
}
td + td + td + td:not([width]) {
itemcol: 4;
noterow: 4;
align: right;
content: text;
label: 'North';
}
td + td + td + td + td:not([width]) {
itemcol: 5;
noterow: 5;
align: right;
content: text;
label: 'South';
}
</style>
<klipscript>
<![CDATA[
function onRefresh()
{
Items.autoremove = false;
var req = Engines.HTTP.newRequest (Prefs.contentsource);
if (!req.send())
{
return false;
}
var data = req.response.data;
if (!data.length || req.response.status != 200)
{
return (req.response.status != 304);
}
// Show only current items in source
Items.purge (true);
return Engines.KlipFood.process (data);
}
]]>
</klipscript>
</klip> In summary, when parsing HTML, don’t worry about extracting all the data at first; instead, focus on uniquely identifying the specific table you want to match. There are usually attributes in the table that make it easy to locate.
To show some data, you can usually use the following CSS specification for
td td { itemcol: 1;
noterow: 1;
content: cdata;
}
Which just shows the raw contents of the first td. Remember: you need to both match the specific table and include a
tr to specify an item as being contained with a row.
table[width="341"] tr { type: item;
}
The rest is just iteratively refining the CSS specifications to extract the contents of each row. Switch to
content: text to have KlipFolio convert the raw XML to text. You can usually specify a specific
td by enumerating the number of
td's that precedes it. If there is something unique about the
td, such as a style or attribute, you can use that as the matching criteria instead.