Thursday, 19 July 2012

A straightforward method to detecting RSS and Atom feeds in websites with HtmlAgilityPack

Let's create a tool that will be able to query any website and detect if they provide feeds in one of the common syndication formats. By using HtmlAgilityPack we can apply screen scraping techniques and examine the meta tags to see what alternative formats are available.

A little background

If you haven't heard about HtmlAgilityPack yet then you have landed on the wrong post. Head over to the main article index and check out the introduction post.

Syndication feeds? RSS? Atom?

Well if you don't know what these are then you probably don't have much interest in reading an article about detecting them but for completeness I will provide a brief introduction.

Its possible to make a copy of a websites news posts available in a special computer readable format. There are two main standards around; RSS and Atom. Collectively this is referred to as syndication.

These data feeds are normally used in two main ways:

  1. Collected together in special software called feed readers
  2. Republished on other websites as content

Users can add feeds into feed readers to consume the feeds. By keeping feeds inside a feed reader users can read the latest news from their favourite websites without having to check every site for updates.

For content creators it means that their content can easily be integrated into other websites. The other websites like the fresh content and the content creators like being able to generate traffic back to their websites by linking to themselves within the articles.

For a much deeper discussion of this topic check out this Wikipedia article:

Feed discovery

There is a standard to detect if a website provides an alternate format RSS or Atom feed. This process is called autodiscovery and has been standardised by the RSS Advisory Board and adopted by the Atom feed format.

The mechanism is simply a <link> tag placed in the <head> of a webpage. The basic rules are:

  • Set rel to "alternate"
  • Each href must be a different feed
  • The type must contain the feed's mime type

You can have as many feeds as you want in a page and you can mix and match feed formats.

This web site uses the following tags to allow feed autodiscovery.

<link rel="alternate" type="application/atom+xml" title="Run Tings Proper - Atom" href="http://runtingsproper.blogspot.com/feeds/posts/default" />
<link rel="alternate" type="application/rss+xml" title="Run Tings Proper - RSS" href="http://runtingsproper.blogspot.com/feeds/posts/default?alt=rss" />

You can read more about RSS autodiscovery here:

Writing the application

We are going to build this application together from scratch. I am actually writing the sample as I write the article so the steps I describe are the steps I’m taking - there won't be any Blue Peter "here's one I made earlier" style magic!

The first step is to load up Microsoft Visual Studio. I am using Visual Studio 2011 Beta but I'm not planning on using any special tricks so it should be virtually identical even if you are on an older or express version.

The first step is to create a new website project.

  1. Load up Visual Studio
  2. Click File | New | New Website
  3. Choose an asp.net website template and the C# language. In my Visual Studio I picked .NET Framework 4, Visual C#, ASP.NET Empty Web Site.
  4. Give your project a name. I called mine "HtmlAgilityPackExample-DetectSyndicationFeeds" but you can call it whatever you like.
  5. Click OK to create your new website.

A simple start to the tutorial! As this tutorial relies on the HtmlAgilityPack we are going to need to download it.

Download the HtmlAgilityPack – NuGet

This article was originally written back in 2009 before we had the joys of NuGet in our lives but it never got finished. Well we should all have NuGet installed now so I’ll run through this section really quickly.

If you don’t have NuGet then either get it or follow the guide in the next section to manually install it.

  1. Click the Website menu (or right click on the top project node in your Solution Explorer window)
  2. Choose Manage NuGet packages…
  3. Click the Online tab in the left hand side then All
  4. Search for HtmlAgilityPack in the top right hand corner
  5. Click the Install button on the HtmlAgilityPack entry in the results list that's in the centre of the window.

Download the HtmlAgilityPack – Manual

Note: You can skip this section if you just followed the NuGet installation instructions.

For our purposes we only need the binary release of the HtmlAgilityPack. Go to the site below and download the latest binary release.

At the time of writing the latest version is 1.4.0.

I recommend extracting this to a common location. Personally I keep all my 3rd party library downloads in F:\libraries\ and when I need one I can easily browse to it from any project.

If you follow this method then one advantage is that the assembly will have a .refresh file automatically generated for it. This allows Visual Studio to check that it has the latest version on your hard drive before compiling the main project.

In any case you should now do the following:

  1. Extract the archive and make a note of the path.
  2. Return to Visual Studio and right click on the Website menu then choose Add Reference.
    • If there isn't a menu called Website just move your mouse over to the Solution Explorer tool window and click on any file inside your website. The Project menu will be replaced with a Website menu.
  3. Click the Browse tab and navigate to the location you extracted the code to.
  4. Select HtmlAgilityPack.dll and click OK

Lay out the user interface

The next stage of the article is to lay out the user interface. The interface will have a TextBox to take the website address, a Button to fire it off, a GridView to display the results.

  1. Click Website | Add New Item (or press Ctrl-Shift-A)
  2. Select Web Form and click Add (leaving the name as Default.aspx, Place code in separate file ticked and Select master page unticked)
  3. Either copy and paste the code below or drop a TextBox, Button and GridView onto the page until your page looks like this:
<%@ Page Language="C#" AutoEventWireup="true" CodeFile="Default.aspx.cs" Inherits="_Default" %>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

<html xmlns="http://www.w3.org/1999/xhtml">
<head runat="server">
    <title>Html Agility Pack Example - Extract All Syndication Feeds From A Web Page</title>
</head>
<body>
    <form id="form1" runat="server">
    <div>
        <h1>
            Html Agility Pack Example - Extract All Syndication Feeds From A Web Page</h1>
        <p>
            This example shows you how to use Html Agility Pack to load a remote url and parse
            out the urls of all the syndication feeds it contains.</p>
        <div>
            <asp:Label ID="Label1" runat="server" Text="URL:" />
            <asp:TextBox ID="Url" runat="server" Text="http://www.cnn.com/"></asp:TextBox>
            <asp:Button ID="RetrieveButton" runat="server" Text="Get Feeds"/>
        </div>
        <asp:GridView ID="ResultsGrid" runat="server">
        </asp:GridView>
    </div>
    </form>
</body>
</html>

Making a DataObject

We are going to package up the code that does all the heavy lifting inside a class of its own. This will allow us to reuse the code on multiple pages without duplicating the code.

As an added bonus we can tag this class up with DataObject attributes and then codelessly bind it to a databound control via an ObjectDataSource.

  1. Right click on the App_Code folder
    • If the App_Code folder isn't in your project just right click on the project node and choose Add ASP.NET folder | App_Code
  2. Choose Add New Item
  3. Select Class and call the class SyndicationFeedsDataObject.cs
  4. Add a using statement to the top:
    using System.ComponentModel;
    using HtmlAgilityPack;
  5. Tag the class with a [DataObject(true)] attribute
    /// <summary>
    /// Summary description for SyndicationFeedsDataObject
    /// </summary>
    [DataObject(true)]
    public class SyndicationFeedsDataObject
    {
    }
  6. Add a Select() method to the class that has the following signature:
    public IEnumerable<SyndicationFeedsDataObject> Select(string WebsiteUrl)
  7. The last part is to tag the Select() method with the correct attribute:
    [DataObjectMethod(DataObjectMethodType.Select, true)]
    public IEnumerable<SyndicationFeedsDataObject> Select(string WebsiteUrl)
  8. At this point your complete class should look like this:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.ComponentModel;
using HtmlAgilityPack;

/// <summary>
/// Summary description for SyndicationFeedsDataObject
/// </summary>
[DataObject(true)]
public class SyndicationFeedsDataObject
{
    public SyndicationFeedsDataObject()
    {
    }

    [DataObjectMethod(DataObjectMethodType.Select, true)]
    public IEnumerable<SyndicationFeedsDataObject> Select(string WebsiteUrl)
    {
        throw new NotImplementedException();
    }
}

Adding bindable public properties

We need to add some public properties to our DataObject so that we have something to bind against in our GridView. I think that we need to capture the following information about the feed:

  • Feed Url
  • Title
  • Mime Type

To add a property you should go to the top of the class definition and type:

prop [tab] [tab]

This is a shortcut in Visual Studio which lays out what is known as an Automatic Property. It looks like this by default:

public int MyProperty { get; set; }

Your cursor is positioned at the data type so that you can then enter it (like string). Then pressing tab again will let you input the property name. Finally press enter to complete the property insert.

Now you know how to do this you need to add properties until your class looks like this:

[DataObject(true)]
public class SyndicationFeedsDataObject
{
    public string FeedUrl { get; set; }
    public string Title { get; set; }
    public string MimeType { get; set; }

    // ...
}

Preparing the Select() statement

The Select() method we stubbed out earlier is going to be the main workhorse of this application. It will take in a website url, parse it for correctness, retrieve the html page located at that url, extract the feed urls and return that data.

If an exception occurs that's going to prevent a valid collection of feeds being returned then I intend to return null. The data bound control will handle this seamlessly and execute its empty dataset code.

When I want to start out a new method I usually like to put in a series of comments which describe what the method is going to do. This gives focus to the work that I am doing. This approach is discussed in detail in Steven McConnell's book Code Complete (if you buy it make sure you get the second edition).

Following that technique lets us lay out the plans we just made so we can get started on the coding:

    [DataObjectMethod(DataObjectMethodType.Select, true)]
    public IEnumerable<SyndicationFeedsDataObject> Select(string WebsiteUrl)
    {
        // parse the url
        // retrieve the website html
        // extract all tags
        // convert extracted tags to SyndicationFeedsDataObjects
        // return the data
    }

Url Validation

So the first part should be pretty simple. If you pass an invalid url to Html Agility Pack it throws a UriFormatException. We don't want to force a user to remember the http:// at the start so if they forget it then we will add it in to this method. We can then check if the url is formatted correctly by trying to instantiate a new Uri() from it:

    private static string ParseUrl(string WebsiteUrl)
    {
        if (string.IsNullOrEmpty(WebsiteUrl))
        {
            return string.Empty;
        }

        if (!Regex.IsMatch(WebsiteUrl, "^http://", RegexOptions.IgnoreCase))
        {
            WebsiteUrl = "http://" + WebsiteUrl;
        }

        try
        {
            Uri TestUrl = new Uri(WebsiteUrl);
        }
        catch (UriFormatException)
        {
            WebsiteUrl = string.Empty;
        }

        return WebsiteUrl;
    }

This could have been done in the UI using a RegularExpressionValidator control or even a CustomValidator control but I felt that if we are going to make it idiot proof then it should be a seamless implementation rather than forcing the user to read an error message and then type the http:// themselves.

Using HtmlDocument and HtmlWeb

When we know we have a valid url we can rely on we can then make our first use of the HtmlAgilityPack so far in this tutorial.

The HtmlDocument class is the main class used by HtmlAgilityPack. This is the workhorse which contains all of the nodes and attributes with a document. There are two ways to load html into this class.

The first one (which was demonstrated in my easily extracting links from a snippet of html article) is to use the HtmlDocument methods. LoadHtml() will load html in from a string. Load() will load html from a path, Stream, or TextReader.

The new trick that we are going to use today is to load in remote html in with a simple command. The HtmlWeb class included with HtmlAgilityPack will take a valid url, retrieve the contents and return it as a HtmlDocument.

The code looks like this:

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(WebsiteUrl);

Simple!

Link tag extraction

This part of the code is where we really see the power of HtmlAgilityPack. All of the tags in the html document we just retrieved are placed inside the Document.ChildNodes property. To select them based on your requirements you can pass an XPath statement to the SelectNodes method.

XPath is an official w3c standard designed to give you a structured query language for extracting nodes in an xml document. In this case it is being applied to the (probably) invalid html mark-up by the HtmlAgilityPack library.

Take a look at the following XPath statement:

//link[(@type='application/rss+xml' or @type='application/atom+xml') and @rel='alternate']

What it is looking for is:

  • all <link> tags
  • where the type attribute is equal to either an rss or atom feed
  • and the rel attribute is equal to alternate

The rel="alternate" is important because without it you can get false positives. For example, the atom format lets you define the endpoint that you can connect to if you wanted to post an item to a blog as a <link> tag.

<link rel="service.post" type="application/atom+xml" title="Example" href="http://www.example.com/endpoint" />

The type attribute is the same as above but the rel tag is "service.post".

So to wrap this up in a convenient method we would pass in the XPath statement above to the SelectNodes() method which would return a HtmlNodeCollection containing only the tags we wanted. It would look like this:

private static HtmlNodeCollection SelectFeeds(HtmlDocument doc)
{
    return doc.DocumentNode.SelectNodes("//link[(@type='application/rss+xml' or @type='application/atom+xml') and @rel='alternate']");
}

Preparing for Data Binding

The last stage we need to do is to convert all of the data extracted so far into a format that can be data bound. The reason that we need to convert them is that the HtmlNode's inside the HtmlNodeCollection contain weakly typed attributes that look like this:

link.Attributes["href"].Value

So the conversion is required in order to be able to pass these in to our databound controls and access them using clean notation such as:

Eval("Href");

This means using a helper method to convert the HtmlNode's into SyndicatedFeedsDataObject's.

Apart from being a requirement in this case it's also a good programming practice. This is known as a code seam. By converting to our own class before we return the data we are actually fully encapsulating our use of HtmlAgilityPack. The dataobject takes a string for input (the website url) and returns a collection of itself (the feeds it found at the website url). As far as the outside world is concerned HtmlAgilityPack doesn't exist.

This means that if you write an application that depends on the SyndicatedFeedsDataObject and then you find some whizzo new html parsing library you can swap it out without changing any of the calling code. For our sample application this doesn't mean much but in your full blown application you will certainly appreciate not having to change line after line of code to remove your dependency on the HtmlNode class.

A simple method which uses a linq query to translate the data would look like this:

private static IEnumerable<SyndicationFeedsDataObject> ConvertNodesToSyndicationFeeds(HtmlNodeCollection feeds)
{
    var query = from link in feeds
                select new SyndicationFeedsDataObject
                {
                    FeedUrl = link.Attributes["href"].Value,
                    Title = link.Attributes["title"].Value,
                    MimeType = link.Attributes["type"].Value
                };
    return query;
}

It is written in LINQ query syntax and projects the data into the new class using object initializers. At this point I'm just pleased to be using some of the technical jargon that I've been brushing up on recently. If you don't understand the snippet above then it is far beyond the scope of this tutorial to start explaining it but I will say that I have been very happy with my recent purchase of C# 4.0 in a nutshell (no affiliate codes embedded in there - I'm just a happy customer).

Disregarding the technical jargon in the last paragraph you should be able to see that it simply assigns the values of one class to the values of another class and returns a collection of the converted classes.

Putting our Select() together

A few sections back we made a list of comments that outlined the plan for the Select(). We then spent the next few sections writing the key functionality that would be needed to turn these comments into working code.

This means that we can now flesh out each of these comments into blocks of code.

// parse the url
WebsiteUrl = ParseUrl(WebsiteUrl);

if (!IsValidUrl(WebsiteUrl))
{
    return null;
}

The IsValidUrl() is just a simple check to make sure the string isn't empty:

private static bool IsValidUrl(string WebsiteUrl)
{
    return !string.IsNullOrWhiteSpace(WebsiteUrl);
}

I could have put this check inline but after reading Uncle Bob’s book Clean Code I developed a habit of showing the intention of my code by refactoring it into many small functions.

Then we use two of the functions we wrote, RetrieveHtml and SelectFeeds to pull the data in, plus a quick sanity check on the data:

// retrieve the website html
HtmlDocument doc = RetrieveHtml(WebsiteUrl);

// extract all tags
HtmlNodeCollection feeds = SelectFeeds(doc);

if (!IsValidFeedCollection(feeds))
{
    return null;
}

IsValidFeedCollection() is another one of my seemingly pointless methods which simply checks if the input is null or not. Again I could have put this inline but it allows me to express what my intention was for the if statement. It prevents me from having to drop down to a lower level of abstraction. In fact in “real” code I would probably have used var rather than HtmlDocument and HtmlNodeCollection as it would make it ready a lot more smoothly and datatypes are not important at this stage. I only included the types verbosely for the sake of introducing them in this tutorial.

Here is that IsValidFeedCollection function for the curious:

private static bool IsValidFeedCollection(HtmlNodeCollection feeds)
{
    return feeds != null;
}

Then I throw the collected feeds data through the conversion routine and return it:

// convert extracted tags to SyndicationFeedsDataObjects
var query = ConvertNodesToSyndicationFeeds(feeds);

// return the data
return query;

We are working with IEnumerable<> collection types so this is easily understood by the data bound controls that are going to use the DataObject we just made.

How to bind our results to the GridView

Because of the DataObject attributes we applied to our code earlier on binding this to a GridView is a codeless procedure. We don’t get away with it quite that easily though because there is still an ObjectDataSource that needs to be configured. You shouldn’t see anything new here, its a simple ObjectDataSource, with a control parameter set up so that it pulls the url in the textbox into the Select() statement. There isn’t even any code in the button click event – its only there to trigger a postback and then the ObjectDataSource does the rest.

<asp:GridView ID="ResultsGrid" runat="server" DataSourceID="ObjectDataSource1">
</asp:GridView>
<asp:ObjectDataSource ID="ObjectDataSource1" runat="server" 
    OldValuesParameterFormatString="original_{0}" SelectMethod="Select" 
    TypeName="SyndicationFeedsDataObject">
    <SelectParameters>
        <asp:ControlParameter ControlID="Url" Name="WebsiteUrl" PropertyName="Text" 
            Type="String" />
    </SelectParameters>
</asp:ObjectDataSource>

If anything went wrong in the code we simply returned null. The GridView’s default behaviour in this case is to simply hide itself.

So congratulations, if you have diligently copied each of these snippets into their correct locations you will now have a working app! Press F5 to start up and lets see. If you don’t get any compile time errors then you will see something like the screenshots below. If you do get errors then don’t worry, try to fix them or just jump to the download instructions to get the working files.

image

The final code

Despite the epic length of this article you might be surprised to find that the final application weighs in at only 102 lines of code and a simple aspx file to drive it. The work we put in early on meant that the rest of the asp.net ecosystem knew how to play well together. This coupled with the easy retrieval and manipulation of html that HtmlAgililtyPack provides has been a recipe for success.

Another big benefit to packaging up the code in a DataObject class is that in a few months time when you want to reuse this in another project you can rip out out easily without having to untangle dependencies or spend time trying to figure out how it all fits together. It uses standard development patterns and is decoupled. This is the kind of code you could take home to meet your mother!

What can I do with this data once I have found it?

Now that you’ve learned this new technique you have to be asking yourself what can you do with it? An obvious next step is to pull the feeds in and display them. I’m not going to start a second tutorial here but I will introduce you to a man who has already swum in these waters:

Its the tutorial I read when I first needed to display feed headlines in a site.

Download the sample application

I gave the listings and the explanations for everything you need to build the sample app but if you didn’t follow along at the time then you can download or fork this at GitHub:

Conclusion

We have covered a very large spectrum of asp.net development in this article. You should have learned a lot because we have covered:

  • Learned about syndication feeds
  • Learned about feed autodiscovery
  • Using HtmlAgilityPack to check a remote website for feeds
  • Public properties
  • Shortcut to creating Automatic Properties
  • Creating a bindable Data Object

I also hope you picked up a few good book recommendations and some programming techniques that will keep your code clean and professional.

More in this series

This article is part of a series. You can find more posts in this series here:

3 comments:

Conrad Braam said...

I have tried to use htmlagilitypack based on recommendations from stackoverflow.
But there appears to be no support for this module, and I do not have VS IDE so cannot just debug my problem for myself - the file fails to load. Who supports htmlagilitypack?

Clay Shannon said...

So what IDE are you using, if you're not using VS? You'll have to use the htmlagilitypack .dll in some way.

anant said...

sir, I really liked your article here on HTMLAGILITY PACK . I wish to ask you few question , could I? like if in a source code of a page there are many links but I want a specific big link which contains a small word , So what exactly will be the XPAth of it .?
Should it be like this : //a[@href='%word'];
like we did in SQL to find a word


-->