Thursday, 19 November 2009

Easily extracting links from a snippet of html with HtmlAgilityPack

The HtmlAgilityPack is a powerful library that makes screen scraping in asp.net a breeze. This is the second in a continuing series where I demonstrate a way for you to extract all the links from a snippet of html.

A little background

If you haven't heard about HtmlAgilityPack yet then you have landed on the wrong post. Head over to my introduction to the subject and then come back and see me when you have read that.

How the sample application is going to work

The sample application is going to take a snippet of messy html stored in a text file. We are going to load it in and parse out all the <a href=""> tags and present these links in the browser by binding them to a gridview.

Lets take a look at the html snippet that we are going to load:

~/App_Data/HtmlSnippet.txt

<table border="0" cellpadding="2" cellspacing="7" style="vertical-align:top;"><tr><td width="80" align="center" valign="top"><font style="font-size:85%;font-family:arial,sans-serif"><a href="http://news.google.com/news/url?fd=R&amp;sa=T&amp;url=http%3A%2F%2Fwww.timesonline.co.uk%2Ftol%2Fnews%2Fworld%2Fus_and_americas%2Farticle6802128.ece&amp;usg=AFQjCNGnZL4BdTSWSglpAZdprg3u_tJVhg"><img src="http://nt2.ggpht.com/news/tbn/XrArEKXhTe6dLM/6.jpg" alt="" border="1" width="80" height="80" /><br /><font size="-2">Times Online</font></a></font></td><td valign="top"><font style="font-size:85%;font-family:arial,sans-serif"><br /><div style="padding-top:0.8em;"><img alt="" height="1" width="1" /></div><div><a href="http://news.google.com/news/url?fd=R&amp;sa=T&amp;url=http%3A%2F%2Fwww.latimes.com%2Fnews%2Fnationworld%2Fnation%2Fla-na-health-coop20-2009aug20%2C0%2C4258832.story&amp;usg=AFQjCNG4LI_9w3yHg7H8ZqUBaKNwzpgiuA"><b>Healthcare co-ops emerging as viable alternative</b></a><!-- snip -->

Well that's not all of it but I think you get the point. Like I said this snippet originally came from a forum question. The html itself came from what looks like a Google news feed. I have kept it for this article because it shows that the HtmlAgilityPack can handle messy code and also that its not going to be tripped up by the extra urls which are url encoded into it.

The normal approach of using a regular expression to extract this kind of information could be tricked by this kind of code. I am not a big fan of using regular expressions for extracting this kind of information because they are too brittle (but I am a big fan of regular expressions).

The main structure of the program

Here is the code for the Page_Load method for your perusal. It should give you an idea of the main steps this program takes to complete its tasks:

protected void Page_Load(object sender, EventArgs e)
{
    // load snippet
    HtmlDocument htmlSnippet = new HtmlDocument();
    htmlSnippet = LoadHtmlSnippetFromFile();

    // extract hrefs
    List<string> hrefTags = new List<string>();
    hrefTags = ExtractAllAHrefTags(htmlSnippet);

    // bind to gridview
    GridViewHrefs.DataSource = hrefTags;
    GridViewHrefs.DataBind();
}

So as you can see it takes three main steps - loading the snippet of html into the system, parsing it and a final cosmetic stage of binding it to a GridView.

In the first line you see that we create an instance of a HtmlDocument class. This is a class which comes with the HtmlAgilityPack library. It is the primary class you use to store a complete HtmlDocument.

This brings us nicely to LoadHtmlSnippetFromFile();

Loading the html snippet from file

The second method we are going to look at is LoadHtmlSnippetFromFile();

It is a pretty simple method which loads in the full version of that horribly messy html snippet I showed you earlier. However this is not to say that it doesn't do anything educational. Lets take a look:

/// <summary>
/// Load the html snippet from the txt file
/// </summary>
private HtmlDocument LoadHtmlSnippetFromFile()
{
    TextReader reader = File.OpenText(Server.MapPath("~/App_Data/HtmlSnippet.txt"));

    HtmlDocument doc = new HtmlDocument();
    doc.Load(reader);

    reader.Close();

    return doc;
}

So as you can see, I have used one of the many stream reader classes to effortlessly load the html snippet txt file into memory.

Turning this stream of html text into a queryable document is the task of the Load() method on the HtmlDocument. If you poke around with intellisense on that method you will find that it has 10 overloads which will let you us paths, streams or TextReaders with various encoding options.

If you already have the contents of a html document contained with in string (such as from a web service) then you can use LoadHtml().

There is a third option which the HtmlAgilityPack supports and that is retrieving the page over the internet via a url. This is demonstrated in the next article which explains how you can test if a web page contains an rss or atom feed, but for now it will remain a tantalising mystery.

I shouldn't have to say it but don't forget to .Close() your stream after you have populated your HtmlDocument.

Extract all href tags from the document

This is the section we have all been waiting for - it is the part where the html parsing magic is done. And while this is a deceptively simple method it actually shows many of the key building blocks you will use in your screen scraping endeavours.

/// <summary>
/// Extract all anchor tags using HtmlAgilityPack
/// </summary>
/// <param name="htmlSnippet"></param>
/// <returns></returns>
private List<string> ExtractAllAHrefTags(HtmlDocument htmlSnippet)
{
    List<string> hrefTags = new List<string>();

    foreach (HtmlNode link in htmlSnippet.DocumentNode.SelectNodes("//a[@href]"))
    {
        HtmlAttribute att = link.Attributes["href"];
        hrefTags.Add(att.Value);
    }

    return hrefTags;
}

Looking at the code several things become clear:

  • The HtmlDocument class contains a collection of HtmlNodes
  • These HtmlNodes can be selected with an XPath query
  • The HtmlNodes can then be interrogated attribute by attribute with the HtmlAttribute class.

Using these three elements I have extracted a list of all the hrefs in the html snippet.

The use of XPath for extracting information out of html documents is key to the power of HtmlAgilityPack. If you don't know what XPath is then it's a technology that goes alongside xml and provides a query language for extracting xml nodes out of xml documents.

XML documents must be valid to be queried and most html out on the web is far from valid. It is littered with unclosed tags, capitalisations and syntax errors. A normal implementation of XPath can't be used to query html unless you clean it up first. The great thing about this being in HtmlAgilityPack is that the implementation will do its best to extract the information regardless of the validity of the document.

This gives us a very expressive way to describe the information we want to extract. In fact for many projects you will find that the hardest part is figuring out the correct XPath query to describe exactly what you want.

Our query //a[@href] means select all a tags (html anchor tags) that have a href attribute (so don't select the anchor if its just a named anchor).

The resulting collection is then iterated over in the foreach loop where I read the href attribute from each a tag and put it into my final collection for databinding.

We have covered a lot of ground in very little code which I hope further impresses on you the power of this library.

Tune in next time to find out how we will build a query engine that can detect if a webpage has an rss or atom feed associated with it!

Download the sample application

The sample application contains everything we discussed in this article including the HtmlAgilityPack, the code and the html snippet file.

More In This Series

This article is part of a series. You can find more posts in this series here:

kick it Shout it vote it on WebDevVote.com

19 comments:

cs said...

Great article. I would love to see/know how HtmlAgilityPack handles large files. For example, if I have a 30 megabyte XHTML document and want to get every nth sub-element and write that to a much smaller document, is that possible? It seems like HtmlAgilityPack is a memory-only tool and does not support disk-level streams. Is that true?

Thanks.

waqas said...

Cool

Anonymous said...

Very informative. This method will be userful when wanting to extract information like Weather, Currency from sites not having to pay or subscribe in order to receive Rss feeds to fetch the same info. It's simply great that it can parse dirty html code. Having said that, I am not sure of any legal policies some sites might have from using their content they pay for.

Great article.

Thanks a lot for sharing.

rtpHarry said...

@Anonymous: in my intro article written before this one I link to the wikipedia article which explains screen scraping. It has a section which covers the legal aspects and it concludes that it is legal although some content creators understandably take issue with their content being skimmed. http://en.wikipedia.org/wiki/Web_scraping#Legal_issues

Anonymous said...

This will work for simple, well behaved page. Many times a page has multiple navigations buried in it and the DocumentCompleted has not well, completed.

It's possible to grab the content too soon.

Shailesh kavathiya said...

Great post Thanks for share with us keep it up

Suraj said...

great article ... it helped

Anonymous said...

An excellent article! A link to it ought to be posted on forums.asp.net in the HowTos

Monish said...

Good work. But if I want to Post Data like Username and password and then i need screen scrape of whole html in txt file then what i need to do using Htmlagilitypack? I am waiting for your next article in which you will solve my this problem.

rtpHarry said...

@Monish: You can send authenticated requests but it depends on the login mechanism in use on the site. I would suggest bringing this up in a thread on forums.asp.net if you want to explore this further.

Treeluv Burdpu said...

I like this post. I like HtmlAgilityPack. But I am lost when it doesn't work. If you do a SelectNodes("//*[@id='search-results-module']/*") and you get zero results from a document which you believe has those elements you are SOL, as far as I can tell. You just get nothing. Is there some tool which will help me hone my XPathing skills? All the ones I have found expect an XML doc and break with HTML, or just return nothing. How to troubleshoot syntax? I am already starting on the "try every character combination" method. I should be done this century. Any assistance would be appreciated.

rtpHarry said...

@Treeluv Burdpu: If you are working in .net 3.0 or above then you can now use linq queries against the latest beta release. I have two articles in the pipelines for this but my world has not been blog friendly recently :) If you know how to use linq to objects then you should be able to compose your queries without having to tear your hair out! (I don't get on that well with XSL either)

ooty said...

thanks man it helped me a lot

Taufik Lukman said...

So you still have to use regular expression?

Pradeep Nulu said...

Hi,
How do I exclude commented links ?

Thanks.

Pradeep Nulu said...

Hi,
How do I ignore commented links ?
Thanks.

Pradeep Nulu said...

Hi,
How do I exclude commented links ?
Thanks.

Sherihan Anver said...

Hi,
May I know whether HtmlAgilityPack is compatible with Visual Studio 2010 (.NET framework 4.0). Can it extract aps.net pages' every valid tags? Sorry, I just found about HtmlAgilityPack and I think according to this BLOG, this is what I was searching for days now. I need to extract web pages of asp.net for some evaluation process. please reply if my target can be achieved via this

Anonymous said...

Could you tell me where href attributes are getting with this solution? I don't understand which web site has this a tags.