Introduction To The HtmlAgilityPack Library

Posted on: Sunday, 27 September 2009 13:29

If you haven't heard of the HtmlAgilityPack before then you could be in for a treat. It's an open source library on codeplex which greatly eases html screen scraping.

What is screen scraping?

Screen scraping is the term for downloading a web page and then parsing its html to pull out information from it. In the old days before RSS feeds this was the only way to get at a websites information. With the advent of syndicated content and web services it has become a lot easier to get information from other sites but there are still situations where you need to go back to screen scraping to get the job done.

One example of a reason that you would still want to use this technique in a modern day environment is a price comparison site. When the complete stock information is not made available you have to read the data out of the website. Other reasons include weather data monitoring, website change detection (such as the release of a new version of software) and getting meta information from a page.

Doesn't .net support this natively?

The .net library comes with built in classes to manipulate XML documents with ease however in most cases you wont get won't get very far trying to screen scrape with these libraries. This is because they require standards compliant mark-up which is not very common for web sites. The XML standard has a strict policy of failing at the first error so unless you give it perfectly formatted code you wont be able to use these classes.

The HtmlAgilityPack provides a set of classes that makes it easy for you to download html pages into memory and then query them using XPath syntax. It doesn't matter if the page isn't standards compliant, the library will just do the best with what it has.

Where can I download it?

The codeplex project is located at the following location:

To download it just click the Download Now button located down the right hand side of the page and then press I Agree to accept the codeplex license agreement.

The pack comes with a couple of examples to get you started and there are most posts in this series here on this site.

More In This Series

This article is part of a series. You can find more posts in this series here:

Further reading

kick it Shout it vote it on

HtmlAgilityPack Article Series

Posted on: 13:25

This is an index post which will pull together the various HtmlAgilityPack posts on here. If you don't know what the HtmlAgilityPack or the idea of Screen Scraping is then you should read the first introductory article, otherwise you can dive right in!

Article Index

Where to get an 2.0 compatible AjaxControlToolkit

Posted on: Thursday, 17 September 2009 23:01

(Updated: To include the latest controls 2.0 users are missing out on)

This is just a simple link so that I don't have to keep finding it and explaining the situation each time the topic comes up. 2.0 support

The AjaxControlToolkit doesn't support 2.0 any more. This means that if you are still using 2.0 then you will have to download an older release.

The last version that was released with 2.0 support was toolkit version 1.0.20229. Download it here: 3.5 support

If you're looking for the latest 3.5 version then I suggest heading directly to the front page of the site so that you can get the latest version. Actually this url seems to automatically redirect you to the latest version available:

But if it breaks for you in the future then use this link and click the download button in the top right:

What's missing for 2.0 users?

The following controls (at the time of writing) are missing from the older AjaxControlToolkit:

  • HTMLEditor
  • ComboBox
  • ColorPicker
  • MultiHandleSlider
  • Seadragon Image Viewer
  • AsyncFileUpload

Round off time to the nearest minute

Posted on: 07:59

Say you have a DateTime object like this:

DateTime someTime = DateTime.Parse("00:00:38");

Rounding Up

How would you round this up to the nearest minute? There isn't a built in function to do this so you have to use a little bit of maths to get there. There are 60 seconds in a minute. We already have 38 seconds on the clock. So we need to add on 60 - 38 = 22 more seconds.

In code this looks like:

DateTime RoundUp = DateTime.Parse("00:00:38");
RoundUp = RoundUp.AddSeconds(60 - RoundUp.Second);

Now our RoundUp contains "00:01:00".

Rounding Down

To round down we use the same idea:

DateTime RoundDown = DateTime.Parse("00:01:38");
RoundDown = RoundDown.AddSeconds(-RoundDown.Second);


The AddSeconds() method doesn't actually alter the DateTime its working on - it just returns a new one. This is why I assigned the DateTime to itself in the examples above.

Farewell VSJ print edition

Posted on: Tuesday, 15 September 2009 19:27

In the first page of VSJ there is a small message from the editor which discreetly announces that this month will be the final month of VSJ in print. I will miss reading this publication which has become a part of my life over the past year.

Being a massive geek it was a nice fix when I found myself being forced to leave my computer for a few minutes such as when I have to wait for somebody.

The magazine is still going to continue as an online publication but if I am honest I wont be reading it. There is already enough content to read when I am at my computer and I dont have time to get through all of it!

So thats that then. Does anybody know of any other great .net focused print magazines?

A better way to reference your wizard steps using named steps

Posted on: Saturday, 12 September 2009 10:52

Note: this article uses the plain vanilla <asp:Wizard> but the concepts apply equally well to its popular counterpart <asp:CreateUserWizard>.

By far the most common way that I see wizard steps reference in code snippets is by their index.

Something like this is common:

void Wizard1_NextButtonClick(object sender, System.Web.UI.WebControls.WizardNavigationEventArgs e)
     if (Wizard1.ActiveStepIndex == 1)
          // jump to step three
          Wizard1.ActiveStepIndex = 3;

This was actually taken from some MSDN documentation. But what's the problem with this? It works doesn't it?

Yes it works today but the problem is when you come back in a few weeks and want to re-order your steps, or add a new one in somewhere. Suddenly all those numbers seem to have a lot less meaning. Its a chore to change the steps over and its not something that will get caught by the compiler of you miss one or get it wrong.

There is a better way though because you don't have to tie yourself down to the index. The trick in a nutshell is to give each of your <asp:WizardStep>'s an ID. By default the Visual Studio GUI doesn't do this but you can add them in yourself. This gives you a meaningful name that you can use and its not tied to the order of the steps.

Example 1 - Finding a control in a WizardStep

The first challenge that is enhanced by this technique is finding controls that are inside wizard steps. Instead of having to go through the Wizard control you can now use FindControl() directly on your WizardStep because you have a way to access it.

TextBox firstName = (TextBox)StepNameHere.FindControl("FirstName");

Example 2 - Comparing against CurrentStepIndex

The second example is my favourite trick because I didn't find out about it for a while after I had started using named WizardSteps.

If you have a WizardStep called StepBillingDetails for example, you can use the Wizard1.WizardSteps.IndexOf() to find the index of that named step. This turns your named step into an integer which can be used to compare against properties such as ActiveStepIndex property of the Wizard and CurrentStepIndex in your NextButtonClick event.

protected void Wizard1_NextButtonClick(object sender, WizardNavigationEventArgs e)
    Wizard w = (Wizard)sender;

    // check if billing details step has been completed
    if (e.CurrentStepIndex == w.WizardSteps.IndexOf(StepBillingDetails))
        // user has just completed the wizard step "StepBillingDetails"

Example 3 - Predictable Hot Jumping

From looking back at the first example that I gleaned from MSDN you can probably see how we can improve it using these new techniques. While it is a slightly contrived example (I doubt I would always want to unconditionally jump from step one to step three) it does illustrate my point.

Lets take a look at what it looks like now that it uses named steps:

void Wizard1_NextButtonClick(object sender, System.Web.UI.WebControls.WizardNavigationEventArgs e)
     Wizard wizard1 = (Wizard)sender;

     if (wizard1.ActiveStepIndex == wizard1.WizardSteps.IndexOf(StepPersonalDetails))
          // jump to step three
          wizard1.ActiveStepIndex = wizard1.WizardSteps.IndexOf(StepOrderComplete);

Complete Sample

Here is a complete sample demonstrating how to take advantage of named WizardSteps with both of the techniques described above.


        <asp:Wizard ID="Wizard1" runat="server" 
                <asp:WizardStep ID="StepPersonalDetails" runat="server" Title="Personal Details">
                    <asp:TextBox ID="FirstName" runat="server"></asp:TextBox>
                <asp:WizardStep ID="StepBillingDetails" runat="server" Title="Billing Details">
                <asp:WizardStep ID="StepOrderComplete" runat="server" Title="Order Complete">
                    First Name: <asp:Label ID="FirstNameLabel" runat="server" Text="Unknown"></asp:Label>

Code behind:

    protected void Wizard1_NextButtonClick(object sender, WizardNavigationEventArgs e)
        Wizard w = (Wizard)sender;

        // check if we just completed the first step
        if (e.CurrentStepIndex == w.WizardSteps.IndexOf(StepPersonalDetails))
            // find the first name control
            TextBox firstName = (TextBox)StepPersonalDetails.FindControl("FirstName");

            // check it has a value
            if (!String.IsNullOrEmpty(firstName.Text))
                // find the label on the last page
                Label firstNameLabel = (Label)StepOrderComplete.FindControl("FirstNameLabel");

                // assign the value
                firstNameLabel.Text = firstName.Text;

            // jump to order complete step
            w.ActiveStepIndex = w.WizardSteps.IndexOf(StepOrderComplete);

Downloadable version:

Further Reading

Not a whole lot of further reading for you to today as I have really said everything I wanted to say. Here is one just for completeness:

You will find a few gotcha's on there that aren't related to this article but things that might trip you up in the future.