HtmlAgilityPack Extension to Load a Uri/Url that handles redirects


This extension allowed me to load an HtmlDocument from a Uri. Now, I created this as a function that also returns a Uri in case the one that you specified redirects to another page. This is important, because you will use the redirected Uri to turn any relative links on your page into absolute links. I can also show you can example of how to do this.

First, the extension method for “LoadUri” that extends HtmlDocument (this should be cleaned up to allow you to also specify a user agent):

VB.Net

''' <summary>
''' Loads an HtmlDocument give a specified uri.  A System.Uri will be returned from this function that is the page that
''' responded (in the case of a redirect).  This Uri can then be used to correct turn the relative links into absolute links.
''' </summary>
''' <param name="hd"></param>
''' <param name="uri"></param>
''' <param name="timeoutMs"></param>
<Extension()> _
Public Function LoadUri(ByVal hd As HtmlDocument, ByVal uri As System.Uri, ByVal timeoutMs As Integer) As System.Uri
    Dim hwr As HttpWebRequest = DirectCast(WebRequest.Create(uri), HttpWebRequest)
    hwr.Timeout = timeoutMs
    hwr.UserAgent = "Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:1.9.1.5) Gecko/20091102 Firefox/3.5.5"

    Dim resp As HttpWebResponse
    resp = DirectCast(hwr.GetResponse(), HttpWebResponse)

    Dim returnUri As System.Uri = resp.ResponseUri

    If resp.ContentType.StartsWith("text/html", StringComparison.InvariantCultureIgnoreCase) Then
        Dim resultStream = resp.GetResponseStream()
        hd.Load(resultStream)
    End If

    Return returnUri
End Function

Second, we have an extension method off of HtmlDocument that will return all the links on a page. This particular extension requires the baseUri to be provided in order to construct absolute links from relative ones. The returnUri from the previous LoadUri method is what would be passed into this method:

VB.Net

''' <summary>
''' Returns all links from an HTML Document as a generic list of strings.  The baseUri will be for turning relative links
''' into absolute links.
''' </summary>
''' <param name="doc"></param>
<Extension()> _
Public Function GetLinks(ByVal doc As HtmlAgilityPack.HtmlDocument, ByVal baseUri As Uri) As List(Of String)
    Dim linkList As List(Of String) = GetLinks(doc)
    Dim newList As New List(Of String)
    Dim baseUrl As String = ""

    baseUrl = baseUri.AbsoluteUri.ToString.Substring(0, baseUri.AbsoluteUri.LastIndexOf("/") + 1)

    For Each link As String In linkList
        Dim uri As New Uri(link, System.UriKind.RelativeOrAbsolute)
        If uri.IsAbsoluteUri = False Then
            newList.Add(baseUrl & link.TrimStart("/"))
        Else
            newList.Add(uri.AbsoluteUri.ToString)
        End If
    Next

    Return newList

End Function

Here is how I would use it as a simple test. In this test case, I just dump the contents of the list into a WinForms RichTextBox in order to quickly see it’s values. I also have an extension method off of list that allows me to send it to a delimited string quickly (you can replace that with a for each loop or put a debug point to see the contents of the list):

VB.Net

Dim hd As New HtmlAgilityPack.HtmlDocument
Dim responseUri As System.Uri = hd.LoadUri(New System.Uri("http://www.blakepell.com"), 3000)
Dim linkList As List(Of String) = hd.GetLinks(responseUri)

RichTextBox1.Text = linkList.ToDelimitedString(vbCrLf)

This may need some touching up, it doesn’t handle stuff like javascript in links, etc. It’s a basic starting point for how to collect links on pages.

Leave a comment

Please note that we won't show your email to others, or use it for sending unwanted emails. We will only use it to render your Gravatar image and to validate you as a real person.