A WebCrawler demonstrating the Beauty of TPL Dataflow

[2012-Oct-13 Updated downloadable code to wait for cancel to complete]

Für mehr Informationen zu diesem Thema in Deutsch siehe meinen Artikel “Gleichzeitig zum Erfolg, Parallele Programmierung und Dataflow mit .NET” im dotnetpro Magazin, Heft 6/2013.

This post demonstrates the beauty of TPL Dataflow by implementing a simple web crawler. For an overview about Dataflow see:

This Dataflow web crawler implements the following features:

  • Download web pages asynchronously.
  • Download max 4 web pages in parallel.
  • Traverse  the web pages links tree.
  • Parse for links to images.
  • Download jpg images to disk.
  • Download the images using the new async I/O TAP method Stream.CopyToAsync.

Dataflow Web Crawler Architecture

Sub Main()
    'Define Blocks
    Dim downLoader As New TransformBlock(Of String, String)(
        Async Function(url)
            WriteLineInColor(url, ConsoleColor.White)
            Return Await New HttpClient().GetStringAsync(url).ConfigureAwait(False)
        End Function, New ExecutionDataflowBlockOptions With {.MaxDegreeOfParallelism = 4})

    Dim contentBroadcaster As New BroadcastBlock(Of String)(Function(html) html)

    Dim linkParser As New TransformManyBlock(Of String, String)(
        Function(html)
            Dim doc As New HtmlDocument()
            doc.LoadHtml(html)
            Return From n In doc.DocumentNode.Descendants("a")
               Where n.Attributes.Contains("href")
               Let url = n.GetAttributeValue("href", "")
               Where url.StartsWith("http://")
               Select url
        End Function)

    Dim imageParser As New TransformManyBlock(Of String, String)(
        Function(html)
            Dim doc As New HtmlDocument()
            doc.LoadHtml(html)
            Return From n In doc.DocumentNode.Descendants("img")
                               Where n.Attributes.Contains("src")
                               Let url = n.GetAttributeValue("src", "")
                               Where url.StartsWith("http://")
                               Select url
        End Function)

    Dim linkBroadCaster As New BroadcastBlock(Of String)(Nothing)

    Dim imageProcessor As New ActionBlock(Of String)(
       Async Function(url)
               Dim uri As New Uri(url)
               Dim fileUrl = If(String.IsNullOrWhiteSpace(uri.Query), uri.AbsoluteUri, uri.AbsoluteUri.Replace(uri.Query, ""))
               WriteLineInColor(fileUrl, ConsoleColor.Yellow)

               Dim webStream = Await (New HttpClient().GetStreamAsync(url)).ConfigureAwait(False)
               Dim filePath = New DirectoryInfo(Path.Combine(My.Application.Info.DirectoryPath, "..\..\..\Images", IO.Path.GetFileName(fileUrl))
                              ).FullName
               Using fileStream = IO.File.OpenWrite(filePath)
                   Await webStream.CopyToAsync(fileStream).ConfigureAwait(False)
               End Using
       End Function)

    'Handle unexpected errors
    downLoader.Completion.ContinueWith(Sub(ant) WriteLineInColor(ant.Exception.GetBaseException.ToString, ConsoleColor.Red),
            ,TaskContinuationOptions.OnlyOnFaulted))
    ...
    imageProcessor.Completion.ContinueWith(Sub(ant) WriteLineInColor(ant.Exception.GetBaseException.ToString, ConsoleColor.Red),
            ,TaskContinuationOptions.OnlyOnFaulted))

    'Link Blocks
    downLoader.LinkTo(contentBroadcaster)
    contentBroadcaster.LinkTo(linkParser)
    contentBroadcaster.LinkTo(imageParser)
    linkParser.LinkTo(linkBroadCaster)
    linkBroadCaster.LinkTo(downLoader)
    '.jpg just to demonstrate a link filter
    linkBroadCaster.LinkTo(imageProcessor, Function(url) url.EndsWith(".jpg"))
    imageParser.LinkTo(imageProcessor)

    'Let's start
    downLoader.Post("http://www.bbc.co.uk/news/")

    PromptUser("Crawling... Press  to abort:", ConsoleColor.White)

In BroadcastBlocks one should generally consider cloning the message to avoid consuming blocks working on the same instance. In this example that is not necessary, because the message is of type String and thus immutable. Thus in contentBroadcaster I simply pass the html. linkBroadCaster shows that one can omit the Action completely.

Download sample code.

This web crawler example was inspired by Bnaya Eshet’s post TPL Dataflow walkthrough.

About Peter Meinl

IT Consultant
This entry was posted in Computers and Internet and tagged , , , . Bookmark the permalink.

2 Responses to A WebCrawler demonstrating the Beauty of TPL Dataflow

  1. svick says:

    Few comments:

    1. Why are you using ConfigureAwait()? It has no effect in console applications.
    2. contentBroadcaster could have use Nothing, instead of identity lambda, same as linkBroadcaster.
    3. I think downloading to a file would be simpler using WebClient.DownloadFileTaskAsync(). HttpClient is newer, but that doesn’t mean you have to use it everywhere.
    4. Why is the code in the archive slightly different than the code you posted here?

  2. Peter Meinl says:

    svik,
    thanks for your quick and valuable comments.

    1.
    I am trying to make it a habit using ConfigureAwait(False) whenever I do not need the continuation to be executed on the original context. This way I avoid accidentally having the overhead of ConfigureAwait(True) when copying code to a library or application where it runs in a context.

    After reading “It’s all about the synchronization context” http://msdn.microsoft.com/en-us/magazine/gg598924.aspx I am not sure if there really is not performance penalty in a console app, because the default synchronization context seems to be used. What do you think?

    Additionally I use ConfigureAwait in this example to trigger me talking with colleagues about the mere existence of it.

    2.
    Yes, I could use Nothing for the Action in both BroadcastBlocks. In my more detailled article (not published yet) I am using the lambda (Function(html) html) to explain that one commonly clones the message in the broadcast block. I now added a comment to this post too.

    3.
    You caught me here – always wanting to use the new stuff:-) Yes, WebClient.DownloadFileTaskAsync() would be simpler, I did not think of it.

    However, by default I prefer using the new HttpClient in async Scenarios, because WebClient sometimes seems to block the calling thread while initializing the async I/O. If I remember correctly when passing an invalid URL like http://www.micorosoft.com or when the network is disconnected.

    4.
    I sometimes keep the code shown in posts and articles simpler than in the downloadable version. In this post the download includes cancellation which I found would unnecessarily bloat the listing in the post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s