B i l l i o n strings falling from c l o u d

Everyday, we have to process almost a Billion strings from our click stream store. To keep things simple, let’s say I get the URL and I need to parse the domain name from string. For example: if the URL is as follows:

http://www.microsoft.com/Downloads.

Then I need to parse this string and get the domain name “www.microsoft.com”.

Looking at the parallel libraries of .NET 4.0,  I thought to use them and increase our data crunching performance. I planned to use Parallel.ForEach in place of foreach  to harness all the CPU cores of my box.

However, to my amazement,  I realized that using parallel version of foreach is taking more time than the single thread version of foreach. After little head scratching, I realized the work in my foreach loop was so small that cost of creating thread and destroying them was more than the work they need to do them self. It is like if the work is small enough that explaining someone to do it will take more time than actually doing it, sometime parallel processing and delegating work is not good.

However, as soon as I put heavy processing in the loop, for e.g. making the thread sleep for 10 millisecond :), my parallel version started performing better than the single threaded version of foreach.

Here is the code I wrote to do the parallel parsing:


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Threading.Tasks;
using System.Threading;

namespace DP_Parallel
{
class Program
{
static void Main(string[] args)
{
 List str = new List();
 using (AdventureWorksEntities AwContext = new AdventureWorksEntities())
 {
  IEnumerable uri = AwContext.ClickStreams
  .Where(u => u.ReferringURI.Length > 0)
  .Select(u => u.ReferringURI)
     .Take(1000000);

     //Parallel
     Parallel.ForEach
  (uri, (string s) =>
     {
      int Pos = s.IndexOf(@"/", 8) - 7;
      if (Pos > 5)
      {
        //Thread.Sleep(10);
       str.Add(s.Substring(7, Pos))  ;
       //Some other processing on the string - upper case etc
      }
     }
    );
   }
  }
 }
}

For a million rows C# non parallel code took 8.67 second, whereas Parallel version took 1 Minute and 1.51 second, but only 16 second in SQL with the following query

print cast ( getdate() as time )
SELECT  TOP 1000000
  case CHARINDEX ( '/', [referringURI], 8 )
  when 0 then ReferringURI
  else substring ( [ReferringURI] , 8, (CHARINDEX ( '/', [referringURI], 8)  ) - 8)
  end as DomainName
   into #T
  FROM [StagerDW].[dbo].[ClickStream] where len (ReferringURI) > 0
print cast ( getdate() as time )
 

So for just this task, C# code is twice as efficient over SQL and  many fold over the parallel code. But, as I said if you have to do a little heavy processing than parallel code may win over the single thread code.  So the best option to write a CLR Stored procedure in SQL, and from the CLR code enjoy the multi core parallel processing advantages.

yes, we don’t wait whole day to process the billion rows but crunch them as they come in …

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s