Archive

Posts Tagged ‘Multi core’

B i l l i o n strings falling from c l o u d

December 25, 2009 Leave a comment

Everyday, we have to process almost a Billion strings from our click stream store. To keep things simple, let’s say I get the URL and I need to parse the domain name from string. For example: if the URL is as follows:

http://www.microsoft.com/Downloads.

Then I need to parse this string and get the domain name “www.microsoft.com”.

Looking at the parallel libraries of .NET 4.0,  I thought to use them and increase our data crunching performance. I planned to use Parallel.ForEach in place of foreach  to harness all the CPU cores of my box.

However, to my amazement,  I realized that using parallel version of foreach is taking more time than the single thread version of foreach. After little head scratching, I realized the work in my foreach loop was so small that cost of creating thread and destroying them was more than the work they need to do them self. It is like if the work is small enough that explaining someone to do it will take more time than actually doing it, sometime parallel processing and delegating work is not good.

However, as soon as I put heavy processing in the loop, for e.g. making the thread sleep for 10 millisecond :), my parallel version started performing better than the single threaded version of foreach.

Here is the code I wrote to do the parallel parsing:


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Threading.Tasks;
using System.Threading;

namespace DP_Parallel
{
class Program
{
static void Main(string[] args)
{
 List str = new List();
 using (AdventureWorksEntities AwContext = new AdventureWorksEntities())
 {
  IEnumerable uri = AwContext.ClickStreams
  .Where(u => u.ReferringURI.Length > 0)
  .Select(u => u.ReferringURI)
     .Take(1000000);

     //Parallel
     Parallel.ForEach
  (uri, (string s) =>
     {
      int Pos = s.IndexOf(@"/", 8) - 7;
      if (Pos > 5)
      {
        //Thread.Sleep(10);
       str.Add(s.Substring(7, Pos))  ;
       //Some other processing on the string - upper case etc
      }
     }
    );
   }
  }
 }
}

For a million rows C# non parallel code took 8.67 second, whereas Parallel version took 1 Minute and 1.51 second, but only 16 second in SQL with the following query

print cast ( getdate() as time )
SELECT  TOP 1000000
  case CHARINDEX ( '/', [referringURI], 8 )
  when 0 then ReferringURI
  else substring ( [ReferringURI] , 8, (CHARINDEX ( '/', [referringURI], 8)  ) - 8)
  end as DomainName
   into #T
  FROM [StagerDW].[dbo].[ClickStream] where len (ReferringURI) > 0
print cast ( getdate() as time )
 

So for just this task, C# code is twice as efficient over SQL and  many fold over the parallel code. But, as I said if you have to do a little heavy processing than parallel code may win over the single thread code.  So the best option to write a CLR Stored procedure in SQL, and from the CLR code enjoy the multi core parallel processing advantages.

yes, we don’t wait whole day to process the billion rows but crunch them as they come in …

AsParallel makes your query topless

December 25, 2009 Leave a comment

Here is a simple code snippet using Entity framework:


using (AdventureWorksEntities AwContext = new AdventureWorksEntities())
 {
   var LoginId = AwContext.Employees
            .Where ( u => u.LoginID.Length > 0 )
            .Select(u => u.LoginID)
            .Take(10);

As you can see from the above code, I am trying to get the top 10 rows from Employee table and only one column LoginID, where loginID length is more than zero. If you fire this query you may get a SQL in SQL Profiler as follows:

SELECT TOP (10)
      [c].[LoginID] AS [LoginID]
        FROM [HumanResources].[Employee] AS [c]
            WHERE LEN( [c].[LoginID] ) > 0

Now, web is full of examples of how to take advantage of multiple cores and make your query run in parallel by using PLINQ.  Just add magic word AsParallel in front of the data source and your code will take advantage of multi core and run in multi thread . But if you are developer who read 5000 words a minute, you may miss the fact that PLINQ applies only to LINQ to objects (i.e. IEnumerable-based sources where lambdas are bound to delegates, not IQueryable-based sources where the lambdas are bound to expressions) and you may add AsParallel in your query thinking that you are using multi core of your CPU and some how your query will become faster. Unfortunately your code may still work, but now it has sever side effects behind the scene.
<pre>using (AdventureWorksEntities AwContext = new AdventureWorksEntities())</pre>
 {
 var LoginId = AwContext.Employees.AsParallel
 .Where ( u => u.LoginID.Length > 0 )
 .Select(u => u.LoginID)
 .Take(10);

With luck your code may still work, but under the cover many things have changed. First one is your Top 10 selection is gone from the query which goes to SQL. This can be a problem if your table is large with couple of millions rows. Second, now you are getting all the columns. Third, if you are lucky then you may get some exceptions in  the code as other threads try to process your rest of statements like ‘Where’ clause and they break up with the Null Reference exception else you will just squander the resources in the false pretext that you have written an efficient code. Here is the SQL generated after adding the ‘AsParallel’  keyword.
SELECT
   [Extent1].[EmployeeID] AS [EmployeeID],
   [Extent1].[NationalIDNumber] AS [NationalIDNumber],
   [Extent1].[ContactID] AS [ContactID],
   [Extent1].[LoginID] AS [LoginID],
..all other columns
     FROM [HumanResources].[Employee] AS [Extent1]
         WHERE LEN( [c].[LoginID] ) > 0

So remember, LINQ-to-SQL and LINQ-to-Entities queries will be executed by the respective databases and query providers, PLINQ does not offer a way to parallelize those queries. However,  If you wish to process the results of those queries in memory, including joining the output of many heterogeneous queries, then PLINQ can be quite useful.

Follow

Get every new post delivered to your Inbox.