B i l l i o n strings falling from c l o u d

December 25, 2009 Rafat Sarosh Leave a comment

Everyday, we have to process almost a Billion strings from our click stream store. To keep things simple, let’s say I get the URL and I need to parse the domain name from string. For example: if the URL is as follows:

http://www.microsoft.com/Downloads.

Then I need to parse this string and get the domain name “www.microsoft.com”.

Looking at the parallel libraries of .NET 4.0,  I thought to use them and increase our data crunching performance. I planned to use Parallel.ForEach in place of foreach  to harness all the CPU cores of my box.

However, to my amazement,  I realized that using parallel version of foreach is taking more time than the single thread version of foreach. After little head scratching, I realized the work in my foreach loop was so small that cost of creating thread and destroying them was more than the work they need to do them self. It is like if the work is small enough that explaining someone to do it will take more time than actually doing it, sometime parallel processing and delegating work is not good.

However, as soon as I put heavy processing in the loop, for e.g. making the thread sleep for 10 millisecond :), my parallel version started performing better than the single threaded version of foreach.

Here is the code I wrote to do the parallel parsing:


using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Threading.Tasks;
using System.Threading;

namespace DP_Parallel
{
class Program
{
static void Main(string[] args)
{
 List str = new List();
 using (AdventureWorksEntities AwContext = new AdventureWorksEntities())
 {
  IEnumerable uri = AwContext.ClickStreams
  .Where(u => u.ReferringURI.Length > 0)
  .Select(u => u.ReferringURI)
     .Take(1000000);

     //Parallel
     Parallel.ForEach
  (uri, (string s) =>
     {
      int Pos = s.IndexOf(@"/", 8) - 7;
      if (Pos > 5)
      {
        //Thread.Sleep(10);
       str.Add(s.Substring(7, Pos))  ;
       //Some other processing on the string - upper case etc
      }
     }
    );
   }
  }
 }
}

For a million rows C# non parallel code took 8.67 second, whereas Parallel version took 1 Minute and 1.51 second, but only 16 second in SQL with the following query

print cast ( getdate() as time )
SELECT  TOP 1000000
  case CHARINDEX ( '/', [referringURI], 8 )
  when 0 then ReferringURI
  else substring ( [ReferringURI] , 8, (CHARINDEX ( '/', [referringURI], 8)  ) - 8)
  end as DomainName
   into #T
  FROM [StagerDW].[dbo].[ClickStream] where len (ReferringURI) > 0
print cast ( getdate() as time )
 

So for just this task, C# code is twice as efficient over SQL and  many fold over the parallel code. But, as I said if you have to do a little heavy processing than parallel code may win over the single thread code.  So the best option to write a CLR Stored procedure in SQL, and from the CLR code enjoy the multi core parallel processing advantages.

yes, we don’t wait whole day to process the billion rows but crunch them as they come in …

AsParallel makes your query topless

December 25, 2009 Rafat Sarosh Leave a comment

Here is a simple code snippet using Entity framework:


using (AdventureWorksEntities AwContext = new AdventureWorksEntities())
 {
   var LoginId = AwContext.Employees
            .Where ( u => u.LoginID.Length > 0 )
            .Select(u => u.LoginID)
            .Take(10);

As you can see from the above code, I am trying to get the top 10 rows from Employee table and only one column LoginID, where loginID length is more than zero. If you fire this query you may get a SQL in SQL Profiler as follows:

SELECT TOP (10)
      [c].[LoginID] AS [LoginID]
        FROM [HumanResources].[Employee] AS [c]
            WHERE LEN( [c].[LoginID] ) > 0
Now, web is full of examples of how to take advantage of multiple cores and make your query run in parallel by using PLINQ.  Just add magic word AsParallel in front of the data source and your code will take advantage of multi core and run in multi thread . But if you are developer who read 5000 words a minute, you may miss the fact that PLINQ applies only to LINQ to objects (i.e. IEnumerable-based sources where lambdas are bound to delegates, not IQueryable-based sources where the lambdas are bound to expressions) and you may add AsParallel in your query thinking that you are using multi core of your CPU and some how your query will become faster. Unfortunately your code may still work, but now it has sever side effects behind the scene.
<pre>using (AdventureWorksEntities AwContext = new AdventureWorksEntities())</pre>
 {
 var LoginId = AwContext.Employees.AsParallel
 .Where ( u => u.LoginID.Length > 0 )
 .Select(u => u.LoginID)
 .Take(10);
With luck your code may still work, but under the cover many things have changed. First one is your Top 10 selection is gone from the query which goes to SQL. This can be a problem if your table is large with couple of millions rows. Second, now you are getting all the columns. Third, if you are lucky then you may get some exceptions in  the code as other threads try to process your rest of statements like ‘Where’ clause and they break up with the Null Reference exception else you will just squander the resources in the false pretext that you have written an efficient code. Here is the SQL generated after adding the ‘AsParallel’  keyword.
SELECT
   [Extent1].[EmployeeID] AS [EmployeeID],
   [Extent1].[NationalIDNumber] AS [NationalIDNumber],
   [Extent1].[ContactID] AS [ContactID],
   [Extent1].[LoginID] AS [LoginID],
..all other columns
     FROM [HumanResources].[Employee] AS [Extent1]
         WHERE LEN( [c].[LoginID] ) > 0

So remember, LINQ-to-SQL and LINQ-to-Entities queries will be executed by the respective databases and query providers, PLINQ does not offer a way to parallelize those queries. However,  If you wish to process the results of those queries in memory, including joining the output of many heterogeneous queries, then PLINQ can be quite useful.

XMLTextWriter v/s XElement

September 29, 2009 Rafat Sarosh Leave a comment

I have to move some data from Database to a XML file. A very simple task, write a SQL query, get the result and put it into a XML File. So here is my data query


DownloadDataContext db = new DownloadDataContext();
IQueryable downloads =
      from f in db.Files
      join fr in db.FamilyReleases
          on f.ReleaseID equals fr.ReleaseID 
      join fdl in  db.FamilyDetailsByLocales
          on fr.FamilyID equals fdl.FamilyID
              select new DownloadDetails{
                FamilyID = fdl.FamilyID,
                FileName = f.FileName,
                FileSize = f.FileSize, URL = f.URL
)

Once I got the data in IQueryable downloads, Out of old habit, I proceeded to make the XML as follows:

MemoryStream ms = new MemoryStream();
using (XmlTextWriter w = new XmlTextWriter( ms, Encoding.UTF8 )) {
     w.Formatting = Formatting.Indented;
     w.WriteStartElement("Products");
           foreach (DownloadDetails d in downloads) { 
          w.WriteStartElement("FamilyID"); w.WriteValue(d.FamilyID.ToString());                                

w.WriteEndElement(); w.WriteStartElement("FileName"); w.WriteValue(d.FileName ); w.WriteEndElement(); w.WriteStartElement("FileSize"); w.WriteValue(d.FileSize); w.WriteEndElement(); w.WriteStartElement("URL"); w.WriteValue(d.URL); w.WriteEndElement(); }
w.WriteEndElement(); //products
}
StreamReader r = new StreamReader(ms,Encoding.UTF8);
ms.Seek(0, SeekOrigin.Begin);
return r.ReadToEnd();

Very simple, However when I check the returned XML string,  it is always cut in between and a proper XML is not formed. As the data is localized there may be a possibility that there is some funny character came in between and stream reader ReadToEnd function got confused. Debugging with 40+ languages in data was a nightmare, and I had no clue why streamreader is not working as it is suppose to work. There were other problems too, If any of the values are null, TextWriter.WriteValue complain very loudly. So every where I have to put a check to see if the value is null or not. However, the XElement came to my rescue. XElement does take all the possible data types, and does not complain if any of the data type is null.


XElement x = new XElement("Products");
foreach (DownloadDetails d in downloads) {
x.Add(new XElement("Product",
new XElement("FamilyID", d.FamilyID.ToString()),
new XElement("FileName", d.FileName),
new XElement("FileSize", d.FileSize),
new XElement("URL", d.URL + ""))); }

XElement not only handles NULL gracefully but does not complain about any of the unknown characters which threw StreamReader off. Clearly XElement is a better choice to make  XML out of data, here is another example of XElement use.

Categories: C#, LINQ Tags: ,

Difference between facade and Decorator Pattern

September 22, 2009 Rafat Sarosh 2 comments

Yesterday, I was discussing with our friends about pattern and differences between facade and decorator pattern? Facade gives the impression that your create a facade object over different desperate objects, or you actually decorate them with another layer, or they are two different name for the same pattern etc. All kind of confusion was there, because of the name, I noticed people get confused between these two pattern, not the one who understand both the pattern well, but the people who read about them last year and did not understand them well.

With this blog entry, I want to make sure that you will never ever forget their definition and never get mixed up between these two patterns.

Decorator pattern is a design pattern that allows new/additional behavior to be added to an existing class dynamically. Decorators provide a flexible alternative to sub classing for extending functionality.

The decorator pattern can be used to make it possible to extend (decorate) the functionality of a class at runtime. Taking the example from Allan Shalloway book, Design patterns explained. Suppose you have to print a SalesTicket, and depending up on sales, you may need to print multiple headers and multiple footers. Decorator pattern helps you control headers, by chaining together the headers desired in the correct order.

Header1->Header2->Header3->Concrete Component.

beadneckles2 The Decorator pattern separates the dynamic building of this chain of functionality from the client. It will also separates the building of the chain from the chain components (Header and Footer).

Now, here is a way to remember it forever, a different way to think is, Decorator pattern help decorate. decorator pattern help us create  necklaces with different bead color combinations.  Decorator pattern is chain of objects. To achieve this you wrap the original bead class with a decorator class.

Suppose we need to write a program, which make different necklaces with different color combinations of beads. Because of decorator pattern it is fairly easy to build the different necklaces and the change in code can be limited to factory class.

 

Here is the code:

using System;

namespace Decorator
{

///
/// The Beads abstract class
///

abstract class Beads
{
        public abstract void MakeNecklace();
}

///
/// The 'ConcreteComponent' class
///

class ConcerteBead : Beads
{

public override void MakeNecklace()
{
         Console.WriteLine("ConcreteBead");
}
}

///
/// The 'Decorator' abstract class
///

abstract class Decorator : Beads
{
      protected Beads beads;
      public Decorator(Beads _beads)
     {
           this.beads = _beads;
     }
    public Decorator()
   {
    }

 public override void MakeNecklace()
 {
       if (beads != null)
     {
         beads.MakeNecklace();
      }
 }
}

class RedBeads : Decorator
{
       public RedBeads(Beads _beads)
      {
           beads = _beads;
        }
       public override void MakeNecklace()
      {
          base.MakeNecklace();
         Console.WriteLine("RED");
     }
}

class GreenBeads : Decorator
{
         public GreenBeads(Beads _beads)
        {
              beads = _beads;
        }
        public override void MakeNecklace()
       {
          base.MakeNecklace();
          Console.WriteLine("GREEN");
        }
}

class WhiteBeads : Decorator
{
        public WhiteBeads(Beads _beads)
       {
           beads = _beads;
        }
         public override void MakeNecklace()
         {
            base.MakeNecklace();
            Console.WriteLine("WHITE");
         }
}

class Program
{
        static void Main()
       {
           Beads beads = BeadFactory(true);
           beads.MakeNecklace();
           Console.ReadKey();
       }

private static Beads BeadFactory(bool Colorful)
{
           if (Colorful)
         {
                return new GreenBeads(new RedBeads(new WhiteBeads(new ConcerteBead())));
          }
else
        {
              return new WhiteBeads(new WhiteBeads(new WhiteBeads(new ConcerteBead())));

       }
  }
 }
}

The output of the above code will be:

Concrete Bead
WHITE
RED
GREEN

 

To decouple client from the chain of object,  a factory objects can instantiate the chains based upon some configuration information.  Instantiation of the chains of objects is completely decoupled from the Client objects that use it. Decorator Allows for extending the functionality of an object without resorting to subclassing.

May be tomorrow we will talk about Facade pattern.

Categories: C#, Pattern

Suddenly working system performance nose dived, now what?

September 17, 2009 Rafat Sarosh Leave a comment

We had a ETL process, which used to process a small table of thirty million records a time. These records were moved in a staging table for the extensive processing. Once the processing completes, we truncate the staging table and get the data for the next day.

The system ran fine for couple of days, then suddenly the ETL processed slowed and the ETL complete time increased by many folds, from 15 minutes it reached to four hours or so.

Our first attempt to find the cause of slow query performance was to check the Query plan. There I noticed that we are using an Index Lookup.

   SELECT @DLStartTime = MIN(DownloadTime)      FROM etl.VLSCRawDataFromStager(nolock)      WHERE (DownloadURI = @DownloadURI AND ASID = @ASID) OR TransactionID = @TransactionId</p>  <p>

We had couple of non-clustered indexes on DownloadURI, ASID and on TransactionId. And a clustered Index on RowId. Looking at the query, you may think the query must be using these simple non-crusted indexes to get the DownloadTime. However, the Query plan looks like as follows:

execution

 

Query is using the key lookup operator. The use of a Key Lookup operator in a query plan indicates that the query might benefit from performance tuning. For example, query performance might be improved by adding a covering index. read more on http://msdn.microsoft.com/en-us/library/bb326635.aspx

Detail of two non-clustered indexes seeks, output shows as  ‘RowId’ and then this ‘RowId’ is used in turn to do a Key lookup on clustered Index for the output ‘DownloadTime’. Key lookup fetches the extra columns from the clustered index when the non clustered index that’s used to retrieve the rows doesn’t have all the columns required.

Bookmark lookup is not a cheap operation. The  plan can be improved by  adding one or more columns to an existing index so as to eliminate a bookmark lookup, in SQL Server 2005 onwards, you can add columns using the include clause of the create index statement.  Included columns are more efficient than key columns; they save disk space and make searching and updating the index more efficient.  SQL decide on the query plan depending on the number of rows returned. Read more on

http://sqlserverpedia.com/blog/sql-server-bloggers/key-lookup-threshold/comment-page-1/
http://blogs.msdn.com/craigfr/archive/2006/06/30/652639.aspx

We modified one of the index and included other columns.

 CREATE NONCLUSTERED INDEX [VLSCRawDataFromStager_ndx2] ON [etl].[VLSCRawDataFromStager]    (  [DownloadTime] ASC,  [DownloadURI] ASC,  [ASID] ASC,  [TransactionID] ASC   )     <br />GO</p>  <p>

This change removed the Key lookup from the query plan, and suddenly our performance increased by many fold.

It worked for few hours, but before our celebration beer can become warm, the performance again tanked. Now what?

What the hell happened?  For every run we are truncating the tables, and the system ran fine for weeks, now it is tanking, specially when we removed the nasty key lookup. We thought we improved the system but system performance is not improved.

Now, I used sys.dm_db_index_physical_stats check the indexes, this returns fragmentation information for the data and indexes of the specified table or view.

 select * from sys.dm_db_index_physical_stats (DB_ID(N'DataBaseName'), OBJECT_ID(N'TableName'), NULL, NULL , NULL); 

and of course we see the indexes are fragmented more then 95%. I  realized the truncate table statement does not get rid of Indexes, TRUNCATE TABLE removes all rows from a table, but the table structure and its columns, constraints, indexes and so on remain. read more http://msdn.microsoft.com/en-us/library/aa260621(SQL.80).aspx

Overlooking this fact threw us in wrong direction, else this is the first thing we should be checking, but we thought truncate table is cleaning the indexes too, which turn out not be the case. However, the simple fix was to get rid of indexes too, and recreate them after every truncate. This fix bought our ETL performance back to acceptable level.

 

P.S

However the result set from sys.dm_db_index_physical_stats has index_id and object_id as the column name. You may waste few minutes to write query to get the index name from these ids.  So here is the query:

 select i.*, stat.* from sys.dm_db_index_physical_stats (DB_ID(N'DataBaseName'), OBJECT_ID(N'TableName'), NULL, NULL , NULL) stat join sys.indexes i on i.object_id = stat.object_id and i.index_id = stat.index_id 

OBJECT_ID is evaluated in the current database context, not in the context of the database which you specified in for DB_ID. If the current database does not have the table, then Object_ID will return NULL and your result will be wrong

To get all the indexes in a database it is simply:

 select i.*, stat.* from sys.dm_db_index_physical_stats (DB_ID(N'AdventureWorks'), NULL, NULL, NULL , NULL) stat join sys.indexes i on i.object_id = stat.object_id and i.index_id = stat.index_id where stat.avg_fragmentation_in_percent &gt; 10   <p></p> 

Read more at http://msdn.microsoft.com/en-us/library/ms188917.aspx

Categories: SQL Tags:

Temp Table Variable v/s Temp table

September 16, 2009 Rafat Sarosh Leave a comment

We know that using Temp table in a stored procedure  reduced the chance to reuse the execution plan. Change the temp table to temp variable

 select top 1000  LimitedProgramTBNEventID into #Events from LimitedProgramTBNEvent(NOLOCK)  	where eventStatus = 7 

we changed it as follows:

 DECLARE @Events table (LimitedProgramTBNEventID int NOT NULL) insert into @Events select top 1000 LimitedProgramTBNEventID 	from LimitedProgramTBNEvent (NOLOCK)where eventStatus = 7  

Here is another reason for not to use the Temp table. If you have a temp table then you cannot use the ‘Display Estimated execution plan"’ option in the query analyzer. When a query is run using the "Display Estimated Execution Plan" option, it does not really run so no temp tables are created. Since they are not created, any references to them in the code will fail, which prevents an estimated execution plan from being created. 

Categories: SQL Tags:

Not for replication

August 24, 2009 Rafat Sarosh Leave a comment

If you have a database which is replicated, then run the following query on the publisher and if you get any result back then you got some work at your hand

select distinct t.[name], c.* from sys.tables t
    join sys.columns c on t.object_id = c.object_id
             and c.is_identity = 1 and c.is_replicated = 0

this query basically tells that you have tables which have identity columns but they are not created with ‘NOT FOR REPLICATION’ option.

If that is the case then your identity columns are out of synch between publisher and subscriber. You can fix it by setting the is_replicated column to 1.

EXEC %%
ColumnEx(
ObjectID = @
object_id,
Name = @
identity_column).
SetIdentityNotForRepl(VALUE = 1)
Categories: SQL Tags: ,

ACCESS (MDB) to XML with LINQ

I had some data in Access DB and wanted to convert it to XML. Data in Access was in one table called ‘Words’, with the following columns: Word, Meaning, Synonyms, Antonyms, Example etc. I wanted to convert this data in XML as follows:

<Dictionary>
<Word Spelling="Munificent">
    <Meaning>very generous</Meaning>
    <Synonyms />
    <Antonyms />
    <Example />
  </Word>
  <Word Spelling="Anfractuosity">
    <Meaning>Twist or turn, A complicated or involved process. </Meaning>
    <Synonyms />
    <Antonyms />
    <Example> The anfractuosities of his intellect and temper. –Macaulay </Example>
  </Word>
</Dictionary>

LINQ made job to read the data and convert it to XML very easy, as you can see all this happen in one line.

using System;
using System.Linq;
using System.Text;
using System.Data.OleDb;
using System.Xml.Linq;
using System.Data.Linq;
using System.Data.Linq.Mapping;

namespace Convert2XML
{
    class Program
    {
static void Main(string[] args)
        {
                  string ConnectionString = @"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=Vocabulary.mdb;Persist Security Info=false";
                  DataContext db = new DataContext(new OleDbConnection(ConnectionString)) ;
                  Table <words> tblwords =   db.GetTable<words>();
                  System.IO.File.WriteAllText("Data.XML",
                            ( new XElement("Dictionary", from w in tblwords
                                                   select new XElement("Word" ,
                                                          new XAttribute("Spelling", w.Word),
                                                   new XElement("Meaning", w.Meaning),
                                                   new XElement("Synonyms", w.Synonyms),
                                                   new XElement("Antonyms", w.Antonyms),
                                                   new XElement("Example", w.Example)
                                                   )
                                           )
                             ).ToString()
                             , Encoding.UTF8);
        }

        [Table(Name = "words")]
        public class words
        {
            [Column]
            public string Word;
            [Column]
            public string Meaning;
            [Column]
            public string Synonyms;
            [Column]
            public string Antonyms;
            [Column]
            public int weight;
            [Column]
            public string Reference;
            [Column]
            public string Example;
            [Column]
            public string link;
        }
    }
}

If you don’t have OLEDB 12.0 then You can use the following connection string too

string ConnectionString = @”Provider=Microsoft.Jet.OLEDB.4.0;Data Source=Vocabulary.mdb;Persist Security Info=True” ;

If you don’t have Microsoft.ACE.OLEDB.12.0 then get it from here

http://www.microsoft.com/downloads/details.aspx?familyid=7554F536-8C28-4598-9B72-EF94E038C891&displaylang=en

These oledb providers are 32 bit, so make sure to compile your project for 32 bit on a 64 bit machine.
Error handling and comments are removed for the brevity.

Categories: C#, LINQ Tags: , , ,

How to Partition Cube – using C#

June 3, 2009 Rafat Sarosh 1 comment

There are many different ways you can partition a cube. Here I am talking about – how you can use C# to partition a cube from a DTS package?

Using  Microsoft.AnalysisServices  Object API, you can travel the whole Hierarchy of objects  and completely administer an Analysis Service Instance. From the connection Object you can get to the Server and database instances. And in database you can transverse the Cube and their different measures. Once you get hold of a measure then you can transverse the different partitions of measures and dig deep in to each partition query definitions etc.
Read more…

Push or Pull Replication

March 11, 2009 Rafat Sarosh Leave a comment

A study conducted by MSCOM ops team found 1 GB of data took 110 Minute to push and 12.5 minutes to Pull across the continents. Furthermore SQL 2008 with Windows Server 2008 made a huge difference over SQL 2005 and Windows Server 2003.

 

Push subscription replication of character data with SQL Server 2008 running on Windows Server 2008 yielded a 104 percent increase over SQL Server 2005 running on Windows Server 2003, and pull subscription replication of the same data yielded a 1,298 percent gain.

 

Lot of performance improvement is attributed to windows server TCP/IP Stack and SQL Native Client driver. Different reader and writer thread behavior resulted in push and pull performance difference.

 

Reader thread fills two buffers  of 40 KB  and signals the Writer thread to write the buffer on subscriber, so the 40 KB packet moves thru and fro over the wire, but in case of Pull model Writer thread pulls the data in one huge chunk as big as TCP/IP let it get.  One of the most significant improvements to Windows Server 2008 is the autotuning of the receive window size, designed for high-latency environments. In previous versions of the operating system, the maximum window size was limited by the 16-bit Window field in the TCP header amounting to a maximum window size of 64 KB. In Windows Server 2008, by combining the Window field together with a Scale Factor field of the TCP header, the window can be scaled or tuned, up to 16 megabytes (MB) in size.

 

Read the full study here http://msdn.microsoft.com/en-us/library/dd263442.aspx

 

 

Categories: SQL