B i l l i o n strings falling from c l o u d
Everyday, we have to process almost a Billion strings from our click stream store. To keep things simple, let’s say I get the URL and I need to parse the domain name from string. For example: if the URL is as follows:
http://www.microsoft.com/Downloads.
Then I need to parse this string and get the domain name “www.microsoft.com”.
Looking at the parallel libraries of .NET 4.0, I thought to use them and increase our data crunching performance. I planned to use Parallel.ForEach in place of foreach to harness all the CPU cores of my box.
However, to my amazement, I realized that using parallel version of foreach is taking more time than the single thread version of foreach. After little head scratching, I realized the work in my foreach loop was so small that cost of creating thread and destroying them was more than the work they need to do them self. It is like if the work is small enough that explaining someone to do it will take more time than actually doing it, sometime parallel processing and delegating work is not good.
However, as soon as I put heavy processing in the loop, for e.g. making the thread sleep for 10 millisecond :), my parallel version started performing better than the single threaded version of foreach.
Here is the code I wrote to do the parallel parsing:
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Diagnostics;
using System.Threading.Tasks;
using System.Threading;
namespace DP_Parallel
{
class Program
{
static void Main(string[] args)
{
List str = new List();
using (AdventureWorksEntities AwContext = new AdventureWorksEntities())
{
IEnumerable uri = AwContext.ClickStreams
.Where(u => u.ReferringURI.Length > 0)
.Select(u => u.ReferringURI)
.Take(1000000);
//Parallel
Parallel.ForEach
(uri, (string s) =>
{
int Pos = s.IndexOf(@"/", 8) - 7;
if (Pos > 5)
{
//Thread.Sleep(10);
str.Add(s.Substring(7, Pos)) ;
//Some other processing on the string - upper case etc
}
}
);
}
}
}
}
For a million rows C# non parallel code took 8.67 second, whereas Parallel version took 1 Minute and 1.51 second, but only 16 second in SQL with the following query
print cast ( getdate() as time ) SELECT TOP 1000000 case CHARINDEX ( '/', [referringURI], 8 ) when 0 then ReferringURI else substring ( [ReferringURI] , 8, (CHARINDEX ( '/', [referringURI], 8) ) - 8) end as DomainName into #T FROM [StagerDW].[dbo].[ClickStream] where len (ReferringURI) > 0 print cast ( getdate() as time )
So for just this task, C# code is twice as efficient over SQL and many fold over the parallel code. But, as I said if you have to do a little heavy processing than parallel code may win over the single thread code. So the best option to write a CLR Stored procedure in SQL, and from the CLR code enjoy the multi core parallel processing advantages.
yes, we don’t wait whole day to process the billion rows but crunch them as they come in …
AsParallel makes your query topless
Here is a simple code snippet using Entity framework:
using (AdventureWorksEntities AwContext = new AdventureWorksEntities())
{
var LoginId = AwContext.Employees
.Where ( u => u.LoginID.Length > 0 )
.Select(u => u.LoginID)
.Take(10);
As you can see from the above code, I am trying to get the top 10 rows from Employee table and only one column LoginID, where loginID length is more than zero. If you fire this query you may get a SQL in SQL Profiler as follows:
SELECT TOP (10)
[c].[LoginID] AS [LoginID]
FROM [HumanResources].[Employee] AS [c]
WHERE LEN( [c].[LoginID] ) > 0
<pre>using (AdventureWorksEntities AwContext = new AdventureWorksEntities())</pre>
{
var LoginId = AwContext.Employees.AsParallel
.Where ( u => u.LoginID.Length > 0 )
.Select(u => u.LoginID)
.Take(10);
SELECT
[Extent1].[EmployeeID] AS [EmployeeID],
[Extent1].[NationalIDNumber] AS [NationalIDNumber],
[Extent1].[ContactID] AS [ContactID],
[Extent1].[LoginID] AS [LoginID],
..all other columns
FROM [HumanResources].[Employee] AS [Extent1]
WHERE LEN( [c].[LoginID] ) > 0
So remember, LINQ-to-SQL and LINQ-to-Entities queries will be executed by the respective databases and query providers, PLINQ does not offer a way to parallelize those queries. However, If you wish to process the results of those queries in memory, including joining the output of many heterogeneous queries, then PLINQ can be quite useful.
XMLTextWriter v/s XElement
I have to move some data from Database to a XML file. A very simple task, write a SQL query, get the result and put it into a XML File. So here is my data query
DownloadDataContext db = new DownloadDataContext();
IQueryable downloads =
from f in db.Files
join fr in db.FamilyReleases
on f.ReleaseID equals fr.ReleaseID
join fdl in db.FamilyDetailsByLocales
on fr.FamilyID equals fdl.FamilyID
select new DownloadDetails{
FamilyID = fdl.FamilyID,
FileName = f.FileName,
FileSize = f.FileSize, URL = f.URL
)
Once I got the data in IQueryable downloads, Out of old habit, I proceeded to make the XML as follows:
MemoryStream ms = new MemoryStream();
using (XmlTextWriter w = new XmlTextWriter( ms, Encoding.UTF8 )) {
w.Formatting = Formatting.Indented;
w.WriteStartElement("Products");
foreach (DownloadDetails d in downloads) {
w.WriteStartElement("FamilyID"); w.WriteValue(d.FamilyID.ToString());
w.WriteEndElement(); w.WriteStartElement("FileName"); w.WriteValue(d.FileName ); w.WriteEndElement(); w.WriteStartElement("FileSize"); w.WriteValue(d.FileSize); w.WriteEndElement(); w.WriteStartElement("URL"); w.WriteValue(d.URL); w.WriteEndElement(); }
w.WriteEndElement(); //products
}
StreamReader r = new StreamReader(ms,Encoding.UTF8);
ms.Seek(0, SeekOrigin.Begin);
return r.ReadToEnd();
Very simple, However when I check the returned XML string, it is always cut in between and a proper XML is not formed. As the data is localized there may be a possibility that there is some funny character came in between and stream reader ReadToEnd function got confused. Debugging with 40+ languages in data was a nightmare, and I had no clue why streamreader is not working as it is suppose to work. There were other problems too, If any of the values are null, TextWriter.WriteValue complain very loudly. So every where I have to put a check to see if the value is null or not. However, the XElement came to my rescue. XElement does take all the possible data types, and does not complain if any of the data type is null.
XElement x = new XElement("Products");
foreach (DownloadDetails d in downloads) {
x.Add(new XElement("Product",
new XElement("FamilyID", d.FamilyID.ToString()),
new XElement("FileName", d.FileName),
new XElement("FileSize", d.FileSize),
new XElement("URL", d.URL + ""))); }
XElement not only handles NULL gracefully but does not complain about any of the unknown characters which threw StreamReader off. Clearly XElement is a better choice to make XML out of data, here is another example of XElement use.
Difference between facade and Decorator Pattern
Yesterday, I was discussing with our friends about pattern and differences between facade and decorator pattern? Facade gives the impression that your create a facade object over different desperate objects, or you actually decorate them with another layer, or they are two different name for the same pattern etc. All kind of confusion was there, because of the name, I noticed people get confused between these two pattern, not the one who understand both the pattern well, but the people who read about them last year and did not understand them well.
With this blog entry, I want to make sure that you will never ever forget their definition and never get mixed up between these two patterns.
Decorator pattern is a design pattern that allows new/additional behavior to be added to an existing class dynamically. Decorators provide a flexible alternative to sub classing for extending functionality.
The decorator pattern can be used to make it possible to extend (decorate) the functionality of a class at runtime. Taking the example from Allan Shalloway book, Design patterns explained. Suppose you have to print a SalesTicket, and depending up on sales, you may need to print multiple headers and multiple footers. Decorator pattern helps you control headers, by chaining together the headers desired in the correct order.
Header1->Header2->Header3->Concrete Component.
The Decorator pattern separates the dynamic building of this chain of functionality from the client. It will also separates the building of the chain from the chain components (Header and Footer).
Now, here is a way to remember it forever, a different way to think is, Decorator pattern help decorate. decorator pattern help us create necklaces with different bead color combinations. Decorator pattern is chain of objects. To achieve this you wrap the original bead class with a decorator class.
Suppose we need to write a program, which make different necklaces with different color combinations of beads. Because of decorator pattern it is fairly easy to build the different necklaces and the change in code can be limited to factory class.
Here is the code:
using System;
namespace Decorator
{
///
/// The Beads abstract class
///
abstract class Beads
{
public abstract void MakeNecklace();
}
///
/// The 'ConcreteComponent' class
///
class ConcerteBead : Beads
{
public override void MakeNecklace()
{
Console.WriteLine("ConcreteBead");
}
}
///
/// The 'Decorator' abstract class
///
abstract class Decorator : Beads
{
protected Beads beads;
public Decorator(Beads _beads)
{
this.beads = _beads;
}
public Decorator()
{
}
public override void MakeNecklace()
{
if (beads != null)
{
beads.MakeNecklace();
}
}
}
class RedBeads : Decorator
{
public RedBeads(Beads _beads)
{
beads = _beads;
}
public override void MakeNecklace()
{
base.MakeNecklace();
Console.WriteLine("RED");
}
}
class GreenBeads : Decorator
{
public GreenBeads(Beads _beads)
{
beads = _beads;
}
public override void MakeNecklace()
{
base.MakeNecklace();
Console.WriteLine("GREEN");
}
}
class WhiteBeads : Decorator
{
public WhiteBeads(Beads _beads)
{
beads = _beads;
}
public override void MakeNecklace()
{
base.MakeNecklace();
Console.WriteLine("WHITE");
}
}
class Program
{
static void Main()
{
Beads beads = BeadFactory(true);
beads.MakeNecklace();
Console.ReadKey();
}
private static Beads BeadFactory(bool Colorful)
{
if (Colorful)
{
return new GreenBeads(new RedBeads(new WhiteBeads(new ConcerteBead())));
}
else
{
return new WhiteBeads(new WhiteBeads(new WhiteBeads(new ConcerteBead())));
}
}
}
}
The output of the above code will be:
Concrete Bead
WHITE
RED
GREEN
To decouple client from the chain of object, a factory objects can instantiate the chains based upon some configuration information. Instantiation of the chains of objects is completely decoupled from the Client objects that use it. Decorator Allows for extending the functionality of an object without resorting to subclassing.
May be tomorrow we will talk about Facade pattern.
Suddenly working system performance nose dived, now what?
We had a ETL process, which used to process a small table of thirty million records a time. These records were moved in a staging table for the extensive processing. Once the processing completes, we truncate the staging table and get the data for the next day.
The system ran fine for couple of days, then suddenly the ETL processed slowed and the ETL complete time increased by many folds, from 15 minutes it reached to four hours or so.
Our first attempt to find the cause of slow query performance was to check the Query plan. There I noticed that we are using an Index Lookup.
SELECT @DLStartTime = MIN(DownloadTime) FROM etl.VLSCRawDataFromStager(nolock) WHERE (DownloadURI = @DownloadURI AND ASID = @ASID) OR TransactionID = @TransactionId</p> <p>
We had couple of non-clustered indexes on DownloadURI, ASID and on TransactionId. And a clustered Index on RowId. Looking at the query, you may think the query must be using these simple non-crusted indexes to get the DownloadTime. However, the Query plan looks like as follows:
Query is using the key lookup operator. The use of a Key Lookup operator in a query plan indicates that the query might benefit from performance tuning. For example, query performance might be improved by adding a covering index. read more on http://msdn.microsoft.com/en-us/library/bb326635.aspx
Detail of two non-clustered indexes seeks, output shows as ‘RowId’ and then this ‘RowId’ is used in turn to do a Key lookup on clustered Index for the output ‘DownloadTime’. Key lookup fetches the extra columns from the clustered index when the non clustered index that’s used to retrieve the rows doesn’t have all the columns required.
Bookmark lookup is not a cheap operation. The plan can be improved by adding one or more columns to an existing index so as to eliminate a bookmark lookup, in SQL Server 2005 onwards, you can add columns using the include clause of the create index statement. Included columns are more efficient than key columns; they save disk space and make searching and updating the index more efficient. SQL decide on the query plan depending on the number of rows returned. Read more on
http://sqlserverpedia.com/blog/sql-server-bloggers/key-lookup-threshold/comment-page-1/
http://blogs.msdn.com/craigfr/archive/2006/06/30/652639.aspx
We modified one of the index and included other columns.
CREATE NONCLUSTERED INDEX [VLSCRawDataFromStager_ndx2] ON [etl].[VLSCRawDataFromStager] ( [DownloadTime] ASC, [DownloadURI] ASC, [ASID] ASC, [TransactionID] ASC ) <br />GO</p> <p>
This change removed the Key lookup from the query plan, and suddenly our performance increased by many fold.
It worked for few hours, but before our celebration beer can become warm, the performance again tanked. Now what?
What the hell happened? For every run we are truncating the tables, and the system ran fine for weeks, now it is tanking, specially when we removed the nasty key lookup. We thought we improved the system but system performance is not improved.
Now, I used sys.dm_db_index_physical_stats check the indexes, this returns fragmentation information for the data and indexes of the specified table or view.
select * from sys.dm_db_index_physical_stats (DB_ID(N'DataBaseName'), OBJECT_ID(N'TableName'), NULL, NULL , NULL);
and of course we see the indexes are fragmented more then 95%. I realized the truncate table statement does not get rid of Indexes, TRUNCATE TABLE removes all rows from a table, but the table structure and its columns, constraints, indexes and so on remain. read more http://msdn.microsoft.com/en-us/library/aa260621(SQL.80).aspx
Overlooking this fact threw us in wrong direction, else this is the first thing we should be checking, but we thought truncate table is cleaning the indexes too, which turn out not be the case. However, the simple fix was to get rid of indexes too, and recreate them after every truncate. This fix bought our ETL performance back to acceptable level.
P.S
However the result set from sys.dm_db_index_physical_stats has index_id and object_id as the column name. You may waste few minutes to write query to get the index name from these ids. So here is the query:
select i.*, stat.* from sys.dm_db_index_physical_stats (DB_ID(N'DataBaseName'), OBJECT_ID(N'TableName'), NULL, NULL , NULL) stat join sys.indexes i on i.object_id = stat.object_id and i.index_id = stat.index_id
OBJECT_ID is evaluated in the current database context, not in the context of the database which you specified in for DB_ID. If the current database does not have the table, then Object_ID will return NULL and your result will be wrong
To get all the indexes in a database it is simply:
select i.*, stat.* from sys.dm_db_index_physical_stats (DB_ID(N'AdventureWorks'), NULL, NULL, NULL , NULL) stat join sys.indexes i on i.object_id = stat.object_id and i.index_id = stat.index_id where stat.avg_fragmentation_in_percent > 10 <p></p>
Read more at http://msdn.microsoft.com/en-us/library/ms188917.aspx
Temp Table Variable v/s Temp table
We know that using Temp table in a stored procedure reduced the chance to reuse the execution plan. Change the temp table to temp variable
select top 1000 LimitedProgramTBNEventID into #Events from LimitedProgramTBNEvent(NOLOCK) where eventStatus = 7
we changed it as follows:
DECLARE @Events table (LimitedProgramTBNEventID int NOT NULL) insert into @Events select top 1000 LimitedProgramTBNEventID from LimitedProgramTBNEvent (NOLOCK)where eventStatus = 7
Here is another reason for not to use the Temp table. If you have a temp table then you cannot use the ‘Display Estimated execution plan"’ option in the query analyzer. When a query is run using the "Display Estimated Execution Plan" option, it does not really run so no temp tables are created. Since they are not created, any references to them in the code will fail, which prevents an estimated execution plan from being created.
Not for replication
If you have a database which is replicated, then run the following query on the publisher and if you get any result back then you got some work at your hand
select distinct t.[name], c.* from sys.tables t join sys.columns c on t.object_id = c.object_id and c.is_identity = 1 and c.is_replicated = 0
this query basically tells that you have tables which have identity columns but they are not created with ‘NOT FOR REPLICATION’ option.
If that is the case then your identity columns are out of synch between publisher and subscriber. You can fix it by setting the is_replicated column to 1.
EXEC %% ColumnEx( ObjectID = @ object_id, Name = @ identity_column). SetIdentityNotForRepl(VALUE = 1)
ACCESS (MDB) to XML with LINQ
I had some data in Access DB and wanted to convert it to XML. Data in Access was in one table called ‘Words’, with the following columns: Word, Meaning, Synonyms, Antonyms, Example etc. I wanted to convert this data in XML as follows:
<Dictionary>
<Word Spelling="Munificent">
<Meaning>very generous</Meaning>
<Synonyms />
<Antonyms />
<Example />
</Word>
<Word Spelling="Anfractuosity">
<Meaning>Twist or turn, A complicated or involved process. </Meaning>
<Synonyms />
<Antonyms />
<Example> The anfractuosities of his intellect and temper. –Macaulay </Example>
</Word>
</Dictionary>
LINQ made job to read the data and convert it to XML very easy, as you can see all this happen in one line.
using System;
using System.Linq;
using System.Text;
using System.Data.OleDb;
using System.Xml.Linq;
using System.Data.Linq;
using System.Data.Linq.Mapping;
namespace Convert2XML
{
class Program
{
static void Main(string[] args)
{
string ConnectionString = @"Provider=Microsoft.ACE.OLEDB.12.0;Data Source=Vocabulary.mdb;Persist Security Info=false";
DataContext db = new DataContext(new OleDbConnection(ConnectionString)) ;
Table <words> tblwords = db.GetTable<words>();
System.IO.File.WriteAllText("Data.XML",
( new XElement("Dictionary", from w in tblwords
select new XElement("Word" ,
new XAttribute("Spelling", w.Word),
new XElement("Meaning", w.Meaning),
new XElement("Synonyms", w.Synonyms),
new XElement("Antonyms", w.Antonyms),
new XElement("Example", w.Example)
)
)
).ToString()
, Encoding.UTF8);
}
[Table(Name = "words")]
public class words
{
[Column]
public string Word;
[Column]
public string Meaning;
[Column]
public string Synonyms;
[Column]
public string Antonyms;
[Column]
public int weight;
[Column]
public string Reference;
[Column]
public string Example;
[Column]
public string link;
}
}
}
If you don’t have OLEDB 12.0 then You can use the following connection string too
string ConnectionString = @”Provider=Microsoft.Jet.OLEDB.4.0;Data Source=Vocabulary.mdb;Persist Security Info=True” ;
If you don’t have Microsoft.ACE.OLEDB.12.0 then get it from here
http://www.microsoft.com/downloads/details.aspx?familyid=7554F536-8C28-4598-9B72-EF94E038C891&displaylang=en
These oledb providers are 32 bit, so make sure to compile your project for 32 bit on a 64 bit machine.
Error handling and comments are removed for the brevity.
How to Partition Cube – using C#
There are many different ways you can partition a cube. Here I am talking about – how you can use C# to partition a cube from a DTS package?
Using Microsoft.AnalysisServices Object API, you can travel the whole Hierarchy of objects and completely administer an Analysis Service Instance. From the connection Object you can get to the Server and database instances. And in database you can transverse the Cube and their different measures. Once you get hold of a measure then you can transverse the different partitions of measures and dig deep in to each partition query definitions etc.
Read more…
Push or Pull Replication
A study conducted by MSCOM ops team found 1 GB of data took 110 Minute to push and 12.5 minutes to Pull across the continents. Furthermore SQL 2008 with Windows Server 2008 made a huge difference over SQL 2005 and Windows Server 2003.
Push subscription replication of character data with SQL Server 2008 running on Windows Server 2008 yielded a 104 percent increase over SQL Server 2005 running on Windows Server 2003, and pull subscription replication of the same data yielded a 1,298 percent gain.
Lot of performance improvement is attributed to windows server TCP/IP Stack and SQL Native Client driver. Different reader and writer thread behavior resulted in push and pull performance difference.
Reader thread fills two buffers of 40 KB and signals the Writer thread to write the buffer on subscriber, so the 40 KB packet moves thru and fro over the wire, but in case of Pull model Writer thread pulls the data in one huge chunk as big as TCP/IP let it get. One of the most significant improvements to Windows Server 2008 is the autotuning of the receive window size, designed for high-latency environments. In previous versions of the operating system, the maximum window size was limited by the 16-bit Window field in the TCP header amounting to a maximum window size of 64 KB. In Windows Server 2008, by combining the Window field together with a Scale Factor field of the TCP header, the window can be scaled or tuned, up to 16 megabytes (MB) in size.
Read the full study here http://msdn.microsoft.com/en-us/library/dd263442.aspx