Crystal Ball For Churn Prediction

Don’t you think, it is good to know before your customer churn? Imagine you have a crystal ball which tells you that a customer is going to be unhappy and may leave your service. Every business wishes for this crystal ball.

This blog is about building that crystal ball.


Sure, once you have this crystal ball, you can use it to find out the customer churn, which machine is going to break before they actually break, predict maintenance or basically do any other kind of prediction. However, here we will focus on churn prediction.

If you are a SAS (Software as Service) company and you are providing your services thru different API’s,  you can look at the usage of API’s and predict the churn.

Let’s see how you will predict the churn manually? First, you will look at the error’s your customers are getting. Second, you may see how many support calls are made by this customer. This tells you that your customer is facing difficulty in using your services. This may be an indication on your bad documentation or  badly designed API’s.  You may also notice, the age of customer. By that I mean, how long customer is a subscriber of your services. Age plays an important role. Maximum customers leave within 30 days, few leave in 60 days and hardly anyone leaves after 90 days. There can be other factors which you can take into account, but I am keeping this blog simple so we can understand the code.

Now, if you have handful of customers and few calls per day,  you can put all this in an excel file and find out who may churn. But, if you have thousands of customers and millions of calls made daily and you may have more than 50 features then you may need machines to help you. That’s where machine learning comes in.

Please take a look at my spark notebook, which is using Python to predict churn.

from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkContext
sqlContext = SQLContext(sc) #Define the structure of your file 

schema = StructType([StructField("subscription", StringType(), True), 
StructField("account", StringType(), True), 
StructField("api_type", DoubleType(), True),#Inventory, Customer, Order etc StructField("IsEnterprise", StringType(), True), 
StructField("avg_calls_daily", DoubleType(), True), 
StructField("avg_calls_last10_days", DoubleType(), True), 
StructField("customer_age", DoubleType(), True), #how long customer is with us. First 30 days are critical 
StructField("total_error_event", DoubleType(), True),
StructField("total_http_200", DoubleType(), True), 
StructField("total_http_400", DoubleType(), True), 
StructField("total_http_409", DoubleType(), True), #conflict 
StructField("total_http_500", DoubleType(), True), 
StructField("number_customer_service_calls", DoubleType(), True), 
StructField("churned", StringType(), True)]) 

#read the file 
dbutils.fs.mount( source = "wasbs://", mount_point = "/mnt/MyMount", extra_configs = {"": "FsRlI6dy7qy0 wv0W9r/bc0SYourKeyShouldbeHere4W2nl0DEj7OinQ2MpDJ6/zfQHStFg=="} ) 

df ="csv").option("header","false").schema(schema).load("dbfs:/mnt/MyMount/train.csv") 

# Assemble feature vectors 
from pyspark.mllib.linalg import Vectors 
from import VectorAssembler 

assembler = VectorAssembler( inputCols = [ 
'customer_age', 'avg_calls_daily', 
'api_type'], outputCol = 'features') 

# Reshape /Bin # Transform labels 
from import StringIndexer 

label_indexer = StringIndexer(inputCol = 'churned', outputCol = 'label')#convert string to number as per the frequency 
# Fit the model 

from import Pipeline 
from import PipelineModel 

#Using RandomForestClassifier 
from import RandomForestClassifier 

classifier = RandomForestClassifier(labelCol = 'label', featuresCol = 'features') #Setup the pipeline 

pipeline = Pipeline(stages=[assembler, label_indexer, classifier]) 
(train, test) = df.randomSplit([0.7, 0.3]) 
model = 

#Save the Model to be used later 
model.write().overwrite().save("dbfs:/mnt/MyMount/ChurnModel") #save 

#Do a small test #read the model 

model2 = PipelineModel.load("dbfs:/mnt/MyMount/ChurnModel") 

from import BinaryClassificationEvaluator 

predictions = model2.transform(test) 
evaluator = BinaryClassificationEvaluator() 

The output of the code will be:
Out[1]: 0.9523809523809523

As you can see my model is pretty confident. It is having 95% confidence in predicting the churn.

Now, save this model,  and use it  to predict the churn. The daily data is coming from DailyLogs.csv file.

from import Pipeline
from import PipelineModel
from import RandomForestClassifier
from import BinaryClassificationEvaluator
from pyspark.sql import SQLContext
from pyspark.sql.types import *
from pyspark import SparkContext
from sklearn import preprocessing

# sc = SparkContext().getOrCreate()
sqlContext = SQLContext(sc)

#Define the structure of your file
schema = StructType([StructField("subscription", StringType(), True),
StructField("account", StringType(), True),
StructField("api_type", DoubleType(), True), #Inventory, Customer, Order etc
StructField("IsEnterprise", StringType(), True),
StructField("avg_calls_daily", DoubleType(), True),
StructField("avg_calls_last10_days", DoubleType(), True),
StructField("customer_age", DoubleType(), True), #how long customer is with us. First 30 days are critical
StructField("total_error_event", DoubleType(), True),
StructField("total_http_200", DoubleType(), True),
StructField("total_http_400", DoubleType(), True),
StructField("total_http_409", DoubleType(), True), #conflict
StructField("total_http_500", DoubleType(), True),
StructField("number_customer_service_calls", DoubleType(), True),
StructField("churned", StringType(), True)])


#read data
df ="csv").option("header","false").schema(schema).load("dbfs:/mnt/MyMount/DailyLogs.csv")

model = PipelineModel.load("dbfs:/mnt/MyMount/ChurnModel")
predictions = model.transform(df)"subscription","probability","prediction").show()

evaluator = BinaryClassificationEvaluator()

Mounts successfully refreshed.
|subscription| probability|prediction|
| 1156352|[0.10833333333333...| 1.0|
| 295650|[0.76011904761904...| 0.0|
| 2915731|[0.80416666666666...| 0.0|
| 1537570|[0.89473684210526...| 0.0|
| 765712|[0.83223684210526...| 0.0|
| 106978| [0.14375,0.85625]| 1.0|
| 2652748|[0.14750000000000...| 1.0|
| 230962| [0.775,0.225]| 0.0|
| 1561100| [0.15,0.85]| 1.0|
| 2991707| [0.8875,0.1125]| 0.0|
| 240377|[0.86666666666666...| 0.0|
| 2342217| [0.15,0.85]| 1.0|
| 170620|[0.83928571428571...| 0.0|
| 1652778| [0.05,0.95]| 1.0|
| 2505206| [0.89375,0.10625]| 0.0|
| 538637| [0.9,0.1]| 0.0|
| 1971373| [0.0,1.0]| 1.0|
| 1939815| [0.85,0.15]| 0.0|
| 297083| [0.75,0.25]| 0.0|
| 365425| [0.95,0.05]| 0.0|
only showing top 20 rows

Out[2]: 0.9975510204081632



How to optimize for Search engines (SEO)

Here are few tips to improve your SEO:

Duplicated content exists off-site. Make sure there is no duplicate content on other sites.

The URL structure contains unnecessary sub-folders.

Deep folder structures are a negative SEO factor. Search engines determine the importance of a page by how close to the root directory it resides. The deeper a page is, the lower the importance, which will have an impact on rankings and traffic. Removing unnecessary folder structure will help to increase the importance of pages, which will increase rankings and traffic from search engines.

Image and video filenames are not optimized – Image and video filenames should directly reflect the contents of the image or video. Unoptimized image and video filenames are a missed opportunity to indicate to search engines what the image or video contains, providing relevancy to the page.

Images and videos are missing specific, optimized alt text – Similar to file names, search engines “read” the alt attributes of images and videos. Websites have the opportunity to reinforce relevancy for targeted keywords at the page level by optimizing attributes.

Reduce Relative URLs are being utilized across the site – Absolute URLs assist sites in establishing authority and consistency and ensure links are being crawled effectively. Without an absolute URL, search engines may: Encounter discrepancies such as malformed URLs Get caught in crawl traps Ignore canonicals.

The top navigation changes throughout the site – The navigation should stay consistent throughout the site to provide a good user experience. Users want an easy to navigate site that provides options to top category and sub-category pages.

The robots.txt file is missing the location of the XML sitemap – Without the XML sitemap location distinction search engines may not discover the location of your XML sitemap. The XML sitemap is an important resource to help search engines understand the full scope of your site, missing this file can result in a decrease in organic rankings and organic traffic.


View All Page – Should have a “View All” link for search engines. Searchers generally prefer to view a whole article or category on a single page. Therefore, if we think this is what the searcher is looking for, we try to show the View All page in search results. You can also add a rel= link to the component pages to tell Google that the View All version is the version you want to appear in search results.

Site Latency – Above 2 sec first view load time effects SEO. Site latency has become an increasingly important issue in regards to the way search engines respect domains in SERPs. Search engines have directed more of their algorithmic focus upon site latency due to the overt impact this metric has upon a user’s interaction with a website.

Recommendation – Optimization of images, leveraging of browser caching and compression of resources with gzip be addressed as the highest priority for the greatest improvement. While combining images into CSS sprites and optimization of CSS delivery be slated as less of a priority but still considered important, overall, for latency improvement.

      • Leverage Browser Caching 39/100: Static resources should be cached so that repeat views will have greater performance. This is done through Expires and ETag statements within the HTTP header.
      • Optimize Images 80/100: Properly formatting and compressing images can save many bytes of data.
      • Remove Render-Blocking JavaScript: Before a browser can render a page to the user, it has to parse the page. If it encounters a blocking external script during parsing, it has to stop and download that Javascript. Each time it does that, it is adding a network round trip, which will delay the time to first render of the page.
      • Optimize CSS Delivery: Large CSS should only be inline if necessary for rendering above-the-fold content – other CSS should be deferred until after above-the-fold load.
      • Combine images into CSS sprites 64/100: Combining images into as few files as possible using CSS sprites reduces the number of round-trips and delays in downloading other resources, reduces request overhead, and can reduce the total number of bytes downloaded by a web page.
      • Enable Compression 87/100: Compressing resources with gzip or deflate can reduce the number of bytes sent over the network.
        • is not implemented in the source code of videos – provides a hierarchy of microdata that can be appended to web pages. This practice allows for control of what appears in the rich snippets as well as allowing for further description of on-page items that are not easily digested by search engines. Videos with markup will appear in SERPs as rich snippets. Providing this information to consumers from the search engine improves CTR and conversions.
        • Page titles are too long and branding is at the beginning of the title – The page title is the most important on-page SEO element. Relevant keywords should be placed first in the title, with branded information at the end. Keywords earlier in the title are given greater importance and will positively impact our organic rankings and traffic. Page titles should lead with keywords and phrases that are relevant to the content on the page. Branded information should be at the end of the page title. Page titles should be ideally within 55 characters, and less than 60.
        • Meta descriptions are too short, too long, missing or duplicated across the site – Meta descriptions are important for SEO as they dramatically impact the click-through-rating on search engine result pages. Having duplicated descriptions can confuse the search engine and devalue the pages. A missing meta description or duplicated one is a major gap in optimization. Add unique meta descriptions to all pages. Meta descriptions should include a call to action, be limited to 155 characters, include all targeted key phrases, and copy should be written with users in mind.
        • Social Tags – Pages do not use Facebook Open Graph coding or Twitter Cards.

Open Graph tags and Twitter Cards enable any web page to become an object in a social graph. They control what information Facebook, Twitter and other social networks display when a page is shared.

Add formulaic, scaled Open Graph and Twitter Card code to category and product templates at least – if not entire site. Title, Type, Image, Description, URL, Site Name

Much of the information can be take from existing SEO fields, but we may want to optimize more for social spaces

Facebook Open Graph






Twitter Cards

Following Obstacles should be Analyzed

      • Page Canonicalization Issues
      • Canonical Tags Improperly Used
      • URL Structure Issues
      • Unnecessary Sub-Folders
      • Unnecessary Sub-Domains
      • Generic Folder Structure
      • Tracking Parameters in the URL
      • Session IDs in the URL
      • Broken Images
      • Broken Links
      • Page Load Speed
      • Switchboard Tags
      • Missing Breadcrumbs
      • Missing Custom 404 Page
      • 404 Error Page Does Not Return 404 Status Code
      • Site Requires Session IDs to Function
      • Site Requires Cookies to Function
      • JavaScript Navigation
      • Flash Navigation
      • Images Used in Navigation
      • 302 Redirects
      • Meta-Refresh Redirects
      • JavaScript Redirects
      • Unnecessary Redirects
      • Unnecessary Internal Nofollow Links
      • Generic Desktop to Mobile Site Redirection
      • Duplicate Content
      • Frames are Being Used
      • JavaScript Used to Control Site Content
      • Site Constructed in Flash
      • Popup Windows
      • Relevant Content Contained within PDF or Other Formats
      • Splash Page
      • Main Entry Page Requires User Action
      • Buried Deep / Island Pages
      • Search Engine Incompatibility
      • Cross Browser Incompatibility
      • Webmaster Tools Are Not Available
      • Shared IP Address Issues
      • Site Not Hosted in Same Country as Target Audience
      • Cloaked Content
      • Questionable Content
      • Vary HTTP Header – Same URL
      • Vary HTTP Header – Different URL
      • International Website Localized with CSS / XML
      • International Website is using Multiple Style Sheets
      • Robots.txt File is Missing
      • Robots.txt File is Blocking Content
      • Robots.txt File is Missing XML Sitemap Location
      • Robots Meta Tag
      • Consumer Facing Sitemap Missing
      • XML Sitemap Missing
      • XML Video Sitemap Missing
      • XML Sitemap is Too Large
      • XML Sitemap is Malformed
      • Inline / On-Page CSS
      • Inline / On-Page JavaScript
      • Invalid HTML Markup
      • Page Title Formatting
      • Duplicate Page Titles
      • Page Titles Missing
      • Page Titles Too Long
      • Multiple Page Titles
      • H1 Tags Missing
      • Multiple H1 Tags
      • Meta Descriptions Missing
      • Meta Descriptions Too Short
      • Meta Descriptions Too Long
      • Video Transcripts Missing
      • Missing Micro Formatting
      • Lack of Body Content
      • Text Contained Within Images
      • Missing ALT Tags
      • Generic Anchor Text


Cosmos DB and Key Vault

You can access Cosmos DB with a Key and the URI of the account. However, you should never keep the keys and URL in application code. Be it a desktop application or a web application.

Azure Key Vault brings the perfect solution for your application. The solution can be very simple, summarize in following steps:

  • Create a key Vault
  • Store the CosmosDB Access Keys in Key Vault.
  • Create an application
  • Register the Application with the Active Directory
  • Give permission to application to read the Key Vault.

Your application code will look as follows:

AzureServiceTokenProvider azureServiceTokenProvider = new AzureServiceTokenProvider();
  var keyVaultClient = new KeyVaultClient(
  new KeyVaultClient.AuthenticationCallback(azureServiceTokenProvider.KeyVaultTokenCallback));
  var secret = await keyVaultClient.GetSecretAsync("")
  ViewBag.Secret = $"Secret: {secret.Value}";
catch (Exception exp)
      ViewBag.Error = $"Something went wrong: {exp.Message}";

ViewBag.Principal = azureServiceTokenProvider.PrincipalUsed != null ? $"Principal Used: {azureServiceTokenProvider.PrincipalUsed}" : string.Empty;
return View();

Now, lets go thru all the steps.

Create a secret in Key Vault

Get the secret identifier, my secret is “xxxxx” string. But,  you can keep the Cosmos DB connection  key here.


Here is the identifier of the secret:

You need this identifier in your application. But, you don’t have to worry about this identifier. Even if someone gets it, they cannot access your secret.

Create a web application or you can download the code from here. It is a very simple MVC application, write the code in home controller as shown above. Only interesting file is HomeController.cs. Everything else is a boiler plate code.

Once the application is created, deploy it on Azure. (right click on project and choose publish)

Once the application is deployed. Go to Azure portal, choose the application service and turn on the Managed Service Identity of this application.


If you will run the application now, you will see the following error, as you have not given any permission to this application in Key Vault.


Now go to Key Vault and add the application, using the access policy


After adding the application, you can choose the permission you want to give it to this application as follows:


Now, if you will run the application you will see that you can read the secret from Key Vault.


Similarly you can add a user to access the key Vault.

You need to add yourself to the Key Vault by clicking on “Access Policies” and then give all the permission you need to run the application from Visual studio. When this application is running from your desktop it takes your identity.

Learn more


Reading Cosmos DB Change Feed

To track the changes on the Cosmos DB, you can reads it’s change feed. The change feed support in Azure Cosmos DB enables you to build efficient and scalable solutions.

Azure Cosmos DB change feed provides a sorted list of documents within an Azure Cosmos DB collection in the order in which they were modified. This feed can be used to listen for modifications to data within the collection and perform any action. The change feed is available for each partition key range within the document collection, and thus can be distributed across one or more consumers for parallel processing. Once you get the document which is changed, sky is the limit. You can send that document to Azure Notification hub or trigger any other process.

You can read the change feed in three different ways:

  1. Using the CosmosDB Client Library.
  2. Using the Change Feed Processor SDK
  3. Using the serverless Azure function.

In this article we will discuss first two option and subsequent blog post will address the last serverless Azure function options.

I am keeping this article very short and to the point. I am quickly showing you the code snippet which you need to write to get started on reading the change feed. At the end of article, you will find a link to the full working code.

Azure Cosmos DB SDK
Download CosmosDB SDK 

This SDK gives you all the power to read the change feed, but with power comes lots of responsibilities too. If you want to manage checkpoint, and deal with sequence number of documents and have granule control over partition keys then this may be the right approach.

So let’s get started, Read the database, collection name etc from appconfig. You will get this information from Azure portal.

DocumentClient client;
string DatabaseName = ConfigurationManager.AppSettings["database"];
string CollectionName = ConfigurationManager.AppSettings["collection"];
string endpointUrl = ConfigurationManager.AppSettings["endpoint"];
string authorizationKey = ConfigurationManager.AppSettings["authKey"];

Make the client as follows:

using (client = new DocumentClient(new Uri(endpointUrl), authorizationKey,
new ConnectionPolicy { ConnectionMode = ConnectionMode.Direct, ConnectionProtocol = Protocol.Tcp }))

and then get the partition key ranges

FeedResponse pkRangesResponse = await client.ReadPartitionKeyRangeFeedAsync(
                                      new FeedOptions
                      {RequestContinuation = pkRangesResponseContinuation });

pkRangesResponseContinuation = pkRangesResponse.ResponseContinuation;

and then just call ExecuteNextAsync for every partition key ranges

 foreach (PartitionKeyRange pkRange in partitionKeyRanges){
                string continuation = null;
                checkpoints.TryGetValue(pkRange.Id, out continuation);
                IDocumentQuery<Document> query = client.CreateDocumentChangeFeedQuery(
                    new ChangeFeedOptions
                        PartitionKeyRangeId = pkRange.Id,
                        StartFromBeginning = true,
                        RequestContinuation = continuation,
                        MaxItemCount = -1,
                        // Set reading time: only show change feed results modified since StartTime
                        StartTime = DateTime.Now - TimeSpan.FromSeconds(30)
                while (query.HasMoreResults)
                    FeedResponse<dynamic> readChangesResponse = query.ExecuteNextAsync<dynamic>().Result;

                    foreach (dynamic changedDocument in readChangesResponse)
                        Console.WriteLine("document: {0}", changedDocument);
                    checkpoints[pkRange.Id] = readChangesResponse.ResponseContinuation;

If you have multiple readers, you can use ChangeFeedOptions to distribute read load to different threads or different clients. This is it, with these few lines of code you will start reading the change feed. Get the code from here.

Here the last line ResponseContinuation has the last logical sequence number (LSN) of the document, which will be used next time to read new documents after this sequence numbers. Using StartTime of ChangeFeedOption you can widen your net to get the documents. So, If your ResponseContinuation is null, but your StartTime goes back in time then you will get all the documents change since StartTime. But, if your ResponseContinuation has a value then system will get you all the documents since that LSN.

Side Note: One more thing to note, ETag on FeedResponse is different than the _etag you see on the document. _etag is an internal identifier and used to concurrency, it tells about the version of the document and ETag is used for sequencing the feed.

So, you see your checkpoint array is just keeping LSN for each partition. But if you don’t want to deal with the partitions, checkpoints, LSN, Startime etc the simpler option is to use the Change Feed Processor Library.

 Using Change Feed Processor Library 

Azure Cosmos DB Change Feed Processor library, can help you easily distribute event processing across multiple consumers. This library simplifies reading changes across partitions and multiple threads working in parallel.

The main benefit of Feed Processor Library is that you don’t have to manage the each partition, continuation token etc and you don’t have to poll each collection manually.

The FP Library simplifies reading changes across partitions and multiple threads working in parallel.  Change Feed Processor automatically manages reading changes across partitions using a lease mechanism. As you can see in the below image, If I start two clients who are using Processor Library they divide the work among themselves. You can keep increasing the clients and they can keep dividing the work among themselves.


I started the left client first and it started monitoring all the partitions, then I started the second client and then first let go some of the leases to second one. As you can see this is the nice way to distribute the work between different machines and clients.

To implement Feed library you have to do following:

  1. Implement a DocumentFeedObserver object which implements IChangeFeedObserver.
  2. Implement a DocumentFeedObserverFactory, which implements IChangeFeedObserverFactory.
  3. In the CreateObserver method of DocumentFeedObserverFacory, instantiate ChangeFeedObserver which you made in step 1 and return it.
  4. Instantiate DocumentObserverFactory.
  5. Instantiate a ChangeFeedEventHost
    ChangeFeedEventHost host = new ChangeFeedEventHost(

    Register the DocumentFeedObserverFactory with host.

That’s it. After these few steps you will start seeing the document come in DocumentFeedObserver ProcessChangesAsync method.

Here is the code for step 3.

public IChangeFeedObserver CreateObserver()
          DocumentFeedObserver newObserver = new DocumentFeedObserver(this.client,                                                                                         this.collectionInfo);
          return newObserver;


Here is the code for step 4 & 5

ChangeFeedOptions feedOptions = new ChangeFeedOptions();
feedOptions.StartFromBeginning = true;

ChangeFeedHostOptions feedHostOptions = new ChangeFeedHostOptions();

//Customizing lease renewal interval to 15 seconds. So the if
feedHostOptions.LeaseRenewInterval = TimeSpan.FromSeconds(15);

using (DocumentClient destClient = new DocumentClient(destCollInfo.Uri, destCollInfo.MasterKey))
        DocumentFeedObserverFactory docObserverFactory = new DocumentFeedObserverFactory(destClient, destCollInfo);
        ChangeFeedEventHost host = new ChangeFeedEventHost(hostName, documentCollectionLocation, leaseCollectionLocation, feedOptions, feedHostOptions);
        await host.RegisterObserverFactoryAsync(docObserverFactory);
        await host.UnregisterObserversAsync();

Complete code you will find here which shows step 1 & 2 and all other steps.

The best option to read the change feed of your collection is to use server less function of Azure. Now Azure functions and Cosmos have native integration.