Bing Dialtone Service Failure Correlation

Problem statement 

Bing.com and Bing powered products (Yahoo, AOL search etc.) are powered by distributed services running in multiple data centers in various geo-locations. Bing.com has well established targets for availability and performance. Bing.com has basic Dialtone availability expectation where a user conducts a search and expects to get ten blue links in results. Break in such a functionality is often called Dialtone failure in Bing. Bing already has well established detection mechanism for Dialtone failures and the system generates alerts when availability dips below target threshold in a specific market or in a datacenter. These Dialtone failure alert calls on-call engineer (also called DRI) and live site investigation is initiated started to mitigate the issue followed by finding root cause, post mortems etc. Bing is powered by many online services which produce logs and performance counters at PB scale. Correlating a failure to a root cause can be quite challenging due to dynamic and complex interactions between the services and also due to the volume of operational data produced. Feature teams often rely on existing TSG (troubleshooting guides) or experience to pin point issues causing failures. During my internship at Microsoft, I worked on creating a system that would attempt to correlate Bing Dialtone failures with performance data from deeper services in the Bing stack. I scoped Bing Dialtone failures to Bing Error page that is often called “Sad Panda”. I had Gautam Dewan as my mentor and Debashish Ghosal as my manager.

Figure 1- Error page in Bing.com

Challenges with current approach 

Bing serving stack is divided into many parts and these services are run independently in their own environments. Following the best practices, each of these services are also monitored for failures/errors. When Bing Dialtone failure alert is triggered at the top most part of the stack, in many cases deeper parts of the Bing stack may also trigger alerts. When engineers receive a flurry of alerts, one of the first things they need to pin point is the source of the issue. This requires looking through many views, graphs, relying on “How Bing services interact” to pin point the issue. One needs to be familiar with a service to understand what the performance counters mean, and this familiarity is often key to find the root cause for failures. 

Architecture 

Bing services emit performance counter data which is then aggregated and pushed into Cosmos (Microsoft’s Hadoop) for longer term storage. I scoped my work to a set of Bing services where the performance counter data was ~32TB of data per day. In the following topology, I wrote tasks that would download the counter data from Cosmos and store in a SQL Data Warehouse hosted in Microsoft Azure. I wrote R script that would process the Bing’s Dialtone metric from another Azure storage and service counter data from the SQL Data Warehouse and produce outputs which would be visualized in Shiny app, PowerBI etc.  

Figure 2- Architecture

Process 

Bing Dialtone metric and Bing services performance metric (aka counter) data is constantly generated as new traffic hits Bing.com. I looked at past known Bing Dialtone failures and collected Dialtone metric and performance counter data that could be used to develop and tune my algorithm. As the past data is static and contained in a SQL database that I can manage, all I am left with is to develop my R script for processing. The algorithm’s basic functionality can be described as follows 

  • Use anomaly detection techniques to identify (noticeable) trend changes in Bing Dialtone metrics. This helped identify time windows where Bing Dialtone metric (Error page rate in scope) spiked 
  • Look for (noticeable) anomalies in Bing services performance counter data during that time window 
  • Attempt to correlate Bing services performance counter patterns with Bing Dialtone metric pattern and identify possible candidates as root cause 
  • Manually go through detected candidates to narrow down to one or more root cause 

Detecting Anomalies 

The anomaly detection method I used in this project was the quantile (aka percentile) function. The quantile function is fast and can operate with little historical data (few hours). I also experimented with Twitter’s Anomaly Detection model but it was too slow for the scale of this project and required lot more historical data. While the quantile approach for Anomaly Detection might return more false positives in case of seasonal time series, it performs well in identifying spikes in signals like latency and error rates. I did further analysis of these false positives to find solutions to filter them out in correlation phase. As shown in Figure 3, sometimes the signal has a pattern where it remains high or ramps up steeply for a while and suddenly drops (e.g. drop in memory after a GC cycle). In this case, all the period that the signal was high will be considered an anomaly and will match to the same period that Bing Availability also had an anomaly. There is also the known case where the signal is following a seasonal pattern and, at some point, the pattern coincides with the drop in Bing Availability as can be seen in Figure 4. Finally, we also have the case where the counter’s time series matches the Bing Availability signal but if we expand the time window, we will notice that the spike aligning with Bing Availability was not as significant compared to later spikes in the same counter (Figure 5). The correlation phase ruled out a lot of these false positives as discussed in the next section. 

  Figure 3- False positive in orange and Bing QoR in blue 

  Figure 4- Seasonal pattern in counter 

 Figure 5- The second spike in Raw Health Metric didn’t cause a spike in Aggregated Error Rate so we know this metric is mostly a false positive. 

Correlating Candidate Metrics 

In order to reduce the number of false positives, I needed a model that focuses on finding out raw metrics (performance counters) that really triggered the spike in aggregated metric (error rate) and weren’t just a coincidence. The idea is to give high importance to counters that had the most egregious anomalies at almost the exact same time when Bing had the maximum error rate. I also extend the time window by a few hours (before and after) for this correlation to capture scenarios where the raw metric (or the aggregated metric) has multiple spikes. This technique was able to rule out all the false positives discussed in the previous section. The list of relevant counters produced after correlation were small but we would still need human analysis to identify the real root cause of the problem (bad code, config or infrastructure). I used clustering and dimensionality reduction to help create a view that could help DRIs in identifying the root cause. 

Clustering 

To further improve the task of finding relevant counters, I used an interactive visualization using a machine learning algorithm for dimensionality reduction (t-SNE) and an unsupervised learning algorithm for clustering (k-means) so DRIs could explore the counters by similarity to the Bing Availability signal and also to other counters. The visualization is shown in Figure 6 and is divided into 2 different views of the data. The first one is a scatter plot graph where the dots represent performance counters that are positioned by similarity. The colors of these dots represent different clusters that the counters belongs to and the biggest dot corresponds to the Bing Availability signal. The second visualization is a line chart where it is possible to check the behavior of the Bing Availability signal against the selected counter during an 18 hour time window.  

 Figure 6 - Scatter plot on the left where circles represent counters positioned by similarity. On the right is the line chart containing the selected counter in orange and the Bing Availability signal in blue. 

Findings and validation 

Even though the model was able to filter the relevant counters, it can’t know which one was the root cause for a problem since correlation does not necessarily mean causation. This is reasonable since the problem sometimes is a file, a bad code or even just a capacity issue that is not directly related to a specific counter. However, DRIs have the experience to identify the root cause by looking at relevant counters which is returned by the algorithm.  

Our results were validated by two DRIs which also gave us more insights about the data. We showed the visualization to one of the DRI’s and he confirmed that this project would really help in the process of finding root causes when an incident occurs. In the future, it is possible to filter the data by selecting a set of counters and instances that are most important in the DRI’s pipeline since we noticed that some instances even though are correlated to an issue, do not help in detecting the root cause. Using a machine learning algorithm with labeled data on whether a counter is relevant or not would help in this task.   

Conclusion 

The algorithm was able to correlate signals and identify relevant counters. However, the DRIs are not excluded from the process since they have institutionalized knowledge to make decisions and mitigate issues. As future work, we plan to further improve the algorithm by providing labeled data in order to apply machine learning algorithms and to retrieve only the most critical counters to DRIs. 

Explore topics