Skip to main content

Behind the scenes: Running the Power BI service

Headshot of article author Yitzhak Kesselman

I’m excited to share a new whitepaper that describes the Power BI team’s approach to maintaining a reliable, performant, and scalable service for our customers.

It covers aspects related to monitoring service health, mitigating incidents, release management and acting on necessary improvements. This document was created to share knowledge with our customers, who often raise questions regarding site reliability engineering practices.  The intention is to offer transparency into how the Power BI team minimizes service disruption through safe deployment, continuous monitoring, and rapid incident response. The techniques described here also provide a blueprint for teams hosting service-based solutions to build foundational live site processes that are efficient and effective at scale.

As service owners we need to make sure our customers can rely on us to use Power BI for mission critical work. This trust is shown in the rapid growth, with 6 straight years of triple digit paid growth since its launch. Power BI is now being used by 97% of Fortune 500 companies.

The results illustrated in the table below are the direct result of engineering, tools, and culture changes made by the Power BI team over the past few years.

 

 

Metric Actual

(Dec 2018)

Actual

(May 2021)

% Improvement
Time to Notify (TTN) Customers of Incidents – P75 110 min 14 min 87%
Time to Acknowledge (TTA) When Incidents Occur – P75 11 min 0.76 min 93%
Time to Mitigate (TTM) Issue – P50 49.3 min 2.8 min 94%
% Alerts Automated (Enrichment) 7% 88% 1,157%
% Alerts Mitigated w/o human intervention 0% 82% New Capability
% Incidents Escalated to SMEs (Subject Matter Expert) 6.7% 0.34% 95%

 

 

Read our service admin site reliability service model whitepaper