Demystifying Power BI Q&A – Part 2
Microsoft Thursday, March 6, 2014
Last time, we gave you the best case scenario for Q&A against a “raw” data import. However, in order to successfully understand the large swath of questions it is capable of interpreting, Q&A makes a number of fundamental assumptions about the model. If the structure of your model violates one or more of these assumptions, that’s when things start to go awry. It’s time to pay for your free lunch by cleaning up your model. Fortunately, the vast majority of the requirements for a good Q&A model are precisely the same basic optimizations you should perform on your model to make it work well with Power View, whether or not you will be enabling the model for Q&A.
Add missing relationships
If you are missing relationships between tables in your model, neither Power View nor Q&A are going to start wildly guessing how to join between those tables if you ask a questions about them. Power Pivot relationships are the cornerstone of a good model. For example, you cannot ask for the “total sales for Seattle customers” if the relationship between the orders table and the customers table is missing.
Rename tables and columns
While poorly named tables and columns make even manual Power View report authoring difficult, choice of names is even more critical for Q&A. For example, if you have a table named “CustomerSummary”, which contains a list of your customers, you would need to ask questions like “List the customer summaries in Chicago” rather than “List the customers in Chicago”. Not exactly intuitive.
While Q&A can do some basic word breaking and detection of plurals, it is going to trust that your table and column names accurately reflect the content.
Consider also this case: You have a table named “Headcount” that contains first and last names and employee numbers, and you have another table named “Employees” contains employee numbers, job numbers and start dates. While this might be decipherable to someone familiar with the model, someone who asks “count the employees” is going to get a count of the rows from the “Employees” table, which is probably not what they had in mind, since that’s a count of every job each employee has ever had. Better to rename the tables to truly reflect what they actually contain.
Choose a Default Field Set for each table
The Default Field Set is used for selecting the columns to display by default when asking questions about a table. When you ask for “orders for last week”, you probably don’t want to see all thirty columns in your Orders table.
Choose a Default Label for each table
The Default Label is used in Power View for selecting items in filters. It is also used by Q&A when deciding what kind of graph or chart to use for displaying your results. For example, “count orders by customer” will show a tabular result if the Default Label is not set for the customers table, since there are too many columns to display by default. Setting it to the Customer Name column will allow the result to be shown in a bar chart. Note that some tables, such as transaction tables, will only have a Default Field Set, and no Default Label.
Fix incorrect data types
Imported data can have incorrect data types. In particular, date and number columns that are imported as strings will not be interpreted by Q&A as dates and numbers. You should select the correct data type in Power Pivot.
Mark year and identifier columns as Do Not Summarize
Power View aggressively aggregates numeric columns by default, so questions like “total sales by year” can sometimes result in a grand total of sales alongside a grand total of years. If you have specific columns where you do not want this behavior, set the Summarize By property on the column to Do Not Summarize. Be on the lookout for year, month, day, and ID columns, as these are the most frequent problems. Other columns that aren’t sensible to sum, such as age, could also benefit from setting Summarize By to Do Not Summarize or to Average.
Choose a Data Category for each date and geography column
The Data Category provides additional semantic knowledge about the content of a column beyond its mere data type. For example, an integer column might be marked as a Year or as a Zip Code. This information is used by Q&A in two important ways: For visualization selection and for language biases.
First, Q&A uses the Data Category information to help make choices about what kind of visual display to use. For example, it recognizes that columns with date or time Data Categories are typically a good choice for the horizontal axis of a line chart or the play axis of a bubble chart. And it assumes that results containing columns with geographical Data Categories may look good on a map.
Second, Q&A makes some educated guesses about how users are likely to talk about date and geography columns, to help it understand certain types of questions. For example, the “when” in “When was John Smith hired?” is almost certain to map to a date column, and the “Brown” in “Count customers in Brown” is more likely to be a city than a hair color.
Note that some data categories, such as Year, will need to be entered as Custom Categories.
Choose a Sort By Column for relevant columns
The Sort By Column property allows sorting on one column to automatically sort by a different column instead. For example, when you ask “sort customers by shirt size”, you probably want your Shirt Size column to sort by the underlying size number (XS, S, M, L, XL) rather than alphabetically (L, M, S, XL, XS).
Normalize your model
Ok, before you panic, we’re not saying you need to reshape your model to conform to Sixth Normal Form or some such nonsense. But there are certain structures that are simply so painful and contorted that Q&A isn’t going to handle them well. If you do some basic normalization of the structure of your model, the usability of Power View will increase significantly, as will the correctness of Q&A results.
The general rule of thumb you’ll want to follow is this: Each unique “thing” the user talks about should be represented by exactly one model object (table or column). So, if your users talk about customers, there should be one customer object. And if your users talk about sales, there should be a one sales object. Simple, right? Depending on the shape of the data you’re starting with, it can be. There are rich data shaping capabilities available in Power Query if you need them, while many of the more straightforward transformations can happen simply using calculations in Power Pivot.
Here are some of the more common transformations you might need to perform:
Create new tables for multi-column entities
If you have multiple columns that act as a single distinct unit within a larger table, those columns should be split out into their own table. For example, if you have a Contact Name, Contact Title, and Contact Phone column within your Companies table, a better design would be to have a separate Contacts table to contain the Name, Title, and Phone and a link back to the Companies table. That makes it significantly easier ask questions about contacts independently of questions about companies for which they are the contact and improves display flexibility (e.g. Default Field Set and Default Label).
Pivot to eliminate property bags
If you have property bags in your model, these should be restructured to have a single column per property. Property bags, while convenient for managing large numbers of properties, suffer from a number of inherent limitations that neither Power View nor Q&A are designed to work around.
For example, consider a CustomerDemographics table with CustomerID, Property, and Value columns, where each row represents a different property of the customer (e.g. age, marital status, city, etc). By overloading the meaning of the Value column based on the content of the Property column, it becomes impossible for Q&A to interpret most queries which reference it. A simple question such as “show the age of each customer” might happen to work, since it could be interpreted as “show the customers and customer demographics where property is age”. However, the structure of the model simply doesn’t support slightly more complex questions like “average age of customers in Chicago”. While users who directly author Power View reports can sometimes find clever ways to get the data they are looking for, Q&A only works when each column has only a single meaning.
Union to eliminate partitioning
If you have partitioned your data across multiple table or if you have pivoted values across multiple columns, a number of common operations will be difficult or impossible for your users to achieve. Consider first a typical table partitioning: a Sales2000-2010 table and a Sales2011-2020 table. If all of your important reports are restricted to a specific decade, you could probably leave it this way for Power View. However, the flexibility of Q&A will lead your users to expect answers to questions like “total sales by year”. For this to work, you’ll need to union the data into a single Power Pivot table.
Similarly, consider a typical pivoted value column: a BookTour table containing Author, Book, City1, City2, and City3 columns. With a structure like this, even simple questions like “count books by city” cannot be interpreted correctly. For this to work, you should create a separate BookTourCities table, which unions the city values into a single column.
Split formatted columns
If the source from which you are importing your data contains formatted columns, Power View (and Q&A) will not reach inside the column to parse out the contents. So if you have, for example, a Full Address column that contains the address, city and country, you should also split it into Address, City and Country columns so your users can query against them individually.
There are two cases like this which we hope to add better automatic support for in the future. First, if you have any person full name columns, you’ll want to add First Name and Last Name columns, just in case someone wants to ask questions using partial names. Second, if you have any date columns, you’ll want to add Year and Month columns so people can ask questions which group by the year and/or month.
Create new tables for multi-value columns
Similarly, if the source from which you are importing your data contains multi-value columns, Power View (and Q&A) will not reach inside the column to parse out the contents. So if you have, for example, a Composer column that contains the names of multiple composers for a song, you should split it into multiple rows in a separate Composers table.
Denormalize to eliminate inactive relationships
The one exception to the “normalization is better” rule of thumb occurs when there is more than one path to get from one table to another. For example: If you have a Flights table with both SourceCityID and DestinationCityID columns, each of which are related to the Cities table, one of those relationships will have to be marked as inactive. Since Q&A can only use active relationships, you would be unable to ask questions about either source or destination, depending on which you chose. If you instead denormalize the city name columns into the Flights table, you’ll be able to ask questions like: “list the flights for tomorrow with a source city of Seattle and a destination city of San Francisco”
Add Synonyms to Tables and Columns
This is the only step which is applicable specifically to Q&A rather than Power View in general. Users often have a variety of terms they use to refer to the same thing, e.g. total sales, net sales, total net sales. Power Pivot allows these synonyms to be added to tables and columns within the model. Do not underestimate the importance of this step. As obvious as you think your table and column names might be, the users of Q&A are asking questions using the vocabulary that first pops into their heads, not picking from a predefined list of columns. The more sensible synonyms you can add, the better.
A word of warning, however: Be aware that you’re introducing ambiguity when you add the same synonym to more than one column or table. Q&A will utilize context, where possible, to choose between ambiguous synonyms, but not all questions have sufficient context. For example, when your user asks “count the customers”, if you have three things with the synonym “customer” in your model, they might not get the answer they are looking for. In these cases, make sure the primary synonym is unique, as that is what is used in the restatement. It can alert the user to the ambiguity (e.g. a restatement of “show the number of archived customer records”), hinting they might want to ask it differently.
Most people will be able to stop here. Typical models that have been optimized for use in Power View and have had synonyms added for Q&A are sufficient for answering the majority of typical user questions. However, there are circumstances under which you might want or need more fine-grained control over the linguistic behavior of Q&A for your model. In Part 3, we’ll give you an early preview of how advanced phrasing rules are defined and what kind of power they give you.