Azure HDInsight now offers a fully managed Spark service. This capability allows for scenarios such as iterative machine learning and interactive data analysis. Power BI allows you to directly connect to the data in Spark on HDInsight offering simple and live exploration.
Power BI allows you to connect directly to your Spark cluster and explore and monitor data without requiring a data model as an intermediate cache. This offers interactive exploration of your data and automatically refreshes the visuals without requiring a scheduled refresh.
The direct connect experience is targeted at users who are familiar with their business data. In this post we’ll cover how to get better insights to your data in Spark on HDInsight data using Power BI. For additional details on how to connect and get started, jump to the Connecting to Spark on Azure HDInsight section below.
Exploring your data in Spark on HDInsight
Once you create a connection to your source, you can start exploring your data to create a dashboard like the one above. As you explore the data in Power BI, queries are generated dynamically and sent back to the source. It's a live connection, any field selection or filter sends a query back to the source and the visual is updated with the new results. Tips to optimize your clusters for Power BI can be found here.
After saving your report, any of the visuals can be pinned to your customized dashboard. The data in the dashboard will be refreshed approximately every 15 minutes, no refresh schedule is required. The dashboard can be shared within your organization to keep your team up to date.
To get started in Power BI, select Databases & More on the Get Data screen in Power BI or use the search box.
Select the Spark on Azure HDInsight tile or use the Search box to find it quickly. Select Connect to move to the connection screen.
In order to connect, you need to specify the server name, as well as your username and password to connect with. The server is always in the form <clustername>.azurehdinsight.net and the values can be found in the portal.
After selecting Connect, you can select the newly created dataset named "SparkDataset" to begin exploring, or try selecting the placeholder tile on your dashboard.
Selecting or dragging fields on to the canvas allows you to start exploring your data. Every selection generates a query back to the source. Depending on the size of the query and the optimizations in the database, you may see some loading indicators while the visuals are created. Tips to optimize your clusters for Power BI can be found here.
These visuals are the same as any other in Power BI and can be pinned to your dashboard. Drilling into the tiles will bring you back to the report you’ve created.
We’re always interested in hearing your feedback – please reach out at https://support.powerbi.com to let the team know how your experience was and if there’s anything we can do better. We look forward to your feedback!