Creating Dataflows
Power BI dataflows are a feature within the Microsoft Power BI service that allows users to transform and combine data from various sources, perform data transformation and cleansing operations, and then store the results in inexpensive Azure Blob Gen2 Storage.
Dataflows also use the Power Query M language and are constructed using a web-based version of Power Query Editor. In effect, dataflows are cloud-based versions of queries that you would otherwise create in Power BI Desktop.
Getting ready
To prepare for this recipe, you will need a Power BI Pro license or trial license. See the Thinking about Licensing recipe from Chapter 1, Installing and Licensing Power BI Tools. Create a new workspace in the Power BI service. This workspace should be a Pro workspace and not a Fabric or Fabric trial workspace. Fabric workspaces include the Dataflows Gen2 capabilities, which are constructed almost identically to dataflows.
In addition, you should complete the Installing SQL Server and AdventureWorks Database and Creating an On-Premises Data Gateway recipes from Chapter 1, Installing and Licensing Power BI Tools. Ensure that your data gateway is installed and that you are signed in.
How to create a dataflow
To create a dataflow, log in to the Power BI service at https://app.powerbi.com/, navigate to the Power BI Pro workspace created in the Getting ready section of this recipe, and then follow these steps:
- In the upper-right corner, select New and then Dataflow, as shown in Figure 2.24:
Figure 2.24: Selecting a new dataflow to be created
- On the Start creating your dataflow page, click the Add new tables button in the Define new tables area.
- On the Choose a data source page, select SQL Server database.
- On the Connect to data source page, enter the connection settings and connection credentials for your SQL Server, as shown in Figure 2.25. If your gateway is running on the same computer as your SQL Server, you can use localhost for the server name. Otherwise, specify the name of the computer running SQL Server. Also, specify the database—in this case,
AdventureWorksDW2022
. Leave Connection set as Create new connection, choose your data gateway, and enter your Windows or SQL Server authentication information. When finished, click the Next button.
Figure 2.25: Data source connection information
- On the Choose data page, mark the checkboxes next to
FactInternetSales
,DimCustomer
,DimDate
, andDimProduct
, and then click the Transform data button. - A web-based version of Power Query Editor is presented. Here, you can transform your data exactly like using Power Query Editor in Power BI Desktop. In our case, we can just click the Save & close button, as no data transformations are necessary.
- In the Save your dataflow dialog, enter a name for your dataflow, such as
AdventureWorksDW2022 Internet Sales
, and click the Save button. - In the upper-right corner, a notification appears. Click the Refresh now button to retrieve the data from SQL Server.
Once refreshed, the dataflow can be used within Power BI Desktop. To do so, open Power BI Desktop, click Get data in the ribbon on the Home tab, and choose Dataflows. The created dataflow and associated tables appear in the workspace within the Navigator dialog. You can select the desired tables and click the Load button.
How it works
Dataflows ingest data from source systems and store that data as entities in Microsoft Common Data Model-compliant folders. This entity data is stored within Azure Data Lake Storage Gen2. You can think of entities as tables.
A refresh schedule can be created for a dataflow in the same way that Power BI Desktop queries can be refreshed within the Power BI service. Refreshes update the data within the Azure Data Lake Storage Gen2 infrastructure. Power BI Desktop reports connecting to the dataflow pull data from the data entities stored within Azure Data Lake Storage Gen2 folders.
Using dataflows, performance can be improved for both Power BI reports as well as the source data systems. Since only the dataflow interacts with the source system versus potentially many different Power BI Desktop queries, this potentially reduces the load on the source systems. In addition, particularly if the source systems are on-premises, refreshes for Power BI reports can be improved, since network latency for retrieving data via a data gateway is removed and Power BI semantic models are refreshed from data already present in the Microsoft cloud.
There’s more...
The decision to use dataflows versus queries in Power BI Desktop depends on a number of factors, including your data preparation needs, collaboration requirements, and the scalability of your solution. Here are some considerations when choosing between a dataflow and a query in Power BI:
Consider Power BI dataflows in the following scenarios:
- Centralized data preparation: If there are multiple reports or dashboards that require the same or similar data preparation steps, using a dataflow can centralize these transformations, ensuring consistency across various reports.
- Reuse of data transformations: Dataflows allow the reuse of data transformation logic across different semantic models. If you find yourself applying similar transformations to the same tables in multiple semantic models, dataflows can help in creating a standardized set of transformations.
- Collaboration: When multiple users or teams need to collaborate on data preparation, dataflows provide a shared environment. This allows for collaboration on the data preparation process, and changes made by one user are reflected for others.
- Scalability: Dataflows are suitable for handling large volumes of data and complex data preparation scenarios. Dataflows can be optimized for performance and can be scheduled to refresh data at regular intervals.
- Integration with Power Platform: When working with other Power Platform services like Power Apps and Power Automate, using dataflows can facilitate data integration across these services through the Common Data Model.
Consider Power BI Desktop queries in the following scenarios:
- Single report-specific transformations: If data preparation requirements are specific to a single report and won’t be reused in other reports, using Power BI Desktop queries can be more straightforward.
- Ad hoc data exploration: When you are in the process of exploring and analyzing data for a specific report, using Power BI Desktop queries allows for more ad hoc and interactive data exploration.
- Lightweight transformations: For relatively simple or lightweight data transformations that don’t require extensive reuse or collaboration, using Power BI Desktop queries may be sufficient.
- Personal projects: When working on personal projects or small-scale reporting tasks, using Power BI Desktop queries provides a more self-contained environment.
In short, dataflows are beneficial for scenarios requiring centralized and reusable data preparation logic across multiple reports or dashboards. In addition, if collaboration and performance are important, dataflows are likely a better choice. Power BI Desktop queries are more suitable for individual, report-specific data transformations, especially in scenarios where extensive reuse or collaboration is not a priority. The choice ultimately depends on the specific use case and requirements.
See also
- Differences between Dataflow Gen1 and Dataflow Gen2: https://bit.ly/3MK9Zrx
- Introduction to dataflows and self-service data prep: https://bit.ly/3R4w46C