Power BI & Microsoft Purview: Showing ‘full’ lineage from source to report
Microsoft Purview (started as Azure Purview) is a Microsoft solution for data management, including a data catalog, data map (including the assets), and many more. Later they rebranded the whole Microsoft 365 compliancy center to Purview, making it more logical to call it Microsoft Purview, rather than Azure Purview. In this blog, we’ll focus on the Azure part, specifically the lineage part!
I want to show you how you can add lineage between your source, Azure Data Factory, SQL Server, and your Power BI dataset. For this, we’re going to use the Apache Atlas API.
First, a few quick details about Purview and the data map: You register your data source and after that, you can create a scan. This scan (a bit different for all resources) scans the resource, collects all the metadata about the assets, and lists them in the data catalog. While the scan is running, it can also check for classifications like email addresses, credit card numbers, and many more.
But the thing we are going to focus on is one of the main requested features I’ve been hearing so far: Lineage. With lineage, we can view the sequence of where the data is coming from and where it is going. In the Power BI service we can find a light version of this:
But what if we need more details? Where does the data from the SQL Server come from and is the data transformed or not? This is where Microsoft Purview comes in!
In this case, we have scanned the blob storage and the SQL Server. Our Purview account is connected to the Azure Data Factory (ADF) we use to transform the .xlsx file to a SQL Table. Here you can read how this is set up.
Now after creating a report on this specific table, and scanning all the assets again, you would expect to see this table connected to the dataset, and the dataset connected to the report.
In Power BI, we get this result:
In Purview, we get this result:
But I want to know where the data in the SQL Table is coming from!
In a demo that Microsoft showed in 2021, they got the following lineage overview (You can watch this webinar here, after registering):
But for some reason, this isn’t an out-of-the-box functionality, and that is really too bad! In my opinion, this is really a must before we can see Purview as an enterprise product. But don’t be too disappointed because there are options!
First, the Edit button of an asset:
As you can see, we can add manual lineage, but unfortunately only to the report. I want it to connect to the data source because I can use it later for multiple reports! But if this is what you want, this is a nice first idea!
Second: Apache Atlas.
According to the Apache Atlas website: “Atlas is a scalable and extensible set of core foundational governance services — enabling enterprises to effectively and efficiently meet their compliance requirements within Hadoop and allows integration with the whole enterprise data ecosystem.
Apache Atlas provides open metadata management and governance capabilities for organizations to build a catalog of their data assets, classify and govern these assets and provide collaboration capabilities around these data assets for data scientists, analysts, and the data governance team.”
For us, it means we can use it as an API to make changes in Purview and more important: Add lineage the way we want! Franck Mercier put on a very helpful post on GitHub describing how it’s done. You can find it here: https://github.com/franmer2/AzurePurviewFullPBILineage_US
There is a nice post from Will Johson that also describes installing PyApacheAtlas: https://github.com/wjohnson/pyapacheatlas/blob/master/README.md
In general, these are the steps:
- Create a new (or use an existing) app registration in the Azure Portal.
- Make it a Data Curator on the root level in your data map in Azure Purview.
- Make sure you have scanned your Power BI & SQL Server instances (but if you’re reading this blog I guess you've already done that).
- Find the Fully Qualified Names of the Power BI Dataset and the SQL Table you want to use for the lineage.
- Run the script and be happy with the result:
Finally, we can see the lineage from source to report!
In this blog I have shown you how to create the lineage of a Power BI dataset and the origin of the SQL table which is used for this dataset, using the Apache Atlas API in Visual Studio Code.
I really like Microsoft Purview and I can see it has a lot of potential to become an important product for data stewards within organizations. The data catalog is very easy to use, the labels which are automatically applied to the resources are very handy and it gives a nice structure to your data.
From what I hear, the lineage part is the one that triggers companies the most. It’s something they really want or need, but this part has its flaws. It works okay now with Azure Data Factory, but for example not with Synapse Pipelines. And the fact that lineage from source to a Power BI report is not out of the box available is a real bummer.
Thanks to the great posts of Will Johson & Franck Mercier, it is possible and only requires a little bit of extra effort. It’s amazing to finally see which transformations have been made before the data got into a Power BI dataset. Now let’s hope shortly we can also see the transformations that have been made inside the Power BI dataset (using M-query for example).