Constructing an analytics pipeline with BigQuery, dbt, and PowerBI
In this article, we'll take a look at building a data pipeline using Google BigQuery, dbt (an open source data tool), and Microsoft Power BI. BigQuery is Google's serverless data warehouse solution available through Google Cloud platform. dbt is a data tool that allows you to efficiently perform transformations. And PowerBI is Microsoft's analytics software that allows you to ingest data from a wide range of sources and visualize it in countless ways.
This BigQuery -> dbt -> PowerBI pipeline is a pretty common working paradigm in the data analytics realm, so it's worth taking a look at how to go from raw data to processed data to end-user visualization.
BiqQuery can be accessed through the web console, the command line, or the API. I put together a notebook that discusses setting up with BigQuery:
BigQuery: Getting Started (at Google Colab)
And there's also the official Google docs, which offer a number of quickstarts:
In short, you'll set up a new Google Cloud project, and specify a service account that dbt can use to access your BigQuery data. This dbt tutorial will walk you through setting up both in tandem:
The dbt docs will walk you through setting up some starter models and materializing them as views in BigQuery. I decided to go a little bit farther and make use of some of the BigQuery public datasets as a source of raw data. If you haven't checked these out, you really should - there's over a hundred publicly available datasets, with everything from COVID-19 data to blockchain data to github repo data and much more! I'll start out here with the github repo dataset, which contains 2.3m records on github repositories, commits, files, etc.
Something that the dbt docs didn't discuss in much detail, but which turns out to be important, is correctly setting permissions for your service account. I got hung up for a little while because the service account inherits permissions from my main user account, and therefore those service account permissions can't be edited directly - I had to go to my user in IAM and edit permissions there to include BigQuery admin privileges so that I can create tables and do other things that I'll need to do in order to extract, transform, and load data.
Before we get started, let's take a quick look at the structure of the github repo data. That database consists of several tables. Here's a preview of the
files table, the structure of which we'll need to understand in order to make a useful transformation:
I won't go too much into these details for now. Just remember that the repo data contains several tables which we will need to draw data from.
So once we have BigQuery set up, I want to start a new dbt project to define a couple of models using the github repository data. In order to do that, I'll
cd to wherever I want my new project to live, and then run
dbt init project-name (replace 'project-name' with your desired name).
Here's my dbt_project.yml file:
This file specifies project name, paths, targets, and models and their properties. I have defined two models, commits and filecounts. Let's take a look at each in more detail.
dbt init will create the project directory structure seen in the left pane of my Atom window. Models are defined as SQL statements, and the above shows the view of my commits model (defined in commits.sql). The recommendation in the dbt docs seems to be to make frequent use of the
WITH ... AS SQL syntax, which (to be quite honest) I was not too familiar with until going through the dbt tutorials. I've written plenty of complex queries in my time, but usually relied on other SQL language features to structure my queries. But
WITH ... AS is a super useful clause because it allows you to name subqueries which you can refer to in your subsequent statements, and therefore helps you organize complex SQL statements, which you'll need to do when defining data models.
So I've defined two models: commits and filecounts. Commits is pretty simple. A commit contains three fields - commit_id, author_name, and commit_time. I am only using those three fields in part so that I can limit the amount of data returned from the query (and thus what I'll be charged for). Specifying a small subset of fields, and using the
LIMIT clause are both good ways to make sure that you don't end up running a query that returns a gigantic amount of data that you get billed for. For more on cost control in BigQuery, see my notebook:
BigQuery: Cost Control (at Google Colab)
The filecounts model is a little more interesting. I was curious to see the total number of files in each repository, and that's what the filecounts model will give us. This is where things get a bit more interesting. In order to calculate file counts for each repository, we need to combine data from the files table and the sample_repos table. I want to get a list of repos that's sorted by a count of the number of files in that repo, so I can see the size of the largest repos. And of course when we need to combine data from multiple tables in SQL, we need a
Here's my filecounts.sql file:
Notice that in this case, the
WITH ... AS syntax is unnecessary - I could take that clause out and just have this model definition consist of the inner SQL statement, and the result would be the same. But it's good to get in the habit of using
WITH ... AS, because when we start working with more complex models, and models that inherit from other models, there will be times when it's only possible to build the query we need using that
WITH ... AS syntax. Just keep that in mind for future reference.
So let's look at what this model definition is doing. We're selecting from two tables, files and sample_repos. We're joining those tables using the
INNER JOIN syntax, on the condition of matching the repo_name from both tables. We're using COUNT to get a total number of files for each repo - if we didn't use COUNT, we'd get one record for each individual file in the files table, which would be a lot of records and not very useful. We're aliasing that count so that we can use it later in our query. We're using
GROUP BY to specify that we want to aggregate over the repo_name column. We're using ORDER BY to sort in descending order by the file_count. And we're using LIMIT to limit the number of rows returned. But since this is a join and both tables have to be scanned fully to calculate the file counts, the query scan size is still 50.2GB (!):