BigQuery, dbt, PowerBI - analyzing github repo data

Constructing an analytics pipeline with BigQuery, dbt, and PowerBI

In this article, we'll take a look at building a data pipeline using Google BigQuery, dbt (an open source data tool), and Microsoft Power BI. BigQuery is Google's serverless data warehouse solution available through Google Cloud platform. dbt is a data tool that allows you to efficiently perform transformations. And PowerBI is Microsoft's analytics software that allows you to ingest data from a wide range of sources and visualize it in countless ways.

This BigQuery -> dbt -> PowerBI pipeline is a pretty common working paradigm in the data analytics realm, so it's worth taking a look at how to go from raw data to processed data to end-user visualization.

BiqQuery can be accessed through the web console, the command line, or the API. I put together a notebook that discusses setting up with BigQuery:

BigQuery: Getting Started (at Google Colab)

And there's also the official Google docs, which offer a number of quickstarts:

BigQuery: Quickstarts

In short, you'll set up a new Google Cloud project, and specify a service account that dbt can use to access your BigQuery data. This dbt tutorial will walk you through setting up both in tandem:

Setting up (dbt docs)

The dbt docs will walk you through setting up some starter models and materializing them as views in BigQuery. I decided to go a little bit farther and make use of some of the BigQuery public datasets as a source of raw data. If you haven't checked these out, you really should - there's over a hundred publicly available datasets, with everything from COVID-19 data to blockchain data to github repo data and much more! I'll start out here with the github repo dataset, which contains 2.3m records on github repositories, commits, files, etc.

Something that the dbt docs didn't discuss in much detail, but which turns out to be important, is correctly setting permissions for your service account. I got hung up for a little while because the service account inherits permissions from my main user account, and therefore those service account permissions can't be edited directly - I had to go to my user in IAM and edit permissions there to include BigQuery admin privileges so that I can create tables and do other things that I'll need to do in order to extract, transform, and load data.

Before we get started, let's take a quick look at the structure of the github repo data. That database consists of several tables. Here's a preview of the files table, the structure of which we'll need to understand in order to make a useful transformation:

I won't go too much into these details for now. Just remember that the repo data contains several tables which we will need to draw data from.

So once we have BigQuery set up, I want to start a new dbt project to define a couple of models using the github repository data. In order to do that, I'll cd to wherever I want my new project to live, and then run dbt init project-name (replace 'project-name' with your desired name).

Here's my dbt_project.yml file:

name: 'bigquery_public_test'
version: '1.0.0'
config-version: 2

# This setting configures which "profile" dbt uses for this project.
profile: 'bigquery_public'
# These configurations specify where dbt should look for different types of files.
# The `source-paths` config, for example, states that models in this project can be
# found in the "models/" directory. You probably won't need to change these!
source-paths: ["models"]
analysis-paths: ["analysis"]
test-paths: ["tests"]
data-paths: ["data"]
macro-paths: ["macros"]
snapshot-paths: ["snapshots"]

target-path: "target" # directory which will store compiled SQL files
clean-targets: # directories to be removed by `dbt clean`
    - "target"
    - "dbt_modules"

# Configuring models
# Full documentation:  https://docs.getdbt.com/docs/configuring-models 

# In this example config, we tell dbt to build all models in the example/ directory
# as tables. These settings can be overridden in the individual model files
# using the `{{ config(...) }}` macro.
models:
  commits:
    +materialized: table
  filecounts:
    +materialized: view

 

This file specifies project name, paths, targets, and models and their properties. I have defined two models, commits and filecounts. Let's take a look at each in more detail.

 

Running dbt init will create the project directory structure seen in the left pane of my Atom window. Models are defined as SQL statements, and the above shows the view of my commits model (defined in commits.sql). The recommendation in the dbt docs seems to be to make frequent use of the WITH ... AS SQL syntax, which (to be quite honest) I was not too familiar with until going through the dbt tutorials. I've written plenty of complex queries in my time, but usually relied on other SQL language features to structure my queries. But WITH ... AS is a super useful clause because it allows you to name subqueries which you can refer to in your subsequent statements, and therefore helps you organize complex SQL statements, which you'll need to do when defining data models.

So I've defined two models: commits and filecounts. Commits is pretty simple. A commit contains three fields - commit_id, author_name, and commit_time. I am only using those three fields in part so that I can limit the amount of data returned from the query (and thus what I'll be charged for). Specifying a small subset of fields, and using the LIMIT clause are both good ways to make sure that you don't end up running a query that returns a gigantic amount of data that you get billed for. For more on cost control in BigQuery, see my notebook:

BigQuery: Cost Control (at Google Colab)

The filecounts model is a little more interesting. I was curious to see the total number of files in each repository, and that's what the filecounts model will give us. This is where things get a bit more interesting. In order to calculate file counts for each repository, we need to combine data from the files table and the sample_repos table. I want to get a list of repos that's sorted by a count of the number of files in that repo, so I can see the size of the largest repos. And of course when we need to combine data from multiple tables in SQL, we need a JOIN !

Here's my filecounts.sql file:

{{ config(materialized='view') }}

with gh_file_counts as (

  SELECT COUNT(f.repo_name) AS file_count, f.repo_name AS repo_name
  FROM `bigquery-public-data.github_repos.files` f INNER JOIN  `bigquery-public-data.github_repos.sample_repos` r
  ON f.repo_name = r.repo_name
  GROUP BY f.repo_name
  ORDER BY file_count DESC
  LIMIT 100

)

SELECT *
FROM gh_file_counts

Notice that in this case, the WITH ... AS syntax is unnecessary - I could take that clause out and just have this model definition consist of the inner SQL statement, and the result would be the same. But it's good to get in the habit of using WITH ... AS, because when we start working with more complex models, and models that inherit from other models, there will be times when it's only possible to build the query we need using that WITH ... AS syntax. Just keep that in mind for future reference.

So let's look at what this model definition is doing. We're selecting from two tables, files and sample_repos. We're joining those tables using the INNER JOIN syntax, on the condition of matching the repo_name from both tables. We're using COUNT to get a total number of files for each repo - if we didn't use COUNT, we'd get one record for each individual file in the files table, which would be a lot of records and not very useful. We're aliasing that count so that we can use it later in our query. We're using GROUP BY to specify that we want to aggregate over the repo_name column. We're using ORDER BY to sort in descending order by the file_count. And we're using LIMIT to limit the number of rows returned. But since this is a join and both tables have to be scanned fully to calculate the file counts, the query scan size is still 50.2GB (!):