It didn’t even look like a bad query. It also didn’t take long to realize the quick way to get the work done was a disaster of a shortcut.
The introduction to this requires that I also defend those involved. These were very smart developers working on a system that barely required reporting at the end of a complicated process. I understand how they got there. Take the data you have and put it in a data warehouse. Let the magic and arcane engineers of Google handle the data and processing to get the reporting results you require. You get to just write SQL (or BigQuery’s variant that’s very close to standard) and it just works!
The company tracked packages. Millions of them every day. They wanted to know things about the travel routes and times. The list of ways to measure and describe a shipment was growing quickly with new ideas and new requirements every sprint. To optimize for this dynamic style of development this team designed what I would describe as a Visitor Pattern or a Plugin Architecture. The history and state of a package is sent through an interface to any number of functions that developers were constantly updating and adding to. The functions would all output more data and tie that to the name of the function.
One simple example was a Days In Transit function. It would analyze the information provided and say something like “Days In Transit = 4.” Put an index on that data and you could quickly see how long all the packages had been in transit. There were lots more of these functions doing so many things. Some measuring distance and some judging if there has been a problem. Some functions were making predictions of outcomes, both common and rarely occurring. Some looked at previous predictions and compared how that was going with the new tracking information. Complex stuff. All generating more and more data that could be used in aggregate reports later.
Every package we tracked now had an Kotlin Map structure of these pieces of data. Query by the field you want and you get the data element back. It worked like a dynamically typed language. Which is great. Some packages were first processed under different rules and had different fields. Given how many packages we were tracking it just wasn’t economical to recalculate everything on delivered shipments. Some more recent data had all the newest functions applied. The datastore for this was MongoDB, which handles dynamic objects really well.
Then came BigQuery.
I have 3 main rules of Google BigQuery.
1. Enter any query into the web console and look at the cost estimate before you run it.
2. Select only the columns you need.
3. Try to run one query and get everything you need. Almost always the cost is looking up the data, not the processing time.
BigQuery is a column store, like Redshift, Snowflake, Cassandra, and Parquet format. Each of those is a bit different, but the column rules apply. You save on queries by going wide with your data. You can go very wide. Tons of columns. When you query for the data you want and select the specific columns you need, the rest of the data never needs to be looked up on disk. You can analyze lots of rows because the data was stored by the schema’s columns. But simply, if you have 100 columns and select 1 column, your cost will be 1% of “SELECT *”.
The project was mostly a success by now, we just wanted some reporting on all of the data at once. The dynamic object with all its generated values would end up being shoved into BigQuery with columns of “trackingNumber”, “timestamp”, “fieldName”, “fieldValue”. Not great for a relational database, but terrible for a column store. Every measurement was its own row in the database. To find a set of values you would group by “trackingNumber” and “fieldName”, find the latest “timestamp”, and get the value for each “fieldName”. This visits every single record, not just the ones you’re looking for. It makes it so you’re getting all the data from the entire table, even if you care about one single value.
A second problem with this was that the SQL in BigQuery looked fine. It was only a few lines, and pretty directly described what it was doing to create a report. It ran relatively quickly on large datasets. The effective cost to run this version of the product was 50x what it needed to be to analyze the data we had. It didn’t take long looking at our billing report from Google to start saving the data in a wider column format. Deciding to spend the time to manage the schema and write to new columns got the project’s costs back on track.