Exploring BigQuery's Time Travel and Cloning Features
Written on
In this series, we delve into some remarkable features of BigQuery. With so many functionalities available, it's easy for some to be overlooked. Here’s a list of my favorites:
- Auto, Transactions, and Indexes — Hidden gems of BigQuery
- Change History and Defaults — Hidden gems of BigQuery — Part 2
- Data Lineage and Routines — Hidden gems of BigQuery — Part 3
- Views, Caching, Scheduled Queries, Materialized Views — Hidden Gems of BigQuery — Part 4
- ML in BigQuery, including new Generative AI — BigQuery Breaking News
This article will focus on the following aspects:
- Time-travel window
- Fail-safe period
- Snapshots
- Clones
Let’s dive in…
Time Travel
Querying data from a table is common knowledge, but what if you want to access the data as it existed yesterday or the day before? This is where time travel comes into play — and you don't need a DeLorean; all you need is BigQuery.
BigQuery allows you to retrieve data from any point within a seven-day time travel window by default. This feature enables you to query deleted or modified data, recover deleted tables, or restore tables that have expired.
How do you achieve this? It's straightforward:
SELECT * FROM hgbq.artem.part6 FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) LIMIT 100
However, while you can access past data, it’s essential to note that this capability is limited to the last seven days by default. You can adjust the time frame but can only reduce it to a minimum of two days.
What if your data was deleted more than seven days ago? Fortunately, there's still hope due to the fail-safe period.
Fail-Safe
BigQuery incorporates a fail-safe period that automatically retains deleted data for an additional seven days following the time travel window. This ensures data is available for emergency recovery at the table level. Note that this fail-safe feature is not adjustable.
It's important to distinguish between fail-safe and time travel. The fail-safe has several limitations:
- You cannot query data stored in the fail-safe.
- Recovery must be handled through the Cloud Customer Care team, as self-recovery is not an option. (Currently in Preview, we hope for changes before general availability.)
- Fail-safe retains deleted data, while time travel allows access to previous versions of updated records.
The fail-safe is strictly an emergency resource.
What if you need to retrieve earlier data beyond the 14-day limit? It depends on your timing. If you plan ahead, you have more options.
Creating a Manual Snapshot Table
For those familiar with SQL and OLTP databases, a straightforward solution is to execute a query on a scheduled basis to create a new table (referred to as a manual snapshot) based on your primary table's data.
Note: This approach is not a solution to specific problems in BigQuery; it's intended to clarify subsequent, more effective options.
You can accomplish this with a simple query:
CREATE OR REPLACE TABLE hgbq.artem.part6_mn_23062023 AS (SELECT * FROM hgbq.artem.part6)
You can even specify the schema if you’re not satisfied with the automatic version:
CREATE OR REPLACE TABLE hgbq.artem.part6_mn_23062023 (field1 STRING OPTIONS(description="A description."), field2 INT64 OPTIONS(description="A description")) AS (SELECT * FROM hgbq.artem.part6)
Set up a scheduler to run this query daily.
Considerations for this approach include:
- If you use BigQuery Scheduler Queries, you’ll need a method to alter the table name daily.
- Copying all data to a new table incurs storage fees for both the original and new table. Daily manual snapshots will increase your storage costs.
- You may want to implement a second scheduled query to delete old snapshots no longer needed.
- Since you must read all data to create a copy, you’ll incur processing charges for this daily operation.
This method works, but are there better options available within BigQuery? Yes! Let's explore them step by step.
COPY TABLE
Instead of using a CREATE...SELECT statement to create a new table, you can opt for a CREATE TABLE COPY statement:
CREATE TABLE hgbq.artem.part6_copy_23062023 COPY hgbq.artem.part6
This command creates a table with the same data and metadata as another table.
You can also create a copy based on a specific timestamp:
CREATE TABLE hgbq.artem.part6_copy_23062023 COPY hgbq.artem.part6 FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
What differentiates this method? It's not merely syntactic sugar; there's a critical distinction:
- Copying a table incurs no data processing charges, as the operation occurs at a low system level.
However, be aware of certain limitations.
While we've eliminated processing costs, storage costs remain. Is there a way to optimize that as well? Yes!
Snapshots
BigQuery features a built-in mechanism for table snapshots. A snapshot preserves the contents of a table (the base table) at a specific time.
You can create a snapshot using the following SQL command:
CREATE SNAPSHOT TABLE hgbq.artem.part6_sn23062023 CLONE hgbq.artem.part6
You can either save a snapshot of the current table or create one that reflects the table's state at any time in the past week:
CREATE SNAPSHOT TABLE hgbq.artem.part6_sn23062023 CLONE hgbq.artem.part6 FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY)
What sets table copies apart from BigQuery snapshots? Several factors:
Storage Cost: Unlike table copies, BigQuery does not duplicate full table data into a new table when creating a snapshot. Instead, it tracks changes between the original table and the snapshot, only storing the differences (i.e., modified and deleted records).
In simpler terms:
- Initially, creating a table snapshot incurs no storage cost.
- If new data is added to the base table after snapshot creation, you won't pay for that data's storage in the snapshot.
- If data that exists in the snapshot is modified or deleted from the base table, you will incur storage costs for that changed or deleted data. If multiple snapshots contain the altered data, you will only be charged for the storage used by the oldest snapshot.
Of course, if you change everything in your original table, your snapshot will turn into a complete copy of the original table at the time of snapshot creation, resulting in increased storage costs.
Note: Be cautious, as automatic re-clustering can unintentionally alter all data. Changes to a base table can lead to significant storage charges for snapshots. For example, modifying a base table with clustering may result in re-clustering, causing discrepancies in storage costs between the base table and its snapshots.
Expiration: You can establish an expiration timestamp for snapshots, eliminating the need for a separate process to remove old snapshots.
CREATE SNAPSHOT TABLE hgbq.artem.part6_sn23062023 CLONE hgbq.artem.part6 FOR SYSTEM_TIME AS OF TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 1 DAY) OPTIONS (
expiration_timestamp = TIMESTAMP_ADD(CURRENT_TIMESTAMP(), INTERVAL 7 DAY));
Read-only: Snapshots are read-only, while a copied table grants you full access for modifications.
But there’s a way to address this as well!
Clones
What if you want to enjoy the benefits of snapshots (zero initial creation cost and smart storage cost) while also having the ability to modify the snapshot data? In that case, you need Clones.
A table clone is a lightweight, writable copy of another table (the base table). You’re only charged for the storage of data in the clone that differs from the base table, so there's no initial storage cost for a table clone. Besides the billing model for storage and some additional metadata for the base table, a clone functions similarly to a standard table—you can query it, copy it, delete it, etc.
Essentially, BigQuery table clones use a similar smart mechanism for managing changes to the original table, but they also accept modifications themselves.
How does this affect costs?
You could think of BigQuery table clones as forks of tables…
Personal note: While I appreciate the cleverness and technical execution of this feature, I have yet to find a compelling real-life application for it in my projects. Whenever I identify a potential use case, it often seems like a solution in search of a problem. It could be a useful option for creating Dev/Test environments based on real production data, provided there’s no risk of sensitive data exposure. I hope you discover a valuable use case for this feature!
Summary
We’ve explored multiple options for accessing previous states of a table. Here are useful links to the topics discussed:
- Time travel
- Fail-safe
- Copy
- Snapshots
- Clones
- Scheduling Queries
P.S. Many of these methods require running scheduled queries with changing table names. How do you accomplish this?
If an external scheduling system (like Composer or a task on GKE) executes your queries, you can pass the table name as a parameter.
But if you prefer to use BigQuery's built-in Query Scheduling, you can write a multi-statement SQL query that generates and executes a CREATE TABLE command using EXECUTE IMMEDIATE. A good example can be found here.