Best Practices

Using Snowflake Git + Kestra to Automate Pipelines

Using Snowflake Git + Kestra to Automate Pipelines

PUBLISHED ON

Snowflake has just released its Git integration to public preview across all clouds, giving you a “source of truth” for your SQL scripts, Snowpark functions, procedures, and apps. This opens many doors for automation, particularly when integrated with Kestra, an open-source declarative data orchestration tool.

Snowflake Git Flow Diagram

Setting up Snowflake Git

First, we need to connect our git repository to our Snowflake account. If you are integrating a private repo, you need to create a secret to contain the credentials for authenticating. If using a public repo, skip this.

USE ROLE ACCOUNTADMIN;
CREATE OR REPLACE SECRET git_secret
  TYPE = password
  USERNAME = 'gh_username'
  PASSWORD = 'ghp_token';

Note: You will need to generate a personal access token scoped to the appropriate repo(s).

Next, we will create an API Integration (using ACCOUNTADMIN) that allows the git traffic between our Snowflake instance and the repository.

CREATE OR REPLACE API INTEGRATION git_integration
  API_PROVIDER = git_https_api
  API_ALLOWED_PREFIXES = ('https://github.com/<my-account>/')
  -- ALLOWED_AUTHENTICATION_SECRETS = (git_secret)
  ENABLED = TRUE;

To verify this has been created we can use SHOW API INTEGRATIONS;

Lastly, we need to create the Git Repository in Snowflake, which represents the external Git repo and includes a cache of all files from its branches, tags, and commits.

CREATE OR REPLACE GIT REPOSITORY snowflake
  API_INTEGRATION = git_integration
  -- GIT_CREDENTIALS = my_secret if needed
  ORIGIN = 'https://github.com/<my-account>/snowflake.git';

You can test this by trying to list the contents of your repo from within Snowflake using the following command. It has a special logical naming structure so you can navigate across files in different branches (@repo_name/branches/branch_name), tags (@repo_name/tags/tag_name), or commits (@repo_name/commits/commit_hash).

To refresh the repository in Snowflake after making changes, you can run the fetch command:

ALTER GIT REPOSITORY snowflake FETCH;

Integrating with Kestra

Kestra is a new data orchestration tool similar to Airflow but simpler and addresses Airflow’s shortcomings with scalability and a challenging Python environment to manage. Flows are YAML-based, making it easy to read and understand, allowing non-developers the ability to design orchestrations.

Kestra has cloud and enterprise editions available, as well as the option to self-host and deploy in Docker, Kubernetes, or a public cloud.

In this example, we will be using Kestra to set up a new Snowflake instance to our “standards”, similar to how we would set up a new client account. If you wish to follow along, simply sign up for a Snowflake Trial Account with just a few short clicks.

Create a Flow

In Kestra, go to Flows and click Create.

Kestra Flow

You will then be presented with a YAML editor, copy the following code and we will break down the various parts:

id: snowflake_setup
namespace: snowflake

inputs:
  - id: account_identifier
    type: STRING
  - id: username
    type: STRING
  - id: password
    type: STRING

tasks:
  - id: create_stage_db
    type: io.kestra.plugin.jdbc.snowflake.Query
    url: jdbc:snowflake://{{inputs.account_identifier}}.snowflakecomputing.com
    username: "{{inputs.username}}"
    password: "{{inputs.password}}"
    role: SYSADMIN
    sql: CREATE OR REPLACE DATABASE STAGE_DB;

Your screen should look like this:

Kestra flow with YAML

Let's dig into the code line by line:

  • id — The id on top is the name for your flow and needs to be unique in the given namespace.
  • namespace — These group the flows and cannot be changed once saved.
  • inputs — Parameters used for making flows dynamic and reusable, these are determined at runtime.
  • tasks — Discrete actions capable of taking inputs and variables from the flow, performing computations, and producing outputs for downstream consumption.

Kestra makes writing flows easy by adding a live documentation window to aid you as you type. If you select the Source and Documentation view in the top right, the part your cursor is on will display its documentation on the right with examples, properties, outputs, and definitions to help you.

The task we’ve created will prompt the user to enter the account identifier, username, and password for Snowflake and then pass those credentials to Snowflake to execute the SQL statement to create the STAGE_DB database. Save the flow, hit Execute on top, enter your credentials and run the flow.

Automate Snowflake Git with Kestra

Now that we’ve successfully run and understand our first Kestra flow, we can add to it. Let's automate the Snowflake Git integration setup. We will remove the task ID and replace it with SQL to set up our public repo.

id: snowflake_setup
namespace: snowflake

inputs:
  - id: account_identifier
    type: STRING
  - id: username
    type: STRING
  - id: password
    type: STRING

tasks:
  - id: grant_warehouse_to_sysadmin
    type: io.kestra.plugin.jdbc.snowflake.Query
    url: jdbc:snowflake://{{inputs.account_identifier}}.snowflakecomputing.com
    username: "{{inputs.username}}"
    password: "{{inputs.password}}"
    role: ACCOUNTADMIN
    sql: GRANT USAGE ON WAREHOUSE COMPUTE_WH TO SYSADMIN;

  - id: git_api_integration
    type: io.kestra.plugin.jdbc.snowflake.Query
    url: jdbc:snowflake://{{inputs.account_identifier}}.snowflakecomputing.com
    username: "{{inputs.username}}"
    password: "{{inputs.password}}"
    role: ACCOUNTADMIN
    sql: 
      CREATE OR REPLACE API INTEGRATION git_integration
      API_PROVIDER = git_https_api
      API_ALLOWED_PREFIXES = ('https://github.com/<my-account>/')
      ENABLED = TRUE;
    
  - id: create_git_database
    type: io.kestra.plugin.jdbc.snowflake.Query
    url: jdbc:snowflake://{{inputs.account_identifier}}.snowflakecomputing.com
    username: "{{inputs.username}}"
    password: "{{inputs.password}}"
    role: SYSADMIN
    sql: CREATE OR REPLACE DATABASE GIT;

  - id: create_git_repo
    type: io.kestra.plugin.jdbc.snowflake.Query
    url: jdbc:snowflake://{{inputs.account_identifier}}.snowflakecomputing.com
    username: "{{inputs.username}}"
    password: "{{inputs.password}}"
    role: SYSADMIN
    database: GIT
    schema: PUBLIC
    sql: 
      CREATE OR REPLACE GIT REPOSITORY snowflake
      API_INTEGRATION = git_integration
      ORIGIN = 'https://github.com/<my-account>/snowflake.git';

Make sure to replace my-account with your GitHub organization. We had to add a few more properties to declare role, database, and schema depending on the SQL statement. We’ve also created a new database named GIT to house the repository metadata.

Now that we have the Git integration automated, we can build out batch SQL scripts to call and run. One caveat to the newly released Snowflake Git integration is the inability to edit/change the files from within Snowflake. But any editor will work such as VS Code, which is what I’ll be using. I’ve set up folders in my Snowflake GitHub repo to organize my SQL scripts but you can follow whatever structure makes sense for your use case. Create a “create_database.sql” file with the following statements:

CREATE OR REPLACE DATABASE STAGE_DB;
CREATE OR REPLACE DATABASE STAGE_DB_DEV;
CREATE OR REPLACE DATABASE ETL_DB;
CREATE OR REPLACE DATABASE ETL_DB_DEV;
CREATE OR REPLACE DATABASE EDW_DB;
CREATE OR REPLACE DATABASE EDW_DB_DEV;

Let's also create a “create_warehouse.sql” file with the following:

-- Creates x-small warehouse
    CREATE OR REPLACE WAREHOUSE WH_XSM 
    WAREHOUSE_SIZE = 'XSMALL'
    WAREHOUSE_TYPE = 'STANDARD'
    AUTO_SUSPEND = 300
    AUTO_RESUME = TRUE
    SCALING_POLICY = 'STANDARD'
    MIN_CLUSTER_COUNT = 1
    MAX_CLUSTER_COUNT = 3
    INITIALLY_SUSPENDED = TRUE;

-- Creates medium warehouse
    CREATE OR REPLACE WAREHOUSE WH_MD
    WAREHOUSE_SIZE = 'MEDIUM'
    WAREHOUSE_TYPE = 'STANDARD'
    AUTO_SUSPEND = 300
    AUTO_RESUME = TRUE
    SCALING_POLICY = 'STANDARD'
    MIN_CLUSTER_COUNT = 1
    MAX_CLUSTER_COUNT = 5
    INITIALLY_SUSPENDED = TRUE;

-- Creates  x-large warehouse
    CREATE OR REPLACE WAREHOUSE WH_XLG
    WAREHOUSE_SIZE = 'XLARGE'
    WAREHOUSE_TYPE = 'STANDARD'
    AUTO_SUSPEND = 300
    AUTO_RESUME = TRUE
    SCALING_POLICY = 'STANDARD'
    MIN_CLUSTER_COUNT = 1
    MAX_CLUSTER_COUNT = 8
    INITIALLY_SUSPENDED = TRUE;

Commit the changes and sync to your repository so it’s visible on GitHub. These are standard database structures and warehouses we primarily use unless a specific deviation is needed, feel free to name, add, or remove databases or warehouses for your use case.

With that in our GitHub repo, we can go back to Kestra to add the calls for those two batch scripts. Add the following new ID’s to the end of your snowflake_setup flow:

 - id: create_databases
    type: io.kestra.plugin.jdbc.snowflake.Query
    url: jdbc:snowflake://{{inputs.account_identifier}}.snowflakecomputing.com
    username: "{{inputs.username}}"
    password: "{{inputs.password}}"
    role: SYSADMIN
    database: GIT
    schema: PUBLIC
    sql: EXECUTE IMMEDIATE FROM @snowflake/branches/main/client_setup/create_database.sql;

  - id: create_warehouses
    type: io.kestra.plugin.jdbc.snowflake.Query
    url: jdbc:snowflake://{{inputs.account_identifier}}.snowflakecomputing.com
    username: "{{inputs.username}}"
    password: "{{inputs.password}}"
    role: SYSADMIN
    database: GIT
    schema: PUBLIC
    sql: EXECUTE IMMEDIATE FROM @snowflake/branches/main/client_setup/create_warehouse.sql;

I have placed my create_database.sql and create_warehouse.sql scripts inside the client_setup folder under main. We can call and run these scripts with the EXECUTE IMMEDIATE call:

EXECUTE IMMEDIATE FROM @<repo_name>/branches/main/<name>.sql

And that's all there is to it, not so bad right? We can save the flow, hit execute, and populate our inputs for account identifier, username, and password and let it run. Kestra will show you in real time as it executes each task, logs responses/metrics, and shows any errors that occurred.

Gantt view of the executed flow

We can check our Snowflake instance to confirm all the databases and warehouses were created:

Databases and Warehouses created from Kestra in Snowflake

This example is just the beginning of what can be accomplished between Snowflake Git and Kestra, for both automating repetitive tasks and having a “source of truth” repository to house all your important code and standards. Hopefully, this brings enough familiarity and understanding to get those gears turning in your heads around the endless possibilities with these two tools.

For help, documentation, and references I’ll provide some useful links that can aid understanding further:

Latest

Harnessing the Power of Mature Data: Navigating CSRD & CSDDD for a Sustainable Future

Best Practices

Harnessing the Power of Mature Data: Navigating CSRD & CSDDD for a Sustainable Future

Sustainability is a key priority for businesses worldwide and with a growing environmental awareness and corporate responsibility, mature data is needed more than ever to drive meaningful change. Learn how organizations harness the power of mature data to navigate the directives stemming from the CSRD Corporate Sustainability Reporting Directive and the new E.U. Corporate Sustainability Due Diligence Directive (CSDDD).

Read
How to Start an Effective Data Governance Program

How to Start an Effective Data Governance Program

Data Governance is about decision-making. Who gets to make the decisions, how they are made, when they are made, etc. There may be several data management tasks or operations that then occur because of the decisions that were made by the data governance program. To have a successful governance program and a data management initiative, these two efforts must be in-sync with each other AND the scope of each should be known and understood. If we understand that data governance is about decision-making, then we can establish that the key to achieving acceptance from the organization for the program is to involve the right people from all parts of the organization in the right places within the program. People want to be heard and involved in decision making. It is also important to note – a data governance program is not a project that ends. It is an ongoing discipline that continues to improve and hopefully thrive over time. The focus of a data governance program could and should change throughout its lifetime as the opportunities around the use of data and information grow within your organization. With the context from above, here are 8 steps to take to implement an effective data governance program within your organization.

Read
Seeing is Believing: Transforming Complex Data into Actionable Insights

Discovery

Seeing is Believing: Transforming Complex Data into Actionable Insights

In today's data-driven world, the ability to extract meaningful insights from vast amounts of information is crucial for making informed decisions and driving business success. However, the sheer volume and complexity of data can often be overwhelming, leaving decision-makers struggling to identify relevant trends and patterns. This is where Pandata Group steps in, offering cutting-edge visualization tools that transform complex data into actionable insights, empowering organizations to navigate their data landscape with confidence.

Read
Simplifying Power BI Data Aggregation: A Comparative Overview

Best Practices

Simplifying Power BI Data Aggregation: A Comparative Overview

In the dynamic world of data science and analytics, professionals must choose the best method for managing and summarizing large datasets. Power BI offers several approaches to tackle this challenge - let's break down some of the techniques to help you understand which might be the best fit for your needs.

Read
Police Data Analysis - Moving from Statistics to Insights

Police Data Analysis - Moving from Statistics to Insights

Read the six-part blog series in one place! Examine how one community dug deeper to analyze policing efforts when the statistics didn't add up. Learn what steps needed to be taken to better understand the data that was presented. From understanding the data and building the data set to quality control and presentation of insight, and finally to the lessons learned.

Read
A Sustainable Future: Initiating Your ESG Journey with Data-Driven Solutions

Discovery

A Sustainable Future: Initiating Your ESG Journey with Data-Driven Solutions

In this week's Looking Forward highlight Guy Nelson explores the importance of embracing sustainability with data-driven initiatives. Assessing your starting point, building a roadmap, leveraging data, and unlocking new insights are just a few of the steps in a journey to sustainability and ESG excellence.

Read
Police Data Analysis - Moving from Statistics to Insights

Police Data Analysis - Moving from Statistics to Insights

This six part blog series examines how one community dug deeper to analyze policing efforts when the statistics didn't add up. We'll showcase what steps needed to be taken to better understand the data that was presented. From understanding the data and building the data set to quality control and presentation of insight, and finally to the lessons learned. Join us each week as we uncover more to the story and move from statistics to insights.

Read
Police Data Analysis - Moving from Statistics to Insights

Data Analytics

Police Data Analysis - Moving from Statistics to Insights

This six part blog series examines how one community dug deeper to analyze policing efforts when the statistics didn't add up. We'll showcase what steps needed to be taken to better understand the data that was presented. From understanding the data and building the data set to quality control and presentation of insight, and finally to the lessons learned. Join us each week as we uncover more to the story and move from statistics to insights.

Read
Police Data Analysis - Moving from Statistics to Insights

Data Analytics

Police Data Analysis - Moving from Statistics to Insights

This six part blog series examines how one community dug deeper to analyze policing efforts when the statistics didn't add up. We'll showcase what steps needed to be taken to better understand the data that was presented. From understanding the data and building the data set to quality control and presentation of insight, and finally to the lessons learned. Join us each week as we uncover more to the story and move from statistics to insights.

Read
Police Data Analysis - Moving from Statistics to Insights

Data Analytics

Police Data Analysis - Moving from Statistics to Insights

This six part blog series examines how one community dug deeper to analyze policing efforts when the statistics didn't add up. We'll showcase what steps needed to be taken to better understand the data that was presented. From understanding the data and building the data set to quality control and presentation of insight, and finally to the lessons learned. Join us each week as we uncover more to the story and move from statistics to insights.

Read
Police Data Analysis - Moving from Statistics to Insights

Data Analytics

Police Data Analysis - Moving from Statistics to Insights

This six part blog series examines how one community dug deeper to analyze policing efforts when the statistics didn't add up. We'll showcase what steps needed to be taken to better understand the data that was presented. From understanding the data and building the data set to quality control and presentation of insight, and finally to the lessons learned. Join us each week as we uncover more to the story and move from statistics to insights.

Read
Why Differentiating Between Data Governance and Data Management Matters

Best Practices

Why Differentiating Between Data Governance and Data Management Matters

This week's Looking Forward blog highlights the importance of differentiating between data governance and data management. Jason Fishbain provides a great reminder of the differences between the two strategies and how each one impacts your organization.

Read
Police Data Analysis - Moving from Statistics to Insights

Data Analytics

Police Data Analysis - Moving from Statistics to Insights

This six part blog series examines how one community dug deeper to analyze policing efforts when the statistics didn't add up. We'll showcase what steps needed to be taken to better understand the data that was presented. From understanding the data and building the data set to quality control and presentation of insight, and finally to the lessons learned. Join us each week as we uncover more to the story and move from statistics to insights.

Read
Unlocking New Possibilities for Business Leaders. Getting Started with Gen AI.

Discovery

Unlocking New Possibilities for Business Leaders. Getting Started with Gen AI.

In the second blog of our Looking Forward series, we explore the discovery category. Here Sumanth Donthula touches on what Generative AI is, how its leveraged, and how you can get started with Gen AI in your organization.

Read
Pandata Group Launches Bamboo SDC:  Rewire Your Sustainability Data Management

Annoucements

Pandata Group Launches Bamboo SDC: Rewire Your Sustainability Data Management

Pandata Group is proud to announce the launch of Bamboo Sustainability Data Cloud (SDC). This innovative platform streamlines the collection and management of Sustainability and Environmental, Social, and Governance (ESG) data, helping organizations enhance efficiency and become more data-driven with accurate, well-modeled, and reliable data. Powered by the Snowflake AI Data Cloud, Bamboo SDC collects, structures, and processes data to develop AI-based insights and sustainability strategies.

Read
Snowflake: Evolving into an AI Powerhouse

Emerging Technologies

Snowflake: Evolving into an AI Powerhouse

What better way to kick off our new blog series, Looking Forward, than to dive into the conversation we're all having - AI. In this blog, Jefferson Duggan explores how Snowflake, a known data warehousing and cloud platform powerhouse, is pivoting to something bigger. He also discusses how emerging technologies such as Open AI are paving the way.

Read
Mastering the Data Cloud Summit: What to Pack

Events

Mastering the Data Cloud Summit: What to Pack

It's that time of the year again. Snowflake Data Cloud Summit is right around the corner and we're planning our trip to San Fransisco. Are you? Over the next few weeks, we'll highlight why you should attend, dos and donts of summit, what to pack, and everything in between to ensure you're prepared for the four-day conference. Explore why you should attend in part one here!

Read
Mastering the Data Cloud Summit: Must Do Activities During Your Visit

Events

Mastering the Data Cloud Summit: Must Do Activities During Your Visit

It's that time of the year again. Snowflake Data Cloud Summit is right around the corner and we're planning our trip to San Fransisco. Are you? Over the next few weeks, we'll highlight why you should attend, dos and donts of summit, what to pack, and everything in between to ensure you're prepared for the four-day conference. Explore why you should attend in part three here!

Read
Mastering the Data Cloud Summit 24: Dos and Donts

Events

Mastering the Data Cloud Summit 24: Dos and Donts

It's that time of the year again. Snowflake Data Cloud Summit is right around the corner and we're planning our trip to San Fransisco. Are you? Over the next few weeks, we'll highlight why you should attend, dos and donts of summit, what to pack, and everything in between to ensure you're prepared for the four-day conference. Explore why you should attend in part one here!

Read
Mastering the Data Cloud Summit 24: Why Attend?

Events

Mastering the Data Cloud Summit 24: Why Attend?

It's that time of the year again. Snowflake Data Cloud Summit is right around the corner and we're planning our trip to San Fransisco. Are you? Over the next few weeks, we'll highlight why you should attend, dos and donts of summit, what to pack, and everything in between to ensure you're prepared for the four-day conference. Explore why you should attend in part one here!

Read
The Secrets of AI Value Creation: Practical Guide to Business Value Creation with Artificial Intelligence from Strategy to Execution

Annoucements

The Secrets of AI Value Creation: Practical Guide to Business Value Creation with Artificial Intelligence from Strategy to Execution

This book presents a comprehensive framework that can be applied to your organization, exploring the value drivers and challenges you might face throughout your AI journey. You will uncover effective strategies and tactics utilized by successful artificial intelligence (AI) achievers to propel business growth.

Read
Using Snowflake Git + Kestra to Automate Pipelines

Best Practices

Using Snowflake Git + Kestra to Automate Pipelines

The power of using Kestra, an open-source declarative data orchestration tool.

Read
Transforming Data into Decisions: The Snowflake Revolution in AI/ML

Digital Transformation

Transforming Data into Decisions: The Snowflake Revolution in AI/ML

In the words of a widely acknowledged metaphor, 'Data is the oil of the 21st century, and AI/ML serves as the combustion engine, powering the machinery of tomorrow's innovations.' This analogy succinctly encapsulates the essence of our digital era, underscoring the indispensable roles that data and artificial intelligence/machine learning technologies play in powering the innovations that shape our future.

Read
Tis the Season of Gratitude: Simple Ways to Show Employees You Care Pt 2

Culture

Tis the Season of Gratitude: Simple Ways to Show Employees You Care Pt 2

Show your team how much you value them and there’s nothing they won’t strive to accomplish. We’ve got 4 great ways to show your employees your appreciation.

Read
Tis the season of gratitude: Simple Ways to Show Employees You Care Pt 1

Culture

Tis the season of gratitude: Simple Ways to Show Employees You Care Pt 1

Employees who feel valued and appreciated by their leaders are far more likely to go above and beyond in their work. Here are 5 simple ways to show gratitude to your team.

Read
Hey, you! Get on to my Cloud!

Industry Clouds

Hey, you! Get on to my Cloud!

The emergence of industry data clouds is to help accelerate the development and adoption of digital solutions such as data, apps, and AI. So, what is a data cloud and how do respective industry’s adopt it? In this series we’ll highlight how a data cloud works, the core benefits, industry use case examples, and potential obstacles to consider when implementing it.

Read
4 Reasons to Work with a Snowflake partner for Data, Analytics, and Machine Learning

Digital Transformation

4 Reasons to Work with a Snowflake partner for Data, Analytics, and Machine Learning

It requires the right technical skillset to realize your data’s full potential and see the benefits of a modern data stack built in the Snowflake Data Cloud.

Read
Why Manufacturing Leaders Should Embrace the Cloud in 2023

Digital Transformation

Why Manufacturing Leaders Should Embrace the Cloud in 2023

Now more than ever, CIOs and Leadership need to collaborate and look to the unique advantages of cloud, data, and analytics

Read
The Whats, Whys, and Hows of an Analytical Community of Excellence

Data Analytics

The Whats, Whys, and Hows of an Analytical Community of Excellence

Communities of Excellence can create operational efficiencies, drive higher ROIs on data related projects, and create trust in the organization’s information.

Read
Snowflake Summit 2023: Three Days In The Desert With Plenty Of Snow

Snowflake Summit 2023: Three Days In The Desert With Plenty Of Snow

From inspiring keynote speeches to hands-on workshops, the Snowflake Summit 2023 provided attendees with invaluable insights and practical knowledge.

Read
Data Modeling In The Cloud Era

Data Modeling In The Cloud Era

Here is why data modeling is a vital part of enterprise data management.

Read
The Time is Now for Manufacturing to Adopt Cloud Analytics

Data Analytics

The Time is Now for Manufacturing to Adopt Cloud Analytics

The manufacturing industry is undergoing a digital transformation, and one of the key technologies driving this transformation is cloud analytics.

Read