A Jupyter Magics extension designed for Data Scientists to interact seamlessly with Snowflake
Snowflake is a multifaceted Cloud Data Platform providing scale, elasticity, and performance all in a consumption-based SaaS offering. Customers can load their data into Snowflake tables and easily transform the stored data when the need arises.
Among the many features provided by Snowflake is the ability to establish a remote connection. From this connection, you can leverage the majority of what Snowflake has to offer.
I will focus on two features: running SQL queries and transforming table data via a remote Snowflake connection.
One popular way for data scientists to query Snowflake and transform table data is to connect remotely using the Snowflake Connector Python inside a Jupyter Notebook.
The example above is a use case of the Snowflake Connector Python inside a Jupyter Notebook. I first create a connector object. Then, a cursor object is created from the connection. After creating the cursor, I can execute a SQL query inside my Snowflake environment. Finally, I store the query results as a pandas DataFrame. I can now easily transform the pandas DataFrame and upload it to Snowflake as a table.
From the example above, you can see that connecting to Snowflake and executing SQL inside a Jupyter Notebook is not difficult, but it can be inefficient. To address this problem, we developed an open-source Python package and Jupyter extension.
Cloudy SQL is a pandas and Jupyter extension that manages the Snowflake connection process and provides a simplified way to execute SQL in Snowflake from a Jupyter Notebook.
Watch a demonstration video of Cloudy SQL in this Hashmap Megabyte:
Supported Methods
Cloudy SQL currently supports:
- A Jupyter magic method that allows users to execute SQL queries in Snowflake from a Jupyter Notebook easily
- Writing to an existing or new Snowflake table from a pandas DataFrame
Installation and Configuration
To optimize Cloudy SQL, a few steps need to be completed before use:
- Run
pip install cloudy-sql
to install Cloudy SQL and its dependencies: - Upon installation, open an empty Jupyter notebook and run the following code in a Jupyter cell:
%load_ext cloudy_sql
After you run the above code, a configuration file will be created in your HOME
directory. Cloudy SQL uses the information in this file to connect to Snowflake for you.
The path to the configuration file: $HOME/.cloudy_sql/configuration_profiles.yml
For Windows use
$USERPROFILE
instead of$HOME
The configuration file has the following format:
- Open this file using the path provided above and fill out your Snowflake information to the applicable fields.
Note: Configuration is a one-time setup. Once you’ve configured the credentials file, you can use it for any project that uses Cloudy SQL.
When you call any Cloudy SQL magic or method, it uses the information stored in the configuration_profiles.yml
to seamlessly connect to Snowflake.
Cloudy SQL currently supports two options to pass in Snowflake connection credentials and details:
- The above configuration file.
configuration_profiles.yml
serves as the place to store default Snowflake credentials. - Pass in your Snowflake details as arguments when calling a Cloudy SQL magic or method. Any argument passed in will prioritize its corresponding default value stored in the configuration file when you use this option.
To use Cloudy SQL in a Jupyter Notebook, you need to run the following code in a cell:
%load_ext cloudy_sql
The intent has been to keep the API as simple as possible by minimally extending the pandas and IPython Magic APIs.
%%sql_to_snowflake
IPython Cell Magic to seamlessly connect to Snowflake and run a query in Snowflake and optionally return a pandas DataFrame as the result when applicable.
Parameters
- <SQL query> (Required)
SQL query to be executed in Snowflake - <destination_var> (Positional)
Variable to store the query results. If none is given, the magic will return the first 10 rows of the pandas DataFrame if applicable. - –params <params> (Optional)
Parameters to be used in the SQL Query. Params must be passed in as a dictionary string in the format{"param_name": "param_value"}
or reference a dictionary string defined in a previous cell. The use of a parameter in the query should be indicated with{{param_name}}
. - –username <username> (Optional)
The called method will connect to Snowflake with this username instead of the configuration file’s default if provided. - –password <password> (Optional)
If provided, the called method will connect to Snowflake with this password instead of the default configuration file. - –account <account> (Optional)
If provided, the called method will connect to Snowflake with this account instead of the configuration file’s default. - –role <role> (Optional)
If provided, the called method will connect to Snowflake with this role instead of the default in the configuration file. - –warehouse <warehouse> (Optional)
If provided, the called method will use this warehouse instead of the default in the configuration file.
write_snowflake
This method allows users to create a Snowflake table and write to that table with a pandas DataFrame. Users can also use this method to append data to an existing Snowflake table.
The only required argument to directly include is table
.
This method works when writing to either an existing Snowflake table or a previously non-existing Snowflake table. If the table you provide does not exist, this method creates a new Snowflake table and writes to it. If the table already exists, the DataFrame data is appended to the existing table by default. If you would like to replace the table with the pandas, DataFrame set overwrite = True
when calling the method.
Role
and warehouse
are optional arguments that can be set up in the configuration_profiles.yml
.
Username
, password
, account
, database
, and schema
are all required but can have default values set up in the configuration file.
The example above shows how a user can leverage both the %%sql_to_snowflake
magic and the write_snowflake
method. It runs a SQL query with %%sql_to_snowflake
and saves the results as a pandas DataFrame by passing in the destination variable df
In [6]
. The called %%sql_to_snowflake
magic uses the Snowflake credentials found in the configuration file.
The example then shows how to easily write that df
to a Snowflake table In [8]
. The write_snowflake
method uses the default username
, password
, account
, database
, and schema
found in the configuration file.
The example above runs a SQL query with passed-in variables. The variables are used directly in the SQL query by placing each one inside {{ }}
. A dictionary string parameters
is passed in when the magic is called by including the--params
inline argument and placing a $
to reference the dictionary string creating in the previous cell In [3]
. The magic also uses the passed in snowflake_username
instead of the default in the configuration file.
The example then shows how to overwrite the existing test_cloudy_sql
table with the data in the df
variable by setting overwrite = True
In [5]
. The user then drops the table In [6]
.
Get the Code
Find the Gitlab Repo here.
Cloudy SQL is a pandas and Jupyter extension that manages the Snowflake connection process and provides a simplified and streamlined way to execute SQL in Snowflake from a Jupyter Notebook. This tool continues to be developed with new features, so any feedback is greatly appreciated. Be sure to check out the PyPi package here!
At Hashmap, we work with our clients to build better together.
If you are considering moving data and analytics products and applications to the cloud or if you would like help and guidance and a few best practices in delivering higher value outcomes in your existing cloud program, then please contact us.
Hashmap, an NTT DATA Company, offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our data and cloud service offerings. We would be glad to work through your specific requirements.
Feel free to share on other channels, and be sure and keep up with all new content from Hashmap here. To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.