Posts in Reviews
Why Snowflake is the real power – Snowflake Review
 
SNowflake Logo

Snowflake is a cloud-based data-warehousing platform, which enables a modern data warehouse, augmented data lakes, accelerated analytics, Integrated data engineering, Secure Data Exchange, Agile Data App Development and Advanced Data Science Features.

The real power of Snowflake lies in its architecture and its flexibility. Snowflake is the first analysis performable database built for the cloud, meaning you can access your data warehouses from anywhere. On top of this, it seamlessly integrates with AWS and other cloud platforms. 

Snowflake handles all aspects of authentication, configuration, resource management, data protection, optimisation, and availability and as a user, you only need to sign-up, load your data and start querying.Overly, this means that a lot of the hassle that surrounds traditional data analysis platforms doesn’t exist for Snowflake users! 

How does Snowflake differ from other traditional architectures?

In traditional architectures:

Shared disk architectures use multiple nodes to access data shared on a single storage system and Shared nothing architecture stores a portion of data in each node and each cluster in the data warehouses.

In snowflake architecture:

It combines the benefits of both architectures in an innovative new design that takes full advantage of the cloud using its multi-cluster shared data architecture, which consists of three separate layers:

  • Data Storage Layer

  • Compute Layer

  • Service Layer

Each layer scales independently and includes built-in redundancy, and in addition to that, snowflake lets you store structured relational data and semi-structural non-relational data.

Snowflake’s architecture is made up of three layers: database storage layer, query/compute layer & cloud services layer.

Snowflake’s architecture is made up of three layers: database storage layer, query/compute layer & cloud services layer.

Regardless of the data types, we can use ANSI standard SQL to perform all data related tasks. 

Snowflake uses highly secured cloud storage to maintain all your data and as data is loaded into tables, snowflake converts into an optimized columnar compressed format and encrypts it using AES 256 strong encryption.

Unlike traditional architectures snowflake allows you to create multiple independent compute clusters called virtual warehouses that all access the same data storage layer without contention or performance degradation.

To create a virtual warehouse you simply give it a name, specify a size, and snowflake handles all the provisioning and configuration of the underlying compute resources. On top of this, virtual warehouses can be scaled up or down at any time without any downtime or disruption.

The key advantage with snowflake is when a virtual warehouse is resized, subsequent queries take advantage of additional resources and its unique cloud architecture enables virtually unlimited scale and concurrency without resource contention.

For example, separate virtual warehouses can be used to handle loading and querying concurrently because virtual warehouses access the same data storage layer and any update or inserts become immediately available to all other warehouses.

On top of everything, the service layer coordinates and manages the entire system by authenticating users, managing sessions, securing data, performing query compilation and optimization. It also manages virtual warehouses and coordinates data storage updates and access, ensuring that once a transaction is completed, all virtual warehouses see the new version of the data with no impact on availability or performance.

The key component of the services layer is the metadata store, which powers a number of unique snowflake features including Zero Copy Cloning, Time Travel and Data Sharing.

A growing ecosystem of external tools have native connectivity with Snowflake, meaning virtually all operations can be seamlessly integrated with the platform and making it even simpler to complete operations. 

What do you need to manage Snowflake?

Not much, snowflake eliminates most of the tuning knobs and parameters required by other data warehouses and you only need to create database tables and virtual warehouses, load data and execute queries - Snowflake handles the rest.

How much does snowflake cost?

The advantage with Snowflake is that you only need to pay for the storage and computing resources used. 

Storage costs are based on the amount of compressed data stores in database tables and the additional data are retained to support Snowflakes unique data recovery features.

Compute costs are based on warehouse size and how long your warehouse/s is/are running.

 

We hoped this brought some insight into the powers of Snowflake and the efficiencies it can create for companies with large amounts of data stored and being used. If you’re interested in learning more about Snowflake, get in touch for a chat on your data strategy today! 

Do you use Snowflake? We’d love to know your thoughts on this platform and whether you’ll be using it. Join the discussion on our LinkedIn Page.

 
Commentary, ReviewsJack Sloman
Why We Chose Working with KNIME
Knime review analytics data preparation sydney

As data practitioners or enthusiast, you must have realised or heard that 60-70% of your job would involve data preparation. So, it makes sense to have dedicated tools that creates reproducible workflows and also reduces the pain involved in these tasks. There has been varying amount of data transformation tools like Alteryx, Penthao, SSIS and Tableau prep, which to a large degree are one of the best in the market for BI types of task. However, these tools don't come cheap! There is a huge price tag to them. So, in the last few months here at White Box Analytics, we went on a journey to find an alternative to the costly ETL tools. We arrived at KNIME and we must say it is pretty good! So, we thought to share and highlight our experience with the open source tools.

KNIME is an open source data platform that has been around for more than 10 years now, just before the buzz of data science. It is a guided analytics software for data professionals and an easy to use interactive data platform that executes workflows in nodes. In the years it has been around, the platform has evolved from a standalone desktop software to a collaborative tool for data practitioners. It has also developed into other products to cater for enterprise users with the release of the KNIME Server and KNIME extensions. It can be described as a data platform as it covers most end-to-end workflow process of a data project. These include data transformation and import (ETL), visualisation and predictive modelling.

Specifically, our experience with the product has been with transforming and loading data into the data sources - popularly known as the Extract, Transform and Load (ETL) process - for our clients. With a fairly decent skill in data manipulation we found the software to be relatively quick to pick up and run workflows. It is very intuitive thanks to the development team as we find the panes to be really useful. The panes include KNIME Explorer pane, Node Repository pane, Description pane, Console pane, Outline pane, Workflow Coach pane and KNIME extension. See Figure 1.

Figure 1: A screenshot of the Knime desktop.

Figure 1: A screenshot of the Knime desktop.

As most guided analytics software and ETL tools, KNIME executes in nodes and the node repository pane groups these nodes into functions. These include a database group specifically for connecting and querying different databases. Out of the box, KNIME connects to the usual suspects databases; Microsoft SQL Server, MySQL, Oracle and PostgreSQL and can connect to any other data source using the arbitrary JDBC driver.

One other cool feature we found with the KNIME node repository is the file handling capabilities. Aside from being able to connect to Secure/File Transfer Protocol (SFTP/FTP) servers with the usual password authentication, we found connecting to the servers via public key to be seamless compared to other applications. Connecting to AWS S3 bucket was also a walk in the park as it was just a matter of installing the relevant extensions.

Something worth highlighting is the scripting capabilities of KNIME. This feature of KNIME emphasises why it is not just an ETL tool but a data platform. With this capability we found it easy to integrate other scripting languages into our workflow, which came in handy to execute some required tasks other than using the KNIME created nodes. So far, the languages we integrated with KNIME include R and Python, which are some of the most relevant languages in the industry, but it does have the ability to integrate other languages such as Java and JavaScript. This can be useful for those problematic csv files or if you would rather work in either R or Python but need reproducible visual workflows. Something to be very cautious about is the fact that some of its node terminologies could be a bit confusing to get your head around if you are using it for the first time. We found that we had some significant delays in completing a task just because we were confused about the input variable that goes into the source and target port for the “download/upload from list” node. This node connects to an FTP server and lists the files in a remote directory with the option to either download or upload to this directory, but the source and the target port expects a variable format name URI. It took quite a while to understand that the input variable which usually would be a file path had to be converted to a KNIME format called URI.

We did notice a cool community engagement. The KNIME community portal was really useful as it was well organised and it answered just about any question we had. This could be due to the maturity of the platform or using the right keywords but we just like to put it out there that it was fairly easy to find answers to most questions using the KNIME community portal. We found that the user of the platform are really engaged and this has led to the development of quite a number of KINME extensions. Currently, there is an extension for configuring slack messages because this is 2019!

There are two helpful features for beginners, these are Workflow Coach and the KNIME NodePit. The Workflow Coach pane is one of the features that makes KNIME stand out, particularly for beginners as it recommends a list of nodes with a percentage ranking that should be accompanied with a selected node. This feature just mimics the KNIME NodePit but in lesser details. The NodePit goes into greater detail, giving a detailed explanation about the node and also has a section called the Best Friends (Incoming and Outgoing) nodes that mostly accompanies a given node. The NodePit also has a section that give example workflows that incorporates the node in question into it. This is also super cool for beginner to re-engineer that example workflow to have a better understanding of the node.

Although there weren't a lot of difficulties working with KNIME, we found the desktop version wasn't built to handle very large datasets. While using KNIME desktop, the extent of the workload depended on the local server or computer's hardware capability as the temporary file are stored on the hard drive or memory. In our experience we also found that the desktop application struggle to handle a dataset of about 14 million records using a 16GB ram. Another downside faced with KNIME was its difficulty to successfully run stored procedures. Although there are workarounds for running stored procedures, we do hope that running them in later updates are as easy as just configuring a node.

This is by no means an exhaustive guide to using the KNIME desktop product, we believe there are multiple resources online that covers those topics. It is just a highlight of our experience with the software and an expression of why we feel it is a better alternative to the traditional SSIS ETL tool which seems not to have really evolved with the growth in the industry and a better alternative to the costly ETL tools like Alteryx and Informatica. There is something to be said about the open source movement and it’s importance to the data analytics industry, but we will leave that for a separate article. Until then, we hope you found this informative.


This review was written by Ije Iruemi, a data analyst for White Box Analytics.

For more data analysis and visualisations, click here.

Or, get in touch for a discussion about your data strategy.

Commentary, ReviewsJack Sloman