Snowflake, a quick overview, from their documentation and some project usage experience.
Snowflake is a columnar data warehouse and data integration platform SaaS hosted on AWS, GCP, or Azure.
Snowflake Use Cases include:
- Session Data, Transaction Storage
- ETL integration with tools such as Informatica
- Automated Data Processing (Python, Spark, ODBC/JDBC interfaces, SQL)
- Micro-Partitioning of Data (Columnar, compressed, encrypted, 50 MB to 500 MB)
- Machine Learning
- BI with PowerBI, Quicksight, Tableau
- Data formatting variation (eg. ORC, Parquet, XML)
When data is loaded into Snowflake, Snowflake reorganizes that data into its internal optimized, compressed, columnar format. Snowflake stores this optimized data in cloud storage. Snowflake manages all aspects of how this data is stored — the organization, file size, structure, compression, metadata, statistics, and other aspects of data storage are handled by Snowflake. The data objects stored by Snowflake are not directly visible nor accessible by customers; they are only accessible through SQL query operations run using Snowflake.
Other columnar database types include AWS Redshift, HBase, Google Big Query, perhaps Opentext. AWS Dynamo DB (Key-value, NoSQL), Azure Table Storage (NoSQL) and Azure Cosmos could be considered competitors (NoSQL is a different schema and structure than Snowflake, but some of the use cases might be similar).
Azure: You can deploy Snowflake on Azure.
The 3 main components of Snowflake:
- Database Storage — The actual underlying file system in Snowflake is backed by S3 in Snowflake’s account, all data is encrypted, compressed, and distributed to optimize performance. In Amazon S3 the data is geo-redundant and provides excellent data durability and availability.
- Query Processing — Snowflake provides the ability to create “Virtual Warehouses” which are basically compute clusters in EC2 that are provisioned behind the scenes. Virtual Warehouses can be used to load data or run queries and can do both of these tasks concurrently. These Virtual Warehouses can be scaled up or down on demand and can be paused when not in use to reduce the spend on compute.
- Cloud Services — Coordinates and handles all other services in Snowflake including sessions, authentication, SQL compilation, encryption, etc.
By design, each one of these 3 layers can be independently scaled and are redundant. If you interested in detailed information about the underlying architecture visit Snowflake’s documentation here.
Snowflake’s cloud services work on ANSI SQL, allowing users to manage data infrastructure and optimize data. Snowflake’s stored data is encrypted and secured in transit and at rest. The platform’s warehousing certifications include HIPAA and PCI DSS.
- Web-based UI
- command-line clients (e.g., SnowSQL)
- ODBC and JDBC drivers
- native connectors (e.g., Python)
- apps like ETL tools (e.g., Informatica) and BI tools.
Using the web interface and its loading wizard is the simplest way to load data into Snowflake. Click the Load Data button and choose the location from which you want to load your data. The wizard combines data loading and staging phases in one swift operation while deleting staged fields automatically after the process has finished. This approach is only suitable for loading datasets up to 50 MB.
Data Format support
Importantly, Snowflake supports the most popular data formats like JSON, Avro, Parquet, ORC and XML. The ability to easily store structured, unstructured, and semi-structured data will help address the common problem of handling all the incongruent data types that exist in a single data warehouse. This is a big step towards providing more value on the data as a whole using advanced analytics.
Many organizations, especially those dealing with large amounts of structured data, opt to use snowflake databases instead of RDBMS. Here are what some of the advantages are:
- Flexible schema design: Snowflake databases allow you to design schemas that reflect how business users think about data. Not what the database engine needs to store the data effectively. This helps reduce complexity and boost performance.
- Simplified management: Snowflake schemas make it easier for companies to spot problems arising from changes in their organization’s data model. They’re also easier to maintain because they don’t require complex ETL processes as RDBMSs do. And lastly, there’s less computing overhead than other types of database structures because snowflake structures distribute individual tables across multiple servers.
- Enhanced querying capabilities: Since dimensions in a snowflake database aren’t dependent on each other, there’s usually little data duplication. This allows companies to query the entire snowflake more efficiently than an RDBMS.
Snowflake databases help companies solve modern-day problems with traditional relational databases, such as the star schema. Snowflake addresses issues like data fragmentation, maintenance overhead, and computing power.
A Snowflake database organizes the same types of information present in relational databases into dimensional models. The most significant difference between a snowflake model and a star schema is that the dimensions in a snowflake database don’t depend on each other for storage or querying purposes. This gives you greater flexibility when thinking about what tables to build and what columns to put within them.
What Are Snowflake’s Benefits?
Here’s how Snowflake’s architecture transforms into practical benefits for data storage and data management.
Snowflake is a complete SaaS platform, which means it requires no installation, setting up or configuration. You can start using the platform with all its features as soon as you subscribe to the service.
SaaS solutions don’t require ongoing maintenance, as your vendor takes care of everything. There’s no need to hire a dedicated IT team to maintain your solution or train your employees to do this independently.
A multi-cloud environment can prevent vendor lock-in while making the most out of each service. Multi-cloud support lets you rely on Google (GCP), Microsoft Azure and Amazon AWS. For example, one of the platforms might give you better analytics features, while another might be better for boosting security.
The cloud services layer is a collection of services that coordinate activities across Snowflake. These services tie together all of the different components of Snowflake in order to process user requests, from login to query dispatch. The cloud services layer also runs on compute instances provisioned by Snowflake from the cloud provider.
Services managed in this layer include:
- Infrastructure management
- Metadata management
- Query parsing and optimization
- Access control
Because most platforms are interconnected, users have to pay for more storage when they need more compute. Snowflake’s storage and compute are completely separate, and there are no extra charges related to scalability.
Snowflake’s multi-cluster architecture removes all concurrency issues. One virtual warehouse’s performance can’t affect the queries of other virtual warehouses. At the same time, every warehouse can scale quickly according to current needs.
Snowflake supports an unlimited number of concurrent workloads and users. The engine powers analytics processes, feature engineering, interactive applications and complex data pipelines.
Snowflake’s scalability, performance and speed reduce some of the most apparent data management costs.
Snowflake automates: data resiliency, availability, data governance, security and data management.
Automation allows companies to handle higher workloads and volumes of data, improving scalability while keeping costs at the same level. It also reduces downtime as companies are always available and can finish processes on time.
Snowflake provides seamless data sharing, cross-region communication and cross-cloud capabilities without the need to use data silos or ETL processes, which are more complex and require more compute resources.
Anyone can access data through the cloud with seamless compliance and governance policies. When a single data source is shared across the whole enterprise, everyone can be sure they have the latest data, making decision-making and collaboration more effective.
Snowflake has an extensive data marketplace of third-party apps and data. This allows teams to connect with their customers with new applications and comprehensive workflows. Regardless of your data pipelines, you can set them in place with these integrations and automate workflows throughout the organization.
Is it perfect? No
Snowflake isn’t perfect. It will be expensive.
Can be expensive- PAYGM
Snowflake has no data limits on storage and computing. While that is a great thing overall, Snowflake has a pay-as-you-go model, which means users need to control their data usage to avoid expensive monthly bills.
Depending on the applications and use, Snowflake can be expensive compared to its competitors, for instance Redshift. Snowflake bills for one minute each time you start or resume a warehouse and charges for every second after that.