Snowflake is a cloud data warehouse built on top of the Amazon Web Services (AWS) cloud infrastructure and is a true SaaS offering. There is no hardware (virtual or physical) for you to select, install, configure, or manage. There is no software for you to install, configure, or manage. All ongoing maintenance, management, and tuning is handled by Snowflake.
Architecturally there are three main components that make up the Snowflake data warehouse.
The three main components are:
Storage Layer — Snowflake relies on scalable cloud blob storage available in public clouds like AWS, Azure, and GCP. Relying on massively distributed storage systems enables Snowflake to provide a high degree of performance, reliability, availability, capacity, and scalability required by the most demanding of data warehousing workloads.
Snowflake organizes the data into multiple micro partitions that are internally optimized and compressed. It uses a columnar format to store. Data is stored in the cloud storage and works as a shared-disk model thereby providing simplicity in data management. This makes sure users do not have to worry about data distribution across multiple nodes in the shared-nothing model.
The storage layer of Snowflake is architected to support scaling of storage independent to the compute layer. This design choice works out great for the consumer both in terms of performance as well as cost. The storage layer holds the data, tables and query results for Snowflake.
Compute nodes connect with storage layer to fetch the data for query processing. As the storage layer is independent, we only pay for the average monthly storage used. Since Snowflake is provisioned on the Cloud, storage is elastic and is charged as per the usage per TB every month.
Compute Layer — Snowflake relies on the standard computing infrastructure, i.e. virtual machines available to anyone in a public cloud environment. In AWS, it is EC2, and in GCP it is the compute engine. Virtual Warehouses form a critical component in the Snowflake architecture. These virtual warehouses, by design, can process massive volumes of data with a high degree of efficiency and performance. When an incoming query is detected, computing power becomes available immediately to process the request. Similar to other database technologies, implementation of intelligent caching ensures optimal utilization of resources and to reduce the interaction between compute and storage systems. However, Snowflake deploys multiple virtual warehouses to process a request while simultaneously maintaining the integrity of the transaction, making the system ACID compliant.
Multiple Virtual Warehouses can be created in Snowflake for various requirements depending upon workloads. Each virtual warehouse can work with one storage layer. Generally, a virtual Warehouse has its own independent compute cluster and doesn’t interact with other virtual warehouses.
Cloud Services Layer — Services layers of Snowflake is where all the intelligent action happens. This layer performs various functions like authenticating users, management of the cluster, Query execution and optimization, security, encryption, and the orchestration of transaction execution. This layer runs on compute nodes that are stateless and span the entire data center. Intelligent use of metadata distributed across the cluster of computing nodes maintains the global state of transactions and the system.
When a query is issued, the services layer parses the query, compiles it, and determines which set of partitions hold the data of interest and flags them for scanning. One would expect the processing of the metadata to take up sizable computing power, and they wouldn’t be wrong to think so. However, by design, the processing of metadata happens on a separate cluster of machines which reduce the impact of the actual compute resources processing the data for the user.
These three layers scale independently and Snowflake charges for storage and virtual warehouse separately. Services layer is handled within compute nodes provisioned, and hence not charged.
The advantage of this Snowflake architecture is that we can scale any one layer independently of others. For e.g. you can scale storage layer elastically and will be charged for storage separately. Multiple virtual warehouses can be provisioned and scaled when additional resources are required for faster query processing and to optimize performance.
Информация по комментариям в разработке