As your application gains popularity and traction, size of the data that you have to analyze increases exponentially. Some queries now start taking a lot of time and size of the data becomes unmanageable in traditional databases. So we start looking at the data warehouse solution for data storage which can keep the data organized and easily accessible. So if you are looking at data warehouse solution then also keep Redshift in mind as well.
What is Redshift??
Amazon Redshift is a completely managed, petabyte-scale data warehouse service in the cloud. You can start with just a couple hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers. Redshift creates powerful solution by using various AWS services. So features are as:
- Exceptionally fast when it comes to load the data and query it for analytical and reporting purposes
- High performance due to massive parallelism with multiple nodes, optimized because of reduced I/O in columnar storage and data compression in reducing memory footprint and massively improves the I/O speed.
- Can scale horizontally and bundle well with other AWS echo systems like S3, EMR.
- Red shift comes with various security features.
- ANSI SQL compatible
- Redshift 1MB block size and because of Larger block size I/O request reduces hence better performance.
Amazon redshift communicates with client application by using industry standard SQL jdbc/odbc drivers. Two types of Nodes are available in redshift :
- Leader Node
- Provide SQL Endpoint
- Coordinates parallel query execution with compute nodes
- Store metadata
- Compute Node
- Multiple compute nodes execute queries in parallel.
- have dedicated CPU/memory and local storage
- Scale out/in and up/down
- Each compute nodes have slices with memory/disk
- Number of slices depends upon on type of compute nodes.
Launch Redshift Cluster:
You can sign up for AWS cloud and get one year free for some services. Once you sign up, Login to AWS Management console on Services menu and click Amazon Redshift.
Configure the following settings:
- Cluster Identifier : testcluster
- Database Name: test01
- Database Port: 5439
- Master Username/ password: Master/ password ( reconfirmed password)
Click continue and then select Node Type: dc2.large ( 160GB storage available and not under free tier usage). Keep Single node for cluster type otherwise it’ll cost you more.
Next continue and choose the following leaving other setting as default. Here i am assuming you are familiar with VPC, security group and roles.
- Cluster Parameter group: when you launch the cluster first time in your region, redshift create a default parameter group for you. Next time you’ll have that parameter available to select.
- Choose a VPC: Set up VPC and not use the default one
- VPC Security settings: Redshift security Group ( create Security group before hand, one your IP address in inbound rules for port 5439)
- Available nodes role: Redshift-role
You’ll see a warning for usage as redshift cluster does not cover under free tier. Click continue and launch the cluster. It’ll take 5-7 mins. you
Connecting to Redshift:
Once cluster is ready, copy the end point without 5439.
Connect to aginity and create new connection.
Now provide redshift cluster and database detail.
- host: End point
- Database: test01
- username: Master
- Password and port : 5439
- SSL Node : disabled.
Data Loading Process:
The data can be copied directly to redshift from various AWS services like S3, EMR and DynamoDB via copy command. Sql client can query amazon redshift and query result will be returned to the SQL client.
Other AWS services like Amazon kinesis firehose/ Kinesis-enabled App or AWS database migration service can copy the data to S3 and from S3 data will be copied to redshift.
Data can be loaded using lambda. when file land into S3 bucket the lambda function will trigger the copy command from S3 -> Redshift.
I copied the data from US domestic airline data and loaded into S3 bucket.
Create table to redshift and load the data from S3.
CREATE TABLE flights (
carrier varchar(80) DISTKEY,
# Other example for data load:
copy flights from 'dynamodb://flights' IAM_ROLE '' readratio 100;
copy flights from 'emr://m-flight/output/part-*' IAM_ROLE '';
Copy data from S3 and provide IAM role. Gzip indicates that data is compressed. Delimiter indicates that item are separated by a comma. Removequotes to remove the quotation marks and provide region.
Data is loaded with 3-4 mins.
Run the query to see the number of rows:
Now data is loaded and you can run the queries to find out underlying patterns in the data. You can load more data into tables and run queries that join the tables for additional information.
Redshift Table Design:
Designing of tables with in redshift is different than row based RDBMS.
- In typical row based RDBMS database the linking of tables is done via primary /referential keys but in redshift primary/referential key are used for query planning and performance only.
- Sort key are created in redshift instead of indexes.
- when you create a table then you need to choose following three distribution styles.
- Even: Rows distribution across the slices regardless of values in a particular column ( in round robin fashion). Default distribution style.
- Key: Used in joining tables. Co-locate matching values on same slice. Used in large fact table with in star schema.
- All: Copy of the entire table is distributed on slices for all nodes. More storage is required. used when data not changed and table is reasonable size.
- Sort keys: Amazon redshift stores your data on disk in sorted order according to sort key. And redshift query optimizer also uses sort order for query plans. Two types of sort keys.
- Compound: Default, for joins, order by, group by and used for large tables. Poor performance if don’t have primary key.
- Interleaved: Equal weight for each column in the sort key and multiple queries used by BI developers but different sort keys. (selective queries)
- Redshift has Zone maps which knows min and max value of each block as per sort keys are defined. So as zone maps are aware about the min and max value of the block, the sql query will read only relevant blocks. So better performance.
- Compression: Reduce storage cost and processing time with compression. Two way to enable compression.
- Automatic compression: compression automatically enabled when Table is created and data is loaded. After loading 100k rows in the brand new table, in background compression Analysis run and drop the table and re-create it with compression and encoding.
- Manual compression: Create and load the table. Then run compression manually with “analyze compression table-name” command and will display the recommended compression settings for table. Drop and re-create the table with recommended compression.
- Constraints are defined at column level/table level to keep the data integrity.
- Primary Key – Not enforced
- unique key – Not enforced
- Not null/null – Enforced
- Reference – Primary key and reference as foreign key in other table. – Not Enforced
- We defined these keys if application need it or for better query plan and performance.
Number of slices:
You can find out the number of slices from stv_slices command. In my case, I have single node, the number of slices are 2.
How many data files should be good for loading the data?? So the rule is equal to number of slices or multiple of the number of slices.
- 2 slices – 2 files or 4 files
- 4 slices – 4 files or 8 files or so on.
- should not be odd number because uneven files load will wait for other file to complete.
- Blocks are immutable – Means redshift blocks can’t change after being constructed.
- update result in new block and delete row won’t remove from disk.
- older rows consume disk space and get scanned when a query is executed. So performance degradation.
You can use vacuum command to reclaim disk space. Few options are available.
- vacuum full; or vacuum sort only; or vacuum delete only or vacuum reindex interleavedtable;
- But vacuum is i/o and need to run in maintenance period.
- It can be very slow on large tables.
Recommendation from AWS to run deep copy which re-creates and repopulate the table with bulk insert. It’s much faster than vacuum command.
- First method to re-create the table which will automatically sort the data. Create new table, insert from original and drop original. Rename the new table.
- Second method create table like method.
- Third method create temp table and truncate the original table and reload.
Update table statistics for use by the query planner.
Redshift automatically runs analyze command on tables with “create tables as” or “select into” or “create temp tables as”. You can set the threshold for analyze or skip analyze.
set analyze_threshold_percent to 20;analyze skip;
Use following commands to see the space allocated over nodes.
# disk space capacity
owner AS node,
used/capacity::numeric * 100 as percent_used
WHERE host = node
ORDER BY 1, 2;
# how much space taken by tables:
JOIN (SELECT DISTINCT name, id as tbl from stv_tbl_perm) USING (tbl)
GROUP BY name
I will discuss Redshift cluster operations and monitoring from console in next blog. Thanks a lot for reading. It took me a while to write this article and hopefully you find it useful.