How to create a Redshift stack with AWS CloudFormation

Our AWS CloudFormation template creates an Amazon Redshift stack. Redshift is a data warehousing solution that allows you to run complex data queries on huge data sets within seconds (it’s pretty awesome). You can use it to generate reports and analyze customer data. This stack will help you get up and running with Redshift.

There are a number of ways to get your data into Redshift. In this template, we use S3 as the data source.

For simplicity, we’ll put Redshift in a VPC subnet so that you can connect directly to it without setting up a VPN or proxy (note: we don’t recommend this for production environments). For some baseline security, Redshift will be locked down to your specific IP address.

Here’s the entire Redshift template:

AWSTemplateFormatVersion: 2010-09-09
Description: Redshift Stack
Conditions:
  SingleNode: !Equals [ !Ref RedshiftNodeCount, 1 ]
Parameters:
  SubnetA:
    Type: String
    Type: AWS::EC2::Subnet::Id
    Description: Make sure this belongs to the VPC specified below (e.g. 172.31.0.0/20)
  SubnetB:
    Type: String
    Type: AWS::EC2::Subnet::Id
    Description: Make sure this is different from the subnet above (e.g. 172.31.16.0/20)
  VPCID:
    Type: String
    Type: AWS::EC2::VPC::Id
    Description: Select a VPC (e.g. 172.31.0.0/16)
  DataBucketName:
    Type: String
    Description: S3 data bucket name
  DatabaseName:
    Type: String
    Description: Database name
  MasterUsername:
    Type: String
    Description: Master user name for Redshift
    Default: admin
  MasterUserPassword:
    Type: String
    Description: Master password for Redshift (used mixed case and numbers)
    NoEcho: true
  DeveloperIPAddress:
    Type: String
    Description: Your public IP address (see http://checkip.dyndns.org/)
  RedshiftNodeCount:
    Type: Number
    Description: Number of Redshift nodes
    Default: 1
    MinValue: 1
    ConstraintDescription: Must be a number greater or equal to 1
Resources:
  RedshiftCluster:
    Type: AWS::Redshift::Cluster
    Properties:
      ClusterSubnetGroupName: !Ref RedshiftClusterSubnetGroup
      ClusterType: !If [ SingleNode, single-node, multi-node ]  
      NumberOfNodes: !If [ SingleNode, !Ref 'AWS::NoValue', !Ref RedshiftNodeCount ] #'
      DBName: !Sub ${DatabaseName}
      IamRoles:
        - !GetAtt RawDataBucketAccessRole.Arn
      MasterUserPassword: !Ref MasterUserPassword
      MasterUsername: !Ref MasterUsername
      PubliclyAccessible: true
      NodeType: dc1.large
      Port: 5439
      VpcSecurityGroupIds:
        - !Sub ${RedshiftSecurityGroup}
      PreferredMaintenanceWindow: Sun:09:15-Sun:09:45
  DataBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub ${DataBucketName}
  RawDataBucketAccessRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          -
            Effect: Allow
            Principal:
              Service:
                - redshift.amazonaws.com
            Action:
              - sts:AssumeRole
  RawDataBucketRolePolicy:
    Type: AWS::IAM::Policy
    Properties:
      PolicyName: RawDataBucketRolePolicy
      PolicyDocument:
        Version: 2012-10-17
        Statement:
          -
            Effect: Allow
            Action: s3:ListAllMyBuckets
            Resource: arn:aws:s3:::*
          -
            Effect: Allow
            Action:
              - 's3:Get*'
              - 's3:List*'
            Resource: '*'
          -
            Effect: Allow
            Action: cloudwatch:*
            Resource: "*"
      Roles:
        - !Ref RawDataBucketAccessRole  
  RedshiftClusterSubnetGroup:
    Type: AWS::Redshift::ClusterSubnetGroup
    Properties:
      Description: Cluster subnet group
      SubnetIds:
        - !Ref SubnetA
        - !Ref SubnetB
  RedshiftSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Enable JDBC port
      VpcId: !Ref VPCID
      SecurityGroupIngress:
        -
          CidrIp: !Sub ${DeveloperIPAddress}/32
          FromPort: 5439
          ToPort: 5439
          IpProtocol: tcp
          Description: IP address for your dev machine
        -
          SourceSecurityGroupId: !Ref AccessToRedshiftSecurityGroup
          FromPort: 5439
          ToPort: 5439
          IpProtocol: tcp
          Description: Access to redshift  
  AccessToRedshiftSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Access to Redshift access
      VpcId: !Ref VPCID
  InternalSecurityGroupIngress:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      IpProtocol: tcp
      FromPort: 0
      ToPort: 65535
      SourceSecurityGroupId: !Ref RedshiftSecurityGroup
      GroupId: !Ref RedshiftSecurityGroup
Outputs:
  RedshiftClusterEndpointAddress:
    Description: Redshift Cluster Endpoint Address
    Value: !GetAtt RedshiftCluster.Endpoint.Address
  RedshiftClusterEndpoint:
    Description: Redshift Cluster Endpoint
    Value:
      Fn::Join:
        - ""
        - - 'jdbc:redshift://'
          - !GetAtt RedshiftCluster.Endpoint.Address
          - ':5439/'
          - !Sub ${DatabaseName}

You can download this CloudFormation template by clicking here.

Let’s explore what’s going on here.

Creation of the Redshift cluster

The first thing we do is create the Redshift cluster.

Please note that the code snippet below is simplified for demonstration purposes and doesn’t yet match the code we provided in the overall template above. We’ll revisit and explain the additional Redshift properties in a later section.

.
.
.
Resources
  RedshiftCluster:
    Type: AWS::Redshift::Cluster
    Properties:
      ClusterType: SingleNode
      NumberOfNodes: 1
      DBName: !Sub ${DatabaseName}
      MasterUserPassword: !Ref MasterUserPassword
      MasterUsername: !Ref MasterUsername
      PubliclyAccessible: true
      NodeType: dc1.large

Here are the key aspects of this code:

“ClusterType”: This can be “SingleNode” or “MultiNode”. For now, we hard-code “SingleNode”.
“NumberOfNodes”: Since we’re using “SingleNode”, this has to be set to 1.
“DBName”: This refers to a parameter in the Parameters section called “DatabaseName”, which becomes the name of our Redshift database.
“MasterUsername”: This is another parameter that sets the master user name.
“MasterUserPassword”: This is also a parameter for setting the master password.
“PubliclyAccessible”: This is set to true so that you can connect to it easily.
“NodeType”: “dc1.large” is the least expensive node type.

Like we mentioned prior, there are a few more Redshift properties that we’ve included in our overall template that we’ll explain in a later section titled “More Redshift cluster properties”.

Set up S3 as a data source

Redshift can load data from different data sources. In this example, we’ll be using S3.

To set this up, we have to create an S3 bucket and an IAM role that grants Redshift access to S3. This is what the code looks like:

Resources:
.
.
.
  DataBucket:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: !Sub ${DataBucketName}
  RawDataBucketAccessRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: 2012-10-17
        Statement:
          -
            Effect: Allow
            Principal:
              Service:
                - redshift.amazonaws.com
            Action:
              - sts:AssumeRole
  RawDataBucketRolePolicy:
    Type: AWS::IAM::Policy
    Properties:
      PolicyName: RawDataBucketRolePolicy
      PolicyDocument:
        Version: 2012-10-17
        Statement:
          -
            Effect: Allow
            Action: s3:ListAllMyBuckets
            Resource: arn:aws:s3:::*
          -
            Effect: Allow
            Action:
              - 's3:Get*'
              - 's3:List*'
            Resource: '*'
          -
            Effect: Allow
            Action: cloudwatch:*
            Resource: "*"
      Roles:
        - !Ref RawDataBucketAccessRole

Here’s a quick overview of what’s going on:

“S3::Bucket”: The bucket name comes from a parameter called “DataBucketName”.
“IAM::Role”: This is the IAM role that allows access to S3. It doesn’t have any permissions yet but it allows the Redshift service to assume this role.
“IAM::Policy”: This contains a list of permissions for accessing S3 and Cloudwatch. The policy associates itself with the IAM Role.

So far, the architecture looks like this:

Create VPC and public subnets

You usually want to put databases in a private subnet, like we mentioned in our VPC template article. But in the early stages of a project, you might want direct access to Redshift from your development machine.

We don’t recommend this for production environments, but in this development case, you can start off by putting Redshift into your VPC subnet.

Resources:
.
.
.
  RedshiftClusterSubnetGroup:
    Type: AWS::Redshift::ClusterSubnetGroup
    Properties:
      Description: Cluster subnet group
      SubnetIds:
        - !Ref SubnetA
        - !Ref SubnetB

We can’t put Redshift in a subnet directly, so here we put Redshift in something called a “ClusterSubnetGroup”. You can then add multiple subnets to the “ClusterSubnetGroup”. These subnets should be in different availability zones, which helps with high availability.

Subnet and VPC Parameters

The Redshift CloudFormation template doesn’t create any subnets or networks of its own. Instead, it asks you for parameters — two public subnets and a VPC.

Parameters:
  SubnetA:
    Type: String
    Type: AWS::EC2::Subnet::Id
    Description: Make sure this belongs to the VPC specified below (e.g. 172.31.0.0/20)
  SubnetB:
    Type: String
    Type: AWS::EC2::Subnet::Id
    Description: Make sure this is different from the subnet above (e.g. 172.31.16.0/20)
  VPCID:
    Type: String
    Type: AWS::EC2::VPC::Id
    Description: Select a VPC (e.g. 172.31.0.0/16)

You can just pick the VPC and public subnets that come by default in every region of each AWS account.

We wind up with this architecture:

Create a security group

So far, the Redshift cluster is in a public subnet. But before we can connect to it, we have to add a security group to allow port traffic to Redshift.

Resources:
.
.
.
  RedshiftSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Enable JDBC port
      VpcId: !Ref VPCID
      SecurityGroupIngress:
        -
          CidrIp: !Sub ${DeveloperIPAddress}/32
          FromPort: 5439
          ToPort: 5439
          IpProtocol: tcp
          Description: IP address for your dev machine

This allows port 5439 traffic, which is the default TCP port for Redshift. This is locked down to the public IP address of your computer, which you provide via the CloudFormation parameter “DeveloperIPAddress”.

Configure security group access

During development, you’ll want to access Redshift directly from your development machine. But eventually, you want to make calls to Redshift from an application, such as AWS Lambda.

For this, you need to create other security groups and grant these access to Redshift.

Resources:
.
.
.
  RedshiftSecurityGroup: 
    Type: AWS::EC2::SecurityGroup 
    Properties: 
      GroupDescription: Enable JDBC port 
      VpcId: !Ref VPCID
      SecurityGroupIngress: 
        - 
          CidrIp: !Sub ${DeveloperIPAddress}/32 
          FromPort: 5439 
          ToPort: 5439 
          IpProtocol: tcp 
          Description: IP address for your dev machine 
        - 
          SourceSecurityGroupId: !Ref AccessToRedshiftSecurityGroup 
          FromPort: 5439 
          ToPort: 5439 
          IpProtocol: tcp 
          Description: Access to redshift 
  AccessToRedshiftSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Access to Redshift access
      VpcId: !Ref VPCID
  InternalSecurityGroupIngress:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      IpProtocol: tcp
      FromPort: 0
      ToPort: 65535
      SourceSecurityGroupId: !Ref RedshiftSecurityGroup
      GroupId: !Ref RedshiftSecurityGroup

This example builds off of the “RedshiftSecurityGroup” from the previous section. Here, we’re configuring two types of access:

“AccessToRedshiftSecurityGroup”: This is an additional security group that you might assign to an application, such as AWS Lambda. We add a security group ingress rule that allows inbound traffic on port 5439.
“InternalSecurityGroupIngress”: This is a standalone rule that allows resources in one “RedshiftSecurityGroup” to access another. It’s configured as a standalone ingress rule, because CloudFormation resources can’t reference themselves within their own properties.

More Redshift Cluster properties

As you wrap up development, you’ll want to start thinking about deploying to production. Here are a few tweaks to the Redshift cluster that we created in the first section that might come in handy:

.
.
.
Conditions: 
  SingleNode: !Equals [ !Ref RedshiftNodeCount, 1 ] 
Parameters:
.
.
.
  RedshiftNodeCount:
    Type: Number
    Description: Number of Redshift nodes
    Default: 1
    MinValue: 1
    ConstraintDescription: Must be a number greater or equal to 1
.
.
.
Resources:
.
.
.
  RedshiftCluster:
    Type: AWS::Redshift::Cluster
    Properties:
      ClusterSubnetGroupName: !Ref RedshiftClusterSubnetGroup
      ClusterType: !If [ SingleNode, single-node, multi-node ]
      NumberOfNodes: !If [ SingleNode, !Ref 'AWS::NoValue', !Ref RedshiftNodeCount ]
      DBName: !Sub ${DatabaseName}
      IamRoles:
        - !GetAtt RawDataBucketAccessRole.Arn
      MasterUserPassword: !Ref MasterUserPassword
      MasterUsername: !Ref MasterUsername
      PubliclyAccessible: true
      NodeType: dc1.large
      Port: 5439
      VpcSecurityGroupIds:
        - !Sub ${RedshiftSecurityGroup}
      PreferredMaintenanceWindow: Sun:09:15-Sun:09:45

We add a parameter called “RedshiftNodeCount”. This represents how many Redshift nodes you want in your cluster.

We use a condition called “SingleNode” that checks if we have just one node. If so, we pass “single-node” to the “ClusterType” property. Otherwise, we pass in “multi-node” if more than one node was specified.

The “NumberOfNodes” property gets a little tricky. If there are multiple nodes, we can just pass in the “RedshiftNodeCount”. But if there’s just a single node, we get an error if we populate “NumberOfNodes” with any information, even if it’s just the number “1”.

The way around this is to use the pseudo parameter “AWS::NoValue”. If there’s just a single node, we pass “AWS::NoValue” to “NumberOfNodes” (which has the same effect as deleting that property).

Here’s the final architecture:

In this diagram, you can access your Redshift cluster from both your development machine, and an application such as AWS Lambda.

Outputs

Once you spin up a Redshift cluster, the first thing you want to do is connect to it. One useful piece of information to output would be the Redshift cluster endpoint.

.
.
.
Outputs:
  RedshiftClusterEndpointAddress:
    Description: Redshift Cluster Endpoint Address
    Value: !GetAtt RedshiftCluster.Endpoint.Address
  RedshiftClusterEndpoint:
    Description: Redshift Cluster Endpoint
    Value:
      Fn::Join:
        - ""
        - - 'jdbc:redshift://'
        - !GetAtt RedshiftCluster.Endpoint.Address
        - ':5439/'
        - !Sub ${DatabaseName}

Here we have the “RedshiftClusterEndpointAddress”, which gives you the DNS hostname of the Redshift cluster.

To make things even more convenient, we construct a JDBC url in the format of:

jdbc:redshift://examplecluster.cg034hpkmmjt.us-east-1.redshift.amazonaws.com:5439/dbname

which you can paste into your database client software.

Amazon Redshift Deployment and Optimization

The Problem

A global telecommunications firm that provides voice, mobile data, IoT, and other services to mobile network operators and digital companies was in need of a new data warehousing solution.

The company used the IBM Netezza data warehouse product to analyze their over 438 terabytes (TB) of data. The product was reaching its end of life and IBM would no longer support it. Thus, the company was in search of an alternative data warehouse product.

The company wanted to see if Amazon Redshift could provide the performance they needed for their applications. The target metric they wanted to achieve was that all queries needed to run in under three (3) seconds.

The Solution

In target of helped perform an in-depth proof of concept to help the client determine whether Redshift would meet their data warehouse performance needs.

Here are the steps we took to do so.

1) Set up AWS Environment and Redshift Instance

The first thing we did was set up the overall AWS architecture to deploy the Redshift instance. We designed this architecture with security in mind from the beginning. The environment included:

A dedicated virtual private cloud (VPC)
Private subnets for database cluster nodes
Public subnets for bastion hosts and publicly accessible services
A CloudFormation template for repeatable deployments
A standardized change control process to facilitate Infrastructure as Code

We then determined which Redshift instance would best achieve the client’s needs for the project.

We compared ds2.8xlarge and dc2.8xlarge instances.

The ds2 instances use magnetic disks, which are cheaper and can hold more data. But when we ran queries with test data, these instances did not meet the client’s goal of queries executing in under 3 seconds.

The dc2 instances use SSD, which is much more performant than ds2’s magnetic disk technology. While these are more expensive, the dc2 instances were able to exceed the client’s requirements, so they were the best option.

2) Adapt and Execute the Data Definition Language (DDL)

After we set up the AWS environment, our next step was to modify the syntax of the Netezza DDL exports and translate it to syntax usable by Redshift.

This entailed modifying the distribution keys, Datetime default formatting, and other syntax.

After these changes were made, we ran the DDL on Redshift to build the data warehouse tables.

3) Export Netezza Data and Transfer it to AWS S3

The next step was to export sample data from the Netezza data warehouse and move it to S3.

We received over 6,500 files that amounted to approximately 1.5 terabytes (TB) of compressed data. This data represented six (6) months of GPRS roaming exchange data.

4) Load Data from S3 to Redshift

The data in S3 now could be loaded into Redshift.

We leveraged the Redshift COPY command to load the data in parallel, and temporarily scaled up the number of nodes to reduce the data load time.

Here is a snapshot of the amount of data we worked with:

5) Run Queries to Obtain Baseline Benchmarks

Once the data was loaded into Redshift, we ran initial queries to determine baseline performance of the data warehouse before any optimizations were made.

6) Edit Tables to Optimize Query Response Time

After determining the query benchmarks, we optimized the table parameters to make the queries run faster.

We analyzed the queries for ideal sort keys, distribution keys, and compression.

Some of the tactics we used included:

Changed the Distribution Style to ALL for the Dimension tables.

The Dimension tables were small enough to justify putting duplicate copies of each table on every single Redshift node. This improved performance by eliminating the network overhead of a single node needing to fetch dimension data from another node during a query.

For the Fact Tables:

Changed Sort Key to match query conditions:
- fact_xdr_daily.request_time_id
- fact_xdr_detail.request_timestamp
- fact_xdr_hourly.request_time_hour_id
Change the Distribution Style to EVEN
Ran ANALYZE COMPRESSION to get compression estimates

We then recreated the optimized tables, copied the data from the original tables to the optimized tables, and reran the queries to compare the performance.

7) Inflate Data from 6 Months to 36 Months

In order to better model the client’s real-world conditions, we inflated the 6-month daily data set, fact_xdr_daily, to represent approximately 36 months of data.

The data set was expanded from 1.6 TB to 13.2 TB, and we reran the queries to check performance.

Results

The results of the queries were as follows:

We achieved significant improvement on all but one of the queries.

In the case of the 36-month data set, the time to run those two queries (xdr_daily_1 and xdr_daily_2) was faster than that of the original 6-month data set. This is because when the proper sort keys are identified, Redshift is able to immediately zero in on the right subset of data needed by the query regardless of how much total data is in the table.

In the case of the xdr_hourly_2 query where an improvement was not achieved, we surmise that this was a very complex query that just didn’t respond to our optimizations. Regardless, the query execution time met the 3-second goal that the client set.

Conclusion

We helped the client determine that Redshift is certainly capable of running queries that meet or exceed the speed of their old Netezza data warehouse queries.

As a result, the client is planning on moving ahead with a full migration to Redshift for all of their data warehousing needs.

Conclusion

Redshift is a really powerful data warehousing tool that makes it fast and simple to analyze your data and glean insights that can help your business. This CloudFormation template will help you automate the deployment of and get you going with Redshift.

Overall, there’s so much that you can do with CloudFormation and it’s difficult to review every little detail. But we hope that walking through these templates gives you a better idea of the power of CloudFormation and how you can use it to manage your AWS deployments.

Let us know if you have any questions in the comments below, we’d love to hear your thoughts.

How to create a Redshift stack with AWS CloudFormation

ByFavTuts Editorial

Creation of the Redshift cluster

Set up S3 as a data source

Create VPC and public subnets

Subnet and VPC Parameters

Create a security group

Configure security group access

More Redshift Cluster properties

Outputs

Amazon Redshift Deployment and Optimization

The Problem

The Solution

1) Set up AWS Environment and Redshift Instance

2) Adapt and Execute the Data Definition Language (DDL)

3) Export Netezza Data and Transfer it to AWS S3

4) Load Data from S3 to Redshift

5) Run Queries to Obtain Baseline Benchmarks

6) Edit Tables to Optimize Query Response Time

7) Inflate Data from 6 Months to 36 Months

Results

Conclusion

Conclusion

By FavTuts Editorial

Related Post

WP CloudStack – a WordPress CloudFormation template for infrastructure on AWS cloud

How to incorporate S3, EC2, and IAM in a CloudFormation template

How AWS CloudFormation Works and How to create AWS VPC with it

Leave a Reply Cancel reply

You missed

Validation with Spring Boot – the Complete Guide

How to Record Request and Response Bodies in Sping Boot Applications

How to create a Spring Boot console application

How To Change Your PHP Settings on Ubuntu