AWS Big Data Blog

How to enable cross-account Amazon Redshift COPY and Redshift Spectrum query for AWS KMS–encrypted data in Amazon S3

This post shows a step-by-step walkthrough of how to set up a cross-account Amazon Redshift COPY and Spectrum query using a sample dataset in Amazon S3. The sample dataset is encrypted at rest using AWS KMS-managed keys (SSE-KMS).

About AWS Key Management Service (AWS KMS)

With AWS Key Management Service (AWS KMS), you can have centralized control over the encryption keys used to protect your data at rest. You can create, import, rotate, disable, delete, define usage policies, and audit the use of encryption keys used to encrypt your data. AWS KMS uses FIPS 140-2 validated cryptographic modules to protect the confidentiality and integrity of your master keys.

AWS KMS is seamlessly integrated with most AWS services. This integration means that you can easily use customer master keys (CMKs) to control the encryption of the data you store within these services. When deciding to encrypt data in a service such as Amazon Redshift, you can choose to use an AWS-managed CMK that Amazon Redshift automatically creates in KMS. You can track the usage of the key, but it’s managed by the service on your behalf. In some cases, you might need direct control over the lifecycle of a CMK or want to allow other accounts to use it. In these cases, you can create and manage your own CMK that AWS services such as Amazon Redshift can use on your behalf. These customer-managed CMKs enable you to have full control over the access permissions that determine who can use the key and under which conditions. AWS KMS is integrated with AWS CloudTrail, a service that provides a record of actions performed by a user, role, or AWS service in AWS KMS.

About Amazon Redshift and Redshift Spectrum

Amazon Redshift is a petabyte scale, fully managed data warehouse service on AWS. It uses a distributed, massively parallel processing (MPP), shared-nothing architecture that scales horizontally to meet usage requirements.

Amazon Redshift Spectrum is a feature of Amazon Redshift that extends the analytic power of Amazon Redshift beyond the data that is stored on local disks in the data warehouse. In other words, Amazon Redshift Spectrum enables you to use the same ANSI SQL syntax of Amazon Redshift on the data that is stored in an Amazon S3 data lake. You do so using external tables, without having to ingest the data into Amazon Redshift first. A common pattern is to run queries that span both the frequently accessed “hot” data stored locally in Amazon Redshift and the “warm/cold” data stored cost-effectively in Amazon S3. That pattern separates compute and storage by enabling independent scaling of both to match the use case. This means you don’t have to pay for unused compute capacity just to add more storage. More importantly, this approach enables seamless interoperability between your data lake and Amazon Redshift.

The Amazon Redshift COPY command supports the following types of Amazon S3 encryption:

  • Server-side encryption with Amazon S3-managed keys (SSE-S3)
  • Server-side encryption with AWS KMS-managed keys (SSE-KMS)
  • Client-side encryption using a client-side symmetric master key

The Amazon Redshift COPY command doesn’t support the following types of Amazon S3 encryption:

  • Server-side encryption with customer-provided keys (SSE-C)
  • Client-side encryption using an AWS KMS–managed customer master key
  • Client-side encryption using a customer-provided asymmetric master key

About the use case

A multiple-account AWS environment is a common pattern across our customers for a variety of reasons. One of the common reasons for data lake customers in AWS is to separate ownership of data assets from different business units in the company. At the same time, business units might need to grant access to some of their data assets to each other for new business insights.

As illustrated in the following drawing, in our example Account A owns an S3 bucket with SSE-KMS encrypted data and Account B owns an Amazon Redshift cluster with Redshift Spectrum enabled. Account B needs access to the same data to load to the Amazon Redshift cluster using the COPY command and also to query using Redshift Spectrum.

Solution walkthrough

Following, we walk through a couple different options to support this use case.

Prerequisites

The solution assumes that you already have the following set up:

    1. Access to two AWS accounts (we call them Account A and B) in the same AWS Region.*
    2. Grant the AdministratorAccess policy to the AWS accounts (which should be restricted further for production).
    3. Account A has a customer-managed CMK in AWS KMS with the following attributes:
      • Alias as kms_key_account_a
      • Description as Cross Account KMS Key in Account A
      • Administrator as current IAM user using which you signed in to the AWS console and created the KMS key
      • Account B added as External Accounts

      Copy and save the CMK Amazon Resource Name (ARN) to be used shortly

    4. Account A uses the following sample dataset from AWS:
      Customer - s3://awssampledbuswest2/ssbgz/customer0002_part_00.gz
    5. Account A has an S3 bucket called rs-xacct-kms-bucket with bucket encryption option set to AWS KMS using the KMS key kms_key_account_a created earlier.
    6. Use the following AWS CLI command to copy the customer table data from AWS sample dataset SSB – Sample Schema Benchmark, found in the Amazon Redshift documentation.Note: Because bucket names are global across all AWS customers, you need a unique bucket name for your test run. Be sure to replace rs-xacct-kms-bucket with your own bucket name in the following command:
      aws s3 cp s3://awssampledbuswest2/ssbgz/ s3://rs-xacct-kms-bucket/customer/ --recursive --exclude '*' --include 'customer*'
    7. After the copy is complete, check the KMS key ID for the file from S3 console, as shown following.
    8. Account B has an Amazon Redshift cluster:
      • The cluster name is rstest
      • It’s publicly accessible
      • It has an IAM role attached called redshift_role_account_b with the following two managed IAM policies:
        • AmazonS3ReadOnlyAccess
        • AWSGlueConsoleFullAccess

            Note: Be sure to update redshift_role_account_b with your own IAM role.

            You can set up a database session successfully from a client tool, such as SQL Workbench from your laptop.

* This walkthrough uses a publicly available AWS sample dataset from the US-West-2 (Oregon) Region. Hence, we recommend that you use the US-West-2 (Oregon) Region for your test run to reduce cross-region network latency and cost due to data movement.

Step-by-step walkthrough

Depending on which account’s AWS Glue Data Catalog you want to use for Redshift Spectrum, there are two solution options to choose from:

  1. AWS Glue Data Catalog in Account B
  2. AWS Glue Data Catalog in Account A

Option 1: AWS Glue Data Catalog in Account B

Set up permissions

  1. Sign in to Account A’s AWS console. Then, change the AWS Region to us-west-2 (Oregon). Add the following bucket policy for the rs-xacct-kms-bucket bucket so that Account B (which owns the Amazon Redshift cluster – rstest) can access the bucket.

Note: Replace <Account B> with AWS Account ID for Account B and rs-xacct-kms-bucket with your bucket name.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowS3",
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::<Account B>:root"
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::rs-xacct-kms-bucket/*",
                "arn:aws:s3:::rs-xacct-kms-bucket"
            ]
        }
    ]
}
    1. Sign in to Account B’s AWS console. Then, change the AWS Region to us-west-2 (Oregon). Create IAM policies and roles as described following:

a) Create the following two IAM permission policies: rs_xacct_bucket_policy to give Account B access to the S3 bucket in Account A, and rs_xacct_kms_policy to give Account B access to the CMK in Account A.

Policy name: rs_xacct_kms_policy

Note: Replace <ARN of kms_key_account_a from Account A> with your KMS key ARN from Account A.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowUseOfTheKey",
            "Effect": "Allow",
            "Action": [
                "kms:Encrypt",
                "kms:Decrypt",
                "kms:ReEncrypt*",
                "kms:GenerateDataKey*",
                "kms:DescribeKey"
            ],
            "Resource": [
                "<ARN of kms_key_account_a from Account A>"
            ]
        },
        {
            "Sid": "AllowAttachmentOfPersistentResources",
            "Effect": "Allow",
            "Action": [
                "kms:CreateGrant",
                "kms:ListGrants",
                "kms:RevokeGrant"
            ],
            "Resource": [
                "<ARN of kms_key_account_a from Account A>"
            ],
            "Condition": {
                "Bool": {
                    "kms:GrantIsForAWSResource": true
                }
            }
        }
    ]
}

Policy name: rs_xacct_bucket_policy

Note: Replace rs-xacct-kms-bucket with your bucket name.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowS3",
            "Effect": "Allow",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::rs-xacct-kms-bucket/*",
                "arn:aws:s3:::rs-xacct-kms-bucket"
            ]
        }
    ]
}

b) Create a new IAM role called xacct_kms_role_account_b for the Amazon Redshift service with the following IAM policies attached:

rs_xacct_bucket_policy
rs_xacct_kms_policy
AWSGlueConsoleFullAccess

Save the Amazon Resource Name (ARN) of the IAM role. You’ll use it soon.

c) Now let’s set up the IAM role chaining for Amazon Redshift between the two IAM roles, redshift_role_account_b and xacct_kms_role_account_b.

To chain roles, you establish a trust relationship between the roles. A role that assumes another role (for example, Role A) must have a permission policy that allows it to assume the next chained role (for example, Role B). Similarly, the role that passes permissions (Role B) must have a trust policy that allows it to pass its permissions to the previous chained role (Role A).

The first role in the chain must be a role attached to the Amazon Redshift cluster. The first role and each subsequent role that assumes the next role in the chain must have a policy that includes a specific statement. This statement has the Allow effect on the sts:AssumeRole action and the ARN of the next role in a Resource element.

In our example, Role A is redshift_role_account_b, which needs the permission policy rs_xacct_assume_role_policy, which  allows it to assume Role B (which is xacct_kms_role_account_b). Both IAM roles are owned by AWS Account B.

d) Let’s create the IAM permission policy rs_xacct_assume_role_policy and attach the policy to the IAM role redshift_role_account_b.

Policy name: rs_xacct_assume_role_policy

Note: Replace <ARN for IAM role xacct_kms_role_account_b from Account B>.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1487639602000",
            "Effect": "Allow",
            "Action": [
                "sts:AssumeRole"
            ],
            "Resource": [
"<ARN for IAM role xacct_kms_role_account_b from Account B>"
            ]
        }
    ]
}

e) Change the trust relationship for IAM role xacct_kms_role_account_b by choosing Edit trust relationship and replacing the existing trust policy with the following:

Note: Replace <Account B> with the AWS Account ID for Account B.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "redshift.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<Account B>:root"
      },
      "Action": "sts:AssumeRole"
    }
  ]
} 

f) Create an AWS Glue service IAM role called glue_service_role_account_b with the following policies attached:

• AWSGlueServiceRole (AWS managed policy)
• rs_xacct_bucket_policy (managed policy created earlier)
• rs_xacct_kms_policy (managed policy created earlier)

Note: Be sure to update glue_service_role_account_b with your own IAM role.

Perform the Amazon Redshift COPY

  1. Log in to the Amazon Redshift cluster from your query tool and create the customer table using the DDL following.
CREATE TABLE customer 
(
  c_custkey      INTEGER NOT NULL,
  c_name         VARCHAR(25) NOT NULL,
  c_address      VARCHAR(25) NOT NULL,
  c_city         VARCHAR(10) NOT NULL,
  c_nation       VARCHAR(15) NOT NULL,
  c_region       VARCHAR(12) NOT NULL,
  c_phone        VARCHAR(15) NOT NULL,
  c_mktsegment   VARCHAR(10) NOT NULL
);

2. Now you can run the COPY statement following successfully.

copy customer from 's3://rs-xacct-kms-bucket/customer/' 
iam_role '<IAM role ARN of redshift_role_account_b,IAM role ARN of xacct_kms_role_account_b>'
gzip
region 'us-west-2';

Note: Replace the IAM role ARNs from Account B separated by a comma without any spaces around it.

3. Run the following sample query to verify that the data was loaded successfully.

select * from customer limit 10;

Set up an AWS Glue Data Catalog table for Redshift Spectrum to query

Let’s now create an AWS Glue crawler in Account B to crawl the same customer data and create a table called customer in the AWS Glue Data Catalog database spectrumdb_account_b following these steps:

  1. Navigate to Databases on the AWS Glue console and choose Add database to create an AWS Glue Data Catalog database called spectrumdb_account_b, as shown following.

  1. Navigate to Crawlers on the AWS Glue console and choose Add crawler, as shown following.

  1. Create a crawler customerxacct, as shown following.

Note: The Crawler job name (customerxacct in this case) is not same as the table name created by the crawler (a common confusion). The table name is picked up automatically from the prefix and folder name from your S3 bucket and folder structure. You also have an option to attach a table name prefix if you want to.              

  1. Choose Next to enter Data store details of the customer table, as following.

  1. Choose Next to get to the Add another data store We leave the default, No, because we don’t have any other data stores to add.

  1. Choose Next to choose the IAM role created earlier, glue_service_role_account_b, for the crawler to use, as shown following.

  1. Choose Next to go to the Schedule page and choose the schedule that you want this crawler job to run. For this example, we can choose Run on demand.

  1. Choose Next to choose the AWS Glue Data Catalog database spectrumdb_account_b (created earlier by create external schema command) as the crawler output location.

  1. Choose Next to get to the review page.

  1. After reviewing the details, choose Finish to finish creating the crawler.

  1. Now, let’s run the crawler job by selecting the job as following and choosing Run crawler.

  1. Wait and watch for the job to complete. Its status changes from Starting to Stopping to Ready. You can choose the refresh button for the latest status.

  1. If the job fails, the failure is recorded in Amazon CloudWatch logs. To view the logs, choose Logs, shown in the screenshot preceding, which takes you to the CloudWatch logs.
  1. Now, let’s go to the AWS Glue Data Catalog database to make sure that the table exists.

Choose Databases, choose the spectrumdb_account_b database, and then choose View Tables, or choose the hyperlink of the database name. You should see the customer table, as shown following.

  1. Choose the customer hyperlink to get to the external table, details following.

Because the data file didn’t have a header record, the AWS Glue crawler has assigned a default column naming convention as shown preceding. For the customer table, this naming is column 0 to column 7

  1. Choose Edit Schema and assign appropriate column names, as per the mapping following.

c0 => c_custkey

c1 => c_name

c2 => c_address

c3 => c_city

c4 => c_nation

c5 => c_region

c6 => c_phone

c7 => c_mktsegment

When you are done, choose Save.

Perform the Redshift Spectrum query

Now that the customer table is created in AWS Glue Data Catalog, let’s query the table using Redshift Spectrum.

  1. Log in to the Amazon Redshift cluster from your query tool.
  2. Run the statements following to create an external schema called spectrumxacct for Redshift Spectrum pointing to the AWS Glue Data Catalog database. This database is spectrumdb_account_b in Account B, already created on the AWS Glue console.
    drop schema if exists spectrumxacct;
    create external schema spectrumxacct
    from data catalog 
    database 'spectrumdb_account_b'
    iam_role '<IAM role ARN of redshift_role_account_b,IAM role ARN of xacct_kms_role_account_b>'
    create external database if not exists;
    

    Note: Replace the IAM role ARNs from Account B separated by a comma without any spaces around it.

  3. Run the following sample query to verify that Redshift Spectrum can query the data successfully.
    select * from spectrumxacct.customer limit 10;

Note: Redshift Spectrum uses the AWS Glue Data Catalog in Account B, not Account A.

Option 2: AWS Glue Data Catalog in Account A

 

Set up permissions

1. Sign in to the Account A AWS console, then change the AWS Region to us-west-2 (Oregon).

    • a) Create the following IAM policies:

• rs-xacct-bucket-policy to give access to the S3 bucket in Account A
• rs_xacct_kms_policy to give access to the CMK in Account A

Policy name: rs_xacct_bucket_policy

Note: Replace the bucket name rs-xacct-kms-bucket with your bucket name.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowS3",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::rs-xacct-kms-bucket/*",
                "arn:aws:s3:::rs-xacct-kms-bucket"
            ]
        }
    ]
}

Policy name: rs_xacct_kms_policy

Note: Replace <ARN of kms_key_account_a from Account A> with your KMS key ARN from Account A.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "AllowUseOfTheKey",
            "Effect": "Allow",
            "Action": [
                "kms:Encrypt",
                "kms:Decrypt",
                "kms:ReEncrypt*",
                "kms:GenerateDataKey*",
                "kms:DescribeKey"
            ],
            "Resource": [
                "<ARN of kms_key_account_a from Account A>"
            ]
        },
        {
            "Sid": "AllowAttachmentOfPersistentResources",
            "Effect": "Allow",
            "Action": [
                "kms:CreateGrant",
                "kms:ListGrants",
                "kms:RevokeGrant"
            ],
            "Resource": [
                "<ARN of kms_key_account_a from Account A>"
            ],
            "Condition": {
                "Bool": {
                    "kms:GrantIsForAWSResource": true
                }
            }
        }
    ]
}

b) Create a new IAM role called xacct_kms_role_account_a for the Amazon Redshift service with the following IAM policies:

rs_xacct_bucket_policy
rs_xacct_kms_policy
AWSGlueConsoleFullAccess (this managed policy provides the required permissions for the AWS Glue Data Catalog)

Save the IAM role ARN to be used shortly.

c) Change the trust relationship for the IAM role xacct_kms_role_account_a by choosing Edit trust relationship and replacing the existing trust policy with the following:

Note: Replace <Account B> with the AWS account ID for Account B.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "redshift.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    },
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": "arn:aws:iam::<Account B>:root"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

d) Create an AWS Glue service IAM role called glue_service_role_account_a with the following policies attached:

AWSGlueServiceRole (AWS managed policy)
rs_xacct_bucket_policy (managed policy created earlier)
rs_xacct_kms_policy (managed policy created earlier)

Note: Be sure to update glue_service_role_account_a with your own IAM role

2. Sign in to Account B’s AWS console and change the AWS Region to us-west-2 (Oregon) if it’s not already selected.

a) Modify the existing IAM policy rs_xacct_assume_role_policy and replace the existing JSON policy with the following:

 Note: Replace <ARN for IAM role xacct_kms_role_account_a from Account A>.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1487639602000",
            "Effect": "Allow",
            "Action": [
                "sts:AssumeRole"
            ],
            "Resource": [
"<ARN for IAM role xacct_kms_role_account_a from Account A>"
            ]
        }
    ]
}

Perform the Amazon Redshift COPY

1. Log in to the Amazon Redshift cluster from your query tool and create the customer table using the DDL following.

CREATE TABLE customer 
(
  c_custkey      INTEGER NOT NULL,
  c_name         VARCHAR(25) NOT NULL,
  c_address      VARCHAR(25) NOT NULL,
  c_city         VARCHAR(10) NOT NULL,
  c_nation       VARCHAR(15) NOT NULL,
  c_region       VARCHAR(12) NOT NULL,
  c_phone        VARCHAR(15) NOT NULL,
  c_mktsegment   VARCHAR(10) NOT NULL
);

2. Now you should be able to run the COPY statement following successfully.

copy customer from 's3://rs-xacct-kms-bucket/customer/' 
iam_role '<ARN for IAM role redshift_role_account_b from Account B,<ARN for IAM role xacct_kms_role_account_a from Account A>'
gzip
region 'us-west-2';

Note: Replace the IAM role ARNs separated by a comma without any spaces around it.

3. Run the sample query following to validate that the data was loaded successfully.

select * from customer limit 10;

Set up AWS Glue Data Catalog table for Redshift Spectrum to query

Let’s now create an AWS Glue crawler in Account A to crawl the same customer data and create a table called customer in the AWS Glue Data Catalog database spectrumdb_account_a in Account A following these steps:

Follow the same steps as outlined in Option 1 to create and run a crawler with the following changes:

  1. This time, create the crawler in Account A (as opposed to Account B for Option 1).
  2. Create an AWS Glue Data Catalog database spectrumdb_account_a in Account A (as opposed to spectrumdb_account_b in Account B), and choose that database for crawler to create the customer table.
  3. While providing S3 path, choose the option Specified path in my account (unlike Specified path in another account chosen for Option 1).
  4. Make sure to use glue_service_role_account_a created earlier as the AWS Glue service IAM role.=

Perform the Redshift Spectrum query

Now that the customer table is created in the AWS Glue Data Catalog, let’s query the table using Redshift Spectrum.

1. Log in to the Amazon Redshift cluster from your query tool and run the statements following. These create an external schema called spectrumxacct2 for Redshift Spectrum pointing to the AWS Glue Data Catalog database spectrumdb_account_a (created earlier from AWS Glue console) in Account A.

drop schema if exists spectrumxacct2;
create external schema spectrumxacct2
from data catalog 
database 'spectrumdb_account_a' 
iam_role '<ARN for IAM role redshift_role_account_b from Account B,<ARN for IAM role xacct_kms_role_account_a from Account A>'
create external database if not exists;

Note: Replace the IAM role ARNs separated by a comma without any spaces around it.

2. Run the following query, which should run successfully.

select * from spectrumxacct2.customer limit 10;

Note: Spectrum uses the AWS Glue Data Catalog in Account A, not Account B.

Summary

This post shows a step-by-step walkthrough of how to set up a cross-account Amazon Redshift COPY and query using Redshift Spectrum for a sample KMS encrypted dataset in Amazon S3. It demonstrates two solution options to choose from depending on which account’s AWS Glue Catalog you want to use for Redshift Spectrum.

If you have questions or suggestions, please leave a comment.

 


About the Author

Asim Kumar Sasmal is a Sr. Data Architect – IoT in the Global Specialty Practice of AWS Professional Services. He helps AWS customers around the globe to design and build data driven solutions by providing expert technical consulting, best practices guidance, and implementation services on AWS platform. He is passionate about working backwards from customer ask, help them to think big, and dive deep to solve real business problems by leveraging the power of AWS platform.