AWS Big Data Blog

Restrict access to your AWS Glue Data Catalog with resource-level IAM permissions and resource-based policies

A data lake provides a centralized repository that you can use to store all your structured and unstructured data at any scale. A data lake can include both raw datasets and curated, query-optimized datasets. Raw datasets can be quickly ingested, in their original form, without having to force-fit them into a predefined schema. Using data lakes, you can run different types of analytics on both raw and curated datasets. By using Amazon S3 as the storage layer of your data lakes, you can have a set of rich controls at both the bucket and object level. You can use these to define access control policies for the datasets in your lake.

The AWS Glue Data Catalog is a persistent, fully managed metadata store for your data lake on AWS. Using the Glue Data Catalog, you can store, annotate, and share metadata in the AWS Cloud in the same way you do in an Apache Hive Metastore. The Glue Data Catalog also has seamless out-of-box integration with Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.

Using AWS Glue, you can also create policies to restrict access to different portions of the catalog based on users, roles, or applied at a resource level. With these policies, you can provide granular control over which users can access the various metadata definitions in your data lake.

Important: The S3 and the AWS Glue Data Catalog policies define the access permissions to the data and metadata definitions respectively. In other words, the AWS Glue Data Catalog policies define the access to the metadata, and the S3 policies define the access to the content itself.

You can restrict which metadata operations can be performed, such as GetDatabases, GetTables, and CreateTable, and others using identiy-based policies (IAM). You can also restrict which data catalog objects those operations are performed on. Additionally, you can limit which catalog objects get returned in the resulting call. A Glue Data Catalog “object” here refers to a database, a table, a user-defined function, or a connection stored in the Glue Data Catalog.

Suppose that you have users that require read access to your production databases and tables in your data lake, and others have additional permissions to dev resources. Suppose also that you have a data lake storing both raw data feeds and curated datasets used by business intelligence, analytics, and machine learning applications. You can set these configurations easily, and many others, using the access control mechanisms in the AWS Glue Data Catalog.

Note: The following example shows how to set up a policy on the AWS Glue Data Catalog. It doesn’t set up the related S3 bucket or object level policies. This means the metadata isn’t discoverable when using Athena, EMR, and tools integrating with the AWS Glue Data Catalog. At the point when someone tries to access an S3 object directly, S3 policy enforcement is important. You should use Data Catalog and S3 bucket or object level policies together.

Fine-grained access control

You can define the access to the metadata using both resource-based and identity-based policies, depending on your organization’s needs. Resource-based policies list the principals that are allowed or denied access to your resources, allowing you to set up policies such as cross-account access. Identity policies are specifically attached to users, groups, and roles within IAM.

The fine-grained access portion of the policy is defined within the Resource clause. This portion defines both the AWS Glue Data Catalog object that the action can be performed on, and what resulting objects get returned by that operation.

Let’s run through an example. Suppose that you want to define a policy that allows a set of users to access only the finegrainacces database. The policy also allows users to return all the tables listed within the database. For the GetTables actions, the resource definitions include these resource statements:

"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/finegrainaccess",
"arn:aws:glue:us-east-1:123456789012:table/finegrainaccess/*"

The first resource statement at the database Amazon Resource Name (ARN) allows the user to call the operation on the finegrainaccess database. The second ARN allows all the tables within that database to be returned.

Now, what if we want to return only the tables that started with “dev_” from the “finegrainaccess” database? If so, this is how the policy changes:

"arn:aws:glue:us-east-1:123456789012:catalog",
"arn:aws:glue:us-east-1:123456789012:database/finegrainaccess",           
"arn:aws:glue:us-east-1:123456789012:tables/finegrainaccess/dev_*"  

Now, we are specifying dev_ as part of the table’s ARN in the second resource definition. This approach also works with actions for getting the list of databases, partitions for a table, connections, and other operations in the catalog.

Taking it for a spin

Note: This post focuses on the policies for AWS Glue Data Catalog. If you look closely, all of these datasets are pointing to the same S3 locations, which are world-readable. In a full example, you should also set the necessary S3 bucket or object level permissions, or both.

Next, we show an example you can do yourself. The next example creates the following in a Data Catalog.

We set up two users in the example, as shown following.

In the AWS Management Console, launch the AWS CloudFormation template.

Choose Next.

Important: Enter a password for the IAM users to be created. These users will have permissions to run Athena queries, access to your Athena S3 results bucket, and see the AWS Glue databases and tables that the CloudFormation script creates. These permissions must match the minimum requirements of the IAM password policy on the account that you run this example from.

Choose Next, and then on the next page choose Next again.

Lastly, acknowledge that the template will create IAM users and policies.

Then choose Create.

When you refresh your CloudFormation page, you can see the script creating the example resources.

Wait until it’s complete.

This script creates the necessary IAM users and policies attached to them, along with the necessarily databases and tables listed preceding.

After the CloudFormation script completes, you should see these tables if using an administrator user.

If you look on the Outputs tab, you can see the two IAM users that were created along with your IAM sign-in URL.

Note:

If you click the sign-in link in the same browser, the system logs you out. A nice trick is to right-click and open a private or incognito window.

If the provided IAM password doesn’t meet the minimum requirements, you see this message in the CloudFormation script event log:

The specified password is invalid … <why it was invalid>

Looking in the AWS Glue Data Catalog, you can see the tables that just got created by the script.

We can see the script created the structure that we outlined preceding.

Let’s check the two user profiles. If you go into IAM and users, they are set as inline policies. You should see the following for each user.

For the AWS Glue dev user, this section gives us full access to anything in the dev databases:

This section gives us the ability to query and see the prod database:

Lastly, this section gives us access to get tables and partitions from the prod database. You can structure this section so that it explicitly lists the blog_prod database in the resource and only allow that. The following lets someone query for database/* and return only the blog_prod tables. This, in fact, is the default behavior of the console.

Without this, you could still query those two databases explicitly, but the policy would not allow a wildcard query such as the following.

In contrast, the QA user doesn’t have access to the dev database and can only see the tables that start with prod_in the prod database. So the following is what the QA user’s policy looks like.

The query for the prod database is as follows:

Only GetTables and GetPartitions are available for the tables starting with prod_.

Notice the “prod_*” in the resource definition following.

Querying based on the different users

Logging in as the two different IAM users created by the AWS CloudFormation output tab and the password you provided, you can see some differences.

Notice that the QA user can’t see any metadata definitions for the blog_dev database, or the staging_yellow table in the blog_prod database.

Next, sign in as blog_dev_user and go to the Athena console. Notice that the blog user only sees the databases and tables listed that this user is permitted to.

The dev user can create a table under blog_dev, but not the blog_prod database.

Now let’s look under dev_qa_user. Notice that we only see the blog_prod and prod_* tables in Athena.

The QA user can query the datasets that user can see, but the policy doesn’t let that user create a database or tables.

If the QA user tries to query through Athena, and manually pull the metadata outside the console, that user can’t see any of the information. You can test this by running the following.

select * from blog_dev.yellow limit 10;

Conclusion

Data cataloging is an important part of many analytical systems. The AWS Glue Data Catalog provides integration with a wide number of tools. Using the Data Catalog, you also can specify a policy that grants permissions to objects in the Data Catalog. Data lakes require detailed access control at both the content level and the level of the metadata describing the content. In this example, we show how you can define the access policies for the metadata in the catalog.


Additional Reading

Learn how to harmonize, search, and analyze loosely coupled datasets on AWS.

 


About the Author

Ben Snively is a Public Sector Specialist Solutions Architect. He works with government, non-profit and education customers on big data and analytical projects, helping them build solutions using AWS. In his spare time, he adds IoT sensors throughout his house and runs analytics on it.