AWS Big Data Blog

Connect Amazon Athena to your Apache Hive Metastore and use user-defined functions

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run. This post details the two new preview features that you can start using today: connecting to Apache Hive Metastore and using user-defined functions.

Connecting Athena to your Apache Hive Metastore

Several customers use the Hive Metastore as a common metadata catalog for their big data environments. Such customers run Apache Spark, Presto, and Apache Hive on Amazon EC2 and Amazon EMR clusters with a self-hosted Hive Metastore as the common catalog. AWS also offers the AWS Glue Data Catalog, which is a fully managed catalog and drop-in replacement for the Hive Metastore. With the release as of this writing, you can now use the Hive Metastore in addition to the Data Catalog with Athena. Athena now allows you to connect to multiple Hive Metastores along with existing Data Catalog.

To connect to a self-hosted Hive Metastore, you need a metastore connector. You can download a reference implementation of this connector, which runs as a Lambda function in your account. The current version of the implementation supports only SELECT queries. DDL support is limited to basic metadata syntax. For more information, see please check Considerations and Limitations of this feature. You can also write a Hive Metastore connector using the previous reference implementation as an example. You can deploy your implementation as a Lambda function, and subsequently use it with Athena. For more information about the feature, see the Using Athena Data Connector for External Hive Metastore (Preview) documentation.

Using user-defined functions in Athena

Athena also offers preview support for scalar user-defined functions (UDFs). UDFs enable you to write functions and invoke them in SQL queries. A scalar UDF is applied one row at a time and returns a single column value. Athena invokes your scalar UDF with batches of rows to limit the performance impact associated with making a remote call for the UDF itself.

With the latest release as of this writing, you can use the Athena Query Federation SDK to define your functions and invoke them inline in SQL queries. You can now compress and decompress row values, scrub personally identifiable information (PII) from columns, transform dates to a different format, read image metadata, and execute proprietary custom code in your queries. You can also execute UDFs in both the SELECT and FILTER phase of the query and invoke multiple UDFs in the same query.

For more information about UDFs, see our documentation. For common UDF example implementations, see the GitHub repo. For more information about writing functions using the Athena Query Federation SDK, please visit this link.

Testing the preview features

All Athena queries originating from the workgroup AmazonAthenaPreviewFunctionality are considered Preview test queries.

Create a new workgroup AmazonAthenaPreviewFunctionality using Athena APIs or the Athena console. For more information, see Create a Workgroup.

The following considerations are important when using preview features.

Do not edit the workgroup name. You can edit other workgroup properties, such as enabling Amazon CloudWatch metrics and requester pays. You can use the Athena console, JDBC/ODBC drivers, or APIs to submit your test queries. Specify the workgroup AmazonAthenaPreviewFunctionality when you submit test queries.

Preview functionality is available only in the us-east-1 Region. If you use Athena in any other Region and submit queries using AmazonAthenaPreviewFunctionality, your query fails. Cross-Region calls are not supported in preview mode.

During the preview, you do not incur charges for the data scanned from federated data sources. However, you are charged standard Athena rates for data scanned from S3. Additionally, you are charged standard rates for the AWS services that you use with Athena, such as S3, AWS Lambda, AWS Glue, Amazon SageMaker, and AWS SAM. For example, you are charged S3 rates for storage, requests, and inter-Region data transfer. By default, query results are stored in an S3 bucket of your choice and are billed at standard S3 rates. If you use Lambda, you are charged based on the number of requests for your functions and the duration (the time it takes for your code to execute).

It is not recommended to onboard your production workload to AmazonAthenaPreviewFunctionality.

Query performance may vary between the preview workgroup and the other workgroups in your account. Additionally, new features and bug fixes may not be backwards compatible.

Summary

In summary, we introduced Athena’s two new features that released today in Preview.

Customers who use the Apache Hive Metastore for metadata management, and were previously unable to use Athena, can now connect their Hive Metastore to Athena to run queries. Also, customers can now use Athena’s Query Federation SDK to define and invoke their own functions in their SQL queries in Athena.

Both these features are available in Preview in the AWS us-east-1 region. Begin your Preview now by following these steps in the Athena FAQ. We welcome your feedback at Athena-feedback@amazon.com

 


About the Author

Janak Agarwal is a product manager for Athena at AWS.