Improve availability of Amazon Neptune during engine upgrade using blue/green deployment

Amazon Neptune is a fully managed graph database service built for the cloud that makes it easier to build and run graph applications that work with highly connected datasets. Neptune provides built-in security, continuous backups, serverless compute, and integrations with other AWS services. Neptune supports in-place upgrades of cluster and database instances. Upgrade of a Neptune cluster can be done either manually or automatically (during the database maintenance window).

Blue/green deployment minimizes downtime when upgrading your databases. Two database environments are required to upgrade databases: your current production environment (blue) and a staging environment (green). You must then maintain a sync between these two environments so you can test and upgrade your changes safely.

As a Neptune customer, you can self-manage blue/green deployments with database cloning and read replicas. However, self-managing a blue/green deployment can be costly and time-consuming. As a result, customers often choose availability over database updates.

In this post, we present a custom-built blue/green deployment solution that enables faster, safer, and simpler database updates. It enables you to make changes to the staging environment such as major and minor version upgrades, graph data model changes without affecting production. You can easily create a synchronized, managed staging environment to mirror production in just a few steps.

Our Blue/Green Deployment solution can be installed and run using an AWS CloudFormation template. In the following sections, we provide a high-level overview of the solution, and then describe the steps to prepare your environment, and deploy the solution.

Solution overview

The following diagram illustrates the high-level workflow. The Neptune blue and green cluster versions in the diagram are for representation purposes only. You can upgrade to any higher version of the cluster.

With this solution, you can perform blue/green upgrades for the Neptune cluster. The high level-flow comprises two phases:

Create identical green cluster – Set up a green cluster by using a unique blue/green deployment identifier and a target engine version for the Neptune database (higher than the current engine version). The solution creates a staging environment (green) with the same cluster topology (the same number or size of database instances), parameter groups, and other configurations as the production environment (blue). Additionally, it performs automated tasks to prepare the database for production. The solution performs an engine version upgrade on the green cluster to the specified target engine version. You can specify minor and major engine versions for upgrades. Depending on your target engine version, the solution performs multi-hop upgrades.
Setup continuous data sync – After the green cluster creation is complete, the solution sets up Neptune Streams-based replication between the source (blue) and the target (green) cluster. When this replication difference reaches zero, the staging environment is ready for testing. At this point, you must pause writing to the source (blue) cluster to avoid further replication lag between the blue and green cluster. The target engine version can have new Neptune features or dependencies; refer to Engine releases for Amazon Neptune for guidance. It’s suggested that you run integration tests (or verify manually) on the green cluster before promoting it to the production environment. After you have tested and qualified the changes in the green cluster, you may switch the database endpoint to the green (new) cluster in your application.

After switchover, the blue/green deployment solution doesn’t delete your old production (blue) environment. You may access the blue cluster for additional validations and performance or regression testing if needed. Standard billing charges apply to old production instances until you delete them. The solution uses other AWS services and the costs for those are incurred at the public pricing. Details on deleting the solution after switchover to green cluster are covered in Clean up section.

Rollback to an older version is not possible with this solution.

Prerequisites

You need to meet the following prerequisites:

Neptune Streams is enabled on the blue cluster. After changing the neptune_streams DB cluster parameter, you must reboot the database instances in the cluster for the change to take effect
The database instances in the blue Neptune cluster should be in available state. You can check the DB instance state on the Neptune console or by running the describe-db-instance management API call.
The database instances in the blue Neptune cluster should be in sync with the DB cluster parameter group.
Select a valid Neptune engine version for the green cluster. It should be higher than Neptune engine version of blue cluster.
Blue/green deployments require an Amazon DynamoDB VPC endpoint in the VPC containing Neptune. To create a DynamoDB VPC endpoint, refer to Using Amazon VPC endpoints to access DynamoDB.
Avoid heavy write workloads such as Bulkload on the blue cluster when the solution is deployed. It can slowdown the Neptune streams based replication process.

Deploy the solution with AWS CloudFormation

You can deploy the solution using AWS CloudFormation. The CloudFormation template creates an Amazon Elastic Compute Cloud (Amazon EC2) instance in the same VPC as the source Neptune database and install the solution on the EC2 instance. Additionally, it starts the blue/green deployment solution automatically. You can monitor the progress in the Amazon CloudWatch logs (discussed in the next section).

To launch the CloudFormation stack on the AWS CloudFormation console, choose Launch Stack:

View

View in Designer

Provide details for the following parameters:

DeploymentID – An identifier that is unique to each blue/green deployment. It is used as the green cluster identifier. It’s also used as a prefix for naming new resources created as part of the solution.
NeptuneSourceClusterId – The cluster identifier that you want to upgrade.
NeptuneTargetClusterVersion – The desired Neptune engine version for the green cluster. It must be higher than the Neptune engine version of the blue cluster.
DeploymentMode – Indicates whether this is a new deployment attempt or resumption of an existing attempt. The value must be new or resume. The default mode is new. When you are using same DeploymentID as any previous deployment, set to resume.
GraphQueryType – The graph type for your database. The value must be either propertygraph or rdf. The default is propertygraph.
SubnetId – A subnet ID from the same VPC as Neptune (see Connecting to a Neptune DB Cluster from an Amazon EC2 instance in the same VPC). Provide a public subnet if you want to SSH to the instance through EC2 Connect.
InstanceSecurityGroup – A security group for your EC2 instance. The security group should have access to your Neptune database and you must be able to SSH to the instance. Refer to Create a security group using the VPC console for details.

The deployment process begins automatically once the stack has completed. The stack deployment takes 7-10 min to complete. Wait until the stack is complete before proceeding to the next section.

Monitor the solution progress

On the CloudWatch console, you can monitor logs in the /aws/neptune/<blue green deployment-id> CloudWatch log group. You can find a link to the CloudWatch logs in the outputs of the solution’s CloudFormation stack as shown in following screenshot.

If you provided a public subnet as a stack parameter, you can also SSH to your EC2 instance created as part of the stack and refer to the log in /var/log/cloud-init-output.log.

The log shows actions taken as part of the solution, as demonstrated in the following screenshot.

Log messages show the sync status between the blue and green clusters. The sync process checks the replication lag by computing the difference between the latest stream eventID on the blue cluster and the replication checkpoint present in the checkpoint table (the DynamoDB table created by the Neptune-to-Neptune replication stack). You can monitor the replication difference in the CloudWatch log. Refer to the following screenshot for details.

Traffic switchover

To promote the green cluster to production, ensure that the commit difference between the blue and green clusters is zero. You should disable all write traffic to the blue cluster from all of your applications and wait for replication to finish. Continuing to write to the blue cluster after switching the database endpoint to the green cluster for some writers’ applications may result in data corruption since you will write partial data to both the clusters. However, you may be able to continue reading from any cluster. Once replication is complete, you must enable write traffic only on the green cluster.

If you have enabled AWS Identity and Access Management (IAM) authentication on the source (blue) cluster, make sure to update the IAM policy used in your application to point to the green cluster. Refer to Policy example allowing unrestricted access to the data in a Neptune DB cluster for details.

Troubleshooting

The following table lists common errors and resolutions.

Error	Resolution
Cluster with id = ” + `blue_green_deployment_id` + ” already exists.	There is an existing cluster with the identifier `blue_green_deployment_id`. Provide a new blue/green deployment ID or change the deployment mode to resume if the cluster was created in a previous blue/green deployment attempt.
Streams should be enabled on the source cluster for blue/green deployment.	Enable the Neptune Streams feature on the source(blue) cluster.
No bulk load should be in progress on the source cluster ” + `cluster_id`.	The solution stops the workflow if it identifies an ongoing bulk load process. This is to ensure that the sync process is able to catch up with the changes. Avoid or cancel any ongoing bulk load jobs when the solution starts.
Blue/green deployment requires instances to be in sync with the DB cluster parameter group.	Any changes to the cluster parameter group should be in sync. Refer to Amazon Neptune parameter groups for details.
Invalid target engine version for blue/green deployment (`TargetEngineVersion`).	Refer to Engine releases for Amazon Neptune. The green cluster Neptune version should be higher than the source(blue) cluster Neptune version.

If you’re still unable to figure out the problem, download the CloudWatch log and contact AWS Support with the cluster details and CloudWatch log for further investigation.

Limitations

This solution has the following limitations:

The solution doesn’t work if you’re using SPARQL (RDF) and have changed BlankNodes while running this script. It may lead to data inconsistency in the green cluster.
The green (staging) environment may take indefinite time to catch up with heavy write workloads such as Bulkload.
Your application must be modified to use the green (staging) cluster. There will be some downtime in your application as a result.
You may want to retry the blue green deployment with same CloudFormation template parameter DeploymentID if the process was interrupted manually. In case of retry, set DeploymentMode=resume, make sure you haven’t deleted resources created in previous attempts, such as cloned clusters, instances, or replication stacks. If you deleted resources created in previous attempts, use a fresh DeploymentID.

Best practices

Consider the following best practices:

Ensure that the staging cluster is functioning properly before switching over to production traffic. Check the consistency of the data and the configuration of the database. It’s possible that some of the new engine versions will require client upgrades as well. You can check the engine release notes before the engine upgrade. It’s advisable to test this in your development, testing, and pre-production environments before performing a blue/green upgrade in production.
To ensure that everything is working properly after upgrading and synchronizing, keep your original cluster for a period of time. During disaster recovery, it might be useful if some unknown issues arise.
It’s recommended that you avoid heavy-write operations (Neptune bulk load) when using this solution. This way, Neptune Streams-based replication can catch up faster, and you can switchover to green cluster sooner.

Clean up

After you have promoted the staging (green) cluster to production, you should clean up the resources created by the solution, including the EC2 instance, green Neptune cluster, and stack for Neptune Streams-based replication. To clean up, delete the CloudFormation template with stack_name you provided earlier. Also, delete the CloudFormation template with a stack name that matches {DeploymentID}-replication.

Deleting CloudFormation stacks won’t delete any clusters (blue or green). After you have verified that the green cluster is working as expected, you can optionally take a snapshot and manually delete the blue cluster.

Summary

In this post, we explained how you can use the Neptune blue/green deployment solution to ensure your applications continue to run smoothly and efficiently during upgrade. Explore this solution to upgrade your cluster with minimal application downtime.

About the authors

Ankit Gupta is a Software Development Manager with the Amazon Neptune Platform Team in India and has been part of the Neptune team since product inception. He works with AWS customers and internal development teams to improve Neptune’s usability, performance, scalability, and user experience.

Abhishek Mishra is a Sr. Specialist Solutions Architect focused on Amazon Neptune at AWS. He helps AWS customers build innovative solutions using graph databases. In his spare time, he loves making the earth a greener place.