AWS Big Data Blog

Working with nested data types using Amazon Redshift Spectrum

Redshift Spectrum is a feature of Amazon Redshift that allows you to query data stored on Amazon S3 directly and supports nested data types. This post discusses which use cases can benefit from nested data types, how to use Amazon Redshift Spectrum with nested data types to achieve excellent performance and storage efficiency, and some of the limitations of nested data types.

This post uses a data set generated with dummy data. You can view its table schema. If you’d like to try the dataset, deploy a Redshift cluster, execute the DDLs there, and use the example queries from this post or build your own.

Data modeling

In many scenarios, data is generated in a hierarchy. For example, assume a customer bought several items. For analytic purposes, there are various data modeling approaches to save storage or speed up data processing. One popular approach to achieve storage efficiency is the dimensional model.

The following table shows dummy customer data.

username name sex address mail birthdate
1 erin15 Sarah Newman F 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 josephlyons@yahoo.com 4/24/10
2 shepherdlisa Mark Lee M 754 Michelle Gateway Port Johnstad, ME 35695 guerrerotodd@hotmail.com 11/10/32
3 palmerpaul Jennifer Marshall F 869 Harrell Forges Apt. 111 East Monica, MO 01243 heather65@hotmail.com 3/11/07
4 brettmcgee Travis Wilson M 535 Lisa Flat East Andrew, ID 43332 bellkim@gmail.com 3/22/10
5 torresdiana Ashley Hoffman F 7815 Lauren Ranch Ambertown, FL 93225 franklinjonathan@hotmail.com 5/14/60

The following table contains dummy order data, which is linked to the customer table via a foreign key username.

username transaction_date shipping_date items price
1 erin15 10/11/19 10/13/19 10 4794
2 erin15 10/11/19 10/12/19 7 1697
3 erin15 10/7/19 10/9/19 2 15
4 erin15 10/6/19 10/10/19 5 1744
5 erin15 10/5/19 10/10/19 7 6346

In the dimensional model, each customer’s information is stored only one time. There is no duplicated data, even though a customer could order multiple items at various times.

The dimensional model is optimal for storage. However, it can be challenging to process data efficiently. To get a full picture of your data, you need to join the two tables together to restore the hierarchy.

For example, to find out how many items customer Mark Lee bought and his total spending in the last three months, the query needs to join the customers and orders table. See the following code:

Select c.username, o.transaction_date, o.shipping_date, sum(items), sum(price) 
from customers c inner join orders o on (c.username = o.username) 
where c.name = ‘Mark Lee’ 
and transaction_date > DATEADD(month, -3, GETDATE())
group by 1,2,3;

When there are millions of customers who might buy multiple items in each transaction, the join can be very expensive. A fast-growing dataset can be so large that you need to store it in a distributed system. To perform the join, you need to shuffle data through the network, and the cost becomes even more significant.

As storage becomes cheaper and cheaper, people are starting to use a flattened model. In this model, data is pre-joined to gain processing efficiency. The following table shows that the customer and order information is stored in one record and ready to be analyzed.

username name sex address mail birthdate transaction_date shipping_date items price
erin15 Sarah Newman F 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 josephlyons@yahoo.com 4/24/10 9/14/19 10/12/19 2 1237
erin15 Sarah Newman F 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 josephlyons@yahoo.com 4/24/10 9/16/19 10/9/19 8 4824
erin15 Sarah Newman F 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 josephlyons@yahoo.com 4/24/10 9/17/19 10/10/19 9 4392
erin15 Sarah Newman F 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 josephlyons@yahoo.com 4/24/10 9/17/19 10/9/19 3 1079
erin15 Sarah Newman F 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 josephlyons@yahoo.com 4/24/10 9/25/19 10/7/19 1 208
erin15 Sarah Newman F 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 josephlyons@yahoo.com 4/24/10 10/2/19 10/5/19 10 3689
erin15 Sarah Newman F 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 josephlyons@yahoo.com 4/24/10 10/5/19 10/10/19 7 6346
erin15 Sarah Newman F 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 josephlyons@yahoo.com 4/24/10 10/6/19 10/10/19 5 1744
erin15 Sarah Newman F 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 josephlyons@yahoo.com 4/24/10 10/7/19 10/9/19 2 15
erin15 Sarah Newman F 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 josephlyons@yahoo.com 4/24/10 10/11/19 10/13/19 10 4794
erin15 Sarah Newman F 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 josephlyons@yahoo.com 4/24/10 10/11/19 10/12/19 7 1697
palmerpaul Jennifer Marshall F 869 Harrell Forges Apt. 111 East Monica, MO 01243 heather65@hotmail.com 3/11/07 9/14/19 9/22/19 6 4642
palmerpaul Jennifer Marshall F 869 Harrell Forges Apt. 111 East Monica, MO 01243 heather65@hotmail.com 3/11/07 9/17/19 9/21/19 1 527
palmerpaul Jennifer Marshall F 869 Harrell Forges Apt. 111 East Monica, MO 01243 heather65@hotmail.com 3/11/07 10/9/19 10/12/19 5 408
torresdiana Ashley Hoffman F 7815 Lauren Ranch Ambertown, FL 93225 franklinjonathan@hotmail.com 5/14/60 9/17/19 9/28/19 9 5452

This model also works well on a distributed system. Because each row contains complete information, you can process it on any node, and don’t need to shuffle data. You can also use the columnar format to store data, which allows the query engine to read only the needed columns instead of the whole row. This technique improves analytics performance and is storage efficient.

Both models have their pros and cons. The dimensional model trades compute power for storage efficiency, and the flattened model trades storage for processing efficiency.

Some new data types are available that achieve the best of both. Instead of putting child records into another table, you can nest them into the parent record and get the full information without performing a join. It effectively denormalizes the data without duplicating the parent record.

The following diagram illustrates this workflow.

You can apply this model to a schemaful hierarchy dataset. Continuing with the customer and order example, although a customer might buy multiple items, each order item contains the same type of information, such as product ID, price, and vendor.

The hierarchy is clear and consistent. You can map data to a nested structured schema, which you can store and access efficiently via SQL language.

The following table is a nested data presentation of the previous example.

username name sex address mail birthdate transaction_date shipping_date items price
erin15 Sarah Newman F 795 Nancy Shoal Apt. 684 Phillipschester, MI 01979 josephlyons@yahoo.com 4/24/10 9/14/19 10/12/19 2 1237
9/16/19 10/9/19 8 4824
9/17/19 10/10/19 9 4392
9/17/19 10/9/19 3 1079
9/25/19 10/7/19 1 208
10/2/19 10/5/19 10 3689
10/5/19 10/10/19 7 6346
10/6/19 10/10/19 5 1744
10/7/19 10/9/19 2 15
10/11/19 10/13/19 10 4794
10/11/19 10/12/19 7 1697
palmerpaul Jennifer Marshall F 869 Harrell Forges Apt. 111 East Monica, MO 01243 heather65@hotmail.com 3/11/07 9/14/19 9/22/19 6 4642
9/17/19 9/21/19 1 527
10/9/19 10/12/19 5 408
torresdiana Ashley Hoffman F 7815 Lauren Ranch Ambertown, FL 93225 franklinjonathan@hotmail.com 5/14/60 9/17/19 9/28/19 9 5452

The following graph compares the storage usage for the three models (all in parquet format).

The graph shows that nested structure is as storage efficient as the dimensional model.

Using nested data types

Nested data types are structured data types for some common data patterns. Nested data types support structs, arrays, and maps.

A struct is similar to a relational table. It groups object properties together. For example, if a customer profile contains their name, address, email, and birthdate, it appears as the following schema:

customer struct<
              username:string,
              name:string,
              sex:string,
              address:string,
              mail:string,
              birthdate:string>

The data appears as the following code:

    "customer": {
        "username": "kevin35",
        "name": "Nancy Alvarez",
        "sex": "F",
        "address": "05472 Kathleen Turnpike\nNew Ashley, NV 84430",
        "mail": "darrellbaker@yahoo.com",
        "birthdate": "1961-02-05"
    }

An array stores one-to-many relationships. For example, a customer may have multiple shipping addresses or phone numbers. If a customer has several phone numbers, it appears as the following schema:

Phonenumbers array<string>

The data appears as the following code:

[‘555-5555’, ‘555-1234’]

A map is a collection of key-value pairs. You can consider it as a list of struct<key, value> elements. For example, if a customer has particular reward preferences, it appears as the following schema:

preference map<string,boolean> 

The data appears as the following code:

{one_day_delivery=true, 
 coupon=false, 
 free_shipping=true}

Nested data could have another nested data type as a member. The most common one is an array of structs. For example, an order containing multiple items could appear as the following schema:

orders array<
            struct<
              product_id:string,
              price:int,
              onsale:boolean,
              tax:int,
              weight:int,
              others:int,
              vendor:string>
            > 

You can create a complex object by combining them. For example, a customer’s online transaction appears as the following schema:

customer struct<
              username:string,
              name:string,
              sex:string,
              address:string,
              mail:string,
              birthdate:string> , 
  shipping_address array<
                      struct<
                        name:string,
                        street_address:string,
                        city:string,
                        postcode:string>
                      > , 
  creditcard string , 
  transaction_date string , 
  shipping_date string , 
  membership string , 
  preference map<string,boolean> , 
  orders array<
            struct<
              product_id:string,
              price:int,
              onsale:boolean,
              tax:int,
              weight:int,
              others:int,
              vendor:string>
            > , 
  platform string , 
  comments string 

Popular query engines such as Hive, Spark, Presto, and Redshift Spectrum support nested data types. The SQL syntax those engines support can be different. To make it straightforward and consistent, all query examples in this post use Amazon Redshift Spectrum. For more information, see Tutorial: Querying Nested Data with Amazon Redshift Spectrum.

Use cases for nested data types

Nested data types have many benefits: simplify your ETL, data modeling, and achieve the good performance. The following are some common use cases that can benefit from nested data types.

Parent-child relationship

Nested data types keep the parent-child (summary-details) relationship by storing them collocated. This often matches how you want to analyze the data. For example, to analyze customers’ purchasing habits, you may need to find the following:

  • Customers who purchase often but buy only a few items each time. They likely want an annual membership that covers the shipping cost.
  • Customers who purchase less frequently but buy many items in one transaction. They likely expect a free shipping benefit or discount.

You need support information from the orders data, such as how many items, on average, a customer buys per transaction.

To find a list of customers who order online at least once per week, with fewer than four items each time, use the following code:

with purchases as (
select co.customer.username as customer, co.transaction_date as transaction_date, co.customer.address as address,
  (select count(*) from co.orders) as total_items, 
  (select sum(case when onsale = true then 1 else 0 end) from co.orders) as items_onsale
from demo.customer_order_nested_parq co )
select customer, count(transaction_date) as tran_cnt, avg(total_items) 
from purchases 
where total_items <= 3 and items_onsale > 0 
      and transaction_date >= '2019-09-01' 
group by 1 having tran_cnt >= 4;

With the nested order details, per item information is already grouped by customer per transaction. Children aggregation is straightforward; you can aggregate order details to categorize a customer. If you use a denormalized table, you have to do GROUP BY two times. The query could also take longer. See the following code:

with purchases as (
select cc_username as customer, transaction_date, cc_address as address,
  count(*) as total_items, 
  sum(case when co_onsale = true then 1 else 0 end) as items_onsale
from demo.customer_order_flatten_parq 
group by 1,2,3)
select customer, count(transaction_date) as tran_cnt, avg(total_items) 
from purchases 
where total_items <= 3 and items_onsale > 0 
      and transaction_date > '2019-09-01'
group by 1 having tran_cnt >= 4;

To find customers who order only once per quarter with at least 10 items and high total spending, use the following code:

with purchases as (
select co.customer.username as customer, co.transaction_date as transaction_date, 
  (select count(*) from co.orders) as total_items, 
  (select sum(price) from co.orders) as total_spending
from demo.customer_order_nested_parq co )
select customer, count(transaction_date) as tran_cnt, avg(total_spending) from purchases 
where total_items >= 10 and total_spending > 5000 and transaction_date > '2019-07-01' transaction_date < '2019-09-30'
group by 1 having tran_cnt < 2
order by 3 desc;

Another benefit of using nested data types for parent-child data analysis is resource usage reduction. If there are one million customer transactions, there could be over five times the item orders. For example, to find each day how many goods ship to Michigan, use the following code:

select co.shipping_date, sum(coo.weight)
from demo.customer_order_nested_parq co, co.orders coo
where co.customer.address like '%MI 012__'
group by 1
order by 1;

Assuming that 3% of customers ship orders to Michigan, after filtering the customer data, there could be approximately 3% of matching transactions. You only need to process 150 thousand item orders instead of 5 million. This greatly reduces the data to process and the resources to use when compared to a flattened model.

For the parent-child use case, nested data types provide straightforward aggregation on children, more efficient filtering, group by, windowing, and storage saving.

Many-to-many relationship

Customers could buy many items from various vendors, and a vendor could sell a product to many customers. This is a many-to-many relationship.

In a dimensional model, you need three tables: a customers table, an orders table, and a transactions table. To find the top vendors who have the most customers, you need to join the three tables. See the following code:

select vendor, transaction_date, count(distinct cc.username)
from customers cc,
     transactions tt,
     orders oo
where cc.username = tt.username
and oo.transaction_id = tt.transaction_id
and tt.transaction_date >= '2019-01-01' 
group by 1,2
order by 3 desc;

With nested data types, the query is similar to the one using the dimensional model. However, because the orders data is collocated with customer transactions, you can join them on-the-fly without paying the cost. See the following code:

select coo.vendor, co.transaction_date, count(distinct co.customer.username)
from demo.customer_order_nested_parq co, 
co.orders coo
where co.transaction_date > '2019-01-01'
group by 1,2
order by 3 desc;

As another example, your vendor, Smith PLC, had a big sale event on October 10, 2019. You want to find out which customers bought your product during this sale and the top customers who spent the most. To do so, use the following code:

select co.customer.username, count(coo.product_id), sum(coo.price)
from demo.customer_order_nested_parq co, co.orders coo
where co.transaction_date = '2019-10-10'
and (select count(*) from co.orders 
     where vendor = 'Smith PLC' and onsale = true) > 0
group by 1
order by 3 desc;

Compared to the dimensional model query, the nested model is two-to-three times faster. This is on a relatively small dataset with only a few million rows. For a larger dataset, the performance improvement is even greater, and with less resource usage.

Sparse and frequently changed data

Assume that you want to reward customers who order from your online store. For each transaction, the customer can choose one or more rewards, such as free shipping, one-day delivery, a discount, or a coupon. Depending on how effective a reward is, you have to frequently modify the reward types, add new ones, or remove ones that aren’t popular.

If you store the data in a flattened model, there are two common options to track this data. The first method is creating a table with one column for each type of reward. You have to think of all possible rewards at the outset and create those columns. This could lead to a wide table and very sparse data. Alternatively, you can modify your table schema when you want to add or remove a reward type. That adds more maintenance work and you may lose history data. The following table demonstrates this method (all transaction_id data in below table examples are faked one).

transaction_id free_shipping one_day_delivery discount coupon
pklein35966659391853535 FALSE TRUE TRUE
rebeccawiliams228880139768961 FALSE TRUE
brooke39180013629693040 TRUE FALSE TRUE TRUE
jchapman4283556333561927 FALSE TRUE FALSE FALSE
mariamartin3515336516983566 FALSE FALSE TRUE

The second option is storing one reward per row. This avoids the wide table issue and the burden of constantly updating the schema. The approach is suitable if you only need to analyze a single reward. If you want to see whether there is any correlation between rewards, such as if more customers prefer free shipping and one-day delivery more than a discount and coupon, this option is more complicated. This model also needs more storage. The following table demonstrates this method.

transaction_id rewordtype value
pklein35966659391853535 free_shipping FALSE
pklein35966659391853535 one_day_delivery TRUE
pklein35966659391853535 coupon TRUE
rebeccawiliams228880139768961 one_day_delivery FALSE
rebeccawiliams228880139768961 coupon TRUE
brooke39180013629693040 free_shipping TRUE
brooke39180013629693040 one_day_delivery FALSE
brooke39180013629693040 discount TRUE
brooke39180013629693040 coupon TRUE

A compromise is to use a JSON string to store selected rewards together in one column, which avoids schema change. See the following code:

preference varchar(65535)	

The following table shows how the data is stored in JSON string:

transaction_id preference
pklein35966659391853535 {“coupon”:true, “free_shipping”:false,”one_day_delivery”:true}
rebeccawiliams228880139768961 {“coupon”:true, one_day_delivery”:false}
brooke39180013629693040 {“coupon”:true, “discount”:true, “free_shipping”:true,”one_day_delivery”:false}
jchapman4283556333561927 {“coupon”:false, “discount”:false, “free_shipping”:false, “one_day_delivery”:true}
mariamartin3515336516983566 {“discount”:true, “free_shipping”:false,”one_day_delivery”:false}

You can analyze it by using a JSON function to extract the reward data. See the following code:

select correlation, count(username) from (
select username,
(case when 
    (json_extract_path_text(preference,'free_shipping')  = 'true' and  
     json_extract_path_text(preference,'one_day_delivery')  = 'true') 
     then 1
 when 
    (json_extract_path_text(preference,'discount') = 'true' and  
     json_extract_path_text(preference,'coupon')  = 'true') 
     then 2
else 0 
 end) as correlation
from demo.transactions
  )
group by 1;

This solution is acceptable, but you could be more storage efficient and more performant by using the nested data type map. See the following code:

preference map<string, boolean>

The following table shows how the data is stored in map:

transaction_id preference
pklein35966659391853535 {coupon=true, free_shipping=false,one_day_delivery=true}
rebeccawiliams228880139768961 {coupon=true, one_day_delivery=false}
brooke39180013629693040 {coupon=true, discount=true, free_shipping=true,one_day_delivery=false}
jchapman4283556333561927 {coupon=false, discount=false, free_shipping=false, one_day_delivery=true}
mariamartin3515336516983566 {discount=true, free_shipping=false,one_day_delivery=false}

 

You can analyze a single reward or multiple rewards using SQL. For example, to find how many customers prefer free shipping, use the following code:

select count(distinct co.customer.username)
from demo.customer_order_nested_parq co, co.preference cm
where cm.key = 'free_shipping' and cm.value = true;

To find how many customers prefer free shipping and one-day delivery more than a coupon or discount, use the following code:

with customer_rewards as (
select co.customer.username as customer, 
 (select count(*) from co.preference cm 
where cm.key = 'free_shipping' and cm.value = true) as shipping_pref,
 (select count(*) from co.preference cm 
where cm.key = 'one_day_delivery' and cm.value = true) as delivery_pref,
 (select count(*) from co.preference cm 
where cm.key = 'coupon' and cm.value = true) as coupon_pref,
 (select count(*) from co.preference cm 
where cm.key = 'discount' and cm.value = true) as discount_pref
from demo.customer_order_nested_parq co;
select case when shipping_pref > 0 and delivery_pref > 0 then 1
            when coupon_pref > 0 and discount_pref > 20 then 2
            else 0
       end as correlation, count(customer)
from customer_rewards
group by 1;

The map type allows you to add any key-value pair. You can add a new reward type at any time without a schema change, and you can analyze the new reward right away.

The main advantage of the map type is that it supports flexible schema and eliminates the need to update the schema frequently. However, there is not much performance benefit. If performance is your top priority, a flattened table is recommended. You can also flatten the most-often accessed columns, and use map for the less frequently accessed columns.

Limitations of nested data types

Although nested data types are useful in many use cases, they have the following limitations:

  • There is a hard limit on children size.
  • You can only append, and updating data is difficult and slow. You need to rewrite the entire nested object even if you want to modify one child attribute.
  • Processing is split at the parent record level. You may run into problems if the children data is heavily skewed.
  • The query engine may not support all types of analytics on nested data.
  • Amazon Redshift Spectrum Nested Data Limitations.

Summary

This post discussed the benefits of nested data types and use cases in which nested data types can help improve storage efficiency, performance, or simplify analysis. There are many more use cases in which nested data types can be an ideal solution. Try it out and share your experiences!

 


About the Author

 Juan Yu is a Data Warehouse Specialist Solutions Architect at AWS.