Ova

How to Create a Surrogate Key in Redshift?

Published in Redshift Surrogate Key 4 mins read

In Amazon Redshift, you primarily create a surrogate key using the IDENTITY keyword, which automatically generates unique, sequential integer values for a column as new rows are added to a table.

Understanding Surrogate Keys in Redshift

A surrogate key is an artificial, system-generated primary key that serves as a unique identifier for each row in a database table, especially in data warehousing environments like Redshift. Unlike natural keys, which are derived from the business data itself, surrogate keys are typically simple integers and have no intrinsic meaning to the business.

Why use surrogate keys in Redshift?

  • Performance: Joining tables on simple integer keys is generally faster than on multi-column natural keys or complex strings.
  • Flexibility: They insulate your data warehouse from changes in natural keys in the source system. If a natural key changes, you only update the associated dimension record without affecting facts.
  • Data Integrity: They guarantee uniqueness for each record, even if the natural key is null or non-unique in the source.
  • Simplicity: They simplify table designs and join conditions.

Implementing Surrogate Keys with IDENTITY

The IDENTITY property in Redshift allows you to define a column that automatically populates with unique, sequential values. This is the recommended and most common method for creating surrogate keys.

Syntax for IDENTITY

When defining a column as an IDENTITY column, you specify a seed and a step:

column_name BIGINT IDENTITY(seed, step)
  • seed: The starting value for the identity column. The first row inserted will have this value.
  • step: The increment value that is added to the previous value to generate the next identity value.

Example: Creating a Dimension Table with an IDENTITY Surrogate Key

Let's say you're building a data warehouse and need a Dim_Venue table to store information about event venues. You can create a VenueSkey (Venue Surrogate Key) using IDENTITY:

CREATE TABLE dim_venue (
    venueskey    BIGINT IDENTITY(1,1), -- Our surrogate key, starting at 1, incrementing by 1
    venueid      INTEGER,              -- Natural key from source system
    venuename    VARCHAR(256),
    venuecity    VARCHAR(128),
    venuestate   VARCHAR(128),
    venuecapacity INTEGER,
    -- ... other venue attributes
    PRIMARY KEY(venueskey)
);

In this example:

  • venueskey is defined as a BIGINT (to accommodate a large number of rows) and is an IDENTITY(1,1) column.
  • The first row inserted into dim_venue will have venueskey = 1, the second will have venueskey = 2, and so on.
  • The venueskey column is automatically populated with unique values as new rows are added to the table.

Inserting Data into Tables with IDENTITY Columns

When inserting data, you typically omit the IDENTITY column from your INSERT statement, and Redshift automatically generates its value.

INSERT INTO dim_venue (venueid, venuename, venuecity, venuestate, venuecapacity)
VALUES
    (101, 'Grand Arena', 'New York', 'NY', 20000),
    (102, 'City Concert Hall', 'Los Angeles', 'CA', 5000),
    (103, 'Exhibition Center', 'Chicago', 'IL', 15000);

-- Redshift will automatically assign VenueSkey values (1, 2, 3 respectively)

You can also explicitly insert a value into an IDENTITY column using the OVERRIDING SYSTEM VALUE clause, but this is generally not recommended for normal operations as it can lead to key collisions or gaps in the sequence. For example:

INSERT INTO dim_venue OVERRIDING SYSTEM VALUE (venueskey, venueid, venuename, venuecity, venuestate, venuecapacity)
VALUES (1000, 104, 'Stadium One', 'Houston', 'TX', 30000);

For more in-depth information on the IDENTITY property, refer to the AWS Redshift documentation on CREATE TABLE.

Alternative Considerations (Less Common for Primary Surrogate Keys)

While IDENTITY is the standard, other methods exist, though they are less common for primary surrogate keys due to performance or complexity.

  • UUID (Universally Unique Identifier): You could generate UUIDs using functions like MD5(RANDOM()::TEXT) or UUID_GENERATE_V4() (if you have the necessary extensions enabled). While globally unique, UUIDs are typically larger (128-bit) and can be less efficient for joins and indexing compared to BIGINT integers.
  • Sequence Objects (simulated): In databases like PostgreSQL, explicit sequence objects are common. Redshift's IDENTITY is essentially an optimized form of an auto-incrementing sequence for table columns.

For most data warehousing scenarios in Redshift, the IDENTITY property provides the ideal balance of performance, simplicity, and data integrity for creating surrogate keys.