I’ve been using Amazon Redshift for nearly a year now, and overall I’ve been satisfied with its performance and ease of scalability. That said, I’ve definitely learned some lessons along the way that may be helpful to others who are considering using it, and I will briefly discuss some of those lessons here.
Redshift is considered an OLAP (Online Analytical Processing) database, which essentially means that it is intended for low-volume complex analytical queries rather than high-volume transactions, which is typically what traditional OLTP (Online Transactional Processing) databases are designed for. It is marketed as a petabyte-scale data warehouse solution with the capability of massively parallel processing (MPP).
One of the core features of Redshift is that data is stored in columnar form, where the data is persisted column-by-column rather than row-by-row like in traditional database systems. This has two main advantages:
- I/O time is significantly reduced when a subset of the columns are queried over a sequential set of rows.
- Data can be more effectively compressed since very similar data is grouped together
In my experience, Redshift works extremely well when it is used as advertised. That is, Redshift primarily meant for executing complex analytical queries over large data sets, and is designed to scale out very effectively. I’ve seen data exports on time-series data involving dozens of joins and millions of rows that take over three minutes to complete on an Oracle system easily finish in under five seconds on Redshift with only one or two nodes, and that includes the time exporting the results to S3.
Here I will cover some of the unexpected challenges that you may run into when using Redshift for the first time.
Maintenance of Dimension Tables
A data warehouse typically isn’t limited to just large fact tables – there are almost always dimension and metadata tables, which often have to be maintained and updated. Developers desiring to avoid over-complicating system architecture may favor simply keeping the maintenance of the metadata within the same Redshift database. But this can result in a situation where Redshift is speedily providing the reporting and data export functionality of the system, but lacking in performance when it comes to performing updates on meta data.
Putting a user interface on top of Redshift to manage this meta data will likely only result in frustration for users of the application, as it’s not at all uncommon for simple selects and updates to take a second or two on Redshift. The slow metadata retrieval could be mitigated if it is cached, but updates will still take a second or two to complete, or even more depending on how many database calls are actually invoked. So unless updates to the metadata are very infrequent, you’re probably still better off moving the metadata maintenance to a different system upstream and then pushing the updates to Redshift as they occur.
If your ingestion workflow is fairly straightforward where you simply need to transform files and copy them into a staging table that you then merge into your fact tables, then Redshift will handle this very well since it is designed to effectively parallelize this sort of operation. But if you need to do some sort of quality control on the data that involves running rules to flag potentially bad data after you push it to a staging table, you’ll probably run into performance problems since Redshift can very slow at performing updates (on Redshift, every update is actually a delete and insert). You may be better off running this sort of quality control prior to ingesting the data into Redshift.
If you use an ETL workflow where you have an AWS Lambda function listening for files to be dropped into an S3 bucket, make sure that you throttle the Lambda function. Otherwise a sudden influx of many data files could overload your Redshift cluster if you don’t have enough nodes, and you may find your Lambda function timing out.
Unsupported SQL operations
Although Redshift is based on PostgreSQL, it does not implement the same full specification. Sometimes you’ll spend the time building what seems like a perfectly reasonable SQL query, only to find out that Redshift doesn’t support it.
As an example, this is a query that uses a WHERE EXISTS clause to flag time-series data where the delta of the min and max value within an hour exceeds a threshold:
UPDATE measurement SET qc_flagged = true FROM measurement outer_measurement WHERE sensor_id = 1 AND EXISTS ( SELECT min(date_time) AS date_time_hour FROM measurement AS inner_measurement WHERE inner.sensorId = outer.sensor_id AND outer_measurement.date_time = inner_measurement.date_time GROUP BY EXTRACT(year from inner_measurement.date_time), EXTRACT(month from inner_measurement.date_time), EXTRACT(day from inner_measurement.date_time), EXTRACT(hour from inner_measurement.date_time) HAVING MIN(value_avg) - MAX(value_avg) > 5 );
At the time of this writing, if you attempt to execute a query like this on Redshift, you’ll receive this error:
Invalid operation: This type of correlated subquery pattern is not supported yet;
Obviously, this sort of query can be restructured using joins, so it’s not the end of the world.
Here’s another unsupported query you might attempt to use when populating a meta data table:
INSERT INTO sensor (name, measurement_type_id) VALUES ('Temperature Sensor', (SELECT measurement_type_id FROM measurement_type WHERE name = 'temperature')), ('Humidity Sensor', (SELECT measurement_type_id FROM measurement_type WHERE name = 'humidity'));
The error received in this case is:
Invalid operation: cannot use subqueries in multi-row VALUES;
I’ve run into this for a few other operations, such as PIVOT, SELECT DISTICT ON, and variable declarations.
In each case that I’ve encountered an issue like this I have been able to identify a workaround, but it hasn’t always been pleasant. In the case of PIVOT for instance, the only way to construct an equivalent query is to programmatically build out a large sequence of joins.
Of course, these unsupported operations are fairly well documented in the official documentation. But unless you go through it in detail and memorize which functions are supported and which are not, you can easily be caught off guard when something doesn’t work.
A fairly comprehensive list of unsupported features are documented on this page: Unsupported PostgreSQL Features
Note that some fairly common operations, such as what I’ve mentioned above, are not listed there even though they are not supported.
What I’ve discussed here may make it seem like I have an unfavorable opinion of Redshift, but I do think it is a very powerful tool . When it is used as intended, it delivers on what it promises. But potential users should carefully look through the documentation and implement some proof of concepts to test their use cases before fully committing to it.