A Few Things to be Aware of Before Using Amazon Redshift

Jared Gommels

I’ve been using Amazon Redshift for nearly a year now, and overall I’ve been satisfied with its performance and ease of scalability. That said, I’ve definitely learned some lessons along the way that may be helpful to others who are considering using it, and I will briefly discuss some of those lessons here.

Brief Overview

Redshift is considered an OLAP (Online Analytical Processing) database, which essentially means that it is intended for low-volume complex analytical queries rather than high-volume transactions, which is typically what traditional OLTP (Online Transactional Processing) databases are designed for. It is marketed as a petabyte-scale data warehouse solution with the capability of massively parallel processing (MPP).

One of the core features of Redshift is that data is stored in columnar form, where the data is persisted column-by-column rather than row-by-row like in traditional database systems. This has two main advantages:

  • I/O time is significantly reduced when a subset of the columns are queried over a sequential set of rows.
  • Data can be more effectively compressed since very similar data is grouped together

In my experience, Redshift works extremely well when it is used as advertised. That is, Redshift primarily meant for executing complex analytical queries over large data sets, and is designed to scale out very effectively. I’ve seen data exports on time-series data involving dozens of joins and millions of rows that take over three minutes to complete on an Oracle system easily finish in under five seconds on Redshift with only one or two nodes, and that includes the time exporting the results to S3.

Possible Challenges

Here I will cover some of the unexpected challenges that you may run into when using Redshift for the first time.

Maintenance of Dimension Tables

A data warehouse typically isn’t limited to just large fact tables – there are almost always dimension and metadata tables, which often have to be maintained and updated. It is common to need to provide a UI or web service layer to maintain this metadata. Developers trying to avoid over-complicate system architecture may favor simply using the metadata in Redshift as the source of truth and connect the user interface and web services directly to Redshift. But this can result in a situation where Redshift is speedily providing the reporting and data export functionality of the system, but lacking in performance when it comes to simple metadata updates.

Putting a user interface on top of Redshift to manage this meta data may result in frustration for users of the application, as it’s not at all uncommon for simple selects and updates to take a second or two on Redshift. The slow metadata retrieval could be mitigated if it is cached, but updates will still take a second or two to take effect, or even more depending on how many database calls are actually invoked. So unless updates to the metadata are very infrequent, you’re probably still better off moving the metadata maintenance to a different system upstream and then pushing the updates to Redshift as they occur.

Data Ingestion

If your ingestion workflow is fairly straightforward where you simply need to transform files and copy them into a staging table that you then merge into your fact tables, then Redshift will handle this very well since it is designed to effectively parallelize this sort of operation. But if you need to do some sort of quality control on the data that involves running a list of rules to flag potentially bad data after you push it to a staging table, you’ll probably run into performance problems since Redshift can very slow at performing updates (on Redshift, every update is actually a delete and insert). You may be better off running this sort of quality control prior to ingesting the data into Redshift.

If you use an ETL workflow where you have an AWS Lambda function listening for files to be dropped into an S3 bucket, make sure that you throttle the Lambda function.  Otherwise a sudden influx of many data files could overload your Redshift cluster if you don’t have enough nodes, and you may find your Lambda function timing out.

Unsupported SQL operations

Although Redshift is based on PostgreSQL, it does not implement the same full specification. Sometimes you’ll spend the time building what seems like a perfectly reasonable SQL query only to find out that Redshift doesn’t support it.

As an example, this is a query that uses a WHERE EXISTS clause to flag time-series data where the delta of the min and max value within an hour exceeds a threshold:

 UPDATE measurement
 SET qc_flagged = true
 FROM measurement outer_measurement
 WHERE sensor_id = 1 AND EXISTS (
 SELECT min(date_time) AS date_time_hour
  FROM measurement AS inner_measurement
  WHERE inner.sensorId = outer.sensor_id AND outer_measurement.date_time = inner_measurement.date_time
  GROUP BY EXTRACT(year from inner_measurement.date_time),
  EXTRACT(month from inner_measurement.date_time),
  EXTRACT(day from inner_measurement.date_time),
  EXTRACT(hour from inner_measurement.date_time)
  HAVING MIN(value_avg) - MAX(value_avg) > 5
);

At the time of this writing, if you attempt to execute a query like this on Redshift, you’ll receive this error:

Invalid operation: This type of correlated subquery pattern is not supported yet;

Obviously, this sort of query can be restructured using joins, so it’s not the end of the world.

Here’s another unsupported query you might attempt to use when populating a meta data table:

INSERT INTO sensor (name, measurement_type_id) VALUES
 ('Temperature Sensor', (SELECT measurement_type_id FROM measurement_type WHERE name = 'temperature')),
 ('Humidity Sensor', (SELECT measurement_type_id FROM measurement_type WHERE name = 'humidity'));

The error received in this case is:

Invalid operation: cannot use subqueries in multi-row VALUES;

I’ve run into this for a few other operations, such as PIVOT, SELECT DISTICT ON, and variable declarations.

In each case that I’ve encountered an issue like this I have been able to identify a workaround, but it hasn’t always been pleasant. In the case of PIVOT for instance, the only way to construct an equivalent query is to programmatically build out a large sequence of joins.

Of course, these unsupported operations are fairly well documented in the official documentation. But unless you go through it in detail and memorize which functions are supported and which are not, you can easily be caught off guard when something doesn’t work.

A fairly comprehensive list of unsupported features are documented on this page: Unsupported PostgreSQL Features

Conclusion

What I’ve discussed here may make it seem like I have an unfavorable opinion of Redshift, but I do think it is a very powerful and useful database. When it is used as intended, it delivers on what it promises. But potential users should carefully look through the documentation and implement some proof of concepts to test their use cases before fully committing to it.

Installing GDAL and rgdal on Linux to read KML files

Jared Gommels

The rgdal package is a library that provides bindings to GDAL (Geospatial Data Abstraction Library). It can be used to read many different raster and vector file types, including KML. In this post I’ll show how to install GDAL and rgdal on a *nix system such as Ubuntu or CentOS, and how to use rgdal to read a KML file.

Installing with package manager

GDAL can be installed through a package manager like apt-get or yum, but you likely won’t be able to install the latest version. At the time of this writing, the version available in the repositories is 1.11.3 in apt-get and 1.9.2 in yum, but the latest version is 2.1.1. I have run into some issues with 1.11.3 that were not present in 2.1.1, such as there being inconsistency in what to supply for a layer name when reading KML files.

To install GDAL using a package manager, you first need the correct repository configured on your system, which you should already have if you installed R through a package manager. If using Ubuntu, you can find that documentation here.

To install GDAL using using apt-get (if using a Debian-based distribution such as Ubuntu):

sudo apt-get install libgdal-dev gdal-bin libproj-dev

If you’re using an RPM-based distribution such as CentOS or Red Hat, then this may be slightly more involved. You can try the following and see if it works:

sudo yum install gdal

If that doesn’t work, then you may need to follow this guide.

Then we can install the R packages from the R terminal:

> install.packages('sp')
> install.packages('rgdal')

 

Compiling from source

If you need to install a newer version of GDAL, you may need to compile it manually from source.

Before we compile GDAL, we need to install the Expat XML library, otherwise reading KML files will not work.

Using apt-get:

sudo apt-get install libexpat1-dev

Or, using yum

sudo yum install expat-devel.x86_64

Then we can download, compile, and install GDAL:

wget http://download.osgeo.org/gdal/2.1.1/gdal-2.1.1.tar.gz
tar xvf gdal-2.1.1.tar.gz
cd gdal-2.1.1
./configure

It’s important that in the output, we see that “Expat support” is shown as “yes”, otherwise GDAL will not properly load KML files, even though it may compile successfully:

Xerces-C support: no
NAS support: no
Expat support: yes
libxml2 support: no
Google libkml support: no

Then run:

sudo make
sudo make install

You can verify the installed version with:

gdal-config --version

We’ll also need to install PROJ.4. We can use the same approach of compiling from source, since the version in the package manager may be out of date:

wget http://download.osgeo.org/proj/proj-4.9.3.tar.gz
tar xvf proj-4.9.3.tar.gz
cd proj-4.9.3
./configure
sudo make
sudo make install

Now we can install the R packages. We first need to install the “sp” dependency if it is not installed already:

> install.packages('sp')

We can attempt to install “rgdal” now, but it’s possible an error will be thrown:

> install.packages('rgdal')

** testing if installed package can be loaded
Error in dyn.load(file, DLLpath = DLLpath, ...) :
unable to load shared object '/usr/local/lib/R/site-library/rgdal/libs/rgdal.so':
libgdal.so.20: cannot open shared object file: No such file or directory
Error: loading failed
Execution halted
ERROR: loading failed

To fix this, we have to find where rgdal.so is located. On my system, it was in /usr/local/lib, so to fix this we can create a file such as /etc/ld.so.conf.d/my.gdal.conf with the contents: /usr/local/lib

Then run this to update the bindings:

sudo ldconfig

Now installing “rgdal” should work.

Verifying that it works

If you need a KML file to test with, you can download a sample file as: https://developers.google.com/kml/documentation/KML_Samples.kml

In order to read the KML file with rgdal, we need to know the name of a layer in the file. Typically the layer names are determined by the contents of the XML file. However, in some instances (depending on the KML file and the version of GDAL installed), the layer name may be the actual name of the file. You can view the layers available in the file by running:

ogrinfo KML_Samples.kml

Note: If you get an error with this command saying that it cannot load the file with the available drivers, then it’s possible that you compiled from source and did not have the Expat library installed first.

You should see output similar to this (if using the sample file linked above):

Had to open data source read-only.
INFO: Open of `KML_Samples.kml'
using driver `KML' successful.
1: Placemarks (3D Point)
2: Highlighted Icon (3D Point)
3: Paths (3D Line String)
4: Google Campus (3D Polygon)
5: Extruded Polygon (3D Polygon)
6: Absolute and Relative (3D Polygon)

Then in R, we load the ‘rgdal’ library:

> library(rgdal)
Loading required package: sp
rgdal: version: 1.1-10, (SVN revision 622)
Geospatial Data Abstraction Library extensions to R successfully loaded
Loaded GDAL runtime: GDAL 2.1.1, released 2016/07/07
Path to GDAL shared files: /usr/local/share/gdal
Loaded PROJ.4 runtime: Rel. 4.9.3, 15 August 2016, [PJ_VERSION: 493]
Path to PROJ.4 shared files: (autodetected)
Linking to sp version: 1.2-3

We can then attempt to load the KML file:

 > paths = readOGR(dsn = 'KML_Samples.kml', layer='Paths')

You should get this output:

OGR data source with driver: KML
Source: 'KML_Samples.kml', layer: 'Paths'
with 6 features
It has 2 fields
Warning message:
In readOGR(dsn = 'KML_Samples.kml', layer = 'Paths') :
Z-dimension discarded

Documentation on the rgdal API can be found here.

Click here to view my post on Amazon Redshift.