14 Chart: Parallel Coordinate Plots

This chapter originated as a community contribution created by aashnakanuga

14.1 Overview

This section covers how to create static parallel coordinate plots with the GGally package.

For interactive parallel coordinate plots, check out the parcoords package. The package vignette provides instructions on using this package.

14.3 Simple examples

Woah woah woah! Too complicated! Much simpler, please.

Let us use the popular “iris” dataset for this example:

For more information about the dataset, type ?iris into the console.

14.4 Theory

For more info about parallel coordinate plots and multivariate continuous data, check out Chapter 6 of the textbook.

14.5 When to use

Generally, parallel coordinate plots are used to infer relationships between multiple continuous variables - we mostly use them to detect a general trend that our data follows, and also the specific cases that are outliers.

Please keep in mind that parallel coordinate plots are not the ideal graph to use when there are just categorical variables involved. We can include a few categorical variables in our axes or for the sake of clustering, but using a lot of categorical variables results in overlapping profiles, which makes it difficult to interpret.

We can also use parallel coordinate plots to identify trends in specific clusters - just highlight each cluster in a different color using the groupColumn attribute of ggparcoord() to specify your column, and you are good to go!

Sometimes, parallel coordinate plots are very helpful in graphing time series data - where we have information stored at regular time intervals. Each vertical axis will now become a time point and we need to pass that column in ggparcoord’s “column” attribute.

14.6 Considerations

14.6.1 When do I use clustering?

Generally, you use clustering when you want to observe a pattern in a set of cases with some specific properties. This may include divvying up all variables into clusters based on their value for a specific categorical variable. But you can even use a continuous variable; for example, dividing all cases into two sections based on some continuous variable height: those who have a height greater than 150cm and those who do not.

Let us look at an example using our iris dataset, clustering on the “Species” column:

14.6.2 Deciding the value of alpha

In practice, parallel coordinate plots are not going to be used for very small datasets. Your data will likely have thousands and thousands of cases, and sometimes it can get very difficult to observe anything when so many of your cases will overlap. So we set the aplhaLines parameter to a value between zero and one, and it reduces the opacity of all lines so that you can get a clearer view of what is going on if you have too many overlapping cases.

Again we use our iris data, but reduce alpha to 0.5. Observe how much easier it is now to trace the course of every case:

14.6.3 Scales

When we use ggparcoord(), we have an option to set the scale attribute, which will scale all variables so we can compare their values.

The different types of scales are as follows:

  1. std: default value, where it subtracts mean and divides by SD
  2. robust: subtract median and divide by median absolute deviation
  3. uniminmax: scale all values so that the minimum is at 0 and maximum at 1
  4. globalminmax: no scaling, original values taken
  5. center: centers each variable according to the value given in scaleSummary
  6. centerObs: centers each variable according to the value of the observation given in centerObsID

Let us create a sample dataset and see how values on the y-axis change for different scales:

14.6.4 Order of the variables

Deciding the order of the variables on the y-axis depends on your application. It can be specified using the order parameter.

The different types of order are as follows:

  1. default: the order in which we add our variables to the column attribute
  2. given vector: providing a vector of the order we need (used most frequently)
  3. anyClass: order based on the separation of a variable from the rest (F-statistic - each variable v/s the rest)
  4. allClass: order based on the variation between classes (F-statistic - group column v/s the rest)
  5. skewness: order from most to least skewed
  6. Outlying: order based on the Outlying measure

14.7 Modifications

14.7.1 Flipping the coordinates

A good idea if we have too many variables and their names are overlapping on the x-axis:

14.7.3 Using splines

Generally, we use splines if we have a column where there are a lot of repeating values, which adds a lot of noise. The case lines become more and more curved when we set a higher spline factor, which removes noise and makes for easier observations of trends. It can be set using the splineFactor attribute:

14.7.4 Adding boxplots to the graph

You can add boxplots to your graph, which can be useful for observing the trend of median values. Generally, they are added to data with a lot of variables - for example, if we plot time series data.

14.8 Other Packages

There are a number of packages that have functions for creating parallel coordinate plots: [to do: add links]

  • parcoords::parcoords() – great interactive option
  • ggplot2::geom_line() – not specific to parallel coordinate plots but easy to create with the group= parameter.
  • lattice::parallelplot()
  • MASS::parcoord()

14.9 External Resources