library(GDAdata)
library(ggplot2)
ggplot(SpeedSki, aes(Year, Speed)) +
geom_point() +
labs(x = "Birth year", y = "Speed achieved (km/hr)") +
ggtitle("Skiers by birth year and speed achieved")
11 Two continuous variables
In this chapter, we will look at techniques that explore the relationships between two continuous variables.
11.1 Scatterplot
11.1.1 Basics and implications
For the following example, we use data set SpeedSki
.
In our example, we simply use geom_point
on variables Year
and Speed
to create the scatterplot. we try to capture if there is a relationship between the age of a player and the speed he/she can achieve. From the graph, it seems such relationship does not exist. Overall, scatterplots are very useful in understanding the correlation (or lack thereof) between variables. The scatterplot gives a good idea of whether that relationship is positive or negative and if there’s a correlation. However, don’t mistake correlation in a scatterplot for causation!
11.1.2 Overplotting
In some situations a scatter plot faces the problem of overplotting as there are so many points overlapping. Consider the following example from class. To save time, we randomly sample 20% of the data in advance.
library(dplyr)
library(ggplot2movies)
<- slice_sample(movies, prop = 0.2)
sample
ggplot(sample,aes(x=votes,y=rating)) +
geom_point() +
ggtitle("Votes vs. rating") +
theme_classic()
To create better visuals, we can use:
Alpha blending -
alpha=...
Open circles -
pch=21
smaller circles -
size=...
orshape="."
library(gridExtra)
<- ggplot(sample,aes(x=votes,y=rating)) +
f1 geom_point(alpha=0.3) +
theme_classic() +
ggtitle("Alpha blending")
<- ggplot(sample,aes(x=votes,y=rating)) +
f2 geom_point(pch = 21) +
theme_classic() +
ggtitle("Open circle")
<- ggplot(sample,aes(x=votes,y=rating)) +
f3 geom_point(size=0.5) +
theme_classic() +
ggtitle("Smaller circle")
grid.arrange(f1, f2, f3,nrow = 3)
Other methods that directly deal with the data:
Randomly sample data - as shown in the first code chunk using
sample_n
Subset - split data into bins using
ntile(votes, 10)
Remove outliers
Transform to log scale
11.1.3 Interactive scatterplot
You can create an interactive scatterplot using plotly
. In the following example, we take 1% of the movie data set to present a better visual. We plotted the votes vs. rating and grouped by the year they are released. In this graph:
You can hover on to the points to see the title of the movie
You can double click on the year legend to look at a certain year
You can zoom into a certain part of the graph to better understand the data points.
library(plotly)
<- slice_sample(movies,prop=0.01) %>%
sample2 filter(year > 2000)
plot_ly(sample2, x = ~votes, y = ~rating,
color = ~as.factor(year), text= ~title,
hoverinfo = 'text')
11.1.4 Modifications
11.1.4.1 Contour lines
Contour lines give a sense of the density of the data at a glance.
For these contour maps, we will use the SpeedSki
dataset.
Contour lines can be added to the plot using geom_density_2d() and contour lines work best when combined with other layers
ggplot(SpeedSki, aes(Year, Speed)) +
geom_density_2d(bins=5) +
geom_point() +
ggtitle("Scatter plot with contour line")
You can use bins
to control the number of contour bins.
11.1.4.2 Scatterplot matrices
If you want to compare multiple parameters to each other, consider using a scatterplot matrix. This will allow you to show many comparisons in a compact and efficient manner.
For these scatterplot matrices, we use the movies
dataset from the ggplot2movies
package.
As a default, the base R plot() function will create a scatterplot matrix when given multiple variables:
<- slice_sample(movies,prop=0.01) #sample data
sample3
<- sample3 %>%
splomvar ::select(length, budget, votes, rating, year)
dplyr
plot(splomvar)
While this is quite useful for personal exploration of a dataset, it is not recommended for presentation purposes. Something called the Hermann grid illusion makes this plot very difficult to examine.
11.2 Heatmaps
11.2.1 Basics and implications
In the following example, we still use the SpeedSki
data set.
ggplot(SpeedSki, aes(Year, Speed)) +
geom_bin2d()
To create a heatmap, simply substitute geom_point()
with geom_bin2d()
. Generally, heat maps are like a combination of scatterplots and histograms: they allow you to compare different parameters while also seeing their relative distributions.
11.2.2 Modifications
For the following section, we introduce some variations on heatmaps.
11.2.2.1 Change number of bins / binwidth
By default, geom_bin2d()
use 30 bins. Similar to a histogram, we can change the number of bins or binwidth.
ggplot(SpeedSki, aes(Year, Speed)) +
geom_bin2d(binwidth = c(5,5)) +
ggtitle("Changing binwidth")
Notice we are specifying the binwidth for both x and y axis.
11.2.2.2 Combine with a scatterplot
ggplot(SpeedSki, aes(Year, Speed)) +
geom_bin2d(binwidth = c(10, 10), alpha = .4) +
geom_point(size = 2) +
ggtitle("Combined with scatterplot")
11.2.2.3 Change color scale
You can change the continuous scale of color
ggplot(SpeedSki, aes(Year, Speed)) +
geom_bin2d() +
ggtitle("Changing color scale") +
scale_fill_viridis_c()
11.2.2.4 Hex heatmap
One alternative is a hex heatmap. You can create the graph using geom_hex
ggplot(SpeedSki, aes(Year, Speed)) +
geom_hex(binswidth = c(10,10)) +
ggtitle("Hex heatmap")
11.2.2.5 Alternative approach to color
If you look at all the previous examples, you might notice that lighter points correspond to more clustered points, which is somewhat counter-intuitive. The following example suggests an alternative approach in color scale.
ggplot(SpeedSki, aes(Year, Speed)) +
geom_hex(bins=12) +
scale_fill_gradient(low = "grey", high = "purple") +
theme_classic(18) +
ggtitle("Alternative approach to color")