```
library(GDAdata)
library(ggplot2)
ggplot(SpeedSki, aes(Year, Speed)) +
geom_point() +
labs(x = "Birth year", y = "Speed achieved (km/hr)") +
ggtitle("Skiers by birth year and speed achieved")
```

# 11 Two continuous variables

In this chapter, we will look at techniques that explore the relationships between two continuous variables.

## 11.1 Scatterplot

### 11.1.1 Basics and implications

For the following example, we use data set `SpeedSki`

.

In our example, we simply use `geom_point`

on variables `Year`

and `Speed`

to create the scatterplot. we try to capture if there is a relationship between the age of a player and the speed he/she can achieve. From the graph, it seems such relationship does not exist. Overall, scatterplots are very useful in understanding the correlation (or lack thereof) between variables. The scatterplot gives a good idea of whether that relationship is positive or negative and if there’s a correlation. However, don’t mistake correlation in a scatterplot for causation!

### 11.1.2 Overplotting

In some situations a scatter plot faces the problem of overplotting as there are so many points overlapping. Consider the following example from class. To save time, we randomly sample 20% of the data in advance.

```
library(dplyr)
library(ggplot2movies)
<- slice_sample(movies, prop = 0.2)
sample
ggplot(sample,aes(x=votes,y=rating)) +
geom_point() +
ggtitle("Votes vs. rating") +
theme_classic()
```

To create better visuals, we can use:

Alpha blending -

`alpha=...`

Open circles -

`pch=21`

smaller circles -

`size=...`

or`shape="."`

```
library(gridExtra)
<- ggplot(sample,aes(x=votes,y=rating)) +
f1 geom_point(alpha=0.3) +
theme_classic() +
ggtitle("Alpha blending")
<- ggplot(sample,aes(x=votes,y=rating)) +
f2 geom_point(pch = 21) +
theme_classic() +
ggtitle("Open circle")
<- ggplot(sample,aes(x=votes,y=rating)) +
f3 geom_point(size=0.5) +
theme_classic() +
ggtitle("Smaller circle")
grid.arrange(f1, f2, f3,nrow = 3)
```

Other methods that directly deal with the data:

Randomly sample data - as shown in the first code chunk using

`sample_n`

Subset - split data into bins using

`ntile(votes, 10)`

Remove outliers

Transform to log scale

### 11.1.3 Interactive scatterplot

You can create an interactive scatterplot using `plotly`

. In the following example, we take 1% of the movie data set to present a better visual. We plotted the votes vs. rating and grouped by the year they are released. In this graph:

You can hover on to the points to see the title of the movie

You can double click on the year legend to look at a certain year

You can zoom into a certain part of the graph to better understand the data points.

```
library(plotly)
<- slice_sample(movies,prop=0.01) |>
sample2 filter(year > 2000)
plot_ly(sample2, x = ~votes, y = ~rating,
color = ~as.factor(year), text= ~title,
hoverinfo = 'text')
```

### 11.1.4 Modifications

#### 11.1.4.1 Contour lines

Contour lines give a sense of the density of the data at a glance.

For these contour maps, we will use the `SpeedSki`

dataset.

Contour lines can be added to the plot using geom_density_2d() and contour lines work best when combined with other layers

```
ggplot(SpeedSki, aes(Year, Speed)) +
geom_density_2d(bins=5) +
geom_point() +
ggtitle("Scatter plot with contour line")
```

You can use `bins`

to control the number of contour bins.

#### 11.1.4.2 Scatterplot matrices

If you want to compare multiple parameters to each other, consider using a scatterplot matrix. This will allow you to show many comparisons in a compact and efficient manner.

For these scatterplot matrices, we use the `movies`

dataset from the `ggplot2movies`

package.

As a default, the base R plot() function will create a scatterplot matrix when given multiple variables:

```
<- slice_sample(movies,prop=0.01) #sample data
sample3
<- sample3 |>
splomvar ::select(length, budget, votes, rating, year)
dplyr
plot(splomvar)
```

While this is quite useful for personal exploration of a dataset, it is **not** recommended for presentation purposes. Something called the Hermann grid illusion makes this plot very difficult to examine.

## 11.2 Heatmaps

### 11.2.1 Basics and implications

In the following example, we still use the `SpeedSki`

data set.

```
ggplot(SpeedSki, aes(Year, Speed)) +
geom_bin2d()
```

To create a heatmap, simply substitute `geom_point()`

with `geom_bin2d()`

. Generally, heat maps are like a combination of scatterplots and histograms: they allow you to compare different parameters while also seeing their relative distributions.

### 11.2.2 Modifications

For the following section, we introduce some variations on heatmaps.

#### 11.2.2.1 Change number of bins / binwidth

By default, `geom_bin2d()`

use 30 bins. Similar to a histogram, we can change the number of bins or binwidth.

```
ggplot(SpeedSki, aes(Year, Speed)) +
geom_bin2d(binwidth = c(5,5)) +
ggtitle("Changing binwidth")
```

Notice we are specifying the binwidth for both x and y axis.

#### 11.2.2.2 Combine with a scatterplot

```
ggplot(SpeedSki, aes(Year, Speed)) +
geom_bin2d(binwidth = c(10, 10), alpha = .4) +
geom_point(size = 2) +
ggtitle("Combined with scatterplot")
```

#### 11.2.2.3 Change color scale

You can change the continuous scale of color

```
ggplot(SpeedSki, aes(Year, Speed)) +
geom_bin2d() +
ggtitle("Changing color scale") +
scale_fill_viridis_c()
```

#### 11.2.2.4 Hex heatmap

One alternative is a hex heatmap. You can create the graph using `geom_hex`

```
ggplot(SpeedSki, aes(Year, Speed)) +
geom_hex(binswidth = c(10,10)) +
ggtitle("Hex heatmap")
```

#### 11.2.2.5 Alternative approach to color

If you look at all the previous examples, you might notice that lighter points correspond to more clustered points, which is somewhat counter-intuitive. The following example suggests an alternative approach in color scale.

```
ggplot(SpeedSki, aes(Year, Speed)) +
geom_hex(bins=12) +
scale_fill_gradient(low = "grey", high = "purple") +
theme_classic(18) +
ggtitle("Alternative approach to color")
```