Chapter 10 Unidimensional categorical variables
In real-world datasets, categorical features are quite common but tricky during both the data pre-processing and visualization process. In this chapter, we will demonstrate several plotting options for the uni-dimensional categorical variables with ggplot
.
10.1 Bar plot
There are two types of uni-dimensional categorical variables: nominal and ordinal. Here you will be shown how these variables should be plotted differently using bar plot under the same dataset.
10.1.1 Nominal data
Nominal data is data with no fixed category order and should be sorted from highest to lowest count (left to right, or top to bottom)
By default, R always sorts levels in alphabetical order. To reorder it by a sorted value, you can try fct_reorder
, fct_rev
, fct_relevel
in the forcats package
library(vcdExtra)
library(ggplot2)
library(forcats)
library(dplyr)
Accident %>%
group_by(mode) %>%
summarise(freq = sum(Freq)) %>%
ggplot(aes(x=fct_reorder(mode,freq,.desc = TRUE),y=freq)) +
geom_bar(stat = "identity",fill = "cornflowerblue") +
ggtitle("Number of people with different modes in accident") +
xlab("") +
theme(panel.grid.major.x = element_blank())
… or top to bottom
Accident %>%
group_by(mode) %>%
summarise(freq = sum(Freq)) %>%
ggplot(aes(x=fct_rev(fct_reorder(mode,freq,.desc = TRUE)),y=freq)) +
geom_bar(stat = "identity",fill = "cornflowerblue") +
ggtitle("Number of people with different modes in accident") +
coord_flip() +
xlab("") +
theme(panel.grid.major.x = element_blank())
10.1.2 Ordinal data
Ordinal data is data having a fixed category order and need to sort it in logical order of the categories (left to right)
Accident %>%
group_by(age) %>%
summarise(freq = sum(Freq)) %>%
ggplot(aes(x=age,y=freq)) +
geom_bar(stat = "identity",fill = "cornflowerblue") +
ggtitle("Number of people of different ages in accident") +
xlab("") +
theme(panel.grid.major.x = element_blank())
Sort in logical order of the categories (starting at bottom OR top)
10.2 Cleveland dot plot
Cleveland dot plot is a good alternative to bar plots, making plots more readable and comparable even with more data. Similarly, we also need to reorder the categorical variables just like what we’ve done for nominal bar plot.
library(Lock5withR)
ggplot(USStates, aes(x = IQ, y = fct_reorder(State, IQ))) +
geom_point(color = "blue") +
ggtitle("Avg. IQ for US states") +
ylab("") +
theme_linedraw()
10.2.1 Cleveland dot plot with multiple dots
Sort by Obese Rate
library(tidyr)
USStates %>%
select('State','Obese','HeavyDrinkers') %>%
gather(key='type',value='percentage',Obese,HeavyDrinkers) %>%
ggplot(aes(x=percentage, y=fct_reorder2(State,type=='Obese',percentage,.desc=FALSE), color = type)) +
geom_point() +
ggtitle("Obese rate & heavy drinker rate in US") +
ylab("") +
theme_linedraw()
10.2.2 Cleveland dot plot with facets
You can split the graph into small multiples using facet_grid().
ggplot(USStates, aes(x = IQ, y = reorder(State, IQ))) +
geom_point(color = "blue") +
facet_grid(Pres2008 ~ ., scales = "free_y", space = "free_y") +
ggtitle('IQ of US state residents facet by Pres2008') +
xlab("IQ") +
ylab('') +
theme_linedraw() +
theme(panel.grid.major.x = element_blank(), panel.grid.minor.x = element_blank())
10.2.3 Example: How Much People in the Trump Administration Are Worth
# create dot plot theme
theme_dotplot <-
theme_bw(16) +
theme(axis.text.y = element_text(size = rel(.8)), axis.ticks.y = element_blank(),
axis.title.x = element_text(), axis.text = element_text(face = "bold"),
plot.background = element_rect(fill = "lightcyan2"),
panel.background = element_rect(fill = "moccasin"),
panel.grid.major.x = element_line(size = 0.5),
panel.grid.major.y = element_line(size = 0.5, color = "lightblue"),
panel.grid.minor.x = element_blank(),
strip.text = element_text(size = rel(.7)), legend.position = "top")
# data source:
# NYT, How Much People in the Trump Administration Are Worth
# https://www.nytimes.com/interactive/2017/04/01/us/politics/how-much-people-in-the-trump-administration-are-worth-financial-disclosure.html
df <- read.csv("data/Assets.csv")
# change units to millions
df$Assets <- df$Assets / 1000000
ggplot(df, aes(x = Assets, y = reorder(Name, Assets))) +
geom_point() +
ggtitle("How Much People in the Trump\nAdministration Are Worth") +
xlab("Assets in Millions $") +
ylab("") +
theme_dotplot
# create Panel column
df <- df |>
mutate(Panel = cut(Assets, 4, breaks = fivenum(Assets),
labels = c("$66k - $604k", "$1 - 3.5 Million",
"$4 - 12 Million", "$18 Million+"))) |> mutate(Panel = fct_rev(Panel))
ggplot(df, aes(x = Assets, y = reorder(Name, Assets))) +
geom_point() +
facet_wrap(~Panel, ncol = 1, scales = "free") +
ggtitle("How Much People in the Trump\nAdministration Are Worth") +
xlab("Assets in Millions $") +
ylab("") +
theme_dotplot