In this chapter, we will focus on multivariate categorical data. Here, it is noteworthy that multivariate plot is not the same as multiple variable plot, where the former is used for analysis with multiple outcomes.
Bar chats are used to display the frequency of multidimensional categorical variables. In the next few plots you will be shown different kinds of bar charts.
position = "dodge" to create grouped bar chart
In this section, we would like to show how to use chi-square test to check the independence between two features.
We will use the following example to answer: Are older Americans more interested in local news than younger Americans? The dataset is collected from here.
The chi-square hypothesis is set to be:
Null hypothesis: Age and tendency to follow local news are independent
Alternative hypothesis: Age and tendence to follow local news are NOT independent
## Followers Nonfollowers ## 18-29 428 2423 ## 30-49 2791 7176 ## 50-64 4242 6921 ## 65+ 4583 6328
## Followers Nonfollowers ## 18-29 984.1065 1866.893 ## 30-49 3440.4032 6526.597 ## 50-64 3853.2378 7309.762 ## 65+ 3766.2526 7144.747
## ## Pearson's Chi-squared test ## ## data: localmat ## X-squared = 997.48, df = 3, p-value < 2.2e-16
We compare observed to expected and then the p-value tells that age and tendency are independent features. We are good to move on to next stage on mosaic plots.
Mosaic plots are used for visualizing data from two or more qualitative variables to show their proportions or associations.
Here’s some criteria of best practice of mosaic plots :
Dependent variables is split last and split horizontally
Fill is set to dependent variable
Other variables are split vertically
Most important level of dependent variable is closest to the x-axis and darkest (or most noticable shade)
pairs method to plot a matrix of pairwise mosaic plots for class
Spine plot is a mosaic plot with straight, parallel cuts in one dimension (“spines”) and only one variable cutting in the other direction.
library(vcdExtra) library(forcats) foodorder <- Alligator %>% group_by(food) %>% summarize(Freq = sum(count)) %>% arrange(Freq) %>% pull(food) ally <- Alligator %>% rename(Freq = count) %>% mutate(size = fct_relevel(size, "small"), food = factor(food, levels = foodorder), food = fct_relevel(food, "other")) vcd::mosaic(food ~ sex + size, ally, direction = c("v", "v", "h"), highlighting_fill= RColorBrewer::brewer.pal(5, "Accent"))
Treemap is a filled rectangular plot representing hierarchical data (fill color does not necessarily represent frequency count)
Alluvial diagrams are usually used to represent the flow changes in network structure over time or between different levels.
The following plot shows the essential components of alluvial plots used in the naming schemes and documentation (axis, alluvium, stratum, lode):
library(ggalluvial) df2 <- data.frame(Class1 = c("Stats", "Math", "Stats", "Math", "Stats", "Math", "Stats", "Math"), Class2 = c("French", "French", "Art", "Art", "French", "French", "Art", "Art"), Class3 = c("Gym", "Gym", "Gym", "Gym", "Lunch", "Lunch", "Lunch", "Lunch"), Freq = c(20, 3, 40, 5, 10, 2, 5, 15)) ggplot(df2, aes(axis1 = Class1, axis2 = Class2, axis3 = Class3, y = Freq)) + geom_alluvium(color='black') + geom_stratum() + geom_text(stat = "stratum", aes(label = paste(after_stat(stratum), "\n", after_stat(count)))) + scale_x_discrete(limits = c("Class1", "Class2", "Class3"))
You can choose to color the alluvium by different variables, for example, the first variable
Another way of plotting alluvial diagrams is using
geom_flow rather than
After we use
geom_flow, all Math students learning Art came together, which is also the same as Stats students. It makes the graph much clearer than
geom_alluvium since there is less cross alluviums between each axises.
Besides what have been systematically introduced in
Chapter 9.2 Heatmaps, this part demonstrated a special case of heat map when both x and y are categorical. Here the heat map can been seen as a clustered bar chart and a pre-defined theme is used to show the dense more clearly.
library(vcdExtra) library(dplyr) theme_heat <- theme_classic() + theme(axis.line = element_blank(), axis.ticks = element_blank()) orderedclasses <- c("Farm", "LoM", "UpM", "LoNM", "UpNM") mydata <- Yamaguchi87 mydata$Son <- factor(mydata$Son, levels = orderedclasses) mydata$Father <- factor(mydata$Father, levels = orderedclasses) mydata3 <- mydata %>% group_by(Country, Father) %>% mutate(Total = sum(Freq)) %>% ungroup() ggplot(mydata3, aes(x = Father, y = Son)) + geom_tile(aes(fill = (Freq/Total)), color = "white") + coord_fixed() + scale_fill_gradient2(low = "black", mid = "white", high = "red", midpoint = .2) + facet_wrap(~Country) + theme_heat