This is the 7.5.1.1 exercise from chapter 7 Exploratory Data Analysis in R for Data Science.

The question is “Use what you have learned to improve the visualisation of the departure times of cancelled vs. non-cancelled flights.”

library(tidyverse)
library(nycflights13)
options(Encoding="UTF-8")
data(flights)
str(flights)

# arr_time = NA means that the flight was cancelled
# add cancelled column to identify
flights_cancellation <- flights %>%
  mutate(cancelled = is.na(arr_time))

Now I am going to plot the result. There are several options I can choose to plot.

Before plotting, we have to know the type of variables, which are “cancelled” and “dep_time” in this case.

head(flights_cancellation$cancelled)
## [1] FALSE FALSE FALSE FALSE FALSE FALSE
head(flights_cancellation$dep_time)
## [1] 517 533 542 544 554 554

Now we know that “cancelled” is a categorical variable and “dep_time” is a continuous variable.

Les’s plot!

# plot01
# draw the plot00
flights_cancellation %>%
  ggplot(aes(x = cancelled)) +
  geom_bar()

It is quite clear that there is counts gap within flights.

# plot01
flights_cancellation %>%
  ggplot(aes(x = dep_time)) +
  geom_freqpoly(aes(col = cancelled), binwidth = 50)

As you can see, since the counts of cancelled flights are largely less than not-cancelled flights, we can’t observe the trend of cancelled flights. Thus, we should change from geom_freqpoly() to geom_density().

# plot02
flights_cancellation %>%
  ggplot(aes(x = dep_time)) +
  geom_density(aes(col = cancelled), bw = 40) 

As you can see, there are some pattern between cancelled and not-cancelled flights.

# dplot03
flights_cancellation %>%
  ggplot(aes(x = dep_time)) +
  geom_freqpoly(binwidth = 50) +
  facet_grid( . ~ cancelled)

Let’s add facet_grid() to the plot, but we need to remember to adjust the geom function.

# plot04
flights_cancellation %>%
  ggplot(aes(x = dep_time)) +
  geom_density(bw = 40) +
  facet_grid( . ~ cancelled)

As you can see, facet_grid() looks great.

# plot05
flights_cancellation %>%
  ggplot(aes(x = cancelled, y = dep_time)) +
  geom_boxplot() 

As you can see, geom_boxplot() is also great.

To sum up, we use geom_freqpoly, geom_density and geom_boxplot() to plot two variables which are “cancelled” and “dep_time”. Since there are counts gap, we have to use density to replace counts. Boxplot is also useful to compare continuous variable broken down by a categorical variable.