Darren Dahly PhD Statistical Epidemiology

Step plots with ggplot2

I have been reviewing studies of infant body composition. There are several different ways to measure body composition in infants, and I wanted to get a sense of how popular the methods were and how this had changed over time. This is the plot I eventually arrived at.

Here is how I made it.

Load the neccessary packages and simulate the data. The rows represent individual publications, which are described by their Type (e.g. the measurement method they used) and the Year they were published.

require(plyr) 
require(ggplot2)
require(RColorBrewer)

Df <- data.frame(Type = replicate(100, 
                        sample(c("A","B","C", "D", "E"), 1)),
                 Year = sample(c(1980:2014), 100, replace = TRUE))

You also need a variable reflecting the cumulative number of each type of publication for each year, and another variable giving the final tally for each method.

Df$count <- 1
  
Df <- Df[order(Df[, 2]), ] # Sort by year
  
Df <- ddply(Df, .(Type), transform, cumsumType = cumsum(count)) 
  
Df <- ddply(Df, .(Type), transform, maxCount = max(cumsumType)) 
  
Df$Type <- reorder(Df$Type, Df$maxCount, max) 
Df$Type <- factor(Df$Type, levels = rev(levels(Df$Type)))

The last two lines reorder the levels of Df$Type by the final tally, in descending order. This helps align the plot legend to match the order the lines will appear on the plot in the final year.

First I tried a simple line plot.

ggplot(Df, aes(x = Year, y = cumsumType, colour = Type, 
               group = Type)) + 
  geom_line() +
  ylab("Total number of publications") +
  ggtitle("Cumulative number of publications by measurement method") +
  scale_color_brewer(name = "Method", palette = "Set1")

Then I decided I wanted it to look like a step function, so I tried geom_step.

ggplot(Df,aes(x = Year,color = Type)) + 
  geom_step(aes(y = cumsumType)) +
  ylab("Total number of publications") +
  ggtitle("Cumulative number of publications by measurement method") +
  scale_color_brewer(name = "Method", palette = "Set1")

However, I wanted all the lines to extend horizontally to the end of the plot, so I tried stat-bin, but this happened.

ggplot(Df,aes(x = Year, color = Type)) + 
  stat_bin(aes(y = cumsum(..count..)), geom="step")  +
  ylab("Total number of publications") +
  ggtitle("Cumulative number of publications by measurement method") +
  scale_color_brewer(name = "Method", palette = "Set1")

The best solution I could come up with was to plot each line in its own layer with stat_bin. Here is the complete code.

 ggplot(Df,aes(x = Year, color = Type)) +
  
   stat_bin(data = subset(Df, Type == "A"),
            aes(y = cumsum(..count..)), 
            geom = "step", size = 3, alpha = 0.3) +
   stat_bin(data = subset(Df, Type == "B"),
            aes(y = cumsum(..count..)), 
            geom = "step", size = 3, alpha = 0.3) +
   stat_bin(data = subset(Df, Type == "C"),
            aes(y = cumsum(..count..)), 
            geom = "step", size = 3, alpha = 0.3) +
   stat_bin(data = subset(Df, Type == "D"),
            aes(y = cumsum(..count..)), 
            geom = "step", size = 3, alpha = 0.3) +
   stat_bin(data = subset(Df, Type == "E"),
            aes(y = cumsum(..count..)), 
            geom = "step", size = 3, alpha = 0.3) +
  
   stat_bin(data = subset(Df, Type == "A"),
            aes(y = cumsum(..count..)), 
            geom = "step", size = 1) +
   stat_bin(data = subset(Df, Type == "B"),
            aes(y = cumsum(..count..)), 
            geom = "step", size = 1) +
   stat_bin(data = subset(Df, Type == "C"),
            aes(y = cumsum(..count..)), 
            geom = "step", size = 1) +
   stat_bin(data = subset(Df, Type == "D"),
            aes(y = cumsum(..count..)), 
            geom = "step", size = 1) +
   stat_bin(data = subset(Df, Type == "E"),
            aes(y = cumsum(..count..)), 
            geom = "step", size = 1) +
  
   coord_cartesian(xlim = c(1980, 2015)) +
   ylab("Total number of publications") +
   ggtitle("Cumulative number of publications by measurement method") +
   scale_color_brewer(name = "Method", palette = "Set1", 
                      breaks = levels(Df$Type))

Because the line for each Type is plotted in its own layer, you need to specify the breaks for the color scale to get the legend in the correct order (descending, by final tally).