Group_by is one of my most frequently used and favorite functions. However, with large datasets leaving data grouped (especially when you have lots of data) can come with a significant computational cost if you forget to ungroup the data between operations where you don’t need it. Below is a quick example mostly as a self-reminder to check my pipelines for opportunities to ungroup data before some operations.

First let’s load our packages

library(dplyr)
library(tibble)

Now to illustrate this point let’s first generate some random data

groups <- 1:20000
df <- tibble(groups=rep(groups,each=100)) %>%
  mutate(random_value=runif(n=nrow(.),min=0,max=100000)) 

Next let’s calculate the max value for each group and then filter data to only keep the max for each group

system.time({
  df %>%
    group_by(groups) %>%
    mutate(max_value=max(random_value)) %>%
    filter(random_value == max_value)
})
##    user  system elapsed 
##  12.370   0.007  12.381

Finally let’s run the same operation but first ungrouping the data

system.time({
  df %>%
    group_by(groups) %>%
    mutate(max_value=max(random_value)) %>%
    ungroup() %>%
    filter(random_value == max_value)
})
##    user  system elapsed 
##   0.212   0.012   0.224

It runs almost 100x faster if you first ungroup. If you are doing analyses with 1000s of groups your workflow may benefit from more scrupulous ungrouping.

Or as it turns out change your code to use data.table. This is a well-documented advantage of data.table over dplyr as shown here