Group_by is one of my most frequently used and favorite functions. However, with large datasets leaving data grouped (especially when you have lots of data) can come with a significant computational cost if you forget to ungroup the data between operations where you don’t need it. Below is a quick example mostly as a self-reminder to check my pipelines for opportunities to ungroup data before some operations.

## First let’s load our packages

```
library(dplyr)
library(tibble)
```

## Now to illustrate this point let’s first generate some random data

```
groups <- 1:20000
df <- tibble(groups=rep(groups,each=100)) %>%
mutate(random_value=runif(n=nrow(.),min=0,max=100000))
```

## Next let’s calculate the max value for each group and then filter data to only keep the max for each group

```
system.time({
df %>%
group_by(groups) %>%
mutate(max_value=max(random_value)) %>%
filter(random_value == max_value)
})
```

```
## user system elapsed
## 12.370 0.007 12.381
```

## Finally let’s run the same operation but first ungrouping the data

```
system.time({
df %>%
group_by(groups) %>%
mutate(max_value=max(random_value)) %>%
ungroup() %>%
filter(random_value == max_value)
})
```

```
## user system elapsed
## 0.212 0.012 0.224
```

It runs almost 100x faster if you first ungroup. If you are doing analyses with 1000s of groups your workflow may benefit from more scrupulous ungrouping.

Or as it turns out change your code to use `data.table`

. This is a well-documented advantage of data.table over dplyr as shown here