Group by one or more variables

Source: R/group-by.R

group_by.Rd

Most data operations are done on groups defined by variables.group_by() takes an existing tbl and converts it into a grouped tblwhere operations are performed "by group". ungroup() removes grouping.

Usage

group_by(.data, ..., .add = FALSE, .drop = group_by_drop_default(.data))ungroup(x, ...)

Arguments

.data

A data frame, data frame extension (e.g. a tibble), or alazy data frame (e.g. from dbplyr or dtplyr). See Methods, below, formore details.

...

In group_by(), variables or computations to group by.Computations are always done on the ungrouped data frame.To perform computations on the grouped data, you need to usea separate mutate() step before the group_by().Computations are not allowed in nest_by().In ungroup(), variables to remove from the grouping.

.add

When FALSE, the default, group_by() willoverride existing groups. To add to the existing groups, use.add = TRUE.

This argument was previously called add, but that preventedcreating a new grouping variable called add, and conflicts withour naming conventions.

.drop

Drop groups formed by factor levels that don't appear in thedata? The default is TRUE except when .data has been previouslygrouped with .drop = FALSE. See group_by_drop_default() for details.

x

A tbl()

Value

A grouped data frame with class grouped_df,unless the combination of ... and add yields a empty set ofgrouping columns, in which case a tibble will be returned.

Methods

These function are generics, which means that packages can provideimplementations (methods) for other classes. See the documentation ofindividual methods for extra arguments and differences in behaviour.

Methods available in currently loaded packages:

group_by(): dbplyr (tbl_lazy), dplyr (data.frame).
ungroup(): dbplyr (tbl_lazy), dplyr (data.frame, grouped_df, rowwise_df).

Ordering

Currently, group_by() internally orders the groups in ascending order. Thisresults in ordered output from functions that aggregate groups, such assummarise().

When used as grouping columns, character vectors are ordered in the C localefor performance and reproducibility across R sessions. If the resultingordering of your grouped operation matters and is dependent on the locale,you should follow up the grouped operation with an explicit call toarrange() and set the .locale argument. For example:

data %>% group_by(chr) %>% summarise(avg = mean(x)) %>% arrange(chr, .locale = "en")

This is often useful as a preliminary step before generating content intendedfor humans, such as an HTML table.

Legacy behavior

Prior to dplyr 1.1.0, character vector grouping columns were ordered in thesystem locale. If you need to temporarily revert to this behavior, you canset the global option dplyr.legacy_locale to TRUE, but this should beused sparingly and you should expect this option to be removed in a futureversion of dplyr. It is better to update existing code to explicitly callarrange(.locale = ) instead. Note that setting dplyr.legacy_locale willalso force calls to arrange() to use the system locale.

Examples

by_cyl <- mtcars %>% group_by(cyl)# grouping doesn't change how the data looks (apart from listing# how it's grouped):by_cyl#> # A tibble: 32 × 11#> # Groups: cyl [3]#> mpg cyl disp hp drat wt qsec vs am gear carb#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#>  1 21 6 160 110 3.9 2.62 16.5 0 1 4 4#>  2 21 6 160 110 3.9 2.88 17.0 0 1 4 4#>  3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1#>  4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1#>  5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2#>  6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1#>  7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4#>  8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2#>  9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4#> # ℹ 22 more rows# It changes how it acts with the other dplyr verbs:by_cyl %>% summarise( disp = mean(disp), hp = mean(hp))#> # A tibble: 3 × 3#> cyl disp hp#> <dbl> <dbl> <dbl>#> 1 4 105. 82.6#> 2 6 183. 122. #> 3 8 353. 209. by_cyl %>% filter(disp == max(disp))#> # A tibble: 3 × 11#> # Groups: cyl [3]#> mpg cyl disp hp drat wt qsec vs am gear carb#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#> 1 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1#> 2 24.4 4 147. 62 3.69 3.19 20 1 0 4 2#> 3 10.4 8 472 205 2.93 5.25 18.0 0 0 3 4# Each call to summarise() removes a layer of groupingby_vs_am <- mtcars %>% group_by(vs, am)by_vs <- by_vs_am %>% summarise(n = n())#> `summarise()` has grouped output by 'vs'. You can override using the#> `.groups` argument.by_vs#> # A tibble: 4 × 3#> # Groups: vs [2]#> vs am n#> <dbl> <dbl> <int>#> 1 0 0 12#> 2 0 1 6#> 3 1 0 7#> 4 1 1 7by_vs %>% summarise(n = sum(n))#> # A tibble: 2 × 2#> vs n#> <dbl> <int>#> 1 0 18#> 2 1 14# To removing grouping, use ungroupby_vs %>% ungroup() %>% summarise(n = sum(n))#> # A tibble: 1 × 1#> n#> <int>#> 1 32# By default, group_by() overrides existing groupingby_cyl %>% group_by(vs, am) %>% group_vars()#> [1] "vs" "am"# Use add = TRUE to instead appendby_cyl %>% group_by(vs, am, .add = TRUE) %>% group_vars()#> [1] "cyl" "vs" "am" # You can group by expressions: this is a short-hand# for a mutate() followed by a group_by()mtcars %>% group_by(vsam = vs + am)#> # A tibble: 32 × 12#> # Groups: vsam [3]#> mpg cyl disp hp drat wt qsec vs am gear carb vsam#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#>  1 21 6 160 110 3.9 2.62 16.5 0 1 4 4 1#>  2 21 6 160 110 3.9 2.88 17.0 0 1 4 4 1#>  3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1 2#>  4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1 1#>  5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2 0#>  6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1 1#>  7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4 0#>  8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2 1#>  9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2 1#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4 1#> # ℹ 22 more rows# The implicit mutate() step is always performed on the# ungrouped data. Here we get 3 groups:mtcars %>% group_by(vs) %>% group_by(hp_cut = cut(hp, 3))#> # A tibble: 32 × 12#> # Groups: hp_cut [3]#> mpg cyl disp hp drat wt qsec vs am gear carb#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#>  1 21 6 160 110 3.9 2.62 16.5 0 1 4 4#>  2 21 6 160 110 3.9 2.88 17.0 0 1 4 4#>  3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1#>  4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1#>  5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2#>  6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1#>  7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4#>  8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2#>  9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4#> # ℹ 22 more rows#> # ℹ 1 more variable: hp_cut <fct># If you want it to be performed by groups,# you have to use an explicit mutate() call.# Here we get 3 groups per value of vsmtcars %>% group_by(vs) %>% mutate(hp_cut = cut(hp, 3)) %>% group_by(hp_cut)#> # A tibble: 32 × 12#> # Groups: hp_cut [6]#> mpg cyl disp hp drat wt qsec vs am gear carb#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>#>  1 21 6 160 110 3.9 2.62 16.5 0 1 4 4#>  2 21 6 160 110 3.9 2.88 17.0 0 1 4 4#>  3 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1#>  4 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1#>  5 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2#>  6 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1#>  7 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4#>  8 24.4 4 147. 62 3.69 3.19 20 1 0 4 2#>  9 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2#> 10 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4#> # ℹ 22 more rows#> # ℹ 1 more variable: hp_cut <fct># when factors are involved and .drop = FALSE, groups can be emptytbl <- tibble( x = 1:10, y = factor(rep(c("a", "c"), each = 5), levels = c("a", "b", "c")))tbl %>% group_by(y, .drop = FALSE) %>% group_rows()#> <list_of<integer>[3]>#> [[1]]#> [1] 1 2 3 4 5#> #> [[2]]#> integer(0)#> #> [[3]]#> [1] 6 7 8 9 10#>

Group by one or more variables — group_by (2024)