dplyr Package in R Programming - GeeksforGeeks (2024)

Last Updated : 20 Dec, 2023

Improve

In this article, we will discuss Aggregating and analyzing data with dplyr package in the R Programming Language.

dplyr Package in R

The dplyr package inR Programming Languageis a structure of data manipulation that provides a uniform set of verbs, helping to resolve the most frequent data manipulation hurdles.

  • By limiting the choices the focus can now be more on data manipulation difficulties.
  • There are uncomplicated “verbs”, functions present for tackling every common data manipulation and the thoughts can be translated into code faster.
  • There are valuable backends and hence waiting time for the computer is reduced.

Here are some key functions and concepts within the dplyr package in R.

Data Frame and Tibble

Data frames in dplyr in R is organized tables where each column stores specific types of information, like names, ages, or scores.for creating a data frame involves specifying column names and their respective values.

R

df <- data.frame(

Name = c("vipul", "jayesh", "anurag"),

Age = c(25, 23, 22),

Score = c(95, 89, 78)

)

df

Output:

 Name Age Score
1 vipul 25 95
2 jayesh 23 89
3 anurag 22 78

On the other hand, tibbles, introduced through the tibble package, share similar functionality but offer enhanced user-friendly features. The syntax for creating a tibble is comparable to that of a data frame.

Pipes (%>%)

dplyr in R The pipe operator (%>%) in dplyr package, which allows us to chain multiple operations together, improving code readability.

R

# Load necessary libraries

library(dplyr)

# Example: Chain operations using the pipe operator

result <- mtcars %>%

filter(mpg > 20) %>% # Filter rows where mpg is greater than 20

select(mpg, cyl, hp) %>% # Select specific columns

group_by(cyl) %>% # Group the data by the 'cyl' variable

summarise(mean_hp = mean(hp)) # Calculate the mean horsepower for each group

# Display the result

print(result)

Output:

 cyl mean_hp
<dbl> <dbl>
1 4 82.6
2 6 110

Verb Functions

dplyr in R provides various important functions that can be used for Data Manipulation. These are:

filter() Function

For choosing cases and using their values as a base for doing so.

R

# Create a data frame with missing data

d <- data.frame(name = c("Abhi", "Bhavesh", "Chaman", "Dimri"),

age = c(7, 5, 9, 16),

ht = c(46, NA, NA, 69),

school = c("yes", "yes", "no", "no"))

# Display the data frame

print(d)

# Finding rows with NA value

rows_with_na <- d %>% filter(is.na(ht))

print(rows_with_na)

# Finding rows with no NA value

rows_without_na <- d %>% filter(!is.na(ht))

print(rows_without_na)

Output:

 name age ht school
1 Abhi 7 46 yes
2 Bhavesh 5 NA yes
3 Chaman 9 NA no
4 Dimri 16 69 no
Finding rows with NA value
name age ht school
1 Bhavesh 5 NA yes
2 Chaman 9 NA no
Finding rows with no NA value
name age ht school
1 Abhi 7 46 yes
2 Dimri 16 69 no

arrange():

For reordering of the cases.

R

# Create a data frame with missing data

d <- data.frame( name = c("Abhi", "Bhavesh", "Chaman", "Dimri"),

age = c(7, 5, 9, 16),

ht = c(46, NA, NA, 69),

school = c("yes", "yes", "no", "no") )

d

# Arranging name according to the age

d.name<- arrange(d, age)

print(d.name)

Output:

 name age ht school
1 Abhi 7 46 yes
2 Bhavesh 5 NA yes
3 Chaman 9 NA no
4 Dimri 16 69 no

Arranging name according to the age
name age ht school
1 Bhavesh 5 NA yes
2 Abhi 7 46 yes
3 Chaman 9 NA no
4 Dimri 16 69 no

select() and rename():

For choosing variables and using their names as a base for doing so.

R

# Create a data frame with missing data

d <- data.frame(name=c("Abhi", "Bhavesh",

"Chaman", "Dimri"),

age=c(7, 5, 9, 16),

ht=c(46, NA, NA, 69),

school=c("yes", "yes", "no", "no"))

# startswith() function to print only ht data

select(d, starts_with("ht"))

# -startswith() function to print

# everything except ht data

select(d, -starts_with("ht"))

# Printing column 1 to 2

select(d, 1: 2)

# Printing data of column

# heading containing 'a'

select(d, contains("a"))

# Printing data of column

# heading which matches 'na'

select(d, matches("na"))

Output:


ht
1 46
2 NA
3 NA
4 69
everything except ht data
name age school
1 Abhi 7 yes
2 Bhavesh 5 yes
3 Chaman 9 no
4 Dimri 16 no
Printing column 1 to 2
name age
1 Abhi 7
2 Bhavesh 5
3 Chaman 9
4 Dimri 16
heading containing 'a'
name age
1 Abhi 7
2 Bhavesh 5
3 Chaman 9
4 Dimri 16
heading which matches 'na'
name
1 Abhi
2 Bhavesh
3 Chaman
4 Dimri

mutate() and transmute():

Addition of new variables which are the functions of prevailing variables.

R

# Create a data frame with missing data

d <- data.frame( name = c("Abhi", "Bhavesh",

"Chaman", "Dimri"),

age = c(7, 5, 9, 16),

ht = c(46, NA, NA, 69),

school = c("yes", "yes", "no", "no") )

# Calculating a variable x3 which is sum of height

# and age printing with ht and age

mutate(d, x3 = ht + age)

# Calculating a variable x3 which is sum of height

# and age printing without ht and age

transmute(d, x3 = ht + age)

Output:

 name age ht school
1 Abhi 7 46 yes
2 Bhavesh 5 NA yes
3 Chaman 9 NA no
4 Dimri 16 69 no
Calculating a variable x3 which is sum of height

name age ht school x3
1 Abhi 7 46 yes 53
2 Bhavesh 5 NA yes NA
3 Chaman 9 NA no NA
4 Dimri 16 69 no 85
Calculating a variable x3 which is sum of height
x3
1 53
2 NA
3 NA
4 85

summarise():

Condensing various values to one value.

R

# Create a data frame with missing data

d <- data.frame( name = c("Abhi", "Bhavesh",

"Chaman", "Dimri"),

age = c(7, 5, 9, 16),

ht = c(46, NA, NA, 69),

school = c("yes", "yes", "no", "no") )

# Calculating mean of age

summarise(d, mean = mean(age))

# Calculating min of age

summarise(d, med = min(age))

# Calculating max of age

summarise(d, med = max(age))

# Calculating median of age

summarise(d, med = median(age))

Output:

Calculating mean of age
mean
1 9.25
Calculating minimum age
med
1 5
Calculating max of age
med
1 16
Calculating median of age
med
1 8

sample_n() and sample_frac():

For taking random specimens.

R

# Create a data frame with missing data

d <- data.frame( name = c("Abhi", "Bhavesh",

"Chaman", "Dimri"),

age = c(7, 5, 9, 16),

ht = c(46, NA, NA, 69),

school = c("yes", "yes", "no", "no") )

# Printing three rows

sample_n(d, 3)

# Printing 50 % of the rows

sample_frac(d, 0.50)

Output:

 name age ht school
1 Chaman 9 NA no
2 Dimri 16 69 no
3 Abhi 7 46 yes
Printing 50 % of the rows
name age ht school
1 Abhi 7 46 yes
2 Dimri 16 69 no


S

geeksforgeeks user

Improve

Previous Article

Data visualization with R and ggplot2

Next Article

Grid and Lattice Packages in R Programming

Please Login to comment...

dplyr Package in R Programming - GeeksforGeeks (2024)

FAQs

What is the purpose of dplyr package in R? ›

The dplyr package makes these steps fast and easy: By constraining your options, it helps you think about your data manipulation challenges. It provides simple “verbs”, functions that correspond to the most common data manipulation tasks, to help you translate your thoughts into code.

Why is the dplyr package very useful in big data analysis? ›

The dplyr package provides a concise set of operations for managing data frames. With these functions we can do a number of complex operations in just a few lines of code. In particular, we can often conduct the beginnings of an exploratory analysis with the powerful combination of group_by() and summarize() .

Which are 5 of the most commonly used dplyr functions? ›

We're going to learn some of the most common dplyr functions: select() , filter() , mutate() , group_by() , and summarize() .

What are the R packages for Geeksforgeeks? ›

Packages in R Programming language are a set of R functions, compiled code, and sample data. These are stored under a directory called “library” within the R environment. By default, R installs a group of packages during installation. Once we start the R console, only the default packages are available by default.

What does %>% mean in dplyr? ›

The pipe operator (%>%) forces R to read functions left to right instead of right to left. It pipes, or transfers, output from the first function to the input of a second function. In the following code, we will invoke the select function, then invoke arrange. mtcars %>% select(cyl, mpg) %>% arrange (cyl, mpg)

What is the difference between dplyr and tidyverse? ›

dplyr: A package for data manipulation that uses a consistent and intuitive syntax that makes data manipulation tasks more straightforward. tidyr: A package for data tidying that helps you transform data between different formats, such as converting wide data to long format or vice versa.

Is a data table better than dplyr? ›

While dplyr has very flexible and intuitive syntax, data. table can be orders of magnitude faster in some scenarios. One of those scenarios is when performing operations over a very large number of groups.

What is the difference between SQL and dplyr? ›

SQL and dplyr both are industry standards and are used in industry and academia equally. In SQL SELECT is a clause used to select the columns' subset and the dplyr has select(dataset, col01, col02, ...) verb used for the same task, similarly WHERE clause and filter(dataset, col01 > val1, ...)

What is the dplyr function select used for? ›

The select() function of dplyr package is used to choose which columns of a data frame you would like to work with. It takes column names as arguments and creates a new data frame using the selected columns. select() can be combined with others functions such as filter() .

Why is it called dplyr? ›

d is for data. frame , plyr as in a set of pliers to manipulate things with. dplyr is a data. frame specific set of tools like plyr .

What is the use of arrange () with dplyr package? ›

dplyr Package – arrange()

The arrange() function is used to reorder rows of a data frame according to one of the variables. Reordering rows of a data frame (while preserving corresponding order of other columns) is normally a pain to do in R.

How to load dplyr package in R? ›

dplyr: Getting Started
  1. # On a local R install, use install.packages("dplyr") # To download the package # Load the dplyr library library(dplyr) # dplyr is a part of the R "tidyverse" library(tidyverse) ...
  2. # Use dplyr function glimpse() to view the structure of data glimpse(mtcars) # Result is similar to str() str(mtcars)

Is R easier than Python? ›

Both Python and R are considered fairly easy languages to learn. Python was originally designed for software development. If you have previous experience with Java or C++, you may be able to pick up Python more naturally than R. If you have a background in statistics, on the other hand, R could be a bit easier.

Which packages should I install in R? ›

To load data
  • DBI - The standard for for communication between R and relational database management systems. ...
  • odbc - Use any ODBC driver with the odbc package to connect R to your database. ...
  • RMySQL, RPostgresSQL, RSQLite - If you'd like to read in data from a database, these packages are a good place to start.
Apr 26, 2024

Is R programming easy? ›

R is considered one of the more difficult programming languages to learn due to how different its syntax is from other languages like Python and its extensive set of commands. It takes most learners without prior coding experience roughly four to six weeks to learn R. Of course, this depends on several factors.

What is the purpose of the R package? ›

An R package is a collection of R functions, compiled code, and sample data, designed to make the organization and reusability of code more efficient in the R programming language. These packages are stored in a directory named 'library' in the R environment.

What is the difference between data table and dplyr? ›

So, for example, while data. table includes functions to read, write, or reshape data, dplyr delegates these tasks to companion packages like readr or tidyr. On the other hand, data. table is focused on the processing of local in-memory data, but dplyr offers a database backend.

What is the difference between base and dplyr? ›

The code dplyr verbs input and output data frames. This contrasts with base R functions which more frequently work with individual vectors. dplyr relies heavily on “non-standard evaluation” so that you don't need to use $ to refer to columns in the “current” data frame.

References

Top Articles
Latest Posts
Article information

Author: Dan Stracke

Last Updated:

Views: 5540

Rating: 4.2 / 5 (43 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Dan Stracke

Birthday: 1992-08-25

Address: 2253 Brown Springs, East Alla, OH 38634-0309

Phone: +398735162064

Job: Investor Government Associate

Hobby: Shopping, LARPing, Scrapbooking, Surfing, Slacklining, Dance, Glassblowing

Introduction: My name is Dan Stracke, I am a homely, gleaming, glamorous, inquisitive, homely, gorgeous, light person who loves writing and wants to share my knowledge and understanding with you.