Dplyr vs. SQL (2024)

Dplyr is a part of the tidyverse package and is called the grammar of data manipulation. Structure Query Language (SQL) is a special purpose language used to interact with data stored in databases. In this article, we will compare the similarities and differences in query structure of both dplyr and SQL. SQL and dplyr both are industry standards and are used in industry and academia equally. In SQL SELECT is a clause used to select the columns’ subset and the dplyr has select(dataset, col01, col02, ...) verb used for the same task, similarly WHERE clause and filter(dataset, col01 > val1, ...) verb are used to filter datasets in SQL and dplyr respectively.

Dplyr vs. SQL (2)

We will use mtcars dataset. The CSV and batabasefiles are given below. For the database file, I use SQLite. You can open the .db file in DB Browser For SQLite.

We want to select Miles/(US) gallon (mpg), Number of cylinders(cyl), and Gross horsepower (hp) from mtcars dataset.

  • dplyr has select verb to return required columns
  • SQL has SELECT clause to choose required columns

DPLYR

SQL

If you want to return the top n rows of the dataset you need to use the following dplyr and SQL commands.

  • dplyr has head(n = k) function to return top k rows
  • SQL has LIMIT k clause to return top k rows

DPLYR

SQL

To count the total number of rows in the dataset you need to apply the following dplyr and SQL commands

  • dplyr has summary(count = n()) function to count all the records in the dataset.
  • SQL has COUNT(*) function to return the total number of records in the data table

DPLYR

SQL

Sorting datasets is a very important task in data science. SQL and dplyr has the following functions to sort the dataset

  • dplyr has arrange(col1, desc(col2), ...) verb to sort the dataset order by columns such as col1 and col2. The dataset will sort in ascending order by column col1 and descending order by column col2.
  • SQL has ORDER BY clause to sort the dataset
  • We have given three examples of sorting the mtcars dataset
  • In the first example, we sort the mtcars dataset according to the ascending order of mpg column.
  • In the second example, we sort the mtcars dataset according to the descending order of mpg column.
  • In the third example, we sort the mtcars dataset according to the ascending order of thempg column and descending order of the model column.

DPLYR

SQL

For example, you want to convert Miles-per-gallon mpg, and Weight(1000 lbs) wt to kg-per-gallon kpland wt_kg

  • dplyr has mutate(new_col = old_col*some_val, ...) verb to create new columns
  • SQL creates driver columns in the SELECT clause is given below.

DPLYR

SQL

You can sort datasets in ascending or descending order of any column in the dataset. SQL and dplyr functions are given below

  • dplyr has filter(gear == 4) verb to filter the dataset
  • SQL has WHERE clause to filter records in the database
  • The data type of gear is a factor and we need to convert it into numeric as less than operator does not operate on the factor data type.
  • We showed three examples of filtering data with dplyr and SQL
  • First, we filtered the mtcars dataset with the condition gear == 4
  • Second, we filtered the mtcars dataset with the condition gear == 4 and cyl <= 6.
  • Second, we filtered the mtcars dataset with the condition gear == 4 or cyl <= 6.

DPLYR

SQL

The GROUP BY statement groups rows of the dataset that have the same values into summary rows, like “find the number of gear in each category”.

The GROUP BY statement is often used with aggregate functions (COUNT(), MAX(), MIN(), SUM(), AVG()) to group the result-set by one or more columns.

The dplyr and SQL code for grouping data is given below

DPLYR

SQL

  • dplyr has summarise(mean_wt = mean(wt), ...) verb to calculate mean
  • SQL has AVG() to calculate average

DPLYR

SQL

  • dplyr has summarise(var_wt = var(wt), ...) verb to calculate variance
  • SQL has VAR(wt) to calculate the variance of wt column

DPLYR

SQL

  • dplyr has summarise(MIN = min(as.numeric(gear)), ...) verb to calculate different summary functions.
  • SQL has MIN(wt), MAX(wt), SUM(wt), RANGEto calculate the average values of wt column

DPLYR

SQL

In this article, we learned the structure of dplyr and SQL queries. We learned how to filter, limit, sort, select and group dataets using SQL and dplyr.

Dplyr vs. SQL (2024)

References

Top Articles
Latest Posts
Article information

Author: Ray Christiansen

Last Updated:

Views: 6086

Rating: 4.9 / 5 (69 voted)

Reviews: 84% of readers found this page helpful

Author information

Name: Ray Christiansen

Birthday: 1998-05-04

Address: Apt. 814 34339 Sauer Islands, Hirtheville, GA 02446-8771

Phone: +337636892828

Job: Lead Hospitality Designer

Hobby: Urban exploration, Tai chi, Lockpicking, Fashion, Gunsmithing, Pottery, Geocaching

Introduction: My name is Ray Christiansen, I am a fair, good, cute, gentle, vast, glamorous, excited person who loves writing and wants to share my knowledge and understanding with you.