Data Manipulation Using SQL And Tidyverse’s Dplyr Package
Dplyr is a part of the tidyverse package and is called the grammar of data manipulation. Structure Query Language (SQL) is a special purpose language used to interact with data stored in databases. In this article, we will compare the similarities and differences in query structure of both dplyr and SQL. SQL and dplyr both are industry standards and are used in industry and academia equally. In SQL SELECT
is a clause used to select the columns’ subset and the dplyr has select(dataset, col01, col02, ...)
verb used for the same task, similarly WHERE
clause and filter(dataset, col01 > val1, ...)
verb are used to filter datasets in SQL and dplyr respectively.
We will use mtcars dataset. The CSV and batabase
files are given below. For the database file, I use SQLite. You can open the .db
file in DB Browser For SQLite.
We want to select Miles/(US) gallon (mpg
), Number of cylinders(cyl
), and Gross horsepower (hp
) from mtcars dataset.
- dplyr has
select
verb to return required columns - SQL has
SELECT
clause to choose required columns
DPLYR
SQL
If you want to return the top n rows of the dataset you need to use the following dplyr and SQL commands.
- dplyr has
head(n = k)
function to return topk
rows - SQL has
LIMIT k
clause to return topk
rows
DPLYR
SQL
To count the total number of rows in the dataset you need to apply the following dplyr and SQL commands
- dplyr has
summary(count = n())
function to count all the records in the dataset. - SQL has
COUNT(*)
function to return the total number of records in the data table
DPLYR
SQL
Sorting datasets is a very important task in data science. SQL and dplyr has the following functions to sort the dataset
- dplyr has
arrange(col1, desc(col2), ...)
verb to sort the dataset order by columns such ascol1
andcol2
. The dataset will sort in ascending order by columncol1
and descending order by columncol2
. - SQL has
ORDER BY
clause to sort the dataset - We have given three examples of sorting the
mtcars
dataset - In the first example, we sort the
mtcars
dataset according to the ascending order ofmpg
column. - In the second example, we sort the
mtcars
dataset according to the descending order ofmpg
column. - In the third example, we sort the
mtcars
dataset according to the ascending order of thempg
column and descending order of themodel
column.
DPLYR
SQL
For example, you want to convert Miles-per-gallon mpg,
and Weight(1000 lbs) wt
to kg-per-gallon kpl
and wt_kg
- dplyr has
mutate(new_col = old_col*some_val, ...)
verb to create new columns - SQL creates driver columns in the
SELECT
clause is given below.
DPLYR
SQL
You can sort datasets in ascending or descending order of any column in the dataset. SQL and dplyr functions are given below
- dplyr has
filter(gear == 4)
verb to filter the dataset - SQL has
WHERE
clause to filter records in the database - The data type of
gear
is afactor
and we need to convert it intonumeric
asless than
operator does not operate on the factor data type. - We showed three examples of filtering data with dplyr and SQL
- First, we filtered the mtcars dataset with the condition
gear == 4
- Second, we filtered the mtcars dataset with the condition
gear == 4 and cyl <= 6.
- Second, we filtered the mtcars dataset with the condition
gear == 4 or cyl <= 6.
DPLYR
SQL
The GROUP BY statement groups rows of the dataset that have the same values into summary rows, like “find the number of gear in each category”.
The GROUP BY statement is often used with aggregate functions (COUNT(), MAX(), MIN(), SUM(), AVG()) to group the result-set by one or more columns.
The dplyr and SQL code for grouping data is given below
DPLYR
SQL
- dplyr has
summarise(mean_wt = mean(wt), ...)
verb to calculate mean - SQL has
AVG()
to calculate average
DPLYR
SQL
- dplyr has
summarise(var_wt = var(wt), ...)
verb to calculate variance - SQL has
VAR(wt)
to calculate the variance ofwt
column
DPLYR
SQL
- dplyr has
summarise(MIN = min(as.numeric(gear)), ...)
verb to calculate different summary functions. - SQL has
MIN(wt), MAX(wt), SUM(wt), RANGE
to calculate the average values ofwt
column
DPLYR
SQL
In this article, we learned the structure of dplyr and SQL queries. We learned how to filter, limit, sort, select and group dataets using SQL and dplyr.