101 R data.table Exercises (2024)

The data.table package in R is super fast when it comes to handling data. It has a syntax that reduces keystrokes while making R code easier to read. These set of exercises are designed to help you to oil your data brain through solving data manipulation exercises. Related post: 101 Python datatable Exercises (pydatatable)

101 R data.table Exercises. Photo by Ela Abbau.

1. How to install data.table and check the version?

# Install the data.table packageinstall.packages("data.table")# Load the librarylibrary(data.table)# Check versionprint(packageVersion("data.table"))

2. How to create a data.table from lists (vectors) ?

Difficulty Level: L1 Question : Create a pandas series from each of the items below: a list, numpy and a dictionary Input

list_1 <- c("a","b","c","d")list_2 <- c("1","2","3","4")list_3 <- c("aa","bb","cc","dd")

Show Solution

# Inputslist_1 <- c("a","b","c","d")list_2 <- c("1","2","3","4")list_3 <- c("aa","bb","cc","dd")# SolutionDT <- data.table(V1= list_1, V2 =list_2, V3 = list_3 )print(DT)

 V1 V2 V31: a 1 aa2: b 2 bb3: c 3 cc4: d 4 dd

3. How to import csv data files as data.table ?

Difficulty Level: L1 Question : Import BostonHousing dataset as a data.table. Show Solution

# SolutionDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')print(head(DT))

 crim zn indus chas nox rm age dis rad tax ptratio b lstat1: 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.982: 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.143: 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.034: 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.945: 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.336: 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 medv1: 24.02: 21.63: 34.74: 33.45: 36.26: 28.7

4. How to import only ‘n’ rows from a csv file to create a data.table?

Difficiulty Level: L1 Question : Import first 50 row of BostonHousing dataset as a data.table. Show Solution

# SolutionDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', nrows = 50)print(head(DT))

 crim zn indus chas nox rm age dis rad tax ptratio b lstat1: 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.982: 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.143: 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.034: 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.945: 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.336: 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 medv1: 24.02: 21.63: 34.74: 33.45: 36.26: 28.7

5. How to import only specified columns from a csv file?

Difficulty Level: L1 Question : Import only ‘crim’ and ‘medv’ columns of the BostonHousing dataset as a data.table. Show Solution

# SolutionDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', select = c('crim', 'medv'))print(head(DT))

 crim medv1: 0.00632 24.02: 0.02731 21.63: 0.02729 34.74: 0.03237 33.45: 0.06905 36.26: 0.02985 28.7

6. How to get the nrows, ncolumns, datatype, summary stats of each column in a data.table

Difficulty Level: L2 Question : Get the number of rows, columns, datatype and summary statistics of each column of the Cars93 dataset. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')

Show Solution

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')# number of rows and columnsprint(NROW(DT))print(NCOL(DT))# datatypesprint(sapply(DT,class))# summary statisticssummary(DT)

[1] 506[1] 14 crim zn indus chas nox rm age dis "numeric" "numeric" "numeric" "integer" "numeric" "numeric" "numeric" "numeric" rad tax ptratio b lstat medv "integer" "integer" "numeric" "numeric" "numeric" "numeric" crim zn indus chas Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000 Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 nox rm age dis Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207 Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127 rad tax ptratio b Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 Median : 5.000 Median :330.0 Median :19.05 Median :391.44 Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90 lstat medv Min. : 1.73 Min. : 5.00 1st Qu.: 6.95 1st Qu.:17.02 Median :11.36 Median :21.20 Mean :12.65 Mean :22.53 3rd Qu.:16.95 3rd Qu.:25.00 Max. :37.97 Max. :50.00

7. How to extract the row and column number of a particular cell with given criterion?

Difficulty Level: L2 Question : Which manufacturer, model and type has the highest Price? What is the row and column number of the cell with the highest Price value? Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution

# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# Solution# Get Manufacturer with highest priceprint(DT[Price == max(Price, na.rm = TRUE)][,list(Manufacturer, Model, Type)])# Get row numberDT[, .(rownum=.I, Price)][Price == max(Price, na.rm = TRUE), rownum]

 Manufacturer Model Type1: Mercedes-Benz 300E Midsize

8. How to rename a specific columns in a data.table?

Difficulty Level: Easy Question : Rename the column Model as New_Model in DT. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution

# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')print(head(setnames(DT, old = "Model", new = "New_Model")[]))

 Manufacturer New_Model Type Min.Price Price Max.Price MPG.city1: Acura Integra Small 12.9 15.9 18.8 252: <NA> Legend Midsize 29.2 33.9 38.7 183: Audi 90 Compact 25.9 29.1 32.3 204: Audi 100 Midsize NA 37.7 44.6 195: BMW 535i Midsize NA 30.0 NA 226: Buick Century Midsize 14.2 15.7 17.3 22 MPG.highway AirBags DriveTrain Cylinders EngineSize Horsepower1: 31 None Front 4 1.8 1402: 25 Driver & Passenger Front 6 3.2 2003: 26 Driver only Front 6 2.8 1724: 26 Driver & Passenger <NA> 6 NA 1725: 30 <NA> Rear 4 3.5 2086: 31 Driver only <NA> 4 2.2 110 RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length1: 6300 2890 Yes 13.2 5 1772: 5500 2335 Yes 18.0 5 1953: 5500 2280 Yes 16.9 5 1804: 5500 2535 <NA> 21.1 6 1935: 5700 2545 Yes 21.1 4 1866: 5200 2565 No 16.4 6 189 Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin1: 102 68 37 26.5 NA 2705 non-USA2: 115 71 38 30.0 15 3560 non-USA3: 102 67 37 28.0 14 3375 non-USA4: 106 NA 37 31.0 17 3405 non-USA5: 109 69 39 27.0 13 3640 non-USA6: 105 69 41 28.0 16 NA USA Make1: Acura Integra2: Acura Legend3: Audi 904: Audi 1005: BMW 535i6: Buick Century

9. How to check if a data.table has any missing values?

Difficulty Level: L1 Question : Check if DT has any missing values. Return TRUE is present else FALSE. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution

# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# SolutionanyNA(DT)

TRUE

10. How to check for missing values in a data.table?

Difficulty Level: L1 Question : Get the location of missing values in DT. Replace the value with TRUE is missing else FALSE. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

 Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE[2,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE[4,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE[5,] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE(...truncated...)

Show Solution

# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# Solutionprint(head(is.na(DT)))

 Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE[2,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE[4,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE[5,] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE AirBags DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE[4,] FALSE TRUE FALSE TRUE FALSE FALSE FALSE[5,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE[6,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width[1,] FALSE FALSE FALSE FALSE FALSE FALSE[2,] FALSE FALSE FALSE FALSE FALSE FALSE[3,] FALSE FALSE FALSE FALSE FALSE FALSE[4,] TRUE FALSE FALSE FALSE FALSE TRUE[5,] FALSE FALSE FALSE FALSE FALSE FALSE[6,] FALSE FALSE FALSE FALSE FALSE FALSE Turn.circle Rear.seat.room Luggage.room Weight Origin Make[1,] FALSE FALSE TRUE FALSE FALSE FALSE[2,] FALSE FALSE FALSE FALSE FALSE FALSE[3,] FALSE FALSE FALSE FALSE FALSE FALSE[4,] FALSE FALSE FALSE FALSE FALSE FALSE[5,] FALSE FALSE FALSE FALSE FALSE FALSE[6,] FALSE FALSE FALSE TRUE FALSE FALSE

11. How to get the row and column positions of missing values in a data.table?

Difficulty Level: L2 Question: Get the row and column positions of missing values in DT. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

[1] "(1, 24)" "(2, 1)" "(4, 4)" "(4, 10)" "(4, 12)" "(4, 16)" "(4, 21)" [8] "(5, 4)" "(5, 6)" "(5, 9)" "(6, 10)" "(6, 25)" "(7, 6)" "(7, 15)" [15] "(7, 17)" "(9, 21)" "(10, 15)" "(11, 24)" "(12, 9)" "(12, 10)" "(12, 13)"[22] "(13, 11)" "(13, 24)" "(14, 11)" "(14, 17)" "(15, 7)" "(15, 13)" "(15, 19)"[29] "(15, 22)" "(16, 24)" "(17, 13)" "(17, 24)" "(19, 23)" "(19, 24)" "(19, 26)"[36] "(20, 1)" "(20, 22)" "(23, 17)" "(23, 23)" "(24, 15)" "(24, 24)" .. (truncated)

Show Solution

# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# Solutionoutput <- c() # init output vectorfor(i in 1:NROW(DT)){ for(j in 1:NCOL(DT)){ curr_value <- DT[i, j,with=F][[1]] if(is.na(curr_value)){ position <- paste0('(', paste(i, j, sep=', '), ')') output <- c(output, position) } }}

11. How to count the number of missing values in each column of data.table?

Difficulty Level: L2 Question : Count the number of missing values in each column of DT Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution

# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# Solutionfor (i in colnames(DT)){ print(paste("Number of missing values in column", i , "is", sum(is.na(DT[,..i]))))}

[1] "Number of missing values in column Manufacturer is 4"[1] "Number of missing values in column Model is 1"[1] "Number of missing values in column Type is 3"[1] "Number of missing values in column Min.Price is 7"[1] "Number of missing values in column Price is 2"[1] "Number of missing values in column Max.Price is 5"[1] "Number of missing values in column MPG.city is 9"[1] "Number of missing values in column MPG.highway is 2"[1] "Number of missing values in column AirBags is 6"[1] "Number of missing values in column DriveTrain is 7"[1] "Number of missing values in column Cylinders is 5"[1] "Number of missing values in column EngineSize is 2"[1] "Number of missing values in column Horsepower is 7"[1] "Number of missing values in column RPM is 3"[1] "Number of missing values in column Rev.per.mile is 6"[1] "Number of missing values in column Man.trans.avail is 5"[1] "Number of missing values in column Fuel.tank.capacity is 8"[1] "Number of missing values in column Passengers is 2"[1] "Number of missing values in column Length is 4"[1] "Number of missing values in column Wheelbase is 1"[1] "Number of missing values in column Width is 6"[1] "Number of missing values in column Turn.circle is 5"[1] "Number of missing values in column Rear.seat.room is 4"[1] "Number of missing values in column Luggage.room is 19"[1] "Number of missing values in column Weight is 7"[1] "Number of missing values in column Origin is 5"[1] "Number of missing values in column Make is 3"

12. How to replace missing values of data.table with 0?

Difficulty Level: L2 Question : Replace all missing values in DT with 0. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution

# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# SolutionDT[is.na(DT)] <- 0print(head(DT[, 1:8]))

 Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: Acura Integra Small 12.9 15.9 18.8 25 312: 0 Legend Midsize 29.2 33.9 38.7 18 253: Audi 90 Compact 25.9 29.1 32.3 20 264: Audi 100 Midsize 0.0 37.7 44.6 19 265: BMW 535i Midsize 0.0 30.0 0.0 22 306: Buick Century Midsize 14.2 15.7 17.3 22 31

13. How to replace missing values of numeric columns with the mean?

Difficulty Level: L2 Question : Replace missing values of numeric columns with their respective mean. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution

# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# Solutionreplace_with_mean <- function(x){ifelse(is.na(x),mean(x, na.rm = TRUE),x)}output <- DT[,lapply(.SD, replace_with_mean)]print(head(output))

 Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: Acura Integra Small 12.9000 15.9 18.80000 25 312: <NA> Legend Midsize 29.2000 33.9 38.70000 18 253: Audi 90 Compact 25.9000 29.1 32.30000 20 264: Audi 100 Midsize 17.1186 37.7 44.60000 19 265: BMW 535i Midsize 17.1186 30.0 21.45909 22 306: Buick Century Midsize 14.2000 15.7 17.30000 22 31 AirBags DriveTrain Cylinders EngineSize Horsepower RPM1: None Front 4 1.800000 140 63002: Driver & Passenger Front 6 3.200000 200 55003: Driver only Front 6 2.800000 172 55004: Driver & Passenger <NA> 6 2.658242 172 55005: <NA> Rear 4 3.500000 208 57006: Driver only <NA> 4 2.200000 110 5200 Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase1: 2890 Yes 13.2 5 177 1022: 2335 Yes 18.0 5 195 1153: 2280 Yes 16.9 5 180 1024: 2535 <NA> 21.1 6 193 1065: 2545 Yes 21.1 4 186 1096: 2565 No 16.4 6 189 105 Width Turn.circle Rear.seat.room Luggage.room Weight Origin1: 68.00000 37 26.5 13.98649 2705.000 non-USA2: 71.00000 38 30.0 15.00000 3560.000 non-USA3: 67.00000 37 28.0 14.00000 3375.000 non-USA4: 69.44828 37 31.0 17.00000 3405.000 non-USA5: 69.00000 39 27.0 13.00000 3640.000 non-USA6: 69.00000 41 28.0 16.00000 3104.593 USA Make1: Acura Integra2: Acura Legend3: Audi 904: Audi 1005: BMW 535i6: Buick Century

14. How to select a specific column from a data.table as a list (vector)?

Difficulty Level: L1 Question : Get the column Model in DT as a list. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Solution

[1] "Integra" "Legend" "90" "100" [5] "535i" "Century" "LeSabre" "Roadmaster" [9] "Riviera" "DeVille" "Seville" "Cavalier" [13] "Corsica" "Camaro" "Lumina" "Lumina_APV" [17] "Astro" "Caprice" "Corvette" "Concorde" [21] "LeBaron" "Imperial" (...truncated...)

Show Solution

# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#SolutionDT <- DT[, Model]print(DT)

 [1] "Integra" "Legend" "90" "100" [5] "535i" "Century" "LeSabre" "Roadmaster" [9] "Riviera" "DeVille" "Seville" "Cavalier" [13] "Corsica" "Camaro" "Lumina" "Lumina_APV" [17] "Astro" "Caprice" "Corvette" "Concorde" [21] "LeBaron" "Imperial" "Colt" "Shadow" [25] "Spirit" "Caravan" "Dynasty" "Stealth" [29] "Summit" "Vision" "Festiva" "Escort" [33] "Tempo" "Mustang" "Probe" "Aerostar" [37] "Taurus" "Crown_Victoria" "Metro" "Storm" [41] "Prelude" "Civic" "Accord" "Excel" [45] "Elantra" "Scoupe" "Sonata" "Q45" [49] "ES300" "SC300" "Continental" "Town_Car" [53] "323" "Protege" "626" "MPV" [57] "RX-7" "190E" "300E" NA [61] "Cougar" "Mirage" "Diamante" "Sentra" [65] "Altima" "Quest" "Maxima" "Achieva" [69] "Cutlass_Ciera" "Silhouette" "Eighty-Eight" "Laser" [73] "LeMans" "Sunbird" "Firebird" "Grand_Prix" [77] "Bonneville" "900" "SL" "Justy" [81] "Loyale" "Legacy" "Swift" "Tercel" [85] "Celica" "Camry" "Previa" "Fox" [89] "Eurovan" "Passat" "Corrado" "240" [93] "850"

15. How to select a specific column from a data.table as a data.table?

Difficulty Level: L2 Question : Get the column Model in DT as a data.table (rather than as a list). Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Solution

 Model1: Integra2: Legend3: 904: 1005: 535i6: Century

Show Solution

# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#SolutionDT <- DT[, list(Model)]print(head(DT))

 Model1: Integra2: Legend3: 904: 1005: 535i6: Century

16. How to sort a data.table based on a given column?

Difficulty Level: L1 Question: Reorder the rows as per Price column of DT. Input

17. How to sort a data.table based on two or more columns?

Difficulty Level: L1 Question : Reorder the rows as per ascending order of Price column and descending order of Max.Price column of DT. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution

# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#SolutionDT <- DT[order(Price, -Max.Price)]print(head(DT))

 Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: Ford Festiva Small 6.9 7.4 7.9 31 332: Hyundai Excel Small 6.8 8.0 9.2 29 333: Mazda 323 Small 7.4 8.3 9.1 29 374: Geo Metro Small 6.7 8.4 10.0 46 505: Subaru Justy Small 7.3 8.4 9.5 33 376: Suzuki Swift <NA> 7.3 8.6 NA 39 43 AirBags DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile1: <NA> Front 4 1.3 63 5000 31502: None Front 4 1.5 81 5500 27103: None Front 4 1.6 82 5000 23704: None Front 3 1.0 55 5700 37555: None 4WD 3 1.2 73 5600 28756: None Front 3 1.3 70 6000 3360 Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width1: Yes 10.0 4 141 90 632: Yes 11.9 5 168 94 633: Yes 13.2 4 164 97 664: <NA> 10.6 4 151 93 635: Yes 9.2 4 146 90 606: Yes 10.6 4 161 93 NA Turn.circle Rear.seat.room Luggage.room Weight Origin Make1: 33 26.0 12 1845 USA Ford Festiva2: 35 26.0 11 2345 non-USA Hyundai Excel3: 34 27.0 16 2325 non-USA Mazda 3234: 34 27.5 10 1695 non-USA Geo Metro5: 32 23.5 10 2045 non-USA Subaru Justy6: 34 27.5 10 1965 non-USA Suzuki Swift

18. How to filter out rows from a data.table that satisfy multiple conditions?

Difficulty Level: L2 Question : Filter out the rows having Manufacturer as Ford and Price less than 30 Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output :

 Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: Ford Festiva Small 6.9 7.4 7.9 31 332: Ford Escort Small 8.4 10.1 11.9 23 303: Ford Tempo Compact 10.4 11.3 12.2 22 274: Ford Mustang Sporty 10.8 15.9 21.0 22 295: Ford Probe Sporty 12.8 14.0 15.2 NA 306: Ford Aerostar Van 14.5 19.9 25.3 15 20 AirBags DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile1: <NA> Front 4 1.3 63 5000 31502: None Front 4 1.8 127 6500 24103: None Front 4 2.3 96 4200 28054: Driver only Rear 4 2.3 105 4600 22855: Driver only Front 4 2.0 115 5500 23406: Driver only 4WD 6 3.0 145 4800 2080(...truncated...)

Show Solution

#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionsol <- DT[Manufacturer == "Ford" & Price < 30]print(head(sol))

 Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: Ford Festiva Small 6.9 7.4 7.9 31 332: Ford Escort Small 8.4 10.1 11.9 23 303: Ford Tempo Compact 10.4 11.3 12.2 22 274: Ford Mustang Sporty 10.8 15.9 21.0 22 295: Ford Probe Sporty 12.8 14.0 15.2 NA 306: Ford Aerostar Van 14.5 19.9 25.3 15 20 AirBags DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile1: <NA> Front 4 1.3 63 5000 31502: None Front 4 1.8 127 6500 24103: None Front 4 2.3 96 4200 28054: Driver only Rear 4 2.3 105 4600 22855: Driver only Front 4 2.0 115 5500 23406: Driver only 4WD 6 3.0 145 4800 2080 Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width1: Yes 10.0 4 141 90 632: <NA> 13.2 5 171 98 673: Yes 15.9 5 177 100 684: Yes 15.4 4 180 101 685: Yes 15.5 4 179 103 706: Yes 21.0 7 176 119 72 Turn.circle Rear.seat.room Luggage.room Weight Origin Make1: 33 26.0 12 1845 USA Ford Festiva2: 36 28.0 12 2530 USA Ford Escort3: NA 27.5 13 NA USA <NA>4: 40 24.0 12 NA USA Ford Mustang5: 38 23.0 18 2710 USA Ford Probe6: 45 30.0 NA 3735 USA Ford Aerostar

19. How to select multiple columns from data.table and rename them?

Difficulty Level: L2 Question : Get columns ‘Manufacturer’ & ‘Type’ from data.table and rename them to ‘MANUFACTURER’ & ‘TYPE’ Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

 MANUFACTURER TYPE1: Acura Small2: <NA> Midsize3: Audi Compact4: Audi Midsize5: BMW Midsize6: Buick Midsize

Show Solution

#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionsol <- DT[, .(MANUFACTURER = Manufacturer, TYPE = Type)]print(head(sol))

 MANUFACTURER TYPE1: Acura Small2: <NA> Midsize3: Audi Compact4: Audi Midsize5: BMW Midsize6: Buick Midsize

20. How to get mean of rows that satisfy a condition from a given column?

Difficulty Level: L2 Question : Get mean of all the vehicles having Price greater than 30. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output :

38.6833

Show Solution

#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionoutput <- DT[Price > 30, mean(Price, na.rm = TRUE)]output

38.6833333333333

21. How to remove only specified columns from a data.table?

Difficulty Level: L2 Question : Create new data.table by removing ‘Manufacturer’ & ‘Type’ columns. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution

#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution output <- DT[, !c("Manufacturer", "Type")]print(head(output))

 Model Min.Price Price Max.Price MPG.city MPG.highway AirBags1: Integra 12.9 15.9 18.8 25 31 None2: Legend 29.2 33.9 38.7 18 25 Driver & Passenger3: 90 25.9 29.1 32.3 20 26 Driver only4: 100 NA 37.7 44.6 19 26 Driver & Passenger5: 535i NA 30.0 NA 22 30 <NA>6: Century 14.2 15.7 17.3 22 31 Driver only DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile Man.trans.avail1: Front 4 1.8 140 6300 2890 Yes2: Front 6 3.2 200 5500 2335 Yes3: Front 6 2.8 172 5500 2280 Yes4: <NA> 6 NA 172 5500 2535 <NA>5: Rear 4 3.5 208 5700 2545 Yes6: <NA> 4 2.2 110 5200 2565 No Fuel.tank.capacity Passengers Length Wheelbase Width Turn.circle1: 13.2 5 177 102 68 372: 18.0 5 195 115 71 383: 16.9 5 180 102 67 374: 21.1 6 193 106 NA 375: 21.1 4 186 109 69 396: 16.4 6 189 105 69 41 Rear.seat.room Luggage.room Weight Origin Make1: 26.5 NA 2705 non-USA Acura Integra2: 30.0 15 3560 non-USA Acura Legend3: 28.0 14 3375 non-USA Audi 904: 31.0 17 3405 non-USA Audi 1005: 27.0 13 3640 non-USA BMW 535i6: 28.0 16 NA USA Buick Century

22. How to calculate count of each unique value in a given column?

Difficulty Level: L2 Question : Calculate number of records for every Manufacturer in DT. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

 Manufacturer N1: Acura 12: <NA> 43: Audi 24: BMW 15: Buick 46: Cadillac 2

Show Solution

#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution output <- DT[, .(.N), by = .(Manufacturer)]print(head(output))

 Manufacturer N1: Acura 12: <NA> 43: Audi 24: BMW 15: Buick 46: Cadillac 2

23. How to find mean of a column grouped by multiple columns?

Difficulty Level: L2 Question : Calculate mean Price for every Manufacturer, Type combination in DT. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

 Manufacturer Type V11: Acura Small 15.900002: <NA> Midsize 31.933333: Audi Compact 29.100004: Audi Midsize 37.700005: BMW Midsize 30.000006: Buick Midsize 21.00000

Show Solution

#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution output <- DT[, mean(Price, na.rm = TRUE), by = .(Manufacturer, Type)]print(head(output))

 Manufacturer Type V11: Acura Small 15.900002: <NA> Midsize 31.933333: Audi Compact 29.100004: Audi Midsize 37.700005: BMW Midsize 30.000006: Buick Midsize 21.00000

24. How to find mean of column grouped by multiple columns and sort by grouped columns?

Difficulty Level: L2 Question : Calculate mean Price for every Manufacturer, Type combination and arrange it by the keys in DT. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

 Manufacturer Type V11: Acura Small 15.902: Audi Compact 29.103: Audi Midsize 37.704: BMW Midsize 30.005: Buick Large 22.256: Buick Midsize 21.00

Show Solution

#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution 1output <- DT[, mean(Price, na.rm = TRUE), keyby = .(Manufacturer, Type)]#Solution 1output <- DT[, mean(Price, na.rm = TRUE), by = .(Manufacturer, Type)][order(Manufacturer, Type)]print(head(output))

 Manufacturer Type V11: Acura Small 15.902: Audi Compact 29.103: Audi Midsize 37.704: BMW Midsize 30.005: Buick Large 22.256: Buick Midsize 21.00

25. How to select all numeric columns from data.table?

Difficulty Level: L2 Question : Select numeric columns from DT. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

 Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower RPM1: 12.9 15.9 18.8 25 31 1.8 140 63002: 29.2 33.9 38.7 18 25 3.2 200 55003: 25.9 29.1 32.3 20 26 2.8 172 55004: NA 37.7 44.6 19 26 NA 172 55005: NA 30.0 NA 22 30 3.5 208 57006: 14.2 15.7 17.3 22 31 2.2 110 5200 Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width1: 2890 13.2 5 177 102 682: 2335 18.0 5 195 115 713: 2280 16.9 5 180 102 674: 2535 21.1 6 193 106 NA5: 2545 21.1 4 186 109 696: 2565 16.4 6 189 105 69(...truncated...)

Show Solution

# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# Solution 1output <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]print(head(output))

 Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower RPM1: 12.9 15.9 18.8 25 31 1.8 140 63002: 29.2 33.9 38.7 18 25 3.2 200 55003: 25.9 29.1 32.3 20 26 2.8 172 55004: NA 37.7 44.6 19 26 NA 172 55005: NA 30.0 NA 22 30 3.5 208 57006: 14.2 15.7 17.3 22 31 2.2 110 5200 Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width1: 2890 13.2 5 177 102 682: 2335 18.0 5 195 115 713: 2280 16.9 5 180 102 674: 2535 21.1 6 193 106 NA5: 2545 21.1 4 186 109 696: 2565 16.4 6 189 105 69 Turn.circle Rear.seat.room Luggage.room Weight1: 37 26.5 NA 27052: 38 30.0 15 35603: 37 28.0 14 33754: 37 31.0 17 34055: 39 27.0 13 36406: 41 28.0 16 NA

26. How to calculate mean of all numeric columns for every Manufacturer?

Difficulty Level: L3 Question : Calculate mean of all numeric columns for every Manufacturer Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

 Manufacturer Min.Price Price Max.Price MPG.city MPG.highway EngineSize1: Acura 12.900 15.900 18.80 25.0 31.00 1.8002: <NA> 26.775 28.550 30.30 19.0 26.00 2.9753: Audi NA 33.400 38.45 19.5 26.00 NA4: BMW NA 30.000 NA 22.0 30.00 3.5005: Buick 20.750 21.625 NA 19.0 27.75 3.8756: Cadillac 35.250 37.400 39.50 16.0 25.00 4.750(...truncated...)

Show Solution

#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionoutput <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]output <- cbind(output, DT[, list(Manufacturer)])output <- output[, lapply(.SD, mean), by = Manufacturer]print(head(output))

 Manufacturer Min.Price Price Max.Price MPG.city MPG.highway EngineSize1: Acura 12.900 15.900 18.80 25.0 31.00 1.8002: <NA> 26.775 28.550 30.30 19.0 26.00 2.9753: Audi NA 33.400 38.45 19.5 26.00 NA4: BMW NA 30.000 NA 22.0 30.00 3.5005: Buick 20.750 21.625 NA 19.0 27.75 3.8756: Cadillac 35.250 37.400 39.50 16.0 25.00 4.750 Horsepower RPM Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase1: 140.0 6300 2890.0 13.2 5.00 177.00 102.002: 186.5 5750 NA NA 5.00 193.25 109.753: 172.0 5500 2407.5 19.0 5.50 186.50 104.004: 208.0 5700 2545.0 21.1 4.00 186.00 109.005: 157.5 4700 NA NA 5.75 200.75 110.006: 247.5 5050 NA 19.0 5.50 205.00 112.50 Width Turn.circle Rear.seat.room Luggage.room Weight1: 68.00 37.00 26.500 NA 2705.002: 71.25 NA 29.000 13.5 3458.753: NA 37.00 29.500 15.5 3390.004: 69.00 39.00 27.000 13.0 3640.005: NA 42.25 28.875 17.0 NA6: 73.50 43.50 33.000 NA 3777.50

27. How to get first 3 rows for every Manufacturer?

Difficulty Level: L3 Question : Get first 3 rows for every Manufacturer in DT. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

 Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: Acura Integra Small 12.9 15.9 18.8 25 312: Audi 90 Compact 25.9 29.1 32.3 20 263: Audi 100 Midsize NA 37.7 44.6 19 264: BMW 535i Midsize NA 30.0 NA 22 305: Buick Century Midsize 14.2 15.7 17.3 22 316: Buick LeSabre Large 19.9 20.8 NA 19 28(...truncated...)

Show Solution

#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution output <- DT[!is.na(Manufacturer), head(.SD, 3), by = Manufacturer]print(head(output))

 Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: Acura Integra Small 12.9 15.9 18.8 25 312: Audi 90 Compact 25.9 29.1 32.3 20 263: Audi 100 Midsize NA 37.7 44.6 19 264: BMW 535i Midsize NA 30.0 NA 22 305: Buick Century Midsize 14.2 15.7 17.3 22 316: Buick LeSabre Large 19.9 20.8 NA 19 28 AirBags DriveTrain Cylinders EngineSize Horsepower RPM1: None Front 4 1.8 140 63002: Driver only Front 6 2.8 172 55003: Driver & Passenger <NA> 6 NA 172 55004: <NA> Rear 4 3.5 208 57005: Driver only <NA> 4 2.2 110 52006: Driver only Front 6 3.8 170 4800 Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase1: 2890 Yes 13.2 5 177 1022: 2280 Yes 16.9 5 180 1023: 2535 <NA> 21.1 6 193 1064: 2545 Yes 21.1 4 186 1095: 2565 No 16.4 6 189 1056: NA No NA 6 200 111 Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make1: 68 37 26.5 NA 2705 non-USA Acura Integra2: 67 37 28.0 14 3375 non-USA Audi 903: NA 37 31.0 17 3405 non-USA Audi 1004: 69 39 27.0 13 3640 non-USA BMW 535i5: 69 41 28.0 16 NA USA Buick Century6: 74 42 30.5 17 3470 USA Buick LeSabre

28. How to remove specific column from data.table?

Difficulty Level: L2 Question : Remove column Model from DT. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution

#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution DT[, Model := NULL]print(head(DT))

 Manufacturer Type Min.Price Price Max.Price MPG.city MPG.highway1: Acura Small 12.9 15.9 18.8 25 312: <NA> Midsize 29.2 33.9 38.7 18 253: Audi Compact 25.9 29.1 32.3 20 264: Audi Midsize NA 37.7 44.6 19 265: BMW Midsize NA 30.0 NA 22 306: Buick Midsize 14.2 15.7 17.3 22 31 AirBags DriveTrain Cylinders EngineSize Horsepower RPM1: None Front 4 1.8 140 63002: Driver & Passenger Front 6 3.2 200 55003: Driver only Front 6 2.8 172 55004: Driver & Passenger <NA> 6 NA 172 55005: <NA> Rear 4 3.5 208 57006: Driver only <NA> 4 2.2 110 5200 Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase1: 2890 Yes 13.2 5 177 1022: 2335 Yes 18.0 5 195 1153: 2280 Yes 16.9 5 180 1024: 2535 <NA> 21.1 6 193 1065: 2545 Yes 21.1 4 186 1096: 2565 No 16.4 6 189 105 Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make1: 68 37 26.5 NA 2705 non-USA Acura Integra2: 71 38 30.0 15 3560 non-USA Acura Legend3: 67 37 28.0 14 3375 non-USA Audi 904: NA 37 31.0 17 3405 non-USA Audi 1005: 69 39 27.0 13 3640 non-USA BMW 535i6: 69 41 28.0 16 NA USA Buick Century

29. How to set key to data.table?

Difficulty Level: L1 Question : Set column AirBags as key to the data.table Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution

#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution setkey(DT,AirBags)

30. How to get all rows where `AirBags` are at `'Driver & Passenger'` in `DT`?

Difficulty Level: L2 Question : Get all rows where AirBags are Driver & Passenger in DT. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution

#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution setkey(DT,AirBags)print(head(DT["Driver & Passenger"]))

 Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: <NA> Legend Midsize 29.2 33.9 38.7 18 252: Audi 100 Midsize NA 37.7 44.6 19 263: Cadillac Seville Midsize 37.5 40.1 42.7 16 254: Chevrolet Camaro Sporty 13.4 15.1 16.8 19 285: <NA> Concorde Large 18.4 18.4 18.4 20 286: Chrysler LeBaron Compact 14.5 15.8 17.1 23 28 AirBags DriveTrain Cylinders EngineSize Horsepower RPM1: Driver & Passenger Front 6 3.2 200 55002: Driver & Passenger <NA> 6 NA 172 55003: Driver & Passenger Front 8 4.6 295 60004: Driver & Passenger Rear <NA> 3.4 160 46005: Driver & Passenger Front 6 3.3 153 53006: Driver & Passenger Front 4 3.0 141 5000 Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase1: 2335 Yes 18.0 5 195 1152: 2535 <NA> 21.1 6 193 1063: 1985 No 20.0 5 204 1114: 1805 Yes NA 4 193 1015: 1990 No 18.0 6 203 1136: 2090 No 16.0 6 183 104 Width Turn.circle Rear.seat.room Luggage.room Weight Origin1: 71 38 30.0 15 3560 non-USA2: NA 37 31.0 17 3405 non-USA3: 74 44 31.0 NA 3935 USA4: 74 43 25.0 13 3240 USA5: 74 NA 31.0 15 3515 USA6: 68 41 30.5 14 3085 USA Make1: Acura Legend2: Audi 1003: Cadillac Seville4: Chevrolet Camaro5: Chrylser Concorde6: Chrysler LeBaron

31. How to get all rows where `AirBags` are either `'Driver & Passenger'` or `'No data'`?

Difficulty Level: L2 Question : Get all rows where AirBags are Driver & Passenger & No data. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

Show Solution

#Inputlibrary(data.table)DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution setkey(DT,AirBags)print(DT[c("Driver & Passenger", "No data")])

 Manufacturer Model Type Min.Price Price Max.Price MPG.city 1: <NA> Legend Midsize 29.2 33.9 38.7 18 2: Audi 100 Midsize NA 37.7 44.6 19 3: Cadillac Seville Midsize 37.5 40.1 42.7 16 4: Chevrolet Camaro Sporty 13.4 15.1 16.8 19 5: <NA> Concorde Large 18.4 18.4 18.4 20 6: Chrysler LeBaron Compact 14.5 15.8 17.1 23 7: Eagle Vision Large NA 19.3 21.2 20 8: Honda Prelude Sporty 17.0 19.8 22.7 24 9: Honda Accord Compact 13.8 17.5 21.2 2410: <NA> SC300 Midsize 34.7 35.2 35.6 1811: Lincoln Continental Midsize 33.3 34.3 35.3 1712: Lincoln Town_Car Large 34.4 36.1 37.8 1813: Mercedes-Benz 300E Midsize 43.8 61.9 80.0 1914: Pontiac Firebird <NA> 14.0 17.7 21.4 1915: Pontiac Bonneville Large 19.4 24.4 29.4 1916: <NA> 850 Midsize 24.8 26.7 28.5 2017: <NA> <NA> <NA> NA NA NA NA MPG.highway AirBags DriveTrain Cylinders EngineSize Horsepower 1: 25 Driver & Passenger Front 6 3.2 200 2: 26 Driver & Passenger <NA> 6 NA 172 3: 25 Driver & Passenger Front 8 4.6 295 4: 28 Driver & Passenger Rear <NA> 3.4 160 5: 28 Driver & Passenger Front 6 3.3 153 6: 28 Driver & Passenger Front 4 3.0 141 7: 28 Driver & Passenger Front 6 3.5 NA 8: 31 Driver & Passenger Front 4 2.3 160 9: 31 Driver & Passenger Front 4 2.2 14010: 23 Driver & Passenger Rear 6 3.0 22511: 26 Driver & Passenger <NA> 6 3.8 16012: 26 Driver & Passenger Rear 8 4.6 21013: 25 Driver & Passenger Rear 6 3.2 21714: 28 Driver & Passenger Rear 6 NA 16015: 28 Driver & Passenger Front 6 3.8 17016: 28 Driver & Passenger Front 5 2.4 16817: NA No data <NA> <NA> NA NA RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length 1: 5500 2335 Yes 18.0 5 195 2: 5500 2535 <NA> 21.1 6 193 3: 6000 1985 No 20.0 5 204 4: 4600 1805 Yes NA 4 193 5: 5300 1990 No 18.0 6 203 6: 5000 2090 No 16.0 6 183 7: 5800 1980 No 18.0 6 202 8: 5800 2855 Yes NA 4 175 9: 5600 NA Yes 17.0 4 18510: 6000 2510 Yes NA 4 19111: 4400 1835 No 18.4 6 20512: 4600 1840 No 20.0 NA 21913: 5500 2220 No 18.5 5 NA14: 4600 1805 Yes 15.5 4 19615: 4800 1565 No 18.0 6 17716: 6200 NA Yes 19.3 5 18417: NA NA <NA> NA NA NA Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin 1: 115 71 38 30.0 15 3560 non-USA 2: 106 NA 37 31.0 17 3405 non-USA 3: 111 74 44 31.0 NA 3935 USA 4: 101 74 43 25.0 13 3240 USA 5: 113 74 NA 31.0 15 3515 USA 6: 104 68 41 30.5 14 3085 USA 7: 113 74 40 30.0 15 3490 USA 8: 100 70 39 23.5 8 2865 non-USA 9: 107 67 41 28.0 14 3040 non-USA10: 106 71 39 25.0 9 3515 non-USA11: 109 73 42 30.0 19 3695 USA12: 117 77 45 31.5 22 4055 USA13: 110 69 37 NA 15 3525 non-USA14: 101 75 43 25.0 13 3240 USA15: 111 74 43 30.5 18 3495 USA16: 105 69 38 30.0 15 3245 non-USA17: NA NA NA NA NA NA <NA> Make 1: Acura Legend 2: Audi 100 3: Cadillac Seville 4: Chevrolet Camaro 5: Chrylser Concorde 6: Chrysler LeBaron 7: Eagle Vision 8: Honda Prelude 9: Honda Accord10: Lexus SC30011: Lincoln Continental12: Lincoln Town_Car13: Mercedes-Benz 300E14: Pontiac Firebird15: Pontiac Bonneville16: Volvo 85017: <NA>

33. How to get last and 2nd last (penultimate) row of data.table?

Difficulty Level: L2 Question 1 : Get penultimate row of DT. Question 2 : Get last row of DT. Input:

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output:

 Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: <NA> 850 Midsize 24.8 26.7 28.5 20 282: Volvo 240 Compact 21.8 22.7 23.5 21 28 AirBags DriveTrain Cylinders EngineSize Horsepower RPM1: Driver & Passenger Front 5 2.4 168 62002: Driver only Rear <NA> 2.3 114 5400 Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase1: NA Yes 19.3 5 184 1052: 2215 Yes 15.8 5 190 104 Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make1: 69 38 30.0 15 3245 non-USA Volvo 8502: 67 37 29.5 14 2985 non-USA Volvo 240

Show Solution

#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# Solutionoutput <- DT[c(.N, .N - 1)]print(output)

 Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: <NA> 850 Midsize 24.8 26.7 28.5 20 282: Volvo 240 Compact 21.8 22.7 23.5 21 28 AirBags DriveTrain Cylinders EngineSize Horsepower RPM1: Driver & Passenger Front 5 2.4 168 62002: Driver only Rear <NA> 2.3 114 5400 Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase1: NA Yes 19.3 5 184 1052: 2215 Yes 15.8 5 190 104 Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make1: 69 38 30.0 15 3245 non-USA Volvo 8502: 67 37 29.5 14 2985 non-USA Volvo 240

34. How to reorder columns of data.table?

Difficulty Level: L2 Question : Reorder columns of DT as V2 , V1, V3 Input

DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","3","4"), V3 = c("aa","bb","cc","dd") )

Desired Output

 V2 V1 V31: 1 a aa2: 2 b bb3: 3 c cc4: 4 d dd

Show Solution

# InputDT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","3","4"), V3 = c("aa","bb","cc","dd") )#Solutionoutput <- setcolorder(DT,c("V2","V1","V3"))print(output)

 V2 V1 V31: 1 a aa2: 2 b bb3: 3 c cc4: 4 d dd

35. How to select one column and compute standard deviation of another column and return a single value that gets recycled?

Difficulty Level: L2 Question : Select column Manufacturer and compute standard deviation of column Price and return a single value that gets recycled. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output:

 Manufacturer Sd.Price1: Acura 9.724282: <NA> 9.724283: Audi 9.724284: Audi 9.724285: BMW 9.724286: Buick 9.72428

Show Solution

# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionoutput <- DT[,.(Manufacturer, Sd.Price = sd(Price, na.rm = TRUE))]print(head(output))

 Manufacturer Sd.Price1: Acura 9.724282: <NA> 9.724283: Audi 9.724284: Audi 9.724285: BMW 9.724286: Buick 9.72428

36. How to use row indexing, column indexing and group by expressions all together?

Difficulty Level: L2 Question : Calculate sum of Weight of cars under every Manufacturer but exclude all 'Midsize' Type cars. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

 Manufacturer V11: Acura 27052: Audi 33753: Buick 75754: Cadillac 36205: Chevrolet 235456: <NA> 3515

Show Solution

# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionoutput = DT[Type !="Midsize", sum(Weight), by=Manufacturer]print(head(output))

 Manufacturer V11: Acura 27052: Audi 33753: Buick 75754: Cadillac 36205: Chevrolet 235456: <NA> 3515

37. How to get the row number of the nth largest value in a column?

Difficulty Level: L2 Question : Find the row position of the 5th largest value of column Price in DT. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution

# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#SolutionDT <- DT[order(Price)]fifth_largest_price = DT[5, Price]output = DT[Price == fifth_largest_price, which = TRUE]output

38. How to find and cap outliers from a data.table column?

Difficulty Level: L2 Question: Replace all values of Length column in the lower 5%ile and greater than 95%ile with respective 5th and 95th %ile value. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution

# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# SolutionDT[Length > quantile(Length, .95, na.rm=T), Length := round(quantile(DT$Length, .95, na.rm=T))]DT[Length < quantile(Length, .05, na.rm=T), Length := round(quantile(DT$Length, .05, na.rm=T))]

39. How to swap two column values based on a condition in another column of a data.table?

Difficulty Level: L2 Question : Swap column V1 & V3 for all the rows where V2 = 2. Input

DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )

Desired Output:

 V1 V2 V31: a 1 aa2: bb 2 b3: cc 2 c4: d 3 dd

Show Solution

# InputDT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )#SolutionDT[V2=="2", c("V3", "V1") := .(V1, V3)]print(DT)

 V1 V2 V31: a 1 aa2: bb 2 b3: cc 2 c4: d 3 dd

40. How to swap two rows of a dataframe?

Difficulty Level: L2 Questions : Swap rows 1 and 2 in DT. Input

DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )

Desired Output

 V1 V2 V31: b 2 bb2: a 1 aa3: c 2 cc4: d 3 dd

Show Solution

# InputDT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )# Solutionswap_rows <- function(dt, i1, i2){ i1_row <- DT[i1,] i2_row <- DT[i2,] DT[i1,] <- i2_row DT[i2,] <- i1_row return(DT)}output <- swap_rows(DT, 1,2)print(output)

 V1 V2 V31: b 2 bb2: a 1 aa3: c 2 cc4: d 3 dd

41. How to reverse the rows of a data.table?

Difficulty Level: L2 Question : Reverse all the rows of DT so that topmost row goes to bottom. Input

DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )

Desired Output

 V1 V2 V3 [1,] "d" "3" "dd"[2,] "c" "2" "cc"[3,] "b" "2" "bb"[4,] "a" "1" "aa"

Show Solution

# InputDT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )# Solutionoutput <- do.call(cbind, lapply(DT, rev))print(output)

 V1 V2 V3 [1,] "d" "3" "dd"[2,] "c" "2" "cc"[3,] "b" "2" "bb"[4,] "a" "1" "aa"

42. How to create one-hot encodings of a categorical variable (dummy variables)?

Difficulty Level: L2 Question : Get one-hot encodings for columns Gender & Degree in the data.table DT and append it as columns. Input

DT <- data.table(Gender = c("M","M","F","F"), E_Id =c(1,2,3,4), Degree = c("UG","PG","PhD","UG") ) Gender E_Id Degree1: M 1 UG2: M 2 PG3: F 3 PhD4: F 4 UG

Desired Output

 E_Id Gender_F Gender_M Degree_PG Degree_PhD Degree_UG1: 1 0 1 0 0 12: 2 0 1 1 0 03: 3 1 0 0 1 04: 4 1 0 0 0 1

Show Solution

# InputDT <- data.table(Gender = c("M","M","F","F"), E_Id =c(1,2,3,4), Degree = c("UG","PG","PhD","UG") )# Solutionoutput <- dcast(melt(DT, id.vars='E_Id'), E_Id ~ variable + value, fun = length)print(output)

 E_Id Gender_F Gender_M Degree_PG Degree_PhD Degree_UG1: 1 0 1 0 0 12: 2 0 1 1 0 03: 3 1 0 0 1 04: 4 1 0 0 0 1

‘

Difficulty Level: L2 Question : Find row-wise sum values of DT. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

 Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower RPM1: 14.6 17.7 19.15 21 28 2.3 140 5200 Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width1: 2360 16.5 5 181 103 69 Turn.circle Rear.seat.room Luggage.room Weight1: 39 27.5 14 3085

Show Solution

#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionoutput <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]output <- output[, lapply(.SD, median, na.rm = TRUE)]print(head(output))

 Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower RPM1: 14.6 17.7 19.15 21 28 2.3 140 5200 Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width1: 2360 16.5 5 181 103 69 Turn.circle Rear.seat.room Luggage.room Weight1: 39 27.5 14 3085

44. How to compute correlation score of each column against other columns?

Difficulty Level: L2 Question : Compute correlation score of each column against other columns. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Show Solution

#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionoutput <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]print(cor(output, use = "na.or.complete"))

45. How to create a column containing the minimum by maximum of each row?

Difficulty Level: L2 Question : Compute the minimum-by-maximum for every row of df.

DT <- data.table(V1 = runif(10, 1, 100), V2 = runif(10, 1, 80), V3 = runif(10, 1, 100))

Show Solution

# InputDT <- data.table(V1= runif(10, 1, 100), V2 =runif(10, 1, 80), V3 = runif(10, 1, 100))#SolutionDT[,max_min:=do.call(pmin,.SD)/do.call(pmax,.SD)]print(head(DT))# Solution 2DT <- data.table(V1= runif(10, 1, 100), V2 =runif(10, 1, 80), V3 = runif(10, 1, 100))DT[, max_min2:=apply(.SD, 1, min)/apply(.SD, 1, max)]print(head(DT))

 V1 V2 V3 max_min1: 16.33154 66.08863 41.38638 0.24711572: 87.79467 38.57875 38.50150 0.43854043: 40.95019 19.47167 15.03562 0.36716854: 12.40790 17.68716 34.95050 0.35501365: 34.15262 24.52459 71.04999 0.34517386: 60.22314 60.22538 10.79905 0.1793106

46. How to `scale` all columns in a dataframe?

Difficulty Level: L2 Question : Normalize all columns in DT. Without using external packages. Input

DT <- data.table(V1= runif(10, 1, 100), V2 =runif(10, 1, 80), V3 = runif(10, 1, 100))

Show Solution

# InputDT <- data.table(V1= runif(10, 1, 100), V2 =runif(10, 1, 80), V3 = runif(10, 1, 100))# Solutionoutput <- DT[, lapply(.SD, scale)]print(output)

 V1.V1 V2.V1 V3.V1 1: 0.9884659 -1.2595111 0.09593436 2: 0.1217149 -0.8867426 -0.31629923 3: 0.5872020 0.1443417 0.70839114 4: -0.5178531 0.5737876 1.12962574 5: 1.0221688 -0.9475154 -0.63975308 6: -1.5091412 1.2276058 1.11944272 7: -0.9203772 1.0074295 -0.33182208 8: -0.8005338 1.3745140 -1.41247525 9: 1.5281054 -0.9386551 -1.4577190110: -0.4997516 -0.2952545 1.10467468

47. How to compute the correlation of each column with the succeeding column?

Difficulty Level: L2 Compute the correlation of each column of df with its succeeding column. Input

DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')

Desired Output

 Price Min.Price 0.9705018 Max.Price Price 0.9802519 MPG.city Max.Price -0.5377239 MPG.highway MPG.city 0.9479157 EngineSizeMPG.highway -0.6259651 Horsepower EngineSize 0.7313269 RPM Horsepower 0.04736874 Rev.per.mile RPM 0.4623314(...truncated...)

Show Solution

#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionoutput <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))] # numeric columnsfor (i in 1:(ncol(output) - 1)){ a <- i + 1 print(cor(output[,..i], output[,..a], use = "na.or.complete"))}

 PriceMin.Price 0.9705018 Max.PricePrice 0.9802519 MPG.cityMax.Price -0.5377239 MPG.highwayMPG.city 0.9479157 EngineSizeMPG.highway -0.6259651 HorsepowerEngineSize 0.7313269 RPMHorsepower 0.04736874 Rev.per.mileRPM 0.4623314 Fuel.tank.capacityRev.per.mile -0.6318497 PassengersFuel.tank.capacity 0.4965661 LengthPassengers 0.4737477 WheelbaseLength 0.8275146 WidthWheelbase 0.8127431 Turn.circleWidth 0.8170549 Rear.seat.roomTurn.circle 0.4738118 Luggage.roomRear.seat.room 0.668847 WeightLuggage.room 0.6425643

48. How to replace both the diagonals of data.table with 0?

Difficulty Level: L2 Replace both values in both diagonals of DT with 0. Input

DT <- data.table(V1= runif(4, 1, 100), V2 =runif(4, 1, 80), V3 = runif(4, 1, 100), V4 = runif(4,1, 50)) V1 V2 V3 V41: 97.451642 65.393451 6.496714 27.800142: 56.475427 5.234317 80.551154 10.866173: 65.965884 18.954913 70.968490 25.925094: 3.723218 26.590925 75.723006 42.22132

Desired output

 V1 V2 V3 V41: 0.00000 65.39345 6.496714 0.000002: 56.47543 0.00000 0.000000 10.866173: 65.96588 0.00000 0.000000 25.925094: 0.00000 26.59093 75.723006 0.00000

Show Solution

# InputDT <- data.table(V1= runif(4, 1, 100), V2 =runif(4, 1, 80), V3 = runif(4, 1, 100), V4 = runif(4,1, 50))# Solutionfor (i in 1:nrow(DT)){ DT[i,i] <- 0 DT[nrow(DT) - i + 1, i] <- 0}print(DT)

 V1 V2 V3 V41: 0.00000 72.08959 12.63875 0.0000002: 17.61880 0.00000 0.00000 1.6967413: 82.20107 0.00000 0.00000 40.3210274: 0.00000 40.47473 71.60260 0.000000

49. How to join two dataframes by 2 columns so they have only the common rows?

Difficulty Level: L2 Join dataframes df1 and df2 by ‘fruit-pazham’ and ‘weight-kilo’. Input

DT1 = data.table(fruit = c('apple', 'banana', 'orange'), weight = c('high', 'medium', 'low'), price= c(1,2,3))DT2 = data.table(pazham = c('apple', 'orange', 'pine'), kilo= c('high', 'low', 'high'), price = c(1,2,3))

Desired Output

 fruit weight price_left price_right1: apple high 1 12: orange low 3 2

Show Solution

# InputDT1 = data.table(fruit = c('apple', 'banana', 'orange'), weight = c('high', 'medium', 'low'), price= c(1,2,3))DT2 = data.table(pazham = c('apple', 'orange', 'pine'), kilo= c('high', 'low', 'high'), price = c(1,2,3))# Solutionoutput <- merge(DT1, DT2, by.x = c('fruit', 'weight'), by.y = c('pazham', 'kilo'), all = FALSE, suffixes = c("_left", "_right"))print(output)

 fruit weight price_left price_right1: apple high 1 12: orange low 3 2

‘

50. How to get the positions where values of two columns of `data.table` match?

Difficulty Level: L2 Question : Get the positions where values of two columns of DT match. Input

DT <- data.table(fruit1 = c('apple', 'orange', 'banana', 'mango), fruit2 = c('apple', 'grapes', 'banana', 'jackfruit'))

Desired Output

1 3

Show Solution

# InputDT <- data.table(fruit1 = c('apple', 'orange', 'banana', 'mango'), fruit2 = c('apple', 'grapes', 'banana', 'jackfruit'))# Solutionwhich(DT$fruit1 == DT$fruit2)

‘

51. How to create lags and leads of a column in a `data.table`?

Difficulty Level: L2 Create two new columns in df, one of which is a lag 1 (shift column a down by 1 row) of column ‘a’ and the other is a lead 1 (shift column b up by 1 row). Input

DT <- data.table(V1 = runif(4, 1, 100), V2 = runif(4, 1, 80), V3 = runif(4, 1, 100), V4 = runif(4,1, 50))

Desired Output

 V1 V2 V3 V4 lagged_column lead_column1: 90.41034 79.62526 57.525005 16.88133 NA 13.194272: 13.19427 29.20243 3.962821 35.72731 90.41034 13.791103: 13.79110 15.68368 78.933011 18.43108 13.19427 86.762934: 86.76293 33.66214 6.706639 43.40251 13.79110 NA

Show Solution

# InputDT <- data.table(V1= runif(4, 1, 100), V2 =runif(4, 1, 80), V3 = runif(4, 1, 100), V4 = runif(4,1, 50))# SolutionDT[, lagged_column := shift(V1, type = 'lag')]DT[, lead_column := shift(V1, type = 'lead')]print(DT)

 V1 V2 V3 V4 lagged_column lead_column1: 51.50006 74.050716 27.618570 16.68658 NA 82.297832: 82.29783 63.969537 9.054432 1.16761 51.50006 78.364253: 78.36425 63.937995 85.047398 21.39628 82.29783 63.641814: 63.64181 2.482478 10.060054 27.33597 78.36425 NA

52. How to get the frequency of unique values in the entire dataframe?

Difficulty Level: L2 Question : Get the frequency of unique values in the entire dataframe DT. Input

DT <- data.table(fruit = c('apple', 'orange', 'banana', 'apple'))

Desired Output

 fruit number_of_distinct_orders1: apple 22: orange 13: banana 1

Show Solution

# InputDT <- data.table(fruit = c('apple', 'orange', 'banana', 'apple'))# Solutionoutput <- DToutput[,flag := 1]print(output[, .(number_of_distinct_orders = length(flag)), by = fruit])

 fruit number_of_distinct_orders1: apple 22: orange 13: banana 1

53. How to convert a table from long to wide format using R data.table?

Question : Create a pivot table out of DT keeping ID & Month as key. Difficulty Level: L2 Input

n <- 5DT <- data.table( ID = sample(1:20, n, replace=TRUE), Month = sample(1:12, n, replace=TRUE), Category = c("Drinks", "Food", "Drinks", "Food", "Food"), Expenses = runif(n), key = c('ID', 'Month'))

 ID Month Category Expenses1: 4 3 Drinks 0.99215042: 4 4 Food 0.80735233: 8 10 Food 0.55333364: 14 8 Food 0.25880985: 20 4 Drinks 0.2012480

Desired Output

 ID Month Drinks Food1: 4 3 0.9921504 0.00000002: 4 4 0.0000000 0.80735233: 8 10 0.0000000 0.55333364: 14 8 0.0000000 0.25880985: 20 4 0.2012480 0.0000000

Show Solution

# Inputn <- 5DT <- data.table( ID=sample(1:20, n, replace=TRUE), Month=sample(1:12, n, replace=TRUE), Category= c("Drinks", "Food", "Drinks", "Food", "Food"), Expenses=runif(n), key=c('ID', 'Month'))# Solutionoutput <- dcast.data.table(DT, ID + Month ~ Category, fun=sum, value.var='Expenses')print(output)

 ID Month Drinks Food1: 7 6 0.0000000 0.759283502: 7 10 0.5103192 0.000000003: 9 3 0.1320670 0.000000004: 10 2 0.0000000 0.074503865: 18 7 0.0000000 0.23840221

54. How to create a bootstrap sample (sample same number of rows with replacement) from a data.table?

Difficulty Level: L2 Question : Create a bootstrap sample of length 5 from DT. Input

DT <- data.table(V1 = runif(20, 1, 100), V2 = runif(20, 1, 80), V3 = runif(20, 1, 100), V4 = runif(20,1, 50))

Show Solution

# InputDT <- data.table(V1 = runif(20, 1, 100), V2 = runif(20, 1, 80), V3 = runif(20, 1, 100), V4 = runif(20,1, 50))# Solutionn <- 5row_number <- sample(1:20, n, replace=TRUE)print(DT[row_number,])

 V1 V2 V3 V41: 3.213551 25.49768 30.01245 5.811022: 28.313228 63.25563 89.86874 30.283123: 37.163717 41.17688 98.72226 43.524854: 91.656742 60.68913 9.93234 21.660745: 51.661261 26.40149 28.44038 27.73880

55. How to bin a numeric column with user custom intervals to form a categorical (text) column?

Difficulty Level: L2 In DT, create a new column (cat_col, that has 'fail' if the average of the first two columns is less than 40. Else, it contains 'pass'. Input

set.seed(100)DT <- data.table(V1 = runif(20, 1, 100), V2 = runif(20, 1, 80))

Desired Ouput

 V1 V2 cat_col 1: 31.468845 43.32908 fail 2: 26.509578 57.15350 pass 3: 55.679921 43.52955 pass 4: 6.581932 60.16881 fail 5: 47.386379 34.18801 pass 6: 48.893303 14.54220 fail(...truncated...)

Show Solution

# Inputset.seed(100)DT <- data.table(V1= runif(20, 1, 100), V2 =runif(20, 1, 80))# SolutionDT[, cat_col := ifelse((V1 + V2)/2 < 40, 'fail' , 'pass')]print(DT)

 V1 V2 cat_col 1: 31.468845 43.32908 fail 2: 26.509578 57.15350 pass 3: 55.679921 43.52955 pass 4: 6.581932 60.16881 fail 5: 47.386379 34.18801 pass 6: 48.893303 14.54220 fail 7: 81.427859 61.85383 pass 8: 37.661733 70.67433 pass 9: 55.109301 44.37864 pass10: 17.855943 22.94018 fail11: 62.874651 39.57617 pass12: 88.334386 74.35190 pass13: 28.755030 28.54667 fail14: 40.450302 76.37846 pass15: 76.492557 55.92666 pass16: 67.233150 71.26683 pass17: 21.256604 15.25217 fail18: 36.394960 50.72188 pass19: 36.588036 79.17557 pass20: 69.338762 11.29282 pass

56. How to convert a `data.table` from wide to long format?

Difficulty Level: L2 Question : Convert DT from wide to long format by keeping ID & Month as key. Input

n <- 5DT <- data.table( ID = sample(1:20, n, replace=TRUE), Month = sample(1:12, n, replace=TRUE), Drinks = sample(10:20,n, replace = TRUE), Food = sample(10:20,n, replace = TRUE), Starters = sample(5:15,n, replace = TRUE) ) ID Month Drinks Food Starters1: 1 12 17 14 132: 14 2 12 12 113: 12 7 12 15 114: 19 1 16 15 155: 15 11 17 10 10

Desired Output

 ID Month variable value1: 14 4 Drinks 122: 4 11 Drinks 183: 19 5 Drinks 174: 19 7 Drinks 155: 14 4 Drinks 116: 14 4 Food 14(...truncated...)

Show Solution

# Inputn <- 5DT <- data.table( ID=sample(1:20, n, replace=TRUE), Month=sample(1:12, n, replace=TRUE), Drinks= sample(10:20,n, replace = TRUE), Food = sample(10:20,n, replace = TRUE), Starters = sample(5:15,n, replace = TRUE) )# Solutionoutput = melt(DT, id.vars = c("ID", "Month"), measure.vars = c("Drinks", "Food", "Starters"))print(head(output))

 ID Month variable value1: 14 4 Drinks 122: 4 11 Drinks 183: 19 5 Drinks 174: 19 7 Drinks 155: 14 4 Drinks 116: 14 4 Food 14

57. How to create a dataframe that contains the first 3 lags of a time series data?

Difficulty Level: L2 Question : Append a new column in DT with first three lag of column V1. Input

DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1), V1 = round(runif(10, 1, 80),2))

Desired Output

 Date V1 lag1 lag2 lag31: 2011-01-01 48.92 NA NA NA2: 2011-01-02 65.45 48.92 NA NA3: 2011-01-03 67.63 65.45 48.92 NA4: 2011-01-04 63.26 67.63 65.45 48.925: 2011-01-05 2.50 63.26 67.63 65.456: 2011-01-06 56.15 2.50 63.26 67.637: 2011-01-07 65.39 56.15 2.50 63.268: 2011-01-08 45.90 65.39 56.15 2.50

Show Solution

# InputDT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1), V1 = round(runif(10, 1, 80),2))# SolutionDT[, `:=`(lag1 = shift(V1, 1, type = 'lag'), lag2 = shift(V1, 2, type = 'lag'), lag3 = shift(V1, 3, type = 'lag'))]print(head(DT, 8))

 Date V1 lag1 lag2 lag31: 2011-01-01 8.05 NA NA NA2: 2011-01-02 13.85 8.05 NA NA3: 2011-01-03 3.11 13.85 8.05 NA4: 2011-01-04 57.06 3.11 13.85 8.055: 2011-01-05 61.12 57.06 3.11 13.856: 2011-01-06 68.74 61.12 57.06 3.117: 2011-01-07 35.54 68.74 61.12 57.068: 2011-01-08 33.95 35.54 68.74 61.12

58. How to shuffle rows of a dataframe?

Difficulty Level: L2 Question : Shuffle rows of DT. Input

DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1), V1 = runif(10, 1, 80), V2 = runif(10, 1, 100), V3 = runif(10,1, 50))

Show Solution

# InputDT <- data.table(Date = seq(as.Date('2011-01-01'), as.Date('2011-01-10'),by = 1), V1 = runif(10, 1, 80), V2 = runif(10, 1, 100), V3 = runif(10,1, 50))# Solutionset.seed(1000)print(DT[sample(nrow(DT), replace = T),])

 Date V1 V2 V3 1: 2011-01-04 35.76134 2.61564 17.215551 2: 2011-01-06 54.37578 56.66843 15.841530 3: 2011-01-03 59.85359 21.81431 32.227574 4: 2011-01-08 39.23481 74.79349 33.457133 5: 2011-01-03 59.85359 21.81431 32.227574 6: 2011-01-02 54.02029 34.80660 7.503086 7: 2011-01-06 54.37578 56.66843 15.841530 8: 2011-01-06 54.37578 56.66843 15.841530 9: 2011-01-06 54.37578 56.66843 15.84153010: 2011-01-01 76.45873 27.03600 16.913088

59. How to select rows between two dates in a dataframe?

Difficulty Level: L1 Question : Select all rows between dates 2011-01-05 & 2011-01-08 in DT. Input

set.seed(100)DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1), V1 = runif(10, 1, 80), V2 = runif(10, 1, 100), V3 = runif(10,1, 50))

Desired Output

 Date V1 V2 V31: 2011-01-06 28.39314 36.99609 29.8141142: 2011-01-07 52.53458 33.46489 5.073762

Show Solution

# InputDT <- data.table(Date= seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1), V1 =runif(10, 1, 80), V2 = runif(10, 1, 100), V3 = runif(10,1, 50))# Solutionoutput <- DT[(Date > '2011-01-05') & (Date < '2011-01-08'),]print(output)

 Date V1 V2 V31: 2011-01-06 28.39314 36.99609 29.8141142: 2011-01-07 52.53458 33.46489 5.073762

To be continued . . .

101 R data.table Exercises (2024)

1. How to install data.table and check the version?

2. How to create a data.table from lists (vectors) ?

3. How to import csv data files as data.table ?

4. How to import only ‘n’ rows from a csv file to create a data.table?

5. How to import only specified columns from a csv file?

6. How to get the nrows, ncolumns, datatype, summary stats of each column in a data.table

7. How to extract the row and column number of a particular cell with given criterion?

8. How to rename a specific columns in a data.table?

9. How to check if a data.table has any missing values?

10. How to check for missing values in a data.table?

11. How to get the row and column positions of missing values in a data.table?

11. How to count the number of missing values in each column of data.table?

12. How to replace missing values of data.table with 0?

13. How to replace missing values of numeric columns with the mean?

14. How to select a specific column from a data.table as a list (vector)?

15. How to select a specific column from a data.table as a data.table?

16. How to sort a data.table based on a given column?

17. How to sort a data.table based on two or more columns?

18. How to filter out rows from a data.table that satisfy multiple conditions?

19. How to select multiple columns from data.table and rename them?

20. How to get mean of rows that satisfy a condition from a given column?

21. How to remove only specified columns from a data.table?

22. How to calculate count of each unique value in a given column?

23. How to find mean of a column grouped by multiple columns?

24. How to find mean of column grouped by multiple columns and sort by grouped columns?

25. How to select all numeric columns from data.table?

26. How to calculate mean of all numeric columns for every Manufacturer?

27. How to get first 3 rows for every Manufacturer?

28. How to remove specific column from data.table?

29. How to set key to data.table?

30. How to get all rows where AirBags are at 'Driver & Passenger' in DT?

31. How to get all rows where AirBags are either 'Driver & Passenger' or 'No data'?

33. How to get last and 2nd last (penultimate) row of data.table?

34. How to reorder columns of data.table?

35. How to select one column and compute standard deviation of another column and return a single value that gets recycled?

36. How to use row indexing, column indexing and group by expressions all together?

37. How to get the row number of the nth largest value in a column?

38. How to find and cap outliers from a data.table column?

39. How to swap two column values based on a condition in another column of a data.table?

40. How to swap two rows of a dataframe?

41. How to reverse the rows of a data.table?

42. How to create one-hot encodings of a categorical variable (dummy variables)?

44. How to compute correlation score of each column against other columns?

45. How to create a column containing the minimum by maximum of each row?

46. How to scale all columns in a dataframe?

47. How to compute the correlation of each column with the succeeding column?

48. How to replace both the diagonals of data.table with 0?

49. How to join two dataframes by 2 columns so they have only the common rows?

50. How to get the positions where values of two columns of data.table match?

51. How to create lags and leads of a column in a data.table?

52. How to get the frequency of unique values in the entire dataframe?

53. How to convert a table from long to wide format using R data.table?

54. How to create a bootstrap sample (sample same number of rows with replacement) from a data.table?

55. How to bin a numeric column with user custom intervals to form a categorical (text) column?

56. How to convert a data.table from wide to long format?

57. How to create a dataframe that contains the first 3 lags of a time series data?

58. How to shuffle rows of a dataframe?

59. How to select rows between two dates in a dataframe?

References

30. How to get all rows where `AirBags` are at `'Driver & Passenger'` in `DT`?

31. How to get all rows where `AirBags` are either `'Driver & Passenger'` or `'No data'`?

46. How to `scale` all columns in a dataframe?

50. How to get the positions where values of two columns of `data.table` match?

51. How to create lags and leads of a column in a `data.table`?

56. How to convert a `data.table` from wide to long format?