The data.table package in R is super fast when it comes to handling data. It has a syntax that reduces keystrokes while making R code easier to read. These set of exercises are designed to help you to oil your data brain through solving data manipulation exercises. Related post: 101 Python datatable Exercises (pydatatable)
101 R data.table Exercises. Photo by Ela Abbau.
1. How to install data.table and check the version?
# Install the data.table packageinstall.packages("data.table")# Load the librarylibrary(data.table)# Check versionprint(packageVersion("data.table"))
2. How to create a data.table from lists (vectors) ?
Difficulty Level: L1 Question : Create a pandas series from each of the items below: a list, numpy and a dictionary Input
list_1 <- c("a","b","c","d")list_2 <- c("1","2","3","4")list_3 <- c("aa","bb","cc","dd")
Show Solution
# Inputslist_1 <- c("a","b","c","d")list_2 <- c("1","2","3","4")list_3 <- c("aa","bb","cc","dd")# SolutionDT <- data.table(V1= list_1, V2 =list_2, V3 = list_3 )print(DT)
V1 V2 V31: a 1 aa2: b 2 bb3: c 3 cc4: d 4 dd
3. How to import csv data files as data.table ?
Difficulty Level: L1 Question : Import BostonHousing dataset as a data.table. Show Solution
# SolutionDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')print(head(DT))
crim zn indus chas nox rm age dis rad tax ptratio b lstat1: 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.982: 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.143: 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.034: 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.945: 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.336: 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 medv1: 24.02: 21.63: 34.74: 33.45: 36.26: 28.7
4. How to import only ‘n’ rows from a csv file to create a data.table?
Difficiulty Level: L1 Question : Import first 50 row of BostonHousing dataset as a data.table. Show Solution
# SolutionDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', nrows = 50)print(head(DT))
crim zn indus chas nox rm age dis rad tax ptratio b lstat1: 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.982: 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.143: 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.034: 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.945: 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.336: 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 medv1: 24.02: 21.63: 34.74: 33.45: 36.26: 28.7
5. How to import only specified columns from a csv file?
Difficulty Level: L1 Question : Import only ‘crim’ and ‘medv’ columns of the BostonHousing dataset as a data.table. Show Solution
# SolutionDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv', select = c('crim', 'medv'))print(head(DT))
crim medv1: 0.00632 24.02: 0.02731 21.63: 0.02729 34.74: 0.03237 33.45: 0.06905 36.26: 0.02985 28.7
6. How to get the nrows, ncolumns, datatype, summary stats of each column in a data.table
Difficulty Level: L2 Question : Get the number of rows, columns, datatype and summary statistics of each column of the Cars93 dataset. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')
Show Solution
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/BostonHousing.csv')# number of rows and columnsprint(NROW(DT))print(NCOL(DT))# datatypesprint(sapply(DT,class))# summary statisticssummary(DT)
[1] 506[1] 14 crim zn indus chas nox rm age dis "numeric" "numeric" "numeric" "integer" "numeric" "numeric" "numeric" "numeric" rad tax ptratio b lstat medv "integer" "integer" "numeric" "numeric" "numeric" "numeric" crim zn indus chas Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000 Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000 Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000 Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000 nox rm age dis Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100 Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207 Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188 Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127 rad tax ptratio b Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38 Median : 5.000 Median :330.0 Median :19.05 Median :391.44 Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23 Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90 lstat medv Min. : 1.73 Min. : 5.00 1st Qu.: 6.95 1st Qu.:17.02 Median :11.36 Median :21.20 Mean :12.65 Mean :22.53 3rd Qu.:16.95 3rd Qu.:25.00 Max. :37.97 Max. :50.00
7. How to extract the row and column number of a particular cell with given criterion?
Difficulty Level: L2 Question : Which manufacturer, model and type has the highest Price
? What is the row and column number of the cell with the highest Price
value? Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# Solution# Get Manufacturer with highest priceprint(DT[Price == max(Price, na.rm = TRUE)][,list(Manufacturer, Model, Type)])# Get row numberDT[, .(rownum=.I, Price)][Price == max(Price, na.rm = TRUE), rownum]
Manufacturer Model Type1: Mercedes-Benz 300E Midsize
59
8. How to rename a specific columns in a data.table?
Difficulty Level: Easy Question : Rename the column Model
as New_Model
in DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')print(head(setnames(DT, old = "Model", new = "New_Model")[]))
Manufacturer New_Model Type Min.Price Price Max.Price MPG.city1: Acura Integra Small 12.9 15.9 18.8 252: <NA> Legend Midsize 29.2 33.9 38.7 183: Audi 90 Compact 25.9 29.1 32.3 204: Audi 100 Midsize NA 37.7 44.6 195: BMW 535i Midsize NA 30.0 NA 226: Buick Century Midsize 14.2 15.7 17.3 22 MPG.highway AirBags DriveTrain Cylinders EngineSize Horsepower1: 31 None Front 4 1.8 1402: 25 Driver & Passenger Front 6 3.2 2003: 26 Driver only Front 6 2.8 1724: 26 Driver & Passenger <NA> 6 NA 1725: 30 <NA> Rear 4 3.5 2086: 31 Driver only <NA> 4 2.2 110 RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length1: 6300 2890 Yes 13.2 5 1772: 5500 2335 Yes 18.0 5 1953: 5500 2280 Yes 16.9 5 1804: 5500 2535 <NA> 21.1 6 1935: 5700 2545 Yes 21.1 4 1866: 5200 2565 No 16.4 6 189 Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin1: 102 68 37 26.5 NA 2705 non-USA2: 115 71 38 30.0 15 3560 non-USA3: 102 67 37 28.0 14 3375 non-USA4: 106 NA 37 31.0 17 3405 non-USA5: 109 69 39 27.0 13 3640 non-USA6: 105 69 41 28.0 16 NA USA Make1: Acura Integra2: Acura Legend3: Audi 904: Audi 1005: BMW 535i6: Buick Century
9. How to check if a data.table has any missing values?
Difficulty Level: L1 Question : Check if DT
has any missing values. Return TRUE
is present else FALSE
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# SolutionanyNA(DT)
TRUE
10. How to check for missing values in a data.table?
Difficulty Level: L1 Question : Get the location of missing values in DT
. Replace the value with TRUE
is missing else FALSE
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE[2,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE[4,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE[5,] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE(...truncated...)
Show Solution
# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# Solutionprint(head(is.na(DT)))
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE[2,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE[4,] FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE[5,] FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE[6,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE AirBags DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE[2,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE[3,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE[4,] FALSE TRUE FALSE TRUE FALSE FALSE FALSE[5,] TRUE FALSE FALSE FALSE FALSE FALSE FALSE[6,] FALSE TRUE FALSE FALSE FALSE FALSE FALSE Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width[1,] FALSE FALSE FALSE FALSE FALSE FALSE[2,] FALSE FALSE FALSE FALSE FALSE FALSE[3,] FALSE FALSE FALSE FALSE FALSE FALSE[4,] TRUE FALSE FALSE FALSE FALSE TRUE[5,] FALSE FALSE FALSE FALSE FALSE FALSE[6,] FALSE FALSE FALSE FALSE FALSE FALSE Turn.circle Rear.seat.room Luggage.room Weight Origin Make[1,] FALSE FALSE TRUE FALSE FALSE FALSE[2,] FALSE FALSE FALSE FALSE FALSE FALSE[3,] FALSE FALSE FALSE FALSE FALSE FALSE[4,] FALSE FALSE FALSE FALSE FALSE FALSE[5,] FALSE FALSE FALSE FALSE FALSE FALSE[6,] FALSE FALSE FALSE TRUE FALSE FALSE
11. How to get the row and column positions of missing values in a data.table?
Difficulty Level: L2 Question: Get the row and column positions of missing values in DT. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
[1] "(1, 24)" "(2, 1)" "(4, 4)" "(4, 10)" "(4, 12)" "(4, 16)" "(4, 21)" [8] "(5, 4)" "(5, 6)" "(5, 9)" "(6, 10)" "(6, 25)" "(7, 6)" "(7, 15)" [15] "(7, 17)" "(9, 21)" "(10, 15)" "(11, 24)" "(12, 9)" "(12, 10)" "(12, 13)"[22] "(13, 11)" "(13, 24)" "(14, 11)" "(14, 17)" "(15, 7)" "(15, 13)" "(15, 19)"[29] "(15, 22)" "(16, 24)" "(17, 13)" "(17, 24)" "(19, 23)" "(19, 24)" "(19, 26)"[36] "(20, 1)" "(20, 22)" "(23, 17)" "(23, 23)" "(24, 15)" "(24, 24)" .. (truncated)
Show Solution
# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# Solutionoutput <- c() # init output vectorfor(i in 1:NROW(DT)){ for(j in 1:NCOL(DT)){ curr_value <- DT[i, j,with=F][[1]] if(is.na(curr_value)){ position <- paste0('(', paste(i, j, sep=', '), ')') output <- c(output, position) } }}
11. How to count the number of missing values in each column of data.table?
Difficulty Level: L2 Question : Count the number of missing values in each column of DT
Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# Solutionfor (i in colnames(DT)){ print(paste("Number of missing values in column", i , "is", sum(is.na(DT[,..i]))))}
[1] "Number of missing values in column Manufacturer is 4"[1] "Number of missing values in column Model is 1"[1] "Number of missing values in column Type is 3"[1] "Number of missing values in column Min.Price is 7"[1] "Number of missing values in column Price is 2"[1] "Number of missing values in column Max.Price is 5"[1] "Number of missing values in column MPG.city is 9"[1] "Number of missing values in column MPG.highway is 2"[1] "Number of missing values in column AirBags is 6"[1] "Number of missing values in column DriveTrain is 7"[1] "Number of missing values in column Cylinders is 5"[1] "Number of missing values in column EngineSize is 2"[1] "Number of missing values in column Horsepower is 7"[1] "Number of missing values in column RPM is 3"[1] "Number of missing values in column Rev.per.mile is 6"[1] "Number of missing values in column Man.trans.avail is 5"[1] "Number of missing values in column Fuel.tank.capacity is 8"[1] "Number of missing values in column Passengers is 2"[1] "Number of missing values in column Length is 4"[1] "Number of missing values in column Wheelbase is 1"[1] "Number of missing values in column Width is 6"[1] "Number of missing values in column Turn.circle is 5"[1] "Number of missing values in column Rear.seat.room is 4"[1] "Number of missing values in column Luggage.room is 19"[1] "Number of missing values in column Weight is 7"[1] "Number of missing values in column Origin is 5"[1] "Number of missing values in column Make is 3"
12. How to replace missing values of data.table with 0?
Difficulty Level: L2 Question : Replace all missing values in DT
with 0. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# SolutionDT[is.na(DT)] <- 0print(head(DT[, 1:8]))
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: Acura Integra Small 12.9 15.9 18.8 25 312: 0 Legend Midsize 29.2 33.9 38.7 18 253: Audi 90 Compact 25.9 29.1 32.3 20 264: Audi 100 Midsize 0.0 37.7 44.6 19 265: BMW 535i Midsize 0.0 30.0 0.0 22 306: Buick Century Midsize 14.2 15.7 17.3 22 31
13. How to replace missing values of numeric columns with the mean?
Difficulty Level: L2 Question : Replace missing values of numeric columns with their respective mean. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# Solutionreplace_with_mean <- function(x){ifelse(is.na(x),mean(x, na.rm = TRUE),x)}output <- DT[,lapply(.SD, replace_with_mean)]print(head(output))
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: Acura Integra Small 12.9000 15.9 18.80000 25 312: <NA> Legend Midsize 29.2000 33.9 38.70000 18 253: Audi 90 Compact 25.9000 29.1 32.30000 20 264: Audi 100 Midsize 17.1186 37.7 44.60000 19 265: BMW 535i Midsize 17.1186 30.0 21.45909 22 306: Buick Century Midsize 14.2000 15.7 17.30000 22 31 AirBags DriveTrain Cylinders EngineSize Horsepower RPM1: None Front 4 1.800000 140 63002: Driver & Passenger Front 6 3.200000 200 55003: Driver only Front 6 2.800000 172 55004: Driver & Passenger <NA> 6 2.658242 172 55005: <NA> Rear 4 3.500000 208 57006: Driver only <NA> 4 2.200000 110 5200 Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase1: 2890 Yes 13.2 5 177 1022: 2335 Yes 18.0 5 195 1153: 2280 Yes 16.9 5 180 1024: 2535 <NA> 21.1 6 193 1065: 2545 Yes 21.1 4 186 1096: 2565 No 16.4 6 189 105 Width Turn.circle Rear.seat.room Luggage.room Weight Origin1: 68.00000 37 26.5 13.98649 2705.000 non-USA2: 71.00000 38 30.0 15.00000 3560.000 non-USA3: 67.00000 37 28.0 14.00000 3375.000 non-USA4: 69.44828 37 31.0 17.00000 3405.000 non-USA5: 69.00000 39 27.0 13.00000 3640.000 non-USA6: 69.00000 41 28.0 16.00000 3104.593 USA Make1: Acura Integra2: Acura Legend3: Audi 904: Audi 1005: BMW 535i6: Buick Century
14. How to select a specific column from a data.table as a list (vector)?
Difficulty Level: L1 Question : Get the column Model
in DT
as a list. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Solution
[1] "Integra" "Legend" "90" "100" [5] "535i" "Century" "LeSabre" "Roadmaster" [9] "Riviera" "DeVille" "Seville" "Cavalier" [13] "Corsica" "Camaro" "Lumina" "Lumina_APV" [17] "Astro" "Caprice" "Corvette" "Concorde" [21] "LeBaron" "Imperial" (...truncated...)
Show Solution
# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#SolutionDT <- DT[, Model]print(DT)
[1] "Integra" "Legend" "90" "100" [5] "535i" "Century" "LeSabre" "Roadmaster" [9] "Riviera" "DeVille" "Seville" "Cavalier" [13] "Corsica" "Camaro" "Lumina" "Lumina_APV" [17] "Astro" "Caprice" "Corvette" "Concorde" [21] "LeBaron" "Imperial" "Colt" "Shadow" [25] "Spirit" "Caravan" "Dynasty" "Stealth" [29] "Summit" "Vision" "Festiva" "Escort" [33] "Tempo" "Mustang" "Probe" "Aerostar" [37] "Taurus" "Crown_Victoria" "Metro" "Storm" [41] "Prelude" "Civic" "Accord" "Excel" [45] "Elantra" "Scoupe" "Sonata" "Q45" [49] "ES300" "SC300" "Continental" "Town_Car" [53] "323" "Protege" "626" "MPV" [57] "RX-7" "190E" "300E" NA [61] "Cougar" "Mirage" "Diamante" "Sentra" [65] "Altima" "Quest" "Maxima" "Achieva" [69] "Cutlass_Ciera" "Silhouette" "Eighty-Eight" "Laser" [73] "LeMans" "Sunbird" "Firebird" "Grand_Prix" [77] "Bonneville" "900" "SL" "Justy" [81] "Loyale" "Legacy" "Swift" "Tercel" [85] "Celica" "Camry" "Previa" "Fox" [89] "Eurovan" "Passat" "Corrado" "240" [93] "850"
15. How to select a specific column from a data.table as a data.table?
Difficulty Level: L2 Question : Get the column Model
in DT
as a data.table (rather than as a list). Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Solution
Model1: Integra2: Legend3: 904: 1005: 535i6: Century
Show Solution
# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#SolutionDT <- DT[, list(Model)]print(head(DT))
Model1: Integra2: Legend3: 904: 1005: 535i6: Century
16. How to sort a data.table based on a given column?
Difficulty Level: L1 Question: Reorder the rows as per Price
column of DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#SolutionDT <- DT[order(Price)]print(head(DT))
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: Ford Festiva Small 6.9 7.4 7.9 31 332: Hyundai Excel Small 6.8 8.0 9.2 29 333: Mazda 323 Small 7.4 8.3 9.1 29 374: Geo Metro Small 6.7 8.4 10.0 46 505: Subaru Justy Small 7.3 8.4 9.5 33 376: Suzuki Swift <NA> 7.3 8.6 NA 39 43 AirBags DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile1: <NA> Front 4 1.3 63 5000 31502: None Front 4 1.5 81 5500 27103: None Front 4 1.6 82 5000 23704: None Front 3 1.0 55 5700 37555: None 4WD 3 1.2 73 5600 28756: None Front 3 1.3 70 6000 3360 Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width1: Yes 10.0 4 141 90 632: Yes 11.9 5 168 94 633: Yes 13.2 4 164 97 664: <NA> 10.6 4 151 93 635: Yes 9.2 4 146 90 606: Yes 10.6 4 161 93 NA Turn.circle Rear.seat.room Luggage.room Weight Origin Make1: 33 26.0 12 1845 USA Ford Festiva2: 35 26.0 11 2345 non-USA Hyundai Excel3: 34 27.0 16 2325 non-USA Mazda 3234: 34 27.5 10 1695 non-USA Geo Metro5: 32 23.5 10 2045 non-USA Subaru Justy6: 34 27.5 10 1965 non-USA Suzuki Swift
17. How to sort a data.table based on two or more columns?
Difficulty Level: L1 Question : Reorder the rows as per ascending order of Price
column and descending order of Max.Price
column of DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#SolutionDT <- DT[order(Price, -Max.Price)]print(head(DT))
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: Ford Festiva Small 6.9 7.4 7.9 31 332: Hyundai Excel Small 6.8 8.0 9.2 29 333: Mazda 323 Small 7.4 8.3 9.1 29 374: Geo Metro Small 6.7 8.4 10.0 46 505: Subaru Justy Small 7.3 8.4 9.5 33 376: Suzuki Swift <NA> 7.3 8.6 NA 39 43 AirBags DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile1: <NA> Front 4 1.3 63 5000 31502: None Front 4 1.5 81 5500 27103: None Front 4 1.6 82 5000 23704: None Front 3 1.0 55 5700 37555: None 4WD 3 1.2 73 5600 28756: None Front 3 1.3 70 6000 3360 Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width1: Yes 10.0 4 141 90 632: Yes 11.9 5 168 94 633: Yes 13.2 4 164 97 664: <NA> 10.6 4 151 93 635: Yes 9.2 4 146 90 606: Yes 10.6 4 161 93 NA Turn.circle Rear.seat.room Luggage.room Weight Origin Make1: 33 26.0 12 1845 USA Ford Festiva2: 35 26.0 11 2345 non-USA Hyundai Excel3: 34 27.0 16 2325 non-USA Mazda 3234: 34 27.5 10 1695 non-USA Geo Metro5: 32 23.5 10 2045 non-USA Subaru Justy6: 34 27.5 10 1965 non-USA Suzuki Swift
18. How to filter out rows from a data.table that satisfy multiple conditions?
Difficulty Level: L2 Question : Filter out the rows having Manufacturer as Ford and Price less than 30 Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output :
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: Ford Festiva Small 6.9 7.4 7.9 31 332: Ford Escort Small 8.4 10.1 11.9 23 303: Ford Tempo Compact 10.4 11.3 12.2 22 274: Ford Mustang Sporty 10.8 15.9 21.0 22 295: Ford Probe Sporty 12.8 14.0 15.2 NA 306: Ford Aerostar Van 14.5 19.9 25.3 15 20 AirBags DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile1: <NA> Front 4 1.3 63 5000 31502: None Front 4 1.8 127 6500 24103: None Front 4 2.3 96 4200 28054: Driver only Rear 4 2.3 105 4600 22855: Driver only Front 4 2.0 115 5500 23406: Driver only 4WD 6 3.0 145 4800 2080(...truncated...)
Show Solution
#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionsol <- DT[Manufacturer == "Ford" & Price < 30]print(head(sol))
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: Ford Festiva Small 6.9 7.4 7.9 31 332: Ford Escort Small 8.4 10.1 11.9 23 303: Ford Tempo Compact 10.4 11.3 12.2 22 274: Ford Mustang Sporty 10.8 15.9 21.0 22 295: Ford Probe Sporty 12.8 14.0 15.2 NA 306: Ford Aerostar Van 14.5 19.9 25.3 15 20 AirBags DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile1: <NA> Front 4 1.3 63 5000 31502: None Front 4 1.8 127 6500 24103: None Front 4 2.3 96 4200 28054: Driver only Rear 4 2.3 105 4600 22855: Driver only Front 4 2.0 115 5500 23406: Driver only 4WD 6 3.0 145 4800 2080 Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase Width1: Yes 10.0 4 141 90 632: <NA> 13.2 5 171 98 673: Yes 15.9 5 177 100 684: Yes 15.4 4 180 101 685: Yes 15.5 4 179 103 706: Yes 21.0 7 176 119 72 Turn.circle Rear.seat.room Luggage.room Weight Origin Make1: 33 26.0 12 1845 USA Ford Festiva2: 36 28.0 12 2530 USA Ford Escort3: NA 27.5 13 NA USA <NA>4: 40 24.0 12 NA USA Ford Mustang5: 38 23.0 18 2710 USA Ford Probe6: 45 30.0 NA 3735 USA Ford Aerostar
19. How to select multiple columns from data.table and rename them?
Difficulty Level: L2 Question : Get columns ‘Manufacturer’ & ‘Type’ from data.table and rename them to ‘MANUFACTURER’ & ‘TYPE’ Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
MANUFACTURER TYPE1: Acura Small2: <NA> Midsize3: Audi Compact4: Audi Midsize5: BMW Midsize6: Buick Midsize
Show Solution
#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionsol <- DT[, .(MANUFACTURER = Manufacturer, TYPE = Type)]print(head(sol))
MANUFACTURER TYPE1: Acura Small2: <NA> Midsize3: Audi Compact4: Audi Midsize5: BMW Midsize6: Buick Midsize
20. How to get mean of rows that satisfy a condition from a given column?
Difficulty Level: L2 Question : Get mean of all the vehicles having Price
greater than 30
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output :
38.6833
Show Solution
#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionoutput <- DT[Price > 30, mean(Price, na.rm = TRUE)]output
38.6833333333333
21. How to remove only specified columns from a data.table?
Difficulty Level: L2 Question : Create new data.table by removing ‘Manufacturer’ & ‘Type’ columns. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution output <- DT[, !c("Manufacturer", "Type")]print(head(output))
Model Min.Price Price Max.Price MPG.city MPG.highway AirBags1: Integra 12.9 15.9 18.8 25 31 None2: Legend 29.2 33.9 38.7 18 25 Driver & Passenger3: 90 25.9 29.1 32.3 20 26 Driver only4: 100 NA 37.7 44.6 19 26 Driver & Passenger5: 535i NA 30.0 NA 22 30 <NA>6: Century 14.2 15.7 17.3 22 31 Driver only DriveTrain Cylinders EngineSize Horsepower RPM Rev.per.mile Man.trans.avail1: Front 4 1.8 140 6300 2890 Yes2: Front 6 3.2 200 5500 2335 Yes3: Front 6 2.8 172 5500 2280 Yes4: <NA> 6 NA 172 5500 2535 <NA>5: Rear 4 3.5 208 5700 2545 Yes6: <NA> 4 2.2 110 5200 2565 No Fuel.tank.capacity Passengers Length Wheelbase Width Turn.circle1: 13.2 5 177 102 68 372: 18.0 5 195 115 71 383: 16.9 5 180 102 67 374: 21.1 6 193 106 NA 375: 21.1 4 186 109 69 396: 16.4 6 189 105 69 41 Rear.seat.room Luggage.room Weight Origin Make1: 26.5 NA 2705 non-USA Acura Integra2: 30.0 15 3560 non-USA Acura Legend3: 28.0 14 3375 non-USA Audi 904: 31.0 17 3405 non-USA Audi 1005: 27.0 13 3640 non-USA BMW 535i6: 28.0 16 NA USA Buick Century
22. How to calculate count of each unique value in a given column?
Difficulty Level: L2 Question : Calculate number of records for every Manufacturer
in DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Manufacturer N1: Acura 12: <NA> 43: Audi 24: BMW 15: Buick 46: Cadillac 2
Show Solution
#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution output <- DT[, .(.N), by = .(Manufacturer)]print(head(output))
Manufacturer N1: Acura 12: <NA> 43: Audi 24: BMW 15: Buick 46: Cadillac 2
23. How to find mean of a column grouped by multiple columns?
Difficulty Level: L2 Question : Calculate mean Price
for every Manufacturer
, Type
combination in DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Manufacturer Type V11: Acura Small 15.900002: <NA> Midsize 31.933333: Audi Compact 29.100004: Audi Midsize 37.700005: BMW Midsize 30.000006: Buick Midsize 21.00000
Show Solution
#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution output <- DT[, mean(Price, na.rm = TRUE), by = .(Manufacturer, Type)]print(head(output))
Manufacturer Type V11: Acura Small 15.900002: <NA> Midsize 31.933333: Audi Compact 29.100004: Audi Midsize 37.700005: BMW Midsize 30.000006: Buick Midsize 21.00000
24. How to find mean of column grouped by multiple columns and sort by grouped columns?
Difficulty Level: L2 Question : Calculate mean Price
for every Manufacturer
, Type
combination and arrange it by the keys in DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Manufacturer Type V11: Acura Small 15.902: Audi Compact 29.103: Audi Midsize 37.704: BMW Midsize 30.005: Buick Large 22.256: Buick Midsize 21.00
Show Solution
#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution 1output <- DT[, mean(Price, na.rm = TRUE), keyby = .(Manufacturer, Type)]#Solution 1output <- DT[, mean(Price, na.rm = TRUE), by = .(Manufacturer, Type)][order(Manufacturer, Type)]print(head(output))
Manufacturer Type V11: Acura Small 15.902: Audi Compact 29.103: Audi Midsize 37.704: BMW Midsize 30.005: Buick Large 22.256: Buick Midsize 21.00
25. How to select all numeric columns from data.table?
Difficulty Level: L2 Question : Select numeric columns from DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower RPM1: 12.9 15.9 18.8 25 31 1.8 140 63002: 29.2 33.9 38.7 18 25 3.2 200 55003: 25.9 29.1 32.3 20 26 2.8 172 55004: NA 37.7 44.6 19 26 NA 172 55005: NA 30.0 NA 22 30 3.5 208 57006: 14.2 15.7 17.3 22 31 2.2 110 5200 Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width1: 2890 13.2 5 177 102 682: 2335 18.0 5 195 115 713: 2280 16.9 5 180 102 674: 2535 21.1 6 193 106 NA5: 2545 21.1 4 186 109 696: 2565 16.4 6 189 105 69(...truncated...)
Show Solution
# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# Solution 1output <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]print(head(output))
Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower RPM1: 12.9 15.9 18.8 25 31 1.8 140 63002: 29.2 33.9 38.7 18 25 3.2 200 55003: 25.9 29.1 32.3 20 26 2.8 172 55004: NA 37.7 44.6 19 26 NA 172 55005: NA 30.0 NA 22 30 3.5 208 57006: 14.2 15.7 17.3 22 31 2.2 110 5200 Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width1: 2890 13.2 5 177 102 682: 2335 18.0 5 195 115 713: 2280 16.9 5 180 102 674: 2535 21.1 6 193 106 NA5: 2545 21.1 4 186 109 696: 2565 16.4 6 189 105 69 Turn.circle Rear.seat.room Luggage.room Weight1: 37 26.5 NA 27052: 38 30.0 15 35603: 37 28.0 14 33754: 37 31.0 17 34055: 39 27.0 13 36406: 41 28.0 16 NA
26. How to calculate mean of all numeric columns for every Manufacturer?
Difficulty Level: L3 Question : Calculate mean of all numeric columns for every Manufacturer Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Manufacturer Min.Price Price Max.Price MPG.city MPG.highway EngineSize1: Acura 12.900 15.900 18.80 25.0 31.00 1.8002: <NA> 26.775 28.550 30.30 19.0 26.00 2.9753: Audi NA 33.400 38.45 19.5 26.00 NA4: BMW NA 30.000 NA 22.0 30.00 3.5005: Buick 20.750 21.625 NA 19.0 27.75 3.8756: Cadillac 35.250 37.400 39.50 16.0 25.00 4.750(...truncated...)
Show Solution
#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionoutput <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]output <- cbind(output, DT[, list(Manufacturer)])output <- output[, lapply(.SD, mean), by = Manufacturer]print(head(output))
Manufacturer Min.Price Price Max.Price MPG.city MPG.highway EngineSize1: Acura 12.900 15.900 18.80 25.0 31.00 1.8002: <NA> 26.775 28.550 30.30 19.0 26.00 2.9753: Audi NA 33.400 38.45 19.5 26.00 NA4: BMW NA 30.000 NA 22.0 30.00 3.5005: Buick 20.750 21.625 NA 19.0 27.75 3.8756: Cadillac 35.250 37.400 39.50 16.0 25.00 4.750 Horsepower RPM Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase1: 140.0 6300 2890.0 13.2 5.00 177.00 102.002: 186.5 5750 NA NA 5.00 193.25 109.753: 172.0 5500 2407.5 19.0 5.50 186.50 104.004: 208.0 5700 2545.0 21.1 4.00 186.00 109.005: 157.5 4700 NA NA 5.75 200.75 110.006: 247.5 5050 NA 19.0 5.50 205.00 112.50 Width Turn.circle Rear.seat.room Luggage.room Weight1: 68.00 37.00 26.500 NA 2705.002: 71.25 NA 29.000 13.5 3458.753: NA 37.00 29.500 15.5 3390.004: 69.00 39.00 27.000 13.0 3640.005: NA 42.25 28.875 17.0 NA6: 73.50 43.50 33.000 NA 3777.50
27. How to get first 3 rows for every Manufacturer?
Difficulty Level: L3 Question : Get first 3
rows for every Manufacturer
in DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: Acura Integra Small 12.9 15.9 18.8 25 312: Audi 90 Compact 25.9 29.1 32.3 20 263: Audi 100 Midsize NA 37.7 44.6 19 264: BMW 535i Midsize NA 30.0 NA 22 305: Buick Century Midsize 14.2 15.7 17.3 22 316: Buick LeSabre Large 19.9 20.8 NA 19 28(...truncated...)
Show Solution
#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution output <- DT[!is.na(Manufacturer), head(.SD, 3), by = Manufacturer]print(head(output))
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: Acura Integra Small 12.9 15.9 18.8 25 312: Audi 90 Compact 25.9 29.1 32.3 20 263: Audi 100 Midsize NA 37.7 44.6 19 264: BMW 535i Midsize NA 30.0 NA 22 305: Buick Century Midsize 14.2 15.7 17.3 22 316: Buick LeSabre Large 19.9 20.8 NA 19 28 AirBags DriveTrain Cylinders EngineSize Horsepower RPM1: None Front 4 1.8 140 63002: Driver only Front 6 2.8 172 55003: Driver & Passenger <NA> 6 NA 172 55004: <NA> Rear 4 3.5 208 57005: Driver only <NA> 4 2.2 110 52006: Driver only Front 6 3.8 170 4800 Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase1: 2890 Yes 13.2 5 177 1022: 2280 Yes 16.9 5 180 1023: 2535 <NA> 21.1 6 193 1064: 2545 Yes 21.1 4 186 1095: 2565 No 16.4 6 189 1056: NA No NA 6 200 111 Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make1: 68 37 26.5 NA 2705 non-USA Acura Integra2: 67 37 28.0 14 3375 non-USA Audi 903: NA 37 31.0 17 3405 non-USA Audi 1004: 69 39 27.0 13 3640 non-USA BMW 535i5: 69 41 28.0 16 NA USA Buick Century6: 74 42 30.5 17 3470 USA Buick LeSabre
28. How to remove specific column from data.table?
Difficulty Level: L2 Question : Remove column Model
from DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution DT[, Model := NULL]print(head(DT))
Manufacturer Type Min.Price Price Max.Price MPG.city MPG.highway1: Acura Small 12.9 15.9 18.8 25 312: <NA> Midsize 29.2 33.9 38.7 18 253: Audi Compact 25.9 29.1 32.3 20 264: Audi Midsize NA 37.7 44.6 19 265: BMW Midsize NA 30.0 NA 22 306: Buick Midsize 14.2 15.7 17.3 22 31 AirBags DriveTrain Cylinders EngineSize Horsepower RPM1: None Front 4 1.8 140 63002: Driver & Passenger Front 6 3.2 200 55003: Driver only Front 6 2.8 172 55004: Driver & Passenger <NA> 6 NA 172 55005: <NA> Rear 4 3.5 208 57006: Driver only <NA> 4 2.2 110 5200 Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase1: 2890 Yes 13.2 5 177 1022: 2335 Yes 18.0 5 195 1153: 2280 Yes 16.9 5 180 1024: 2535 <NA> 21.1 6 193 1065: 2545 Yes 21.1 4 186 1096: 2565 No 16.4 6 189 105 Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make1: 68 37 26.5 NA 2705 non-USA Acura Integra2: 71 38 30.0 15 3560 non-USA Acura Legend3: 67 37 28.0 14 3375 non-USA Audi 904: NA 37 31.0 17 3405 non-USA Audi 1005: 69 39 27.0 13 3640 non-USA BMW 535i6: 69 41 28.0 16 NA USA Buick Century
29. How to set key to data.table?
Difficulty Level: L1 Question : Set column AirBags
as key to the data.table Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution setkey(DT,AirBags)
30. How to get all rows where AirBags
are at 'Driver & Passenger'
in DT
?
Difficulty Level: L2 Question : Get all rows where AirBags
are Driver & Passenger
in DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution setkey(DT,AirBags)print(head(DT["Driver & Passenger"]))
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: <NA> Legend Midsize 29.2 33.9 38.7 18 252: Audi 100 Midsize NA 37.7 44.6 19 263: Cadillac Seville Midsize 37.5 40.1 42.7 16 254: Chevrolet Camaro Sporty 13.4 15.1 16.8 19 285: <NA> Concorde Large 18.4 18.4 18.4 20 286: Chrysler LeBaron Compact 14.5 15.8 17.1 23 28 AirBags DriveTrain Cylinders EngineSize Horsepower RPM1: Driver & Passenger Front 6 3.2 200 55002: Driver & Passenger <NA> 6 NA 172 55003: Driver & Passenger Front 8 4.6 295 60004: Driver & Passenger Rear <NA> 3.4 160 46005: Driver & Passenger Front 6 3.3 153 53006: Driver & Passenger Front 4 3.0 141 5000 Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase1: 2335 Yes 18.0 5 195 1152: 2535 <NA> 21.1 6 193 1063: 1985 No 20.0 5 204 1114: 1805 Yes NA 4 193 1015: 1990 No 18.0 6 203 1136: 2090 No 16.0 6 183 104 Width Turn.circle Rear.seat.room Luggage.room Weight Origin1: 71 38 30.0 15 3560 non-USA2: NA 37 31.0 17 3405 non-USA3: 74 44 31.0 NA 3935 USA4: 74 43 25.0 13 3240 USA5: 74 NA 31.0 15 3515 USA6: 68 41 30.5 14 3085 USA Make1: Acura Legend2: Audi 1003: Cadillac Seville4: Chevrolet Camaro5: Chrylser Concorde6: Chrysler LeBaron
31. How to get all rows where AirBags
are either 'Driver & Passenger'
or 'No data'
?
Difficulty Level: L2 Question : Get all rows where AirBags
are Driver & Passenger
& No data
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Show Solution
#Inputlibrary(data.table)DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solution setkey(DT,AirBags)print(DT[c("Driver & Passenger", "No data")])
Manufacturer Model Type Min.Price Price Max.Price MPG.city 1: <NA> Legend Midsize 29.2 33.9 38.7 18 2: Audi 100 Midsize NA 37.7 44.6 19 3: Cadillac Seville Midsize 37.5 40.1 42.7 16 4: Chevrolet Camaro Sporty 13.4 15.1 16.8 19 5: <NA> Concorde Large 18.4 18.4 18.4 20 6: Chrysler LeBaron Compact 14.5 15.8 17.1 23 7: Eagle Vision Large NA 19.3 21.2 20 8: Honda Prelude Sporty 17.0 19.8 22.7 24 9: Honda Accord Compact 13.8 17.5 21.2 2410: <NA> SC300 Midsize 34.7 35.2 35.6 1811: Lincoln Continental Midsize 33.3 34.3 35.3 1712: Lincoln Town_Car Large 34.4 36.1 37.8 1813: Mercedes-Benz 300E Midsize 43.8 61.9 80.0 1914: Pontiac Firebird <NA> 14.0 17.7 21.4 1915: Pontiac Bonneville Large 19.4 24.4 29.4 1916: <NA> 850 Midsize 24.8 26.7 28.5 2017: <NA> <NA> <NA> NA NA NA NA MPG.highway AirBags DriveTrain Cylinders EngineSize Horsepower 1: 25 Driver & Passenger Front 6 3.2 200 2: 26 Driver & Passenger <NA> 6 NA 172 3: 25 Driver & Passenger Front 8 4.6 295 4: 28 Driver & Passenger Rear <NA> 3.4 160 5: 28 Driver & Passenger Front 6 3.3 153 6: 28 Driver & Passenger Front 4 3.0 141 7: 28 Driver & Passenger Front 6 3.5 NA 8: 31 Driver & Passenger Front 4 2.3 160 9: 31 Driver & Passenger Front 4 2.2 14010: 23 Driver & Passenger Rear 6 3.0 22511: 26 Driver & Passenger <NA> 6 3.8 16012: 26 Driver & Passenger Rear 8 4.6 21013: 25 Driver & Passenger Rear 6 3.2 21714: 28 Driver & Passenger Rear 6 NA 16015: 28 Driver & Passenger Front 6 3.8 17016: 28 Driver & Passenger Front 5 2.4 16817: NA No data <NA> <NA> NA NA RPM Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length 1: 5500 2335 Yes 18.0 5 195 2: 5500 2535 <NA> 21.1 6 193 3: 6000 1985 No 20.0 5 204 4: 4600 1805 Yes NA 4 193 5: 5300 1990 No 18.0 6 203 6: 5000 2090 No 16.0 6 183 7: 5800 1980 No 18.0 6 202 8: 5800 2855 Yes NA 4 175 9: 5600 NA Yes 17.0 4 18510: 6000 2510 Yes NA 4 19111: 4400 1835 No 18.4 6 20512: 4600 1840 No 20.0 NA 21913: 5500 2220 No 18.5 5 NA14: 4600 1805 Yes 15.5 4 19615: 4800 1565 No 18.0 6 17716: 6200 NA Yes 19.3 5 18417: NA NA <NA> NA NA NA Wheelbase Width Turn.circle Rear.seat.room Luggage.room Weight Origin 1: 115 71 38 30.0 15 3560 non-USA 2: 106 NA 37 31.0 17 3405 non-USA 3: 111 74 44 31.0 NA 3935 USA 4: 101 74 43 25.0 13 3240 USA 5: 113 74 NA 31.0 15 3515 USA 6: 104 68 41 30.5 14 3085 USA 7: 113 74 40 30.0 15 3490 USA 8: 100 70 39 23.5 8 2865 non-USA 9: 107 67 41 28.0 14 3040 non-USA10: 106 71 39 25.0 9 3515 non-USA11: 109 73 42 30.0 19 3695 USA12: 117 77 45 31.5 22 4055 USA13: 110 69 37 NA 15 3525 non-USA14: 101 75 43 25.0 13 3240 USA15: 111 74 43 30.5 18 3495 USA16: 105 69 38 30.0 15 3245 non-USA17: NA NA NA NA NA NA <NA> Make 1: Acura Legend 2: Audi 100 3: Cadillac Seville 4: Chevrolet Camaro 5: Chrylser Concorde 6: Chrysler LeBaron 7: Eagle Vision 8: Honda Prelude 9: Honda Accord10: Lexus SC30011: Lincoln Continental12: Lincoln Town_Car13: Mercedes-Benz 300E14: Pontiac Firebird15: Pontiac Bonneville16: Volvo 85017: <NA>
33. How to get last and 2nd last (penultimate) row of data.table?
Difficulty Level: L2 Question 1 : Get penultimate row of DT
. Question 2 : Get last row of DT
. Input:
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output:
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: <NA> 850 Midsize 24.8 26.7 28.5 20 282: Volvo 240 Compact 21.8 22.7 23.5 21 28 AirBags DriveTrain Cylinders EngineSize Horsepower RPM1: Driver & Passenger Front 5 2.4 168 62002: Driver only Rear <NA> 2.3 114 5400 Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase1: NA Yes 19.3 5 184 1052: 2215 Yes 15.8 5 190 104 Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make1: 69 38 30.0 15 3245 non-USA Volvo 8502: 67 37 29.5 14 2985 non-USA Volvo 240
Show Solution
#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# Solutionoutput <- DT[c(.N, .N - 1)]print(output)
Manufacturer Model Type Min.Price Price Max.Price MPG.city MPG.highway1: <NA> 850 Midsize 24.8 26.7 28.5 20 282: Volvo 240 Compact 21.8 22.7 23.5 21 28 AirBags DriveTrain Cylinders EngineSize Horsepower RPM1: Driver & Passenger Front 5 2.4 168 62002: Driver only Rear <NA> 2.3 114 5400 Rev.per.mile Man.trans.avail Fuel.tank.capacity Passengers Length Wheelbase1: NA Yes 19.3 5 184 1052: 2215 Yes 15.8 5 190 104 Width Turn.circle Rear.seat.room Luggage.room Weight Origin Make1: 69 38 30.0 15 3245 non-USA Volvo 8502: 67 37 29.5 14 2985 non-USA Volvo 240
34. How to reorder columns of data.table?
Difficulty Level: L2 Question : Reorder columns of DT
as V2 , V1, V3
Input
DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","3","4"), V3 = c("aa","bb","cc","dd") )
Desired Output
V2 V1 V31: 1 a aa2: 2 b bb3: 3 c cc4: 4 d dd
Show Solution
# InputDT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","3","4"), V3 = c("aa","bb","cc","dd") )#Solutionoutput <- setcolorder(DT,c("V2","V1","V3"))print(output)
V2 V1 V31: 1 a aa2: 2 b bb3: 3 c cc4: 4 d dd
35. How to select one column and compute standard deviation of another column and return a single value that gets recycled?
Difficulty Level: L2 Question : Select column Manufacturer
and compute standard deviation of column Price
and return a single value that gets recycled. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output:
Manufacturer Sd.Price1: Acura 9.724282: <NA> 9.724283: Audi 9.724284: Audi 9.724285: BMW 9.724286: Buick 9.72428
Show Solution
# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionoutput <- DT[,.(Manufacturer, Sd.Price = sd(Price, na.rm = TRUE))]print(head(output))
Manufacturer Sd.Price1: Acura 9.724282: <NA> 9.724283: Audi 9.724284: Audi 9.724285: BMW 9.724286: Buick 9.72428
36. How to use row indexing, column indexing and group by expressions all together?
Difficulty Level: L2 Question : Calculate sum of Weight
of cars under every Manufacturer
but exclude all 'Midsize'
Type
cars. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Manufacturer V11: Acura 27052: Audi 33753: Buick 75754: Cadillac 36205: Chevrolet 235456: <NA> 3515
Show Solution
# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionoutput = DT[Type !="Midsize", sum(Weight), by=Manufacturer]print(head(output))
Manufacturer V11: Acura 27052: Audi 33753: Buick 75754: Cadillac 36205: Chevrolet 235456: <NA> 3515
37. How to get the row number of the nth largest value in a column?
Difficulty Level: L2 Question : Find the row position of the 5th largest value of column Price
in DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#SolutionDT <- DT[order(Price)]fifth_largest_price = DT[5, Price]output = DT[Price == fifth_largest_price, which = TRUE]output
- 4
- 5
38. How to find and cap outliers from a data.table column?
Difficulty Level: L2 Question: Replace all values of Length
column in the lower 5%ile and greater than 95%ile with respective 5th and 95th %ile value. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
# InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')# SolutionDT[Length > quantile(Length, .95, na.rm=T), Length := round(quantile(DT$Length, .95, na.rm=T))]DT[Length < quantile(Length, .05, na.rm=T), Length := round(quantile(DT$Length, .05, na.rm=T))]
39. How to swap two column values based on a condition in another column of a data.table?
Difficulty Level: L2 Question : Swap column V1 & V3 for all the rows where V2 = 2. Input
DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )
Desired Output:
V1 V2 V31: a 1 aa2: bb 2 b3: cc 2 c4: d 3 dd
Show Solution
# InputDT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )#SolutionDT[V2=="2", c("V3", "V1") := .(V1, V3)]print(DT)
V1 V2 V31: a 1 aa2: bb 2 b3: cc 2 c4: d 3 dd
40. How to swap two rows of a dataframe?
Difficulty Level: L2 Questions : Swap rows 1 and 2 in DT
. Input
DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )
Desired Output
V1 V2 V31: b 2 bb2: a 1 aa3: c 2 cc4: d 3 dd
Show Solution
# InputDT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )# Solutionswap_rows <- function(dt, i1, i2){ i1_row <- DT[i1,] i2_row <- DT[i2,] DT[i1,] <- i2_row DT[i2,] <- i1_row return(DT)}output <- swap_rows(DT, 1,2)print(output)
V1 V2 V31: b 2 bb2: a 1 aa3: c 2 cc4: d 3 dd
41. How to reverse the rows of a data.table?
Difficulty Level: L2 Question : Reverse all the rows of DT
so that topmost row goes to bottom. Input
DT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )
Desired Output
V1 V2 V3 [1,] "d" "3" "dd"[2,] "c" "2" "cc"[3,] "b" "2" "bb"[4,] "a" "1" "aa"
Show Solution
# InputDT <- data.table(V1= c("a","b","c","d"), V2 =c("1","2","2","3"), V3 = c("aa","bb","cc","dd") )# Solutionoutput <- do.call(cbind, lapply(DT, rev))print(output)
V1 V2 V3 [1,] "d" "3" "dd"[2,] "c" "2" "cc"[3,] "b" "2" "bb"[4,] "a" "1" "aa"
42. How to create one-hot encodings of a categorical variable (dummy variables)?
Difficulty Level: L2 Question : Get one-hot encodings for columns Gender
& Degree
in the data.table DT
and append it as columns. Input
DT <- data.table(Gender = c("M","M","F","F"), E_Id =c(1,2,3,4), Degree = c("UG","PG","PhD","UG") ) Gender E_Id Degree1: M 1 UG2: M 2 PG3: F 3 PhD4: F 4 UG
Desired Output
E_Id Gender_F Gender_M Degree_PG Degree_PhD Degree_UG1: 1 0 1 0 0 12: 2 0 1 1 0 03: 3 1 0 0 1 04: 4 1 0 0 0 1
Show Solution
# InputDT <- data.table(Gender = c("M","M","F","F"), E_Id =c(1,2,3,4), Degree = c("UG","PG","PhD","UG") )# Solutionoutput <- dcast(melt(DT, id.vars='E_Id'), E_Id ~ variable + value, fun = length)print(output)
E_Id Gender_F Gender_M Degree_PG Degree_PhD Degree_UG1: 1 0 1 0 0 12: 2 0 1 1 0 03: 3 1 0 0 1 04: 4 1 0 0 0 1
‘
Difficulty Level: L2 Question : Find row-wise sum values of DT
. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower RPM1: 14.6 17.7 19.15 21 28 2.3 140 5200 Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width1: 2360 16.5 5 181 103 69 Turn.circle Rear.seat.room Luggage.room Weight1: 39 27.5 14 3085
Show Solution
#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionoutput <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]output <- output[, lapply(.SD, median, na.rm = TRUE)]print(head(output))
Min.Price Price Max.Price MPG.city MPG.highway EngineSize Horsepower RPM1: 14.6 17.7 19.15 21 28 2.3 140 5200 Rev.per.mile Fuel.tank.capacity Passengers Length Wheelbase Width1: 2360 16.5 5 181 103 69 Turn.circle Rear.seat.room Luggage.room Weight1: 39 27.5 14 3085
44. How to compute correlation score of each column against other columns?
Difficulty Level: L2 Question : Compute correlation score of each column against other columns. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Show Solution
#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionoutput <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))]print(cor(output, use = "na.or.complete"))
45. How to create a column containing the minimum by maximum of each row?
Difficulty Level: L2 Question : Compute the minimum-by-maximum for every row of df
.
DT <- data.table(V1 = runif(10, 1, 100), V2 = runif(10, 1, 80), V3 = runif(10, 1, 100))
Show Solution
# InputDT <- data.table(V1= runif(10, 1, 100), V2 =runif(10, 1, 80), V3 = runif(10, 1, 100))#SolutionDT[,max_min:=do.call(pmin,.SD)/do.call(pmax,.SD)]print(head(DT))# Solution 2DT <- data.table(V1= runif(10, 1, 100), V2 =runif(10, 1, 80), V3 = runif(10, 1, 100))DT[, max_min2:=apply(.SD, 1, min)/apply(.SD, 1, max)]print(head(DT))
V1 V2 V3 max_min1: 16.33154 66.08863 41.38638 0.24711572: 87.79467 38.57875 38.50150 0.43854043: 40.95019 19.47167 15.03562 0.36716854: 12.40790 17.68716 34.95050 0.35501365: 34.15262 24.52459 71.04999 0.34517386: 60.22314 60.22538 10.79905 0.1793106
46. How to scale
all columns in a dataframe?
Difficulty Level: L2 Question : Normalize all columns in DT
. Without using external packages. Input
DT <- data.table(V1= runif(10, 1, 100), V2 =runif(10, 1, 80), V3 = runif(10, 1, 100))
Show Solution
# InputDT <- data.table(V1= runif(10, 1, 100), V2 =runif(10, 1, 80), V3 = runif(10, 1, 100))# Solutionoutput <- DT[, lapply(.SD, scale)]print(output)
V1.V1 V2.V1 V3.V1 1: 0.9884659 -1.2595111 0.09593436 2: 0.1217149 -0.8867426 -0.31629923 3: 0.5872020 0.1443417 0.70839114 4: -0.5178531 0.5737876 1.12962574 5: 1.0221688 -0.9475154 -0.63975308 6: -1.5091412 1.2276058 1.11944272 7: -0.9203772 1.0074295 -0.33182208 8: -0.8005338 1.3745140 -1.41247525 9: 1.5281054 -0.9386551 -1.4577190110: -0.4997516 -0.2952545 1.10467468
47. How to compute the correlation of each column with the succeeding column?
Difficulty Level: L2 Compute the correlation of each column of df
with its succeeding column. Input
DT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')
Desired Output
Price Min.Price 0.9705018 Max.Price Price 0.9802519 MPG.city Max.Price -0.5377239 MPG.highway MPG.city 0.9479157 EngineSizeMPG.highway -0.6259651 Horsepower EngineSize 0.7313269 RPM Horsepower 0.04736874 Rev.per.mile RPM 0.4623314(...truncated...)
Show Solution
#InputDT <- fread('https://raw.githubusercontent.com/selva86/datasets/master/Cars93_miss.csv')#Solutionoutput <- DT[, .SD, .SDcols = which(sapply(DT, is.numeric))] # numeric columnsfor (i in 1:(ncol(output) - 1)){ a <- i + 1 print(cor(output[,..i], output[,..a], use = "na.or.complete"))}
PriceMin.Price 0.9705018 Max.PricePrice 0.9802519 MPG.cityMax.Price -0.5377239 MPG.highwayMPG.city 0.9479157 EngineSizeMPG.highway -0.6259651 HorsepowerEngineSize 0.7313269 RPMHorsepower 0.04736874 Rev.per.mileRPM 0.4623314 Fuel.tank.capacityRev.per.mile -0.6318497 PassengersFuel.tank.capacity 0.4965661 LengthPassengers 0.4737477 WheelbaseLength 0.8275146 WidthWheelbase 0.8127431 Turn.circleWidth 0.8170549 Rear.seat.roomTurn.circle 0.4738118 Luggage.roomRear.seat.room 0.668847 WeightLuggage.room 0.6425643
48. How to replace both the diagonals of data.table with 0?
Difficulty Level: L2 Replace both values in both diagonals of DT
with 0. Input
DT <- data.table(V1= runif(4, 1, 100), V2 =runif(4, 1, 80), V3 = runif(4, 1, 100), V4 = runif(4,1, 50)) V1 V2 V3 V41: 97.451642 65.393451 6.496714 27.800142: 56.475427 5.234317 80.551154 10.866173: 65.965884 18.954913 70.968490 25.925094: 3.723218 26.590925 75.723006 42.22132
Desired output
V1 V2 V3 V41: 0.00000 65.39345 6.496714 0.000002: 56.47543 0.00000 0.000000 10.866173: 65.96588 0.00000 0.000000 25.925094: 0.00000 26.59093 75.723006 0.00000
Show Solution
# InputDT <- data.table(V1= runif(4, 1, 100), V2 =runif(4, 1, 80), V3 = runif(4, 1, 100), V4 = runif(4,1, 50))# Solutionfor (i in 1:nrow(DT)){ DT[i,i] <- 0 DT[nrow(DT) - i + 1, i] <- 0}print(DT)
V1 V2 V3 V41: 0.00000 72.08959 12.63875 0.0000002: 17.61880 0.00000 0.00000 1.6967413: 82.20107 0.00000 0.00000 40.3210274: 0.00000 40.47473 71.60260 0.000000
49. How to join two dataframes by 2 columns so they have only the common rows?
Difficulty Level: L2 Join dataframes df1
and df2
by ‘fruit-pazham’ and ‘weight-kilo’. Input
DT1 = data.table(fruit = c('apple', 'banana', 'orange'), weight = c('high', 'medium', 'low'), price= c(1,2,3))DT2 = data.table(pazham = c('apple', 'orange', 'pine'), kilo= c('high', 'low', 'high'), price = c(1,2,3))
Desired Output
fruit weight price_left price_right1: apple high 1 12: orange low 3 2
Show Solution
# InputDT1 = data.table(fruit = c('apple', 'banana', 'orange'), weight = c('high', 'medium', 'low'), price= c(1,2,3))DT2 = data.table(pazham = c('apple', 'orange', 'pine'), kilo= c('high', 'low', 'high'), price = c(1,2,3))# Solutionoutput <- merge(DT1, DT2, by.x = c('fruit', 'weight'), by.y = c('pazham', 'kilo'), all = FALSE, suffixes = c("_left", "_right"))print(output)
fruit weight price_left price_right1: apple high 1 12: orange low 3 2
‘
50. How to get the positions where values of two columns of data.table
match?
Difficulty Level: L2 Question : Get the positions where values of two columns of DT
match. Input
DT <- data.table(fruit1 = c('apple', 'orange', 'banana', 'mango), fruit2 = c('apple', 'grapes', 'banana', 'jackfruit'))
Desired Output
1 3
Show Solution
# InputDT <- data.table(fruit1 = c('apple', 'orange', 'banana', 'mango'), fruit2 = c('apple', 'grapes', 'banana', 'jackfruit'))# Solutionwhich(DT$fruit1 == DT$fruit2)
- 1
- 3
‘
51. How to create lags and leads of a column in a data.table
?
Difficulty Level: L2 Create two new columns in df
, one of which is a lag 1 (shift column a
down by 1 row) of column ‘a’ and the other is a lead 1 (shift column b
up by 1 row). Input
DT <- data.table(V1 = runif(4, 1, 100), V2 = runif(4, 1, 80), V3 = runif(4, 1, 100), V4 = runif(4,1, 50))
Desired Output
V1 V2 V3 V4 lagged_column lead_column1: 90.41034 79.62526 57.525005 16.88133 NA 13.194272: 13.19427 29.20243 3.962821 35.72731 90.41034 13.791103: 13.79110 15.68368 78.933011 18.43108 13.19427 86.762934: 86.76293 33.66214 6.706639 43.40251 13.79110 NA
Show Solution
# InputDT <- data.table(V1= runif(4, 1, 100), V2 =runif(4, 1, 80), V3 = runif(4, 1, 100), V4 = runif(4,1, 50))# SolutionDT[, lagged_column := shift(V1, type = 'lag')]DT[, lead_column := shift(V1, type = 'lead')]print(DT)
V1 V2 V3 V4 lagged_column lead_column1: 51.50006 74.050716 27.618570 16.68658 NA 82.297832: 82.29783 63.969537 9.054432 1.16761 51.50006 78.364253: 78.36425 63.937995 85.047398 21.39628 82.29783 63.641814: 63.64181 2.482478 10.060054 27.33597 78.36425 NA
52. How to get the frequency of unique values in the entire dataframe?
Difficulty Level: L2 Question : Get the frequency of unique values in the entire dataframe DT
. Input
DT <- data.table(fruit = c('apple', 'orange', 'banana', 'apple'))
Desired Output
fruit number_of_distinct_orders1: apple 22: orange 13: banana 1
Show Solution
# InputDT <- data.table(fruit = c('apple', 'orange', 'banana', 'apple'))# Solutionoutput <- DToutput[,flag := 1]print(output[, .(number_of_distinct_orders = length(flag)), by = fruit])
fruit number_of_distinct_orders1: apple 22: orange 13: banana 1
53. How to convert a table from long to wide format using R data.table?
Question : Create a pivot table out of DT
keeping ID & Month
as key. Difficulty Level: L2 Input
n <- 5DT <- data.table( ID = sample(1:20, n, replace=TRUE), Month = sample(1:12, n, replace=TRUE), Category = c("Drinks", "Food", "Drinks", "Food", "Food"), Expenses = runif(n), key = c('ID', 'Month'))
ID Month Category Expenses1: 4 3 Drinks 0.99215042: 4 4 Food 0.80735233: 8 10 Food 0.55333364: 14 8 Food 0.25880985: 20 4 Drinks 0.2012480
Desired Output
ID Month Drinks Food1: 4 3 0.9921504 0.00000002: 4 4 0.0000000 0.80735233: 8 10 0.0000000 0.55333364: 14 8 0.0000000 0.25880985: 20 4 0.2012480 0.0000000
Show Solution
# Inputn <- 5DT <- data.table( ID=sample(1:20, n, replace=TRUE), Month=sample(1:12, n, replace=TRUE), Category= c("Drinks", "Food", "Drinks", "Food", "Food"), Expenses=runif(n), key=c('ID', 'Month'))# Solutionoutput <- dcast.data.table(DT, ID + Month ~ Category, fun=sum, value.var='Expenses')print(output)
ID Month Drinks Food1: 7 6 0.0000000 0.759283502: 7 10 0.5103192 0.000000003: 9 3 0.1320670 0.000000004: 10 2 0.0000000 0.074503865: 18 7 0.0000000 0.23840221
54. How to create a bootstrap sample (sample same number of rows with replacement) from a data.table?
Difficulty Level: L2 Question : Create a bootstrap sample of length 5 from DT
. Input
DT <- data.table(V1 = runif(20, 1, 100), V2 = runif(20, 1, 80), V3 = runif(20, 1, 100), V4 = runif(20,1, 50))
Show Solution
# InputDT <- data.table(V1 = runif(20, 1, 100), V2 = runif(20, 1, 80), V3 = runif(20, 1, 100), V4 = runif(20,1, 50))# Solutionn <- 5row_number <- sample(1:20, n, replace=TRUE)print(DT[row_number,])
V1 V2 V3 V41: 3.213551 25.49768 30.01245 5.811022: 28.313228 63.25563 89.86874 30.283123: 37.163717 41.17688 98.72226 43.524854: 91.656742 60.68913 9.93234 21.660745: 51.661261 26.40149 28.44038 27.73880
55. How to bin a numeric column with user custom intervals to form a categorical (text) column?
Difficulty Level: L2 In DT, create a new column (cat_col
, that has 'fail'
if the average of the first two columns is less than 40. Else, it contains 'pass'
. Input
set.seed(100)DT <- data.table(V1 = runif(20, 1, 100), V2 = runif(20, 1, 80))
Desired Ouput
V1 V2 cat_col 1: 31.468845 43.32908 fail 2: 26.509578 57.15350 pass 3: 55.679921 43.52955 pass 4: 6.581932 60.16881 fail 5: 47.386379 34.18801 pass 6: 48.893303 14.54220 fail(...truncated...)
Show Solution
# Inputset.seed(100)DT <- data.table(V1= runif(20, 1, 100), V2 =runif(20, 1, 80))# SolutionDT[, cat_col := ifelse((V1 + V2)/2 < 40, 'fail' , 'pass')]print(DT)
V1 V2 cat_col 1: 31.468845 43.32908 fail 2: 26.509578 57.15350 pass 3: 55.679921 43.52955 pass 4: 6.581932 60.16881 fail 5: 47.386379 34.18801 pass 6: 48.893303 14.54220 fail 7: 81.427859 61.85383 pass 8: 37.661733 70.67433 pass 9: 55.109301 44.37864 pass10: 17.855943 22.94018 fail11: 62.874651 39.57617 pass12: 88.334386 74.35190 pass13: 28.755030 28.54667 fail14: 40.450302 76.37846 pass15: 76.492557 55.92666 pass16: 67.233150 71.26683 pass17: 21.256604 15.25217 fail18: 36.394960 50.72188 pass19: 36.588036 79.17557 pass20: 69.338762 11.29282 pass
56. How to convert a data.table
from wide to long format?
Difficulty Level: L2 Question : Convert DT
from wide to long format by keeping ID & Month
as key. Input
n <- 5DT <- data.table( ID = sample(1:20, n, replace=TRUE), Month = sample(1:12, n, replace=TRUE), Drinks = sample(10:20,n, replace = TRUE), Food = sample(10:20,n, replace = TRUE), Starters = sample(5:15,n, replace = TRUE) ) ID Month Drinks Food Starters1: 1 12 17 14 132: 14 2 12 12 113: 12 7 12 15 114: 19 1 16 15 155: 15 11 17 10 10
Desired Output
ID Month variable value1: 14 4 Drinks 122: 4 11 Drinks 183: 19 5 Drinks 174: 19 7 Drinks 155: 14 4 Drinks 116: 14 4 Food 14(...truncated...)
Show Solution
# Inputn <- 5DT <- data.table( ID=sample(1:20, n, replace=TRUE), Month=sample(1:12, n, replace=TRUE), Drinks= sample(10:20,n, replace = TRUE), Food = sample(10:20,n, replace = TRUE), Starters = sample(5:15,n, replace = TRUE) )# Solutionoutput = melt(DT, id.vars = c("ID", "Month"), measure.vars = c("Drinks", "Food", "Starters"))print(head(output))
ID Month variable value1: 14 4 Drinks 122: 4 11 Drinks 183: 19 5 Drinks 174: 19 7 Drinks 155: 14 4 Drinks 116: 14 4 Food 14
57. How to create a dataframe that contains the first 3 lags of a time series data?
Difficulty Level: L2 Question : Append a new column in DT
with first three lag of column V1
. Input
DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1), V1 = round(runif(10, 1, 80),2))
Desired Output
Date V1 lag1 lag2 lag31: 2011-01-01 48.92 NA NA NA2: 2011-01-02 65.45 48.92 NA NA3: 2011-01-03 67.63 65.45 48.92 NA4: 2011-01-04 63.26 67.63 65.45 48.925: 2011-01-05 2.50 63.26 67.63 65.456: 2011-01-06 56.15 2.50 63.26 67.637: 2011-01-07 65.39 56.15 2.50 63.268: 2011-01-08 45.90 65.39 56.15 2.50
Show Solution
# InputDT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1), V1 = round(runif(10, 1, 80),2))# SolutionDT[, `:=`(lag1 = shift(V1, 1, type = 'lag'), lag2 = shift(V1, 2, type = 'lag'), lag3 = shift(V1, 3, type = 'lag'))]print(head(DT, 8))
Date V1 lag1 lag2 lag31: 2011-01-01 8.05 NA NA NA2: 2011-01-02 13.85 8.05 NA NA3: 2011-01-03 3.11 13.85 8.05 NA4: 2011-01-04 57.06 3.11 13.85 8.055: 2011-01-05 61.12 57.06 3.11 13.856: 2011-01-06 68.74 61.12 57.06 3.117: 2011-01-07 35.54 68.74 61.12 57.068: 2011-01-08 33.95 35.54 68.74 61.12
58. How to shuffle rows of a dataframe?
Difficulty Level: L2 Question : Shuffle rows of DT
. Input
DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1), V1 = runif(10, 1, 80), V2 = runif(10, 1, 100), V3 = runif(10,1, 50))
Show Solution
# InputDT <- data.table(Date = seq(as.Date('2011-01-01'), as.Date('2011-01-10'),by = 1), V1 = runif(10, 1, 80), V2 = runif(10, 1, 100), V3 = runif(10,1, 50))# Solutionset.seed(1000)print(DT[sample(nrow(DT), replace = T),])
Date V1 V2 V3 1: 2011-01-04 35.76134 2.61564 17.215551 2: 2011-01-06 54.37578 56.66843 15.841530 3: 2011-01-03 59.85359 21.81431 32.227574 4: 2011-01-08 39.23481 74.79349 33.457133 5: 2011-01-03 59.85359 21.81431 32.227574 6: 2011-01-02 54.02029 34.80660 7.503086 7: 2011-01-06 54.37578 56.66843 15.841530 8: 2011-01-06 54.37578 56.66843 15.841530 9: 2011-01-06 54.37578 56.66843 15.84153010: 2011-01-01 76.45873 27.03600 16.913088
59. How to select rows between two dates in a dataframe?
Difficulty Level: L1 Question : Select all rows between dates 2011-01-05
& 2011-01-08
in DT
. Input
set.seed(100)DT <- data.table(Date = seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1), V1 = runif(10, 1, 80), V2 = runif(10, 1, 100), V3 = runif(10,1, 50))
Desired Output
Date V1 V2 V31: 2011-01-06 28.39314 36.99609 29.8141142: 2011-01-07 52.53458 33.46489 5.073762
Show Solution
# InputDT <- data.table(Date= seq(as.Date('2011-01-01'),as.Date('2011-01-10'),by = 1), V1 =runif(10, 1, 80), V2 = runif(10, 1, 100), V3 = runif(10,1, 50))# Solutionoutput <- DT[(Date > '2011-01-05') & (Date < '2011-01-08'),]print(output)
Date V1 V2 V31: 2011-01-06 28.39314 36.99609 29.8141142: 2011-01-07 52.53458 33.46489 5.073762
To be continued . . .