data.table

3 minute read

When you are doing a lot of R, you can find two different kind of people, the dplyr and the data.table people. Indeed, those to packages are used for table manipulation. They are like the ggplot2 function for graphic, when you know them they change your R experience and you cannot go back.

dplyr

I will not talk a lot about dplyr package, because I am not using it. However, you need to know few things about it. Usually, dplyr is used with another package named tidyr, and since recently, you can find one package which will load both of them and more (load also ggplot2, readr, etc.) and named tidyverse.

To be simple, dplyr is very useful to handle object, and specially for sorting, sub-sample, etc. thanks to the new operator %>%. This operator will pipe object to another line to make more change on it. It is very convenient to avoid typing multiple times the object names like this:

set.seed(121)
data <- rnorm(10)
data <- data + 44

# With dplyr package
library(dplyr)
set.seed(121)
data2 <- rnorm(10) %>%
    + 44

identical(data,data2)

TRUE

But, there is a lot of very nice function with dplyr to select, sub.select, sort, replace data from tables.

If you are using emacs, you can define a key-bind for %>%. Here is the lisp code you will need to put in your .emacs config file.

;;Define shortcut key for %>% function
(global-set-key (kbd "<M-f5>") 'dplyr-function)
(defun dplyr-function ()
  (interactive)
  (insert " %>%"))

Thus, in my example, the shortcut will be Meta (Alt) - f5. Of course, you can change it for whatever you like, just be sure that your shortcut is not already use.

You will find a lot of site to learn more of the dplyr package on internet. The user community is large and you will find a lot of question/answer in stackoverflow from the dplyr package.

data.table

I started using data.table a couple of years ago. At first, this is very hard and the syntax is counter-intuitive, but you will use to it. The data.table function is known to be faster than the dplyr package, but the difference of speed will only be consequent for huge data set (millions or rows and hundred of columns).

data.table use less verbose than dplyr but is more complex to write in my opinion. Here is some examples (the data are here):

library(data.table)

data_plot_pat <- fread("data_plot_scale.txt") #read the data
data_plot_pat
dim(data_plot_pat)

1548

setkey(data_plot_pat, Status, mark)#set in the data table "data_plot_pat" first the Status, and then the mark

selected_plot  <- data_plot_pat[Status == "living" & mark == "DBH"] [, .( Stand_ID.m,lng, lat, pattern_state)]#select plot with living status and DBH mark 
# and then subset only the Stand__ID_m name, lng, lat and pattern_state
print(dim(selected_plot))

504

And you can add

selected_plot[,  lat :=  lat + 100 ]#make some change on a existed vector. Here add 100 for each row on the "lat" vector

selected_plot[,  new_colmun :=  "repeated text"]#add a new column with the same value for each row (here this is "repeated text", but it could be numeric as well.
print(dim(selected_plot))

504

As you can see, here there is a new column.

And if you are using emacs and R, I suggest you to add the following package in your .emacs for the function := to have space between it, you can add this code:

;; Electric operator will turn a=10*5+2 into a = 10 * 5 + 2, so let’s enable it for R
(use-package electric-operator
  :ensure t
  :config
  (setq electric-operator-R-named-argument-style 'spaced)
  (add-hook 'ess-mode-hook #'electric-operator-mode))
(electric-operator-add-rules-for-mode 'ess-mode
  (cons ":=" " := "))

Also, the electric-operator will make you code easier to read by adding space between operator such as " == ", or " + ", etc.

Because I am used to data.table and specially love the fread and fwrite function to read and save tables, I did not want to use dplyr. However, if you are a beginner, or you want to read old code easily years later, I suggest you to use dplyr.