diff --git a/NEWS.md b/NEWS.md index 4c9647cdfb..194fa27e9e 100644 --- a/NEWS.md +++ b/NEWS.md @@ -109,11 +109,13 @@ 21. `melt()` was pseudo generic in that `melt(DT)` would dispatch to the `melt.data.table` method but `melt(not-DT)` would explicitly redirect to `reshape2`. Now `melt()` is standard generic so that methods can be developed in other packages, [#4864](https://github.com/Rdatatable/data.table/pull/4864). Thanks to @odelmarcelle for suggesting and implementing. -22. `DT(i, j, by, ...)` has been added, i.e. functional form of a `data.table` query, [#641](https://github.com/Rdatatable/data.table/issues/641) [#4872](https://github.com/Rdatatable/data.table/issues/4872). Thanks to Yike Lu and Elio Campitelli for filing requests, many others for comments and suggestions, and Matt Dowle for the PR. This enables the `data.table` general form query to be invoked on a `data.frame` without converting it to a `data.table` first. The class of the input object is retained. +22. `DT(i, j, by, ...)` has been added, i.e. functional form of a `data.table` query, [#641](https://github.com/Rdatatable/data.table/issues/641) [#4872](https://github.com/Rdatatable/data.table/issues/4872). Thanks to Yike Lu and Elio Campitelli for filing requests, many others for comments and suggestions, and Matt Dowle for the PR. This enables the `data.table` general form query to be invoked on a `data.frame` without converting it to a `data.table` first. The class of the input object is retained. Thanks to Mark Fairbanks and Boniface Kamgang for testing and reporting problems that have been fixed before release, [#5106](https://github.com/Rdatatable/data.table/issues/5106) [#5107](https://github.com/Rdatatable/data.table/issues/5107). ```R mtcars |> DT(mpg>20, .(mean_hp=mean(hp)), by=cyl) ``` + + When `data.table` queries (either `[...]` or `|> DT(...)`) receive a `data.table`, the operations maintain `data.table`'s attributes such as its key and any indices. For example, if a `data.table` is reordered by `data.table`, or a key column has a value changed by `:=` in `data.table`, its key and indices will either be dropped or reordered appropriately. Some `data.table` operations automatically add and store an index on a `data.table` for reuse in future queries, if `options(datatable.auto.index=TRUE)`, which is `TRUE` by default. `data.table`'s are also over-allocated, which means there are spare column pointer slots allocated in advance so that a `data.table` in the `.GlobalEnv` can have a column added to it truly by reference, like an in-memory database with multiple client sessions connecting to one server R process, as a `data.table` video has shown in the past. But because R and other packages don't maintain `data.table`'s attributes or over-allocation (e.g. a subset or reorder by R or another package will create invalid `data.table` attributes) `data.table` cannot use these attributes when it detects that base R or another package has touched the `data.table` in the meantime, even if the attributes may sometimes still be valid. So, please realize that, `DT()` on a `data.table` should realize better speed and memory usage than `DT()` on a `data.frame`. `DT()` on a `data.frame` may still be useful to use `data.table`'s syntax (e.g. sub-queries within group: `|> DT(i, .SD[sub-query], by=grp)`) without needing to convert to a `data.table` first. 23. `DT[i, nomatch=NULL]` where `i` contains row numbers now excludes `NA` and any outside the range [1,nrow], [#3109](https://github.com/Rdatatable/data.table/issues/3109) [#3666](https://github.com/Rdatatable/data.table/issues/3666). Before, `NA` rows were returned always for such values; i.e. `nomatch=0|NULL` was ignored. Thanks Michel Lang and Hadley Wickham for the requests, and Jan Gorecki for the PR. Using `nomatch=0` in this case when `i` is row numbers generates the warning `Please use nomatch=NULL instead of nomatch=0; see news item 5 in v1.12.0 (Jan 2019)`. diff --git a/R/data.table.R b/R/data.table.R index 4dfa9c276a..8718f3e44e 100644 --- a/R/data.table.R +++ b/R/data.table.R @@ -446,7 +446,7 @@ replace_dot_alias = function(e) { i = as.data.table(i) } - if (is.data.table(i)) { + if (is.data.frame(i)) { if (missing(on)) { if (!haskey(x)) { stopf("When i is a data.table (or character vector), the columns to join by must be specified using 'on=' argument (see ?data.table), by keying x (i.e. sorted, and, marked as sorted, see ?setkey), or by sharing column names between x and i (i.e., a natural join). Keyed joins might have further speed benefits on very large data due to x being sorted in RAM.") @@ -1160,7 +1160,8 @@ replace_dot_alias = function(e) { # ok=-1 which will trigger setalloccol with verbose in the next # branch, which again calls _selfrefok and returns the message then if ((ok<-selfrefok(x, verbose=FALSE))==0L) # ok==0 so no warning when loaded from disk (-1) [-1 considered TRUE by R] - warningf("Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.") + if (is.data.table(x)) warningf("Invalid .internal.selfref detected and fixed by taking a (shallow) copy of the data.table so that := can add this new column by reference. At an earlier point, this data.table has been copied by R (or was created manually using structure() or similar). Avoid names<- and attr<- which in R currently (and oddly) may copy the whole data.table. Use set* syntax instead to avoid copying: ?set, ?setnames and ?setattr. If this message doesn't help, please report your use case to the data.table issue tracker so the root cause can be fixed or this message improved.") + # !is.data.table for DF |> DT(,:=) tests 2212.16-19 (#5113) where a shallow copy is routine for data.frame if ((ok<1L) || (truelength(x) < ncol(x)+length(newnames))) { DT = x # in case getOption contains "ncol(DT)" as it used to. TODO: warn and then remove n = length(newnames) + eval(getOption("datatable.alloccol")) # TODO: warn about expressions and then drop the eval() @@ -1325,13 +1326,12 @@ replace_dot_alias = function(e) { if (keylen && (ichk || is.logical(i) || (.Call(CisOrderedSubset, irows, nrow(x)) && ((roll == FALSE) || length(irows) == 1L)))) # see #1010. don't set key when i has no key, but irows is ordered and roll != FALSE setattr(ans,"sorted",head(key(x),keylen)) } - setattr(ans, "class", class(x)) # fix for #64 - setattr(ans, "row.names", .set_row_names(nrow(ans))) + setattr(ans, "class", class(x)) # retain class that inherits from data.table, #64 + setattr(ans, "row.names", .set_row_names(length(ans[[1L]]))) setalloccol(ans) } - if (!with || missing(j)) return(ans) - + if (!is.data.table(ans)) setattr(ans, "class", c("data.table","data.frame")) # DF |> DT(,.SD[...]) .SD should be data.table, test 2212.013 SDenv$.SDall = ans SDenv$.SD = if (length(non_sdvars)) shallow(SDenv$.SDall, sdvars) else SDenv$.SDall SDenv$.N = nrow(ans) @@ -1544,6 +1544,7 @@ replace_dot_alias = function(e) { # TODO add: if (max(len__)==nrow) stopf("There is no need to deep copy x in this case") # TODO move down to dogroup.c, too. SDenv$.SDall = .Call(CsubsetDT, x, if (length(len__)) seq_len(max(len__)) else 0L, xcols) # must be deep copy when largest group is a subset + if (!is.data.table(SDenv$.SDall)) setattr(SDenv$.SDall, "class", c("data.table","data.frame")) # DF |> DT(,.SD[...],by=grp) needs .SD to be data.table, test 2022.012 if (xdotcols) setattr(SDenv$.SDall, 'names', ansvars[xcolsAns]) # now that we allow 'x.' prefix in 'j', #2313 bug fix - [xcolsAns] SDenv$.SD = if (length(non_sdvars)) shallow(SDenv$.SDall, sdvars) else SDenv$.SDall } @@ -1934,7 +1935,17 @@ replace_dot_alias = function(e) { setalloccol(ans) # TODO: overallocate in dogroups in the first place and remove this line } -DT = `[.data.table` #4872 +DT = function(x, ...) { #4872 + old = getOption("datatable.optimize") + if (!is.data.table(x) && old>2L) { + options(datatable.optimize=2L) + # GForce still on; building and storing indices in .prepareFastSubset off; see long paragraph in news item 22 of v1.14.2 + } + ans = `[.data.table`(x, ...) + options(datatable.optimize=old) + .global$print = "" # functional form should always print; #5106 + ans +} .optmean = function(expr) { # called by optimization of j inside [.data.table only. Outside for a small speed advantage. if (length(expr)==2L) # no parameters passed to mean, so defaults of trim=0 and na.rm=FALSE @@ -2512,8 +2523,8 @@ copy = function(x) { } shallow = function(x, cols=NULL) { - if (!is.data.table(x)) - stopf("x is not a data.table. Shallow copy is a copy of the vector of column pointers (only), so is only meaningful for data.table") + if (!is.data.frame(x)) + stopf("x is not a data.table|frame. Shallow copy is a copy of the vector of column pointers (only), so is only meaningful for data.table|frame") ans = .shallow(x, cols=cols, retain.key=selfrefok(x)) # selfrefok for #5042 ans } diff --git a/R/test.data.table.R b/R/test.data.table.R index 65a62fd0b5..b64dfe119d 100644 --- a/R/test.data.table.R +++ b/R/test.data.table.R @@ -407,8 +407,8 @@ test = function(num,x,y=TRUE,error=NULL,warning=NULL,message=NULL,output=NULL,no y = try(y,TRUE) if (identical(x,y)) return(invisible(TRUE)) all.equal.result = TRUE - if (is.data.table(x) && is.data.table(y)) { - if (!selfrefok(x) || !selfrefok(y)) { + if (is.data.frame(x) && is.data.frame(y)) { + if ((is.data.table(x) && !selfrefok(x)) || (is.data.table(y) && !selfrefok(y))) { # nocov start catf("Test %s ran without errors but selfrefok(%s) is FALSE\n", numStr, if (selfrefok(x)) "y" else "x") fail = TRUE @@ -417,12 +417,14 @@ test = function(num,x,y=TRUE,error=NULL,warning=NULL,message=NULL,output=NULL,no xc=copy(x) yc=copy(y) # so we don't affect the original data which may be used in the next test # drop unused levels in factors - if (length(x)) for (i in which(vapply_1b(x,is.factor))) {.xi=x[[i]];xc[,(i):=factor(.xi)]} - if (length(y)) for (i in which(vapply_1b(y,is.factor))) {.yi=y[[i]];yc[,(i):=factor(.yi)]} - setattr(xc,"row.names",NULL) # for test 165+, i.e. x may have row names set from inheritance but y won't, consider these equal - setattr(yc,"row.names",NULL) + if (length(x)) for (i in which(vapply_1b(x,is.factor))) {.xi=x[[i]];xc[[i]]<-factor(.xi)} + if (length(y)) for (i in which(vapply_1b(y,is.factor))) {.yi=y[[i]];yc[[i]]<-factor(.yi)} + if (is.data.table(xc)) setattr(xc,"row.names",NULL) # for test 165+, i.e. x may have row names set from inheritance but y won't, consider these equal + if (is.data.table(yc)) setattr(yc,"row.names",NULL) setattr(xc,"index",NULL) # too onerous to create test RHS with the correct index as well, just check result setattr(yc,"index",NULL) + setattr(xc,".internal.selfref",NULL) # test 2212 + setattr(yc,".internal.selfref",NULL) if (identical(xc,yc) && identical(key(x),key(y))) return(invisible(TRUE)) # check key on original x and y because := above might have cleared it on xc or yc if (isTRUE(all.equal.result<-all.equal(xc,yc,check.environment=FALSE)) && identical(key(x),key(y)) && # ^^ to pass tests 2022.[1-4] in R-devel from 5 Dec 2020, #4835 diff --git a/inst/tests/tests.Rraw b/inst/tests/tests.Rraw index 32b16e471f..a7d292bdf6 100644 --- a/inst/tests/tests.Rraw +++ b/inst/tests/tests.Rraw @@ -7,6 +7,7 @@ if (exists("test.data.table", .GlobalEnv, inherits=FALSE)) { } if ((tt<-compiler::enableJIT(-1))>0) cat("This is dev mode and JIT is enabled (level ", tt, ") so there will be a brief pause around the first test.\n", sep="") + DTfun = DT # just in dev-mode, DT() gets overwritten in .GlobalEnv by DT objects here in tests.Rraw; we restore DT() in test 2212 } else { require(data.table) # Make symbols to the installed version's ::: so that we can i) test internal-only not-exposed R functions @@ -639,7 +640,7 @@ test(211, ncol(TESTDT), 2L) DT = data.table(a=1:6,key="a") test(212, DT[J(3)]$a, 3L) # correct class c("data.table","data.frame") class(DT) = "data.table" # incorrect class, but as from 1.8.1 it works. By accident when moving from colnames() to names(), it was dimnames() doing the check, but rather than add a check that identical(class(DT),c("data.frame","data.table")) at the top of [.data.table, we'll leave it flexible to user (user might not want to inherit from data.frame for some reason). -test(213, DT[J(3)]$a, 3L) +test(213, DT[J(3)]$a, error="x is not a data.table|frame") # from v1.14.2, data.table must inherit from data.frame (internals are too hard to reason if a data.table may not be data.frame too) # setkey now auto coerces double and character for convenience, and # to solve bug #953 @@ -14194,7 +14195,7 @@ test(1984.242, na.omit(data.table(A=c(1,NA,2)), cols=character()), data.table(A= test(1984.25, rbindlist(list(DT[1L], DT[2L]), idcol = TRUE), data.table(.id=1:2, a=1:2)) test(1984.26, setalloccol(`*tmp*`), error='setalloccol attempting to modify `*tmp*`') DF = as.data.frame(DT) -test(1984.27, shallow(DF), error='x is not a data.table') +test(1984.27, shallow(DF), DF) # shallow (which is not exported) works on DF from v1.14.2 test(1984.28, split.data.table(DF), error='argument must be a data.table') test(1984.29, split(DT, by='a', f='a'), error="passing 'f' argument together with 'by' is not allowed") test(1984.30, split(DT), error="Either 'by' or 'f' argument must be supplied") @@ -18050,3 +18051,49 @@ for (col in c("a","b","c")) { } } +# DT() functional form, #4872 #5106 #5107 +if (base::getRversion() >= "4.1.0") { + # we have to EVAL "|>" here too otherwise this tests.Rraw file won't parse in R<4.1.0 + if (exists("DTfun")) DT=DTfun # just in dev-mode restore DT() in .GlobalEnv as DT object overwrote it in tests above + droprn = function(df) { rownames(df)=NULL; df } # TODO: could retain rownames where droprn is currently used below + test(2212.011, EVAL("mtcars |> DT(mpg>20, .(mean_hp=round(mean(hp),2)), by=cyl)"), + data.frame(cyl=c(6,4), mean_hp=c(110.0, 82.64))) + test(2212.012, EVAL("mtcars |> DT(mpg>15, .SD[hp>mean(hp)], by=cyl)"), + droprn(mtcars[c(10,11,30,3,9,21,27,28,32,29), c(2,1,3:11)])) + test(2212.013, EVAL("mtcars |> DT(mpg>20, .SD[hp>mean(hp)])"), + droprn(mtcars[ mtcars$mpg>20 & mtcars$hp>mean(mtcars$hp[mtcars$mpg>20]), ])) + D = copy(mtcars) + test(2212.02, EVAL("D |> DT(,.SD)"), D) + test(2212.03, EVAL("D |> DT(, .SD, .SDcols=5:8)"), D[,5:8]) + test(2212.04, EVAL("D |> DT(, 5:8)"), droprn(D[,5:8])) + test(2212.05, EVAL("D |> DT(, lapply(.SD, sum))"), as.data.frame(lapply(D,sum))) + test(2212.06, EVAL("D |> DT(, .SD, keyby=cyl) |> setkey(NULL)"), droprn(D[order(D$cyl),c(2,1,3:11)])) + test(2212.07, EVAL("D |> DT(1:20, .SD)"), droprn(D[1:20,])) + test(2212.08, EVAL("D |> DT(, .SD, by=cyl, .SDcols=5:8)"), droprn(D[unlist(tapply(1:32, D$cyl, c)[c(2,1,3)]), c(2,5:8)])) + test(2212.09, EVAL("D |> DT(1:20, .SD, .SDcols=5:8)"), droprn(D[1:20, 5:8])) + test(2212.10, EVAL("D |> DT(1:20, .SD, by=cyl, .SDcols=5:8)"), droprn(D[unlist(tapply(1:20, D$cyl[1:20], c)[c(2,1,3)]), c(2,5:8)])) + test(2212.11, EVAL("D |> DT(1:20, lapply(.SD, sum))"), as.data.frame(lapply(D[1:20,],sum))) + test(2212.12, droprn(EVAL("D |> DT(1:20, c(N=.N, lapply(.SD, sum)), by=cyl)")[c(1,3),c("cyl","N","carb")]), data.frame(cyl=c(6,8), N=c(6L,8L), carb=c(18,27))) + test(2212.13, EVAL("D |> DT(cyl==4)"), droprn(D[D$cyl==4,])) + test(2212.14, EVAL("D |> DT(cyl==4 & vs==0)"), droprn(D[D$cyl==4 & D$vs==0,])) + test(2212.15, EVAL("D |> DT(cyl==4 & vs>0)"), droprn(D[D$cyl==4 & D$vs>0,])) + test(2212.16, EVAL("D |> DT(cyl>=4)"), droprn(D[D$cyl>=4,])) + test(2212.17, EVAL("D |> DT(cyl!=4)"), droprn(D[D$cyl!=4,])) + test(2212.18, EVAL("D |> DT(cyl!=4 & vs!=0)"), droprn(D[D$cyl!=4 & D$vs!=0,])) + test(2212.19, EVAL("iris |> DT(Sepal.Length==5.0 & Species=='setosa')"), droprn(iris[iris$Sepal.Length==5.0 & iris$Species=="setosa",])) + test(2212.20, EVAL("iris |> DT(Sepal.Length==5.0)"), droprn(iris[iris$Sepal.Length==5.0,])) + test(2212.21, EVAL("iris |> DT(Species=='setosa')"), droprn(iris[iris$Species=='setosa',])) + test(2212.22, EVAL("D |> DT(, cyl)"), droprn(D[,"cyl"])) + test(2212.23, EVAL("D |> DT(1:2, cyl)"), droprn(D[1:2, "cyl"])) + test(2212.24, EVAL("D |> DT(, list(cyl))"), droprn(D[,"cyl",drop=FALSE])) + test(2212.25, EVAL("D |> DT(1:2, .(cyl))"), droprn(D[1:2, "cyl", drop=FALSE])) + test(2212.26, EVAL("D |> DT(, z:=sum(cyl))"), cbind(D, z=sum(D$cyl))) + test(2212.27, EVAL("D |> DT(, z:=round(mean(mpg),2), by=cyl)"), cbind(D, z=c("6"=19.74, "4"=26.66, "8"=15.10)[as.character(D$cyl)])) + test(2212.28, EVAL("D |> DT(1:3, z:=5, by=cyl)"), cbind(D, z=c(5,5,5,rep(NA,nrow(D)-3)))) + test(2212.29, EVAL("D |> DT(1:3, z:=NULL)"), error="When deleting columns, i should not be provided") + test(2212.30, EVAL("D |> DT(data.table(cyl=4), on='cyl')"), droprn(D[D$cyl==4,])) + test(2212.31, EVAL("D |> DT(data.frame(cyl=4), on='cyl')"), droprn(D[D$cyl==4,])) + test(2212.32, EVAL("D |> DT(.(4), on='cyl')"), droprn(D[D$cyl==4,])) + test(2212.33, EVAL("iris |> DT('setosa', on='Species')"), {tt=droprn(iris[iris$Species=="setosa",]); tt$Species=as.character(tt$Species); tt}) +} +