Conversation
|
Are you thinking of something like |
|
Thanks for the reminder! I thought I was missing something. Like i.e., it's doing "cartesian unnesting" (if in one row there's a list with 4 elements, and the other there's 6 elements, this will make 24 rows). I couldn't get |
jangorecki
left a comment
There was a problem hiding this comment.
Some few initial comments, not really a review
src/unnest.c
Outdated
| int n = LENGTH(VECTOR_ELT(x, 0)); | ||
| int p = LENGTH(x); | ||
| int row_counts[n]; | ||
| SEXPTYPE col_types[p]; |
There was a problem hiding this comment.
Could segfault for many rows or columns
There was a problem hiding this comment.
Would also be obviated by switch to cols...
The issue here is if we try to instantiate SEXPTYPE array with >INT_MAX elements?
|
Looking again, I'm thinking of just using this instead: because we immediately leverage code built into The To allow a Any thoughts? |
|
Hmm... as straightforward as it looks, it's much slower than the current implementation of A quick look at the profile suggests it's the loop itself slowing things down... At a glance, current implementation working pretty well (not 100% clear this will stay the same once the TODO are all fixed): |
|
What if you call CJ and rbindlist from C? Ideally would be to isolate CJ from R C api then it can be called from C without unnecessary overhead. |
|
Ah, none of the functions in |
Yes I think we're on the same page there. Maybe I'll add |
|
If skipping CJ part can give extra speed up then probably provide an extra argument to control that. |
|
|
Problem with re-using however, i just found this, which looks quite interesting: https://stackoverflow.com/a/7413460/3576984 |
|
AFAIR we already use |
|
I'm seeing only v limited usage: Doesn't this provide a nice window for shutting off parallelism for small-cardinality cases that's been the source of some speed issues in other cases? |
R/setkey.R
Outdated
| if (unique) l[[i]] = unique(y) | ||
| } | ||
| } | ||
| nrow = prod( vapply_1i(l, length) ) # lengths(l) will work from R 3.2.0 |
There was a problem hiding this comment.
migrated this to C, epsilon more efficient but anyway cleaner to have more logic there; will also make it easier to switch to long vector support
|
should be ready for testing. a few tweaks needed & then dotting i's. just ran out of time for now |
|
On 10,000 rows, it looks like the performance is slower than your initial benchmarks. Is it just me? Also, there are typically fewer row names than your alternative |
|
Thanks @colem1 , and good catch w.r.t. For the slowdown, do you have a lot of cores on your machine? Either way could you try again? I just turned on conditional parallelism for |
|
I think the functions would parse easier as |
|
@TysonStanley or anyone else more familiar with output seems correct for
|
|
@MichaelChirico this is looking awesome! Your usage of Some things to consider (as they are useful and
Thanks for putting this together, @MichaelChirico ! I will definitely be able to use this function. |
|
Thanks, very helpful. I tried For 2, yes, should be easy enough to support. For 3, I think for the list columns, |
|
Also, re:naming, my vote (if it counts much) is, if we are essentially replicating the behavior of |
|
For
So I'm leaning towards using |
|
My two cents re: |
|
Ok. I see. Yeah, I would definitely go with Are you still wanting to implement functionality for list-columns of class data frame/table? |
I totally agree. We don't need more terms if a term for it already exists. |
|
@TysonStanley there's some other bug obscuring things a bit here (#4159) but I think the long-and-wide unnesting of So I'm not sure we need to add a new function for that? Or maybe some more complicated things are involved that I'm not seeing... |
|
OK, with the fix in #4161, indeed we can just use |
|
It almost seems like there are two functions in one. For a field of Then for actual lists, your excellent work on |
|
In fact I've added to the TODO & begun work on
Seems a lot of the use cases cited wanted 'matched' so I'll make that the default. |
|
I would stick to |
|
Yes & yes, however, there is still a major difference vs. |
|
See some of my tries (https://hope-data-science.github.io/tidyfst/reference/nest.html), the only problem seems to be the data class of integer and double, nothing else. |
That's exactly the problem I've been running into with pretty much every attempt at "unnesting" in data.table. When attempting to unlist a mixture of numeric types it just dies on the spot. It's a shame because being able to unnest makes producing "long" tables when starting with start/end sequences in two separate columns trivial. |
Closes #2146, part of #3189
Still plenty to do, initial basic idea is here. Haven't benchmarked vs. ad-hoc approaches yet either.
TODO:
rbindlist]list(1, 1L, 'a')) [viarbindlist]cj]row.namestypeargument --'cartesian'to do cross-product of multiple inputs;'matched'(?name) works more liketidyr::unnestcols-- accept int-as-numeric, column names[ ] Make it work recursively?[only if requested... pending a follow-up issue...]base/dplyr,kdb::ungroup, and ad hoc versions using existingdata.tablecode)