ARROW-6532 [R] write_parquet() uses writer properties (general and arrow specific)#5451
ARROW-6532 [R] write_parquet() uses writer properties (general and arrow specific)#5451romainfrancois wants to merge 31 commits intoapache:masterfrom
Conversation
|
Yeah, I'm not sure about this on a few levels. I think the R I want to type to write a compressed Parquet file looks like I'm also not sure whether this works as intended. The Parquet C++ code seems to have its own compression and writing logic; that may be historical artifact, or it may be meaningful. Maybe we can get away without implementing bindings for those classes--the proof would be a passing test of writing a compressed parquet file and reading it back in. Then again, maybe in principle we should write the Parquet bindings to match the C++ library. |
|
It is not a good idea to write a Parquet file into a CompressedOutputStream. Such file will not be readable with Parquet already compresses data internally. |
|
Here's the way we handle it in Python, you'll need to do the same thing in R https://github.com/apache/arrow/blob/master/python/pyarrow/parquet.py#L363 |
d85b6fc to
aa2833e
Compare
|
Some progress inspired from the python implementation. write_parquet <- function(
table,
sink, chunk_size = NULL,
version = NULL, compression = NULL, use_dictionary = NULL, write_statistics = NULL, data_page_size = NULL,
properties = ParquetWriterProperties$create(
version = version,
compression = compression,
use_dictionary = use_dictionary,
write_statistics = write_statistics,
data_page_size = data_page_size
),
use_deprecated_int96_timestamps = FALSE, coerce_timestamps = NULL, allow_truncated_timestamps = FALSE,
arrow_properties = ParquetArrowWriterProperties$create(
use_deprecated_int96_timestamps = use_deprecated_int96_timestamps,
coerce_timestamps = coerce_timestamps,
allow_truncated_timestamps = allow_truncated_timestamps
)
)that are managed by the classes Only simple versions so far, e.g. library(arrow, warn.conflicts = FALSE)
df <- tibble::tibble(x = 1:5)
write_parquet(df, "/tmp/test.parquet", compression = "snappy")
read_parquet("/tmp/test.parquet")
#> # A tibble: 5 x 1
#> x
#> <int>
#> 1 1
#> 2 2
#> 3 3
#> 4 4
#> 5 5but we can't e.g. specify specific variables to handle by such and such compression. This is a good place I think for a tidy select, e.g. something like that: df <- tibble::tibble(x1 = 1:5, x2 = 1:5, y = 1:5)
write_parquet(df, "/tmp/test.parquet",
compression = list(snappy = starts_with("x"))
)The list in python goes the other way, so if we do something similar it would look like write_parquet(df, "/tmp/test.parquet",
compression = list(x1 = "snappy", x2 = "snappy")
)perhaps we can have write_parquet(df, "/tmp/test.parquet",
compression = compression_spec(snappy = starts_with("x"))
) |
|
One option we discussed with @nealrichardson was to be able to do e.g. write_parquet(df, "/tmp/test.parquet",
compression = Codec$create("snappy", 5L)
)But unfortunately, the C++ class Instead, I followed python's lead and we can do this instead: write_parquet(df, "/tmp/test.parquet",
compression = "snappy",
compression_level = 5L
) |
|
These arguments that are handled by |
|
Taking a look now; FTR Travis says |
Codecov Report
@@ Coverage Diff @@
## master #5451 +/- ##
===========================================
- Coverage 88.7% 76.76% -11.94%
===========================================
Files 964 59 -905
Lines 128215 4330 -123885
Branches 1501 0 -1501
===========================================
- Hits 113731 3324 -110407
+ Misses 14119 1006 -13113
+ Partials 365 0 -365
Continue to review full report at Codecov.
|
nealrichardson
left a comment
There was a problem hiding this comment.
A few more notes. I'd also like to see better coverage on https://codecov.io/gh/apache/arrow/pull/5451/diff
r/R/parquet.R
Outdated
There was a problem hiding this comment.
I'd write this function as
make_valid_version <- function(version, valid_versions = valid_parquet_version) {
pq_version <- valid_version[[version]]
if (is.null(pq_version)) {
stop('"version" should be one of ', oxford_paste(names(valid_versions), "or"), call.=FALSE)
}
pq_version
}
As it stands, make_valid_version(1) won't work, and it seems like it should.
Per the codecov report, this code isn't being exercised.
There was a problem hiding this comment.
wfm:
arrow:::make_valid_version("1.0")
#> [1] 0
arrow:::make_valid_version("2.0")
#> [1] 1
arrow:::make_valid_version(1)
#> [1] 0
arrow:::make_valid_version(2)
#> [1] 1Created on 2019-09-27 by the reprex package (v0.3.0.9000)
r/R/parquet.R
Outdated
There was a problem hiding this comment.
I'm not sure there's value including properties and arrow_properties in the signature here. I kept them in read_delim_arrow() because there were some properties they expose that aren't mapped to arguments the readr::read_delim signature but that doesn't seem to be the case here. (On reflection, that's probably not the right call there either; if you want lower-level access to those settings, you should probably be doing CsvTableReader$create(...) anyway.
There was a problem hiding this comment.
My rationale was that perhaps you'd already have built those objects properties and arrow_properties before.
There was a problem hiding this comment.
But I get the point that maybe this could be diverted to using a ParquetFileWriter instance
nealrichardson
left a comment
There was a problem hiding this comment.
Some notes on the docs
r/R/parquet.R
Outdated
There was a problem hiding this comment.
And not all columns need to be specified, correct?
r/R/parquet.R
Outdated
There was a problem hiding this comment.
Does this follow the same conventions as compression? Maybe there should be a paragraph/section in the docs that explains how these parameters work since it's the same/similar.
There was a problem hiding this comment.
I've refactored the documentation in @details
r/R/parquet.R
Outdated
There was a problem hiding this comment.
Same, and what are statistics?
There was a problem hiding this comment.
I don't know
… file path for more flexibility
…rrow::WriterProperties to R side
…ties_Builder class skeleton
…perties to write_parquet()
Co-Authored-By: Neal Richardson <neal.p.richardson@gmail.com>
…ch class. Many tests were false positives
354d263 to
50555f8
Compare
nealrichardson
left a comment
There was a problem hiding this comment.
One final style question but otherwise LGTM, happy to merge today regardless of where we land on the whitespace question.
| as_data_frame = TRUE, | ||
| props = ParquetReaderProperties$create(), | ||
| ...) { | ||
| col_select = NULL, |
There was a problem hiding this comment.
This is "bad", according to the tidyverse style guide, which I believed we were trying to follow: https://style.tidyverse.org/functions.html#long-lines-1
I can get used to whatever style conventions we decide, just want to make sure we're in agreement.
There was a problem hiding this comment.
I'll setup my Rstudio to obey the style, perhaps we should use styler:: once in a while to do that automatically.
|
Can you update the PR description to reflect what is actually in the PR (since writing a Parquet file into a CompressedOutputStream isn't recommendable -- you would have to decompress the entire file first to be able to read any part of it) |
The ability to preserve categorical values was introduced in #5077 as the convention of storing a special `ARROW:schema` key in the metadata. To invoke this, we need to call `ArrowWriterProperties::store_schema()`. The R binding is already ready for this, but calls `store_schema()` only conditionally and uses `parquet___default_arrow_writer_properties()` by default. Though I don't see the motivation to implement as such in #5451, considering [the Python binding always calls `store_schema()`](https://github.com/apache/arrow/blob/dbe708c7527a4aa6b63df7722cd57db4e0bd2dc7/python/pyarrow/_parquet.pyx#L1269), I guess the R code can do the same. Closes #6135 from yutannihilation/ARROW-7045_preserve_factor_in_parquet and squashes the following commits: 9227e7e <Hiroaki Yutani> Fix test 4d8bb46 <Hiroaki Yutani> Remove default_arrow_writer_properties() dfd08cb <Hiroaki Yutani> Add failing tests Authored-by: Hiroaki Yutani <yutani.ini@gmail.com> Signed-off-by: Neal Richardson <neal.p.richardson@gmail.com>
This adds parameters to
write_parquet()to control compression, whether to use dictionary, etc ... on top of the C++ classesparquet::WriterPropertiesandparquet::ArrowWriterPropertiese.g.