Conversation
Merging this PR will not alter performance
Comparing Footnotes
|
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.025x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.024x ➖, 0↑ 0↓)
datafusion / parquet (1.023x ➖, 0↑ 1↓)
datafusion / arrow (1.032x ➖, 0↑ 3↓)
duckdb / vortex-file-compressed (1.021x ➖, 0↑ 1↓)
duckdb / vortex-compact (1.024x ➖, 0↑ 0↓)
duckdb / parquet (1.033x ➖, 2↑ 5↓)
duckdb / duckdb (1.032x ➖, 0↑ 1↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.881x ➖, 1↑ 0↓)
datafusion / vortex-compact (1.001x ➖, 2↑ 2↓)
datafusion / parquet (0.974x ➖, 0↑ 2↓)
duckdb / vortex-file-compressed (0.996x ➖, 1↑ 0↓)
duckdb / vortex-compact (1.065x ➖, 0↑ 2↓)
duckdb / parquet (0.969x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.954x ➖, 7↑ 0↓)
datafusion / vortex-compact (0.963x ➖, 1↑ 1↓)
datafusion / parquet (0.964x ➖, 4↑ 1↓)
duckdb / vortex-file-compressed (0.953x ➖, 10↑ 1↓)
duckdb / vortex-compact (0.958x ➖, 5↑ 2↓)
duckdb / parquet (0.973x ➖, 2↑ 1↓)
duckdb / duckdb (0.957x ➖, 8↑ 0↓)
Full attributed analysis
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.017x ➖, 0↑ 1↓)
datafusion / parquet (1.017x ➖, 0↑ 2↓)
duckdb / vortex-file-compressed (0.978x ➖, 4↑ 4↓)
duckdb / parquet (1.002x ➖, 0↑ 0↓)
duckdb / duckdb (1.029x ➖, 0↑ 2↓)
Full attributed analysis
|
Polar Signals Profiling ResultsLatest Run
Previous Runs (2)
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.007x ➖ datafusion / vortex-file-compressed (1.007x ➖, 0↑ 0↓)
|
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.026x ➖, 0↑ 2↓)
datafusion / vortex-compact (1.026x ➖, 0↑ 1↓)
datafusion / parquet (1.001x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.998x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.011x ➖, 0↑ 0↓)
duckdb / parquet (0.996x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (0.977x ➖, 1↑ 0↓)
duckdb / vortex-compact (0.982x ➖, 1↑ 0↓)
duckdb / parquet (1.002x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (1.006x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.006x ➖, 0↑ 0↓)
datafusion / parquet (1.012x ➖, 0↑ 0↓)
datafusion / arrow (1.006x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (1.008x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.008x ➖, 0↑ 0↓)
duckdb / parquet (1.018x ➖, 0↑ 2↓)
duckdb / duckdb (1.008x ➖, 0↑ 0↓)
Full attributed analysis
|
Signed-off-by: Connor Tsui <connor.tsui20@gmail.com>
eeaea6d to
355df2d
Compare
Benchmarks: FineWeb S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (1.100x ➖, 0↑ 1↓)
datafusion / vortex-compact (0.848x ➖, 3↑ 0↓)
datafusion / parquet (0.974x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (1.018x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.972x ➖, 0↑ 0↓)
duckdb / parquet (0.996x ➖, 0↑ 0↓)
Full attributed analysis
|
| // actually compressing data. | ||
| let mut codes_excludes = vec![IntCode::Dict, IntCode::Sequence]; | ||
| codes_excludes.extend_from_slice(excludes); |
There was a problem hiding this comment.
what goes wrong with sequence array?
There was a problem hiding this comment.
we want to have duplicates in dictionary codes by definition which means the codes can never be a sequence array
There was a problem hiding this comment.
that makes sense, is that documented?
There was a problem hiding this comment.
the comment right above?
There was a problem hiding this comment.
Note that Im working on something that will make this kind of reasoning a lot more obvious, if you are interested: #7018
Benchmarks: Random AccessVortex (geomean): 0.894x ✅ unknown / unknown (0.980x ➖, 7↑ 1↓)
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.992x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.080x ➖, 1↑ 4↓)
datafusion / parquet (0.989x ➖, 0↑ 2↓)
duckdb / vortex-file-compressed (1.117x ➖, 0↑ 4↓)
duckdb / vortex-compact (1.083x ➖, 0↑ 1↓)
duckdb / parquet (1.063x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: CompressionVortex (geomean): 1.014x ➖ unknown / unknown (1.022x ➖, 0↑ 11↓)
|
|
whoops just meant to make this a draft |
|
I think you're unlikely to find practical improvements here since the writer already limits dictionaries to 65k unique values, i.e. u16. Not sure what's the formula for best case improvement here but it likely won't be a lot |
Summary
Compress dictionary-encoded integer array values.
Note that we already do this for dictionary-encoded float array values.
Testing
N/A