feat: add clickhouse-bench with auto-downloaded ClickHouse binary#6736
feat: add clickhouse-bench with auto-downloaded ClickHouse binary#6736fastio wants to merge 8 commits intovortex-data:developfrom
Conversation
There was a problem hiding this comment.
why do we need this file is it difference to the already included one?
There was a problem hiding this comment.
Good catch! I have removed the duplicate clickbench_clickhouse_queries.sql and validated with cargo check -p vortex-bench.
myrrc
left a comment
There was a problem hiding this comment.
I don't think downloading untrusted binaries from internet via a build script is a good idea. We want first-class integration with duckdb thus we need to download its sources (although I'd not do it in build script as well), but we don't need such integration with Clickhouse yet.
My idea is to use clickhouse binary in CI (as it runs on Linux only) and require users to download it by hand if they want a local run. Benchmarking on MacOS doesn't make much sense anyway as vectorized instrustion set is different.
Agreed — removed the binary download from build.rs entirely. The clickhouse binary is now resolved at runtime: via CLICKHOUSE_BINARY env var or from $PATH. CI installs it via the official installer before building. Local users need to install it manually. No more untrusted binary downloads in the build script. |
.github/workflows/sql-benchmarks.yml
Outdated
| - name: Install ClickHouse | ||
| if: contains(matrix.targets, 'clickhouse:') | ||
| run: | | ||
| curl https://clickhouse.com/ | sh |
There was a problem hiding this comment.
Why not download the latest release file for our architecture from Github releases? We then don't need any installation and curl in general.
There was a problem hiding this comment.
Good call — updated CI to download the static binary directly from GitHub Releases (pined ClickHouse to LTS release v25.8.18.1 from GitHub Releases), no curl | sh or installation needed.
| return query.to_string(); | ||
| } | ||
|
|
||
| strip_simple_identifier_quotes(query) |
There was a problem hiding this comment.
Clickhouse does handle quoted identifiers correctly so I think we can pass them through to reduce this PR's diff.
|
The changes look good to me conceptually, let's see what the CI run says. |
82c11d1 to
60aa2d2
Compare
Introduce a new clickhouse-bench benchmark crate that runs ClickBench queries against Parquet data via clickhouse-local, providing a baseline for comparing Vortex performance against ClickHouse. Key design decisions: - build.rs auto-downloads the full ClickHouse binary (with Parquet support) into target/clickhouse-local/, similar to how vortex-duckdb downloads the DuckDB library. This eliminates manual install steps and avoids issues with slim/homebrew builds lacking Parquet support. - The binary path is baked in via CLICKHOUSE_BINARY env at compile time; CLICKHOUSE_LOCAL env var allows runtime override. - ClickHouse-dialect SQL queries are maintained in a separate clickbench_clickhouse_queries.sql file (43 queries). - CI workflows updated to include clickhouse:parquet target in ClickBench benchmarks and conditionally build clickhouse-bench. Signed-off-by: fastio <pengjian.uestc@gmail.com>
…dling Signed-off-by: fastio <pengjian.uestc@gmail.com>
…use from PATH - Remove reqwest-based binary download from build.rs - Resolve clickhouse binary via CLICKHOUSE_BINARY env var or $PATH at runtime - Add CI step to install clickhouse before building when needed - Fail with clear error message if binary is not found locally Signed-off-by: fastio <pengjian.uestc@gmail.com>
- Pass subcommand arg to clickhouse-bench in run-sql-bench.sh for consistency - Use BenchmarkArg + create_benchmark() in main.rs like other engines - Replace `which` with `clickhouse local --version` for binary verification - Pin ClickHouse to LTS release v25.8.18.1 from GitHub Releases Signed-off-by: fastio <pengjian.uestc@gmail.com>
…identifier handling. Queries are now returned as-is without dialect-specific transformation. Signed-off-by: fastio <pengjian.uestc@gmail.com>
Signed-off-by: fastio <pengjian.uestc@gmail.com>
60aa2d2 to
5aa201a
Compare
Signed-off-by: Peng Jian <pengjian.uestc@gmail.com>
Merging this PR will degrade performance by 20.9%
Performance Changes
Comparing Footnotes
|
|
@fastio Feel free to ping us in the public slack channel if you want us to run CI for you! (Feel free to ping me here as well) Edit: To fix the CI issues right now, could you update the lockfile with |
Signed-off-by: fastio <pengjian.uestc@gmail.com>
@connortsui20 Thanks! I've updated the lockfile by running cargo check. The Cargo.lock should now be in sync. Let me know if CI still has issues after this. |
|
@fastio you can click on the github action that is failing (in this case it is the "CI / Rust (list) (pull_request)" action) and see the problem. Let me know if you need help with this! |
0ax1
left a comment
There was a problem hiding this comment.
It's unclear yet, whether we actually want to maintain this integration. Will loop back with the team.
connortsui20
left a comment
There was a problem hiding this comment.
Before addressing some of these comments, can you fix the clippy lint? We should actually run these benchmarks to see if it even works.
| let time_instant = Instant::now(); | ||
|
|
||
| // The `clickhouse` binary is a multi-tool; invoke it as `clickhouse local`. | ||
| let mut child = Command::new(&self.binary) | ||
| .args(["local", "--format", "TabSeparated"]) | ||
| .stdin(Stdio::piped()) | ||
| .stdout(Stdio::piped()) | ||
| .stderr(Stdio::piped()) | ||
| .spawn() | ||
| .context("Failed to spawn clickhouse-local")?; | ||
|
|
||
| // Write SQL to stdin | ||
| { | ||
| let stdin = child | ||
| .stdin | ||
| .as_mut() | ||
| .context("Failed to open clickhouse-local stdin")?; | ||
| stdin | ||
| .write_all(full_sql.as_bytes()) | ||
| .context("Failed to write SQL to clickhouse-local stdin")?; | ||
| } |
There was a problem hiding this comment.
I don't think this is right. If you look at the other benchmark engine setup, we do not open an entire new process and write to stdin, which means this engine is going to have wildly different characteristics to the rest of the engines.
I also don't know of a good way to solve this since I have not worked much with Clickhouse before. Do you have any ideas how to make this more consistent with the other engines?
There was a problem hiding this comment.
No but we essentially do this. DuckDB we close and re-open the database. DataFusion we start a new session. etc.
There was a problem hiding this comment.
I don't believe this is true (or not true anymore?). In our benchmarks/datafusion-bench/src/main.rs and benchmarks/duckdb-bench/src/lib.rs we create the connections and only time the query
| // Build the full SQL: setup views + the actual query | ||
| let mut full_sql = String::new(); | ||
| for stmt in &self.setup_sql { | ||
| full_sql.push_str(stmt); | ||
| full_sql.push('\n'); | ||
| } |
There was a problem hiding this comment.
This also seems strange to me, if this is setup work then why are we timing it in our measurement?
There was a problem hiding this comment.
The setup SQL (CREATE VIEW statements) is indeed being timed along with the actual query. I'll separate the setup phase from the measurement phase to be consistent with how DuckDB and DataFusion
handle this.
| # ClickHouse-bench only runs for local benchmarks (clickhouse-local reads local files). | ||
| if ! $is_remote && [[ "$has_clickhouse" == "true" ]] && [[ -f "target/release_debug/clickhouse-bench" ]]; then |
There was a problem hiding this comment.
If you look above at the lance setup code, we have a better guard against benching something on remote that is not supposed to be, can you mimic that?
| [dependencies] | ||
| anyhow = { workspace = true } | ||
| clap = { workspace = true, features = ["derive"] } | ||
| tokio = { workspace = true, features = ["full"] } |
There was a problem hiding this comment.
I doubt you need "full" features here
| fn queries_file_path(&self) -> PathBuf { | ||
| if let Some(file) = &self.queries_file { | ||
| return file.into(); | ||
| } | ||
| let manifest_dir = PathBuf::from(env!("CARGO_MANIFEST_DIR")); | ||
| manifest_dir.join("clickbench_queries.sql") | ||
| } |
There was a problem hiding this comment.
This refactoring seems somewhat unnecessary
Hi @0ax1, this PR is Phase 1 of the plan discussed in #6425 with @gatesn. The idea was to first integrate ClickHouse into the ClickBench benchmarking framework As @gatesn put it: "I think we should also look to at Clickbench to the benchmarking framework first, even without support for Vortex. That will give us some sanity checks on the PR that we are doing things This PR intentionally has a narrow scope — it only adds ClickHouse as a benchmark engine running against Parquet via clickhouse-local. The maintenance burden is minimal: one benchmark crate with no custom |
|
Just to confirm, I'm happy to go ahead and merge this. Sounds like there's some open questions about where to put the benchmark timings. Also it's worth making sure the caching behavior is consistent with our other benchmarks. We haven't actually documented precisely what this is yet, but typically means no caching at all, even things like footers. |
Introduce a new clickhouse-bench benchmark crate that runs ClickBench queries against Parquet data via clickhouse-local, providing a baseline for comparing Vortex performance against ClickHouse.
Key design decisions:
#6425