Skip to content

docs: Add instructions on running TPC-H on macOS#1647

Merged
andygrove merged 5 commits intoapache:mainfrom
andygrove:tpch-macos
Apr 28, 2025
Merged

docs: Add instructions on running TPC-H on macOS#1647
andygrove merged 5 commits intoapache:mainfrom
andygrove:tpch-macos

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Apr 14, 2025

Which issue does this PR close?

Closes #1648 (sort of .. this PR will explain why the benchmark results are unstable on macOS)

Rationale for this change

What changes are included in this PR?

How are these changes tested?


# Comet Benchmarking on macOS

This guide is for setting up TPC-H benchmarks locally on macOS using the 100 GB dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--conf spark.memory.offHeap.enabled=true \
--conf spark.memory.offHeap.size=16g \
--conf spark.eventLog.enabled=true \
/path/to/datafusion-benchmarks/tpcbench.py \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be datafusion-benchmarks/runners/datafusion-comet/tpcbench.py

@andygrove andygrove changed the title [WIP] docs: Add instructions on running TPC-H on macOS docs: Add instructions on running TPC-H on macOS Apr 22, 2025
@andygrove andygrove marked this pull request as ready for review April 22, 2025 16:42
Install Spark

```shell
wget https://archive.apache.org/dist/spark/spark-3.5.4/spark-3.5.4-bin-hadoop3.tgz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we use 3.5.5?

Set `SPARK_MASTER` env var (host name will need to be edited):

```shell
export SPARK_MASTER=spark://Rustys-MacBook-Pro.local:7077
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

localhost?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark master does not bind to localhost by default. We could specify --host localhost when starting the master process, but I have not tried this on macOS yet. I will test this and update this PR.

Copy link
Contributor

@mbutrovich mbutrovich Apr 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, this is what I currently use to start Spark for benchmarking TPC-H on macOS:

export SPARK_HOME=/opt/spark-3.5.5-bin-hadoop3
export SPARK_MASTER="local[*]"
export SPARK_MASTER_HOST="127.0.0.1"
export SPARK_LOCAL_IP="127.0.0.1"

$SPARK_HOME/sbin/start-master.sh
$SPARK_HOME/sbin/start-worker.sh $SPARK_MASTER_HOST:7077
$SPARK_HOME/sbin/start-history-server.sh

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have been using standalone mode rather than local mode, but perhaps local mode may make more sense for macOS.

@comphead
Copy link
Contributor

are we okay to merge this PR?

@andygrove
Copy link
Member Author

are we okay to merge this PR?

Yes, I'll go ahead and merge and we can follow up with change

@andygrove andygrove merged commit fd09a79 into apache:main Apr 28, 2025
1 check passed
@andygrove andygrove deleted the tpch-macos branch April 28, 2025 15:50
coderfender pushed a commit to coderfender/datafusion-comet that referenced this pull request Dec 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Investigate unstable benchmark results on macOS

6 participants