Skip to content

Conversation

@Kimahriman
Copy link
Contributor

What changes were proposed in this pull request?

Update all array set-like expressions to use hashing for all data types. Currently complex types and certain string collations fallback to a nested comparison loop that is O(n^2) for the size of the array, which can get really expensive for larger arrays. Instead, this borrows the idea from InternalRowComparableWrapper and creates a GenericComparableWrapper that works for any data type, using Murmur3HashFunction for the hash code and InterpretedOrdering for equals.

Additionally, since there is just one code path for these expressions now, I moved the eval logic directly into the nullSafeEval function instead of keeping the transient function call.

Why are the changes needed?

To improve performance of array set-like operations for complex types.

Does this PR introduce any user-facing change?

No, just performance improvement.

How was this patch tested?

Since there's no behavior change, existing tests should suffice, but I can add any additional ones if needed.

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions github-actions bot added the SQL label Dec 13, 2025
@Kimahriman Kimahriman changed the title [SPARK-54698] Support hashing for all data types for array set like operations [SPARK-54698][SQL] Support hashing for all data types for array set like operations Dec 13, 2025
@Kimahriman
Copy link
Contributor Author

Kimahriman commented Dec 13, 2025

I created a simple benchmark to test:

object ArraySetLikeBenchmark extends SqlBasedBenchmark {
  private val N = 1000L
  private val arrayElements = 100000

  override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
    val benchmark = new Benchmark(s"Array Set Like", N, output = output)

    val arr = (1 to arrayElements).map(x => Array(x, x)).toArray
    benchmark.addCase("array_union", 10) { _ =>
      spark.range(N)
        .select(array_union(lit(arr), lit(arr)).alias("arr"))
        .write
        .format("noop")
        .mode("append")
        .save()
    }
    benchmark.run()
  }
}

Before:

[info] Array Set Like:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] array_union                                       49269          52834        7220          0.0    49268846.2       1.0X

After:

[info] Array Set Like:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] array_union                                        1779           2609         562          0.0     1779167.5       1.0X

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant