• Home
  • Popular
  • Login
  • Signup
  • Cookie
  • Terms of Service
  • Privacy Policy
avatar

Posted by User Bot


26 Apr, 2025

Updated at 20 May, 2025

Why does a subquery without matching column names still work in Spark SQL?

I have the following two datasets in Spark SQL:

person view:

person = spark.createDataFrame([
(0, "Bill Chambers", 0, [100]),
(1, "Matei Zaharia", 1, [500, 250, 100]),
(2, "Michael Armbrust", 1, [250, 100])])\
.toDF("id", "name", "graduate_program", "spark_status")

person.createOrReplaceTempView("person")

person.show(truncate=False)
+---+----------------+----------------+---------------+
|id |name            |graduate_program|spark_status   |
+---+----------------+----------------+---------------+
|0  |Bill Chambers   |0               |[100]          |
|1  |Matei Zaharia   |1               |[500, 250, 100]|
|2  |Michael Armbrust|1               |[250, 100]     |
+---+----------------+----------------+---------------+

graduateProgram view:

graduateProgram = spark.createDataFrame([
(0, "Masters", "School of Information", "UC Berkeley"),
(2, "Masters", "EECS", "UC Berkeley"),
(1, "Ph.D.", "EECS", "UC Berkeley")])\
.toDF("id", "degree", "department", "school")

graduateProgram.createOrReplaceTempView("graduateProgram")

graduateProgram.show(truncate=False)
+---+-------+---------------------+-----------+
|id |degree |department           |school     |
+---+-------+---------------------+-----------+
|0  |Masters|School of Information|UC Berkeley|
|2  |Masters|EECS                 |UC Berkeley|
|1  |Ph.D.  |EECS                 |UC Berkeley|
+---+-------+---------------------+-----------+

Now when I run the query:

SELECT *
FROM person
WHERE graduate_program IN (SELECT graduate_program FROM graduateProgram);

the query is working, even though there is no column named graduate_program in the graduateProgram view — it only has an id column.

My question is why is this query working because as there is no graduate_program in the graduateProgram. It should fail saying that column does not exist.

If i do this:

SELECT graduate_program FROM graduateProgram;

I get an error saying

[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `graduate_program` cannot be resolved. Did you mean one of the following? [`degree`, `department`, `id`, `school`]. SQLSTATE: 42703

Then how is the query above is working? I am running cluster with 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12).