I have the following two datasets in Spark SQL:
person
view:
person = spark.createDataFrame([
(0, "Bill Chambers", 0, [100]),
(1, "Matei Zaharia", 1, [500, 250, 100]),
(2, "Michael Armbrust", 1, [250, 100])])\
.toDF("id", "name", "graduate_program", "spark_status")
person.createOrReplaceTempView("person")
person.show(truncate=False)
+---+----------------+----------------+---------------+
|id |name |graduate_program|spark_status |
+---+----------------+----------------+---------------+
|0 |Bill Chambers |0 |[100] |
|1 |Matei Zaharia |1 |[500, 250, 100]|
|2 |Michael Armbrust|1 |[250, 100] |
+---+----------------+----------------+---------------+
graduateProgram
view:
graduateProgram = spark.createDataFrame([
(0, "Masters", "School of Information", "UC Berkeley"),
(2, "Masters", "EECS", "UC Berkeley"),
(1, "Ph.D.", "EECS", "UC Berkeley")])\
.toDF("id", "degree", "department", "school")
graduateProgram.createOrReplaceTempView("graduateProgram")
graduateProgram.show(truncate=False)
+---+-------+---------------------+-----------+
|id |degree |department |school |
+---+-------+---------------------+-----------+
|0 |Masters|School of Information|UC Berkeley|
|2 |Masters|EECS |UC Berkeley|
|1 |Ph.D. |EECS |UC Berkeley|
+---+-------+---------------------+-----------+
Now when I run the query:
SELECT *
FROM person
WHERE graduate_program IN (SELECT graduate_program FROM graduateProgram);
the query is working, even though there is no column named graduate_program
in the graduateProgram
view — it only has an id
column.
My question is why is this query working because as there is no graduate_program
in the graduateProgram
. It should fail saying that column does not exist.
If i do this:
SELECT graduate_program FROM graduateProgram;
I get an error saying
[UNRESOLVED_COLUMN.WITH_SUGGESTION] A column, variable, or function parameter with name `graduate_program` cannot be resolved. Did you mean one of the following? [`degree`, `department`, `id`, `school`]. SQLSTATE: 42703
Then how is the query above is working? I am running cluster with 15.4 LTS (includes Apache Spark 3.5.0, Scala 2.12)
.