When I was writing code to chain when in PySpark
"Does this take precedence over the previously written when, just like SQL? "
"Because it is a method chain, will it be overwritten by the when written at the end? "
I was worried, so I actually wrote the verification code and examined it.
df = spark.createDataFrame([(1,),(2,),(3,)], schema=('val',))
display(df)
| val | 
|---|
| 1 | 
| 2 | 
| 3 | 
#Registered as a temporary table to touch from Spark SQL
df.registerTempTable('tmp')
SELECT
  val,
  CASE
    WHEN val <= 1 THEN 'label_1'
    WHEN val <= 2 THEN 'label_2'
    ELSE 'label_3'
  END AS label
FROM tmp
| val | label | 
|---|---|
| 1 | label_1 | 
| 2 | label_2 | 
| 3 | label_3 | 
In the case of SQL, of course, the condition of WHEN written earlier takes precedence.
from pyspark.sql import functions as F
df_label = df.withColumn('label',
    F.when(F.col('val') <= 1, 'label_1')
     .when(F.col('val') <= 2, 'label_2')
     .otherwise('label_3')
)
display(df_label)
| val | label | 
|---|---|
| 1 | label_1 | 
| 2 | label_2 | 
| 3 | label_3 | 
Even when when is chained in PySpark, it seems that the condition of when written earlier has priority as in Spark SQL.
Recommended Posts