I have just started using databricks/pyspark. Im using python/spark 2.1. I have uploaded data to a table. This table is a single column full of strings. I wish to apply a mapping function to each element in the column. I load the table into a dataframe:
df = spark.table("mynewtable")
The only way I could see was others saying was to convert it to RDD to apply the mapping function and then back to dataframe to show the data. But this throws up job aborted stage failure:
df2 = df.select("_c0").rdd.flatMap(lambda x: x.append("anything")).toDF()
All i want to do is just apply any sort of map function to my data in the table.For example append something to each string in the column, or perform a split on a char, and then put that back into a dataframe so i can .show() or display it.
Best Answer
You cannot:
- Use
flatMap
because it will flatten theRow
You cannot use
append
because:tuple
orRow
have no append methodappend
(if present on collection) is executed for side effects and returnsNone
I would use withColumn
:
df.withColumn("foo", lit("anything"))
but map
should work as well:
df.select("_c0").rdd.flatMap(lambda x: x + ("anything", )).toDF()
Edit (given the comment):
You probably want an udf
from pyspark.sql.functions import udfdef iplookup(s):return ... # Some lookup logiciplookup_udf = udf(iplookup)df.withColumn("foo", iplookup_udf("c0"))
Default return type is StringType
, so if you want something else you should adjust it.