I am new to pyspark. I want to plot the result using matplotlib, but not sure which function to use. I searched for a way to convert sql result to pandas and then use plot.

2

Best Answer


I have found the solution for this. I converted sql dataframe to pandas dataframe and then I was able to plot the graphs. below is the sample code.from

pyspark.sql import Rowfrom pyspark.sql import HiveContextimport pysparkfrom IPython.display import displayimport matplotlibimport matplotlib.pyplot as plt%matplotlib inline sc = pyspark.SparkContext()sqlContext = HiveContext(sc)test_list = [(1, 'hasan'),(2, 'nana'),(3, 'dad'),(4, 'mon')]rdd = sc.parallelize(test_list)people = rdd.map(lambda x: Row(id=int(x[0]), name=x[1]))schemaPeople = sqlContext.createDataFrame(people)# Register it as a temp tablesqlContext.registerDataFrameAsTable(schemaPeople, "test_table")df1=sqlContext.sql("Select * from test_table")pdf1=df1.toPandas()pdf1.plot(kind='barh',x='name',y='id',colormap='winter_r')

For small data, you can use .select() and .collect() on the pyspark DataFrame. collect will give a python list of pyspark.sql.types.Row, which can be indexed. From there you can plot using matplotlib without Pandas, however using Pandas dataframes with df.toPandas() is probably easier.