Spark magic

In [1]:
%load_ext sparkmagic.magics
In [2]:
%manage_spark
Starting Spark application
IDYARN Application IDKindStateSpark UIDriver logCurrent session?
134application_1522938745830_0059pysparkidleLinkLink
SparkSession available as 'spark'.
In [11]:
%spark info
Info for running Spark:
    Sessions:
        Name: s1        Session id: 134 YARN id: application_1522938745830_0059 Kind: pyspark   State: idle
        Spark UI: http://vcm-2168.oit.duke.edu:8088/proxy/application_1522938745830_0059/
        Driver Log: http://vcm-3544.oit.duke.edu:8042/node/containerlogs/container_e19_1522938745830_0059_01_000001/user06021
    Session configs:
        {'driverMemory': '2048M', 'executorCores': 2, 'proxyUser': 'user06021', 'conf': {'spark.master': 'yarn-client'}}

In [24]:
%%spark -o foo

foo = spark.read.parquet('foo.parquet')
foo.show(4)
+-------+--------+-------+-----+----+---+
|   name|semester|subject|score| sex|age|
+-------+--------+-------+-----+----+---+
|    bob|    fall|  stats|   92|male| 19|
|    bob|  summer|  stats|  100|male| 19|
|    bob|  spring|  stats|  100|male| 19|
|charles|  spring|  stats|   88|male| 22|
+-------+--------+-------+-----+----+---+
only showing top 4 rows

Export data to pandas DataFrame

In [25]:
foo
Out[25]:
name semester subject score sex age
0 bob fall stats 92.0 male 19
1 bob summer stats 100.0 male 19
2 bob spring stats 100.0 male 19
3 charles spring stats 88.0 male 22
4 charles fall bio 100.0 male 22
5 ann spring math 98.0 female 23
6 ann fall bio 50.0 female 23
7 daivd NaN NaN NaN male 23