Hello Spark Cluster

Get YARN Application info


In [1]:
%%info


Current session configs: {u'kind': 'pyspark3', u'name': u'remotesparkmagics'}
No active sessions.

Upload dataset

Upload COVID spread dataset to Spark RDD


In [2]:
import pandas as pd
from pyspark import SparkFiles


Starting Spark application
IDYARN Application IDKindStateSpark UIDriver logCurrent session?
8application_1587653246481_0012pyspark3idleLinkLink
SparkSession available as 'spark'.

In [3]:
# load from URL
url = 'https://raw.githubusercontent.com/CSSEGISandData/COVID-19/web-data/data/cases_country.csv'
spark.sparkContext.addFile(url)

# store to cases country RDD
cases_country = spark.read.csv("file://"+SparkFiles.get('cases_country.csv'), header=True, inferSchema=True)
cases_country.printSchema()


root
 |-- Country_Region: string (nullable = true)
 |-- Last_Update: timestamp (nullable = true)
 |-- Lat: double (nullable = true)
 |-- Long_: double (nullable = true)
 |-- Confirmed: integer (nullable = true)
 |-- Deaths: integer (nullable = true)
 |-- Recovered: integer (nullable = true)
 |-- Active: integer (nullable = true)
 |-- Incident_Rate: double (nullable = true)
 |-- People_Tested: string (nullable = true)
 |-- People_Hospitalized: string (nullable = true)
 |-- Mortality_Rate: double (nullable = true)
 |-- UID: integer (nullable = true)
 |-- ISO3: string (nullable = true)

Discover data


In [5]:
cases_country.select('ISO3', 'Country_Region').distinct().show()


+----+------------------+
|ISO3|    Country_Region|
+----+------------------+
| GNQ| Equatorial Guinea|
| LBY|             Libya|
| HTI|             Haiti|
| AUT|           Austria|
| ECU|           Ecuador|
| JAM|           Jamaica|
| RWA|            Rwanda|
| UZB|        Uzbekistan|
| DOM|Dominican Republic|
| MEX|            Mexico|
| LBR|           Liberia|
| COL|          Colombia|
| YEM|             Yemen|
| GMB|            Gambia|
| ZWE|          Zimbabwe|
| SVN|          Slovenia|
| SUR|          Suriname|
| SRB|            Serbia|
| BHS|           Bahamas|
| SVK|          Slovakia|
+----+------------------+
only showing top 20 rows

In [7]:
cases_country.select(['ISO3', 'Country_Region', 'Confirmed', 'Recovered', 'Deaths', 'Active', 'Last_Update']).orderBy(cases_country.Active.desc()).show(20)


+----+--------------+---------+---------+------+------+-------------------+
|ISO3|Country_Region|Confirmed|Recovered|Deaths|Active|        Last_Update|
+----+--------------+---------+---------+------+------+-------------------+
| USA|            US|   869172|    80934| 49963|738275|2020-04-24 12:31:02|
| GBR|United Kingdom|   139246|      712| 18791|119743|2020-04-24 12:30:33|
| ITA|         Italy|   189973|    57576| 25549|106848|2020-04-24 12:30:33|
| ESP|         Spain|   219764|    92355| 22524|104885|2020-04-24 12:30:33|
| FRA|        France|   159495|    42792| 21889| 94814|2020-04-24 12:30:33|
| TUR|        Turkey|   101790|    18491|  2491| 80808|2020-04-24 12:30:33|
| RUS|        Russia|    68622|     5568|   615| 62439|2020-04-24 12:30:33|
| DEU|       Germany|   153393|   106800|  5575| 41018|2020-04-24 12:30:33|
| NLD|   Netherlands|    36727|      101|  4304| 32322|2020-04-24 12:30:33|
| BEL|       Belgium|    44293|    10122|  6679| 27492|2020-04-24 12:30:33|
| CAN|        Canada|    43286|    14761|  2241| 26284|2020-04-24 12:31:17|
| PRT|      Portugal|    22353|     1201|   820| 20332|2020-04-24 12:30:33|
| BRA|        Brazil|    50230|    26573|  3343| 20314|2020-04-24 12:30:33|
| IND|         India|    23502|     5012|   722| 17768|2020-04-24 12:30:33|
| IRN|          Iran|    88194|    66599|  5574| 16021|2020-04-24 12:30:33|
| SWE|        Sweden|    17567|      550|  2152| 14865|2020-04-24 12:30:33|
| PER|          Peru|    20914|     7422|   572| 12920|2020-04-24 12:30:33|
| SAU|  Saudi Arabia|    13930|     1925|   121| 11884|2020-04-24 12:30:33|
| SGP|     Singapore|    12075|      924|    12| 11139|2020-04-24 12:30:33|
| JPN|         Japan|    12368|     1494|   328| 10546|2020-04-24 12:30:33|
+----+--------------+---------+---------+------+------+-------------------+
only showing top 20 rows

Store results to HIVE table


In [9]:
cases_country.createOrReplaceTempView('CasesCountryTable')

In [10]:
%%sql
select * from CasesCountryTable



In [11]:
%%sql
select * from CasesCountryTable where ISO3 == 'USA' or ISO3 == 'GBR' or ISO3 == 'ITA' or ISO3 == 'ESP' or ISO3 == 'FRA' or ISO3 = 'RUS'



In [ ]: