Jupyter Notebook
Apache Toree
sampleDataNetflix.tsv placed in local filesystem and path updated in 1) below
We will read in a TSV file and try to infer schema since it is not very complex data types we are using
In [104]:
val sessions = spark.read.option("header", "true")
.option("sep", "\t")
.option("inferSchema","true")
.csv("/Users/memo/Desktop/netflixSpark/sampleDataNetflix.tsv")
Out[104]:
In [11]:
sessions.printSchema
In [12]:
sessions.show(2)
In [13]:
sessions.registerTempTable("SESSIONS")
Using DISTINCT to show unique users
In [16]:
%%SQL select distinct user_id
from SESSIONS
where navigation_page = 'OurPlanetTitle'
Out[16]:
Showing the page visits just for validation, can be easily removed from the projection list in query
In [25]:
%%SQL select user_id, count(user_id) as page_visits
from SESSIONS
where navigation_page = 'OurPlanetTitle'
group by user_id
having page_visits == 1
Out[25]:
Making sure we filter for the same path using the timestamps and making sure it's all within the same session_id
In [101]:
%%SQL select distinct a.user_id
from sessions a,
sessions b,
sessions c,
sessions d
where a.user_id = b.user_id
and b.user_id = c.user_id
and c.user_id = d.user_id
and a.navigation_page = 'HomePage'
and b.navigation_page = 'OriginalsGenre'
and c.navigation_page = 'OurPlanetTitle'
and d.navigation_page = 'HomePage'
and a.timestamp < b.timestamp
and b.timestamp < c.timestamp
and c.timestamp < d.timestamp
and a.session_id = b.session_id
and b.session_id = c.session_id
and c.session_id = d.session_id
Out[101]:
The like operator is not the most performant but the SQL optimizer should be able to tell that my 2nd where clause can improve selectivity of this query. I am using the timestamp
column to make sure that a before landing on a Login page, the user first comes from a Title page
In [98]:
%%SQL select a.user_id
from sessions a,
sessions b
where a.user_id = b.user_id
and b.navigation_page = 'LogIn'
and a.navigation_page like '%Title'
and a.timestamp < b.timestamp
Out[98]:
We are using relation 'b' to get the total count of url
the user has visited
In [82]:
%%SQL select a.user_id
from sessions a,
(select user_id, count(url) as totalUrl from sessions group by user_id) b
where a.user_id = b.user_id
and a.navigation_page = 'OurPlanetTitle'
and b.totalurl = 1
Out[82]: