Apache Impala and SQL magic for IPython/Jupyter notebooks

with Kerberos authentication

1. Load SQL magic extension (uses ipython-sql by Catherine Devlin)


In [1]:
%load_ext sql

2. Connect to the target database

  • ### requires Cloudera impyla package and thrift_sasl
  • ### edit the value of config_args as relevant for your environment

In [2]:
%config SqlMagic.connect_args="{'kerberos_service_name':'impala', 'auth_mechanism':'GSSAPI'}"
%sql impala://impalasrv-prod:21050/test2


Out[2]:
u'Connected: None@test2'

3. Run SQL on the target using the %%sql cell magic or %sql line magic


In [3]:
%%sql 
select * from emp


Done.
Out[3]:
empno ename job mgr hiredate sal comm deptno
7521 WARD SALESMAN 7698 1981-02-22 00:00:00 1250 500 30
7902 FORD ANALYST 7566 1981-12-03 00:00:00 3000 None 20
7900 JAMES CLERK 7698 1981-12-03 00:00:00 950 None 30
7782 CLARK MANAGER 7839 1981-06-09 00:00:00 2450 None 10
7566 JONES MANAGER 7839 1981-04-02 00:00:00 2975 None 20
7698 BLAKE MANAGER 7839 1981-05-01 00:00:00 2850 None 30
7839 KING PRESIDENT None 1981-11-17 00:00:00 5000 None 10
7788 SCOTT ANALYST 7566 1987-04-19 00:00:00 3000 None 20
7369 SMITH CLERK 7902 1980-12-17 00:00:00 800 None 20
7654 MARTIN SALESMAN 7698 1981-09-28 00:00:00 1250 1400 30
7934 MILLER CLERK 7782 1982-01-23 00:00:00 1300 None 10
7876 ADAMS CLERK 7788 1987-05-23 00:00:00 1100 None 20
7844 TURNER SALESMAN 7698 1981-09-08 00:00:00 1500 0 30
7499 ALLEN SALESMAN 7698 1981-02-20 00:00:00 1600 300 30

Bind variables


In [4]:
Employee_name="SCOTT"

%sql select * from emp where ename=:Employee_name


Done.
Out[4]:
empno ename job mgr hiredate sal comm deptno
7788 SCOTT ANALYST 7566 1987-04-19 00:00:00 3000 None 20

Additional example of the integration with the IPython environment


In [5]:
myResultSet=%sql select ename "Employee Name", sal "Salary" from emp

%matplotlib inline
import matplotlib
matplotlib.style.use('ggplot')

myResultSet.bar()


Done.
Out[5]:
<Container object of 14 artists>

The integration with Pandas opens many additional possibilities for data analysis


In [6]:
my_DataFrame=myResultSet.DataFrame()

In [7]:
my_DataFrame.head()


Out[7]:
employee name salary
0 TURNER 1500
1 FORD 3000
2 WARD 1250
3 JONES 2975
4 JAMES 950

In [ ]: