In [ ]:
import ibis
import os
hdfs_port = os.environ.get('IBIS_WEBHDFS_PORT', 50070)
hdfs = ibis.hdfs_connect(host='quickstart.cloudera', port=hdfs_port)
con = ibis.impala.connect(host='quickstart.cloudera', database='ibis_testing',
hdfs_client=hdfs)
ibis.options.interactive = True
The Ibis type system is pretty basic and will get better (and more documented over time). It maps directly onto the current Impala type system
int8
int16
int32
int64
boolean
float
double
string
timestamp
decimal($precision, $scale)
These type names can be used to cast from one type to another
In [ ]:
table = con.table('functional_alltypes')
table.string_col.cast('double').sum()
In [ ]:
table.string_col.cast('decimal(12,2)').sum()
In [ ]:
expr = (table.string_col
.case()
.when('4', 'fee')
.when('7', 'fi')
.when('1', 'fo')
.when('0', 'fum')
.else_(table.string_col)
.end()
.name('new_strings'))
expr.value_counts()
If the else_
default condition is not provided, any values not matching one of the conditions will be NULL
.
In [ ]:
expr = (table.string_col
.case()
.when('4', 'fee')
.when('7', 'fi')
.end()
.name('with_nulls'))
expr.value_counts()
To test for an arbitrary series of boolean conditions, use the case
API method and pass any boolean expressions potentially involving columns of the table:
In [ ]:
expr = (ibis.case()
.when(table.int_col > 5, table.bigint_col * 2)
.when(table.int_col > 2, table.bigint_col)
.else_(table.int_col)
.end())
table['id', 'int_col', 'bigint_col', expr.name('case_result')].limit(20)
Simple ternary-cases (like the Python X if COND else Y
) can be written using the ifelse
function:
In [ ]:
expr = ((table.int_col > 5)
.ifelse(table.bigint_col / 2, table.bigint_col * 2)
.name('ifelse_result'))
table['int_col', 'bigint_col', expr].limit(10)
In [ ]:
bool_clause = table.string_col.notin(['1', '4', '7'])
table[bool_clause].string_col.value_counts()
You can also check for membership in an array. Here is an example of filtering based on the top 3 (ignoring ties) most frequently-occurring values in the string_col
column of alltypes:
In [ ]:
top_strings = table.string_col.value_counts().limit(3).string_col
top_filter = table.string_col.isin(top_strings)
expr = table[top_filter]
expr.count()
This is a common enough operation that we provide a special analytical filter function topk
:
In [ ]:
table[table.string_col.topk(3)].count()
Cool, huh? More on topk
later.
In [ ]:
expr = (table.string_col
.case()
.when('4', 'fee')
.when('7', 'fi')
.when('1', 'fo')
.end()
.name('new_strings'))
expr.isnull().value_counts()
Functions like isnull
can be combined with case
expressions or functions like ifelse
to replace null values with some other value. ifelse
here will use the first value supplied for any True
value and the second value for any False
value. Either value can be a scalar or array.
In [ ]:
expr2 = expr.isnull().ifelse('was null', expr).name('strings')
expr2.value_counts()
In [ ]:
table['int_col', 'bigint_col'].distinct()
In [ ]:
table.string_col.distinct()
This can be combined with count
to form a reduction metric:
In [ ]:
metric = (table.bigint_col
.distinct().count()
.name('unique_bigints'))
This is common enough to have a shortcut nunique
:
In [ ]:
table.string_col.nunique()
In [ ]:
nation = con.table('tpch_nation')
nation.limit(5)
At the moment, basic substring operations (substr
, with conveniences left
and right
) and Python-like APIs such as lower
and upper
(for case normalization) are supported. So you could count first letter occurrences in a string column like so:
In [ ]:
expr = nation.n_name.lower().left(1).name('first_letter')
expr.value_counts().sort_by(('count', False))
For fuzzy and regex filtering/searching, you can use one of the following
like
, works as the SQL LIKE
keywordrlike
, like re.search
or SQL RLIKE
contains
, like x in str_value
in Python
In [ ]:
nation[nation.n_name.like('%GE%')]
In [ ]:
nation[nation.n_name.lower().rlike('.*ge.*')]
In [ ]:
nation[nation.n_name.lower().contains('ge')]
In [ ]:
table = con.table('functional_alltypes')
table[table.timestamp_col, table.timestamp_col.minute().name('minute')].limit(10)
Somewhat more comprehensively
In [ ]:
def get_field(f):
return getattr(table.timestamp_col, f)().name(f)
fields = ['year', 'month', 'day', 'hour', 'minute', 'second', 'millisecond']
projection = [table.timestamp_col] + [get_field(x) for x in fields]
table[projection].limit(10)
For timestamp arithmetic and comparisons, check out functions in the top level ibis
namespace. This include things like day
and second
, but also the ibis.timestamp
function:
In [ ]:
table[table.timestamp_col.min(), table.timestamp_col.max(), table.count().name('nrows')]
In [ ]:
table[table.timestamp_col < '2010-01-01'].count()
In [ ]:
table[table.timestamp_col <
(ibis.timestamp('2010-01-01') + ibis.interval(months=3))].count()
In [ ]:
expr = (table.timestamp_col + ibis.interval(days=1) + ibis.interval(hours=4)).name('offset')
table[table.timestamp_col, expr, ibis.now().name('current_time')].limit(10)