Db2 11.1 introduced extended support for regular expressions. Regular expressions allow you to do very complex pattern matching in character strings. Normal SQL LIKE searches are limited to very specific patterns, but Regular expression have a rich syntax that gives you much more flexibility in searching.
Set up the connection to the database and Db2 command extensions.
In [ ]:
%run db2.ipynb
The following SQL will create a new table in the SAMPLE database (or whatever database you are currently connected to). This table represents all of the stations of the London Underground Central line (existing stations only, not historical ones!). This table will be used for all of the examples within this section.
In [ ]:
%%sql -q
DROP TABLE CENTRAL_LINE;
CREATE TABLE CENTRAL_LINE
(
STATION_NO INTEGER GENERATED ALWAYS AS IDENTITY,
STATION VARCHAR(31),
UPPER_STATION VARCHAR(31) GENERATED ALWAYS AS (UCASE(STATION))
)
;
INSERT INTO CENTRAL_LINE(STATION)
VALUES 'West Ruislip','Ruislip Gardens','South Ruislip','Northolt','Greenford',
'Perivale','Hanger Lane','Ealing Broadway','West Acton','North Acton',
'East Acton','White City','Shepherd''s Bush','Holland Park','Notting Hill Gate',
'Queensway','Lancaster Gate','Marble Arch','Bond Street','Oxford Circus',
'Tottenham Court Road','Holborn','Chancery Lane','St. Paul''s','Bank',
'Liverpool Street','Bethnal Green','Mile End','Stratford','Leyton',
'Leytonstone','Wanstead','Redbridge','Gants Hill','Newbury Park',
'Barkingside','Fairlop','Hainault','Grange Hill','Chigwell',
'Roding Valley','Snaresbrook','South Woodford','Woodford','Buckhurst Hill',
'Loughton','Debden','Theydon Bois','Epping'
;
There are six regular expression functions within DB2 including:
Each one of these functions follows a similar calling sequence:
REGEXP_FUNCTION(source, pattern, flags, start_pos, codeunits)
The arguments to the function are:
The source can be any valid Db2 string including CHAR, VARCHAR, CLOB, etc. Start_pos is the location in the source string that you want to start searching from, and codeunits tells Db2 whether to treat the start_pos as an absolute location (think byte location) or a character location which takes into account the unicode size of the character string.
Codeunits can be specified as CODEUNITS16, CODEUNITS32, or OCTETS. CODEUNITS16 specifies that start is expressed in 16-bit UTF-16 code units. CODEUNITS32 specifies that start is expressed in 32-bit UTF-32 code units. OCTETS specifies that start is expressed in bytes.
Pattern and flag values are complex and so are discussed in the following sections.
Regular expression functions have a flag specification that can be used to change the behavior of the search. There are six possible flags that can be specified as part of the REGEXP command:
Flag | Purpose |
---|---|
c | Specifies that matching is case-sensitive (the default value) |
i | Specifies that matching is case insensitive |
m | Specifies that the input data can contain more than one line. By default, the '^' in a pattern matches only the start of the input string; the '\$' in a pattern matches only the end of the input string. If this flag is set, "^" and "\$" also matches at the start and end of each line within the input string. |
n | Specifies that the '.' character in a pattern matches a line terminator in the input string. By default, the '.' character in a pattern does not match a line terminator. A carriage-return and line-feed pair in the input string behaves as a single-line terminator, and matches a single "." in a pattern. |
s | Specifies that the '.' character in a pattern matches a line terminator in the input string. This value is a synonym for the 'n' value. |
x | Specifies that white space characters in a pattern are ignored, unless escaped. |
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'Ruislip')
The pattern 'Ruislip' will look for a match of Ruislip within the STATION column. Note that this pattern will also match 'West Ruislip' or 'Ruislip Gardens' since we placed no restriction on where the pattern can be found in the string. The match will also be exact (case matters). This type of search would be equivalent to using the SQL LIKE statement:
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE STATION LIKE '%Ruislip%'
If you didn't place the %
at the beginning of the LIKE
string, only the stations that start with Ruislip would be
found.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE STATION LIKE 'Ruislip%'
If you want to match Ruislip with upper or lower case being
ignored, you would add the 'i'
flag as part of the
REGEXP_LIKE (or any REGEXP function).
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'RUISLIP','i')
By default a pattern will be matched anywhere in a string.
Our previous example showed how Ruislip could be found
anywhere in a string. To force a match to start at the
beginning of a string, the carat symbol ^
can be used to
force the match to occur at the beginning of a string.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'^Ruislip')
To match a pattern at the end of the string, the dollar sign
$
can be used.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'Ruislip$')
To force an exact match with a string you would use both the beginning and end anchors.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'^Leyton$')
Note that if we didn't use the end anchor, we are going to get more than one result.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'^Leyton');
So far the examples have dealt with strings that do not contain newline characters (or carriage feeds). In some applications, data from an input panel may include multiple lines which may contain hard line feeds. What this means is that there are actually multiple lines in the data, but from a database perspective, there is only one line in the VARCHAR field. You can modify the behavior of the Regular Expression search by instructing it to honor the CRLF characters as line delimiters.
The following SQL will insert a single line with multiple CRLF characters in it to simulate a multi-line text string.
In [ ]:
%%sql -q
DROP TABLE LONGLINE;
CREATE TABLE LONGLINE (NAME VARCHAR(255));
INSERT INTO LONGLINE
VALUES 'George' || CHR(10) || 'Katrina';
Searching for Katrina at the beginning and end of string doesn't work.
In [ ]:
%%sql
SELECT COUNT(*) FROM LONGLINE
WHERE REGEXP_LIKE(NAME,'^Katrina$')
We can override the regular expression search by telling it to treat each NL/CRLF as the end of a string within a string.
In [ ]:
%%sql
SELECT COUNT(*) FROM LONGLINE
WHERE REGEXP_LIKE(NAME,'^Katrina$','m')
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'way|ing')
Some things to be aware of when creating the search pattern. Spaces in the patterns themselves are significant. If the previous search pattern had a space in one of the words, it would not find it (unless of course there was a space in the station name).
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'way| ing')
Using the "x"
flag will ignore blanks in your pattern, so
this would fix issues that we have in the previous example.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'way| ing','x')
Brackets can be used to make it clear what the pattern is that you are searching for and avoid the problem of having blanks in the expression. Brackets do have a specific usage in regular expressions, but here we are using it only to separate the two search strings.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'(way)|(ing)')
As we found out in the previous section, there is an OR operator that you can use to select between two patterns. How do you request that multiple patterns be present? First we must understand how matching occurs when we have multiple strings that need to be matched that have an unknown number of characters between them.
For instance, how do we create a pattern that looks for
"ing" followed by "way" somewhere in the string? Regular
expression recognize the "."
(period) character as matching
anything. Following the pattern you can add a modifier that
specifies how many times you want the pattern matched:
The following regular expression searches for a pattern with "ing" followed by any characters and then "way".
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'(ing)*.(way)')
The previous answer gave you two results (Ealing Broadway
and Queensway). Why two? The reason is that we used the *
in
the wrong place (a single character in a wrong place can
result in very different results!). What we really needed to
do was place a .*
after the (ing) to match "ing" and then
any characters, before matching "way". What our query did
above was match 0 or more occurences of "ing", followed by
any character, and then match "way". Here is the correct
query.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'(ing).*(way)')
Finding at least one occurrence of a pattern requires the
use of the +
operator, or the bracket operators. This
example locates at least one occurrence of the "an" string
in station names.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'(an)+')
If we want to find an exact number of occurrences, we need
to use the {} notation to tell the regular expression
matcher how many we want to find. The syntax of the {}
match is:
So the "+"
symbol is equivalent to the following regular
expression using the {}
syntax.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'(an){1,}')
If we want to match exactly 2 'an' patterns in a string, we
would think that changing the expression to {2}
would work.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'(an){2}')
Sadly, we get no results! This would appear to be the wrong
result, but it's because we got lucky with our first search!
The best way to figure out what we matched in the original
query is to use the REGEXP_EXTRACT
and REGEXP_INSTR
functions.
The following SQL gives us a clue to what was found with the
(an)
pattern.
In [ ]:
%%sql
SELECT STATION,
REGEXP_INSTR(STATION,'(an)') AS LOCATION,
REGEXP_EXTRACT(STATION,'(an)') AS EXTRACT
FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'(an)')
What you should see in the previous result is the location
where the "an"
pattern was found in the string. Note that
all we matched was the "an"
pattern, nothing else. So why
can't I find two "an"
patterns in the string? The reason is
that (an){2}
means "an" followed by another "an"
! We didn't
tell the pattern to match anything else! What we need to do
is modify the pattern to say that it can match "an"
followed
by anything else. The pattern needs to be modifed to (an).*
where the ".*"
means any character following the "an".
In order to tell the regular expression function to use this
entire pattern (an).\*
twice, we need to place brackets
around it as well. The final pattern is ((an).\*){2}
.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'((an).*){2}')
You should find that two stations match the pattern. The following SQL shows which pattern is matched first in the STATIONS names.
In [ ]:
%%sql
SELECT STATION,
REGEXP_INSTR(STATION,'((an).*){2}') AS LOCATION,
REGEXP_EXTRACT(STATION,'((an).*){2}') AS EXTRACT
FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'((an).*){2}')
Aside from matching entire strings, you can also use regular
expression to look for patterns of characters. The simplest
matching pattern is the period (.)
which matches any
character. Matching a string of arbitrary length is the
pattern ".*"
. The "+"
and "?"
characters can also be used to
modify how many characters you want matched.
What about situations where you want to check for certain patterns or characters in a string? A good example would be a social security number, or credit card number. There are certain patterns that you would find for these objects. Assume we have a social security number in the format xxx-xx-xxxx. It is possible to create a regular expression that would return true if the SSN matched the pattern above (it doesn't tell you if the SSN itself is valid, only that it has the proper format).
Regular expressions allow you to create a list of characters that need to be matched in something called a bracket expression. A bracket expression has the format:
[a-z] [A-Z] [0-9] [a-zA-z]
The examples above represent the following search patterns:
You can also enumerate all of the characters you want to
match by listing them between the brackets like
[abcdefghikjlmnopqrstuvwxyz]
. The short form a-z
is easier
to read and less prone to typing errors!
The following example checks for station names that start
with the letter P-R
.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'^[P-R]')
If you wanted to include all stations that have the letter
P-R
or p-e
, you could add the condition within the brackets.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'[p-rP-R]')
Back to our SSN question. Can a regular expression pattern be used to determine whether or not the string is in the correct format? The data will be in the format XXX-XX-XXXX so the regular expression needs to find the three numeric values separated by dashes.
The number pattern can be represented with the bracket
expression [0-9]
. To specify the number of characters that
need to be found, we use the braces {}
to specify the exact
number required.
For the three numbers in the pattern we can use [0-9]{3}
,
[0-9]{2}
, and [0-9]{4}
. Adding in the dashes gives us the
final pattern. The SQL below checks to see if a SSN is
correct.
In [ ]:
%%sql
VALUES
CASE
WHEN REGEXP_LIKE('123-34-1422','[0-9]{3}-[0-9]{2}-[0-9]{4}') THEN 'Valid'
ELSE 'Invalid'
END
The SSN is valid in the example above. Here are some other examples to show whether or not the regular expression picks up all of the errors.
In [ ]:
%%sql
WITH SSNS(SSN) AS (
VALUES
'123-34-1322',
'ABC-34-9999',
'X123-44-0001',
'123X-Y44-Z0001',
'111-222-111'
)
SELECT SSN,
CASE
WHEN REGEXP_LIKE(SSN,'[0-9]{3}-[0-9]{2}-[0-9]{4}') THEN 'Valid'
ELSE 'Invalid'
END
FROM SSNS
If you check closely, one of the strings was marked as
valid, although it is not correct (X123-44-0001)
. The reason
this occurred is that the pattern was found after the "X"
and it was correct. To prevent this from happening, we need
to anchor the pattern at the beginning to avoid this
situation. A better pattern would be to anchor both ends of
the pattern so there is no possibility of other characters
being at the beginning or end of the pattern.
In [ ]:
%%sql
WITH SSNS(SSN) AS (
VALUES
'123-34-1322',
'ABC-34-9999',
'X123-44-0001',
'123X-Y44-Z0001',
'111-222-111'
)
SELECT SSN,
CASE
WHEN REGEXP_LIKE(SSN,'^[0-9]{3}-[0-9]{2}-[0-9]{4}$') THEN 'Valid'
ELSE 'Invalid'
END
FROM SSNS
The previous example used the [0-9] syntax to request that only numbers be found in the pattern. There are some predefined patterns that define these common patterns. The first argument is Posix format (if it exists), the second is the escape character equivalent, and the final one is the raw pattern it represents.
Posix | Escape | Pattern | Meaning |
---|---|---|---|
[:alnum:] | [A-Za-z0-9] | Alphanumeric characters | |
\w | [A-Za-z0-9_] | Alphanumeric characters plus "_" | |
\W | [^A-Za-z0-9_] | Non-word characters | |
[:alpha:] | \a | [A-Za-z] | Alphabetic characters |
[:blank:] | \s, \t | Space and tab | |
\b | Word boundaries | ||
[:cntrl:] | [\x00-\x1F\x7F] | Control characters | |
[:digit:] | \d | [0-9] | Digits |
\D | [^0-9] | Non-digits | |
[:graph:] | [\x21-\x7E] | Visible characters | |
[:lower:] | \l | [a-z] | Lowercase letters |
[:print:] | \p | [\x20-\x7E] | Visible characters and the space character |
[:punct:] | [][!"#\$%&'()*+,./:;<=>?@\^_`{| }~-] |
Punctuation characters | |
[:space:] | \s | [ \t\r\n\v\f] | Whitespace characters |
\S | [^ \t\r\n\v\f] | Non-whitespace characters | |
[:upper:] | \u | [A-Z] | Uppercase letters |
[:xdigit:] | \x | [A-Fa-f0-9] | Hexadecimal digits |
For instance, the following three statements will produce the same result.
In [ ]:
%%sql -a
WITH SSNS(SSN) AS (
VALUES
'123-34-1322',
'ABC-34-9999',
'X123-44-0001',
'123X-Y44-Z0001',
'111-222-111'
)
SELECT 'Original', SSN,
CASE
WHEN REGEXP_LIKE(SSN,'^[0-9]{3}-[0-9]{2}-[0-9]{4}$') THEN 'Valid'
ELSE 'Invalid'
END
FROM SSNS
UNION ALL
SELECT 'Posix', SSN,
CASE
WHEN REGEXP_LIKE(SSN,'^[:digit:]{3}-[:digit:]{2}-[:digit:]{4}$') THEN 'Valid'
ELSE 'Invalid'
END
FROM SSNS
UNION ALL
SELECT 'Escape', SSN,
CASE
WHEN REGEXP_LIKE(SSN,'^\d{3}-\d{2}-\d{4}$') THEN 'Valid'
ELSE 'Invalid'
END
FROM SSNS
Up to this point in time, the patterns that have been used are looking for a positive match. In some cases you may want to find values that do not match. The easiest way is to negate the actual REGEXP_LIKE expression. The following expression finds all of the stations that start with "West".
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'^West')
Adding the NOT modifier in front of the REGEXP function gives us the stations that do not begin with West.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE NOT REGEXP_LIKE(STATION,'^West')
You can also negate some of the searches in a pattern by
using the [^...]
syntax where the ^
tells the regular
expression not to match the following characters. The
expression [^0-9]
would mean match any characters which are
not numeric.
However, regular expressions have something called negative
lookarounds which basically mean find the pattern which does
not match. You create this pattern by adding the (?!..)
at
the beginning of the string. The same query (finding
stations that don't start with West) would be written with
this lookaround logic found in the SQL below.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'^(?!West)')
The previous example used something called a negative
lookaround with capturing parenthesis. When you place a
pattern within a set of brackets (...)
the string that
matches this pattern is "remembered". The strings that are
matched can be used in subsequent parts of your regular
expression. This allows a form of programming within your
regular expression!
Each set of parentheses that are matched are associated with a number, starting at one and incrementing for each subsequent pattern match. For instance, the following pattern will have three matches:
^([0-9]{3})-([0-9]{3})-([0-9]{3})$This is similar to the SSN example used earlier on in this section. The difference in this example is that each block of numbers is exactly the same (3 digits). This pattern will match any sequence of numbers in the format 123-456-789.
In [ ]:
%%sql
WITH SSNS(SSN) AS (
VALUES
'123-456-789',
'123-555-123',
'890-533-098',
'123-456-456'
)
SELECT SSN,
CASE
WHEN REGEXP_LIKE(SSN,'^([0-9]{3})-([0-9]{3})-([0-9]{3})$') THEN 'Valid'
ELSE 'Invalid'
END
FROM SSNS
All of these numbers fit the pattern and should be valid.
When one of the capturing parenthesis matches, it will
remember the string that it matched. For instance, in the
first example (123-456-789), the first match will find the
string '123'. The second match will find '456' and so on. We
can refer to these matched strings with the special control
characters \n
where n represents the capturing parenthesis
location. So \1
would refer to the '123' that was found. \2
would be for '456'.
The regular expression will be updated so that the last part of the pattern needs to be numeric (0-9) but can't be same as the first match.
In [ ]:
%%sql
WITH SSNS(SSN) AS (
VALUES
'123-456-789',
'123-555-123',
'890-533-098',
'123-456-456'
)
SELECT SSN,
CASE
WHEN REGEXP_LIKE(SSN,'^([0-9]{3})-([0-9]{3})-(?!\1)([0-9]{3})$') THEN 'Valid'
ELSE 'Invalid'
END
FROM SSNS
In many cases it may be easier to find the patterns that
match and then negate the REGEXP statement! The (?...)
syntax is used for a variety of purposes in regular
expressions:
Pattern | Result |
---|---|
(?: ... ) | Non-capturing parentheses. Groups the included pattern, but does not provide capturing of matching text. More efficient than capturing parentheses. |
(?> ... ) | Atomic-match parentheses. First match of the parenthesized subexpression is the only one tried. If it does not lead to an overall pattern match, back up the search for a match to a position before the "(?>" |
(?# ... ) | Free-format comment (?# comment ) |
(?= ... ) | Look-ahead assertion. True if the parenthesized pattern matches at the current input position, but does not advance the input position. |
(?! ... ) | Negative look-ahead assertion. True if the parenthesized pattern does not match at the current input position. Does not advance the input position. |
(?<= ... ) | Look-behind assertion. True if the parenthesized pattern matches text that precedes the current input position. The last character of the match is the input character just before the current position. Does not alter the input position. The length of possible strings that is matched by the look-behind pattern must not be unbounded (no * or + operators.) |
(?<!... ) | Negative Look-behind assertion. True if the parenthesized pattern does not match text that precedes preceding the current input position. The last character of the match is the input character just before the current position. Does not alter the input position. The length of possible strings that is matched by the look-behind pattern must not be unbounded (no * or + operators.) |
For efficiency in matching, the best approach is to place
strings that you are searching for in non-capturing
parentheses (?:...)
rather than the generic () parenthesis.
The following example finds all stations with "West" in the
name.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'(West)')
The following SQL is equivalent, except that the matched pattern is not kept for future use in matching.
In [ ]:
%%sql
SELECT STATION FROM CENTRAL_LINE
WHERE REGEXP_LIKE(STATION,'(?:West)')
We know that there are better ways to write Regular Expressions (like the use of non-capturing parenthesis. How much of a difference does this make? In order to find out, the following SQL will take the existing table and insert into a temporary table 1000 times (for close to 50000 records).
This SQL will generate the new table and fill it with the base CENTRAL_LINE data.
In [ ]:
%%sql -d -q
DROP TABLE TEMP_LINE
@
CREATE TABLE TEMP_LINE AS (SELECT * FROM CENTRAL_LINE) DEFINITION ONLY NOT LOGGED INITIALLY
@
BEGIN
DECLARE I INTEGER DEFAULT 0;
WHILE I <= 1000 DO
INSERT INTO TEMP_LINE SELECT * FROM CENTRAL_LINE;
SET I = I + 1;
END WHILE;
END
@
SELECT COUNT(*) FROM TEMP_LINE
@
We will run four queries in the following SQL.
The queries are run as many times as possible per second to get consistent results. Odds are that there will be variations in the results on your system!
In [ ]:
results_like = %sql -t SELECT COUNT(*) FROM TEMP_LINE WHERE STATION LIKE '%West%'
results_string = %sql -t SELECT COUNT(*) FROM TEMP_LINE WHERE REGEXP_LIKE(STATION,'West')
results_capturing = %sql -t SELECT COUNT(*) FROM TEMP_LINE WHERE REGEXP_LIKE(STATION,'(West)')
results_noncapturing = %sql -t SELECT COUNT(*) FROM TEMP_LINE WHERE REGEXP_LIKE(STATION,'(?:West)')
The results are placed into a temporary table for easier formatting.
In [ ]:
%sql -q DROP TABLE RESULTS
%sql CREATE TABLE RESULTS(TYPE VARCHAR(16), RESULT DEC(9,2))
%sql INSERT INTO RESULTS VALUES ('LIKE', {results_like} ), \
('REGX STRING', {results_string} ), \
('REGX CAPTURE', {results_capturing} ), \
('REGX NONCAPTURE', {results_noncapturing} )
%sql SELECT * FROM RESULTS
The results are clearer when we plot them all on one graph!
In [ ]:
%sql -pb SELECT * FROM RESULTS
Every system will have different performance characteristics, but the odds are that the LIKE statement was faster than the regular expression. The performance is probably ranked this way (with tx/sec being the measurement):
LIKE > REGEXP(Non-capturing) > REGEXP(Capturing) > REGEXP(String)
However, your system will probably get different results. So while Regular expressions are very powerful, there is a performance penalty when using them. In addition, you may need to consider the use of indexes from a performance perspective.
The following SQL will add an index to the STATION name.
In [ ]:
%%sql -q
DROP INDEX TEMP_STATION_INDEX;
CREATE INDEX TEMP_STATION_INDEX ON TEMP_LINE(STATION);
We will try the 4 queries again and plot the result.
In [ ]:
results_like = %sql -t SELECT COUNT(*) FROM TEMP_LINE WHERE STATION LIKE '%West%'
results_string = %sql -t SELECT COUNT(*) FROM TEMP_LINE WHERE REGEXP_LIKE(STATION,'West')
results_capturing = %sql -t SELECT COUNT(*) FROM TEMP_LINE WHERE REGEXP_LIKE(STATION,'(West)')
results_noncapturing = %sql -t SELECT COUNT(*) FROM TEMP_LINE WHERE REGEXP_LIKE(STATION,'(?:West)')
%sql -q DROP TABLE RESULTS
%sql -q CREATE TABLE RESULTS(TYPE VARCHAR(16), RESULT DEC(9,2))
%sql -q INSERT INTO RESULTS VALUES ('LIKE', {results_like} ), \
('REGX STRING', {results_string} ), \
('REGX CAPTURE', {results_capturing} ), \
('REGX NONCAPTURE', {results_noncapturing} )
%sql SELECT * FROM RESULTS
%sql -pb SELECT * FROM RESULTS
The index can help speed up processing of the LIKE statement and the regular expression calls. The results may or may not favor LIKE over regular expressions. In summary, if you are going to use regular expressions, you may get a performance advantage by using indexes if possible.