In [ ]:
#@title Licensed under the Apache License, Version 2.0 (the "License")
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
|
Filters input string elements based on a regex. May also transform them based on the matching groups.
To run a code cell, you can click the Run cell button at the top left of the cell,
or select it and press Shift+Enter
.
Try modifying a code cell and re-running it to see what happens.
To learn more about Colab, see Welcome to Colaboratory!.
First, let's install the apache-beam
module.
In [ ]:
!pip install --quiet -U apache-beam
In the following examples, we create a pipeline with a PCollection
of text strings.
Then, we use the Regex
transform to search, replace, and split through the text elements using
regular expressions.
You can use tools to help you create and test your regular expressions, such as regex101. Make sure to specify the Python flavor at the left side bar.
Lets look at the
[regular expression (?P<icon>[^\s,]+), *(\w+), *(\w+)
](https://regex101.com/r/Z7hTTj/3)
for example.
It matches anything that is not a whitespace \s
([ \t\n\r\f\v]
) or comma ,
until a comma is found and stores that in the named group icon
,
this can match even utf-8
strings.
Then it matches any number of whitespaces, followed by at least one word character
\w
([a-zA-Z0-9_]
), which is stored in the second group for the name.
It does the same with the third group for the duration.
Note: To avoid unexpected string escaping in your regular expressions, it is recommended to use raw strings such as
r'raw-string'
instead of'escaped-string'
.
Regex.matches
keeps only the elements that match the regular expression,
returning the matched group.
The argument group
is set to 0
(the entire match) by default,
but can be set to a group number like 3
, or to a named group like 'icon'
.
Regex.matches
starts to match the regular expression at the beginning of the string.
To match until the end of the string, add '$'
at the end of the regular expression.
To start matching at any point instead of the beginning of the string, use
Regex.find(regex)
.
In [ ]:
import apache_beam as beam
# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
plants_matches = (
pipeline
| 'Garden plants' >> beam.Create([
'🍓, Strawberry, perennial',
'🥕, Carrot, biennial ignoring trailing words',
'🍆, Eggplant, perennial',
'🍅, Tomato, annual',
'🥔, Potato, perennial',
'# 🍌, invalid, format',
'invalid, 🍉, format',
])
| 'Parse plants' >> beam.Regex.matches(regex)
| beam.Map(print)
)
Regex.all_matches
keeps only the elements that match the regular expression,
returning all groups as a list.
The groups are returned in the order encountered in the regular expression,
including group 0
(the entire match) as the first group.
Regex.all_matches
starts to match the regular expression at the beginning of the string.
To match until the end of the string, add '$'
at the end of the regular expression.
To start matching at any point instead of the beginning of the string, use
Regex.find_all(regex, group=Regex.ALL, outputEmpty=False)
.
In [ ]:
import apache_beam as beam
# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
plants_all_matches = (
pipeline
| 'Garden plants' >> beam.Create([
'🍓, Strawberry, perennial',
'🥕, Carrot, biennial ignoring trailing words',
'🍆, Eggplant, perennial',
'🍅, Tomato, annual',
'🥔, Potato, perennial',
'# 🍌, invalid, format',
'invalid, 🍉, format',
])
| 'Parse plants' >> beam.Regex.all_matches(regex)
| beam.Map(print)
)
Regex.matches_kv
keeps only the elements that match the regular expression,
returning a key-value pair using the specified groups.
The argument keyGroup
is set to a group number like 3
, or to a named group like 'icon'
.
The argument valueGroup
is set to 0
(the entire match) by default,
but can be set to a group number like 3
, or to a named group like 'icon'
.
Regex.matches_kv
starts to match the regular expression at the beginning of the string.
To match until the end of the string, add '$'
at the end of the regular expression.
To start matching at any point instead of the beginning of the string, use
Regex.find_kv(regex, keyGroup)
.
In [ ]:
import apache_beam as beam
# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
plants_matches_kv = (
pipeline
| 'Garden plants' >> beam.Create([
'🍓, Strawberry, perennial',
'🥕, Carrot, biennial ignoring trailing words',
'🍆, Eggplant, perennial',
'🍅, Tomato, annual',
'🥔, Potato, perennial',
'# 🍌, invalid, format',
'invalid, 🍉, format',
])
| 'Parse plants' >> beam.Regex.matches_kv(regex, keyGroup='icon')
| beam.Map(print)
)
Regex.find
keeps only the elements that match the regular expression,
returning the matched group.
The argument group
is set to 0
(the entire match) by default,
but can be set to a group number like 3
, or to a named group like 'icon'
.
Regex.find
matches the first occurrence of the regular expression in the string.
To start matching at the beginning, add '^'
at the beginning of the regular expression.
To match until the end of the string, add '$'
at the end of the regular expression.
If you need to match from the start only, consider using
Regex.matches(regex)
.
In [ ]:
import apache_beam as beam
# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
plants_matches = (
pipeline
| 'Garden plants' >> beam.Create([
'# 🍓, Strawberry, perennial',
'# 🥕, Carrot, biennial ignoring trailing words',
'# 🍆, Eggplant, perennial - 🍌, Banana, perennial',
'# 🍅, Tomato, annual - 🍉, Watermelon, annual',
'# 🥔, Potato, perennial',
])
| 'Parse plants' >> beam.Regex.find(regex)
| beam.Map(print)
)
Regex.find_all
returns a list of all the matches of the regular expression,
returning the matched group.
The argument group
is set to 0
by default, but can be set to a group number like 3
, to a named group like 'icon'
, or to Regex.ALL
to return all groups.
The argument outputEmpty
is set to True
by default, but can be set to False
to skip elements where no matches were found.
Regex.find_all
matches the regular expression anywhere it is found in the string.
To start matching at the beginning, add '^'
at the start of the regular expression.
To match until the end of the string, add '$'
at the end of the regular expression.
If you need to match all groups from the start only, consider using
Regex.all_matches(regex)
.
In [ ]:
import apache_beam as beam
# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
plants_find_all = (
pipeline
| 'Garden plants' >> beam.Create([
'# 🍓, Strawberry, perennial',
'# 🥕, Carrot, biennial ignoring trailing words',
'# 🍆, Eggplant, perennial - 🍌, Banana, perennial',
'# 🍅, Tomato, annual - 🍉, Watermelon, annual',
'# 🥔, Potato, perennial',
])
| 'Parse plants' >> beam.Regex.find_all(regex)
| beam.Map(print)
)
Regex.find_kv
returns a list of all the matches of the regular expression,
returning a key-value pair using the specified groups.
The argument keyGroup
is set to a group number like 3
, or to a named group like 'icon'
.
The argument valueGroup
is set to 0
(the entire match) by default,
but can be set to a group number like 3
, or to a named group like 'icon'
.
Regex.find_kv
matches the first occurrence of the regular expression in the string.
To start matching at the beginning, add '^'
at the beginning of the regular expression.
To match until the end of the string, add '$'
at the end of the regular expression.
If you need to match as key-value pairs from the start only, consider using
Regex.matches_kv(regex)
.
In [ ]:
import apache_beam as beam
# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
plants_matches_kv = (
pipeline
| 'Garden plants' >> beam.Create([
'# 🍓, Strawberry, perennial',
'# 🥕, Carrot, biennial ignoring trailing words',
'# 🍆, Eggplant, perennial - 🍌, Banana, perennial',
'# 🍅, Tomato, annual - 🍉, Watermelon, annual',
'# 🥔, Potato, perennial',
])
| 'Parse plants' >> beam.Regex.find_kv(regex, keyGroup='icon')
| beam.Map(print)
)
Regex.replace_all
returns the string with all the occurrences of the regular expression replaced by another string.
You can also use
backreferences
on the replacement
.
In [ ]:
import apache_beam as beam
with beam.Pipeline() as pipeline:
plants_replace_all = (
pipeline
| 'Garden plants' >> beam.Create([
'🍓 : Strawberry : perennial',
'🥕 : Carrot : biennial',
'🍆\t:\tEggplant\t:\tperennial',
'🍅 : Tomato : annual',
'🥔 : Potato : perennial',
])
| 'To CSV' >> beam.Regex.replace_all(r'\s*:\s*', ',')
| beam.Map(print)
)
Regex.replace_first
returns the string with the first occurrence of the regular expression replaced by another string.
You can also use
backreferences
on the replacement
.
In [ ]:
import apache_beam as beam
with beam.Pipeline() as pipeline:
plants_replace_first = (
pipeline
| 'Garden plants' >> beam.Create([
'🍓, Strawberry, perennial',
'🥕, Carrot, biennial',
'🍆,\tEggplant, perennial',
'🍅, Tomato, annual',
'🥔, Potato, perennial',
])
| 'As dictionary' >> beam.Regex.replace_first(r'\s*,\s*', ': ')
| beam.Map(print)
)
In [ ]:
import apache_beam as beam
with beam.Pipeline() as pipeline:
plants_split = (
pipeline
| 'Garden plants' >> beam.Create([
'🍓 : Strawberry : perennial',
'🥕 : Carrot : biennial',
'🍆\t:\tEggplant : perennial',
'🍅 : Tomato : annual',
'🥔 : Potato : perennial',
])
| 'Parse plants' >> beam.Regex.split(r'\s*:\s*')
| beam.Map(print)
)