In [ ]:
#@title Licensed under the Apache License, Version 2.0 (the "License")
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
#   http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.

Regex

Pydoc




Filters input string elements based on a regex. May also transform them based on the matching groups.

Setup

To run a code cell, you can click the Run cell button at the top left of the cell, or select it and press Shift+Enter. Try modifying a code cell and re-running it to see what happens.

To learn more about Colab, see Welcome to Colaboratory!.

First, let's install the apache-beam module.


In [ ]:
!pip install --quiet -U apache-beam

Examples

In the following examples, we create a pipeline with a PCollection of text strings. Then, we use the Regex transform to search, replace, and split through the text elements using regular expressions.

You can use tools to help you create and test your regular expressions, such as regex101. Make sure to specify the Python flavor at the left side bar.

Lets look at the [regular expression (?P<icon>[^\s,]+), *(\w+), *(\w+)](https://regex101.com/r/Z7hTTj/3) for example. It matches anything that is not a whitespace \s ([ \t\n\r\f\v]) or comma , until a comma is found and stores that in the named group icon, this can match even utf-8 strings. Then it matches any number of whitespaces, followed by at least one word character \w ([a-zA-Z0-9_]), which is stored in the second group for the name. It does the same with the third group for the duration.

Note: To avoid unexpected string escaping in your regular expressions, it is recommended to use raw strings such as r'raw-string' instead of 'escaped-string'.

Example 1: Regex match

Regex.matches keeps only the elements that match the regular expression, returning the matched group. The argument group is set to 0 (the entire match) by default, but can be set to a group number like 3, or to a named group like 'icon'.

Regex.matches starts to match the regular expression at the beginning of the string. To match until the end of the string, add '$' at the end of the regular expression.

To start matching at any point instead of the beginning of the string, use Regex.find(regex).


In [ ]:
import apache_beam as beam

# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
  plants_matches = (
      pipeline
      | 'Garden plants' >> beam.Create([
          '🍓, Strawberry, perennial',
          '🥕, Carrot, biennial ignoring trailing words',
          '🍆, Eggplant, perennial',
          '🍅, Tomato, annual',
          '🥔, Potato, perennial',
          '# 🍌, invalid, format',
          'invalid, 🍉, format',
      ])
      | 'Parse plants' >> beam.Regex.matches(regex)
      | beam.Map(print)
  )

Example 2: Regex match with all groups

Regex.all_matches keeps only the elements that match the regular expression, returning all groups as a list. The groups are returned in the order encountered in the regular expression, including group 0 (the entire match) as the first group.

Regex.all_matches starts to match the regular expression at the beginning of the string. To match until the end of the string, add '$' at the end of the regular expression.

To start matching at any point instead of the beginning of the string, use Regex.find_all(regex, group=Regex.ALL, outputEmpty=False).


In [ ]:
import apache_beam as beam

# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
  plants_all_matches = (
      pipeline
      | 'Garden plants' >> beam.Create([
          '🍓, Strawberry, perennial',
          '🥕, Carrot, biennial ignoring trailing words',
          '🍆, Eggplant, perennial',
          '🍅, Tomato, annual',
          '🥔, Potato, perennial',
          '# 🍌, invalid, format',
          'invalid, 🍉, format',
      ])
      | 'Parse plants' >> beam.Regex.all_matches(regex)
      | beam.Map(print)
  )

Example 3: Regex match into key-value pairs

Regex.matches_kv keeps only the elements that match the regular expression, returning a key-value pair using the specified groups. The argument keyGroup is set to a group number like 3, or to a named group like 'icon'. The argument valueGroup is set to 0 (the entire match) by default, but can be set to a group number like 3, or to a named group like 'icon'.

Regex.matches_kv starts to match the regular expression at the beginning of the string. To match until the end of the string, add '$' at the end of the regular expression.

To start matching at any point instead of the beginning of the string, use Regex.find_kv(regex, keyGroup).


In [ ]:
import apache_beam as beam

# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
  plants_matches_kv = (
      pipeline
      | 'Garden plants' >> beam.Create([
          '🍓, Strawberry, perennial',
          '🥕, Carrot, biennial ignoring trailing words',
          '🍆, Eggplant, perennial',
          '🍅, Tomato, annual',
          '🥔, Potato, perennial',
          '# 🍌, invalid, format',
          'invalid, 🍉, format',
      ])
      | 'Parse plants' >> beam.Regex.matches_kv(regex, keyGroup='icon')
      | beam.Map(print)
  )

Example 4: Regex find

Regex.find keeps only the elements that match the regular expression, returning the matched group. The argument group is set to 0 (the entire match) by default, but can be set to a group number like 3, or to a named group like 'icon'.

Regex.find matches the first occurrence of the regular expression in the string. To start matching at the beginning, add '^' at the beginning of the regular expression. To match until the end of the string, add '$' at the end of the regular expression.

If you need to match from the start only, consider using Regex.matches(regex).


In [ ]:
import apache_beam as beam

# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
  plants_matches = (
      pipeline
      | 'Garden plants' >> beam.Create([
          '# 🍓, Strawberry, perennial',
          '# 🥕, Carrot, biennial ignoring trailing words',
          '# 🍆, Eggplant, perennial - 🍌, Banana, perennial',
          '# 🍅, Tomato, annual - 🍉, Watermelon, annual',
          '# 🥔, Potato, perennial',
      ])
      | 'Parse plants' >> beam.Regex.find(regex)
      | beam.Map(print)
  )

Example 5: Regex find all

Regex.find_all returns a list of all the matches of the regular expression, returning the matched group. The argument group is set to 0 by default, but can be set to a group number like 3, to a named group like 'icon', or to Regex.ALL to return all groups. The argument outputEmpty is set to True by default, but can be set to False to skip elements where no matches were found.

Regex.find_all matches the regular expression anywhere it is found in the string. To start matching at the beginning, add '^' at the start of the regular expression. To match until the end of the string, add '$' at the end of the regular expression.

If you need to match all groups from the start only, consider using Regex.all_matches(regex).


In [ ]:
import apache_beam as beam

# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
  plants_find_all = (
      pipeline
      | 'Garden plants' >> beam.Create([
          '# 🍓, Strawberry, perennial',
          '# 🥕, Carrot, biennial ignoring trailing words',
          '# 🍆, Eggplant, perennial - 🍌, Banana, perennial',
          '# 🍅, Tomato, annual - 🍉, Watermelon, annual',
          '# 🥔, Potato, perennial',
      ])
      | 'Parse plants' >> beam.Regex.find_all(regex)
      | beam.Map(print)
  )

Example 6: Regex find as key-value pairs

Regex.find_kv returns a list of all the matches of the regular expression, returning a key-value pair using the specified groups. The argument keyGroup is set to a group number like 3, or to a named group like 'icon'. The argument valueGroup is set to 0 (the entire match) by default, but can be set to a group number like 3, or to a named group like 'icon'.

Regex.find_kv matches the first occurrence of the regular expression in the string. To start matching at the beginning, add '^' at the beginning of the regular expression. To match until the end of the string, add '$' at the end of the regular expression.

If you need to match as key-value pairs from the start only, consider using Regex.matches_kv(regex).


In [ ]:
import apache_beam as beam

# Matches a named group 'icon', and then two comma-separated groups.
regex = r'(?P<icon>[^\s,]+), *(\w+), *(\w+)'
with beam.Pipeline() as pipeline:
  plants_matches_kv = (
      pipeline
      | 'Garden plants' >> beam.Create([
          '# 🍓, Strawberry, perennial',
          '# 🥕, Carrot, biennial ignoring trailing words',
          '# 🍆, Eggplant, perennial - 🍌, Banana, perennial',
          '# 🍅, Tomato, annual - 🍉, Watermelon, annual',
          '# 🥔, Potato, perennial',
      ])
      | 'Parse plants' >> beam.Regex.find_kv(regex, keyGroup='icon')
      | beam.Map(print)
  )

Example 7: Regex replace all

Regex.replace_all returns the string with all the occurrences of the regular expression replaced by another string. You can also use backreferences on the replacement.


In [ ]:
import apache_beam as beam

with beam.Pipeline() as pipeline:
  plants_replace_all = (
      pipeline
      | 'Garden plants' >> beam.Create([
          '🍓 : Strawberry : perennial',
          '🥕 : Carrot : biennial',
          '🍆\t:\tEggplant\t:\tperennial',
          '🍅 : Tomato : annual',
          '🥔 : Potato : perennial',
      ])
      | 'To CSV' >> beam.Regex.replace_all(r'\s*:\s*', ',')
      | beam.Map(print)
  )

Example 8: Regex replace first

Regex.replace_first returns the string with the first occurrence of the regular expression replaced by another string. You can also use backreferences on the replacement.


In [ ]:
import apache_beam as beam

with beam.Pipeline() as pipeline:
  plants_replace_first = (
      pipeline
      | 'Garden plants' >> beam.Create([
          '🍓, Strawberry, perennial',
          '🥕, Carrot, biennial',
          '🍆,\tEggplant, perennial',
          '🍅, Tomato, annual',
          '🥔, Potato, perennial',
      ])
      | 'As dictionary' >> beam.Regex.replace_first(r'\s*,\s*', ': ')
      | beam.Map(print)
  )

Example 9: Regex split

Regex.split returns the list of strings that were delimited by the specified regular expression. The argument outputEmpty is set to False by default, but can be set to True to keep empty items in the output list.


In [ ]:
import apache_beam as beam

with beam.Pipeline() as pipeline:
  plants_split = (
      pipeline
      | 'Garden plants' >> beam.Create([
          '🍓 : Strawberry : perennial',
          '🥕 : Carrot : biennial',
          '🍆\t:\tEggplant : perennial',
          '🍅 : Tomato : annual',
          '🥔 : Potato : perennial',
      ])
      | 'Parse plants' >> beam.Regex.split(r'\s*:\s*')
      | beam.Map(print)
  )
  • FlatMap behaves the same as Map, but for each input it may produce zero or more outputs.
  • Map applies a simple 1-to-1 mapping function over each element in the collection
Pydoc




</icon>