Day 9: Explosives in Cyberspace

author: Harshvardhan Pandit

license: MIT

link to problem statement

Wandering around a secure area, you come across a datalink port to a new part of the network. After briefly scanning it for interesting files, you find one file in particular that catches your attention. It's compressed with an experimental format, but fortunately, the documentation for the format is nearby.

The format compresses a sequence of characters. Whitespace is ignored. To indicate that some sequence should be repeated, a marker is added to the file, like (10x2). To decompress this marker, take the subsequent 10 characters and repeat them 2 times. Then, continue reading the file after the repeated data. The marker itself is not included in the decompressed output.

If parentheses or other characters appear within the data referenced by a marker, that's okay - treat it like normal data, not a marker, and then resume looking for markers after the decompressed section.

For example:

  • ADVENT contains no markers and decompresses to itself with no changes, resulting in a decompressed length of 6.
  • A(1x5)BC repeats only the B a total of 5 times, becoming ABBBBBC for a decompressed length of 7.
  • (3x3)XYZ becomes XYZXYZXYZ for a decompressed length of 9.
  • A(2x2)BCD(2x2)EFG doubles the BC and EF, becoming ABCBCDEFEFG for a decompressed length of 11.
  • (6x1)(1x3)A simply becomes (1x3)A - the (1x3) looks like a marker, but because it's within a data section of another marker, it is not treated any differently from the A that comes after it. It has a decompressed length of 6.
  • X(8x2)(3x3)ABCY becomes X(3x3)ABC(3x3)ABCY (for a decompressed length of 18), because the decompressed data from the (8x2) marker (the (3x3)ABC) is skipped and not processed further.

What is the decompressed length of the file (your puzzle input)? Don't count whitespace.

Solution logic

Any time we encounter brackets of the format (AxB), we take in the next A characters regardless of what they are, and repeat them B times in the decompressed data. There are a few points to note here:

  • If any of the A characters contain bracket(s), they are to be ignored and treated as normal characters
  • Brackets of other formats could be present in the input. For e.g. (), (A), (AxBxC), etc. These are to be ignored and not considered as markers.
  • Whitespace is to be ignored. So any newlines, spaces, tabs, etc. are ignored and not parsed.

We iterate over each character at a time, and check if it is a start of the marker (bracket). If it is, we skip the next A characters, and add A * B to the count.

Since we only need to count the length of the decompressed data, we do not need to actually store it. We can simply use a count variable to hold the count of the decompressed characters.

We define a function called parse_marker to get the values of A and B from the marker and to calculate the decompressed length. If the start of the sequence is not a (, it is a single character with the decompressed length of 1. If the character is (, it is a marker and we extract the values and count the decompressed length as A x B.

Reading input: The input needs to be stitched together, ignoring whitespace. We use the string.split method that splits a string on any whitespace character. Reading in each line of input from the file, we split it, and join it together back again.

data = [''.join(line.split()) for line in f.readlines()]

Then, we join all lines from the file to create a continuos data stream.

data = ''.join(data)

Another way to read the file would be to read the file character by character usign file.read. This would require us to check whether each character is a whitespace or not, and then to ignore it if it is.


In [1]:
with open('../inputs/day09.txt', 'r') as f:
    data = ''.join([''.join(line.split()) for line in f.readlines()])
#     TEST DATA
#     data = ''.join([
#         'ADVENT',
#         'A(1x5)BC',
#         '(3x3)XYZ',
#         'A(2x2)BCD(2x2)EFG',
#         '(6x1)(1x3)A',
#         'X(8x2)(3x3)ABCY'
#     ])

Reading in markers, Calculating decompressed length: We use the (very awesome) itertools module to do the iterating and filtering for us.

We use an iterator to go over the input values, so that we can use the itertools functions such as takewhile that selects characters as long as a condition is fulfilled.

def takewhile(condition, data):
    filtered_data = []
    for item in data:
        if condition(data):
            filtered_data.append(item)
        else:
            break
    return filtered_data

We use takewhile to swallow characters until we reach the markers, and then to get the marker itself. Using regular expressions, we extract the two values from the marker. Since we are using iterators, we need to skip the next A characters, which we do using a for loop.

At the end, the answer is in the count variable.


In [2]:
from itertools import islice, takewhile
import re
numbers = re.compile(r'(\d+)')


def decompress(data_iterator):
    '''parses markers and returns index of last character and length of decompressed data'''
    count = 0
    index = 0

    while True:
        #  handle single tokens that decompress to length 1 until start of marker
        count += len(list(takewhile(lambda character: character != '(', data_iterator)))
        #  extract marker
        marker = ''.join(takewhile(lambda character: character != ')', data_iterator))
        #  extract A and B
        try:
            a, b = map(int, numbers.findall(marker))
        except ValueError:
            # EOF or no other markers present
            break
        # skip the next a characters
        for i in range(a):
            next(data_iterator)
        # increment count
        count += a * b
    
    return count

In [3]:
print(decompress(iter(data)))


110346

Part Two

Apparently, the file actually uses version two of the format.

In version two, the only difference is that markers within decompressed data are decompressed. This, the documentation explains, provides much more substantial compression capabilities, allowing many-gigabyte files to be stored in only a few kilobytes.

For example:

  • (3x3)XYZ still becomes XYZXYZXYZ, as the decompressed section contains no markers.
  • X(8x2)(3x3)ABCY becomes XABCABCABCABCABCABCY, because the decompressed data from the (8x2) marker is then further decompressed, thus triggering the (3x3) marker twice for a total of six ABC sequences.
  • (27x12)(20x12)(13x14)(7x10)(1x12)A decompresses into a string of A repeated 241920 times.
  • (25x3)(3x3)ABC(2x3)XY(5x2)PQRSTX(18x9)(3x2)TWO(5x7)SEVEN becomes 445 characters long.

Unfortunately, the computer you brought probably doesn't have enough memory to actually decompress the file; you'll have to come up with another way to get its decompressed length.

What is the decompressed length of the file using this improved format?

Solution logic

In this part, we need to keep track of the markers within the skipped marked from part one. As an assumption, we take the approach that no internal marker will extend the limits of the external marker. If it does, we will need to take a different approach to scan the string over and over again. Instead, we use a recursive approach to parse the string by marker and return the correct length.

X(8x2)(3x3)ABCY

(8x2) --> 8 characters: (3x3)ABC multiplied by 2
      --> 2 x decompressed (3x3)ABC
      --> 2 x 3 x ABC


For this, we extend the decompress function so that it will return the length of the string plus recursively scan any marker within it and return the final index of the last character scanned. This is the same function as in part one, except that it recusively checks for markers.

The recursive part of this approach is to further decompress the string (or characters) that were skipped in the first part. For this, we use islice to extract part of the string specified by the markers and recursively call the function on it to get its decompressed length.


In [4]:
def decompress(data_iterator):
    count = 0
    '''parses markers and returns index of last character and length of decompressed data'''
    while(True):
        # handle all single characters
        count += len(list(takewhile(lambda character: character != '(', data_iterator)))
        # marker occurs here, extract marker
        marker = ''.join(takewhile(lambda character: character != ')', data_iterator))
        # extract A and B
        try:
            a, b = map(int, numbers.findall(marker))
        except ValueError:
            break
        count += b * decompress(islice(data_iterator, a))
    return count

In [5]:
print(decompress(iter(data)))


10774309173

== END ==