Wandering around a secure area, you come across a datalink port to a new part of the network. After briefly scanning it for interesting files, you find one file in particular that catches your attention. It's compressed with an experimental format, but fortunately, the documentation for the format is nearby.
The format compresses a sequence of characters. Whitespace is ignored. To indicate that some sequence should be repeated, a marker is added to the file, like (10x2)
. To decompress this marker, take the subsequent 10
characters and repeat them 2
times. Then, continue reading the file after the repeated data. The marker itself is not included in the decompressed output.
If parentheses or other characters appear within the data referenced by a marker, that's okay - treat it like normal data, not a marker, and then resume looking for markers after the decompressed section.
For example:
ADVENT
contains no markers and decompresses to itself with no changes, resulting in a decompressed length of 6
.A(1x5)BC
repeats only the B
a total of 5
times, becoming ABBBBBC
for a decompressed length of 7
.(3x3)XYZ
becomes XYZXYZXYZ
for a decompressed length of 9
.A(2x2)BCD(2x2)EFG
doubles the BC
and EF
, becoming ABCBCDEFEFG
for a decompressed length of 11
.(6x1)(1x3)A
simply becomes (1x3)A
- the (1x3)
looks like a marker, but because it's within a data section of another marker, it is not treated any differently from the A
that comes after it. It has a decompressed length of 6
.X(8x2)(3x3)ABCY
becomes X(3x3)ABC(3x3)ABCY
(for a decompressed length of 18
), because the decompressed data from the (8x2)
marker (the (3x3)ABC)
is skipped and not processed further.What is the decompressed length of the file (your puzzle input)? Don't count whitespace.
Any time we encounter brackets of the format (AxB)
, we take in the next A
characters regardless of what they are, and repeat them B
times in the decompressed data. There are a few points to note here:
A
characters contain bracket(s), they are to be ignored and treated as normal characters()
, (A)
, (AxBxC)
, etc. These are to be ignored and not considered as markers.We iterate over each character at a time, and check if it is a start of the marker (bracket). If it is, we skip the next A
characters, and add A * B
to the count.
Since we only need to count the length of the decompressed data, we do not need to actually store it. We can simply use a count
variable to hold the count of the decompressed characters.
We define a function called parse_marker
to get the values of A
and B
from the marker and to calculate the decompressed length. If the start of the sequence is not a (
, it is a single character with the decompressed length of 1
. If the character is (
, it is a marker and we extract the values and count the decompressed length as A x B
.
Reading input: The input needs to be stitched together, ignoring whitespace.
We use the string.split
method that splits a string on any whitespace character. Reading in each line of input from the file, we split it, and join it together back again.
data = [''.join(line.split()) for line in f.readlines()]
Then, we join all lines from the file to create a continuos data stream.
data = ''.join(data)
Another way to read the file would be to read the file character by character usign file.read
. This would require us to check whether each character is a whitespace or not, and then to ignore it if it is.
In [1]:
with open('../inputs/day09.txt', 'r') as f:
data = ''.join([''.join(line.split()) for line in f.readlines()])
# TEST DATA
# data = ''.join([
# 'ADVENT',
# 'A(1x5)BC',
# '(3x3)XYZ',
# 'A(2x2)BCD(2x2)EFG',
# '(6x1)(1x3)A',
# 'X(8x2)(3x3)ABCY'
# ])
Reading in markers, Calculating decompressed length: We use the (very awesome) itertools
module to do the iterating and filtering for us.
We use an iterator to go over the input values, so that we can use the itertools
functions such as takewhile
that selects characters as long as a condition is fulfilled.
def takewhile(condition, data):
filtered_data = []
for item in data:
if condition(data):
filtered_data.append(item)
else:
break
return filtered_data
We use takewhile
to swallow characters until we reach the markers, and then to get the marker itself. Using regular expressions, we extract the two values from the marker. Since we are using iterators, we need to skip
the next A
characters, which we do using a for
loop.
At the end, the answer is in the count
variable.
In [2]:
from itertools import islice, takewhile
import re
numbers = re.compile(r'(\d+)')
def decompress(data_iterator):
'''parses markers and returns index of last character and length of decompressed data'''
count = 0
index = 0
while True:
# handle single tokens that decompress to length 1 until start of marker
count += len(list(takewhile(lambda character: character != '(', data_iterator)))
# extract marker
marker = ''.join(takewhile(lambda character: character != ')', data_iterator))
# extract A and B
try:
a, b = map(int, numbers.findall(marker))
except ValueError:
# EOF or no other markers present
break
# skip the next a characters
for i in range(a):
next(data_iterator)
# increment count
count += a * b
return count
In [3]:
print(decompress(iter(data)))
Apparently, the file actually uses version two of the format.
In version two, the only difference is that markers within decompressed data are decompressed. This, the documentation explains, provides much more substantial compression capabilities, allowing many-gigabyte files to be stored in only a few kilobytes.
For example:
(3x3)XYZ
still becomes XYZXYZXYZ
, as the decompressed section contains no markers.X(8x2)(3x3)ABCY
becomes XABCABCABCABCABCABCY
, because the decompressed data from the (8x2)
marker is then further decompressed, thus triggering the (3x3)
marker twice for a total of six ABC
sequences.(27x12)(20x12)(13x14)(7x10)(1x12)A
decompresses into a string of A
repeated 241920
times.(25x3)(3x3)ABC(2x3)XY(5x2)PQRSTX(18x9)(3x2)TWO(5x7)SEVEN
becomes 445
characters long.Unfortunately, the computer you brought probably doesn't have enough memory to actually decompress the file; you'll have to come up with another way to get its decompressed length.
What is the decompressed length of the file using this improved format?
In this part, we need to keep track of the markers within the skipped marked from part one. As an assumption, we take the approach that no internal marker will extend the limits of the external marker. If it does, we will need to take a different approach to scan the string over and over again. Instead, we use a recursive approach to parse the string by marker and return the correct length.
X(8x2)(3x3)ABCY
(8x2) --> 8 characters: (3x3)ABC multiplied by 2
--> 2 x decompressed (3x3)ABC
--> 2 x 3 x ABC
For this, we extend the decompress
function so that it will return the length of the string plus recursively scan any marker within it and return the final index of the last character scanned. This is the same function as in part one, except that it recusively checks for markers.
The recursive part of this approach is to further decompress the string (or characters) that were skipped in the first part. For this, we use islice
to extract part of the string specified by the markers and recursively call the function on it to get its decompressed length.
In [4]:
def decompress(data_iterator):
count = 0
'''parses markers and returns index of last character and length of decompressed data'''
while(True):
# handle all single characters
count += len(list(takewhile(lambda character: character != '(', data_iterator)))
# marker occurs here, extract marker
marker = ''.join(takewhile(lambda character: character != ')', data_iterator))
# extract A and B
try:
a, b = map(int, numbers.findall(marker))
except ValueError:
break
count += b * decompress(islice(data_iterator, a))
return count
In [5]:
print(decompress(iter(data)))
== END ==