dpp runs the knesset data pipelines periodically on our server.
This notebook shows how to run pipelines that render pages for the static website at https://oknesset.org
In [ ]:
!{'cd /pipelines; KNESSET_LOAD_FROM_URL=1 dpp run --concurrency 4 '\
'./committees/kns_committee,'\
'./people/committee-meeting-attendees,'\
'./members/mk_individual'}
This pipeline aggregates the relevant data and allows to filter for quicker development cycles.
You can uncomment and modify the filter step in committees/dist/knesset.source-spec.yaml under the build
pipeline to change the filter.
The build pipeline can take a few minutes to process for the first time.
In [2]:
!{'cd /pipelines; dpp run --verbose ./committees/dist/build'}
In [ ]:
!{'pip install --upgrade dataflows'}
Restart the kernel if an upgrade was done
Choose some session IDs to download protocol files for:
In [1]:
session_ids = [2063122, 2063126]
In [2]:
from dataflows import Flow, load, printer, filter_rows
sessions_data = Flow(
load('/pipelines/data/committees/kns_committeesession/datapackage.json'),
filter_rows(lambda row: row['CommitteeSessionID'] in session_ids),
printer(tablefmt='html')
).results()
In [7]:
import os
import subprocess
import sys
for session in sessions_data[0][0]:
for attr in ['text_parsed_filename', 'parts_parsed_filename']:
pathpart = 'meeting_protocols_text' if attr == 'text_parsed_filename' else 'meeting_protocols_parts'
url = 'https://production.oknesset.org/pipelines/data/committees/{}/{}'.format(pathpart, session[attr])
filename = '/pipelines/data/committees/{}/{}'.format(pathpart, session[attr])
os.makedirs(os.path.dirname(filename), exist_ok=True)
cmd = 'curl -s -o {} {}'.format(filename, url)
print(cmd, file=sys.stderr)
subprocess.check_call(cmd, shell=True)
In [8]:
%%bash
find /pipelines/data/committees/dist -type f -name '*.hash' -delete
In [9]:
!{'cd /pipelines; dpp run ./committees/dist/render_meetings'}
In [10]:
from dataflows import Flow, load, printer, filter_rows, add_field
def add_filenames():
def _add_filenames(row):
for ext in ['html', 'json']:
row['rendered_'+ext] = '/pipelines/data/committees/dist/dist/meetings/{}/{}/{}.{}'.format(
str(row['CommitteeSessionID'])[0], str(row['CommitteeSessionID'])[1], str(row['CommitteeSessionID']), ext)
return Flow(
add_field('rendered_html', 'string'),
add_field('rendered_json', 'string'),
_add_filenames
)
rendered_meetings = Flow(
load('/pipelines/data/committees/dist/rendered_meetings_stats/datapackage.json'),
add_filenames(),
filter_rows(lambda row: row['CommitteeSessionID'] in session_ids),
printer(tablefmt='html')
).results()[0][0]
In [13]:
!{'cd /pipelines; dpp run ./committees/dist/render_committees'}
In [12]:
!{'cd /pipelines; dpp run ./committees/dist/create_members,./committees/dist/build_positions,./committees/dist/create_factions'}
To serve the site, locate the correspondoing local directory for /pipelines/data/committees/dist/dist and run:
python -m http.server 8000
Pages should be available at http://localhost:8000/