Load and prepare data

Objective: Load news and tweets data from raw data files into sqlite3 db.

Last modified: 2017-10-15

Roadmap

  1. Copy Meng original data folder to DATA_DIR, unzip and check format.
  2. Create db. Build tables for news and tweets.
  3. Bulk load news and tweets data into db.
  4. Check basic statistics.
  5. Build df pickles for news over selected period

Steps


In [1]:
"""
Initialization
"""

'''
Standard modules
'''
import os
import sqlite3
import csv
import time
import codecs
from pprint import pprint

'''
Analysis modules
'''
import pandas as pd

'''
Custom modules
'''
import config
import utilities

'''
Misc
'''
nb_name = '20171002-daheng-load_and_prepare_data'

Copy Meng original data folder to DATA_DIR, unzip and check format.


In [2]:
"""
Register
    MENG_NEWS_TWEETS_DIR = os.path.join(DATA_DIR, 'raw-news_tweets-meng') DEPRECATED
    ORIGINAL_NEWS_TWEETS_DIR = os.path.join(DATA_DIR, 'raw-news_tweets-original')
in config.
"""


Out[2]:
"\nRegister\n    MENG_NEWS_TWEETS_DIR = os.path.join(DATA_DIR, 'raw-news_tweets-meng') DEPRECATED\n    ORIGINAL_NEWS_TWEETS_DIR = os.path.join(DATA_DIR, 'raw-news_tweets-original')\nin config.\n"

Raw data files are located inside ORIGINAL_NEWS_TWEETS_DIR folder.


In [3]:
! ls -1 ./data/raw-news_tweets-original/


dataset1
dataset2

In [4]:
! ls ./data/raw-news_tweets-original/dataset1/news


2014-11-18.txt	2014-12-28.txt	2015-02-05.txt	2015-03-16.txt	2015-04-30.txt
2014-11-19.txt	2014-12-29.txt	2015-02-06.txt	2015-03-17.txt	2015-05-01.txt
2014-11-20.txt	2014-12-30.txt	2015-02-07.txt	2015-03-18.txt	2015-05-02.txt
2014-11-21.txt	2014-12-31.txt	2015-02-08.txt	2015-03-19.txt	2015-05-03.txt
2014-11-22.txt	2015-01-01.txt	2015-02-09.txt	2015-03-20.txt	2015-05-04.txt
2014-11-23.txt	2015-01-02.txt	2015-02-10.txt	2015-03-21.txt	2015-05-05.txt
2014-11-24.txt	2015-01-03.txt	2015-02-11.txt	2015-03-22.txt	2015-05-06.txt
2014-11-25.txt	2015-01-04.txt	2015-02-12.txt	2015-03-29.txt	2015-05-07.txt
2014-11-26.txt	2015-01-05.txt	2015-02-13.txt	2015-03-30.txt	2015-05-08.txt
2014-11-27.txt	2015-01-06.txt	2015-02-14.txt	2015-03-31.txt	2015-05-09.txt
2014-11-28.txt	2015-01-07.txt	2015-02-15.txt	2015-04-01.txt	2015-05-10.txt
2014-11-29.txt	2015-01-08.txt	2015-02-16.txt	2015-04-02.txt	2015-05-11.txt
2014-11-30.txt	2015-01-09.txt	2015-02-17.txt	2015-04-03.txt	2015-05-12.txt
2014-12-01.txt	2015-01-10.txt	2015-02-18.txt	2015-04-04.txt	2015-05-13.txt
2014-12-02.txt	2015-01-11.txt	2015-02-19.txt	2015-04-05.txt	2015-05-14.txt
2014-12-03.txt	2015-01-12.txt	2015-02-20.txt	2015-04-06.txt	2015-05-15.txt
2014-12-04.txt	2015-01-13.txt	2015-02-21.txt	2015-04-07.txt	2015-05-16.txt
2014-12-05.txt	2015-01-14.txt	2015-02-22.txt	2015-04-08.txt	2015-05-17.txt
2014-12-06.txt	2015-01-15.txt	2015-02-23.txt	2015-04-09.txt	2015-05-18.txt
2014-12-07.txt	2015-01-16.txt	2015-02-24.txt	2015-04-10.txt	2015-05-19.txt
2014-12-08.txt	2015-01-17.txt	2015-02-25.txt	2015-04-11.txt	2015-05-20.txt
2014-12-09.txt	2015-01-18.txt	2015-02-26.txt	2015-04-12.txt	2015-05-21.txt
2014-12-10.txt	2015-01-19.txt	2015-02-27.txt	2015-04-13.txt	2015-05-22.txt
2014-12-11.txt	2015-01-20.txt	2015-02-28.txt	2015-04-14.txt	2015-05-23.txt
2014-12-12.txt	2015-01-21.txt	2015-03-01.txt	2015-04-15.txt	2015-05-24.txt
2014-12-13.txt	2015-01-22.txt	2015-03-02.txt	2015-04-16.txt	2015-05-25.txt
2014-12-14.txt	2015-01-23.txt	2015-03-03.txt	2015-04-17.txt	2015-05-26.txt
2014-12-16.txt	2015-01-24.txt	2015-03-04.txt	2015-04-18.txt	2015-05-27.txt
2014-12-17.txt	2015-01-25.txt	2015-03-05.txt	2015-04-19.txt	2015-05-28.txt
2014-12-18.txt	2015-01-26.txt	2015-03-06.txt	2015-04-20.txt	2015-05-29.txt
2014-12-19.txt	2015-01-27.txt	2015-03-07.txt	2015-04-21.txt	2015-05-30.txt
2014-12-20.txt	2015-01-28.txt	2015-03-08.txt	2015-04-22.txt	2015-05-31.txt
2014-12-21.txt	2015-01-29.txt	2015-03-09.txt	2015-04-23.txt	2015-06-01.txt
2014-12-22.txt	2015-01-30.txt	2015-03-10.txt	2015-04-24.txt	2015-06-02.txt
2014-12-23.txt	2015-01-31.txt	2015-03-11.txt	2015-04-25.txt	2015-06-03.txt
2014-12-24.txt	2015-02-01.txt	2015-03-12.txt	2015-04-26.txt	2015-06-04.txt
2014-12-25.txt	2015-02-02.txt	2015-03-13.txt	2015-04-27.txt	2015-06-05.txt
2014-12-26.txt	2015-02-03.txt	2015-03-14.txt	2015-04-28.txt
2014-12-27.txt	2015-02-04.txt	2015-03-15.txt	2015-04-29.txt

In [5]:
! ls ./data/raw-news_tweets-original/dataset1/tweets/2014-11-18/


2685_Missouris_Nixon_Declares_State_of_Emergency_Awaiting_Grand_Jury-Businessweek
2686_PEOPLE_Bill_Cosby_Charles_Manson_Solange_Knowles_and_more-Pittsburgh_Post_Gazette
2687_Ebola_patient_who_died_had_received_ZMapp_late_in_his_treatment-Los_Angeles_Times
2688_At_least_4_dead_in_attack_in_Kabul_official_says-Fox_News
2689_Australia_will_not_be_at_periphery_of_Indias_vision_Modi-The_Hindu
2690_FBI_Violence_could_follow_Ferguson_indictment_decision-CBS_News
2691_Four_Killed_in_Palestinian_Attack_at_Jerusalem_Synagogue-Businessweek
2692_Mass_murderer_Charles_Manson_issued_marriage_license_may_get_hitched_next_-New_York_Daily_News
2693_News_Guide_Texas_latest_history_textbook_tussle-Washington_Post
2694_AbdulRahman_Kassigs_parents_mourn_beloved_son-BBC_News
2695_Obama_orders_full_review_of_US_hostage_policy-Fox_News
2696_Homeless_Children_in_US_A_parenttoparent_approach_to_help_kids_video-Christian_Science_Monitor
2697_Alleged_Bill_Cosby_victim_has_connection_to_Colorado-KRDO
2698_Church_of_England_approves_women_bishops-Network_Norwich
2699_Uber_executive_wants_to_dig_into_personal_lives_discredit_journalists_who_cover_-New_York_Daily_News
2700_Suicide_blast_kills_two_at_Kabuls_foreign_compound-BBC_News
2701_Answers_to_questions_about_the_Ferguson_grand_jury-Chroncom
2702_Four_Killed_in_Jerusalem_Synagogue_Complex-New_York_Times
2703_Source_Charles_Manson_fiance_get_marriage_license-CNN
2704_Suicide_Attack_in_Afghan_Capital_Kills_2-New_York_Times
2705_Hong_Kong_Protesters_Greet_Court_Officials_With_Indifference-TIME
2706_Europeans_have_prominent_role_in_beheading_video-Salt_Lake_Tribune
2707_Deals_Heat_Up_for_Lawyers_Like_Its_1998_Business_of_Law-Businessweek
2708_Cupich_set_to_become_9th_archbishop_of_Chicago-MiamiHeraldcom
2709_4_Israelis_2_Palestinians_killed_in_synagogue_attack_Israeli_police_say-CNN
2710_Missouri_Gov_Jay_Nixon_Declares_State_Of_Emergency_Ahead_Of_Grand_Jury_Decision-Huffington_Post
2711_French_National_Identified_In_Islamic_State_Beheading_Video-Chinatopix
2712_Suicide_bombing_near_coalition_base_in_Kabul_kills_2_security_officers-Los_Angeles_Times
2713_Baseball_notes_Nov_17_Stanton_gets_record_325_million_deal-SFGate
2714_Surgeon_dies_of_Ebola_at_Nebraska_hospital_after_contracting_disease_in_Sierra_-New_York_Daily_News
2715_Some_Barricades_Cleared_From_HK_Protest_Site-ABC_News
2716_Pakistan_ranks_third_among_terrorhit_countries-The_News_International
2717_Gunfight_at_godman_Rampals_ashram_in_Hisar_devotees_take_on_cops-Hindustan_Times
2718_AP_Exclusive_Charles_Manson_gets_marriage_license-seattlepicom
2719_Philadephiaarea_Catholics_excited_about_pontiffs_visit-USA_TODAY
2720_India_sixth_worst_affected_country_by_terrorism_in_2013_Report-Times_of_India
2721_Hong_Kong_authorities_clear_part_of_Admiralty_protest_site-BBC_News
2722_Should_Uber_Fire_Exec_Who_Suggested_Investigating_Reporters_Personal_Lives-Recode
2723_Cruel_murder_Netanyahu_Kerry_denounce_terror_attack_on_Jerusalem_synagogue-Fox_News
2724_Missouri_Governor_Activates_National_Guard_Ahead_of_Ferguson_Grand_Jury_Ruling-NBCNewscom
2725_Taliban_Suicide_Attacker_Kills_4_in_Kabul-ABC_News
2726_Bill_Cosby_77_Will_Still_Premiere_on_Netflix-Netflix_Life
2727_Uber_executive_suggests_digging_up_dirt_on_media_critics_BuzzFeed-Reuters
2728_Indian_PM_Modi_Urges_Greater_Security_Economic_Ties_With_Australia-Voice_of_America
2729_Study_Finds_Alternative_to_AntiCholesterol_Drug-New_York_Times
2730_Four_dead_in_suspected_Palestinian_attack_on_Jerusalem_synagogue-Reuters
2731_Answers_to_questions_about_the_Ferguson_grand_jury-Fort_Worth_Star_Telegram
2732_Charles_Manson_reluctantly_applies_for_marriage_license_to_wed_girlfriend-Washington_Post
2733_UPDATE_3Truck_bomb_kills_two_in_attack_on_foreign_base_in_Kabul-Reuters
2734_All_50_States_Face_Winter_Whack_5_Feet_of_Snow_Forecast_Near_Buffalo-NBCNewscom
2735_Deals_Heat_Up_for_Lawyers_Like_Its_1998_Business_of_Law-Bloomberg
2736_Israel_We_will_respond_with_a_heavy_hand_after_synagogue_attack_kills_4-CNN
2737_Missouri_governor_declares_state_of_emergency_in_Ferguson_St_Louis_region-New_York_Daily_News
2738_Charles_Manson_reluctantly_applies_for_marriage_license_to_wed_girlfriend-Washington_Post
2739_Supporters_of_oil_pipeline_scramble_for_last_vote-Buffalo_News
2740_Police_Storm_Ashram_in_India_in_Search_of_Guru-ABC_News
2741_Modi_Abbott_agree_on_closer_cooperation_on_security_and_trade-Times_of_India
2742_Missouri_governor_declares_state_of_emergency_ahead_of_ruling_on_Ferguson_shooting-Reuters
2743_Keystone_Vote_May_Be_Too_Late_to_Help_Democrat_Hold_Senate_Seat-Businessweek
2744_Deaths_caused_by_terrorism_rises_by_61_percent_report_shows-Deutsche_Welle
2745_Truck_bomb_hits_foreign_base_in_Afghan_capital_kills_two-Reuters
2746_Toyota_bets_on_hydrogen_fuel_cell_technology-Financial_Times
2747_Followers_of_wanted_Indian_guru_hold_out_against_police-Reuters
2748_Uber_exec_proposed_publishing_journalists_personal_secrets_to_fight_bad_press-Washington_Post
2749_Prime_Minister_Narendra_Modi_in_Australia_Parliament-Indian_Express
2750_Can_a_Keystone_pipeline_vote_help_Mary_Landrieu-CBS_News
2751_Small_plane_crashes_into_home_near_Chicagos_Midway_airport_PHOTOS-Atlanta_Journal_Constitution
2752_Whoopi_Goldberg_defends_Bill_Cosby_over_rape_allegations_I_have_a_lot_of_-The_Independent
2753_Haryana_police_storm_godman_Rampals_ashram_main_gate_damaged_Report-Times_of_India
2754_Child_Homelessness_Reaches_Record_High-Neon_Tommy
2755_Attackers_storm_Jerusalem_synagogue_killing_4_worshippers-Washington_Post
2756_Deaths_Linked_to_Terrorism_Are_Up_60_Percent_Study_Finds-New_York_Times
2757_Russia_sees_no_chance_of_breakthrough_on_Ukraine_in_German_ministers_visit-Reuters
2758_Small_Plane_Crashes_Into_Home_Near_Chicagos_Midway_Airport-NBC_Chicago
2759_Suicide_bombers_attack_foreigner_compound_in_Afghan_capital_killing_2-Washington_Post
2760_Why_liberals_are_turning_on_Bill_Cosby_over_rape_allegations-Fox_News
2761_Japans_Abe_Calls_Early_Election_to_Save_His_Grand_Economic_Plan-TIME
2762_At_Least_6_Killed_at_Jerusalem_Synagogue-Voice_of_America
2763_National_Guard_Prepares_for_More_Ferguson_Unrest-ABC_News
2764_Global_terrorism_on_rise_Fivefold_increase_in_terrorrelated_deaths_since_2000_—_RT_News-RT
2765_Small_Plane_Crashes_Into_Chicago_Home-ABC_News
2766_Home_Depot_profit_beats_estimates_as_US_job_market_improves-Reuters
2767_Indian_PM_jokingly_accuses_Tony_Abbott_of_shirtfronting_Australia-Telegraphcouk
2768_As_Missouri_awaits_decision_on_police_shooting_National_Guard_called_in-Reuters
2769_Terror_Deaths_Up_by_37_Study-Newsweek_Pakistan
2770_Uber_rides_into_new_PR_storm_over_digging_dirt_on_hostile_press-Fortune
2771_State_Board_Mulls_New_History_Textbooks_«_CBS_Dallas__Fort_Worth-CBS_Local
2772_Cupich_set_to_become_9th_archbishop_of_Chicago-seattlepicom
2773_FBI_warns_Ferguson_grand_jury_decision_will_likely_lead_to_violence-Fox_News
2774_Charles_Manson_gets_marriage_license-The_Detroit_News
2775_Small_plane_crashes_into_Chicago_home_police_say-Fox_News
2776_Supporters_Of_Keystone_Oil_Pipeline_Scramble_For_Last_Vote__NewsOn6com__Tulsa_OK__News_Weather_Video_and_Sports__KOTVcom-News_On_6
2777_LakeEffect_Snow_Pummels_New_York_Closes_Thruway-ABC_News
2778_Uber_Exec_Suggests_Spending_1_Million_To_Dig_Up_Dirt_On_Journalists_BuzzFeed-Huffington_Post
2779_Japan_leads_world_markets_higher_on_stimulus_hopes-Businessweek
2780_Mary_Landrieu_scrambles_for_60th_vote_for_Keystone-Politico
2781_Plane_crashes_into_Chicago_home_but_elderly_couple_survives-Christian_Science_Monitor
2782_Bernie_Sanders_Has_Found_His_Grassroots_Support_Base_The_Fake_News_Audience-National_Journal
2783_House_Democrats_lash_out_at_Nancy_Pelosi-MarketWatch
2785_What_explains_the_continuing_fascination_with_Charles_Manson-BBC_News
2786_NFL_suspends_Vikings_RB_Adrian_Peterson_without_pay_for_remainder_of_the_-New_York_Daily_News
2787_Obama_orders_review_of_hostage_policy__World_News-Independent_Online
2788_NM_Rep_Ben_Ray_Luján_tapped_to_head_Democratic_campaign_committee-Fox_News_Latino
2789_US_Producer_Prices_Rise_02_Percent_in_October-ABC_News
2790_Upholding_the_Sanctity_of_Marriage_for_Charles_Manson-Huffington_Post
2791_NFL_suspends_Adrian_Peterson_without_pay_for_at_least_rest_of_regular_season-USA_TODAY
2792_Small_Cargo_Plane_Crashes_Into_Chicago_Home-ABC_News
2793_Lakeeffect_snow_snarls_Buffalo_flights_in_wake_of_storm-USA_TODAY
2794_Uber_Just_Stuck_a_Knife_in_the_Republican_Partys_Heart-New_York_Magazine
2795_Jose_Canseco_says_hes_selling_detached_finger_and_digitblasting_gun-Los_Angeles_Times
2796_Jennifer_Lawrence_and_the_stars_of_The_Hunger_Games_Mockingjay_Part_1_stun_-New_York_Daily_News
2797_Disappointment_Becomes_GlobalGrowth_Norm_as_Japan_Contracts-Businessweek
2798_Americas_Disastrous_History_of_Pipeline_Accidents_Shows_Why_the_Keystone_Vote_Matters-Huffington_Post
2799_AP_Exclusive_Charles_Manson_Plans_Prison_Wedding-ABC_News
2800_Missouri_Gov_Jay_Nixon_issues_state_of_emergency_ahead_of_Ferguson_grand_jury_decision-Fox_News
2801_Peterson_Suspended_Without_Pay_for_Rest_of_Season_by_NFL-Bloomberg
2802_And_the_2014_Word_of_the_Year_Is…-Bostoncom
2803_Everything_you_need_to_know_about_the_Bill_Cosby_sexual_assault_allegations-Vox
2804_Former_Slugger_Jose_Canseco_Plans_To_Put_His_Finger_On_EBay-CBS_Local
2805_House_Democrats_reelect_Pelosi_as_minority_leader-SFGate
2806_Keystone_Supporters_Hustle_to_Get_60_Yeses_for_Senate_Vote_on_Tuesday-Wall_Street_Journal
2807_14_Questions_And_Answers_About_The_Ferguson_Grand_Jury-Huffington_Post
2808_Chicago_plane_crashes_into_home_near_Midway_1_dead-WLSTV
2809_Oxford_names_vape_2014_Word_of_the_Year-USA_TODAY
2810_Uber_Responds_to_BuzzFeed_Report_on_Journalism_Smear_Campaign-Fox_Business
2811_Obama_Orders_Review_Of_US_Hostage_Policy-New_Hampshire_Public_Radio
2812_Jerusalem_Attack_A_Look_at_the_Victims-NBCNewscom
2813_AP_Exclusive_Charles_Manson_gets_marriage_license-SFGate
2814_Tie_Keystone_approval_to_bigger_environmental_goals-Boston_Globe
2815_Myers_With_Adrian_Peterson_ban_NFL_commish_gets_it_right-New_York_Daily_News
2816_Jonathan_Grubers_Obamacare_comments_Ferguson_grand_jury_and_more-Boston_Globe
2817_Ubers_Plan_to_Win_Over_the_Press_Backfires-Vanity_Fair
2818_President_Orders_Review_of_US_Hostage_Policies-Wall_Street_Journal
2819_Israels_Lone_Wolf_Attacks_Show_Weapons_Threat_Hard_to_Track-Bloomberg
2820_GOP_Vows_to_Pass_Keystone_Later_If_Bill_Fails_Now-ABC_News
2821_National_Guard_prepares_for_more_Ferguson_unrest-SFGate
2822_Uber_Exec_in_Hot_Water_After_Suggesting_Smear_Campaign_on_Journalists-Bostoncom
2823_Where_does_Adrian_Petersons_NFL_career_go_from_here-USA_TODAY
2824_Plane_misses_elderly_couple_by_8_inches_after_crashing_into_home_near_Midway-Chicago_Tribune
2825_Obama_orders_review_of_US_policy_on_hostages-Channel_News_Asia
2826_Oxford_Dictionaries_2014_Word_of_the_Year_is_vape-CTV_News
2827_The_new_PEANUTS_trailer_is_here_And_its_everything_we_ever_wanted-moviepilotcom
2828_Israels_Lone_Wolf_Attacks_Show_Weapons_Threat_Hard_to_Track-Businessweek
2829_Louisianas_Landrieu_Silent_at_Almost_70_of_Energy_Hearings-Bloomberg
2830_Obama_order_could_protect_thousands_of_illegal_immigrants_in_Md_Va_report_says-Washington_Post
2831_Browns_waive_running_back_Ben_Tate-Washington_Post
2832_Bob_Marley_Named_As_Face_of_Global_Marijuana_Brand-GossipCentercom
2833_Keystone_backers_scramble_for_last_vote_on_bill_Boehner_warns_Obama_against_veto-Fox_News
2834_NFL_suspends_Adrian_Peterson_for_remainder_of_2014_season-Fox_News
2835_House_Democrats_Reelect_Pelosi_as_Minority_Leader-US_News__World_Report
2836_Oxford_chooses_vape_as_its_2014_Word_of_the_Year-Reuters
2837_Obama_Orders_Review_of_Hostage_Policy-New_York_Times
2838_Four_rabbis_killed_in_Jerusalem_synagogue_terror_attack-USA_TODAY
2839_Keystone_XL_chances_dim_in_Senate_as_King_says_no-Reuters
2840_What_Charles_Mansons_Future_MotherinLaw_Thinks_About_Wedding-ABC_News
2841_Report_Alleged_Officer_Warns_Ferguson_If_You_Do_Not_Have_a_Gun_Get_One-CBS_Local
2842_Tech_world_calls_out_Uber_for_thuggish_behavior-USA_TODAY
2843_Liberals_oppose_Himes_in_House_Democratic_race-Middletown_Press
2844_President_Obama_Orders_Full_Review_of_Hostage_Negotiation_Policy-ABC_News
2845_Charles_Manson_Set_to_Tie_the_Knot_With_26YearOld_Woman-ABC_News
2846_Ferguson_Activists_Prepare_Havens_for_PostDecision_Protests-Bloomberg
2847_Toyota_aims_to_replicate_Prius_success_with_fuel_cell_Mirai-Reuters
2848_Official_Bob_Marley_Marijuana_Is_Coming-TIME
2849_Three_Americans_Among_Four_Rabbis_Slaughtered_in_Jerusalem_Synagogue-NBCNewscom
2850_Senator_Landrieus_Hail_Mary_goes_beyond_Keystone_XL_pipeline-Christian_Science_Monitor
2851_Charles_Manson_fan_insists_she_will_marry_the_80yearold_murderer-Los_Angeles_Times
2852_Uber_Draws_Fire_After_Executive_Suggests_Investigating_Reporters-Businessweek
2853_Earnest_Old_Gruber_Videos_Are_Not_Views_That_Are_Shared_By_Anybody_At_The_White_House-RealClearPolitics
2854_Crime_and_Inept_Punishment_Sheriff_Roger_Goodell_Is_Barney_Fife_Once_More-Bostoncom
2855_Vape_is_Oxford_Dictionaries_Word_Of_The_Year-BBC_News
2856_Putin_says_US_wants_to_subdue_Russia-Daily_Mail
2857_Top_Republican_floats_new_attack_plan_for_Obamas_immigration_action-Fox_News
2858_Dear_relatives_not_in_SoCal_We_really_dont_miss_your_crazy_snowstorms-OCRegister
2859_Cost_to_Treat_Ebola_1_Million_For_Two_Patients-NBCNewscom
2860_Fla_sees_big_rise_of_residents_in_US_illegally-Washington_Post
2861_Uber_cant_sweep_execs_revenge_campaign_under_the_car_mat-Los_Angeles_Times
2862_The_NFL_Suspends_Adrian_Peterson_and_the_Sponsors_Stay_Quiet-Businessweek
2863_Lakeeffect_snow_pummels_New_York_closes_Thruway-Sacramento_Bee
2864_East_Coast_popular_for_immigrants_in_US_illegally-Charlotte_Observer
2865_Vape_is_English_Word_of_the_Year_for_2014_Oxford_Says-Voice_of_America
2866_National_Guard_coming_to_help_dig_out_from_colossal_effect_storm-Buffalo_News
2867_Pilot_dies_after_small_plane_crashes_into_Chicago_home-Boston_Globe
2868_Who_is_Hannibal_Buress_and_why_did_he_call_Bill_Cosby_a_rapist-CBS_News
2869_Obama_will_not_change_policy_against_paying_ransom_for_hostages-Reuters
2870_Winter_Whack_Nation_Faces_Arctic_Chill_Almost_6_Feet_of_Snow_Forecast_Near_Buffalo-NBCNewscom
2871_Ferguson_dilemma_Was_calling_up_National_Guard_the_right_move-Christian_Science_Monitor
2872_Democrats_ReElect_Nancy_Pelosi_As_House_Minority_Leader_Amid_Criticism_Over_2014_Midterm_Elections-International_Business_Times
2873_Can_Uber_afford_to_have_this_many_enemies-Christian_Science_Monitor
2874_Synagogue_attack_Netanyahu_vow_in_battle_for_Jerusalem-BBC_News
2875_Vape_is_word_of_the_year_for_2014-Sydney_Morning_Herald
2876_Louisiana_Senate_Seat_Is_Real_Reward_in_Keystone_Pipeline_Vote-New_York_Times
2877_What_May_Happen_to_Officer_Darren_Wilson_After_Ferguson_Grand_Jury_Decision-ABC_News
2878_Bill_Cosby_hunkers_down_as_scandal_rages-Boston_Herald
2879_Chicago_plane_crashes_into_home_near_Midway_pilot_killed-WLSTV
2880_NY_agency_aids_more_than_100_snowstranded_drivers-Chroncom
2881_UPDATE_1Seventh_Sierra_Leone_doctor_killed_by_Ebola_source-Reuters_Africa
2882_For_Obama_Executive_Action_Will_Not_Be_Limited_to_Immigration-NBCNewscom
2883_Vape_named_as_Oxford_English_Dictionarys_2014_word_of_the_year-ABC_Online
2884_Gruber_frequently_visited_White_House-The_Hill
2885_Heres_Everything_We_Know_and_Dont_Know_About_the_Bill_Cosby_Rape_-TIME
2886_Obama_orders_review_of_US_hostage_policy-Lynchburg_News_and_Advance
2887_Uber_CEO_Apologies_For_Execs_Terrible_Suggestion_That_The_Company_-Huffington_Post
2888_With_beheading_deaths_of_Americans_Obama_orders_review_of_US_response_to_hostage_takings-Minneapolis_Star_Tribune
2889_Bill_Cosby_Rape_Accuser_Joan_Tarshis_Reveals_Details_Of_Horrifying_Attack-Popdust
2890_House_Democratic_Leaders_Hold_Caucus_Meeting_To_Elect_Leaders_For_114th_Congress-Getty_Images
2891_UVa_student_Hannah_Grahams_death_the_result_of_homicidal_violence_officials_-Washington_Post
2892_Adrian_Peterson_will_not_return_this_weekend_against_Green_Bay_updated-Washington_Post
2893_Palestinians_kill_five_in_Jerusalem_synagogue_attack-Reuters
2894_Keystones_Big_Senate_Test_A_Search_For_One_Vote-ABC_News
2895_Missouri_Gov_swears_in_Ferguson_panel_ahead_of_grand_jury_decision_in_shooting-Washington_Post
2896_LISTEN_Bill_Cosbys_1969_riff_on_drugging_womens_drinks-Chicago_SunTimes
2897_Obama_orders_review_of_US_hostage_policy-STLtodaycom
2898_Ubers_vast_trove_of_customer_data_is_ripe_for_abuse-Vox
2899_Senate_Narrowly_Defeats_Keystone_XL_Bill-Wall_Street_Journal
2900_Oxford_Dictionaries_word_of_2014_Have_you_ever_heard_of_it__One_News-TVNZ
2901_What_will_Bill_Cosbys_legacy_be-USA_TODAY
2902_Obama_Nowhere_Near_Out_of_the_Woods_on_Ebola-Wall_Street_Journal
2903_The_Short_List_Uber_wants_to_silence_journalists_Keystone_bill_fails_Peterson_-USA_TODAY
2904_Keystone_Vote_Falls_Short_in_Senate-NBCNewscom
2905_Jerusalem_synagogue_attack_Lone_wolf_pattern_seen_in_deadly_assault-Los_Angeles_Times
2906_Western_New_York_Snow_Storm_Could_Set_Records-Huffington_Post
2907_Justice_Department_Probe_Of_Ferguson_Police_Could_Spur_Broad_Change-Huffington_Post
2908_Janice_Dickinson_accuses_Bill_Cosby_of_sexual_assault_during_1982_hotel_meetup-New_York_Daily_News
2909_West_Africa_nowhere_near_out_of_woods_on_Ebola_Obama__Xinhua-Xinhua
2910_Obama_orders_review_of_US_hostage_policy-KOMO_News
2911_Tracy_Morgan_Still_Struggling_but_Fighting_to_Get_Better_After_Brain_Injury_-E_Online
2912_Will_Kim_Jongun_face_mass_crimes_prosecution_at_The_Hague_video-Christian_Science_Monitor
2913_Keystone_Pipeline_Fails_to_Get_Through_Senate-ABC_News
2914_Tech_lobby_to_keep_tabs_on_NSA_reform_votes-The_Hill
2915_Lawyer_Tracy_Morgan_Still_Struggling_With_Severe_Brain_Injury-NBCNewscom
2916_UPDATE_1Seventh_Sierra_Leone_doctor_killed_by_Ebola_source-Reuters_Africa
2917_Quinn_Uber_grapples_with_its_aggressive_image-San_Jose_Mercury_News
2918_Senate_Narrowly_Defeats_Keystone_XL_Pipeline-New_York_Times
2919_Senate_Republicans_Block_Sweeping_Overhaul_of_NSA_Program-New_York_Times
2920_Obama_orders_review_of_hostage_policy-WTSP_10_News
2921_Ebola_Researchers_Race_to_Slow_Epidemic-Voice_of_America
2922_Sarah_Lacy_on_Uber_Im_doing_everything_I_can_to_keep_my_family_safe-USA_TODAY
2923_Police_nab_man_sought_after_fatal_NYC_subway_shove-seattlepicom
2924_UN_panel_calls_for_NKorea_referral_to_international_court-Reuters
2925_Senate_defeats_Keystone_XL_pipeline-USA_TODAY
2926_Middle_EastJewish_Victims_All_From_One_Jerusalem_Street_Were_a_-New_York_Times
2927_Obama_orders_review_of_the_policy_on_terroristrelated_hostage_cases-I_Know_Today
2928_150_cars_snowbound_in_early_winter_storm-Chroncom
2929_Ubers_plot_to_spy_on_reporter_is_latest_controversy-USA_TODAY
2930_UN_Rights_Committee_Urges_Court_Referral_for_North_Korea-Voice_of_America
2931_Death_of_Virginia_college_student_ruled_a_homicide-Chicago_Tribune
2932_After_Jerusalem_attack_Netanyahu_hopes_PR_porn_will_win_support_abroad__Diplomacy_and_Defense_Israel_News-Haaretz
2934_Uber_CEO_Apologizes_For_Execs_Terrible_Suggestion_That_The_Company_Investigate_Journalists-Huffington_Post
2935_Ebola_crisis_Seventh_Sierra_Leone_doctor_dies_from_virus-ABC_Online
2936_US_reviewing_how_it_deals_with_families_of_American_hostages-McClatchy_Washington_Bureau
2937_United_Nations_Urges_North_Korea_Prosecutions-New_York_Times
2938_Stupidity_reconsidered-WNDcom
2939_Ryan_to_chair_tax_panel_a_possible_2016_platform-El_Paso_Inc
2940_Senate_fails_to_advance_legislation_on_NSA_reform-Washington_Post
2941_Uber_executive_stirs_up_privacy_controversy-Washington_Post
2942_Janice_Dickinson_Says_She_Was_Sexually_Assaulted_By_Bill_Cosby-seattlepicom
2943_UVA_student_Hannah_Graham_died_from_homicidal_violence_medical_examiner-New_York_Daily_News
2944_Police_nab_man_sought_after_fatal_NYC_subway_shove-Westport_News
2945_Ryan_to_Chair_Tax_Panel_a_Possible_2016_Platform-ABC_News
2946_In_latest_College_Football_Playoff_rankings_Alabama_rolls_straight_to_the_No_1_spot-Washington_Post
2947_Senate_Fails_to_Advance_NSA_Data_Collection_Overhaul_Legislation-Wall_Street_Journal
2948_Storm_blamed_for_at_least_4_deaths_in_upstate_New_York-CBS_News
2949_Report_Janice_Dickinson_accuses_Bill_Cosby_of_rape-CNN
2950_Tracy_Morgan_Still_Battling_With_Brain_Injury-FanSided
2951_UN_calls_for_probe_of_North_Korea_crimes_against_humanity-Channel_News_Asia
2953_Keystone_Vote_Unlikely_to_Change_Odds_for_Mary_Landrieu-New_York_Times
2954_Senate_Republicans_block_bill_NSA_will_continue_monitoring_your_calls-Christian_Science_Monitor
2955_TV_host_Janice_Dickinson_latest_Cosby_accuser-USA_TODAY
2956_Lawyer_Comedian_Tracy_Morgan_Suffered_Traumatic_Brain_Injury_In_NJ_Tpke_Crash-CBS_Local
2957_Video_seems_to_show_Ferguson_officer_in_confrontation-USA_TODAY
2958_North_Korea_UN_moves_closer_to_ICC_human_rights_probe-BBC_News
2959_Defective_Takata_Airbag_Grows_Into_Global_Problem_for_Manufacturer-New_York_Times
2960_Report_Va_students_death_was_homicide-WCVB_Boston
2961_Winter_Whack_Nation_Faces_Arctic_Chill_6_Feet_of_Snow_Hit_Buffalo_Area-NBCNewscom
2962_Fail_Mary_Senate_rejects_Keystone_bill-Politico
2963_Palestinians_kill_five_in_Jerusalem_synagogue_attack-Reuters
2964_Attorney_says_actor_Tracy_Morgan_struggling_after_crash_report-Reuters_Canada
2965_Court_clears_way_for_gay_marriage_in_South_Carolina-The_Augusta_Chronicle
2966_Subway_Motorman_Describes_Deadly_Train_Push-NBC_New_York
2967_Federal_highway_safety_agency_demands_recall_of_cars_with_Takata_air_bags-Fox_News
2968_North_Korea_reacts_angrily_after_UN_votes_to_probe_crimes_against_humanity-South_China_Morning_Post_subscription
2969_Early_winter_pummels_much_of_country_strands_motorists_emergency_vehicles-Fox_News
2970_Palestinians_kill_five_in_Jerusalem_synagogue_attack-Reuters
2971_Attorney_says_actor_Tracy_Morgan_struggling_after_crash_report-Reuters
2972_UPDATE_3US_auto_regulator_seeks_nationwide_recall_of_Takata_air_bags-Reuters
2973_Death_of_Virginia_college_student_ruled_a_homicide-Reuters
2974_Adrian_Peterson_May_Be_Suspended_But_Hes_Unlikely_To_Lose_Any_Pay-Forbes

Check format of news data:


In [6]:
! head -1 ./data/raw-news_tweets-original/dataset1/news/2014-11-18.txt


2685	http://www.businessweek.com/news/2014-11-17/missouri-s-nixon-declares-state-of-emergency-awaiting-grand-jury	Missouri's Nixon Declares State of Emergency Awaiting Grand Jury	Businessweek	Mon, 17 Nov 2014 21:04:21 -0800	Toluse Olorunnipa	missouri;nixon declares;emergency awaiting;grand jury	Missouri Governor Jay Nixon's actions underscore rising tensions in the St. Louis region, which has seen months of protests since the Aug. 9 shooting death of 18-year-old Michael Brown. Photographer: Scott Olson/Getty Images. Governor Jay Nixon declared ...	Missouri Governor Jay Nixon’s actions underscore rising tensions in the St. Louis region, which has seen months of protests since the Aug. 9 shooting death of 18-year-old Michael Brown. Photographer: Scott Olson/Getty Images::::::::Governor Jay Nixon activated the Missouri National Guard in anticipation of unrest when a grand jury decides whether to indict a white police officer who shot and killed an unarmed black teenager.::::::::Nixon declared a state of emergency and created a “unified command” of police agencies in preparation for the decision, due this month in the slaying of 18-year-old Michael Brown of Ferguson, a St. Louis suburb.::::::::“As part of our ongoing efforts to plan and be prepared for any contingency, it is necessary to have these resources in place in advance of any announcement,” Nixon, a 58-year-old Democrat, said yesterday in a statement. “Public safety demands that we are fully prepared.”::::::::The governor’s actions underscore rising tensions in the St. Louis region, which has seen months of protests since the Aug. 9 shooting. Protesters and police have predicted demonstrations and potential violence if Officer Darren Wilson, who shot Brown, isn’t indicted for murder.::::::::Hundreds of police officers have been trained to deal with demonstrators, Nixon said last week, and “violence will not be tolerated.” The response to the first round of protests -- including armored vehicles and officers in military gear who leveled rifles at protesters -- drew international rebuke.::::::::Attorneys representing Brown’s family said Nixon should also warn law enforcement against using excessive force.::::::::“To the law-enforcement officers who will be patrolling the streets, we would like to thank you in advance for not having a repeat of the horrific encounters that took place in August,” Benjamin Crump, a lawyer for the family, said last week at a news conference.::::::::Nixon called up the National Guard in August after police and protesters faced off in nightly clashes featuring tear gas, gunshots and dozens of arrests.::::::::Some protests have continued, with demonstrators blocking roads and chanting at officers.::::::::The St. Louis County Police Department has been in charge of policing the protests since last month. The Missouri Highway Patrol and the St. Louis Metropolitan Police Department will also be part of the unified command, according to the executive order.::::::::The Ferguson Police Department isn’t listed as part of the group.::::::::“Having the support of the National Guard available will enable local officers to continue to respond promptly to calls for service,” St. Louis Metropolitan Police Chief Sam Dotson said in a statement released by Nixon’s office.::::::::Since August, the St. Louis County Police Department has spent more than $100,000 on riot gear and equipment such as tear gas, according to a spokesman, Brian Schellman.::::::::Autopsy results show that Wilson shot Brown at least six times. Police said Brown attacked Wilson and attempted to get his gun. Protesters and some residents said the teenager raised his hands in surrender before he was killed.::::::::To contact the reporter on this story: Toluse Olorunnipa in Tallahassee, Florida at tolorunnipa@bloomberg.net::::::::To contact the editors responsible for this story: Stephen Merelman at smerelman@bloomberg.net Pete Young, Theo Mullen	

Check format of tweets data:


In [7]:
! head -3 ./data/raw-news_tweets-original/dataset1/tweets/2014-11-18/2685_Missouris_Nixon_Declares_State_of_Emergency_Awaiting_Grand_Jury-Businessweek


534598949748092928	Missouri governor declares emergency, activates National Guard ahead of grand jury announcement... http://t.co/iXqEprM4KJ	Tue Nov 18 06:48:16 +0000 2014	http://t.co/iXqEprM4KJ							False	0	False	0	104591943	False	567	3187	913	None	Wed Jan 13 20:12:34 +0000 2010			Facebook			
534598876976910336	Missouri+governor+declares+state+of+emergency+ahead+of+Ferguson+grand+jury+decision+(+video)+-+http://t.co/XCnLNeibfk http://t.co/9byi9KURzB	Tue Nov 18 06:47:59 +0000 2014	http://t.co/XCnLNeibfk,http://t.co/9byi9KURzB							False	0	False	0	40676472	False	1002	1227	2001	None	Sun May 17 15:27:05 +0000 2009	Orlando, FL		Twitter for Android			
534598442820317184	RT @TeaPartyCat: Missouri governor declares state of racism in Ferguson in anticipation of grand jury deciding that killing a black man isn…	Tue Nov 18 06:46:15 +0000 2014			534570486706216960	False	93	False	113	False	0	False	113	1699988293	False	173	2732	281	None	Sun Aug 25 19:21:04 +0000 2013	Spokane, wa		Twitter for iPhone	27754737		

Create db. Build tables for news and tweets.

Define DDL scripts to create db, table schema for news, and table schema for tweets


In [8]:
"""
Register
    NEWS_TWEETS_DDL_FILE = os.path.join(DATA_DIR, 'original-news_tweets.schema.sql')
in config.
"""


Out[8]:
"\nRegister\n    NEWS_TWEETS_DDL_FILE = os.path.join(DATA_DIR, 'original-news_tweets.schema.sql')\nin config.\n"

In [9]:
! cat ./data/original-news_tweets.schema.sql


-- original-news_tweets.schema.sql

-- Schema for news and tweets data

-- News table
create table news (
    news_id integer primary key autoincrement not null,
    news_native_id integer not null,
    news_url text,
    news_title text,
    news_source text,
    news_post_time datetime,
    news_collected_time date,
    news_keywords text,
    news_doc text,
    news_entities text
);

-- Tweets table
create table tweets (
    tweet_id integer primary key autoincrement not null,
    tweet_native_id integer not null,
    tweet_text text,
    tweet_post_time datetime,
    tweet_collected_time date,
    news_native_id integer not null,
    foreign key (news_native_id) references news (news_native_id) on delete cascade
);

Build db and talbes


In [10]:
"""
Register 
    NEWS_TWEETS_DB = os.path.join(DATA_DIR, 'news_tweets-meng.db')
in config.
"""
def check_db():
    db_exists = os.path.exists(config.NEWS_TWEETS_DB_FILE)
    
    with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
        if not db_exists:
            print('No db exists. Creating new db and tables...')
            with open(config.NEWS_TWEETS_DDL_FILE, 'r') as f:
                schema = f.read()
                conn.executescript(schema)
        else:
            print('db already exists!')

In [11]:
if 1 == 1:
    check_db()


db already exists!

Bulk load news and tweets into db

Load news data into db table


In [13]:
%%time
"""
Parse and load original news data

Note: should be executed two times
 - config.ORIGINAL_NEWS_TWEETS_DIR/dataset1/news/
 - config.ORIGINAL_NEWS_TWEETS_DIR/dataset2/news/

"""
if 0 == 1:
    '''
    Get all [date].txt news files
    '''
    news_dir = os.path.join(config.ORIGINAL_NEWS_TWEETS_DIR, 'dataset1', 'news')
    news_files = os.listdir(news_dir)
    # sort files by date
    news_files.sort(key=lambda x: time.strptime(x, '%Y-%m-%d.txt'))

    '''
    Define news file line format information
    '''
    csv.register_dialect('original_news_line', delimiter='\t', doublequote=False, quoting=csv.QUOTE_NONE)
    
    # NOTE: line_index 7 is news_abstract! news_doc should be line_index 8
    # line_index 9 is entities for news_doc in format of [entity_name]:[entity_type]:[frequency]
    news_line_index_dict = {'news_native_id': 0, 'news_url': 1, 'news_title': 2, 'news_source': 3, 
                            'news_post_time': 4, 'news_keywords': 6, 'news_doc': 8, 'news_entities': 9}
    
    query = '''
    insert into news (news_native_id, news_url, news_title, news_source, news_post_time, news_collected_time, news_keywords, news_doc, news_entities)
    values (?, ?, ?, ?, ?, ?, ?, ?, ?)
    '''
    
    with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
        cursor = conn.cursor()
        '''
        For each news file, read-in, parse, and load into db news table
        '''
        for news_file in news_files:
            # parse out news collected date from the name of the file
            news_collected_time = os.path.split(news_file)[1].split('.')[0]
            with open(os.path.join(news_dir, news_file), 'r') as f:
                news_tpl_lst = []
                news_lines = csv.reader(f, dialect='original_news_line')
                for news_line in news_lines:
                    # check format of news article
                    if len(news_line) == 10:
                        news_tpl = (news_line[news_line_index_dict['news_native_id']],
                                    news_line[news_line_index_dict['news_url']],
                                    news_line[news_line_index_dict['news_title']],
                                    news_line[news_line_index_dict['news_source']],
                                    utilities.parse_news_post_time(news_line[news_line_index_dict['news_post_time']]),
                                    news_collected_time,
                                    news_line[news_line_index_dict['news_keywords']],
                                    news_line[news_line_index_dict['news_doc']],
                                    news_line[news_line_index_dict['news_entities']])
                        news_tpl_lst.append(news_tpl)
                
                cursor.executemany(query, news_tpl_lst)


CPU times: user 2.48 s, sys: 244 ms, total: 2.72 s
Wall time: 3.34 s

Load tweets data into db table


In [12]:
%%time
"""
Parse and load original tweets data
"""
if 0 == 1:
    '''
    Retrieve all valid dates for news
    '''
    news_valid_dates_lst = []
    
    query_news_valid_dates = '''
    select distinct news_collected_time from news
    order by news_collected_time;'''
    
    with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
        cursor = conn.cursor()
        cursor.execute(query_news_valid_dates)
        news_valid_dates_lst = [item[0] for item in cursor.fetchall()]
    
    '''
    Tmp var to manipulate with valid date lst in case of unexpected errors when building db
    
    Also see Notes section
    ''' 
    tmp_lst = []
    for news_valid_date in news_valid_dates_lst:
        if news_valid_date not in config.ORIGINAL_TWEETS_ERROR_DATES_LST:
            tmp_lst.append(news_valid_date)
            
    news_valid_dates_lst = tmp_lst
    
    '''
    Establie conn to db for each news date and commit inserts
    '''
    for news_valid_date in news_valid_dates_lst:
        localtime = time.asctime(time.localtime(time.time()))
        print('Processing tweets associated with news on {} ({})...'.format(news_valid_date, localtime))
        '''
        Retrieve news_native_id from db
        '''
        news_native_ids_lst = []

        query_select_news = '''
        select news_native_id from news
        where news_collected_time = :news_valid_date
        order by news_native_id asc;'''
        with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
            cursor = conn.cursor()
            cursor.execute(query_select_news, {'news_valid_date': news_valid_date})
            news_native_ids_lst = [item[0] for item in cursor.fetchall()]

        '''
        Define tweets file line format information
        '''
        csv.register_dialect('original_tweet_line', delimiter='\t', doublequote=False, quoting=csv.QUOTE_NONE)
        tweet_line_index_dict = {'tweet_native_id': 0, 'tweet_text': 1, 'tweet_post_time': 2}

        query_insert_tweets = '''insert into tweets (tweet_native_id, tweet_text, tweet_post_time, tweet_collected_time, news_native_id) values (?, ?, ?, ?, ?)'''

        with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
            cursor = conn.cursor()
            '''
            For each news:
             - locate corresponding tweets file
             - read-in file lines
             - parse content
             - load into tweets table
            '''
            for news_ind, news_native_id in enumerate(news_native_ids_lst):
                news_collected_time = news_valid_date
                # print out progress info every 100 news articles
#                 if not news_ind % 100:
#                     print('Processing tweets associated with news {} on {} ({}/{})...'.format(news_native_id, news_collected_time, news_ind+1, len(news_native_ids_lst)))
                # determine if the news_tweets file is in dataset1 folder or dataset2 folder
                if news_collected_time <= '2015-06-05':
                    news_tweets_file_dir = os.path.join(config.ORIGINAL_NEWS_TWEETS_DIR, 'dataset1', 'tweets', news_collected_time)
                else:
                    news_tweets_file_dir = os.path.join(config.ORIGINAL_NEWS_TWEETS_DIR, 'dataset2', 'tweets', news_collected_time)

                # find out the name of the news_tweets file
                tweets_files_lst = [file for file in os.listdir(news_tweets_file_dir) if file.startswith(str(news_native_id))]
                # if news_tweets file exists and its size larger than 0
                if tweets_files_lst and os.stat(os.path.join(news_tweets_file_dir, tweets_files_lst[0])).st_size > 0:
                    news_tweets_file = tweets_files_lst[0]
                    with open(os.path.join(news_tweets_file_dir, news_tweets_file), 'r') as f:
                        tweets_tpl_lst = []
                        # get rid of '\0' (Error: line contains NULL byte)
                        tweet_lines = csv.reader((line.replace('\0','') for line in f), dialect='original_tweet_line')
                        for tweet_line in tweet_lines:
                            # filter out non-complete tweet lines and retweets
                            if len(tweet_line) >=3 and not tweet_line[tweet_line_index_dict['tweet_text']].startswith('RT'):
                                # build tpl for tweets table
                                tweet_tpl = (tweet_line[tweet_line_index_dict['tweet_native_id']],
                                             tweet_line[tweet_line_index_dict['tweet_text']],
                                             utilities.parse_tweet_post_time(tweet_line[tweet_line_index_dict['tweet_post_time']]),
                                             news_collected_time,
                                             news_native_id)
                                tweets_tpl_lst.append(tweet_tpl)

                        if tweets_tpl_lst:
                            cursor.executemany(query_insert_tweets, tweets_tpl_lst)


Processing tweets associated with news on 2014-11-18 (Thu Oct 12 19:34:22 2017)...
Processing tweets associated with news on 2014-11-19 (Thu Oct 12 19:36:52 2017)...
Processing tweets associated with news on 2014-11-20 (Thu Oct 12 19:38:44 2017)...
Processing tweets associated with news on 2014-11-21 (Thu Oct 12 19:41:07 2017)...
Processing tweets associated with news on 2014-11-22 (Thu Oct 12 19:43:17 2017)...
Processing tweets associated with news on 2014-11-23 (Thu Oct 12 19:45:23 2017)...
Processing tweets associated with news on 2014-11-24 (Thu Oct 12 19:47:34 2017)...
Processing tweets associated with news on 2014-11-25 (Thu Oct 12 19:49:39 2017)...
Processing tweets associated with news on 2014-11-26 (Thu Oct 12 19:51:39 2017)...
Processing tweets associated with news on 2014-11-27 (Thu Oct 12 19:54:12 2017)...
Processing tweets associated with news on 2014-11-28 (Thu Oct 12 19:56:11 2017)...
Processing tweets associated with news on 2014-11-29 (Thu Oct 12 19:58:41 2017)...
Processing tweets associated with news on 2014-11-30 (Thu Oct 12 20:00:39 2017)...
Processing tweets associated with news on 2014-12-01 (Thu Oct 12 20:02:47 2017)...
Processing tweets associated with news on 2014-12-02 (Thu Oct 12 20:05:04 2017)...
Processing tweets associated with news on 2014-12-03 (Thu Oct 12 20:07:27 2017)...
Processing tweets associated with news on 2014-12-04 (Thu Oct 12 20:09:39 2017)...
Processing tweets associated with news on 2014-12-05 (Thu Oct 12 20:11:42 2017)...
Processing tweets associated with news on 2014-12-06 (Thu Oct 12 20:13:50 2017)...
Processing tweets associated with news on 2014-12-07 (Thu Oct 12 20:16:01 2017)...
Processing tweets associated with news on 2014-12-08 (Thu Oct 12 20:18:04 2017)...
Processing tweets associated with news on 2014-12-09 (Thu Oct 12 20:20:03 2017)...
Processing tweets associated with news on 2014-12-10 (Thu Oct 12 20:22:13 2017)...
Processing tweets associated with news on 2014-12-11 (Thu Oct 12 20:24:08 2017)...
Processing tweets associated with news on 2014-12-12 (Thu Oct 12 20:26:08 2017)...
Processing tweets associated with news on 2014-12-13 (Thu Oct 12 20:28:03 2017)...
Processing tweets associated with news on 2014-12-14 (Thu Oct 12 20:29:43 2017)...
Processing tweets associated with news on 2014-12-16 (Thu Oct 12 20:30:05 2017)...
Processing tweets associated with news on 2014-12-17 (Thu Oct 12 20:30:28 2017)...
Processing tweets associated with news on 2014-12-18 (Thu Oct 12 20:32:00 2017)...
Processing tweets associated with news on 2014-12-19 (Thu Oct 12 20:32:36 2017)...
Processing tweets associated with news on 2014-12-20 (Thu Oct 12 20:35:12 2017)...
Processing tweets associated with news on 2014-12-21 (Thu Oct 12 20:37:20 2017)...
Processing tweets associated with news on 2014-12-22 (Thu Oct 12 20:39:30 2017)...
Processing tweets associated with news on 2014-12-23 (Thu Oct 12 20:41:43 2017)...
Processing tweets associated with news on 2014-12-24 (Thu Oct 12 20:43:37 2017)...
Processing tweets associated with news on 2014-12-25 (Thu Oct 12 20:46:06 2017)...
Processing tweets associated with news on 2014-12-26 (Thu Oct 12 20:48:06 2017)...
Processing tweets associated with news on 2014-12-27 (Thu Oct 12 20:50:13 2017)...
Processing tweets associated with news on 2014-12-28 (Thu Oct 12 20:52:18 2017)...
Processing tweets associated with news on 2014-12-29 (Thu Oct 12 20:54:24 2017)...
Processing tweets associated with news on 2014-12-30 (Thu Oct 12 20:55:40 2017)...
Processing tweets associated with news on 2014-12-31 (Thu Oct 12 20:55:46 2017)...
Processing tweets associated with news on 2015-01-01 (Thu Oct 12 20:56:46 2017)...
Processing tweets associated with news on 2015-01-02 (Thu Oct 12 20:59:07 2017)...
Processing tweets associated with news on 2015-01-03 (Thu Oct 12 21:01:02 2017)...
Processing tweets associated with news on 2015-01-04 (Thu Oct 12 21:03:10 2017)...
Processing tweets associated with news on 2015-01-05 (Thu Oct 12 21:05:22 2017)...
Processing tweets associated with news on 2015-01-06 (Thu Oct 12 21:07:14 2017)...
Processing tweets associated with news on 2015-01-07 (Thu Oct 12 21:09:33 2017)...
Processing tweets associated with news on 2015-01-08 (Thu Oct 12 21:11:23 2017)...
Processing tweets associated with news on 2015-01-09 (Thu Oct 12 21:13:39 2017)...
Processing tweets associated with news on 2015-01-10 (Thu Oct 12 21:15:38 2017)...
Processing tweets associated with news on 2015-01-11 (Thu Oct 12 21:17:18 2017)...
Processing tweets associated with news on 2015-01-12 (Thu Oct 12 21:19:29 2017)...
Processing tweets associated with news on 2015-01-13 (Thu Oct 12 21:21:26 2017)...
Processing tweets associated with news on 2015-01-14 (Thu Oct 12 21:23:28 2017)...
Processing tweets associated with news on 2015-01-15 (Thu Oct 12 21:25:26 2017)...
Processing tweets associated with news on 2015-01-16 (Thu Oct 12 21:27:22 2017)...
Processing tweets associated with news on 2015-01-17 (Thu Oct 12 21:29:35 2017)...
Processing tweets associated with news on 2015-01-18 (Thu Oct 12 21:31:09 2017)...
Processing tweets associated with news on 2015-01-19 (Thu Oct 12 21:33:02 2017)...
Processing tweets associated with news on 2015-01-20 (Thu Oct 12 21:35:11 2017)...
Processing tweets associated with news on 2015-01-21 (Thu Oct 12 21:37:04 2017)...
Processing tweets associated with news on 2015-01-22 (Thu Oct 12 21:39:11 2017)...
Processing tweets associated with news on 2015-01-23 (Thu Oct 12 21:40:56 2017)...
Processing tweets associated with news on 2015-01-24 (Thu Oct 12 21:41:10 2017)...
Processing tweets associated with news on 2015-01-25 (Thu Oct 12 21:43:23 2017)...
Processing tweets associated with news on 2015-01-26 (Thu Oct 12 21:45:30 2017)...
Processing tweets associated with news on 2015-01-27 (Thu Oct 12 21:47:33 2017)...
Processing tweets associated with news on 2015-01-28 (Thu Oct 12 21:49:29 2017)...
Processing tweets associated with news on 2015-01-29 (Thu Oct 12 21:51:07 2017)...
Processing tweets associated with news on 2015-01-30 (Thu Oct 12 21:53:25 2017)...
Processing tweets associated with news on 2015-01-31 (Thu Oct 12 21:55:34 2017)...
Processing tweets associated with news on 2015-02-01 (Thu Oct 12 21:57:37 2017)...
Processing tweets associated with news on 2015-02-02 (Thu Oct 12 21:59:44 2017)...
Processing tweets associated with news on 2015-02-03 (Thu Oct 12 22:02:05 2017)...
Processing tweets associated with news on 2015-02-04 (Thu Oct 12 22:04:04 2017)...
Processing tweets associated with news on 2015-02-05 (Thu Oct 12 22:06:13 2017)...
Processing tweets associated with news on 2015-02-06 (Thu Oct 12 22:08:37 2017)...
Processing tweets associated with news on 2015-02-07 (Thu Oct 12 22:10:27 2017)...
Processing tweets associated with news on 2015-02-08 (Thu Oct 12 22:12:32 2017)...
Processing tweets associated with news on 2015-02-09 (Thu Oct 12 22:14:50 2017)...
Processing tweets associated with news on 2015-02-10 (Thu Oct 12 22:16:42 2017)...
Processing tweets associated with news on 2015-02-11 (Thu Oct 12 22:18:46 2017)...
Processing tweets associated with news on 2015-02-12 (Thu Oct 12 22:20:57 2017)...
Processing tweets associated with news on 2015-02-13 (Thu Oct 12 22:22:53 2017)...
Processing tweets associated with news on 2015-02-14 (Thu Oct 12 22:25:04 2017)...
Processing tweets associated with news on 2015-02-15 (Thu Oct 12 22:27:11 2017)...
Processing tweets associated with news on 2015-02-16 (Thu Oct 12 22:29:07 2017)...
Processing tweets associated with news on 2015-02-17 (Thu Oct 12 22:31:10 2017)...
Processing tweets associated with news on 2015-02-18 (Thu Oct 12 22:33:09 2017)...
Processing tweets associated with news on 2015-02-19 (Thu Oct 12 22:35:23 2017)...
Processing tweets associated with news on 2015-02-20 (Thu Oct 12 22:37:17 2017)...
Processing tweets associated with news on 2015-02-21 (Thu Oct 12 22:39:22 2017)...
Processing tweets associated with news on 2015-02-22 (Thu Oct 12 22:41:25 2017)...
Processing tweets associated with news on 2015-02-23 (Thu Oct 12 22:43:20 2017)...
Processing tweets associated with news on 2015-02-24 (Thu Oct 12 22:45:21 2017)...
Processing tweets associated with news on 2015-02-25 (Thu Oct 12 22:47:33 2017)...
Processing tweets associated with news on 2015-02-26 (Thu Oct 12 22:49:51 2017)...
Processing tweets associated with news on 2015-02-27 (Thu Oct 12 22:52:02 2017)...
Processing tweets associated with news on 2015-02-28 (Thu Oct 12 22:54:00 2017)...
Processing tweets associated with news on 2015-03-01 (Thu Oct 12 22:55:55 2017)...
Processing tweets associated with news on 2015-03-02 (Thu Oct 12 22:57:51 2017)...
Processing tweets associated with news on 2015-03-03 (Thu Oct 12 22:59:44 2017)...
Processing tweets associated with news on 2015-03-04 (Thu Oct 12 23:01:53 2017)...
Processing tweets associated with news on 2015-03-05 (Thu Oct 12 23:04:04 2017)...
Processing tweets associated with news on 2015-03-06 (Thu Oct 12 23:05:55 2017)...
Processing tweets associated with news on 2015-03-07 (Thu Oct 12 23:08:06 2017)...
Processing tweets associated with news on 2015-03-08 (Thu Oct 12 23:10:25 2017)...
Processing tweets associated with news on 2015-03-09 (Thu Oct 12 23:12:18 2017)...
Processing tweets associated with news on 2015-03-10 (Thu Oct 12 23:14:25 2017)...
Processing tweets associated with news on 2015-03-11 (Thu Oct 12 23:16:16 2017)...
Processing tweets associated with news on 2015-03-12 (Thu Oct 12 23:18:05 2017)...
Processing tweets associated with news on 2015-03-13 (Thu Oct 12 23:20:12 2017)...
Processing tweets associated with news on 2015-03-14 (Thu Oct 12 23:22:13 2017)...
Processing tweets associated with news on 2015-03-15 (Thu Oct 12 23:23:06 2017)...
Processing tweets associated with news on 2015-03-16 (Thu Oct 12 23:24:57 2017)...
Processing tweets associated with news on 2015-03-17 (Thu Oct 12 23:26:52 2017)...
Processing tweets associated with news on 2015-03-18 (Thu Oct 12 23:28:04 2017)...
Processing tweets associated with news on 2015-03-19 (Thu Oct 12 23:31:02 2017)...
Processing tweets associated with news on 2015-03-20 (Thu Oct 12 23:33:22 2017)...
Processing tweets associated with news on 2015-03-21 (Thu Oct 12 23:35:33 2017)...
Processing tweets associated with news on 2015-03-22 (Thu Oct 12 23:36:31 2017)...
Processing tweets associated with news on 2015-03-29 (Thu Oct 12 23:36:35 2017)...
Processing tweets associated with news on 2015-03-30 (Thu Oct 12 23:39:40 2017)...
Processing tweets associated with news on 2015-03-31 (Thu Oct 12 23:41:42 2017)...
Processing tweets associated with news on 2015-04-01 (Thu Oct 12 23:43:49 2017)...
Processing tweets associated with news on 2015-04-02 (Thu Oct 12 23:46:09 2017)...
Processing tweets associated with news on 2015-04-03 (Thu Oct 12 23:48:07 2017)...
Processing tweets associated with news on 2015-04-04 (Thu Oct 12 23:50:28 2017)...
Processing tweets associated with news on 2015-04-05 (Thu Oct 12 23:52:33 2017)...
Processing tweets associated with news on 2015-04-06 (Thu Oct 12 23:54:51 2017)...
Processing tweets associated with news on 2015-04-07 (Thu Oct 12 23:56:51 2017)...
Processing tweets associated with news on 2015-04-08 (Thu Oct 12 23:58:51 2017)...
Processing tweets associated with news on 2015-04-09 (Fri Oct 13 00:01:13 2017)...
Processing tweets associated with news on 2015-04-10 (Fri Oct 13 00:03:07 2017)...
Processing tweets associated with news on 2015-04-11 (Fri Oct 13 00:05:19 2017)...
Processing tweets associated with news on 2015-04-12 (Fri Oct 13 00:07:08 2017)...
Processing tweets associated with news on 2015-04-13 (Fri Oct 13 00:09:23 2017)...
Processing tweets associated with news on 2015-04-14 (Fri Oct 13 00:12:19 2017)...
Processing tweets associated with news on 2015-04-15 (Fri Oct 13 00:15:14 2017)...
Processing tweets associated with news on 2015-04-16 (Fri Oct 13 00:16:05 2017)...
Processing tweets associated with news on 2015-04-17 (Fri Oct 13 00:16:27 2017)...
Processing tweets associated with news on 2015-04-18 (Fri Oct 13 00:16:45 2017)...
Processing tweets associated with news on 2015-04-19 (Fri Oct 13 00:17:01 2017)...
Processing tweets associated with news on 2015-04-20 (Fri Oct 13 00:17:19 2017)...
Processing tweets associated with news on 2015-04-21 (Fri Oct 13 00:17:36 2017)...
Processing tweets associated with news on 2015-04-22 (Fri Oct 13 00:17:50 2017)...
Processing tweets associated with news on 2015-04-23 (Fri Oct 13 00:18:07 2017)...
Processing tweets associated with news on 2015-04-24 (Fri Oct 13 00:18:25 2017)...
Processing tweets associated with news on 2015-04-25 (Fri Oct 13 00:18:38 2017)...
Processing tweets associated with news on 2015-04-26 (Fri Oct 13 00:18:54 2017)...
Processing tweets associated with news on 2015-04-27 (Fri Oct 13 00:19:09 2017)...
Processing tweets associated with news on 2015-04-28 (Fri Oct 13 00:19:24 2017)...
Processing tweets associated with news on 2015-04-29 (Fri Oct 13 00:19:39 2017)...
Processing tweets associated with news on 2015-04-30 (Fri Oct 13 00:19:56 2017)...
Processing tweets associated with news on 2015-05-01 (Fri Oct 13 00:20:11 2017)...
Processing tweets associated with news on 2015-05-02 (Fri Oct 13 00:20:34 2017)...
Processing tweets associated with news on 2015-05-03 (Fri Oct 13 00:20:52 2017)...
Processing tweets associated with news on 2015-05-04 (Fri Oct 13 00:21:07 2017)...
Processing tweets associated with news on 2015-05-05 (Fri Oct 13 00:21:14 2017)...
Processing tweets associated with news on 2015-05-06 (Fri Oct 13 00:21:21 2017)...
Processing tweets associated with news on 2015-05-07 (Fri Oct 13 00:21:43 2017)...
Processing tweets associated with news on 2015-05-08 (Fri Oct 13 00:21:56 2017)...
Processing tweets associated with news on 2015-05-09 (Fri Oct 13 00:22:22 2017)...
Processing tweets associated with news on 2015-05-10 (Fri Oct 13 00:22:48 2017)...
Processing tweets associated with news on 2015-05-11 (Fri Oct 13 00:23:09 2017)...
Processing tweets associated with news on 2015-05-12 (Fri Oct 13 00:23:34 2017)...
Processing tweets associated with news on 2015-05-13 (Fri Oct 13 00:24:11 2017)...
Processing tweets associated with news on 2015-05-14 (Fri Oct 13 00:24:49 2017)...
Processing tweets associated with news on 2015-05-15 (Fri Oct 13 00:25:11 2017)...
Processing tweets associated with news on 2015-05-16 (Fri Oct 13 00:25:35 2017)...
Processing tweets associated with news on 2015-05-17 (Fri Oct 13 00:25:59 2017)...
Processing tweets associated with news on 2015-05-18 (Fri Oct 13 00:26:30 2017)...
Processing tweets associated with news on 2015-05-19 (Fri Oct 13 00:27:01 2017)...
Processing tweets associated with news on 2015-05-20 (Fri Oct 13 00:27:21 2017)...
Processing tweets associated with news on 2015-05-21 (Fri Oct 13 00:27:55 2017)...
Processing tweets associated with news on 2015-05-22 (Fri Oct 13 00:28:23 2017)...
Processing tweets associated with news on 2015-05-23 (Fri Oct 13 00:28:49 2017)...
Processing tweets associated with news on 2015-05-24 (Fri Oct 13 00:29:15 2017)...
Processing tweets associated with news on 2015-05-25 (Fri Oct 13 00:29:39 2017)...
Processing tweets associated with news on 2015-05-26 (Fri Oct 13 00:30:05 2017)...
Processing tweets associated with news on 2015-05-27 (Fri Oct 13 00:30:29 2017)...
Processing tweets associated with news on 2015-05-28 (Fri Oct 13 00:30:58 2017)...
Processing tweets associated with news on 2015-05-29 (Fri Oct 13 00:31:19 2017)...
Processing tweets associated with news on 2015-05-30 (Fri Oct 13 00:31:49 2017)...
Processing tweets associated with news on 2015-05-31 (Fri Oct 13 00:32:07 2017)...
Processing tweets associated with news on 2015-06-01 (Fri Oct 13 00:32:31 2017)...
Processing tweets associated with news on 2015-06-02 (Fri Oct 13 00:33:03 2017)...
Processing tweets associated with news on 2015-06-03 (Fri Oct 13 00:33:29 2017)...
Processing tweets associated with news on 2015-06-04 (Fri Oct 13 00:33:47 2017)...
Processing tweets associated with news on 2015-09-07 (Fri Oct 13 00:34:10 2017)...
Processing tweets associated with news on 2015-09-08 (Fri Oct 13 00:34:17 2017)...
Processing tweets associated with news on 2015-09-09 (Fri Oct 13 00:34:45 2017)...
Processing tweets associated with news on 2015-09-10 (Fri Oct 13 00:35:24 2017)...
Processing tweets associated with news on 2015-09-11 (Fri Oct 13 00:35:59 2017)...
Processing tweets associated with news on 2015-09-12 (Fri Oct 13 00:36:25 2017)...
Processing tweets associated with news on 2015-09-13 (Fri Oct 13 00:36:54 2017)...
Processing tweets associated with news on 2015-09-14 (Fri Oct 13 00:37:16 2017)...
Processing tweets associated with news on 2015-09-15 (Fri Oct 13 00:37:44 2017)...
Processing tweets associated with news on 2015-09-16 (Fri Oct 13 00:38:08 2017)...
Processing tweets associated with news on 2015-09-17 (Fri Oct 13 00:38:43 2017)...
Processing tweets associated with news on 2015-09-18 (Fri Oct 13 00:39:14 2017)...
Processing tweets associated with news on 2015-09-19 (Fri Oct 13 00:39:37 2017)...
Processing tweets associated with news on 2015-09-30 (Fri Oct 13 00:39:59 2017)...
Processing tweets associated with news on 2015-10-01 (Fri Oct 13 00:40:02 2017)...
Processing tweets associated with news on 2015-11-13 (Fri Oct 13 00:40:22 2017)...
Processing tweets associated with news on 2015-11-14 (Fri Oct 13 00:40:26 2017)...
Processing tweets associated with news on 2015-11-15 (Fri Oct 13 00:40:57 2017)...
Processing tweets associated with news on 2015-11-16 (Fri Oct 13 00:41:24 2017)...
Processing tweets associated with news on 2015-11-17 (Fri Oct 13 00:41:59 2017)...
Processing tweets associated with news on 2015-11-18 (Fri Oct 13 00:42:29 2017)...
Processing tweets associated with news on 2015-11-19 (Fri Oct 13 00:42:56 2017)...
Processing tweets associated with news on 2015-11-20 (Fri Oct 13 00:43:20 2017)...
Processing tweets associated with news on 2015-11-21 (Fri Oct 13 00:43:45 2017)...
Processing tweets associated with news on 2015-11-22 (Fri Oct 13 00:44:03 2017)...
Processing tweets associated with news on 2015-11-23 (Fri Oct 13 00:44:32 2017)...
Processing tweets associated with news on 2015-11-24 (Fri Oct 13 00:44:55 2017)...
Processing tweets associated with news on 2015-11-25 (Fri Oct 13 00:45:23 2017)...
Processing tweets associated with news on 2015-11-26 (Fri Oct 13 00:45:49 2017)...
Processing tweets associated with news on 2015-11-27 (Fri Oct 13 00:46:10 2017)...
Processing tweets associated with news on 2015-11-28 (Fri Oct 13 00:46:28 2017)...
Processing tweets associated with news on 2015-11-29 (Fri Oct 13 00:46:53 2017)...
Processing tweets associated with news on 2015-11-30 (Fri Oct 13 00:47:18 2017)...
Processing tweets associated with news on 2015-12-01 (Fri Oct 13 00:47:37 2017)...
Processing tweets associated with news on 2015-12-03 (Fri Oct 13 00:48:01 2017)...
Processing tweets associated with news on 2015-12-04 (Fri Oct 13 00:48:08 2017)...
Processing tweets associated with news on 2015-12-05 (Fri Oct 13 00:48:35 2017)...
Processing tweets associated with news on 2015-12-06 (Fri Oct 13 00:48:49 2017)...
Processing tweets associated with news on 2015-12-07 (Fri Oct 13 00:49:06 2017)...
Processing tweets associated with news on 2016-01-06 (Fri Oct 13 00:49:25 2017)...
Processing tweets associated with news on 2016-01-07 (Fri Oct 13 00:49:39 2017)...
Processing tweets associated with news on 2016-01-08 (Fri Oct 13 00:49:48 2017)...
Processing tweets associated with news on 2016-01-09 (Fri Oct 13 00:50:00 2017)...
Processing tweets associated with news on 2016-01-10 (Fri Oct 13 00:50:17 2017)...
Processing tweets associated with news on 2016-01-11 (Fri Oct 13 00:50:30 2017)...
Processing tweets associated with news on 2016-01-12 (Fri Oct 13 00:50:41 2017)...
Processing tweets associated with news on 2016-01-13 (Fri Oct 13 00:51:06 2017)...
Processing tweets associated with news on 2016-01-14 (Fri Oct 13 00:51:22 2017)...
Processing tweets associated with news on 2016-01-15 (Fri Oct 13 00:51:37 2017)...
Processing tweets associated with news on 2016-01-16 (Fri Oct 13 00:51:46 2017)...
Processing tweets associated with news on 2016-01-17 (Fri Oct 13 00:51:54 2017)...
Processing tweets associated with news on 2016-01-18 (Fri Oct 13 00:52:14 2017)...
Processing tweets associated with news on 2016-01-19 (Fri Oct 13 00:52:28 2017)...
Processing tweets associated with news on 2016-01-20 (Fri Oct 13 00:52:46 2017)...
Processing tweets associated with news on 2016-01-21 (Fri Oct 13 00:53:07 2017)...
Processing tweets associated with news on 2016-01-22 (Fri Oct 13 00:53:27 2017)...
Processing tweets associated with news on 2016-01-23 (Fri Oct 13 00:53:40 2017)...
Processing tweets associated with news on 2016-01-24 (Fri Oct 13 00:53:55 2017)...
Processing tweets associated with news on 2016-01-25 (Fri Oct 13 00:54:07 2017)...
Processing tweets associated with news on 2016-01-26 (Fri Oct 13 00:54:17 2017)...
Processing tweets associated with news on 2016-01-27 (Fri Oct 13 00:54:31 2017)...
Processing tweets associated with news on 2016-01-28 (Fri Oct 13 00:54:51 2017)...
Processing tweets associated with news on 2016-01-29 (Fri Oct 13 00:55:05 2017)...
Processing tweets associated with news on 2016-01-30 (Fri Oct 13 00:55:21 2017)...
Processing tweets associated with news on 2016-01-31 (Fri Oct 13 00:55:35 2017)...
Processing tweets associated with news on 2016-02-01 (Fri Oct 13 00:55:46 2017)...
Processing tweets associated with news on 2016-02-02 (Fri Oct 13 00:55:57 2017)...
Processing tweets associated with news on 2016-02-03 (Fri Oct 13 00:55:57 2017)...
Processing tweets associated with news on 2016-02-04 (Fri Oct 13 00:55:57 2017)...
Processing tweets associated with news on 2016-02-05 (Fri Oct 13 00:55:57 2017)...
Processing tweets associated with news on 2016-02-06 (Fri Oct 13 00:55:57 2017)...
Processing tweets associated with news on 2016-02-07 (Fri Oct 13 00:55:59 2017)...
Processing tweets associated with news on 2016-02-08 (Fri Oct 13 00:56:14 2017)...
Processing tweets associated with news on 2016-02-09 (Fri Oct 13 00:56:26 2017)...
Processing tweets associated with news on 2016-02-10 (Fri Oct 13 00:56:35 2017)...
Processing tweets associated with news on 2016-02-11 (Fri Oct 13 00:56:47 2017)...
Processing tweets associated with news on 2016-02-12 (Fri Oct 13 00:56:58 2017)...
Processing tweets associated with news on 2016-02-13 (Fri Oct 13 00:57:11 2017)...
Processing tweets associated with news on 2016-02-16 (Fri Oct 13 00:57:23 2017)...
CPU times: user 5h 7min 10s, sys: 7min 19s, total: 5h 14min 29s
Wall time: 5h 23min 11s

Delete news collected on tweet error dates


In [14]:
"""
Twitter internal server was unstable on some dates.
Tweets data collected on these dates contain errors cannot be parsed.
Delete corresponding news data on these dates.

Also see Notes section
"""
if 0 == 1:
    with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
        cursor = conn.cursor()
        query_delete_news = '''
        delete from news where news_collected_time = :news_collected_time
        '''
        for original_tweets_error_date in config.ORIGINAL_TWEETS_ERROR_DATES_LST:
            print('Delete news on {} ...'.format(original_tweets_error_date))
            cursor.execute(query_delete_news, {'news_collected_time': original_tweets_error_date})


Delete news on 2015-06-05 ...
Delete news on 2015-09-20 ...
Delete news on 2015-09-21 ...
Delete news on 2015-12-08 ...
Delete news on 2015-12-09 ...
Delete news on 2015-12-10 ...
Delete news on 2016-02-14 ...
Delete news on 2016-02-15 ...
Delete news on 2016-02-17 ...
Delete news on 2016-02-18 ...
Delete news on 2016-02-19 ...

Build necessary indexes


In [15]:
%%time
"""
Build indexes on news table news_native_id, news_title, news_post_time, and news_collected_time fields.
"""
if 0 == 1:
    with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
        cursor = conn.cursor()
        queries_lst = ['create index news_news_native_id on news(news_native_id);',
                       'create index news_news_title on news(news_title);',
                       'create index news_news_post_time on news(news_post_time);',
                       'create index news_news_collected_time on news(news_collected_time);']
        
        for query in queries_lst:
            cursor.execute(query)


CPU times: user 316 ms, sys: 488 ms, total: 804 ms
Wall time: 914 ms

In [12]:
%%time
"""
Build indexes on tweets table tweet_native_id, tweet_post_time, tweet_collected_time, and news_native_id fields.
"""
if 0 == 1:
    with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
        cursor = conn.cursor()
        queries_lst = ['create index tweets_tweet_native_id on tweets(tweet_native_id);',
                       'create index tweets_tweet_post_time on tweets(tweet_post_time);',
                       'create index tweets_tweet_collected_time on tweets(tweet_collected_time);',
                       'create index tweets_news_native_id on tweets(news_native_id);']

        
        for query_ind, query in enumerate(queries_lst):
            print('Building index {}/{} ...'.format(query_ind+1, len(queries_lst)))
            cursor.execute(query)


Building index 1/4 ...
Building index 2/4 ...
Building index 3/4 ...
Building index 4/4 ...
CPU times: user 17min 32s, sys: 6min 47s, total: 24min 19s
Wall time: 50min 8s

In [2]:
"""
List out tables and indices
"""
if 1 == 1:
    with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
        cursor = conn.cursor()
        query = """
        select name from sqlite_master
        where type = 'table';
        """
        cursor.execute(query)
        print('TABLES:')
        print(cursor.fetchall())
        
        query = """
        select name from sqlite_master
        where type = 'index'
        """
        cursor.execute(query)
        print('INDICES:')
        print(cursor.fetchall())


TABLES:
[('news',), ('sqlite_sequence',), ('tweets',)]
INDICES:
[('news_news_native_id',), ('news_news_title',), ('news_news_post_time',), ('news_news_collected_time',), ('tweets_tweet_native_id',), ('tweets_tweet_post_time',), ('tweets_tweet_collected_time',), ('tweets_news_native_id',)]

DEPRECATED CODES


In [ ]:
%%time
"""
DEPRECATED

Load news data by Meng

Note: should be executed two times
 - data/raw-news_tweets-meng/dataset1/news.txt
 - data/raw-news_tweets-meng/dataset2/news.txt
"""
if 0 == 1:
    with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
        with open(os.path.join(config.DATA_DIR, 'raw-news_tweets-meng/dataset2/news.txt'), 'r') as f:
            news_df_chunksize = 10000
            for df_chunk in pd.read_csv(f, sep='\t', chunksize=news_df_chunksize, iterator=True):
                column_names = {
                    'POST_TIME': 'post_time',
                    'NEWS_SOURCE': 'news_source',
                    'NEWS_TITLE': 'news_title',
                    'NEWS_DOC': 'news_doc'
                }
                df_chunk = df_chunk.rename(columns=column_names)
                df_chunk.to_sql(name='news', con=conn, if_exists='append', index=False)

In [ ]:
%%time
"""
DEPRECATED

Load tweets data by Meng

Note: should be executed two times
 - data/raw-news_tweets-meng/dataset1/tweets.txt
 - data/raw-news_tweets-meng/dataset2/tweets.txt
"""
if 0 == 1:
    with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
        with open(os.path.join(config.DATA_DIR, 'raw-news_tweets-meng/dataset2/tweets.txt'), 'r') as f:
            tweets_df_chunksize = 100000
            for df_chunk in pd.read_csv(f, sep='\t', chunksize=tweets_df_chunksize, iterator=True):
                column_names = {
                    'POST_TIME': 'post_time',
                    'TWEET_TEXT': 'tweet_text'
                }
                df_chunk = df_chunk.rename(columns=column_names)
                df_chunk.to_sql(name='tweets', con=conn, if_exists='append', index=False)

Check basic statistics


In [18]:
%%time
"""
Check number of news per day
"""
if 1 == 1:
    with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
        conn.row_factory = sqlite3.Row
        cursor = conn.cursor()
        
        query = '''
        select date(news_collected_time) as news_date, count(news_id) as news_num
        from news
        group by date(news_collected_time);
        '''
        
        cursor.execute(query)
        for row in cursor.fetchall():
            print('{}: {}'.format(row['news_date'], row['news_num']))


2014-11-18: 290
2014-11-19: 267
2014-11-20: 320
2014-11-21: 293
2014-11-22: 260
2014-11-23: 275
2014-11-24: 274
2014-11-25: 323
2014-11-26: 307
2014-11-27: 287
2014-11-28: 291
2014-11-29: 250
2014-11-30: 263
2014-12-01: 323
2014-12-02: 291
2014-12-03: 293
2014-12-04: 283
2014-12-05: 288
2014-12-06: 233
2014-12-07: 262
2014-12-08: 292
2014-12-09: 272
2014-12-10: 284
2014-12-11: 284
2014-12-12: 254
2014-12-13: 231
2014-12-14: 149
2014-12-16: 47
2014-12-17: 313
2014-12-18: 51
2014-12-19: 301
2014-12-20: 248
2014-12-21: 268
2014-12-22: 293
2014-12-23: 282
2014-12-24: 287
2014-12-25: 261
2014-12-26: 269
2014-12-27: 253
2014-12-28: 265
2014-12-29: 263
2014-12-30: 69
2014-12-31: 55
2015-01-01: 271
2015-01-02: 282
2015-01-03: 254
2015-01-04: 243
2015-01-05: 262
2015-01-06: 311
2015-01-07: 269
2015-01-08: 275
2015-01-09: 296
2015-01-10: 257
2015-01-11: 282
2015-01-12: 276
2015-01-13: 299
2015-01-14: 311
2015-01-15: 280
2015-01-16: 303
2015-01-17: 248
2015-01-18: 258
2015-01-19: 295
2015-01-20: 274
2015-01-21: 287
2015-01-22: 277
2015-01-23: 144
2015-01-24: 258
2015-01-25: 280
2015-01-26: 214
2015-01-27: 43
2015-01-28: 153
2015-01-29: 295
2015-01-30: 291
2015-01-31: 290
2015-02-01: 272
2015-02-02: 302
2015-02-03: 306
2015-02-04: 290
2015-02-05: 281
2015-02-06: 267
2015-02-07: 283
2015-02-08: 268
2015-02-09: 300
2015-02-10: 305
2015-02-11: 274
2015-02-12: 279
2015-02-13: 280
2015-02-14: 243
2015-02-15: 270
2015-02-16: 275
2015-02-17: 287
2015-02-18: 316
2015-02-19: 295
2015-02-20: 300
2015-02-21: 256
2015-02-22: 251
2015-02-23: 326
2015-02-24: 312
2015-02-25: 324
2015-02-26: 305
2015-02-27: 280
2015-02-28: 253
2015-03-01: 273
2015-03-02: 297
2015-03-03: 277
2015-03-04: 274
2015-03-05: 281
2015-03-06: 286
2015-03-07: 279
2015-03-08: 244
2015-03-09: 304
2015-03-10: 287
2015-03-11: 287
2015-03-12: 318
2015-03-13: 304
2015-03-14: 257
2015-03-15: 273
2015-03-16: 309
2015-03-17: 306
2015-03-18: 301
2015-03-19: 298
2015-03-20: 283
2015-03-21: 242
2015-03-22: 31
2015-03-29: 305
2015-03-30: 320
2015-03-31: 313
2015-04-01: 320
2015-04-02: 297
2015-04-03: 292
2015-04-04: 284
2015-04-05: 286
2015-04-06: 315
2015-04-07: 323
2015-04-08: 328
2015-04-09: 298
2015-04-10: 299
2015-04-11: 289
2015-04-12: 299
2015-04-13: 307
2015-04-14: 315
2015-04-15: 309
2015-04-16: 309
2015-04-17: 295
2015-04-18: 276
2015-04-19: 284
2015-04-20: 294
2015-04-21: 294
2015-04-22: 316
2015-04-23: 326
2015-04-24: 312
2015-04-25: 254
2015-04-26: 278
2015-04-27: 312
2015-04-28: 347
2015-04-29: 334
2015-04-30: 284
2015-05-01: 236
2015-05-02: 224
2015-05-03: 245
2015-05-04: 264
2015-05-05: 106
2015-05-06: 246
2015-05-07: 248
2015-05-08: 229
2015-05-09: 219
2015-05-10: 236
2015-05-11: 264
2015-05-12: 239
2015-05-13: 257
2015-05-14: 230
2015-05-15: 209
2015-05-16: 208
2015-05-17: 242
2015-05-18: 232
2015-05-19: 245
2015-05-20: 246
2015-05-21: 257
2015-05-22: 238
2015-05-23: 212
2015-05-24: 198
2015-05-25: 254
2015-05-26: 245
2015-05-27: 245
2015-05-28: 270
2015-05-29: 214
2015-05-30: 179
2015-05-31: 193
2015-06-01: 228
2015-06-02: 242
2015-06-03: 226
2015-06-04: 219
2015-09-07: 49
2015-09-08: 227
2015-09-09: 251
2015-09-10: 236
2015-09-11: 226
2015-09-12: 208
2015-09-13: 195
2015-09-14: 225
2015-09-15: 241
2015-09-16: 221
2015-09-17: 254
2015-09-18: 181
2015-09-19: 142
2015-09-30: 15
2015-10-01: 169
2015-11-13: 23
2015-11-14: 199
2015-11-15: 162
2015-11-16: 218
2015-11-17: 183
2015-11-18: 162
2015-11-19: 152
2015-11-20: 182
2015-11-21: 137
2015-11-22: 168
2015-11-23: 139
2015-11-24: 172
2015-11-25: 159
2015-11-26: 149
2015-11-27: 130
2015-11-28: 143
2015-11-29: 137
2015-11-30: 148
2015-12-01: 134
2015-12-03: 74
2015-12-04: 179
2015-12-05: 149
2015-12-06: 164
2015-12-07: 161
2016-01-06: 115
2016-01-07: 91
2016-01-08: 119
2016-01-09: 106
2016-01-10: 98
2016-01-11: 108
2016-01-12: 125
2016-01-13: 113
2016-01-14: 105
2016-01-15: 90
2016-01-16: 114
2016-01-17: 117
2016-01-18: 91
2016-01-19: 133
2016-01-20: 107
2016-01-21: 110
2016-01-22: 131
2016-01-23: 111
2016-01-24: 115
2016-01-25: 93
2016-01-26: 124
2016-01-27: 122
2016-01-28: 126
2016-01-29: 150
2016-01-30: 129
2016-01-31: 110
2016-02-01: 116
2016-02-02: 107
2016-02-03: 126
2016-02-04: 112
2016-02-05: 131
2016-02-06: 115
2016-02-07: 118
2016-02-08: 128
2016-02-09: 140
2016-02-10: 154
2016-02-11: 132
2016-02-12: 162
2016-02-13: 116
2016-02-16: 145
CPU times: user 116 ms, sys: 28 ms, total: 144 ms
Wall time: 132 ms

In [38]:
"""
Check any single news article
"""
if 1 == 1:
    with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
        conn.row_factory = sqlite3.Row
        cursor = conn.cursor()
        
        '''
        Note:
         - News collected until 2015-03-17 have empty news_entities field.
         - Straing from news_native_id = '34583' on 2015-03-18, most news have news_entities field information (with few exceptions each day).
        '''
        
        query = """select * from news where news_native_id = '34583';"""
        
        cursor.execute(query)
        
        result = cursor.fetchone()
        row_keys_lst = [item[0] for item in cursor.description]
        for row_ind, row in enumerate(result):
            print('({}/{}) {}: {}'.format(row_ind+1, len(result), row_keys_lst[row_ind], row))
        
#         for row in cursor.fetchall():
#             print('{}: {}'.format(row['news_native_id'], row['news_entities']))


(1/10) news_id: 31899
(2/10) news_native_id: 34583
(3/10) news_url: http://www.wsj.com/articles/netanyahu-gains-edge-in-official-vote-count-1426581793
(4/10) news_title: Netanyahu Pulls Ahead of Main Challenger Herzog in Israeli Elections
(5/10) news_source: Wall Street Journal
(6/10) news_post_time: 2015-03-17 21:30:55
(7/10) news_collected_time: 2015-03-18
(8/10) news_keywords: Netanyahu;Elections;Israeli
(9/10) news_doc: TEL AVIV—Conservative Prime Minister Benjamin Netanyahu pulled ahead of his main challenger in Israeli elections with most of the votes counted early Wednesday, a strong showing after he hammered away at security issues in the final hours of the campaign.::::::::He will still need the support of smaller parties to form a ruling coalition, but appeared to be in the best position to garner a majority of support in parliament after Tuesday’s...
(10/10) news_entities: Israeli:place:2;Netanyahu:person:1;Benjamin Netanyahu:person:1;Wednesday:org:1

In [16]:
%%time
"""
Check number of tweets on a given date
"""
if 1 == 1:
    with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
        conn.row_factory = sqlite3.Row
        cursor = conn.cursor()
        
        tweet_collected_time = '2015-04-03'
        
        query = """
        select tweet_collected_time, count(tweet_id) as tweets_num
        from tweets
        where tweet_collected_time = :tweet_collected_time;
        """
        
        cursor.execute(query, {'tweet_collected_time': tweet_collected_time})
        for row in cursor.fetchall():
            print('{}: {}'.format(row['tweet_collected_time'], row['tweets_num']))


2015-04-03: 1487312
CPU times: user 256 ms, sys: 56 ms, total: 312 ms
Wall time: 619 ms

In [17]:
%%time
"""
Check number of tweets per day
"""
if 1 == 1:
    with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
        conn.row_factory = sqlite3.Row
        cursor = conn.cursor()
        
        query = """
        select tweet_collected_time, count(tweet_id) as tweets_num
        from tweets
        group by tweet_collected_time;
        """
        
        cursor.execute(query)
        for row in cursor.fetchall():
            print('{}: {}'.format(row['tweet_collected_time'], row['tweets_num']))


2014-11-18: 1494515
2014-11-19: 1147495
2014-11-20: 1439572
2014-11-21: 1320177
2014-11-22: 1305833
2014-11-23: 1354981
2014-11-24: 1246511
2014-11-25: 1244563
2014-11-26: 1593780
2014-11-27: 1227528
2014-11-28: 1546461
2014-11-29: 1236868
2014-11-30: 1393424
2014-12-01: 1457180
2014-12-02: 1515236
2014-12-03: 1336795
2014-12-04: 1265828
2014-12-05: 1308233
2014-12-06: 1367410
2014-12-07: 1272486
2014-12-08: 1214623
2014-12-09: 1346906
2014-12-10: 1188458
2014-12-11: 1273633
2014-12-12: 1210513
2014-12-13: 1029532
2014-12-14: 213366
2014-12-16: 234871
2014-12-17: 924315
2014-12-18: 387406
2014-12-19: 1675156
2014-12-20: 1329957
2014-12-21: 1327250
2014-12-22: 1407439
2014-12-23: 1209476
2014-12-24: 1594981
2014-12-25: 1283703
2014-12-26: 1345370
2014-12-27: 1289327
2014-12-28: 1312207
2014-12-29: 810524
2014-12-30: 55453
2014-12-31: 647316
2015-01-01: 1470963
2015-01-02: 1218878
2015-01-03: 1419915
2015-01-04: 1381914
2015-01-05: 1191842
2015-01-06: 1518042
2015-01-07: 1136823
2015-01-08: 1443992
2015-01-09: 1247942
2015-01-10: 989682
2015-01-11: 1334989
2015-01-12: 1232914
2015-01-13: 1290127
2015-01-14: 1260424
2015-01-15: 1239630
2015-01-16: 1400073
2015-01-17: 999526
2015-01-18: 1189740
2015-01-19: 1327846
2015-01-20: 1214109
2015-01-21: 1319163
2015-01-22: 1083340
2015-01-23: 146012
2015-01-24: 1372743
2015-01-25: 1294871
2015-01-26: 1320838
2015-01-27: 1270881
2015-01-28: 1057614
2015-01-29: 1506609
2015-01-30: 1365749
2015-01-31: 1300781
2015-02-01: 1318308
2015-02-02: 1478749
2015-02-03: 1203581
2015-02-04: 1361611
2015-02-05: 1540522
2015-02-06: 1116566
2015-02-07: 1225153
2015-02-08: 1396166
2015-02-09: 1159127
2015-02-10: 1320385
2015-02-11: 1354398
2015-02-12: 1224552
2015-02-13: 1401978
2015-02-14: 1298971
2015-02-15: 1232985
2015-02-16: 1306548
2015-02-17: 1260859
2015-02-18: 1458500
2015-02-19: 1219920
2015-02-20: 1316815
2015-02-21: 1324776
2015-02-22: 1214908
2015-02-23: 1243563
2015-02-24: 1403509
2015-02-25: 1456489
2015-02-26: 1413762
2015-02-27: 1208442
2015-02-28: 1175229
2015-03-01: 1250931
2015-03-02: 1208104
2015-03-03: 1363427
2015-03-04: 1394678
2015-03-05: 1172094
2015-03-06: 1384549
2015-03-07: 1466687
2015-03-08: 1212779
2015-03-09: 1337632
2015-03-10: 1165537
2015-03-11: 1146778
2015-03-12: 1344391
2015-03-13: 1307209
2015-03-14: 576043
2015-03-15: 1208341
2015-03-16: 1242481
2015-03-17: 761801
2015-03-18: 1946802
2015-03-19: 1531272
2015-03-20: 1426327
2015-03-21: 600128
2015-03-22: 42051
2015-03-29: 1903144
2015-03-30: 1275066
2015-03-31: 1353108
2015-04-01: 1508713
2015-04-02: 1224303
2015-04-03: 1487312
2015-04-04: 1318566
2015-04-05: 1445558
2015-04-06: 1264336
2015-04-07: 1263628
2015-04-08: 1520243
2015-04-09: 1218857
2015-04-10: 1414855
2015-04-11: 1149927
2015-04-12: 1431028
2015-04-13: 1312947
2015-04-14: 1535514
2015-04-15: 531903
2015-04-16: 231244
2015-04-17: 195513
2015-04-18: 167382
2015-04-19: 195480
2015-04-20: 172493
2015-04-21: 151853
2015-04-22: 162830
2015-04-23: 182853
2015-04-24: 136910
2015-04-25: 147805
2015-04-26: 148999
2015-04-27: 139318
2015-04-28: 143570
2015-04-29: 170839
2015-04-30: 141286
2015-05-01: 227562
2015-05-02: 171961
2015-05-03: 153977
2015-05-04: 67367
2015-05-05: 72491
2015-05-06: 240319
2015-05-07: 140652
2015-05-08: 283339
2015-05-09: 281771
2015-05-10: 233896
2015-05-11: 252855
2015-05-12: 365623
2015-05-13: 381719
2015-05-14: 236633
2015-05-15: 254261
2015-05-16: 259351
2015-05-17: 336145
2015-05-18: 305858
2015-05-19: 214905
2015-05-20: 352908
2015-05-21: 299601
2015-05-22: 262991
2015-05-23: 246439
2015-05-24: 240611
2015-05-25: 294277
2015-05-26: 257284
2015-05-27: 303121
2015-05-28: 231811
2015-05-29: 329029
2015-05-30: 191764
2015-05-31: 252268
2015-06-01: 328906
2015-06-02: 280437
2015-06-03: 189182
2015-06-04: 251068
2015-09-07: 78088
2015-09-08: 296053
2015-09-09: 406226
2015-09-10: 370227
2015-09-11: 250640
2015-09-12: 305628
2015-09-13: 236682
2015-09-14: 290543
2015-09-15: 266116
2015-09-16: 352708
2015-09-17: 326349
2015-09-18: 232285
2015-09-19: 253023
2015-09-30: 31679
2015-10-01: 198903
2015-11-13: 34746
2015-11-14: 288460
2015-11-15: 262927
2015-11-16: 337592
2015-11-17: 292001
2015-11-18: 261343
2015-11-19: 240694
2015-11-20: 252174
2015-11-21: 182985
2015-11-22: 299802
2015-11-23: 227217
2015-11-24: 267451
2015-11-25: 270813
2015-11-26: 214435
2015-11-27: 179895
2015-11-28: 251914
2015-11-29: 253996
2015-11-30: 199121
2015-12-01: 239553
2015-12-03: 65394
2015-12-04: 266862
2015-12-05: 146099
2015-12-06: 167430
2015-12-07: 183862
2016-01-06: 142603
2016-01-07: 85298
2016-01-08: 115484
2016-01-09: 148917
2016-01-10: 136676
2016-01-11: 103113
2016-01-12: 240683
2016-01-13: 159246
2016-01-14: 125269
2016-01-15: 93761
2016-01-16: 78186
2016-01-17: 192521
2016-01-18: 132254
2016-01-19: 170801
2016-01-20: 197839
2016-01-21: 193879
2016-01-22: 135009
2016-01-23: 147907
2016-01-24: 121936
2016-01-25: 100489
2016-01-26: 134754
2016-01-27: 195210
2016-01-28: 145628
2016-01-29: 164063
2016-01-30: 139050
2016-01-31: 107129
2016-02-01: 101926
2016-02-06: 23738
2016-02-07: 138055
2016-02-08: 110841
2016-02-09: 83998
2016-02-10: 124320
2016-02-11: 112097
2016-02-12: 133920
2016-02-13: 130288
2016-02-16: 102389
CPU times: user 42.9 s, sys: 8.6 s, total: 51.5 s
Wall time: 1min 4s

In [18]:
%%time
"""
Check number of tweets per news
"""
if 1 == 1:
    with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
        conn.row_factory = sqlite3.Row
        cursor = conn.cursor()
        
        query = """
        select news_native_id, count(tweet_id) as tweets_num
        from tweets
        where tweet_collected_time < '2014-11-20'
        group by news_native_id;
        """
        
        cursor.execute(query)
        for row in cursor.fetchall():
            print('{}: {}'.format(row['news_native_id'], row['tweets_num']))


2685: 1265
2686: 23
2687: 19945
2688: 2036
2689: 15087
2690: 5591
2691: 18219
2692: 5511
2693: 417
2694: 123
2695: 3220
2696: 315
2697: 581
2698: 749
2699: 1202
2700: 519
2701: 10257
2702: 1833
2703: 4656
2704: 685
2705: 106
2706: 180
2707: 238
2708: 39023
2709: 24822
2710: 484
2711: 1596
2712: 130
2713: 42760
2714: 45109
2715: 216
2716: 314
2717: 6584
2718: 4216
2719: 90
2720: 35864
2721: 583
2722: 148
2723: 474
2724: 2313
2725: 182
2726: 63740
2727: 2915
2728: 1263
2729: 285
2730: 24840
2731: 9881
2732: 3592
2733: 18
2734: 154
2735: 235
2736: 324
2737: 18374
2738: 3898
2739: 4998
2740: 1108
2741: 18195
2742: 17848
2743: 341
2744: 57
2745: 110
2746: 349
2747: 52
2748: 255
2749: 7384
2750: 3412
2751: 6089
2752: 130
2753: 103
2754: 26
2755: 20428
2756: 2403
2757: 842
2758: 3529
2759: 106
2760: 3153
2761: 150
2762: 20762
2763: 154
2764: 591
2765: 2310
2766: 103
2767: 2202
2768: 1252
2769: 606
2770: 1672
2772: 20854
2773: 767
2774: 2152
2775: 4587
2776: 10063
2777: 1095
2778: 537
2779: 1906
2780: 2247
2781: 51
2782: 202
2783: 2252
2785: 14954
2786: 15391
2787: 669
2788: 90
2789: 237
2790: 11867
2791: 2461
2792: 77
2793: 6888
2794: 17041
2795: 1177
2796: 620
2797: 82
2798: 69
2799: 99
2800: 142
2801: 8645
2802: 21792
2803: 669
2804: 106
2805: 2554
2806: 273
2807: 8028
2808: 3001
2809: 5096
2810: 238
2811: 1722
2812: 18102
2813: 1913
2814: 9
2815: 13232
2816: 22
2817: 27423
2818: 1143
2819: 26714
2820: 2651
2821: 5023
2822: 563
2823: 11399
2824: 665
2825: 2487
2826: 3043
2827: 4795
2828: 35986
2829: 223
2830: 1823
2831: 1801
2832: 2309
2833: 18219
2834: 13667
2835: 671
2836: 4237
2837: 1822
2838: 3519
2839: 710
2840: 2081
2841: 1660
2842: 6822
2843: 438
2844: 78
2845: 1249
2846: 12
2847: 2762
2848: 3619
2849: 13694
2850: 10730
2851: 992
2852: 136
2853: 336
2854: 41
2855: 2569
2856: 26944
2857: 1072
2858: 52
2859: 1599
2860: 714
2862: 1264
2863: 9308
2864: 3707
2865: 720
2866: 76
2867: 559
2868: 1790
2869: 23615
2870: 36
2871: 4323
2873: 31027
2874: 27836
2875: 7572
2876: 105
2877: 194
2878: 106
2879: 1582
2880: 72
2881: 5046
2882: 107
2883: 517
2884: 10668
2885: 12071
2886: 1076
2887: 1198
2888: 1
2889: 29
2890: 21
2891: 297
2892: 2
2893: 13410
2894: 13548
2895: 957
2896: 351
2897: 607
2898: 70
2899: 27
2900: 943
2901: 428
2902: 38872
2903: 73
2904: 4507
2905: 368
2906: 66
2907: 107
2908: 68
2909: 29866
2910: 425
2911: 4643
2912: 96
2913: 1704
2914: 490
2915: 4638
2916: 4780
2917: 134
2918: 407
2919: 449
2920: 719
2921: 1401
2922: 707
2923: 2374
2924: 325
2925: 5832
2926: 251
2927: 99
2928: 120
2929: 1982
2930: 9256
2931: 4838
2932: 6628
2934: 79
2935: 3433
2936: 145
2937: 90
2938: 7328
2939: 495
2940: 17400
2941: 781
2942: 573
2943: 143
2944: 2116
2945: 10950
2946: 10343
2947: 99
2948: 49000
2949: 8747
2950: 4945
2951: 4230
2953: 328
2954: 1424
2955: 2944
2956: 31
2957: 19523
2958: 7250
2959: 270
2960: 224
2961: 155
2962: 189
2963: 9559
2964: 4422
2965: 5164
2966: 42
2967: 490
2968: 147
2969: 356
2970: 9168
2971: 4473
2972: 4
2973: 5463
2974: 124
2975: 428
2976: 9412
2977: 3552
2978: 312
2979: 605
2980: 196
2981: 14073
2982: 1264
2983: 75
2984: 252
2985: 854
2986: 20165
2987: 2730
2988: 7834
2989: 8496
2990: 67
2991: 780
2992: 338
2993: 26431
2994: 277
2995: 2144
2996: 115
2997: 3207
2998: 183
2999: 4118
3000: 393
3002: 50
3003: 3207
3004: 2033
3005: 388
3006: 18477
3007: 240
3008: 20781
3009: 217
3010: 3092
3011: 101
3012: 23
3013: 15301
3014: 504
3015: 8001
3016: 99
3017: 634
3018: 9883
3019: 794
3020: 16
3021: 27565
3022: 604
3023: 3634
3024: 1070
3025: 131
3027: 392
3028: 116
3029: 113
3030: 271
3031: 22649
3032: 593
3033: 111
3034: 153
3035: 2971
3036: 181
3037: 2641
3038: 142
3039: 39050
3040: 7535
3041: 72
3042: 4066
3043: 23162
3044: 11944
3045: 3
3046: 198
3047: 51
3048: 321
3049: 69673
3050: 26802
3051: 3
3052: 121
3053: 1304
3054: 8720
3055: 119
3056: 2291
3057: 8531
3058: 78
3059: 225
3060: 103
3061: 344
3062: 342
3063: 1145
3064: 29
3065: 843
3066: 7340
3067: 670
3068: 284
3069: 95
3070: 127
3071: 784
3072: 261
3073: 8200
3074: 2099
3075: 305
3076: 8011
3077: 96
3078: 2526
3079: 4390
3080: 1078
3081: 319
3082: 19124
3083: 10879
3084: 109
3085: 138
3086: 300
3087: 4961
3088: 54
3089: 126
3090: 6924
3091: 400
3092: 852
3093: 25
3094: 2230
3095: 1861
3096: 314
3097: 1930
3098: 13
3099: 18136
3100: 91
3101: 1689
3102: 2203
3103: 124
3104: 716
3105: 795
3106: 5781
3107: 7945
3108: 7036
3109: 3566
3110: 227
3111: 198
3112: 6053
3113: 5811
3114: 176
3115: 146
3116: 2625
3117: 15170
3118: 113
3119: 29108
3120: 232
3121: 1584
3122: 680
3123: 6996
3124: 6553
3125: 77
3126: 883
3127: 369
3128: 1329
3129: 11849
3130: 52
3131: 10
3132: 42
3133: 308
3134: 834
3135: 19514
3136: 32928
3137: 367
3138: 540
3139: 4813
3140: 3880
3141: 589
3142: 788
3143: 1117
3145: 1306
3146: 586
3147: 458
3148: 5641
3149: 1411
3150: 964
3151: 234
3152: 5914
3153: 28
3154: 699
3155: 1357
3156: 381
3157: 561
3158: 18691
3159: 321
3160: 5
3161: 205
3162: 104
3163: 86
3164: 613
3165: 14038
3166: 3712
3167: 57
3168: 744
3169: 53
3170: 1922
3171: 1914
3172: 52416
3173: 4639
3174: 221
3175: 5951
3176: 587
3177: 9173
3178: 5474
3179: 205
3180: 25
3181: 94
3182: 141
3183: 55
3184: 159
3185: 5737
3186: 15107
3187: 33233
3188: 2599
3189: 128
3190: 233
3191: 21852
3192: 89
3193: 5330
3194: 67
3195: 638
3196: 122
3197: 836
3198: 2115
3199: 5758
3200: 6525
3201: 5
3202: 1132
3203: 329
3204: 189
3205: 905
3206: 1082
3207: 746
3208: 55
3209: 543
3210: 29
3211: 234
3212: 41725
3213: 17825
3214: 104
3215: 397
3216: 589
3217: 936
3218: 26294
3219: 456
3220: 106
3221: 6875
3222: 26244
3223: 89
3224: 1550
3225: 1011
3226: 710
3227: 1106
3228: 337
3229: 84
3230: 6
3231: 261
3232: 56
3233: 114
3234: 120
3235: 54
3236: 1250
3237: 39
3238: 848
3239: 6863
3240: 10639
3241: 622
CPU times: user 3min 8s, sys: 1min 11s, total: 4min 19s
Wall time: 7min 47s

In [19]:
%%time
"""
Check tweets for a given news
"""
if 1 == 1:
    with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
        cursor = conn.cursor()
        
        query = """
        select tweet_native_id, tweet_text, tweet_post_time, tweet_collected_time from tweets
        where tweets.news_native_id = :news_native_id;
        """
        
        cursor.execute(query, {'news_native_id': 3098})
        for row in cursor.fetchall():
            print(row)


(535147631333634049, "Today's six-pack: The sexiest man and the longest bus ride http://t.co/VsNXpbFmsY", '2014-11-19 19:08:32', '2014-11-19')
(535102061751201792, "Today's six-pack: The sexiest man and the longest bus ride http://t.co/GdgzcI05jl", '2014-11-19 16:07:27', '2014-11-19')
(535099134542622720, "Today's six-pack: The sexiest man and the longest bus ride - Bloomington Pantagraph #sixpack #fitness http://t.co/rjCp7dQxtG", '2014-11-19 15:55:49', '2014-11-19')
(535093918036025345, "Today's six-pack: The sexiest man and the longest bus ride http://t.co/HEsWy4KDGd", '2014-11-19 15:35:06', '2014-11-19')
(535092632356020224, "Today's six-pack: The sexiest man and the longest bus ride - http://t.co/SiBp21eBlj: STLtoday.comToday's six-pa... http://t.co/gV16CKE50W", '2014-11-19 15:29:59', '2014-11-19')
(535090575868166144, "#Today's six-pack: The sexiest man and the longest bus ride - http://t.co/Prj4UmbSKx http://t.co/AckfiBlEua", '2014-11-19 15:21:49', '2014-11-19')
(535088384096497664, "Today's six-pack: The sexiest man and the longest bus ride - http://t.co/rFj3UOa6NH http://t.co/YhNCUFoc7C", '2014-11-19 15:13:06', '2014-11-19')
(535082645651472384, "Today's six-pack: The sexiest man and the longest bus ride - http://t.co/d2h9lsae2E http://t.co/ykuyxwaAxX", '2014-11-19 14:50:18', '2014-11-19')
(535080673917550592, "Today's six-pack: The sexiest man and the longest bus ride - http://t.co/3AkuFUKaCa #sixpack #fitness http://t.co/euMBAk7a5u", '2014-11-19 14:42:28', '2014-11-19')
(532810463353270272, "Man on the bus that's on the phone every morning telling someone he's got a six pack and he's massive, when he's actually fat and ugly😏😴😴", '2014-11-13 08:21:28', '2014-11-19')
(532303228768358402, "There's a white lady and a black guy sitting at the bus stop across from the grotto drinking a six pack... Man I love west view!", '2014-11-11 22:45:54', '2014-11-19')
(535166212251332608, "Today's six-pack: The sexiest man and the longest bus ride.http://t.co/g651BV2KQB http://t.co/KbkoC1gUrh", '2014-11-19 20:22:22', '2014-11-19')
(535319645738106881, "Today's six-pack: The sexiest man and the longest bus ride http://t.co/aw1V5m9O8j", '2014-11-20 06:32:03', '2014-11-19')
CPU times: user 8 ms, sys: 0 ns, total: 8 ms
Wall time: 59.6 ms

Build df pickles for news and tweets over selected period

Select out news and tweets data between [2014-11-18, 2015-04-14] (before Twitter changed API rates)


In [2]:
%%time
"""
Build pickle for news data over selected period.

Register
    NEWS_PERIOD_DF_PKL = os.path.join(DATA_DIR, 'news-period.df.pkl')
in config.
"""
if 0 == 1:
    with sqlite3.connect(config.NEWS_TWEETS_DB_FILE) as conn:
        query = """
        select news_id, news_native_id, news_title, news_post_time, news_collected_time, news_keywords, news_doc, news_entities from news
        where news_collected_time <= '2015-04-14';
        """
        news_period_df = pd.read_sql_query(sql=query,con=conn,parse_dates=['news_post_time', 'news_collected_time'])
        
        '''
        Remove news with empty news_title and news_doc field
        '''
        news_period_df = news_period_df[news_period_df['news_title'] != '']
        news_period_df = news_period_df[news_period_df['news_doc'] != '']
                
        '''
        Make pickle
        '''
        news_period_df.to_pickle(config.NEWS_PERIOD_DF_PKL)


CPU times: user 1.12 s, sys: 1.41 s, total: 2.54 s
Wall time: 2.54 s

In [3]:
%%time
"""
Test recover df pickle
"""
if 1 == 1:
    news_period_df = pd.read_pickle(config.NEWS_PERIOD_DF_PKL)


CPU times: user 248 ms, sys: 416 ms, total: 664 ms
Wall time: 665 ms

In [4]:
news_period_df


Out[4]:
news_id news_native_id news_title news_post_time news_collected_time news_keywords news_doc news_entities
0 1 2685 Missouri's Nixon Declares State of Emergency A... 2014-11-17 21:04:21 2014-11-18 missouri;nixon declares;emergency awaiting;gra... Missouri Governor Jay Nixon’s actions undersco...
1 2 2686 PEOPLE: Bill Cosby. Charles Manson, Solange Kn... 2014-11-17 21:01:31 2014-11-18 people;bill cosby;charles manson;solange knowles A blanket denial from Bill Cosby’s lawyer dism...
2 3 2687 Ebola patient who died had received ZMapp late... 2014-11-17 21:34:11 2014-11-18 ebola;zmapp A frantic, 36-hour effort to save the life of ...
3 4 2688 At least 4 dead in attack in Kabul, official says 2014-11-17 21:20:53 2014-11-18 kabul A least four people were killed in a suicide a...
4 5 2689 Australia will not be at periphery of India's ... 2014-11-17 21:26:59 2014-11-18 australia;india;'s vision;modi "We celebrate the legend of Bradman and the cl...
5 6 2690 FBI: Violence could follow Ferguson indictment... 2014-11-17 22:09:02 2014-11-18 violence;ferguson;indictment decision Clayton Police Department officers keep a watc...
6 7 2691 Four Killed in Palestinian Attack at Jerusalem... 2014-11-17 22:16:03 2014-11-18 killed;palestinian attack;jerusalem synagogue Palestinians killed at least four people in an...
7 8 2692 Mass murderer Charles Manson issued marriage l... 2014-11-17 22:15:05 2014-11-18 mass;charles manson;marriage license Manson and 26-year-old Afton Elaine Burton wer...
8 9 2693 News Guide: Texas' latest history textbook tussle 2014-11-17 22:12:58 2014-11-18 guide;texas;history textbook tussle AUSTIN, Texas — The Texas Board of Education i...
9 10 2694 Abdul-Rahman Kassig's parents mourn 'beloved son' 2014-11-17 22:06:17 2014-11-18 abdul-rahman kassig;'s parents The parents of Abdul-Rahman Kassig, who was mu...
10 11 2695 Obama orders full review of US hostage policy 2014-11-17 20:34:21 2014-11-18 obama;full review;us hostage policy President Obama has ordered a complete review ...
11 12 2696 Homeless Children in US: A parent-to-parent ap... 2014-11-17 21:26:18 2014-11-18 homeless;parent-to-parent approach A new report tries to capture the number of ho...
12 13 2697 Alleged Bill Cosby victim has connection to Co... 2014-11-17 22:39:16 2014-11-18 alleged bill cosby;colorado A woman who has recently come forward alleging...
13 14 2698 Church of England approves women bishops 2014-11-17 19:50:14 2014-11-18 england;approves women bishops The Church of England’s General Synod has toda...
14 15 2699 Uber executive wants to dig into personal live... 2014-11-17 22:13:03 2014-11-18 uber;personal lives;discredit journalists One tech exec is uber sorry after being outed ...
15 16 2700 Suicide blast kills two at Kabul's foreign com... 2014-11-17 21:30:18 2014-11-18 suicide;blast kills;kabul;foreign compound At least two security guards have been killed ...
16 17 2701 Answers to questions about the Ferguson grand ... 2014-11-17 22:43:57 2014-11-18 answers;ferguson;grand jury In this Wednesday, Nov. 12, 2014 photo, Rev. O...
17 18 2702 Four Killed in Jerusalem Synagogue Complex 2014-11-17 22:59:39 2014-11-18 killed;jerusalem synagogue complex JERUSALEM — Two assailants armed with a gun, k...
18 19 2703 Source: Charles Manson, fiance get marriage li... 2014-11-17 23:06:36 2014-11-18 source;charles manson;marriage license (CNN) -- Charles Manson and his fiancee, who a...
19 20 2704 Suicide Attack in Afghan Capital Kills 2 2014-11-17 22:42:29 2014-11-18 suicide attack;afghan capital kills KABUL, Afghanistan — A suicide attack targeted...
20 21 2705 Hong Kong Protesters Greet Court Officials Wit... 2014-11-17 23:04:47 2014-11-18 hong kong protesters greet;court officials;ind... Hong Kong officials began enforcing the first ...
21 22 2706 Europeans have prominent role in beheading video 2014-11-17 21:21:22 2014-11-18 europeans;prominent role Maxime Hauchard has been on the radar of Frenc...
22 23 2707 Deals Heat Up for Lawyers Like It's 1998: Busi... 2014-11-17 22:53:02 2014-11-18 deals heat up;lawyers After a day like yesterday, it’s no surprise t...
23 24 2708 Cupich set to become 9th archbishop of Chicago 2014-11-17 22:40:41 2014-11-18 cupich;chicago Blase Cupich is poised become the archbishop o...
24 25 2709 4 Israelis, 2 Palestinians killed in synagogue... 2014-11-17 23:41:46 2014-11-18 israelis;palestinians;synagogue attack;israeli Jerusalem (CNN) -- Two knife- and axe-wielding...
25 26 2710 Missouri Gov. Jay Nixon Declares State Of Emer... 2014-11-17 23:40:39 2014-11-18 missouri gov;jay nixon declares;emergency ahea... Gov. Jay Nixon (D) issued an executive order o...
26 27 2711 French National Identified In Islamic State Be... 2014-11-17 23:36:48 2014-11-18 identified;islamic;beheading video French authorities on Monday identified a Fren...
27 28 2712 Suicide bombing near coalition base in Kabul k... 2014-11-17 23:37:15 2014-11-18 suicide;coalition base;kabul;security officers A Taliban suicide bombing near a base used by ...
28 29 2713 Baseball notes, Nov. 17: Stanton gets record $... 2014-11-17 22:59:38 2014-11-18 nov.;stanton;record $ Power-hitting outfielder Giancarlo Stanton, wh...
29 30 2714 Surgeon dies of Ebola at Nebraska hospital aft... 2014-11-17 22:00:46 2014-11-18 surgeon;ebola;nebraska;sierra The doomed doctor who contracted Ebola while w...
... ... ... ... ... ... ... ... ...
38213 38214 40898 The enduring images of Abraham Lincoln 2015-04-14 16:13:34 2015-04-14 images;Abraham Lincoln;enduring When John Wilkes Booth crept into the presiden... Abraham Lincoln:person:23;Joseph Stalin:person...
38214 38215 40899 Senate Approves a Bill on Changes to Medicare 2015-04-14 18:57:51 2015-04-14 Medicare;Bill;Senate WASHINGTON — The Senate on Tuesday approved sw... United States Senate:org:12;United States Cong...
38215 38216 40900 Hillary Clinton is going after Wall Street 2015-04-14 14:40:55 2015-04-14 Wall Street;Hillary Clinton MONTICELLO, Iowa (Reuters) - Democrat Hillary ... Hillary Rodham Clinton:person:14;Iowa:place:6;...
38216 38217 40901 Lincoln's Assassination: 150 Years Later 2015-04-14 18:33:53 2015-04-14 Assassination;Lincolns Abraham Lincoln had been in good cheer on the ... Abraham Lincoln:person:31;Joseph Stalin:person...
38217 38218 40902 SpaceX Launches Cargo Capsule, Fails to Nail R... 2015-04-14 19:24:21 2015-04-14 SpaceX;Launches;Rocket SpaceX launched a shipment of groceries to the... SpaceX:org:9;NASA:org:4;Italy:place:2;Samantha...
38218 38219 40903 UPDATE 4-Shooting at NC community college inve... 2015-04-14 12:48:48 2015-04-14 hate crime;UPDATE;community college WINSTON-SALEM, N.C., April 14 (Reuters) - A 20... Lane County, Oregon:place:2;Goldsboro, North C...
38219 38220 40904 Hillary's folksy, populist re-entry 2015-04-14 20:04:48 2015-04-14 reentry;Hillarys;populist MONTICELLO, Iowa — “Good to see you! Thanks fo... Hillary Rodham Clinton:person:21;Iowa:place:17...
38220 38221 40905 The man who created the Lincoln we know 2015-04-14 18:42:13 2015-04-14 Lincoln;man In the late hours of April 14, 1865, 150 years... Abraham Lincoln:person:26;Joseph Stalin:person...
38221 38222 40906 US promises stricter regulation on private sec... 2015-04-14 15:09:31 2015-04-14 Blackwater;regulation;private security The US State Department offered assurances on ... United States Department of State:org:4;Academ...
38222 38223 40907 Cuba praises 'fair' US pledge on terrorism list 2015-04-14 19:19:31 2015-04-14 terrorism;Cuba Cuba has welcomed as "fair" a US decision to r... Cuba:place:21;Fidel Castro:person:5;Barack Oba...
38223 38224 40908 SEE IT: Former NYPD cop disarms gunman by ramm... 2015-04-14 19:56:51 2015-04-14 police cruiser;NYPD;Former A former NYPD veteran who calls himself Roboco... Valencia:place:9;Marana, Arizona:place:5;Tucso...
38224 38225 40909 Jury Selected In Colorado Movie Theater Shooti... 2015-04-14 20:07:38 2015-04-14 Movie Theater;Trial;Colorado DENVER—After nearly three months, a jury was s... Arapahoe County, Colorado:place:1;United State...
38225 38226 40910 The American Register – USS Oklahoma crew memb... 2015-04-14 18:48:17 2015-04-14 USS Oklahoma;American;Register The Pentagon announced that the unknown sailor... United States Navy:org:6;United States Marine ...
38226 38227 40911 Killing investigated as hate crime; accused ma... 2015-04-14 19:39:44 2015-04-14 hate crime;allegations;court GOLDSBORO — Police are investigating the shoot... Lane County, Oregon:place:13;Sutton, London:pl...
38227 38228 40912 April 14, 2015 in Falcon 9: Falcon 9 successfu... 2015-04-14 19:56:13 2015-04-14 April;descends;Falcon 9 SpaceX’s Falcon 9 rocket successfully launched... SpaceX:org:16;Kennedy Space Center:org:5;NASA:...
38228 38229 40913 Watch Arizona Police Car Ram Suspect, Ending D... 2015-04-14 19:24:32 2015-04-14 Spree;Arizona;Police The page you've requested either does not exis... Joseph Stalin:person:1;Arizona:place:1;Spree:p...
38229 38230 40914 USS Oklahoma sailors, Marines to be exhumed 2015-04-14 20:02:33 2015-04-14 sailors;Marines;USS Oklahoma Tom Gray's family has waited for more than 70 ... Johns Hopkins University:org:5;United States N...
38230 38231 40915 Quintana leads White Sox to third straight vic... 2015-04-14 19:14:37 2015-04-14 White Sox;victory;straight Jose Abreu celebrates in the dugout after a so... Jose Canseco:person:5;Bobby Abreu:person:5;Ven...
38231 38232 40916 Senate approves bill changing how Medicare pay... 2015-04-14 20:39:58 2015-04-14 Senate;Medicare;doctors The Senate overwhelmingly passed legislation p... United States Congress:org:8;United States Sen...
38232 38233 40917 150 years ago, Abraham Lincoln was shot. Histo... 2015-04-14 19:00:11 2015-04-14 Historians;Abraham Lincoln Thanks to movies and TV, we all think we have ... Booth Theatre:place:24;Abraham Lincoln:person:...
38233 38234 40918 Dashboard camera shows Arizona police officer ... 2015-04-14 20:29:51 2015-04-14 Arizona;cruiser;camera Dramatic dash-cam video released Tuesday shows... Valencia:place:7;Joseph Stalin:person:4;Tucson...
38234 38235 40919 SCOTUS, Gov. Jay Nixon allows execution to go ... 2015-04-14 20:06:10 2015-04-14 Jay Nixon;SCOTUS;execution This article has been updated.::::::::BONNE TE... Cole Thornhart:person:22;Missouri:place:6;Jose...
38235 38236 40920 SpaceX launches space station groceries, espre... 2015-04-14 20:33:41 2015-04-14 SpaceX;groceries;space After a scrub on Monday, the Falcon 9 SpaceX r... SpaceX:org:13;Cape Canaveral Air Force Station...
38236 38237 40921 Cuba gave US assurances it will not support te... 2015-04-14 18:19:07 2015-04-14 future US;Cuba;terrorism "The assurances that they (provided) were fair... Cuba:place:10;Joseph Stalin:person:2;United St...
38237 38238 40922 Unknown USS Oklahoma sailors to be disinterred... 2015-04-14 20:19:48 2015-04-14 sailors;USS Oklahoma The Department of Defense (DoD) announced toda... United States Navy:org:9;United States Departm...
38238 38239 40923 Events, exhibits mark 150 years since Presiden... 2015-04-14 21:03:30 2015-04-14 Abraham Lincoln assassination;Events;mark You'll still find it on News-Sentinel.com, but...
38239 38240 40924 Attorneys ask US Supreme Court to halt Missour... 2015-04-14 20:54:38 2015-04-14 US Supreme Court;execution;Missouri A federal appeals court ruled that the lethal ... Cole Thornhart:person:19;Missouri:place:5;Jose...
38240 38241 40925 Cuba welcomes US move to drop island from terr... 2015-04-14 21:12:52 2015-04-14 Cuba;island;US President Barack Obama's move to drop Cuba fro... Cuba:place:33;United States Armed Forces:org:1...
38241 38242 40926 Actress urges second opinion in cancer diagnosis 2015-04-14 20:58:06 2015-04-14 diagnosis;cancer;Actress BUFFALO, NY (WIVB) — When actress Rita Wilson ... Montana:place:2;Tom Hanks:person:1;Woodrow Wil...
38242 38243 40927 Best Boss Ever Aims to Raise Minimum Worker Pa... 2015-04-14 20:54:55 2015-04-14 Year;Ever;Worker Bet you wish Dan Price was your boss right abo... France:place:1;NASA:org:1;Apple Inc.:org:1;Ste...

37286 rows × 8 columns


In [5]:
"""
Check number of news per day
"""
with pd.option_context('display.max_rows', 150):
    print(news_period_df.groupby(['news_collected_time']).size())


news_collected_time
2014-11-18    288
2014-11-19    253
2014-11-20    311
2014-11-21    285
2014-11-22    252
2014-11-23    269
2014-11-24    272
2014-11-25    309
2014-11-26    303
2014-11-27    281
2014-11-28    284
2014-11-29    246
2014-11-30    254
2014-12-01    317
2014-12-02    290
2014-12-03    286
2014-12-04    275
2014-12-05    263
2014-12-06    232
2014-12-07    257
2014-12-08    288
2014-12-09    269
2014-12-10    270
2014-12-11    271
2014-12-12    249
2014-12-13    224
2014-12-14    147
2014-12-16     45
2014-12-17    308
2014-12-18     51
2014-12-19    297
2014-12-20    243
2014-12-21    248
2014-12-22    285
2014-12-23    275
2014-12-24    283
2014-12-25    245
2014-12-26    267
2014-12-27    245
2014-12-28    258
2014-12-29    261
2014-12-30     66
2014-12-31     54
2015-01-01    265
2015-01-02    268
2015-01-03    248
2015-01-04    236
2015-01-05    257
2015-01-06    301
2015-01-07    265
2015-01-08    272
2015-01-09    294
2015-01-10    256
2015-01-11    274
2015-01-12    254
2015-01-13    292
2015-01-14    299
2015-01-15    278
2015-01-16    297
2015-01-17    246
2015-01-18    252
2015-01-19    281
2015-01-20    262
2015-01-21    281
2015-01-22    272
2015-01-23    140
2015-01-24    256
2015-01-25    266
2015-01-26    203
2015-01-27     42
2015-01-28    150
2015-01-29    282
2015-01-30    278
2015-01-31    282
2015-02-01    271
2015-02-02    295
2015-02-03    294
2015-02-04    287
2015-02-05    275
2015-02-06    262
2015-02-07    275
2015-02-08    264
2015-02-09    292
2015-02-10    303
2015-02-11    265
2015-02-12    266
2015-02-13    276
2015-02-14    237
2015-02-15    262
2015-02-16    272
2015-02-17    284
2015-02-18    310
2015-02-19    285
2015-02-20    295
2015-02-21    247
2015-02-22    239
2015-02-23    309
2015-02-24    309
2015-02-25    319
2015-02-26    301
2015-02-27    274
2015-02-28    247
2015-03-01    270
2015-03-02    292
2015-03-03    276
2015-03-04    269
2015-03-05    279
2015-03-06    282
2015-03-07    262
2015-03-08    231
2015-03-09    303
2015-03-10    285
2015-03-11    285
2015-03-12    315
2015-03-13    298
2015-03-14    245
2015-03-15    270
2015-03-16    300
2015-03-17    299
2015-03-18    296
2015-03-19    297
2015-03-20    277
2015-03-21    233
2015-03-22     31
2015-03-29    298
2015-03-30    312
2015-03-31    309
2015-04-01    316
2015-04-02    291
2015-04-03    288
2015-04-04    279
2015-04-05    274
2015-04-06    306
2015-04-07    301
2015-04-08    316
2015-04-09    292
2015-04-10    296
2015-04-11    275
2015-04-12    285
2015-04-13    301
2015-04-14    312
dtype: int64

In [6]:
"""
Check number of news with empty ‘news_entities’ field per day
"""
with pd.option_context('display.max_rows', 150):
    print(news_period_df[news_period_df['news_entities'] == ''].groupby(['news_collected_time']).size())


news_collected_time
2014-11-18    288
2014-11-19    253
2014-11-20    311
2014-11-21    285
2014-11-22    252
2014-11-23    269
2014-11-24    272
2014-11-25    309
2014-11-26    303
2014-11-27    281
2014-11-28    284
2014-11-29    246
2014-11-30    254
2014-12-01    317
2014-12-02    290
2014-12-03    286
2014-12-04    275
2014-12-05    263
2014-12-06    232
2014-12-07    257
2014-12-08    288
2014-12-09    269
2014-12-10    270
2014-12-11    271
2014-12-12    249
2014-12-13    224
2014-12-14    147
2014-12-16     45
2014-12-17    308
2014-12-18     51
2014-12-19    297
2014-12-20    243
2014-12-21    248
2014-12-22    285
2014-12-23    275
2014-12-24    283
2014-12-25    245
2014-12-26    267
2014-12-27    245
2014-12-28    258
2014-12-29    261
2014-12-30     66
2014-12-31     54
2015-01-01    265
2015-01-02    268
2015-01-03    248
2015-01-04    236
2015-01-05    257
2015-01-06    301
2015-01-07    265
2015-01-08    272
2015-01-09    294
2015-01-10    256
2015-01-11    274
2015-01-12    254
2015-01-13    292
2015-01-14    299
2015-01-15    278
2015-01-16    297
2015-01-17    246
2015-01-18    252
2015-01-19    281
2015-01-20    262
2015-01-21    281
2015-01-22    272
2015-01-23    140
2015-01-24    256
2015-01-25    266
2015-01-26    203
2015-01-27     42
2015-01-28    150
2015-01-29    282
2015-01-30    278
2015-01-31    282
2015-02-01    271
2015-02-02    295
2015-02-03    294
2015-02-04    287
2015-02-05    275
2015-02-06    262
2015-02-07    275
2015-02-08    264
2015-02-09    292
2015-02-10    303
2015-02-11    265
2015-02-12    266
2015-02-13    276
2015-02-14    237
2015-02-15    262
2015-02-16    272
2015-02-17    284
2015-02-18    310
2015-02-19    285
2015-02-20    295
2015-02-21    247
2015-02-22    239
2015-02-23    309
2015-02-24    309
2015-02-25    319
2015-02-26    301
2015-02-27    274
2015-02-28    247
2015-03-01    270
2015-03-02    292
2015-03-03    276
2015-03-04    269
2015-03-05    279
2015-03-06    282
2015-03-07    262
2015-03-08    231
2015-03-09    303
2015-03-10    285
2015-03-11    285
2015-03-12    315
2015-03-13    298
2015-03-14    245
2015-03-15    270
2015-03-16    300
2015-03-17    299
2015-03-18      2
2015-03-20      1
2015-03-21      2
2015-03-29      1
2015-03-30      3
2015-03-31      1
2015-04-01      1
2015-04-03      2
2015-04-04      1
2015-04-05      1
2015-04-09      1
2015-04-12      3
2015-04-13      3
2015-04-14      2
dtype: int64

In [7]:
news_period_df.dtypes


Out[7]:
news_id                         int64
news_native_id                  int64
news_title                     object
news_post_time         datetime64[ns]
news_collected_time    datetime64[ns]
news_keywords                  object
news_doc                       object
news_entities                  object
dtype: object

Notes

  • Data provided by Dr. Jiang is minimal. Use original version of data for complete information.
  • There are dates Twitter server were unstalbe. Tweets with errors, which cannot be parsed, were collected on these dates. These dates are MANUALLY complied into config.ORIGINAL_TWEETS_ERROR_DATES_LST during experiments. We discard all news/tweets data on these dates.
  • Raw news articles has fixexd format: each news line sep by '\t' and lst has length 10
  • Most news straing from news_native_id = '34583' on 2015-03-18 have news_entites field (with few exceptions on each day).
  • Data in 'news_entites' field are not reliable enough.