Optimize Python Readability

This notebook contains our experiments with optimizing the internal heuristics of the python readability module.

We forked python-readabilty and replaced the hard coded constants with instance variables which we could then tweak to find their optimial value. ( Forked repo is available at https://github.com/dlarochelle/python-readability ) We then attempted to find the optimial values for each of these instance variable constants.

We experimented with 2 approaches:

using a binary like search to optimize each value indvidually
using sci.py's parameter optimization routines

binary search individual optimization

We randomly split our data into training and test sets. Then using a binary like search, we found that there were 6 parameters that could be modified to improve the F1 score on the test set by more than 0.001: LOW_WEIGHT_LINK_DENSITY_THRESHOLD, MIN_SIBLING_SCORE_THRESHOLD, BEST_SCORE_MULTIPLIER_THRESHOLD, CONTENT_SCORE_DIV_BONUS, CLASS_WEIGHT_NEGATIVE_RE_PENALTY, CLASS_WEIGHT_POSITVE_RE_BONUS. Unfortunately, our tests showed that the parameters were not independent. Rerunning the analysis on the training data with all of the significantly optimized parameters set to their optimized value yielded an improvement of only 0.000805. Note that this was a smaller improvement than we obtained from only tweaking each parameter individually.

When we also tried analyzing the test set while modifying only the parameter which created the largest improvement in the training set ('CONTENT_SCORE_DIV_BONUS' - 0.005 improvement). However, our accuracy on the testset actually decreased compared to using only the default values.

sci.py's parameter optimization routines

There are routines in Python that try to do automated global parameter optimization. We experimented with them a bit but found that they were mainly designed for functions in which even small changes to the params changed the result. Thus they tended to accept the initial values as optimal when small tweaks e.g. +/- 0.001 had no effect.

Sci.py also contains a basinhopping optimization function ( http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.optimize.basinhopping.html) which automates the process of selecting initial param values which are then passed to another optimization routine. However, it is slow -- even with only a few parameters to optimize, it did not complete after running over night. (The reason it's slow is because evaluating a single set of parameter values on our test set takes over 1 minute.) However, if we want to try global optimization of multiple parameters basinhopping is probably what we want to use.

At the moment, we have decided to simply use the default version of python-readability instead of investing more development time in parameter optimization.

Although it would be interesting to explore basinhopping in more detail, that would require more development time and CPU resources. Additionally, there is the danger that by optimizing over multiple parameters, we would overfit the training data.

Internal Redmine issue for reference: https://cyber.law.harvard.edu/projectmanagement/issues/10722

Set up / Methods



In [1]:

    
import cPickle
import os.path

api_key = cPickle.load( file( os.path.expanduser( '~/mediacloud_api_key.pickle' ), 'r' ) )



In [2]:

    
import cPickle
import os.path

cPickle.dump( api_key, file( os.path.expanduser( '~/mediacloud_api_key.pickle' ), 'wb' ) )



In [3]:

    
#import sys
#sys.path.append('../../foreign_modules/python/')



In [4]:

    
loc_key = 'f66a50230d54afaf18822808aed649f1d6ca72b08fb06d5efb6247afe9fbae52'



In [5]:

    
import subprocess
import tempfile
import codecs
import time



In [6]:

    
import operator



In [7]:

    
def lines_to_comparable_text( lines ):
    text = u"\n\n".join([ clean_for_comparison(line) for line in lines ])
    
    if text == '':
        text = u''
        
    return text

def html_to_comparable_text( html_text ):
    text = clean_for_comparison( html_text )
    
    if text == '' or text == None:
        text = u''
        
    return text



In [8]:

    
import lxml
import html2text

def html_strip( str ):
    if str.isspace() or str == '':
        return u' '
    
    if str == '<':
        return u' '        
    
    try:
        h = html2text.HTML2Text()
        h.ignore_links = True
        return h.handle( str )
        #return lxml.html.fromstring(str).text_content()    
    except:
        print "Unexpected error on string '" + str + "'" , sys.exc_info()[0]
        raise
        return u''       

def clean_for_comparison( str ):
    if len(str) > 0:
        ret = html_strip( str )
    else:
        return str
    
    return ret



In [9]:

    
import difflib
from IPython.display import HTML

from collections import Counter

def ro_compare_base( actual_text, expected_text ):
    
    words_expected = expected_text.split()
    words_crf      = actual_text.split()
    
    differ = difflib.Differ( )
    
    #print words_crf[:10]
    #print words_expected[:10]
    list( differ.compare( words_crf , words_expected ) )
    counts = Counter([ d[0] for d in differ.compare( words_expected, words_crf   ) ])
    
    tp = counts[' ']
    fp = counts['+']
    fn = counts['-']
    
    return  { 'tp': tp, 'fp': fp, 'fn': fn }

def precision_recall_f1( tp, fp, fn ):
    
    if float(tp+fp) == 0:
        precision = 0.0
    else:
        precision = tp/float(tp+fp)
        
    if float( tp + fn ) == 0:
        recall = 0
    else:
        recall    = tp/float( tp + fn )
    
    if ( precision + recall ) > 0:
        f1 = 2*(precision*recall)/( precision + recall )
    else:
        f1 = 0
    
    ret = { 'precision': precision,
        'recall': recall,
        'f1': f1
    }
    
    return ret

def ratcliff_obershelp_compare( actual_text, expected_text ):

    comp_results = ro_compare_base( actual_text, expected_text )
    
    tp = comp_results[ 'tp' ]
    fp = comp_results['fp']
    fn = comp_results['fn']

    ret = precision_recall_f1( tp, fp, fn )
    
    return ret

def compare_with_expected( extractor_name, actual_text, actual_html, expected_text, story ):
    #actual_text = lines_to_comparable_text( actual_lines )
    #expected_text = lines_to_comparable_text( expected_lines )
    ret = {}
    ret[ extractor_name ] = ratcliff_obershelp_compare( actual_text, expected_text )
    
    if compare_deduplicated:
        dedup_text = remove_duplicate_sentences( actual_html, story )
        
        ret[ extractor_name + "_dedup" ] = ratcliff_obershelp_compare( dedup_text, expected_text )
    
    return ret



In [10]:

    
def python_readability_results( eto, readability_options):
    
    #readability_options['debug'] = True
    raw_content = eto['raw_content']
    extract_res = { 'extracted_html': extract_with_python_readability( raw_content , readability_options) }

    if 'extracted_text' not in extract_res:
            extract_res['extracted_text'] = html_to_comparable_text( extract_res['extracted_html' ] )
    
    
    expected_text = eto['expected_text']
    story = eto['story']
    
    return ro_compare_base( actual_text=extract_res['extracted_text' ] , expected_text=expected_text )

Flags



In [11]:

    
regenerate_extractor_training_objects = True
regenerate_media_id_media_map         = False
regenerate_comps_downloads            = True
compare_deduplicated                  = False

Load Data



In [12]:

    
extractor_training_objects = cPickle.load( file( 
                                                os.path.expanduser( '~/Dropbox/mc/extractor_test/extractor_training_objects.pickle' ), "rb" ) )

#cPickle.load( open( "extractor_traning_objects.pickle", "rb") )

print len( extractor_training_objects )

Run extractors



In [13]:

    
import math
import random
import numpy

def python_readability_f1_mean( extractor_training_objects, py_readability_options = {}):
    
    #reload( difflib )
    #reload( readability )
    #reload( lxml )
    #reload( html2text )
    #dreload( difflib )
    #dreload( readability )
    #dreload( lxml )
    #random.seed(12345)
    #numpy.random.seed( 12345 )

    
    print 'python_readibility_f1_mean', py_readability_options
    
    #py_readability_options = {}
    #py_readability_options['retry_length'] = retry_length
    #py_readability_options['min_text_length'] = min_text_length

    download_results = {}
    
    comp_res = []
    for eto in extractor_training_objects:
        comp_result = python_readability_results( eto, py_readability_options )
        #comp_result = python_readability_results( eto, {} )
        comp_res.append( comp_result )
        
        download_results[ eto['downloads_id'] ] = comp_result
        
    #return download_results

    #comp_res = [ python_readability_results( eto, py_readability_options ) for eto in extractor_training_objects ]

    #print comp_res
    #f1_s = [ res['python_readability']['f1'] for res in comp_res ]
    
    #ret = math.fsum( f1_s ) / len( f1_s )
    
    tp = sum( [ x['tp'] for x in comp_res ] )
    fp = sum( [ x['fp'] for x in comp_res ] )
    fn = sum( [ x['fn'] for x in comp_res ] )
    
    comp_stats = precision_recall_f1( tp=tp, fp=fp, fn=fn )
    
    print 'comp_stats', comp_stats['f1']
    
    
    return comp_stats['f1' ]

    #rint 'python_readibility_f1_mean', 'retry_length', retry_length, 'min_text_length', min_text_length, 'return', ret 
    
    #eturn ret



In [43]:

    
sys.path = ['/home/dlarochelle/dev_scratch/python-readability-non-determinism/'] + sys.path

import readability

COMMA_COUNT = 10
P_TO_INPUT_RATIO = 3
MIN_EMBED_COMMENT_LENGTH = 75

def extract_with_python_readability( raw_content, readability_options=None ):
    
    if readability_options == None:
        readability_options = {}
    
    #readability.htmls = lxml.html.HTMLParser(encoding='utf-8')
    
    doc = readability.Document( raw_content, **readability_options )
    
    if 'LONG_NODE_LENGTH' in readability_options:
        doc.LONG_NODE_LENGTH # ensure class varaible has been declared
        doc.LONG_NODE_LENGTH = readability_options['LONG_NODE_LENGTH']
    
    if 'P_TO_INPUT_RATIO' in readability_options:
        doc.P_TO_INPUT_RATIO # ensure class varaible has been declared
        doc.P_TO_INPUT_RATIO = readability_options['P_TO_INPUT_RATIO']
    
    if 'LOW_WEIGHT_LINK_DENSITY_THRESHOLD' in readability_options:
        doc.LOW_WEIGHT_LINK_DENSITY_THRESHOLD # ensure class varaible has been declared
        doc.LOW_WEIGHT_LINK_DENSITY_THRESHOLD = readability_options['LOW_WEIGHT_LINK_DENSITY_THRESHOLD']
    
    if 'HEADER_LINK_DENSITY_THRESHOLD' in readability_options:
        doc.HEADER_LINK_DENSITY_THRESHOLD # ensure class varaible has been declared
        doc.HEADER_LINK_DENSITY_THRESHOLD = readability_options['HEADER_LINK_DENSITY_THRESHOLD']
    
    if 'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD' in readability_options:
        doc.HIGH_WEIGHT_LINK_DENSITY_THRESHOLD # ensure class varaible has been declared
        doc.HIGH_WEIGHT_LINK_DENSITY_THRESHOLD = readability_options['HIGH_WEIGHT_LINK_DENSITY_THRESHOLD']
    
    if 'MIN_SIBLING_SCORE_THRESHOLD' in readability_options:
        doc.MIN_SIBLING_SCORE_THRESHOLD # ensure class varaible has been declared
        doc.MIN_SIBLING_SCORE_THRESHOLD = readability_options['MIN_SIBLING_SCORE_THRESHOLD']
    
    if 'BEST_SCORE_MULTIPLIER_THRESHOLD' in readability_options:
        doc.BEST_SCORE_MULTIPLIER_THRESHOLD # ensure class varaible has been declared
        doc.BEST_SCORE_MULTIPLIER_THRESHOLD = readability_options['BEST_SCORE_MULTIPLIER_THRESHOLD']
    
    if 'LONG_NODE_LINK_DENSITY_THRESHOLD' in readability_options:
        doc.LONG_NODE_LINK_DENSITY_THRESHOLD # ensure class varaible has been declared
        doc.LONG_NODE_LINK_DENSITY_THRESHOLD = readability_options['LONG_NODE_LINK_DENSITY_THRESHOLD']
    
    if 'COMMA_COUNT' in readability_options:
        doc.COMMA_COUNT # ensure class varaible has been declared
        doc.COMMA_COUNT = readability_options['COMMA_COUNT']
    
    if 'MIN_EMBED_COMMENT_LENGTH' in readability_options:
        doc.MIN_EMBED_COMMENT_LENGTH # ensure class varaible has been declared
        doc.MIN_EMBED_COMMENT_LENGTH = readability_options['MIN_EMBED_COMMENT_LENGTH']
    
    if 'TEXT_LENGTH_THRESHOLD' in readability_options:
        doc.TEXT_LENGTH_THRESHOLD # ensure class varaible has been declared
        doc.TEXT_LENGTH_THRESHOLD = readability_options['TEXT_LENGTH_THRESHOLD']
    
    if 'RETRY_LENGTH' in readability_options:
        doc.RETRY_LENGTH # ensure class varaible has been declared
        doc.RETRY_LENGTH = readability_options['RETRY_LENGTH']
    
    if 'SIBLING_CONTENT_LENGTH_SUM' in readability_options:
        doc.SIBLING_CONTENT_LENGTH_SUM # ensure class varaible has been declared
        doc.SIBLING_CONTENT_LENGTH_SUM = readability_options['SIBLING_CONTENT_LENGTH_SUM']
    
    if 'CONTENT_SCORE_DIV_BONUS' in readability_options:
        doc.CONTENT_SCORE_DIV_BONUS # ensure class varaible has been declared
        doc.CONTENT_SCORE_DIV_BONUS = readability_options['CONTENT_SCORE_DIV_BONUS']
    
    if 'CONTENT_SCORE_PRE_TD_BONUS' in readability_options:
        doc.CONTENT_SCORE_PRE_TD_BONUS # ensure class varaible has been declared
        doc.CONTENT_SCORE_PRE_TD_BONUS = readability_options['CONTENT_SCORE_PRE_TD_BONUS']
    
    if 'CONTENT_SCORE_ADDRESS_OL_PENALTY' in readability_options:
        doc.CONTENT_SCORE_ADDRESS_OL_PENALTY # ensure class varaible has been declared
        doc.CONTENT_SCORE_ADDRESS_OL_PENALTY = readability_options['CONTENT_SCORE_ADDRESS_OL_PENALTY']
    
    if 'CONTENT_SCORE_HEADER_PENALTY' in readability_options:
        doc.CONTENT_SCORE_HEADER_PENALTY # ensure class varaible has been declared
        doc.CONTENT_SCORE_HEADER_PENALTY = readability_options['CONTENT_SCORE_HEADER_PENALTY']
    
    if 'CLASS_WEIGHT_NEGATIVE_RE_PENALTY' in readability_options:
        doc.CLASS_WEIGHT_NEGATIVE_RE_PENALTY # ensure class varaible has been declared
        doc.CLASS_WEIGHT_NEGATIVE_RE_PENALTY = readability_options['CLASS_WEIGHT_NEGATIVE_RE_PENALTY']
    
    if 'CLASS_WEIGHT_POSITVE_RE_BONUS' in readability_options:
        doc.CLASS_WEIGHT_POSITVE_RE_BONUS # ensure class varaible has been declared
        doc.CLASS_WEIGHT_POSITVE_RE_BONUS = readability_options['CLASS_WEIGHT_POSITVE_RE_BONUS']
    
    if 'CONTENT_SCORE_START' in readability_options:
        doc.CONTENT_SCORE_START # ensure class varaible has been declared
        doc.CONTENT_SCORE_START = readability_options['CONTENT_SCORE_START']
    
    if 'CONTENT_SCORE_INNER_TEXT_MIN_BONUS' in readability_options:
        doc.CONTENT_SCORE_INNER_TEXT_MIN_BONUS # ensure class varaible has been declared
        doc.CONTENT_SCORE_INNER_TEXT_MIN_BONUS = readability_options['CONTENT_SCORE_INNER_TEXT_MIN_BONUS']
    
    if 'LI_COUNT_REDUCTION' in readability_options:
        doc.LI_COUNT_REDUCTION # ensure class varaible has been declared
        doc.LI_COUNT_REDUCTION = readability_options['LI_COUNT_REDUCTION']
    
    valid_options =  ['LONG_NODE_LENGTH', 'P_TO_INPUT_RATIO', 'LOW_WEIGHT_LINK_DENSITY_THRESHOLD', 
                      'HEADER_LINK_DENSITY_THRESHOLD', 'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD', 'MIN_SIBLING_SCORE_THRESHOLD',
                      'BEST_SCORE_MULTIPLIER_THRESHOLD', 'LONG_NODE_LINK_DENSITY_THRESHOLD', 'COMMA_COUNT', 
                      'MIN_EMBED_COMMENT_LENGTH', 'TEXT_LENGTH_THRESHOLD', 'RETRY_LENGTH', 'SIBLING_CONTENT_LENGTH_SUM',
                      'CONTENT_SCORE_DIV_BONUS', 'CONTENT_SCORE_PRE_TD_BONUS', 'CONTENT_SCORE_ADDRESS_OL_PENALTY',
                      'CONTENT_SCORE_HEADER_PENALTY', 'CLASS_WEIGHT_NEGATIVE_RE_PENALTY', 'CLASS_WEIGHT_POSITVE_RE_BONUS',
                      'CONTENT_SCORE_START', 'CONTENT_SCORE_INNER_TEXT_MIN_BONUS', 'LI_COUNT_REDUCTION']
            
    for key in readability_options.keys():
        #print key 
        assert key in valid_options, "invalid key " + key
    

        
    #doc.RETRY_LENGTH = 100000
    #doc.TEXT_LENGTH_THRESHOLD = 0
    #doc.MAX_SIBLING_P_LINK_DENSITY
    #doc.MAX_SIBLING_P_LINK_DENSITY = 2.0
    title = doc.short_title()
    summary = doc.summary()
    ret = title + "\n\n" + summary
    
    return ret



In [15]:

    
#dreload( readability )



In [16]:

    
python_readability_results( extractor_training_objects[0], {} )









    Out[16]:





{'fn': 17, 'fp': 18, 'tp': 179}



In [17]:

    
import datetime

test_sizes = [ 10, 100, 1000 ]

if False:
    f1_expected = {}
    start_time = datetime.datetime.now()
    for test_size in test_sizes:
        f1_expected[ test_size ] = python_readability_f1_mean( extractor_training_objects[ : test_size ], 250, 25 )
        
        current_time = datetime.datetime.now()
        
        print test_size, "total time", current_time - start_time
    
    cPickle.dump( f1_expected,
                 file( os.path.expanduser( '~/Dropbox/mc/extractor_test/python_reability_expected.pickle'), "wb") )
   
    f1_expected



In [18]:

    
f1_expected = cPickle.load( 
                           file( os.path.expanduser( '~/Dropbox/mc/extractor_test/python_reability_expected.pickle'), "rb") )
f1_expected









    Out[18]:





{10: 0.94945420028476513, 100: 0.93906769882258045, 1000: 0.87264579321822566}



In [19]:

    
reload(readability)
test_sizes = [ 10, 100, 1000 ]


f1_actual = {}
start_time = datetime.datetime.now()
for test_size in test_sizes:
    f1_actual[ test_size ] = python_readability_f1_mean( extractor_training_objects[ : test_size ], 
#                                                        {'SIBLING_CONTENT_LENGTH_SUM': 0} )
                                                        {} )
  
    current_time = datetime.datetime.now()
    
    print test_size, "total time", current_time - start_time
    
    print 'result for sample size ', test_size, f1_expected[ test_size ] == f1_actual[ test_size ]









    



python_readibility_f1_mean {}
comp_stats 0.949454200285
10 total time 0:00:00.660982
result for sample size  10 True
python_readibility_f1_mean {}
comp_stats 0.939067698823
100 total time 0:00:08.006728
result for sample size  100 True
python_readibility_f1_mean {}
comp_stats 0.872645793218
1000 total time 0:01:09.704568
result for sample size  1000 True



In [20]:

    
run1 = python_readability_f1_mean( extractor_training_objects[:], 250, 25 )









    



---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-20-54bb2a0762c6> in <module>()
      1 
----> 2 run1 = python_readability_f1_mean( extractor_training_objects[:], 250, 25 )

TypeError: python_readability_f1_mean() takes at most 2 arguments (3 given)



In [ ]:

    
print "foo"


opt_function = lambda p : 1 - python_readability_f1_mean( extractor_training_objects[:50], { 'LOW_WEIGHT_LINK_DENSITY': p[0], 'HEADER_LINK_DENSITY_THRESHOLD': p[1], 'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': p[ 2] } )

opt_function( [ LOW_WEIGHT_LINK_DENSITY_THRESHOLD, HEADER_LINK_DENSITY_THRESHOLD, HIGH_WEIGHT_LINK_DENSITY_THRESHOLD ] )



In [ ]:

    
import scipy.optimize

opt_result = scipy.optimize.minimize(  opt_function, 
                                     [ LOW_WEIGHT_LINK_DENSITY_THRESHOLD, HEADER_LINK_DENSITY_THRESHOLD, HIGH_WEIGHT_LINK_DENSITY_THRESHOLD ],
                                     method='TNC', bounds=[ [0,1],[0,1],[0,1]],
                                     options={ 'maxiter': 1, 'disp': True} )

opt_result



In [ ]:

    
class MyBounds(object):
    def __init__(self, xmax=[1,1,1], xmin=[0,0,0] ):
     self.xmax = np.array(xmax)
     self.xmin = np.array(xmin)
    def __call__(self, **kwargs):
     x = kwargs["x_new"]
     tmax = bool(np.all(x <= self.xmax))
     tmin = bool(np.all(x >= self.xmin))
     return tmax and tmin



In [ ]:

    
import scipy.optimize

mybounds = MyBounds()

opt_result = scipy.optimize.basinhopping(  opt_function, 
                                     [ LOW_WEIGHT_LINK_DENSITY_THRESHOLD, HEADER_LINK_DENSITY_THRESHOLD, HIGH_WEIGHT_LINK_DENSITY_THRESHOLD ],
                                     minimizer_kwargs={ 'method':"L-BFGS-B", 'bounds': [ [0,1],[0,1],[0,1]] },
                                     accept_test = mybounds, disp=True )
opt_result



In [21]:

    
params { 
LOW_WEIGHT_LINK_DENSITY_THRESHOLD = 0.2
HEADER_LINK_DENSITY_THRESHOLD = 0.33
HIGH_WEIGHT_LINK_DENSITY_THRESHOLD = 0.5

MIN_SIBLING_SCORE_THRESHOLD = 10
BEST_SCORE_MULTIPLIER_THRESHOLD = 0.2

LONG_NODE_LINK_DENSITY_THRESHOLD = 0.25
LONG_NODE_LENGTH                 = 80

COMMA_COUNT = 10
P_TO_INPUT_RATIO = 3
MIN_EMBED_COMMENT_LENGTH = 75
type( COMMA_COUNT )









    



  File "<ipython-input-21-e5c2538bf6d6>", line 1
    params {
           ^
SyntaxError: invalid syntax



In [38]:

    
params_to_optimize = [
{'param_name':'LONG_NODE_LENGTH', 'start_value': 80}, 
{'param_name':'P_TO_INPUT_RATIO', 'start_value': 3}, 
{'param_name':'LOW_WEIGHT_LINK_DENSITY_THRESHOLD', 'start_value': 0.2}, 
{'param_name':'HEADER_LINK_DENSITY_THRESHOLD', 'start_value': 0.33}, 
{'param_name':'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD', 'start_value': 0.5}, 
{'param_name':'MIN_SIBLING_SCORE_THRESHOLD', 'start_value': 10}, 
{'param_name':'BEST_SCORE_MULTIPLIER_THRESHOLD', 'start_value': 0.2}, 
{'param_name':'LONG_NODE_LINK_DENSITY_THRESHOLD', 'start_value': 0.25}, 
{'param_name':'COMMA_COUNT', 'start_value': 10}, 
{'param_name':'MIN_EMBED_COMMENT_LENGTH', 'start_value': 75},
{'param_name':'TEXT_LENGTH_THRESHOLD', 'start_value': 25},
{'param_name':'RETRY_LENGTH', 'start_value': 250} ,
{'param_name': 'SIBLING_CONTENT_LENGTH_SUM', 'start_value': 1000},
{'param_name': 'CONTENT_SCORE_DIV_BONUS', 'start_value': 5},
{'param_name': 'CONTENT_SCORE_PRE_TD_BONUS', 'start_value': 3},
{'param_name': 'CONTENT_SCORE_ADDRESS_OL_PENALTY', 'start_value': 3},
{'param_name': 'CONTENT_SCORE_HEADER_PENALTY', 'start_value': 5 },
{'param_name': 'CLASS_WEIGHT_NEGATIVE_RE_PENALTY', 'start_value': 25},
{'param_name': 'CLASS_WEIGHT_POSITVE_RE_BONUS', 'start_value': 25 },
{'param_name': 'CONTENT_SCORE_START', 'start_value': 1},
{'param_name': 'CONTENT_SCORE_INNER_TEXT_MIN_BONUS', 'start_value': 3},
#{'param_name': 'CONTENT_SCORE_GRAND_PARENT_BONUS_FACTOR', 'start_value': 2.0, 'non_zero': True},
{'param_name': 'LI_COUNT_REDUCTION', 'start_value': 100}
]

for value_dict in params_to_optimize:
    value_dict[ 'make_int'] = type( value_dict['start_value']) == int

print len( params_to_optimize )    
params_to_optimize









    



22






    Out[38]:





[{'make_int': True, 'param_name': 'LONG_NODE_LENGTH', 'start_value': 80},
 {'make_int': True, 'param_name': 'P_TO_INPUT_RATIO', 'start_value': 3},
 {'make_int': False,
  'param_name': 'LOW_WEIGHT_LINK_DENSITY_THRESHOLD',
  'start_value': 0.2},
 {'make_int': False,
  'param_name': 'HEADER_LINK_DENSITY_THRESHOLD',
  'start_value': 0.33},
 {'make_int': False,
  'param_name': 'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD',
  'start_value': 0.5},
 {'make_int': True,
  'param_name': 'MIN_SIBLING_SCORE_THRESHOLD',
  'start_value': 10},
 {'make_int': False,
  'param_name': 'BEST_SCORE_MULTIPLIER_THRESHOLD',
  'start_value': 0.2},
 {'make_int': False,
  'param_name': 'LONG_NODE_LINK_DENSITY_THRESHOLD',
  'start_value': 0.25},
 {'make_int': True, 'param_name': 'COMMA_COUNT', 'start_value': 10},
 {'make_int': True,
  'param_name': 'MIN_EMBED_COMMENT_LENGTH',
  'start_value': 75},
 {'make_int': True, 'param_name': 'TEXT_LENGTH_THRESHOLD', 'start_value': 25},
 {'make_int': True, 'param_name': 'RETRY_LENGTH', 'start_value': 250},
 {'make_int': True,
  'param_name': 'SIBLING_CONTENT_LENGTH_SUM',
  'start_value': 1000},
 {'make_int': True, 'param_name': 'CONTENT_SCORE_DIV_BONUS', 'start_value': 5},
 {'make_int': True,
  'param_name': 'CONTENT_SCORE_PRE_TD_BONUS',
  'start_value': 3},
 {'make_int': True,
  'param_name': 'CONTENT_SCORE_ADDRESS_OL_PENALTY',
  'start_value': 3},
 {'make_int': True,
  'param_name': 'CONTENT_SCORE_HEADER_PENALTY',
  'start_value': 5},
 {'make_int': True,
  'param_name': 'CLASS_WEIGHT_NEGATIVE_RE_PENALTY',
  'start_value': 25},
 {'make_int': True,
  'param_name': 'CLASS_WEIGHT_POSITVE_RE_BONUS',
  'start_value': 25},
 {'make_int': True, 'param_name': 'CONTENT_SCORE_START', 'start_value': 1},
 {'make_int': True,
  'param_name': 'CONTENT_SCORE_INNER_TEXT_MIN_BONUS',
  'start_value': 3},
 {'make_int': True, 'param_name': 'LI_COUNT_REDUCTION', 'start_value': 100}]



In [39]:

    
for param_to_opt in params_to_optimize:
    param_name = param_to_opt['param_name']
    print "if '" + param_name + "' in readability_options:"
    print '    doc.' + param_name +' # ensure class varaible has been declared'
    print '    doc.' + param_name + " = readability_options['" + param_name + "']"
    print
    
param_names = [ param_to_opt['param_name'] for param_to_opt in params_to_optimize ]

print 'valid_options = ',
print param_names









    



if 'LONG_NODE_LENGTH' in readability_options:
    doc.LONG_NODE_LENGTH # ensure class varaible has been declared
    doc.LONG_NODE_LENGTH = readability_options['LONG_NODE_LENGTH']

if 'P_TO_INPUT_RATIO' in readability_options:
    doc.P_TO_INPUT_RATIO # ensure class varaible has been declared
    doc.P_TO_INPUT_RATIO = readability_options['P_TO_INPUT_RATIO']

if 'LOW_WEIGHT_LINK_DENSITY_THRESHOLD' in readability_options:
    doc.LOW_WEIGHT_LINK_DENSITY_THRESHOLD # ensure class varaible has been declared
    doc.LOW_WEIGHT_LINK_DENSITY_THRESHOLD = readability_options['LOW_WEIGHT_LINK_DENSITY_THRESHOLD']

if 'HEADER_LINK_DENSITY_THRESHOLD' in readability_options:
    doc.HEADER_LINK_DENSITY_THRESHOLD # ensure class varaible has been declared
    doc.HEADER_LINK_DENSITY_THRESHOLD = readability_options['HEADER_LINK_DENSITY_THRESHOLD']

if 'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD' in readability_options:
    doc.HIGH_WEIGHT_LINK_DENSITY_THRESHOLD # ensure class varaible has been declared
    doc.HIGH_WEIGHT_LINK_DENSITY_THRESHOLD = readability_options['HIGH_WEIGHT_LINK_DENSITY_THRESHOLD']

if 'MIN_SIBLING_SCORE_THRESHOLD' in readability_options:
    doc.MIN_SIBLING_SCORE_THRESHOLD # ensure class varaible has been declared
    doc.MIN_SIBLING_SCORE_THRESHOLD = readability_options['MIN_SIBLING_SCORE_THRESHOLD']

if 'BEST_SCORE_MULTIPLIER_THRESHOLD' in readability_options:
    doc.BEST_SCORE_MULTIPLIER_THRESHOLD # ensure class varaible has been declared
    doc.BEST_SCORE_MULTIPLIER_THRESHOLD = readability_options['BEST_SCORE_MULTIPLIER_THRESHOLD']

if 'LONG_NODE_LINK_DENSITY_THRESHOLD' in readability_options:
    doc.LONG_NODE_LINK_DENSITY_THRESHOLD # ensure class varaible has been declared
    doc.LONG_NODE_LINK_DENSITY_THRESHOLD = readability_options['LONG_NODE_LINK_DENSITY_THRESHOLD']

if 'COMMA_COUNT' in readability_options:
    doc.COMMA_COUNT # ensure class varaible has been declared
    doc.COMMA_COUNT = readability_options['COMMA_COUNT']

if 'MIN_EMBED_COMMENT_LENGTH' in readability_options:
    doc.MIN_EMBED_COMMENT_LENGTH # ensure class varaible has been declared
    doc.MIN_EMBED_COMMENT_LENGTH = readability_options['MIN_EMBED_COMMENT_LENGTH']

if 'TEXT_LENGTH_THRESHOLD' in readability_options:
    doc.TEXT_LENGTH_THRESHOLD # ensure class varaible has been declared
    doc.TEXT_LENGTH_THRESHOLD = readability_options['TEXT_LENGTH_THRESHOLD']

if 'RETRY_LENGTH' in readability_options:
    doc.RETRY_LENGTH # ensure class varaible has been declared
    doc.RETRY_LENGTH = readability_options['RETRY_LENGTH']

if 'SIBLING_CONTENT_LENGTH_SUM' in readability_options:
    doc.SIBLING_CONTENT_LENGTH_SUM # ensure class varaible has been declared
    doc.SIBLING_CONTENT_LENGTH_SUM = readability_options['SIBLING_CONTENT_LENGTH_SUM']

if 'CONTENT_SCORE_DIV_BONUS' in readability_options:
    doc.CONTENT_SCORE_DIV_BONUS # ensure class varaible has been declared
    doc.CONTENT_SCORE_DIV_BONUS = readability_options['CONTENT_SCORE_DIV_BONUS']

if 'CONTENT_SCORE_PRE_TD_BONUS' in readability_options:
    doc.CONTENT_SCORE_PRE_TD_BONUS # ensure class varaible has been declared
    doc.CONTENT_SCORE_PRE_TD_BONUS = readability_options['CONTENT_SCORE_PRE_TD_BONUS']

if 'CONTENT_SCORE_ADDRESS_OL_PENALTY' in readability_options:
    doc.CONTENT_SCORE_ADDRESS_OL_PENALTY # ensure class varaible has been declared
    doc.CONTENT_SCORE_ADDRESS_OL_PENALTY = readability_options['CONTENT_SCORE_ADDRESS_OL_PENALTY']

if 'CONTENT_SCORE_HEADER_PENALTY' in readability_options:
    doc.CONTENT_SCORE_HEADER_PENALTY # ensure class varaible has been declared
    doc.CONTENT_SCORE_HEADER_PENALTY = readability_options['CONTENT_SCORE_HEADER_PENALTY']

if 'CLASS_WEIGHT_NEGATIVE_RE_PENALTY' in readability_options:
    doc.CLASS_WEIGHT_NEGATIVE_RE_PENALTY # ensure class varaible has been declared
    doc.CLASS_WEIGHT_NEGATIVE_RE_PENALTY = readability_options['CLASS_WEIGHT_NEGATIVE_RE_PENALTY']

if 'CLASS_WEIGHT_POSITVE_RE_BONUS' in readability_options:
    doc.CLASS_WEIGHT_POSITVE_RE_BONUS # ensure class varaible has been declared
    doc.CLASS_WEIGHT_POSITVE_RE_BONUS = readability_options['CLASS_WEIGHT_POSITVE_RE_BONUS']

if 'CONTENT_SCORE_START' in readability_options:
    doc.CONTENT_SCORE_START # ensure class varaible has been declared
    doc.CONTENT_SCORE_START = readability_options['CONTENT_SCORE_START']

if 'CONTENT_SCORE_INNER_TEXT_MIN_BONUS' in readability_options:
    doc.CONTENT_SCORE_INNER_TEXT_MIN_BONUS # ensure class varaible has been declared
    doc.CONTENT_SCORE_INNER_TEXT_MIN_BONUS = readability_options['CONTENT_SCORE_INNER_TEXT_MIN_BONUS']

if 'LI_COUNT_REDUCTION' in readability_options:
    doc.LI_COUNT_REDUCTION # ensure class varaible has been declared
    doc.LI_COUNT_REDUCTION = readability_options['LI_COUNT_REDUCTION']

valid_options =  ['LONG_NODE_LENGTH', 'P_TO_INPUT_RATIO', 'LOW_WEIGHT_LINK_DENSITY_THRESHOLD', 'HEADER_LINK_DENSITY_THRESHOLD', 'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD', 'MIN_SIBLING_SCORE_THRESHOLD', 'BEST_SCORE_MULTIPLIER_THRESHOLD', 'LONG_NODE_LINK_DENSITY_THRESHOLD', 'COMMA_COUNT', 'MIN_EMBED_COMMENT_LENGTH', 'TEXT_LENGTH_THRESHOLD', 'RETRY_LENGTH', 'SIBLING_CONTENT_LENGTH_SUM', 'CONTENT_SCORE_DIV_BONUS', 'CONTENT_SCORE_PRE_TD_BONUS', 'CONTENT_SCORE_ADDRESS_OL_PENALTY', 'CONTENT_SCORE_HEADER_PENALTY', 'CLASS_WEIGHT_NEGATIVE_RE_PENALTY', 'CLASS_WEIGHT_POSITVE_RE_BONUS', 'CONTENT_SCORE_START', 'CONTENT_SCORE_INNER_TEXT_MIN_BONUS', 'LI_COUNT_REDUCTION']



In [62]:

    
def adjusted_mean( a, b, make_int ):
    ret = ( a + b ) / 2
    
    if make_int:
        ret = int ( ret )
    else:
        ret = round( ret, 2 )
        
    return ret

import random
random.seed( 12345 )

extractor_training_subset = extractor_training_objects[:]
random.shuffle( extractor_training_subset )

#extractor_training_subset = extractor_training_subset[ : ( len( extractor_training_objects)/2 ) ]

def binary_search_opt_param( value_to_optimize, start_value, make_int):
    current = start_value
    prev = {}
    
    iteration = 0
    
    if make_int:
        stop_delta = 1.01
    else:
        stop_delta = 0.011
    
    max_iterations = 100
        
    funct = lambda param : python_readability_f1_mean( extractor_training_subset, { value_to_optimize: param } )
    while True:
        iteration += 1
        
        if iteration > max_iterations:
            break
            
        print 'iteration', iteration
        
        if make_int:
            current = int( current )
            
        print 'current', current
        if current not in prev:
            prev[ current ] = funct( current )
        
        if iteration % 5 == 0 and iteration > 0 :
            sorted_keys = list(reversed(sorted( prev.keys() )))
            best_keys = sorted( sorted_keys, key = lambda k : prev[ k ] )
            best_keys.reverse()
            print 'best_keys', [ ( k, prev[k] ) for k in  best_keys ]
            if abs( best_keys[1] - best_keys[0] ) <= stop_delta:
                print 'stopping for small delta'
                print current
                break
            
            current = adjusted_mean(best_keys[0], best_keys[1], make_int )
    
            if current in prev:
            
                current = adjusted_mean( best_keys[0], current, make_int )
                
                # just pick a point between the current best and the next closest value
                if current in prev:
                    print 'falling back in heuristic'
                    best_index = sorted_keys.index( best_keys[ 0 ] ) 
                    if best_index == 0:
                        comp_index = 1
                    else:
                        comp_index = best_index - 1
                    
                    
                    current = adjusted_mean( sorted_keys[ best_index ], sorted_keys[ comp_index ], make_int )
                    
                    if current in prev:
                        assert abs( best_keys[ 0 ] - current ) <= stop_delta
                        print "stopping for small delta", best_keys[0], current
                        
                        break
            
            print 'continue'
            continue
        
        lower = round( current/2.0, 2 )
        higher = round( current*2.0, 2)
    
        if make_int:
            lower = int( lower )
            higher = int( higher )
        
        if lower not in prev:
            prev[lower] = funct( lower )
            
        if higher not in prev:
            prev[higher] = funct( higher )
        
        if prev[lower] >= prev[higher]:
            compare_point = lower
        else:
            compare_point = higher
            
        if prev[ current ] > prev[ compare_point ]:
            current = adjusted_mean(current, compare_point, make_int )
        else:
            current = compare_point
            
            
    ret = { 'start_value': start_value, 
            'start_result': prev[start_value],
            'opt_value': best_keys[0],
            'opt_result': prev[ best_keys[0] ],
            'param_name': value_to_optimize
            }
    
    return ret



In [60]:

    
len( extractor_training_subset )









    Out[60]:





358



In [45]:

    
python_readability_f1_mean( extractor_training_subset, {} )









    



python_readibility_f1_mean {}
comp_stats 0.92545937825






    Out[45]:





0.92545937825031777



In [61]:

    
opt_results = []

start_time = datetime.datetime.now()

for param_info in params_to_optimize:
    print param_info
    opt_results.append( binary_search_opt_param( param_info['param_name'], param_info['start_value'], param_info['make_int'] ))
    
for opt_result in opt_results:
    if opt_result['opt_result'] > opt_result['start_result']:
        
        improvement = opt_result['opt_result'] - opt_result['start_result']
        
        if improvement < 0.001:
            print opt_result['param_name'], " - SMALL OPT - start", opt_result['start_value'], opt_result['start_result'], "opt to", opt_result['opt_value'], 
            print opt_result['opt_result'], 'improvement', improvement
                             
        else:
            print opt_result['param_name'], "- LARGE OPT - start", opt_result['start_value'], opt_result['start_result'], "opt to", opt_result['opt_value'],
            print opt_result['opt_result'], 'improvement', improvement
    
    
    else:
        print opt_result['param_name'], "no improvement - start", opt_result['start_value'], opt_result['start_result'], "opt to", opt_result['opt_value'], opt_result['opt_result']

        
end_time = datetime.datetime.now()

print 'total time', end_time - start_time









    



{'make_int': True, 'param_name': 'LONG_NODE_LENGTH', 'start_value': 80}
iteration 1
current 80
python_readibility_f1_mean {'LONG_NODE_LENGTH': 80}
comp_stats 0.878857527755
python_readibility_f1_mean {'LONG_NODE_LENGTH': 40}
comp_stats 0.879029542595
python_readibility_f1_mean {'LONG_NODE_LENGTH': 160}
comp_stats 0.878030896593
iteration 2
current 40
python_readibility_f1_mean {'LONG_NODE_LENGTH': 20}
comp_stats 0.879159886717
iteration 3
current 20
python_readibility_f1_mean {'LONG_NODE_LENGTH': 10}
comp_stats 0.879207909097
iteration 4
current 10
python_readibility_f1_mean {'LONG_NODE_LENGTH': 5}
comp_stats 0.879160331262
iteration 5
current 7
python_readibility_f1_mean {'LONG_NODE_LENGTH': 7}
comp_stats 0.879197524302
best_keys [(10, 0.8792079090973518), (7, 0.87919752430223352), (5, 0.87916033126171755), (20, 0.87915988671700818), (40, 0.87902954259485699), (80, 0.87885752775529247), (160, 0.87803089659250499)]
continue
iteration 6
current 8
python_readibility_f1_mean {'LONG_NODE_LENGTH': 8}
comp_stats 0.879197524302
python_readibility_f1_mean {'LONG_NODE_LENGTH': 4}
comp_stats 0.879157735617
python_readibility_f1_mean {'LONG_NODE_LENGTH': 16}
comp_stats 0.8791863662
iteration 7
current 12
python_readibility_f1_mean {'LONG_NODE_LENGTH': 12}
comp_stats 0.879205768924
python_readibility_f1_mean {'LONG_NODE_LENGTH': 6}
comp_stats 0.879194928142
python_readibility_f1_mean {'LONG_NODE_LENGTH': 24}
comp_stats 0.879122501838
iteration 8
current 9
python_readibility_f1_mean {'LONG_NODE_LENGTH': 9}
comp_stats 0.879197524302
python_readibility_f1_mean {'LONG_NODE_LENGTH': 18}
comp_stats 0.879179746447
iteration 9
current 13
python_readibility_f1_mean {'LONG_NODE_LENGTH': 13}
comp_stats 0.879206224992
python_readibility_f1_mean {'LONG_NODE_LENGTH': 26}
comp_stats 0.87908465846
iteration 10
current 9
best_keys [(10, 0.8792079090973518), (13, 0.87920622499151002), (12, 0.87920576892411462), (7, 0.87919752430223352), (8, 0.87919752430223352), (9, 0.87919752430223352), (6, 0.87919492814178535), (16, 0.87918636620017776), (18, 0.87917974644668029), (5, 0.87916033126171755), (20, 0.87915988671700818), (4, 0.87915773561732014), (24, 0.87912250183845786), (26, 0.87908465845993911), (40, 0.87902954259485699), (80, 0.87885752775529247), (160, 0.87803089659250499)]
continue
iteration 11
current 11
python_readibility_f1_mean {'LONG_NODE_LENGTH': 11}
comp_stats 0.879205768924
python_readibility_f1_mean {'LONG_NODE_LENGTH': 22}
comp_stats 0.879133146685
iteration 12
current 8
iteration 13
current 12
iteration 14
current 9
iteration 15
current 13
best_keys [(10, 0.8792079090973518), (13, 0.87920622499151002), (11, 0.87920576892411462), (12, 0.87920576892411462), (7, 0.87919752430223352), (8, 0.87919752430223352), (9, 0.87919752430223352), (6, 0.87919492814178535), (16, 0.87918636620017776), (18, 0.87917974644668029), (5, 0.87916033126171755), (20, 0.87915988671700818), (4, 0.87915773561732014), (22, 0.87913314668462272), (24, 0.87912250183845786), (26, 0.87908465845993911), (40, 0.87902954259485699), (80, 0.87885752775529247), (160, 0.87803089659250499)]
falling back in heuristic
stopping for small delta 10 10
{'make_int': True, 'param_name': 'P_TO_INPUT_RATIO', 'start_value': 3}
iteration 1
current 3
python_readibility_f1_mean {'P_TO_INPUT_RATIO': 3}
comp_stats 0.878857527755
python_readibility_f1_mean {'P_TO_INPUT_RATIO': 1}
comp_stats 0.878857527755
python_readibility_f1_mean {'P_TO_INPUT_RATIO': 6}
comp_stats 0.878857527755
iteration 2
current 1
python_readibility_f1_mean {'P_TO_INPUT_RATIO': 0}
comp_stats 0.878834876885
python_readibility_f1_mean {'P_TO_INPUT_RATIO': 2}
comp_stats 0.878857527755
iteration 3
current 2
python_readibility_f1_mean {'P_TO_INPUT_RATIO': 4}
comp_stats 0.878857527755
iteration 4
current 1
iteration 5
current 2
best_keys [(1, 0.87885752775529247), (2, 0.87885752775529247), (3, 0.87885752775529247), (4, 0.87885752775529247), (6, 0.87885752775529247), (0, 0.87883487688512718)]
stopping for small delta
2
{'make_int': False, 'param_name': 'LOW_WEIGHT_LINK_DENSITY_THRESHOLD', 'start_value': 0.2}
iteration 1
current 0.2
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.2}
comp_stats 0.878857527755
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.1}
comp_stats 0.876946494081
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.4}
comp_stats 0.87968863708
iteration 2
current 0.4
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.8}
comp_stats 0.88018434529
iteration 3
current 0.8
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 1.6}
comp_stats 0.867533754776
iteration 4
current 0.6
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.6}
comp_stats 0.879695110287
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.3}
comp_stats 0.879288939218
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 1.2}
comp_stats 0.867533754776
iteration 5
current 0.45
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.45}
comp_stats 0.879744523397
best_keys [(0.8, 0.88018434529020628), (0.45, 0.87974452339708464), (0.6, 0.87969511028708491), (0.4, 0.87968863708011813), (0.3, 0.8792889392183405), (0.2, 0.87885752775529247), (0.1, 0.87694649408099323), (1.2, 0.86753375477580374), (1.6, 0.86753375477580374)]
continue
iteration 6
current 0.63
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.63}
comp_stats 0.881334738362
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.32}
comp_stats 0.879288939218
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 1.26}
comp_stats 0.867533754776
iteration 7
current 0.47
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.47}
comp_stats 0.879735107922
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.23}
comp_stats 0.878933645161
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.94}
comp_stats 0.879140795828
iteration 8
current 0.7
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.7}
comp_stats 0.881483415423
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.35}
comp_stats 0.879645918605
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 1.4}
comp_stats 0.867533754776
iteration 9
current 0.52
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.52}
comp_stats 0.879731118973
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.26}
comp_stats 0.878917757351
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 1.04}
comp_stats 0.867533754776
iteration 10
current 0.39
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.39}
comp_stats 0.879648046487
best_keys [(0.7, 0.88148341542261544), (0.63, 0.88133473836228027), (0.8, 0.88018434529020628), (0.45, 0.87974452339708464), (0.47, 0.87973510792172249), (0.52, 0.87973111897278489), (0.6, 0.87969511028708491), (0.4, 0.87968863708011813), (0.39, 0.87964804648709494), (0.35, 0.87964591860489105), (0.3, 0.8792889392183405), (0.32, 0.8792889392183405), (0.94, 0.87914079582792526), (0.23, 0.87893364516062367), (0.26, 0.87891775735142186), (0.2, 0.87885752775529247), (0.1, 0.87694649408099323), (1.04, 0.86753375477580374), (1.2, 0.86753375477580374), (1.26, 0.86753375477580374), (1.4, 0.86753375477580374), (1.6, 0.86753375477580374)]
continue
iteration 11
current 0.67
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.67}
comp_stats 0.881456380466
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.34}
comp_stats 0.879592137447
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 1.34}
comp_stats 0.867533754776
iteration 12
current 0.51
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.51}
comp_stats 0.879731118973
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 1.02}
comp_stats 0.867533754776
iteration 13
current 0.39
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.78}
comp_stats 0.880290517191
iteration 14
current 0.78
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 1.56}
comp_stats 0.867533754776
iteration 15
current 0.58
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.58}
comp_stats 0.879548967365
best_keys [(0.7, 0.88148341542261544), (0.67, 0.88145638046619423), (0.63, 0.88133473836228027), (0.78, 0.88029051719088613), (0.8, 0.88018434529020628), (0.45, 0.87974452339708464), (0.47, 0.87973510792172249), (0.51, 0.87973111897278489), (0.52, 0.87973111897278489), (0.6, 0.87969511028708491), (0.4, 0.87968863708011813), (0.39, 0.87964804648709494), (0.35, 0.87964591860489105), (0.34, 0.87959213744732268), (0.58, 0.87954896736490362), (0.3, 0.8792889392183405), (0.32, 0.8792889392183405), (0.94, 0.87914079582792526), (0.23, 0.87893364516062367), (0.26, 0.87891775735142186), (0.2, 0.87885752775529247), (0.1, 0.87694649408099323), (1.02, 0.86753375477580374), (1.04, 0.86753375477580374), (1.2, 0.86753375477580374), (1.26, 0.86753375477580374), (1.34, 0.86753375477580374), (1.4, 0.86753375477580374), (1.56, 0.86753375477580374), (1.6, 0.86753375477580374)]
continue
iteration 16
current 0.69
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.69}
comp_stats 0.881483415423
python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 1.38}
comp_stats 0.867533754776
iteration 17
current 0.52
iteration 18
current 0.39
iteration 19
current 0.78
iteration 20
current 0.58
best_keys [(0.69, 0.88148341542261544), (0.7, 0.88148341542261544), (0.67, 0.88145638046619423), (0.63, 0.88133473836228027), (0.78, 0.88029051719088613), (0.8, 0.88018434529020628), (0.45, 0.87974452339708464), (0.47, 0.87973510792172249), (0.51, 0.87973111897278489), (0.52, 0.87973111897278489), (0.6, 0.87969511028708491), (0.4, 0.87968863708011813), (0.39, 0.87964804648709494), (0.35, 0.87964591860489105), (0.34, 0.87959213744732268), (0.58, 0.87954896736490362), (0.3, 0.8792889392183405), (0.32, 0.8792889392183405), (0.94, 0.87914079582792526), (0.23, 0.87893364516062367), (0.26, 0.87891775735142186), (0.2, 0.87885752775529247), (0.1, 0.87694649408099323), (1.02, 0.86753375477580374), (1.04, 0.86753375477580374), (1.2, 0.86753375477580374), (1.26, 0.86753375477580374), (1.34, 0.86753375477580374), (1.38, 0.86753375477580374), (1.4, 0.86753375477580374), (1.56, 0.86753375477580374), (1.6, 0.86753375477580374)]
stopping for small delta
0.58
{'make_int': False, 'param_name': 'HEADER_LINK_DENSITY_THRESHOLD', 'start_value': 0.33}
iteration 1
current 0.33
python_readibility_f1_mean {'HEADER_LINK_DENSITY_THRESHOLD': 0.33}
comp_stats 0.878857527755
python_readibility_f1_mean {'HEADER_LINK_DENSITY_THRESHOLD': 0.17}
comp_stats 0.878914651729
python_readibility_f1_mean {'HEADER_LINK_DENSITY_THRESHOLD': 0.66}
comp_stats 0.878810795481
iteration 2
current 0.17
python_readibility_f1_mean {'HEADER_LINK_DENSITY_THRESHOLD': 0.09}
comp_stats 0.878953604151
python_readibility_f1_mean {'HEADER_LINK_DENSITY_THRESHOLD': 0.34}
comp_stats 0.878857527755
iteration 3
current 0.09
python_readibility_f1_mean {'HEADER_LINK_DENSITY_THRESHOLD': 0.04}
comp_stats 0.878953604151
python_readibility_f1_mean {'HEADER_LINK_DENSITY_THRESHOLD': 0.18}
comp_stats 0.878914651729
iteration 4
current 0.04
python_readibility_f1_mean {'HEADER_LINK_DENSITY_THRESHOLD': 0.02}
comp_stats 0.878953604151
python_readibility_f1_mean {'HEADER_LINK_DENSITY_THRESHOLD': 0.08}
comp_stats 0.878953604151
iteration 5
current 0.02
best_keys [(0.02, 0.87895360415060109), (0.04, 0.87895360415060109), (0.08, 0.87895360415060109), (0.09, 0.87895360415060109), (0.17, 0.87891465172894656), (0.18, 0.87891465172894656), (0.33, 0.87885752775529247), (0.34, 0.87885752775529247), (0.66, 0.87881079548140095)]
continue
iteration 6
current 0.03
python_readibility_f1_mean {'HEADER_LINK_DENSITY_THRESHOLD': 0.03}
comp_stats 0.878953604151
python_readibility_f1_mean {'HEADER_LINK_DENSITY_THRESHOLD': 0.01}
comp_stats 0.878953604151
python_readibility_f1_mean {'HEADER_LINK_DENSITY_THRESHOLD': 0.06}
comp_stats 0.878953604151
iteration 7
current 0.01
iteration 8
current 0.01
iteration 9
current 0.01
iteration 10
current 0.01
best_keys [(0.01, 0.87895360415060109), (0.02, 0.87895360415060109), (0.03, 0.87895360415060109), (0.04, 0.87895360415060109), (0.06, 0.87895360415060109), (0.08, 0.87895360415060109), (0.09, 0.87895360415060109), (0.17, 0.87891465172894656), (0.18, 0.87891465172894656), (0.33, 0.87885752775529247), (0.34, 0.87885752775529247), (0.66, 0.87881079548140095)]
stopping for small delta
0.01
{'make_int': False, 'param_name': 'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD', 'start_value': 0.5}
iteration 1
current 0.5
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.5}
comp_stats 0.878857527755
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.25}
comp_stats 0.878827088282
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 1.0}
comp_stats 0.878596806252
iteration 2
current 0.38
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.38}
comp_stats 0.878845261905
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.19}
comp_stats 0.878827088282
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.76}
comp_stats 0.878805783195
iteration 3
current 0.29
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.29}
comp_stats 0.878827088282
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.14}
comp_stats 0.878877985339
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.58}
comp_stats 0.878828968553
iteration 4
current 0.14
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.07}
comp_stats 0.876317390595
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.28}
comp_stats 0.878827088282
iteration 5
current 0.21
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.21}
comp_stats 0.878827088282
best_keys [(0.14, 0.8788779853393236), (0.5, 0.87885752775529247), (0.38, 0.87884526190469159), (0.58, 0.87882896855290182), (0.19, 0.87882708828151779), (0.21, 0.87882708828151779), (0.25, 0.87882708828151779), (0.28, 0.87882708828151779), (0.29, 0.87882708828151779), (0.76, 0.87880578319517411), (1.0, 0.87859680625164271), (0.07, 0.8763173905946251)]
continue
iteration 6
current 0.32
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.32}
comp_stats 0.878827088282
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.16}
comp_stats 0.878925754129
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.64}
comp_stats 0.878828968553
iteration 7
current 0.16
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.08}
comp_stats 0.877511575927
iteration 8
current 0.24
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.24}
comp_stats 0.878827088282
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.12}
comp_stats 0.878585366287
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.48}
comp_stats 0.878857527755
iteration 9
current 0.48
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.96}
comp_stats 0.878786267525
iteration 10
current 0.36
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.36}
comp_stats 0.878845261905
best_keys [(0.16, 0.87892575412887419), (0.14, 0.8788779853393236), (0.48, 0.87885752775529247), (0.5, 0.87885752775529247), (0.36, 0.87884526190469159), (0.38, 0.87884526190469159), (0.58, 0.87882896855290182), (0.64, 0.87882896855290182), (0.19, 0.87882708828151779), (0.21, 0.87882708828151779), (0.24, 0.87882708828151779), (0.25, 0.87882708828151779), (0.28, 0.87882708828151779), (0.29, 0.87882708828151779), (0.32, 0.87882708828151779), (0.76, 0.87880578319517411), (0.96, 0.87878626752450328), (1.0, 0.87859680625164271), (0.12, 0.87858536628697148), (0.08, 0.87751157592681539), (0.07, 0.8763173905946251)]
continue
iteration 11
current 0.15
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.15}
comp_stats 0.878937630417
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.3}
comp_stats 0.878827088282
iteration 12
current 0.22
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.22}
comp_stats 0.878827088282
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.11}
comp_stats 0.878285030451
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.44}
comp_stats 0.878845261905
iteration 13
current 0.44
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.88}
comp_stats 0.878835592765
iteration 14
current 0.66
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.66}
comp_stats 0.878828968553
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 0.33}
comp_stats 0.878845261905
python_readibility_f1_mean {'HIGH_WEIGHT_LINK_DENSITY_THRESHOLD': 1.32}
comp_stats 0.878596806252
iteration 15
current 0.33
best_keys [(0.15, 0.8789376304169203), (0.16, 0.87892575412887419), (0.14, 0.8788779853393236), (0.48, 0.87885752775529247), (0.5, 0.87885752775529247), (0.33, 0.87884526190469159), (0.36, 0.87884526190469159), (0.38, 0.87884526190469159), (0.44, 0.87884526190469159), (0.88, 0.87883559276469903), (0.58, 0.87882896855290182), (0.64, 0.87882896855290182), (0.66, 0.87882896855290182), (0.19, 0.87882708828151779), (0.21, 0.87882708828151779), (0.22, 0.87882708828151779), (0.24, 0.87882708828151779), (0.25, 0.87882708828151779), (0.28, 0.87882708828151779), (0.29, 0.87882708828151779), (0.3, 0.87882708828151779), (0.32, 0.87882708828151779), (0.76, 0.87880578319517411), (0.96, 0.87878626752450328), (1.0, 0.87859680625164271), (1.32, 0.87859680625164271), (0.12, 0.87858536628697148), (0.11, 0.87828503045095863), (0.08, 0.87751157592681539), (0.07, 0.8763173905946251)]
stopping for small delta
0.33
{'make_int': True, 'param_name': 'MIN_SIBLING_SCORE_THRESHOLD', 'start_value': 10}
iteration 1
current 10
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 10}
comp_stats 0.878857527755
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 5}
comp_stats 0.876121723155
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 20}
comp_stats 0.879773691655
iteration 2
current 20
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 40}
comp_stats 0.880049061143
iteration 3
current 40
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 80}
comp_stats 0.880049061143
iteration 4
current 80
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 160}
comp_stats 0.880049061143
iteration 5
current 40
best_keys [(40, 0.88004906114274528), (80, 0.88004906114274528), (160, 0.88004906114274528), (20, 0.87977369165487973), (10, 0.87885752775529247), (5, 0.87612172315519232)]
continue
iteration 6
current 60
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 60}
comp_stats 0.880049061143
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 30}
comp_stats 0.880210091249
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 120}
comp_stats 0.880049061143
iteration 7
current 30
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 15}
comp_stats 0.87873281509
iteration 8
current 45
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 45}
comp_stats 0.880049061143
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 22}
comp_stats 0.87977249457
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 90}
comp_stats 0.880049061143
iteration 9
current 90
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 180}
comp_stats 0.880049061143
iteration 10
current 45
best_keys [(30, 0.88021009124876382), (40, 0.88004906114274528), (45, 0.88004906114274528), (60, 0.88004906114274528), (80, 0.88004906114274528), (90, 0.88004906114274528), (120, 0.88004906114274528), (160, 0.88004906114274528), (180, 0.88004906114274528), (20, 0.87977369165487973), (22, 0.8797724945698171), (10, 0.87885752775529247), (15, 0.87873281508992429), (5, 0.87612172315519232)]
continue
iteration 11
current 35
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 35}
comp_stats 0.880114923804
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 17}
comp_stats 0.879773691655
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 70}
comp_stats 0.880049061143
iteration 12
current 52
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 52}
comp_stats 0.880049061143
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 26}
comp_stats 0.87977249457
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 104}
comp_stats 0.880049061143
iteration 13
current 104
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 208}
comp_stats 0.880049061143
iteration 14
current 52
iteration 15
current 104
best_keys [(30, 0.88021009124876382), (35, 0.88011492380374101), (40, 0.88004906114274528), (45, 0.88004906114274528), (52, 0.88004906114274528), (60, 0.88004906114274528), (70, 0.88004906114274528), (80, 0.88004906114274528), (90, 0.88004906114274528), (104, 0.88004906114274528), (120, 0.88004906114274528), (160, 0.88004906114274528), (180, 0.88004906114274528), (208, 0.88004906114274528), (17, 0.87977369165487973), (20, 0.87977369165487973), (22, 0.8797724945698171), (26, 0.8797724945698171), (10, 0.87885752775529247), (15, 0.87873281508992429), (5, 0.87612172315519232)]
continue
iteration 16
current 32
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 32}
comp_stats 0.880210091249
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 16}
comp_stats 0.879773691655
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 64}
comp_stats 0.880049061143
iteration 17
current 48
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 48}
comp_stats 0.880049061143
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 24}
comp_stats 0.87977249457
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 96}
comp_stats 0.880049061143
iteration 18
current 96
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 192}
comp_stats 0.880049061143
iteration 19
current 48
iteration 20
current 96
best_keys [(30, 0.88021009124876382), (32, 0.88021009124876382), (35, 0.88011492380374101), (40, 0.88004906114274528), (45, 0.88004906114274528), (48, 0.88004906114274528), (52, 0.88004906114274528), (60, 0.88004906114274528), (64, 0.88004906114274528), (70, 0.88004906114274528), (80, 0.88004906114274528), (90, 0.88004906114274528), (96, 0.88004906114274528), (104, 0.88004906114274528), (120, 0.88004906114274528), (160, 0.88004906114274528), (180, 0.88004906114274528), (192, 0.88004906114274528), (208, 0.88004906114274528), (16, 0.87977369165487973), (17, 0.87977369165487973), (20, 0.87977369165487973), (22, 0.8797724945698171), (24, 0.8797724945698171), (26, 0.8797724945698171), (10, 0.87885752775529247), (15, 0.87873281508992429), (5, 0.87612172315519232)]
continue
iteration 21
current 31
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 31}
comp_stats 0.880210091249
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 62}
comp_stats 0.880049061143
iteration 22
current 46
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 46}
comp_stats 0.880049061143
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 23}
comp_stats 0.87977249457
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 92}
comp_stats 0.880049061143
iteration 23
current 92
python_readibility_f1_mean {'MIN_SIBLING_SCORE_THRESHOLD': 184}
comp_stats 0.880049061143
iteration 24
current 46
iteration 25
current 92
best_keys [(30, 0.88021009124876382), (31, 0.88021009124876382), (32, 0.88021009124876382), (35, 0.88011492380374101), (40, 0.88004906114274528), (45, 0.88004906114274528), (46, 0.88004906114274528), (48, 0.88004906114274528), (52, 0.88004906114274528), (60, 0.88004906114274528), (62, 0.88004906114274528), (64, 0.88004906114274528), (70, 0.88004906114274528), (80, 0.88004906114274528), (90, 0.88004906114274528), (92, 0.88004906114274528), (96, 0.88004906114274528), (104, 0.88004906114274528), (120, 0.88004906114274528), (160, 0.88004906114274528), (180, 0.88004906114274528), (184, 0.88004906114274528), (192, 0.88004906114274528), (208, 0.88004906114274528), (16, 0.87977369165487973), (17, 0.87977369165487973), (20, 0.87977369165487973), (22, 0.8797724945698171), (23, 0.8797724945698171), (24, 0.8797724945698171), (26, 0.8797724945698171), (10, 0.87885752775529247), (15, 0.87873281508992429), (5, 0.87612172315519232)]
stopping for small delta
92
{'make_int': False, 'param_name': 'BEST_SCORE_MULTIPLIER_THRESHOLD', 'start_value': 0.2}
iteration 1
current 0.2
python_readibility_f1_mean {'BEST_SCORE_MULTIPLIER_THRESHOLD': 0.2}
comp_stats 0.878857527755
python_readibility_f1_mean {'BEST_SCORE_MULTIPLIER_THRESHOLD': 0.1}
comp_stats 0.879894999189
python_readibility_f1_mean {'BEST_SCORE_MULTIPLIER_THRESHOLD': 0.4}
comp_stats 0.878430560894
iteration 2
current 0.1
python_readibility_f1_mean {'BEST_SCORE_MULTIPLIER_THRESHOLD': 0.05}
comp_stats 0.880040099659
iteration 3
current 0.05
python_readibility_f1_mean {'BEST_SCORE_MULTIPLIER_THRESHOLD': 0.03}
comp_stats 0.880040099659
iteration 4
current 0.03
python_readibility_f1_mean {'BEST_SCORE_MULTIPLIER_THRESHOLD': 0.01}
comp_stats 0.880040099659
python_readibility_f1_mean {'BEST_SCORE_MULTIPLIER_THRESHOLD': 0.06}
comp_stats 0.879937721895
iteration 5
current 0.01
best_keys [(0.01, 0.88004009965944785), (0.03, 0.88004009965944785), (0.05, 0.88004009965944785), (0.06, 0.87993772189523589), (0.1, 0.87989499918889824), (0.2, 0.87885752775529247), (0.4, 0.87843056089411908)]
continue
iteration 6
current 0.02
python_readibility_f1_mean {'BEST_SCORE_MULTIPLIER_THRESHOLD': 0.02}
comp_stats 0.880040099659
python_readibility_f1_mean {'BEST_SCORE_MULTIPLIER_THRESHOLD': 0.04}
comp_stats 0.880040099659
iteration 7
current 0.01
iteration 8
current 0.01
iteration 9
current 0.01
iteration 10
current 0.01
best_keys [(0.01, 0.88004009965944785), (0.02, 0.88004009965944785), (0.03, 0.88004009965944785), (0.04, 0.88004009965944785), (0.05, 0.88004009965944785), (0.06, 0.87993772189523589), (0.1, 0.87989499918889824), (0.2, 0.87885752775529247), (0.4, 0.87843056089411908)]
stopping for small delta
0.01
{'make_int': False, 'param_name': 'LONG_NODE_LINK_DENSITY_THRESHOLD', 'start_value': 0.25}
iteration 1
current 0.25
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.25}
comp_stats 0.878857527755
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.13}
comp_stats 0.878857527755
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.5}
comp_stats 0.878857527755
iteration 2
current 0.13
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.07}
comp_stats 0.878857527755
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.26}
comp_stats 0.878857527755
iteration 3
current 0.07
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.04}
comp_stats 0.878270970717
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.14}
comp_stats 0.878857527755
iteration 4
current 0.14
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.28}
comp_stats 0.878857527755
iteration 5
current 0.07
best_keys [(0.07, 0.87885752775529247), (0.13, 0.87885752775529247), (0.14, 0.87885752775529247), (0.25, 0.87885752775529247), (0.26, 0.87885752775529247), (0.28, 0.87885752775529247), (0.5, 0.87885752775529247), (0.04, 0.87827097071681293)]
continue
iteration 6
current 0.1
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.1}
comp_stats 0.878857527755
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.05}
comp_stats 0.878270970717
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.2}
comp_stats 0.878857527755
iteration 7
current 0.2
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.4}
comp_stats 0.878857527755
iteration 8
current 0.1
iteration 9
current 0.2
iteration 10
current 0.1
best_keys [(0.07, 0.87885752775529247), (0.1, 0.87885752775529247), (0.13, 0.87885752775529247), (0.14, 0.87885752775529247), (0.2, 0.87885752775529247), (0.25, 0.87885752775529247), (0.26, 0.87885752775529247), (0.28, 0.87885752775529247), (0.4, 0.87885752775529247), (0.5, 0.87885752775529247), (0.04, 0.87827097071681293), (0.05, 0.87827097071681293)]
continue
iteration 11
current 0.09
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.09}
comp_stats 0.878857527755
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.18}
comp_stats 0.878857527755
iteration 12
current 0.18
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.36}
comp_stats 0.878857527755
iteration 13
current 0.09
iteration 14
current 0.18
iteration 15
current 0.09
best_keys [(0.07, 0.87885752775529247), (0.09, 0.87885752775529247), (0.1, 0.87885752775529247), (0.13, 0.87885752775529247), (0.14, 0.87885752775529247), (0.18, 0.87885752775529247), (0.2, 0.87885752775529247), (0.25, 0.87885752775529247), (0.26, 0.87885752775529247), (0.28, 0.87885752775529247), (0.36, 0.87885752775529247), (0.4, 0.87885752775529247), (0.5, 0.87885752775529247), (0.04, 0.87827097071681293), (0.05, 0.87827097071681293)]
continue
iteration 16
current 0.08
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.08}
comp_stats 0.878857527755
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.16}
comp_stats 0.878857527755
iteration 17
current 0.16
python_readibility_f1_mean {'LONG_NODE_LINK_DENSITY_THRESHOLD': 0.32}
comp_stats 0.878857527755
iteration 18
current 0.08
iteration 19
current 0.16
iteration 20
current 0.08
best_keys [(0.07, 0.87885752775529247), (0.08, 0.87885752775529247), (0.09, 0.87885752775529247), (0.1, 0.87885752775529247), (0.13, 0.87885752775529247), (0.14, 0.87885752775529247), (0.16, 0.87885752775529247), (0.18, 0.87885752775529247), (0.2, 0.87885752775529247), (0.25, 0.87885752775529247), (0.26, 0.87885752775529247), (0.28, 0.87885752775529247), (0.32, 0.87885752775529247), (0.36, 0.87885752775529247), (0.4, 0.87885752775529247), (0.5, 0.87885752775529247), (0.04, 0.87827097071681293), (0.05, 0.87827097071681293)]
stopping for small delta
0.08
{'make_int': True, 'param_name': 'COMMA_COUNT', 'start_value': 10}
iteration 1
current 10
python_readibility_f1_mean {'COMMA_COUNT': 10}
comp_stats 0.878857527755
python_readibility_f1_mean {'COMMA_COUNT': 5}
comp_stats 0.877406216064
python_readibility_f1_mean {'COMMA_COUNT': 20}
comp_stats 0.873633749941
iteration 2
current 7
python_readibility_f1_mean {'COMMA_COUNT': 7}
comp_stats 0.878139639227
python_readibility_f1_mean {'COMMA_COUNT': 3}
comp_stats 0.873550220451
python_readibility_f1_mean {'COMMA_COUNT': 14}
comp_stats 0.875580954188
iteration 3
current 10
iteration 4
current 7
iteration 5
current 10
best_keys [(10, 0.87885752775529247), (7, 0.8781396392274472), (5, 0.87740621606429725), (14, 0.87558095418761273), (20, 0.87363374994059795), (3, 0.87355022045144937)]
continue
iteration 6
current 8
python_readibility_f1_mean {'COMMA_COUNT': 8}
comp_stats 0.878139639227
python_readibility_f1_mean {'COMMA_COUNT': 4}
comp_stats 0.875983910727
python_readibility_f1_mean {'COMMA_COUNT': 16}
comp_stats 0.873633749941
iteration 7
current 6
python_readibility_f1_mean {'COMMA_COUNT': 6}
comp_stats 0.878538139343
python_readibility_f1_mean {'COMMA_COUNT': 12}
comp_stats 0.877990946962
iteration 8
current 9
python_readibility_f1_mean {'COMMA_COUNT': 9}
comp_stats 0.878139639227
python_readibility_f1_mean {'COMMA_COUNT': 18}
comp_stats 0.873633749941
iteration 9
current 6
iteration 10
current 9
best_keys [(10, 0.87885752775529247), (6, 0.87853813934281155), (7, 0.8781396392274472), (8, 0.8781396392274472), (9, 0.8781396392274472), (12, 0.87799094696196966), (5, 0.87740621606429725), (4, 0.87598391072745485), (14, 0.87558095418761273), (16, 0.87363374994059795), (18, 0.87363374994059795), (20, 0.87363374994059795), (3, 0.87355022045144937)]
falling back in heuristic
continue
iteration 11
current 11
python_readibility_f1_mean {'COMMA_COUNT': 11}
comp_stats 0.877990946962
python_readibility_f1_mean {'COMMA_COUNT': 22}
comp_stats 0.873633749941
iteration 12
current 8
iteration 13
current 6
iteration 14
current 9
iteration 15
current 6
best_keys [(10, 0.87885752775529247), (6, 0.87853813934281155), (7, 0.8781396392274472), (8, 0.8781396392274472), (9, 0.8781396392274472), (11, 0.87799094696196966), (12, 0.87799094696196966), (5, 0.87740621606429725), (4, 0.87598391072745485), (14, 0.87558095418761273), (16, 0.87363374994059795), (18, 0.87363374994059795), (20, 0.87363374994059795), (22, 0.87363374994059795), (3, 0.87355022045144937)]
falling back in heuristic
stopping for small delta 10 10
{'make_int': True, 'param_name': 'MIN_EMBED_COMMENT_LENGTH', 'start_value': 75}
iteration 1
current 75
python_readibility_f1_mean {'MIN_EMBED_COMMENT_LENGTH': 75}
comp_stats 0.878857527755
python_readibility_f1_mean {'MIN_EMBED_COMMENT_LENGTH': 37}
comp_stats 0.878857527755
python_readibility_f1_mean {'MIN_EMBED_COMMENT_LENGTH': 150}
comp_stats 0.878857527755
iteration 2
current 37
python_readibility_f1_mean {'MIN_EMBED_COMMENT_LENGTH': 18}
comp_stats 0.878857527755
python_readibility_f1_mean {'MIN_EMBED_COMMENT_LENGTH': 74}
comp_stats 0.878857527755
iteration 3
current 18
python_readibility_f1_mean {'MIN_EMBED_COMMENT_LENGTH': 9}
comp_stats 0.878857527755
python_readibility_f1_mean {'MIN_EMBED_COMMENT_LENGTH': 36}
comp_stats 0.878857527755
iteration 4
current 9
python_readibility_f1_mean {'MIN_EMBED_COMMENT_LENGTH': 4}
comp_stats 0.878857527755
iteration 5
current 4
best_keys [(4, 0.87885752775529247), (9, 0.87885752775529247), (18, 0.87885752775529247), (36, 0.87885752775529247), (37, 0.87885752775529247), (74, 0.87885752775529247), (75, 0.87885752775529247), (150, 0.87885752775529247)]
continue
iteration 6
current 6
python_readibility_f1_mean {'MIN_EMBED_COMMENT_LENGTH': 6}
comp_stats 0.878857527755
python_readibility_f1_mean {'MIN_EMBED_COMMENT_LENGTH': 3}
comp_stats 0.878857527755
python_readibility_f1_mean {'MIN_EMBED_COMMENT_LENGTH': 12}
comp_stats 0.878857527755
iteration 7
current 3
python_readibility_f1_mean {'MIN_EMBED_COMMENT_LENGTH': 1}
comp_stats 0.878857527755
iteration 8
current 1
python_readibility_f1_mean {'MIN_EMBED_COMMENT_LENGTH': 0}
comp_stats 0.878857527755
python_readibility_f1_mean {'MIN_EMBED_COMMENT_LENGTH': 2}
comp_stats 0.878857527755
iteration 9
current 0
iteration 10
current 0
best_keys [(0, 0.87885752775529247), (1, 0.87885752775529247), (2, 0.87885752775529247), (3, 0.87885752775529247), (4, 0.87885752775529247), (6, 0.87885752775529247), (9, 0.87885752775529247), (12, 0.87885752775529247), (18, 0.87885752775529247), (36, 0.87885752775529247), (37, 0.87885752775529247), (74, 0.87885752775529247), (75, 0.87885752775529247), (150, 0.87885752775529247)]
stopping for small delta
0
{'make_int': True, 'param_name': 'TEXT_LENGTH_THRESHOLD', 'start_value': 25}
iteration 1
current 25
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 25}
comp_stats 0.878857527755
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 12}
comp_stats 0.873860877313
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 50}
comp_stats 0.876906301988
iteration 2
current 37
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 37}
comp_stats 0.877043682277
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 18}
comp_stats 0.872985186082
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 74}
comp_stats 0.876276258481
iteration 3
current 55
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 55}
comp_stats 0.875153465888
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 27}
comp_stats 0.878911074212
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 110}
comp_stats 0.876140514333
iteration 4
current 27
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 13}
comp_stats 0.873860877313
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 54}
comp_stats 0.874765171754
iteration 5
current 40
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 40}
comp_stats 0.877127447359
best_keys [(27, 0.87891107421217141), (25, 0.87885752775529247), (40, 0.87712744735869963), (37, 0.87704368227740004), (50, 0.87690630198837527), (74, 0.87627625848123036), (110, 0.87614051433301976), (55, 0.87515346588841592), (54, 0.87476517175374913), (12, 0.8738608773125407), (13, 0.8738608773125407), (18, 0.87298518608158726)]
continue
iteration 6
current 26
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 26}
comp_stats 0.878911074212
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 52}
comp_stats 0.874703848418
iteration 7
current 39
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 39}
comp_stats 0.877127447359
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 19}
comp_stats 0.872995492903
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 78}
comp_stats 0.87624668883
iteration 8
current 58
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 58}
comp_stats 0.875446271434
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 29}
comp_stats 0.878923051472
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 116}
comp_stats 0.876274249695
iteration 9
current 29
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 14}
comp_stats 0.873336231344
iteration 10
current 43
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 43}
comp_stats 0.877117592122
best_keys [(29, 0.87892305147232608), (26, 0.87891107421217141), (27, 0.87891107421217141), (25, 0.87885752775529247), (39, 0.87712744735869963), (40, 0.87712744735869963), (43, 0.8771175921219071), (37, 0.87704368227740004), (50, 0.87690630198837527), (74, 0.87627625848123036), (116, 0.8762742496950704), (78, 0.87624668883002366), (110, 0.87614051433301976), (58, 0.87544627143359988), (55, 0.87515346588841592), (54, 0.87476517175374913), (52, 0.87470384841771509), (12, 0.8738608773125407), (13, 0.8738608773125407), (14, 0.87333623134405602), (19, 0.87299549290287159), (18, 0.87298518608158726)]
continue
iteration 11
current 28
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 28}
comp_stats 0.878868461952
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 56}
comp_stats 0.875153465888
iteration 12
current 42
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 42}
comp_stats 0.877127447359
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 21}
comp_stats 0.878769619745
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 84}
comp_stats 0.876321988947
iteration 13
current 21
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 10}
comp_stats 0.873798431326
iteration 14
current 31
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 31}
comp_stats 0.878914997075
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 15}
comp_stats 0.874134099345
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 62}
comp_stats 0.875503647562
iteration 15
current 46
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 46}
comp_stats 0.877218703169
best_keys [(29, 0.87892305147232608), (31, 0.87891499707538412), (26, 0.87891107421217141), (27, 0.87891107421217141), (28, 0.87886846195212176), (25, 0.87885752775529247), (21, 0.87876961974468504), (46, 0.87721870316892159), (39, 0.87712744735869963), (40, 0.87712744735869963), (42, 0.87712744735869963), (43, 0.8771175921219071), (37, 0.87704368227740004), (50, 0.87690630198837527), (84, 0.87632198894700197), (74, 0.87627625848123036), (116, 0.8762742496950704), (78, 0.87624668883002366), (110, 0.87614051433301976), (62, 0.87550364756207955), (58, 0.87544627143359988), (55, 0.87515346588841592), (56, 0.87515346588841592), (54, 0.87476517175374913), (52, 0.87470384841771509), (15, 0.87413409934510722), (12, 0.8738608773125407), (13, 0.8738608773125407), (10, 0.87379843132629598), (14, 0.87333623134405602), (19, 0.87299549290287159), (18, 0.87298518608158726)]
continue
iteration 16
current 30
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 30}
comp_stats 0.878923051472
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 60}
comp_stats 0.875500653734
iteration 17
current 45
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 45}
comp_stats 0.877195367781
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 22}
comp_stats 0.878769619745
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 90}
comp_stats 0.876434797298
iteration 18
current 22
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 11}
comp_stats 0.873883810086
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 44}
comp_stats 0.877174626257
iteration 19
current 33
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 33}
comp_stats 0.878547293306
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 16}
comp_stats 0.87416361272
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 66}
comp_stats 0.87555547666
iteration 20
current 49
python_readibility_f1_mean {'TEXT_LENGTH_THRESHOLD': 49}
comp_stats 0.876872617676
best_keys [(29, 0.87892305147232608), (30, 0.87892305147232608), (31, 0.87891499707538412), (26, 0.87891107421217141), (27, 0.87891107421217141), (28, 0.87886846195212176), (25, 0.87885752775529247), (21, 0.87876961974468504), (22, 0.87876961974468504), (33, 0.87854729330592374), (46, 0.87721870316892159), (45, 0.87719536778137075), (44, 0.87717462625689413), (39, 0.87712744735869963), (40, 0.87712744735869963), (42, 0.87712744735869963), (43, 0.8771175921219071), (37, 0.87704368227740004), (50, 0.87690630198837527), (49, 0.8768726176757381), (90, 0.87643479729829865), (84, 0.87632198894700197), (74, 0.87627625848123036), (116, 0.8762742496950704), (78, 0.87624668883002366), (110, 0.87614051433301976), (66, 0.87555547665962519), (62, 0.87550364756207955), (60, 0.87550065373397457), (58, 0.87544627143359988), (55, 0.87515346588841592), (56, 0.87515346588841592), (54, 0.87476517175374913), (52, 0.87470384841771509), (16, 0.87416361272017973), (15, 0.87413409934510722), (11, 0.87388381008552052), (12, 0.8738608773125407), (13, 0.8738608773125407), (10, 0.87379843132629598), (14, 0.87333623134405602), (19, 0.87299549290287159), (18, 0.87298518608158726)]
stopping for small delta
49
{'make_int': True, 'param_name': 'RETRY_LENGTH', 'start_value': 250}
iteration 1
current 250
python_readibility_f1_mean {'RETRY_LENGTH': 250}
comp_stats 0.878857527755
python_readibility_f1_mean {'RETRY_LENGTH': 125}
comp_stats 0.878857527755
python_readibility_f1_mean {'RETRY_LENGTH': 500}
comp_stats 0.878857527755
iteration 2
current 125
python_readibility_f1_mean {'RETRY_LENGTH': 62}
comp_stats 0.878857527755
iteration 3
current 62
python_readibility_f1_mean {'RETRY_LENGTH': 31}
comp_stats 0.878857527755
python_readibility_f1_mean {'RETRY_LENGTH': 124}
comp_stats 0.878857527755
iteration 4
current 31
python_readibility_f1_mean {'RETRY_LENGTH': 15}
comp_stats 0.878857527755
iteration 5
current 15
best_keys [(15, 0.87885752775529247), (31, 0.87885752775529247), (62, 0.87885752775529247), (124, 0.87885752775529247), (125, 0.87885752775529247), (250, 0.87885752775529247), (500, 0.87885752775529247)]
continue
iteration 6
current 23
python_readibility_f1_mean {'RETRY_LENGTH': 23}
comp_stats 0.878857527755
python_readibility_f1_mean {'RETRY_LENGTH': 11}
comp_stats 0.878857527755
python_readibility_f1_mean {'RETRY_LENGTH': 46}
comp_stats 0.878857527755
iteration 7
current 11
python_readibility_f1_mean {'RETRY_LENGTH': 5}
comp_stats 0.878857527755
python_readibility_f1_mean {'RETRY_LENGTH': 22}
comp_stats 0.878857527755
iteration 8
current 5
python_readibility_f1_mean {'RETRY_LENGTH': 2}
comp_stats 0.878857527755
python_readibility_f1_mean {'RETRY_LENGTH': 10}
comp_stats 0.878857527755
iteration 9
current 2
python_readibility_f1_mean {'RETRY_LENGTH': 1}
comp_stats 0.878857527755
python_readibility_f1_mean {'RETRY_LENGTH': 4}
comp_stats 0.878857527755
iteration 10
current 1
best_keys [(1, 0.87885752775529247), (2, 0.87885752775529247), (4, 0.87885752775529247), (5, 0.87885752775529247), (10, 0.87885752775529247), (11, 0.87885752775529247), (15, 0.87885752775529247), (22, 0.87885752775529247), (23, 0.87885752775529247), (31, 0.87885752775529247), (46, 0.87885752775529247), (62, 0.87885752775529247), (124, 0.87885752775529247), (125, 0.87885752775529247), (250, 0.87885752775529247), (500, 0.87885752775529247)]
stopping for small delta
1
{'make_int': True, 'param_name': 'SIBLING_CONTENT_LENGTH_SUM', 'start_value': 1000}
iteration 1
current 1000
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 1000}
comp_stats 0.878857527755
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 500}
comp_stats 0.878857527755
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 2000}
comp_stats 0.878857527755
iteration 2
current 500
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 250}
comp_stats 0.878857527755
iteration 3
current 250
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 125}
comp_stats 0.878857527755
iteration 4
current 125
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 62}
comp_stats 0.878857527755
iteration 5
current 62
best_keys [(62, 0.87885752775529247), (125, 0.87885752775529247), (250, 0.87885752775529247), (500, 0.87885752775529247), (1000, 0.87885752775529247), (2000, 0.87885752775529247)]
continue
iteration 6
current 93
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 93}
comp_stats 0.878857527755
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 46}
comp_stats 0.878857527755
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 186}
comp_stats 0.878857527755
iteration 7
current 46
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 23}
comp_stats 0.878857527755
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 92}
comp_stats 0.878857527755
iteration 8
current 23
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 11}
comp_stats 0.878857527755
iteration 9
current 11
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 5}
comp_stats 0.878857527755
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 22}
comp_stats 0.878857527755
iteration 10
current 5
best_keys [(5, 0.87885752775529247), (11, 0.87885752775529247), (22, 0.87885752775529247), (23, 0.87885752775529247), (46, 0.87885752775529247), (62, 0.87885752775529247), (92, 0.87885752775529247), (93, 0.87885752775529247), (125, 0.87885752775529247), (186, 0.87885752775529247), (250, 0.87885752775529247), (500, 0.87885752775529247), (1000, 0.87885752775529247), (2000, 0.87885752775529247)]
continue
iteration 11
current 8
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 8}
comp_stats 0.878857527755
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 4}
comp_stats 0.878857527755
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 16}
comp_stats 0.878857527755
iteration 12
current 4
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 2}
comp_stats 0.878857527755
iteration 13
current 2
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 1}
comp_stats 0.878857527755
iteration 14
current 1
python_readibility_f1_mean {'SIBLING_CONTENT_LENGTH_SUM': 0}
comp_stats 0.878857527755
iteration 15
current 0
best_keys [(0, 0.87885752775529247), (1, 0.87885752775529247), (2, 0.87885752775529247), (4, 0.87885752775529247), (5, 0.87885752775529247), (8, 0.87885752775529247), (11, 0.87885752775529247), (16, 0.87885752775529247), (22, 0.87885752775529247), (23, 0.87885752775529247), (46, 0.87885752775529247), (62, 0.87885752775529247), (92, 0.87885752775529247), (93, 0.87885752775529247), (125, 0.87885752775529247), (186, 0.87885752775529247), (250, 0.87885752775529247), (500, 0.87885752775529247), (1000, 0.87885752775529247), (2000, 0.87885752775529247)]
stopping for small delta
0
{'make_int': True, 'param_name': 'CONTENT_SCORE_DIV_BONUS', 'start_value': 5}
iteration 1
current 5
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 5}
comp_stats 0.878857527755
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 2}
comp_stats 0.878818403368
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 10}
comp_stats 0.880614241983
iteration 2
current 10
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 20}
comp_stats 0.883701495368
iteration 3
current 20
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 40}
comp_stats 0.883897756366
iteration 4
current 40
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 80}
comp_stats 0.874065832064
iteration 5
current 30
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 30}
comp_stats 0.883402727627
best_keys [(40, 0.88389775636646606), (20, 0.88370149536809284), (30, 0.88340272762665573), (10, 0.88061424198323879), (5, 0.87885752775529247), (2, 0.87881840336755035), (80, 0.87406583206397559)]
continue
iteration 6
current 35
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 35}
comp_stats 0.883383007111
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 17}
comp_stats 0.880094953247
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 70}
comp_stats 0.874712275456
iteration 7
current 26
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 26}
comp_stats 0.882013970078
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 13}
comp_stats 0.881425735097
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 52}
comp_stats 0.878068138741
iteration 8
current 19
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 19}
comp_stats 0.880403624115
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 9}
comp_stats 0.880130887656
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 38}
comp_stats 0.883612596554
iteration 9
current 38
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 76}
comp_stats 0.874065832064
iteration 10
current 28
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 28}
comp_stats 0.883423731332
best_keys [(40, 0.88389775636646606), (20, 0.88370149536809284), (38, 0.88361259655377311), (28, 0.88342373133219421), (30, 0.88340272762665573), (35, 0.88338300711082851), (26, 0.88201397007768412), (13, 0.88142573509671973), (10, 0.88061424198323879), (19, 0.88040362411471051), (9, 0.88013088765574665), (17, 0.88009495324674558), (5, 0.87885752775529247), (2, 0.87881840336755035), (52, 0.87806813874147083), (70, 0.87471227545620656), (76, 0.87406583206397559), (80, 0.87406583206397559)]
falling back in heuristic
continue
iteration 11
current 46
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 46}
comp_stats 0.883142115301
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 23}
comp_stats 0.883542466038
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 92}
comp_stats 0.870896712006
iteration 12
current 23
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 11}
comp_stats 0.880734324041
iteration 13
current 34
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 34}
comp_stats 0.883425008841
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 68}
comp_stats 0.874712275456
iteration 14
current 25
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 25}
comp_stats 0.882028973466
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 12}
comp_stats 0.881215160006
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 50}
comp_stats 0.878902193306
iteration 15
current 18
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 18}
comp_stats 0.880375034687
best_keys [(40, 0.88389775636646606), (20, 0.88370149536809284), (38, 0.88361259655377311), (23, 0.88354246603754016), (34, 0.88342500884066888), (28, 0.88342373133219421), (30, 0.88340272762665573), (35, 0.88338300711082851), (46, 0.88314211530120557), (25, 0.88202897346571796), (26, 0.88201397007768412), (13, 0.88142573509671973), (12, 0.88121516000589883), (11, 0.88073432404114338), (10, 0.88061424198323879), (19, 0.88040362411471051), (18, 0.88037503468716605), (9, 0.88013088765574665), (17, 0.88009495324674558), (50, 0.8789021933055805), (5, 0.87885752775529247), (2, 0.87881840336755035), (52, 0.87806813874147083), (68, 0.87471227545620656), (70, 0.87471227545620656), (76, 0.87406583206397559), (80, 0.87406583206397559), (92, 0.8708967120063168)]
falling back in heuristic
continue
iteration 16
current 43
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 43}
comp_stats 0.883913907344
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 21}
comp_stats 0.883710041163
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 86}
comp_stats 0.874065832064
iteration 17
current 32
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 32}
comp_stats 0.883402727627
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 16}
comp_stats 0.881674997197
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 64}
comp_stats 0.875278455893
iteration 18
current 24
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 24}
comp_stats 0.882028973466
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 48}
comp_stats 0.880079286422
iteration 19
current 18
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 36}
comp_stats 0.88354294237
iteration 20
current 36
best_keys [(43, 0.88391390734377362), (40, 0.88389775636646606), (21, 0.88371004116275154), (20, 0.88370149536809284), (38, 0.88361259655377311), (36, 0.88354294237014197), (23, 0.88354246603754016), (34, 0.88342500884066888), (28, 0.88342373133219421), (30, 0.88340272762665573), (32, 0.88340272762665573), (35, 0.88338300711082851), (46, 0.88314211530120557), (24, 0.88202897346571796), (25, 0.88202897346571796), (26, 0.88201397007768412), (16, 0.88167499719692899), (13, 0.88142573509671973), (12, 0.88121516000589883), (11, 0.88073432404114338), (10, 0.88061424198323879), (19, 0.88040362411471051), (18, 0.88037503468716605), (9, 0.88013088765574665), (17, 0.88009495324674558), (48, 0.88007928642220012), (50, 0.8789021933055805), (5, 0.87885752775529247), (2, 0.87881840336755035), (52, 0.87806813874147083), (64, 0.87527845589347641), (68, 0.87471227545620656), (70, 0.87471227545620656), (76, 0.87406583206397559), (80, 0.87406583206397559), (86, 0.87406583206397559), (92, 0.8708967120063168)]
continue
iteration 21
current 41
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 41}
comp_stats 0.883937534526
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 82}
comp_stats 0.874065832064
iteration 22
current 30
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 15}
comp_stats 0.881722079332
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 60}
comp_stats 0.875278455893
iteration 23
current 22
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 22}
comp_stats 0.88350581745
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 44}
comp_stats 0.883740679924
iteration 24
current 44
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 88}
comp_stats 0.874065832064
iteration 25
current 33
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 33}
comp_stats 0.883433978413
best_keys [(41, 0.8839375345264896), (43, 0.88391390734377362), (40, 0.88389775636646606), (44, 0.88374067992386562), (21, 0.88371004116275154), (20, 0.88370149536809284), (38, 0.88361259655377311), (36, 0.88354294237014197), (23, 0.88354246603754016), (22, 0.8835058174502769), (33, 0.88343397841281923), (34, 0.88342500884066888), (28, 0.88342373133219421), (30, 0.88340272762665573), (32, 0.88340272762665573), (35, 0.88338300711082851), (46, 0.88314211530120557), (24, 0.88202897346571796), (25, 0.88202897346571796), (26, 0.88201397007768412), (15, 0.88172207933174451), (16, 0.88167499719692899), (13, 0.88142573509671973), (12, 0.88121516000589883), (11, 0.88073432404114338), (10, 0.88061424198323879), (19, 0.88040362411471051), (18, 0.88037503468716605), (9, 0.88013088765574665), (17, 0.88009495324674558), (48, 0.88007928642220012), (50, 0.8789021933055805), (5, 0.87885752775529247), (2, 0.87881840336755035), (52, 0.87806813874147083), (60, 0.87527845589347641), (64, 0.87527845589347641), (68, 0.87471227545620656), (70, 0.87471227545620656), (76, 0.87406583206397559), (80, 0.87406583206397559), (82, 0.87406583206397559), (86, 0.87406583206397559), (88, 0.87406583206397559), (92, 0.8708967120063168)]
continue
iteration 26
current 42
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 42}
comp_stats 0.883937534526
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 84}
comp_stats 0.874065832064
iteration 27
current 31
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 31}
comp_stats 0.883402727627
python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 62}
comp_stats 0.875278455893
iteration 28
current 23
iteration 29
current 34
iteration 30
current 25
best_keys [(41, 0.8839375345264896), (42, 0.8839375345264896), (43, 0.88391390734377362), (40, 0.88389775636646606), (44, 0.88374067992386562), (21, 0.88371004116275154), (20, 0.88370149536809284), (38, 0.88361259655377311), (36, 0.88354294237014197), (23, 0.88354246603754016), (22, 0.8835058174502769), (33, 0.88343397841281923), (34, 0.88342500884066888), (28, 0.88342373133219421), (30, 0.88340272762665573), (31, 0.88340272762665573), (32, 0.88340272762665573), (35, 0.88338300711082851), (46, 0.88314211530120557), (24, 0.88202897346571796), (25, 0.88202897346571796), (26, 0.88201397007768412), (15, 0.88172207933174451), (16, 0.88167499719692899), (13, 0.88142573509671973), (12, 0.88121516000589883), (11, 0.88073432404114338), (10, 0.88061424198323879), (19, 0.88040362411471051), (18, 0.88037503468716605), (9, 0.88013088765574665), (17, 0.88009495324674558), (48, 0.88007928642220012), (50, 0.8789021933055805), (5, 0.87885752775529247), (2, 0.87881840336755035), (52, 0.87806813874147083), (60, 0.87527845589347641), (62, 0.87527845589347641), (64, 0.87527845589347641), (68, 0.87471227545620656), (70, 0.87471227545620656), (76, 0.87406583206397559), (80, 0.87406583206397559), (82, 0.87406583206397559), (84, 0.87406583206397559), (86, 0.87406583206397559), (88, 0.87406583206397559), (92, 0.8708967120063168)]
stopping for small delta
25
{'make_int': True, 'param_name': 'CONTENT_SCORE_PRE_TD_BONUS', 'start_value': 3}
iteration 1
current 3
python_readibility_f1_mean {'CONTENT_SCORE_PRE_TD_BONUS': 3}
comp_stats 0.878857527755
python_readibility_f1_mean {'CONTENT_SCORE_PRE_TD_BONUS': 1}
comp_stats 0.878763621111
python_readibility_f1_mean {'CONTENT_SCORE_PRE_TD_BONUS': 6}
comp_stats 0.876851690291
iteration 2
current 2
python_readibility_f1_mean {'CONTENT_SCORE_PRE_TD_BONUS': 2}
comp_stats 0.878763621111
python_readibility_f1_mean {'CONTENT_SCORE_PRE_TD_BONUS': 4}
comp_stats 0.878857527755
iteration 3
current 4
python_readibility_f1_mean {'CONTENT_SCORE_PRE_TD_BONUS': 8}
comp_stats 0.876856876691
iteration 4
current 3
iteration 5
current 2
best_keys [(3, 0.87885752775529247), (4, 0.87885752775529247), (1, 0.87876362111083872), (2, 0.87876362111083872), (8, 0.87685687669126233), (6, 0.87685169029121324)]
stopping for small delta
2
{'make_int': True, 'param_name': 'CONTENT_SCORE_ADDRESS_OL_PENALTY', 'start_value': 3}
iteration 1
current 3
python_readibility_f1_mean {'CONTENT_SCORE_ADDRESS_OL_PENALTY': 3}
comp_stats 0.878857527755
python_readibility_f1_mean {'CONTENT_SCORE_ADDRESS_OL_PENALTY': 1}
comp_stats 0.878857527755
python_readibility_f1_mean {'CONTENT_SCORE_ADDRESS_OL_PENALTY': 6}
comp_stats 0.878857527755
iteration 2
current 1
python_readibility_f1_mean {'CONTENT_SCORE_ADDRESS_OL_PENALTY': 0}
comp_stats 0.878857527755
python_readibility_f1_mean {'CONTENT_SCORE_ADDRESS_OL_PENALTY': 2}
comp_stats 0.878857527755
iteration 3
current 0
iteration 4
current 0
iteration 5
current 0
best_keys [(0, 0.87885752775529247), (1, 0.87885752775529247), (2, 0.87885752775529247), (3, 0.87885752775529247), (6, 0.87885752775529247)]
stopping for small delta
0
{'make_int': True, 'param_name': 'CONTENT_SCORE_HEADER_PENALTY', 'start_value': 5}
iteration 1
current 5
python_readibility_f1_mean {'CONTENT_SCORE_HEADER_PENALTY': 5}
comp_stats 0.878857527755
python_readibility_f1_mean {'CONTENT_SCORE_HEADER_PENALTY': 2}
comp_stats 0.878857527755
python_readibility_f1_mean {'CONTENT_SCORE_HEADER_PENALTY': 10}
comp_stats 0.878857527755
iteration 2
current 2
python_readibility_f1_mean {'CONTENT_SCORE_HEADER_PENALTY': 1}
comp_stats 0.878857527755
python_readibility_f1_mean {'CONTENT_SCORE_HEADER_PENALTY': 4}
comp_stats 0.878857527755
iteration 3
current 1
python_readibility_f1_mean {'CONTENT_SCORE_HEADER_PENALTY': 0}
comp_stats 0.878857527755
iteration 4
current 0
iteration 5
current 0
best_keys [(0, 0.87885752775529247), (1, 0.87885752775529247), (2, 0.87885752775529247), (4, 0.87885752775529247), (5, 0.87885752775529247), (10, 0.87885752775529247)]
stopping for small delta
0
{'make_int': True, 'param_name': 'CLASS_WEIGHT_NEGATIVE_RE_PENALTY', 'start_value': 25}
iteration 1
current 25
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 25}
comp_stats 0.878857527755
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 12}
comp_stats 0.875128695396
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 50}
comp_stats 0.880122204657
iteration 2
current 50
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 100}
comp_stats 0.876431097614
iteration 3
current 37
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 37}
comp_stats 0.880160346747
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 18}
comp_stats 0.875983852018
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 74}
comp_stats 0.875800674722
iteration 4
current 27
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 27}
comp_stats 0.878733438132
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 13}
comp_stats 0.875205568516
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 54}
comp_stats 0.879183593646
iteration 5
current 54
best_keys [(37, 0.88016034674687949), (50, 0.88012220465727631), (54, 0.87918359364582688), (25, 0.87885752775529247), (27, 0.87873343813159677), (100, 0.87643109761357296), (18, 0.87598385201805606), (74, 0.87580067472171441), (13, 0.87520556851646469), (12, 0.87512869539638194)]
continue
iteration 6
current 43
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 43}
comp_stats 0.880160346747
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 21}
comp_stats 0.875983852018
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 86}
comp_stats 0.875800674722
iteration 7
current 32
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 32}
comp_stats 0.879002022448
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 16}
comp_stats 0.875983852018
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 64}
comp_stats 0.87781838895
iteration 8
current 48
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 48}
comp_stats 0.880160346747
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 24}
comp_stats 0.878857527755
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 96}
comp_stats 0.876431097614
iteration 9
current 36
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 36}
comp_stats 0.880160346747
python_readibility_f1_mean {'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 72}
comp_stats 0.875800674722
iteration 10
current 27
best_keys [(36, 0.88016034674687949), (37, 0.88016034674687949), (43, 0.88016034674687949), (48, 0.88016034674687949), (50, 0.88012220465727631), (54, 0.87918359364582688), (32, 0.87900202244798997), (24, 0.87885752775529247), (25, 0.87885752775529247), (27, 0.87873343813159677), (64, 0.87781838894954189), (96, 0.87643109761357296), (100, 0.87643109761357296), (16, 0.87598385201805606), (18, 0.87598385201805606), (21, 0.87598385201805606), (72, 0.87580067472171441), (74, 0.87580067472171441), (86, 0.87580067472171441), (13, 0.87520556851646469), (12, 0.87512869539638194)]
stopping for small delta
27
{'make_int': True, 'param_name': 'CLASS_WEIGHT_POSITVE_RE_BONUS', 'start_value': 25}
iteration 1
current 25
python_readibility_f1_mean {'CLASS_WEIGHT_POSITVE_RE_BONUS': 25}
comp_stats 0.878857527755
python_readibility_f1_mean {'CLASS_WEIGHT_POSITVE_RE_BONUS': 12}
comp_stats 0.879014722133
python_readibility_f1_mean {'CLASS_WEIGHT_POSITVE_RE_BONUS': 50}
comp_stats 0.874405508364
iteration 2
current 12
python_readibility_f1_mean {'CLASS_WEIGHT_POSITVE_RE_BONUS': 6}
comp_stats 0.878640632518
python_readibility_f1_mean {'CLASS_WEIGHT_POSITVE_RE_BONUS': 24}
comp_stats 0.878823341667
iteration 3
current 18
python_readibility_f1_mean {'CLASS_WEIGHT_POSITVE_RE_BONUS': 18}
comp_stats 0.876333127933
python_readibility_f1_mean {'CLASS_WEIGHT_POSITVE_RE_BONUS': 9}
comp_stats 0.879292725242
python_readibility_f1_mean {'CLASS_WEIGHT_POSITVE_RE_BONUS': 36}
comp_stats 0.876063312406
iteration 4
current 9
python_readibility_f1_mean {'CLASS_WEIGHT_POSITVE_RE_BONUS': 4}
comp_stats 0.877369146988
iteration 5
current 6
best_keys [(9, 0.87929272524181867), (12, 0.87901472213310528), (25, 0.87885752775529247), (24, 0.87882334166710985), (6, 0.87864063251813274), (4, 0.87736914698818103), (18, 0.87633312793288154), (36, 0.87606331240604829), (50, 0.87440550836429043)]
continue
iteration 6
current 10
python_readibility_f1_mean {'CLASS_WEIGHT_POSITVE_RE_BONUS': 10}
comp_stats 0.879339383048
python_readibility_f1_mean {'CLASS_WEIGHT_POSITVE_RE_BONUS': 5}
comp_stats 0.878167117504
python_readibility_f1_mean {'CLASS_WEIGHT_POSITVE_RE_BONUS': 20}
comp_stats 0.877668041493
iteration 7
current 7
python_readibility_f1_mean {'CLASS_WEIGHT_POSITVE_RE_BONUS': 7}
comp_stats 0.880070766619
python_readibility_f1_mean {'CLASS_WEIGHT_POSITVE_RE_BONUS': 3}
comp_stats 0.877385135898
python_readibility_f1_mean {'CLASS_WEIGHT_POSITVE_RE_BONUS': 14}
comp_stats 0.878186481127
iteration 8
current 10
iteration 9
current 7
iteration 10
current 10
best_keys [(7, 0.88007076661857853), (10, 0.87933938304792747), (9, 0.87929272524181867), (12, 0.87901472213310528), (25, 0.87885752775529247), (24, 0.87882334166710985), (6, 0.87864063251813274), (14, 0.8781864811274831), (5, 0.87816711750365972), (20, 0.87766804149333388), (3, 0.87738513589848788), (4, 0.87736914698818103), (18, 0.87633312793288154), (36, 0.87606331240604829), (50, 0.87440550836429043)]
continue
iteration 11
current 8
python_readibility_f1_mean {'CLASS_WEIGHT_POSITVE_RE_BONUS': 8}
comp_stats 0.879260920582
python_readibility_f1_mean {'CLASS_WEIGHT_POSITVE_RE_BONUS': 16}
comp_stats 0.876643064822
iteration 12
current 6
iteration 13
current 12
iteration 14
current 18
iteration 15
current 9
best_keys [(7, 0.88007076661857853), (10, 0.87933938304792747), (9, 0.87929272524181867), (8, 0.87926092058163885), (12, 0.87901472213310528), (25, 0.87885752775529247), (24, 0.87882334166710985), (6, 0.87864063251813274), (14, 0.8781864811274831), (5, 0.87816711750365972), (20, 0.87766804149333388), (3, 0.87738513589848788), (4, 0.87736914698818103), (16, 0.87664306482159304), (18, 0.87633312793288154), (36, 0.87606331240604829), (50, 0.87440550836429043)]
falling back in heuristic
stopping for small delta 7 7
{'make_int': True, 'param_name': 'CONTENT_SCORE_START', 'start_value': 1}
iteration 1
current 1
python_readibility_f1_mean {'CONTENT_SCORE_START': 1}
comp_stats 0.878857527755
python_readibility_f1_mean {'CONTENT_SCORE_START': 0}
comp_stats 0.877669616519
python_readibility_f1_mean {'CONTENT_SCORE_START': 2}
comp_stats 0.878277670439
iteration 2
current 1
iteration 3
current 1
iteration 4
current 1
iteration 5
current 1
best_keys [(1, 0.87885752775529247), (2, 0.87827767043886296), (0, 0.87766961651917408)]
stopping for small delta
1
{'make_int': True, 'param_name': 'CONTENT_SCORE_INNER_TEXT_MIN_BONUS', 'start_value': 3}
iteration 1
current 3
python_readibility_f1_mean {'CONTENT_SCORE_INNER_TEXT_MIN_BONUS': 3}
comp_stats 0.878857527755
python_readibility_f1_mean {'CONTENT_SCORE_INNER_TEXT_MIN_BONUS': 1}
comp_stats 0.876697934438
python_readibility_f1_mean {'CONTENT_SCORE_INNER_TEXT_MIN_BONUS': 6}
comp_stats 0.879289563244
iteration 2
current 6
python_readibility_f1_mean {'CONTENT_SCORE_INNER_TEXT_MIN_BONUS': 12}
comp_stats 0.8790161768
iteration 3
current 9
python_readibility_f1_mean {'CONTENT_SCORE_INNER_TEXT_MIN_BONUS': 9}
comp_stats 0.878946017623
python_readibility_f1_mean {'CONTENT_SCORE_INNER_TEXT_MIN_BONUS': 4}
comp_stats 0.879289563244
python_readibility_f1_mean {'CONTENT_SCORE_INNER_TEXT_MIN_BONUS': 18}
comp_stats 0.87905825721
iteration 4
current 4
python_readibility_f1_mean {'CONTENT_SCORE_INNER_TEXT_MIN_BONUS': 2}
comp_stats 0.878942176489
python_readibility_f1_mean {'CONTENT_SCORE_INNER_TEXT_MIN_BONUS': 8}
comp_stats 0.879289563244
iteration 5
current 8
best_keys [(4, 0.87928956324446961), (6, 0.87928956324446961), (8, 0.87928956324446961), (18, 0.87905825720951769), (12, 0.87901617680027377), (9, 0.87894601762340652), (2, 0.87894217648900419), (3, 0.87885752775529247), (1, 0.87669793443803601)]
continue
iteration 6
current 5
python_readibility_f1_mean {'CONTENT_SCORE_INNER_TEXT_MIN_BONUS': 5}
comp_stats 0.879289563244
python_readibility_f1_mean {'CONTENT_SCORE_INNER_TEXT_MIN_BONUS': 10}
comp_stats 0.878946017623
iteration 7
current 7
python_readibility_f1_mean {'CONTENT_SCORE_INNER_TEXT_MIN_BONUS': 7}
comp_stats 0.879289563244
python_readibility_f1_mean {'CONTENT_SCORE_INNER_TEXT_MIN_BONUS': 14}
comp_stats 0.8790161768
iteration 8
current 10
python_readibility_f1_mean {'CONTENT_SCORE_INNER_TEXT_MIN_BONUS': 20}
comp_stats 0.87896188159
iteration 9
current 5
iteration 10
current 7
best_keys [(4, 0.87928956324446961), (5, 0.87928956324446961), (6, 0.87928956324446961), (7, 0.87928956324446961), (8, 0.87928956324446961), (18, 0.87905825720951769), (12, 0.87901617680027377), (14, 0.87901617680027377), (20, 0.87896188158961885), (9, 0.87894601762340652), (10, 0.87894601762340652), (2, 0.87894217648900419), (3, 0.87885752775529247), (1, 0.87669793443803601)]
stopping for small delta
7
{'make_int': True, 'param_name': 'LI_COUNT_REDUCTION', 'start_value': 100}
iteration 1
current 100
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 100}
comp_stats 0.878857527755
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 50}
comp_stats 0.878857527755
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 200}
comp_stats 0.878857527755
iteration 2
current 50
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 25}
comp_stats 0.878857527755
iteration 3
current 25
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 12}
comp_stats 0.878054261847
iteration 4
current 50
iteration 5
current 25
best_keys [(25, 0.87885752775529247), (50, 0.87885752775529247), (100, 0.87885752775529247), (200, 0.87885752775529247), (12, 0.87805426184667701)]
continue
iteration 6
current 37
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 37}
comp_stats 0.878857527755
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 18}
comp_stats 0.878857527755
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 74}
comp_stats 0.878857527755
iteration 7
current 18
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 9}
comp_stats 0.878054261847
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 36}
comp_stats 0.878857527755
iteration 8
current 36
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 72}
comp_stats 0.878857527755
iteration 9
current 18
iteration 10
current 36
best_keys [(18, 0.87885752775529247), (25, 0.87885752775529247), (36, 0.87885752775529247), (37, 0.87885752775529247), (50, 0.87885752775529247), (72, 0.87885752775529247), (74, 0.87885752775529247), (100, 0.87885752775529247), (200, 0.87885752775529247), (9, 0.87805426184667701), (12, 0.87805426184667701)]
continue
iteration 11
current 21
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 21}
comp_stats 0.878857527755
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 10}
comp_stats 0.878054261847
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 42}
comp_stats 0.878857527755
iteration 12
current 42
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 84}
comp_stats 0.878857527755
iteration 13
current 21
iteration 14
current 42
iteration 15
current 21
best_keys [(18, 0.87885752775529247), (21, 0.87885752775529247), (25, 0.87885752775529247), (36, 0.87885752775529247), (37, 0.87885752775529247), (42, 0.87885752775529247), (50, 0.87885752775529247), (72, 0.87885752775529247), (74, 0.87885752775529247), (84, 0.87885752775529247), (100, 0.87885752775529247), (200, 0.87885752775529247), (9, 0.87805426184667701), (10, 0.87805426184667701), (12, 0.87805426184667701)]
continue
iteration 16
current 19
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 19}
comp_stats 0.878857527755
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 38}
comp_stats 0.878857527755
iteration 17
current 38
python_readibility_f1_mean {'LI_COUNT_REDUCTION': 76}
comp_stats 0.878857527755
iteration 18
current 19
iteration 19
current 38
iteration 20
current 19
best_keys [(18, 0.87885752775529247), (19, 0.87885752775529247), (21, 0.87885752775529247), (25, 0.87885752775529247), (36, 0.87885752775529247), (37, 0.87885752775529247), (38, 0.87885752775529247), (42, 0.87885752775529247), (50, 0.87885752775529247), (72, 0.87885752775529247), (74, 0.87885752775529247), (76, 0.87885752775529247), (84, 0.87885752775529247), (100, 0.87885752775529247), (200, 0.87885752775529247), (9, 0.87805426184667701), (10, 0.87805426184667701), (12, 0.87805426184667701)]
stopping for small delta
19
LONG_NODE_LENGTH  - SMALL OPT - start 80 0.878857527755 opt to 10 0.879207909097 improvement 0.000350381342059
P_TO_INPUT_RATIO no improvement - start 3 0.878857527755 opt to 1 0.878857527755
LOW_WEIGHT_LINK_DENSITY_THRESHOLD - LARGE OPT - start 0.2 0.878857527755 opt to 0.69 0.881483415423 improvement 0.00262588766732
HEADER_LINK_DENSITY_THRESHOLD  - SMALL OPT - start 0.33 0.878857527755 opt to 0.01 0.878953604151 improvement 9.60763953086e-05
HIGH_WEIGHT_LINK_DENSITY_THRESHOLD  - SMALL OPT - start 0.5 0.878857527755 opt to 0.15 0.878937630417 improvement 8.01026616278e-05
MIN_SIBLING_SCORE_THRESHOLD - LARGE OPT - start 10 0.878857527755 opt to 30 0.880210091249 improvement 0.00135256349347
BEST_SCORE_MULTIPLIER_THRESHOLD - LARGE OPT - start 0.2 0.878857527755 opt to 0.01 0.880040099659 improvement 0.00118257190416
LONG_NODE_LINK_DENSITY_THRESHOLD no improvement - start 0.25 0.878857527755 opt to 0.07 0.878857527755
COMMA_COUNT no improvement - start 10 0.878857527755 opt to 10 0.878857527755
MIN_EMBED_COMMENT_LENGTH no improvement - start 75 0.878857527755 opt to 0 0.878857527755
TEXT_LENGTH_THRESHOLD  - SMALL OPT - start 25 0.878857527755 opt to 29 0.878923051472 improvement 6.55237170336e-05
RETRY_LENGTH no improvement - start 250 0.878857527755 opt to 1 0.878857527755
SIBLING_CONTENT_LENGTH_SUM no improvement - start 1000 0.878857527755 opt to 0 0.878857527755
CONTENT_SCORE_DIV_BONUS - LARGE OPT - start 5 0.878857527755 opt to 41 0.883937534526 improvement 0.0050800067712
CONTENT_SCORE_PRE_TD_BONUS no improvement - start 3 0.878857527755 opt to 3 0.878857527755
CONTENT_SCORE_ADDRESS_OL_PENALTY no improvement - start 3 0.878857527755 opt to 0 0.878857527755
CONTENT_SCORE_HEADER_PENALTY no improvement - start 5 0.878857527755 opt to 0 0.878857527755
CLASS_WEIGHT_NEGATIVE_RE_PENALTY - LARGE OPT - start 25 0.878857527755 opt to 36 0.880160346747 improvement 0.00130281899159
CLASS_WEIGHT_POSITVE_RE_BONUS - LARGE OPT - start 25 0.878857527755 opt to 7 0.880070766619 improvement 0.00121323886329
CONTENT_SCORE_START no improvement - start 1 0.878857527755 opt to 1 0.878857527755
CONTENT_SCORE_INNER_TEXT_MIN_BONUS  - SMALL OPT - start 3 0.878857527755 opt to 4 0.879289563244 improvement 0.000432035489177
LI_COUNT_REDUCTION no improvement - start 100 0.878857527755 opt to 18 0.878857527755
total time 3:03:19.413456



In [73]:

    
default_f1 = python_readability_f1_mean( extractor_training_subset[ ( len( extractor_training_objects)/2 ) : ], {} )

print 'F1 with default params', default_f1









    



python_readibility_f1_mean {}
comp_stats 0.867183623635
F1 with default params 0.867183623635



In [67]:

    
opt_params = { 
              'TEXT_LENGTH_THRESHOLD': 48,
              'CONTENT_SCORE_DIV_BONUS': 18,
              'CLASS_WEIGHT_POSITVE_RE_BONUS': 6
              }


opt_params = { 
              'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.69,
              'MIN_SIBLING_SCORE_THRESHOLD':  30 ,
              'BEST_SCORE_MULTIPLIER_THRESHOLD':  0.01,
              'CONTENT_SCORE_DIV_BONUS': 41,
              'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 36,
              'CLASS_WEIGHT_POSITVE_RE_BONUS': 7}

opt_f1 =  python_readability_f1_mean(extractor_training_subset[ ( len( extractor_training_objects)/2 ) : ], opt_params )

print 'F1 with optimized params', opt_f1









    



python_readibility_f1_mean {'LOW_WEIGHT_LINK_DENSITY_THRESHOLD': 0.69, 'BEST_SCORE_MULTIPLIER_THRESHOLD': 0.01, 'MIN_SIBLING_SCORE_THRESHOLD': 30, 'CONTENT_SCORE_DIV_BONUS': 41, 'CLASS_WEIGHT_NEGATIVE_RE_PENALTY': 36, 'CLASS_WEIGHT_POSITVE_RE_BONUS': 7}
comp_stats 0.8596001993
F1 with optimized params 0.8596001993



In [74]:

    
opt_params = { 
              'CONTENT_SCORE_DIV_BONUS': 41,
              }

opt_f1 =  python_readability_f1_mean( extractor_training_subset[ ( len( extractor_training_objects)/2 ) : ], opt_params )

print 'F1 with optimized params', opt_f1









    



python_readibility_f1_mean {'CONTENT_SCORE_DIV_BONUS': 41}
comp_stats 0.865635849903
F1 with optimized params 0.865635849903



In [27]:

    
make_int = True
value_to_optimize = 'min_text_length'
start_value = 25
binary_search_opt_param( value_to_optimize, start_value, make_int )









    



iteration 1
current 25
python_readibility_f1_mean {'min_text_length': 25}
comp_stats 0.949454200285
python_readibility_f1_mean {'min_text_length': 12}
comp_stats 0.936690473308
python_readibility_f1_mean {'min_text_length': 50}
comp_stats 0.949454200285
iteration 2
current 50
python_readibility_f1_mean {'min_text_length': 100}
comp_stats 0.949454200285
iteration 3
current 100
python_readibility_f1_mean {'min_text_length': 200}
comp_stats 0.947431445336
iteration 4
current 50
iteration 5
current 100
best_keys [(25, 0.94945420028476513), (50, 0.94945420028476513), (100, 0.94945420028476513), (200, 0.94743144533588797), (12, 0.93669047330831612)]
continue
iteration 6
current 37
python_readibility_f1_mean {'min_text_length': 37}
comp_stats 0.949454200285
python_readibility_f1_mean {'min_text_length': 18}
comp_stats 0.949454200285
python_readibility_f1_mean {'min_text_length': 74}
comp_stats 0.949454200285
iteration 7
current 74
python_readibility_f1_mean {'min_text_length': 148}
comp_stats 0.949454200285
iteration 8
current 148
python_readibility_f1_mean {'min_text_length': 296}
comp_stats 0.89241665649
iteration 9
current 74
iteration 10
current 148
best_keys [(18, 0.94945420028476513), (25, 0.94945420028476513), (37, 0.94945420028476513), (50, 0.94945420028476513), (74, 0.94945420028476513), (100, 0.94945420028476513), (148, 0.94945420028476513), (200, 0.94743144533588797), (12, 0.93669047330831612), (296, 0.89241665649041391)]
continue
iteration 11
current 21
python_readibility_f1_mean {'min_text_length': 21}
comp_stats 0.949454200285
python_readibility_f1_mean {'min_text_length': 10}
comp_stats 0.936690473308
python_readibility_f1_mean {'min_text_length': 42}
comp_stats 0.949454200285
iteration 12
current 42
python_readibility_f1_mean {'min_text_length': 84}
comp_stats 0.949454200285
iteration 13
current 84
python_readibility_f1_mean {'min_text_length': 168}
comp_stats 0.949454200285
iteration 14
current 168
python_readibility_f1_mean {'min_text_length': 336}
comp_stats 0.89241665649
iteration 15
current 84
best_keys [(18, 0.94945420028476513), (21, 0.94945420028476513), (25, 0.94945420028476513), (37, 0.94945420028476513), (42, 0.94945420028476513), (50, 0.94945420028476513), (74, 0.94945420028476513), (84, 0.94945420028476513), (100, 0.94945420028476513), (148, 0.94945420028476513), (168, 0.94945420028476513), (200, 0.94743144533588797), (10, 0.93669047330831612), (12, 0.93669047330831612), (296, 0.89241665649041391), (336, 0.89241665649041391)]
continue
iteration 16
current 19
python_readibility_f1_mean {'min_text_length': 19}
comp_stats 0.949454200285
python_readibility_f1_mean {'min_text_length': 9}
comp_stats 0.936690473308
python_readibility_f1_mean {'min_text_length': 38}
comp_stats 0.949454200285
iteration 17
current 38
python_readibility_f1_mean {'min_text_length': 76}
comp_stats 0.949454200285
iteration 18
current 76
python_readibility_f1_mean {'min_text_length': 152}
comp_stats 0.949454200285
iteration 19
current 152
python_readibility_f1_mean {'min_text_length': 304}
comp_stats 0.89241665649
iteration 20
current 76
best_keys [(18, 0.94945420028476513), (19, 0.94945420028476513), (21, 0.94945420028476513), (25, 0.94945420028476513), (37, 0.94945420028476513), (38, 0.94945420028476513), (42, 0.94945420028476513), (50, 0.94945420028476513), (74, 0.94945420028476513), (76, 0.94945420028476513), (84, 0.94945420028476513), (100, 0.94945420028476513), (148, 0.94945420028476513), (152, 0.94945420028476513), (168, 0.94945420028476513), (200, 0.94743144533588797), (9, 0.93669047330831612), (10, 0.93669047330831612), (12, 0.93669047330831612), (296, 0.89241665649041391), (304, 0.89241665649041391), (336, 0.89241665649041391)]
stopping for small delta
76






    Out[27]:





{'opt_reslt': 0.94945420028476513,
 'opt_value': 18,
 'start_result': 0.94945420028476513,
 'start_value': 25}



In [19]:

    
#random.seed(12345)
#numpy.random.seed( 12345 )
#reload( difflib )
#reload( readability )
#reload( lxml )
#dreload( difflib )
#dreload( readability )
#dreload( lxml )
run2 = python_readability_f1_mean( extractor_training_objects[:], 250, 25 )









    



python_readibility_f1_mean retry_length 250 min_text_length 25



In [20]:

    
print run1 == run2

print run1.keys() == run2.keys()

for k in run1.keys():
    if run1[ k] != run2[k]:
        print k, run1[k], run2[k]









    



---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-20-0de5ac22064f> in <module>()
      1 print run1 == run2
      2 
----> 3 print run1.keys() == run2.keys()
      4 
      5 for k in run1.keys():

AttributeError: 'numpy.float64' object has no attribute 'keys'





    



True



In [ ]:

    
#random.seed(12345)
#numpy.random.seed( 12345 )
#reload( difflib )
#reload( readability )
#reload( lxml )
#dreload( difflib )
#dreload( readability )
#dreload( lxml )
run3 = python_readability_f1_mean( extractor_training_objects, 250, 25 )



In [ ]:

    
print run1.keys() == run3.keys()
print run1.keys() == run3.keys()

for k in run1.keys():
    if run1[ k] != run3[k]:
        print k, run1[k], run3[k]



In [ ]:

    
bad_etos = [ eto for eto in extractor_training_objects if eto['downloads_id'] == 590957745 ]

for x in range( 10):
    print python_readability_f1_mean( bad_etos, 250, 25 )



In [ ]:

    
bad_etos = [ eto for eto in extractor_training_objects if eto['downloads_id'] == 590957745 ]

for x in range( 10):
    print python_readability_results( bad_etos[0], {} )



In [ ]:

    
text_outputs = []
for x in range( 10):
    text_outputs.append( extract_with_python_readability( bad_etos[0]['raw_content'] ) )

print len( text_outputs )
print len(set( text_outputs ) )



In [ ]:

    
isinstance( raw_content, unicode )
import hashlib
[ hashlib.md5( out_text ).hexdigest() for out_text in text_outputs2 ]



In [ ]:

    
text_outputs2 = []
raw_content = bad_etos[0]['raw_content'] 
#import readability
for x in range( 10):
    reload( lxml.etree )
    reload ( lxml.html )
    reload( readability.cleaners )
    reload( readability.encoding )
    reload( readability)
    reload( readability.htmls ) 
    #readability.htmls = lxml.html.HTMLParser(encoding='utf-8')
    print readability.cleaners.html_cleaner    
    print readability.htmls.utf8_parser

    random.seed(12345)
    numpy.random.seed( 12345 )
    
    print isinstance( raw_content, unicode )
    
    text_outputs2.append( readability.Document( raw_content ).summary() )

print len( text_outputs2 )
print len(set( text_outputs2 ) )



In [ ]:

    
set( text_outputs )



In [ ]:

    
sys.path



In [ ]:

    
readability.readability.total_siblings = 0 
readability.readability.total_eval_siblings = 0 
readability.readability.total_candidate_siblings = 0 
readability.readability.num_articles = 0
readability.readability.articles_with_siblings = 0

for eto in extractor_training_objects:
    
    extract_with_python_readability( eto['raw_content'] )
    
print readability.readability.total_siblings
print readability.readability.num_articles
print float( readability.readability.total_siblings ) / readability.readability.num_articles

print readability.readability.articles_with_siblings
print readability.readability.total_candidate_siblings
print float( readability.readability.total_candidate_siblings ) / readability.readability.num_articles

print 

print readability.readability.total_eval_siblings
print float( readability.readability.total_eval_siblings ) / readability.readability.num_articles



In [ ]:

    
#comp_res = [ comp_extractors( eto ) for eto in extractor_training_objects[ :10] ]
#print comp_res
#f1_s = [ res['python_readibilty']['f1'] for res in comp_res ]

#print f1_s
#print numpy.mean( f1_s )


print python_readability_f1_mean( extractor_training_objects, 100000, 25 )
print python_readability_f1_mean( extractor_training_objects, -10, 25 )
print python_readability_f1_mean( extractor_training_objects, 10, 15 )



In [ ]:

    
opt_fun = lambda p : 1 - python_readability_f1_mean( extractor_training_objects[:400], retry_length=p[0], min_text_length=p[1])
    
#print opt_fun( [0, 250] )    

#print opt_fun( [1000000000, 250] )



In [ ]:

    
from operator import itemgetter, attrgetter, methodcaller

defaults = { 'retry_length': 250, 
            'min_text_length':25 
            }

current_min_text_length = defaults['min_text_length']


range = [0, 100]

prev_values = []

eval_with_param = lambda param : { 'param': param, 'result': opt_fun( [param, defaults['min_text_length'], param ]) }
                                                                       
prev_values = [ eval_with_param( param ) for param in [ 0, 300]]

#print prev_values
#prev_values.sort( key=itemgetter('result') )

#print prev_values
#prev_values.pop()
#print prev_values
old_value = opt_fun( [ defaults['retry_length'], current_min_text_length] )
while True:
    prev_values.sort( key=itemgetter('result') )
    
    if prev_values[0]['param'] == prev_values[1]['param']:
        prev_values.pop()
        break
    
    if abs(prev_values[0]['param'] - prev_values[1]['param']) == 1:
        prev_values.pop()
        break
    
    print prev_values
    
    new_param = int ( ( prev_values[0]['param'] + prev_values[1]['param'])/2 )
    prev_values.pop()
    
    if new_param == prev_values[0]['param']:
        break
        
    prev_values.append(  eval_with_param( new_param ) )
    
    

print prev_values



In [18]:

    
python_readability_cache = {}
def cached_readability_f1( extractor_training_objects, p ):
    
    print 'cached_readability_f1( extractor_training_objects', p
    
    if id( extractor_training_objects ) not in python_readability_cache:
        python_readability_cache[ id( extractor_training_objects ) ] = {}
        
    retry_length = max( -1, int( p[0] ) )
    min_text_length = max( -1, int(p[1] ) )
    
    if retry_length not in python_readability_cache[ id( extractor_training_objects ) ]:
        python_readability_cache[ id( extractor_training_objects ) ][ retry_length ] = {}
        
    if min_text_length not in python_readability_cache[ id( extractor_training_objects ) ][ retry_length ]:
        
        print 'cached_readability_f1 recalculating ', retry_length, min_text_length
        
        ret = python_readability_f1_mean( extractor_training_objects, 
                                     retry_length=retry_length, min_text_length=min_text_length)
        python_readability_cache[ id( extractor_training_objects ) ][ retry_length ][min_text_length] = ret
    else:
        print 'cached_readability_f1 returning from cache', retry_length, min_text_length
    
    ret = python_readability_cache[ id( extractor_training_objects ) ][ retry_length ][min_text_length] 

    print 'f1', ret
    
    return ret



In [19]:

    
#id(extractor_training_objects)
cached_readability_f1( extractor_training_objects, [250, 25 ] )









    



cached_readability_f1( extractor_training_objects [250, 25]
cached_readability_f1 recalculating  250 25
python_readibility_f1_mean retry_length 250 min_text_length 25
f1 0.872645793218






    Out[19]:





0.87264579321822566



In [20]:

    
import scipy.optimize

#f = lambda var : opt_fun( [defaults['retry_length'], var])

#f( 2 ) 
opt_result = scipy.optimize.minimize(  lambda p : 1- cached_readability_f1( extractor_training_objects, p ), 
                                     [ 150, 20 ], method='SLSQP', bounds=[ [1,500],[0,50]],
                                     options={ 'maxiter': 1, 'disp': True} )
print opt_result
opt_result
#opt_result.values()









    



cached_readability_f1( extractor_training_objects [ 150.   20.]
cached_readability_f1 recalculating  150 20
python_readibility_f1_mean retry_length 150 min_text_length 20
f1 0.87309397635
cached_readability_f1( extractor_training_objects [ 150.   20.]
cached_readability_f1 returning from cache 150 20
f1 0.87309397635
cached_readability_f1( extractor_training_objects [ 150.00000001   20.        ]
cached_readability_f1 returning from cache 150 20
f1 0.87309397635
cached_readability_f1( extractor_training_objects [ 150.           20.00000001]
cached_readability_f1 returning from cache 150 20
f1 0.87309397635
Optimization terminated successfully.    (Exit mode 0)
            Current function value: 0.12690602365
            Iterations: 1
            Function evaluations: 4
            Gradient evaluations: 1
  status: 0
 success: True
    njev: 1
    nfev: 4
     fun: 0.1269060236499705
       x: array([ 150.,   20.])
 message: 'Optimization terminated successfully.'
     jac: array([ 0.,  0.,  0.])
     nit: 1






    Out[20]:





  status: 0
 success: True
    njev: 1
    nfev: 4
     fun: 0.1269060236499705
       x: array([ 150.,   20.])
 message: 'Optimization terminated successfully.'
     jac: array([ 0.,  0.,  0.])
     nit: 1



In [21]:

    
import scipy.optimize

#f = lambda var : opt_fun( [defaults['retry_length'], var])

#f( 2 ) 
opt_result = scipy.optimize.minimize_scalar(  
                                            lambda min_text_length : 1- 
                                            cached_readability_f1( extractor_training_objects, [ 250, min_text_length] ), 
                                     bounds=[ 0, 50 ], method='Brent', 
                                     options={ 'maxiter': 10, 'disp': True} )
print opt_result
opt_result
#opt_result.values()









    



cached_readability_f1( extractor_training_objects [250, 0.0]
cached_readability_f1 recalculating  250 0
python_readibility_f1_mean retry_length 250 min_text_length 0
f1 0.853136108401
cached_readability_f1( extractor_training_objects [250, 1.0]
cached_readability_f1 recalculating  250 1
python_readibility_f1_mean retry_length 250 min_text_length 1
f1 0.858293842755
cached_readability_f1( extractor_training_objects [250, 2.6180339999999998]
cached_readability_f1 recalculating  250 2
python_readibility_f1_mean retry_length 250 min_text_length 2
f1 0.863455818915
cached_readability_f1( extractor_training_objects [250, 3.9316174053561492]
cached_readability_f1 recalculating  250 3
python_readibility_f1_mean retry_length 250 min_text_length 3
f1 0.862857576238
cached_readability_f1( extractor_training_objects [250, 2.6180339999999998]
cached_readability_f1 returning from cache 250 2
f1 0.863455818915
cached_readability_f1( extractor_training_objects [250, 2.000000025156]
cached_readability_f1 returning from cache 250 2
f1 0.863455818915
cached_readability_f1( extractor_training_objects [250, 1.6180340155472632]
cached_readability_f1 returning from cache 250 1
f1 0.858293842755
cached_readability_f1( extractor_training_objects [250, 2.3090170125779999]
cached_readability_f1 returning from cache 250 2
f1 0.863455818915
cached_readability_f1( extractor_training_objects [250, 2.1909830299603681]
cached_readability_f1 returning from cache 250 2
f1 0.863455818915
cached_readability_f1( extractor_training_objects [250, 2.1180340155472628]
cached_readability_f1 returning from cache 250 2
f1 0.863455818915
cached_readability_f1( extractor_training_objects [250, 2.0729490443734737]
cached_readability_f1 returning from cache 250 2
f1 0.863455818915
cached_readability_f1( extractor_training_objects [250, 2.0450849992990521]
cached_readability_f1 returning from cache 250 2
f1 0.863455818915
cached_readability_f1( extractor_training_objects [250, 2.027864072065527]
cached_readability_f1 returning from cache 250 2
f1 0.863455818915
cached_readability_f1( extractor_training_objects [250, 2.0172209535236827]
cached_readability_f1 returning from cache 250 2
f1 0.863455818915
cached_readability_f1( extractor_training_objects [250, 2.0106431443987924]
cached_readability_f1 returning from cache 250 2
f1 0.863455818915
  fun: 0.13654418108497735
 nfev: 11
  nit: 10
    x: 2.0106431443987924






    



-c:10: OptimizeWarning: Unknown solver options: disp






    Out[21]:





  fun: 0.13654418108497735
 nfev: 11
  nit: 10
    x: 2.0106431443987924



In [22]:

    
import scipy.optimize

opt_result = scipy.optimize.brute( lambda min_text_length : 1 - 
                                            cached_readability_f1( extractor_training_objects, [ 250, min_text_length] ),
                                  ranges=[ slice( 0, 50, 1 )], disp=True, full_output=True)
print opt_result
opt_result

#opt_result = scipy.optimize.minimize( opt_fun, [ 0, 0 ], method='L-BFGS-B', bounds=[ [0,400],[0,100]],options={ 'maxiter': 2000} )
#print opt_result
#opt_result.values()









    



cached_readability_f1( extractor_training_objects [250, array(0)]
cached_readability_f1 returning from cache 250 0
f1 0.853136108401
cached_readability_f1( extractor_training_objects [250, array(0)]
cached_readability_f1 returning from cache 250 0
f1 0.853136108401
cached_readability_f1( extractor_training_objects [250, array(1)]
cached_readability_f1 returning from cache 250 1
f1 0.858293842755
cached_readability_f1( extractor_training_objects [250, array(2)]
cached_readability_f1 returning from cache 250 2
f1 0.863455818915
cached_readability_f1( extractor_training_objects [250, array(3)]
cached_readability_f1 returning from cache 250 3
f1 0.862857576238
cached_readability_f1( extractor_training_objects [250, array(4)]
cached_readability_f1 recalculating  250 4
python_readibility_f1_mean retry_length 250 min_text_length 4
f1 0.861274188541
cached_readability_f1( extractor_training_objects [250, array(5)]
cached_readability_f1 recalculating  250 5
python_readibility_f1_mean retry_length 250 min_text_length 5
f1 0.8613450712
cached_readability_f1( extractor_training_objects [250, array(6)]
cached_readability_f1 recalculating  250 6
python_readibility_f1_mean retry_length 250 min_text_length 6
f1 0.862453162336
cached_readability_f1( extractor_training_objects [250, array(7)]
cached_readability_f1 recalculating  250 7
python_readibility_f1_mean retry_length 250 min_text_length 7
f1 0.862457542237
cached_readability_f1( extractor_training_objects [250, array(8)]
cached_readability_f1 recalculating  250 8
python_readibility_f1_mean retry_length 250 min_text_length 8
f1 0.863076066095
cached_readability_f1( extractor_training_objects [250, array(9)]
cached_readability_f1 recalculating  250 9
python_readibility_f1_mean retry_length 250 min_text_length 9
f1 0.863105440015
cached_readability_f1( extractor_training_objects [250, array(10)]
cached_readability_f1 recalculating  250 10
python_readibility_f1_mean retry_length 250 min_text_length 10
f1 0.863377114678
cached_readability_f1( extractor_training_objects [250, array(11)]
cached_readability_f1 recalculating  250 11
python_readibility_f1_mean retry_length 250 min_text_length 11
f1 0.863669699476
cached_readability_f1( extractor_training_objects [250, array(12)]
cached_readability_f1 recalculating  250 12
python_readibility_f1_mean retry_length 250 min_text_length 12
f1 0.863568604576
cached_readability_f1( extractor_training_objects [250, array(13)]
cached_readability_f1 recalculating  250 13
python_readibility_f1_mean retry_length 250 min_text_length 13
f1 0.863577503286
cached_readability_f1( extractor_training_objects [250, array(14)]
cached_readability_f1 recalculating  250 14
python_readibility_f1_mean retry_length 250 min_text_length 14
f1 0.863310604
cached_readability_f1( extractor_training_objects [250, array(15)]
cached_readability_f1 recalculating  250 15
python_readibility_f1_mean retry_length 250 min_text_length 15
f1 0.864613377141
cached_readability_f1( extractor_training_objects [250, array(16)]
cached_readability_f1 recalculating  250 16
python_readibility_f1_mean retry_length 250 min_text_length 16
f1 0.868975375831
cached_readability_f1( extractor_training_objects [250, array(17)]
cached_readability_f1 recalculating  250 17
python_readibility_f1_mean retry_length 250 min_text_length 17
f1 0.868475153495
cached_readability_f1( extractor_training_objects [250, array(18)]
cached_readability_f1 recalculating  250 18
python_readibility_f1_mean retry_length 250 min_text_length 18
f1 0.868458944127
cached_readability_f1( extractor_training_objects [250, array(19)]
cached_readability_f1 recalculating  250 19
python_readibility_f1_mean retry_length 250 min_text_length 19
f1 0.868639521797
cached_readability_f1( extractor_training_objects [250, array(20)]
cached_readability_f1 recalculating  250 20
python_readibility_f1_mean retry_length 250 min_text_length 20
f1 0.87309397635
cached_readability_f1( extractor_training_objects [250, array(21)]
cached_readability_f1 recalculating  250 21
python_readibility_f1_mean retry_length 250 min_text_length 21
f1 0.873086016896
cached_readability_f1( extractor_training_objects [250, array(22)]
cached_readability_f1 recalculating  250 22
python_readibility_f1_mean retry_length 250 min_text_length 22
f1 0.873132316032
cached_readability_f1( extractor_training_objects [250, array(23)]
cached_readability_f1 recalculating  250 23
python_readibility_f1_mean retry_length 250 min_text_length 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array(24)]
cached_readability_f1 recalculating  250 24
python_readibility_f1_mean retry_length 250 min_text_length 24
f1 0.872564886826
cached_readability_f1( extractor_training_objects [250, array(25)]
cached_readability_f1 returning from cache 250 25
f1 0.872645793218
cached_readability_f1( extractor_training_objects [250, array(26)]
cached_readability_f1 recalculating  250 26
python_readibility_f1_mean retry_length 250 min_text_length 26
f1 0.872521081951
cached_readability_f1( extractor_training_objects [250, array(27)]
cached_readability_f1 recalculating  250 27
python_readibility_f1_mean retry_length 250 min_text_length 27
f1 0.87252952677
cached_readability_f1( extractor_training_objects [250, array(28)]
cached_readability_f1 recalculating  250 28
python_readibility_f1_mean retry_length 250 min_text_length 28
f1 0.872519060748
cached_readability_f1( extractor_training_objects [250, array(29)]
cached_readability_f1 recalculating  250 29
python_readibility_f1_mean retry_length 250 min_text_length 29
f1 0.872656142
cached_readability_f1( extractor_training_objects [250, array(30)]
cached_readability_f1 recalculating  250 30
python_readibility_f1_mean retry_length 250 min_text_length 30
f1 0.872646791286
cached_readability_f1( extractor_training_objects [250, array(31)]
cached_readability_f1 recalculating  250 31
python_readibility_f1_mean retry_length 250 min_text_length 31
f1 0.872642970107
cached_readability_f1( extractor_training_objects [250, array(32)]
cached_readability_f1 recalculating  250 32
python_readibility_f1_mean retry_length 250 min_text_length 32
f1 0.869571246325
cached_readability_f1( extractor_training_objects [250, array(33)]
cached_readability_f1 recalculating  250 33
python_readibility_f1_mean retry_length 250 min_text_length 33
f1 0.868922101161
cached_readability_f1( extractor_training_objects [250, array(34)]
cached_readability_f1 recalculating  250 34
python_readibility_f1_mean retry_length 250 min_text_length 34
f1 0.868929327314
cached_readability_f1( extractor_training_objects [250, array(35)]
cached_readability_f1 recalculating  250 35
python_readibility_f1_mean retry_length 250 min_text_length 35
f1 0.868966511262
cached_readability_f1( extractor_training_objects [250, array(36)]
cached_readability_f1 recalculating  250 36
python_readibility_f1_mean retry_length 250 min_text_length 36
f1 0.868977216502
cached_readability_f1( extractor_training_objects [250, array(37)]
cached_readability_f1 recalculating  250 37
python_readibility_f1_mean retry_length 250 min_text_length 37
f1 0.867992967316
cached_readability_f1( extractor_training_objects [250, array(38)]
cached_readability_f1 recalculating  250 38
python_readibility_f1_mean retry_length 250 min_text_length 38
f1 0.867986688852
cached_readability_f1( extractor_training_objects [250, array(39)]
cached_readability_f1 recalculating  250 39
python_readibility_f1_mean retry_length 250 min_text_length 39
f1 0.868016418904
cached_readability_f1( extractor_training_objects [250, array(40)]
cached_readability_f1 recalculating  250 40
python_readibility_f1_mean retry_length 250 min_text_length 40
f1 0.864115091841
cached_readability_f1( extractor_training_objects [250, array(41)]
cached_readability_f1 recalculating  250 41
python_readibility_f1_mean retry_length 250 min_text_length 41
f1 0.864115091841
cached_readability_f1( extractor_training_objects [250, array(42)]
cached_readability_f1 recalculating  250 42
python_readibility_f1_mean retry_length 250 min_text_length 42
f1 0.864096998521
cached_readability_f1( extractor_training_objects [250, array(43)]
cached_readability_f1 recalculating  250 43
python_readibility_f1_mean retry_length 250 min_text_length 43
f1 0.864740801335
cached_readability_f1( extractor_training_objects [250, array(44)]
cached_readability_f1 recalculating  250 44
python_readibility_f1_mean retry_length 250 min_text_length 44
f1 0.86552503093
cached_readability_f1( extractor_training_objects [250, array(45)]
cached_readability_f1 recalculating  250 45
python_readibility_f1_mean retry_length 250 min_text_length 45
f1 0.86860899875
cached_readability_f1( extractor_training_objects [250, array(46)]
cached_readability_f1 recalculating  250 46
python_readibility_f1_mean retry_length 250 min_text_length 46
f1 0.869086200182
cached_readability_f1( extractor_training_objects [250, array(47)]
cached_readability_f1 recalculating  250 47
python_readibility_f1_mean retry_length 250 min_text_length 47
f1 0.869091459555
cached_readability_f1( extractor_training_objects [250, array(48)]
cached_readability_f1 recalculating  250 48
python_readibility_f1_mean retry_length 250 min_text_length 48
f1 0.868904788112
cached_readability_f1( extractor_training_objects [250, array(49)]
cached_readability_f1 recalculating  250 49
python_readibility_f1_mean retry_length 250 min_text_length 49
f1 0.868865339867
cached_readability_f1( extractor_training_objects [250, array([ 23.])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 24.15])]
cached_readability_f1 returning from cache 250 24
f1 0.872564886826
cached_readability_f1( extractor_training_objects [250, array([ 21.85])]
cached_readability_f1 returning from cache 250 21
f1 0.873086016896
cached_readability_f1( extractor_training_objects [250, array([ 22.425])]
cached_readability_f1 returning from cache 250 22
f1 0.873132316032
cached_readability_f1( extractor_training_objects [250, array([ 23.575])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 23.2875])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 22.7125])]
cached_readability_f1 returning from cache 250 22
f1 0.873132316032
cached_readability_f1( extractor_training_objects [250, array([ 23.14375])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 23.14375])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 22.85625])]
cached_readability_f1 returning from cache 250 22
f1 0.873132316032
cached_readability_f1( extractor_training_objects [250, array([ 23.071875])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 23.071875])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 22.928125])]
cached_readability_f1 returning from cache 250 22
f1 0.873132316032
cached_readability_f1( extractor_training_objects [250, array([ 23.0359375])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 23.0359375])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 22.9640625])]
cached_readability_f1 returning from cache 250 22
f1 0.873132316032
cached_readability_f1( extractor_training_objects [250, array([ 23.01796875])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 23.01796875])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 22.98203125])]
cached_readability_f1 returning from cache 250 22
f1 0.873132316032
cached_readability_f1( extractor_training_objects [250, array([ 23.00898438])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 23.00898438])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 22.99101562])]
cached_readability_f1 returning from cache 250 22
f1 0.873132316032
cached_readability_f1( extractor_training_objects [250, array([ 23.00449219])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 23.00449219])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 22.99550781])]
cached_readability_f1 returning from cache 250 22
f1 0.873132316032
cached_readability_f1( extractor_training_objects [250, array([ 23.00224609])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 23.00224609])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 22.99775391])]
cached_readability_f1 returning from cache 250 22
f1 0.873132316032
cached_readability_f1( extractor_training_objects [250, array([ 23.00112305])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 23.00112305])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 22.99887695])]
cached_readability_f1 returning from cache 250 22
f1 0.873132316032
cached_readability_f1( extractor_training_objects [250, array([ 23.00056152])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 23.00056152])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 22.99943848])]
cached_readability_f1 returning from cache 250 22
f1 0.873132316032
cached_readability_f1( extractor_training_objects [250, array([ 23.00028076])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 23.00028076])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 22.99971924])]
cached_readability_f1 returning from cache 250 22
f1 0.873132316032
cached_readability_f1( extractor_training_objects [250, array([ 23.00014038])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 23.00014038])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 22.99985962])]
cached_readability_f1 returning from cache 250 22
f1 0.873132316032
cached_readability_f1( extractor_training_objects [250, array([ 23.00007019])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
cached_readability_f1( extractor_training_objects [250, array([ 23.00007019])]
cached_readability_f1 returning from cache 250 23
f1 0.873138270656
Optimization terminated successfully.
         Current function value: 0.126862
         Iterations: 15
         Function evaluations: 42
(array([ 23.]), 0.12686172934431161, array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]), array([ 0.14686389,  0.14170616,  0.13654418,  0.13714242,  0.13872581,
        0.13865493,  0.13754684,  0.13754246,  0.13692393,  0.13689456,
        0.13662289,  0.1363303 ,  0.1364314 ,  0.1364225 ,  0.1366894 ,
        0.13538662,  0.13102462,  0.13152485,  0.13154106,  0.13136048,
        0.12690602,  0.12691398,  0.12686768,  0.12686173,  0.12743511,
        0.12735421,  0.12747892,  0.12747047,  0.12748094,  0.12734386,
        0.12735321,  0.12735703,  0.13042875,  0.1310779 ,  0.13107067,
        0.13103349,  0.13102278,  0.13200703,  0.13201331,  0.13198358,
        0.13588491,  0.13588491,  0.135903  ,  0.1352592 ,  0.13447497,
        0.131391  ,  0.1309138 ,  0.13090854,  0.13109521,  0.13113466]))






    Out[22]:





(array([ 23.]),
 0.12686172934431161,
 array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
       34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]),
 array([ 0.14686389,  0.14170616,  0.13654418,  0.13714242,  0.13872581,
        0.13865493,  0.13754684,  0.13754246,  0.13692393,  0.13689456,
        0.13662289,  0.1363303 ,  0.1364314 ,  0.1364225 ,  0.1366894 ,
        0.13538662,  0.13102462,  0.13152485,  0.13154106,  0.13136048,
        0.12690602,  0.12691398,  0.12686768,  0.12686173,  0.12743511,
        0.12735421,  0.12747892,  0.12747047,  0.12748094,  0.12734386,
        0.12735321,  0.12735703,  0.13042875,  0.1310779 ,  0.13107067,
        0.13103349,  0.13102278,  0.13200703,  0.13201331,  0.13198358,
        0.13588491,  0.13588491,  0.135903  ,  0.1352592 ,  0.13447497,
        0.131391  ,  0.1309138 ,  0.13090854,  0.13109521,  0.13113466]))



In [23]:

    
import pandas as pd
zip( opt_result[2], [ 1 - x for x in opt_result[3]])









    Out[23]:





[(0, 0.85313610840093612),
 (1, 0.85829384275495046),
 (2, 0.86345581891502265),
 (3, 0.86285757623811532),
 (4, 0.86127418854127136),
 (5, 0.86134507119991166),
 (6, 0.86245316233561975),
 (7, 0.86245754223702409),
 (8, 0.86307606609455323),
 (9, 0.86310544001478129),
 (10, 0.86337711467752676),
 (11, 0.86366969947615102),
 (12, 0.86356860457579887),
 (13, 0.86357750328558602),
 (14, 0.86331060400016024),
 (15, 0.86461337714125708),
 (16, 0.86897537583127515),
 (17, 0.86847515349526183),
 (18, 0.86845894412724078),
 (19, 0.86863952179697135),
 (20, 0.8730939763500295),
 (21, 0.87308601689642296),
 (22, 0.87313231603183272),
 (23, 0.87313827065568839),
 (24, 0.87256488682554534),
 (25, 0.87264579321822566),
 (26, 0.8725210819510294),
 (27, 0.87252952676973961),
 (28, 0.8725190607482709),
 (29, 0.87265614200005537),
 (30, 0.87264679128638845),
 (31, 0.87264297010650183),
 (32, 0.86957124632536464),
 (33, 0.86892210116051993),
 (34, 0.86892932731423278),
 (35, 0.86896651126177316),
 (36, 0.86897721650213222),
 (37, 0.86799296731576636),
 (38, 0.86798668885191343),
 (39, 0.86801641890392722),
 (40, 0.8641150918406566),
 (41, 0.8641150918406566),
 (42, 0.86409699852068955),
 (43, 0.86474080133537479),
 (44, 0.86552503092965705),
 (45, 0.86860899875002062),
 (46, 0.86908620018192606),
 (47, 0.86909145955467471),
 (48, 0.8689047881122729),
 (49, 0.86886533986687964)]



In [24]:

    
print opt_fun( [10, 0] )   
print opt_fun( [250, 25] )   
opt_result









    



---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-24-da2001f9e177> in <module>()
----> 1 print opt_fun( [10, 0] )
      2 print opt_fun( [250, 25] )
      3 opt_result

NameError: name 'opt_fun' is not defined



In [ ]:

    
#opt_result = scipy.optimize.brute( opt_fun, ranges=[ slice( 0,400, 1 ),slice( 0,1000, 1 )])

opt_result = scipy.optimize.brute( opt_fun, ranges=[ slice( 249,251, 1 ),slice( 22,27, 1 )], disp=True, full_output=True)
print opt_result
opt_result



In [ ]:

    
start_time = datetime.datetime.now()

batch = extractor_training_objects[:400]

print python_readability_f1_mean( batch, 250, 25 )

end_time = datetime.datetime.now()
    
print "Total_time", end_time - start_time
    
print "Time per download", (end_time - start_time)/ (len(batch) )



In [ ]:

    
import datetime

if regenerate_comps_downloads:
    
    comps_downloads = []
    processed = 0
    skipped = 0
    
    start_time = datetime.datetime.now()
    
    e=None
    for extractor_training_object in extractor_training_objects[:100]:
        print 'processed ', processed
        print 'skipped ', skipped
        print extractor_training_object[ 'downloads_id']
        try:
            res = comp_extractors( extractor_training_object )
            #print res
            comps_downloads.append( res )
            processed += 1
        except Exception, e:
            print "error on download{}".format( extractor_training_object[ 'downloads_id'] )
            e = sys.exc_info()
            
            import traceback
            
            traceback.print_exc()
            print e
            #raise e
            skipped += 1

    end_time = datetime.datetime.now()
    
    print "Total_time", end_time - start_time
    
    print "Time per download", (end_time - start_time)/ (processed + skipped )
    
    #cPickle.dump( comps_downloads, file( 
    #        os.path.expanduser( "~/Dropbox/mc/extractor_test/comps_downloads.pickle"), "wb"))
    
    
    e
#extractor_training_objects



In [ ]:

    
#comps_downloads = cPickle.load(  file( 
#            os.path.expanduser( "~/Dropbox/mc/extractor_test/comps_downloads.pickle"), "rb") )

Results

Results Overall



In [ ]:

    
comps_downloads[0]



In [ ]:

    
df = get_data_frame_from_comparision_objects( comps_downloads )
print_results_by_measurement_type( df )



In [ ]:

    
non_spidered_downloads = remove_spidered_downloads( comps_downloads )

df = get_data_frame_from_comparision_objects( non_spidered_downloads )
print_results_by_measurement_type( df )



In [ ]:

    
print "spidered"
df = get_data_frame_from_comparision_objects( only_spidered_downloads( comps_downloads ) )
print_results_by_measurement_type( df )

Results by Subset



In [ ]:

    
regional = { 2453107 }



print "region / pew knight study / 245107 "
df = get_data_frame_from_comparision_objects( filter_by_media_tags_id( non_spidered_downloads, regional ) )
print_results_by_measurement_type( df )

ap_english_us_top_25 = { 2453107 }
print "ap_english_us_top25 / 8875027 "
df = get_data_frame_from_comparision_objects( filter_by_media_tags_id( non_spidered_downloads, ap_english_us_top_25 ) )
print_results_by_measurement_type( df )

political_blogs = { 125 }
print "political blogs / 125"
df = get_data_frame_from_comparision_objects( filter_by_media_tags_id( non_spidered_downloads, political_blogs ) )
print_results_by_measurement_type( df )


russian = { 7796878 }
print 'russian'
df = get_data_frame_from_comparision_objects( filter_by_media_tags_id( non_spidered_downloads, russian ) )
print_results_by_measurement_type( df )

print 'brazil'
df = get_data_frame_from_comparision_objects( filter_by_media_tags_id( non_spidered_downloads, {8877968,  8877969, 8877973, 8877970 } ) )
print_results_by_measurement_type( df )

arabic = { 8878255 }
print 'arabic'
df = get_data_frame_from_comparision_objects( filter_by_media_tags_id( non_spidered_downloads, arabic ) )
print_results_by_measurement_type( df )



In [ ]:

    
boiler_pipe_extractor_training_objects = cPickle.load( open( "boiler_pipe_google_news_extractor_training_objects.pickle", "rb") )
#eto = extractor_training_objects[ 0 ]
#eto.keys()
#print eto['expected_text']
#get_extraction_results( eto )
#comp_extractors ( eto )

comps_downloads_boiler_pipe = []
processed = 0
skipped = 0
start_time = datetime.datetime.now()
e=None

for extractor_training_object in  boiler_pipe_extractor_training_objects[:]:
    try:
        res = comp_extractors( extractor_training_object )
        #print res
        comps_downloads_boiler_pipe.append( res )
        processed += 1
    except Exception, e:
        print "error on download{}".format( extractor_training_object[ 'downloads_id'] )
        e = sys.exc_info()
        
        import traceback
        
        traceback.print_exc()
        print e
        #raise e
        skipped += 1

    print 'processed', processed, 'skipped', skipped

    #extraction_results.append( er )

end_time = datetime.datetime.now()


print "Total_time", end_time - start_time

print "Time per download", (end_time - start_time)/ (processed + skipped )
    
res.keys()



In [ ]:

    
df = get_data_frame_from_comparision_objects( comps_downloads_boiler_pipe )
print_results_by_measurement_type( df )