In [1]:
import urllib

book_urls = {"Jane Austin":"http://www.gutenberg.org/cache/epub/1342/pg1342.txt",
             "Alice in Wonderland":"http://www.gutenberg.org/cache/epub/11/pg11.txt",
             "The Yellow Wallpaper":"http://www.gutenberg.org/cache/epub/1952/pg1952.txt",
             "Huckleberry Finn":"http://www.gutenberg.org/cache/epub/76/pg76.txt",
             "The Importance of Being Earnest":"http://www.gutenberg.org/cache/epub/844/pg844.txt"
             }

def load_gutenberg_book(url, char_limit=10000, min_len_of_sections=40):
    """
    Returns a list of paragraphs in the book.
    
    url: A url from Project Gutenberg.
    char_limit: Amount of characters of the book to read.
    min_len_of_sections: Each paragraph must be at least this many characters long.
    """
    book = urllib.urlopen(url)
    book_text = book.read(char_limit if char_limit else -1)
    
    result = []
    for text in book_text[:char_limit].split("\r\n\r\n"):
        if len(text) >= min_len_of_sections:
            clean_text = text.replace("\r\n", " ").strip()
            result.append(clean_text)
    
    start_position = len(result) if len(result) < 6 else 6
    return result[start_position:]

In [29]:
ja = load_gutenberg_book(book_urls["Jane Austin"])
aw = load_gutenberg_book(book_urls["Alice in Wonderland"], char_limit=None)

In [30]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [31]:
vect = TfidfVectorizer()
awt = vect.fit_transform(aw)

In [32]:
import numpy as np

In [33]:
def cosine_similarity(new_docs, old_docs):
    """
    Returns a similarity matrix where the first row is an array of
    similarities of the first new_doc compared with each of the old
    docs.
    """
    return new_docs*old_docs.T

def find_closest_matches(similarity_matrix, n_matches_to_return=1):
    """
    Expects a dense array of the form [[1., .5, .2],
                                       [.3, 1., .1],
                                       [.2, .4, 1.]]
    where rows correspond to similarities.
    """
    top_indices = np.apply_along_axis(func1d=lambda x: x.argsort()[-n_matches_to_return:][::-1], 
                                      axis=1, 
                                      arr=similarity_matrix)
    return top_indices
    
similarities = cosine_similarity(awt, awt).todense()
matches = find_closest_matches(similarities, 2)

In [35]:
top_score = 0

for new_text, old_texts in enumerate(matches[:]):
    max_score = max([float(similarities[[new_text],[ind]]) for ind in old_texts[1:]])
    if top_score < max_score:
        top_score = max_score
        print max_score
        similar_texts = [(float(similarities[[new_text],[ind]]), aw[ind]) for ind in old_texts[1:]]
        print aw[new_text]
        print similar_texts
        print new_text, old_texts
        print


0.242165356148
So she was considering in her own mind (as well as she could, for the hot day made her feel very sleepy and stupid), whether the pleasure of making a daisy-chain would be worth the trouble of getting up and picking the daisies, when suddenly a White Rabbit with pink eyes ran close by her.
[(0.24216535614796228, "Alice thought she might as well go back, and see how the game was going on, as she heard the Queen's voice in the distance, screaming with passion. She had already heard her sentence three of the players to be executed for having missed their turns, and she did not like the look of things at all, as the game was in such confusion that she never knew whether it was her turn or not. So she went in search of her hedgehog.")]
0 [  0 425]

0.342408595348
There was nothing so VERY remarkable in that; nor did Alice think it so VERY much out of the way to hear the Rabbit say to itself, 'Oh dear! Oh dear! I shall be late!' (when she thought it over afterwards, it occurred to her that she ought to have wondered at this, but at the time it all seemed quite natural); but when the Rabbit actually TOOK A WATCH OUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on, Alice started to her feet, for it flashed across her mind that she had never before seen a rabbit with either a waistcoat-pocket, or a watch to take out of it, and burning with curiosity, she ran across the field after it, and fortunately was just in time to see it pop down a large rabbit-hole under the hedge.
[(0.3424085953476502, "Alice was not a bit hurt, and she jumped up on to her feet in a moment: she looked up, but it was all dark overhead; before her was another long passage, and the White Rabbit was still in sight, hurrying down it. There was not a moment to be lost: away went Alice like the wind, and was just in time to hear it say, as it turned a corner, 'Oh my ears and whiskers, how late it's getting!' She was close behind it when she turned the corner, but the Rabbit was no longer to be seen: she found herself in a long, low hall, which was lit up by a row of lamps hanging from the roof.")]
1 [1 9]

0.414956778645
Either the well was very deep, or she fell very slowly, for she had plenty of time as she went down to look about her and to wonder what was going to happen next. First, she tried to look down and make out what she was coming to, but it was too dark to see anything; then she looked at the sides of the well, and noticed that they were filled with cupboards and book-shelves; here and there she saw maps and pictures hung upon pegs. She took down a jar from one of the shelves as she passed; it was labelled 'ORANGE MARMALADE', but to her great disappointment it was empty: she did not like to drop the jar for fear of killing somebody, so managed to put it into one of the cupboards as she fell past it.
[(0.414956778645421, "As she said this she looked down at her hands, and was surprised to see that she had put on one of the Rabbit's little white kid gloves while she was talking. 'How CAN I have done that?' she thought. 'I must be growing small again.' She got up and went to the table to measure herself by it, and found that, as nearly as she could guess, she was now about two feet high, and was going on shrinking rapidly: she soon found out that the cause of this was the fan she was holding, and she dropped it hastily, just in time to avoid shrinking away altogether.")]
4 [ 4 35]

0.463145263184
After a while, finding that nothing more happened, she decided on going into the garden at once; but, alas for poor Alice! when she got to the door, she found she had forgotten the little golden key, and when she went back to the table for it, she found she could not possibly reach it: she could see it quite plainly through the glass, and she tried her best to climb up one of the legs of the table, but it was too slippery; and when she had tired herself out with trying, the poor little thing sat down and cried.
[(0.46314526318368665, "Once more she found herself in the long hall, and close to the little glass table. 'Now, I'll manage better this time,' she said to herself, and began by taking the little golden key, and unlocking the door that led into the garden. Then she went to work nibbling at the mushroom (she had kept a piece of it in her pocket) till she was about a foot high: then she walked down the little passage: and THEN--she found herself at last in the beautiful garden, among the bright flower-beds and the cool fountains.")]
18 [ 18 367]

0.492144166262
'I didn't mean it!' pleaded poor Alice. 'But you're so easily offended, you know!'
[(0.49214416626227414, "'But I'm not used to it!' pleaded poor Alice in a piteous tone. And she thought of herself, 'I wish the creatures wouldn't be so easily offended!'")]
83 [ 83 170]

0.65148085129
'Sure, it's an arm, yer honour!' (He pronounced it 'arrum.')
[(0.6514808512898599, "'Sure, it does, yer honour: but it's an arm for all that.'")]
110 [110 112]

1.0
Will you, won't you, will you, won't you, will you join the dance?  Will you, won't you, will you, won't you, won't you join the dance?
[(0.9999999999999999, "Will you, won't you, will you, won't you, will you join the dance?  Will you, won't you, will you, won't you, won't you join the dance?")]
543 [547 543]