Setting up your python environment

We will be using python 3 and ipython/jupyter extensively in this course. You need to set up the correct python environment first. Below are some instructions in a Mac OS X 10.11 environment. You should be able to adapt it to other environments -- try Google if any problem.

Installing anaconda and jupyter

  1. Download and install the anaconda installation package (for python 3.6) at https://www.continuum.io/downloads
  2. Create a py36 virtual environment by conda create -n py36 python=3.6 anaconda. See more at http://conda.pydata.org/docs/using/envs.html
  3. Activiate py36 (or put it on your ~/.bashrc): source activate py36
  4. To install a new package in an environment, switch to it and use conda install -n PACKAGENAME or pip install PACKAGENAME
  5. Install jupyter by conda install jupyter

Test you installation

% python -V
Python 3.6.3 :: Anaconda ...
% ipython -V
6.1.0
% jupyter notebook

The last command shall open up a new page in your browser. Also check if you click the "new" button, there is a "python 3" choice under the 'notebooks'.

Using jupyter

Start with simple tutorial: https://jupyter-notebook-beginner-guide.readthedocs.io/en/latest/index.html

Press h (you may need to press ESC first) to learn a few important keyboard shortcuts, e.g.,

  • SHIFT+RETURN
  • A, B, X
  • ESC
  • ESC m: to change the current cell to a markdown cell,
  • selecting multiple lines + TAB (indent them) / Cmd + / (block comment).
  • Note that mouse selection = copying to clipboard (sometimes annoying).

Read the syntax of markdown at http://jupyter-notebook.readthedocs.io/en/latest/examples/Notebook/Working%20With%20Markdown%20Cells.html and try it out by yourself.

It also can display maths symbols/equations, e.g., $e^{ix} = cos(x) + i \cdot sin(x)$.

$$ P \implies Q \qquad \equiv \qquad P \lor \neg Q $$

Try out cells with simple python code (or try the following cell in this notebook).

Tips:

  • Recommended browser: firefox (Chrome has issues rendering maths fonts/equations)
  • Your code may run into an infinite loop and you may HAVE TO kill the browser. So use a decent session manager for your browser.

In [1]:
import random

n = 10
data = [random.randint(1, 10) for _ in range(n)]
data # this print out the variable's content


Out[1]:
[2, 9, 2, 2, 4, 6, 9, 3, 10, 7]

In [2]:
import string
import sys
import urllib.request
from bs4 import BeautifulSoup
from pprint import pprint

def get_page(url):
    try :
        web_page = urllib.request.urlopen(url).read()
        soup = BeautifulSoup(web_page, 'html.parser')
        return soup
    except urllib2.HTTPError :
        print("HTTPERROR!")
    except urllib2.URLError :
        print("URLERROR!")
        
def get_titles(sp):
    i = 1
    papers = sp.find_all('div', {'class' : 'data'})
    for paper in papers:
        title = paper.find('span', {'class' : 'title'} )
        print("Paper {}:\t{}".format(i, title.get_text()))
        i += 1

In [3]:
sp = get_page('http://dblp.uni-trier.de/pers/hd/m/Manning:Christopher_D=')

In [4]:
get_titles(sp)


Paper 1:	Understanding Human Language: Can NLP and Deep Learning Help?
Paper 2:	Evaluating the word-expert approach for Named-Entity Disambiguation.
Paper 3:	A Fast Unified Model for Parsing and Sentence Understanding.
Paper 4:	Achieving Open Vocabulary Neural Machine Translation with Hybrid Word-Character Models.
Paper 5:	Improving Coreference Resolution by Learning Entity-Level Distributed Representations.
Paper 6:	Learning Language Games through Interaction.
Paper 7:	A Thorough Examination of the CNN/Daily Mail Reading Comprehension Task.
Paper 8:	Compression of Neural Machine Translation Models via Pruning.
Paper 9:	Natural language translation at the intersection of AI and HCI.
Paper 10:	Computational Linguistics and Deep Learning.
Paper 11:	Natural Language Translation at the Intersection of AI and HCI.
Paper 12:	Text to 3D Scene Generation with Rich Lexical Grounding.
Paper 13:	Leveraging Linguistic Structure For Open Domain Information Extraction.
Paper 14:	Robust Subgraph Generation Improves Abstract Meaning Representation Parsing.
Paper 15:	Entity-Centric Coreference Resolution with Model Stacking.
Paper 16:	Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks.
Paper 17:	Deep Neural Language Models for Machine Translation.
Paper 18:	Forum77: An Analysis of an Online Health Forum Dedicated to Addiction Recovery.
Paper 19:	A large annotated corpus for learning natural language inference.
Paper 20:	Effective Approaches to Attention-based Neural Machine Translation.
Paper 21:	Distributed Representations of Words to Guide Bootstrapped Entity Classifiers.
Paper 22:	Tree-Structured Composition in Neural Networks without Tree-Structured Architectures.
Paper 23:	On-the-Job Learning with Bayesian Decision Theory.
Paper 24:	Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks.
Paper 25:	Text to 3D Scene Generation with Rich Lexical Grounding.
Paper 26:	Robust Subgraph Generation Improves Abstract Meaning Representation Parsing.
Paper 27:	On-the-Job Learning with Bayesian Decision Theory.
Paper 28:	Tree-structured composition in neural networks without tree-structured architectures.
Paper 29:	Effective Approaches to Attention-based Neural Machine Translation.
Paper 30:	A large annotated corpus for learning natural language inference.
Paper 31:	Research and applications: Induced lexico-syntactic patterns improve information extraction from online medical forums.
Paper 32:	Cross-lingual Projected Expectation Regularization for Weakly Supervised Learning.
Paper 33:	Grounded Compositional Semantics for Finding and Describing Images with Sentences.
Paper 34:	The Stanford CoreNLP Natural Language Processing Toolkit.
Paper 35:	Robust Logistic Regression using Shift Parameters.
Paper 36:	Faster Phrase-Based Decoding by Refining Feature State.
Paper 37:	Two Knives Cut Better Than One: Chinese Word Segmentation with Dual Decomposition.
Paper 38:	Word Segmentation of Informal Arabic with Domain Adaptation.
Paper 39:	TransPhoner: automated mnemonic keyword generation.
Paper 40:	Improved Pattern Learning for Bootstrapped Entity Extraction.
Paper 41:	NaturalLI: Natural Logic Inference for Common Sense Reasoning.
Paper 42:	Human Effort and Machine Learnability in Computer Aided Translation.
Paper 43:	Combining Distant and Partial Supervision for Relation Extraction.
Paper 44:	Modeling Biological Processes for Reading Comprehension.
Paper 45:	A Fast and Accurate Dependency Parser using Neural Networks.
Paper 46:	Glove: Global Vectors for Word Representation.
Paper 47:	Learning Spatial Knowledge for Text to 3D Scene Generation.
Paper 48:	A Gold Standard Dependency Corpus for English.
Paper 49:	Event Extraction Using Distant Supervision.
Paper 50:	Universal Stanford dependencies: A cross-linguistic typology.
Paper 51:	Global Belief Recursive Neural Networks.
Paper 52:	Simple MAP Inference via Low-Rank Relaxations.
Paper 53:	Learning Distributed Representations for Structured Output Prediction.
Paper 54:	On being the right scale: sizing large collections of 3D models.
Paper 55:	Predictive translation memory: a mixed-initiative system for human language translation.
Paper 56:	Recursive Neural Networks for Learning Logical Semantics.
Paper 57:	Learning Distributed Word Representations for Natural Logic Reasoning.
Paper 58:	Parsing Models for Identifying Multiword Expressions.
Paper 59:	Effective Bilingual Constraints for Semi-Supervised Learning of Named Entity Recognizers.
Paper 60:	Fast and Adaptive Online Training of Feature-Rich Translation Models.
Paper 61:	Parsing with Compositional Vector Grammars.
Paper 62:	Joint Word Alignment and Bilingual Named Entity Recognition Using Dual Decomposition.
Paper 63:	The efficacy of human post-editing for language translation.
Paper 64:	Better Word Representations with Recursive Neural Networks for Morphology.
Paper 65:	Philosophers are Mortal: Inferring the Truth of Unseen Facts.
Paper 66:	Feature Noising for Log-Linear Structured Prediction.
Paper 67:	Bilingual Word Embeddings for Phrase-Based Machine Translation.
Paper 68:	Fast dropout training.
Paper 69:	Topic Model Diagnostics: Assessing Domain Relevance via Topical Alignment.
Paper 70:	Learning a Product of Experts with Elitist Lasso.
Paper 71:	Effect of Non-linear Deep Architecture in Sequence Labeling.
Paper 72:	Deep Learning for NLP (without Magic).
Paper 73:	Named Entity Recognition with Bilingual Constraints.
Paper 74:	Reasoning With Neural Tensor Networks for Knowledge Base Completion.
Paper 75:	Zero-Shot Learning Through Cross-Modal Transfer.
Paper 76:	Stanford's 2013 KBP System.
Paper 77:	Learning New Facts From Knowledge Bases With Neural Tensor Networks and Semantic Word Vectors.
Paper 78:	Zero-Shot Learning Through Cross-Modal Transfer.
Paper 79:	Robust Logistic Regression using Shift Parameters.
Paper 80:	Cross-lingual Pseudo-Projected Expectation Regularization for Weakly Supervised Learning.
Paper 81:	Relaxations for inference in restricted Boltzmann machines.
Paper 82:	Combining joint models for biomedical event extraction.
Paper 83:	Did It Happen? The Pragmatic Complexity of Veridicality Assessment.
Paper 84:	"Without the clutter of unimportant words": Descriptive keyphrases for text visualization.
Paper 85:	Deep Learning for NLP (without Magic).
Paper 86:	Baselines and Bigrams: Simple, Good Sentiment and Topic Classification.
Paper 87:	Improving Word Representations via Global Context and Multiple Word Prototypes.
Paper 88:	Termite: visualization techniques for assessing textual topic models.
Paper 89:	Interpretation and trust: designing model-driven visualizations for text analysis.
Paper 90:	Short message communications: users, topics, and in-language processing.
Paper 91:	Multi-instance Multi-label Learning for Relation Extraction.
Paper 92:	Learning Constraints for Consistent Timeline Extraction.
Paper 93:	Probabilistic Finite State Machines for Regression-based MT Evaluation.
Paper 94:	Semantic Compositionality through Recursive Matrix-Vector Spaces.
Paper 95:	SUTime: A library for recognizing and normalizing time expressions.
Paper 96:	Entity Clustering Across Languages.
Paper 97:	Parsing Time: Learning to Interpret Time Expressions.
Paper 98:	Convolutional-Recursive Deep Learning for 3D Object Classification.
Paper 99:	Event Extraction as Dependency Parsing.
Paper 100:	Part-of-Speech Tagging from 97% to 100%: Is It Time for Some Linguistics?
Paper 101:	Semi-Supervised Recursive Autoencoders for Predicting Sentiment Distributions.
Paper 102:	Multiword Expression Identification with Tree Substitution Grammars: A Parsing tour de force with French.
Paper 103:	Risk analysis for intellectual property litigation.
Paper 104:	Parsing Natural Scenes and Natural Language with Recursive Neural Networks.
Paper 105:	Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers.
Paper 106:	Partially labeled topic models for interpretable text mining.
Paper 107:	Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection.
Paper 108:	Veridicality and Utterance Understanding.
Paper 109:	TopicFlow Model: Unsupervised Learning of Topic-specific Influences of Hyperlinked Documents.
Paper 110:	Spectral Chinese Restaurant Processes: Nonparametric Clustering Based on Similarities.
Paper 111:	Proceedings of the Fifteenth Conference on Computational Natural Language Learning, CoNLL 2011, Portland, Oregon, USA, June 23-24, 2011.
Paper 112:	Stanford-UBC Entity Linking at TAC-KBP, Again.
Paper 113:	Stanford's Distantly-Supervised Slot-Filling System.
Paper 114:	Which words are hard to recognize? Prosodic, lexical, and disfluency factors that increase speech recognition error rates.
Paper 115:	"Was It Good? It Was Provocative." Learning the Meaning of Scalar Adjectives.
Paper 116:	Hierarchical Joint Learning: Improving Joint Parsing and Named Entity Recognition with Non-Jointly Labeled Data.
Paper 117:	Better Arabic Parsing: Baselines, Evaluations, and Analysis.
Paper 118:	Probabilistic Tree-Edit Models with Structured Latent Variables for Textual Entailment and Question Answering.
Paper 119:	Viterbi Training Improves Unsupervised Dependency Parsing.
Paper 120:	A Multi-Pass Sieve for Coreference Resolution.
Paper 121:	Parsing to Stanford Dependencies: Trade-offs between Speed and Accuracy.
Paper 122:	Phrasal: A Statistical Machine Translation Toolkit for Exploring New Model Features.
Paper 123:	Subword Variation in Text Message Classification.
Paper 124:	The Best Lexical Metric for Phrase-Based Statistical MT System Optimization.
Paper 125:	Ensemble Models for Dependency Parsing: Cheap and Good?
Paper 126:	Improved Models of Distortion Cost for Statistical Machine Translation.
Paper 127:	Accurate Non-Hierarchical Phrase-Based Translation.
Paper 128:	Stanford-UBC Entity Linking at TAC-KBP.
Paper 129:	A Simple Distant Supervision Approach for the TAC-KBP Slot Filling Task.
Paper 130:	Measuring machine translation quality as semantic equivalence: A metric based on entailment features.
Paper 131:	Robust Machine Translation Evaluation with Entailment Features.
Paper 132:	Quadratic-Time Dependency Parsing for Machine Translation.
Paper 133:	Nested Named Entity Recognition.
Paper 134:	Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora.
Paper 135:	Joint Parsing and Named Entity Recognition.
Paper 136:	Hierarchical Bayesian Domain Adaptation.
Paper 137:	Random Walks for Text Semantic Similarity.
Paper 138:	WikiWalk: Random walks on Wikipedia for Semantic Relatedness.
Paper 139:	Clustering the tagged web.
Paper 140:	Stanford-UBC at TAC-KBP.
Paper 141:	Introduction to information retrieval.
Paper 142:	A Global Joint Model for Semantic Role Labeling.
Paper 143:	Enforcing Transitivity in Coreference Resolution.
Paper 144:	Which Words Are Hard to Recognize? Prosodic, Lexical, and Disfluency Factors that Increase ASR Error Rates.
Paper 145:	Efficient, Feature-based, Conditional Random Field Parsing.
Paper 146:	Finding Contradictions in Text.
Paper 147:	Modeling Semantic Containment and Exclusion in Natural Language Inference.
Paper 148:	Studying the History of Ideas Using Topic Models.
Paper 149:	Legal Docket Classification: Where Machine Learning Stumbles.
Paper 150:	A Phrase-Based Alignment Model for Natural Language Inference.
Paper 151:	A Simple and Effective Hierarchical Phrase Reordering Model.
Paper 152:	Lexicon Schemas and Related Data Models: when Standards Meet Users.
Paper 153:	Deciding Entailment and Contradiction with Stochastic and Edit Distance-based Alignment.
Paper 154:	Robust Graph Alignment Methods for Textual Inference and Machine Reading.
Paper 155:	The Infinite Tree.
Paper 156:	Regularization, adaptation, and non-independent features improve hidden conditional random fields for phone classification.
Paper 157:	An Effective Two-Stage Model for Exploiting Non-Local Dependencies in Named Entity Recognition.
Paper 158:	Unsupervised Discovery of a Statistical Verb Lexicon.
Paper 159:	Solving the Problem of Cascading Errors: Approximate Bayesian Inference for Linguistic Annotation Pipelines.
Paper 160:	Learning to recognize features of valid textual entailments.
Paper 161:	Graphical Model Representations of Word Lattices.
Paper 162:	Exploring the boundaries: gene and protein identification in biomedical text.
Paper 163:	Natural language grammar induction with a generative constituent-context model.
Paper 164:	Robust Textual Inference Via Learning and Abductive Reasoning.
Paper 165:	Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling.
Paper 166:	Unsupervised Learning of Field Segmentation Models for Information Extraction.
Paper 167:	Joint Learning Improves Semantic Role Labeling.
Paper 168:	A Joint Model for Semantic Role Labeling.
Paper 169:	Robust Textual Inference via Graph Matching.
Paper 170:	Deep Dependencies from Context-Free Statistical Parsers: Correcting the Surface Dependency Approximation.
Paper 171:	Corpus-Based Induction of Syntactic Structure: Models of Dependency and Constituency.
Paper 172:	Language Learning: Beyond Thunderdome.
Paper 173:	Using Feature Conjunctions Across Examples for Learning Pairwise Classifiers.
Paper 174:	Max-Margin Parsing.
Paper 175:	Verb Sense and Subcategorization: Using Joint Inference to Improve Performance on Complementary Task.
Paper 176:	The Leaf Path Projection View of Parse Trees: Exploring String Kernels for HPSG Parse Selection.
Paper 177:	Learning random walk models for inducing word dependency distributions.
Paper 178:	Accurate Unlexicalized Parsing.
Paper 179:	Is it Harder to Parse Chinese, or the Chinese Treebank?
Paper 180:	Named Entity Recognition with Character-Level Models.
Paper 181:	A Generative Model for Semantic Role Labeling.
Paper 182:	Optimizing Local Probability Models for Statistical Parsing.
Paper 183:	Spectral Learning.
Paper 184:	Factored A* Search for Models over Sequences and Trees.
Paper 185:	A* Parsing: Fast Exact Viterbi Parse Selection.
Paper 186:	Optimization, Maxent Models, and Conditional Estimation without Magic.
Paper 187:	Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network.
Paper 188:	Log-Linear Models for Label Ranking.
Paper 189:	Extrapolation methods for accelerating PageRank computations.
Paper 190:	A Generative Constituent-Context Model for Improved Grammar Induction.
Paper 191:	The LinGO Redwoods Treebank: Motivation and Preliminary Applications.
Paper 192:	Feature Selection for a Rich HPSG Grammar Using Decision Trees.
Paper 193:	Interpreting and Extending Classical Agglomerative Clustering Algorithms using a Model-Based approach.
Paper 194:	From Instance-level Constraints to Space-Level Constraints: Making the Most of Prior Knowledge in Data Clustering.
Paper 195:	Fast Exact Inference with a Factored Model for Natural Language Parsing.
Paper 196:	Foundations of statistical natural language processing.
Paper 197:	Kirrkirr: Software for Browsing and Visual Exploration of a Structured Warlpiri Dictionary.
Paper 198:	Parsing with Treebank Grammars: Empirical Bounds, Theoretical Models, and the Structure of the Penn Treebank.
Paper 199:	Distributional phrase structure induction.
Paper 200:	Parsing and Hypergraphs.
Paper 201:	Natural Language Grammar Induction Using a Constituent-Context Model.
Paper 202:	What's related? Generalizing approaches to related articles in medicine.
Paper 203:	The segmentation problem in morphology learning.
Paper 204:	Probabilistic Parsing Using Left Corner Language Models.
Paper 205:	Automatic Acquisition of a Large Subcategorization Dictionary from Corpora.

Numpy


In [ ]:


In [ ]:

Exercise

  1. Compute the top-10 most frequently appearing words in the papers of the author.

For Manning, the output should be:

[('for', 74),
 ('and', 48),
 ('of', 37),
 ('Learning', 30),
 ('with', 25),
 ('the', 20),
 ('A', 20),
 ('to', 18),
 ('in', 15),
 ('Parsing', 13)]

You may even be able to use the wordcloud package to generate the word cloud:


In [6]:
from IPython.display import Image
Image(filename="./asset/Christopher_Manning_wordcloud.png")


Out[6]:

In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]:


In [ ]: