Python Scikit-Learn for Computational Linguists

Version: 1.0, January 2017

License: Creative Commons Attribution-ShareAlike 4.0 International License (CA BY-SA 4.0)

This tutorial was developed as part of my course material for the course Machine Learning for Computational Linguistics in the Computational Linguistics Program of the Department of Linguistics at Indiana University.

This material is based on various other tutorials, including:

An introduction to machine learning with scikit-learn

Introduction

One of the problems or issues that Machine Learning aims to solve is to make predictions from previous experience. This can be achieved by extracting features from existing data collections. Scikit-Learn comes with some sample datasets. The datasets are the Iris flower data (classification), the Pen-Based Recognition of Handwritten Digits Data Set (classification), and the Boston Housing Data Set (regression). The datasets are part of the Scikit and do not have to be downloads. We can load these datasets by loading the datasets module from sklearn and then loading the individual datasets.



In [71]:

    
from sklearn import datasets

We can load a dataset using the following function:



In [72]:

    
diabetes = datasets.load_diabetes()

Some datasets provide a description in the DESCR field:



In [73]:

    
iris = datasets.load_iris()
print(iris.DESCR)









    



Iris Plants Database
====================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20  0.76     0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

This is a copy of UCI ML iris datasets.
http://archive.ics.uci.edu/ml/datasets/Iris

The famous Iris database, first used by Sir R.A Fisher

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

References
----------
   - Fisher,R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda,R.O., & Hart,P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

We can see the content of the datasets by printing them out:



In [74]:

    
digits = datasets.load_digits()
print(digits)









    



{'DESCR': "Optical Recognition of Handwritten Digits Data Set\n===================================================\n\nNotes\n-----\nData Set Characteristics:\n    :Number of Instances: 5620\n    :Number of Attributes: 64\n    :Attribute Information: 8x8 image of integer pixels in the range 0..16.\n    :Missing Attribute Values: None\n    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)\n    :Date: July; 1998\n\nThis is a copy of the test set of the UCI ML hand-written digits datasets\nhttp://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits\n\nThe data set contains images of hand-written digits: 10 classes where\neach class refers to a digit.\n\nPreprocessing programs made available by NIST were used to extract\nnormalized bitmaps of handwritten digits from a preprinted form. From a\ntotal of 43 people, 30 contributed to the training set and different 13\nto the test set. 32x32 bitmaps are divided into nonoverlapping blocks of\n4x4 and the number of on pixels are counted in each block. This generates\nan input matrix of 8x8 where each element is an integer in the range\n0..16. This reduces dimensionality and gives invariance to small\ndistortions.\n\nFor info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.\nT. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.\nL. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,\n1994.\n\nReferences\n----------\n  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their\n    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of\n    Graduate Studies in Science and Engineering, Bogazici University.\n  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.\n  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.\n    Linear dimensionalityreduction using relevance weighted LDA. School of\n    Electrical and Electronic Engineering Nanyang Technological University.\n    2005.\n  - Claudio Gentile. A New Approximate Maximal Margin Classification\n    Algorithm. NIPS. 2000.\n", 'images': array([[[  0.,   0.,   5., ...,   1.,   0.,   0.],
        [  0.,   0.,  13., ...,  15.,   5.,   0.],
        [  0.,   3.,  15., ...,  11.,   8.,   0.],
        ..., 
        [  0.,   4.,  11., ...,  12.,   7.,   0.],
        [  0.,   2.,  14., ...,  12.,   0.,   0.],
        [  0.,   0.,   6., ...,   0.,   0.,   0.]],

       [[  0.,   0.,   0., ...,   5.,   0.,   0.],
        [  0.,   0.,   0., ...,   9.,   0.,   0.],
        [  0.,   0.,   3., ...,   6.,   0.,   0.],
        ..., 
        [  0.,   0.,   1., ...,   6.,   0.,   0.],
        [  0.,   0.,   1., ...,   6.,   0.,   0.],
        [  0.,   0.,   0., ...,  10.,   0.,   0.]],

       [[  0.,   0.,   0., ...,  12.,   0.,   0.],
        [  0.,   0.,   3., ...,  14.,   0.,   0.],
        [  0.,   0.,   8., ...,  16.,   0.,   0.],
        ..., 
        [  0.,   9.,  16., ...,   0.,   0.,   0.],
        [  0.,   3.,  13., ...,  11.,   5.,   0.],
        [  0.,   0.,   0., ...,  16.,   9.,   0.]],

       ..., 
       [[  0.,   0.,   1., ...,   1.,   0.,   0.],
        [  0.,   0.,  13., ...,   2.,   1.,   0.],
        [  0.,   0.,  16., ...,  16.,   5.,   0.],
        ..., 
        [  0.,   0.,  16., ...,  15.,   0.,   0.],
        [  0.,   0.,  15., ...,  16.,   0.,   0.],
        [  0.,   0.,   2., ...,   6.,   0.,   0.]],

       [[  0.,   0.,   2., ...,   0.,   0.,   0.],
        [  0.,   0.,  14., ...,  15.,   1.,   0.],
        [  0.,   4.,  16., ...,  16.,   7.,   0.],
        ..., 
        [  0.,   0.,   0., ...,  16.,   2.,   0.],
        [  0.,   0.,   4., ...,  16.,   2.,   0.],
        [  0.,   0.,   5., ...,  12.,   0.,   0.]],

       [[  0.,   0.,  10., ...,   1.,   0.,   0.],
        [  0.,   2.,  16., ...,   1.,   0.,   0.],
        [  0.,   0.,  15., ...,  15.,   0.,   0.],
        ..., 
        [  0.,   4.,  16., ...,  16.,   6.,   0.],
        [  0.,   8.,  16., ...,  16.,   8.,   0.],
        [  0.,   1.,   8., ...,  12.,   1.,   0.]]]), 'data': array([[  0.,   0.,   5., ...,   0.,   0.,   0.],
       [  0.,   0.,   0., ...,  10.,   0.,   0.],
       [  0.,   0.,   0., ...,  16.,   9.,   0.],
       ..., 
       [  0.,   0.,   1., ...,   6.,   0.,   0.],
       [  0.,   0.,   2., ...,  12.,   0.,   0.],
       [  0.,   0.,  10., ...,  12.,   1.,   0.]]), 'target': array([0, 1, 2, ..., 8, 9, 8]), 'target_names': array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])}

The data of the digits dataset is stored in the data member. This data represents the features of the digit image.



In [75]:

    
print(digits.data)









    



[[  0.   0.   5. ...,   0.   0.   0.]
 [  0.   0.   0. ...,  10.   0.   0.]
 [  0.   0.   0. ...,  16.   9.   0.]
 ..., 
 [  0.   0.   1. ...,   6.   0.   0.]
 [  0.   0.   2. ...,  12.   0.   0.]
 [  0.   0.  10. ...,  12.   1.   0.]]

The target member contains the real target labels or values of the feature sets, that is the numbers that the feature sets represent.



In [77]:

    
print(digits.target)
print(digits.DESCR)









    



[0 1 2 ..., 8 9 8]
Optical Recognition of Handwritten Digits Data Set
===================================================

Notes
-----
Data Set Characteristics:
    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
http://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each block. This generates
an input matrix of 8x8 where each element is an integer in the range
0..16. This reduces dimensionality and gives invariance to small
distortions.

For info on NIST preprocessing routines, see M. D. Garris, J. L. Blue, G.
T. Candela, D. L. Dimmick, J. Geist, P. J. Grother, S. A. Janet, and C.
L. Wilson, NIST Form-Based Handprint Recognition System, NISTIR 5469,
1994.

References
----------
  - C. Kaynak (1995) Methods of Combining Multiple Classifiers and Their
    Applications to Handwritten Digit Recognition, MSc Thesis, Institute of
    Graduate Studies in Science and Engineering, Bogazici University.
  - E. Alpaydin, C. Kaynak (1998) Cascading Classifiers, Kybernetika.
  - Ken Tang and Ponnuthurai N. Suganthan and Xi Yao and A. Kai Qin.
    Linear dimensionalityreduction using relevance weighted LDA. School of
    Electrical and Electronic Engineering Nanyang Technological University.
    2005.
  - Claudio Gentile. A New Approximate Maximal Margin Classification
    Algorithm. NIPS. 2000.

In case of the digits dataset the 2D shapes of the images are mapped on a 8x8 matrix. You can print them out using the images member:



In [78]:

    
print(0, '\n', digits.images[0])
print()
print(1, '\n', digits.images[1])









    



0 
 [[  0.   0.   5.  13.   9.   1.   0.   0.]
 [  0.   0.  13.  15.  10.  15.   5.   0.]
 [  0.   3.  15.   2.   0.  11.   8.   0.]
 [  0.   4.  12.   0.   0.   8.   8.   0.]
 [  0.   5.   8.   0.   0.   9.   8.   0.]
 [  0.   4.  11.   0.   1.  12.   7.   0.]
 [  0.   2.  14.   5.  10.  12.   0.   0.]
 [  0.   0.   6.  13.  10.   0.   0.   0.]]

1 
 [[  0.   0.   0.  12.  13.   5.   0.   0.]
 [  0.   0.   0.  11.  16.   9.   0.   0.]
 [  0.   0.   3.  15.  16.   6.   0.   0.]
 [  0.   7.  15.  16.  16.   2.   0.   0.]
 [  0.   0.   1.  16.  16.   3.   0.   0.]
 [  0.   0.   1.  16.  16.   6.   0.   0.]
 [  0.   0.   1.  16.  16.   6.   0.   0.]
 [  0.   0.   0.  11.  16.  10.   0.   0.]]

The digits dataset is a set of images of digits that can be used to train a classifier and test the classification on unseen images. To use a Support Vector Classifier we import the svm module:



In [79]:

    
from sklearn import svm

We create a classifier instance with manually set parameters. The parameters can be automatically set using various methods.



In [80]:

    
classifier = svm.SVC(gamma=0.001, C=100.)

The classifier instance has to be trained on the data. The fit method of the instance requires two parameters, the features and the array with the corresponding classes or labels. The features are stored in the data member. The labels are stored in the target member. We use all but the last data and target element for training or fitting.



In [81]:

    
classifier.fit(digits.data[:-1], digits.target[:-1])









    Out[81]:





SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

We can use the predict method to request a guess about the last element in the data member:



In [83]:

    
print("Prediction:", classifier.predict(digits.data[-1:]))
print("Image:\n", digits.images[-1])
print("Label:", digits.target[-1])









    



Prediction: [8]
Image:
 [[  0.   0.  10.  14.   8.   1.   0.   0.]
 [  0.   2.  16.  14.   6.   1.   0.   0.]
 [  0.   0.  15.  15.   8.  15.   0.   0.]
 [  0.   0.   5.  16.  16.  10.   0.   0.]
 [  0.   0.  12.  15.  15.  12.   0.   0.]
 [  0.   4.  16.   6.   4.  16.   6.   0.]
 [  0.   8.  16.  10.   8.  16.   8.   0.]
 [  0.   1.   8.  12.  14.  12.   1.   0.]]
Label: 8

Storing Models

We can train a new model from the Iris data using the fit method:



In [84]:

    
classifier.fit(iris.data, iris.target)









    Out[84]:





SVC(C=100.0, cache_size=200, class_weight=None, coef0=0.0,
  decision_function_shape=None, degree=3, gamma=0.001, kernel='rbf',
  max_iter=-1, probability=False, random_state=None, shrinking=True,
  tol=0.001, verbose=False)

To store the model in a file, we can use the pickle module:



In [85]:

    
import pickle

We can serialize the classifier to a variable that we can process or save to disk:



In [86]:

    
s = pickle.dumps(classifier)

We will save the model to a file irisModel.dat.



In [87]:

    
ofp = open("irisModel.dat", mode='bw')
ofp.write(s)
ofp.close()

The model can be read back into memory using the following code:



In [88]:

    
ifp = open("irisModel.dat", mode='br')
model = ifp.read()
ifp.close()
classifier2 = pickle.loads(model)

We can use this unpickled classifier2 in the same way as shown above:



In [89]:

    
print("Prediction:", classifier2.predict(iris.data[0:1]))
print("Target:", iris.target[0])









    



Prediction: [0]
Target: 0

Nearest Neighbor Classification

We will use the numpy module for arrays and operations on those.



In [90]:

    
import numpy

We can print out the unique list (or array) of classes (or targets) from the iris dataset using the following code:



In [91]:

    
print(iris.target)
print(numpy.unique(iris.target))









    



[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2
 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
 2 2]
[0 1 2]

We can split the iris dataset in a training and testing dataset using random permutations.



In [96]:

    
numpy.random.seed(0)
indices = numpy.random.permutation(len(iris.data))
print(indices)
indices = numpy.random.permutation(len(iris.data))
print(indices)









    



[114  62  33 107   7 100  40  86  76  71 134  51  73  54  63  37  78  90
  45  16 121  66  24   8 126  22  44  97  93  26 137  84  27 127 132  59
  18  83  61  92 112   2 141  43  10  60 116 144 119 108  69 135  56  80
 123 133 106 146  50 147  85  30 101  94  64  89  91 125  48  13 111  95
  20  15  52   3 149  98   6  68 109  96  12 102 120 104 128  46  11 110
 124  41 148   1 113 139  42   4 129  17  38   5  53 143 105   0  34  28
  55  75  35  23  74  31 118  57 131  65  32 138  14 122  19  29 130  49
 136  99  82  79 115 145  72  77  25  81 140 142  39  58  88  70  87  36
  21   9 103  67 117  47]
[ 92 141 130 119  48 143 122  63  26  64  42 108  91  77  22 148   6  65
  47  68  60  15 124  58 142  12  59 105  89  78  52 131 113  98  30 136
  66 133  49  62  74  17 106   8 135  80 107  90   0  36 112   5  57 102
  55  34 128  33  21  73   7  45 129 103 146 120  94  50 134  99 126 114
   9  39  97 101  29  81  20  46  51  53  23  27   2  28  37 111  10  84
 137 127  43  87  69 144 140  35  76   3  82 145 116  88  44 147   1  93
  38  11 115  54  40  18  41  79  24  56  71  13  31  85  70 132 125 123
 100  32 104  83 117 118 138  25 110  16  75 109 121  86 139   4  96  14
  61  67 149  95  19  72]



In [95]:

    
text = "Hello"
for i in range(len(text)):
    print(i, ':', text[i])









    



0 : H
1 : e
2 : l
3 : l
4 : o



In [97]:

    
irisTrain_data = iris.data[indices[:-10]]
irisTrain_target = iris.target[indices[:-10]]
irisTest_data = iris.data[indices[-10:]]
irisTest_target = iris.target[indices[-10:]]



In [98]:

    
from sklearn.neighbors import KNeighborsClassifier



In [99]:

    
knn = KNeighborsClassifier()
knn.fit(irisTrain_data, irisTrain_target)









    Out[99]:





KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')



In [100]:

    
knn.predict(irisTest_data)









    Out[100]:





array([2, 0, 1, 0, 1, 1, 2, 1, 0, 2])



In [101]:

    
irisTest_target









    Out[101]:





array([2, 0, 1, 0, 1, 1, 2, 1, 0, 1])

Clustering

K-means Clustering



In [102]:

    
from sklearn import cluster



In [104]:

    
k_means = cluster.KMeans(n_clusters=3)



In [105]:

    
k_means.fit(iris.data)









    Out[105]:





KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=1, precompute_distances='auto',
    random_state=None, tol=0.0001, verbose=0)



In [106]:

    
print(k_means.labels_[::10])









    



[1 1 1 1 1 2 2 2 2 2 0 0 0 0 0]



In [107]:

    
print(iris.target[::10])









    



[0 0 0 0 0 1 1 1 1 1 2 2 2 2 2]

Classification

Using Kernels

Linear kernel



In [108]:

    
svc = svm.SVC(kernel='linear', gamma=0.001, C=100.)
svc.fit(digits.data[:-1], digits.target[:-1])
print(svc.predict(digits.data[-1:]))
print(digits.target[-1:])









    



[8]
[8]

Polynomial kernel:

The degree is polynomial.



In [109]:

    
svc = svm.SVC(kernel='poly', degree=3, gamma=0.001, C=100.)
svc.fit(digits.data[:-1], digits.target[:-1])
print(svc.predict(digits.data[-1:]))
print(digits.target[-1:])









    



[8]
[8]

RBF kernel (Radial Basis Function):



In [110]:

    
svc = svm.SVC(kernel='rbf', gamma=0.001, C=100.)
svc.fit(digits.data[:-1], digits.target[:-1])
print(svc.predict(digits.data[-1:]))
print(digits.target[-1:])









    



[8]
[8]

Logistic Regression



In [111]:

    
from sklearn import linear_model

logistic = linear_model.LogisticRegression(C=1e5)



In [112]:

    
logistic.fit(irisTrain_data, irisTrain_target)









    Out[112]:





LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='ovr', n_jobs=1, penalty='l2', random_state=None,
          solver='liblinear', tol=0.0001, verbose=0, warm_start=False)



In [113]:

    
logistic.predict(irisTest_data)









    Out[113]:





array([2, 0, 1, 0, 1, 1, 2, 1, 0, 1])



In [114]:

    
irisTest_target









    Out[114]:





array([2, 0, 1, 0, 1, 1, 2, 1, 0, 1])



In [115]:

    
from sklearn import ensemble
rfc = ensemble.RandomForestClassifier()
rfc.fit(irisTrain_data, irisTrain_target)









    Out[115]:





RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
            max_depth=None, max_features='auto', max_leaf_nodes=None,
            min_impurity_split=1e-07, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=10, n_jobs=1, oob_score=False, random_state=None,
            verbose=0, warm_start=False)



In [116]:

    
rfc.predict(irisTest_data)









    Out[116]:





array([2, 0, 1, 0, 1, 1, 2, 1, 0, 1])



In [117]:

    
irisTest_target









    Out[117]:





array([2, 0, 1, 0, 1, 1, 2, 1, 0, 1])



In [ ]:



In [129]:

    
text_s1 = """
User (computing)
A user is a person who uses a computer or network service. Users generally use a system or a software product[1] without the technical expertise required to fully understand it.[1] Power users use advanced features of programs, though they are not necessarily capable of computer programming and system administration.[2][3]

A user often has a user account and is identified to the system by a username (or user name). Other terms for username include login name, screenname (or screen name), nickname (or nick) and handle, which is derived from the identical Citizen's Band radio term.

Some software products provide services to other systems and have no direct end users.
End user
See also: End user

End users are the ultimate human users (also referred to as operators) of a software product. The term is used to abstract and distinguish those who only use the software from the developers of the system, who enhance the software for end users.[4] In user-centered design, it also distinguishes the software operator from the client who pays for its development and other stakeholders who may not directly use the software, but help establish its requirements.[5][6] This abstraction is primarily useful in designing the user interface, and refers to a relevant subset of characteristics that most expected users would have in common.

In user-centered design, personas are created to represent the types of users. It is sometimes specified for each persona which types of user interfaces it is comfortable with (due to previous experience or the interface's inherent simplicity), and what technical expertise and degree of knowledge it has in specific fields or disciplines. When few constraints are imposed on the end-user category, especially when designing programs for use by the general public, it is common practice to expect minimal technical expertise or previous training in end users.[7] In this context, graphical user interfaces (GUIs) are usually preferred to command-line interfaces (CLIs) for the sake of usability.[8]

The end-user development discipline blurs the typical distinction between users and developers. It designates activities or techniques in which people who are not professional developers create automated behavior and complex data objects without significant knowledge of a programming language.

Systems whose actor is another system or a software agent have no direct end users.
User account

A user's account allows a user to authenticate to a system and potentially to receive authorization to access resources provided by or connected to that system; however, authentication does not imply authorization. To log in to an account, a user is typically required to authenticate oneself with a password or other credentials for the purposes of accounting, security, logging, and resource management.

Once the user has logged on, the operating system will often use an identifier such as an integer to refer to them, rather than their username, through a process known as identity correlation. In Unix systems, the username is correlated with a user identifier or user id.

Computer systems operate in one of two types based on what kind of users they have:

    Single-user systems do not have a concept of several user accounts.
    Multi-user systems have such a concept, and require users to identify themselves before using the system.

Each user account on a multi-user system typically has a home directory, in which to store files pertaining exclusively to that user's activities, which is protected from access by other users (though a system administrator may have access). User accounts often contain a public user profile, which contains basic information provided by the account's owner. The files stored in the home directory (and all other directories in the system) have file system permissions which are inspected by the operating system to determine which users are granted access to read or execute a file, or to store a new file in that directory.

While systems expect most user accounts to be used by only a single person, many systems have a special account intended to allow anyone to use the system, such as the username "anonymous" for anonymous FTP and the username "guest" for a guest account.
Usernames

Various computer operating-systems and applications expect/enforce different rules for the formats of user names.

In Microsoft Windows environments, for example, note the potential use of:[9]

    User Principal Name (UPN) format - for example: UserName@orgName.com
    Down-Level Logon Name format - for example: DOMAIN\accountName

Some online communities use usernames as nicknames for the account holders. In some cases, a user may be better known by their username than by their real name, such as CmdrTaco (Rob Malda), founder of the website Slashdot.
Terminology

Some usability professionals have expressed their dislike of the term "user", proposing it to be changed.[10] Don Norman stated that "One of the horrible words we use is 'users'. I am on a crusade to get rid of the word 'users'. I would prefer to call them 'people'."[11]
See also

    Information technology portal iconSoftware portal 

    1% rule (Internet culture)
    Anonymous post
    Pseudonym
    End-user computing, systems in which non-programmers can create working applications.
    End-user database, a collection of data developed by individual end-users.
    End-user development, a technique that allows people who are not professional developers to perform programming tasks, i.e. to create or modify software.
    End-User License Agreement (EULA), a contract between a supplier of software and its purchaser, granting the right to use it.
    User error
    User agent
    User experience
    User space
"""

text_s2 = """
Personal account

A personal account is an account for use by an individual for that person's own needs. It is a relative term to differentiate them from those accounts for corporate or business use. The term "personal account" may be used generically for financial accounts at banks and for service accounts such as accounts with the phone company, or even for e-mail accounts.

Banking

In banking "personal account" refers to one's account at the bank that is used for non-business purposes. Most likely, the service at the bank consists of one of two kinds of accounts or sometimes both: a savings account and a current account.

Banks differentiate their services for personal accounts from business accounts by setting lower minimum balance requirements, lower fees, free checks, free ATM usage, free debit card (Check card) usage, etc. The term does not apply to any one service or limit the banks from providing the same services to non-individuals. Personal account can be classified into three categories: 1. Persons of Nature, 2. Persons of Artificial Relationship, 3. Persons of Representation.

At the turn of the 21st century, many banks started offering free checking, a checking account with no minimum balance, a free check book, and no hidden fees. This encouraged Americans who would otherwise live from check to check to open their "personal" account at financial institutions. For businesses that issue corporate checks to employees, this enables reduction in the amount of paperwork.

Finance

In the financial industry, 'personal account' (usually "PA") refers to trading or investing for yourself, rather than the company one is working for. There are often restrictions on what may be done with a PA, to avoid conflict of interest.
"""

test_text = """
A user account is a location on a network server used to store a computer username, password, and other information. A user account allows or does not allow a user to connect to a network, another computer, or other share. Any network that has multiple users requires user accounts.
"""

from nltk import word_tokenize, sent_tokenize

sentences_s1 = sent_tokenize(text_s1)
#print(sentences_s1)

toksentences_s1 = [ word_tokenize(sentence) for sentence in sentences_s1 ]
#print(toksentences_s1)

tokens_s1 = set(word_tokenize(text_s1))
tokens_s2 = set(word_tokenize(text_s2))

#print(set.intersection(tokens_s1, tokens_s2))

unique_s1 = tokens_s1 - tokens_s2
unique_s2 = tokens_s2 - tokens_s1
#print(unique_s1)
#print(unique_s2)

testTokens = set(word_tokenize(test_text))
print(len(set.intersection(testTokens, unique_s1)))
print(len(set.intersection(testTokens, unique_s2)))



In [ ]: