In this notebook, we build a Text Classification Model on CNAE-9 Dataset on UCI.
From the description:
This is a data set containing 1080 documents of free text business descriptions of Brazilian companies categorized into a subset of 9 categories cataloged in a table called National Classification of Economic Activities (Classificação Nacional de Atividade Econômicas - CNAE). The original texts were pre-processed to obtain the current data set: initially, it was kept only letters and then it was removed prepositions of the texts. Next, the words were transformed to their canonical form. Finally, each document was represented as a vector, where the weight of each word is its frequency in the document. This data set is highly sparse (99.22% of the matrix is filled with zeros).
In [1]:
import pandas as pd
import numpy as np
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('ggplot')
import seaborn as sns
In [2]:
%load_ext version_information
%version_information scipy, numpy, pandas, matplotlib, seaborn, version_information
Out[2]:
In [3]:
url = r'https://archive.ics.uci.edu/ml/machine-learning-databases/00233/CNAE-9.data'
count_df = pd.read_csv(url, header=None)
In [4]:
count_df.info()
The result is a 1080*857 dense matrix. The first column is the label. We extract the first column as the label and convert the rest to a sparse matrix.
In [5]:
from scipy.sparse import csr_matrix
labels, count_features = count_df.loc[:, 0], count_df.loc[:, 1:]
count_data = csr_matrix(count_features.values)
In [6]:
count_data
Out[6]:
In [7]:
label_counts = pd.Series(labels).value_counts()
label_counts.plot(kind='bar', rot=0)
Out[7]:
In [8]:
from sklearn.metrics import classification_report
from sklearn.pipeline import Pipeline
from sklearn.model_selection import StratifiedKFold, cross_val_predict
from sklearn.linear_model import LogisticRegression
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import StandardScaler
In [9]:
pipeline = Pipeline(
[
('reducer', TruncatedSVD(n_components=100, random_state=1000)),
('scaler', StandardScaler(with_mean=False)),
('model', LogisticRegression(max_iter=100, random_state=1234,
solver='lbfgs', multi_class='multinomial'))
]
)
cv = StratifiedKFold(n_splits=10, random_state=1245, shuffle=True)
predictions = cross_val_predict(pipeline, count_data, labels, cv=cv)
cr = classification_report(labels, predictions)
print(cr)