In pseudo-code the bubble sort algorithm can be written as:
list of length n.i in the list from the first to the last:j in the list from element i+1 to the last:i is bigger than element j then swap themlist is now sorted.We can do a direct translation of this into Python:
In [1]:
def bubblesort(unsorted):
"""
Sorts an array using bubble sort algorithm
Paramters
---------
unsorted : list
The unsorted list
Returns
sorted : list
The sorted list (in place)
"""
last = len(unsorted)
# All Python lists start from 0
for i in range(last):
for j in range(i+1, last):
if unsorted[i] > unsorted[j]:
temp = unsorted[j]
unsorted[j] = unsorted[i]
unsorted[i] = temp
return unsorted
In [2]:
unsorted = [2, 4, 6, 0, 1, 3, 5]
print(bubblesort(unsorted))
We can see the essential features of Python used:
def;: is used;" as for the docstring, or using #;return;range produces a range of integers, usually used to loop over.Note: there are in-built Python functions to sort lists which should be used in general:
In [3]:
unsorted = [2, 4, 6, 0, 1, 3, 5]
print(sorted(unsorted))
Note: there is a "more Pythonic" way of writing the bubble sort function, taking advantage of the feature that Python can assign to multiple things at once. Compare the internals of the loop:
In [4]:
def bubblesort(unsorted):
"""
Sorts an array using bubble sort algorithm
Paramters
---------
unsorted : list
The unsorted list
Returns
sorted : list
The sorted list (in place)
"""
last = len(unsorted)
# All Python lists start from 0
for i in range(last):
for j in range(i+1, last):
if unsorted[i] > unsorted[j]:
unsorted[j], unsorted[i] = unsorted[i], unsorted[j]
return unsorted
In [5]:
unsorted = [2, 4, 6, 0, 1, 3, 5]
print(bubblesort(unsorted))
This gets rid of the need for a temporary variable.
Here is a pseudo-code for the counting sort algorithm:
list of length n.min_value and maximum value max_value of the list.counts that will count the number of entries in the list with value between min_value and max_value inclusive, and set its entries to zeroi in list from the first to the last:counts list entry whose index matches the value of this elementi in the counts list from the first to the last:j entries of list equal to ilist is now sorted.Translate this into Python. Note that the in-built Python min and max functions can be used on lists. To create a list of the correct size you can use
counts = list(range(min_value, max_value+1))
but this list will not contain zeros so must be reset.
In [6]:
def countingsort(unsorted):
"""
Sorts an array using counting sort algorithm
Paramters
---------
unsorted : list
The unsorted list
Returns
sorted : list
The sorted list (in place)
"""
# Allocate the counts array
min_value = min(unsorted)
max_value = max(unsorted)
# This creates a list of the right length, but the entries are not zero, so reset
counts = list(range(min_value, max_value+1))
for i in range(len(counts)):
counts[i] = 0
# Count the values
last = len(unsorted)
for i in range(last):
counts[unsorted[i]] += 1
# Write the items back into the list array
next_index = 0
for i in range(min_value, max_value+1):
for j in range(counts[i]):
unsorted[next_index] = i
next_index += 1
return unsorted
In [7]:
unsorted = [2, 4, 6, 0, 1, 3, 5]
print(countingsort(unsorted))
For the linear programming problem $$ \begin{aligned} \max x_1 + x_2 &= z \\ 2 x_1 + x_2 & \le 4 \\ x_1 + 2 x_2 & \le 3 \end{aligned} $$ where $x_1, x_2 \ge 0$, one standard approach is the simplex method.
Introducing slack variables $s_1, s_2 \ge 0$ the standard tableau form becomes $$ \begin{pmatrix} 1 & -1 & -1 & 0 & 0 \\ 0 & 2 & 1 & 1 & 0 \\ 0 & 1 & 2 & 0 & 1 \end{pmatrix} \begin{pmatrix} z & x_1 & x_2 & s_1 & s_2 \end{pmatrix}^T = \begin{pmatrix} 0 \\ 4 \\ 3 \end{pmatrix}. $$
The simplex method performs row operations to remove all negative numbers from the top row, at each stage choosing the smallest (in magnitude) pivot.
Assume the tableau is given in this standard form. We can use numpy to implement the problem.
In [3]:
import numpy
In [9]:
tableau = numpy.array([ [1, -1, -1, 0, 0, 0],
[0, 2, 1, 1, 0, 4],
[0, 1, 2, 0, 1, 3] ], dtype=numpy.float64)
print(tableau)
To access an entry we use square brackets as with lists:
In [10]:
print(tableau[0, 0])
print(tableau[1, 2])
row = 2
column = 5
print(tableau[row, column])
To access a complete row or column, we use slicing notation:
In [11]:
print(tableau[row, :])
print(tableau[:, column])
To apply the simplex method, we have to remove the negative entries in row 0. These appear in columns 1 and 2. For column 1 the pivot in row 1 has magnitude $|-1/2| = 1/2$ and the pivot in row 2 has magnitude $|-1/1|=1$. So we choose row 1.
To perform the row operation we want to eliminate all entries in column 1 except for the diagonal, which is set to $1$:
In [12]:
column = 1
pivot_row = 1
# Rescale pivot row
tableau[pivot_row, :] /= tableau[pivot_row, column]
# Remove all entries in columns except the pivot
pivot0 = tableau[0, column] / tableau[pivot_row, column]
tableau[0, :] -= pivot0 * tableau[pivot_row, :]
pivot2 = tableau[2, column] / tableau[pivot_row, column]
tableau[2, :] -= pivot2 * tableau[pivot_row, :]
print(tableau)
Now we repeat this on column 2, noting that we can only pivot on row 2:
In [13]:
column = 2
pivot_row = 2
# Rescale pivot row
tableau[pivot_row, :] /= tableau[pivot_row, column]
# Remove all entries in columns except the pivot
pivot0 = tableau[0, column] / tableau[pivot_row, column]
tableau[0, :] -= pivot0 * tableau[pivot_row, :]
pivot1 = tableau[1, column] / tableau[pivot_row, column]
tableau[1, :] -= pivot1 * tableau[pivot_row, :]
print(tableau)
We read off the solution (noting that floating point representations mean we need care interpreting the results): $z = 7/3$ when $x_1 = 5/3$ and $x_2 = 2/3$:
In [14]:
print("z =", tableau[0, -1])
print("x_1 =", tableau[1, -1])
print("x_2 =", tableau[2, -1])
Let's turn that into a function.
In [15]:
def simplex(tableau):
"""
Assuming a standard form tableau, find the solution
"""
nvars = tableau.shape[1] - tableau.shape[0] - 1
for column in range(1, nvars+2):
if tableau[0, column] < 0:
pivot_row = numpy.argmin(numpy.abs(tableau[0, column] / tableau[1:, column])) + 1
# Rescale pivot row
tableau[pivot_row, :] /= tableau[pivot_row, column]
# Remove all entries in columns except the pivot
for row in range(0, pivot_row):
pivot = tableau[row, column] / tableau[pivot_row, column]
tableau[row, :] -= pivot * tableau[pivot_row, :]
for row in range(pivot_row+1, tableau.shape[0]):
pivot = tableau[row, column] / tableau[pivot_row, column]
tableau[row, :] -= pivot * tableau[pivot_row, :]
z = tableau[0, -1]
x = tableau[1:nvars+1, -1]
return z, x
In [16]:
tableau = numpy.array([ [1, -1, -1, 0, 0, 0],
[0, 2, 1, 1, 0, 4],
[0, 1, 2, 0, 1, 3] ], dtype=numpy.float64)
z, x = simplex(tableau)
print("z =", z)
print("x =", x)
Once the problem is phrased in the tableau form the short simplex function solves it without problem. However, for large problems, we don't want to type in the matrix by hand. Instead we want a way of keeping track of the objective function to maximize, and the constraints, and make the computer do all the work.
To do that we'll introduce classes. In VBA a class is a special module, and you access its variables and methods using dot notation. For example, if Student is a class, which has a variable Name, and s1 is a Student object, then s1.Name is the name associated with that particular instance of student.
The same approach is used in Python:
In [17]:
class Student(object):
def __init__(self, name):
self.name = name
def print_name(self):
print("Hello", self.name)
In [18]:
s1 = Student("Christine Carpenter")
print(s1.name)
s2 = Student("Jörg Fliege")
s2.print_name()
See how this compares to VBA.
class keyword is used to start the definition of the class.Student) is given. It follows similar rules and conventions to variables, but typically is capitalized.object) is what the class inherits from. Here we use the default (object).self. Methods are called using <instance>.<method>.s1 = Student(...)) the __init__ method is called if it exists. We can use this to set up the instance.There are a number of special methods that can be defined that work with Python operations. For example, suppose we printed the instances above:
In [19]:
print(s1)
print(s2)
This isn't very informative. However, we can define the string representation of our class using the __repr__ method:
In [20]:
class Student(object):
def __init__(self, name):
self.name = name
def __repr__(self):
return self.name
In [21]:
s1 = Student("Christine Carpenter")
s2 = Student("Jörg Fliege")
print(s1)
print(s2)
We can also define what it means to add two instances of our class:
In [22]:
class Student(object):
def __init__(self, name):
self.name = name
def __repr__(self):
return self.name
def __add__(self, other):
return Student(self.name + " and " + other.name)
In [23]:
s1 = Student("Christine Carpenter")
s2 = Student("Jörg Fliege")
print(s1 + s2)
Going back to the simplex method, we want to define a class that contains the objective function and the constraints, a method to solve the problem, and a representation of the problem and solution.
In [24]:
class Constraint(object):
def __init__(self, coefficients, value):
self.coefficients = numpy.array(coefficients)
self.value = value
def __repr__(self):
string = ""
for i in range(len(self.coefficients)-1):
string += str(self.coefficients[i]) + " x_{}".format(i+1) + " + "
string += str(self.coefficients[-1]) + " x_{}".format(len(self.coefficients))
string += " \le "
string += str(self.value)
return string
In [25]:
c1 = Constraint([2, 1], 4)
c2 = Constraint([1, 2], 3)
print(c1)
print(c2)
In [26]:
class Linearprog(object):
def __init__(self, objective, constraints):
self.objective = numpy.array(objective)
self.nvars = len(self.objective)
self.constraints = constraints
self.nconstraints = len(self.constraints)
self.tableau = numpy.zeros((1+self.nconstraints, 2+self.nvars+self.nconstraints))
self.tableau[0, 0] = 1.0
self.tableau[0, 1:1+self.nvars] = -self.objective
for nc, c in enumerate(self.constraints):
self.tableau[1+nc, 1:1+self.nvars] = c.coefficients
self.tableau[1+nc, 1+self.nvars+nc] = 1.0
self.tableau[1+nc, -1] = c.value
self.z, self.x = self.simplex()
def simplex(self):
for column in range(1, self.nvars+2):
if self.tableau[0, column] < 0:
pivot_row = numpy.argmin(numpy.abs(self.tableau[0, column] / self.tableau[1:, column])) + 1
# Rescale pivot row
self.tableau[pivot_row, :] /= self.tableau[pivot_row, column]
# Remove all entries in columns except the pivot
for row in range(0, pivot_row):
pivot = self.tableau[row, column] / self.tableau[pivot_row, column]
self.tableau[row, :] -= pivot * self.tableau[pivot_row, :]
for row in range(pivot_row+1, self.tableau.shape[0]):
pivot = self.tableau[row, column] / self.tableau[pivot_row, column]
self.tableau[row, :] -= pivot * self.tableau[pivot_row, :]
z = self.tableau[0, -1]
x = self.tableau[1:self.nvars+1, -1]
return z, x
def __repr__(self):
string = "max "
for i in range(len(self.objective)-1):
string += str(self.objective[i]) + " x_{}".format(i+1) + " + "
string += str(self.objective[-1]) + " x_{}".format(len(self.objective))
string += "\n\nwith constraints\n"
for c in self.constraints:
string += "\n"
string += c.__repr__()
string += "\n\n"
string += "Solution has objective function maximum of " + str(self.z)
string += "\n\n"
string += "at location x = " + str(self.x)
return string
In [27]:
problem = Linearprog([1, 1], [c1, c2])
print(problem)
The main advantage of using Python is the range of libraries there are that you can use to more efficiently solve your problems. For linear programming there is the pulp library, which is a Python wrapper to efficient low level libraries such as GLPK. It's worth noting that pulp provides high-level access to leading proprietary libraries like CPLEX, but doesn't provide the binaries or the licences. By default pulp uses CBC which is considerably slower: consult your supervisor as to what's suitable when for your work.
There are a range of examples that you can look at, but we'll quickly revisit the example above. The approach is to use a lot of pulp defined classes, which are hopefully fairly transparent:
In [4]:
import pulp
In [29]:
problem = pulp.LpProblem("Simple problem", pulp.LpMaximize)
This gives a "meaningful" title to the problem and says if we're going to maximize or minimize.
In [30]:
x1 = pulp.LpVariable("x_1", lowBound=0, upBound=None, cat='continuous')
x2 = pulp.LpVariable("x_2", lowBound=0, upBound=None, cat='continuous')
Defining the variables again gives them "meaningful" names, and specifies their lower and upper bounds, and whether the variable type is continuous or integer. We could ignore the latter two definitions as they take their default values.
The first thing to do now is to define the objective function by "adding" it to the problem:
In [31]:
objective = x1 + x2, "Objective function to maximize"
problem += objective
Again we have given a "meaningful" name to the objective function we're maximizing.
Next we can create constraints and add them to the problem.
In [32]:
c1 = 2 * x1 + x2 <= 4, "First constraint"
c2 = x1 + 2 * x2 <= 3, "Second constraint"
problem += c1
problem += c2
If you want to save the problem at this stage, you can use problem.writeLP(<filename>), where the .lp extension is normally used.
To solve the problem, we just call
In [33]:
problem.solve()
Out[33]:
The 1 just means it did it: it does not say whether it succeeded! We need to print the status:
In [34]:
print("Status:", pulp.LpStatus[problem.status])
As it's found a solution, we can print the objective function and the variables:
In [35]:
print("Maximized objective function = ", pulp.value(problem.objective))
for v in problem.variables():
print(v.name, "=", v.varValue)
Using pulp is far easier and robust than coding our own, and will cover a much wider range of problems.
Try using pulp to implement the following optimisation problem.
Whiskas want to make their cat food out of just two ingredients: chicken and beef. These ingredients must be blended such that they meet the nutritional requirements for the food whilst minimising costs. The costs of chicken and beef are \$0.013 and \$0.008 per gram, and their nutritional contributions per gram are:
| Stuff | Protein | Fat | Fibre | Salt |
|---|---|---|---|---|
| Chicken | 0.100 | 0.080 | 0.001 | 0.002 |
| Beef | 0.200 | 0.100 | 0.005 | 0.005 |
Let's define our decision variables:
$$x_1 = \text{percentage of chicken in can of cat food} $$$$x_2 = \text{percentage of beef in can of cat food}$$As these are percentages, we know that both must be $0 \leq x \leq 100$ and that they must sum to 100. The objective function to minimise costs is
$$\min 0.013 x_1 + 0.008 x_2$$The constraints (that the variables must sum to 100 and that the nutritional requirements are met) are:
$$1.000 x_1 + 1.000 x_2 = 100.0$$$$0.100 x_1 + 0.200 x_2 \ge 8.0$$$$0.080 x_1 + 0.100 x_2 \ge 6.0$$$$0.001 x_1 + 0.005 x_2 \le 2.0$$$$0.002 x_1 + 0.005 x_2 \le 0.4$$This problem was taken from the pulp documentation - you can find the solution there.
There's a number of projects using pulp out there - one for people interested in scheduling is Conference Scheduler which works out when to put talks on, given constraints.
One type of optimization problem deals with queues. As an example problem we'll take the Unilink bus service U1C from the Airport into the centre and ask: at busy times, how many people will not be able to get on the bus, and at what stops?
If we use a fixed set of customers and want to simulate the events in time, this is an example of a discrete event model. An example Python Discrete Event simulator is ciw, which has a detailed set of tutorials.
If we want to provide a random set of customers, to see what range of problems we may have, this is an example of Monte Carlo simulation. This can be done using standard Python random number generators, built into (for example) numpy and scipy.
We will only consider the main stops:
In [36]:
bus_stops = ["Airport Parkway Station",
"Wessex Lane",
"Highfield Interchange",
"Portswood Broadway",
"The Avenue Archers Road",
"Civic Centre",
"Central Station",
"West Quay",
"Town Quay",
"NOCS"]
We will assume that the bus capacity is $85$ people, that $250$ people want to travel, that they are distributed at the $10$ stops following a discrete random distribution, and each wants to travel a number of stops that also follows a discrete random distribution (distributed between $1$ and the maximum number of stops they could travel).
There are smarter ways of doing it than this, I'm sure:
In [37]:
import numpy
In [38]:
capacity = 85
n_people = 250
total_stops = len(bus_stops)
initial_stops = numpy.random.randint(0, total_stops-1, n_people)
n_stops = numpy.zeros_like(initial_stops)
n_onboard = numpy.zeros((total_stops,), dtype=numpy.int)
n_left_behind = numpy.zeros_like(n_onboard)
for i in range(total_stops):
if i == total_stops - 1: # Can only take one stop
n_stops[initial_stops == i] = 1
else:
n_people_at_stop = len(initial_stops[initial_stops == i])
n_stops[initial_stops == i] = numpy.random.randint(1, total_stops-i, n_people_at_stop)
for i in range(total_stops):
n_people_at_stop = len(initial_stops[initial_stops == i])
n_people_getting_on = max([0, min([n_people_at_stop, capacity - n_onboard[i]])])
n_left_behind[i] = max([n_people_at_stop - n_people_getting_on, 0])
for fill_stops in n_stops[initial_stops == i][:n_people_getting_on]:
n_onboard[i:i+fill_stops] += 1
In [39]:
print(n_left_behind)
In [40]:
print(n_onboard)
And now that we know how to do it once, we can do it many times:
In [41]:
def mc_unilink(n_people, n_runs = 10000):
"""
Given n_people wanting to ride the U1, use Monte Carlo to see how many are left behind on average at each stop.
Parameters
----------
n_people : int
Total number of people wanting to use the bus
n_runs : int
Number of realizations
Returns
-------
n_left_behind_average : array of float
Average number of people left behind at each stop
"""
bus_stops = ["Airport Parkway Station",
"Wessex Lane",
"Highfield Interchange",
"Portswood Broadway",
"The Avenue Archers Road",
"Civic Centre",
"Central Station",
"West Quay",
"Town Quay",
"NOCS"]
total_stops = len(bus_stops)
capacity = 85
n_left_behind = numpy.zeros((total_stops, n_runs), dtype = numpy.int)
for run in range(n_runs):
initial_stops = numpy.random.randint(0, total_stops-1, n_people)
n_stops = numpy.zeros_like(initial_stops)
n_onboard = numpy.zeros((total_stops,), dtype=numpy.int)
for i in range(total_stops):
if i == total_stops - 1: # Can only take one stop
n_stops[initial_stops == i] = 1
else:
n_people_at_stop = len(initial_stops[initial_stops == i])
n_stops[initial_stops == i] = numpy.random.randint(1, total_stops-i, n_people_at_stop)
for i in range(total_stops):
n_people_at_stop = len(initial_stops[initial_stops == i])
n_people_getting_on = max([0, min([n_people_at_stop, capacity - n_onboard[i]])])
n_left_behind[i, run] = max([n_people_at_stop - n_people_getting_on, 0])
for fill_stops in n_stops[initial_stops == i][:n_people_getting_on]:
n_onboard[i:i+fill_stops] += 1
return numpy.mean(n_left_behind, axis=1)
In [42]:
n_left_behind_average = mc_unilink(250, 10000)
In [43]:
n_left_behind_average
Out[43]:
We see that, as expected, it's the stops in the middle that fare worst. We can easily plot this:
In [44]:
%matplotlib inline
from matplotlib import pyplot
x = list(range(len(n_left_behind_average)))
pyplot.bar(x, n_left_behind_average)
pyplot.xticks(x, bus_stops, rotation='vertical')
pyplot.ylabel("Average # passengers unable to board")
pyplot.show()
Take a look at the discrete distributions in scipy.stats and work out how to improve this model.
If you want to get a computer to classify a large dataset for you, or to "learn", then packages for Machine Learning, or Neural Networks, or Deep Learning, are the place to go. This field is very much a moving target, but the Python scikit-learn library has a lot of very useful tools that can be used as a starting point.
In this example we'll focus on classification: given a dataset that's known to fall into fixed groups, develop a model that predicts from the data what group a new data point falls within.
As a concrete example we'll use the standard Iris data set again, which we can get from GitHub. We used this with pandas, and we can use that route to get the data in.
In [5]:
import numpy
import pandas
import sklearn
iris = pandas.read_csv('https://raw.githubusercontent.com/pandas-dev/pandas/master/pandas/tests/data/iris.csv')
A quick reminder of what the dataset contains:
In [46]:
iris.head()
Out[46]:
There are different types of iris, classified by the Name. Each individual flower observed has four measurements, given by the data. We want to use some of the data (the Sepal Length and Width, and the Petal Length and Width) to construct a model. The model will take an observation - these four numbers - and predict which flower type we have. We'll use the rest of the data to check how accurate our model is.
First let's look at how many different types of flower there are:
In [47]:
iris['Name'].unique()
Out[47]:
So we're trying to choose one of three types.
The range of values can be summarized:
In [48]:
iris.describe()
Out[48]:
There's 150 observations, with a reasonable range of values.
So, let's split the dataframe into its data and its labels. What we're wanting to do here is predict the label (the type, or Name, of the Iris observed) from the data (the measurements of the sepal and petal).
In [49]:
labels = iris['Name']
data = iris.drop('Name', axis=1)
We then want to split our data, and associated labels, into a training set (where we tell the classifier what the answer is) and a testing set (to check the accuracy of the model):
In [50]:
from sklearn.model_selection import train_test_split
In [51]:
data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size = 0.5)
Here we have split the data set in two: 50% is in the training set, and 50% in the testing set.
We can now use a classification algorithm. To start, we will use a decision tree algorithm:
In [52]:
from sklearn import tree
classifier = tree.DecisionTreeClassifier()
classifier.fit(data_train, labels_train)
Out[52]:
We now have a model: given data, it will return its prediction for the label. We use the testing data to check the model:
In [53]:
print(labels_test)
print(classifier.predict(data_test))
We see from the first couple of entries that it's done ok, but that there are errors. As estimating the accuracy of a classification by comparing to test data is so standard, there's a function for that:
In [54]:
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(labels_test, classifier.predict(data_test))
print("Decision Tree Accuracy with 50/50: {}".format(accuracy))
So the result is very accurate on this simple dataset.
KNeighborsClassifier for example.sklearn.datasets for possibilities, such as the digits dataset which does handwriting recognition.Here's a worked solution for the music data classifier. We'll start by importing the libraries and data we need.
In [9]:
from sklearn.model_selection import train_test_split
from sklearn import tree
from sklearn.metrics import accuracy_score
In [7]:
dfs = {'indie': pandas.read_csv('spotify_data/indie.csv'), 'pop': pandas.read_csv('spotify_data/pop.csv'),
'country': pandas.read_csv('spotify_data/country.csv'), 'metal': pandas.read_csv('spotify_data/metal.csv'),
'house': pandas.read_csv('spotify_data/house.csv'), 'rap': pandas.read_csv('spotify_data/rap.csv')}
In [67]:
for genre, df in dfs.items():
df['genre'] = genre
dat = pandas.concat(dfs.values())
# define a list of the fields we want to use to train our classifier
columns = ['duration_ms', 'explicit', 'popularity', 'acousticness', 'danceability',
'energy', 'instrumentalness', 'key', 'liveness', 'loudness',
'mode', 'speechiness', 'tempo', 'time_signature', 'valence', 'genre']
# define data as all columns but the genre column
data = dat[columns].drop('genre', axis=1)
# define labels as the genre column
labels = dat[columns].genre
# split the data into a training set and a testing set
data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size = 0.3)
# create the classifier
classifier = tree.DecisionTreeClassifier()
# train the classifier using the training data
classifier.fit(data_train, labels_train)
Out[67]:
In [68]:
# calculate the accuracy of the classifier using the testing data
accuracy = accuracy_score(labels_test, classifier.predict(data_test))
print("Decision Tree Accuracy with 50/50: {}".format(accuracy))
The accuracy of this classifier is not great. As the train_test_split function randomly selects its training and test data, the accuracy will change every time you run it, but it tends to be 60-70%. Let's try excluding the pop data.
In [64]:
nopop_dat = dat[dat.genre != 'pop']
# define data as all columns but the genre column
data = nopop_dat[columns].drop('genre', axis=1)
# define labels as the genre column
labels = nopop_dat[columns].genre
data_train, data_test, labels_train, labels_test = train_test_split(data, labels, test_size = 0.1)
classifier = tree.DecisionTreeClassifier()
classifier.fit(data_train, labels_train)
accuracy = accuracy_score(labels_test, classifier.predict(data_test))
print("Decision Tree Accuracy with 50/50: {}".format(accuracy))
This classifier is now quite a bit more accurate - 75-85%. Generally, the accuracy of your classifier should improve the fewer categories it has to classify.
The scikit-learn documentation is very detailed, provided you have some level of understanding of Machine Learning to start. Sebastian Raschke has a book specifically on Python and Machine Learning. There are other tutorials online that you may find useful.