Introduction

Ši užrašų knygutė yra skirta gretutinių studijų analizei. Čia nagrinėjama gretutinių studijų pasirinkimo dinamika, ieškoma dėsningumu.


In [2]:
from IPython.display import Image
Image('http://www6.cityu.edu.hk/projectflame/images/indexikon/withHat.png',width=100)


Out[2]:

In [3]:
Image('http://morphocode.com/wp-content/uploads/2013/07/krzywinski-hiveplot-poster.jpg',width=50)


Out[3]:

Įvadas

Tai yra gretutinės studijos. Sunku nustatyti, kada įvyko pirmieji gretutinių studijų atvejai. Tačiau, gretutinės studijos yra pasaulio mastu atpažįstama akademinės kultūros dalis. Gretutinės studijos pastebimos ir Lietuvos aukštojo mokslo institucijoje. Čia bus nagrinėjamas VDU gretutinių studijų reiškinys.

Teorija

Naudosime grafas. (V,E), kur V - viršūnių rinkinys, o E kraštinių rinkinys. $$ V =\{v_1,..., v_5\} $$ $$ E = \{(v_1, v_2),(v_2, v_5),(v_5, v_5),(v_5, v_4),(v_5, v_4)\}$$ Supaprastinimo vardan naudosimetokį žymėjimą: $_p,_q$ and $_r$. Tarkime, kad mūsų viršūnės yra trys studijų programos: teisė, politikos mokslai ir istorija.

p - 'teise', q - 'politikos_mokslai' and r - 'istorija'

Svorio skaičiavimas:

$$D - svoris$$$$D_{(p,q)}$$

Mūsų atveju kraštinės, jungančios du viršūnes svorys yra lygus.

p - pagrindinė studijų programa

q - gretutinė studijų programa

n - bet kokia gretutinė programa

D - kraštinės svoris

$H_x$ - visi galėję rinktis studijų programą x.

$S_{x,m,k}$ - studentų, įstojusių į studijų programą x, m metais ir gretutines pasirinkusių k kurse, skaičius.

$m_i$, - metai, kuriuose pasirinktos gretutinės studijos.

$$H_x = S_{x, 2014, 4} + S_{x, 2014, 3} + S_{x, 2014, 2} + S_{x, 2014, 1} + S_{x, 2013, 3} + S_{x, 2013, 2} + S_{x, 2013, 1} + S_{x, 2012, 2} + S_{x, 2012, 2} + S_{x, 2011, 1}$$$$H_{x} = \sum {S_{x, x, m-2010}}$$$$H_{2014} = \sum {S_{2014, 2014, (m-2010)}}$$$$D_{(p,q)}= \frac{\sum_{(p,q)}} {\sum_{(p,n)}} \cdot \frac {1}{H_p}$$

Kaip atrasti dominuojantį svorį? $D_{(p,q)}$ yra dominuojantis, jeigu $\frac{D_{(p,q)}} {D_{(q,p)}} > 0$

Svorių suvienodinimas, jei $D_{(p,q)} > 0$ ir $D_{(q,p)} > 0$ ir $D_{(p,q)} \neq D_{(q,p)}$.

*Apie trijų narių ryšius. Tarkime, jog turime $D_{(p,q)}$, $D_{(p,r)}$ ir $D_{(p,q)}$
jei $D_{(p,q)} > D_{(p,r)}$ tai koks $D_{(p,r)}$? $$(x,y)$$

$(x)^{(-1)}$

$m_i$, $m_3$

1 The data


In [4]:
import os
import glob
import re
import codecs
import pickle

In [5]:
f = open("Data/edgelist.p","rb")
edgelist = pickle.loads(f.read())

f = open("Data/nodelist.p","rb")
nodelist = pickle.loads(f.read())

In [6]:
len(edgelist)


Out[6]:
698

2 DATAFRAMES

2.1 EDGE DATAFRAME

Kuriame gretutinių studijų pasirinkimo dataframe. Šie duomenys gauti iš VDU. Šis dataframe yra pats svarbiausias mūsų tyrimui. Sukūrę jį mes jį pildysime naudodami duomenis iš kitų šaltinių. Taip pat, jį naudosime konstruodami grafą.


In [7]:
import pandas as pd
# Pandas settings
pd.options.display.mpl_style = 'default'
pd.set_option('display.max_columns', 10)
pd.set_option('display.max_rows', 7)
pd.set_option('display.width', 10)

# Loading DFs
dfe = pd.read_pickle('data/dfe.pkl')
dfn = pd.read_pickle('data/dfn.pkl')
dff = pd.read_pickle('data/dff.pkl')

# Saving DFs
import time
timestr = time.strftime("%Y-%m-%d_%H-%M")
timestrless = time.strftime("%Y-%m-%d")
#dfe.to_pickle('data/dfe.pkl')
#dfn.to_pickle('data/dfn.pkl')
#dff.to_pickle('data/dff.pkl')

#dfe.to_pickle('data/dfe{}.pkl'.format(timestrless))
#dfn.to_pickle('data/dfn{}.pkl'.format(timestrless))
#dff.to_pickle('data/dff{}.pkl'.format(timestrless))

In [8]:
import json

json_data=open(r'json/get_faculty_name.json')
get_faculty_name = json.load(json_data)
json_data.close()

%matplotlib inline
import matplotlib.pyplot as plt 

fak_colors = {'evf':'#659768', 'gmf':'#b2ce69', 'hmf':'#e6c70b', 'if':'#8cbac2', 'ktf':'#904e98',
              'mf':'#ce8d37', 'pmdf':'#58636c', 'smf':'#4667a9', 'tf':'#2a2f58', 'ma':'#00ceb9'}

fak_colors1 = {'evf':'r', 'gmf':'h', 'hmf':'b', 'if':'c', 'ktf':'k', 
               'mf':'r', 'pmdf':'g', 'smf':'b', 'tf':'c', 'ma':'y'}

programos_fakulteto_spalva = {}

def gauk_programos_fakulteto_spalva(prog):
    return fak_colors[get_faculty_name[prog]] 

facultylist = set(dfe.pagrfak.unique().tolist() + dfe.gretfak.unique().tolist())
proglist = set(dfe.pagr.unique().tolist() + dfe.gret.unique().tolist())

for x in proglist:
    programos_fakulteto_spalva[x] = gauk_programos_fakulteto_spalva(x)

Kaip jau minėjome, mūsų pagrindinė informacija išsaugota dviejuose DF. Viename DF išsaugota gretutinių studijų pasirinkimo informacija, o kitame - studentų stojimo duomenys.

  1. Šioje dalyje, prie DFE pridėsime, kiek studentų rinkosi pagrindinę studijų programą.
  2. Palyginsime su skaičiumi studentų, kurie rinkosi gretutines studijas.
Bet kaip dėl tikrųjų svorių?

Kaip apskaiciuojamas grafo kraštinių svoris? 2014 pavasario semestro gretutinių studijų pasirinkimo svorį skaičiuosime toliau pateiktą forumulę.


In [9]:
# This function gets possible students of this year.
def get_number_of_current_gret_of_prog(PROG):  # Returns
    retlist = []
    for x in range(1, 9):
        dft = dfe[dfe['studsem'] == x]
        dft = dft[dft['studmet'] >= x]
        dft = dft[dft['pagr'] == PROG]
        retlist.append(len(dft))
    return sum(retlist)

def gauk_skaiciu_pagal_metus_pagr_prog(PROG, df): # Paduodame programos pavadinimą, gauname, kiek i programą istojo per paskutinius keturiu metus.
    result = 0
    yeardict = {2010:'m10', 
                2011:'m11',
                2012:'m12', 
                2013:'m13'}
    stoje_pasirinkegret = df[df[u'pavadinimas'].isin([PROG])]
    res = stoje_pasirinkegret[[yeardict[2010],
                               yeardict[2011],
                               yeardict[2012],
                               yeardict[2013]]]
    
    try:
        result = int(res.values.sum(axis=1))
        return result
    except:
        return result

# Dictionary that gets number of current students and year
mydict = {}
for x in proglist:
    ret1 = gauk_skaiciu_pagal_metus_pagr_prog(x, dfn)
    ret2 = get_number_of_current_gret_of_prog(x)
    ret2 = float(ret2)
    try:
        ret1 = float(ret1)
    except:
        pass
    mydict[x] = [ret2, ret1]

# In this part we get something, by diving to values we got in previous cell
tikrasis_gret_pop = [[x, mydict[x][0] / mydict[x][1]]
                     for x in proglist 
                     if (type(mydict[x][1]) == float) 
                     and (mydict[x][1] != 0)]

dft = pd.DataFrame([x[1] for x in tikrasis_gret_pop],
                   [x[0] for x in tikrasis_gret_pop],
                   columns=['santykis'])

dft['visostojo'] = [mydict[x][1] for x in dft.index]
dft.add = [x in range(20)]

TTL = 'Koks procentas studentu renkasi gretutines(is siuo metu studijuojanciu)'
dft['visostojo'] = list([mydict[x][1] for x in dft.index])

# FOR GRET GRET GRET
def get_number_of_current_students_getting_the_GRET_of_the_prog(PROG):
    retlist = []
    for x in range(1, 9):
        dft = dfe[dfe[u'studsem'] == x]
        dft = dft[dft[u'gretpradz'] >= x]
        dft = dft[dft[u'gret'] == PROG]
        ret = len(dft)
        retlist.append(ret)
    return sum(retlist)

# Kiek studentų šiuo metu turi šias gretutines.

dfn.set_index(dfn.pavadinimas.values, inplace=True)

dfn = pd.concat([dfn,dft],axis=1)
dfe[dfe.loc[:,'vardas'].str.contains(u"yte |aite |iute |iene | ytė |aitė |iutė |ienė")]['gim'] = 'mot'

In [10]:
dfe[dfe.loc[:,'vardas'].str.contains(u"yte |aite |iute |iene | ytė |aitė |iutė |ienė")]['gim'] = 'mot'

In [11]:
import numpy as np
df = []
df = dfn.loc[:,['santykis','visostojo']]
df = df[np.isfinite(df['santykis'])]
of_minor = [get_number_of_current_students_getting_the_GRET_of_the_prog(x)
            for x in df.index]
df


Out[11]:
santykis visostojo
anglu_filologija 0.032070 343
anglu_ir_vokieciu_filologija 0.014706 68
aplinkotyra_ir_ekologija 0.003937 254
... ... ...
viesasis_administravimas 0.002075 482
viesoji_komunikacija 0.002837 705
vokieciu_filologija 0.000000 16

38 rows × 2 columns


In [12]:
fig = plt.figure(num=None,
                 figsize=(5, 5),
                 dpi=80,
                 facecolor='w',
                 edgecolor='k')

labels = df.index
ax = fig.add_subplot(111)
ax.set_title(u'Išeinantys prieš ateinančius')
ax.set_xlabel(u'Pasirinke šią gretutinių studijų programą')
ax.set_ylabel(u'Šios studijų programos skaičius')
plt.scatter(of_minor, 
            df.santykis, 
            s=(df.visostojo * 3), 
            c=[programos_fakulteto_spalva[x] for x in df.index],
            alpha=0.7)

for label, x, y in zip(labels, of_minor, df.santykis):
    plt.annotate(
        label, 
        xy = (x, y), xytext = (-10, 10),
        textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.3', fc = 'grey', alpha = 0.1),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

plt.show()



In [13]:
import mpld3
fig = plt.figure(num=None,
                 figsize=(5, 5),
                 dpi=80,
                 facecolor='w',
                 edgecolor='k')

scatter = plt.scatter(of_minor, 
                      df.santykis, 
                      s=(df.visostojo * 3), 
                      c=[programos_fakulteto_spalva[x] for x in df.index],
                      alpha=0.7)

ax.grid(color='white', linestyle='solid')

#for label, x, y in zip(labels, of_minor, srtd.santykis):
#    plt.annotate(
#        label, 
#        xy = (x, y), xytext = (-10, 10),
#        textcoords = 'offset points', ha = 'right', va = 'bottom',
#        bbox = dict(boxstyle = 'round,pad=0.3', fc = 'grey', alpha = 0.1),
#        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
ax.set_title(u'Išeinantys prieš ateinančius')
ax.set_xlabel(u'Pasirinke šią gretutinių studijų programą')
ax.set_ylabel(u'Šios studijų programos skaičius')

labels = [x.decode('utf-8') for x in df.index]
tooltip = mpld3.plugins.PointLabelTooltip(scatter,labels=labels)  # labels=labels
mpld3.plugins.connect(fig, tooltip)

mpld3.display()


Out[13]:

3. Faculty level analysis

Now, when we have our dataframes ready, we can dive deeper into the analysis. First, let's see what we can do. First, lets begin by analysing faculty level. But before we start it we have to recognise some things. First, it is not really abotu faculty level, it is about the reduction to faculty level. The data will include all individual cases. At this moment, our focus is on faculty level, we will dive deeper, of course, but lets wait for a bit. So, lets look how we filter


In [14]:
# This shows ho we filter to entries of one faculty. In this case we look at the source of academic minor
# dfe[dfe['pagrfak'].isin(['hmf'])]

In [15]:
# This is another look, we see all cases when someone chooses an entry in other faculty.
#dfe[dfe['gretfak'].isin(['smf'])]

In [16]:
# dataframe for faculties

rez1 = [len(dfe[dfe['pagrfak'].isin([x])]) for x in facultylist]  # Gretutines pasirinkusių fakulto studentų skaičius.
rez2 = [len(dfe[dfe['pagrfak'].isin([x]) & ~dfe['gretfak'].isin([x])]) for x in facultylist]  # Gretutines fakulteto išorėje pasirinkusių studentų skaičius.
rez3 = [len(dfe[dfe['pagrfak'].isin([x]) & dfe['gretfak'].isin([x])]) for x in facultylist]
#rez3_1 = [rez1[x] - rez2[x] for x in range(len(rez1))]  # Gretutines tame pačiame fakultete pasirinkusių studentų skaičius. # rez3 == rez3_1

rez4 = [len(dfe[dfe['gretfak'].isin([x])]) for x in facultylist] 
rez5 = [len(dfe[dfe['gretfak'].isin([x]) & ~dfe['pagrfak'].isin([x])]) for x in facultylist]

rex = [rez1, rez2, rez3, rez4, rez5]
headers = ['is_fak','is_fak_out','to_same_fak','i_fak','i_fak_is_isores']
dft = pd.DataFrame(rex, index=headers, columns=facultylist)
dff = dft.T


# Sudėti šiuos du 

dft1 = pd.DataFrame([rez2,rez3], index=['is_fak_out','to_same_fak'], columns=facultylist)
dft2 = dft1.T
dft2['procentas'] = (dft2.is_fak_out / dft2.to_same_fak) * 10
dft2 = dft2.drop('procentas', 1)
dft2['programu_fakultete'] = dfn.groupby(['fakultetas']).size()
dft2['programu_NEfakultete'] = len(proglist) - dfn.groupby(['fakultetas']).size()
dft2['TSF_atsizvelgus_i_PF'] = (dft2.is_fak_out / dft2.programu_NEfakultete)
dft2['IFO_atsizvelgus_i_PF'] = (dft2.to_same_fak / (dft2.programu_fakultete - 1)) # nes negali i ta pacia
dft2['TSF_IFO_procentas'] = dft2.TSF_atsizvelgus_i_PF / dft2.IFO_atsizvelgus_i_PF * 50

srtd = dft2.sort('TSF_IFO_procentas',ascending=True)
srtd.head()


Out[16]:
is_fak_out to_same_fak programu_fakultete programu_NEfakultete TSF_atsizvelgus_i_PF IFO_atsizvelgus_i_PF TSF_IFO_procentas
tf 81 0 1 60 1.350000 inf 0.000000
smf 46 42 6 55 0.836364 8.400000 4.978355
evf 18 9 5 56 0.321429 2.250000 7.142857
gmf 33 23 7 54 0.611111 3.833333 7.971014
if 17 3 3 58 0.293103 1.500000 9.770115
Preparation for plotting.

In [17]:
# Ar palikti fakultetus, kurie turi labai mažai studentų? Manau reikia atsisakyti MA ir TF.
# Kaip pridėti procentinę liniją?

# Šiuos duomenis galime patikslinti, atsižvelgiant į pasirinkimo galimybę. 
# Kiek Buvo galima pasirinkti programų fakulteto viduje? 
# Teisės fakulteto studentai turėjo 0 galimybių pasirinkti fakulteto viduje.
# Didesnį pasirinkimą turėjo HMF, SMF ir kitų programų studentai.
# HMF, nepaisydami didelio pasirinkimo, juda į 
# SMF, linkę likti tame pačiame fakultete.
# GMF, linkę likti ten pat.
# PMDF, didelis judėjimas į išorę, vidutinis pasirinkimas.

# Kas tiesa, tas nemelas. Sunku įvertinti programų pasirinkimą, klausimai: 
# "apie kiek programų studentai svarstė" ir 
#  "kiek tuo metu programų buvo"?
srtd.sort('TSF_IFO_procentas',ascending=True).plot(kind='barh', alpha=1)


Out[17]:
<matplotlib.axes.AxesSubplot at 0xcccfe48>

Būtina atsžvelgti, jog ne visi gali judėti į kitą fakultetą. Geriausias to pavyzdys - teisės fakultetas. Jame - tik viena studijų programa. Toliau esančiame grafe tai ir darome.


In [18]:
fig = plt.figure(num=None,
                 figsize=(5, 5),
                 dpi=80,
                 facecolor='w',
                 edgecolor='k')

labels = dft2.index.unique()
plt.subplots_adjust(bottom = 0.1)

ax = fig.add_subplot(111)
ax.set_title(u'Fakulteto ribose ir už tarpfakultetiniai')
ax.set_xlabel(u'Į kitus fakultetus')
ax.set_ylabel(u'Lieka fakultete')

COLORS = [fak_colors[x] for x in facultylist]

plt.scatter(dft2.TSF_atsizvelgus_i_PF, 
            dft2.IFO_atsizvelgus_i_PF, 
            alpha=0.7, 
            color=COLORS, 
            s=dft2.programu_fakultete * 100)

for label, x, y in zip(labels, dft2.TSF_atsizvelgus_i_PF, dft2.IFO_atsizvelgus_i_PF):
    plt.annotate(
        label, 
        xy = (x, y), xytext = (-10, 10),
        textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.3', fc = 'grey', alpha = 0.1),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))
##########################
# padaryti ta pati, tik atskiroms programoms.
##########################
plt.show()


Pairs of faculties

HMF and SMF


In [19]:
import matplotlib.pyplot as plt

plt.figure()
# Example with HMF and SMF
fltrd1 = dfe[dfe['pagrfak'].isin(['hmf'])]
fltrd2 = fltrd1[fltrd1['gretfak'].isin(['smf'])]
grpd = fltrd2.groupby(['gret']).size().order('gret')
#grpd.plot(title='hmf studentai renkasi smf gretutines', kind='barh')
#grpd
#plt.show()


C:\Users\Saonkfas\Anaconda\lib\site-packages\pandas\core\series.py:1686: FutureWarning: na_last is deprecated. Please use na_position instead
  FutureWarning)
<matplotlib.figure.Figure at 0xcf2e400>

In [20]:
fltrd1 = dfe[dfe['pagrfak'].isin(['smf'])]
fltrd2 = fltrd1[fltrd1['gretfak'].isin(['hmf'])]
grpd = fltrd2.groupby(['gret']).size().order('gret')
#grpd.plot(title='smf studentai renkasi hmf gretutines', kind='barh')
#plt.show()

4 One faculty analysis

Faculty of social sciences


In [21]:
# Lets analyse SMF
plt.figure()
smf = dfe[dfe['pagrfak'].isin(['smf'])]
grpd = smf.groupby('gretfak').size().order(ascending=True)
grpd.plot(kind='barh',title=u'Kokių fakultetų gretutines renkasi SMF studentai?')


Out[21]:
<matplotlib.axes.AxesSubplot at 0xcf438d0>

In [22]:
plt.figure()
grpd1 = smf.groupby('studsem').size() #Studsem, not very interesting. But I should make them all.
grpd1.plot(kind='barh',title=u'Kokiame semestre stoja smf studentai?')


Out[22]:
<matplotlib.axes.AxesSubplot at 0xcbd9dd8>

SMF į HMF


In [23]:
# Kokių HMF programų studentai labiausiai linkę rinktis SMF gretutines?
dfh2s = dfe[dfe.pagrfak.isin(['hmf']) & dfe.gretfak.isin(['smf'])]
TITLE = u'Kokių HMF programų studentai labiausiai linkę rinktis SMF gretutines?'
dfh2s.groupby('pagr').size().order(ascending=True).plot(kind='barh', title=TITLE)


Out[23]:
<matplotlib.axes.AxesSubplot at 0xddea358>

In [24]:
# Kokias SMF gretutines labiausiai linke rinktis HMF studentai?
dfh2s.groupby('gret').size().order(ascending=True).plot(kind='barh')


Out[24]:
<matplotlib.axes.AxesSubplot at 0xde4b6d8>

In [25]:
# Faculty Ego(INTERNAL)graph
faculty_internal_ego_graph = [dfe[dfe.pagrfak.isin([x]) & dfe.gretfak.isin([x])] for x in facultylist]
[x.groupby(['studsem']).size() for x in faculty_internal_ego_graph][0]
[x.groupby(['gretfak']).size() for x in faculty_internal_ego_graph]


Out[25]:
[gretfak
 mf         5
 dtype: int64, Series([], dtype: int64), gretfak
 gmf        23
 dtype: int64, gretfak
 evf        9
 dtype: int64, Series([], dtype: int64), gretfak
 smf        42
 dtype: int64, gretfak
 hmf        38
 dtype: int64, gretfak
 pmdf       14
 dtype: int64, gretfak
 ktf        1
 dtype: int64, gretfak
 if         3
 dtype: int64]

5 NetworkX

5.1 NetworkX allows us to get more data


In [26]:
import networkx as nx
import mpld3

# Create the graph object.
G = nx.MultiDiGraph()
GF = nx.MultiDiGraph() # Graph of faculty level


Couldn't import dot_parser, loading of dot files will not be possible.

Gaukime duomenis iš Pandas.


In [27]:
pd2nx = dfe.T.to_dict() #pandas to networkx
pd2nx1 = [(pd2nx[x]['pagr'],
           pd2nx[x]['gret'],
           {'gretall': pd2nx[x]['gretall'],
            'gretfak': pd2nx[x]['gretfak'],
            'gretpab': pd2nx[x]['gretpab'],
            'gretpop': pd2nx[x]['gretpop'],
            'gretpradz': pd2nx[x]['gretpradz'],
            'pagrall': pd2nx[x]['pagrall'],
            'pagrfak': pd2nx[x]['pagrfak'],
            'pagrpop': pd2nx[x]['pagrpop'],
            'saltinionr': pd2nx[x]['saltinionr'],
            'studmet': pd2nx[x]['studmet'],
            'studsem': pd2nx[x]['studsem'],
            'vardas': pd2nx[x]['vardas']}) 
            for x in range(len(pd2nx))]

# Add edges and their attributes (nodes are included).
G.add_edges_from(pd2nx1)

# Add node attributes.
for x in G.nodes():
    G.node[x]['fakultetas'] = get_faculty_name[x]
    #add more?
    
#G.nodes(data=True)
#G.edges(data=True)
len(G.nodes()),len(G.edges())


Out[27]:
(61, 698)
SUB GRAPHS

Lets start by crating an internal graph. The criteria is simple, all nodes of graph have to be in same faculty.


In [28]:
facultylist


Out[28]:
{u'evf', u'gmf', u'hmf', u'if', u'ktf', u'ma', u'mf', u'pmdf', u'smf', u'tf'}

5.1 Program graphs

5.1.1 Program ego graphs


In [29]:
# This is FACULTY GRAPH. I SHOULD MOVE IT. But no, it focus on level of programs.
faculty = u'hmf'
GHMF=nx.Graph([(u,v,d) for u,v,d in G.edges(data=True) if d[u'pagrfak'] == faculty or d[u'gretfak'] == faculty])

# Dictionary of faculty Egographs
# Move to another place.
mydict = {}
for x in facultylist:
    graph = nx.MultiDiGraph([(u,v,d) for u,v,d in G.edges(data=True) if d['pagrfak'] == x or d['gretfak'] == x])
    mydict[x] = graph

5.1.2 Pairs of programs


In [30]:
#This place should contain pairs of programs.  What we can say about how philosophy interacts with law?
x,y =u'hmf',u'pmdf'
fakpair = {}
fakpair['hmfsmf'] = nx.MultiDiGraph([(u,v,d) for u,v,d in G.edges(data=True) if d['pagrfak'] == x and d['gretfak'] == y])
# Also, how many different interaction possibilities do we have? What are the trends of interaction?
fakpair


Out[30]:
{'hmfsmf': <networkx.classes.multidigraph.MultiDiGraph at 0xe0e36a0>}

5.1.3 Program triplets


In [31]:
# This is still not working but someday I will be able to use deque.
from collections import deque
triplet = deque(['filosofija','etnologija','teologija'])
triplet1 = deque(['filosofija','teologija','etnologija'])
triplet2 = deque(['etnologija','filosofija','teologija'])
triplet1.rotate(1)
trplt = triplet1.rotate(1)
trplt





headers = [u'pagr', u'gret', u'pagrfak', u'gretfak', u'Weight']
def haha(x, y, z):
    filtered = dfe[dfe.pagr.isin([x]) & dfe.gret.isin([y])]
    
    if len(filtered) > 0:
        grpd = filtered.groupby([u'gret', u'pagr']).size()
        filtered1 = dfe[dfe.pagr.isin([y]) & dfe.gret.isin([z])]
        
        if len(filtered1) > 0:
            grpd1 = filtered1.groupby([u'gret', u'pagr']).size()
            filtered2 = dfe[dfe.pagr.isin([z]) & dfe.gret.isin([x])]
            
            if len(filtered2) > 0:
                grpd2 = filtered2.groupby([u'gret', u'pagr']).size()
                ret = [len(filtered), len(filtered1), len(filtered2)]
                ret1 = len(filtered) * len(filtered1) * len(filtered2)
                ret2 = [grpd, grpd1, grpd2]

                print(x, y, z, ret1)
                return[ret, ret1, ret2]

def haha2(x, y, z):
    filtered = dfe[dfe.pagr.isin([x]) & dfe.gret.isin([y])]

    if len(filtered) > 0:
        filtered1 = dfe[dfe.pagr.isin([y]) & dfe.gret.isin([z])]

        if len(filtered1) > 0:
            filtered2 = dfe[dfe.pagr.isin([z]) & dfe.gret.isin([x])]

            if len(filtered2) > 0:
                ret = [len(filtered), len(filtered1), len(filtered2)]
                ret1 = len(filtered) * len(filtered1) * len(filtered2)
                
                return([x, y, z], ret, ret1)

def ha():
    thelist = []
    seenlist = []
    count = 0
    for k in proglist:
        count += 1
        for l in proglist:
            for m in proglist:
                seenlist.append([k, l, m])
                if[k, l, m] in seenlist is not True:
                    rez = haha2(k, l, m)
                    if rez is not None:
                        thelist.append(deque(rez))
                        print (rez)
    return thelist

import pickle

#alltriplets = ha()
f = open("Data/alltriplets.p","rb")

bin_data = f.read()
alltriplets = pickle.loads(bin_data)

#f = open("Data/alltriplets_unicode.p","rb")
#alltriplets  = pickle.load(f)

#pickle.dump(alltriplets, open( "Data/alltriplets.p", "wb" ))
#pickle.dump(unicode(alltriplets), open( "Data/alltriplets_unicode.p", "wb" ))

#TRIPLETS
rez = [x for x in alltriplets]
trip_df = pd.DataFrame(rez)

The following are not triplets


In [32]:
SG = mydict[faculty]
for x in SG.nodes():
    SG.node[x]['fakultetas'] = get_faculty_name[x]

all_faculties_in_graph = [get_faculty_name[x] for x in SG.nodes()]

In [33]:
list_of_node_lists = []
for FAK in set(all_faculties_in_graph):
    node_list = []
    node_list = [n for n, attrdict in SG.node.items() if attrdict["fakultetas"] == FAK]
    list_of_node_lists.append(node_list)

In [34]:
### Developing a colored version.
#from mpld3 import enable_notebook
#enable_notebook()

def draw_graphs(faculty):
    G = mydict[faculty]
    for x in G.nodes():
       G.node[x]['fakultetas'] = get_faculty_name[x]
          
    plt.figure(figsize=(10, 10))
    plt.title('{}'.format(faculty))
    
    pos=nx.spring_layout(G, iterations=100)
    
    all_faculties_in_graph = [get_faculty_name[x] for x in G.nodes()]
    rez = set(all_faculties_in_graph)
   # weights = [n for n, attrdict in G.edge.items() if attrdict["fakultetas"] == "hmf"]
    count = 0
    for x in rez:
        count += 1
        nx.draw_networkx_nodes(G,
                               pos,
                               #nodelist=G.nodes()[5:-1],
                               nodelist=[z for z in G.nodes() if get_faculty_name[z] == x],
                               node_color=fak_colors[x],
                               node_size=500,
                               alpha=0.5)
    
    count = 0
    for x in rez:
        nx.draw_networkx_edges(G,
                               pos,
         #                      edgelist=G.edges(),
                               edgelist = [(u,v,d) for u,v,d in G.edges(data=True) if get_faculty_name[u] == x],
                               width=2,
                               alpha=0.2,
                               weight=1,
                               edge_color=fak_colors[x])

        #add nx.draw_networkx_edges... if u and v already in graph
    nx.draw_networkx_labels(G, 
                             pos, 
                             labels=None, 
                             font_size=10, 
                             font_color='k', 
                             font_family='sans-serif', 
                             font_weight='normal', 
                             alpha=0.6, 
                             ax=None)
    plt.show()
[draw_graphs(x) for x in facultylist]


Out[34]:
[None, None, None, None, None, None, None, None, None, None]

In [35]:
# Wonderful, but lets do the same for 
#EACH PROGRAM! PROGRAMS EGO NETWORK
#INTERNAL FACULTY RELATIONS

In [36]:
# Let's do internal faculty

In [37]:
# Let's do favorite locations of each faculty. Which faculty do they like to go to

In [38]:
# Let's do STUDSEM paterns

In [39]:
# Let's do changes of faculty patterns. What amount of people go where, is it continuous or is it fluctuating?

In [40]:
# Let's do some tweaks for pring layout.

5.2 Faculty graphs (where relations between faculties are analysed)

5.2.1 Ego

5.2.2 Twin(pair)


In [41]:
# Works only with two graphs
GHSC = nx.compose(mydict['hmf'],mydict['smf'])
len(GHSC.edges()), len(GHSC.nodes())


Out[41]:
(469, 56)

We have many connections between programs. These connections are very itneresting. The most interesting thing to us is, whether we can interpret connections as one sided. We have a hypothesis, that each connection is onesided, to prove that, we have to find, how values of connections compare.

If we find that connections are indeed onesided, we will enable to reduce our graph to DiGraph, that will help us to process the graph even further.

We will not abandon the data, we will identify unlikely connections. That is to say, connections, that go against the stream.


In [42]:
def gauk_tikruosius_svorius(x,y):
    dft = dfe.loc[:,['pagr','pagrpop','pagrall']]

    x_pagrpop = dft[dft.pagr == x].drop_duplicates()['pagrpop'].values
    x_pagrall = dft[dft.pagr == x].drop_duplicates()['pagrall'].values

    y_pagrpop = dft[dft.pagr == y].drop_duplicates()['pagrpop'].values
    y_pagrall = dft[dft.pagr == y].drop_duplicates()['pagrall'].values

    x_to_y = len(dfe[dfe['pagr'].isin([x]) & dfe['gret'].isin([y])])
    y_to_x = len(dfe[dfe['pagr'].isin([y]) & dfe['gret'].isin([x])])
    
    rez = [x, y, 
           float(x_pagrpop), float(x_pagrall),  # Something wroing with returning 0 length arrays and turning them to float
           float(y_pagrpop), float(y_pagrall), 
           x_to_y, y_to_x]
    
    value1 = 0
    value2 = 0
    
    if float(rez[6]) != 0:
        value1 = (float(rez[3]) / float(rez[6]))
    if float(rez[7]) != 0:
        value2 = (float(rez[5]) / float(rez[7]))
                  
    ret = [rez[0],
           rez[1], 
           value1 * rez[2],
           value2 * rez[4]]
    return ret
x,y = 'istorija','teise'


dft = dfe.loc[:,['pagr','pagrpop','pagrall']]
dft


Out[42]:
pagr pagrpop pagrall
0 biochemija 0.087500 9
1 prancuzu_filologija 0.181818 10
2 viesoji_komunikacija 0.073759 61
... ... ... ...
695 menotyra 0.077844 17
696 kurybines_industrijos 0.078125 4
697 marketingo_vadyba 0.039326 35

698 rows × 3 columns


In [43]:
rex = []
for x in proglist:
    for y in proglist:
        rezz = []
        try:
            rezz = gauk_tikruosius_svorius(x,y)
        except:
            pass
        rex.append(rezz)

In [45]:
DTTTTTT = pd.DataFrame(rex, columns=['pav1','pav2','wght1','wght2'])
DTTTTTT


Out[45]:
pav1 pav2 wght1 wght2
0 marketingo_vadyba marketingo_vadyba 0.000000 0.000000
1 marketingo_vadyba ekonomika_ir_finansai 0.688202 0.329897
2 marketingo_vadyba atlikimo_menas 0.000000 NaN
... ... ... ... ...
3718 teise biochemija 0.000000 0.000000
3719 teise italistika_ir_romanu_kalbos 5.425837 NaN
3720 teise teise 0.000000 0.000000

3721 rows × 4 columns


In [46]:
import pandas as pd
rex = [x for x in rex if x]
#[len(x) for x in rex] #Check

names1 = [x[0] for x in rex] 
names2 = [x[1] for x in rex]
values = [x[2] for x in rex]

DATAFRAME = pd.DataFrame()
DATAFRAME = pd.DataFrame(rex,
                         columns=['pagr','gret','weight','weight2'])
DATAFRAME = DATAFRAME.dropna()
DATAFRAME = DATAFRAME[DATAFRAME.weight != 0].sort('weight')
DATAFRAME1 = DATAFRAME
DATAFRAME = DATAFRAME.drop('weight2', 1)

In [48]:
fig = plt.figure(num=None,
                 figsize=(15, 15),
                 dpi=80,
                 facecolor='w',
                 edgecolor='k')

labels = zip(DATAFRAME1.pagr,DATAFRAME1.gret)
ax = fig.add_subplot(111)
ax.set_title(u'Neblogas grafas')
ax.set_xlabel(u'Kiek iseina')
ax.set_ylabel(u'Koks antros programos skaicius')
plt.scatter(DATAFRAME1.weight, 
            DATAFRAME1.weight2, 
            s=500, 
            c=[programos_fakulteto_spalva[x] for x in DATAFRAME1.pagr],
            alpha=0.7)

for label, x, y in zip(labels, DATAFRAME1.weight, DATAFRAME1.weight2):
    plt.annotate(
        label, 
        xy = (x, y), xytext = (-10, 10),
        textcoords = 'offset points', ha = 'right', va = 'bottom',
        bbox = dict(boxstyle = 'round,pad=0.3', fc = 'grey', alpha = 0.1),
        arrowprops = dict(arrowstyle = '->', connectionstyle = 'arc3,rad=0'))

plt.show


Out[48]:
<function matplotlib.pyplot.show>

In [49]:
pd2nx = DATAFRAME.T.to_dict()
pd2nx2 = [(pd2nx[x]['pagr'], pd2nx[x]['gret'], {'gretall': pd2nx[x]['weight']}) for x in DATAFRAME.index]

In [51]:
from mpld3 import enable_notebook
enable_notebook()

G = nx.MultiDiGraph(pd2nx2)
#SG = G.subgraph([n for n, attrdict in G.node.items() 
#                 if (attrdict["fakultetas"] == "hmf"]) 
#                 or (attrdict["fakultetas"] == "smf"]) 
nx.draw(SG)



In [52]:
#x_pagrpop = dft[dft.pagr == x].drop_duplicates()['pagrpop'].values
#x_pagrall = dft[dft.pagr == x].drop_duplicates()['pagrall'].values

In [53]:
# Faculty pairs
# Better to do some of this with pandas.
# creates a list of dictionaries, where dictionaries represent relations between two. PAIR
mylist = []
for x in proglist:
    mydict = {}
    mydict['x'] = x
    
  #  lala = dfe.iloc['pagrpop','pagrviso']
    for y in proglist:
        mydict['y'] = y
        mydict['xy'] = [len(dfe[dfe['pagr'].isin([x]) & dfe['gret'].isin([y])])][0]
        mydict['yx'] = [len(dfe[dfe['pagr'].isin([y]) & dfe['pagr'].isin([x])])][0]
        mylist.append(mydict)
#some1 = [x['yx'] for x in mylist if x['xy'] > 1]
#some2 = [x['yx'] for x in mylist if x['yx'] > 1]

In [54]:
mylist[2]


Out[54]:
{'x': u'marketingo_vadyba', 'xy': 0, 'y': 'teise', 'yx': 0}

5.2.3 Faculty triplets

5.3.1 Triplet(Clique) graphs


In [55]:
# Prie dfe prideti PASVERTA OUT

grpd = dfe[dfe.pagrpop > 0].sort('pagrpop', ascending=0)
drp = grpd.drop_duplicates(cols=['pagr', 'gret'])[['pagr', 'gret', 'pagrpop']]
THEGRAPH = nx.DiGraph()


dfl1 = drp['pagr'].tolist()
dfl2 = drp['gret'].tolist()
dfl3 = drp['pagrpop'].tolist()
dfl = zip(dfl1,
          dfl2,
          dfl3)

for x in range(len(dfl1)):
    THEGRAPH.add_edge(dfl1[x], dfl2[x])
    THEGRAPH[dfl1[x]][dfl2[x]]['weight'] = float(dfl3[x])
#PLOT THEGRAPH


C:\Users\Saonkfas\Anaconda\lib\site-packages\pandas\util\decorators.py:53: FutureWarning: cols is deprecated, use subset instead
  warnings.warn(msg, FutureWarning)

STUDSEM


In [56]:
#rez = [dfe[dfe.pagr == x].groupby('studsem').size() for x in proglist] # get lots of dataframes
def get_all_studsems(x, pagr_or_gret): #By study program
    count = 0
    myarray = []
    mymainlist = []
    for z in range(8):
        count += 1
        if pagr_or_gret == 'pagr':
            rez = dfe[dfe.pagr == x]
        if pagr_or_gret == 'gret':
            rez = dfe[dfe.gret == x]
        rez = rez[rez.studsem == count]
        rez = len(rez)
    #    mydict[count] = np.array(rez)
        myarray.append(rez)
    mymainlist.append(np.array(myarray))
    return mymainlist

#get_all_studsems('filosofija') # TEST

In [57]:
import numpy as nphgf
rez_pagr = [get_all_studsems(x,'pagr') for x in proglist]
rez_gret = [get_all_studsems(x,'gret') for x in proglist]
DFT = pd.DataFrame(rez_pagr, proglist, ['studsem_cnts_pagr'])
DFT['studsem_cnts_gret'] = [x[0] for x in rez_gret]
#DFT.iloc[1,0]
#DFT.T.psichologija
DFT


Out[57]:
studsem_cnts_pagr studsem_cnts_gret
marketingo_vadyba [1, 4, 0, 0, 2, 0, 0, 0] [5, 11, 17, 1, 1, 0, 0, 0]
ekonomika_ir_finansai [1, 3, 4, 0, 0, 0, 0, 0] [2, 7, 3, 9, 8, 1, 1, 1]
atlikimo_menas [0, 1, 2, 0, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0, 0, 0]
... ... ...
biochemija [0, 6, 1, 0, 0, 0, 0, 0] [1, 6, 2, 0, 0, 0, 0, 0]
italistika_ir_romanu_kalbos [4, 0, 1, 0, 0, 0, 0, 0] [1, 1, 4, 0, 2, 0, 0, 0]
teise [24, 15, 31, 4, 2, 4, 1, 0] [5, 19, 26, 2, 2, 1, 1, 0]

61 rows × 2 columns


In [58]:
def get_stats_of_array(arrayy):
    mylist = []       
    rez1 = [sum(arrayy[0:4]), sum(arrayy[4:8])] # Pirma metu puse, pries antra
    rez2 = [sum(arrayy[0::2]), sum(arrayy[1::2])] #Pavasaris pries rudeni.
    rez3 = [sum(arrayy[0:1]), sum(arrayy[2:3]), sum(arrayy[4:5]), sum(arrayy[6:7])]
    mylist.extend([rez1,rez2,rez3])
    return mylist

# Pagr
REX1 = [get_stats_of_array(DFT.iloc[x,0])[0] for x in range(len(DFT))]
REX2 = [get_stats_of_array(DFT.iloc[x,0])[1] for x in range(len(DFT))]
REX3 = [get_stats_of_array(DFT.iloc[x,0])[2] for x in range(len(DFT))]

DFT['prps_pagr'] = REX1
DFT['pvsrs_rd_pagr'] = REX2
DFT['ktr_mt_pagr'] = REX3

# Gret
REX4 = [get_stats_of_array(DFT.iloc[x,1])[0] for x in range(len(DFT))]
REX5 = [get_stats_of_array(DFT.iloc[x,1])[1] for x in range(len(DFT))]
REX6 = [get_stats_of_array(DFT.iloc[x,1])[2] for x in range(len(DFT))]

DFT['prps_gret'] = REX4
DFT['pvsrs_rd_gret'] = REX5
DFT['ktr_mt_gret'] = REX6
DFT


Out[58]:
studsem_cnts_pagr studsem_cnts_gret prps_pagr pvsrs_rd_pagr ktr_mt_pagr prps_gret pvsrs_rd_gret ktr_mt_gret
marketingo_vadyba [1, 4, 0, 0, 2, 0, 0, 0] [5, 11, 17, 1, 1, 0, 0, 0] [5, 2] [3, 4] [1, 0, 2, 0] [34, 1] [23, 12] [5, 17, 1, 0]
ekonomika_ir_finansai [1, 3, 4, 0, 0, 0, 0, 0] [2, 7, 3, 9, 8, 1, 1, 1] [8, 0] [5, 3] [1, 4, 0, 0] [21, 11] [14, 18] [2, 3, 8, 1]
atlikimo_menas [0, 1, 2, 0, 0, 0, 0, 0] [0, 0, 0, 0, 0, 0, 0, 0] [3, 0] [2, 1] [0, 2, 0, 0] [0, 0] [0, 0] [0, 0, 0, 0]
... ... ... ... ... ... ... ... ...
biochemija [0, 6, 1, 0, 0, 0, 0, 0] [1, 6, 2, 0, 0, 0, 0, 0] [7, 0] [1, 6] [0, 1, 0, 0] [9, 0] [3, 6] [1, 2, 0, 0]
italistika_ir_romanu_kalbos [4, 0, 1, 0, 0, 0, 0, 0] [1, 1, 4, 0, 2, 0, 0, 0] [5, 0] [5, 0] [4, 1, 0, 0] [6, 2] [7, 1] [1, 4, 2, 0]
teise [24, 15, 31, 4, 2, 4, 1, 0] [5, 19, 26, 2, 2, 1, 1, 0] [74, 7] [58, 23] [24, 31, 2, 1] [52, 4] [34, 22] [5, 26, 2, 1]

61 rows × 8 columns


In [59]:
def PLOT_GRAPH(axis,threshold):
    for x in range(len(DFT)):
        if sum(DFT.iloc[x,axis]) > threshold:
            plt.plot(DFT.iloc[x,axis] / float(sum(DFT.iloc[x,axis])))
PLOT_GRAPH(0,20)



In [60]:
PLOT_GRAPH(1,20)



In [61]:
for x in range(len(DFT)):
    plt.plot(DFT.iloc[x,2])



In [62]:
for x in range(len(DFT)):
    plt.plot(DFT.iloc[x,3])



In [63]:
for x in range(len(DFT)):
    plt.plot(DFT.iloc[x,4])



In [73]:
import numpy as np
import pandas as pd
from pandas.tools.plotting import parallel_coordinates

a = dfn.loc[:,['m09','m10','m11','m12']].iloc[1]
b = dfn.loc[:,['m09','m10','m11','m12']].iloc[2]
c = dfn.loc[:,['m09','m10','m11','m12']].iloc[3]

#dfn
##[dfn.loc[:,['m09','m10','m11','m12']].iloc[1] for x in range(len(dfn))]
df = pd.DataFrame([dfn.loc[:,['m09','m10','m11','m12']].iloc[x] 
                   for x in range(len(dfn))], index=dfn.pavadinimas.values)

#from mpld3 import enable_notebook # Causes soooo much lag!
#enable_notebook() # Causes soooo much lag! Also too much of axis.
mpld3.disable_notebook
pd.scatter_matrix(df, 
                  diagonal='hist',
                  figsize=(10,10),
                  alpha=0.3,
                  marker='o',
                  c='k',
                  #c=[programos_fakulteto_spalva[x] for x in df.index],
                  s=50)


Out[73]:
array([[<matplotlib.axes.AxesSubplot object at 0x0000000012A62E10>,
        <matplotlib.axes.AxesSubplot object at 0x0000000011D82320>,
        <matplotlib.axes.AxesSubplot object at 0x0000000013467780>,
        <matplotlib.axes.AxesSubplot object at 0x0000000013490240>],
       [<matplotlib.axes.AxesSubplot object at 0x00000000138232B0>,
        <matplotlib.axes.AxesSubplot object at 0x00000000133DC5C0>,
        <matplotlib.axes.AxesSubplot object at 0x000000001399F908>,
        <matplotlib.axes.AxesSubplot object at 0x00000000139C4A58>],
       [<matplotlib.axes.AxesSubplot object at 0x0000000013B22588>,
        <matplotlib.axes.AxesSubplot object at 0x0000000013B453C8>,
        <matplotlib.axes.AxesSubplot object at 0x0000000013B2D438>,
        <matplotlib.axes.AxesSubplot object at 0x0000000013CA7518>],
       [<matplotlib.axes.AxesSubplot object at 0x0000000013E0E4E0>,
        <matplotlib.axes.AxesSubplot object at 0x0000000013E27BA8>,
        <matplotlib.axes.AxesSubplot object at 0x0000000013FDBDA0>,
        <matplotlib.axes.AxesSubplot object at 0x0000000013FF11D0>]], dtype=object)

KITA

Turimea atsižvelgti, į tai, kiek studentų galėjo pasirinkti gretutines.

Now, lets create something that combines the two numbers.

Down to the individual level


In [65]:
%qtconsole

The unique cases of choosing academic minor. The cases that oppose the dominant trends. Howe we can find them?

We did a lot already. But we did not assess any of the individual choices. It is very good, that we have made a throughout analysis of the environment, but at the end, it is all about choices of individuals. So what are we interested in? We are interested in such things: o Can we find the motivation of the choice? o Can we assess the success of the choice?


In [66]:
prg1, prg2 = 'filosofija', 'psichologija'
[len(dfe[dfe['pagr'].isin([prg1]) & dfe['gret'].isin([prg2])])]


Out[66]:
[20]

In [67]:
# Creating AFG(ALL FACULTY GRAPHS) this will make it easier to go through them. The graph is basically an EgoGraphf of Faculty
AFG = [nx.DiGraph for x in facultylist] # All Faculty Graphs
#add edges and nodes

In [71]:
# THIS PART IS GOOD!!!
%matplotlib inline
import networkx as nx
pos = nx.fruchterman_reingold_layout(G)
plt.figure(figsize=(10, 10))
nx.draw_networkx(G, pos, with_labels=True) # unicod eproblems
plt.show()