This notebook is dedicated to analysis of disciplines in relation to profesional occupations. We will use many sources, bout our first source is http://www.occupationsguide.cz/en/

Occupations Guide

Get data


In [1]:
import requests
MAIN = 'http://www.occupationsguide.cz/en/'
url = 'http://www.occupationsguide.cz/en/abecedni/abecedni.htm'
COMPARISONS_FRAME = 'http://www.occupationsguide.cz/en/pribuzn/PRIBUZN.ASP?Prvni={}&Druhy={}'
r = requests.get(url)

from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text)

In [2]:
tempor = [x['href'].lstrip('../') for x in soup.findAll('a', href=True) if r'abecedni' not in x]
endings = sorted([x for x in tempor if 'abec' not in x])
all_links = [''.join([MAIN,x]) for x in endings]

In [4]:
all_rel_link = [x.replace('POVOL', 'pribuzn') for x in  all_links]

let's define two functions that will help us to get the files we need.


In [5]:
import urllib
def get_main_descriptions():
    count = 0
    for x in all_links:
        count += 1
        urllib.urlretrieve(x, "downloads/{}.htm".format(count))

def get_connections():
    count = 0
    for x in all_rel_link:
        count += 1
        urllib.urlretrieve(x, "downloads/rels{}.htm".format(count))

In [31]:
#main_descriptions = get_main_descriptions()
#connection_descriptions = get_connections()

Now, as we have it in our drive. Let's do some functions that will take the info that we need:


In [32]:
def get_all_descriptions():
    mylist = []
    error_count = 0
    for count in range(1, 1417):
        templist = []
        try:
            soup = BeautifulSoup(open(r'downloads/{}.htm'.format(count)).read())
            temp_list = soup.findAll(True)[10].findAll('table')[7].findAll('p')
            mylist.append([x.text for x in temp_list if len(x) > 1])
        except:
            error_count += 1
    print (error_count, "errors found")
    return mylist
all_descriptions = get_all_descriptions()


(749, 'errors found')

In [34]:
len(all_descriptions)


Out[34]:
667

In [35]:
#WIP(Work in progress)
count = 1
soup = BeautifulSoup(open(r'downloads/rels{}.htm'.format(count)).read())
tables = soup.find_all('table')
tables[0].find_all('a')
#mylist.append([x.text for x in temp_list if len(x) > 1])


Out[35]:
[<a href="PRIBUZN.ASP?Prvni=10&amp;Druhy=48"><img alt="Knowledge and activities common to both occupations" border="0" height="15" src="../ICO/_PRIBUZ.gif" width="14"/></a>,
 <a href="../povol/48.htm">interior designer interior designer</a>,
 <a href="PRIBUZN.ASP?Prvni=10&amp;Druhy=657"><img alt="Knowledge and activities common to both occupations" border="0" height="15" src="../ICO/_PRIBUZ.gif" width="14"/></a>,
 <a href="../povol/657.htm">stage designer / theatre designer</a>,
 <a href="PRIBUZN.ASP?Prvni=10&amp;Druhy=6"><img alt="Knowledge and activities common to both occupations" border="0" height="15" src="../ICO/_PRIBUZ.gif" width="14"/></a>,
 <a href="../povol/6.htm">animator</a>,
 <a href="PRIBUZN.ASP?Prvni=10&amp;Druhy=7"><img alt="Knowledge and activities common to both occupations" border="0" height="15" src="../ICO/_PRIBUZ.gif" width="14"/></a>,
 <a href="../povol/7.htm">window-dresser</a>,
 <a href="PRIBUZN.ASP?Prvni=10&amp;Druhy=141"><img alt="Knowledge and activities common to both occupations" border="0" height="15" src="../ICO/_PRIBUZ.gif" width="14"/></a>,
 <a href="../povol/141.htm">script editor</a>,
 <a href="PRIBUZN.ASP?Prvni=10&amp;Druhy=177"><img alt="Knowledge and activities common to both occupations" border="0" height="15" src="../ICO/_PRIBUZ.gif" width="14"/></a>,
 <a href="../povol/177.htm">graphic designer (graphic designer /audiovisual graphic designer)</a>,
 <a href="PRIBUZN.ASP?Prvni=10&amp;Druhy=178"><img alt="Knowledge and activities common to both occupations" border="0" height="15" src="../ICO/_PRIBUZ.gif" width="14"/></a>,
 <a href="../povol/178.htm">audio graphic designer</a>,
 <a href="PRIBUZN.ASP?Prvni=10&amp;Druhy=341"><img alt="Knowledge and activities common to both occupations" border="0" height="15" src="../ICO/_PRIBUZ.gif" width="14"/></a>,
 <a href="../povol/341.htm">scene painter</a>,
 <a href="PRIBUZN.ASP?Prvni=10&amp;Druhy=347"><img alt="Knowledge and activities common to both occupations" border="0" height="15" src="../ICO/_PRIBUZ.gif" width="14"/></a>,
 <a href="../povol/347.htm">make-up artist and wigmaker</a>,
 <a href="PRIBUZN.ASP?Prvni=10&amp;Druhy=615"><img alt="Knowledge and activities common to both occupations" border="0" height="15" src="../ICO/_PRIBUZ.gif" width="14"/></a>,
 <a href="../povol/615.htm">retoucher</a>,
 <a href="PRIBUZN.ASP?Prvni=10&amp;Druhy=912"><img alt="Knowledge and activities common to both occupations" border="0" height="15" src="../ICO/_PRIBUZ.gif" width="14"/></a>,
 <a href="../povol/912.htm">product designer (industrial designer)</a>,
 <a href="PRIBUZN.ASP?Prvni=10&amp;Druhy=1101"><img alt="Knowledge and activities common to both occupations" border="0" height="15" src="../ICO/_PRIBUZ.gif" width="14"/></a>,
 <a href="../povol/1101.htm">fine artist</a>,
 <a href="PRIBUZN.ASP?Prvni=10&amp;Druhy=1139"><img alt="Knowledge and activities common to both occupations" border="0" height="15" src="../ICO/_PRIBUZ.gif" width="14"/></a>,
 <a href="../povol/1139.htm">building/civil engineering/ architectural technician/technologist</a>,
 <a href="PRIBUZN.ASP?Prvni=10&amp;Druhy=1204"><img alt="Knowledge and activities common to both occupations" border="0" height="15" src="../ICO/_PRIBUZ.gif" width="14"/></a>,
 <a href="../povol/1204.htm">industrial designer (wood and furniture industry)</a>,
 <a href="PRIBUZN.ASP?Prvni=10&amp;Druhy=1228"><img alt="Knowledge and activities common to both occupations" border="0" height="15" src="../ICO/_PRIBUZ.gif" width="14"/></a>,
 <a href="../povol/1228.htm">multimedia designer</a>]

In [39]:
def get_all_connections():
    mylist = []
    error_count = 0
    for count in range(1, 1417):
        templist = []
        try:
            soup = BeautifulSoup(open(r'downloads\rels{}.htm'.format(count)).read())
            #temp_list = soup.findAll(True)[10].findAll('table')[7].findAll('p')   # <------update----------
            #mylist.append([x.text for x in temp_list if len(x) > 1]) # <------update----------
        except:
            error_count += 1
    print (error_count, "errors found")
    return mylist
relations_list = get_all_connections()


(1416, 'errors found')

In [40]:
relations_list


Out[40]:
[]

Now, as we have some info already, we can do some more.

We want to have them in very nice structure.


In [41]:
def get_nice():
    big_list = []
    for line in all_descriptions:
        q1, q2, q3, q4, q5 = [
                u'Who are they and what do they do?',
                u'What are the activities of the job?',
                u'Where is it done and under what conditions?',
                u'What tools/equipment do they use?',
                u'What do you need to succeed?'
                ]
        small_dict = {q1:None, q2:None, q3:None, q4:None, q5:None}
        for item in line:
            if item.startswith(q1):
                small_dict[q1] = item.lstrip(q1)
            if item.startswith(q2):
                small_dict[q2] = item.lstrip(q2)
            if item.startswith(q3):
                small_dict[q3] = item.lstrip(q3)
            if item.startswith(q4):
                small_dict[q4] = item.lstrip(q4)
            if item.startswith(q5):
                small_dict[q5] = item.lstrip(q5)
        big_list.append(small_dict)
    return big_list
nice_list = get_nice()

To make it possible for us to access all entries. We can do it this way.


In [44]:
succeed = [x[u'What do you need to succeed?'] for x in nice_list]
who = [x[u'Who are they and what do they do?'] for x in nice_list]
activities = [x[u'What are the activities of the job?'] for x in nice_list]
where = [x[u'Where is it done and under what conditions?'] for x in nice_list]
tools = [x[u'What tools/equipment do they use?'] for x in nice_list]

In [52]:
import pandas as pd
df = pd.DataFrame([succeed, who, activities, where, tools]).T
df.columns = ['suceed', 'who','activities', 'where', 'toolks']

In [53]:
df


Out[53]:
suceed who activities where toolks
0 You need university education, creative abilit... The film designer is a worker who, according t... He/she supervises creation of the scene accord... In a studio, or where film-shooting is going o... Cameras, film, and a variety of special effect...
1 You need on the job training or a specific cou... A manager manages, organises or plans activiti... There are two main areas: None None
2 You need training on the job or a course. A physiotherapist helps to rehabilitate indivi... None None
3 You need on the job training or a specific cou... The work of the economist has a comparatively ... Economists may specialise in many different ar... None None
4 You need on-the-job training or a training cou... Assistant printing workers carry out various a... They may transport bales or rolls of paper, re... In a workshop as a part of the printing proces... Hand tools, trolleys, possibly forklift trucks...
5 You need a talent for visual arts in the first... A farm worker does unskilled work in agriculture. For example, manual digging, hoeing and the ha... In the open air where varying weather conditio... Manual tools such as spades and shovels, and c...
6 You need a talent for visual arts in the first... kers in paper manufacturing carry out basic ma... Simple manual or machine operations in cellulo... The job is carried out in paper making factori... Various machines and equipment, automatic prod...
7 You need higher or university education, with ... kers in the production of construction materia... They work on the production of cement, lime, c... In production premises where noise, heat and c... rking tools are a variety of different machine...
8 You need to have completed university studies ... A fine artist uses line and colour to express ... Artists usually work alone in a studio. They m... In her/his studio but often also outside work.... Creative media, on a broad scale from the clas...
9 You need to have graduated from a faculty of m... A sculptor is a creative artist who creates va... Modelling, carving, chiselling, moulding, cast... In a studio. The working environment of the sc... All that has a decisive bearing on the use of ...
10 You need to have completed university studies ... The anthropologist studies the origin, develop... The variation in the bodily structure of human... Anthropologists may find job opportunities, de... Anthropometric tools, diagnostic equipment, in...
11 You need to have completed university studies ... A nutritionist is an expert, who resolves vari... At health institutions s/he adapts the curativ... In residential health establishments, hospital... The spoken or written word, information techno...
12 You need to complete university studies at a u... A bartender (barman/ barmaid) prepares and ser... S/he serves drinks at places designed for this... Restaurants, cafés, bars, - often then in a no... None
13 You need to have completed university studies ... A botanist carries out research into the veget... Perhaps the most frequent field work of a bota... In the open air during field work and in labor... A botanist needs special aids for the collecti...
14 You need to have completed university studies ... A paediatrician makes diagnoses, plans and pro... S/he organises his/her own professional activi... A paediatrician mostly works at hospitals and ... The tools generally used by doctors and the di...
15 You need to study physics at the department of... An ergonomist seeks the optimum relation betwe... Considers the appropriateness of production eq... The working environment may be a production pl... Specialised devices, writing and drawing aids,...
16 You need to complete university studies at a f... None An ethnographer collects during his/her field ... An ethnographer works both indoors and outdoor... riting tools, recording and photographic equip...
17 Basic qualifications for this job involve educ... The philosopher analyses and synthesises philo... A philosopher or metaphysician discovers gener... At universities, or in schools, classrooms, se... rge number of books and various study material...
18 You need to complete studies at any of the man... A physicist studies the objective properties o... Physicists study various forms of energy (heat... Physicists work mostly indoors, in laboratorie... They need for their work various kinds of spec...
19 You need excellent technical thinking ability,... The geneticist studies and applies the laws of... Study of genetic material and inheritance at v... k may be in a laboratory or outdoors doing fie... Various items of special scientific equipment ...
20 You need to have studied art and design with a... The geophysicist examines the structure and co... S/he makes various geophysical measurements. I... Mostly inside -in the observatory, in the offi... Special measurement instruments (seismographs,...
21 You need on the job training or a specific cou... A historian studies the conditions and factors... S/he may focus on political, economic, social ... In libraries or archives, where s/he finds the... Films, photographic documentation and sound re...
22 To join the police force you need to be a grad... A publican runs a public house or similar prem... Preparation of hot and cold meals, ready-to-co... Indoors, but sometimes outdoors, in good weath... None
23 You need to have completed university studies ... The general labourer in petroleum refineries i... Pumping and mixing various products, handling ... In refineries and chemical plants, where expos... Various machinery and equipment, hand tools.
24 You need a degree from a school of electrical ... The electrical equipment designer conceives an... The design drawings are supported by calculati... In offices and related spaces, including drawi... The use of computers is now prevalent. The des...
25 You need to have completed studies at secondar... None Keeping the register of births, deaths and mar... In the registrar's office, a reasonably comfor... Computers and a photocopier, official document...
26 To qualify for this job, you need to complete ... The mechatronic engineering profession is rela... Development, production, assembly, adjustment,... Offices and factories and related spaces. Like other technical professions, mechatronics...
27 You need a university degree in education and ... A fashion designer designs clothes, often prod... . To some extent this depends on the sector in... A fashion designer works usually indoors, in a... Computers and computer graphics, pens, pencils...
28 You need to have completed university studies,... An operator of NC machines and industrial robo... Using the technical documentation and technolo... S/he works in modern engineering companies, wh... The control panels for the machines and the co...
29 You need to have graduated from a university t... Railway freight handlers load, unload and stor... Loading, unloading or storage of freight, whic... In railway stations and freight yards, a noisy... rk is done by hand or using some sort of mecha...
... ... ... ... ... ...
637 None The task is to create and administer a network... Insurance salesperson - providing information,... The job is done in office spaces, meeting room... Primarily computer technology and normal offic...
638 None The landscape architect prepares architectural... The designs cover various garden structures, d... Both in studios and in gardens and parks and e... Information technology and other office equipm...
639 None The task of a gardener is to grow, and look af... Manual preparation of the soil by digging, hoe... A wide range of activities are involved in thi... The tools most commonly used are standard gard...
640 None A glass production worker does simple, but som... Shaping solid glass in the furnace treatment o... In glassworks, where dirt, high temperatures a... Various machines and instruments and automatic...
641 None A railway track construction fitter makes and ... Makes, assembles, adjusts, repairs and renovat... On railway tracks, where there will be dirt an... Hand tools, welding units and other aids and e...
642 None A rail vehicle mechanic or carman/woman inspec... Repairs carriages, tramcars and metro cars, re... In railway stations and depots. There will be ... Hand craft tools and measurement technology.
643 None A pawnbroker lends money to clients who in exc... Assess the value of the goods offered as secur... In pawnshops as an over-the-counter service. Information technology and other office equipm...
644 None The task of a driver's mate is to aid lorry dr... Loading and unloading lorries, especially indi... Mainly on the lorry, at the place of loading o... Primarily shovels, fork-lift trucks, ramps, tr...
645 None Medical laboratory assistants test samples of ... Checks samples of blood, urine, and other biol... In the laboratory. Some contact with chemicals... Normal laboratory equipment, sample bottles etc.
646 None The nurse assists doctors in the treatment of ... k activities include- complex nursing care- ad... Primarily in medical facilities such as clinic... Syringes, splints, thermometers, and other mea...
647 None The paramedic or ambulance worker provides qua... Rescue of, and first aid to, victims of accide... Primarily at the sites of accidents, natural d... Bandages, splints, stethoscopes, manometers. a...
648 None The safety and communications electrician prod... Produces, assembles, installs and repairs the ... In stations and depots where noise, dust and d... Mainly hand tools of the trade and measurement...
649 None The bricklayer's job is mainly the constructio... Preparation of mortar and concrete and other c... On a construction site or in existing building... Trowel, bucket, smoother, hammer, spirit level...
650 None The task of a farmer is to grow crops or raise... Sowing, planting and protecting corn and other... In fields and agricultural complexes (cattle s... Sowing machines and planters, tractors, combin...
651 None Land surveyors measure, and map land. They cre... Surveying land and recording its boundaries an... Both in offices and in the field where there c... Field survey devices, land survey information ...
652 None The task of cartographer is to make maps, char... Collecting data about the face of the earth in... In offices and drawing offices as well as out ... The tools most commonly used include maps, opt...
653 None The task of the jeweller is to design, make, r... king with precious metals and stones by cuttin... In jeweller's workshops, where the environment... Drills, grinders, bending tools, drawing and m...
654 None A goldsmith makes jewellery and other decorati... Beats out gold, silver and aluminium or may cu... In a goldsmith's workshop. There may be some d... Power hammers and certain hand and tools.
655 None A telecommunications worker does the less qual... Laying lines involves trench digging. S/he als... In an open air environment, where varying weat... Hand tools, painting gear, spades, mechanical ...
656 None The task of a specialist in animal husbandry i... Choosing and carrying out the technical proces... Partly in offices or similar rooms and partly ... Transport vehicles, computer technology, telep...
657 None The dental surgery assistant aids the dentist. Processing dental materials and preparation of... Primarily in a dental clinic where there may, ... Mainly small tools and equipment, especially f...
658 None The dental technician makes and repairs dentur... Making and repairing crowns, bridgework and co... Primarily in a dental laboratory, where contac... For example machinery to make casts, gas burne...
659 None The metal refiner improves the properties (har... This is done using various technologies (quenc... In heavy duty steel mills. The environment can... Various machines and equipment, ovens, furnace...
660 None The textile refiner enhances the properties of... Using various finishing methods, i.e. impregna... In workshops and factories where some noise an... Various machines and equipment, automatic prod...
661 None A sound engineer sets and operates the equipme... S/he also processes sound recordings to produc... In theatres, concert halls, film studios, reco... Sound reproduction equipment, sound recording ...
662 None The railway yard worker's job covers a wide ra... Operating the signals that govern the movement... On railway sites where changing weather condit... Fork lift trucks, conveyors, electronic safety...
663 None The job of an auxiliary worker in the textile ... His/her activities are as follows - he/she fil... The work is done in thread factories, spinning... The work is done using sewing machines and equ...
664 None An assistant worker in ore dressing operates e... . For example they may tend machines that brea... In ore dressing operations of metallurgical or... Crushers, conveyors, sifters, mills etc. and s...
665 None Railway workers perform less demanding work in... E.g. washing or cleaning wagons, greasing and ... In railway stations or depots, a noisy, dusty ... Various hand tools, cleaning and maintenance e...
666 None This worker does various less qualified work, ... E.g. coil winding, assembly of rotor/stator wi... In workshops and industrial factories and ther... Various machines, craft hand tools, and other ...

667 rows × 5 columns

True, we have no NAMES of disciplines here.

Now we are almost ready to do some NLP stuff. But before, we should do something else.


In [24]:
#Also, a list of connected professions. For example, philosopheres are related to:
closely_related_occupations = [
    'priest/minister of religion (clergyman/woman)',
    'historian',
    'political scientist'
    ]

less_related_to = [
    'commentator, reporter, journalist (commentator, reporter)',
    'newspaper editor/ sub-editor (journalist)',
    'art critic and historian',
    'songwriter',
    'tutor/ governess (home tutor)',
    'lecturer/researcher in linguistics',
    ]

Also, what is great, that you can see what skills are common for both. In website common are highlighted with red. Let's see philosopher and political scientist:


In [25]:
Main_knowledge_areas = [
    'native language, writing styles'
    'basics of philosophy'
    'social phenomena'
    'basics of history']

Characteristic_activities = [
    'writing texts on computer'
    'composing the contents of demanding texts']

To do:

* Scrap comparisons of skills
* Use NLP (NLTK)

Doing:

* Scrap for close and related occupations (replace 'provol' with 'pribuzn'). Some problems with getting the right soup.

Done:

* Scrap for descriptions