Import the main method of the module:
In [1]:
from pydataset import data
initiated datasets repo at: /Users/Aziz/.pydataset/
Don't worry about the log message, it will appear at the first import of the module only. (it won't again, unless if that directory doesn't exist or deleted).
In [2]:
from pydataset import data
1. Load a dataset from the repository
Example, loading the iris flower dataset:
In [3]:
iris = data('iris')
In [4]:
iris
Out[4]:
Sepal.Length
Sepal.Width
Petal.Length
Petal.Width
Species
1
5.1
3.5
1.4
0.2
setosa
2
4.9
3.0
1.4
0.2
setosa
3
4.7
3.2
1.3
0.2
setosa
4
4.6
3.1
1.5
0.2
setosa
5
5.0
3.6
1.4
0.2
setosa
6
5.4
3.9
1.7
0.4
setosa
7
4.6
3.4
1.4
0.3
setosa
8
5.0
3.4
1.5
0.2
setosa
9
4.4
2.9
1.4
0.2
setosa
10
4.9
3.1
1.5
0.1
setosa
11
5.4
3.7
1.5
0.2
setosa
12
4.8
3.4
1.6
0.2
setosa
13
4.8
3.0
1.4
0.1
setosa
14
4.3
3.0
1.1
0.1
setosa
15
5.8
4.0
1.2
0.2
setosa
16
5.7
4.4
1.5
0.4
setosa
17
5.4
3.9
1.3
0.4
setosa
18
5.1
3.5
1.4
0.3
setosa
19
5.7
3.8
1.7
0.3
setosa
20
5.1
3.8
1.5
0.3
setosa
21
5.4
3.4
1.7
0.2
setosa
22
5.1
3.7
1.5
0.4
setosa
23
4.6
3.6
1.0
0.2
setosa
24
5.1
3.3
1.7
0.5
setosa
25
4.8
3.4
1.9
0.2
setosa
26
5.0
3.0
1.6
0.2
setosa
27
5.0
3.4
1.6
0.4
setosa
28
5.2
3.5
1.5
0.2
setosa
29
5.2
3.4
1.4
0.2
setosa
30
4.7
3.2
1.6
0.2
setosa
31
4.8
3.1
1.6
0.2
setosa
32
5.4
3.4
1.5
0.4
setosa
33
5.2
4.1
1.5
0.1
setosa
34
5.5
4.2
1.4
0.2
setosa
35
4.9
3.1
1.5
0.2
setosa
36
5.0
3.2
1.2
0.2
setosa
37
5.5
3.5
1.3
0.2
setosa
38
4.9
3.6
1.4
0.1
setosa
39
4.4
3.0
1.3
0.2
setosa
40
5.1
3.4
1.5
0.2
setosa
41
5.0
3.5
1.3
0.3
setosa
42
4.5
2.3
1.3
0.3
setosa
43
4.4
3.2
1.3
0.2
setosa
44
5.0
3.5
1.6
0.6
setosa
45
5.1
3.8
1.9
0.4
setosa
46
4.8
3.0
1.4
0.3
setosa
47
5.1
3.8
1.6
0.2
setosa
48
4.6
3.2
1.4
0.2
setosa
49
5.3
3.7
1.5
0.2
setosa
50
5.0
3.3
1.4
0.2
setosa
51
7.0
3.2
4.7
1.4
versicolor
52
6.4
3.2
4.5
1.5
versicolor
53
6.9
3.1
4.9
1.5
versicolor
54
5.5
2.3
4.0
1.3
versicolor
55
6.5
2.8
4.6
1.5
versicolor
56
5.7
2.8
4.5
1.3
versicolor
57
6.3
3.3
4.7
1.6
versicolor
58
4.9
2.4
3.3
1.0
versicolor
59
6.6
2.9
4.6
1.3
versicolor
60
5.2
2.7
3.9
1.4
versicolor
61
5.0
2.0
3.5
1.0
versicolor
62
5.9
3.0
4.2
1.5
versicolor
63
6.0
2.2
4.0
1.0
versicolor
64
6.1
2.9
4.7
1.4
versicolor
65
5.6
2.9
3.6
1.3
versicolor
66
6.7
3.1
4.4
1.4
versicolor
67
5.6
3.0
4.5
1.5
versicolor
68
5.8
2.7
4.1
1.0
versicolor
69
6.2
2.2
4.5
1.5
versicolor
70
5.6
2.5
3.9
1.1
versicolor
71
5.9
3.2
4.8
1.8
versicolor
72
6.1
2.8
4.0
1.3
versicolor
73
6.3
2.5
4.9
1.5
versicolor
74
6.1
2.8
4.7
1.2
versicolor
75
6.4
2.9
4.3
1.3
versicolor
76
6.6
3.0
4.4
1.4
versicolor
77
6.8
2.8
4.8
1.4
versicolor
78
6.7
3.0
5.0
1.7
versicolor
79
6.0
2.9
4.5
1.5
versicolor
80
5.7
2.6
3.5
1.0
versicolor
81
5.5
2.4
3.8
1.1
versicolor
82
5.5
2.4
3.7
1.0
versicolor
83
5.8
2.7
3.9
1.2
versicolor
84
6.0
2.7
5.1
1.6
versicolor
85
5.4
3.0
4.5
1.5
versicolor
86
6.0
3.4
4.5
1.6
versicolor
87
6.7
3.1
4.7
1.5
versicolor
88
6.3
2.3
4.4
1.3
versicolor
89
5.6
3.0
4.1
1.3
versicolor
90
5.5
2.5
4.0
1.3
versicolor
91
5.5
2.6
4.4
1.2
versicolor
92
6.1
3.0
4.6
1.4
versicolor
93
5.8
2.6
4.0
1.2
versicolor
94
5.0
2.3
3.3
1.0
versicolor
95
5.6
2.7
4.2
1.3
versicolor
96
5.7
3.0
4.2
1.2
versicolor
97
5.7
2.9
4.2
1.3
versicolor
98
6.2
2.9
4.3
1.3
versicolor
99
5.1
2.5
3.0
1.1
versicolor
100
5.7
2.8
4.1
1.3
versicolor
101
6.3
3.3
6.0
2.5
virginica
102
5.8
2.7
5.1
1.9
virginica
103
7.1
3.0
5.9
2.1
virginica
104
6.3
2.9
5.6
1.8
virginica
105
6.5
3.0
5.8
2.2
virginica
106
7.6
3.0
6.6
2.1
virginica
107
4.9
2.5
4.5
1.7
virginica
108
7.3
2.9
6.3
1.8
virginica
109
6.7
2.5
5.8
1.8
virginica
110
7.2
3.6
6.1
2.5
virginica
111
6.5
3.2
5.1
2.0
virginica
112
6.4
2.7
5.3
1.9
virginica
113
6.8
3.0
5.5
2.1
virginica
114
5.7
2.5
5.0
2.0
virginica
115
5.8
2.8
5.1
2.4
virginica
116
6.4
3.2
5.3
2.3
virginica
117
6.5
3.0
5.5
1.8
virginica
118
7.7
3.8
6.7
2.2
virginica
119
7.7
2.6
6.9
2.3
virginica
120
6.0
2.2
5.0
1.5
virginica
121
6.9
3.2
5.7
2.3
virginica
122
5.6
2.8
4.9
2.0
virginica
123
7.7
2.8
6.7
2.0
virginica
124
6.3
2.7
4.9
1.8
virginica
125
6.7
3.3
5.7
2.1
virginica
126
7.2
3.2
6.0
1.8
virginica
127
6.2
2.8
4.8
1.8
virginica
128
6.1
3.0
4.9
1.8
virginica
129
6.4
2.8
5.6
2.1
virginica
130
7.2
3.0
5.8
1.6
virginica
131
7.4
2.8
6.1
1.9
virginica
132
7.9
3.8
6.4
2.0
virginica
133
6.4
2.8
5.6
2.2
virginica
134
6.3
2.8
5.1
1.5
virginica
135
6.1
2.6
5.6
1.4
virginica
136
7.7
3.0
6.1
2.3
virginica
137
6.3
3.4
5.6
2.4
virginica
138
6.4
3.1
5.5
1.8
virginica
139
6.0
3.0
4.8
1.8
virginica
140
6.9
3.1
5.4
2.1
virginica
141
6.7
3.1
5.6
2.4
virginica
142
6.9
3.1
5.1
2.3
virginica
143
5.8
2.7
5.1
1.9
virginica
144
6.8
3.2
5.9
2.3
virginica
145
6.7
3.3
5.7
2.5
virginica
146
6.7
3.0
5.2
2.3
virginica
147
6.3
2.5
5.0
1.9
virginica
148
6.5
3.0
5.2
2.0
virginica
149
6.2
3.4
5.4
2.3
virginica
150
5.9
3.0
5.1
1.8
virginica
2. Show the documenation of a certain dataset:
In [5]:
data('iris', show_doc=True)
iris
PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)
## Edgar Anderson's Iris Data
### Description
This famous (Fisher's or Anderson's) iris data set gives the measurements in
centimeters of the variables sepal length and width and petal length and
width, respectively, for 50 flowers from each of 3 species of iris. The
species are _Iris setosa_, _versicolor_, and _virginica_.
### Usage
iris
iris3
### Format
`iris` is a data frame with 150 cases (rows) and 5 variables (columns) named
`Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, and `Species`.
`iris3` gives the same data arranged as a 3-dimensional array of size 50 by 4
by 3, as represented by S-PLUS. The first dimension gives the case number
within the species subsample, the second the measurements with names `Sepal
L.`, `Sepal W.`, `Petal L.`, and `Petal W.`, and the third the species.
### Source
Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems.
_Annals of Eugenics_, **7**, Part II, 179–188.
The data were collected by Anderson, Edgar (1935). The irises of the Gaspe
Peninsula, _Bulletin of the American Iris Society_, **59**, 2–5.
### References
Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S Language_.
Wadsworth & Brooks/Cole. (has `iris3` as `iris`.)
### See Also
`matplot` some examples of which use `iris`.
### Examples
dni3 <- dimnames(iris3)
ii <- data.frame(matrix(aperm(iris3, c(1,3,2)), ncol = 4,
dimnames = list(NULL, sub(" L.",".Length",
sub(" W.",".Width", dni3[[2]])))),
Species = gl(3, 50, labels = sub("S", "s", sub("V", "v", dni3[[3]]))))
all.equal(ii, iris) # TRUE
3. See the available datasets:
In [6]:
data()
Out[6]:
dataset_id
title
0
AirPassengers
Monthly Airline Passenger Numbers 1949-1960
1
BJsales
Sales Data with Leading Indicator
2
BOD
Biochemical Oxygen Demand
3
Formaldehyde
Determination of Formaldehyde
4
HairEyeColor
Hair and Eye Color of Statistics Students
5
InsectSprays
Effectiveness of Insect Sprays
6
JohnsonJohnson
Quarterly Earnings per Johnson & Johnson Share
7
LakeHuron
Level of Lake Huron 1875-1972
8
LifeCycleSavings
Intercountry Life-Cycle Savings Data
9
Nile
Flow of the River Nile
10
OrchardSprays
Potency of Orchard Sprays
11
PlantGrowth
Results from an Experiment on Plant Growth
12
Puromycin
Reaction Velocity of an Enzymatic Reaction
13
Titanic
Survival of passengers on the Titanic
14
ToothGrowth
The Effect of Vitamin C on Tooth Growth in Guinea Pigs
15
UCBAdmissions
Student Admissions at UC Berkeley
16
UKDriverDeaths
Road Casualties in Great Britain 1969-84
17
UKgas
UK Quarterly Gas Consumption
18
USAccDeaths
Accidental Deaths in the US 1973-1978
19
USArrests
Violent Crime Rates by US State
20
USJudgeRatings
Lawyers' Ratings of State Judges in the US Superior Court
21
USPersonalExpenditure
Personal Expenditure Data
22
VADeaths
Death Rates in Virginia (1940)
23
WWWusage
Internet Usage per Minute
24
WorldPhones
The World's Telephones
25
airmiles
Passenger Miles on Commercial US Airlines, 1937-1960
26
airquality
New York Air Quality Measurements
27
anscombe
Anscombe's Quartet of 'Identical' Simple Linear Regressions
28
attenu
The Joyner-Boore Attenuation Data
29
attitude
The Chatterjee-Price Attitude Data
30
austres
Quarterly Time Series of the Number of Australian Residents
31
cars
Speed and Stopping Distances of Cars
32
chickwts
Chicken Weights by Feed Type
33
co2
Mauna Loa Atmospheric CO2 Concentration
34
crimtab
Student's 3000 Criminals Data
35
discoveries
Yearly Numbers of Important Discoveries
36
esoph
Smoking, Alcohol and (O)esophageal Cancer
37
euro
Conversion Rates of Euro Currencies
38
faithful
Old Faithful Geyser Data
39
freeny
Freeny's Revenue Data
40
infert
Infertility after Spontaneous and Induced Abortion
41
iris
Edgar Anderson's Iris Data
42
islands
Areas of the World's Major Landmasses
43
lh
Luteinizing Hormone in Blood Samples
44
longley
Longley's Economic Regression Data
45
lynx
Annual Canadian Lynx trappings 1821-1934
46
morley
Michelson Speed of Light Data
47
mtcars
Motor Trend Car Road Tests
48
nhtemp
Average Yearly Temperatures in New Haven
49
nottem
Average Monthly Temperatures at Nottingham, 1920-1939
50
npk
Classical N, P, K Factorial Experiment
51
occupationalStatus
Occupational Status of Fathers and their Sons
52
precip
Annual Precipitation in US Cities
53
presidents
Quarterly Approval Ratings of US Presidents
54
pressure
Vapor Pressure of Mercury as a Function of Temperature
55
quakes
Locations of Earthquakes off Fiji
56
randu
Random Numbers from Congruential Generator RANDU
57
rivers
Lengths of Major North American Rivers
58
rock
Measurements on Petroleum Rock Samples
59
sleep
Student's Sleep Data
60
stackloss
Brownlee's Stack Loss Plant Data
61
sunspot.month
Monthly Sunspot Data, from 1749 to "Present"
62
sunspot.year
Yearly Sunspot Data, 1700-1988
63
sunspots
Monthly Sunspot Numbers, 1749-1983
64
swiss
Swiss Fertility and Socioeconomic Indicators (1888) Data
65
treering
Yearly Treering Data, -6000-1979
66
trees
Girth, Height and Volume for Black Cherry Trees
67
uspop
Populations Recorded by the US Census
68
volcano
Topographic Information on Auckland's Maunga Whau Volcano
69
warpbreaks
The Number of Breaks in Yarn during Weaving
70
women
Average Heights and Weights for American Women
71
acme
Monthly Excess Returns
72
aids
Delay in AIDS Reporting in England and Wales
73
aircondit
Failures of Air-conditioning Equipment
74
aircondit7
Failures of Air-conditioning Equipment
75
amis
Car Speeding and Warning Signs
76
aml
Remission Times for Acute Myelogenous Leukaemia
77
bigcity
Population of U.S. Cities
78
brambles
Spatial Location of Bramble Canes
79
breslow
Smoking Deaths Among Doctors
80
calcium
Calcium Uptake Data
81
cane
Sugar-cane Disease Data
82
capability
Simulated Manufacturing Process Data
83
catsM
Weight Data for Domestic Cats
84
cav
Position of Muscle Caveolae
...
...
...
672
students
Student Risk Taking
673
suicides
Crowd Baiting Behaviour and Suicides
674
toothpaste
Toothpaste Data
675
voting
House of Representatives Voting Data
676
water
Mortality and Water Hardness
677
watervoles
Water Voles Data
678
waves
Electricity from Wave Power at Sea
679
weightgain
Gain in Weight of Rats
680
womensrole
Womens Role in Society
681
Bechtoldt
Seven data sets showing a bifactor solution.
682
Bechtoldt.1
Seven data sets showing a bifactor solution.
683
Bechtoldt.2
Seven data sets showing a bifactor solution.
684
Dwyer
8 cognitive variables used by Dwyer for an example.
685
Gleser
Example data from Gleser, Cronbach and Rajaratnam (1965) to show basic principles of g...
686
Gorsuch
Example data set from Gorsuch (1997) for an example factor extension.
687
Harman.5
5 socio-economic variables from Harman (1967)
688
Harman.8
Correlations of eight physical variables (from Harman, 1966)
689
Harman.political
Eight political variables used by Harman (1967) as example 8.17
690
Holzinger
Seven data sets showing a bifactor solution.
691
Holzinger.9
Seven data sets showing a bifactor solution.
692
Reise
Seven data sets showing a bifactor solution.
693
Schmid
12 variables created by Schmid and Leiman to show the Schmid-Leiman Transformation
694
Thurstone
Seven data sets showing a bifactor solution.
695
Thurstone.33
Seven data sets showing a bifactor solution.
696
Tucker
9 Cognitive variables discussed by Tucker and Lewis (1973)
697
ability
16 ability items scored as correct or incorrect.
698
affect
Two data sets of affect and arousal scores as a function of personality and movie cond...
699
bfi
25 Personality items representing 5 factors
700
bfi.dictionary
25 Personality items representing 5 factors
701
blot
Bond's Logical Operations Test - BLOT
702
burt
11 emotional variables from Burt (1915)
703
cities
Distances between 11 US cities
704
cubits
Galton's example of the relationship between height and 'cubit' or forearm length
705
cushny
A data set from Cushny and Peebles (1905) on the effect of three drugs on hours of sle...
706
epi
Eysenck Personality Inventory (EPI) data for 3570 participants
707
epi.bfi
13 personality scales from the Eysenck Personality Inventory and Big 5 inventory
708
epi.dictionary
Eysenck Personality Inventory (EPI) data for 3570 participants
709
galton
Galton's Mid parent child height data
710
heights
A data.frame of the Galton (1888) height and cubit data set.
711
income
US family income from US census 2008
712
iqitems
16 multiple choice IQ items
713
msq
75 mood items from the Motivational State Questionnaire for 3896 participants
714
neo
NEO correlation matrix from the NEO_PI_R manual
715
peas
Galton's Peas
716
sat.act
3 Measures of ability: SATV, SATQ, ACT
717
withinBetween
An example of the distinction between within group and between group correlations
718
Bosco
Boscovich Data
719
CobarOre
Cobar Ore data
720
Mammals
Garland(1983) Data on Running Speed of Mammals
721
barro
Barro Data
722
engel
Engel Data
723
uis
UIS Drug Treatment study data
724
dietox
Growth curves of pigs in a 3x3 factorial experiment
725
koch
Ordinal Data from Koch
726
ohio
Ohio Children Wheeze Status
727
respdis
Clustered Ordinal Respiratory Disorder
728
respiratory
Data from a clinical trial comparing two treatments for a respiratory illness
729
seizure
Epiliptic Seizures
730
sitka89
Growth of Sitka Spruce Trees
731
spruce
Log-size of 79 Sitka spruce trees
732
liver
Liver related laboratory data
733
portpirie
Rain, wavesurge and portpirie datasets.
734
rain
Rain, wavesurge and portpirie datasets.
735
summer
Air pollution data, separately for summer and winter months
736
wavesurge
Rain, wavesurge and portpirie datasets.
737
winter
Air pollution data, separately for summer and winter months
738
arthritis
Rheumatoid Arthritis Clinical Trial
739
housing
Homeless Data
740
bmw
Daily Log Returns on BMW Share Price
741
danish
Danish Fire Insurance Claims
742
nidd.annual
The River Nidd Data
743
nidd.thresh
The River Nidd Data
744
siemens
Daily Log Returns on Siemens Share Price
745
sp.raw
SP Data to June 1993
746
spto87
SP Return Data to October 1987
747
Dyestuff
Yield of dyestuff by batch
748
Dyestuff2
Yield of dyestuff by batch
749
InstEval
University Lecture/Instructor Evaluations by Students at ETH
750
Pastes
Paste strength by batch and cask
751
Penicillin
Variation in penicillin testing
752
VerbAgg
Verbal Aggression item responses
753
cake
Breakage Angle of Chocolate Cakes
754
cbpp
Contagious bovine pleuropneumonia
755
grouseticks
Data on red grouse ticks from Elston et al. 2001
756
sleepstudy
Reaction times in a sleep deprivation study
757 rows × 2 columns
Or, see the docstring for more details:
In [7]:
help(data)
Help on function data in module pydataset:
data(item=None, show_doc=False)
loads a datasaet (from in-modules datasets) in a dataframe data structure.
Args:
item (str) : name of the dataset to load.
show_doc (bool) : to show the dataset's documentation.
Examples:
>>> iris = data('iris')
loaded: iris <class 'pandas.core.frame.DataFrame'>
>>> data('Titanic', show_doc=True)
: returns the dataset's documentation.
>>> data()
: like help(), returns a dataframe [Item, Title]
for a list of the available datasets.
Since a dataset is loaded in a pandas DataFrame structure, you can use it like one:
In [9]:
trees = data('trees')
In [10]:
trees.head()
Out[10]:
Girth
Height
Volume
1
8.3
70
10.3
2
8.6
65
10.3
3
8.8
63
10.2
4
10.5
72
16.4
5
10.7
81
18.8
In [11]:
trees.plot()
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x106958ac8>
And of course you can search a dataset by name, too!
For eaxmple, to see if anscombe datasets is availabe:
In [12]:
data()[data().dataset_id == 'anscombe']
Out[12]:
dataset_id
title
27
anscombe
Anscombe's Quartet of 'Identical' Simple Linear Regressions
In [13]:
anscombe = data('anscombe')
In [14]:
anscombe.plot(kind='scatter', x='x1', y='y1', figsize=(4,2));pass
anscombe.plot(kind='scatter', x='x2', y='y1', figsize=(4,2));pass
anscombe.plot(kind='scatter', x='x1', y='y2', figsize=(4,2));pass
anscombe.plot(kind='scatter', x='x2', y='y2', figsize=(4,2));pass
Content source: iamaziz/PyDataset
Similar notebooks: