Basic usage of PyDataset

Import the main method of the module:


In [1]:
from pydataset import data


initiated datasets repo at: /Users/Aziz/.pydataset/

Don't worry about the log message, it will appear at the first import of the module only. (it won't again, unless if that directory doesn't exist or deleted).


In [2]:
from pydataset import data

Use data() method to do one of three things:

1. Load a dataset from the repository

Example, loading the iris flower dataset:


In [3]:
iris = data('iris')

In [4]:
iris


Out[4]:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
7 4.6 3.4 1.4 0.3 setosa
8 5.0 3.4 1.5 0.2 setosa
9 4.4 2.9 1.4 0.2 setosa
10 4.9 3.1 1.5 0.1 setosa
11 5.4 3.7 1.5 0.2 setosa
12 4.8 3.4 1.6 0.2 setosa
13 4.8 3.0 1.4 0.1 setosa
14 4.3 3.0 1.1 0.1 setosa
15 5.8 4.0 1.2 0.2 setosa
16 5.7 4.4 1.5 0.4 setosa
17 5.4 3.9 1.3 0.4 setosa
18 5.1 3.5 1.4 0.3 setosa
19 5.7 3.8 1.7 0.3 setosa
20 5.1 3.8 1.5 0.3 setosa
21 5.4 3.4 1.7 0.2 setosa
22 5.1 3.7 1.5 0.4 setosa
23 4.6 3.6 1.0 0.2 setosa
24 5.1 3.3 1.7 0.5 setosa
25 4.8 3.4 1.9 0.2 setosa
26 5.0 3.0 1.6 0.2 setosa
27 5.0 3.4 1.6 0.4 setosa
28 5.2 3.5 1.5 0.2 setosa
29 5.2 3.4 1.4 0.2 setosa
30 4.7 3.2 1.6 0.2 setosa
31 4.8 3.1 1.6 0.2 setosa
32 5.4 3.4 1.5 0.4 setosa
33 5.2 4.1 1.5 0.1 setosa
34 5.5 4.2 1.4 0.2 setosa
35 4.9 3.1 1.5 0.2 setosa
36 5.0 3.2 1.2 0.2 setosa
37 5.5 3.5 1.3 0.2 setosa
38 4.9 3.6 1.4 0.1 setosa
39 4.4 3.0 1.3 0.2 setosa
40 5.1 3.4 1.5 0.2 setosa
41 5.0 3.5 1.3 0.3 setosa
42 4.5 2.3 1.3 0.3 setosa
43 4.4 3.2 1.3 0.2 setosa
44 5.0 3.5 1.6 0.6 setosa
45 5.1 3.8 1.9 0.4 setosa
46 4.8 3.0 1.4 0.3 setosa
47 5.1 3.8 1.6 0.2 setosa
48 4.6 3.2 1.4 0.2 setosa
49 5.3 3.7 1.5 0.2 setosa
50 5.0 3.3 1.4 0.2 setosa
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
56 5.7 2.8 4.5 1.3 versicolor
57 6.3 3.3 4.7 1.6 versicolor
58 4.9 2.4 3.3 1.0 versicolor
59 6.6 2.9 4.6 1.3 versicolor
60 5.2 2.7 3.9 1.4 versicolor
61 5.0 2.0 3.5 1.0 versicolor
62 5.9 3.0 4.2 1.5 versicolor
63 6.0 2.2 4.0 1.0 versicolor
64 6.1 2.9 4.7 1.4 versicolor
65 5.6 2.9 3.6 1.3 versicolor
66 6.7 3.1 4.4 1.4 versicolor
67 5.6 3.0 4.5 1.5 versicolor
68 5.8 2.7 4.1 1.0 versicolor
69 6.2 2.2 4.5 1.5 versicolor
70 5.6 2.5 3.9 1.1 versicolor
71 5.9 3.2 4.8 1.8 versicolor
72 6.1 2.8 4.0 1.3 versicolor
73 6.3 2.5 4.9 1.5 versicolor
74 6.1 2.8 4.7 1.2 versicolor
75 6.4 2.9 4.3 1.3 versicolor
76 6.6 3.0 4.4 1.4 versicolor
77 6.8 2.8 4.8 1.4 versicolor
78 6.7 3.0 5.0 1.7 versicolor
79 6.0 2.9 4.5 1.5 versicolor
80 5.7 2.6 3.5 1.0 versicolor
81 5.5 2.4 3.8 1.1 versicolor
82 5.5 2.4 3.7 1.0 versicolor
83 5.8 2.7 3.9 1.2 versicolor
84 6.0 2.7 5.1 1.6 versicolor
85 5.4 3.0 4.5 1.5 versicolor
86 6.0 3.4 4.5 1.6 versicolor
87 6.7 3.1 4.7 1.5 versicolor
88 6.3 2.3 4.4 1.3 versicolor
89 5.6 3.0 4.1 1.3 versicolor
90 5.5 2.5 4.0 1.3 versicolor
91 5.5 2.6 4.4 1.2 versicolor
92 6.1 3.0 4.6 1.4 versicolor
93 5.8 2.6 4.0 1.2 versicolor
94 5.0 2.3 3.3 1.0 versicolor
95 5.6 2.7 4.2 1.3 versicolor
96 5.7 3.0 4.2 1.2 versicolor
97 5.7 2.9 4.2 1.3 versicolor
98 6.2 2.9 4.3 1.3 versicolor
99 5.1 2.5 3.0 1.1 versicolor
100 5.7 2.8 4.1 1.3 versicolor
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
103 7.1 3.0 5.9 2.1 virginica
104 6.3 2.9 5.6 1.8 virginica
105 6.5 3.0 5.8 2.2 virginica
106 7.6 3.0 6.6 2.1 virginica
107 4.9 2.5 4.5 1.7 virginica
108 7.3 2.9 6.3 1.8 virginica
109 6.7 2.5 5.8 1.8 virginica
110 7.2 3.6 6.1 2.5 virginica
111 6.5 3.2 5.1 2.0 virginica
112 6.4 2.7 5.3 1.9 virginica
113 6.8 3.0 5.5 2.1 virginica
114 5.7 2.5 5.0 2.0 virginica
115 5.8 2.8 5.1 2.4 virginica
116 6.4 3.2 5.3 2.3 virginica
117 6.5 3.0 5.5 1.8 virginica
118 7.7 3.8 6.7 2.2 virginica
119 7.7 2.6 6.9 2.3 virginica
120 6.0 2.2 5.0 1.5 virginica
121 6.9 3.2 5.7 2.3 virginica
122 5.6 2.8 4.9 2.0 virginica
123 7.7 2.8 6.7 2.0 virginica
124 6.3 2.7 4.9 1.8 virginica
125 6.7 3.3 5.7 2.1 virginica
126 7.2 3.2 6.0 1.8 virginica
127 6.2 2.8 4.8 1.8 virginica
128 6.1 3.0 4.9 1.8 virginica
129 6.4 2.8 5.6 2.1 virginica
130 7.2 3.0 5.8 1.6 virginica
131 7.4 2.8 6.1 1.9 virginica
132 7.9 3.8 6.4 2.0 virginica
133 6.4 2.8 5.6 2.2 virginica
134 6.3 2.8 5.1 1.5 virginica
135 6.1 2.6 5.6 1.4 virginica
136 7.7 3.0 6.1 2.3 virginica
137 6.3 3.4 5.6 2.4 virginica
138 6.4 3.1 5.5 1.8 virginica
139 6.0 3.0 4.8 1.8 virginica
140 6.9 3.1 5.4 2.1 virginica
141 6.7 3.1 5.6 2.4 virginica
142 6.9 3.1 5.1 2.3 virginica
143 5.8 2.7 5.1 1.9 virginica
144 6.8 3.2 5.9 2.3 virginica
145 6.7 3.3 5.7 2.5 virginica
146 6.7 3.0 5.2 2.3 virginica
147 6.3 2.5 5.0 1.9 virginica
148 6.5 3.0 5.2 2.0 virginica
149 6.2 3.4 5.4 2.3 virginica
150 5.9 3.0 5.1 1.8 virginica

2. Show the documenation of a certain dataset:


In [5]:
data('iris', show_doc=True)


iris

PyDataset Documentation (adopted from R Documentation. The displayed examples are in R)

## Edgar Anderson's Iris Data

### Description

This famous (Fisher's or Anderson's) iris data set gives the measurements in
centimeters of the variables sepal length and width and petal length and
width, respectively, for 50 flowers from each of 3 species of iris. The
species are _Iris setosa_, _versicolor_, and _virginica_.

### Usage

    iris
    iris3

### Format

`iris` is a data frame with 150 cases (rows) and 5 variables (columns) named
`Sepal.Length`, `Sepal.Width`, `Petal.Length`, `Petal.Width`, and `Species`.

`iris3` gives the same data arranged as a 3-dimensional array of size 50 by 4
by 3, as represented by S-PLUS. The first dimension gives the case number
within the species subsample, the second the measurements with names `Sepal
L.`, `Sepal W.`, `Petal L.`, and `Petal W.`, and the third the species.

### Source

Fisher, R. A. (1936) The use of multiple measurements in taxonomic problems.
_Annals of Eugenics_, **7**, Part II, 179–188.

The data were collected by Anderson, Edgar (1935). The irises of the Gaspe
Peninsula, _Bulletin of the American Iris Society_, **59**, 2–5.

### References

Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) _The New S Language_.
Wadsworth & Brooks/Cole. (has `iris3` as `iris`.)

### See Also

`matplot` some examples of which use `iris`.

### Examples

    dni3 <- dimnames(iris3)
    ii <- data.frame(matrix(aperm(iris3, c(1,3,2)), ncol = 4,
                            dimnames = list(NULL, sub(" L.",".Length",
                                            sub(" W.",".Width", dni3[[2]])))),
        Species = gl(3, 50, labels = sub("S", "s", sub("V", "v", dni3[[3]]))))
    all.equal(ii, iris) # TRUE


3. See the available datasets:


In [6]:
data()


Out[6]:
dataset_id title
0 AirPassengers Monthly Airline Passenger Numbers 1949-1960
1 BJsales Sales Data with Leading Indicator
2 BOD Biochemical Oxygen Demand
3 Formaldehyde Determination of Formaldehyde
4 HairEyeColor Hair and Eye Color of Statistics Students
5 InsectSprays Effectiveness of Insect Sprays
6 JohnsonJohnson Quarterly Earnings per Johnson & Johnson Share
7 LakeHuron Level of Lake Huron 1875-1972
8 LifeCycleSavings Intercountry Life-Cycle Savings Data
9 Nile Flow of the River Nile
10 OrchardSprays Potency of Orchard Sprays
11 PlantGrowth Results from an Experiment on Plant Growth
12 Puromycin Reaction Velocity of an Enzymatic Reaction
13 Titanic Survival of passengers on the Titanic
14 ToothGrowth The Effect of Vitamin C on Tooth Growth in Guinea Pigs
15 UCBAdmissions Student Admissions at UC Berkeley
16 UKDriverDeaths Road Casualties in Great Britain 1969-84
17 UKgas UK Quarterly Gas Consumption
18 USAccDeaths Accidental Deaths in the US 1973-1978
19 USArrests Violent Crime Rates by US State
20 USJudgeRatings Lawyers' Ratings of State Judges in the US Superior Court
21 USPersonalExpenditure Personal Expenditure Data
22 VADeaths Death Rates in Virginia (1940)
23 WWWusage Internet Usage per Minute
24 WorldPhones The World's Telephones
25 airmiles Passenger Miles on Commercial US Airlines, 1937-1960
26 airquality New York Air Quality Measurements
27 anscombe Anscombe's Quartet of 'Identical' Simple Linear Regressions
28 attenu The Joyner-Boore Attenuation Data
29 attitude The Chatterjee-Price Attitude Data
30 austres Quarterly Time Series of the Number of Australian Residents
31 cars Speed and Stopping Distances of Cars
32 chickwts Chicken Weights by Feed Type
33 co2 Mauna Loa Atmospheric CO2 Concentration
34 crimtab Student's 3000 Criminals Data
35 discoveries Yearly Numbers of Important Discoveries
36 esoph Smoking, Alcohol and (O)esophageal Cancer
37 euro Conversion Rates of Euro Currencies
38 faithful Old Faithful Geyser Data
39 freeny Freeny's Revenue Data
40 infert Infertility after Spontaneous and Induced Abortion
41 iris Edgar Anderson's Iris Data
42 islands Areas of the World's Major Landmasses
43 lh Luteinizing Hormone in Blood Samples
44 longley Longley's Economic Regression Data
45 lynx Annual Canadian Lynx trappings 1821-1934
46 morley Michelson Speed of Light Data
47 mtcars Motor Trend Car Road Tests
48 nhtemp Average Yearly Temperatures in New Haven
49 nottem Average Monthly Temperatures at Nottingham, 1920-1939
50 npk Classical N, P, K Factorial Experiment
51 occupationalStatus Occupational Status of Fathers and their Sons
52 precip Annual Precipitation in US Cities
53 presidents Quarterly Approval Ratings of US Presidents
54 pressure Vapor Pressure of Mercury as a Function of Temperature
55 quakes Locations of Earthquakes off Fiji
56 randu Random Numbers from Congruential Generator RANDU
57 rivers Lengths of Major North American Rivers
58 rock Measurements on Petroleum Rock Samples
59 sleep Student's Sleep Data
60 stackloss Brownlee's Stack Loss Plant Data
61 sunspot.month Monthly Sunspot Data, from 1749 to "Present"
62 sunspot.year Yearly Sunspot Data, 1700-1988
63 sunspots Monthly Sunspot Numbers, 1749-1983
64 swiss Swiss Fertility and Socioeconomic Indicators (1888) Data
65 treering Yearly Treering Data, -6000-1979
66 trees Girth, Height and Volume for Black Cherry Trees
67 uspop Populations Recorded by the US Census
68 volcano Topographic Information on Auckland's Maunga Whau Volcano
69 warpbreaks The Number of Breaks in Yarn during Weaving
70 women Average Heights and Weights for American Women
71 acme Monthly Excess Returns
72 aids Delay in AIDS Reporting in England and Wales
73 aircondit Failures of Air-conditioning Equipment
74 aircondit7 Failures of Air-conditioning Equipment
75 amis Car Speeding and Warning Signs
76 aml Remission Times for Acute Myelogenous Leukaemia
77 bigcity Population of U.S. Cities
78 brambles Spatial Location of Bramble Canes
79 breslow Smoking Deaths Among Doctors
80 calcium Calcium Uptake Data
81 cane Sugar-cane Disease Data
82 capability Simulated Manufacturing Process Data
83 catsM Weight Data for Domestic Cats
84 cav Position of Muscle Caveolae
... ... ...
672 students Student Risk Taking
673 suicides Crowd Baiting Behaviour and Suicides
674 toothpaste Toothpaste Data
675 voting House of Representatives Voting Data
676 water Mortality and Water Hardness
677 watervoles Water Voles Data
678 waves Electricity from Wave Power at Sea
679 weightgain Gain in Weight of Rats
680 womensrole Womens Role in Society
681 Bechtoldt Seven data sets showing a bifactor solution.
682 Bechtoldt.1 Seven data sets showing a bifactor solution.
683 Bechtoldt.2 Seven data sets showing a bifactor solution.
684 Dwyer 8 cognitive variables used by Dwyer for an example.
685 Gleser Example data from Gleser, Cronbach and Rajaratnam (1965) to show basic principles of g...
686 Gorsuch Example data set from Gorsuch (1997) for an example factor extension.
687 Harman.5 5 socio-economic variables from Harman (1967)
688 Harman.8 Correlations of eight physical variables (from Harman, 1966)
689 Harman.political Eight political variables used by Harman (1967) as example 8.17
690 Holzinger Seven data sets showing a bifactor solution.
691 Holzinger.9 Seven data sets showing a bifactor solution.
692 Reise Seven data sets showing a bifactor solution.
693 Schmid 12 variables created by Schmid and Leiman to show the Schmid-Leiman Transformation
694 Thurstone Seven data sets showing a bifactor solution.
695 Thurstone.33 Seven data sets showing a bifactor solution.
696 Tucker 9 Cognitive variables discussed by Tucker and Lewis (1973)
697 ability 16 ability items scored as correct or incorrect.
698 affect Two data sets of affect and arousal scores as a function of personality and movie cond...
699 bfi 25 Personality items representing 5 factors
700 bfi.dictionary 25 Personality items representing 5 factors
701 blot Bond's Logical Operations Test - BLOT
702 burt 11 emotional variables from Burt (1915)
703 cities Distances between 11 US cities
704 cubits Galton's example of the relationship between height and 'cubit' or forearm length
705 cushny A data set from Cushny and Peebles (1905) on the effect of three drugs on hours of sle...
706 epi Eysenck Personality Inventory (EPI) data for 3570 participants
707 epi.bfi 13 personality scales from the Eysenck Personality Inventory and Big 5 inventory
708 epi.dictionary Eysenck Personality Inventory (EPI) data for 3570 participants
709 galton Galton's Mid parent child height data
710 heights A data.frame of the Galton (1888) height and cubit data set.
711 income US family income from US census 2008
712 iqitems 16 multiple choice IQ items
713 msq 75 mood items from the Motivational State Questionnaire for 3896 participants
714 neo NEO correlation matrix from the NEO_PI_R manual
715 peas Galton's Peas
716 sat.act 3 Measures of ability: SATV, SATQ, ACT
717 withinBetween An example of the distinction between within group and between group correlations
718 Bosco Boscovich Data
719 CobarOre Cobar Ore data
720 Mammals Garland(1983) Data on Running Speed of Mammals
721 barro Barro Data
722 engel Engel Data
723 uis UIS Drug Treatment study data
724 dietox Growth curves of pigs in a 3x3 factorial experiment
725 koch Ordinal Data from Koch
726 ohio Ohio Children Wheeze Status
727 respdis Clustered Ordinal Respiratory Disorder
728 respiratory Data from a clinical trial comparing two treatments for a respiratory illness
729 seizure Epiliptic Seizures
730 sitka89 Growth of Sitka Spruce Trees
731 spruce Log-size of 79 Sitka spruce trees
732 liver Liver related laboratory data
733 portpirie Rain, wavesurge and portpirie datasets.
734 rain Rain, wavesurge and portpirie datasets.
735 summer Air pollution data, separately for summer and winter months
736 wavesurge Rain, wavesurge and portpirie datasets.
737 winter Air pollution data, separately for summer and winter months
738 arthritis Rheumatoid Arthritis Clinical Trial
739 housing Homeless Data
740 bmw Daily Log Returns on BMW Share Price
741 danish Danish Fire Insurance Claims
742 nidd.annual The River Nidd Data
743 nidd.thresh The River Nidd Data
744 siemens Daily Log Returns on Siemens Share Price
745 sp.raw SP Data to June 1993
746 spto87 SP Return Data to October 1987
747 Dyestuff Yield of dyestuff by batch
748 Dyestuff2 Yield of dyestuff by batch
749 InstEval University Lecture/Instructor Evaluations by Students at ETH
750 Pastes Paste strength by batch and cask
751 Penicillin Variation in penicillin testing
752 VerbAgg Verbal Aggression item responses
753 cake Breakage Angle of Chocolate Cakes
754 cbpp Contagious bovine pleuropneumonia
755 grouseticks Data on red grouse ticks from Elston et al. 2001
756 sleepstudy Reaction times in a sleep deprivation study

757 rows × 2 columns

Or, see the docstring for more details:


In [7]:
help(data)


Help on function data in module pydataset:

data(item=None, show_doc=False)
    loads a datasaet (from in-modules datasets) in a dataframe data structure.
    
    Args:
        item (str)      : name of the dataset to load.
        show_doc (bool) : to show the dataset's documentation.
    
    Examples:
    
    >>> iris = data('iris')
    loaded: iris <class 'pandas.core.frame.DataFrame'>
    
    >>> data('Titanic', show_doc=True)
        : returns the dataset's documentation.
    
    >>> data()
        : like help(), returns a dataframe [Item, Title]
        for a list of the available datasets.

More examples

Since a dataset is loaded in a pandas DataFrame structure, you can use it like one:


In [9]:
trees = data('trees')

In [10]:
trees.head()


Out[10]:
Girth Height Volume
1 8.3 70 10.3
2 8.6 65 10.3
3 8.8 63 10.2
4 10.5 72 16.4
5 10.7 81 18.8

In [11]:
trees.plot()


Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x106958ac8>

And of course you can search a dataset by name, too!

For eaxmple, to see if anscombe datasets is availabe:


In [12]:
data()[data().dataset_id == 'anscombe']


Out[12]:
dataset_id title
27 anscombe Anscombe's Quartet of 'Identical' Simple Linear Regressions

In [13]:
anscombe = data('anscombe')

In [14]:
anscombe.plot(kind='scatter', x='x1', y='y1', figsize=(4,2));pass
anscombe.plot(kind='scatter', x='x2', y='y1', figsize=(4,2));pass
anscombe.plot(kind='scatter', x='x1', y='y2', figsize=(4,2));pass
anscombe.plot(kind='scatter', x='x2', y='y2', figsize=(4,2));pass