Main Script for Final Project

T. Martz-Oberlander, 2015-11-10

Change in pitch of a pipe organ from CO2

This script looks for mathematical relationships between CO2 concentration changes and pitch changes from a pipe organ. This script uploads, cleans data and organizes new dataframes, creates figures, and performs statistical tests on the relationships between variable CO2 and frequency of sound from a note played on a pipe organ.

Here I pursue data analysis route 1 (as mentionted in my notebook.md file), which involves comparing one pitch dataframe with one dataframe of environmental characteristics taken at one sensor location. Both dataframes are compared by the time of data recorded.



In [21]:

    
# I import useful libraries (with functions) so I can visualize my data
# I use Pandas because this dataset has word/string column titles and I like the readability features of commands and finish visual products that Pandas offers

import pandas as pd
import matplotlib.pyplot as plt
import re
import numpy as np

%matplotlib inline

#I want to be able to easily scroll through this notebook so I limit the length of the appearance of my dataframes 
from pandas import set_option
set_option('display.max_rows', 10)

Uploaded data into Python

First I upload my data sets. I am working with two: one for pitch measurements and another for environmental characteristics (CO2, temperature (deg C), and relative humidity (RH) (%) measurements). My data comes from environmental sensing logger devices in the "Choir Division" section of the organ consul.



In [ ]:



In [22]:

    
#I import the environmental characteristics data file

env_choir_div=pd.read_table('../data/CO2May.csv', sep=',')

#comment by nick here i am resigning colunm names to remove blank space and units. 
#assigning columns names 
env_choir_div.columns=[['test','date_time','temp','RH','CO2_1','CO2_2']]

#I display my dataframe
env_choir_div









    Out[22]:






  
    
      
      test
      date_time
      temp
      RH
      CO2_1
      CO2_2
    
  
  
    
      0
      1
      04/17/10 11:00:00  AM
      20.650
      35.046
      452.4
      689.9
    
    
      1
      2
      04/17/10 11:02:00  AM
      20.579
      35.105
      450.5
      677.0
    
    
      2
      3
      04/17/10 11:04:00  AM
      20.507
      35.229
      450.5
      663.6
    
    
      3
      4
      04/17/10 11:06:00  AM
      20.460
      35.291
      448.7
      652.0
    
    
      4
      5
      04/17/10 11:08:00  AM
      20.412
      35.352
      442.0
      641.0
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      10853
      10854
      2005-02-10 12:46
      21.581
      44.604
      501.2
      483.5
    
    
      10854
      10855
      2005-02-10 12:48
      21.581
      44.604
      504.3
      482.9
    
    
      10855
      10856
      2005-02-10 12:50
      21.581
      44.604
      503.7
      482.3
    
    
      10856
      10857
      2005-02-10 12:52
      21.604
      44.575
      503.1
      481.7
    
    
      10857
      10858
      2005-02-10 12:54
      21.604
      44.575
      498.8
      480.5
    
  

10858 rows × 6 columns



In [23]:

    
#comment by nick changing your data time variable to actual values of time. 
env_choir_div['date_time']= pd.to_datetime(env_choir_div['date_time'])

#print the new table and the type of data. 
print(env_choir_div)

env_choir_div.dtypes









    



        test           date_time    temp      RH  CO2_1  CO2_2
0          1 2010-04-17 11:00:00  20.650  35.046  452.4  689.9
1          2 2010-04-17 11:02:00  20.579  35.105  450.5  677.0
2          3 2010-04-17 11:04:00  20.507  35.229  450.5  663.6
3          4 2010-04-17 11:06:00  20.460  35.291  448.7  652.0
4          5 2010-04-17 11:08:00  20.412  35.352  442.0  641.0
...      ...                 ...     ...     ...    ...    ...
10853  10854 2005-02-10 12:46:00  21.581  44.604  501.2  483.5
10854  10855 2005-02-10 12:48:00  21.581  44.604  504.3  482.9
10855  10856 2005-02-10 12:50:00  21.581  44.604  503.7  482.3
10856  10857 2005-02-10 12:52:00  21.604  44.575  503.1  481.7
10857  10858 2005-02-10 12:54:00  21.604  44.575  498.8  480.5

[10858 rows x 6 columns]






    Out[23]:





test                  int64
date_time    datetime64[ns]
temp                float64
RH                  float64
CO2_1               float64
CO2_2               float64
dtype: object

Now I know that my datetime column is read as an actually date and time value (a function of Python), and not as an object or string, as it was before performing the "datetime" operation.

Now, I will upload the pitch data so I can compare change in pitch of certain notes and change in environmental characteristics.



In [24]:

    
#I import the environmental characteristics data file

pitch=pd.read_table('../Data/pitches.csv', sep=',')

#I display my dataframe
pitch









    Out[24]:






  
    
      
      time
      div
      note
      freq1
      freq2
      freq3
      freq4
      freq5
      freq6
      freq7
      freq8
      freq9
    
  
  
    
      0
      2010-04-13 8:37
      pedal
      c3
      131.17
      131.20
      131.18
      131.11
      131.17
      131.14
      131.21
      NaN
      NaN
    
    
      1
      2010-04-13 8:37
      pedal
      c4
      262.08
      262.12
      262.09
      262.05
      262.07
      262.10
      262.08
      NaN
      NaN
    
    
      2
      2010-04-13 8:40
      swell
      c3
      131.42
      131.47
      131.45
      131.47
      131.50
      131.47
      131.45
      NaN
      NaN
    
    
      3
      2010-04-13 8:40
      swell
      c4
      262.90
      262.87
      262.84
      262.85
      262.90
      262.87
      262.88
      NaN
      NaN
    
    
      4
      2010-04-13 8:42
      great
      c4
      262.04
      262.05
      262.01
      262.03
      261.97
      261.98
      261.99
      NaN
      NaN
    
    
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      52
      2010-04-17 10:35
      pedal
      c4
      261.95
      261.95
      262.02
      262.00
      261.97
      262.01
      261.95
      261.97
      NaN
    
    
      53
      2010-04-17 10:37
      great
      c4
      261.69
      261.69
      261.68
      261.71
      261.74
      261.66
      261.68
      261.69
      261.67
    
    
      54
      2010-04-17 9:54
      choir
      c5
      NaN
      523.73
      523.61
      523.66
      523.77
      523.63
      523.65
      523.69
      NaN
    
    
      55
      2010-04-17 10:35
      pedal
      c4
      NaN
      261.95
      261.95
      262.02
      262.00
      261.97
      262.01
      261.95
      261.97
    
    
      56
      2010-04-17 10:37
      great
      c4
      NaN
      261.69
      261.69
      261.68
      261.71
      261.74
      261.66
      261.68
      261.69
    
  

57 rows × 12 columns

Munging data for plotting and stats comparrison--Pitch data

Using regular expressions to find matching dated data points for comparisson

To make a meaninful comparisson between pitch and CO2, I need to format my two data files. First, for the pitch.csv file, I select the data that corresponds to the environmental datafile, which are frequency data collected on 2010-04-17 in the "choir division".

I can make a regular expression to select these rows of pitch/sound frequency data.



In [25]:

    
#First, let's work with the pitch. I want to select the "choir" values in the "div[ision" column.

#Then, I can select the data from 2010-04-17 only, which is the date that can be matched with the temp, RH, and CO2 measurements in the oher data file.



In [26]:

    
import re

#I import the file with '\n' new line separators
lines = open('/Users/shubbymartz-oberlander/Desktop/t_final_project/organ_pitch/Data/pitches.csv', 'r').read().strip().split('\n' )

#search for '2010' in the 'time' column of the pitch dataframe
'2010' in pitch['time'][0] #select one item, the first item [0], from the given array of 2D columns









    Out[26]:





True



In [27]:

    
pitch['note'] #selects the 'note' column









    Out[27]:





0     c3
1     c4
2     c3
3     c4
4     c4
      ..
52    c4
53    c4
54    c5
55    c4
56    c4
Name: note, dtype: object



In [28]:

    
pitch.loc[('2010-04-17' in pitch['time']), 'time']

#I tried using a boolian statement for the pattern in the 'time' column, but
#Having a boolian statement causes problems in that I am searching for part of the DateTime values (the date part) and want all time values associated with 2010-04-17.

#I also tried RegEx's, but William said that is for a string/list 
#in a dataframe you should use a search function like the one I tried above

#re.search('2010-04-17', pitch) #looking for these date valuesaov

#show [new data lines]









    



---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-28-884a0682998e> in <module>()
----> 1 pitch.loc[('2010-04-17' in pitch['time']), 'time']
      2 
      3 #I tried using a boolian statement for the pattern in the 'time' column, but
      4 #Having a boolian statement causes problems in that I am searching for part of the DateTime values (the date part) and want all time values associated with 2010-04-17.
      5 

/Users/shubbymartz-oberlander/anaconda/lib/python3.4/site-packages/pandas/core/indexing.py in __getitem__(self, key)
   1194     def __getitem__(self, key):
   1195         if type(key) is tuple:
-> 1196             return self._getitem_tuple(key)
   1197         else:
   1198             return self._getitem_axis(key, axis=0)

/Users/shubbymartz-oberlander/anaconda/lib/python3.4/site-packages/pandas/core/indexing.py in _getitem_tuple(self, tup)
    707     def _getitem_tuple(self, tup):
    708         try:
--> 709             return self._getitem_lowerdim(tup)
    710         except IndexingError:
    711             pass

/Users/shubbymartz-oberlander/anaconda/lib/python3.4/site-packages/pandas/core/indexing.py in _getitem_lowerdim(self, tup)
    832         for i, key in enumerate(tup):
    833             if is_label_like(key) or isinstance(key, tuple):
--> 834                 section = self._getitem_axis(key, axis=i)
    835 
    836                 # we have yielded a scalar ?

/Users/shubbymartz-oberlander/anaconda/lib/python3.4/site-packages/pandas/core/indexing.py in _getitem_axis(self, key, axis)
   1341         # fall thru to straight lookup
   1342         self._has_valid_type(key, axis)
-> 1343         return self._get_label(key, axis=axis)
   1344 
   1345 

/Users/shubbymartz-oberlander/anaconda/lib/python3.4/site-packages/pandas/core/indexing.py in _get_label(self, label, axis)
     84             raise IndexingError('no slices here, handle elsewhere')
     85 
---> 86         return self.obj._xs(label, axis=axis)
     87 
     88     def _get_loc(self, key, axis=0):

/Users/shubbymartz-oberlander/anaconda/lib/python3.4/site-packages/pandas/core/generic.py in xs(self, key, axis, level, copy, drop_level)
   1483                                                       drop_level=drop_level)
   1484         else:
-> 1485             loc = self.index.get_loc(key)
   1486 
   1487             if isinstance(loc, np.ndarray):

/Users/shubbymartz-oberlander/anaconda/lib/python3.4/site-packages/pandas/core/index.py in get_loc(self, key, method, tolerance)
   1690                 raise ValueError('tolerance argument only valid if using pad, '
   1691                                  'backfill or nearest lookups')
-> 1692             return self._engine.get_loc(_values_from_object(key))
   1693 
   1694         indexer = self.get_indexer([key], method=method,

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3979)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas/index.c:3804)()

pandas/index.pyx in pandas.index.Int64Engine._check_type (pandas/index.c:7522)()

KeyError: False



In [29]:

    
#I can then make a new dataframe with 2010-04-17 data only

17data =









    



  File "<ipython-input-29-cf11228c416c>", line 3
    17data =
         ^
SyntaxError: invalid syntax

I then need to select notes from the "choir" cells in the "div" column of pitch (because my CO2 readings come from the choir division area in the chapel and so are spacially comparable).



In [ ]:

Making a useful/comparable pitch value with mean of all pitch frequencies

To make a comparrison between pitch and CO2, I need to find one pitch value for each time sample. I will do this by averaging pitch data points in each row of my "pitches.csv" file.



In [30]:

    
#I use element-wise mathematics between dataframe cells

pitch['pitch_average'] = pitch.mean(columns='freq1' 'freq2' 'freq3' 'freq4' 'freq5' 'freq6' 'freq7' 'freq8' 'freq9')


#pitch[[['freq1', 'freq2', 'freq3', 'freq4', 'freq5', pitch_average']]]



In [31]:

    
pitch('freq1', 'freq2',  pitch_average')









    



  File "<ipython-input-31-2af93edf2621>", line 1
    pitch('freq1', 'freq2',  pitch_average')
                                            ^
SyntaxError: EOL while scanning string literal



In [32]:

    
#I want to find out why the mean pitch values I calculated are NaNs, so I check the type of data in 'pitch average'
#np.dtype('pitch_average')

#how can I check the data type?
pitch.dtypes









    Out[32]:





time              object
div               object
note              object
freq1            float64
freq2            float64
                  ...   
freq6            float64
freq7            float64
freq8            float64
freq9            float64
pitch_average    float64
dtype: object

Munging data for plotting and stats comparrison--Environmental data

Like I did for pitch.csv, I need to select out the rows of my choir_division.csv file for data logged on 2010-04-17. I will use the similar RegEx to do this



In [ ]:

    
#call in choir_division.csv with line separation
lines = open('/Users/shubbymartz-oberlander/Desktop/t_final_project/organ_pitch/Data/Choir_Division_May.csv', 'r').read().strip().split('\n' )



In [ ]:

    
#search for lines that contain the given pattern "2010-04-17"

re.search('2010-04-17', lines)

#is my data not in the proper format? I tried using "env_choir_div" instead of "lines" but the same error message is returned



In [ ]:



In [33]:

    
# STATS: print (lm.summary())



In [ ]:

	test	date_time	temp	RH	CO2_1	CO2_2
0	1	04/17/10 11:00:00 AM	20.650	35.046	452.4	689.9
1	2	04/17/10 11:02:00 AM	20.579	35.105	450.5	677.0
2	3	04/17/10 11:04:00 AM	20.507	35.229	450.5	663.6
3	4	04/17/10 11:06:00 AM	20.460	35.291	448.7	652.0
4	5	04/17/10 11:08:00 AM	20.412	35.352	442.0	641.0
...	...	...	...	...	...	...
10853	10854	2005-02-10 12:46	21.581	44.604	501.2	483.5
10854	10855	2005-02-10 12:48	21.581	44.604	504.3	482.9
10855	10856	2005-02-10 12:50	21.581	44.604	503.7	482.3
10856	10857	2005-02-10 12:52	21.604	44.575	503.1	481.7
10857	10858	2005-02-10 12:54	21.604	44.575	498.8	480.5

	time	div	note	freq1	freq2	freq3	freq4	freq5	freq6	freq7	freq8	freq9
0	2010-04-13 8:37	pedal	c3	131.17	131.20	131.18	131.11	131.17	131.14	131.21	NaN	NaN
1	2010-04-13 8:37	pedal	c4	262.08	262.12	262.09	262.05	262.07	262.10	262.08	NaN	NaN
2	2010-04-13 8:40	swell	c3	131.42	131.47	131.45	131.47	131.50	131.47	131.45	NaN	NaN
3	2010-04-13 8:40	swell	c4	262.90	262.87	262.84	262.85	262.90	262.87	262.88	NaN	NaN
4	2010-04-13 8:42	great	c4	262.04	262.05	262.01	262.03	261.97	261.98	261.99	NaN	NaN
...	...	...	...	...	...	...	...	...	...	...	...	...
52	2010-04-17 10:35	pedal	c4	261.95	261.95	262.02	262.00	261.97	262.01	261.95	261.97	NaN
53	2010-04-17 10:37	great	c4	261.69	261.69	261.68	261.71	261.74	261.66	261.68	261.69	261.67
54	2010-04-17 9:54	choir	c5	NaN	523.73	523.61	523.66	523.77	523.63	523.65	523.69	NaN
55	2010-04-17 10:35	pedal	c4	NaN	261.95	261.95	262.02	262.00	261.97	262.01	261.95	261.97
56	2010-04-17 10:37	great	c4	NaN	261.69	261.69	261.68	261.71	261.74	261.66	261.68	261.69