This script looks for mathematical relationships between CO2 concentration changes and pitch changes from a pipe organ. This script uploads, cleans data and organizes new dataframes, creates figures, and performs statistical tests on the relationships between variable CO2 and frequency of sound from a note played on a pipe organ.
Here I pursue data analysis route 1 (as mentionted in my notebook.md file), which involves comparing one pitch dataframe with one dataframe of environmental characteristics taken at one sensor location. Both dataframes are compared by the time of data recorded.
In [21]:
# I import useful libraries (with functions) so I can visualize my data
# I use Pandas because this dataset has word/string column titles and I like the readability features of commands and finish visual products that Pandas offers
import pandas as pd
import matplotlib.pyplot as plt
import re
import numpy as np
%matplotlib inline
#I want to be able to easily scroll through this notebook so I limit the length of the appearance of my dataframes
from pandas import set_option
set_option('display.max_rows', 10)
First I upload my data sets. I am working with two: one for pitch measurements and another for environmental characteristics (CO2, temperature (deg C), and relative humidity (RH) (%) measurements). My data comes from environmental sensing logger devices in the "Choir Division" section of the organ consul.
In [ ]:
In [22]:
#I import the environmental characteristics data file
env_choir_div.columns=[['test','date_time','temp','RH','CO2_1','CO2_2']]
#Load the data file with the pathname not from your computer, but within your final project directory
env_choir_div=pd.read_table('../data/CO2May.csv', sep=',')
#I display my dataframe
env_choir_div
Out[22]:
In [23]:
#comment by nick changing your data time variable to actual values of time.
env_choir_div['date_time']= pd.to_datetime(env_choir_div['date_time'])
#print the new table and the type of data.
print(env_choir_div)
env_choir_div.dtypes
Out[23]:
Now I know that my datetime column is read as an actually date and time value (a function of Python), and not as an object or string, as it was before performing the "datetime" operation.
Now, I will upload the pitch data so I can compare change in pitch of certain notes and change in environmental characteristics.
In [24]:
#I import the environmental characteristics data file
#Same thing with the filepath here.
pitch=pd.read_table('../data/pitches.csv', sep=',')
#I display my dataframe
pitch
Out[24]:
In [12]:
# If you look at the end of your table (lines 54 to 56), the columns are shifted.
# You need to fix this first, especially because you want to work with the 2010-04-17 date.
# To fix this, first isolate the last 3 rows that you need to fix.
pitch2 = pitch.ix[54:]
pitch2
Out[12]:
In [13]:
# Now you need to drop the first column with the NaN values
pitch3 = pitch2.drop('time', axis=1)
pitch3
Out[13]:
In [14]:
# Now you need to rename your columns
pitch3.columns = [['time', 'div', 'note', 'freq1', 'freq2', 'freq3', 'freq4', 'freq5', 'freq6', 'freq7', 'freq8']]
pitch3
Out[14]:
In [15]:
# Now you can merge this fixed data back with the original data frame and delete the old rows containing this data
To make a meaninful comparisson between pitch and CO2, I need to format my two data files. First, for the pitch.csv file, I select the data that corresponds to the environmental datafile, which are frequency data collected on 2010-04-17 in the "choir division".
I can make a regular expression to select these rows of pitch/sound frequency data.
In [25]:
#First, let's work with the pitch. I want to select the "choir" values in the "div[ision" column.
#Then, I can select the data from 2010-04-17 only, which is the date that can be matched with the temp, RH, and CO2 measurements in the oher data file.
In [26]:
import re
#I import the file with '\n' new line separators
#Same thing with the filepath here
lines = open('../data/pitches.csv', 'r').read().strip().split('\n' )
#search for '2010' in the 'time' column of the pitch dataframe
'2010' in pitch['time'][0] #select one item, the first item [0], from the given array of 2D columns
Out[26]:
In [27]:
pitch['note'] #selects the 'note' column
Out[27]:
In [28]:
pitch.loc[('2010-04-17' in pitch['time']), 'time']
#I tried using a boolian statement for the pattern in the 'time' column, but
#Having a boolian statement causes problems in that I am searching for part of the DateTime values (the date part) and want all time values associated with 2010-04-17.
#I also tried RegEx's, but William said that is for a string/list
#in a dataframe you should use a search function like the one I tried above
#re.search('2010-04-17', pitch) #looking for these date valuesaov
#show [new data lines]
In [29]:
#I can then make a new dataframe with 2010-04-17 data only
17data =
I then need to select notes from the "choir" cells in the "div" column of pitch (because my CO2 readings come from the choir division area in the chapel and so are spacially comparable).
In [ ]:
To make a comparrison between pitch and CO2, I need to find one pitch value for each time sample. I will do this by averaging pitch data points in each row of my "pitches.csv" file.
In [30]:
#I use element-wise mathematics between dataframe cells
pitch['pitch_average'] = pitch.mean(columns='freq1' 'freq2' 'freq3' 'freq4' 'freq5' 'freq6' 'freq7' 'freq8' 'freq9')
#pitch[[['freq1', 'freq2', 'freq3', 'freq4', 'freq5', pitch_average']]]
In [31]:
pitch('freq1', 'freq2', pitch_average')
In [32]:
#I want to find out why the mean pitch values I calculated are NaNs, so I check the type of data in 'pitch average'
#np.dtype('pitch_average')
#how can I check the data type?
pitch.dtypes
Out[32]:
Like I did for pitch.csv, I need to select out the rows of my choir_division.csv file for data logged on 2010-04-17. I will use the similar RegEx to do this
In [ ]:
#call in choir_division.csv with line separation
lines = open('/Users/shubbymartz-oberlander/Desktop/t_final_project/organ_pitch/Data/Choir_Division_May.csv', 'r').read().strip().split('\n' )
In [ ]:
#search for lines that contain the given pattern "2010-04-17"
re.search('2010-04-17', lines)
#is my data not in the proper format? I tried using "env_choir_div" instead of "lines" but the same error message is returned
In [ ]:
In [33]:
# STATS: print (lm.summary())
In [ ]: