The entire script looks for mathematical relationships between CO2 concentration changes and pitch changes from a pipe organ. This script uploads, cleans data and organizes new dataframes, creates figures, and performs statistical tests on the relationships between variable CO2 and frequency of sound from a note played on a pipe organ.
This uploader script:
1) Uploads organ note pitch data files
2) Munges it (creates a Date Time column for the time stamps), establishes column contents as floats
3) Outputs a new dataframe with mean pitch values, grouped by area of the chapel for comparison with same-location environmental data.
Here I pursue data analysis route 1 (as mentionted in my notebook.md file), which involves comparing one pitch dataframe with one dataframe of environmental characteristics taken at one sensor location. Both dataframes are compared by the time of data recorded.
In [2]:
# I import useful libraries (with functions) so I can visualize my data
# I use Pandas because this dataset has word/string column titles and I like the readability features of commands and finish visual products that Pandas offers
import pandas as pd
import matplotlib.pyplot as plt
import re
import numpy as np
%matplotlib inline
#I want to be able to easily scroll through this notebook so I limit the length of the appearance of my dataframes
from pandas import set_option
set_option('display.max_rows', 10)
First I upload my data sets. I am working with two: one for pitch measurements and another for environmental characteristics (CO2, temperature (deg C), and relative humidity (RH) (%) measurements). My data comes from environmental sensing logger devices in the "Choir Division" section of the organ consul.
In [3]:
#I import a pitch data file
#comment by nick changed the path you upload that data from making in compatible with clone copies of your project
pitch=pd.read_table('../Data/pitches.csv', sep=',')
#assigning columns names
#pitch.columns=[['date_time','section','note','freq1','freq2','freq3', 'freq4', 'freq5', 'freq6', 'freq7', 'freq8', 'freq9']]
#I display my dataframe
pitch
Out[3]:
In [4]:
output = pitch['freq7'].dtype
output
Out[4]:
In [5]:
#Test to see if data is a float
def test_data_type(data):
'''Check to see if a column contains only floats'''
obs = pitch['freq7'].dtype #I pass the dtype checking function through my test function
#print(obs)
exp = 'float64'
assert obs == 'float64', 'Data is not a float'
return
test_data_type(pitch['div'])
In [ ]:
In [6]:
#Tell python that my date_time column has a "datetime" values, so it won't read as a string or object
pitch['time']= pd.to_datetime(pitch['time'])
#print the new table and the type of data to check that all columns are in line with the column names
print(pitch)
#Check the type of data in each column. This shows there are integers and floats, and datetime. This is good for analysing.
pitch.dtypes
Out[6]:
I have pitch and I have CO2. I want to make a continuous chart of CO2 over time with expected pitch, then plot measured pitch points over that to see if changes in CO2 affect how close the predicted to measured pitch values are.
How do I generalize the column labels so I can use this script for a file with different number of freq measurements?
In [7]:
#Calculate MEDIAN of pitch values for each time stamp
#pitch['median_freq'] = [[1,2,3,4,5,] axis=0]
#pitch['median_freq'] = np.median(pitch[3,9], axis=1, 'freq1' 'freq2' 'freq3', 'freq4', 'freq5', 'freq6', 'freq7', 'freq8', 'freq9')
#median_freq = pd.pitches(index=median_freq)
#pitch['median_freq'] = Series(np.random.randn(sLength), index=df1.index)
#pitch['median_freq'] = np.median(pitch[['freq1','freq2','freq3', 'freq4', 'freq5', 'freq6', 'freq7', 'freq8', 'freq9']], axis=1)
In [8]:
#Calculate standard mean of frequency values, add another column to dataframe
pitch['mean_freq'] = np.mean(pitch[['freq1','freq2','freq3', 'freq4', 'freq5', 'freq6', 'freq7', 'freq8', 'freq9']], axis=1)
In [9]:
#Calculate the standard deviation for the mean value of all frequency measurements
pitch['stdev_freq'] = np.std(pitch['mean_freq'])
#Now my data frame has me dian and standard deviation for frequency
pitch
Out[9]:
These data points can be plotted on top of a calculated pitch line
In [10]:
#Group by section to compare to environmental measurements from the choir division
#selecting pitch values for the choir division, which is closest to the CO2 sensor
organized_pitch = pitch.groupby(['div']).get_group('choir')
# "organized_pitch" is only a set of pointers to the "pitch" dataframe
organized_pitch
Out[10]:
In [11]:
#Save this data frame as a file which can be called into the plotting script
organized_pitch.to_csv('pitch.csv', sep=',') #columns='time', 'div', 'note', 'freq1', 'freq2')
I now have an input (raw file) and output (section-selected measured pitch dataframe). This output can be called into my next script, the "env_data.py" file, for comparision between measured and calculated pitch (calculated pitch is calculated from environmental data).
In [17]:
def make_plot(data):
plt.figure(figsize=(8,5))
fig = plt.plot(organized_pitch['time'], organized_pitch['mean_freq'], color = 'navy')
plt.title('Pitch of C5 Pipe Organ Note')
plt.ylabel('Sound Frequency (Hz)')
plt.xlabel('Time of Sample Taken (Apr. 13, 16 and 17, 2010)')
plt.show()
return(fig)
make_plot(organized_pitch)
Out[17]:
In [ ]:
#We can see that pitch changes from ~ 523.9-524.7 Hz
In [ ]:
In [ ]: