In this notebook, we use both data from an outside source and that the class generated to explore the relationships between formants, gender, and height.
3 - Vowel Spaces
Remember that to run a cell, you can either click the play button in the toolbar, or you can press shift and enter on your keyboard. To get a quick review of Jupyter notebooks, you can look at the VOT Notebook. Make sure to run the following cell before you get started.
In [ ]:
# DON'T FORGET TO RUN THIS CELL
import math
import numpy as np
import pandas as pd
import seaborn as sns
import datascience as ds
import matplotlib.pyplot as plt
sns.set_style('darkgrid')
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')
We will start off by exploring TIMIT data taken from 8 different regions. These measurements are taken at the midpoint of vowels, where vowel boundaries were determined automatically using forced alignment.
Prior to being able to work with the data, we have to upload our dataset. The following two lines of code will read in our data and create a dataframe. The last line of code prints the timit dataframe, but instead of printing the whole dataframe, by using the method .head, it only prints the first 5 rows.
In [ ]:
timit = pd.read_csv('data/timitvowels.csv')
timit.head()
Look at the dataframe you created and try to figure out what each column measures. Each column represents a different attribute, see the following table for more information.
| Column Name | Details |
|---|---|
| speaker | unique speaker ID |
| gender | Speaker’s self-reported gender |
| region | Speaker dialect region number |
| word | Lexical item (from sentence prompt) |
| vowel | Vowel ID |
| duration | Vowel duration (seconds) |
| F1/F2/F3/f0 | f0 and F1-F3 in BPM (Hz) |
Sometimes data is encoded with with an identifier, or key, to save space and simplify calculations. Each of those keys corresponds to a specific value. If you look at the region column, you will notice that all of the values are numbers. Each of those numbers corresponds to a region, for example, in our first row the speaker, cjf0, is from region 1. That corresponds to New England. Below is a table with all of the keys for region.
| Key | Region |
|---|---|
| 1 | New England |
| 2 | Northern |
| 3 | North Midland |
| 4 | South Midland |
| 5 | Southern |
| 6 | New York City |
| 7 | Western |
| 8 | Army Brat |
When inspecting data, you may realize that there are changes to be made -- possibly due to the representation to the data or errors in the recording. Before jumping into analysis, it is important to clean the data.
One thing to notice about timit is that the column vowel contains ARPABET identifiers for the vowels. We want to convert the vowel column to be IPA characters, and will do so in the cell below.
In [ ]:
IPAdict = {"AO" : "ɔ", "AA" : "ɑ", "IY" : "i", "UW" : "u", "EH" : "ɛ", "IH" : "ɪ", "UH":"ʊ", "AH": "ʌ", "AX" : "ə", "AE":"æ", "EY" :"eɪ", "AY": "aɪ", "OW":"oʊ", "AW":"aʊ", "OY" :"ɔɪ", "ER":"ɚ"}
timit['vowel'] = [IPAdict[x] for x in timit['vowel']]
timit.head()
Most of the speakers will say the same vowel multiple times, so we are going to average those values together. The end result will be a dataframe where each row represents the average values for each vowel for each speaker.
In [ ]:
timit_avg = timit.groupby(['speaker', 'vowel', 'gender', 'region']).mean().reset_index()
timit_avg.head()
Using the same dataframe from above, timit_avg, we are going to split into dataframes grouped by gender. To identify the possible values of gender in the gender column, we can use the method .unique on the column.
In [ ]:
timit_avg.gender.unique()
You could see that for this specific dataset there are only "female" and "male" values in the column. Given that information, we'll create two subsets based off of gender.
We'll split timit_avg into two separate dataframes, one for females, timit_female, and one for males, timit_male. Creating these subset dataframes does not affect the original timit_avg dataframe.
In [ ]:
timit_female = timit_avg[timit_avg['gender'] == 'female']
timit_male = timit_avg[timit_avg['gender'] == 'male']
We want to inspect the distributions of F1, F2, and F3 for those that self-report as male and those that self-report as female to identify possible trends or relationships. Having our two split dataframes, timit_female and timit_male, eases the plotting process.
Run the cell below to see the distribution of F1.
In [ ]:
sns.distplot(timit_female['F1'], kde_kws={"label": "female"})
sns.distplot(timit_male['F1'], kde_kws={"label": "male"})
plt.title('F1')
plt.xlabel("Hz")
plt.ylabel('Proportion per Hz');
Does there seem to be a notable difference between male and female distributions of F1?
Next, we plot F2.
In [ ]:
sns.distplot(timit_female['F2'], kde_kws={"label": "female"})
sns.distplot(timit_male['F2'], kde_kws={"label": "male"})
plt.title('F2')
plt.xlabel("Hz")
plt.ylabel('Proportion per Hz');
Finally, we create the same visualization, but for F3.
In [ ]:
sns.distplot(timit_female['F3'], kde_kws={"label": "female"})
sns.distplot(timit_male['F3'], kde_kws={"label": "male"})
plt.title('F3')
plt.xlabel("Hz")
plt.ylabel('Proportion per Hz');
Do you see a more pronounced difference across the the different F values? Are they the same throughout? Can we make any meaningful assumptions from these visualizations?
An additional question: How do you think the fact that we average each vowel together first for each individual affects the shape of the histograms?
This portion of the notebook will rely on the data that was submit for HW5. Just like we did for the TIMIT data, we are going to read it into a dataframe and modify the column vowel to reflect the corresponding IPA translation. We will name the dataframe class_data.
In [ ]:
# reading in the data
class_data = pd.read_csv('data/110_formants.csv')
class_data.head()
The ID column contains a unique value for each individual. Each individual has a row for each of the different vowels they measured.
In [ ]:
# translating the vowel column
class_data['vowel'] = [IPAdict[x] for x in class_data['vowel']]
class_data.head()
As we did with the TIMIT data, we are going to split class_data based on self-reported gender. We need to figure out what the possible responses for the column were.
In [ ]:
class_data['Gender'].unique()
Notice that there are three possible values for the column. We do not have a large enough sample size to responsibly come to conclusions for Prefer not to answer, so for now we'll compare Male and Female. We'll call our new split dataframes class_female and class_male.
In [ ]:
class_female = class_data[class_data['Gender'] == 'Female']
class_male = class_data[class_data['Gender'] == 'Male']
The following visualizations compare the the distribution of formants for males and females, like we did for the TIMIT data.
First, we'll start with F1.
In [ ]:
sns.distplot(class_female['F1'], kde_kws={"label": "female"})
sns.distplot(class_male['F1'], kde_kws={"label": "male"})
plt.title('F1')
plt.xlabel("Hz")
plt.ylabel('Proportion per Hz');
Next is F2.
In [ ]:
sns.distplot(class_female['F2'], kde_kws={"label": "female"})
sns.distplot(class_male['F2'], kde_kws={"label": "male"})
plt.title('F2')
plt.xlabel("Hz")
plt.ylabel('Proportion per Hz');
And finally F3.
In [ ]:
sns.distplot(class_female['F3'], kde_kws={"label": "female"})
sns.distplot(class_male['F3'], kde_kws={"label": "male"})
plt.title('F3')
plt.xlabel("Hz")
plt.ylabel('Proportion per Hz');
Do the spread of values appear to be the same for females and males? Do the same patterns that occur in the TIMIT data appear in the class's data?
Run the cell below to define some functions that we will be using.
In [ ]:
def plot_blank_vowel_chart():
im = plt.imread('images/blankvowel.png')
plt.imshow(im, extent=(plt.xlim()[0], plt.xlim()[1], plt.ylim()[0], plt.ylim()[1]))
def plot_vowel_space(avgs_df):
plt.figure(figsize=(10, 8))
plt.gca().invert_yaxis()
plt.gca().invert_xaxis()
vowels = ['eɪ', 'i', 'oʊ', 'u', 'æ', 'ɑ', 'ɚ', 'ɛ', 'ɪ', 'ʊ', 'ʌ'] + ['ɔ']
for i in range(len(avgs_df)):
plt.scatter(avgs_df.loc[vowels[i]]['F2'], avgs_df.loc[vowels[i]]['F1'], marker=r"$ {} $".format(vowels[i]), s=1000)
plt.ylabel('F1')
plt.xlabel('F2')
We are going to be recreating the following graphic from this website.
Before we can get to creating, we need to get a singular value for each column for each of the vowels (so we can create coordinate pairs). To do this, we are going to find the average formant values for each of the vowels in our dataframes. We'll do this for both timit and class_data.
In [ ]:
class_vowel_avgs = class_data.drop('ID', axis=1).groupby('vowel').mean()
class_vowel_avgs.head()
In [ ]:
timit_vowel_avgs = timit.groupby('vowel').mean()
timit_vowel_avgs.head()
Each of these new tables has a row for each vowel, which comprisises of the averaged values across all speakers.
Run the cell below to construct a vowel space for the class's data, in which we plot F1 on F2.
Note that both axes are descending.
In [ ]:
plot_vowel_space(class_vowel_avgs)
plt.xlabel('F2 (Hz)')
plt.ylabel('F1 (Hz)');
In [ ]:
log_timit_vowels = timit_vowel_avgs.apply(np.log)
log_class_vowels = class_vowel_avgs.apply(np.log)
class_data['log(F1)'] = np.log(class_data['F1'])
class_data['log(F2)'] = np.log(class_data['F2'])
log_class_vowels.head()
Below we plot the vowel space using these new values.
In [ ]:
plot_vowel_space(log_class_vowels)
plt.xlabel('log(F2) (Hz)')
plt.ylabel('log(F1) (Hz)');
What effect does using the logged values have, if any? What advantages does using these values have? Are there any negatives? This paper might give some ideas.
In [ ]:
plot_vowel_space(log_class_vowels)
plot_blank_vowel_chart()
plt.xlabel('log(F2) (Hz)')
plt.ylabel('log(F1) (Hz)');
How well does it match the original?
Below we generate the same graph, except using the information from the TIMIT dataset.
In [ ]:
plot_vowel_space(log_timit_vowels)
plot_blank_vowel_chart()
plt.xlabel('log(F2) (Hz)')
plt.ylabel('log(F1) (Hz)');
How does the TIMIT vowel space compare to the vowel space from our class data? What may be the cause for any differences between our vowel space and the one constructed using the TIMIT data? Do you notice any outliers or do any points that seem off?
In [ ]:
sns.lmplot('log(F2)', 'log(F1)', hue='vowel', data=class_data, fit_reg=False, size=8, scatter_kws={'s':30})
plt.xlim(8.2, 6.7)
plt.ylim(7.0, 5.7);
In the following visualization, we replace the colors with the IPA characters and attempt to clump the vowels together.
In [ ]:
plt.figure(figsize=(10, 12))
pick_vowel = lambda v: class_data[class_data['vowel'] == v]
colors = ['Greys_r', 'Purples_r', 'Blues_r', 'Greens_r', 'Oranges_r', \
'Reds_r', 'GnBu_r', 'PuRd_r', 'winter_r', 'YlOrBr_r', 'pink_r', 'copper_r']
for vowel, color in list(zip(class_data.vowel.unique(), colors)):
vowel_subset = pick_vowel(vowel)
sns.kdeplot(vowel_subset['log(F2)'], vowel_subset['log(F1)'], n_levels=1, cmap=color, shade=False, shade_lowest=False)
for i in range(1, len(class_data)+1):
plt.scatter(class_data['log(F2)'][i], class_data['log(F1)'][i], color='black', linewidths=.5, marker=r"$ {} $".format(class_data['vowel'][i]), s=40)
plt.xlim(8.2, 6.7)
plt.ylim(7.0, 5.7);
We are going to compare each of the formants and height to see if there is a relationship between the two. To help visualize that, we are going to plot a regression line, which is also referred to as the line of best fit.
We are going to use the maximum of each formant to compare to height. So for each speaker, we will calculate their greatest F1, F2, and F3 across all vowels, then compare one of those to their height. We create the necessary dataframe in the cell below using the class's data.
In [ ]:
genders = class_data['Gender']
plotting_data = class_data.drop('vowel', axis=1)[np.logical_or(genders == 'Male', genders == 'Female')]
maxes = plotting_data.groupby(['ID', 'Gender']).max().reset_index()[plotting_data.columns[:-2]]
maxes.columns = ['ID', 'Language', 'Gender', 'Height', 'Max F1', 'Max F2', 'Max F3']
maxes_female = maxes[maxes['Gender'] == 'Female']
maxes_male = maxes[maxes['Gender'] == 'Male']
maxes.head()
First we will plot Max F1 against Height.
Note: Each gender has a different color dot, but the line represents the line of best fit for ALL points.
In [ ]:
sns.regplot('Height', 'Max F1', data=maxes)
sns.regplot('Height', 'Max F1', data=maxes_male, fit_reg=False)
sns.regplot('Height', 'Max F1', data=maxes_female, fit_reg=False)
plt.xlabel('Height (cm)')
plt.ylabel('Max F1 (Hz)')
print('female: green')
print('male: orange')
Is there a general trend for the data that you notice? What do you notice about the different color dots?
Next, we plot Max F2 on Height.
In [ ]:
sns.regplot('Height', 'Max F2', data=maxes)
sns.regplot('Height', 'Max F2', data=maxes_male, fit_reg=False)
sns.regplot('Height', 'Max F2', data=maxes_female, fit_reg=False)
plt.xlabel('Height (cm)')
plt.ylabel('Max F2 (Hz)')
print('female: green')
print('male: orange')
Finally, Max F3 vs Height.
In [ ]:
sns.regplot('Height', 'Max F3', data=maxes)
sns.regplot('Height', 'Max F3', data=maxes_male, fit_reg=False)
sns.regplot('Height', 'Max F3', data=maxes_female, fit_reg=False)
plt.xlabel('Height (cm)')
plt.ylabel('Max F3 (Hz)')
print('female: green')
print('male: orange')
Do you notice a difference between the trends for the three formants?
Now we are going to plot two lines of best fit -- one for males, one for females. Before we plotted one line for all of the values, but now we are separating by gender to see if gender explains some of the difference in formants values.
For now, we're going deal with just Max F1.
In [ ]:
sns.lmplot('Height', 'Max F1', data=maxes, hue='Gender')
plt.xlabel('Height (cm)')
plt.ylabel('Max F1 (Hz)');
Is there a noticeable difference between the two? Did you expect this result?
We're going to repeat the above graph, plotting a different regression line for males and females, but this time, using timit -- having a larger sample size may help expose patterns. Before we do that, we have to repeat the process of calulating the maximum value for each formants for each speaker. Run the cell below to do that and generate the plot. The blue dots are females, the orange dots are males, and the green line is the regression line for all speakers.
In [ ]:
timit_maxes = timit.groupby(['speaker', 'gender']).max().reset_index()
timit_maxes.columns = ['speaker', 'gender', 'region', 'height', 'word', 'vowel', 'Max duration', 'Max F1', 'Max F2', 'Max F3', 'Max f0']
plt.xlim(140, 210)
plt.ylim(500, 1400)
sns.regplot('height', 'Max F1', data=timit_maxes[timit_maxes['gender'] == 'female'], scatter_kws={'alpha':0.3})
sns.regplot('height', 'Max F1', data=timit_maxes[timit_maxes['gender'] == 'male'], scatter_kws={'alpha':0.3})
sns.regplot('height', 'Max F1', data=timit_maxes, scatter=False)
plt.xlabel('Height (cm)')
plt.ylabel('Max F1 (Hz)');
Does this graph differ from the one based on class_data? If it does, what are some possible explanations for this? From the visualization, what can you say about height as a predictor of Max F1? Do you think gender plays a role in the value of Max F1?
Do you think similar patterns would emerge for Max F2 and Max F3? We only used Max F1, but consider trying to plot them by copying some of the code from above and making slight alterations (remember that to insert a code cell below, you can either press esc + b or click Insert > Insert Cell Below on the toolbar).
Please fill out our feedback form!