%matplotlib inline
from matplotlib import pyplot as plt
from scipy.io import wavfile as wv
from scipy.signal import iirfilter, lfilter, butter, freqz
from scipy.fftpack import fft
import sounddevice as sd
import string
from glob import glob
import pandas as pd
import numpy as np
import seaborn as sns
import os
sep = os.sep
sns.set(style='darkgrid', font_scale=1.5)
blue, red, green = sns.color_palette(n_colors=3)
Pandas is a package designed to make data analysis easier and more efficient. Among the many things it can do, it makes it easy to read data from files, write data to files, and easily process data in various complicated ways. Conveniently, it comes bundled with Anaconda, so you already have it!
As coded above, it's standard to import pandas as pd
, just like it's standard to import numpy as np
.
The key object that you'll use in pandas is the DataFrame
. There are a number of ways to create data frames "from scratch," and functions like read_csv()
in pandas return data frames, as well. Within pandas, DataFrame()
is a function (note the uppercase D and F), so, as usual, you can look at the doscstring to get a sense of how it works:
pd.DataFrame?
You can create a data frame in a number of ways. One is by giving pd.DataFrame()
a dictionary as input:
x = np.random.normal(loc=-5, scale=3, size=100) # 100 random normal variates
y = 5 + 2*x + np.random.normal(size=100) # 100 more
letters = string.ascii_lowercase # lower case letters
z = [letters[np.random.randint(26)] for i in range(100)] # random letters
dd = {'x':x, 'y':y, 'z':z} # dictionary with x, y, and z
df_xyz = pd.DataFrame(dd) # data frame from dictionary
df_xyz.head(n=7) # first 5 rows
Data frames have a shape attribute that tells you the number of rows and columns:
df_xyz.shape
You can also create a data frame by initializing it as an empty data frame and then filling in data column by column:
df_uvw = pd.DataFrame()
df_uvw.head()
df_uvw['u'] = np.random.random(size=100) # column name 'u', value is 100 random numbers
df_uvw['v'] = 1/df_uvw['u'] # column name 'v', the reciprocal of column 'u'
df_uvw['w'] = [letters[np.random.randint(26)] for i in range(100)] # random letters
df_uvw.head()
A very useful feature of data frames is the fact that different columns can contain different data types. In both of these toy examples, one column contains strings, while the other two contain numeric data.
In order to illustrate some more realistic uses for pandas, we will create a data frame (and do some signal processing) with the sound files we processed in the last notebook. In order to do that, we can use glob()
(from the glob
module) to get the list of sound files. Here's our current working directory:
Here's where the sound files are (on my computer) relative to the current working directory:
snd_dir = 'sentences' + sep
We can use glob()
to read files from that folder:
snd_files = glob(snd_dir + '*.wav')
print(snd_files[:5])
Note that each file name includes the string indicating the relative folder location. We will be creating a dictionary to contain sound files we read it, and we will use the names of the sound files as the keys for the dictionary. In order to isolate the file names from the rest of the directory structure, we can use the split()
method that strings have:
# separate a single filename's directories
file = snd_files[0]
file.split(sep)
# do the same for all the filenames, using a list comprehension
f_split = [ff.split(sep) for ff in snd_files]
f_split[:3]
Now we'll create a dictionary to hold the sound files, loop through the list of files, read each one, normalize its amplitude, get just the names of the sounds (i.e., get rid of the relative location and the file type extension "wav") and use them as keys:
snd_dict = {}
for file in snd_files:
fs,s = wv.read(file) # read file in
sname = file.split(sep)[-1].split('.')[0] # get file name, w/o .wav extension
smax = np.max(np.abs(s)) # get peak amplitude
s_norm = .975*s/smax # normalize w.r.t. peak amplitude
snd_dict[sname] = s_norm # put the normalized vector in the dictionary
snd_dict[sname + '_fs'] = fs # put the sampling frequency in the dictionary
print(snd_dict.keys())
Okay, so now we will calculate some quantities (overall energy, high frequency energy, low frequency energy, duration), calculating the overall energy measure using two techniques, which we will then store in a data frame.
The two techniques will allow us to see how well Parseval's theorem holds in the sound files in question. Parseval's theorem says that the "the total energy of a signal can be calculated by summing power-per-sample across time or spectral power across frequency":
\begin{equation*} \sum_{n=0}^{N-1}|x[n]|^2 = \frac{1}{N}\sum_{k=0}^{N-1}|X[k]|^2 \end{equation*}
First, we'll get a list of just the keys for the sound arrays (not the sampling frequencies):
# list comprehension to get sound wave names (excluding _fs keys)
snames = [i for i in list(snd_dict.keys()) if '_fs' not in i]
snames.sort()
print(snames)
Next, we'll prepare for our calculations by creating a couple filters and a function for calculating RMS energy in a signal:
Nord = 5 # order of the filter
cut_freq = 500 # cutoff freq in Hz
nyq = fs/2 # nyquist, assumes all fs values were the same
cut_freq_rad = np.array([cut_freq/nyq]) # cutoff in radians
hi_b, hi_a = iirfilter(N=Nord, Wn=cut_freq_rad, btype='highpass') # high-pass filter
lo_b, lo_a = iirfilter(N=Nord, Wn=cut_freq_rad, btype='lowpass') # low-pass filter
iirfilter?
What about 500 Hz (or whatever cutoff frequency)? With Butterworth filters, the cutoff frequency is defined as the point that is -3 dB relative to the passband. Here's an illustration of the frequency response of these two filters:
w_hi, H_hi = freqz(hi_b, hi_a) # frequency response of high pass filter
w_lo, H_lo = freqz(lo_b, lo_a) # frequency response of low pass filter
frq = np.linspace(0, nyq, len(w_hi))
fig, ax = plt.subplots(1, 1, figsize=(10,5))
ax.plot(frq[:50], np.abs(H_hi)[:50], lw=3)
ax.plot(frq[:50], np.abs(H_lo)[:50], lw=3)
We'll define a convenience function for calculating RMS values:
def frms(x):
return np.sqrt(np.mean(x**2))
Then we'll initialize our data frame:
n_snd = len(snames) # number of sound files we read in
# initialize data frame, specify index and columns
col_names = ['sex','talker','sentence_idx','rms','rms_lo','rms_hi','dur','pow_t','POW_f']
df_ac = pd.DataFrame(index=np.arange(n_snd), columns=col_names, dtype=float)
df_ac.head()
df_ac.shape
Next, we'll loop through the snd_dict
keys, extract each sound, and calculate a few things (note that this cell take a while to run):
for si, sn in enumerate(snames):
if si % 25 == 0: # if si / 25 has a remainder of 0, then...
print(si,sn)
# sex, talker, sentence index
sex = sn[0]
talker = np.int(sn[1])
sentence_idx = np.int(sn[3])
# extract sound and fs from dictionary
sound = snd_dict[sn]
snd_fs = snd_dict[sn + '_fs']
# filter the sound into high and low freq bands
snd_hi = lfilter(hi_b, hi_a, sound)
snd_lo = lfilter(lo_b, lo_a, sound)
# calculate RMS energy values, duration
rms_hi = frms(snd_hi)
rms_lo = frms(snd_lo)
rms = frms(sound)
pow_t = np.sum(sound**2)
dur = len(sound)/snd_fs
# get FFT of signals to calculate mean squared spectral magnitudes
POW_f = np.mean(np.abs(fft(sound))**2)
# put the data into the data frame
df_ac.loc[si,['sex','talker','sentence_idx']] = sex, talker, sentence_idx
df_ac.loc[si,['rms','rms_hi','rms_lo']] = rms, rms_hi, rms_lo
df_ac.loc[si,['dur','pow_t','POW_f']] = dur, pow_t, POW_f
It's often useful to use head()
(or tail()
) to quickly check to make sure things look sensible:
df_ac.head()
We can listen to the last sound and its filtered counterparts using sounddevice
:
sd.play(sound)
sd.play(snd_hi)
sd.play(snd_lo)
There are a number of ways to access the data in a data frame. One way is to treat it like a dictionary and use a column name as we would use a key.
rms_vec = df_ac['rms']
rms_vec[:10]
It is important to note that this does not give us back a numpy array. Rather, it gives us a pandas Series
object (essentially, a 1D data frame):
type(rms_vec)
If we want a numpy array, we can use the values
attribute:
rms_npy = df_ac['rms'].values
rms_npy[:10]
type(rms_npy)
Series and data frames have lots of methods, including plotting methods, e.g.:
rms_vec.plot(style='o:',figsize=(15,5))
ax = plt.gca() # gca = get current axis object
ax.set_xlim(-.5,n_snd-.5);
As suggested by the use of column names to get data from a particular column, indexing with data frames is different than it is with numpy arrays. Whereas with a numpy array, we index rows and columns using slices (or boolean arrays), accessing specific rows in a data frame is very different.
Recall that the data frame has a component called the index
. We can look at the index for a data frame like this:
df_ac.index
In order to specify certain rows, we need to use .loc
or .iloc
. .loc
uses labels (i.e., the specific values in the index), whereas .iloc
uses position (specified by integers, as in numpy).
For our data frame, the index is just the set of integers indicating the positions of the rows, so none of this makes any difference. We'll come back to this later on, and illustrate the various methods. For now, it's just important to know that you don't just use numpy-like indexing.
So, if we want the first five rows of our data frame, and just the columns with RMS values in them, we would do this:
rms_0to4 = df_ac.loc[np.arange(5),['rms','rms_lo','rms_hi']]
rms_0to4
We can use logical indexing with a numpy array of booleans to pick out rows, as well.
rows = (df_ac['rms'] < df_ac['rms'].median()).as_matrix() # converts Series (or data frame) to numpy array
print('boolean array:')
print(rows)
print('')
print('indices:')
print(df_ac.index[rows])
low_df = df_ac.loc[rows,:] # extract rows such that df['rms'] is in the bottom half of values
low_df.head()
low_df.shape
The data frame plot()
method can take multiple column names as arguments:
ax = df_ac.plot(x='rms_lo', y='rms_hi', style='*', ms=18, figsize=(8,8))
ax = df_ac.plot(x='dur', y='pow_t', style='o', ms=11, figsize=(8,8))
Or to make kernel density plots (smoothed histograms, with a bandwidth parameter that governs smoothness):
ax = df_ac.plot(y='rms',kind='kde', figsize=(10,5), lw=3, bw_method=.5)
ax = df_ac.plot(y='rms_lo',kind='kde', figsize=(10,5), lw=3, ax=ax, bw_method=.5)
ax = df_ac.plot(y='rms_hi',kind='kde', figsize=(10,5), lw=3, ax=ax, bw_method=.5)
Or histograms:
bins = np.linspace(0,.3,20)
ax = df_ac.plot(y='rms',kind='hist', histtype='step', figsize=(10,5), lw=3, bins=bins)
ax = df_ac.plot(y='rms_lo',kind='hist', histtype='step', figsize=(10,5), lw=3, bins=bins, ax=ax)
ax = df_ac.plot(y='rms_hi',kind='hist', histtype='step', figsize=(10,5), lw=3, bins=bins, ax=ax)
We can use visualize pow_t
and POW_f
to see if Parseval's theorem holds:
ax = df_ac.plot(y='pow_t', kind='kde', figsize=(10,5), lw=4, bw_method=.3)
ax = df_ac.plot(y='POW_f', kind='kde', style=':', figsize=(10,5), lw=6, bw_method=.3, ax=ax)
Now that we have a data frame, we can easily save the data in it to a file on our computer. I typically use csv
files for this, but there are multiple options.
df_ac.to_csv('sound_data.csv', index=False)
%ls # look at what's in the directory
As a brief aside, numpy also allows you to save data in arrays as files on your computer. These can take up a lot less space than pandas data frames, but pandas data frames can have columns of strings, numbers, etc, all in the same structure, plus you get convenient, easy to remember column names. (Numpy actually has a type of array that is a lot like a data frame, for what it's worth: structured arrays.)
Anyway, if you have exclusively numerical data, particularly if you have a large amount, and column names aren't particularly important (or if they are not important at all), you can use numpy to save data:
# use a list to pick out multiple columns
# make sure they are encoded as floats (.astype(float))
# make sure it's giving me a numpy array (.as_matrix())
snd_arr = df_ac[['rms','rms_lo','rms_hi']].astype(float).as_matrix() # extract RMS data as array of floats
np.save(file='rms_data.npy', arr=snd_arr)
%ls
snd_arr[:5,:]
You can read (numpy) data in with numpy, too:
ac_arr = np.load(file='rms_data.npy')
ac_arr[:5,:]
If you have multiple arrays, you can use the savez()
function, and read structures saved this way with load()
:
dur_arr = df_ac['dur'].as_matrix() # array with just durations
rms_arr = df_ac[['rms','rms_lo','rms_hi']].as_matrix() # rms array (again)
np.savez(file='snd_arrays.npz', dur_arr=dur_arr, rms_arr=rms_arr) # save both in a numpy zipped file
arrays = np.load(file='snd_arrays.npz') # returns dictionary-like object
print(type(arrays))
arrays.keys()
Reading from csv
(or many other files types) is easy with pandas. For example, we can read in a csv
file with a couple of acoustic measurements for consonants produced 10 times by each of 20 talkers (10 male, 10 female) in onset position in simple CV or VC non-words:
pd.read_csv?
cod_df = pd.read_csv('con_data_onset_dur.csv') # cod = consonant onset duration
cod_df.head()
We can see which consonants are in this corpus by using the unique()
function:
cod_df['consonant'].unique()
Pandas provides easy to use functions for combining data sets. The functions concat()
concatenates data frames. In order to illustrate, we'll first read in some more data (one file with power measures for the same onset tokens, and two more with duration and power measurements for the same consonants produced in coda position).
ccd_df = pd.read_csv('con_data_coda_dur.csv') # ccd = consonant coda duration
cop_df = pd.read_csv('con_data_onset_pow.csv') # cop = consonant onset power
ccp_df = pd.read_csv('con_data_coda_pow.csv') # ccp = consonant coda power
We can combine the onset and coda duration data using concat()
like this:
cd_df = pd.concat((cod_df,ccd_df),axis=0) # concatenate along rows, i.e., one on top of the other
print(cod_df['prosody'].unique()) # only onset (CV) data
print(ccd_df['prosody'].unique()) # only coda (VC) data
print(cd_df['prosody'].unique()) # 0 = onset (CV), 1 = coda (VC)
print(cod_df.shape, ccd_df.shape, cd_df.shape)
cd_df.head()
We can combine the onset and coda power data using append()
like this (the end result is the same, whether you use concat()
or append()
):
cp_df = cop_df.append(ccp_df) # append ccp_df to cop_df
print(cop_df['prosody'].unique())
print(ccp_df['prosody'].unique())
print(cp_df['prosody'].unique())
print(cop_df.shape, ccp_df.shape, cp_df.shape)
cp_df.head()
We can now combine the duration and power data frames so that we have duration and power data for each unique production of each consonant. We'll use merge()
to do this.
The function merge()
can take a number of arguments, specifying exactly how the data frames to be merged should be combined. Here's the docstring for merge()
:
pd.merge?
We can illustrate some options with a toy example:
df_a = pd.DataFrame({'name':['Bob','Mark','Jane','James','Fran','Erin'],
'job':['engineer','analyst','analyst','designer','designer','engineer'],
'level':['senior','senior','senior','junior','junior','junior']})
df_a
df_b = pd.DataFrame({'job':['engineer','engineer','designer','designer','analyst','analyst','manager','manager'],
'level':['junior','senior','junior','senior','junior','senior','junior','senior'],
'salary':[55000, 63000, 37000, 45000, 62000, 77000, 88000, 94000]})
df_b
df_inner = pd.merge(left=df_a, right=df_b, on=['job','level'], how='inner') # inner is the default
df_inner
df_outer = pd.merge(left=df_a, right=df_b, on=['job','level'], how='outer')
df_outer
We can use merge()
to combine the duration and power data from above:
# this produces a data frame merged for each combination of suject, token, consonant, etc...
c_df = pd.merge(left=cd_df, right=cp_df, on=['subject','token','consonant','prosody','voice','place','manner'])
print(c_df.shape,cd_df.shape,cp_df.shape)
c_df.head()
Pandas allows you to create new columns easily. For a simple example, we can add the noipow
and voipow
columns together to get a total power column, we can add the condur
and vowdur
columns together to get a total duration column, and we can divide the vowdur
column by the condur
column to get a measure of the consonant duration as a proportion of the vowel duration:
c_df['power'] = c_df['voipow'] + c_df['noipow']
c_df['duration'] = c_df['condur'] + c_df['vowdur']
c_df['cv_ratio'] = c_df['condur']/c_df['vowdur']
c_df.head()
For a more complicated example, we may find it inconvenient that voicing, place, manner, and prosody (onset vs coda) are encoded with numbers in c_df
. It would be nice to have words, instead.
Noting that 0 = voiceless, 1 = voiced; 0 = labial, 1 = alveolar; 0 = stop, 1 = fricative; and 0 = onset, 1 = coda, we can do the following:
v = ['voiceless','voiced'] # to be picked out with 0, 1 from voice column
p = ['labial','alveolar'] # to be picked out with 0, 1 from place column
m = ['stop','fricative'] # to be picked out with 0, 1 from manner column
s = ['onset','coda'] # to be picked out with 0, 1 from prosody column
labels = [v,p,m,s] # list of lists
cols_old = ['voice','place','manner','prosody'] # old column names
cols_new = ['vlab','plab','mlab','slab'] # column names for new corresponding labels
for lab_idx, old_new in enumerate(zip(cols_old,cols_new)):
old_col, new_col = old_new
c_df[new_col] = [labels[lab_idx][ii] for ii in c_df[old_col]]
c_df.head()
If we only want the columns with labels, we can create a copy with just the subset of the columns that we want:
df = c_df[['consonant','condur','vowdur','subject','token',
'noipow','voipow','power','duration','cv_ratio',
'vlab','plab','mlab','slab']].copy()
df.head()
df.shape
Pandas has some very nice methods for grouping data and calculating quantities for separate groups. The groupby()
method creates a pandas object that can be easily used to, e.g., calculate group means for one or another variable.
con_grp = df.groupby('consonant') # creates grouped object
cd_m = con_grp['cv_ratio'].mean() # get mean consonant duration for each consonant
cd_m.sort_index(inplace=True) # sort the Series according to the index values
cd_m
In this case, because we only picked out cv_ratio
, we get a Series object back (i.e., a 1D pandas object):
type(cd_m)
The index
of this Series contains the consonants we're interested in:
cd_m.index
Using .loc
with an index label gives the value at that labeled location in the Series:
cd_m.loc['p']
If we use .iloc
, we use an integer to indicate the location:
cd_m.iloc[3]
cd_s = con_grp['cv_ratio'].std()
cd_s
Reorder the rows:
cd_s = cd_s.loc[['p','b','t','d','f','v','s','z']] # rearrange the indices
cd_m = cd_m.loc[['p','b','t','d','f','v','s','z']] # rearrange the indices
cd_m
cd_s
fig, ax = plt.subplots(1, 1, figsize=(10,5))
# plot error bars, means plus/minus one SD, using a list comprehension
[ ax.plot([i,i],[cd_m.loc[c]-cd_s.loc[c],cd_m.loc[c]+cd_s.loc[c]], color=blue) for i,c in enumerate(cd_m.index) ]
ax.plot(cd_m.values, 'o', color=blue, ms=14)
ax.set(xticks=np.arange(len(cd_m)),xticklabels=cd_m.index);
We can use groupby()
to see if there is a difference in low or high frequency energy for male vs female speakers in the first (big) data frame we made:
# split by sex
# apply function mean() to columns rms, rms_lo, and rms_hi
# combine into single data frame
rms_x_sex = df_ac.groupby('sex')['rms','rms_lo','rms_hi'].mean()
rms_x_sex
You can group by multiple factors, as well. This will produce a 'MultiIndex', which is a more complex, structured index:
vpm_grp = df.groupby(['vlab','plab','mlab'])
np_vpm_m = vpm_grp[['noipow','voipow']].mean()
np_vpm_m
We can use the xs()
method to pick out particular rows (cross-sections) of a series or data frame with a multi-index. Note that the first argument (key
) indicates the value of the index in the desired rows, while the second (level
) indicates the component of the multi-index we want to use:
np_vpm_m.xs(key=('voiced','labial'), level=('vlab','plab'))
np_vpm_m.xs(key=('labial','fricative'), level=('plab','mlab'))
np_vpm_m.xs(key=('fricative'), level=('mlab'))
Objects produced by groupby()
have a number of aggregation functions:
Function | Description |
---|---|
count() |
total number of elements |
first() , last() |
first, last element |
mean() , median() |
mean, median of all elements |
min() , max() |
minimum, maximum |
std() , var() |
standard deviation, variance |
mad() |
mean absolute deviation |
prod() , sum() |
product, sum of elements |
vpm_grp[['condur','vowdur','noipow','voipow']].first()
vpm_grp[['condur','vowdur','noipow','voipow']].median()
vpm_grp[['condur','vowdur','noipow','voipow']].mad()
vpm_grp[['condur','vowdur','noipow','voipow']].count()
Pandas also enables easy creation of cross-tabulations (i.e., sums of co-occurences of given factors):
# new column indicating greater than average noise power
df['high_noipow'] = df['noipow'] > df['noipow'].mean()
pd.crosstab(df['vlab'],df['high_noipow'], margins=True)#, normalize=True)
Cross-tabs can involve more than just two factors:
pd.crosstab(df['vlab'],[ df['mlab'], df['high_noipow'] ]) #, margins=True)
pd.crosstab(df['vlab'],[df['plab'],df['mlab'],df['high_noipow']])
You can also give crosstab()
an aggregation function if you don't want a sum:
pd.crosstab(c_df['vlab'],c_df['plab'], aggfunc=np.std, values=c_df['condur'])
Finally, pandas makes it pretty easy to reshape data sets. The data frame df
is currently in "long" format:
df.sort_values(['subject','consonant','token']).head()
We can specify one or more columns to be the (multi)index of a pivot table, putting the data into "wide" format:
pd.pivot_table?
pt_a = pd.pivot_table(data=df, index=df[['subject','vlab','slab']]) # values are means by default
pt_a.head(n=17)
pt_b = pd.pivot_table(data=df, index=df[['subject','vlab','slab']], aggfunc=np.min) # use min instead
pt_b.head(n=15)
We can also specify what we want in the columns:
pt_c = pd.pivot_table(data=df, index=df[['subject','vlab','slab']], columns=df['token'])
pt_c.head(n=15)
pt_c.shape
We can specify the columns to use to calculate the values in the pivot table:
pt_d = pd.pivot_table(data=df, index=['subject','vlab','slab'],
columns=['token'],
values=['cv_ratio'])
pt_d.columns = [col[0] + '_' + str(col[1]) for col in pt_d.columns.values] # flatten the column names
pt_d.head(n=15)
We can also reshape in the other direction, going from "wide" to "long":
pd.melt?
pt_d.reset_index(inplace=True) # puts multi-index values into columns
pt_d.head()
df_z = pd.melt(frame=pt_d, id_vars=['subject','vlab','slab'],
var_name='original', value_name='mean')
df_z.head(n=10)
This notebook covers the basic capabilities of pandas data frames. We saw how to create data frames and how indexing in data frames is different from indexing in numpy arrays. We saw how some basic signal processing can be used in conjunction with data frames to take acoustic measurements and store the results in a convenient format. We showed how to read and write data with data frames (as well as with numpy arrays), and we used pandas to visualize data, combine data from different sources, and reshape data frames from long to wide format and back again.