%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd
import statsmodels.api as sm
import statsmodels.formula.api as smf
sns.set(style='darkgrid', font_scale=2)
blue, green, red, purple = sns.color_palette('deep', n_colors=4)
There are a number of packages designed for model fitting and evaluation in Python. Two in particular come bundled with Anaconda: statsmodels
and scikit-learn
. We'll work with parts of each, beginning with statsmodels
. We will (probably) also work with pystan
later on, when we get to Bayesian data analysis.
First, note the slightly different import statement in the code block above. In order to import and use statsmodels
, we need to import the api (application program interface). The standard abbreviation is sm
. Note also that we've imported statsmodels.formula.api
as smf
.
We can illustrate some of the useful tools in sm
with the consonant acoustics data we have looked at before. Because there are multiple repetitions of each consonant by each speaker, we will just take a subset of the data (in this case, the first repetition).
cons_all = pd.read_csv('consonant_acoustics_subset_labels.csv') # read in data file
cons_all.sort_values(by=['subject','consonant','slab','token'], inplace=True) # sort the data
cons_first = cons_all.groupby(['subject','consonant','slab']).first() # take just the first repetition
cons_first.head(n=10)
cons_first.reset_index(level=['subject','consonant','slab'], inplace=True) # make index into columns
cons_first.head(n=10)
cols = ['voice','vlab','place','plab','manner','mlab','prosody','slab','noipow','voipow','condur','vowdur','token']
cons = cons_first[cols].copy() # make a copy, with just the subset of columns in cols
cons.head(n=10)
The sm
module makes it easy to specify and fit linear regression models. You can use R
-style syntax with a pandas data frame or with your data in numpy arrays. Here's an example using R
-style syntax and the data frame defined above.
We will specify and fit a simple linear regression model to analyze how well the duration of consonants predicts their RMS energy.
We'll visualize the data before fitting a model:
sns.jointplot(cons['condur'], cons['noipow'], size=8, stat_func=None);
In order to fit a standard linear regression model, we can use the OLS
(ordinary least squares) method inside the sm
module to create a model object, and then we can call the fit
method of that model object. The OLS
method requires arrays for the dependent variable (endog
, for 'endogenous') and the independent variable(s) (exog
, for 'exogenous').
The term 'ordinary least squares' refers to the fact that the model is fit (i.e., parameter values are found) such that the squared deviations of the observed data from the expected value is minimized.
Note that you have to include a column of ones if you want the model to have an intercept.
The model we'll fit is:
\begin{equation} \hat{y}_i = \beta_0 + \beta_1 x_{i} \end{equation}
y = cons['noipow'].as_matrix() # noipow = dependent/endogenous variable
x0 = np.ones(y.shape[0]) # for just estimating an intercept
x = np.stack((np.ones(y.shape[0]),cons['condur']),axis=1) # ones for intercept, condur = ind/exog variable
x[:5,:]
Simple, intercept-only model (i.e., calculating the mean and variance of y in an unnecessarily complicated way):
mod_0 = sm.OLS(endog=y, exog=x0) # intercept-only model
fit_0 = mod_0.fit()
print(fit_0.summary())
Adding in consonant duration as a predictor to see how much information it gives us about acoustic energy above and beyond just calculating the mean (and variance) of y:
mod_a = sm.OLS(endog=y, exog=x) # OLS model object
fit_a = mod_a.fit()
print(fit_a.summary())
Or, somewhat more conveniently, we can use the ols
method inside smf
. Using this with formula and data specified creates a model object, which again requires us to call the fit
method:
fit_b = smf.ols('noipow ~ condur', data=cons).fit()
print(fit_b.summary())
We can use the seaborn function lmplot
to plot the data and the model's fitted values along with a bootstrap confidence interval around the fitted line:
y_mean = fit_0.params
sns.lmplot(x='condur', y='noipow', data=cons, size=8)
ax = plt.gca()
ax.plot(ax.get_xlim(),y_mean*np.ones(2),':',color=purple, lw=3);
We could also plot the 'lowess' (locally weighted) regression line, instead:
sns.lmplot(x='condur',y='noipow', data=cons, size=8, lowess=True);
The fitted model also allows us to generate predicted mean values and standard errors around these predicted mean values for values of the independent variable of interest:
# create data frame with values of 'condur' at which we want predictions
x_new_d = pd.DataFrame({'condur':np.arange(0,751,50)})
# make array for plotting
x_new = x_new_d['condur'].as_matrix()
# generate predictions of noise power at new 'condur' values
y_pred = fit_b.get_prediction(x_new_d)
fig, ax = plt.subplots(1,1,figsize=(8,8))
ax.set_xlabel('condur'); ax.set_ylabel('noipow');
ax.plot(cons['condur'],cons['noipow'],'o',color=list(blue) + [.75])
y_pred_mean = y_pred.predicted_mean # predicted mean
y_pred_se = y_pred.se_mean # prediected standard error
y1 = y_pred_mean-2*y_pred_se; y2 = y_pred_mean+2*y_pred_se # prediction bounds
ax.fill_between(x=x_new,y1=y1,y2=y2,color=list(blue) + [.25])
ax.plot(x_new,y_pred_mean,'-',lw=3,color=list(blue) + [.95]);
Of course, we can also fit more complicated models. Here is a model predicting consonant duration as a function of vowel duration, syllable position (slab
: onset vs coda), and the interaction between these two factors.
Note that, because syllable position is a dichotomous variable, the presence of this variable in the model allows the onset and coda data to have different intercepts, and the interaction term allows for vowel duration to have a different slope for onset and coda positions.
If syllable position is indicated by $S$ and encoded with onset = 1 and coda = 0, with estimated consonant duration indicated by $\hat{C}$ and vowel duration by $V$, then the model is:
\begin{equation} \hat{C}_i = \beta_0 + \beta_1 V_i + \beta_2 S_i + \beta_3 V_i \times S_i \end{equation}
For coda consonants, $S_i = 0$, so this reduces to:
\begin{equation} \hat{C}_i = \beta_0 + \beta_1 V_i \end{equation}
Whereas for onset consonants, $S_i = 1$, so the model is:
\begin{align} \hat{C}_i &= \beta_0 + \beta_1 V_i + \beta_2 S_i + \beta_3 V_i \times S_i\\ &= \beta_0 + \beta_1 V_i + \beta_2 + \beta_3 V_i\\ &= \left(\beta_0 + \beta_2 \right) + \left(\beta_1 + \beta_3\right) V_i \end{align}
Here's the model, using a formula and the smf
module:
fit_c = smf.ols('condur ~ vowdur + slab + vowdur:slab', data=cons).fit()
print(fit_c.summary())
We can parameterize the model in terms of the overall intercept vs deviations from this for coda and onset consonants.
If we code coda as -0.5 and onset as +0.5, for coda consonants the model will be:
\begin{align} \hat{C}_i &= \beta_0 + \beta_1 V_i - 0.5\beta_2 - 0.5\beta_3 V_i\\ \hat{C}_i &= (\beta_0 - 0.5\beta_2) + (\beta_1 - 0.5\beta_3) V_i \end{align}
and for onset consonants it will be:
\begin{align} \hat{C}_i &= \beta_0 + \beta_1 V_i + 0.5\beta_2 + 0.5\beta_3 V_i\\ \hat{C}_i &= (\beta_0 + 0.5\beta_2) + (\beta_1 + 0.5\beta_3) V_i \end{align}
cons['pros_ec'] = -1*cons['prosody']+.5
cons.head()
fit_c2 = smf.ols('condur ~ vowdur + pros_ec + vowdur:pros_ec', data=cons).fit()
print(fit_c2.summary())
We can visualize the data and model prediction for onset and coda separately by specifying the col
and hue
arguments in lmplot
:
sns.lmplot(x='vowdur', y='condur', data=cons, col='slab',
hue='slab', size=8, scatter_kws={'s':100});
The statsmodels package also has the ability to fit 'mixed effects' linear models. I prefer to refer to this kind of model as a multilevel model. The basic idea is that if you have multiple observations within some grouping unit (e.g., multiple acoustic measurements for a given talker), you can simultaneously (a) fit a model to each unit's data and (b) fit a model governing variation across these units.
A more concrete example will probably help clarify. In this case, we'll be predicting consonant duration as a function of each consonant's voicing (voiceless = 0, voiced = 1) and manner of articulation (stop = 0, fricative = 1).
Recall that we just took the first observation of the consonant acoustic data above. The full data set has 10 repetitions of each of 8 consonants from each speaker (in each syllable position). In theory, we could fit a model to each set of repetitions from each speaker. If we did this, we would get beta coefficients for each subject. A multilevel model allows us to do this, while simultaneously modeling the variation in the beta coefficients across speakers.
Here's what the full data set looks like:
cons_all.head()
We will fit a multilevel model to the coda data. In order to fit this model, we need a formula for the 'fixed effect' model (i.e., the group-level model) and a formula for the 'random effects' (i.e., the individual-level model). Note that the latter only refers to the independent variables, with a 1
indicating the intercept (which is included in the group-level model by default). Note, too, that we specify that subject
is the grouping variable.
Note, too, that voice
is coded 0 for voiceless and 1 for voiced consonants, and manner
is coded 0 for stops and 1 for fricatives. This means that the overall group-level model is:
\begin{equation} \hat{C}_i = \beta_0 + \beta_1 voice + \beta_2 manner \end{equation}
So, for voiceless stops, this reduces to:
\begin{equation} \hat{C}_i = \beta_0 \end{equation}
whereas for voiced stops, it is:
\begin{equation} \hat{C}_i = \beta_0 + \beta_1 \end{equation}
and for voiceless fricatives, it is:
\begin{equation} \hat{C}_i = \beta_0 + \beta_2 \end{equation}
and, finally, for voiced fricatives, it is:
\begin{equation} \hat{C}_i = \beta_0 + \beta_1 + \beta_2 \end{equation}
cons_coda = cons_all.loc[cons_all['slab']=='coda',:].copy() # get just coda data
# specify "random effects" formula
# this is the model fit to each subject's data
re_formula = '1 + voice + manner'
# specify "fixed effects" formula
# this is the model governing the subject-level models
fe_formula = 'condur ~ voice + manner'
# create and fit model
fit_e = smf.mixedlm(formula=fe_formula, re_formula=re_formula,
groups=cons_coda['subject'], data=cons_coda).fit()
print(fit_e.summary())
We can extract the estimates of the 'random effects', or the individual speakers' beta coefficients:
re_df = pd.DataFrame(fit_e.random_effects).T
re_df.shape # n talkers, n model parameters
re_df.head()
We can then plot the distributions of these values to visualize how the different speakers vary with respect to how voicing and manner influence consonant duration:
b0, b1, b2 = fit_e.params.loc[['Intercept','voice','manner']]
sns.pairplot(re_df + np.stack([b0,b1,b2]), size=3, plot_kws={'s':100},
diag_kws={'histtype':'step', 'lw':3}, diag_kind='hist');
Finally, we can visualize the data using violin plots organized by speaker (panel), voicing (x-axis), and manner (color):
fig = plt.figure(figsize=(20,20))
for i in range(5):
for j in range(4):
idx = j + 4*i + 1
ax = plt.subplot(5,4,idx)
rows = np.in1d(cons_coda['subject'],idx)
df_s = cons_coda.loc[rows,['condur','vlab','mlab']]
sns.violinplot(y='condur', x='vlab', data=df_s, hue='mlab', ax=ax, split=True, order=['voiceless','voiced'])
ax.set_yticks([])
if idx not in [1,5,9,13,17]:
ax.set_ylabel('')
else:
ax.set_ylabel('Con. Dur.')
if idx not in [17,18,19,20]:
ax.set_xlabel('')
ax.set_xticklabels(['',''])
else:
ax.set_xlabel('Voicing')
if idx < 20:
ax.legend_.remove()
else:
ax.legend_.set_title('')
#fig.savefig('condur_by_svm.png',bbox_inches='tight')
The statsmodels package also allows us to get standard ANOVA results from an OLS model fit:
cons_onset = cons.loc[cons['slab']=='onset',:] # pull out onset data
# fit model with main effects and interactions for voicing, place, and manner
# 'condur ~ voice*place*manner' is same as
# 'condur ~ voice + place + manner
# + voice:place + voice:manner + place:manner
# + voice:place:manner
condur_vpm_fit = smf.ols('condur ~ voice*place*manner', data=cons_onset).fit()
condur_vpm_ANOVA = sm.stats.anova_lm(condur_vpm_fit)
print(np.round(condur_vpm_ANOVA,2))
Statsmodels also has functions for estimating group-level effects using a generalized estimating equation. GEEs provide a robust way to test group-level effects with clustered (i.e., multilevel structured) data even when the covariance structure of the data (conditional on the model) is misspecified. We didn't go into detail about the covariance structure with the 'mixed effect' model above, but these models can produce questionable results when it is misspecified.
Fitting a GEE model and viewing the results is very similar to fitting and viewing a mixed effect model:
form = 'condur ~ voice + manner'
fit_f = smf.gee(form, groups=cons_coda['subject'], data=cons_coda).fit()
print(fit_f.summary())
We can also fit generalized linear models by using the glm
method rather than the ols
method, in which case we have to specify exactly which kind of glm we're fitting.
In this case, we will fit a logistic regression model to predict voicing (voiced = 1, voiceless = 0) as a function of vowel duration, syllable position, and the interaction between vowel duration and syllable position. Note that, using the formula syntax, vowdur*prosody
is equivalent to vowdur + prosody + vowdur:prosody
.
Generalized linear models allow us to fit models to dependent variables that are not normally distributed around the expected value (e.g., the fitted value in the OLS models we fit above). The key additional component of a GLM is the link function, which transforms the linear combination of variables to produce an appropriate expected value for the type of dependent variable in question.
For logistic regression, the dependent variable for each row of the data should be either a Bernoulli random variable (e.g., a variable indicating if a single coin flip is heads = 1 or tails = 0) or a Binomial random variable, which is a sum of independent Bernoulli random variables (e.g., the number of heads = k in the total number of coin flips = N).
For Bernoulli random variable $y$ with probablity $\theta$ of a 'heads', we write:
\begin{equation} \Pr(y) = \theta^y (1-\theta)^{1-y} \end{equation}
If $y = 1$, this reduces to:
\begin{equation} \Pr(y) = \theta \end{equation}
And if $y = 0$, this reduces to:
\begin{equation} \Pr(y) = (1-\theta) \end{equation}
The link function for logistic regression with dependent variable $y_i$ takes a linear combination of independent variables $X_i$ and maps it onto the interval (0,1) (i.e., it ensures that it is a reasonable value for a probability):
\begin{equation} \Pr(y_i = 1) = \frac{1}{1+\exp\left(-\left(\beta_0 + \beta_1 x_{i1} + \cdots + \beta_p x_{ip} \right)\right)} \end{equation}
Since we know how to make arrays of values and plots to illustrate functions of those values, let's do that with the logistic link function:
def logistic(x):
return 1/(1+np.exp(-x))
x = np.linspace(-5,5,500)
y1 = logistic(0+1*x); y2 = logistic(-1+1*x); y3 = logistic(3+1*x)
y4 = logistic(0+2*x); y5 = logistic(0+.35*x)
fig, axes = plt.subplots(1,2,figsize=(15,15))
ax1, ax2 = axes; ax1.set_aspect(10); ax2.set_aspect(10)
ax1.plot(x,y1,color=blue,lw=4); ax1.plot(x,y2,color=red,lw=4); ax1.plot(x,y3,color=green,lw=4)
ax2.plot(x,y1,color=blue,lw=4); ax2.plot(x,y4,color=red,lw=4); ax2.plot(x,y5,color=green,lw=4)
ax1.set_ylabel(r'$\theta$'); ax1.set_xlabel('x'); ax2.set_xlabel('x');
We can use the glm
method to create and fit a logistic regression model as follows:
cons.head()
fit_d = smf.glm('voice ~ vowdur*prosody', data=cons, family=sm.families.Binomial()).fit()
print(fit_d.summary())
The coefficients of a logistic regression model are log odds-ratios. The odds of a dichotomous event are ratio of the probability of the event occurring and the probability of it not occurring:
\begin{equation} odds = \frac{p}{1-p} \end{equation}
The odds ratio is a ratio of odds (for p at, e.g., levels 1 and 2):
\begin{equation} OR = \frac{\frac{p_1}{1-p_1}}{\frac{p_2}{1-p_2}} \end{equation}
Here is a notebook all about why odds ratios are uninterpretable and bad. It also briefly discusses a more intuitive way to interpret logistic regression coefficients.
Seaborn can plot data and fitted values for logistic regression, too:
sns.lmplot(x='vowdur', y='voice', data=cons, hue='slab', col='slab',
logistic=True, size=8, y_jitter=.025, scatter_kws={'s':100});
As with the linear model, we can generate predicted means (probabilities) for unobserved independent variable values:
vowdur_new = np.concatenate((np.arange(0,751,25),np.arange(0,751,25)))
n_new = len(vowdur_new)
slab_new = np.concatenate((np.zeros(int(n_new/2),dtype=int),np.ones(int(n_new/2),dtype=int)))
x_new = pd.DataFrame({'vowdur':vowdur_new,'prosody':slab_new,'vowdur:prosody':vowdur_new*slab_new})
y_new = fit_d.get_prediction(x_new)
y_pred_mean = y_new.predicted_mean
y_pred_se = y_new.se_mean
fig, ax = plt.subplots(1,1,figsize=(8,8))
clrs = [list(green), list(blue)]
for i in range(2):
idx = np.in1d(x_new['prosody'],i)
y1 = y_pred_mean[idx]-2*y_pred_se[idx]
y2 = y_pred_mean[idx]+2*y_pred_se[idx]
ax.fill_between(x=x_new.loc[idx,'vowdur'],y1=y1,y2=y2,color=clrs[i] + [.35])
ax.plot(x_new.loc[idx,'vowdur'],y_pred_mean[idx],'-',lw=3,color=clrs[i] + [.95])
ax.set_ylabel('Predicted Pr(voiced)')
ax.set_xlabel('Vowel Duration');
We can also transform variables in the formula using, e.g., numpy functions:
fig, axes = plt.subplots(1,2,figsize=(16,8))
ax1, ax2 = axes; ax1.set_aspect(12.5); ax2.set_aspect(1/10)
ax1.hist(cons['condur'],bins=50,histtype='step',lw=3); ax1.set(xlabel='dur (ms)')
ax2.hist(np.log(cons['condur']),bins=50,histtype='step',lw=3); ax2.set(xlabel='log(dur)');
And we can specify transformations in the formula using the smf
module:
fit_f = smf.glm('voice ~ np.log(condur)*slab', data=cons, family=sm.families.Binomial()).fit()
print(fit_f.summary())
In order to use seaborn's lmplot
, we need to create a new variable that is the log of vowdur:
cons['log condur'] = np.log(cons['condur'])
sns.lmplot(x='log condur', y='voice', data=cons, hue='slab', col='slab', logistic=True, size=8, y_jitter=.025);
Statsmodels also has some more recently developed model fitting tools.
For example, you can fit regularized models. Regularization puts a penalty on the magnitude of the model coefficients. The end result of this is often that small coefficients are forced to be zero. It's similar in some ways to the pressure that the group-level model puts on the individual-level coefficients in a multilevel models, as discussed above. We will come back to this again later.
For now, here's an example:
y = cons['voice'].as_matrix() # predicting voice (1 = voiced, 0 = voiceless)
n_obs = len(y) # number of observations
x_np = cons['noipow'].as_matrix() # using noise power as a predictor
x_cd = np.log(cons['condur'].as_matrix()) # and consonant duration
x_sb = np.in1d(cons['slab'],'coda') # and syllable position
x = np.stack((np.ones(n_obs),x_np,x_cd,x_np*x_sb,x_cd*x_sb),axis=1) # predictor matrix (w/ column of ones)
mod_g = sm.Logit(endog=y, exog=x) # logistic regression model object
fit_g1 = mod_g.fit_regularized(alpha=5) # use method fit_regularized()
print(fit_g1.summary())
fit_g2 = mod_g.fit()
print(fit_g2.summary())
Finally, we can use a logistic regression model as a classification model. Consider again the model predicting voicing as a function of vowel duration. We'll re-plot just the fitted values for the coda and onset data, along with three possible classification criteria (horizontal dotted lines):
# coda rows, onset rows
c_rows = np.in1d(cons['slab'],'coda')
o_rows = np.in1d(cons['slab'],'onset')
# coda voiced rows, onset voiced rows
c_voice = np.in1d(cons.loc[c_rows,'vlab'],'voiced')
o_voice = np.in1d(cons.loc[o_rows,'vlab'],'voiced')
# fitted coda values, fitted onset values
c_fitted = fit_d.fittedvalues[c_rows]
o_fitted = fit_d.fittedvalues[o_rows]
# coda vowel duration, onset vowel duration
c_vowdur = cons.loc[c_rows,'vowdur']
o_vowdur = cons.loc[o_rows,'vowdur']
# set up grid of axes
fig = plt.figure(figsize=(30,10)); ax1 = plt.subplot2grid((4,4),(0,0),rowspan=2)
ax2 = plt.subplot2grid((4,4),(0,1)); ax3 = plt.subplot2grid((4,4),(1,1)); ax2.set_xlim(0,1); ax3.set_xlim(0,1)
# plot predicted/fitted values and classification criteria
ax1.hlines(y=[.25,.5,.75],xmin=0,xmax=500,colors=[.5,.5,.5,.5],linestyles=[':','--','-.'],lw=3)
ax1.plot(c_vowdur,c_fitted,'o',color=blue) # coda
ax1.plot(o_vowdur,o_fitted,'s',color=green) # onset
ax1.set_xlabel('vowel duration')
# histograms of coda vowel durations for voiced and voiceless consonants
ax2.hist(c_fitted[c_voice],histtype='step',bins=20,lw=3,color=blue)
ax2.hist(c_fitted[~c_voice],histtype='step',bins=20,lw=3,color=np.array(blue)*.5); ax2.set_yticks([])
ax2.vlines([.25,.5,.75],ymin=0,ymax=ax2.get_ylim()[1],linestyles=[':','--','-.'],colors=[.25,.25,.25,.75],lw=3)
# histograms of onset vowel durations for voiced and voiceless consonants
ax3.hist(o_fitted[o_voice],histtype='step',bins=20,lw=3,color=green)
ax3.hist(o_fitted[~o_voice],histtype='step',bins=20,lw=3,color=np.array(green)*.5); ax3.set_yticks([])
ax3.vlines([.25,.5,.75],ymin=0,ymax=ax2.get_ylim()[1],linestyles=[':','--','-.'],colors=[.25,.25,.25,.75],lw=3);
ax3.set_xlabel('Pr(voiced)');
The relatively steep fitted value function for the coda consonants (blue) indicates that vowel duration will provide a lot of information about consonant voicing, whereas the more shallow function for the onset consonants indicates that vowel duration is less informative.
The histograms for the coda consonants (top-right panel) illustrate the relatively large separation of the voiced and voiceless vowel duration distributions, while the histograms for onset consonants (bottom-right panel) illustrate the near-complete overlap of vowel durations for voice and voiceless consonants.
The three vertical broken lines in each panel illustrate three possible classification criteria. The left most (dotted) line is a very liberal classification rule (relative to the category 'voiced'), according to which any consonant with a fitted value greater than 0.25 is classified as 'voiced'. The middle (dashed) line indicates a moderate rule such that any consonant with a fitted value greater than 0.5 is classified as 'voiced', and the right most (dash-dot) line uses the conservative criterion 0.75.
Here are confusion matrices for each of these criteria for the onset and coda data and models:
criteria = np.array([0.25, 0.5, 0.75])
onset_mats = np.zeros((3,2,2), dtype=int)
coda_mats = np.zeros((3,2,2), dtype=int)
for ci, cc in enumerate(criteria):
onset_mats[ci,0,0] = np.sum(np.less_equal(o_fitted,cc) & ~o_voice)
onset_mats[ci,0,1] = np.sum(np.greater_equal(o_fitted,cc) & ~o_voice)
onset_mats[ci,1,0] = np.sum(np.less_equal(o_fitted,cc) & o_voice)
onset_mats[ci,1,1] = np.sum(np.greater_equal(o_fitted,cc) & o_voice)
coda_mats[ci,0,0] = np.sum(np.less_equal(c_fitted,cc) & ~c_voice)
coda_mats[ci,0,1] = np.sum(np.greater_equal(c_fitted,cc) & ~c_voice)
coda_mats[ci,1,0] = np.sum(np.less_equal(c_fitted,cc) & c_voice)
coda_mats[ci,1,1] = np.sum(np.greater_equal(c_fitted,cc) & c_voice)
For each 2 x 2 matrix, the first (top) row is for voiceless consonants, and the second (bottom) row is for voiced consonants, and the first (left) column is for "voiceless" classifications (i.e., the model says "voiceless") and the second (right) column is for "voiced" classifications.
onset_mats
coda_mats
If we take a large number of criteria, ranging from the most conservative to the most liberal, and we calculate the probability of a hit (i.e., correctly classifying a voiced consonant as voiced) and the probability of a false alarm (i.e., incorrectly classifying a voiceless consonant as voiced), we get Receiver Operating Characteristic (ROC) curves, which give an indication of the overall classification accuracy of a given model (as well as some additional information):
# number of voiced and voiceless consonant in each syllable position
n_cv = np.sum(c_voice); n_cl = np.sum(~c_voice)
n_ov = np.sum(o_voice); n_ol = np.sum(~o_voice)
# criteria, ROC vector initialization
criteria = np.linspace(1,0,101)
n_crit = len(criteria)
c_roc = np.zeros((n_crit,2)); o_roc = np.zeros((n_crit,2))
# calculate hit and FA probabilities
for i,crit in enumerate(criteria):
c_hit = np.sum(np.greater_equal(c_fitted,crit) & c_voice)/n_cv
c_fa = np.sum(np.greater_equal(c_fitted,crit) & ~c_voice)/n_cl
c_roc[i,:] = c_fa,c_hit
o_hit = np.sum(np.greater_equal(o_fitted,crit) & o_voice)/n_ov
o_fa = np.sum(np.greater_equal(o_fitted,crit) & ~o_voice)/n_ol
o_roc[i,:] = o_fa,o_hit
fig, ax = plt.subplots(1,1,figsize=(10,10))
ax.plot([0,1],[0,1],':',color=[.5,.5,.5,.5])
lc, = ax.plot(c_roc[:,0],c_roc[:,1],lw=4,color=blue)
ax.plot([0,1],[0,1],':',color=[.5,.5,.5,.5])
lo, = ax.plot(o_roc[:,0],o_roc[:,1],lw=4,color=green)
ax.set_ylabel('Pr(Hit)'); ax.set_xlabel('Pr(False Alarm)')
ax.legend([lc,lo],['coda','onset']);
The statsmodels
module contains a number of useful statistical model fitting tools. We looked at just a small sample of them here: ordinary least-squares linear regression, logistic regression, multilevel linear regression, generalized estimating equations, and regularized logistic regression. We also briefly discussed the use of logistic regression as a classification model.