Informativity in adaptation: Supervised and unsupervised learning of linguistic cue distributions

Do people use informative lables during adaptation?
- Don't stop learning language as an adult: need to learn, or adapt to, the language produced by every new talker you meet.
- Adapting to unusual productions is a lot easier when you know from other cues what the talker meant to say (labeled) than when you are uncertaint about what the intended category was (unlabeled)
- But no studies have directly compared labeled and unlabeled adaptation

Background

Categories (/b/ and /p/) are distributions of cues (VOT, f0, etc.)
Distributional learning:
- Acquisition: learn language distributions
- Adaptation: learn talker's distributions
Same underlying process?
- Acquisition is slow and hard, adaptation fast and easy. Why?
- Labels: lots of information from context (visual, lexical, etc.) that labels cues for listener.
Do listeners actually use labels for adaptation when they're provided? Hasn't been a good test.

Methods

Preliminaries

Set things up and load data

library(knitr)
knitr::opts_chunk$set(cache=TRUE, 
                      autodep=TRUE,
                      dev=c('png', 'pdf', 'svg'),
                      fig.retina=2,
                      warning=FALSE,
                      message=FALSE)

library(devtools)
if (! require('supunsup', quietly=TRUE)) {
  devtools::install_github('kleinschmidt/phonetic-sup-unsup')
  require('supunsup')
}

# pre-parsed + excluded data from package
dat <- supunsup::supunsup_clean

Fit a regression model.

library(lme4)
library(dplyr)

dat_mod <- dat %>%
  filter(trialSupCond == 'unsupervised' | supCond == 'unsupervised') %>%
  mutate_for_lmer()

dat_fit <- glmer(respP ~ trial.s * vot_rel.s * supCond * bvotCond +
                   (trial.s * vot_rel.s | subject),
                 data = dat_mod,
                 family = 'binomial',
                 control = glmerControl(optimizer = 'bobyqa'))

Lay the groundwork for visualization.

library(ggplot2)
theme_set(theme_bw())

four_colors <-
  c("#255984",     # blue
    "#CC8A2E",     # yellow
    "#CC5B2E",     # red-orange
    "#208C5E")     # green

four_colors_saturated <-
  c("#1461A1",
    "#F99710",
    "#F95210",
    "#0BAB66")


scale_color_discrete <- function(...) {
  scale_color_manual(values = four_colors_saturated, ...)
}

scale_fill_discrete <- function(...) {
  scale_fill_manual(values = four_colors_saturated, ...)
}



# size for three-panel results figures
res_w <- 11
res_h <- 5

# formatting boilerplate:
format_results_plot <- function(p) {
  p + 
    scale_color_discrete('Shift (ms)', drop=FALSE) + 
    scale_fill_discrete('Shift (ms)', drop=FALSE) +
    scale_x_continuous('VOT (ms)', breaks=seq(-20, 80, by=20)) + 
    scale_y_continuous('Proportion /p/ response') +
    scale_linetype_discrete('Condition')
}

plot_category_bounds <- function(cat_bounds, dodge_w = 0.75) {
  ggplot(cat_bounds,
         aes(x=factor(bvotCond, levels=rev(levels(bvotCond))),
             y=boundary_vot,
             ymin=boundary_vot - 1.96*boundary_vot_se,
             ymax=boundary_vot + 1.96*boundary_vot_se,
             color=bvotCond,
             linetype=factor(supCond,
               levels=c('unsupervised', 'supervised', 'mixed')),
             group=paste(shift, supCond))) +
    geom_pointrange(size=1.5, position=position_dodge(w=dodge_w)) +
    geom_point(aes(y=boundary_vot_true), shape=1, size=6) + 
    scale_color_discrete(drop=FALSE) +
    scale_linetype_manual(drop=FALSE,
                          values = c(1, 2, 3)) + 
    theme(legend.position='none') +
    scale_x_discrete('Shift (ms VOT)') +
    scale_y_continuous('/b/-/p/ boundary (ms VOT)',
                       breaks = seq(10, 60, by=5)) +
    coord_flip()
}

From the fitted model, generate predictions to visualize model fits measure fitted category boundaries.

add_experiment <- function(data_) {
  data_ %>%
    mutate(experiment = 
             ifelse(supCond == 'mixed', 'Experiment 4',
                    ifelse(bvotCond == 20, 'Experiment 2',
                           ifelse(bvotCond == 30, 'Experiment 3',
                                  'Experiment 1'))))
}

dat_pred <- make_prediction_data(dat, dat_mod) %>% add_experiment

# raw average respond-P probability
respP_by_thirds <- dat %>%
  mutate(thirds=ntile(trial, 3)) %>%
  select(-trial) %>%
  left_join(bin_trials(dat)) %>%
  group_by(supCond, trialSupCond, trial_range, bvotCond, vot) %>%
  summarise(respP = mean(respP)) %>%
  add_experiment

respP_by_thirds_unlab <- respP_by_thirds %>%
  mutate(type='data') %>%               # for plotting along w/ glmer fits
  filter(trialSupCond == 'unsupervised')

cat_bounds <- category_boundaries(dat_mod, dat_fit) %>% add_experiment

Methods

Subjects

n_subj <- supunsup::supunsup %>%
  group_by(subject) %>%
  summarise %>%
  nrow

n_subj_excluded <- supunsup::supunsup_excluded %>%
  group_by(subject) %>%
  summarise %>%
  nrow

We ran 368 subjects on Mechanical Turk, excluded 26 for chance performance, and had data from 342 for analysis. There were about 29 subjects in each condition:

supunsup::supunsup_clean %>%
  group_by(supCond, bvotCond, subject) %>%
  summarise %>%
  tally %>%
  kable(caption = 'Subjects in each condition')

supCond	bvotCond	n
mixed	0	30
mixed	10	28
mixed	20	29
mixed	30	29
supervised	0	31
supervised	10	30
supervised	20	27
supervised	30	29
unsupervised	0	26
unsupervised	10	27
unsupervised	20	30
unsupervised	30	29

Procedure

On each trial subjects heard a spoken b/p minimal pair word (beach/peach, bees/peas, beak/peak), clicked on the matching picture to indicate the word they heard. On labeled trials, only one picture could match (e.g., bees and peach). On unlabeled trials, both members of the minimal pair were present (e.g., beach and peach). Subjects hears 222 trials, with the mixture of VOT values and labeled/unlabeled trials determined by the condition they were randomly assigned to.

Conditions and stimulus distributions

Shift conditions

plot of chunk expt1-stim-counts

dat %>%
  group_by(bvotCond) %>%
  filter(subject == first(subject)) %>%
  group_by(bvotCond, vot) %>%
  tally() %>%
  ggplot(aes(x=vot, y=n, fill=factor(bvotCond))) +
  geom_bar(stat='identity') +
  facet_grid(.~bvotCond) +
  scale_x_continuous('VOT (ms)', breaks=seq(-20, 80, by=20))

Supervision conditions

plot of chunk sup-unsup-mixed-stim-counts

dat %>%
  filter(bvotCond == 0) %>%
  group_by(supCond) %>%
  filter(subject == first(subject)) %>%
  group_by(supCond, labeled, vot) %>%
  tally %>%
  ggplot(aes(x=vot, y=n, fill=labeled)) +
  geom_bar(stat='identity') +
  scale_fill_manual(values = c('black', 'gray')) + 
  facet_grid(.~supCond) +
  scale_x_continuous('VOT (ms)', breaks=seq(-20, 80, by=20))

Measuring learning

Learning was assessed by fitting a mixed effects logistic regression model, extracting the category boundary (VOT where the predicted response was 50% /b/), and comparing this fitted category boundary to the boundary predicted by the input distribution (maximally ambiguous stimulus based on the /b/ and /p/ distributions).

Because the model includes slopes for trial, we had to pick where to evaluate the category boundaries. We picked the point 5/6ths of the way through the experiment, because this was near the end but not so close there's additional uncertainty from edge effects.

Results

Labeled trials across all experiments

Accuracy on labeled trials (where there was a correct response) was very good:

plot of chunk labeled-summary

se <- function(x) {sd(x)/length(x)}

# responses and accuracy on labeled trials across all experiments:
labeled_summary <- dat %>%
  filter(labeled == 'labeled') %>%
  add_experiment %>%
  mutate(labelCat = respCategory) %>%
  group_by(labelCat, subject) %>%
  summarise(acc = mean(labelCat == respCat),
            respP = mean(respP)) %>%
  summarise_each(funs(mean, se))

ggplot(labeled_summary, aes(x=labelCat, y=respP_mean)) +
  geom_bar(stat='identity') +
  scale_x_discrete('Labeled as') +
  scale_y_continuous('Proportion /p/ responses', breaks=c(0, 0.5, 1))

98% accurate across all experiments.

Experiment 1

dat_ex1 <- dat %>%
  add_experiment %>%
  filter(experiment == 'Experiment 1')

Distributions:
- Unshifted
- +10ms
Good learning (matched predicted category boundaries)
No effect of labels

plot of chunk expt1-results

expt1_respP <- respP_by_thirds_unlab %>%
  filter(experiment == 'Experiment 1')

predict_and_plot(filter(dat_pred, experiment == 'Experiment 1'),
                 dat_fit,
                 show_se=TRUE) %>%
  format_results_plot + 
  geom_point(data = expt1_respP, aes(y=respP)) +
  geom_line(data = expt1_respP, aes(y=respP))

plot of chunk expt1-cat-bounds

cat_bounds %>%
  filter(experiment == 'Experiment 1') %>%
  plot_category_bounds() +
  scale_y_continuous('/b/-/p/ boundary (ms VOT)',
                     breaks = seq(10, 60, by=5),
                     limits = c(18, 32))

Experiments 2+3

Too easy? Use bigger shifts
- +20ms
- +30ms
Learning there, but not as good.
Still no effect of labels

plot of chunk expt2-3-results

expt23_respP <- respP_by_thirds_unlab %>%
  filter(experiment %in% c('Experiment 2', 'Experiment 3'))

predict_and_plot(filter(dat_pred, experiment %in% c('Experiment 2', 'Experiment 3')),
                 dat_fit,
                 show_se=TRUE) %>%
  format_results_plot + 
  geom_point(data = expt23_respP, aes(y=respP)) + 
  geom_line(data = expt23_respP, aes(y=respP))

plot of chunk expt2-3-cat-bounds

cat_bounds %>%
  filter(experiment %in% c('Experiment 2', 'Experiment 3')) %>%
  plot_category_bounds()

Experiment 4

Stimulus-specific learning? Mix up labeled and unlabeled trials. Compare with unsupervised conditions from Experiments 1-3.
All shifts: +0, +10, +20, +30ms
Nothing changes.

plot of chunk expt4-results

expt4_respP <- respP_by_thirds_unlab %>%
  filter(experiment == 'Experiment 4' | supCond == 'unsupervised')

predict_and_plot(filter(dat_pred, supCond %in% c('mixed', 'unsupervised')),
                 dat_fit,
                 show_se=TRUE) %>%
  format_results_plot + 
  geom_point(data = expt4_respP, aes(y=respP)) + 
  geom_line(data = expt4_respP, aes(y=respP))

plot of chunk expt4-cat-bounds

cat_bounds %>%
  filter(experiment == 'Experiment 4' | supCond == 'unsupervised') %>%
  plot_category_bounds(dodge_w=1)

Conclusion

Listeners don't use labels to speed up or improve adaptation.
More like acquisition: rely on distributions.
But other sources of information do matter: less adaptation to weirder distributions (+20 and +30 ms shifts)

Model output

For the curious and/or masochistic. Trial was centered and scaled to range \(-0.5–0.5\), VOT was centered around the predicted category boundary for each subject and scaled to increments of continuum steps (+1 is +10ms). Shift condition was treated as a factor and helmert coded, while supervision was sum-coded with unsupervised as the base level. Subject random effects intercepts and slopes for trial, VOT, and their interaction were included.

library(stargazer)

var_name_subs <- list(
  c(':', ' : '),
  c('vot_rel.s', 'VOT'),
  c('bvotCond', 'Shift'),
  c('supCond', 'unsup-vs-'),
  c('trial.s', 'Trial'))

stargazer(dat_fit, float=FALSE, single.row=TRUE,
          covariate.labels = str_replace_multi(names(fixef(dat_fit)),
                                               var_name_subs, TRUE),
          digits = 2, star.cutoffs = c(0.05, 0.01, 0.001),
          column.labels=c('Experiment 1', 'Experiment 2'), align=TRUE,
          intercept.bottom=FALSE, model.numbers=FALSE, 
          dep.var.labels.include=FALSE, dep.var.caption='', 
          keep.stat = c('n'), type='html')


	Experiment 1
(Intercept)	0.78^*** (0.05)
Trial	0.07 (0.10)
VOT	1.62^*** (0.04)
unsup-vs-mixed	0.04 (0.07)
unsup-vs-supervised	-0.11 (0.07)
Shift10	0.34^*** (0.07)
Shift20	0.42^*** (0.04)
Shift30	0.35^*** (0.03)
Trial : VOT	0.75^*** (0.07)
Trial : unsup-vs-mixed	-0.10 (0.13)
Trial : unsup-vs-supervised	0.06 (0.13)
VOT : unsup-vs-mixed	-0.05 (0.05)
VOT : unsup-vs-supervised	-0.08 (0.05)
Trial : Shift10	-0.07 (0.13)
Trial : Shift20	0.02 (0.08)
Trial : Shift30	-0.14^* (0.05)
VOT : Shift10	0.01 (0.05)
VOT : Shift20	-0.07^* (0.03)
VOT : Shift30	-0.09^*** (0.02)
unsup-vs-mixed : Shift10	-0.06 (0.10)
unsup-vs-supervised : Shift10	-0.02 (0.10)
unsup-vs-mixed : Shift20	-0.05 (0.06)
unsup-vs-supervised : Shift20	-0.03 (0.06)
unsup-vs-mixed : Shift30	-0.02 (0.04)
unsup-vs-supervised : Shift30	-0.04 (0.04)
Trial : VOT : unsup-vs-mixed	0.07 (0.09)
Trial : VOT : unsup-vs-supervised	-0.09 (0.09)
Trial : VOT : Shift10	0.10 (0.09)
Trial : VOT : Shift20	-0.04 (0.05)
Trial : VOT : Shift30	-0.01 (0.04)
Trial : unsup-vs-mixed : Shift10	-0.15 (0.19)
Trial : unsup-vs-supervised : Shift10	0.14 (0.18)
Trial : unsup-vs-mixed : Shift20	-0.08 (0.11)
Trial : unsup-vs-supervised : Shift20	0.18 (0.11)
Trial : unsup-vs-mixed : Shift30	-0.20^** (0.08)
Trial : unsup-vs-supervised : Shift30	0.15^* (0.08)
VOT : unsup-vs-mixed : Shift10	-0.05 (0.07)
VOT : unsup-vs-supervised : Shift10	0.02 (0.07)
VOT : unsup-vs-mixed : Shift20	0.02 (0.04)
VOT : unsup-vs-supervised : Shift20	-0.07 (0.04)
VOT : unsup-vs-mixed : Shift30	0.02 (0.03)
VOT : unsup-vs-supervised : Shift30	-0.03 (0.03)
Trial : VOT : unsup-vs-mixed : Shift10	-0.16 (0.13)
Trial : VOT : unsup-vs-supervised : Shift10	0.04 (0.13)
Trial : VOT : unsup-vs-mixed : Shift20	-0.03 (0.07)
Trial : VOT : unsup-vs-supervised : Shift20	0.03 (0.08)
Trial : VOT : unsup-vs-mixed : Shift30	-0.05 (0.05)
Trial : VOT : unsup-vs-supervised : Shift30	0.02 (0.05)

Observations	50,724

Note:	^p<0.05; ^p<0.01; ^**p<0.001