Previous research suggests that individuals with a Vocal Pitch Imitation Deficit (VPID, a.k.a. “poor-pitch singers”) experience less vivid auditory images than accurate imitators (Pfordresher & Halpern, 2013), based on self-report. In the present research we sought to test this proposal directly by having accurate and VPID imitators produce or recognize short melodies based on their original form (untransformed), or after mentally transforming the auditory image of the melody. For the production task, group differences were largest during the untransformed imitation task. Importantly, producing mental transformations of the auditory image degraded performance for all participants, but were relatively more disruptive to accurate than to VPID imitators. These findings suggest that VPID is due partly to poor initial imagery formation, as manifested by production of untransformed melodies. By contrast, producing a transformed mental image may rely on working memory ability, which is more equally matched across participants. This interpretation was further supported by correlations with self-reports of auditory imagery and measures of working memory.
We propose a new framework to understand singing accuracy, based on multi-modal imagery associations: the MMIA model. This model is based on recent data suggesting a link between auditory imagery and singing accuracy, evidence for a link between imagery and the functioning of internal models for sensorimotor associations, and the use of imagery in singing pedagogy. By this account, imagery involves automatic associations between different modalities, which in the present context comprise associations between pitch height and the regulation of vocal fold tension. Importantly, these associations are based on probabilistic relationships that may vary with respect to their precision and accuracy. We further describe how this framework may be extended to multi-modal associations at the sequential level, and how these associations develop. The model we propose here constitutes one part of a larger architecture responsible for singing, but at the same time is cast at a general level that can extend to multi-modal associations outside the domain of singing.