Corpus Phonetics Tutorial

Eleanor Chodroff

Intro      Penn Forced Aligner      AutoVOT      Kaldi      Other Resources


Corpus phonetics has become an increasingly popular method of research in linguistic analysis. With advances in speech technology and computational power, large scale processing of speech data has become a viable technique. A fair number of researchers have exploited these methods, yet these techniques still remain elusive for many. In the words of Mark Liberman, there has been “surprisingly little change in style and scale of [phonetic] research” from 1966 on, implying that the field still relies on small sample sizes of speech data (2009). While “big data” phonetics is not the be-all and end-all of phonetic research, larger sample sizes ensure more statistically sound conclusions about phonetic values in an individual or population. Furthermore, corpus research is not synonymous with big data. Rather, corpus phonetics describes a method of processing speech data with advantages primarily gained in its computational power (relation to big data) and efficiency. The methods and tools developed for corpus phonetics are based on engineering algorithms primarily from automatic speech recognition (ASR), as well as simple programming for data manipulation. This tutorial aims to bring some of these tools to the non-engineer, and specifically to the speech scientist.

Acoustic analysis programs, such as Praat and MATLAB, are already capable of large scale phonetic measurement via their respective scripting languages. While the tutorial covers some phonetic processing in Praat, the primary aim is to introduce supplementary tools to phonetic processing. These tools are based on concepts and algorithms from automatic speech recognition, which allow for automatic alignment of phonetic boundaries to the speech signal.

In particular, the tutorial covers the Penn Forced Aligner, AutoVOT, and various tools from Kaldi, an automatic speech recognition toolkit. The Penn Forced Aligner performs “forced alignment”, or automatic synchronization of a sequence of phones with a speech segment. Forced alignment greatly expedites data processing and phonetic measurement. AutoVOT is an automatic voice onset time (VOT) measurement tool that demarcates the burst release and vocalic onset of a word-initial, prevocalic stop consonant. Kaldi is an automatic speech recognition toolkit that provides the infrastructure to build “personalized” acoustic models and forced alignment systems. Acoustic models are the statistical representations of each phoneme's acoustic information. The "personalized" component means that this system is capable of modeling any corpus of speech, be it British English, Southern American English, Hungarian, or Korean. It additionally houses many speech processing algorithms, which may be of use to the speech scientist. This tutorial will cover acoustic model training and forced alignment in Kaldi; however, the toolkit as a whole provides exceptional potential for phonetic research.

Finally, the tutorial assumes basic familiarity with Praat, as well as a Mac operating system, primarily for the default Unix shell in the Terminal application (bash). For the Penn Forced Aligner and AutoVOT, most of the Unix commands are provided in the tutorial itself. While I try to provide as many of the commands as possible, Kaldi requires more fluency in shell scripting. If you have not used the Terminal application before, I recommend looking over some basic Unix commands online (Google is every programmer's best friend). For a list of the most useful commands, I recommend this website: For more details regarding the argument structure, I recommend this website:

Each section covers the prerequisites for each program’s installation, as well as an example “problem” that the program can solve. As a good rule of thumb, all prerequisites should be installed prior to installation of the desired program.