While we will not go into the actual low level details of our implementation here, it is useful to raise some points about how the nature of the programming language used affects the implementation.
Festival is implemented in two languages, C++ and scheme (a variant of lisp). While in principle it would be attractive to implement the system in a single language, practical reasons concerning the nature of programming languages necessitate the approach we have taken here.
In addition to being a research platform, Festival also operates as a run-time system and hence speed is vital. Because of this, it is necessary to have substantial amounts of the code written in a compiled low-level language such as C or C++. For particular types of operations, such as the array processing often used in signal processing, a language such as C/C++ is much faster than higher-level alternatives. However, it is too restrictive to use a system that is 100% compiled, as this prevents essential run-time configuration as the following examples demonstrate.
In developing an algorithm, it is often useful to try alternatives. With a completely compiled system, trying alternatives would involve changing the code and then re-compiling and running the program. While this may be an acceptable effort for two or three algorithm variations, it soon becomes impractical for larger numbers, especially when a large set of alternatives need to be tried as part of an experiment. With an interpreter, the changes can be made at run time, and it is often a simple matter to write a scheme script which can iterate through all the alternatives and produce a table of results.
Festival is used for synthesizing many languages and it would be impossible to re-configure the compiled code for each. In practical usage, Festival must therefore be flexible enough so that any algorithm in any language can be implemented without the re-compilation of existing code. Due to this requirement, structures such as features, items and relations have been implemented so that any number of each with any name can be used. Specifically the use of C structures, where the fields in the structure would correspond to entities such as ``stress'', ``end'' or ``part of speech'' have been avoided as the addition of any new feature would require a re-compilation. Instead, features are represented by extensible key-value lists. Feature names are stored as strings and efficient functions are used to return the value of a feature given the string name. The relations in an utterance are also stored as an extensible list, with strings being used to provide access to a given relation. Hence there is nothing in the C++ architecture which dictates what relations, items or features should be called. All items, relations and features are of exactly the same type in C++ regardless of what linguistic information they carry. It is only at run time that their linguistic function is designated.
We have found from experience that when designing a complex architecture such as this, it is important to take into account the expectations of a programmer with regard to a language their are experienced with. In a previous architecture we developed ([4]) we achieved the some of the generality and run-time flexibility described above. However the C language constructs we used we often obscure, stretching the language to its very limits. Because of this, even experience C programmers found the system very difficult to program with simply because much of the code didn't look like recognizable C to them. In Festival we have paid much more attention to this problem as the current C++ interface seems fairly natural and unobtrusive.
File I/O functions have been written so that utterance structures be saved and loaded to and from disk at any time in the synthesis procedure. In fact, this facility has proved so useful that we now store many of our speech databases (which contained phone, word and intonation information) in this format.