Many early synthesis systems used what has been referred to as a string re-writing mechanism as their central data structure. In this formalism, the linguistic representation of an utterance is stored as a string. Initially, the string contains text, which is then re-written or embellished with extra symbols as processing takes place. Systems such as MITalk [1] and the CSTR Alvey synthesizer [6] used this method.
There are many short comings in this formalism which have been previously recognized ([7], [4], [5], [8]). The main problem with the string formalism is that it soon becomes unwieldy for anything apart from the most trivial of tasks. Often the string becomes very complex with words, phrase symbols, stress symbols, phones etc all mixed in together. There are two main ways in which modules process such strings. Modules can work on this string directly, but the interpretation of the symbols often gets in the way of the algorithm itself. Alternatively, a module can parse the string into an internal format and let the algorithms use that. Although this may simplify the writing of the algorithms themselves, this is a very unwieldy approach as it means the string has to be parsed every time a module is called. Moreover, this often leads to each module having an individual internal data structure, which is unattractive from a programming point of view as new structures and techniques have to be learnt to understand the workings of any new module.
To lessen these problems, information is often deleted from the string so as to keep only what is perceived as essential information. For instance, after the grapheme to phoneme conversion, the orthographic form of the word may be deleted from the string. This can have unfortunate consequences in that information which might potentially be of use to a module may have been deleted by an earlier module.