首页 | 本学科首页   官方微博 | 高级检索  
相似文献
 共查询到20条相似文献,搜索用时 15 毫秒
1.
A central challenge for articulatory speech synthesis is the simulation of realistic articulatory movements, which is critical for the generation of highly natural and intelligible speech. This includes modeling coarticulation, i.e., the context-dependent variation of the articulatory and acoustic realization of phonemes, especially of consonants. Here we propose a method to simulate the context-sensitive articulation of consonants in consonant-vowel syllables. To achieve this, the vocal tract target shape of a consonant in the context of a given vowel is derived as the weighted average of three measured and acoustically-optimized reference vocal tract shapes for that consonant in the context of the corner vowels /a/, /i/, and /u/. The weights are determined by mapping the target shape of the given context vowel into the vowel subspace spanned by the corner vowels. The model was applied for the synthesis of consonant-vowel syllables with the consonants /b/, /d/, /g/, /l/, /r/, /m/, /n/ in all combinations with the eight long German vowels. In a perception test, the mean recognition rate for the consonants in the isolated syllables was 82.4%. This demonstrates the potential of the approach for highly intelligible articulatory speech synthesis.  相似文献   

2.
Azadpour M  Balaban E 《PloS one》2008,3(4):e1966
Neuroimaging studies of speech processing increasingly rely on artificial speech-like sounds whose perceptual status as speech or non-speech is assigned by simple subjective judgments; brain activation patterns are interpreted according to these status assignments. The naïve perceptual status of one such stimulus, spectrally-rotated speech (not consciously perceived as speech by naïve subjects), was evaluated in discrimination and forced identification experiments. Discrimination of variation in spectrally-rotated syllables in one group of naïve subjects was strongly related to the pattern of similarities in phonological identification of the same stimuli provided by a second, independent group of naïve subjects, suggesting either that (1) naïve rotated syllable perception involves phonetic-like processing, or (2) that perception is solely based on physical acoustic similarity, and similar sounds are provided with similar phonetic identities. Analysis of acoustic (Euclidean distances of center frequency values of formants) and phonetic similarities in the perception of the vowel portions of the rotated syllables revealed that discrimination was significantly and independently influenced by both acoustic and phonological information. We conclude that simple subjective assessments of artificial speech-like sounds can be misleading, as perception of such sounds may initially and unconsciously utilize speech-like, phonological processing.  相似文献   

3.
Gao S  Hu J  Gong D  Chen S  Kendrick KM  Yao D 《PloS one》2012,7(5):e38289
Consonants, unlike vowels, are thought to be speech specific and therefore no interactions would be expected between consonants and pitch, a basic element for musical tones. The present study used an electrophysiological approach to investigate whether, contrary to this view, there is integrative processing of consonants and pitch by measuring additivity of changes in the mismatch negativity (MMN) of evoked potentials. The MMN is elicited by discriminable variations occurring in a sequence of repetitive, homogeneous sounds. In the experiment, event-related potentials (ERPs) were recorded while participants heard frequently sung consonant-vowel syllables and rare stimuli deviating in either consonant identity only, pitch only, or in both dimensions. Every type of deviation elicited a reliable MMN. As expected, the two single-deviant MMNs had similar amplitudes, but that of the double-deviant MMN was also not significantly different from them. This absence of additivity in the double-deviant MMN suggests that consonant and pitch variations are processed, at least at a pre-attentive level, in an integrated rather than independent way. Domain-specificity of consonants may depend on higher-level processes in the hierarchy of speech perception.  相似文献   

4.
The perception of vowels was studied in chimpanzees and humans, using a reaction time task in which reaction times for discrimination of vowels were taken as an index of similarity between vowels. Vowels used were five synthetic and natural Japanese vowels and eight natural French vowels. The chimpanzees required long reaction times for discrimination of synthetic [i] from [u] and [e] from [o], that is, they need long latencies for discrimination between vowels based on differences in frequency of the second formant. A similar tendency was observed for discrimination of natural [i] from [u]. The human subject required long reaction times for discrimination between vowels along the first formant axis. These differences can be explained by differences in auditory sensitivity between the two species and the motor theory of speech perception. A vowel, which is pronounced by different speakers, has different acoustic properties. However, humans can perceive these speech sounds as the same vowel. The phenomenon of perceptual constancy in speech perception was studied in chimpanzees using natural vowels and a synthetic [o]- [a] continuum. The chimpanzees ignored the difference in the sex of the speakers and showed a capacity for vocal tract normalization.  相似文献   

5.
Automatic speech recognition (ASR) is currently used in many assistive technologies, such as helping individuals with speech impairment in their communication ability. One challenge in ASR for speech-impaired individuals is the difficulty in obtaining a good speech database of impaired speakers for building an effective speech acoustic model. Because there are very few existing databases of impaired speech, which are also limited in size, the obvious solution to build a speech acoustic model of impaired speech is by employing adaptation techniques. However, issues that have not been addressed in existing studies in the area of adaptation for speech impairment are as follows: (1) identifying the most effective adaptation technique for impaired speech; and (2) the use of suitable source models to build an effective impaired-speech acoustic model. This research investigates the above-mentioned two issues on dysarthria, a type of speech impairment affecting millions of people. We applied both unimpaired and impaired speech as the source model with well-known adaptation techniques like the maximum likelihood linear regression (MLLR) and the constrained-MLLR(C-MLLR). The recognition accuracy of each impaired speech acoustic model is measured in terms of word error rate (WER), with further assessments, including phoneme insertion, substitution and deletion rates. Unimpaired speech when combined with limited high-quality speech-impaired data improves performance of ASR systems in recognising severely impaired dysarthric speech. The C-MLLR adaptation technique was also found to be better than MLLR in recognising mildly and moderately impaired speech based on the statistical analysis of the WER. It was found that phoneme substitution was the biggest contributing factor in WER in dysarthric speech for all levels of severity. The results show that the speech acoustic models derived from suitable adaptation techniques improve the performance of ASR systems in recognising impaired speech with limited adaptation data.  相似文献   

6.
Virtually every human faculty engage with imitation. One of the most natural and unexplored objects for the study of the mimetic elements in language is the onomatopoeia, as it implies an imitative-driven transformation of a sound of nature into a word. Notably, simple sounds are transformed into complex strings of vowels and consonants, making difficult to identify what is acoustically preserved in this operation. In this work we propose a definition for vocal imitation by which sounds are transformed into the speech elements that minimize their spectral difference within the constraints of the vocal system. In order to test this definition, we use a computational model that allows recovering anatomical features of the vocal system from experimental sound data. We explore the vocal configurations that best reproduce non-speech sounds, like striking blows on a door or the sharp sounds generated by pressing on light switches or computer mouse buttons. From the anatomical point of view, the configurations obtained are readily associated with co-articulated consonants, and we show perceptual evidence that these consonants are positively associated with the original sounds. Moreover, the pairs vowel-consonant that compose these co-articulations correspond to the most stable syllables found in the knock and click onomatopoeias across languages, suggesting a mechanism by which vocal imitation naturally embeds single sounds into more complex speech structures. Other mimetic forces received extensive attention by the scientific community, such as cross-modal associations between speech and visual categories. The present approach helps building a global view of the mimetic forces acting on language and opens a new venue for a quantitative study of word formation in terms of vocal imitation.  相似文献   

7.
We address the hypothesis that postures adopted during grammatical pauses in speech production are more “mechanically advantageous” than absolute rest positions for facilitating efficient postural motor control of vocal tract articulators. We quantify vocal tract posture corresponding to inter-speech pauses, absolute rest intervals as well as vowel and consonant intervals using automated analysis of video captured with real-time magnetic resonance imaging during production of read and spontaneous speech by 5 healthy speakers of American English. We then use locally-weighted linear regression to estimate the articulatory forward map from low-level articulator variables to high-level task/goal variables for these postures. We quantify the overall magnitude of the first derivative of the forward map as a measure of mechanical advantage. We find that postures assumed during grammatical pauses in speech as well as speech-ready postures are significantly more mechanically advantageous than postures assumed during absolute rest. Further, these postures represent empirical extremes of mechanical advantage, between which lie the postures assumed during various vowels and consonants. Relative mechanical advantage of different postures might be an important physical constraint influencing planning and control of speech production.  相似文献   

8.
The perception of consonants which were followed by the vowel [a] was studied in chimpanzees and humans, using a reaction time task in which reaction times for discrimination of syllables were taken as an index of similarity between consonants. Consonants used were 20 natural French consonants and six natural and synthetic Japanese stop consonants. Cluster and MDSCAL analyses of reaction times for discrimination of the French consonants suggested that the manner of articulation is the major determinant of the structure of the perception of consonants by the chimpanzees. Discrimination of stop consonants suggested that the major grouping in the chimpanzees was by voicing. The place of articulation from the lips to the velum was reproduced only in the perception of the synthetic unvoiced stop consonants in the two dimensional MDSCAL space. The phoneme-boundary effect (categorical perception) for the voicing and place-of-articulation features was also examined by a chimpanzee using synthetic [ga]-[ka] and [ba]-[da] continua, respectively. The chimpanzee showed enhanced discriminability at or near the phonetic boundaries between the velar voiced and unvoiced and also between the voiced bilabial and alveolar stops. These results suggest that the basic mechanism for the identification of consonants in chimpanzees is similar to that in humans, although chimpanzees are less accurate than humans in discrimination of consonants.  相似文献   

9.
10.
Models of speech production typically assume that control over the timing of speech movements is governed by the selection of higher-level linguistic units, such as segments or syllables. This study used real-time magnetic resonance imaging of the vocal tract to investigate the anticipatory movements speakers make prior to producing a vocal response. Two factors were varied: preparation (whether or not speakers had foreknowledge of the target response) and pre-response constraint (whether or not speakers were required to maintain a specific vocal tract posture prior to the response). In prepared responses, many speakers were observed to produce pre-response anticipatory movements with a variety of articulators, showing that that speech movements can be readily dissociated from higher-level linguistic units. Substantial variation was observed across speakers with regard to the articulators used for anticipatory posturing and the contexts in which anticipatory movements occurred. The findings of this study have important consequences for models of speech production and for our understanding of the normal range of variation in anticipatory speech behaviors.  相似文献   

11.
Nasality is a very important characteristic of several languages, European Portuguese being one of them. This paper addresses the challenge of nasality detection in surface electromyography (EMG) based speech interfaces. We explore the existence of useful information about the velum movement and also assess if muscles deeper down in the face and neck region can be measured using surface electrodes, and the best electrode location to do so. The procedure we adopted uses Real-Time Magnetic Resonance Imaging (RT-MRI), collected from a set of speakers, providing a method to interpret EMG data. By ensuring compatible data recording conditions, and proper time alignment between the EMG and the RT-MRI data, we are able to accurately estimate the time when the velum moves and the type of movement when a nasal vowel occurs. The combination of these two sources revealed interesting and distinct characteristics in the EMG signal when a nasal vowel is uttered, which motivated a classification experiment. Overall results of this experiment provide evidence that it is possible to detect velum movement using sensors positioned below the ear, between mastoid process and the mandible, in the upper neck region. In a frame-based classification scenario, error rates as low as 32.5% for all speakers and 23.4% for the best speaker have been achieved, for nasal vowel detection. This outcome stands as an encouraging result, fostering the grounds for deeper exploration of the proposed approach as a promising route to the development of an EMG-based speech interface for languages with strong nasal characteristics.  相似文献   

12.
MOTIVATION: The global alignment of protein sequence pairs is often used in the classification and analysis of full-length sequences. The calculation of a Z-score for the comparison gives a length and composition corrected measure of the similarity between the sequences. However, the Z-score alone, does not indicate the likely biological significance of the similarity. In this paper, all pairs of domains from 250 sequences belonging to different SCOP folds were aligned and Z-scores calculated. The distribution of Z-scores was fitted with a peak distribution from which the probability of obtaining a given Z-score from the global alignment of two protein sequences of unrelated fold was calculated. A similar analysis was applied to subsequence pairs found by the Smith-Waterman algorithm. These analyses allow the probability that two protein sequences share the same fold to be estimated by global sequence alignment. RESULTS: The relationship between Z-score and probability varied little over the matrix/gap penalty combinations examined. However, an average shift of +4.7 was observed for Z-scores derived from global alignment of locally-aligned subsequences compared to global alignment of the full-length sequences. This shift was shown to be the result of pre-selection by local alignment, rather than any structural similarity in the subsequences. The search ability of both methods was benchmarked against the SCOP superfamily classification and showed that global alignment Z-scores generated from the entire sequence are as effective as SSEARCH at low error rates and more effective at higher error rates. However, global alignment Z-scores generated from the best locally-aligned subsequence were significantly less effective than SSEARCH. The method of estimating statistical significance described here was shown to give similar values to SSEARCH and BLAST, providing confidence in the significance estimation. AVAILABILITY: Software to apply the statistics to global alignments is available from http://barton.ebi.ac.uk. CONTACT: geoff@ebi.ac.uk  相似文献   

13.
Universal linguistic constraints seem to govern the organization of sound sequences in words. However, our understanding of the origin and development of these constraints is incomplete. One possibility is that the development of neuromuscular control of articulators acts as a constraint for the emergence of sequences in words. Repetitions of the same consonant observed in early infancy and an increase in variation of consonantal sequences over months of age have been interpreted as a consequence of the development of neuromuscular control. Yet, it is not clear how sequential coordination of articulators such as lips, tongue apex and tongue dorsum constrains sequences of labial, coronal and dorsal consonants in words over the course of development. We examined longitudinal development of consonant-vowel-consonant(-vowel) sequences produced by Japanese children between 7 and 60 months of age. The sequences were classified according to places of articulation for corresponding consonants. The analyses of individual and group data show that infants prefer repetitive and fronting articulations, as shown in previous studies. Furthermore, we reveal that serial order of different places of articulations within the same organ appears earlier and then gradually develops, whereas serial order of different articulatory organs appears later and then rapidly develops. In the same way, we also analyzed the sequences produced by English children and obtained similar developmental trends. These results suggest that the development of intra- and inter-articulator coordination constrains the acquisition of serial orders in speech with the complexity that characterizes adult language.  相似文献   

14.
The relationship between the motor and acoustic similarity of song was examined in brown thrashers (Toxostoma rufum) and grey catbirds (Dumetella carolinensis) (family Mimidae), which have very large song repertoires and sometimes mimic other species. Motor similarity was assessed by cross correlation of syringeal airflows and air sac pressures that accompany sound production. Although most syllables were sung only once in the song analyzed, some were repeated, either immediately forming a couplet, or after a period of intervening song, as a distant repetition. Both couplets and distant repetitions are produced by distinctive, stereotyped motor patterns. Their motor similarity does not decrease as the time interval between repetitions increases, suggesting that repeated syllables are stored in memory as fixed motor programs. The acoustic similarity between nonrepeated syllables, as indicated by correlation of their spectrograms, has a significant positive correlation with their motor similarity. This correlation is weak, however, suggesting that there is no simple linear relationship between motor action and acoustic output and that similar sounds may sometimes be produced by different motor mechanisms. When compared without regard to the sequence in which they are sung, syllables paired for maximum spectral similarity form a continuum with repeated syllables in terms of their acoustic and motor similarity. The prominence of couplets in the “syntax” of normal song is enhanced by the dissimilarity of successive nonrepeated syllables that make up the remainder of the song. © 1996 John Wiley & Sons, Inc.  相似文献   

15.
This study examined whether rapid temporal auditory processing, verbal working memory capacity, non-verbal intelligence, executive functioning, musical ability and prior foreign language experience predicted how well native English speakers (N = 120) discriminated Norwegian tonal and vowel contrasts as well as a non-speech analogue of the tonal contrast and a native vowel contrast presented over noise. Results confirmed a male advantage for temporal and tonal processing, and also revealed that temporal processing was associated with both non-verbal intelligence and speech processing. In contrast, effects of musical ability on non-native speech-sound processing and of inhibitory control on vowel discrimination were not mediated by temporal processing. These results suggest that individual differences in non-native speech-sound processing are to some extent determined by temporal auditory processing ability, in which males perform better, but are also determined by a host of other abilities that are deployed flexibly depending on the characteristics of the target sounds.  相似文献   

16.
Investigation into the evolution of human language has involved evidence of many different kinds and approaches from many different disciplines. For full modern language, humans must have evolved a range of physical abilities for the production of our complex speech sounds, as well as sophisticated cognitive abilities. Human speech involves free‐flowing, intricately varied, rapid sound sequences suitable for the fast transfer of complex, highly flexible communication. Some aspects of human speech, such as our ability to manipulate the vocal tract to produce a wide range of different types of sounds that form vowels and consonants, have attracted considerable attention from those interested in the evolution of language. 1 , 2 However, one very important contributory skill, the human ability to attain very fine control of breathing during speech, has been neglected. Here we present evidence of the importance of breathing control to human speech, as well as evidence that our capabilities greatly exceed those of nonhuman primates. Human speech breathing demands fine neurological control of the respiratory muscles, integrated with cognitive processes and other factors. Evidence from comparison of the vertebral canals of fossil hominids and those of extant primates suggests that a major increase in thoracic innervation evolved in later hominid evolution, providing enhanced breathing control. If that is so, then earlier hominids would have had quite restricted speech patterns, whereas more recent hominids, with human‐like breath control abilities, would have been capable of faster, more varied speech sequences.  相似文献   

17.
《Zoology (Jena, Germany)》2014,117(5):329-336
Many insects exhibit secondary defence mechanisms upon contact with a predator, such as defensive sound production or regurgitation of gut contents. In the tettigoniid Poecilimon ornatus, both males and females are capable of sound production and of regurgitation. However, wing stridulatory structures for intraspecific acoustic communication evolved independently in males and females, and may result in different defence sounds. Here we investigate in P. ornatus whether secondary defence behaviours, in particular defence sounds, show sex-specific differences. The male defence sound differs significantly from the male calling song in that it has a longer syllable duration and a higher number of impulses per syllable. In females, the defence sound syllables are also significantly longer than the syllables of their response song to the male calling song. In addition, the acoustic disturbance stridulation differs notably between females and males as both sexes exhibit different temporal patterns of the defence sound. Furthermore, males use defence sounds more often than females. The higher proportion of male disturbance stridulation is consistent with a male-biased predation risk during calling and phonotactic behaviour. The temporal structures of the female and male defence sounds support a deimatic function of the startling sound in both females and males, rather than an adaptation for a particular temporal pattern. Independently of the clear differences in sound defence, no difference in regurgitation of gut content occurs between the sexes.  相似文献   

18.
Many studies have shown that during the first year of life infants start learning the prosodic, phonetic and phonotactic properties of their native language. In parallel, infants start associating sound sequences with semantic representations. However, the question of how these two processes interact remains largely unknown. The current study explores whether (and when) the relative phonotactic probability of a sound sequence in the native language has an impact on infants’ word learning. We exploit the fact that Labial-Coronal (LC) words are more frequent than Coronal-Labial (CL) words in French, and that French-learning infants prefer LC over CL sequences at 10 months of age, to explore the possibility that LC structures might be learned more easily and thus at an earlier age than CL structures. Eye movements of French-learning 14- and 16-month-olds were recorded while they watched animated cartoons in a word learning task. The experiment involved four trials testing LC sequences and four trials testing CL sequences. Our data reveal that 16-month-olds were able to learn the LC and CL words, while14-month-olds were only able to learn the LC words, which are the words with the more frequent phonotactic pattern. The present results provide evidence that infants’ knowledge of their native language phonotactic patterns influences their word learning: Words with a frequent phonotactic structure could be acquired at an earlier age than those with a lower probability. Developmental changes are discussed and integrated with previous findings.  相似文献   

19.
Beat gestures—spontaneously produced biphasic movements of the hand—are among the most frequently encountered co-speech gestures in human communication. They are closely temporally aligned to the prosodic characteristics of the speech signal, typically occurring on lexically stressed syllables. Despite their prevalence across speakers of the world''s languages, how beat gestures impact spoken word recognition is unclear. Can these simple ‘flicks of the hand'' influence speech perception? Across a range of experiments, we demonstrate that beat gestures influence the explicit and implicit perception of lexical stress (e.g. distinguishing OBject from obJECT), and in turn can influence what vowels listeners hear. Thus, we provide converging evidence for a manual McGurk effect: relatively simple and widely occurring hand movements influence which speech sounds we hear.  相似文献   

20.
This paper presents a text-independent speaker verification system based on an online Radial Basis Function (RBF) network referred to as Minimal Resource Allocation Network (MRAN). MRAN is a sequential learning RBF, in which hidden neurons are added or removed as training progresses. LP-derived cepstral coefficients are used as feature vectors during training and verification phases. The performance of MRAN is compared with other well-known RBF and Elliptical Basis Function (EBF) based speaker verification methods in terms of error rates and computational complexity on a series of speaker verification experiments. The experiments use data from 258 speakers from the phonetically balancedcontinuous speech corpus TIMIT. The results show that MRAN produces comparable error rates to other methods with much less computational complexity.  相似文献   

设为首页 | 免责声明 | 关于勤云 | 加入收藏

Copyright©北京勤云科技发展有限公司  京ICP备09084417号