基于聯(lián)合因子分析的耳語音說話人識別研究
發(fā)布時間:2018-08-01 17:13
【摘要】:說話人識別,作為生物特征識別的重要組成部分,可廣泛應用于公安司法、生物醫(yī)學工程、軍隊安全系統(tǒng)等領域。隨著計算機和網(wǎng)絡技術的迅速發(fā)展,說話人識別技術已取得了長足的進步。耳語發(fā)音方式是一種特殊的語音交流形式,在很多場合應用。由于耳語音與正常音之間存在較大差異,耳語方式下說話人識別無法照搬正常音說話人識別的方法,尚有很多問題亟待解決。 本文以與文本無關的耳語說話人識別為研究對象,進行了較為深入的探索。耳語音說話人識別所面臨的問題主要包括:耳語數(shù)據(jù)庫的不完善,對于正常語音,美國國家標準技術局給出了統(tǒng)一的數(shù)據(jù)庫資源用于開展說話人識別研究,而耳語音在這方面的資源較為匱乏;耳語音特征表達問題,耳語音由于其發(fā)音的特殊性,有些常用的特征參數(shù)無法提取,其頻譜參數(shù)的獲取較正常音也更加困難;耳語音是氣聲發(fā)音,聲級較低,較易受噪聲干擾,,且耳語音往往在手機通話時使用,易受信道環(huán)境影響;同時,耳語發(fā)音時,受發(fā)音場所制約,情感表達受限,且發(fā)音狀態(tài)、心理因素都會產(chǎn)生一定的變化,更易受到說話人心理因素、情緒及發(fā)音狀態(tài)的影響。因此,較之正常發(fā)音,耳語發(fā)音方式下說話人識別面臨的主要難點是:特征參數(shù)更難提取,易受說話人自身狀態(tài)影響,以及對信道變化更加敏感等。 針對這些問題,本文開展了以下幾個方面的工作: 1.提出了反映耳語音說話人特征的參數(shù)提取算法。耳語音無基頻、聲源特征難以體現(xiàn),作為表征聲道特性的共振峰參數(shù),其提取算法的可靠性顯得尤為重要。本文提出了基于頻譜分段的耳語音共振峰提取算法,該方法可動態(tài)地進行頻譜分段,通過選擇性線性預測獲得濾波器參數(shù),采用并聯(lián)的逆濾波控制得到共振峰。該方法為解決因耳語發(fā)音導致的共振峰偏移、合并、平坦等問題提供了有效途徑。另一方面,本文依據(jù)變量統(tǒng)計里中心與平坦度可衡量信號穩(wěn)定性的特點,結合人耳聽覺模型,提出了Bark子帶譜中心與Bark子帶譜平坦度的概念,與其他頻譜變量組成特征參數(shù)集,可有效表征耳語發(fā)音方式下說話人特征。 2.提出了基于特征映射及說話人模型合成的非典型情緒下耳語說話人識別方法。較好地解決訓練語音與測試語音說話人情緒狀態(tài)失配的問題。由于耳語音在情緒表達方面不如正常音有效,無法明晰地進行情感分類,本文通過耳語音說話人狀態(tài)的A、V因子分類方法,模糊其情感表達的一一對應性,并在測試階段,作為語音信號的前端處理手段,對每一段語音進行說話人狀態(tài)分辨,而后實現(xiàn)特征域或模型域的補償。實驗表明,基于特征映射及說話人模型合成的說話人狀態(tài)補償方法不僅體現(xiàn)了耳語音的獨特性,更能有效提高非典型情緒下耳語音說話人識別的正確率。 3.提出了基于潛因子分析的非典型情緒下耳語說話人識別方法。為耳語說話人狀態(tài)補償提供了有效的途徑。因子分析不關注公共因子所代表的具體物理含義,僅是在眾多變量中找出具有代表性的因子,且可通過因子數(shù)目的增減,調節(jié)算法的復雜度。根據(jù)潛因子理論,可將耳語音特征超矢量分解為說話人超矢量與說話人狀態(tài)超矢量,通過均衡的訓練語音分別估計說話人與說話人狀態(tài)空間,并在測試階段,對每一段語音估計其說話人因子,進而做出判決。潛因子分析方法規(guī)避了測試環(huán)節(jié)中的說話人狀態(tài)分類,相較于對分類方法有依賴性的補償算法,可進一步提升耳語說話人識別率。 4.提出了基于聯(lián)合因子分析的多信道下非典型情緒耳語音說話人識別方法。實現(xiàn)了耳語音說話人識別中的信道與說話人狀態(tài)雙重補償。根據(jù)聯(lián)合因子分析的基本概念,可將語音特征超矢量分解為說話人超矢量、說話人狀態(tài)超矢量以及信道超矢量。針對因耳語音訓練數(shù)據(jù)不充分,無法同時估計出說話人、說話人狀態(tài)及信道空間的問題,用聯(lián)合因子分析方法,在獲得UBM模型后,計算語音的Baum-Welch統(tǒng)計量,并首先估計說話人空間,而后采用并行模式分別估計說話人狀態(tài)及信道空間。測試階段,對于語音的特征矢量減去信道及說話人狀態(tài)偏移,變換后的特征用于說話人識別。實驗結果表明,基于聯(lián)合因子分析的方法可同時進行信道及說話人狀態(tài)補償,相較于其他算法,可獲得更好的識別效果。
[Abstract]:Speaker recognition, as an important part of biometric recognition, can be widely used in public security and judicature, biomedical engineering, military security system and other fields. With the rapid development of computer and network technology, speaker recognition technology has made great progress. Ear whisper is a special form of voice communication, in many cases Because there is a great difference between the ear and the normal sound, the speaker recognition can not copy the method of the normal speaker recognition in the ear language. There are still many problems to be solved.
In this paper, the research object of the ear language speaker recognition is not related to the text. The problems facing the ear speaker recognition mainly include: the imperfect ear language database, the normal voice, the United States National Standard Technology Bureau, which is used to carry out the speaker recognition research, and the ear is used to carry out the speaker recognition research. The resources of speech are scarce, the problem of ear speech feature expression, the ear speech because of its particularity, some commonly used characteristic parameters can not be extracted, its spectrum parameters are more difficult to obtain than normal sound, ear pronunciation is gas sound, low sound level, easier to be disturbed by noise, and ear speech is often in mobile phone calls. It is easy to be affected by the channel environment; at the same time, when the ear language is pronounced, it is restricted by the place of pronunciation, the expression of emotion is limited, and the state of the pronunciation, the psychological factors will have some changes, and it is more susceptible to the influence of the speaker's psychological factors, emotion and pronunciation state. The point is: the feature parameters are more difficult to extract, and are easily affected by the speaker's own state, and are more sensitive to the channel changes.
In view of these problems, this paper has carried out the following aspects:
1. a parameter extraction algorithm which reflects the characteristics of the speaker's speech speaker is proposed. The ear speech has no fundamental frequency and the sound source features are difficult to embody. As a resonance peak parameter that characterizing the characteristics of the sound channel, the reliability of the extraction algorithm is particularly important. In this paper, a spectral segmentation algorithm for the ear speech resonance peak extraction is proposed. This method can dynamically divide the spectrum. The filter parameters are obtained by selective linear prediction, and the resonant peak is obtained by parallel inverse filtering. This method provides an effective way to solve the problem of resonance peak migration, merger and flatness caused by the ear speech sound. On the other hand, this paper combines the characteristics of the center and flatness of the variable statistics to measure the stability of the signal. The concept of the spectral flatness of the Bark subband spectrum center and the Bark subband spectrum is proposed in the human ear auditory model, and the feature parameter sets are formed with other spectrum variables, which can effectively characterize the speaker's characteristics in the ear speech sound mode.
2. an atypical speech speaker recognition method based on feature mapping and speaker model synthesis is proposed. It can solve the problem of emotional state mismatch between the training speech and the test speech speaker. Because the ear speech is not as effective as the normal sound in emotional expression, it can not make a clear emotional classification. The A, V factor classification method of the speaker state blurs the one-to-one correspondence of its emotional expression, and at the test stage, as the front end processing method of the speech signal, the speaker States each speech state, and then the compensation of the feature domain or the model domain is realized. The experiment shows that the speaker state is based on the feature mapping and the speaker model. The compensation method not only embodies the uniqueness of whispered speech, but also can effectively improve the accuracy of speaker recognition in atypical emotional whispered speech.
3. an ear whisper recognition method based on the latent factor analysis is proposed. It provides an effective way for the ear speaker state compensation. The factor analysis does not pay attention to the specific physical meaning represented by the public factors. It is only to find representative factors in many variables, and can be adjusted and reduced by the number of factors. According to the latent factor theory, the super vector of the ear speech feature can be decomposed into the speaker's super vector and the speaker's state super vector, and the speaker and speaker's state space is estimated by the balanced training speech. In the test stage, the speaker factor is estimated for each speech, and then the decision is made. The submersible factor analysis method is made. Compared with the compensation algorithm which is dependent on the classification method, the speaker recognition rate can be further improved.
4. an untypical emotional ear speaker recognition method based on joint factor analysis is proposed. The dual compensation of the channel and speaker state in the speech speaker recognition is realized. According to the basic concept of the joint factor analysis, the speech feature supervector can be decomposed into the speaker supervector, the speaker state supervector and the speaker state supervector. In order to solve the problem of speaker, speaker state and channel space at the same time, the speaker state and channel space can not be estimated at the same time. After obtaining the UBM model, the Baum-Welch statistics of speech are calculated, the speaker space is estimated and the speaker state is estimated by parallel mode, and then the speaker state is estimated by parallel mode. Channel space. In the test phase, the characteristics of the speech feature vectors are subtracted from the channel and speaker state offset, and the transformed features are used for speaker recognition. The experimental results show that the method based on the joint factor analysis can compensate the channel and speaker state at the same time, and better recognition results can be obtained compared with other algorithms.
【學位授予單位】:蘇州大學
【學位級別】:博士
【學位授予年份】:2014
【分類號】:TN912.34
本文編號:2158273
[Abstract]:Speaker recognition, as an important part of biometric recognition, can be widely used in public security and judicature, biomedical engineering, military security system and other fields. With the rapid development of computer and network technology, speaker recognition technology has made great progress. Ear whisper is a special form of voice communication, in many cases Because there is a great difference between the ear and the normal sound, the speaker recognition can not copy the method of the normal speaker recognition in the ear language. There are still many problems to be solved.
In this paper, the research object of the ear language speaker recognition is not related to the text. The problems facing the ear speaker recognition mainly include: the imperfect ear language database, the normal voice, the United States National Standard Technology Bureau, which is used to carry out the speaker recognition research, and the ear is used to carry out the speaker recognition research. The resources of speech are scarce, the problem of ear speech feature expression, the ear speech because of its particularity, some commonly used characteristic parameters can not be extracted, its spectrum parameters are more difficult to obtain than normal sound, ear pronunciation is gas sound, low sound level, easier to be disturbed by noise, and ear speech is often in mobile phone calls. It is easy to be affected by the channel environment; at the same time, when the ear language is pronounced, it is restricted by the place of pronunciation, the expression of emotion is limited, and the state of the pronunciation, the psychological factors will have some changes, and it is more susceptible to the influence of the speaker's psychological factors, emotion and pronunciation state. The point is: the feature parameters are more difficult to extract, and are easily affected by the speaker's own state, and are more sensitive to the channel changes.
In view of these problems, this paper has carried out the following aspects:
1. a parameter extraction algorithm which reflects the characteristics of the speaker's speech speaker is proposed. The ear speech has no fundamental frequency and the sound source features are difficult to embody. As a resonance peak parameter that characterizing the characteristics of the sound channel, the reliability of the extraction algorithm is particularly important. In this paper, a spectral segmentation algorithm for the ear speech resonance peak extraction is proposed. This method can dynamically divide the spectrum. The filter parameters are obtained by selective linear prediction, and the resonant peak is obtained by parallel inverse filtering. This method provides an effective way to solve the problem of resonance peak migration, merger and flatness caused by the ear speech sound. On the other hand, this paper combines the characteristics of the center and flatness of the variable statistics to measure the stability of the signal. The concept of the spectral flatness of the Bark subband spectrum center and the Bark subband spectrum is proposed in the human ear auditory model, and the feature parameter sets are formed with other spectrum variables, which can effectively characterize the speaker's characteristics in the ear speech sound mode.
2. an atypical speech speaker recognition method based on feature mapping and speaker model synthesis is proposed. It can solve the problem of emotional state mismatch between the training speech and the test speech speaker. Because the ear speech is not as effective as the normal sound in emotional expression, it can not make a clear emotional classification. The A, V factor classification method of the speaker state blurs the one-to-one correspondence of its emotional expression, and at the test stage, as the front end processing method of the speech signal, the speaker States each speech state, and then the compensation of the feature domain or the model domain is realized. The experiment shows that the speaker state is based on the feature mapping and the speaker model. The compensation method not only embodies the uniqueness of whispered speech, but also can effectively improve the accuracy of speaker recognition in atypical emotional whispered speech.
3. an ear whisper recognition method based on the latent factor analysis is proposed. It provides an effective way for the ear speaker state compensation. The factor analysis does not pay attention to the specific physical meaning represented by the public factors. It is only to find representative factors in many variables, and can be adjusted and reduced by the number of factors. According to the latent factor theory, the super vector of the ear speech feature can be decomposed into the speaker's super vector and the speaker's state super vector, and the speaker and speaker's state space is estimated by the balanced training speech. In the test stage, the speaker factor is estimated for each speech, and then the decision is made. The submersible factor analysis method is made. Compared with the compensation algorithm which is dependent on the classification method, the speaker recognition rate can be further improved.
4. an untypical emotional ear speaker recognition method based on joint factor analysis is proposed. The dual compensation of the channel and speaker state in the speech speaker recognition is realized. According to the basic concept of the joint factor analysis, the speech feature supervector can be decomposed into the speaker supervector, the speaker state supervector and the speaker state supervector. In order to solve the problem of speaker, speaker state and channel space at the same time, the speaker state and channel space can not be estimated at the same time. After obtaining the UBM model, the Baum-Welch statistics of speech are calculated, the speaker space is estimated and the speaker state is estimated by parallel mode, and then the speaker state is estimated by parallel mode. Channel space. In the test phase, the characteristics of the speech feature vectors are subtracted from the channel and speaker state offset, and the transformed features are used for speaker recognition. The experimental results show that the method based on the joint factor analysis can compensate the channel and speaker state at the same time, and better recognition results can be obtained compared with other algorithms.
【學位授予單位】:蘇州大學
【學位級別】:博士
【學位授予年份】:2014
【分類號】:TN912.34
【參考文獻】
相關期刊論文 前10條
1 沙丹青,栗學麗,徐柏齡;耳語音聲調特征的研究[J];電聲技術;2003年11期
2 郭武;李軼杰;戴禮榮;王仁華;;采用非監(jiān)督得分規(guī)整和因子分析的說話人確認[J];電子學報;2009年04期
3 陳雪勤;趙鶴鳴;;基于聽覺模型的漢語耳語音聲調檢測[J];電子學報;2009年04期
4 潘欣裕;趙鶴鳴;陳雪勤;徐敏;;基于EMD擬合特征的耳語音端點檢測[J];電子與信息學報;2008年02期
5 黃程韋;趙艷;金峗;于寅驊;趙力;;實用語音情感的特征分析與識別的研究[J];電子與信息學報;2011年01期
6 趙迎春;張勁松;韓晶晶;任芳;蔡汝剛;;中國兒童情感評價圖片庫(7~14歲,上海版)的建立[J];中國兒童保健雜志;2009年03期
7 楊莉莉,李燕,徐柏齡;漢語耳語音庫的建立與聽覺實驗研究[J];南京大學學報(自然科學版);2005年03期
8 蔣丹寧;蔡蓮紅;;基于語音聲學特征的情感信息識別[J];清華大學學報(自然科學版);2006年01期
9 茹婷婷;謝湘;;耳語音數(shù)據(jù)庫的設計與采集[J];清華大學學報(自然科學版);2008年S1期
10 金峗;趙艷;黃程韋;趙力;;耳語音情感數(shù)據(jù)庫的設計與建立[J];聲學技術;2010年01期
本文編號:2158273
本文鏈接:http://www.sikaile.net/kejilunwen/wltx/2158273.html
最近更新
教材專著