基于上下文的移動(dòng)多媒體信息標(biāo)注和管理及關(guān)鍵技術(shù)研究
發(fā)布時(shí)間:2019-06-11 01:00
【摘要】:近年來(lái),隨著計(jì)算機(jī)通信和多媒體壓縮技術(shù)的飛速發(fā)展以及存儲(chǔ)成本的不斷下降,尤其是智能手機(jī)的流行和各種社交網(wǎng)站的出現(xiàn),視頻、圖片等視覺(jué)數(shù)據(jù)的規(guī)模呈現(xiàn)爆炸性增長(zhǎng),如何有效的管理和獲取這些數(shù)據(jù)成為一個(gè)亟待解決的問(wèn)題。為了利用文本管理和檢索技術(shù)實(shí)現(xiàn)對(duì)這些數(shù)據(jù)的直接訪問(wèn),視頻和圖片的語(yǔ)義標(biāo)注技術(shù)逐漸發(fā)展起來(lái),而由于人工標(biāo)注效率低,成本高,主觀性強(qiáng),目前常用的解決方案是利用計(jì)算機(jī)對(duì)視覺(jué)數(shù)據(jù)進(jìn)行自動(dòng)標(biāo)注;谡Z(yǔ)義概念的自動(dòng)標(biāo)注是目前常用的標(biāo)注技術(shù)之一,雖然取得了一定的成功,但仍舊存在一些問(wèn)題影響了自動(dòng)標(biāo)注技術(shù)的進(jìn)一步發(fā)展,其中包括對(duì)訓(xùn)練數(shù)據(jù)的依賴和視覺(jué)語(yǔ)義的局限性等。本文試圖從一個(gè)新的角度來(lái)對(duì)待和處理視覺(jué)數(shù)據(jù)的自動(dòng)標(biāo)注問(wèn)題。從本質(zhì)上講,視頻和圖片等視覺(jué)數(shù)據(jù)是視覺(jué)傳感器對(duì)現(xiàn)實(shí)世界的實(shí)體和事件的描述載體,數(shù)據(jù)標(biāo)注試圖在視覺(jué)描述的基礎(chǔ)上實(shí)現(xiàn)對(duì)原始語(yǔ)義的解析并以語(yǔ)言描述的形式進(jìn)行還原,以方便組織和管理。視覺(jué)傳感器是將其功能范圍內(nèi)目標(biāo)的視覺(jué)表現(xiàn)進(jìn)行記錄,而大量與目標(biāo)語(yǔ)義相關(guān)的上下文信息被忽略掉。目前該領(lǐng)域的研究重點(diǎn)仍是如何充分挖掘視覺(jué)數(shù)據(jù)包含的語(yǔ)義信息,與此不同,本文將注意力放在視覺(jué)數(shù)據(jù)的產(chǎn)生過(guò)程。隨著物聯(lián)網(wǎng)技術(shù)的發(fā)展,各種可穿戴感知設(shè)備逐漸普及,本文旨在利用可穿戴感器實(shí)現(xiàn)對(duì)視覺(jué)目標(biāo)相關(guān)的上下文信息進(jìn)行收集和利用,以幫助視覺(jué)數(shù)據(jù)的語(yǔ)義解析,主要研究成果如下:·常規(guī)視頻中人臉檢測(cè)和跟蹤技術(shù)需要處理視頻中的每一幀圖像,本文提出了一種快速人臉檢測(cè)和跟蹤算法,通過(guò)利用傳感器收集的上下文信息過(guò)濾大量無(wú)臉視頻幀,從而降低處理時(shí)間,減少人臉誤報(bào)和漏報(bào),提高了人臉檢測(cè)和跟蹤的性能和效率。·在利用傳感器進(jìn)行快速人臉識(shí)別的基礎(chǔ)上,通過(guò)深入挖掘不同感知模式中目標(biāo)身體運(yùn)動(dòng)方向的一致性,提出了一種視頻中正面臉部圖像識(shí)別的方法。與前述的身份識(shí)別類似,可穿戴傳感器引入使識(shí)別過(guò)程擺脫了對(duì)樣本數(shù)據(jù)的依賴,實(shí)驗(yàn)證明,該方法具有更好的魯棒性!鹘y(tǒng)的視頻中目標(biāo)身份識(shí)別方法為了保證準(zhǔn)確性,需要針對(duì)每個(gè)目標(biāo)收集大量高質(zhì)量的樣本數(shù)據(jù)。本文提出了一種基于運(yùn)動(dòng)匹配的身份識(shí)別方法,該方法利用同一目標(biāo)在不同感知模型中運(yùn)動(dòng)特征的內(nèi)在一致性,通過(guò)引入可穿戴傳感器來(lái)協(xié)助解決視頻中的目標(biāo)身份識(shí)別問(wèn)題,該方法避開(kāi)了傳統(tǒng)的處理流程,擺脫了對(duì)樣本數(shù)據(jù)的依賴,具有邏輯簡(jiǎn)單,計(jì)算復(fù)雜度低,可靠性高的特點(diǎn)!ぬ岢隽艘环N視頻自動(dòng)標(biāo)注方法,該方法分別利用兩種不同種類的感知數(shù)據(jù)進(jìn)行動(dòng)作識(shí)別,并且通過(guò)融合不同感知模式下的判定結(jié)果,揭示了目標(biāo)的身份,最終達(dá)到以時(shí)間、地點(diǎn)、人物、動(dòng)作的形式對(duì)視頻內(nèi)容進(jìn)行標(biāo)注的目的。
[Abstract]:In recent years, with the rapid development of computer communication and multimedia compression technology and the continuous decline of storage costs, especially the popularity of smart phones and the emergence of various social networking sites, video, The scale of visual data such as pictures is exploding. How to effectively manage and obtain these data has become an urgent problem to be solved. In order to use text management and retrieval technology to access these data directly, the semantic tagging technology of video and picture has been gradually developed, but because of the low efficiency, high cost and subjectivity of manual tagging, At present, the commonly used solution is to use computer to automatically mark visual data. Automatic tagging based on semantic concept is one of the commonly used tagging technologies at present. Although it has achieved some success, there are still some problems that affect the further development of automatic tagging technology. It includes the dependence on training data and the limitation of visual semantics. This paper attempts to deal with and deal with the problem of automatic marking of visual data from a new point of view. In essence, visual data such as video and pictures are the description carriers of real-world entities and events by visual sensors. Data tagging attempts to analyze the original semantics and restore them in the form of language description on the basis of visual description, so as to facilitate organization and management. The visual sensor records the visual performance of the target in its functional range, and a large number of contextual information related to the semantics of the target is ignored. At present, the research focus in this field is still how to fully mine the semantic information contained in visual data. Unlike this, this paper focuses on the generation process of visual data. With the development of Internet of things technology, a variety of wearable perceptual devices are becoming more and more popular. the purpose of this paper is to use wearable sensors to collect and utilize the context information related to visual objects in order to help the semantic analysis of visual data. The main research results are as follows: face detection and tracking technology in conventional video needs to deal with every frame of image in video. In this paper, a fast face detection and tracking algorithm is proposed. By filtering a large number of faceless video frames by using the context information collected by the sensor, the processing time is reduced and the false positives and missed positives of the faces are reduced. The performance and efficiency of face detection and tracking are improved. On the basis of using sensor for fast face recognition, the consistency of target body motion direction in different perception patterns is deeply excavated. In this paper, a method of front face image recognition in video is proposed. Similar to the above identification, the introduction of wearable sensors makes the recognition process get rid of the dependence on sample data, and the experimental results show that this method has better robustness. In order to ensure the accuracy of the traditional target identification method in video, A large number of high-quality sample data need to be collected for each target. In this paper, an identification method based on motion matching is proposed, which makes use of the inherent consistency of the motion features of the same target in different perceptual models, and helps to solve the problem of target identification in video by introducing wearable sensors. This method avoids the traditional processing flow and gets rid of the dependence on sample data. It has the characteristics of simple logic, low computational complexity and high reliability. A video automatic marking method is proposed. This method uses two different kinds of perceptual data for action recognition, and reveals the identity of the target by combining the decision results of different perceptual modes, and finally achieves the identity of the target at time, place and character. The purpose of marking the video content in the form of action.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2015
【分類號(hào)】:TP391.41
,
本文編號(hào):2496875
[Abstract]:In recent years, with the rapid development of computer communication and multimedia compression technology and the continuous decline of storage costs, especially the popularity of smart phones and the emergence of various social networking sites, video, The scale of visual data such as pictures is exploding. How to effectively manage and obtain these data has become an urgent problem to be solved. In order to use text management and retrieval technology to access these data directly, the semantic tagging technology of video and picture has been gradually developed, but because of the low efficiency, high cost and subjectivity of manual tagging, At present, the commonly used solution is to use computer to automatically mark visual data. Automatic tagging based on semantic concept is one of the commonly used tagging technologies at present. Although it has achieved some success, there are still some problems that affect the further development of automatic tagging technology. It includes the dependence on training data and the limitation of visual semantics. This paper attempts to deal with and deal with the problem of automatic marking of visual data from a new point of view. In essence, visual data such as video and pictures are the description carriers of real-world entities and events by visual sensors. Data tagging attempts to analyze the original semantics and restore them in the form of language description on the basis of visual description, so as to facilitate organization and management. The visual sensor records the visual performance of the target in its functional range, and a large number of contextual information related to the semantics of the target is ignored. At present, the research focus in this field is still how to fully mine the semantic information contained in visual data. Unlike this, this paper focuses on the generation process of visual data. With the development of Internet of things technology, a variety of wearable perceptual devices are becoming more and more popular. the purpose of this paper is to use wearable sensors to collect and utilize the context information related to visual objects in order to help the semantic analysis of visual data. The main research results are as follows: face detection and tracking technology in conventional video needs to deal with every frame of image in video. In this paper, a fast face detection and tracking algorithm is proposed. By filtering a large number of faceless video frames by using the context information collected by the sensor, the processing time is reduced and the false positives and missed positives of the faces are reduced. The performance and efficiency of face detection and tracking are improved. On the basis of using sensor for fast face recognition, the consistency of target body motion direction in different perception patterns is deeply excavated. In this paper, a method of front face image recognition in video is proposed. Similar to the above identification, the introduction of wearable sensors makes the recognition process get rid of the dependence on sample data, and the experimental results show that this method has better robustness. In order to ensure the accuracy of the traditional target identification method in video, A large number of high-quality sample data need to be collected for each target. In this paper, an identification method based on motion matching is proposed, which makes use of the inherent consistency of the motion features of the same target in different perceptual models, and helps to solve the problem of target identification in video by introducing wearable sensors. This method avoids the traditional processing flow and gets rid of the dependence on sample data. It has the characteristics of simple logic, low computational complexity and high reliability. A video automatic marking method is proposed. This method uses two different kinds of perceptual data for action recognition, and reveals the identity of the target by combining the decision results of different perceptual modes, and finally achieves the identity of the target at time, place and character. The purpose of marking the video content in the form of action.
【學(xué)位授予單位】:北京郵電大學(xué)
【學(xué)位級(jí)別】:博士
【學(xué)位授予年份】:2015
【分類號(hào)】:TP391.41
,
本文編號(hào):2496875
本文鏈接:http://www.sikaile.net/shoufeilunwen/xxkjbs/2496875.html
最近更新
教材專著