- Premium Academic Help From Professionals
- +1 323 471 4575
- support@phdwriters.us

## Challenge of Multi-Kernel Visual-Auditory Representation Learning

Order ID53563633773 TypeEssay Writer LevelMasters StyleAPA Sources/References4 Perfect Number of Pages to Order5-10 Pages

Description/Paper InstructionsChallenge of Multi-Kernel Visual-Auditory Representation Learning

2 Related works

As previously discussed, visual-auditory search belongs to the area of cross-media retrieval, and our paper mainly focuses on the challenge of multi-kernel visual-auditory representation learning. Therefore, in this section, we discuss related works from the perspective of cross-media retrieval [32, 33, 35, 36] and multiple kernel distance metric learning [10, 34].

2.1 Cross-media retrieval

Cross-media retrieval originates from content-based multimedia analysis and retrieval, which is a long-standing research topic in computer vision [30]. As previously discussed, most content-based multimedia retrieval works

focus on multimedia data of single modality to bridge the semantic gap between low-level features and high-level semantics [15, 29], such as Content-based Image Retrieval (CBIR) [9, 31]. Considering the content gap between

different multimedia data, cross-media retrieval aims to build a flexible retrieval framework, in which users can search multimedia data with a query example of different modality [32, 35]. For example, in a cross-media retrieval

system, we can obtain relevant image and audio results by submitting an image query example or an audio query example. The main challenging problem for cross-media retrieval is how to measure the similarity between

different kinds of low-level feature spaces. For example, although image and audio data could represent similar semantics, it is very difficult to measure the low-level feature similarity between visual features of images and auditory features of audio clips.

In the past few years, researchers have proposed some cross-media retrieval algorithms, and provide possible solution to bridge the content gap for flexible retrieval. Most of those researches could be grouped into three

categories: context-based cross-media retrieval, cross-modal video data analysis and retrieval, content-based cross-media retrieval. In the first group, context correlations, such as web links, conclusion relation and text

comments, are explored and used to estimate cross-media similarity between multimedia data of different modalities. For example, Yang et al. proposed a distance measure between heterogeneous Multimedia Documents

(MMD) which consisted of text, image or audio samples, and constructed a MMD semantic subspace for cross-media retrieval [34]. MMD is a typical cross-media data environment with rich context correlations. If an image and

an audio clip are included in the same MMD, we can assume these two multimedia objects represent similar semantics. Web pages and PPT documents are examples of MMD.

Multimed Tools Appl (2016) 75:9169–9184 9171

Secondly, video data contains different tracks of information, including key frame images, sounds and voices, text subtitles, etc. It was frequently used to synthetically analyze different tracks of low-level video features, such as

visual features of key frames, auditory features of speakers and caption features. A great deal of researcher are dedicated to cross- modal retrieval between different tracks of video data [10, 26]. For example, paper [13]

proposed a subject model which learned probabilistic collections between semantic con- cepts (keywords) of high frequency and multimedia objects so that users could retrieval news of different types.

Besides, a few researchers focus on how to analyze content-level statistical cross-media correlation with labeled and unlabeled data [36, 37, 39]. Although multimedia data of different modalities may Blook^ different in visual

and auditory representations, they may have statistical content-level correlation which could be explored and used for retrieval. For example, paper [36] proposed the isomorphic cross-media subspace mapping algo- rithm, which

calculated and maintained underlying canonical correlation between visual feature matrix of images and auditory feature matrix of audio clips during subspace mapping.

2.2 Multiple kernel distance metric learning

Kernel methods typically consist of two part. The first part maps the input feature space into another space which is often much higher or even infinite dimensionality by applying a nonlinear function; the second part usually

applies a linear method in the high dimensional space. Kernel-based methods are not new for multimedia retrieval, for example, kernel SVM algorithms have been successfully introduced into the CBIR tasks [20]. In kernel-based

multimedia representation and distance metric learning literature, some algorithms were proposed for similarity learning in CBIR. Connections between representation learning and kernel learning, which can provide kernelization

for a set of metric learning methods, have been revealed in recent studies [6].

Multiple kernel learning (MKL) [8, 16] now is a hot research topic in machine learning. It has been used in various studies and applications with great success, such as bioinfor- matics, computer vision, and natural language

processing. Paper [8] found the optimal combination of multiple kernels for learning classifiers towards a given classification task. In addition, several recent studies address multiple kernel learning for multi-class and multi-

labeled data so as to improve system efficiency and generality [7, 22, 23]. Compared to a single kernel, such as SVM, MKL attempts to achieve better results by combining several base kernels instead of using only one specific

kernel [21]. MKL allows the practitioner to optimize over linear combinations of kernels, and it has focused on both formulation learning as well as the corresponding optimization. Different applications need different

formulations, the existing MKL methods use different learning functions for determining the kernel combinations [5].

In terms of combination functions, most MKL studies often work with linear combinations which have two basic categories: unweighted sum and weighted sum. In the unweighted sum case, we use sum or mean of the kernels

as the combined kernel; in the weighted case, we can linearly optimize weight for each kernel. Besides, there are nonlinear combination studies which apply nonlinear functions of kernel (e.g., multiplication, power and

exponentiation). Besides, as for different target functions, MKL algorithms are typically categorized into three groups: the similarity-based

9172 Multimed Tools Appl (2016) 75:9169–9184

functions; the structural risk functions and the Bayesian functions. All MKL algorithms have the same goal of learning the optimum combination of multiple kernels, but the differences between our methods with others lie in that

we aim to learn a kernel-based similarity function for image retrieval while conventional MKL studies often handle classification tasks.

2.3 Discussion

Above related works obtained satisfying results on multimedia representation and retriev- al. Our approach of multiple kernel visual-auditory representation and retrieval differs from most related works in the following aspects:

we aim to learn a kernel-based similarity function for visual-auditory retrieval while conventional MKL studies often handle single- modality multimedia data analysis tasks. On the other hand, content-based multimedia analysis

and retrieval works mostly focus on single modality data and ignore the issue of cross-media correlation analysis and semantics understanding which is addressed in this paper.

3 Multiple kernel visual-auditory representation learning

We aim to learn the general visual-auditory representation framework where different types of multimedia data are represented in the isomorphic subspace and cross-media correlation could be easily measured for query results

ranking. Figure 1 illustrates the flowchart of the proposed Multiple Kernel Visual-Auditory Representation Learning (MKVARL) method. The main idea of our approach is that: first, we map the audio feature matrix and the image

feature matrix into k Hilbert spaces respectively; then, we analyze canonical correlations between a pair of audio Hilbert space and image Hilbert space; thirdly, we map both image samples and audio samples from Hilbert

spaces into the Isomorphic Visual-Auditory Subspace (IVA-Subspace) where original canonical correlations are maximally remained. In the IVA-Subspace, we propose cross-media distance metric measure to estimate visual-

auditory correlation for retrieval. In this way we can find most similar image samples or audio samples to users based on the query example users submitted. Challenge of Multi-Kernel Visual-Auditory Representation Learning

Fig. 1 The framework of the proposed MKVARL method

Multimed Tools Appl (2016) 75:9169–9184 9173

3.1 Visual-auditory kernel canonical correlation analysis and mapping

Suppose Xn × p =(x1, x2, ⋅ ⋅⋅,xn) T and Yn × q =(y1, y2, ⋅ ⋅⋅,yn)

T are original low-level feature matri- ces of images and audio clips respectively, where n is the number of samples and p,q are the feature dimensions. Let φx(x) =(φx(x1),φx(x2), ⋅ ⋅⋅,φx(xn)) denote the transformed Hilbert space

Hx for image feature matrix Xn × p, and φy(y) =(φy(y1),φy(y2), ⋅ ⋅⋅, φy(yn)) denote the trans- formed Hilbert space Hy for audio feature matrix Yn × q. Motivated by the canonical correlation analysis method, we hope to find two

projection vectors wx(p × m) and wy(q × m), with which underlying correlations between Hx and Hy could be maximally maintained in the m-dimen- sional mutual subspace named as Isomorphic Visual-Auditory Subspace (IVA-

Subspace). Let u=wx

Tφx(x) and v= wy Tφy(y) denote the IVA-Subspace mapping process, wx and wy can be

found by solving the following Lagrangian function:

L wx; wy; λx; λy � �

¼ E u−E uð Þð Þ v−E vð Þð Þ½ �−λx 2 E u−E uð Þ2 h i

λy 2 E v−E vð Þ2 h i

þ L0 ð1Þ

where L0 ¼ η2 wxk k 2 þ wy

�� ��2� � and η is a regularization constant. L0 is used because the dimensionalities of the Hilbert spaces are large. Equation (1) may lead to some nonsense projection vectors without L0. Based on the reproducing kernel

theory [4, 18], we have:

wx ¼ X

i

αiφx xið Þ ; wy ¼ X

i

βiφy yið Þ ð2Þ

where αi,βi are weight parameters. Thus, we can rewrite u and v as:

u ¼ X

i

αiφx xið ÞTφx xð Þ ð3Þ

v ¼ X

i

βiφy yið ÞTφy yð Þ ð4Þ

Then u and v can be calculated by only inner products in Hilbert spaces. In practice, since we don’t need an explicit form of φ(x), we first determine kx that can be decomposed in the form of inner product. From Mercer theorem,

the symmetric positive definite kernel kx can be decomposed into the inner product form. We define the kernel functions kx(xi,xj) and ky(yi, yj) as below: Challenge of Multi-Kernel Visual-Auditory Representation Learning

kx xi; xj � �

¼ φx xið ÞT φx xj � �

; ky yi; yj � �

¼ φy yið ÞTφy yj � �

ð5Þ

The corresponding kernel matrices are (Kx)ij =kx(xi,xj) and (Ky)ij = ky(yi, xj). Furthermore, we can get

Mβ ¼ λLα; MTα ¼ λNβ ð6Þ

M ¼ 1 n KTx JKy; L ¼

1

n KTx JKx þ η1Kx; N ¼

1

n KTy JKy þ η2Ky; J ¼ I−

1

N llT ð7Þ

Based on Eq. (6), we can obtain

L−1MN−1MTα ¼ λ2α; N−1MTL−1Mβ ¼ λ2β ð8Þ

9174 Multimed Tools Appl (2016) 75:9169–9184

Therefore, the visual-auditory kernel canonical correlation analysis and mapping process is as below:

3.2 Extension to multiple kernel visual-auditory analysis

As previously defined, Xn × p and Yn × q are original image feature matrix and audio feature matrix respectively. Let xi=(xi1,xi2, ⋅ ⋅⋅,xip)(xik∈R) and yi =(yi1,yi2, ⋅ ⋅⋅,yip)(yik∈R) denote visual feature vectors and auditory feature

vectors respectively. Suppose Kx,y

d (d=1,2, ⋅⋅⋅,k), are k kernel functions, and each of them is associated with Hilbert space Hd. First, we map Xn × p and Yn × q into Hilbert spaces Id and Ad with the kernel function Kx,y

d . Then we calculate canonical correlation between each pair of image Hilbert space and audio Hilbert space, obtain the corresponding projection vectors wx and wy. Therefore, we transform the kernel matrices into the m-

dimensional IVA-Subspace, where cross-media correlations between image and audio kernel features are remained.

We define xi d = (xi1

d ,xi2 d , ⋅ ⋅⋅, xim

d )(xij d = a+b× i, (a, b∈R)), which is obtained from the Hilbert

spaces Id, as the image feature vector in the IVA-Subspace. Also for audio representation, we have m-dimensional representations yi

- To estimate cross-media distance in the IVA-Subspace, we transform the complex numbers in xi
d into

polar coordinate representation:

xdij ¼ βi j; xdij ��� ���� �; βi j ¼ arctg b�a

� � ; xdij

��� ��� ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffia2 þ b2p ð9Þ We perform the same polar coordinate transformation on all the vectors in yi

d, and define the distance between image xi

d and audio yi d as:

dis xdi ; y d i

� � ¼ sqrt

Xm j¼1

xdij

��� ���2 þ ydij ��� ���2−2 � xdij

��� ��� � ydij ��� ��� � cos βi j−βi j�� ��

� ð10Þ

Thus, the similarity of a image xi and an audio yi is:

S xi; yið Þ ¼ Xk d¼1

ηddis x d i ; y

d i

� � ð11Þ

where ηd are the combination weights.

Multimed Tools Appl (2016) 75:9169–9184 9175

Based on above analysis, we discuss how to enable cross-media retrieval under two situations: query example inside database and query example outside database. If the query example is outside the database, we use the

method in our previous work to estimate its coordinates in the IVA-Subspace [39], and then we can measure the cross-media correlation with the same method the database samples use. Our MKVARL algorithm is described as

below:

4 Experiments

4.1 Experimental setup

We conduct a set of experiments to evaluate the performance of the proposed algorithm in cross-media retrieval. we use the Mean Average Precision (MAP) and top-k retrieval accuracy for performance evaluation. Since there is

no benchmark cross-media database available to evaluate the proposed MKVARL approach, we collect an image-audio dataset crawled from websites, including Flickr, http://image.baidu.com, http://encarta.msn.com,

http://www. animalbehaviorarchive.org, etc. And some other audio clips are extracted from movies. The collected datasets consist of 10 semantic categories, such as bird, car, dog, violin, etc.. In each category there are 100

images and 70 audio clips. We randomly select 60 images and 60 audio

9176 Multimed Tools Appl (2016) 75:9169–9184

http://www.animalbehaviorarchive.org/

http://www.animalbehaviorarchive.org/

Challenge of Multi-Kernel Visual-Auditory Representation Learning

RUBRIC

QUALITY OF

RESPONSENO RESPONSEPOOR / UNSATISFACTORYSATISFACTORYGOODEXCELLENTC ontent (worth a maximum of 50% of the total points)Zero points: Student failed to submit the final paper.20 points out of 50: The essay illustrates poor understanding of the relevant material by failing to address or incorrectly addressing the relevant content; failing to identify or inaccurately explaining/defining key concepts/ideas; ignoring or incorrectly explaining key points/claims and the reasoning behind them; and/or incorrectly or inappropriately using terminology; and elements of the response are lacking.30 points out of 50: The essay illustrates a rudimentary understanding of the relevant material by mentioning but not full explaining the relevant content; identifying some of the key concepts/ideas though failing to fully or accurately explain many of them; using terminology, though sometimes inaccurately or inappropriately; and/or incorporating some key claims/points but failing to explain the reasoning behind them or doing so inaccurately. Elements of the required response may also be lacking.40 points out of 50: The essay illustrates solid understanding of the relevant material by correctly addressing most of the relevant content; identifying and explaining most of the key concepts/ideas; using correct terminology; explaining the reasoning behind most of the key points/claims; and/or where necessary or useful, substantiating some points with accurate examples. The answer is complete.50 points: The essay illustrates exemplary understanding of the relevant material by thoroughly and correctly addressing the relevant content; identifying and explaining all of the key concepts/ideas; using correct terminology explaining the reasoning behind key points/claims and substantiating, as necessary/useful, points with several accurate and illuminating examples. No aspects of the required answer are missing.Use of Sources (worth a maximum of 20% of the total points).Zero points: Student failed to include citations and/or references. Or the student failed to submit a final paper.5 out 20 points: Sources are seldom cited to support statements and/or format of citations are not recognizable as APA 6^{th}Edition format. There are major errors in the formation of the references and citations. And/or there is a major reliance on highly questionable. The Student fails to provide an adequate synthesis of research collected for the paper.10 out 20 points: References to scholarly sources are occasionally given; many statements seem unsubstantiated. Frequent errors in APA 6^{th}Edition format, leaving the reader confused about the source of the information. There are significant errors of the formation in the references and citations. And/or there is a significant use of highly questionable sources.15 out 20 points: Credible Scholarly sources are used effectively support claims and are, for the most part, clear and fairly represented. APA 6^{th}Edition is used with only a few minor errors. There are minor errors in reference and/or citations. And/or there is some use of questionable sources.20 points: Credible scholarly sources are used to give compelling evidence to support claims and are clearly and fairly represented. APA 6^{th}Edition format is used accurately and consistently. The student uses above the maximum required references in the development of the assignment.Grammar (worth maximum of 20% of total points)Zero points: Student failed to submit the final paper.5 points out of 20: The paper does not communicate ideas/points clearly due to inappropriate use of terminology and vague language; thoughts and sentences are disjointed or incomprehensible; organization lacking; and/or numerous grammatical, spelling/punctuation errors10 points out 20: The paper is often unclear and difficult to follow due to some inappropriate terminology and/or vague language; ideas may be fragmented, wandering and/or repetitive; poor organization; and/or some grammatical, spelling, punctuation errors15 points out of 20: The paper is mostly clear as a result of appropriate use of terminology and minimal vagueness; no tangents and no repetition; fairly good organization; almost perfect grammar, spelling, punctuation, and word usage.20 points: The paper is clear, concise, and a pleasure to read as a result of appropriate and precise use of terminology; total coherence of thoughts and presentation and logical organization; and the essay is error free.Structure of the Paper (worth 10% of total points)Zero points: Student failed to submit the final paper.3 points out of 10: Student needs to develop better formatting skills. The paper omits significant structural elements required for and APA 6^{th}edition paper. Formatting of the paper has major flaws. The paper does not conform to APA 6^{th}edition requirements whatsoever.5 points out of 10: Appearance of final paper demonstrates the student’s limited ability to format the paper. There are significant errors in formatting and/or the total omission of major components of an APA 6^{th}edition paper. They can include the omission of the cover page, abstract, and page numbers. Additionally the page has major formatting issues with spacing or paragraph formation. Font size might not conform to size requirements. The student also significantly writes too large or too short of and paper7 points out of 10: Research paper presents an above-average use of formatting skills. The paper has slight errors within the paper. This can include small errors or omissions with the cover page, abstract, page number, and headers. There could be also slight formatting issues with the document spacing or the font Additionally the paper might slightly exceed or undershoot the specific number of required written pages for the assignment.10 points: Student provides a high-caliber, formatted paper. This includes an APA 6^{th}edition cover page, abstract, page number, headers and is double spaced in 12’ Times Roman Font. Additionally, the paper conforms to the specific number of required written pages and neither goes over or under the specified length of the paper.## GET THIS PROJECT NOW BY CLICKING ON THIS LINK TO PLACE THE ORDER

CLICK ON THE LINK HERE:https://phdwriters.us/orders/ordernow

Also, you can place the order at www.collegepaper.us/orders/ordernow / www.phdwriters.us/orders/ordernow

Do You Have Any Other Essay/Assignment/Class Project/Homework Related to this? Click Here Now[CLICK ME] and Have It Done by Our PhD Qualified Writers!!Challenge of Multi-Kernel Visual-Auditory Representation Learning

error: Content is protected !!