- Premium Academic Help From Professionals
- +1 323 471 4575
- support@phdwriters.us

## Framework of The Proposed MKVARL Method

Order ID53563633773 TypeEssay Writer LevelMasters StyleAPA Sources/References4 Perfect Number of Pages to Order5-10 Pages

Description/Paper InstructionsFramework of The Proposed MKVARL Method

Fig. 1 The framework of the proposed MKVARL method

Multimed Tools Appl (2016) 75:9169–9184 9173

3.1 Visual-auditory kernel canonical correlation analysis and mapping

Suppose Xn × p =(x1, x2, ⋅ ⋅⋅,xn) T and Yn × q =(y1, y2, ⋅ ⋅⋅,yn)

T are original low-level feature matri- ces of images and audio clips respectively, where n is the number of samples and p,q are the feature dimensions. Let φx(x) =(φx(x1),φx(x2), ⋅ ⋅⋅,φx(xn)) denote the transformed Hilbert space

Hx for image feature matrix Xn × p, and φy(y) =(φy(y1),φy(y2), ⋅ ⋅⋅, φy(yn)) denote the trans- formed Hilbert space Hy for audio feature matrix Yn × q. Motivated by the canonical correlation analysis method, we hope to find two

projection vectors wx(p × m) and wy(q × m), with which underlying correlations between Hx and Hy could be maximally maintained in the m-dimen- sional mutual subspace named as Isomorphic Visual-Auditory Subspace (IVA-

Subspace). Let u=wx

Tφx(x) and v= wy Tφy(y) denote the IVA-Subspace mapping process, wx and wy can be

found by solving the following Lagrangian function:

L wx; wy; λx; λy � �

¼ E u−E uð Þð Þ v−E vð Þð Þ½ �−λx 2 E u−E uð Þ2 h i

λy 2 E v−E vð Þ2 h i

þ L0 ð1Þ

where L0 ¼ η2 wxk k 2 þ wy

�� ��2� � and η is a regularization constant. L0 is used because the dimensionalities of the Hilbert spaces are large. Equation (1) may lead to some nonsense projection vectors without L0. Based on the reproducing kernel

theory [4, 18], we have:

wx ¼ X

i

αiφx xið Þ ; wy ¼ X

i

βiφy yið Þ ð2Þ

where αi,βi are weight parameters. Thus, we can rewrite u and v as:

u ¼ X

i

αiφx xið ÞTφx xð Þ ð3Þ

v ¼ X

i

βiφy yið ÞTφy yð Þ ð4Þ

Then u and v can be calculated by only inner products in Hilbert spaces. In practice, since we don’t need an explicit form of φ(x), we first determine kx that can be decomposed in the form of inner product. From Mercer theorem,

the symmetric positive definite kernel kx can be decomposed into the inner product form. We define the kernel functions kx(xi,xj) and ky(yi, yj) as below:

kx xi; xj � �

¼ φx xið ÞT φx xj � �

; ky yi; yj � �

¼ φy yið ÞTφy yj � �

ð5Þ

The corresponding kernel matrices are (Kx)ij =kx(xi,xj) and (Ky)ij = ky(yi, xj). Furthermore, we can get

Mβ ¼ λLα; MTα ¼ λNβ ð6Þ

M ¼ 1 n KTx JKy; L ¼

1

n KTx JKx þ η1Kx; N ¼

1

n KTy JKy þ η2Ky; J ¼ I−

1

N llT ð7Þ

Based on Eq. (6), we can obtain

L−1MN−1MTα ¼ λ2α; N−1MTL−1Mβ ¼ λ2β ð8Þ

9174 Multimed Tools Appl (2016) 75:9169–9184

Therefore, the visual-auditory kernel canonical correlation analysis and mapping process is as below:

3.2 Extension to multiple kernel visual-auditory analysis

As previously defined, Xn × p and Yn × q are original image feature matrix and audio feature matrix respectively. Let xi=(xi1,xi2, ⋅ ⋅⋅,xip)(xik∈R) and yi =(yi1,yi2, ⋅ ⋅⋅,yip)(yik∈R) denote visual feature vectors and auditory feature

vectors respectively. Suppose Kx,y

d (d=1,2, ⋅⋅⋅,k), are k kernel functions, and each of them is associated with Hilbert space Hd. First, we map Xn × p and Yn × q into Hilbert spaces Id and Ad with the kernel function Kx,y

d . Then we calculate canonical correlation between each pair of image Hilbert space and audio Hilbert space, obtain the corresponding projection vectors wx and wy. Therefore, we transform the kernel matrices into the m-

dimensional IVA-Subspace, where cross-media correlations between image and audio kernel features are remained.

We define xi d = (xi1

d ,xi2 d , ⋅ ⋅⋅, xim

d )(xij d = a+b× i, (a, b∈R)), which is obtained from the Hilbert

spaces Id, as the image feature vector in the IVA-Subspace. Also for audio representation, we have m-dimensional representations yi

- To estimate cross-media distance in the IVA-Subspace, we transform the complex numbers in xi
d into

polar coordinate representation:

xdij ¼ βi j; xdij ��� ���� �; βi j ¼ arctg b�a

� � ; xdij

��� ��� ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffia2 þ b2p ð9Þ We perform the same polar coordinate transformation on all the vectors in yi

d, and define the distance between image xi

d and audio yi d as:

dis xdi ; y d i

� � ¼ sqrt

Xm j¼1

xdij

��� ���2 þ ydij ��� ���2−2 � xdij

��� ��� � ydij ��� ��� � cos βi j−βi j�� ��

� ð10Þ

Thus, the similarity of a image xi and an audio yi is:

S xi; yið Þ ¼ Xk d¼1

ηddis x d i ; y

d i

� � ð11Þ

where ηd are the combination weights.

Multimed Tools Appl (2016) 75:9169–9184 9175

Based on above analysis, we discuss how to enable cross-media retrieval under two situations: query example inside database and query example outside database. If the query example is outside the database, we use the

method in our previous work to estimate its coordinates in the IVA-Subspace [39], and then we can measure the cross-media correlation with the same method the database samples use. Our MKVARL algorithm is described as

below: Framework of The Proposed MKVARL Method

4 Experiments

4.1 Experimental setup

We conduct a set of experiments to evaluate the performance of the proposed algorithm in cross-media retrieval. we use the Mean Average Precision (MAP) and top-k retrieval accuracy for performance evaluation. Since there is

no benchmark cross-media database available to evaluate the proposed MKVARL approach, we collect an image-audio dataset crawled from websites, including Flickr, http://image.baidu.com, http://encarta.msn.com,

http://www. animalbehaviorarchive.org, etc. And some other audio clips are extracted from movies. The collected datasets consist of 10 semantic categories, such as bird, car, dog, violin, etc.. In each category there are 100

images and 70 audio clips. We randomly select 60 images and 60 audio

9176 Multimed Tools Appl (2016) 75:9169–9184

http://www.animalbehaviorarchive.org/

http://www.animalbehaviorarchive.org/

clips from each category as training data, and the rest are used as new media objects to test the performance of mapping new media objects into the IVA-Subspace.

The extracted visual features include Color Histogram (in HSV space), Edge Histogram, Texture feature based on Gray-level co-occurrence matrix, Speeded Up Robust Features (SURF) and GIST. Auditory features are made up of

Centroid, Rolloff, Spectral Flux, and Root Mean Square. We concatenate different visual features into high-dimensional vectors as input. Since audio is a kind of time series data, the dimensionalities of auditory feature vectors are

inconsistent. We employ Fuzzy Clustering on auditory features in preprocessing to get isomorphic audio feature indexes [39]. As described in section 3, we use two kinds of kernels for visual-auditory correlation analysis.

Specifically, we use the following radial basis function in (12), the polynomial kernel function in (13) and the sigmoid function in (14). Framework of The Proposed MKVARL Method

k x; yð Þ ¼ exp − x−yk k 2

γσ2

! ð12Þ

k x; yð Þ ¼ γ x; yh i þ cð Þn ð13Þ

k x; yð Þ ¼ tanh γ x; yh i þ cð Þ ð14Þ

where we choose empirical optimal values of γ=2, σ= 2.4 in (12), γ= 1, c= 1, n=4.2 in (13) and γ=0.6, c= 1.9 in (14), and we choose empirical optimal values of combination weights η= (0.35,0.2, 0.45) in (11).

4.2 Performance comparison results

To evaluate the efficacy of the proposed algorithm, we compare the image-audio retrieval performance of the proposed MKVARL approach with PCA [25], CCA [17] and KCCA [14] methods. When users submit an image query

example which is in the training set, relevant audio clips are retrieved and returned, and vice versa. In our experiments, if a returned result and the query example are in the same semantic category, it is regarded as a correct

result. And the precision performance is defined as the percentage of correctly retrieved samples in the top-k-returned results.

Figure 2 shows the Mean Average Precision (MAP) of different algorithms and Fig. 3 shows the comparison results of recall ratio. In Figs. 2 and 3, the MAP and the recall values are the average results of 10 times queries in each

semantic category, including 5 times of querying image with audio examples and 5 times of querying audio with image examples. And the query examples are randomly selected. From Figs. 1 and 2 we can see that the

performances of CCA, KCCA and MKVARL methods are much better than the performance of the PCA.

Meanwhile the KCCA outperforms CCA, while our proposed MKVARL algorithm gains the best performance. Above results are obtained probably because that: (1) the computing process of the projection vectors of CCA,KCCA

and MKVARL is based on potential relevance between image features and audio features, it can better reflect the high-level semantics; (2) the use of kernel function in KCCA makes it more appropriate for nonlinear correlation;

(3) Different kernels correspond to different notions of similarity between two data samples. In particular, in a high dimensional feature space, it is not optimal to choose one kernel for all the datasets. A single type of kernel

function may fail to exploit the potential of all correlations, meanwhile multiple types kernel functions could better explore the potential of all correlations,

Multimed Tools Appl (2016) 75:9169–9184 9177

which validates the importance of the proposed method. Our approach generally returns more relevant results and it verifies the effectiveness of the proposed method.

Figure 4 is a specific example of image-audio retrieval. The query example is a 5-s audio clip in the violin category. We compute the similarity score between the query audio and the images in database, and return the top-15

relevant images. The numbers below the returned images are the correlation values between the images and the audio query example. It can be seen from Fig. 4 that among the top 15 returned results there are 12 violin images.

4.3 Performance evaluation of new media objects

To test image-audio retrieval performance when query examples are out of training set, we first use the method in our previous work to estimate its coordinates in the IVA-Subspace [39], and

Fig. 3 Recall performance comparison results of image-audio retrieval

Fig. 2 MAP performance comparison results of image-audio retrieval

9178 Multimed Tools Appl (2016) 75:9169–9184

then cosine distance metric to compute the cross-media correlation scores. Figures 5 and 6 are the experiment results with new query examples, including querying image by new audio and querying audio by new image. From

Figs. 5 and 6 we can have the similar observation that: the overall retrieval performance with new multimedia data is good. When querying image by an

Fig. 5 Querying image by new audio

Fig. 4 An example of image-audio retrieval

Multimed Tools Appl (2016) 75:9169–9184 9179

example of new audio, there are 8.58 correct results in top 20 returns on average. The performance of querying audio by new image is similar to that of querying image by new audio.

5 Conclusions

Different from most existing multimedia representation learning methods, this paper proposes multiple kernel visual-auditory representation learning framework, which learns general rep- resentation model from visual and

auditory feature space by explicitly learning statistical cross- media correlations from high-dimensional kernel spaces. Besides, we design distance metric learning strategy in the mutual subspace.

The performance of our approach is tested with cross- media retrieval between image and audio data. Experiments and comparisons verify the validity, superiority and applicability of our approach from different aspects. The

main limitation is that the size of image-audio database is comparatively small (lots of web image galleries are not usable because it is difficult to find suited audios). Future work includes further study on large-scale social

media dataset.

Acknowledgments This research is supported by the National Natural Science Foundation of China (No.61003127, No. 61373109, No.61440016) and the China Scholarship Council (201508420248).

References

Chang X, Yang Y, Hauptmann AG, Xing E, Yu Y (2015) Semantic concept discovery for large-scale zero- shot event detection. International Joint Conference on Artificial Intelligence, IJCAI

Chang X, Yang Y, Xing E, Yu Y (2015) Complex event detection using semantic saliency and nearly- isotonic SVM. International Conference on Machine Learning (ICML)

Chang X, Yu Y, Yang Y, Hauptmann A (2015) Searching persuasively: joint event detection and evidence justification with limited supervision. ACM MM

Fig. 6 Querying audio by new image

9180 Multimed Tools Appl (2016) 75:9169–9184

- Gao DD, Huang RB (2000) Some results on canonical correlation and their application to a linear model. Linear Algebra Appl 321:47–59
- Gonen M, Alpaydın E (2011) Multiple kernel learning algorithms. J Mach Learn Res 12:2211–2268 6. Jain P, Kulis B, Davis JV, Dhillon IS (2012) Metric and kernel learning using a linear transformation. J Mach
Learn Res 13:519–547 7. Jain A, Vishwanathan SVN, Varma M (2012) Spg-gmkl: generalized multiple kernel learning with a million

kernels. In: Proceedings of the ACM SIGKDD conference on knowledge discovery and data mining 8. Lanckriet GRG, Cristianini N, Bartlett P, Ghaoui LE, Jordan MI (2004) Learning the kernel matrix with

semi-definite programming. J Mach Learn Res 5:27–72 9. Lew MS, Sebe N, Djeraba C, Jain R (2006) Content-based multimedia information retrieval: state of the art

and challenges. ACM Trans Multimed Comput Commun Appl 2(1):1–19 10. Liu Y, Wu F, Zhuang Y, Xiao J (2008) Active post-refined multimodality video semantic concept detection

with tensor representation. ACM International Conference on Multimedia. pp.91–100 11. Liu G, Yan Y, Gao C, Tong W, Hauptmann AG, Sebe N (2014) The mystery of faces: investigating face

contribution for multimedia event detection. ICMR 12. Liu H, Yu L (2005) Toward integrating feature selection algorithms for classication and clustering. IEEE

Trans Knowl Data Eng 17(4):491–502 13. Ma Q, Akiyo N, Katsumi T (2006) Complementary information retrieval for cross-media news content. Inf

Syst 31(7):659–678 14. Melzer T, Reiter M, Bischof H (2003) Appearance models based on kernel canonical correlation analysis.

Pattern Recogn 36:1961–1971 15. Shen H, Yan Y, Xu S, Ballas N, Chen W (2015) Evaluation of semi-supervised learning method on action

recognition. Multimedia Tools and Applications 74(2):523–542 16. Sonnenburg S, Rätsch G, Schafer C, Scholkopf B (2006) Largescale multiple kernel learning. J Mach Learn

Res 7:1531–1565 17. Sun T, Chen S (2007) Locality preserving CCA with applications to data visualization and pose estimation.

Image Vis Comput 25:531–543 18. Thomas M, Michael R, Horst B (2003) Appearance models based on kernel canonical correlation analysis.

Pattern Recogn 27(2):1–8 19. Tolias G, Bursuc A, Furon T, Jégou H (2015) Rotation and translation covariant match kernels for image

retrieval. Comp Vis Image Underst 140:9–20 20. Tong S, Chang E (2001) Support vector machine active learning for image retrieval. ACM International

Conference on Multimedia, pp. 107–118 21. Vapnik V (1997) The nature of statistical learning theory. IEEE Trans Neural Netw 8(6) 22. Varma M, Babu BR (2009) More generality in efficient multiple kernel learning. In Proceedings

of

International Conference on Machine Learning, pp.1065–1072 23. Vishwanathan SVN, Sun Z, Ampornpunt N, Varma M (2010) Multiple kernel learning and the SMO

algorithm. In: NIPS, pp. 2361–2369 24. Wang D, Hoi SC, He Y, Zhu J, Mei T, Luo J (2014) Retrieval-based face annotation by weak label

regularized local coordinate coding. IEEE Trans Pattern Ana Mach Intell (TPAMI) 36(3):550–563 25. Wu Y, Chang EY, Chang CC, Kevin, Smith JR (2004) Optimal multimodal fusion for multi-media data

analysis. In: ACM Multimedia Conference, pp. 572–579 26. Wu Y, Chang EY, Chen-Chuan Chang K, Smith JR (2004) Optimal multimodal fusion for multimedia data

analysis. ACM International Conference on Multimedia, pp.572–579 27. Xia H, Hoi SC, Jin R, Zhao P (2012) Online multiple kernel similarity learning for visual search. IEEE

Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 1(1) 28. Yan Y, Ricci E, Liu G, Sebe N (2015) Egocentric daily activity recognition via multitask clustering. IEEE

Trans Image Process 24(10):2984–2995 29. Yan Y, Ricci E, Subramanian R, Liu G, Lanz O, Sebe N. A multi-task learning framework for head pose

estimation under target motion, IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), in press

- Yan Y, Shen H, Liu G, Ma Z, Gao C, Sebe N (2014) GLocal tells you more: coupling glocal structural for feature selection with sparsity for image and video classification. Comp Vision Image Underst (CVIU) 124(7):99–109
Frame

work of The Proposed MKVARL Method

RUBRIC

QUALITY OF RESPONSENO RESPONSEPOOR / UNSATISFACTORYSATISFACTORYGOODEXCELLENTC ontent (worth a maximum of 50% of the total points)Zero points: Student failed to submit the final paper.20 points out of 50: The essay illustrates poor understanding of the relevant material by failing to address or incorrectly addressing the relevant content; failing to identify or inaccurately explaining/defining key concepts/ideas; ignoring or incorrectly explaining key points/claims and the reasoning behind them; and/or incorrectly or inappropriately using terminology; and elements of the response are lacking.30 points out of 50: The essay illustrates a rudimentary understanding of the relevant material by mentioning but not full explaining the relevant content; identifying some of the key concepts/ideas though failing to fully or accurately explain many of them; using terminology, though sometimes inaccurately or inappropriately; and/or incorporating some key claims/points but failing to explain the reasoning behind them or doing so inaccurately. Elements of the required response may also be lacking.40 points out of 50: The essay illustrates solid understanding of the relevant material by correctly addressing most of the relevant content; identifying and explaining most of the key concepts/ideas; using correct terminology; explaining the reasoning behind most of the key points/claims; and/or where necessary or useful, substantiating some points with accurate examples. The answer is complete.50 points: The essay illustrates exemplary understanding of the relevant material by thoroughly and correctly addressing the relevant content; identifying and explaining all of the key concepts/ideas; using correct terminology explaining the reasoning behind key points/claims and substantiating, as necessary/useful, points with several accurate and illuminating examples. No aspects of the required answer are missing.Use of Sources (worth a maximum of 20% of the total points).Zero points: Student failed to include citations and/or references. Or the student failed to submit a final paper.5 out 20 points: Sources are seldom cited to support statements and/or format of citations are not recognizable as APA 6^{th}Edition format. There are major errors in the formation of the references and citations. And/or there is a major reliance on highly questionable. The Student fails to provide an adequate synthesis of research collected for the paper.10 out 20 points: References to scholarly sources are occasionally given; many statements seem unsubstantiated. Frequent errors in APA 6^{th}Edition format, leaving the reader confused about the source of the information. There are significant errors of the formation in the references and citations. And/or there is a significant use of highly questionable sources.15 out 20 points: Credible Scholarly sources are used effectively support claims and are, for the most part, clear and fairly represented. APA 6^{th}Edition is used with only a few minor errors. There are minor errors in reference and/or citations. And/or there is some use of questionable sources.20 points: Credible scholarly sources are used to give compelling evidence to support claims and are clearly and fairly represented. APA 6^{th}Edition format is used accurately and consistently. The student uses above the maximum required references in the development of the assignment.Grammar (worth maximum of 20% of total points)Zero points: Student failed to submit the final paper.5 points out of 20: The paper does not communicate ideas/points clearly due to inappropriate use of terminology and vague language; thoughts and sentences are disjointed or incomprehensible; organization lacking; and/or numerous grammatical, spelling/punctuation errors10 points out 20: The paper is often unclear and difficult to follow due to some inappropriate terminology and/or vague language; ideas may be fragmented, wandering and/or repetitive; poor organization; and/or some grammatical, spelling, punctuation errors15 points out of 20: The paper is mostly clear as a result of appropriate use of terminology and minimal vagueness; no tangents and no repetition; fairly good organization; almost perfect grammar, spelling, punctuation, and word usage.20 points: The paper is clear, concise, and a pleasure to read as a result of appropriate and precise use of terminology; total coherence of thoughts and presentation and logical organization; and the essay is error free.Structure of the Paper (worth 10% of total points)Zero points: Student failed to submit the final paper.3 points out of 10: Student needs to develop better formatting skills. The paper omits significant structural elements required for and APA 6^{th}edition paper. Formatting of the paper has major flaws. The paper does not conform to APA 6^{th}edition requirements whatsoever.5 points out of 10: Appearance of final paper demonstrates the student’s limited ability to format the paper. There are significant errors in formatting and/or the total omission of major components of an APA 6^{th}edition paper. They can include the omission of the cover page, abstract, and page numbers. Additionally the page has major formatting issues with spacing or paragraph formation. Font size might not conform to size requirements. The student also significantly writes too large or too short of and paper7 points out of 10: Research paper presents an above-average use of formatting skills. The paper has slight errors within the paper. This can include small errors or omissions with the cover page, abstract, page number, and headers. There could be also slight formatting issues with the document spacing or the font Additionally the paper might slightly exceed or undershoot the specific number of required written pages for the assignment.10 points: Student provides a high-caliber, formatted paper. This includes an APA 6^{th}edition cover page, abstract, page number, headers and is double spaced in 12’ Times Roman Font. Additionally, the paper conforms to the specific number of required written pages and neither goes over or under the specified length of the paper.## GET THIS PROJECT NOW BY CLICKING ON THIS LINK TO PLACE THE ORDER

CLICK ON THE LINK HERE:https://phdwriters.us/orders/ordernow

Also, you can place the order at www.collegepaper.us/orders/ordernow / www.phdwriters.us/orders/ordernow

Do You Have Any Other Essay/Assignment/Class Project/Homework Related to this? Click Here Now[CLICK ME] and Have It Done by Our PhD Qualified Writers!!Framework of The Proposed MKVARL Method

error: Content is protected !!