282 101 14MB
English Pages 492 [516] Year 2011
Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Alfred Kobsa University of California, Irvine, CA, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen TU Dortmund University, Germany Madhu Sudan Microsoft Research, Cambridge, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Gerhard Weikum Max Planck Institute for Informatics, Saarbruecken, Germany
6524
Kuo-Tien Lee Wen-Hsiang Tsai Hong-Yuan Mark Liao Tsuhan Chen Jun-Wei Hsieh Chien-Cheng Tseng (Eds.)
Advances in Multimedia Modeling 17th International Multimedia Modeling Conference, MMM 2011 Taipei, Taiwan, January 5-7, 2011 Proceedings, Part II
13
Volume Editors Kuo-Tien Lee Jun-Wei Hsieh National Taiwan Ocean University Keelung, Taiwan E-mail: {po,shieh}@mail.ntou.edu.tw Wen-Hsiang Tsai National Chiao Tung University Hsinchu, Taiwan E-mail: [email protected] Hong-Yuan Mark Liao Academia Sinica Taipei, Taiwan E-mail: [email protected] Tsuhan Chen Cornell University Ithaca, NY, USA E-mail: [email protected] Chien-Cheng Tseng National Kaohsiung First University of Science and Technology Kaohsiung, Taiwan E-mail: [email protected]
Library of Congress Control Number: 2010940989 CR Subject Classification (1998): H.5.1, I.5, H.3, H.4, I.4, H.2.8 LNCS Sublibrary: SL 3 – Information Systems and Application, incl. Internet/Web and HCI ISSN ISBN-10 ISBN-13
0302-9743 3-642-17828-6 Springer Berlin Heidelberg New York 978-3-642-17828-3 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2011 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180
Preface
Welcome to the proceedings of the 17th Multimedia Modeling Conference (MMM 2011) held in Taipei, Taiwan, during January 5–7, 2011. Following the success of the 16 preceding conferences, the 17th MMM brought together researchers, developers, practitioners, and educators in the field of multimedia. Both practical systems and theories were presented in this conference, thanks to the support of Microsoft Research Asia, Industrial Technology Research Institute, Institute for Information Industry, National Museum of Natural Science, and the Image Processing and Pattern Recognition Society of Taiwan. MMM 2011 featured a comprehensive program including keynote speeches, regular paper presentations, posters, and special sessions. We received 450 papers in total. Among these submissions, we accepted 75 oral presentations and 21 poster presentations. Six special sessions were organized by world-leading researchers. We sincerely acknowledge the Program Committee members who have contributed much of their precious time during the paper reviewing process. We would like to sincerely thank the support of our strong Organizing Committee and Advisory Committee. Special thanks go to Jun-Wei Hsieh, Tun-Wen Pai, Shyi-Chyi Cheng, Hui-Huang Hsu, Tao Mei, Meng Wang, Chih-Min Chao, Chun-Chao Yeh, Shu-Hsin Liao, and Chin-Chun Chang. This conference would never have happened without their help.
January 2011
Wen-Hsiang Tsai Mark Liao Kuo-Tien Lee
Organization
MMM 2011 was hosted and organized by the Department of Computer Science and Engineering, National Taiwan Ocean University, Taiwan. The conference was held at the National Taiwan Science Education Center, Taipei, during January 5–7, 2011.
Conference Committee Steering Committee
Conference Co-chairs
Program Co-chairs
Special Session Co-chairs Demo Co-chair Local Organizing Co-chairs
Publication Chair Publicity Chair
Yi-Ping Phoebe Chen (La Trobe University, Australia) Tat-Seng Chua (National University of Singapore, Singapore) Tosiyasu L. Kunii (University of Tokyo, Japan) Wei-Ying Ma (Microsoft Research Asia, China) Nadia Magnenat-Thalmann (University of Geneva, Switzerland) Patrick Senac (ENSICA, France) Kuo-Tien Lee (National Taiwan Ocean University, Taiwan) Wen-Hsiang Tsai (National Chiao Tung University, Taiwan) Hong-Yuan Mark Liao (Academia Sinica, Taiwan) Tsuhan Chen (Cornell University, USA) Jun-Wei Hsieh (National Taiwan Ocean University, Taiwan) Chien-Cheng Tseng (National Kaohsiung First University of Science and Technology, Taiwan) Hui-Huang Hsu (Tamkang University, Taiwan) Tao Mei (Microsoft Research Asia, China) Meng Wang (Microsoft Research Asia, China) Tun-Wen Pai (National Taiwan Ocean University, Taiwan) Shyi-Chyi Cheng (National Taiwan Ocean University, Taiwan) Chih-Min Chao (National Taiwan Ocean University, Taiwan) Shu-Hsin Liao (National Taiwan Ocean University, Taiwan)
VIII
Organization
US Liaison Asian Liaison European Liaison Webmaster
Qi Tian (University of Texas at San Antonio, USA) Tat-Seng Chua (National University of Singapore, Singapore) Susanne Boll (University of Oldenburg, Germany) Chun-Chao Yeh (National Taiwan Ocean University, Taiwan)
Program Committee Allan Hanbury Andreas Henrich Bernard Merialdo Brigitte Kerherve Cathal Gurrin Cees Snoek Cha Zhang Chabane Djeraba Changhu Wang Changsheng Xu Chia-Wen Lin Chong-Wah Ngo Christian Timmerer Colum Foley Daniel Thalmann David Vallet Duy-Dinh Le Fernando Pereira Francisco Jose Silva Mata Georg Thallinger Guntur Ravindra Guo-Jun Qi Harald Kosch Hui-Huang Hsu Jen-Chin Jiang Jia-hung Ye Jianmin Li Jianping Fan Jiebo Luo Jing-Ming Guo Jinhui Tang
Vienna University of Technology, Austria University of Bamberg, Germany EURECOM, France University of Quebec, Canada Dublin City University, Ireland University of Amsterdam, The Netherlands Microsoft Research University of Sciences and Technologies of Lille, France University of Science and Technology of China NLPR, Chinese Academy of Science, China National Tsing Hua University, Taiwan City University of Hong Kong, Hong Kong University of Klagenfurt, Austria Dublin City University, Ireland EPFL, Swiss Universidad Aut´ onoma de Madrid, Spain National Institute of Informatics, Japan Technical University of Lisbon, Portugal Centro de Aplicaciones de Tecnologias de Avanzada, Cuba Joanneum Research, Austria Applied Research & Technology Center, Motorola, Bangalore University of Science and Technology of China Passau University, Germany Tamkang University, Taiwan National Dong Hwa University, Taiwan National Sun Yat-sen University, Taiwan Tsinghua University, China University of North Carolina, USA Kodak Research, USA National Taiwan University of Science and Technology, Taiwan University of Science and Technology of China
Organization
Jinjun Wang Jiro Katto Joemon Jose Jonathon Hare Joo Hwee Lim Jose Martinez Keiji Yanai Koichi Shinoda Lap-Pui Chau Laura Hollink Laurent Amsaleg Lekha Chaisorn Liang-Tien Chia Marcel Worring Marco Bertini Marco Paleari Markus Koskela Masashi Inoue Matthew Cooper Matthias Rauterberg Michael Lew Michel Crucianu Michel Kieffer Ming-Huei Jin Mohan Kankanhalli Neil O’Hare Nicholas Evans Noel O’Connor Nouha Bouteldja Ola Stockfelt Paul Ferguson Qi Tian Raphael Troncy Roger Zimmermann Selim Balcisoy Sengamedu Srinivasan Seon Ho Kim Shen-wen Shr Shingo Uchihashi Shin’ichi Satoh
IX
NEC Laboratories America, Inc., USA Waseda University, Japan University of Glasgow, UK University of Southampton, UK Institute for Infocomm Research, Singapore UAM, Spain University of Electro-Communications, Japan Tokyo Institute of Technology, Japan Nanyang Technological University, Singapore Vrije Universiteit Amsterdam, The Netherlands CNRS-IRISA, France Institute for Infocomm Research, Singapore Nanyang Technological University, Singapore University of Amsterdam, The Netherlands University of Florence, Italy EURECOM, France Helsinki University of Technology, Finland Yamagata University, Japan FX Palo Alto Lab, Inc., Germany Technical University Eindhoven, The Netherlands Leiden University, The Netherlands Conservatoire National des Arts et M´etiers, France Laboratoire des Signaux et Syst`emes, CNRS-Sup´elec, France Institute for Information Industry, Taiwan National University of Singapore Dublin City University, Ireland EURECOM, France Dublin City University, Ireland Conservatoire National des Arts et M´etiers, France Gothenburg University, Sweden Dublin City University, Ireland University of Texas at San Antonio, USA CWI, The Netherlands University of Southern California, USA Sabanci University, Turkey Yahoo! India University of Denver, USA National Chi Nan University, Taiwan Fuji Xerox Co., Ltd., Japan National Institute of Informatics, Japan
X
Organization
Shiuan-Ting Jang Shuicheng Yan Shu-Yuan Chen Sid-Ahmed Berrani Stefano Bocconi Susu Yao Suzanne Little Tao Mei Taro Tezuka Tat-Seng Chua Thierry Pun Tong Zhang Valerie Gouet-Brunet Vincent Charvillat Vincent Oria Wai-tian Tan Wei Cheng Weiqi Yan Weisi Lin Wen-Hung Liau Werner Bailer William Grosky Winston Hsu Wolfgang H¨ urst Xin-Jing Wang Yannick Pri´e Yan-Tao Zheng Yea-Shuan Huang Yiannis Kompatsiaris Yijuan Lu Yongwei Zhu Yun Fu Zha Zhengjun Zheng-Jun Zha Zhongfei Zhang Zhu Li
National Yunlin University of Science and Technology, Taiwan National University of Singapore Yuan Ze University, Taiwan Orange Labs - France Telecom Universit`a degli studi di Torino, Italy Institute for Infocomm Research, Singapore Open University, UK Microsoft Research Asia, China Ritsumeikan University, Japan National University of Singapore University of Geneva, Switzerland HP Labs Conservatoire National des Arts et Metiers, France University of Toulouse, France NJIT, USA Hewlett-Packard, USA University of Michigan, USA Queen’s University Belfast, UK Nanyang Technological University, Singapore National Chengchi University, Taiwan Joanneum Research, Austria University of Michigan, USA National Taiwan University, Taiwan Utrecht University, The Netherlands Microsoft Research Asia, China LIRIS, France National University of Singapore, Singapore Chung-Hua University, Taiwan Informatics and Telematics Institute Centre for Research and Technology Hellas, Greece Texas State University, USA Institute for Infocomm Research Asia, Singapore University at Buffalo (SUNY), USA National University of Singapore, Singapore National University of Singapore, Singapore State University of New York at Binghamton, USA Hong Kong Polytechnic University, Hong Kong
Organization
Sponsors Microsoft Research Industrial Technology Research Institute Institute For Information Industry National Taiwan Science Education Center National Taiwan Ocean University Bureau of Foreign Trade National Science Council
XI
Table of Contents – Part II
Special Session Papers Content Analysis for Human-Centered Multimedia Applications Generative Group Activity Analysis with Quaternion Descriptor . . . . . . . Guangyu Zhu, Shuicheng Yan, Tony X. Han, and Changsheng Xu
1
Grid-Based Retargeting with Transformation Consistency Smoothing . . . Bing Li, Ling-Yu Duan, Jinqiao Wang, Jie Chen, Rongrong Ji, and Wen Gao
12
Understanding Video Sequences through Super-Resolution . . . . . . . . . . . . Yu Peng, Jesse S. Jin, Suhuai Luo, and Mira Park
25
Facial Expression Recognition on Hexagonal Structure Using LBP-Based Histogram Variances . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lin Wang, Xiangjian He, Ruo Du, Wenjing Jia, Qiang Wu, and Wei-chang Yeh
35
Mining Social Relationship from Media Collections Towards More Precise Social Image-Tag Alignment . . . . . . . . . . . . . . . . . . . Ning Zhou, Jinye Peng, Xiaoyi Feng, and Jianping Fan Social Community Detection from Photo Collections Using Bayesian Overlapping Subspace Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peng Wu, Qiang Fu, and Feng Tang Dynamic Estimation of Family Relations from Photos . . . . . . . . . . . . . . . . Tong Zhang, Hui Chao, and Dan Tretter
46
57 65
Large Scale Rich Media Data Management Semi-automatic Flickr Group Suggestion . . . . . . . . . . . . . . . . . . . . . . . . . . . . Junjie Cai, Zheng-Jun Zha, Qi Tian, and Zengfu Wang A Visualized Communication System Using Cross-Media Semantic Association . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xinming Zhang, Yang Liu, Chao Liang, and Changsheng Xu
77
88
XIV
Table of Contents – Part II
Effective Large Scale Text Retrieval via Learning Risk-Minimization and Dependency-Embedded Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sheng Gao and Haizhou Li
99
Efficient Large-Scale Image Data Set Exploration: Visual Concept Network and Image Summarization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chunlei Yang, Xiaoyi Feng, Jinye Peng, and Jianping Fan
111
Multimedia Understanding for Consumer Electronics A Study in User-Centered Design and Evaluation of Mental Tasks for BCI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Danny Plass-Oude Bos, Mannes Poel, and Anton Nijholt Video CooKing: Towards the Synthesis of Multimedia Cooking Recipes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Keisuke Doman, Cheng Ying Kuai, Tomokazu Takahashi, Ichiro Ide, and Hiroshi Murase
122
135
Snap2Read: Automatic Magazine Capturing and Analysis for Adaptive Mobile Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yu-Ming Hsu, Yen-Liang Lin, Winston H. Hsu, and Brian Wang
146
Multimodal Interaction Concepts for Mobile Augmented Reality Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wolfgang H¨ urst and Casper van Wezel
157
Image Object Recognition and Compression Morphology-Based Shape Adaptive Compression . . . . . . . . . . . . . . . . . . . . . Jian-Jiun Ding, Pao-Yen Lin, Jiun-De Huang, Tzu-Heng Lee, and Hsin-Hui Chen People Tracking in a Building Using Color Histogram Classifiers and Gaussian Weighted Individual Separation Approaches . . . . . . . . . . . . . . . . Che-Hung Lin, Sheng-Luen Chung, and Jing-Ming Guo Human-Centered Fingertip Mandarin Input System Using Single Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chih-Chang Yu, Hsu-Yung Cheng, Bor-Shenn Jeng, Chien-Cheng Lee, and Wei-Tyng Hong Automatic Container Code Recognition Using Compressed Sensing Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chien-Cheng Tseng and Su-Ling Lee
168
177
187
196
Table of Contents – Part II
Combining Histograms of Oriented Gradients with Global Feature for Human Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shih-Shinh Huang, Hsin-Ming Tsai, Pei-Yung Hsiao, Meng-Qui Tu, and Er-Liang Jian
XV
208
Interactive Image and Video Search Video Browsing Using Object Trajectories . . . . . . . . . . . . . . . . . . . . . . . . . . Felix Lee and Werner Bailer Size Matters! How Thumbnail Number, Size, and Motion Influence Mobile Video Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wolfgang H¨ urst, Cees G.M. Snoek, Willem-Jan Spoel, and Mate Tomin An Information Foraging Theory Based User Study of an Adaptive User Interaction Framework for Content-Based Image Retrieval . . . . . . . . Haiming Liu, Paul Mulholland, Dawei Song, Victoria Uren, and Stefan R¨ uger
219
230
241
Poster Session Papers Generalized Zigzag Scanning Algorithm for Non-square Blocks . . . . . . . . . Jian-Jiun Ding, Pao-Yen Lin, and Hsin-Hui Chen
252
The Interaction Ontology Model: Supporting the Virtual Director Orchestrating Real-Time Group Interaction . . . . . . . . . . . . . . . . . . . . . . . . . Rene Kaiser, Claudia Wagner, Martin Hoeffernig, and Harald Mayer
263
CLUENET: Enabling Automatic Video Aggregation in Social Media Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhuhua Liao, Jing Yang, Chuan Fu, and Guoqing Zhang
274
Pedestrian Tracking Based on Hidden-Latent Temporal Markov Chain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peng Zhang, Sabu Emmanuel, and Mohan Kankanhalli
285
Motion Analysis via Feature Point Tracking Technology . . . . . . . . . . . . . . . Yu-Shin Lin, Shih-Ming Chang, Joseph C. Tsai, Timothy K. Shih, and Hui-Huang Hsu Traffic Monitoring and Event Analysis at Intersection Based on Integrated Multi-video and Petri Net Process . . . . . . . . . . . . . . . . . . . . . . . . Chang-Lung Tsai and Shih-Chao Tai Baseball Event Semantic Exploring System Using HMM . . . . . . . . . . . . . . Wei-Chin Tsai, Hua-Tsung Chen, Hui-Zhen Gu, Suh-Yin Lee, and Jen-Yu Yu
296
304 315
XVI
Table of Contents – Part II
Robust Face Recognition under Different Facial Expressions, Illumination Variations and Partial Occlusions . . . . . . . . . . . . . . . . . . . . . . . Shih-Ming Huang and Jar-Ferr Yang Localization and Recognition of the Scoreboard in Sports Video Based on SIFT Point Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jinlin Guo, Cathal Gurrin, Songyang Lao, Colum Foley, and Alan F. Smeaton 3D Model Search Using Stochastic Attributed Relational Tree Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Naoto Nakamura, Shigeru Takano, and Yoshihiro Okada A Novel Horror Scene Detection Scheme on Revised Multiple Instance Learning Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bin Wu, Xinghao Jiang, Tanfeng Sun, Shanfeng Zhang, Xiqing Chu, Chuxiong Shen, and Jingwen Fan
326
337
348
359
Randomly Projected KD-Trees with Distance Metric Learning for Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pengcheng Wu, Steven C.H. Hoi, Duc Dung Nguyen, and Ying He
371
A SAQD-Domain Source Model Unified Rate Control Algorithm for H.264 Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mingjing Ai and Lili Zhao
383
A Bi-objective Optimization Model for Interactive Face Retrieval . . . . . . Yuchun Fang, Qiyun Cai, Jie Luo, Wang Dai, and Chengsheng Lou
393
Multi-symbology and Multiple 1D/2D Barcodes Extraction Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daw-Tung Lin and Chin-Lin Lin
401
Wikipedia Based News Video Topic Modeling for Information Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sujoy Roy, Mun-Thye Mak, and Kong Wah Wan
411
Advertisement Image Recognition for a Location-Based Reminder System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Siying Liu, Yiqun Li, Aiyuan Guo, and Joo Hwee Lim
421
Flow of Qi: System of Real-Time Multimedia Interactive Application of Calligraphy Controlled by Breathing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kuang-I Chang, Mu-Yu Tsai, Yu-Jen Su, Jyun-Long Chen, and Shu-Min Wu Measuring Bitrate and Quality Trade-Off in a Fast Region-of-Interest Based Video Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Salahuddin Azad, Wei Song, and Dian Tjondronegoro
432
442
Table of Contents – Part II
Image Annotation with Concept Level Feature Using PLSA+CCA . . . . . Yu Zheng, Tetsuya Takiguchi, and Yasuo Ariki Multi-actor Emotion Recognition in Movies Using a Bimodal Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ruchir Srivastava, Sujoy Roy, Shuicheng Yan, and Terence Sim
XVII
454
465
Demo Session Papers RoboGene: An Image Retrieval System with Multi-Level Log-Based Relevance Feedback Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Huanchen Zhang, Haojie Li, Shichao Dong, and Weifeng Sun Query Difficulty Guided Image Retrieval System . . . . . . . . . . . . . . . . . . . . . Yangxi Li, Yong Luo, Dacheng Tao, and Chao Xu HeartPlayer: A Smart Music Player Involving Emotion Recognition, Expression and Recommendation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Songchun Fan, Cheng Tan, Xin Fan, Han Su, and Jinyu Zhang Immersive Video Conferencing Architecture Using Game Engine Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chris Poppe, Charles-Frederik Hollemeersch, Sarah De Bruyne, Peter Lambert, and Rik Van de Walle Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
476 479
483
486
489
Table of Contents – Part I
Regular Papers Audio, Image, Video Processing, Coding and Compression A Generalized Coding Artifacts and Noise Removal Algorithm for Digitally Compressed Video Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ling Shao, Hui Zhang, and Yan Liu
1
Efficient Mode Selection with BMA Based Pre-processing Algorithms for H.264/AVC Fast Intra Mode Decision . . . . . . . . . . . . . . . . . . . . . . . . . . . Chen-Hsien Miao and Chih-Peng Fan
10
Perceptual Motivated Coding Strategy for Quality Consistency . . . . . . . . Like Yu, Feng Dai, Yongdong Zhang, and Shouxun Lin Compressed-Domain Shot Boundary Detection for H.264/AVC Using Intra Partitioning Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sarah De Bruyne, Jan De Cock, Chris Poppe, Charles-Frederik Hollemeersch, Peter Lambert, and Rik Van de Walle
21
29
Adaptive Orthogonal Transform for Motion Compensation Residual in Video Compression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zhouye Gu, Weisi Lin, Bu-sung Lee, and Chiew Tong Lau
40
Parallel Deblocking Filter for H.264/AVC on the TILERA Many-Core Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chenggang Yan, Feng Dai, and Yongdong Zhang
51
Image Distortion Estimation by Hash Comparison . . . . . . . . . . . . . . . . . . . Li Weng and Bart Preneel
62
Media Content Browsing and Retrieval Sewing Photos: Smooth Transition between Photos . . . . . . . . . . . . . . . . . . . Tzu-Hao Kuo, Chun-Yu Tsai, Kai-Yin Cheng, and Bing-Yu Chen
73
Employing Aesthetic Principles for Automatic Photo Book Layout . . . . . Philipp Sandhaus, Mohammad Rabbath, and Susanne Boll
84
Video Event Retrieval from a Small Number of Examples Using Rough Set Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kimiaki Shirahama, Yuta Matsuoka, and Kuniaki Uehara
96
XX
Table of Contents – Part I
Community Discovery from Movie and Its Application to Poster Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yan Wang, Tao Mei, and Xian-Sheng Hua
107
A BOVW Based Query Generative Model . . . . . . . . . . . . . . . . . . . . . . . . . . . Reede Ren, John Collomosse, and Joemon Jose
118
Video Sequence Identification in TV Broadcasts . . . . . . . . . . . . . . . . . . . . . Klaus Schoeffmann and Laszlo Boeszoermenyi
129
Content-Based Multimedia Retrieval in the Presence of Unknown User Preferences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Beecks, Ira Assent, and Thomas Seidl
140
Multi-Camera, Multi-View, and 3D Systems People Localization in a Camera Network Combining Background Subtraction and Scene-Aware Human Detection . . . . . . . . . . . . . . . . . . . . . Tung-Ying Lee, Tsung-Yu Lin, Szu-Hao Huang, Shang-Hong Lai, and Shang-Chih Hung
151
A Novel Depth-Image Based View Synthesis Scheme for Multiview and 3DTV . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xun He, Xin Jin, Minghui Wang, and Satoshi Goto
161
Egocentric View Transition for Video Monitoring in a Distributed Camera Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kuan-Wen Chen, Pei-Jyun Lee, and Yi-Ping Hung
171
A Multiple Camera System with Real-Time Volume Reconstruction for Articulated Skeleton Pose Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Zheng Zhang, Hock Soon Seah, Chee Kwang Quah, Alex Ong, and Khalid Jabbar A New Two-Omni-Camera System with a Console Table for Versatile 3D Vision Applications and Its Automatic Adaptation to Imprecise Camera Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shen-En Shih and Wen-Hsiang Tsai 3D Face Recognition Based on Local Shape Patterns and Sparse Representation Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Di Huang, Karima Ouji, Mohsen Ardabilian, Yunhong Wang, and Liming Chen An Effective Approach to Pose Invariant 3D Face Recognition . . . . . . . . . Dayong Wang, Steven C.H. Hoi, and Ying He
182
193
206
217
Table of Contents – Part I
XXI
Multimedia Indexing and Mining Score Following and Retrieval Based on Chroma and Octave Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wei-Ta Chu and Meng-Luen Li
229
Incremental Multiple Classifier Active Learning for Concept Indexing in Images and Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bahjat Safadi, Yubing Tong, and Georges Qu´enot
240
A Semantic Higher-Level Visual Representation for Object Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ismail El Sayad, Jean Martinet, Thierry Urruty, and Chabane Dejraba Mining Travel Patterns from GPS-Tagged Photos . . . . . . . . . . . . . . . . . . . . Yan-Tao Zheng, Yiqun Li, Zheng-Jun Zha, and Tat-Seng Chua Augmenting Image Processing with Social Tag Mining for Landmark Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Amogh Mahapatra, Xin Wan, Yonghong Tian, and Jaideep Srivastava News Shot Cloud: Ranking TV News Shots by Cross TV-Channel Filtering for Efficient Browsing of Large-Scale News Video Archives . . . . Norio Katayama, Hiroshi Mo, and Shin’ichi Satoh
251
262
273
284
Multimedia Content Analysis (I) Speaker Change Detection Using Variable Segments for Video Indexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . King Yiu Tam, Jose Lay, and David Levy Correlated PLSA for Image Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Peng Li, Jian Cheng, Zechao Li, and Hanqing Lu Genre Classification and the Invariance of MFCC Features to Key and Tempo . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tom L.H. Li and Antoni B. Chan Combination of Local and Global Features for Near-Duplicate Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yue Wang, ZuJun Hou, Karianto Leman, Nam Trung Pham, TeckWee Chua, and Richard Chang Audio Tag Annotation and Retrieval Using Tag Count Information . . . . . Hung-Yi Lo, Shou-De Lin, and Hsin-Min Wang
296 307
317
328
339
XXII
Table of Contents – Part I
Similarity Measurement for Animation Movies . . . . . . . . . . . . . . . . . . . . . . . Alexandre Benoit, Madalina Ciobotaru, Patrick Lambert, and Bogdan Ionescu
350
Multimedia Content Analysis (II) A Feature Sequence Kernel for Video Concept Classification . . . . . . . . . . . Werner Bailer
359
Bottom-Up Saliency Detection Model Based on Amplitude Spectrum . . . Yuming Fang, Weisi Lin, Bu-Sung Lee, Chiew Tong Lau, and Chia-Wen Lin
370
L2 -Signature Quadratic Form Distance for Efficient Query Processing in Very Large Multimedia Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Christian Beecks, Merih Seran Uysal, and Thomas Seidl Generating Representative Views of Landmarks via Scenic Theme Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yi-Liang Zhao, Yan-Tao Zheng, Xiangdong Zhou, and Tat-Seng Chua Regularized Semi-supervised Latent Dirichlet Allocation for Visual Concept Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Liansheng Zhuang, Lanbo She, Jingjing Huang, Jiebo Luo, and Nenghai Yu Boosted Scene Categorization Approach by Adjusting Inner Structures and Outer Weights of Weak Classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xueming Qian, Zhe Yan, and Kaiyu Hang A User-Centric System for Home Movie Summarisation . . . . . . . . . . . . . . . Saman H. Cooray, Hyowon Lee, and Noel E. O’Connor
381
392
403
413 424
Multimedia Signal Processing and Communications Image Super-Resolution by Vectorizing Edges . . . . . . . . . . . . . . . . . . . . . . . . Chia-Jung Hung, Chun-Kai Huang, and Bing-Yu Chen
435
Vehicle Counting without Background Modeling . . . . . . . . . . . . . . . . . . . . . Cheng-Chang Lien, Ya-Ting Tsai, Ming-Hsiu Tsai, and Lih-Guong Jang
446
Effective Color-Difference-Based Interpolation Algorithm for CFA Image Demosaicking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yea-Shuan Huang and Sheng-Yi Cheng
457
Table of Contents – Part I
Utility Max-Min Fair Rate Allocation for Multiuser Multimedia Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Qing Zhang, Guizhong Liu, and Fan Li
XXIII
470
Multimedia Applications Adaptive Model for Robust Pedestrian Counting . . . . . . . . . . . . . . . . . . . . . Jingjing Liu, Jinqiao Wang, and Hanqing Lu
481
Multi Objective Optimization Based Fast Motion Detector . . . . . . . . . . . . Jia Su, Xin Wei, Xiaocong Jin, and Takeshi Ikenaga
492
Narrative Generation by Repurposing Digital Videos . . . . . . . . . . . . . . . . . Nick C. Tang, Hsiao-Rong Tyan, Chiou-Ting Hsu, and Hong-Yuan Mark Liao
503
A Coordinate Transformation System Based on the Human Feature Information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shih-Ming Chang, Joseph Tsai, Timothy K. Shih, and Hui-Huang Hsu An Effective Illumination Compensation Method for Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yea-Shuan Huang and Chu-Yung Li Shape Stylized Face Caricatures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nguyen Kim Hai Le, Yong Peng Why, and Golam Ashraf i-m-Breath: The Effect of Multimedia Biofeedback on Learning Abdominal Breath . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Meng-Chieh Yu, Jin-Shing Chen, King-Jen Chang, Su-Chu Hsu, Ming-Sui Lee, and Yi-Ping Hung Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
514
525 536
548
559
Generative Group Activity Analysis with Quaternion Descriptor Guangyu Zhu1 , Shuicheng Yan1 , Tony X. Han2 , and Changsheng Xu3 1
Electrical and Computer Engineering, National University of Singapore, Singapore 2 Electrical and Computer Engineering, University of Missouri, USA 3 Institute of Automation, Chinese Academy of Sciences, China {elezhug,eleyans}@nus.edu.sg, [email protected], [email protected] Abstract. Activity understanding plays an essential role in video content analysis and remains a challenging open problem. Most of previous research is limited due to the use of excessively localized features without sufficiently encapsulating the interaction context or focus on simply discriminative models but totally ignoring the interaction patterns. In this paper, a new approach is proposed to recognize human group activities. Firstly, we design a new quaternion descriptor to describe the interactive insight of activities regarding the appearance, dynamic, causality and feedback, respectively. The designed descriptor is capable of delineating the individual and pairwise interactions in the activities. Secondly, considering both activity category and interaction variety, we propose an extended pLSA (probabilistic Latent Semantic Analysis) model with two hidden variables. This extended probabilistic graphic paradigm constructed on the quaternion descriptors facilitates the effective inference of activity categories as well as the exploration of activity interaction patterns. The experiments on the realistic movie and human activity databases validate that the proposed approach outperforms the state-ofthe-art results. Keywords: Activity analysis, generative modeling, video description.
1
Introduction
Video-based human activity analysis is one of the most promising applications of computer vision and pattern recognition. In [1], Turaga et al. presented a recent survey of the major approaches pursued over the last two decades. Large amount of the existing work on this problem mainly focused on the relatively simple activities of single person [10,9,5,12,4], e.g., sitting, walking and handwaving, which has achieved particular success. In recent years, recognition of group activity with multiple participators (e.g., fighting and gathering) is gaining increasing amount of interests [19,18,15,17]. Upon the definition given by [1], where an activity is referred to a complex sequence of actions performed by several objects who could be interacting with each other, the interactions among the participants reflect the elementary characteristics of different activities. The effective interaction descriptor is therefore essential for developing sophisticated approaches of activity recognition. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 1–11, 2011. c Springer-Verlag Berlin Heidelberg 2011
2
G. Zhu et al.
Most previous research stems from the local representation in image processing. Although the widely used local descriptors are demonstrated to allow for the recognition of activities in the scenes with occlusions and dynamic cluttered backgrounds, they are solely representations of appearance and motion patterns. An effective feature descriptor for activity recognition should have the capacity of describing the video in terms of the object appearance, dynamic motion as well as the interactive properties. With the activity descriptor, how to make the decision for the activity category classification using the feature representation accordingly is another key issue for activity recognition. Two types of approach are widely used: approaches based on generative model [5,4] and the ones based on discriminative model [10,9,19,18,15,12,17]. Considering the mechanism of human perception for group activity, the interactions between objects are firstly distinguished and then synthesized as the activity recognition result. Although discriminative models have been extensively employed because they are much easier to build up, the construction of discriminative models essentially focus on the differences among the activity classes yet ignore the interactive properties involved. Therefore, discriminative models cannot facilitate the interaction analysis and discover the insight of the interactive relations in the activities. In this paper, we firstly investigate how to effectively represent video activities in the interaction context. A new feature descriptor, namely quaternion descriptor, consists of four types components in terms of appearance, individual dynamic, pairwise causalities and feedbacks of the video active objects, respectively. The components in the descriptor describe the appearance and motion patterns as well as encode the interaction properties in the activities. Resorting to the bag-of-words method, the video is represented as a compact bag-ofquaternion feature vector. To recognize the activity category and facilitate the interaction pattern exploration, we then propose to model and classify the activities in a generative framework which is based on an extended pLSA model. Interactions are modeled in the generative framework, which is able to explicitly infer the activity patterns.
2
Quaternion Descriptor for Activity Representation
We propose to construct the quaternion descriptor by extracting the trajectory atoms and then modeling the spatio-temporal interaction information within these trajectories. 2.1 Appearance Component of Quaternion It has been demonstrated that the appearance information encoded in the image frames can provide critical implication on the semantic categories [21,11]. For video activity recognition over frame sequence, this source of information is also very useful in describing the semantics of activities.
Generative Group Activity Analysis with Quaternion Descriptor
3
In recent years, the well known SIFT feature [6] is acknowledged as one of the most powerful appearance descriptors and has achieved overwhelming success in object categorization and recognition. In our approach, the appearance component of quaternion descriptor is measured as the average of all the SIFT features extracted at the salient points residing on the trajectory. For a motion trajectory with temporal length k, the SIFT average descriptor S is computed from all the SIFT descriptors {S 1 , S 2 , . . . , S k } along the trajectory. The essential idea of the appearance representation is two-fold. First, the tracking process ensures that the local image patches on the same trajectory are relatively stable, and therefore the resultant SIFT average descriptor provides a robust representation for certain aspect of visual content in the activity footage. Second, the SIFT average descriptor can also encode partially the temporal context information which will contribute the recognition task [13]. 2.2
Dynamic Component of Quaternion
We propose to calculate the Markov stationary distribution [23] as the dynamic representation in the quaternion. The Markov chain is a powerful tool for modeling the dynamic properties of a system as a compact representation. We consider each trajectory as a dynamic system and extract such a compact representation to measure the spatio-temporal interactions in the activities. Fig. 1 shows the extraction procedure of dynamic component.
Fig. 1. The procedure for dynamic component extraction. (a) Displacement vector quantization; (b) State transition diagram; (c) Occurrence matrix; (d) Markov stationary distribution.
The existing work [20] has demonstrated that a trajectory can be encoded by the Markov stationary distribution π if it can be converted into an ergodic finitestate Markov chain. To facilitate this conversion, a finite number of states are chosen for quantization. Given points P and P ′ within two consecutive frames −−→ on the same trajectory, D = P P ′ denotes the displacement vector of two points. To perform a comprehensive quantization on D, both of the magnitude and orientation are considered as shown in Fig. 1(a). We translate the sequential relations between the displacement vectors into a directed graph, which is similar to the state diagram of a Markov chain (Fig. 1(b)). Further, we establish the equivalent matrix presentation of the graph and perform row-normalization on the matrix to obtain a valid transition matrix P for a certain Markov chain (Fig. 1(c)). Finally, we use the iterative algorithm in [20] to compute the Markov stationary distribution π (Fig. 1(d)), which is
4
G. Zhu et al.
1 (I + P + · · · + Pn ) , (1) n+1 where I is an identity matrix and n = 100 in our experiments. To further reduce the approximation error from using a finite n, π is calculated as the column average of An . More details about the extraction of Markov chain distribution can be found in [20]. An =
2.3
Causality and Feedback Components
To describe the causality and feedback properties, we propose a representation scheme based on Granger causality test (GCT) [8] and time-to-frequency tranform [24]. Given a concurrent motion trajectory pair of Ta = [Ta (1), . . . , Ta (n), . . .] and Tb = [Tb (1), . . . , Tb (n), . . .], we assume that the interaction between two trajectories is a stationary process, i.e., the prediction functions P (Ta (n)|Ta (1 : n − l), Tb (1 : n − l)) and P (Tb (n)|Ta (1 : n − l), Tb (1 : n − l)) do not change within a short time period, where Ta(1 : n − l) = [Ta (1), . . . , Ta (n − l)] and the same to Tb (1 : n − l), l is a time lag avoiding the overfitting issue in prediction. To model P (Ta (n)|Ta (1 : n − l), Tb (1 : n − l)), we can use kth order linear predictor: Ta (n) =
k
β(i)Ta (n − i − l) + γ(i)Tb (n − i − l) + ǫa (n) ,
(2)
i=1
where β(i) and γ(i) are the regression coefficients and ǫa (n) is the Gaussian noise with standard deviation σ(Ta (n)|Ta (1 : n − l), Tb (1 : n − l)). We use the linear predictor to model P (Ta (n)|Ta (1 : n−l)) and the standard deviation of the noise signal is denoted as σ(Ta (n)|Ta (1 : n − l)). According to the GCT theory, we can calculate two measurements, namely causality ratio rc as rc =
σ(Ta (n)|Ta (1 : n − l)) , σ(Ta (n)|Ta (1 : n − l), Tb (1 : n − l))
(3)
which measures the relative strength of the causality, and feedback ratio rf as rf =
σ(Tb (n)|Tb (1 : n − l)) , σ(Tb (n)|Ta (1 : n − l), Tb (1 : n − l))
(4)
which measures the relative strength of the feedback. We then calculate the z-transforms for both sides of Eq. (2). Afterwards, the magnitudes and the phases of the z-transform function at a set of evenly sampled frequencies are employed to describe the digital filter for the style of the pairwise causality/feedback. In our approach, we employ the magnitudes of the frequency response at {0, π/4, π/2, 3π/4, π} and the phases of the frequency response at {π/4, π/2, 3π/4} to form the feature vector fba . Similarly, we can define the feature vector fab by considering Ta as the input and Tb as the output of the digital filter, which characterizes how the object with the trajectory Ta affects the motion of the object with the trajectory Tb . The causality ratio and feedback ratio characterize the strength of one object affecting another one, while the extracted frequency response fab and fba convey
Generative Group Activity Analysis with Quaternion Descriptor
5
how one object affects another one. These mutually complementary features are hence combined to form the causality and feedback components of quaternion descriptor in the pairwise interaction context.
3
Generative Activity Analysis
Given a collection of unlabeled video sequences, we would like to discover a set of classes from them. Each of these classes would correspond to an activity category. Additionally, we would like to be able to understand activities that are composed of the mixture of interaction varieties. This resembles the problem of automatic topic discovery which can be figured out by latent topic analysis. In the following, we will introduce a new generative method based on pLSA modeling [22], which is able to both infer the activity categories and discover the interaction patterns. 3.1
Generative Activity Modeling
Fig. 2 shows the extended pLSA graphic model which is employed with the the consideration of both activity category and interaction variety. Compared with the traditional philosophy, the interaction distribution is modeled in our method and integrated into the graphic framework as a new hidden variable.
Fig. 2. The extended pLSA model with two hidden variables. Nodes are random variables. Shaded ones are observed and unshaded ones are unobserved (hidden). The plates indicate repetitions. d represents video sequence, z is the activity category, r is the interaction variety and w is the activity representation bag-of-word. The parameters of this model are learnt in an unsupervised manner using an improved EM algorithm.
Suppose we have a set of M (j = 1, . . . , M ) video sequences containing the bag-of-words of interaction representations quantized from the vocabulary of size V (i = 1, . . . , V ). The corpus of videos is summarized in a V -by-M cooccurrence table M , where m(wi , dj ) is the number of occurrences of a word wi ∈ W = {w1 , . . . , wV } in video dj ∈ D = {d1 , . . . , dM }. In addition, there are two latent topic variables z ∈ Z = {z1 , . . . , zK } and r ∈ R = {r1 , . . . , rR } which represent the activity category and interaction variety resided in a certain activity. The variable rt is sequentially associated with each occurrence of a word wi in video dj . Extending from the traditional pLSA model, the joint probability P (w, d, z, r) which translates the inference process in Fig. 2 is expressed as follows P (d, w) = P (d)P (z|d)P (r|z)P (w|r) . (5) z∈Z r∈R
6
G. Zhu et al.
It is worth noticing that an equivalent symmetric version of the model can be obtained by inverting the conditional probability P (z|d) with the help of Bayes’ rule, which results in P (d, w) = P (z)P (d|z)P (r|z)P (w|r) . (6) z∈Z r∈R
The standard procedure for maximum likelihood estimation in latent variable models is the Expectation Maximization (EM). For the proposed pLSA model in the symmetric parametrization, Bayes’ rule yields the E-step as P (z)P (d|z)p(r|z)P (w|r) , P (z, r|d, w) = P (z ′ )P (d|z ′ )p(r′ |z ′ )P (w|r′ ) z′
P (z|d, w) =
(7)
r′
P (z, r|d, w) ,
P (r|d, w) =
r′
P (z, r|d, w) .
(8)
z′
By standard calculations, one arrives at the following M-step re-estimation equations: m(d, w)P (z|d, w) m(d, w)P (z|d, w) d w P (d|z) = , P (w|r) = , (9) m(d′ , w)P (z|d′ , w) m(d, w′ )P (z|d, w′ ) d′ ,w
P (z) =
d,w ′
1 m(d, w)P (z|d, w) , R
1 m(d, w)P (z, r|d, w) , (10) R
P (z, r) =
d,w
P (r|z) =
3.2
P (z, r) , P (z)
d,w
R≡
m(d, w) .
(11)
d,w
Generative Activity Recognition
Given that our algorithm has learnt the activity category models using extended pLSA, our goal is to categorize new video sequences. We have obtained the activity-category-specific interaction distribution P (r|z) and the interaction-pattern-specific video-word distribution P (w|r) from a different set of training sequences at learning stage. When given a new video clip, the unseen video is “projected” on the simplex spanned by the learnt P (r|z) and P (w|r). We need to find the mixing coefficients P (zk |dtest ) such that the Kullback-Leibler divergence between the measured empirical distribution P (w|dtest ) and P (w|dtest ) = K k=1 P (zk |dtest )P (r|zk )P (w|r) is minimized [22]. Similar to the learning scenario, we apply the EM algorithm to find the solution. The sole difference between recognition and learning is that the learnt P (r|z) and P (w|r) are never updated during inference. Thus, a categorization decision is made by selecting the activity category that best explains the observation, that is Activity Category = arg max P (zk |dtest ) . (12) k
Generative Group Activity Analysis with Quaternion Descriptor
3.3
7
Interaction Pattern Exploration and Discovery Based on Generative Model
Generative model facilitates the inference of the dependence among the different distributions in the recognition flow. In the extend pLSA paradigm, the distribution of the interaction patterns is modeled as a hidden variable r bridging the category topic z and visual-word observation w explicitly encoded by P (r|z) and P (w|r), respectively. Two tasks can be achieved by investigating one of the distributions P (w|r), namely interaction amount discovery and pattern exploration. The aim of the discovery is to infer the optimal amount of interaction patterns in the activities. The strategy is to transverse the sampled amount of interaction patterns and observe the corresponding recognition performance. We define K = {1, . . . , R} as the candidate set of the amount of interaction patterns. Give an interaction pattern amount k ∈ K, the corresponding extended pLSA model is learnt and the recognition performance is denoted as mk . Therefore, we can obtain the optimal interaction pattern amount OptInterN o in the underlying activities as OptInterN o = arg max{mk } . (13) k
4
Experiments
To demonstrate the effectiveness of our approach, we performed thorough experiments on two realistic human activity databases: the HOHA-2 database of movie videos used in [19] and the HGA database of surveillance videos used in [18]. These two databases are chosen for evaluation because they exhibit the difficulties in recognizing realistic human activities with multiple participants, which is in contrast to the controlled settings in other related databases. The HOHA-2 database is composed of 8 single activities (i.e., AnswerPhone, DriveCar, Eat, GetOutCar, Run, SitDown, SitUp, and StandUp) and 4 group activities (i.e., FightPerson, HandShake, HugPerson, and Kiss), in which 4 group activities are selected as the evaluation set. The HGA database consists of 6 group activities, in which all the samples are employed for evaluation. A brief summary of these two databases used in the experiments is provided in Table 2. More details about the databases can be found in [19,18]. To facilitate efficient processing, we employ the bag-of-words method to describe the quaternion descriptors in one activity video footage as a compact feature vector, namely bag-of-quaternion. We construct a visual vocabulary with 3000 words by K-means method over the sampled quaternion descriptors. Then, each quaternion is assigned to its closest (in the sense of Euclidean distance) visual word. 4.1
Recognition Performance on HOHA Database
In the pre-processing of this evaluation, the trajectory atoms in every shot are firstly generated by salient points matching using SIFT features. It has been
8
G. Zhu et al. Table 1. A summary of the databases for human group activity recognition Database HOHA-2 group subset [19] HGA [18] Data source Movie clips Surveillance recorders # Class category 4 6 # Training sample 823 4 of 5 collected sessions # Testing sample 884 1 of 5 collected sessions Table 2. A summary of the databases for human group activity recognition Database HOHA-2 group subset [19] HGA [18] Data source Movie clips Surveillance recorders # Class category 4 6 # Training sample 823 4 of 5 collected sessions # Testing sample 884 1 of 5 collected sessions
demonstrated that this trajectory generation method is effective for the motion capture in movie footage [20]. In the experiment, the appearance and dynamic representations are extracted from every salient point trajectory. The salient points residing in the same region may share the similar appearance and motion patterns. The extraction of causality and feedback descriptors on the raw trajectory atoms is not necessary and computation-intensive. Efficiently, we perform a spectral clustering based on normalized cut algorithm [3] on the set of raw trajectory atoms in one video shot. The average trajectory atom, which is calculated as the representative for the corresponding cluster, is employed as the input of the extraction of causality and feedback descriptors. To construct the graph for clustering, each node represents a trajectory atom and the similarity matrix W = [ei,j ] is formed with the similarities defined as ei,j = (PiT · Pj ) · (SiT · Sj ) ,
(14)
where i and j represent the indices of trajectory atoms in the video shot, P = {(x1 , y1 ), (x2 , y2 ), . . . , (xn , yn )} is the set of spatial positions of a trajectory atom and S = {s1 , s2 , . . . , sn } is its SIFT descriptor set. To quantitatively evaluate the performance, we calculate the Average Precision (AP) as the evaluation metric. Note that the AP metric is calculated on the whole database for an equitable comparison with the previous work although we only investigate 4 group activities. Fig. 3 shows the recognition results by using different types of interaction features as well as their combination compared with the state-of-the-art performance in [19]. In [19], the SIFT, HoF and HoG descriptors are extracted from the spatio-temporal salient points detected by 2D and 3D Harris detectors. Bag-of-words features are built as the compact representation for the video activities, which are the input of the SVM classifier. From Fig. 3, we can conclude that our quaternion descriptor and generative model yields the highest AP performance than the latest report. More specifically, the Mean AP is improved from the latest reported 37.8% [19] to 44.9%.
Generative Group Activity Analysis with Quaternion Descriptor
9
Fig. 3. Comparison of our approach with the salient point features and discriminative model in [19] on HOHA-2 database
Another observation from the results is that the pairwise causality and feedback features outperform other components in the quaternion descriptor, which demonstrate that the interaction features are indispensable for the task of group activity recognition. 4.2
Recognition Performance on HGA Database
The HGA database is mainly proposed for human trajectory based group activity analysis. The humans in the database are much smaller so that the appearance features do not present any contribution to the recognition task. Consequently, the trajectory atoms of HGA database are generated by blob tracking. Each human in the activity video is considered as a 2D blob, and then the task is to locate the positions in the frame sequence. Our tracking method is based on the CONDENSATION algorithm [7] with manual initializations. About 100 particles were used in the experiments for the tradeoff of accuracy and computational cost. Accordingly, dynamic and causality/feedback components in the proposed quaternion descriptor are employed to describe the activity video footage. Fig. 4 lists the recognition accuracies in terms of confusion matrices by using different types of interaction descriptors as well as their combination. From Fig. 4(a) and Fig. 4(b), we can observe that the causality/feedback component outperforms the dynamic component on HGA database. This is easy to understand that for two different activity categories, e.g., walking-in-group and gathering, the motion trajectory segments of the specific person may be similar while they have different interactive relations, which can be easily differentiated by the pairwise representation. Therefore, when combing two types of interaction descriptors, the recognition performance can be further improved as shown in Fig. 4(c). Compared with the results reported in [18], in which the best performance is 74% for average accuracy of all the activities, the proposed work achieves the better result with 87% average accuracy. Note that the confusions are reasonable in the sense that most of the misclassification occurs between very similar motions, for instance, there is a confusion between run-in-group and walk-in-group.
10
G. Zhu et al.
Fig. 4. The confusion matrices of HGA database recognition results with different interaction representations
4.3
Interaction Pattern Exploration and Discovery
We further evaluate the capacity of the proposed extended pLSA for exploring and discovering the interaction patterns in HGA database. The reason we select HGA database as the evaluation set is that the trajectory atoms in HGA are the human blob loca in spatio-temporal space which bear the intuitive semantics for visualization. Taking the pairwise causality and feedback interactions as the example, we explored different numbers of interaction patterns, varying from 8 to 256. The corresponding pLSA model was learnt against each number on the training sessions and then evaluated on the testing session. Fig. 5 shows the exploration results of the recognition performance against the variant amount of interaction patterns. From Fig. 5, we can observe some insight between the supposed pattern amount and the recognition performance. The performance is significantly improved by 17.2% for average recognition accuracy when increasing the pattern number from 8 to 32 because more and more interactions can be covered by the learnt model. However, the performance is degenerated and dropped down by 10.5% with the increase of the pattern amount to 256. This is due to the
Fig. 5. The exploration results of the recognition performance against the number of pairwise interaction patterns on HGA database
Generative Group Activity Analysis with Quaternion Descriptor
11
fact that the learnt model with larger pattern amount intends to overfit to the training data, resulting to a less generalizable model. Therefore, the amount of interaction patterns in HGA database is inferred as 32.
References 1. Turaga, P., Chellappa, R., Subrahmanian, V.S., Udrea, O.: Machine recognition of human activitites: a survey. T-CSVT (2008) 2. Bobick, A., Davis, J.: The recognition of human movement using temporal templates. T-PAMI (2001) 3. Shi, J., Malik, J.: Normalized cuts and image segmentation. T-PAMI (2000) 4. Wang, Y., Mori, G.: Human action recognition by semilatent topic models. T-PAMI (2009) 5. Niebles, J.C., Wang, H., Li, F.F.: Unsupervised learning of human action categories using spatial-temoral words. IJCV (2008) 6. Lowe, D.: Distincitive image features from scale-invariant keypoints. IJCV (2004) 7. Isard, M., Blake, A.: CONDENSATION - Conditional density propagation for visual tracking. IJCV (1998) 8. Granger, C.W.J.: Investigating causal relations by econometric models and crossspectral methods. Econometrica (1969) 9. Liu, J., Luo, J., Shah, M.: Recognizing realistic actions from videos in the wild. In: CVPR (2009) 10. Laptev, I., Lindeberg, T.: Space-time interest points. In: ICCV (2003) 11. Torralba, A., Murphy, K., Freeman, W., Rubin, M.: Context-based vision system for place and object recognition. In: ICCV (2003) 12. Schuldt, C., Laptev, I., Caputo, B.: Recognizing human actions: a local svm approach. In: ICPR (2004) 13. Liu, Z., Sarkar, S.: Simplest representation yet for gait recognition: averaged silhouette. In: ICPR (2004) 14. Andrade, E., Blunsden, S., Fisher, R.: Modelling crowd scenes for event detection. In: ICPR (2006) 15. Ryoo, M.S., Aggarwal, J.K.: Hierarchical recognition of human activities interacting with objects. In: CVPR (2007) 16. Turaga, P., Veraraghavan, A., Chellappa, R.: From videos to verbs: mining videos for activites using a cascade of dynamical system. In: CVPR (2007) 17. Zhou, Y., Yan, S., Huang, T.: Pair-activity classification by bi-trajectory analysis. In: CVPR (2008) 18. Ni, B., Yan, S., Kassim, A.: Recognizing human group activities with localized causalities. In: CVPR (2009) 19. Marszalek, M., Laptev, I., Schmid, C.: Actions in Context. In: CVPR 2009 (2009) 20. Sun, J., Wu, X., Yan, S., Cheong, L.F., Chua, T.S., Li, J.: Hierarchical spatiotemporal context modeling for action recognition. In: CVPR (2009) 21. Mortensen, E., Deng, H., Shapiro, L.: A SIFT descriptor with global context. In: CVPR (2005) 22. Hofmann, T.: Probabilistic latent semanic indexing. In: ACM SIGIR (1999) 23. Breiman, L.: Probability. Society for Industrial Mathematics (1992) 24. Jury, E.I.: Sampled-data control systems. John Wiley & Sons, Chichester (1958)
Grid-Based Retargeting with Transformation Consistency Smoothing Bing Li1,2 , Ling-Yu Duan2 , Jinqiao Wang3 , Jie Chen2 , Rongrong Ji2,4 , and Wen Gao1,2 1
2
Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190 China Institute of Digital Media, School of Electronic Engineering and Computer Science, Peking University, Beijing 100871 China 3 National Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190 China 4 Visual Intelligence Laboratory, Department of Computer Science, Harbin Institute of Technology, Heilongjiang, 150001 China
Abstract. Effective and Efficient retargeting are critical to improve user browsing experiences in mobile devices. One important issue in previous works lies in their semantic gap in modeling user focuses and intensions from low-level features, which results to data noise in their importance map constructions. Towards noise-tolerance learning for effective retargeting, we propose a generalized content aware framework from a supervised learning viewpoint. Our main idea is to revisit the retargeting process as working out an optimal mapping function to approximate the output (desirable pixel-wise or region-wise changes) from the training data. Therefore, we adopt a prediction error decomposition strategy to measure the effectiveness of the previous retargeting methods. In addition, taking into account the data noise in importance maps, we also propose a grid-based retargeting model, which is robust and effective to data noise in real time retargeting function learning. Finally, using different mapping functions, our framework is generalized for explaining previous works, such as seam carving [9,13] and mesh based methods [3,18]. Extensive experimental comparison to state-of-the-art works have shown promising results of the proposed framework.
1
Introduction
More and more consumers prefer to watch images over the versatile mobile devices. As the image resolutions vary much and the aspect ratio of a mobile display differs from each other, properly adapting images to a target display is useful to make wise use of expensive display resources. Image retargeting aims to maximize the viewer experience when the size or aspect ratio of a display is different from the original one. Undoubtedly, users are sensitive to any noticeable distortion of re-targeted pictures. Persevering consistency and continuity of images is important. So we propose a generic approach to effective and efficient image retargeting, which is applicable to mobile devices. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 12–24, 2011. c Springer-Verlag Berlin Heidelberg 2011
Grid-Based Retargeting with Transformation Consistency Smoothing
13
Fig. 1. Illustrating image/video retargeting from a supervised learning viewpoint
Many content-aware retargeting methods have been proposed such as cropping[4,5,1], seam craving[13,9,15,11], multi-operator[10], mesh based retargeting[12,18,3,7,17]. Cropping[4,5,1] works on an attention model to detect important regions, and then crops the most important region to display. Seam carving[13,9,15]tries to carve a group of optimal seams iteratively based on an energy map from images/videos. Rubinstein proposes to combine different retargeting methods including scaling, cropping, seam craving in[10]. In addition, mesh based methods[12,18,3,7] partition source images/videos, where more or less deformation is allowed by adjusting the shape of a mesh, while, for important regions, the shapes of relevant meshes are committed to be kept well. Generally speaking, content aware retargeting may be considered as a sort of supervised learning process. Under the supervision from a visual importance map, content aware retargeting aims to figure out a mapping function in charge of removing, shrinking or stretching less important regions, as well as preserving the shape of important regions, as illustrated in Fig 1. Either user study or image similarity measurement applies to evaluate the effectiveness of a retargeting method. On the other hand, the result of content aware methods heavily relies on the quality of an importance map. Most importance maps are generated by low-level visual features such as gradient, color contrast and so on. Due to the lack of high-level features, an importance map cannot recover the meaningful object regions exactly to assign proper values to objects. As the importance map cannot truly represent a user’s attention, content aware retargeting guided by noisy importance map is actually a weakly supervised learning process. From a learning viewpoint, a good model should avoid overfitting, where low variance and high bias is preferred to deal with data noise. However, the seam carving method [9] removes 8-connected seams containing the lowest energy each time. It can be considered as a sort of local approximation to keep the shape of salient regions. As a result, a seam carving method has high variance and low bias. It is very sensitive to noise. For example, when the seams crossing an object produce the lowest energy, removing seams tend to fragment an object in the resulting images. Similarly, by global optimization, mesh based methods have lower variance to reduce the negative influence of noise data similar to filter smoothing. Their resulting images are smoother than pixel-wise methods. Unfortunately, serious shape transformation leads to too complex model involving many degrees of freedom. When an object covers several meshes with each mesh assigned different importance value, the object inconsistency would happen, e.g.
14
B. Li et al.
big head and small body, or a screwed structural object. As an variant of mesh based method, Wang[18] uses vertexes of the mesh to describe quad deformation in an objective function. However, it is not easy to well control the shape transformation of quad grids in optimization; moreover, most grids are irregular quadrilateral in their results. So the resulting grids may fail to preserve the structure of complex backgrounds, although efforts have been made to minimize the bending effects of the grid lines. To summarize, in the cases of certain amount noisy training data, those existing retargeting methods are sensitive due to the models’ higher variance. Undoubtedly, too many freedom degrees of a retargeting model leads to the spatial inconsistency of salient objects and the discontinuity of less important regions. Thus, we propose a grid based optimization approach to retargeting an image. The basic motivation is to reduce the model variance by constraining the gridbased shape transformation over rectangular grids. Then the aspect ratio of a display can be characterized by arranging a set of rectangles, where the change of grids’ aspect ratio is used to measure distortion energy. A nonlinear objective function is employed to allocate unavoidable distortion to unimportant regions so as to reduce the discontinuity within less important regions. In addition, as the nonlinear optimization model to build up is convex programming, a global optimal solution can be obtained by an active-set method. Overall, as our model confines the degrees of freedom, our method is effective to accommodate the weak supervision of noisy importance maps from low-level feature computing. This makes our method more generic in a sense. Our major contributions can be summarized as follows: 1. We propose a generalized retargeting framework from a supervised learning viewpoint, which introduces an optimized retargeting strategy selection approach in term of adapting to the training data quality. by adopting different learning functions, previous retargeting approaches, such as seam carving [9], mesh based approaches [18][3] can be derived from our model. 2. We present a grid-based model to effectively reduce the mapping function complexity, which is robust to the importance map noise (from cluttered background) and inferiority (for delineating the salient objects). By a quadratic programming approximation, our objective function optimization complexity can be linear to training data. 3. Our proposed objective function makes best use of the unimportant regions for optimization consistency between meaningful objects (in important regions) and content continuity in non-important regions. Also, it enables parameter adjusting to favor desirable results with user preferences (shown in Fig 4).
2
Visual Retargeting Framework
In this section, we come up with a general content aware retargeting framework from a supervised learning process point of view. To well keep important regions at the cost of distorting less important ones, retargeting methods are working out an optimal mapping function g that g : IM → SP s.t.
boundary
constraints
(1)
Grid-Based Retargeting with Transformation Consistency Smoothing
15
IM is the importance of pixels/regions, SP denotes the desirable pixel- or regionwise changes such as removing, shrinking, stretching and preserving. 2.1
Retargeting Transformation in Either Local or Global Manner
Our framework aims to abstract any retargeting transformation in a unified way from the local and global points of view. A typical learning problem involves training data (im1 , sp1 ), ..., (imn , spn ) containing an input vector im and an corresponding output sp. The mapping function g(im) is to approximate the output from training data. The approximation can be a local or global one. Local Methods. For the local method, the input/output data come from a local region as {(imk1 ,e1 , spk1 ,e1 ), . . .,(imkn ,en , spkn ,en )}, k1 , e1 . . . kn , en = local region , ki , ei is the position of pixel. The function g is a local approximation of the output, similar to K-Nearest neighbor. In training data, sp may be set to several values for different region operations like removing, shrinking, etc. For example, an image is partitioned into several regions according to importance measurements, where the regions can be determined in different ways, such as detecting objects, locating seams with lowest energy, spotting a window with large importance and so on. As a local approximation, the mapping function leads to a sort of independent retargeting based on each individual region. In other words, the process of keeping the important regions is independent of shrinking/stretching the less important regions. For the sake of simplicity, we set spr = 1 to the region/pixel requiring good local preservation, otherwise spr = −1. The function can be simply defined as: −1 k, e ∈ unimportant region sp ˆ k,e = g(imk,e ) = (2) 1 k, e ∈ important region Global Methods. For a global method, input/output come from a whole image = source image, where the as {(imri , spri ),. . .,(imrn , sprn )}, r1 r2 . . . rn whole image is partitioned into regions or pixels r1 . . . rn = source image. The mapping function is to approximate the output in a global manner. To accomplish a satisfactory global fitting on the training data, the risk Remp(g) is defined as Remp (g) = L(spri , g(imri )) (3) r( i)∈image
L(spri , g(imri ) calculates a weighted discrepancy between an original region and a target region. The g(im) is thus obtained by minimizing Remp(g). As a typical global one, meshes based methods impose mesh-based partition to a source image.g(im) measures each region’s original shape. In meshes based methods, L(spri , g(imri ) is defined as: L(spri , g(imri )) = D(ri ) · w(imri )
(4)
where D(ri ) measures the distortion of the meshes, w(imri ) is a weighting function of the importance of regions ri . Increasing or reducing the distortion of meshes can be controlled by adjusting w(imri ) accordingly.
16
2.2
B. Li et al.
On the Effectiveness of a Retargeting Method
To measure the effectiveness is an important issue to design a good retargeting method. This is closely related to how to measure the performance of a mapping function. The performance of a mapping function strongly depends on correctly choosing important regions in training data. In cases of noisy data, too complex mapping function would lead to overfitting like distortion of an object. To select a good model, a performance measurement of the mapping function should be provided to select the best model based on different types of noisy or clean data. [14] presented the prediction error to measure the effectiveness of a mapping function. A learnt function’s prediction error is related to the sum of the bias and the variance of a learning algorithm [6], which can be formulated as [14]. For the training data Q and any im , the prediction error decomposition is: EQ [(g(im; Q) − E(sp|im))2 ] =(EQ [g(im; Q)] − E[sp|im])2 + EQ [(g(im; Q) − EQ [g(im; Q)])2 ]
(5)
EQ [g(im; Q)] − E[sp|im])2 is bias, EQ [(g(im; Q) − EQ [g(im; Q)])2 ] is variance. To avoid overfiting for the generality of a retargeting function, our goal is to decrease the variance in fitting the particular input im. For a local method, the mapping function variance depends on the pixel number k of each local region. When k is too small, the function has higher variance but lower bias. Such a function exhibits higher complexity that incurs many degrees of freedom. Thus, retargeting is sensitive to noises and may artifact in the objects with rich structures like seam carving[13,9]. When k is large (e.g., all the pixels of a cropping window), this variance is lower. So the impact of noisy data is decreased; however, taking cropping methods [4,5,1] as an example, some objects or parts would be discarded when several important objects are far from each other. For global methods[18,3], the mapping function not only depends on region importance but also their distributions, for which the model would be more complex. Overall, the mapping function is smoother than a local method. 2.3
An Instance of Our Proposed Framework
As discussed above, a good retargeting method has to seek a tradeoff between the bias and the variance based on the quality of training data. In this section, we come up with an instance by taking into account the quality of the importance map. As a visual importance map cannot recover the regions of salient objects exactly, the actual training data is noisy. Therefore, we would like to choose a mapping function with lower variance to reduce the negative influence of noise data. So our instance prefers a global method. Our instance is committed to maintain lower variance. An optimization approach with lower variance is applied to reduce noise data’s influence. We constrain the grid-based shape transformation over rectangular grids. The change of grids aspect ratio is used to measure distortion energy in retargeting. This is advantageous than Wang’s models[18] that too many degrees of freedom often leads
Grid-Based Retargeting with Transformation Consistency Smoothing
17
to deformation of objects. Moreover, we provide a user a few parameters to optimize the use of unimportant regions to keep important regions’ shape. In the cases of noisy data, a lower variance can reduces the influence of noisy data. But the shape transformation of unimportant region is less flexible, which would affect the preservation of important region’s shape. So we introduce user input to alleviate this disadvantage. Through setting a few parameters in our object function, we may amplify the difference between the important and unimportant region’s importance values. The objective function is described as follows: Ob jective Function. We use the edges of grids rather than the coordinates of vertices to measure the distortion energy of each grid. A nonlinear objective function is employed to reallocate distortion to a large proportion of (all) unimportant regions to avoid discontinuity. To minimize the grid distortion energy, the objective function is defined as: (6) min (yi (t) − ars · xj (t))m · snij
m > 2 and is a even number, n > 1. ars is the aspect ratio of the original grid respectively, xi , yj is the width and height of the target grid gij ,respectively. sij is the importance value of grid gij . The weight snij is to control the distortion of grid gij . The more snij is, the more the grid would be preserved. As our approach has high bias and low variance, restricting shape transformation of grids is at the expense of less flexible adjustment of unimportant regions, which would in turn affect the shape preservation of the remaining important region. To optimize the use of unimportant regions’ shape, we introduce user input to alleviate this disadvantage. By adjusting parameters n, m, we may get two types of retargeting effects: 1) the adapted results tend to preserve important regions more; or 2) allowing more smoothness between grids within unimportant regions. More details can be found in the subsequent section.
3
Grid Based Image Retargeting
In this section, we introduce our grid-based image retargeting, involving rectangular grids based shape transformation constraints as well as a nonlinear objective function to reallocate distortion to less important regions. Our method contains three basic stages. Firstly, we calculate gradient map and visual attention map to determine important regions. Secondly,we divide the source image into grids, each grid generating a importance measure. The model of grid optimization is solved at the granularity of grids. The optimal solution is applied to transform source grids to target grids. Finally, the image retargeting is accomplished by a grid based texture mapping algorithm[2]. 3.1
Importance Map
We combine gradient map and visual attention map[16] to generate importance map.The importance map is defined as: IM Sk,e = α · GSk,e + (1 − α) · ASk,e
α ≥ 0;
(7)
18
B. Li et al. Comparation on Power of Importance 4 3.5
n=1 n=2 n=3 n=4
3
Weight
2.5 2 1.5 1 0.5 0
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Importance
Fig. 2. Influence of the importance variant’s power on the resulting weights
GSk,e and ASk,e is important value of the pixel at the position (k, e) in the gradient map and the attention map, respectively. In different types of images such as nature and building, different empirical parameters α can be set to obtain good results.It is noted that, for some homogeneous regions, whose distortions are rarely perceived by humans after transformation; while, for irregular textured regions, i.e., higher values in a gradient map, people can tolerate such distortions. Consequently, the addition of an attention map would reduce the overall importance values in irregular textured regions. 3.2
Grid-Based Resizing Model
The grids are divided as follows. An image is divided into N × N grids, and the grids are denoted by M = (V, E) in which V is the 2D grid coordinate, E are the edges of grids. Each grid is denoted by G = {g11, g12 , . . . , gij . . . gN N } with its location i, j. Owing to the constraint of rectangular grids, all the grids in each row have the same height while the grids in each column have the same width. So the edge is simply denoted by E = {(x1 , y1 ), . . . , (xi , yj ), (xN , yN )}, and xi , yj is the width and the height of the grid gij , respectively. Clearly, our model has fewer parameters to optimize than [18]. Computing the importance of a grid. With the uniform division of N × N grids, the importance of grid gij can be calculated as follows: k,e∈g IM Sk,e sij = × N2 (8) Stotal Stotal is the sum of importance values of all the pixels in the image. A mean importance value of 1 is imposed to isolate important grids from unimportant ones instead of any value between 0 to 1 in [18]. A grid is considered as an unimportant one if its important value is less than 1. From Fig 2, for an important grid gij , the larger n is, the bigger snij is; while, for an unimportant grid gij , the larger n is, the less snij is. Thus increasing n can further improve the important grid’s importance but decrease the unimportant grid’s importance. However, n cannot be too large, otherwise it would break the visual discontinuity in unimportant regions. Empirically, the range of n is [1,7].
Grid-Based Retargeting with Transformation Consistency Smoothing Distortion of Grid:m=2
19
Distortion of Grid:m=4
Trend of Feasible Solution 20
10
10
0
Equation Constraint:x1+x2=12
0 100
50 y
50 x
0 0
100
100
Projecetion of funtion 25 20
m=2 m=4
50 y
50 x
0 0
Projecetion of funtion 25
m=2
20
15
15
10
10
5
5
0
m=4
0 100 50 0 y
50
100 x
100 50 0 y
50
100
100
Value of Objective
20
Variable
x
Fig. 3. The effects of choosing different m on the retargeted image (ars = 2)
Boundary constraints. We introduce the constraints as follows: N yi (t) = HT i=1 i=1 xj (t) = WT
(9)
HT ,WT are the height and width of a target grid, respectively. Note that the minimum height or length of a grid is set to one pixel, as adjacent grids should not overlap each other. The Objective Function. To minimize the sum of grid distortion energy, we employ the objective function 6 as mentioned in section 2. Increasing m improves the continuity of a whole image. With increasing m , these grids with similar weights are subject to similar shape changes the edge disparity between large weighted grids and small weighted grids also becomes smaller. To clarify, we provide a three-dimensional graph of the object function in Fig 3, and take m = 2 and m = 4 respectively. For simplicity, the images are divided into 2 × 2 grids and the height of grids remains unchanged; the 3D graph is projected to a 2D curve to indicate the flat trend of m = 2 and the dramatic trend of m = 4. A two-way arrow and a hollow circle are used to illustrate the solutions’ movements constrained by equation (11) when increasing m. In addition, m cannot be too large; otherwise, the effects is similar to scaling an image. We empirically set m = 2, 4, 6. Global Solution. To get a global solution, we employ an active-set method to solve this optimization problem. This nonlinear program is a convex programming, and any local solution of a convex programming is actually a global solution. (yi (t) − ars · xj (t))m · snij is a convex function, so our objective function (yi (t) − ars · xj (t))m · snij is a convex one. Moreover, the equality constraints are linear functions and the inequality constraints can be seen as a concave function. The solutions satisfying equality and inequality constraints finally form a convex set. When a local solution is resolved, the global solution is yielded.
20
B. Li et al.
For the convex programming, the Hessian matrix of the objective function is positive semidefinite. The complexity is similar to a linear programming that depends on the number of grids instead of real resolution of an image.
4
Experiments
To evaluate the effectiveness and efficiency of our retargeting method, we conducted our experiments on a variety of images. In total 120 images are collected to cover six typical classes: landscape, structure (e.g., indoor/outdoor architecture), computer graphics & painting, daily life (with human/animal as the subject). Daily life are further classified into long shot, medium shot, and close up. To avoid the confusion from picture classification, we apply two rules as below: (1) an image is primarily classified into daily life as long as the image contains human/animal; otherwise, (2) an image containing building is classified into structure, no matter whether the image belongs to computer graphics/painting or not. Those categories may be related to different types of noisy or clean data in a sense. For instance, computer graphics & painting or a long shot is comparatively cleaner (i.e., the importance map tend to exactly delineate the salient object) than other classes. Efficiency. Our algorithm was implemented on a standard laptop with 2.26 GHz duo core CPU, 2GB memory. Our dataset consists of diverse images in sizes and aspect ratios. The largest image size is 1920×1200, while the smallest size is 288×240. Each optimization process costs less than 0.015s for 20×20 grids. As the complexity of our algorithm relies on the grid division instead of the real image resolution, our algorithm is much more efficient than seam carving. Effectiveness. Fig 4 illustrates the effects of parameter m, n on retargeting results. Note that the girl’s head and the volleyball have higher importance values so that they are kept well in Fig 4(c)(d)(e)(f). By comparing different effects from Fig 4(c)(d)(e), we can find that when choosing the same m, increasing n can preserve the shape of the important region (e.g., the girl’s head), while unimportant regions (background) are much distorted; By comparing Fig 4(d)(f), we can find that when keeping the same n , increasing m improves the visual content continuity at unimportant regions. In the subsequent user study, most subjects prefer (f) since its entire consistency are best achieved, even though the grids covering the girl are somehow squeezed. Furthermore, our method are compared with the scaling and other two representative methods [18,9]. Note that when noisy importance maps in source images (especially for structure and close up) cannot delineate important objects from non-important regions, the retargeting problem become more challenging. In our empirical dataset, each type has at least 20 images Fig 5 demonstrate that our method is more efficient in preserving the consistence of objects and the continuity of unimportant regions, even in the case of serious noises (e.g. the sixth row in Fig 5). In contrast, the seam carving method [9] brings about considerable shape artifacts of an object, especially more structured ones, as indicated in the 1, 2, 3, 4, 6, 7 rows in Fig 5), since, at object regions, some seams
Grid-Based Retargeting with Transformation Consistency Smoothing
(a)
(b)
(c)
(d)
(e)
21
(f)
Fig. 4. Results for a daily life picture at medium shot. (a) Original image. (b) Scaling. (c) m = 2,n = 1.(d)m = 2,n = 3. (e)m = 2,n = 5. (f)m = 4,n = 3.
with lower importance are falsely removed. Wang’s method [18] distorts objects, as indicated in the 1, 2, 3 , 4 rows in Fig 5). When an important object covers several grids but the importance of these grids are different from each other, most grids with low importance become irregularly quadrilateral after retargeting, so that Wang’s method may fail to preserve the structure of an object. User Study. A subjective evaluation is further performed by user study. The results of several popular methods are provided to subjects. By means of user preference and scoring evaluation, the effectiveness are measured quantitatively. In total 10 students participated in user study. We showed each participant an original image and a randomly ordered sequence of retargeting results with different methods including scaling, non-homogeneous resizing[8], seam carving[9], Wang’s method[18] and our method. Each participant are required to choose the results most visually similar to the original image. As listed in Tab1, most participants prefer our results. Referring to Fig5, let’s investigate the results. For landscape, our results are comparable to Wang’s method. But seam craving may greatly alter the content of an image, probably without any noticeable distortion sometimes (see the 7th row in Fig 5). For a long shot, except seam carving and scaling, most methods produce similar results. The reason is unimportant region occupies a large portion and has lower values in computed importance map. However, seam carving tends to change the depth of field. See the fifth row of Fig 5. For a close-up, subjects prefer our method and scaling, since the calculated importance maps are often seriously noisy. In such cases, the major part of an image is important, whereas these parts have lower value in importance map. Moreover, the unimportant part is not large enough to adapt it to target size. Consequently, non-homo resizing, seam carving, and Wang’s method distort important regions with lower importance. So the objects are distorted, while our method produces smoother results. It is easy to observe that, for structure, medium shot, and computer graphics & painting, users generally prefer our results, as discussed above. Limitations. Like all content aware retargeting methods, our method are still impacted by the importance map. Our results may be reduced to that of scaling when the most major part of an image is considered as important by visual content computing. Due to the importance map, the retargeting results are not preferred by users actually. If we could lower such value in irregular textured regions, more spaces would be saved for important objects. Ideally, if some
22
B. Li et al.
(a)
(b)
(c)
(d)
(e)
Fig. 5. Comparison Results. Columns from left to right: (a) Source image, (b) scaling, (c) Rubinstein et al’s results[9], (d) Wang et al’s results [18], (e) Our results. Rows from top to bottom: (1) Computer Graphics, (2) Architecture, (3) Indoor, (4) Medium Shot, (4) Long Shot, (5) Close Up, (6) Landscape.
Grid-Based Retargeting with Transformation Consistency Smoothing
23
Table 1. Preference statistics of ten participants in user study (%) Daily
Landscape
Structure
CG& Painting
Long
Medium
Close up
Scaling
2.1
3.5
4.5
5.10
1.30
24.40
Non-homo resizing[8]
8.4
6.2
8.9
22.4
11.8
20.2
Seam Carving[9]
15.2
10 .6
9.4
11.2
12.7
1.4
Wang’s method[18]
30.1
10.7
23.7
25.8
17.3
19.3
Our results
44.2
79.5
53.5
35.6
56.9
34.7
descriptors are available that distinguish irregular textured regions out from objects, our method is able to produce more desirable results.
5
Conclusion and Discussions
We proposed a general content aware retargeting framework from a supervised learning viewpoint. We proposed to measure retargeting performance based on prediction error decomposition [14]. We further propose a grid-based retargeting model to ensures transformation consistency in model learning. This grid-based model is optimized by solving a nonlinear programming problem. There are two merits in the proposed framework: (1) our framework is generalized for previous works, in which by incorporating different learning functions, many state-of-theart retargeting methods can be derived from our framework. (2) based on our grid-based learning structure, our model is suitable for real time applications on mobile devices. Acknowledgements. This work was supported in part by National Basic Research Program of China (973 Program) 2009CB320902, in part by National Natural Science Foundation of China No. 60902057 and No. 60905008, in part by CADAL Project and NEC-PKU Joint Project.
References 1. Santella, A., Agrawala, M., DeCarlo, D., Salesin, D., Cohen, M.: Gaze-based interaction for semi-automatic photo cropping. In: SIGCHI Conference on Human Factors in Computing Systems (2006) 2. Hearn, D., Baker, M.: Computer graphics with OpenGL (2003) 3. Guo, Y., Liu, F., Shi, J., Zhou, Z., Gleicher, M.: Image retargeting using mesh parametrization. IEEE Transactions on Multimedia (2009) 4. Chen, L., Xie, X., Fan, X., Ma, W., Zhang, H., Zhou, H.: A visual attention model for adapting images on small displays. Multimedia Systems (2003) 5. Liu, H., Xie, X., Ma, W., Zhang, H.: Automatic browsing of large pictures on mobile devices. In: ACM Multimedia (2003) 6. James, G.M.: Variance and bias for general loss functions. Mach. Learn. (2003) 7. Shi, L., Wang, J., Duan, L., Lu, H.: Consumer video retargeting: context assisted spatial-temporal grid optimization. In: ACM Multimedia (2009) 8. Wolf, L., Guttmann, M., Cohen-Or, D.: Non-homogeneous content-driven videoretargeting. In: ICCV (2007)
24
B. Li et al.
9. Rubinstein, M., Shamir, A., Avidan, S.: Improved seam carving for video retargeting. ACM Transactions on Graphics (2008) 10. Rubinstein, M., Shamir, A., Avidan, S.: Multi-operator media retargeting. ACM Transactions on Graphics (2009) 11. Matthias, G., Kwatra, V., Han, M., Essa, I.: Discontinuous seam-carving for video retargeting. In: CVPR (2010) 12. Gal, R., Sorkine, O., Cohen-Or, D.: Feature-aware texturing. In: Eurographics Symposium on Rendering (2006) 13. Avidan, S., Shamir, A.: Seam carving for content-aware image resizing. ACM Transactions on Graphics (2007) 14. Geman, S., Bienenstock, E., Doursat, R.: Neural networks and the bias/variance dilemma. Neural. Comput. (1992) 15. Kopf, S., Kiess, J., Lemelson, H., Effelsberg, W.: Fscav: fast seam carving for size adaptation of videos. In: ACM Multimedia (2009) 16. Liu, T., Sun, J., Zheng, N., Tang, X., Shum, H.: Learning to detect a salient object. In: CVPR (2007) 17. Niu, Y., Liu, F., Li, X., Gleicher, M.: Warp propagation for video resizing. In: CVPR (2010) 18. Wang, Y., Tai, C., Sorkine, O., Lee, T.: Optimized scale-and-stretch for image resizing. ACM Transactions on Graphics (2008)
Understanding Video Sequences through Super-Resolution Yu Peng, Jesse S. Jin, Suhuai Luo, and Mira Park The School of Design, Communication and IT, University of Newcastle, Australia {yu.peng,jesse.jin,suhuai.luo,mira.park}@uon.edu.au
Abstract. Human-centred multimedia applications are a set of activities that human directly interact with multimedia, which consists of different forms. Within all multimedia, video is an ultimate resource, by which people could obtain sensory information. Since limitations on the capacity of imaging devices as well as shooting conditions, we cannot usually acquire high quality video records that desired. This problem could be addressed by super-resolution. We propose a novel scheme in the present paper for super-resolution problem, and make three contributions: (1) on the stage of image registration according to previous approaches, the reference image is picked out through observing or randomly. We utilise a simple but efficient method to select the base image; (2) a median-value image, rather than the average image used previously, is adopted as the initialization for estimate of super-resolution; (3) we adapt the traditional Cross Validation (CV) to a weighted version in the process of learning parameters from input observations. Experiments on synthetic and real data are provided to illustrate the effectiveness of our approach. Keywords: Super-resolution, reference image selection, median-value image, Weighted Cross Validation.
1 Introduction With the dramatic development of digital imaging as well as internet, multimedia blending seamlessly text, pictures, animation, and movies becomes one of the most dazzling and fastest growing areas in the fields of information technology [1]. Multimedia is widely used in entertainment, business, education, training, science simulations, digital publications, exhibits and so much more, within all of which human’s interaction and perception is crucial. Low-level features, however, cannot engender accurate understanding of sensory information. As an important media, video providing people visual information usually associates with low quality problems since the limitations on imaging device itself and shooting conditions. For example, people cannot recognize faces or identify car license plates from a video record with low resolution and noises. We usually tackle this botheration through super-resolution, which is a powerful technique of reconstructing a high resolution image from a set of low-resolution observed images, at the same time complementing de-blurring and noise removal. The “super” here means the well characters closing to original scene. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 25–34, 2011. © Springer-Verlag Berlin Heidelberg 2011
26
Y. Peng et al.
In order to interpret it in perspective, a brief review of super-resolution is needed. As a well-studied field, comprehensive reviews were carried out by Borman and Stevenson [2] in 1998 and Park [3] in 2003. Although there are lots of approaches to super-resolution, the formulation of this problem usually falls into either frequency or spatial domain. Historically, the earliest super-resolution methods are raised in frequency domain [4, 5, 6] as the underlying mechanism of resolution improvement, that is restoration of frequency components beyond the Nyquist limit of the individual observation image samples [3]. However, the unreality assumption of global translation with these kind methods is too limited to use widely. On the other hand, spatial methods [7, 8, 9, 10] dominated soon because of its better handling of noise, and a more natural treatment of the point spread blur in imaging. Many super-resolution techniques are derived from Maximum Likelihood (ML) methods, which seek a super-resolution that maximizes the probability of the observed low-resolution input images under a given model, or Maximum a Posteriori (MAP) methods, which must use an assumptive prior information of the highresolution image. Recently, Pickup [10] proposed a completely simultaneous algorithm for super-resolution. In this MAP framework based method, parameters of prior distributions are learnt automatically from input data, rather than being set by trial and error. Additionally, superior results are obtained by optimizing over both the registrations and image pixels. In this paper, we take an adaptive method for super-resolution basing on this simultaneous framework. The three improvements are presented: at the beginning of superresolution, image registration is ultimately crucial for accurate reconstructed results. Most of previous approaches ignored the selection of reference image, which is the basic frame for other sensed images in the process of registration. In our method, we introduce an effective evaluation for reference image selection. As the second development, a median-value image is utilized as the initialization in place of average image, the outperformance is demonstrated. Finally, we adopt a Weighted Cross Validation (WCV) to learn prior parameters. As we present, our is more efficient than the traditional Cross Validation. The paper is organized as follows. Notations for reasoning about the superresolution problem are laid out in section 2, which then goes on to present popular methods for handling super-resolution problem. In section 3, we give particular consideration to image registration and prior information, which are ultimately crucial in the whole process. Section 4 introduces the adaptive super-resolution scheme basing on simultaneous model with our three developments. All the experiments and results are given in section 5, while section 6 will concludes the paper and highlights several promising directions for further research.
2 The Notation of Super-Resolution As we know, in the process of recording video, some spatial resolution would be lost unavoidably for a number of reasons. The most popular choice is describing this information lost process through a generative model, which is a parameterized, probabilistic forward process model of observed data generation. Several comprehensive
Understanding Video Sequences through Super-Resolution
27
generative models with subtle differences were proposed [2,10,11,12,13], generally these models described the corruptions with respect to geometrical transformation, illumination variation, blur engendered by limited shutter speed or sensor point spread effect, down-sampling, and also noise that occurs within sensor or during transmission. As [8,14], the warping, blurring and sub-sampling of the original highresolution image is modelled by an sparse matrix , where ,with pixels, of which has pixels. is assumed to generate a set of low-resolution images, the We could use the following equation to express this model: (2.1) Where scalars and characterize the global affine photometric correction results from multiplication and addition across all pixels respectively, and is noise occurring in this process. In terms of generative model, the simple Maximum Likelihood (ML) solution to the super-resolution problem is the super-resolution image which maximizes the , Although ML super-resolution is really fast, it likelihood of the observed data is an ill-conditioned problem, some of whose solutions are subjectively very implausible to the human viewer. The observations are corrupted by noise, the probability of which would be also maximized at the same time when we maximize the probability of observations. The Maximum a Posterior (MAP) solution is introduced to circumvent this deficit. The MAP solution is derived from Bayesian rule with a prior distribution over x to avoid infeasible solutions, it is given as: ,
,
(2.2)
More recently, Pickup [10] proposed a simultaneous method depending on MAP framework by which we find super-resolution image at the same time as optimizing registration, rather than reconstructing super-resolution image after fixing registration results of low-resolution frames.
3 Image Registration and Prior Distribution Almost all multi-frame images super-resolution algorithms need to use the estimate of the motion relating the low-resolution inputs within sub-pixel accuracy in order to set up the constraints on the high-resolution image pixel intensities which are necessary for estimating the super-resolution image. Generally speaking, image registration is the process of overlaying two or more images of the same scene taken under different conditions. It geometrically aligns two images—the reference and sensed image. The majority of the registration methods consist of the following four steps: feature detection, feature matching, transformmodel estimation, and image re-sampling with transformation [15]. In the sense of super-resolution, a typical method for registering image is to find interest points in low-resolution image set, then use robust methods such as RANSAC [16] to estimate the point correspondences and compute homographies between images.
28
Y. Peng et al.
With standard super-resolution methods, super-resolution image is estimated after fixing registration between low-resolution observations. Accurate low-resolution images registration is critical for the success of super-resolution algorithms. Even very small errors in the registration of the low-resolution frames can have negative consequences on the estimation of other super-resolution components. Within many superresolution methods, many authors consider that image registration and super-resolution image estimation as two distinct stages. However, we prefer to consider that image registration should be optimizing at the same time as recovering super-resolution, as these two stages are not truly independent, as the super-resolution problem is akin to a neural network, the components and the parameters of them are cross-connected with each other. Besides registration, prior distribution is another core in super-resolution, we need it to constrain the original high-resolution image in order to steering this inverse problem away from the infeasible solutions, in practice the exact selection of image priors is a tricky job as we usually should guarantee image reconstruction accuracy and the computational cost, since some priors are much more expensive to evaluate than others. In general, authors propose to use a Markov Random Filed (MRF) with smooth function over the original image to enable robust reconstruction. The practicality of MRFs model depends on the fact that through information contained in the local physical structure of images, we could sufficiently obtain a good global image representation [14]. Therefore, MRFs allow representation of image in terms of local characterization of images. A typical MRF models assumes that the image is locally smooth to avoid the excessive noise amplification at high frequency in ML solution, except those natural discontinuities such as boundaries or edges. As well, we aim to make MAP solution a unique optimal super-resolution image. As the basic ML solution is already convex, the convexity-preserving smooth-priors become attractive. Gaussian image priors give a closed-form solution quickly with a least-squares-style penalty term for image gradient values. Natural images contain edges, however, where there are locally high image gradients that are undesirable to smooth out. As a good substitute, Huber function is quadratic for small values of input, but linear for large values, so it benefits from penalizing edges less severely than Gaussian prior, whose potential function is purely quadratic. The smooth functions vary with different parameters. Therefore, appropriate priors parameters would make considerable benefit on the super-resolution results. There are several methods available that allow us to learn values of prior parameters, which we will return later.
4 The Adaptive Simultaneous Approach 4.1 Overview of This Adaptive Simultaneous Method This simultaneous model consists of two distinct stages: as in Fig. 1, the first stage covers a set of initializations of image registration, point spread function, prior parameters, and coarse estimate of super-resolution image. In the second stage, it is a big optimization loop, which incorporates an outer loop that updating prior parameters, super-resolution image and registration parameters, and two inner loops that if the maximum absolute change in any updated parameters in the outer loop beyond preset convergence thresholds, return to the beginning of the second stage.
Understanding Video Sequences through Super-Resolution
29
Fig. 1. Basic structure of the adaptive simultaneous method. The first part is initialization as the start point for super-resolution. The second part is optimizing for accurate solutions.
4.2 Reference Image Selection Image registration, as the beginning step of this simultaneous method, is crucial for the final super-resolution image. As interpreted above, photometric registration depends on geometric registration, which we usually achieve by a standard algorithm— RANSAC [16]. The problem of reference image selection, however, is too often ignored in current super-resolution techniques. In the procedure of geometric registration, other images are aligned according to the reference image. Therefore, inappropriate selection of reference image would bring in poor quality on registration results, then super-resolution image. Within the vast majority of papers, authors either pick it out randomly or omit the step. In our proposed super-resolution scheme, we handle this problem via a simple but effective method, that is Standard Mean Deviation (SMD), with which the information of low-resolution images is estimated as: ∑
∑
,
,
∑
∑
,
,
(4.1)
Where and represent the number of row and column pixels in each input observations respectively, and , is the intensity of the pixel , . This approach for the reference image selection we propose here is practically sensible. In most superresolution cases, reference image is hard to be told via observing the image content, in which there is no huge difference but only subtle shift exists between low-resolution images. Through (4.1), higher SMD of the image means ample intensity changes, which implies that great information this image has. We can keep much detail as possible as the most informative image is chose as reference image, which would provide a fine foundation for the following operations in the whole diagram. 4.3 Median-Value Image In the traditional methods, an average image is always chosen as a starting point for super-resolution estimation. The average image is formed by a simple re-sampling
30
Y. Peng et al.
scheme applied to the registration input images. Every pixel in average image is a weighted combination of pixels in observations, which are generated from original high-resolution image according to the weights in W. Although it is very robust to noise in observations, the average image is too smooth to bring incorrectness into estimation of super-resolution. We overcome this problem via choosing median-value image instead, each pixel in which is obtained by:
∑
,
,
(4.2) (4.3)
is recovered from low-pixels in observations, each pixel in Where which are generated from super-pixels contained within the pixel’s Point Spread Function (PSF) according to weights . Therefore, depends on the point spread function. The process of acquiring median image is illustrated in Fig. 2. We obtain each pixel of median-value image via selecting median from recovered from all low-resolution images, as (4.2). In (4.3), the care about the lighting change is taken.
Fig. 2. The obtaining of median-value image. From original high-resolution to low resolution images is imaging process. A pixel in median value image depends on certain pixels in low-resolution image, and is created from original high-resolution according to specific Point spread function.
Fig. 3. Comparing median image with average image. The right is ground truth image. The center and the right are average and median image respectively, which are obtained from 5 synthetic low-resolution images. Median image outperforms average one, which is overly smooth.
Understanding Video Sequences through Super-Resolution
31
Note that the resulting images shown in Fig. 3, average image is overly smooth comparing with the original high-resolution image while the resulting median-value image preserves edge information as well as removing noise. 4.4 Learning Prior Parameters via Weighted Cross-Validation As interpreted in 3.2, an appropriate prior distribution, as a regularizer, is crucial to keep the super-resolution image away from infeasible solutions. Different superresolutions are obtained according to varying prior parameters such that prior strength and function ‘shape’ in Huber-MRF. Rather than choosing prior parameters empirically, learning them from inputs is more sensible. Cross validation is a typical method to estimate these parameters. Recently, Pickup [10] raised a pixel-wise validation approach, which has a clear advantage than cross-validation method based on whole-image –wise. Leave-one-out is the basic philosophy of Cross-Validation (CV), in the case of super-resolution, the idea is that a good selection of prior parameters should predict the miss low-resolution image of the inputs. That is, if an arbitrary observations of all the low-resolution is left out, then the regularized super-resolution image, according to the selected prior parameters, should be able to predict this one fairly well. In previous CV for super-resolution, input observations is split into two sets those are training set and validation set, from the former of which we obtain the super-resolution, via the latter of which the error is found. We would make mistakes, however, when there are validation images misregistered. Instead, in pixel-wise method, validation pixels are selected at random from the collection of all the pixels comprising the overall input observations. The cross-validation error is measured by holding a small percentage of the low-resolution pixels in each image back, and performing Huber-MAP . The obtained super-resolution super-resolution using the remaining pixels image is then projected down into the low-resolution frame ) under corresponding generative model and the mean absolute error is recorded. By this recorded error, we determine whether the estimate of super-resolution gets an improvement and is therefore worthy for consideration for gradient descent, when we optimize the prior parameters. We leave out a small percent of pixels from each low-resolution image and seek the value of prior parameters that minimizes the prediction errors, measured by the CV function: ,
(4.5)
Where is the prior operator chosen as Huber-MRF in our experiment, the prior parepresents regularized by prior rameter is a scalar. . Note that lighting selected, and gives the regularized solution, change is not taken in account as the global photometric variation does not make contribution in gradient direction. Although this developed CV overcomes the misregistration problem, we would lose many valuable pixels in each observation those could be the training set. In this
32
Y. Peng et al.
paper, we utilized Weighted Cross Validation (WCV) [17] method in the process of learning prior parameters, which could avoid misregistration problem as well as keep as many pixels in training set as possible. The WCV method is a “leave-weighted one-out” prediction method, we leave out part pixels of an observation in turn and optimize the value of that minimizes the prediction error. The WCG method in one image is illustrated in Fig. 4, we should notice that the operation should be implemented for every low-resolution image in each iteration. In fact, in leaving out the whole ith observation, the derivation seeks to minimize | | the prediction error , when x is the minimizer of ∑
|
,
|
(4.6)
Fig. 4. Weigthed cross validation for one low-resolution image. For each input observation, we first leave out part pixels from it as the validation set, the remaining pixels with all other images are taken as training set for super-resolution. We then project this obtained superresolution into the low-resolution frame, the error is recorded.
We could define the matrix , , , the jth entry, then the above minimization is equivalent to
, ,
, where 0 is
min
|
|
(4.7)
min
|
|
(4.8)
The weighted cross validation method is derived in a similar manner, but we instead use a weighted “leave-one- out” philosophy. We define a new matrix , , , ,√ , , consider , where √ is the jth diagonal entry of . By using the WCV method, we seek a super-resolution estimate to the minimization problem,
Understanding Video Sequences through Super-Resolution
33
5 Experiments Results A set of selected frames from a clip of video record are used to test the proposed approach. Fig. 5 shows one of the sequences consisting of 9 frames, and the area of the interest that the logo of Newcastle University, Austrtalia. The logo sequences are unclear as a result of down-sampling, warping, and noise. In our experiment, we compare the result obtained by this raised method with the reconstructions via ML and tradition MAP. We choose Gaussian PSF with std=0.375 with zoom factor of 2 for all three methods, Huber prior with . 8 and =0.04 for MAP, and Weighted Cross Validation with w=0.2 for learning prior parameters. As the three super-resolution images shown in Fig. 6, this proposed method outperforms the others.
Fig. 5. Input low-resolution frames from a video record. The top is one of 9 frames from a clip. The bottoms are interest areas from all 9 frames.
Fig. 6. Super-resolution obtained through three methods. The left is the ML result, in which we can see the corruptions by amplified noise clearly. The next one is reconstruction from MAP. The final one is super-resolution by this adaptive simultaneous method, the significant development can be seen.
6 Conclusion A novel simultaneous approach to super-resolution is presented in this paper. We make three developments: First, we introduce an effective method for reference image selection in image registration which is always ignored. Next, a median value image is taken in place of average image as the initialization, and we show the advantages. Finally, we utilized Weighted Cross Validation for learning prior-parameters, this method could keep as much pixels in training set as possible while avoiding
34
Y. Peng et al.
misregistration problem. In the end, experimental results show the outperformance of this adaptive method. Further work directions include more robust methods for reference image selection and super-resolution with object tracking.
References 1. Multimedia Design and Entertainment, http://www.clearleadinc.com 2. Borman, S., Stevenson, R.L.: Spatial Resolution Enhancement of Low-resolution Image Sequences. Technical report, Department of Electrical Engineering, University of Notre Dame, Notre Dame, Indiana, USA (1998) 3. Park, S.C., Park, M.K., Kang, M.G.: Super Resolution Image Reconstruction: a technical overview. IEEE Signal Processing Magazine, 1053–5888 (2003) 4. Tsai, R.Y., Huang, T.S.: Uniqueness and Estimation of Three Dimensional Motion Parameters of Rigid Objects with Curved Surfaces. IEEE Transaction on Pattern Analysis and Machine Intelligence 6, 13–27 (1984) 5. Tom, B.C., Katsaggelos, A.K., Galatsanos, N.P.: Reconstruction of A High Resolution Image from Registration and Restoration of Low Resolution Images. In: Proceedings of the IEEE International Conference on Image Processing (ICIP), pp. 553–557 (1994) 6. Nguyen, N., Milanfar, P., Golud, G.: Efficient Generalized Cross Validation with Applications to Parametric Image Restoration and Resolution Enhancement. IEEE Transaction on Inage Proceeding 10(9), 1299–1308 (2001) 7. Peleg, S., Keren, D., Schweitzer, L.: Improving Image Resolution using Subpixel Motion. Pattern Recognition Letters 5(3), 223–226 (1987) 8. Irani, M., Peleg, S.: Motion Analysis for Image Enhancement: Resolution, Occlusion, and Transparency. Journal of Visual Communication and Image Representaion 4, 324–355 (1993) 9. Capel, D.P.: Image Mosaicing and Super-resolution. PhD thesis, University of Oxford (2001) 10. Pickup, L.: Machine Learning in Multi-frame Image Super-resolution. PhD thesis, University of Oxford (2001) 11. Farsiu, S., Robinson, M.D., Elad, M., Milanfar, P.: Fast and Robust Multi-Frame Super Resolution. Image Processing 13(10), 1327–1344 (2004) 12. Irani, M., Peleg, S.: Improving Resolution by Image Registration. Graphical Models and Image Processing 53, 231–239 (1991) 13. Suresh, K.V., Mahesh Kumar, G., Rajagplalan, A.N.: Super-resolution of License Plates in Real Traffic Videos. IEEE Transaction on Intelligent Transportation Systems 8(2) (2007) 14. Zhao, W., Sawhney, H.S.: Is super-resolution with optical flow feasible? In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2350, pp. 599–613. Springer, Heidelberg (2002) 15. Zitova, B., Flusser, J.: Image Registration Methods: a survey. Image and Vision Computing 21, 977–1000 (2003) 16. Hartley, R.I., Zisserman, A.: Multiple View Geometry in Computer Vision, 2nd edn. Cambridge University Press, Cambridge (2004) 17. Chung, J., Nagy, J.G., O’leary, D.: Weighted-GCV Method for Lanczos-Hybrid Regularization. Electronic Transactions on Numerical Analysis 28, 149–167 (2008)
Facial Expression Recognition on Hexagonal Structure Using LBP-Based Histogram Variances Lin Wang1 , Xiangjian He1,2 , Ruo Du2 , Wenjing Jia2 , Qiang Wu2 , and Wei-chang Yeh3 1
Video Surveillance Laboratory Guizhou University for Nationalities Guiyang, China [email protected] 2 Centre for Innovation in IT Services and Applications (iNEXT) University of Technology, Sydney Australia {Xiangjian.He,Wenjing.Jia-1,Qiang.Wu}@uts.edu.au, [email protected] 3 Department of Industrial Engineering and Engineering Management National Tsing Hua University Taiwan [email protected]
Abstract. In our earlier work, we have proposed an HVF (Histogram Variance Face) approach and proved its effectiveness for facial expression recognition. In this paper, we extend the HVF approach and present a novel approach for facial expression. We take into account the human perspective and understanding of facial expressions. For the first time, we propose to use the Local Binary Pattern (LBP) defined on the hexagonal structure to extract local, dynamic facial features from facial expression images. The dynamic LBP features are used to construct a static image, namely Hexagonal Histogram Variance Face (HHVF), for the video representing a facial expression. We show that the HHVFs representing the same facial expression (e.g., surprise, happy and sadness etc.) are similar no matter if the performers and frame rates are different. Therefore, the proposed facial recognition approach can be utilised for the dynamic expression recognition. We have tested our approach on the well-known Cohn-Kanade AU-Coded Facial Expression database. We have found the improved accuracy of HHVF-based classification compared with the HVF-based approach. Keywords: Histogram Variance Face, Action Unit, Hexagonal structure, PCA, SVM.
1 Introduction Human facial expression reflects human’s emotions, moods, attitudes and feelings etc.. Recognising expressions can help computer learn more about human’s mental activities and react more sophisticatedly, so it has enormous potentials in the field of humancomputer interaction (HCI). Explicitly, the expressions are some facial muscular movements. Six basic emotions (happiness, sadness, fear, disgust, surprise and anger) were K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 35–45, 2011. c Springer-Verlag Berlin Heidelberg 2011
36
L. Wang et al.
defined in the Facial Action Coding System (FACS) [4]. FACS consists of 46 action units (AU) which depict basic facial muscular movements. Basically, how to capture the expression features precisely is vital for expression recognition. Getting expression features can be divided into two categories: spatial and spatio-temporal approaches. Because spatial approaches [5][10][24] do not model the dynamics of facial expressions, spatio-temporal approaches have dominated the recent research for ficial expression recognition. In their spatio-temporal approaches, Maja et al. [17][18][23] detected AUs by using individual feature GentleBoost [22] templates built from Gabor wavelet features and tracked temporal AUs based on Particle Filter (PF). Then, the SVMs [2][19][20] was applied for classification. Petar et al. [1] treated the facial expression as a dynamic process and proposed that the performance of an automatic facial expression recognition system could be improved by modeling the reliability of different streams of facial expression information utilising multistream Hidden Markov Models (HMMs) [1]. Although the above spatio-temporal approaches have taken into account modeling dynamic features, the model parameters are often hard to be obtained accurately. In our previous approach [3], the extraction of expression features saved the dynamic features into a Histogram Variance Face(HVF) image by computing the texture histogram variances among the frames of a face video. The common Local Binary Pattern (LBP) [12][13] was employed to extract the face texture. We classified the HVFs using Support Vector Machines (SVMs) [2][19][20] after applying Principal Component Analysis (PCA) [15][16] for dimensionality reduction. The accuracy of HVF classification was very encouraging and highly matched human’s perception on original videos. In this paper, we extend our work shown in [3]. We, for the first time, apply LBPs defined on the hexagonal structure [6] to extract the Hexagonal Histogram Variance Faces (HHVFs). This novel approach not only greatly reduce the computation costs but also improve the accuracy for facial expression recognition. The rest of the paper is organised as follows. Section 2 introduces the LBPs on the hexagonal structure, and describes the approach to generating the HHVF image from an input video based on the LBP operator. Section 3 presents the dimensionality reduction using PCA, and the training and recognition steps using SVMs. Experimental results are demonstrated in Section 4. Section 5 makes the conclusions of this paper.
2 Hexagonal Histogram Variance Face (HHVF) The HVF image is a representation of the dynamic features in a face video [3]. An extension of HVF represented on the hexagonal structure, namely HHVF, is performed in this section. 2.1 Fiducial Point Detection and Face Alignment For different expression videos, normally the scales and locations of human faces in frames are various. To make all the HHVFs have the same scale and location, it is critical to detect the face fiducial points. Bilinear interpolation is used to scale the face images to the same size. To detect the fiducial points, we apply Viola-Jones face detector [21], a real-time face detection scheme based on Haar-like features and AdaBoost learning. We detect and locate the positions of eyes for the iamges on the Cohn-Kanade
Facial Expression Recognition on Hexagonal Structure
37
expression database [9]. Each face image is cut and scaled according to eyes’ positions and the distance between the eyes. 2.2 Conversion from Square Structure to Hexagonal Structure We follow the work shown in [7] to represent images on the hexagonal structure. As shown in Figure 1, the hexagonal pixels appear only on the columns where the square pixels are located. As illustrated in Figure 1, for a given hexagonal pixel (denoted by X but not shown in the figure), there exist two square pixels (denoted by A and B but again not shown in the figure), lying on two consecutive rows and the same column of X, such that point X falls between A and B. Therefore, we can use the linear interpolation algorithm to obtain the light intensity value of X from the intensities of A and B. When we display the image on the hexagonal structure, every two hexagonal rows as shown in Figure 1 is combined into one single square row with their columns unchanged.
Fig. 1. A 9x8 square structure and a constructed 14x8 hexagonal structure [7]
2.3 Preprocessing and LBP Texturising After each input face image is aligned, cut, rescaled and normalized, we replace the values of the pixels outside the ellipse area around the face by 255, and keep the pixel values unchanged in the elliptic face area. To eliminate the illumination interference, we employ an LBP operator [6][12][13] to extract the texture (i.e., LBP) values in each masked face. LBP was originally introduced by Ojala et al. in [11] as texture description and defined on the traditional square image structure. The basic form of an LBP operator labels the pixels of an image by thresholding the 3 × 3 neighborhood of each pixel by the grey value of the pixel (the centre). An illustration of the basic LBP operator is
38
L. Wang et al.
Fig. 2. An example of computing LBP in a 3 × 3 neighborhood on square structure [6]
Fig. 3. An example of computing HLBP in a 7-pixel hexagonal cluster [6]
shown in Figure 2. Similar to the construction of basic LBP on the square structure, the basic LBP on the hexagonal structure, called Hexagonal LBP (HLBP) is constructed as shown in Figure 3 [6] on a cluster of 7 hexagonal pixels. By defining the HLBPs on the hexagonal structure, the number of different patterns has been reduced from 28 = 256 on the square structure to 26 = 64 on the hexagonal structure. More importantly, because all neighboring pixels of a reference pixel have the same distance to it, the grey values of the neighboring pixels have the same contributions to the reference pixel on the hexagonal structure. 2.4 Earth Mover’s Distance (EMD) Earth Mover’s Distance (EMD) [14] is a cross-bin approach and able to address the shift problem caused by noise because slight histogram shifts do not affect the EMD much. EMD is consistent with the human’s vision because that two histograms will have greater EMD value if they look more different in most cases. EMD can be formalized as the following linear programming problem. Let P = {(p1 , wp1 ), . . . , (pm , wpm )}
(1)
be the first signature with m clusters, where pi is the cluster representative and wpi is the weight of the cluster. Let Q = {(q1 , wq1 ), . . . , (qn , wqn )}
(2)
be the second signature with n clusters. Let also D = [dij ] be the ground distance matrix where dij is the ground distance between clusters pi and qj . Then EMD is to find a flow F = [fij ], where fij is the flow between pi and qj , that minimizes the overall cost [14]
Facial Expression Recognition on Hexagonal Structure
W ORK(P, Q, F ) =
m n
dij fij
39
(3)
i=1 j=1
subject to the following constraints [14]: 1. f ij ≥ 0; i ∈ [1, m] , j ∈ [1, n], n 2. fij ≤ wpi ; i ∈ [1, m], j=1 m 3. fij ≤ wqj ; j ∈ [1,n] and i=1 m n m n 4. i=1 j=1 fij = min i=1 wpi , j=1 wqj .
EMD is defined as the resulting work normalized by the total flow [14]: m n i=1 j=1 dij fij EM D(P, Q) = m n i=1 j=1 fij
(4)
In general, the ground distance dij can be any distance and will be chosen according to the problem in question. We employ EMD to measure the distance between two histograms when calculating histogram variances in the temporal direction. In our case, pi and qj are the grayscale pixel values, which are in [0, 63]. wpi and wqj are the pixel distributions at pi and qj respectively. The ground distance dij that we choose is the square of Euclidean distance between pi and qj , i.e., dij = (pi − qj )2 . 2.5 Histogram Variances The following steps for computing the histogram variance are similar to the ones shown in [3]. 1. Suppose that a sequence consists of P face texture images. We firstly break down each image evenly into M × N blocks, denoted by Bx,y;k , where x is row index, y is column index and k is the k-th frame in the sequence. Here, the block size, rows and columns are corresponding to those images displayed on the square structure as described in Subsection 2.2. We then calculate every gray-value histogram of Bx,y;k , denoted by Hx,y;k , where x = 0, 1, . . . , M − 1; y = 0, 1, . . . , N − 1; k = 0, 1, . . . , P − 1. 2. Calculate the histogram variance var(x, y): var(x, y) =
P −1 1 EM D(Hx,y;k , µx,y ), P
(5)
k=0
where µx,y is the mean histogram µx,y =
P −1 1 Hx,y;k P
(6)
k=0
and EM D(Hx,y;k , µx,y ) is the Earth Mover’s Distance between Hx,y;k and µx,y .
40
L. Wang et al.
3. Construct an M × N 8-bit grayscale image as our HHVF. Suppose that hhvf (x, y) denotes the pixel value at coordinate (x, y) in an HHVF image. Then, hhvf (x, y) is computed by 255 ∗ var(x, y) hhvf (x, y) = 255 − . (7) max{var(x, y)} Figure 4 shows some HHVF examples extracted from happiness, surprise and sadness videos respectively. To consider if the different block sizes may affect the recognition, we obtain HHVFs with size 3 × 3 in our experiments, then with the sizes of 6 × 6 and 12 × 12.
Fig. 4. Examples of HHVF images
3 Classifying HHVF Images Using PCA+SVMs HHVF records the dynamic features of the expression. As we can see in Figure 4, for the expressions of happiness, surprise and sadness, the homogeneous HHVFs look similar and HHVFs belonging to different expressions have very distinct features. To verify the performance of HHVF image’s features, we utilise the typical facial recognition technologies PCA+SVMs. 3.1 PCA Dimensionality Reduction In our experiments, all pixel values of an HHVF image construct an n × 1 column vector zi ∈ Rn , and an n × l matrix Z = {z1 , z2 , . . . , zl } denotes the training set which consists of l sample HHVF images. The PCA algorithm finds a linear transformation orthonormal matrix Wn×r (n >> r), projecting the original high n-dimensional feature space into a much lower r-dimensional feature subspace. xi denotes the new feature vector: xi = W T · zi (i = 1, 2, . . . , l). (8)
Facial Expression Recognition on Hexagonal Structure
41
The columns of matrix W (i.e., eigenfaces [16]) are the eigenvectors corresponding to the largest r eigenvalues of the scatter matrix S: l S= (zi − µ)(zi − µ)T
(9)
i=1
where µ is the mean image of all HHVF samples and µ = 3.2 SVMs Training and Recognition
1 l
1
i=1 zi .
SVM [2][20][19] is an effective supervised classification algorithm and its essence is to find a hyperplane that separates the positive and negative feature points with maximum margin in the feature space. Suppose αi (i = 1, 2, ..., l) denotes the Lagrange parameters that describe the separating hyperplane ω in an SVM. Finding the hyperplane involves getting the nonzero solutions αi (i = 1, 2, ..., l) of a Lagrangian dual problem. Once we have found all αi for given a labeled training set {(xi , yi )|i = 1, 2, ..., l}, the decision function becomes: l f (x) = sgn αi yi K(x, xi ) + b , (10) i=1
where b is the bias of the hyperplane, l is the number of training samples, xi (i = 1, 2, ..., l) is the vector of PCA projection coefficients of HHVFs, yi are the labels of train data xi for (i = 1, 2, ..., l), and K(x, xi ) is the ‘kernel mapping’. In this paper, we use linear SVMs, so K(x, xi ) = x, xi , (11) where x, xi represents the dot product of x and xi . Since the SVM is basically a two-class classification algorithm, here we adopt the pairwise classification (one-versus-one) for multi-class. In the pairwise classification approach, there is a two-class SVM for each pair of classes to separate the members of one class from the members of the others. There are maximum C62 = 15 two-class SVM classifiers for the classification of six classes of expressions. For recognising a new HHVF image, all C62 = 15 two-class classifiers are applied and the winner class is the one that receives the most votes.
4 Experiments 4.1 Dataset Our experiments use the Cohn-Kanade AU-Coded Facial Expression Database [9]. We select 49 subjects randomly from the database. Each subject has up to 6 expressions. The total number of expression images is 241. The image sequences belonging to the same expression have the similar duration but their frame rates are various and are from 15-30 frames per second. For each expression, we use about 80% HHVFs for PCA+SVMs training and use all HHVFs for classification (i.e., testing).
42
L. Wang et al.
4.2 HHVFs Generation The faces of selected subjects are detected and cut. Then these faces are scaled to size of 300 × 300 and aligned. These cut and rescaled images are then converted to the images on the hexagonal structure. The size of the new images are of 259×300. HLBP operator is then applied to the images on the hexagonal structure. The final data dimension is reduced to 220 after the PCA operation. 4.3 Training and Recognition As a supervised learning, the training data (HHVFs) are labeled according to their classes. Since the Cohn-Kanade database contains only the AU-Coded combinations for image sequences instead of expression definitions (i.e. surprise, happy, anger etc.), we need to label each HHVF with an expression definition manually according to FACS before feeding it to SVMs. In terms of human perception, we are quite confident to recognise original image sequences of happiness and surprise. Therefore, the training data for these two classes can be labeled with high accuracy. This implies that these two expressions have evidently unique features. Our experimental results (Table 1) testify this point with high recognition rate when we train and test only these two expressions. In Table 1, FPR stands for false positive rate. Table 1. Recognition rates of happy and surprise HHVFs 3 × 3 blocks 6 × 6 blocks 12 × 12 blocks Recognition rate FPR Recognition rate FPR Recognition rate FPR Happy 100% 2.17% 97.78% 2.17% 97.78% 2.17% Surprise 97.82% 0.00% 97.82% 2.22% 97.82% 2.22%
When we manually label HHVFs of anger, disgust, fear and sadness, nearly half of them are very challenging to be correctly classified. Neither AU-Coded combinations nor human’s perception can satisfactorily classify these facial expression sequences, especially the anger and sadness sequences. Actually, AU-Coded prototypes in FACS (2002 version) [9] have created large overlaps for these expressions. From human’s vision perspective, one expression appearance of a person may reflect several different expressions, so the image sequence of a facial expression can often be interpreted as for different expressions. An investigation [8] about facial expression recognition by human indicates that compared to the expressions of happy and surprise, the expressions of anger, fear, disgust and sadness are much more difficult to be recognised by people. (see Table 2). Despite the above-mentioned difficulties, we have tried our best to manually label the expressions. We have also conducted the following experiments: 1. We test for only anger, disgust, fear and sadness classes. For these four tough expressions, we train a set of classifiers having C42 = 6 two-class classifiers and test the HHVFs based on majority voting. Table 3 shows our results.
Facial Expression Recognition on Hexagonal Structure
43
Table 2. A recent investigation of facial expression recognition by human in [8] Happy Surprise Angry Disgust Fear Sadness Recognition rate(%) 97.8 79.3 55.9 60.2 36.8 46.9
Table 3. Recognition rates of anger, disgust, surprise and sadness HHVFs 3 × 3 blocks 6 × 6 blocks 12 × 12 blocks Recognition rate FPR Recognition rate FPR Recognition rate FPR Angry 74.36% 13.51% 76.92% 13.51% 74.36% 14.41% Disgust 73.68% 10.71% 73.68% 9.82% 71.05% 10.71% Fear 68.57% 8.69% 71.43% 7.83% 68.57% 8.69% Sadness 68.42% 5.36% 68.42% 5.36% 68.42% 5.35%
2. We consider all HHVFs, and train C62 = 15 two-class classifiers. We use this set of classifiers to recognise all HHVFs based on voting. We obtain the experimental results as shown in Table 4. Table 4. Recognition rates of all sorts of HHVFs 3 × 3 blocks 6 × 6 blocks 12 × 12 blocks Recognition rate FPR Recognition rate FPR Recognition rate FPR Happy Surprise Angry Disgust Fear Sadness
93.33% 91.3% 82.05% 78.95% 77.14% 78.95%
0.0% 0.51% 7.43% 4.93% 5.34% 0.49%
95.56% 93.48% 82.05% 81.58% 80.00% 78.95%
0.0% 0.51% 5.94% 4.43% 5.34% 0.49%
95.56% 93.48% 76.92% 81.58% 77.14% 78.95%
0.0% 0.51% 5.94% 4.93% 5.34% 1.48%
4.4 Discussion 1. From Table 1, we can see that both happy and surprise HHVFs have produced very high recognition rates. For example, the happy HHVFs reach amazing 100% recognition rate with only 2.17% false positive rate (FPR). These results coincide with our observation on the original image sequences, as human can also easily identify the original happy and surprise sequences from the Cohn-Kanade database. 2. From Table 3, the recognition rates for anger, fear, disgust and sadness HHVFs are much lower. This reflects the challenges that we have encountered when labeling the training data (that nearly half of the training data for these four expressions are not convincing us to accurately label their classes because of the expression feature entanglement). 3. Table 4 shows the recognition results when all six expressions were fed to SVMs for training, we can see that the happy and surprise HHVFs still stand out and the rest are hampered by the entanglement of features. Taking into account our difficulties for labeling the training data, the recognition rates in Table 4 make sense.
44
L. Wang et al.
4. The size of blocks is not critical to our results, although the segmentation into 6 × 6 blocks has the best performance in our experiments as shown in Table 4.
5 Conclusions Our experiments demonstrate that HHVF is an effective representation of the dynamic and internal features of a face video or an image sequence. They show that the accuracy has been improved when comparing with the results shown in [3]. The accuracy has been increased on average from 83.68% (see Table 5 in [3]) to 85.27% (see Table 4 above) when looking only the block size of 6 × 6. HHVF is able to integrate well the dynamic features of a certain duration of expression into a static image through which the static facial recognition approaches can be utlised to recognise the dynamic expressions. The application of HHVFs fills the gap between the expression recognition and facial recognition.
Acknowledgements This work is supported by the Houniao Program through Guizhou University for Nationalities, China.
References 1. Aleksic, P.S., Katsaggelos, A.K.: Automatic facial expression recognition using facial animation parameters and multistream hmms. In: Information Forensics and Security, vol. 1, pp. 3–11 (2006) 2. Boser, B., Guyon, I., Vapnik, K.: An training algorithm for optimal margin classifiers. In: Fifth Annual Workshop on Computational Learning Theory, pp. 144–152 (1992) 3. Du, R., Wu, Q., He, X., Jia, W., Wei, D.: Local binary patterns for human detection on hexagonal structure. In: IEEE Workshop on Applications of Computer Vision, pp. 341–347 (2009) 4. Ekman, P., Friesen, W.: Facial action coding system. Consulting Psychologists Press, Palo Alto (1978) 5. Feng, X., Pietikinen, M., Hadid, A.: Facial expession recognition with local binary partterns and linear programming. Pattern Recognition and Image Analysis 15(2), 546–548 (2005) 6. He, X., Li, J., Chen, Y., Wu, Q., Jia, W.: Local binary patterns for human detection on hexagonal structure. In: IEEE International Symposium in Multimedia, pp. 65–71 (2007) 7. He, X., Wei, D., Lam, K.-M., Li, J., Wang, L., Jia, W., Wu, Q.: Local binary patterns for human detection on hexagonal structure. In: Advanced Concepts for Intelligent Vision Systems (to appear, 2010) 8. Jinghai, T., Zilu, Y., Youwei, Z.: The contrast analysis of facial expression recognition by human and computer. In: ICSP, pp. 1649–1653 (2006) 9. Kanade, T., Cohn, J.F., Tian, Y.: Comprehensive database for facial expression analysis. In: Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture Recognition, Grenoble, France, pp. 46–53 (2000) 10. Littlewort, G., Bartlett, M., Fasel, I., Susskind, J., Movellan, J.: Dynamics of facial expression extracted automatically from video. In: CVPR (2004)
Facial Expression Recognition on Hexagonal Structure
45
11. Ojala, T., Pietik¨ainen, M., Harwood, D.: A comparative study of texture measures with classification based on feature distributions. Pattern Recognition 29, 51–59 (1996) 12. Ojala, T., Pietik¨ainen, M., M¨aenp¨aa¨ , T.: Gray scale and rotation invariant texture classification with local binary patterns. In: Vernon, D. (ed.) ECCV 2000. LNCS, vol. 1842, pp. 404–420. Springer, Heidelberg (2000) 13. Ojala, T., Pietik¨ainen, M., M¨aenp¨aa¨ , T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Pattern Analysis and Machine Intellifence 24, 971–987 (2002) 14. Rubner, Y., Tomasi, C., Guibas, L.J.: The earth mover’s distance as a metric for image retrieval. International Journal of Computer Vision 40(2), 99–121 (2000) 15. Turk, M., Pentland, A.: Eigenfaces for recognition. Journal of Cognitive Neuroscience 3(1), 71–86 (1991) 16. Turk, M., Pentland, A.: Face recognition using eigenfaces. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–591 (1991) 17. Valstar, M., Patras, I., Pantic, M.: Facial action unit detection using probabilistic actively learned support vector machines on tracked facial point data. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (2005) 18. Valstar, M., Pantic, M.: Fully automatic facial action unit detection and temporal analysis. In: Computer Vision and Pattern Recognition Workshop, vol. 3, June 17-22, p. 149 (2006) 19. Vapnik, V.N.: Statistical learning theory. Wiley Interscience, Hoboken (1998) 20. Vapnik, V.: The nature of statistical learning theory. Springer, Heidelberg (1995) 21. Viola, P., Jones, M.J.: Robust real-time object detection. In: ICCV (2001) 22. Vukadinovic, D., Pantic, M.: Fully automatic facial feature point detection using gabor feature based boosted classifiers. In: SMC, pp. 1692–1698 (2005) 23. Vukadinovic, D., Pantic, M.: Fully automatic facial feature point detection using gabor feature based boosted classifiers. In: Systems, Man and Cybernetics (ICSMC), vol. 2, pp. 1692–1698 (2005) 24. Ying, Z., Fang, X.: Combining lbp and adaboost for facial expression recognition. In: ICSP, October 26-29, pp. 1461–1464 (2008)
Towards More Precise Social Image-Tag Alignment Ning Zhou1,2 , Jinye Peng1 , Xiaoyi Feng1 , and Jianping Fan1,2 1
School of Electronics and Information, Northwestern Polytechnical University, Xi’an, P.R. China {jinyepeng,fengxiao}@nwpu.edu.cn 2 Dept. of Computer Science, UNC-Charlotte, Charlotte, NC 28223, USA {nzhou,jfan}@uncc.edu
Abstract. Large-scale user contributed images with tags are increasingly available on the Internet. However, the uncertainty of the relatedness between the images and the tags prohibit them from being precisely accessible to the public and being leveraged for computer vision tasks. In this paper, a novel algorithm is proposed to better align the images with the social tags. First, image clustering is performed to group the images into a set of image clusters based on their visual similarity contexts. By clustering images into different groups, the uncertainty of the relatedness between images and tags can be significantly reduced. Second, random walk is adopted to re-rank the tags based on a cross-modal tag correlation network which harnesses both image visual similarity contexts and tag co-occurrences. We have evaluated the proposed algorithm on a large-scale Flickr data set and achieved very positive results. Keywords: Image-tag alignment, relevance re-ranking, tag correlation network.
1
Introduction
Recently, as online photo sharing platforms (e.g. Flickr [7], Photostuff [9], etc) are becoming increasingly popular, massive images have been uploaded onto the Internet and have been collaboratively tagged by a large population of real world users. User contributed tags provide semantically meaningful descriptors of the images which are essential for large-scale tag-based retrieval systems to work in practice [14]. In such a collaborative image tagging system, users have been tagging the images according to their own social or cultural backgrounds, personal knowledge and interpretation. Without controlling the tagging vocabulary and behavior, many tags in the user provided tag list might be synonyms or polysemes or even spams. Also, many tags are personal tags which are weakly related or even irrelevant to the image content. A recent work reveals that many Flickr tags are imprecise and only around 50% tags are actually related to the image contents [11]. Therefore, the collaboratively tagged images are weakly-tagged images because the social tags may not have exact correspondences with the underlying K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 46–56, 2011. c Springer-Verlag Berlin Heidelberg 2011
Towards More Precise Social Image-Tag Alignment
47
image semantics. The synonyms, polysemes, spams and content irrelevant tags in these weakly-tagged images may either return incomplete sets of the relevant images or result in large amounts of ambiguous images or even junk images [4,6,5]. It is important to better align the images with the social tags to make them more searchable and to leverage these large-scale weakly-tagged images for computer vision tasks. By achieving more accurate alignment between the weakly-tagged images and their tags, we envision many potential applications: 1. Enabling Social Images to be More Accessible: The success of Flickr proves that users are willing to manually annotate their images with the motivation to make them more accessible to the general public [1]. However, the tags are provided in a free way without a controlled vocabulary based on their own personal perceptions. Lots of social tags are synonyms or polysemes or enven spams, which pose a big limitation for users to locate their interested images accurately and efficiently. To enable more effective social image organization and retrieval, it is very attractive to develop new algorithms to achieve more precise alignment between the images and their social tags. 2. Creating Labeled Images for Classifier Training: For many computer vision tasks, e.g. object detection and scene recognition, machine learning models are often used to learn the classifiers from a set of labeled training images [2]. To achieve satisfied performance, large-scale of training examples are required, due to (a) the number of object classes and scenes of interest could be very large; (b) the learning complexity for some object classes and scenes could be very high because of visual ambiguity; and (c) a small number of labeled training images are incomplete or insufficient to interpret the diverse visual properties of large amounts of unseen test images. Unfortunately, hiring professionals to label large amounts of training images is cost-sensitive which is a key limitation for the practical use of some advanced computer vision techniques. On the other hand, the increasing availability of large-scale user contributed digital images with tags on the Internet has provided new opportunities to harvest large-scale images for computer vision tasks. In this paper, we proposed a clustering based framework to align images and social tags. By crawling large number of weakly-tagged images from Flickr, we first cluster them into different groups based on visual similarities. For each group, we aggregate all the tags associated to the images within this cluster to form a larger set of tags. A random walk process is then applied onto this tag set to further improve the alignment between images and tags based on a tag correlation network which is constructed by combining both image visual similarity contexts and tag co-occurrences. We evaluated our proposed algorithm on a Flickr database with 260,000 images and achieved very positive results.
2
Image Similarity Characterization
To achieve more effective image content representation, four grid resolutions (whole image, 64 × 64, 32 × 32 and 16 × 16 patches) are used for image partition
48
N. Zhou et al.
and feature extraction. To sufficiently characterize various visual properties of the diverse image contents, we have extracted three types of visual features: (a) grid-based color histograms; (b) Gabor texture features; and (c) SIFT (scale invariant feature transform) features. For the color features, one color histogram 3 is extracted for each image grid, thus there are r=0 2r ×2r = 85 grid-based color histograms. Each grid-based color histogram consists of 36 RGB bins to represent the color distributions in the corresponding image grid. To extract Gabor texture features, a Gabor filter bank, which contains twelve 21 × 21 Gabor filters in 3 scales and 4 orientations, is used. The Gabor filter is generated by using a Gabor function class. To apply Gabor filters on an image, we need to calculate the convolutions of the filters and the image. We transform both the filters and the image into the frequency domain to get the products and then transform them back to the space domain. This process can calculate Gabor filtered image more effectively. Finally, the mean values and standard deviations are calculated from 12 filtered images, making up to 24-dimensional Gabor texture features. By using multiple types of visual features for image content representation, it is able for us to characterize various visual properties of the images more sufficiently. Because each type of visual features is used to characterize one certain type of visual properties of the images, the visual similarity contexts between the images are more homogeneous and can be approximated more precisely by using one particular type of base kernel. Thus one specific base kernel is constructed for each type of visual features. For two images u and v, their color similarity relationship can be defined as: κc (u, v) =
R−1 1 1 D (u, v) + D (u, v) 0 R−r+1 r 2R 2 r=0
(1)
where R = 4 is the total number of grid resolutions for image partition, D0 (u, v) is the color similarity relationship according to their full-resolution (image-based) color histograms, Dr (u, v) is the grid-based color similarity at the rth resolution. Dr (u, v) =
36
D(Hir (u), Hir (u))
(2)
i=1
where Hir (u) and Hir (v) are the ith component of the grid-based color histograms at the rth image partition resolution. Their local similarity relationship can be defined as: κs (u, v) = e−ds (u,v)/σs (3) ds (u, v) =
i
j
ωi (u)ωj (v)d1 (si (u), sj (v)) i j ωi (u)ωj (v)
(4)
where σs is the mean value of ds (u, v) in our images, ωi and ωj are the Hessian values of the ith and jth interesting points. Their textural similarity relationship can be defined as: κt (u, v) = e−dt(u,v)/σt , dt (u, v) = d1 (gi (u), gj (v))
(5)
Towards More Precise Social Image-Tag Alignment
49
where σt is the mean value of dt (u, v) in our images, d1 (gi (u), gj (v)) is the L1 norm distance between two Gabor textural descriptors. The diverse visual similarity contexts between the online images can be characterized more precisely by using a mixture of these three base image kernels (i.e., mixture-of-kernels) [3]. κ(u, v) =
3
βi κi (u, v),
i=1
3
βi = 1
(6)
i=1
where u and v are two images, βi ≥ 0 is the importance factor for the ith base kernel κi (u, v). Combining multiple base kernels can allow us to achieve more precise characterization of the diverse visual similarity contexts between the two images.
3
Image Clustering
To achieve more effective image clustering, a graph is first constructed for organizing the social images according to their visual similarity contexts [8], where each node on the graph is one particular image and an edge between two nodes is used to characterize the visual similarity context between two images, κ(·, ·). By taking such image graph as the input measures of the pairwise image similarity, automatic image clustering is achieved by passing messages between the nodes of the image graph through affinity propagation [8]. After the images are partitioned into a set of image clusters according to their visual similarity contexts, the cumulative inter-cluster visual similarity context s(Gi , Gj ) between two image clusters Gi and Gj is defined as: s(Gi , Gj ) = κ(u, v) (7) u∈Gi v∈Gj
The cumulative intra-cluster visual similarity contexts s(Gi , Gi ) and s(Gj , Gj ) are defined as: s(Gi , Gi ) = κ(u, v), s(Gj , Gj ) = κ(u, v) (8) u∈Gi v∈Gi
u∈Gj v∈Gj
By using the cumulative intra-cluster visual similarity context to normalize the cumulative inter-cluster visual similarity context, the inter-cluster correlation c(Gi , Gj ) between the image clusters Gi and Gj is defined as: c(Gi , Gj ) =
2s(Gi , Gj ) s(Gi , Gi ) + s(Gj , Gj )
(9)
Three experimental results for image clustering are given in Fig. 1. From these experimental results, one can observe that visual-based image clustering can provide a good summarization of large amounts of images and discover comprehensive knowledge effectively. The images in the same cluster will share similar visual properties and their semantics can be effectively described by using a same set of tags.
50
N. Zhou et al.
Fig. 1. Image clustering results and their associated tags
4
Image and Tag Alignment
In order to filter out the misspelled and content irrelevant tags, we adopt a knowledge-based method [12] to discern content-related tags from contentunrelated ones and retain only the content-related tags to form our tag vocabulary. In particular, we first choose a set of high level categories, including “organism”, “artifact”, “thing”, “color” and “natural phenomenon”, as a taxonomy of the domain knowledge in computer vision area. The content relatedness of a particular tag is then determined by resorting to the WordNet [15] lexicon which preserves a semantical structure among words. Specifically, for each tag, if it is in the noun set of WordNet, we further traverse along the path which contains of hypernyms of the tag until one of the pre-defined categories is matched. If successfully matched, it is considered as a content-related tag, otherwise it is decided as content-unrelated one. Rather than indexing the visually-similar images in the same cluster by loosely using all these tags, a novel tag ranking algorithm is developed for aligning the the tags with the images according to their relevance scores. The alignment algorithm works as follows: (a) The initial relevance scores for the tags are calculated via tag-image co-occurrence probability estimation; (b) The relevance scores for these tags are then refined according to their inter-tag cross-modal similarity contexts based on tag correlation network; and (c) The most relevant tags (top k tags with largest relevance scores) are selected automatically for image semantics description. Our contributions lie in integrating tag correlation network and random walk for automatic relevance score refinement. 4.1
Tag Correlation Network
When people tag images, they may use multiple different tags with similar meanings to describe the semantics of the relevant images alternatively. On the other hand, some tags have multiple senses under different contexts. Thus the tags are strongly inter-related and such inter-related tags and their relevance scores should be considered jointly. Based on this observation, a tag correlation network is generated automatically for characterizing such inter-tag similarity contexts more precisely and providing a good environment to refine the relevance
Towards More Precise Social Image-Tag Alignment
51
Fig. 2. Different views of our tag correlation network
scores for the inter-related tags automatically. In our tag correlation network, each node denotes a tag, and each edge indicates the pairwise inter-tag correlation. The inter-tag correlations consist of two components: (1) inter-tag cooccurrences; and (2) inter-tag visual similarity contexts between their relevant images. Rather than constructing such tag correlation network manually, an automatic algorithm is developed for this aim. For two given tags (one tag pair) ti and tj , their visual similarity context γ(ti , tj ) is defined as: γ(ti , tj ) =
1 κ(u, v) |Ci ||Cj |
(10)
u∈Ci v∈Cj
where Ci and Cj are the image sets for the tags ti and tj , |Ci | and |Cj | are the numbers of the web images for Ci and Cj , κ(u, v) is the kernel-based visual similarity context between two images u and v within Ci and Cj respectively. The co-occurrence β(ti , tj ) between two tags ti and tj is defined as: β(ti , tj ) = −P (ti , tj )log
P (ti , tj ) P (ti ) + P (tj )
(11)
where P (ti , tj ) is the co-occurrence probability for two tags ti and tj , P (ti ) and P (tj ) are the occurrence probabilities for the tags ti and tj . Finally, we define the cross-modal inter-tag correlation between ti and tj by combining their visual similarity context and co-occurrence, given as φ(ti , tj ) = α · γ(ti , tj ) + (1 − α) · β(ti , tj )
(12)
where α is the weighting factor and it is determined through cross-validation. The combination of such cross-modal inter-tag correlation can provide a powerful framework for re-ranking the relevance scores between the images and their tags.
52
N. Zhou et al.
The tag correlation network for tagged image collections from Flickr is shown in Fig. 2, where each tag is linked with multiple most relevant tags with values of φ(·, ·) over a threshold. By seamlessly integrating the visual similarity contexts between the images and the semantic similarity contexts between the tags for tag correlation network construction, the tag correlation network can provide a good environment for addressing the issues of polysemy and synonyms more effectively and disambiguating the image senses precisely, which may allow us to find more suitable tags for more precise image tag alignment. 4.2
Random Walk for Relevance Refinement
In order to take advantage of the tag correlation network to achieve more precise alignment between the images and their tags, a random walk process is performed for automatic relevance score refinement [13,10,16]. Given the tags correlation network with n most frequent tags, we use ρk (t) to denote the relevance score for the tag t at the kth iteration. The relevance scores for all these tags at the −−→ kth iteration will form a column vector ρ(t) ≡ [ρk (t)]n×1 . We further define Φ as an n × n transition matrix, its element φij is used to define the probability of the transition from the tag i to its inter-related tag j. φij is defined as: φ(i, j) φij = k φ(i, k)
(13)
where φ(i, j) is the pairwise inter-tag cross-modal similarity context between i and j as defined in (12). The random walk process is thus formulated as: ρk (t) = θ ρk−1 (j)φtj + (1 − θ)ρ(C, t) (14) j∈Ωj
where Ωj is the first-order nearest neighbors of the tag j on the tag correlation network, ρ(C, t) is the initial relevance score for the given tag t and θ is a weight parameter. This random walk process can promote the tags which have many nearest neighbors on the tag correlation network, e.g., the tags, which have close visualbased interpretations of their semantics and higher co-occurrence probabilities. On the other hand, this random walk process can also weaken the isolated tags on the tag correlation network, e.g., the tags, which have weak visual correlations with other tags and low co-occurrence probabilities with other tags. This random walk process is terminated when the relevance scores converge. For a given image cluster C, all its tags are re-ranked according to their relevance scores. By performing random walk over the tag correlation network, the refinement algorithm can leverage both the co-occurrence similarity and the visual similarity simultaneously to re-rank the tags more precisely. The top-k tags, which have higher relevance scores with the image semantics, are then selected as the keywords to interpret the images in the given image cluster C. Such image-tag alignment process provides better understanding of the crossmedia information (images and tags) as it couples different sources of information
Towards More Precise Social Image-Tag Alignment
53
(a)
(b)
(c)
Fig. 3. Image-tag alignment: (a) image cluster; (b) ranked tags before performing random walk; (c) re-ranked tags after performing random walk
together and allow us to resolve the ambiguities that may arise from a single media analysis. Some experimental results for re-ranking the tags are given in Fig. 3. From these experimental results, one can observe that our image-tag alignment algorithm can effectively find the most relevant keywords to better align images with corresponding tags.
5
Algorithm Evaluation
All our experiments are conducted on an Flickr data set collected from Flickr by using Flickr API. There are 260,000 weakly tagged images in this set. To assess the effectiveness of our proposed algorithms, our algorithm evaluation work focuses on assessing the effectiveness of image clustering and random walk on our algorithm for image-tag alignment. The accuracy rate ̺ is used to assess the effectiveness of the algorithms for image-tag alignment, given as: N δ(Li , Ri ) (15) ̺ = i=1 N where N is the total number of images, Li is a set of the most relevant tags for the ith image which are obtained by automatic image-tag alignment algorithms, Ri is a set of the keywords for the ith image which are given by a benchmark image set. δ(x, y) is a delta function, ⎧ ⎨ 1, x = y, δ(x, y) = (16) ⎩ 0, otherwise
bread carriage caribou carrot tiger brush maker cruiser lake bridge cart car river pluto rainbow brooklyn yacht boat mountain park peak machine brick pin mickey yukon wood aircraft fall toyota dish vegetable shirt cupboard sunset carpet galaxy cabin district black flour cabinet oak candy cavern arch rug street tabriz forest furniture spectrum reef broccoli wall vintage glacier island mount harbour cascade
Accuracy bread carriage caribou carrot tiger brush maker cruiser lake bridge cart car river pluto rainbow brooklyn yacht boat mountain park peak machine brick pin mickey yukon wood aircraft fall toyota dish vegetable shirt cupboard sunset carpet galaxy cabin district black flour cabinet oak candy cavern arch rug street tabriz forest furniture spectrum reef broccoli wall vintage glacier island mount harbour cascade
Accuracy
54 N. Zhou et al.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2 Without Clustering Without Random Walk Integration
0.2
Fig. 4. Image-tag alignment accuracy rate, where top 20 images are evaluated interactively. Average accuracy: Without Clustering = 0.6516, Without Random Walk = 0.7489, Integration = 0.8086.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
Without Clustering Without Random Walk Integration
0.1
Fig. 5. Image-tag alignment accuracy rate, where top 40 images are evaluated interactively. Average accuracy: Without Clustering = 0.5828, Without Random Walk = 0.6936, Integration = 0.7373.
Towards More Precise Social Image-Tag Alignment
55
It is hard to obtain suitable benchmark image set in large size for our algorithm evaluation task. To avoid this problem, an interactive image navigation system is designed to allow users to provide their assessments of the relevances between the images and the ranked tags. To assess the effectiveness of image clustering and random walk for image-tag alignment, we have compared the accuracy rates for our image-tag alignment algorithm under three different scenarios: (a) image clustering is not performed for reducing the uncertainty between the relatedness between the images and their tags; (b) random walk is not performed for relevance re-ranking; (c) both image clustering and random walk are performed for achieving more precise alignment between the images and their most relevant social tags. As shown in Fig. 4 and Fig. 5, it can be seen that incorporating image clustering for uncertainty reduction and performing random walk for relevance re-ranking can significantly improve the accuracy rates for image-tag alignment.
6
Conclusions
In this paper, we proposed a cluster-based framework to provide a more precise image-tag alignment. By clustering visually similar images into clusters, the uncertainty between images and social tags has been reduced dramatically. In order to further refine the alignment, we have adopted a random walk process to re-rank the tags based on a cross-modal tag correlation network which is generated by using both image visual similarity and tag co-occurrences. Experimental results on real Flickr data set has empirically justified our developed algorithm. The proposed research on image-tag alignment may provide two potential applications: (a) enabling more effective tag-based web social image retrieval with higher precision rates by finding more suitable tags for social image indexing; (b) by harvesting social tagged images from Internet, our proposed research can create more representative image sets for training a large number of object and concept classifiers more accurately, which is a long-term goal of the multimedia research community. Acknowledgments. This research is partly supported by NSFC-61075014 and NSFC-60875016, by the Program for New Century Excellent Talents in University under Grant NCET-07-0693, NCET-08-0458 and NCET-10-0071 and the Research Fund for the Doctoral Program of Higher Education of China (Grant No.20096102110025).
References 1. Ames, M., Naaman, M.: Why we tag: motivations for annotation in mobile and online media. In: CHI, pp. 971–980 (2007) 2. Datta, R., Joshi, D., Li, J., Wang, J.Z.: Image retrieval: Ideas, influences, and trends of the new age. ACM Comput. Surv. 40(2) (2008)
56
N. Zhou et al.
3. Fan, J., Gao, Y., Luo, H.: Integrating concept ontology and multitask learning to achieve more effective classifier training for multilevel image annotation. IEEE Transactions on Image Processing 17(3), 407–426 (2008) 4. Fan, J., Luo, H., Shen, Y., Yang, C.: Integrating visual and semantic contexts for topic network generation and word sense disambiguation. In: CIVR (2009) 5. Fan, J., Shen, Y., Zhou, N., Gao, Y.: Harvesting large-scaleweakly-tagged image databases from the web. In: Proc. of CVPR 2010 (2010) 6. Fan, J., Yang, C., Shen, Y., Babaguchi, N., Luo, H.: Leveraging large-scale weaklytagged images to train inter-related classifiers for multi-label annotation. In: Proceedings of the First ACM Workshop on Large-scale Multimedia Retrieval and Mining, LS-MMRM 2009, pp. 27–34. ACM, New York (2009) 7. Flickr. Yahoo! (2005), http://www.flickr.com 8. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. Science 315, 972–976 (2007) 9. Halaschek-Wiener, C., Golbeck, J., Schain, A., Grove, M., Parsia, B., Hendler, J.: Photostuff-an image annotation tool for the semantic web. In: Gil, Y., Motta, E., Benjamins, V.R., Musen, M.A. (eds.) ISWC 2005. LNCS, vol. 3729. Springer, Heidelberg (2005) 10. Hsu, W.H., Kennedy, L.S., Chang, S.-F.: Video search reranking through random walk over document-level context graph. In: ACM Multimedia, pp. 971–980 (2007) 11. Kennedy, L.S., Chang, S.-F., Kozintsev, I.: To search or to label?: predicting the performance of search-based automatic image classifiers. In: Multimedia Information Retrieval, pp. 249–258 (2006) 12. Liu, D., Hua, X.-S., Wang, M., Zhang, H.-J.: Retagging social images based on visual and semantic consistency. In: WWW, pp. 1149–1150 (2010) 13. Liu, D., Hua, X.-S., Yang, L., Wang, M., Zhang, H.-J.: Tag ranking. In: WWW, pp. 351–360 (2009) 14. Sigurbj¨ ornsson, B., van Zwol, R.: Flickr tag recommendation based on collective knowledge. In: Proceedings of the 17th International Conference on World Wide Web (WWW 2008), Beijing, China, April 21-25, pp. 327–336 (2008) 15. Stark, M.M., Riesenfeld, R.F.: Wordnet: An electronic lexical database. In: Proceedings of 11th Eurographics Workshop on Rendering. MIT Press, Cambridge (1998) 16. Tan, H.-K., Ngo, C.-W., Wu, X.: Modeling video hyperlinks with hypergraph for web video reranking. In: Proceeding of the 16th ACM International Conference on Multimedia, MM 2008, pp. 659–662. ACM, New York (2008)
Social Community Detection from Photo Collections Using Bayesian Overlapping Subspace Clustering Peng Wu1, Qiang Fu2, and Feng Tang1 1
Multimedia Interaction and Understanding Lab, HP Labs 1501 Page Mill Road, Palo Alto, CA, USA {peng.wu,feng.tang}@hp.com 2 Dept. of Computer Science & Engineering University of Minnesota, Twin Cities [email protected]
Abstract. We investigate the discovery of social clusters from consumer photo collections. People’s participation in various social activities is the base on which social clusters are formed. The photos that record those social activities can reflect the social structure of people to a certain degree, depending on the extent of coverage of the photos on the social activities. In this paper, we propose to use Bayesian Overlapping Subspace Clustering (BOSC) technique to detect such social structure. We first define a social closeness measurement that takes people’s co-appearance in photos, frequency of co-appearances, etc. into account, from which a social distance matrix can be derived. Then the BOSC is applied to this distance matrix for community detection. BOSC possesses two merits fitting well with social community context: One is that it allows overlapping clusters, i.e., one data item can be assigned with multiple memberships. The other is that it can distinguish insignificant individuals and exclude those from the cluster formation. The experiment results demonstrate that compared with partition-based clustering approach, this technique can reveal more sensible community structure. Keywords: Clustering, Social Community.
1 Introduction Social relationship is formed based on people’s social activities. Individual’s daily activities define, evolve and reflect his/her social relationship with the rest of the world. Given the tremendous business value embedded in the knowledge of such social relationship, many industrial and academic efforts have been devoted to reveal the (partial) social relationship through studying certain types of social activities, such as email, online social networking, and instant messaging. In this paper, we present our work on discovering people’s social clusters and relationship closeness from analyzing photo collections. The work presented in [1] focuses on human evaluation of the social relationship in photos but not constructs the relationship through photo analysis. In [5], the social relationship is considered known and used to improve the face recognition performance. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 57–64, 2011. © Springer-Verlag Berlin Heidelberg 2011
58
P. Wu, Q. Fu, and F. Tang
Outside the photo media scopes, many works have been reported to discover social relationship in other kinds of social activities, such as the one in [4], which aims to construct the social structure from analyzing emails. In [6], a graph partition based clustering algorithm is proposed to address the social community detection from a collection of photos. Given a collection of photos, the faces identified through tagging or face detection are first grouped into different people identities. These identities form the vertices in the social graph, and the strength of the connection among them is measured by a social distance metric that takes the following factors into account. Given vertices of people and the distance measure of vertices, an undirected graph is constructed. The social community detection is formulated as a graph clustering problem and an eigenvector-based graph clustering approach [3] is adopted to detect the communities. Although using the eigenvector-based graph clustering algorithm, a reasonable set of communities are detected. There are two prominent shortcomings of the graph partition based clustering approach in detecting the social communities. First, one vertex can only belong to one cluster. In social context, that is equivalent to enforce one person play only one social role in the society, which is almost universally untrue in real life; Secondly, every vertex has to belong to a cluster. It is common that the photo collection contains not only the active members of communities, but also some individuals who just happen to present or are passing-through attendees of social events. However, as long as they are captured in the social graph, they will be assigned to a certain community, although they should be indeed treated as the “noise” data from analysis perspective. These shortcomings motivate us to use the Bayesian Overlapping Subspace Clustering technique [7] to address the social community detection challenge, as it has intrinsic treatment of noise data entries and support of overlapping memberships. The rest of the paper is organized as follows: Section 2 introduces a distance measurement of social closeness of people in a photo collection and the resulted distance matrix; In Section 3, we provide an overview of the BOSC framework and in particular, its application on the distance matrix; Experiment results are presented in Section 4. Section 5 concludes a paper with a short discussion.
2 Social Closeness Denote all the photos in the collections as set I = {I1, I 2 ,L, I M }, and all the people that appeared in the photos as set P = {P1, P2 ,L, PN }. We identify people in the photo1 collection through two channels: 1) automatic face detection and clustering; 2) manual face tagging. If a person’s face was manually tagged, the face detection and clustering will be skipped to respect the subjective indication. Otherwise, the algorithms proposed in 0 are applied to produce people identifiers. The end result of this identification process is a list of rectangles {R(Pi, Ij)} for any Pi ∈ P that is found in Ij∈ I. For any two people Pi and Pj , we define a function to indicate the social closeness of the pair that is formed by taking the following heuristics into account: 1) if Pi ’ and Pj ’s face locations are close/closer in photos, they are also close/closer in relationship; 2) the more faces are found in a photo, the less trustworthy the face location
Social Community Detection from Photo Collections Using BOSC
59
distance is; 3) the more co-appearances of Pi and Pj, the more trustworthy the face location distance is. These assumptions are captured in the following formula: d ( Pi , Pj ) = [
1 q ∑ (d Il ( Pi , Pj ) ∗ f Il − 1)] q l =1
(1)
where Il , l = 1,..., q , are all the photos that contain both Pi and Pj . fIl is the number of faces detected in photo Il . dIl ( Pi , Pj ) is the distance of faces of Pi and Pj captured in photo Il .Using the above distance measure, we derive a distance matrix X of dimension N × N, with each entry xuv = d ( Pu , Pv ) = xvu .
3 Community Detection Using BOSC In this Section, we first give an overview of the Bayesian Overlapping Clustering (BOSC) model and algorithm. We then modify the BOSC algorithm to handle distance matrices and detect overlapping communities. 3.1 BOSC Overview Given a data matrix, the BOSC algorithm aims to find potentially overlapping dense sub-blocks and noise entries. Suppose the data matrix X has m rows, n columns and k sub-blocks. The BOSC model assumes that each sub-block is modeled using a parametric distribution p j [ j]1k ( [ j ]1k j 1, ... k ) from any suitable exponential family. The noise entries are modeled using another distribution p( θ j 1 ) from the same family. The main idea behind the BOSC model is as follows: Each row u and each column u v v respectively have k -dimensional latent bit vectors zr and zc which indicate their sub-block memberships. The sub-block membership for any entry xuv in the matrix is obtained by an element-wise (Hadamard) product of the corresponding row and u v column bit vectors, i.e., z zr . zc . Given the sub-block membership z and the sub-block distributions, the actual observation xuv is assumed to be generated by a multiplicative mixture model so that k ⎧ 1 z p j ( xuv θ j ) j if z ≠ 0 ⎪ j =1 p( xuv zru , zcv , θ } = ⎨ c( z ) ⎪ p( x θ ) otherwise uv k +1 ⎩
∏
u
v
(2)
where c( z ) is a normalization factor to guarantee that p(• zr , zc ,θ } is a valid distribuu v tion. If z = zr . zc = 0 , the all zeros vector, then xuv is assumed to be generated from the nose component p(• θ k +1 ) . The Hadamard product ensures that the matrix has uniform/dense sub-blocks with possible overlap while treating certain rows/columns as noise. Since it can be tricky to work directly with the latent bit vectors, the BOSC model places suitable Bayesian priors on the sub-block memberships. In particular, it asj j k sumes that there are k Beta distributions Beta(α r , β r ) , [ j ]1 corresponding to the rows
60
P. Wu, Q. Fu, and F. Tang j
j
k
u, j
and k Beta distributions Beta (α c , βc ) , [ j ]1 corresponding to the columns. Let π r dej j note the Bernoulli parameter sampled from Beta(α r , β r ) for row u and sub-block j m k v, j where [u]1 and [ j ]1 . Similarly, let π c denote the Bernoulli parameter sampled from j j Beta (α c , βc ) for row v and sub-block j where [v]1n and [ j]1k . The Beta-Bernoulli distributions are assumed to be the priors for the latent row and column membership vecu v tors zr and zc . In particular, the generative process is shown in [2]. Let Zr and Zc be m × k and n × k binary matrices that have the latent row and column sub-blocks assignments for each row and column. Given the matrix X , the learning task is to infer the joint posterior distribution of ( Zr , Zc ) and compute the model parameters (α r∗, β r∗ ,α c∗ , βc∗ ,θ ∗ ) that maximize log p( X α r , βr , α c , β c ,θ ) . We can then draw samples from the posterior distribution and compute the dense-block assignment for each entry. The BOSC algorithm is an EM-like algorithm. In E-step, given the model parameters (α r , β r , α c , βc ,θ ), the goal is to estimate the expectation of the loglikelihood E[log p ( X α r , βr ,αc , β c , θ )] where the expectation is with respect to the posterior probability p( Z r , Z c | X ,α r , β r , α c , βc ,θ ) . The BOSC algorithm uses Gibbs sampling to approximate the expectation. Specifically, it computes the conditional probabilities u, j v, j of each row (column) variable zr ( zc ) and constructs a Markov chain based on the conditional probabilities. On convergence, the chain will draw samples from the posterior joint distribution of ( Zr , Zc ) , which in turn can be used to get an approximate estimate of the expected log-likelihood. In M-Step, the BOSC algorithm estimates (α r∗ , β r∗ , α c∗ , βc∗ ,θ ∗ ) that maximizes the expectation.
Fig. 1. BOSC generative process
Social Community Detection from Photo Collections Using BOSC
61
Fig. 2. BOSC generative process for distance matrix
3.2 BOSC for Distance Matrix The BOSC framework deals with general matrices. However, distance matrices are symmetric and the rows and columns represent the same set of users in the social network, which implies the row and column cluster assignments have to be identical, i.e., Zr = Zc . Suppose there are N individuals in the social network and the distance matrix is X . We can slightly modify the OSC generative process to support the symmetry nature of the distance matrix as shown in [2]. Note that because the matrices are symmetric, we only need k Beta distributions Beta(α j , β j ) , [ j ]1k for the individuals in the social network. The learning task now becomes inferring the joint posterior distribution of Z and computing the model parameters (α ∗ , β ∗ ,θ ∗ ) that maximizes log p ( X α , β ,θ ). The EM-like process is similar to the description in Section 0.
4 Experiment Results The image set used in the experiment contains 4 photo collections and each corresponding to a distinct social event. The 4 social events are “Birthday party” (78 photos), “Easter party” (37 photos), “Indiana gathering” (38 photos) and “Tennessee gathering” (106 photos). There are 54 individuals captured in those photos. The core people that establish linkages among these events are a couple (Dan and Oanh) and their son (Blake). The “Birthday party” is held for the son, and attended by his kid friends, their accompanying parents, members from Oanh’s family and the couple. The “Easter party” is attended by members of the wife’s family, the couple and the son. Both the “Indiana gathering” (38 photos) and “Tennessee gathering” are attended by the members of the husband’s family, in addition to the couple and the son. Table 1 lists the number of photos in each collection and number of individuals appeared in each collection.
62
P. Wu, Q. Fu, and F. Tang
The clustering results using the graph partition based method 0 can be found in Table 2. Table 3 shows the clustering results from using BOSC. In BOSC implementation, we set k = 4. Several observations emerge from comparing the results in Table 2 and Table 3: • Handling of the noise date: The most noticeable difference is that Cluster 4 in Table 32 is no more a cluster in Table 3. In fact, the two individuals “adult 2” and “adult 3” do not appear in any of the clusters in Table 3. Examine the photo collection shows that “adult 2” and “adult 3” co-appears in just one photo and only with each other. Apparently, excluding such individuals in the community construction is more sensible than treating them as singular social cluster. In addition, a few other individuals, such as “adult 8”, who appears in just one photo with one active member is the “Birthday party” collection, are also excluded in the community formation in Table 3. This difference is due to BOSC’s capability of treating some data entry as noise data, which allows the community detection less interfered by noise. • Overlapping cluster membership: Another key observation is that the clusters are overlapping. As seen in Table 3, Dan appears in all 4 clusters, Blake in 3 clusters and Oanh in 2 clusters. Such result reflects the pivot role played by the family in associating different communities around them. This insight is revealed by BOSC’s overlapping clustering capability, and not achievable through the graph partition approach. • Modeling capability and limitation: As shown in Table 3, Cluster 1 and 3 both consist of people who attended “Indiana gathering” and “Tennessee gathering”. And Cluster 3 is actually a subset of Cluster 1. Cluster 2 and 4 consist of most of people attended the “Birthday party” and the “Easter party” and Cluster 2 is almost a subset of Cluster 4. One can clearly see two camps of social clusters, one is Dan’s family members and the other is Blake’s friends and Oanh’s family members. Considering the photos of “Birthday party” are twice as much as that of “Easter party” and “Birthday party” is attended by both Blake’s friends and Oanh’s family members, such partition is reasonable. Another factor that may contribute to the less distinction of Blake’s friends from Oanh’s family members is the EM-like algorithm’s convergence to local maximum, which can be remedied algorithmically. Overall, the experiment results validate the merits of the BOSC algorithm in treating the noise data and support overlapping membership. Table 1. Photo collections and people appeared in each collection Collections
#. Photo
Birthday party
78
Easter party Indiana gathering Tennessee gathering
37 38 106
People adult 1, adult 4, adult 5, adult 6, adult 7, adult 8, adult 9, adult 10, Alec, Anh, Blake, Blake’s friend 1, Blake’s friend 2, Blake’s friend 3, Blake’s friend 4, Blake’s friend 5, Blake’s friend 6, Blake’s friend 7, Blake’s friend 8, Blake’s friend 9, Dan, Jo, Landon, Mel, Nicolas, Oanh Alec, Anh, Anthony, Bill, Blake, Calista, Dan, Jo, Landon, Mel, Nicolas, Nini, Oanh, adult 4, adult 2, adult 3 Allie, Amanda, Blake, Dad, Dan, Hannelore, Jennifer, Lauren, Mom, Rachel, Stan, Tom, Tracy Allie, Amanda, Blake, Bret, Cindy, Dad, Dan, Grace, Hannelore, Jennifer, Jillian, Katherine, Kevin, Kurt, Lauren, Mom, Oanh, Tracy, Phil, Rachel, Reid, Rich, Sandra, Stan, Tom
Social Community Detection from Photo Collections Using BOSC
63
Table 2. Detected social clusters Cluster Cluster 1 Cluster 2
People Stan, Lauren, Dad, Reid, Amanda, Bret, Kevin, Hannelore, Cindy, Jennifer, Tom, Rachel, Allie, Sandra, Jillian, Grace, Oanh, Phil, Tracy, Kurt, Dan, Rich, Katharine, Mom adult 10, Blake's friend 2, Blake's friend 1, Blake's friend 5, adult 6, adult 9, Nicolas, Blake's friend 3, Blake's friend 4, Blake's friend 6, Blake's friend 7, Alec, Blake's friend 9, Blake's friend 8, adult 1, adult 5, Blake
Cluster 3
Jo, Landon, Anthony, Anh, Bill, adult 4, adult 7, Nini, Calista, Mel, adult 8
Cluster 4
adult 3, adult 2
Table 3. Detected social clusters using BOSC Cluster Cluster 1 Cluster 2 Cluster 3
Cluster 4
People Lauren, Reid, Amanda, Bret, Kevin, Cindy, Tom, Rachel, Sandra, Oanh, Jillian, Phil, Kurt, Dan, Rich, Blake, Mom, Katherine, Dad, Stan, Hannelore, Jennifer, Allie, Grace, Tracy adult 10, adult 9, Dan, Blake, Blake's friend 2, Blake's friend 1, Blake's friend 4, Blake's friend 6, Blake's friend 8 Lauren, Reid, Amanda, Bret, Cindy, Tom, Rachel, Jillian, Phil, Kurt, Dan, Mom, Dad, Jennifer, Allie, Grace Adult 10, Landon, adult 7, Mel, Oanh, Anh, adult 1, Calista, Dan, Blake, Anthony, Jo, Bill, Nini, Nicolas, Alec, Blake’s friend 1, Blake’s friend 2, Blake’s friend 5, Blake’s friend 6, Blake’s friend 3, Blake’s friend 4, Blake’s friend 9, Blake’s friend 7
5 Conclusions In this work, we first described a metric to measure people’s social distance by examining their co-appearances in photo collections. Then a subspace clustering algorithm is applied to the social distance matrix of people to detect the social communities embedded in the photo collections. The experiment results illustrate meaningful social clusters within photo collections can be revealed by the proposed approach effectively.
References 1. Golder, S.: Measuring social networks with digital photograph collections. In: Proceedings of the Nineteenth ACM Conference on Hypertext and Hypermedia (June 2008) 2. Gu, L., Zhang, T., Ding, X.: Clustering consumer photos based on face recognition. In: Proc. of IEEE International Conference on Multimedia and Expo, Beijing, pp. 1998–2001 (July 2007) 3. Newman, M.E.J.: Finding community structure in networks using the eigenvectors of matrices. Phys. Rev. E 74, 036104 (2006)
64
P. Wu, Q. Fu, and F. Tang
4. Rowe, R., Creamer, G., Hershkop, S., Stolfo, S.J.: Automated social hierarchy detection through email network analysis. In: Proceedings of the 9th WebKDD and 1st SNA-KDD 2007 Workshop on Web Mining and Social Network Analysis (August 2007) 5. Stone, Z., Zickler, T., Darrell, T.: Autotagging Facebook: Social network context improves photo annotation. In: Computer Vision and Pattern Recognition Workshops (June 2008) 6. Wu, P., Tretter, D.: Close & closer: social cluster and closeness from photo collections. In: ACM Multimedia 2009, pp. 709–712 (2009) 7. Fu, Q., Banerjee, A.: Bayesian Overlapping Subspace Clustering. In: ICDM 2009, pp. 776– 781 (2009)
Dynamic Estimation of Family Relations from Photos Tong Zhang, Hui Chao, and Dan Tretter Hewlett-Packard Labs 1501 Page Mill Road, Palo Alto, CA 94304, USA {tong.zhang,hui.chao,dan.tretter}@hp.com
Abstract. In this paper, we present an approach to estimate dynamic relations among major characters in family photo collections. This is a fully automated procedure which first identifies major characters in photo collections through face clustering. Then, based on demographic estimation, facial similarity, coappearance information and people's positions in photos, social relations of people such as husband/wife, parents/kids, siblings, relatives and friends can be derived. A workflow is proposed to integrate the information from different aspects in order to give the optimal result. Especially, based on timestamps of photos, dynamic relation trees are displayed to show the evolution of people's relations over time. Keywords: consumer photo collections, family relation estimation, dynamic relation tree, face analysis, face clustering.
1 Introduction We can often figure out people’s relations after looking at somebody’s family photo collections for a while. Then, what about letting the computer automatically identify important people and their relations, and even form a family relation tree like the one shown in Fig.1, through analyzing photos in a family collection? Such a technology would have a wide range of applications. First of all, major people involved in images and their relations form important semantic information which is undoubtedly useful in browsing and searching images. For example, the information may be used in generating automatic recommendations of pictures for making photo products such as photo albums and calendars. Secondly, it may have other social and business values and broader uses in social networking, personalized advertisement and even homeland security. Moreover, a dynamic relation tree that shows the evolution of family relations over a relatively long period of time (e.g. multiple years) provides one kind of life log which provides perspectives of people, events and activities within glances. Existing work on discovering people’s relations based on image content analysis is quite rare. Golder presented a preliminary investigation on measuring social closeness in consumer photos, where a weighted graph was formed with people co-appearance information [1]. A similar approach was taken in [2], but a graph clustering algorithm was employed to detect social clusters embedded in photo collections. Comparing with such prior work, our proposed approach aims at integrating information from K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 65–76, 2011. © Springer-Verlag Berlin Heidelberg 2011
66
T. Zhang, H. Chao, and D. Tretter
wife
husband
grandma grandpa
grandma grandpa parents
parents Extended family
Extended family
daughter
Nuclear family
siblings
siblings brother
niece
sister-in-law
sister
niece nephew
brother
sister-in-law
nephew nephew
brother
niece
brother
niece
couple
Relatives
Friends
Fig. 1. One example of family relation tree automatically estimated based on face analysis
multiple aspects of image analyses and revealing more details of social relationships among the people involved in an image collection.
2 Estimation of People’s Relations 2.1 What Can Be Obtained with Image Analysis With results from previously developed image analysis techniques including face recognition and clustering, demographic estimation and face similarity measurement, as well as contextual information such as co-appearance of people in photos, people’s relative positions in photos and photo timestamps, the following clues can be obtained to discover people’s relations. •
Major characters in a family photo collection
Based on state-of-the-art face recognition technology, we developed a face clustering algorithm in earlier work which automatically divides a photo collection into a number of clusters, with each cluster containing photos of one particular person [3]. As the result, major clusters (that is, those having a relatively large number of photos) corresponding to frequently appearing people may be deemed as containing main characters of the collection. Shown in Fig.2 are major clusters obtained from one consumer image collection by applying face clustering. Each cluster is represented by one face image in the cluster, called a face bubble; and the size of the bubble is proportional to the size of the cluster. In such a figure, it is straightforward to identify people who appear frequently in the photo collection.
Dynamic Estimation of Family Relations from Photos
67
Fig. 2. Main characters identified in a photo collection through face clustering
•
Age group and gender of major characters
We applied learning based algorithms developed in prior work to estimate the gender and age of each main character in a photo collection [4] [5]. In age estimation, a person is categorized into one of five groups: baby (0-1), child (1-10), youth (10-18), adult (18-60) and senior (>60). Even though each classifier only has accuracy rate around 90% for an individual face image, it can be much more reliable when applied to a cluster with big enough number of face images. In Table 1, it shows the number of faces classified into each gender or age group within each face cluster in a photo set, where M, F indicate male and female, respectively; B, C, A and S indicate baby, child, adult and senior, respectively. Since the photos span a period of several years, some subjects may belong to multiple age groups such as B/C and A/S. The numbers in red indicate the gender/age group estimated for the cluster from majority vote, and the numbers in orange indicate the second dominant age group if available. As can be seen, while there are mistakes in estimating gender and age for faces in each cluster; cluster wise, all the classification results match with the ground truth correctly. •
Who look similar to each other
Since each face cluster contains images of one person, the similarity measure between two clusters then indicates how similar the two people look like. In our work, each cluster is represented by all of its member faces with which every face in another
68
T. Zhang, H. Chao, and D. Tretter
cluster is compared, and the distance of the two clusters is determined by the average distance of the K nearest neighbors. It has been found out that clusters of adults with blood relations such as parents/children and siblings normally have high similarity measures with each other. Table 1. Estimation of gender and age group of main characters in a photo collection Face Clusters
Ground Truth
Detected Face
Gender Estimation M
No.1 No.2 No.3 No.4 No.5 No.6 No.7 No.8 No.9 No.10 No.11 No.12 No.13 No.14
•
F, B/C F, B/C F, A M, A M, C F, A M, S M, S F, S M, A/S F, A F, S M, A M, C
436 266 247 215 86 72 65 62 57 40 33 30 18 17
178 109 37 171 46 13 49 41 17 22 2 4 16 13
Age Estimation
F
B
C
A
S
258 157 210 44 40 59 16 21 40 18 31 26 2 4
94 88 2 4 12 1 1 2 0 1 0 0 0 1
274 162 30 44 71 1 7 1 6 4 0 4 4 13
61 14 186 115 2 59 11 3 11 8 33 4 12 3
7 2 29 52 1 11 46 56 40 27 0 22 2 0
Who are close with each other (appear in the same photo, and how often)
Whether and how often two people appear together in photos reveal how close they are with each other. A co-appearance matrix can be obtained for major clusters in a photo collection containing the number of co-occurrences between people. It provides compensating information to the similarity matrix of clusters introduced above. As shown in Fig.3, for one person, people who look most like him (e.g. siblings, parents) and people who are most close with him (e.g. wife, kids) are listed separately. •
Who are in the same social circle (appear in the same event)
Applying clustering to the co-appearance matrix and using photo timestamps, we can find out groups of people who appear together in the same event, and thus figure out social circles in a photo dataset. Particularly, people who appear in a series of group photos taken in one event (e.g. family reunion, company outing, alumni reunion) may be recognized as belonging to one circle of relatives, colleagues or classmates. •
Who are intimate with each other
In a family photo collection, couples, sibling kids and nuclear family members often have exclusive photos of themselves. Besides that, people’s positions in photos also
Dynamic Estimation of Family Relations from Photos
69
provide useful cues regarding intimate relations. For example, couples usually stand or sit next to each other in group photos. Parents/grandparents tend to hold the family kids in photos. Touching-faces position (two faces that touch each other) usually only happens between husband and wife, lovers, parents/grandparents and kids, or siblings.
These are people who look like him
These are people who are close with him
Fig. 3. Face clustering result of CMU Gallagher’s dataset [6]. On the left: face clusters in order of cluster size. On the right: first row – clusters most similar to selected cluster in face; second row – clusters having most co-appearances with selected cluster. Lower-right panel: images in selected cluster.
•
Evolution of relations
People’s relations evolve over time. With the help of photo timestamps, for a photo collection spanning a relatively long period of time (e.g. from a few years to dozens of years), the changes in major characters can be detected. Events such as adding or passing away of family members and occasional visits of extended family members, relatives or friends can be discovered. Also, people who appeared at different stages of one’s life may be identified. 2.2 Constructing a Relation Tree We proposed a workflow to identify people’s relations and derive a relation tree from photo collections using the clues introduced above. Members of the nuclear family of the photo owner are first recognized, followed by extended family members, and then other relatives and friends. •
Identifying nuclear family members
From all major clusters, which are defined as clusters containing more than M images (M is empirically determined, e.g. M=4 or 5), the most significant clusters are selected through a GMM procedure in the kid’s clusters (baby, child, junior) and adult clusters
70
T. Zhang, H. Chao, and D. Tretter
Nuclear family
447
274
254
224
92
Candidate nuclear family members Fig. 4. Nuclear family estimation. Left: working sheet (the number under each face is the size of face cluster of that character); Right: identified nuclear family members.
(adult, senior), respectively, in order to find candidates of kids and parents of the nuclear family [7]. Among these candidates, a link is computed between each pair of people based on the number of co-appearances, and with weights added to cases such as exclusive photos, close positions in group photos, face-touching position and babyholding position. An example is shown in Fig.4, where four of the candidates have strong link between each two of them, while the fifth person only has light link with some of the others. Thus, the nuclear family members were identified as in the right. •
Identifying extended family members
Extended family members are defined as parents and siblings of the husband or wife, as well as siblings’ families. These people are identified through facial similarity, coappearances and positions in photos. As shown in Fig.5, major senior characters that have high facial similarity with husband or wife and strong links with nuclear family members were found. Each senior couple can be further recognized as the parents of the husband or the wife according to the link between them and their links with the husband or wife. A sibling of the husband/wife usually has strong link not only with the husband/wife but also with the corresponding parents in terms of facial similarity and co-appearances. He/she also often has co-occurrences with nuclear family kids with intimate positions. A sibling’s family can be identified as those who have strong links with the sibling, as well as co-appearances with nuclear family members and the corresponding parents. •
Identifying relatives and friends
Other major characters are determined as relatives or friends of the family. People are identified as relatives if they co-appear not only with the nuclear family, but also with the extended family members in photos of different events or in a relatively large amount of photos in one event. The rest major characters are determined as friends. In Fig.6, a group of nine people were found to appear together in dozens of photos (including a number of group photos) taken within two events. The nuclear family and the husband’s sister’s family are also in the group photos. Therefore, these people are identified as in one relative circle of the husband’s side.
Dynamic Estimation of Family Relations from Photos
69
58
65
Extended family
30
par
ent s
Nuclear family
par
sib li
ling s ib
74
43
71
ent
s
ng
18
Fig. 5. Extended family estimation. Left upper: identifying parents of husband/wife; Left lower: identifying a sibling of the husband and her family. Right: estimated extended family. Extended family par ent s
Nuclear family ling sib
e n ts par
sib
ling
Relatives or friends
Fig. 6. Identifying circles of relatives and friends
3 Estimation of the Evolution in People’s Relations 3.1 Dynamic Relation Tree – One Kind of Life Log With people’s relations identified in the above described process, a dynamic relation tree is built which places the people in their corresponding positions in the relation circles, and the view of the tree changes when different time periods are selected.
72
T. Zhang, H. Chao, and D. Tretter
Fig. 7. Snapshots of a dynamic relation tree which was constructed from a family photo collection. This collection contains photos spanning seven years. One view of the tree was generated for each of the years from 2001 to 2007.
Dynamic Estimation of Family Relations from Photos
73
That is, while people and their relations are identified by analyzing all the images in the photo collection, only those people who appear in the specified period of time (e.g. a certain year or a certain event) are shown in one view of the dynamic tree. It thus reveals the evolution of relations over time, and provides one kind of life log with glance views that reminds one of events and people involved in his past. Fig.7 contains snapshots of a dynamic relation tree derived from a family photo collection containing photos spanning 7 years. One view is included for each of the years 2001~2007, showing people appearing that year with face images taken during that time. Many stories can be told from viewing these snapshots. For example, it was the end of 2001 that the family had their first digital camera and only a few of them appeared. Grandparents on both sides got together with the nuclear family every year except that the wife’s parents did not make it in 2006. The family went to see greatgrandmother in 2003. In that same year, the husband’s sister came to visit them. We can see the daughter’s favorite teacher in 2004. Two neighbor families appear in almost every year and the kids have grown up together. If we zoom into specific events in certain years, more details may be discovered such as parents of the wife’s side were together with the family during Thanksgiving in 2004, while parents of the husband’s side spent Christmas with them that year. From these snapshots, we can also see how the looks of the kids have changed when they grew from pre-schoolers to early teenagers. Comparing with a static tree showing who are the major characters and how they connect with each other [7], a dynamic tree presents additional details about the activities and changes in the people and their relations.
Fig. 8. Determining a kid’s age based on photo timestamps and age group classification
74
T. Zhang, H. Chao, and D. Tretter
3.2 Discovery of Current Status and Changes in the Family Circle With the help of photo timestamps and estimation of people’s relations, many events in the family circle can be discovered such as the birth of a baby or the passing away of a family member. Here, in particular, we propose a method that predicts the current age of an individual based on images in his/her face cluster over multiple years. One example is shown in Fig.8. The age group classifier is applied to each face image in a person’s cluster. The images are then divided into one-month periods, and within each period, the percentage of images classified to each age group is computed as plotted in the figure. Next, for each age class, a polynomial function is estimated to fit the data. The polynomial curves of the baby class and the child class over 4 years are shown. The crossing points of the polynomial functions for any two neighboring classes are found, and if one crossing point has value larger than a threshold (e.g. 0.35), it is considered to be a significant transitional point that indicates the time the person transits from one age group to the next. In this example, the transitional time from baby to child class is around June 2008; thus the person was estimated to be around one year old at that time and his/her current age can be estimated accordingly.
4 Experiments and Discussions We tested the proposed approach on three typical family photo collections. Each one spans between 7-9 years, and contains photos of a large number of major people, including nuclear family members (husband, wife and kids), extended families on both sides, other relatives and friends. Using our approach described in this paper, the estimated relations match with the ground truth quite accurately except for a few extreme cases. For example, the final Extended family par e
nts
par
s ent
Nuclear family si
Relatives or friends
g blin
sib ling
Relatives or friends
Fig. 9. Relation tree automatically derived from a family photo collection
Dynamic Estimation of Family Relations from Photos
75
relation tree derived from the collection introduced in section 2.2 is shown in Figure 9. In this tree, one person indentified as a “family relative or friend” is actually the portrait of Chairman Mao in Beijing’s TianAnMen Square. This false alarm happened because the family took quite some photos there at different times. Another issue is that the estimated gender is wrong for three people in this collection, showing that our gender estimation may not be reliable for young kids and seniors. Furthermore, the concept of dynamic tree helps to resolve some confusing cases in a static tree. Still in this collection, one senior man is close with two senior women (one is his late wife, another is his current wife), and both women have strong links with the nuclear family, which makes a difficult case in a static tree. However, as the two women’s appearances do not have any overlap in time, it is straightforward to place them in the dynamic tree. There are also cases that need extra rules to guide to the right result. For instance, in one collection, an adult male has a number of co-appearances with the wife’s sister including two exclusive ones; however, he is also in photos with the husband’s family plus the person and the husband are highest on each other’s facial similarity ranking. We had to add the rule that blood relation estimation has higher priority than that of the significant-other relation, and thus assign the person as a sibling of the husband. We believe that with experiments on more family photo collections, there will be new cases appearing all the time that need rules to be added or expanded to accommodate all different variations of relations.
5 Conclusions and Future Work We presented an approach to automatically figure out main characters and their relations in a photo set based on face analysis technologies and image contextual information. On top of this, a dynamic relation tree is built in which the appearing people and their looks change over time to reveal the evolutions and events in the family’s life over multiple years. Experiments have shown that with existing techniques, quite accurate results can be obtained on typical family photo collections. Only preliminary work has been done with this approach. In the following, more family image datasets will be collected and tested on to make the rule-based system more robust. More learning elements will be added into the workflow to replace hard coded rules so that the system can adapt to different relation cases by itself. We will also investigate use cases of the approach and produce more useful relations trees.
References 1. Golder, S.: Measuring Social Networks with Digital Photo-graph Collections. In: 19th ACM Conference on Hypertext and Hypermedia, pp. 43–47 (June 2008) 2. Wu, P., Tretter, D.: Close & Closer: Social Cluster and Closeness from Photo Collections. In: ACM Conf. on Multimedia, Beijing, pp. 709–712 (October 2009) 3. Zhang, T., Xiao, J., Wen, D., Ding, X.: Face Based Image Navigation and Search. In: ACM Conf. on Multimedia, Beijing, pp. 597–600 (October 2009)
76
T. Zhang, H. Chao, and D. Tretter
4. Gao, W., Ai, H.: Face Gender Classification on Consumer Images in a Multiethnic Environment. In: The 3rd IAPR Conf. on Biometrics, Univ. of Sassari, Italy, June 2-5 (2009) 5. Gao, F., Ai, H.: Face Age Classification on Consumer Images with Gabor Feature and Fuzzy LDA Method. In: The 3rd IAPR International Conference on Biometrics, Univ. of Sassari, Italy, June 2-5 (2009) 6. Gallagher, A.C., Chen, T.: Using Context to Recognize People in Consumer Images. IPSJ Trans. on Computer Vision and Applications 1, 115–126 (2009) 7. Zhang, T., Chao, H., et al.: Consumer Image Retrieval by Estimating Relation Tree from Family Photo Collections. In: ACM Conf. on Image and Video Retrieval, Xi’an, China, pp. 143–150 (July 2010)
Semi-automatic Flickr Group Suggestion Junjie Cai1 , Zheng-Jun Zha2 , Qi Tian3 , and Zengfu Wang1 1
University of Science and Technology of China, Hefei, Anhui, 230027, China 2 National University of Singapore, Singapore, 639798 3 University of Texas at San Antonio, USA, TX 78249 [email protected], [email protected], [email protected], [email protected]
Abstract. Flickr groups are self-organized communities to share photos and conversations with common interest and have gained massive popularity. Users in Flickr have to manually assign each image to the appropriated group. Manual assignment requires users to be familiar with existing images in each group and it is intractable and tedious. Therefore it prohibits users from exploiting the relevant groups. For solution to the problem, group suggestion has attracted increasing attention recently, which aims to suggest groups to user for a specific image. Existing works pose group suggestion as the automatic group prediction problem with a purpose of predicting the groups of each image automatically. Despite of dramatic progress in automatic group prediction, the prediction results are still not accurate enough. In this paper, we propose a semiautomatic group suggestion approach with Human-in-the-Loop. Given a user’s image collection, we employ the pre-built group classifiers to predict the group of each image. These predictions are used as the initial group suggestions. We then select a small number of representative images from user’s collection and ask user to assign the groups of them. Once obtaining user’s feedbacks on the representative images, we infer the groups of remaining images through group propagation over multiple sparse graphs among the images. We conduct experiment on 15 Flickr groups with 127,500 images. The experimental results demonstrate the proposed framework is able to provide accurate group suggestions with quite a small amount of user effort. Keywords: Flickr Group, Semi-automatic, Group Suggestion.
1
Introduction
In Web 2.0 era, social networking is a popular way for people to connect, express themselves, and share interests. Popular social networking websites include MySpace1 , Facebook2 , LinkedIn3 , and Orkut4 for finding and organizing 1 2 3 4
http://www.myspace.com/ http://www.facebook.com/ http://www.linkedin.com/ http://www.orkut.com/
K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 77–87, 2011. c Springer-Verlag Berlin Heidelberg 2011
78
J. Cai et al.
Fig. 1. Home page of the group ”cars”
contacts, LiveJournal5 and BlogSpot6 for sharing blogs, Flickr7 and Youtube8 for sharing images or videos, and so on. One important social connection feature on these Websites is Group, which refers to the self-organized community with declared, common interest. In particular, groups in Flickr are the communities where users gather to share their common interests on certain type of events or topics captured by photos. For example, the Flickr group of “cars” is illustrated in Fig.1. Most of the activities within a Flickr group start from sharing images: user would contribute his/her own images to the related group, comment on other members’ images, and discuss related photographic techniques[1][3]. Flickr now obligates user to manually assign each image to the appropriated group. Manual assignment requires users to be familiar with existing images in each group and match the subject of each image in user’s collection with the topics of various groups. This work is intractable, tedious, and thus prohibits user from exploiting the relevant groups[1][2]. To tackle this problem, group suggestion has been proposed recently, which aims to suggest appropriate groups to user for a specific image. Existing works pose group suggestion as an automatic group prediction problem with a purpose of predicting the groups of each image automatically. For example, Perez and 5 6 7 8
http://www.livejournal.com/ http://googleblog.blogspot.com/ http://www.flickr.com/ http://www.youtube.com/
Semi-automatic Flickr Group Suggestion
79
« ᱑tree᱒ ᱑baby᱒ ᱑animal᱒ groups
animal baby baby
animal baby baby baby
baby baby baby
baby baby baby
baby baby baby
prediction: animal red ᱑tree᱒
baby baby
᱑animal᱒
red user input: baby
«« ««
User Image Collection
᱑baby᱒
᱑sky᱒
Groups Classifiers based on Multiple Kernel Learning
animal baby Group Prediction
««
«« animal baby baby Representative Image Selection and User Labeling
Group Inference
baby baby Group Suggestion
Fig. 2. Flowchart of our approach. The green rectangles indicate final group suggestions corrected by our method.
Negoescu analyzed the relationship between image tags and the tags in groups to automatically suggest groups to users[6]. In addition to tags used in[6], visual content was also utilized to predict the groups of images. Duan et al. presented a framework that integrates a PLSA-based image annotation model with a style model to provide users with groups for their images[4]. Chen et al. developed a system named SheepDog to automatically add images into appropriate groups by matching the semantic of images with the groups, where image semantic is predicted through image classification technique[8]. Recently, as reported in [1][3], Yu et al. converted group suggestion to group classification problem. They integrated both visual content and textual annotations(i.e. tags) to predict the events or topics of the images. Specifically, they firstly trained a SVM classifier to predict the group of each image and then refined the predictions through sparse graph-based group propagation over images within a same user collection. Although encouraging advances have been achieved in automatic group suggestion, the suggestion result is still not accurate enough. Motivated by above observations, in this paper, we propose a semi-automatic group suggestion approach with Human-in-the-Loop. Fig.2 illustrates the flowchart of our approach. Given a user’s image collection, we employ the prebuilt group classifiers to predict the group of each image. These group predictions are used as the initial suggestions for the user. As aforementioned, the automatic group prediction is not accurate enough. We thus introduce user in the loop [15] and conduct the following two steps repeatedly: (a) sample images are selected from user’s collection and the corresponding suggestions from group classifiers are then presented to user to amend the wrong ones, and (b) the groups of remaining images are inferred based on user’s feedbacks and group predictions in last round. From technical perspective, the semi-automatic group suggestion framework contains three components: (a) group classifier building, (b) representative image selection, and (c) group inference. Specifically, we employ Multiple Kernel Learning(MKL) technique [9] to build a SVM classifier for each group with one-vs-all strategy. To select representative images, we incorporate sample uncertainty [12] into Affinity Propagation algorithm and select the images with high uncertainty and representativeness. After obtaining user’s feedbacks
80
J. Cai et al.
on the selected images, we infer the groups of remaining images through group propagation over multiple sparse graphs of the images. We conduct experiment on 15 Flickr groups with 127,500 images. The experimental results demonstrate the proposed framework is able to provide accurate group suggestions with quite a small amount of user effort. The rest of this paper is organized as follows. The proposed semi-automatic group suggestion framework is elaborated in Section 2. The experimental results are reported in Section 3 followed by the conclusions in Section 4.
2 2.1
Approach Image Features
Three popular visual descriptors, i.e., GIST, CEDD and Color Histogram, are extracted to represent image content[1]. GIST. GIST descriptor [5]has recently received increasing attention in the context of scene recognition and classification tasks. To compute the color GIST description, the image is segmented by a 4 by 4 grids for which orientation histograms are extracted. Our implementation takes as input a square image of fixed size and produces a 960-dimensional feature vector. CEDD. Color and Edge Directivity Descriptor(CEDD)[10] is a new low level feature which incorporates color and texture information into a histogram. CEDD size is limited to 54 bytes per image, rendering this descriptor suitable for use in large image databases. A 144-dimensional CEDD feature vector is extracted for each image. Color Histogram. Color histogram is a widely used visual feature. We extract a 512 RGB color histogram with dividing the RGB color space into 8*8*8 bins. Finally, a 512-dimensional color feature vector is extracted for each image. 2.2
Group Classifier Building
The curse of dimensionality has always been a critical problem for many machine learning tasks. Directly concatenating different types of visual features into a long vector may lead to poor statistical modeling and high computational cost. In order to bypass this problem, we employ Multiple Kernel Learning (MKL)method[9] to build a SVM classifier for each group. Denoting the kernel similarity between sample x and y over j-th feature by Kj (x, y), we combine multiple kernels as a convex combination as: K(x, y) =
K j=1
βj Kj (x, y), with
βj ≥ 0,
K
βj = 1.
(1)
j=1
The kernel combination weights βj and the parameters of the SVM can be jointly learned by solving a convex, but non-smooth objective function. We here follow the implementation at http://asi.insa-rouen.fr/enseignants/ arakotom/.
Semi-automatic Flickr Group Suggestion
2.3
81
Representative Image Selection
We use the Affinity Propagation (AP) method to identify small number of images that accurately represent user’s image collection[11][14]. Different from the typical AP, sample uncertainty is incorporated into AP as the prior of images. This modified AP is thus able to select the images with high representativeness as well as high uncertainty. The entropy has been widely used to measure sample uncertainty [12]. The uncertainty in binary classification problem is defined as follows p(y|x) log p(y|x), (2) H(x) = − y∈{0,1}
where p(y|x) represents a distribution of estimated class membership. We simply extend Eq.2 to compute sample uncertainty in multiple group prediction as H(x) = −
k
p(yi |x) log p(yi |x).
(3)
i=1 yi ∈{0,1}
The resulted H(x) is used as the preference of sample x in AP algorithm. AP aims to cluster image set I = {Ii }N i=1 into M (M < N ) clusters based on sample similarity s(Ii , Ij ). Each represented by the most representative image is called “exemplar”. In AP, all the images are considered as potential exemplars. Each of them is regarded as a node in a network. The real-valued message is recursively transmitted via the edges of the network until a good set of exemplars and their corresponding clusters emerge. Let Ie = {Iei }M i=1 denote the final exemplars and e(I) represent the exemplar of image I. In brief, the AP algorithm propagates two kinds of information between images: 1) the “responsibility” r(i, j) transmitted from image i to image j, which measures how well-suited Ij is to serve as the exemplar for Ii by simultaneously considering other potential exemplar for Ii , and 2) the “availability” a(i, j) sent from candidate exemplar Ij to Ii , which reflects how appropriately Ii chooses Ij as exemplar by simultaneously considering other potential images that may choose Ij as their exemplar. This information is iteratively updated by r(i, j) ← s(Ii , Ij ) − max′ {a(i, j ′ ) + s(Ii , Ij ′ )}, j =j a(i, j) ← min{0, r(j, j)} + max{0, r(i′ , j)}
(4)
i′ i,j
The “self-availability” a(j, j) is updated by a(j, j) := i′ =j max{0, r(i′ , j)}. The above information is iteratively propagated until convergence. Then, the exemplar e(Ii ) of image Ii is chosen as e(Ii ) = Ij by solving arg maxj {r(i, j)+a(i, j)}. 2.4
Group Inference
After obtaining user’s feedbacks on selected images, our task is to infer the groups of remaining images. It is reasonable to assume that many images in a user image
82
J. Cai et al.
collection are usually similar and the similar images should be assigned to same group. Therefore, the groups of remaining images can be inferred by propagating user’s feedbacks to these images. Let I = {I1 , . . . , Il , Il+1 , . . . , IN } denote images in certain user’s collection containing l labeled samples and N − l unlabeled samples. Xg = {xg,1 , . . . , xg,l , xg,l+1 , . . . , xg,N } denotes the feature vectors on g-th modality, where xg,i ∈ Rdg represents the i-th sample. Here, we infer the groups of {Il+1 , . . . , IN } resorting to group propagation over multiple sparse graphs among images I. Let Gg = {I, Wg } denote the sparse graph on g-th modality. Wg is the affinity matrix, in which Wg,ij indicates the affinity between sample i and j. Wg can be obtained by solving the following optimization problem[13]. Wg = arg min Wg 1 ,
s.t. xg,i = Ag,i Wg ,
(5)
where the matrix Ag,i = [xg,1 , . . . , xg,i−1 , xg,i+1 , . . . , xg,N ]. Afterwards, we conduct group propagation over K sparse graphs {Gg }K g=1 . Wang et al.[7] have proposed an optimized multi-graph-based label propagation algorithm, which is able to integrate multiple complementary graphs into a regularization framework. The typical graph-based propagation framework estimates class labels (i.e.groups in our case) of images over the graphs such that they satisfy two properties: (a) they should be close to the given labels on the labeled samples, and (b) they should be smooth on the whole graphs. Here, we extend Wang et al.’s method to further require the estimated groups should be consistent with the initial predictions from our group classifiers. The group inference problem can then be formulated as the following optimization problem. K fi fj αg (Wg,ij | f ∗ = arg min{ − |2 +µ |fi −yi |2 +ν |fi −fi0 |2 )}, f D D g,ii g,jj g=1 i,j i i (6)
where Dg,ii = j Wij , fi is the to-be-learned confidence score of sample i with respect to certain group, fi0 is the initial prediction from the corresponding group classifier, and yi is the user’s feedback on sample i. α = {α1 , α2 , ..., αK } are the K weights which satisfy αg > 0 and g=1 {αg } = 1. The regularization framework consists of three components: the first term is a loss function that corresponds to the first property to penalize the deviation from user’s feedbacks; the second term is a regularizer to address the label smoothness; and the last term is a regularizer to prefer the consistence between the estimated group assignment and initial prediction. If the weights α are fixed, we can derive the optimal f as f = (I +
K K 1 ν 1 µ αg Lg + )−1 + (I + αg Lg + )−1 f0 µ g=1 µ ν g=1 ν
(7)
where Lg = D−1/2 (D− Wg )D1/2 is the normalized graph Laplacian. g g However, the weights α, which reflects the utilities of different graphs, are crucial to the propagation performance. Thus, α should be also optimized automatically to reflect the utility of the multiple graphs. To achieve this, we make a
Semi-automatic Flickr Group Suggestion
83
relaxation by changing αg to ατg , τ > 1[7]. Note that ατg achieves its minimum when αg = 1/K with the constraint αg = 1. We then solve the joint optimization of f and α by using the alternative optimization technique. We iteratively optimize f with fixed α and then optimize α with fixed f until convergence [7]. The solution of f and α can be obtained as ( fT L
αg = K
1
1 g f +µ|f −Y
|2 +ν|f −f0 |2
) r−1 (8)
1 1 r−1 g=1 ( f T Lg f +µ|f −Y |2 +ν|f −f0 |2 )
K K τ τ 1 g=1 αg Lg ν −1 1 g=1 αg Lg µ f = (I + + ) + (I + + )−1 f0 K K µ µ ν ν ατg ατg g=1
3
(9)
g=1
Experiments
3.1
Data and Methodologies
We collect 127,500 user images from 15 groups automatically via Flickr Group9 API. All of groups are related to popular visual concepts: “baby”, “bird”, “animal”, “architecture”, “car”, “flower”, “green”, “music”, “night”, “tree”, “red”, “wedding”, “sky”, “snow” and “sunset.” Each group contributes 8,500 images to our dataset on average. For each group, we assign the users and their images into testing subset if they contribute more than 100 images to this group. As a result, the testing subset contains 203 users with 24,367 images. The remaining 103,133 images are used as training samples. Some sample images of each group are illustrated in Fig. 3.
animal
flower
red
architecture
green
wedding
baby
bird
car
music
night
tree
sky
snow
sunset
Fig. 3. Sample images from our dataset
In the experiment, we utilize a simple but effective linear kernel K(xi , xj ) = xTi xj for SVM classifier. We employ the modified AP algorithm to select representative images from each user’s collection. The visual similarity between two 9
http://www.flickr.com/groups/
84
J. Cai et al. Table 1. The comparison of average accuracy on our dataset Visual Feature Descriptor Average Accuracy Color Histogram 39.0% GIST 43.7% CEDD 51.2% MKL 54.0%
K images is calculated as g=1 exp(−xg,i − xg,j 2 ), where xg,i is the feature vector of image Ii . For each representative image selected by AP, there is a score reflecting the significance of that image. In each round, we choose five representative images which have the highest scores and query the user for the groups. 3.2
Evaluation of Group Classifiers
We firstly evaluate our group classifiers and compare them against SVMs with the three features in section 2.1 respectively. Table 1 shows the comparison of average prediction accuracy, while Fig. 4 illustrates the comparison of accuracy over each class. We can see that our MKL classifiers achieve the best performance among four methods. It achieves around 15.0%, 10.3% and 2.8% improvements as compared to SVMs with Color Histogram, GIST and CEDD visual feature descriptors, respectively. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Color Histogram
Gist Descriptor
CEDD Descriptor
MKL
Fig. 4. Classification performance comparison on our dataset
3.3
Evaluation of Group Inference
We evaluate the effectiveness of group inference that is achieved through multisparse-graph group propagation based on initial predictions from group classifiers as well as users’ feedbacks on selected images. The weights in Eq.6 are initialized as α1 =α2 =α3 = 1/3. The parameter µ and ν in Eq.7 are set empirically to 50 and 5, respectively.
Semi-automatic Flickr Group Suggestion
85
Prediction Accuracy
0.85 0.8 0.75 0.7 0.65 0.6 0.55 0.5 0
1
2
3
4 5 Iteration
6
7
8
9
10
Fig. 5. Group suggestion performance comparison over ten iterations 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Initial Label Propagation
Multi-Graph Label Propagation (10 times)
Fig. 6. Prediction accuracy comparison on our dataset
Fig. 5 shows the group suggestion performance over ten iterations. We can see that group inference plays an active role in the loop of group suggestion and effectively improves the group prediction accuracy. As compared to the predictions from group classifiers, group inference improves the prediction accuracy by 8% and 24% in the first and last iteration, respectively. In each iteration, it takes only 0.5 seconds to infer the groups of images and select new representative images for the next iteration. Fig. 6 provides the detailed comparison of prediction accuracy over each group. It shows that the average prediction accuracy can be significantly improved from 0.62 in the first iteration to 0.78 in the last iteration. Fig. 7 illustrates the suggested groups for some sample images within three users’ collections. The first image in each collection is the selected representative image. The arrow indicates the change of group suggestions. The group lying at the left side is the initial suggestion from group classifiers, while the group lying at the right side is the one inputted by users for selected images or the final group suggestion generated by group inference. From the above experimental results, we can see that our semi-automatic group suggestion approach can outperform the automatic approach requiring quite a small amount of user effort.
86
J. Cai et al.
animal baby baby
baby
baby animal baby baby
bird
baby baby red baby
««
flower animal animal
animal
animal baby animal car animal
car
animal animal
««
car animal
sky tree tree
night tree tree
tree
car tree
night tree tree night tree
««
Fig. 7. Final group suggestions for sample images. Initial predication results lie on the left of arrow, while suggestion results are located on the right of arrow.
4
Conclusion
In this paper, we have proposed a semi-automatic group suggestion framework with Human-in-the-Loop. The framework contains three components: group classifier building, representative image selection, and group inference. Specifically, we employ the pre-built group classifiers to predict the group of each image in each user’s collection. After obtaining user’s feedbacks on some selected images, we infer the groups of remaining images through group propagation over multiple sparse graphs of the images. The extensive experiments demonstrate our proposed framework is able to provide accurate group suggestions with minimal user’s effort.
References 1. Yu, J., Jin, X., Han, J., Luo, J.: Mining Personal Image Collection for Social Group Suggestion. In: IEEE International Conference on Data Mining Workshops, Washington, DC, USA, pp. 202–207 (2009) 2. Yu, J., Joshi, D., Luo, J.: Connecting people in photo-sharing sites by photo content and user annotations. In: Proceeding of International Conference on Multimedia and Expo, New York, USA, pp. 1464–1467 (2009) 3. Yu, J., Jin, X., Han, J., Luo, J.: Social Group Suggestion from User Image Collections. In: Proceedings of the 19th International Conference on World Wide Web, Raleigh, North Carolina, USA, pp. 1215–1216 (2010) 4. Duan, M., UIges, A., Breuel, T.M., Wu, X.: Style Modeling for Tagging Personal Photo Collections. In: Proceeding of the International Conference on Image and Video Retrieval, Santorini, Fira, Greece, pp. 1–8 (2009) 5. Douze, M., Jegou, H., Sandhawalia, H., Amsaleg, L., Schmid, C.: Evaluation of GIST descriptors for web-scale image search. In: Proceeding of the International Conference on Image and Video Retrieval, Santorini, Fira, Greece, pp. 1–8 (2009) 6. Negoescu, R.A., Gatica-Perez, D.: Analyzing Flickr Groups. In: Proceeding of the International Conference on Image and Video Retrieval, Niagara Falls, Canada, pp. 417–426 (2008)
Semi-automatic Flickr Group Suggestion
87
7. Wang, M., Hua, X.-S., Yuan, X., Song, Y., Dai, L.-R.: Optimizing Multi-Graph Learning: Towards A Unified Video Annotation Scheme. In: Proceedings of the 15th ACM International Conference on Multimedia, Augsburg, Germany, pp. 862– 871 (2007) 8. Chen, H.-M., Chang, M.-H., Chang, P.-C., Tien, M.-C., Hsu, W., Wu, J.-L.: SheepDog-Group and Tag Recommendation for Flickr Photos by Automatic Search-based Learning. In: Proceeding of the 16th ACM International Conference on Multimedia, Canada, pp. 737–740 (2008) 9. Rakotomamonjy, A., Bach, F., Canu, S., Grandvalet, Y.: SimpleMKL. Journal of Machine Learning Research 9, 2491–2521 (2008) 10. Chatzichristofis, S., Boutalis, Y.: CEDD: Color and Edge Directivity Descriptor. A Compact Descriptor for Image Indexing and Retrieval. In: Computer Vision System, pp. 312–322 (2008) 11. Frey, B., Dueck, D.: Clustering by Passing messages Between Data Points. Science, 319–726 (2007) 12. Lewis, D.D., Gale, W.A.: A Sequential Algorithm for Training Text Classifiers. In: Proceeding of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pp. 3–12 (1994) 13. Wright, J., Yang, A., Ganesh, A., Sastry, S., Ma, Y.: Robust Face Recognition via Sparse Representation. IEEE Transaction on Pattern Analysis and Machine Intelligence 31(2), 210–227 (2009) 14. Zha, Z.-J., Yang, L., Mei, T., Wang, M., Wang, Z.: Visual Query Suggestion. In: Proceeding of the 17th ACM International Conference on Multimedia, Beijing, China, pp. 15–24 (2009) 15. Liu, D., Wang, M., Hua, X.-S., Zhang, H.-J.: Smart batch Tagging of Photo Albums. In: Proceeding of the 17th ACM International Conference on Multimedia, Beijing, China, pp. 809–812 (2009)
A Visualized Communication System Using Cross-Media Semantic Association Xinming Zhang1,2, Yang Liu1,2, Chao Liang1,2, and Changsheng Xu1,2 1
National Labratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijng, China 2 China-Singapore Institute of Digital Media, Singapore {xmzhang,liuyang,cliang,csxu}@nlpr.ia.ac.cn
Abstract. Can you imagine that two people who have different native languages and cannot understand other’s language are able to communicate with each other without professional interpreter? In this paper, a visualized communication system is designed to facilitate such people chatting with each other via visual information. Differing from the online instant message tools such as MSN, Google talk and ICQ, which are mostly based on textual information, the visualized communication system resorts to the vivid images which are relevant to the conversation context aside from text to jump the language obstacle. The multi-phase visual concept detection strategy is applied to associate the text with the corresponding web images. Then, a re-ranking algorithm attempts to return the most related and highest quality images at top positions. In addition, sentiment analysis is performed to help people understand the emotion of each other to further reduce the language obstacle. A number of daily conversation scenes are implemented in the experiments and the performance is evaluated by user study. The experimental results show that the visualized communication system is able to effectively help people with language obstacle to better understand each other. Keywords: Visualized Communication, Sentiment Analysis, Semantic Concept Detection.
1 Introduction The growing trend toward globalization has brought with a lot of opportunities and favorable conditions for transnational trade and traveling abroad. It is inevitable that people have to communicate with foreigners frequently. Besides face to face chatting, many online instant messaging tools such as MSN, Google talk, ICQ, are designed to help human communicate with each other regardless of wherever and whenever they are. However, it is difficult for such tools to enable people who have different native languages and do not understand other’s language to communicate smoothly. Existing instant messaging systems mentioned above purely transmit textual information, but the language obstacle makes them useless for foreigners’ conversation. At this time, the machine translation techniques can help them to understand each other. But sometimes the translated result of the machine translation may mislead the users because of its inaccuracy. Therefore, it is necessary to provide a solution for jumping the language obstacle. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 88–98, 2011. © Springer-Verlag Berlin Heidelberg 2011
A Visualized Communication System Using Cross-Media Semantic Association
89
In daily communication with foreigners, visual information such as hand signs, body gestures is more understandable when people do not know the language with each other. Since such visual information is a kind of common experience for the people coming from all over the world, for example, if you perform the body gesture of “running”, anybody knows that you want to express the “run” meaning. It enlightens us that it is possible to assist people in communication resorting to visual information such as images, graphic signs. To the best of our knowledge, few studies have been investigated on assisting foreigners’ communications. Some research [1][2][3] aims at detecting chatting topic via the chatter textual information, whose performance mainly lies on the quality of the chatting textual information, instead of making conversation understandable and easy. In this paper, we propose a visualized communication system to provide a solution for foreigners’ conversation with multimedia information. Different from traditional chatting system, our system focuses on the intuitive visual information rather than the textual information. A multi-phase semantic concept detection strategy is proposed to associate the textual information with images collected from the web. Then, the image quality assessment is conducted based on a re-ranking strategy to prune the unsatisfied images and make sure the good quality images at the top positions in the retrieval list. In addition, the sentiment analysis is applied to express the chatter’s emotion with corresponding graphic signs to make the conversation much more understandable. Finally, the representative images and sentiment graphic signs are displayed in proper organization to assist the chatters in understanding the conversation. Our contributions can be summarized as follows: 1. Visualization is a novel technique to assist human communication, which is able to easily jump the language obstacle to make human understand each other. 2. Multi-phase semantic concept detectors not only pay attention to the objects in sentence, but also focus on the activities referred in the conversation, which make the visualized system more precise and vivid. 3. Embedding the sentiment analysis and images re-ranking can significantly prune the ambiguity in conversation and express the chatters’ intention. The remaining sections are organized as follows: Section 2 briefly review the related work. Section 3 introduces the framework of our system. The technical details of each component are described in the Section 4. The experimental results are reported in section 5. We conclude the paper with future work in Section 6.
2 Related Work The rapid growth of Internet leads to more and more chatting tools such as MSN and ICQ. These tools are usually hard to help people of different countries to understand each other due to the language barrier. To tackle this problem, various cartoon expressions or animations are applied to vividly convey the speaker’s intention or emotion. However, current image assistant function is rather limited as well as inconvenient. Specifically, the available image resource in current chatting tools is quite limited and the users have to manually specify the proper image. Motivated by the web
90
X. Zhang et al.
images analysis and annotation, we believe a visualized communication system can greatly promote the mutual understanding between people. To the best of our knowledge, little work has been directly conducted on such image facilitated mediation system. In the following, we will briefly review related work on topic detection in users’ chat messages, semantic concept extraction and re-ranking with sentiment analysis. Existing research on instant message mainly concentrates on processing the textual information. The approach in [3] used Extended Vector Space Model to cluster the instant messages based on the users’ chat content for topic detection. The methods in [1] aimed to extract certain topics from users’ talk and both of them were implemented in an offline manner. In contrast, Dong et.al. [2] analyzed the structure of the chat based on online techniques for topic detection. In the conversation among foreigners where talkers do not understand other’s language, these schemes can do little to facilitate people’s mutual understanding. . In the contrary to such limited expression ability of textual language, images are extremely useful in such cross language mediation. As long as semantically related images are presented, people with different native languages can accurately catch the other’s meaning. To implement such visualized communication, an ideal system should recommend some images closely related to users’ conversation. To extract semantics from images/videos, a series famous concept detectors are proposed, such as Columbia374 [8], VIREO-374 [9] and MediaMill-101 [7]. In [7], the proposed challenging problem for automatic video detection is to pay attention to intermediate analysis steps playing an important role in video indexing. Then they proposed a method to generate the concept detectors, namely MediaMill-101. But the main problem in these handy detectors is that they are all the concepts belonging to the noun domain. The noun detectors are indeed helpful to the low-level feature representing the high-level semantic, but they are limited to important nouns pattern ignoring that lots of necessary modes in people’s chatting include concepts companying the action. It is more meaningful to construct concept detector for the verb-noun phrase to detect uses’ real meaning during the dialogue. At the same time, the task in [4] concentrated on the keypoint-based semantic concept detection. In their work, it covered 5 representation choices that can influence the performance of keypoint-based semantic concept detection. These choices consist of the size of visual word vocabulary, weight scheme, stop word removal, feature selection and spatial information. Its experiment showed that if we choose appropriate configuration of the above items, we can derive a good performance in keypoint-based semantic concept detection. The other fact involves the effect in users’ visualization of the images filtered by semantic concept detectors. Re-ranking and sentiment analysis can help to solve this problem. Current re-ranking techniques can be partitioned into two categories. One is primarily depending on visual cues extracted from the images [Zha et al. MM09]. These visual cues in the re-ranking part are usually different from the visual information during the original search results (e.g. [11]). Another re-ranking technique is co-reranking method utilizing jointly the textual and visual information (e.g. [12]), but these work could not prove the quality of the image. Ke et.al. [10] proposed a novel algorithm to assess the quality of a photo. Their work can judge whether a photo is professional or amateur. Finally sentiment analysis can be considered as an additional function for enrich users’ chatting experience.
A Visualized Communication System Using Cross-Media Semantic Association
91
3 Framework The framework of our proposed visualized communication system is shown in Figure 1. The system consists of two parts: offline training phase highlighted by the yellow background and online communication phase highlighted by the dark green background.
NLP
PIC Mood Pictures
Img1 Img2 Img3 Img4
Feature Extraction
Img1
Re-ranking
Img3
Semantic Concept Detector Set
Fig. 1. The framework of our Visualized Communication System
The purpose of offline phase is to associate the proper images with the specific text information by resorting to semantic concept detectors. We collect the image sets for the predefined concepts which are usually appeared in the human daily conversation content. Then, the semantic concept detectors are trained using low-level visual features by SVM classifier. Four modules are contained in the online phase, which are the natural language processing (NLP), semantic concept detection, re-ranking and sentiment analysis. At first, the NLP module extracts the objects and activity keywords referred in the content from the conversation. At the same time, the translated sentence pair will be directly transmitted to the users’ interface. Then, we retrieve the relevant images by querying the keywords on Google image. The concept detection is applied to filter the noisy images. An image quality based re-ranking module is adopted to express the conversation means with high quality images. The sentiment analysis module is to help people to clearly understand the conversation mood of each other. Finally, the recommended images and sentiment graphic sign are organized and displayed to assist the conversation. The details of each part will be described in section 4.
92
X. Zhang et al.
4 Image Recommendation In this section, we introduce the technical details of the four modules in online phase in Figure 1. 4.1 Google Translation and NLP It is intuitive to use translation software or online translation service (e.g. Google Translation) to help people speaking different languages to communicate with each other. However, the state-of-the-art machine translation techniques are far from the real applications in general domains. One of the problems for machine translation is the translation ambiguity which may mismatch users’ original intent. For example, a Chinese student who cannot speak English and is going to visit his old friend living in Nanyang Technological University does not know the meaning of the phrase in a road sign saying “Go straightly by 500m to NTU”, and then he will seek help on the Google translation website. However, the translation result mismatches the users’ original intent “Nanyang Technological University” with “National Taiwan University” in Chinese. Therefore, after reading the translation, the Chinese student may be confused and may think that “Is this direction right?” From this example, the translated result has its ambiguity. To solve this problem, adding some images to associate with the specific word or sentence to be translated is necessary. If the word “NTU” can be sent to Google image search engine, the images both describing the Nanyang Technological University and National Taiwan University are returned. Thus it is easy to understand that NTU in above example means “Nanyang Technological University”. To the extracted keywords which are assigned by the images, we can utilize the NLP tools [13] to analyze the structure of a specific sentence. In our work, NLP tools can help remove stop words at first and Part of Speech (POS) tool will return the part of speech of each word in the sentence. Then we can extract the nouns pattern and the combination of transitive verb and nouns. This combination represents the activity of the object depicted by the sentence. 4.2 Multi-phrase Concept Detector For the visualized communication system, the crucial issue is how to visualize the conversation content. In other word, the system should automatically associate the visual information such as images with textual information at the semantic level. Therefore, the high-level concept detection is applied to tackle this problem. Semantic concept detection is a hot research topic as it provides semantic filters to help analysis and search of multimedia data [4]. It is essentially a classification task that determines whether an image is relevant to a give semantic concept, which also can be defined by text keywords. In our system, the semantic concept detection is applied for associating the conversation content with the images. However, traditional concept detectors usually focus on the noun phrases, which mostly denote the objects in an image. In human daily conversation, many activities will be appeared in the conversation. Therefore, it is necessary to extend current concept detection scope from pure nouns to using the activity concept detectors to clearly express the actions in the conversation. Figure 2 gives an example about the difference between the noun concept and activity concept.
A Visualized Communication System Using Cross-Media Semantic Association
93
Fig. 2. The difference between the noun and activity concepts
From Figure 2, we can see that the images that describing the noun and activity concept are apparently different. Suppose that a man likes car but cannot drive a car. He says to anther foreign person via a chatting system “Do you know how to drive a car?” If a chatting system is mainly based on noun concepts, it will detect the concept “car” and then return the images possibly like the images in the first row in Figure 2. After watching the images, the foreigner still cannot get the meaning of driving a car. However, if we trained not only the noun concept detectors but also more activity concept detectors representing the human’s activity, the images lied in the second row in Figure 2 can be presented to the user. The main difference between the noun concept detector and activity concept detector lies in that the activity concept detector can express the human’s action. In order to implement the function of detecting the action in the users’ conversation, two tasks should be involved. First we pre-define most usual the transitive with noun phrases in our chat and collect the corresponding web images to train for the concept detectors representing the human activity. We follow the approach in [4]. Once these concept detectors are prepared, they can judge whether a new image contains the concept they required. 4.3 Re-ranking Due to the various qualities returned by the search engine, the re-ranking algorithms may be not a good choice if applied in our system. Although maybe these re-ranking methods can correlate the images to the query, the performance of the re-ranked list may be aggravated if these top images are blurry. We follow the three criteria proposed in [10], which is aim to assess the quality of the photos taken by different people, to conduct re-ranking of searched images. This is different from the traditional re-ranking techniques, which rank the image list again based on another visual cue. Now the image set filtered by semantic concept detectors can be derived in section 4.2, but these images only mean they are related to the noun concept or activity concept. Therefore, if utilizing the traditional re-ranking approaches, it does not seem useful for proving the images with high quality ranked at the top positions. Take the noun concept “volleyball” as an example. If one of the speakers wants to learn what the volleyball looks like, the “volleyball” concept detector can return the images after filtering the images excluding the volleyball. Suppose that the images
94
X. Zhang et al.
Fig. 3. The professional and amateurish images
in the first row in Figure 3 are obtained after concept detector filtering, we can see clearly that the first two images are closer to the speaker’s intent, while the third one depicts volleyball in a volleyball match which seems disorganized to the user’s intention. Therefore, if we recommend the third image to the user, the user must be very unsatisfactory. To assess whether an image is professional, there exist three factors. The first one is Simplicity. Compared with snapshots which are often unstructured, busy, and cluttered, it is easy to separate the subject from the background in professional photos. The second one is realism which is another quality to differentiate the snapshots and professional photos. Snapshots look “real” while professional photos look “surreal”. The last one is about photographers’ basic techniques. It is extremely rare for an entire photo taken by a professional to be blurry. 4.4 Sentiment Analysis Sometimes the same sentence attached to different users’ attitudes will generate different meanings. If two or more users cannot know others’ attitude, they may be misunderstood by the text sentence without expressing others’ emotion. Therefore, it is essential to augment the sentimental information into the system.
Fig. 4. The predefined four sentiment sets
A Visualized Communication System Using Cross-Media Semantic Association
95
Here we mainly want to visualize the mood of a user’s conversation via some images. We predefined four sentiment sets representing the approving opinion, the opposing opinion, happy mood and angry mood respectively in Figure 4. These four sets contain most of our daily used words which could express the emotion of a person and some punctuation as well. At the same time, we correlate some emotional pictures to these sets. When we detect these elements belonging to one set or both sets, corresponding pictures can be selected to the users.
5 Experiments In our experiment, we define 15 scenes (namely traveling, asking the way and so on) which involve about total 90 concept detectors consisting of 30 noun semantic concept detectors and 60 activity semantic concept detectors, respectively. The data are collected via the Google image search engine, Flickr and other sources. In order to train the concept detectors, we follow the rule that the combination of local and global features can boost the performance of the semantic concept detectors [4]. Therefore, we utilize three low-level features, bag-of-visual-words (BOW), color moment (CM) and wavelet texture (WT). In BOW, we generate a 500 dimensional vector attached to each image. In CM, the first 3 moments of the 3 channels in HSV color space over 8 × 8 grid partition is used to form a 384 dimensional color moment vector. For WT, we use 3× 3 grids and each grid is represented by the variances in 9 Haar wavelet sub-bands to generate an 81 dimensional wavelet texture vector. The raw outputs from the SVM classifiers and then converted to posterior probabilities using Platt’s method. The probabilities are combined as a score, which indicates the confidence of detecting a concept in an image by average fusion. We divide our experiments into 2 parts. The first part (section 5.1) gives the accuracy of 20 activity concept detectors selected from all the scenes. An interactive user interface of our system is shown in section 5.2. We conduct a user study to evaluate the visualized communication system in section 5.3. 5.1 The Accuracy of the Semantic Activity Concept Detectors
We totally selected 6 semantic concept detectors from the traveling scene. The accuracy of human action detection of the traveling scene is shown in Figure 5. There are 6 groups along the horizontal axis, each consisting of 4 bars representing the accuracy of the CM, WT, BOW and the average fusion of them respectively. In general, the performance of BoW outperforms those of CM and WT due to the superiority of the local feature. The CM achieves the comparable result with the WT. Moreover, the fusion result achieves the best in most cases, which can be attributed to the complementary superiority collaborating both local and global features. In some specified cases, the results of CM are the best, such as diving and go surfing, which is mainly due to the homogeneous background in images related to these concepts. 5.2 User Interface of Visualized Communication System
The user Interface of our system is shown in Figure 6. It can be divided into 2 parts. The first part in the title region lists users’ name and the current system time. Users can select the scene in this region.
96
X. Zhang et al.
&0 :7 %R: )XVLRQ
Climb the mountains
Diving
Drink water Eat biscuits
Go surfing
Ride bikes
Activity Detector
Fig. 5. The accuracy of the semantic concept detectors
Fig. 6. User Interface of Visualized Communication System
The other part consists of two sections. The left section recommends the images related to the users’ conversation. The right section shows the current user list and the conversation. Our system can be used by the users speaking different languages. The translated sentences will be displayed under the original sentence. 5.3 The User Study
We invited 20 foreigners to evaluate our visualized communication system. We predefine five ranking scores for our proposed system. These five rankings stand for the degree of satisfaction, namely “Very satisfied”, “satisfied”, “Just so-so”, “Not satisfied” and “Disappointed”. The evaluating scores are shown in Figure 7.
A Visualized Communication System Using Cross-Media Semantic Association
97
From the evaluating results, we can see that all the users are satisfied in most cases. However, there are still some disappointed votes to our system. We think that the reasons may come from two aspects. One is that deficiency lies in the essential association between the textual information and visual information. It is hard to find proper image to represent every keyword due to semantic gap. The other is that concept detection result will affect the performance of the image recommendation if the concept detection result is poor. However, the result in Figure 7 proves that our system can enhance the quality of the chatting between the users speaking different languages by resorting to the sentiment graph and recommended images.
Fig. 7. The Evaluation Result of User Study
6 Conclusion The visualized communication system incorporating the multimedia items via visual information can really help the users who have an obstacle to communicate with each other. The multi-phase visual concept detection strategy can detect the most actions related to the conversation so that they can provide some assisting and relevant images for the users. The experimental results show that the visualized communication system is able to effectively help people with language obstacle to better understand each other. In the future, we plan to further study the user’s profile information, such as age, gender and education background, to provide rich multimedia images/video to facilitate the mediation process. In addition, user feedback technology will also be utilized to improve the system’s utility with less operation but more suitable accommodation.
References 1. Adams, P.H., Martell, C.H.: Topic Detection and Extraction in Chat. In: 2008 IEEE International Conference on Semantic Computing, pp. 581–588 (2008) 2. Dong, H., Hui, S.C., He, Y.: Structural analysis of chat messages for topic detection. Online Information Review, 496–516 (2006) 3. Wang, L., Jia, Y., Han, W.: Instant message clustering based on extended vector space model. In: Proceedings of the 2nd International Conference on Advances in Computation and Intelligence, pp. 435–443 (2007)
98
X. Zhang et al.
4. Jiang, Y.-G., Yang, J., Ngo, C.-W., Hauptmann, A.G.: Representations of Keypoint-Based Semantic Concept Detection: A Comprehensive Study. IEEE Transitions on Multimedia, 42–53 (2009) 5. Jiang, Y.G., Ngo, C.W., Chang, S.F.: Semantic context transfer across heterogeneous sources for domain adaptive video search. In: Proceedings of the Seventeen ACM International Conference on Multimedia, pp. 155–164 (2009) 6. Snoek, C.G.M., Huurnink, B., Hollink, L., de Rijke, M., Schreiber, G., Worring, M.: Adding semantics to detectors for video retrieval. IEEE Transaction on Multimedia 9(5), 975–986 (2007) 7. Snoek, C.G.M., Worring, M., Van Gemert, J.C., Geusebroek, J.M., Smeulders, A.W.M.: The challenge problem for automated detection of 101 semantic concepts in multimedia. In: Proceedings of the 14th Annual ACM International Conference on Multimedia, p. 430 (2006) 8. Yanagawa, A., Chang, S.-F., Kennedy, L., Hsu, W.: Columbia university’s baseline detectors for 374 lscom semantic visual concepts. In: Columbia University ADVENT Technical Report #222-2006-8 (2007) 9. Jiang, Y.-G., Ngo, C.-W., Yang, J.: Towards optimal bag-of-features for object categorization and semantic video retrieval. In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval, p. 501 (2007) 10. Ke, Y., Tang, X., Jing, F.: The design of high-level features for photo quality assessment. In: CVPR 2006 (2006) 11. Natsev, A., Haubold, A., Tesic, J., Xie, L., Yan, R.: Semantic concept-based query expansion and re-ranking for multimedia retrieval. In: ACM Multimedia, p. 1000 (2007) 12. Yao, T., Mei, T., Ngo, C.W.: Co-reranking by Mutual Reinforcement for Image Search. In: Proceedings of the 6th ACM International Conference on Image and Video Retrieval (2010) 13. http://nlp.stanford.edu/software/tagger.shtml 14. Shih, J.-L., Chen, L.-H.: Color image retrieval based on primitives of color moments. In: IEE Proceedings-Vision, Image, and Signal Processing, p. 370 (2002) 15. Van de Wouwer, G., Scheunders, P., Dyck, D.V.: Statistical Texture Characterization from Discrete Wavelet Representations. IEEE Transactions on Image Processing, 592-598 (1999) 16. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: A comprehensive study. IJCV, 213–238 (2007) 17. Keshtkar, F., Inkpen, D.: Using Sentiment Orientation Features for Mood Classification in Blogs. IEEE, Los Alamitos (2009) 18. Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine learning techniques. In: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural Language Processing, vol. 10 (2002) 19. Zha, Z.-J., Yang, L., Mei, T., Wang, M., Wang, Z.: Visual query suggestion. In: Proceedings of ACM International Conference on Multimedia, pp. 15–24 (2009)
Effective Large Scale Text Retrieval via Learning Risk-Minimization and Dependency-Embedded Model Sheng Gao and Haizhou Li Institute for Infocomm Research (I2R), A-Star, Singapore, 138632 {gaosheng,hli}@i2r.a-star.edu.sg
Abstract. In this paper we present a learning algorithm to estimate a risksensitive and document-relation embedded ranking function so that the ranking score can reflect both the query-document relevance degree and the risk of estimating relevance when the document relation is considered. With proper assumptions, an analytic form of the ranking function is attainable with a ranking score being a linear combination among the expectation of relevance score, the variance of relevance estimation and the covariance with the other documents. We provide a systematic framework to study the roles of the relevance, the variance and the covariance in ranking documents and their relations with the different performance metrics. The experiments show that incorporating the variance in ranking score improves both the relevance and diversity. Keywords: Risk minimization; Diversity search; Language Model.
1 Introduction The task in the information retrieval (IR) system is to retrieve documents relevant to the information needs of users and to rank them with respect to their relevance to the query. Thus, the design of ranking function (or ranker) that computes the relevance score has been the central topic in the past decades. One of the popular ranking principles is based on Bayesian theory, where the documents are ranked according to the odds of the relevance and irrelevance probability for a query-document pair (e.g. [6, 7, 9, 10]). Many rankers are thus derived with the different assumptions of query model and document model such as Okapi [12], Kullback-Leibler divergence and query log-likelihood [6, 7, 8], cosine distance with the tfidf features [2]. These functions share three common properties: 1) the uncertainty of parameter estimation on query or document models is ignored; 2) the ranking score thus omits uncertainty of calculation; and 3) the relevance score, often calculated for each query-document pair, excludes the document relationship. For example, the document or query models are point estimation in LM with the ML criterion [6, 7, 8] as well as the ranking score. When we use the point estimation of relevance in ranking, the risk arises due to the uncertainty of estimation. In addition, because of the independency assumption on documents when calculating relevance score, the diversity related to the query topic is deteriorated. The recent research in diversity search tries to address the problem so that the top-N documents deliver as many relevant sub-topics as possible [3, 4, 19, 20, 21]. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 99–110, 2011. © Springer-Verlag Berlin Heidelberg 2011
100
S. Gao and H. Li
However, none of the methods can calculate the ranking score in a principle way when the document dependency is embedded, most of them using an ad-hoc method to interpolate the relevance and divergence scores. In the paper we present a principle method to learn a ranking function when the document relation is embedded. The novel ranker is derived by minimizing a risk function, which measures the expected loss between the true ranking score and the predicted one on a set of documents interested (For clarity, we use the ranking score to refer to the final value for ranking documents. It is the sum of relevance score and uncertainty score. The relevance score holds its traditional meaning that refers to query-document similarity while the uncertainty score measure the risk in estimating similarity score and related to the diversity.). With suitable assumptions (See section 2.), an analytic form is resulted in and the ranking score becomes a linear combination among the expectation of the relevance score, the uncertainty (variance) of relevance estimation, and the abstract of the covariance between the document and the others, which measures the document dependency. The evaluation is done on the ImageCLEF 2008 set used for the ad-hoc photo image retrieval for diversity search. We study the roles of variance and covariance in ranking and their effects on mean average precision (MAP, relevance measure) and cluster recall at the top-N (CR@N, diversity measure). Our analysis will show that incorporating the variance improves both the relevance performance and diversity performance, and the addition of the covariance further improves the diversity performance at the cost of relevance performance. In the next Section, the proposed risk-minimization and dependency-embedded ranking function (RDRF) is introduced. Then we report the experiments in Section 3. The related work and our findings are discussed in Sections 4 and 5 respectively.
2 Learning RDRF Ranker In the section, we will elaborate the RDRF ranker using the statistical methodology. Firstly, the ranking function is treated as a random variable rather than a deterministic value; secondly, the objective function is defined, which measures the expected loss between the true ranking scores and the predicted ones over the documents; lastly, a particular RDRF ranker is derived based on the query log-likelihood ranking function. 2.1 Ranking Functions as Random Variables , , | | ( :the iIn the LM based IR, the relevance score between a query, th term frequency in the query. | |: the size of the vocabulary.), and a document, , , | | ( : the i-th term frequency in the document.), can be calculated from the query log-likelihood generated by the document model, , , | | , , , | | , and the document or from the discrepancy between the query LM, LM . Here the query log-likelihood function (see Eq.(1)) is chosen as the ranking function to develop the RDRF ranker, while the principle can be applied to the others. ∑|
|
log
(1)
Effective Large Scale Text Retrieval
101
In the case, the query model is the query term frequency and the document model is unigram LM. Conventionally, LM is computed using the ML criteria before calculating the ranking scores. Obviously, the ranking score is one value of Eq. (1) at the estimated point of document model. To improve robustness of score estimation, we can sample as many points as possible from the distribution, , to collect a large number of scores for each querydocument pair. We then use their average as the ranking score. Such the estimation should be more robust than point estimation. However, it is time consuming. is Fortunately, we will soon find that sampling is not necessary. Because is a random function. For each query-document pair, we have a random, thus to characterize the distribution of query-document relevance random variable in is integrated out with some suitable assumptions. score. 2.2 Risk-Minimization Based RDRF Ranker For each query-document pair, we have to characterize the distribution of relevto ance score. Thus, given a set of documents (or the corpus), we have a set of describe the joint distribution of the set of relevance scores. The RDRF algorithm tries to measure and minimize the loss between the true ranking scores and the predicted ones on the documents in order to find the optimal one. 2.2.1 Loss Function on the Set We first define a few notes used in the discussions. : the corpus with | | documents. : the i-th document in . : the ranking function of (See Eq.(1)). | , , : the distribution of depending on the query , the document excluding . and the set of documents : the predicted ranking score of . : the overall ranking score on the set of documents. The whole corpus is used here for discussion. The findings are applicable in the other cases. | , : the distribution of . Now the overall ranking score is defined as ( : the weight of ), ∑|
|
(2)
We try to seek a set of estimation so that Eq. (2) is maximized. It is noted that is | , connected by Eq. (1). should be known in theory as the function of are estimated from the documents. We denote the predicted overall ranking score as . Its optimal estimation is found by minimizing Eq.(3), ,
,
,
| ,
.
(3)
Here · | · is the expected loss. · is the loss function, which is the LINEX function in Eq.(4) [13, 15, 18] ( is the risk weight.). ,
And the optimal Bayesian estimation
(4)
is
102
S. Gao and H. Li
∑
⁄
!
⁄6
⁄
.
(5)
Herein ’s are the cumulants. For example, (the mean of ), (the measure of skewness of ), while is the n-th (the variance of ), moment of . Eq. (5) tells that the optimal overall ranking score is only related to the moments of . In the paper, only the mean and variance are used. It is not our target to know . In the next, we will introduce how to connect with the estimation of individual ranking score when the document dependency is embedded. 2.2.2 Document-Dependent Relevance Score are calculated as in Eq.(6) and Eq.(7), From Eq. (2), it is easy to verify that and
∑|
|
∑|
∑|
|
|
(6) (7)
is the mean of and is the covariance between and . ReHere placing Eq.(6) and Eq.(7) into Eq. (5), we can get the overall ranking score as, ∑|
|
⁄ ∑|
⁄
|
(8)
In Eq. (8) there are three components in the rectangle bracket. The first one is the expectation of relevance score for . The second one is its variance, an uncertainty measure coming from the document itself. And the last one is the covariance coming from the other documents, an uncertainty measure due to dependency on the other . Here documents. The last two summarizes the overall uncertainty in estimating the variance and covariance are treated separately because of their different effects on the ranking relevance (See section 3 for experimental studies.) Now we separate the term related to and get a novel ranking function as, ⁄ ∑|
⁄
̃
|
(9)
and . However, it is not easy. In the To calculate ̃ , we need to know following, we discuss two practical ways for computation of Eq.(9). A) Documents are independent of each other when calculating . and Therefore, for the query log-likelihood function, the mean is calculated as,
|
∑|
|
∑|
|
log
|
log
(10)
is the posterior distribution. Here it is the Dirichlet distribution,
| |
∏
∏|
|
|
(11)
. Γ · : gamma function.
the document length.
log
.
According to the properties of the Dirichlet distribution, (12)
Effective Large Scale Text Retrieval
· : the digamma function),
Therefore, the mean is calculated as ( ∑|
Similarly,
|
is derived as in Eq. (14) ( ∑|
|
∑|
(13)
· : trigamma function),
B) Documents are dependent when calculating covariance To get the covariance, we calculate from Eq.(15). | ,
log
In order to get log log , we need to estimate to the Bayes’ rule, it is found that, |
,
103
(14) .
log
(15)
,
,
. According (16)
is independent of given . Thus, we When we derive Eq. (16), we assume induce an order in the document pair, i.e. Eq.(16) measuring the information flow to . In general, it is not equal to the information flowed from to . We from will soon discuss its effect on the covariance. According to the properties of the Dirichlet distribution, Eq.(17) is derived to calculate
log
log
),
(
log
: a pseudo-document with the term frequency log
Now the covariance is computed as, ∑|
|
log
log
log ∑|
|
log
(17)
log
(18)
In Eq. (18), the first sum is the expectation of relevance score of . The second sum is the difference between the expectation of relevance score for conditioned on , ) and the expectation of (Document model is (Document model is | , , , Eq. (18) is ). Since asymmetric. Strictly saying, it is not a covariance. But here we still use the term to measure the dependency between the documents. 2.2.3 Discussions Substituting Eqs.(13, 14, 18) into Eq.(9), we will get the ranking score for each querydocument pair in case of the document dependency included. It is obvious that the value of the covariance (plus variance) is not comparable in the scale to that of expectation. It is tough to balance them by adjusting the risk weight , because the range of
104
S. Gao and H. Li
the risk weights depend on the size of document set. To normalize such effect, herein we introduce 3 types of methods to calculate the covariance abstract. A) Covariance average The average of covariance (plus variance) is used as the covariance abstract, i.e., ̃
| |
̃ will have the comparable range with noted that . The prior weights
∑|
|
(19)
for the variable size of | |. It is are discarded in the paper.
B) Maximal covariance The average covariance can smooth the effects of the uncertainties coming from different documents. Like in MMR [3] to use the maximal margin as the measure of diversity, we can also use the maximal covariance (plus variance) as the abstract, i.e., ̃
max
,| |
(20)
Thus, if a document has the higher covariance with the others, i.e. more similar in the content with others, it will get larger penalty in calculating ranking score. It means that the ranker will promote the documents which contain novel information. C) Minimal covariance Similarly, the minimal covariance can also be used. ̃
min
,| |
(21)
In the above discussions, we know that we need to find a working set for each document in order to calculate the covariance. For simplicity, the whole corpus is selected in the above. In practice, the working set may depend on the individual document and thus vary according to the applications. For example, if we use the rankers in Eqs. (19-21) to rank all documents in the corpus, the working set is the corpus. But if we want to re-rank the documents in a ranking list in a top-down manner, the working set for a document, saying , may only include the documents that are ranked higher than in the ranking list.
3 Experiments We implement the proposed rankers based on the Lemur toolkit1. Lemur is the representative of the up-to-date technologies developed in IR models. It is used to develop the benchmark system and our ranking systems. 3.1 Evaluation Sets The experiments are carried out on the ImageCLEF 2008 (CLEF08) set, which is officially used for evaluating the task of the ad-hoc image retrieval. It has 20,000 documents with the average length 19.33 words. In the paper we do experiments only on the text modality. The query contains the terms in the field of title with an average 1
http://www.lemurproject.org
Effective Large Scale Text Retrieval
105
length 2.41 terms. Totally 39 queries are designed. Because the queries are designed to evaluate the diversity, they have much ambiguity. 3.2 System Setup and Evaluation Metrics The Jelinek-Mercer LM2 is chosen for document representation. The baseline system is built on the traditional query log-likelihood function (See Eq. (1)). Although the proposed rankers can be applied to retrieve and rank the documents from the whole corpus, considering the computation cost, we run our rankers on a subset which contains the top-1000 documents in the initial ranking list generated by the baseline. The performances of different systems are compared in terms of multiple metrics including mean average precision (MAP) for the relevance performance and cluster recall at the top 5 (CR@5) and 20 (CR@20) for the diversity performance. The cluster recall at the top-N documents measures the accuracy of the sub-topics in the top-N documents for a query. It is calculated by the number of sub-topics covered in the topN documents divided by the total sub-topics of the query. In the ImageCLEF08 set, the sub-topic label is tagged for each document and query in the pooled evaluation set besides the relevance label3 . 3.3 Result Analysis Now we study the behaviors of the RDRF rankers. From the discussion in Section 2, we know that the RDRF based ranking scores contain 3 components: expectation of the relevance score, variance and covariance. In addition, there is a tuning ter . In the following, we will study the performance as a function of the risk weight in the following conditions: 1) variance without document relation (See section 3.3.1), 2) covariance (See section 3.3.2), 3) various covariance abstract methods (See section 3.3.3) and 4) working set selection (See section 3.3.4). In the first two studies, (Eq. (20)) is chosen while the working set is same in the first 3 stuthe ranker dies4. 3.3.1 Effects of Variance Figure 1 depicts the changing performance as a function of the increasing risk weight in the case of only the variance being considered. Obviously, the performances are improving as the weight is increasing. At some points, they reach their maximum and then drops. The maximal MAP, CR@5 and CR@20 are 0.1310 ( ), 0.1340 ( ) and 0.2316 ( ), respectively. In comparison with the baseline performance, which has 0.1143 for MAP, 0.1182 for CR@5 and 0.1990 for CR@20, the significant improvements are observed5. The relatively improvements are 14.6% (MAP), 13.4% (CR@5) and 16.4% (CR@20), respectively. Results are reported here based on the smoothing parameter . . Findings are similar for others. 3 http://imageclef.org/2008/photo 4 The working set for a document contains the documents ranked higher than it in the initial ranking list. For computational consideration, currently only the top-100 documents are reranked using the learned rankers and the other documents in the initial ranking list keep their ranking positions. 5 We test statistical significance using t-test (one-tail critical value at the significance level 0.05).
2
106
S. Gao and H. Li
Therefore, the Bayesian estimation of ranking score gets the better performance than the traditional point estimation based method. The addition of variance further improves the performance. In the experiment, the performances with the zero risk weight are 0.1195 (MAP), 0.1244 (CR@5) and 0.2022 (CR@20) respectively, which are worse than the performances obtained with the optimal risk weights. Our investigations on Eqs.(9, 14, 20) reveal that the variance functions as a penalty added to the relevance estimation. With the same term frequency, the term penalty in a long document is higher than that in a short document. The trigamma function in Eq.(14) also normalizes the effect of the document lengths on the term contribution. Due to the normalization of term and document length, the estimated ranking score become robust and has the positive effect on the performances. MAP
CR@5
CR@20
0.2 0.1 0 0
2
4
6
8
10
Fig. 1. Performance vs. risk weight b (only the variance is considered) VAR MAX_COV
MAX_COV
0.2
0.2
0.15
0.15
0.1
0.1
0.05
0.05
0
0 0
2
4
(a) MAP
6
8 10
VAR MAX_COV
VAR 0.3 0.2 0.1 0 0
2
4
(b) CR@5
6
8
10
0
2
4
6
8
10
(c) CR@20
Fig. 2. Performance vs. risk weight b for the covariance based ranker (MAX_COV) and the variance only ranker (VAR)
3.3.2 Effects of Covariance We now add the covariance into the ranking score. Its effect on the performance is illustrated in Figure 2 and is compared with the performance when the dependency is ignored (i.e. only variance, see section 3.3.1). We have the following findings. 1) The inclusion of covariance improves the diversity performance. In the experiments, adding the dependency obviously improves the diversity performance compared with the case where only the variance is included. We study
Effective Large Scale Text Retrieval
107
maximum of CR@5 and CR@20 for the covariance based ranker (MAX_COV) and the variance based ranker (VAR). The CR@5 has a relative increment 12%, which reaches 0.1501, while CR@20 improves 4.7% which achieves 0.2424. Since the covariance measures the document dependency, it is not surprising to see it improve the diversity performance. 2) Incorporating the covariance decreases MAP, which coincides with the observations in [3, 4, 19, 20]. Since the diversity only concerns the novelty among the documents rather than the similarity, we can understand that in most of times, the diversity algorithms might have negative effects on the relevance metric. 3.3.3 Effects of Covariance Abstract Methods Figure 3 illustrates the performances for the rankers based on 3 covariance abstract methods, i.e. covariance average (AVG_COV), maximal covariance (MAX_COV) and minimal covariance (MIN_COV) (see Sec. 2.2.3 for details). AVG_COV MAX_COV MIN_COV
0.15
AVG_COV MAX_COV MIN_COV
AVG_COV MAX_COV MIN_COV
0.3
0.15
0.05
0.1
0.05 0
2
4
6
8 10
0
(a) MAP
2
4
6
8
0
10
(b) CR@5
2
4
6
8
10
(c) CR@20
Fig. 3. Performance vs. risk weight b for the rankers with covariance average (AVG_COV), maximal covariance (MAX_COV) and minimal covariance (MIN_COV) MAX_COV_N MAX_COV
0.2
MAX_COV_N MAX_COV
MAX_COV_N MAX_COV
0.2
0.2 0.1
0.1
0
0 0
2
4
(a) MAP
6
8
10
0 0
2
4
(b) CR@5
6
8
10
0
2
4
6
8
10
(c) CR@20
Fig. 4. Performance as a function of risk weight b for the rank-listed working set (MAX_COV) and the working set with all documents (MAX_COV_N)
We find that MAX_COV works best. MIN_COV is the worst while AVG_COV is in the middle. When we compare MAX_COV with AVG_COV, the biggest gap is
108
S. Gao and H. Li
found for CR@20 which reaches 20% relative increment. For CR@5, MAX_COV obtained 8.2% increment. 3.3.4 Working Set Selection In the above experiments, we collect all documents that are ranked higher than the document of interest as the working set. Thus, each document has a different working set. We call it the rank-listed working set. But if the ranking order is not available, we can use all documents as the working set. Figure 4 compares the performances between two working set schemes. One is used in the above experiments (MAX_COV) and another is the working set which contains all documents to be ranked (MAX_COV_N). In the experiments, the latter working set includes top-100 documents to be re-ranked for each document. Figure 4 shows that the rank-listed working set works better. But their maximal performances are similar. It is seen that the performance of MAX_COV_N drops quicker than that of MAX_COV off the optimal setting. In other words, MAX_COV is more stable. This is because that the MAX_COV_N includes more documents in its working set than MAX_COV set. Thus, the MAX_COV_N incurs more noises in calculating the document dependency measure.
4 Related Work Lafferty & Zhai [6, 16] presented the risk minimization framework for IR model, where the user browsing documents is treated as the sequential decision. They proved that the commonly used ranking functions are the specials of the framework with the particular choices of loss functions, query and document models and presentation schemes. Although finding the optimal decision is formulated as minimizing the Bayesian loss, in practice they approximated the objective function using the point estimation in their work. In our work the full Bayesian approach is exploited by integrating out the model parameters. Thus, our work results in a novel ranker, which contains: 1) the similarity expectation, a measure of the average relevance and 2) the covariance (plus variance), a measure of the risk of the expectation estimation that is related to diversity performance. The above novelties also make our work different from the risk-aware rankers presented by [13, 18]. In their work, each term is treated as an individual ranker and the uncertainty of the term ranker is estimated independently. Therefore, they need to combine all term rankers to get the query-document ranking score. In comparison, we directly estimate the expectation and the uncertainty. This gives us a single value for each query-document pair to measure the expected relevance score and the uncertainty. Incorporating the covariance into the ranking score is now a natural result in our risk-minimization and document-dependency embedded framework. In contrary, the alternative method is used to calculate the covariance in [13]. When applied to diversity search, our work is quite different from the methods such as [3, 4, 19, 20]. In their works, they developed the methods to estimate the diversity degree of the documents and then linearly combined the diversity scores with the similarity scores in the original ranking list. This means that their similarity scores
Effective Large Scale Text Retrieval
109
and diversity scores are estimated separately which may follow the different criteria. But in our work, they are derived under a unified risk-minimization framework.
5 Conclusion We have presented a learning algorithm to achieve a ranking function where the uncertainty of relevance estimation and document relations are embedded. Learning the ranking function is formulated in the framework of Bayesian risk-minimization. With proper assumptions, an analytic form of the ranking function is attainable. The resulted ranking score becomes a linear combination among the expectation of relevance score, variance of expectation estimation and the covariance with other documents. The presented algorithm provides a systematic way to study the relation among the relevance, diversity and risk. The roles of the variance and covariance in ranking are empirically studied. The inclusion of variance in the ranker improves the performance of relevance as well as diversity. And incorporating covariance can further improve the diversity performance but decrease relevance performance (MAP). The tunable risk weight allows us to balance the relevance and diversity. In future, we will investigate how to embed an adaptive query model in the framework.
References 1. Allan, J., Van Rijsbergen, C.J.: Probabilistic models of information retrieval based on measuring the divergence from randomness. ACM Trans. On Information Systems 20(4), 357–389 (2002) 2. Baeza-Yates, R., Ribeiro-Neto, B.: Modern information retrieval. Addison-Wesley Publisher, Reading (1999) 3. Carbonell, J., Goldstein, J.: The use of MMR, diversity-based reranking for reording documents and producing summaries. In: Proc. of SIGIR 1998 (1998) 4. Chen, H., Karger, D.R.: Less is more: probabilistic models for retrieving fewer relevant documents. In: Proc. of SIGIR 2006 (2006) 5. Jelinek, F., Mercer, R.: Interpolated estimation of markov source parameters from sparse data. Pattern Recognition in Practice, 381–402 (1980) 6. Lafferty, J.D., Zhai, C.: Document language models, query models and risk minimization for information retrieval. In: Proc. of SIGIR 2001 (2001) 7. Lavrenko, V., Croft, W.B.: Relevance-based language models. In: Proc. of SIGIR 2001 (2001) 8. Ponte, J.M., Croft, W.B.: A language modeling approach to information retrieval. In: Proc. of SIGIR 1998 (1998) 9. Robertson, S.E., Sparck Jones, K.: Relevance weighting of search terms. Journal of the American Society for Information Science 27(3), 129–146 (1976) 10. Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-poisson models for probabilistic weighted retrieval. In: Proc. of SIGIR 1994 (1994) 11. Robertson, S.E.: The probability ranking principle in IR. Readings in information Retrieval, 281–286 (1997) 12. Robertson, S.E., Walker, S., Hancock-Beaulieu, M., Gatford, M., Payne, A.: Okapi at TREC-4. In: Proc. of Text Retrieval Conference, TREC (1995)
110
S. Gao and H. Li
13. Wang, J., Zhu, J.H.: Portfolio theory of information retrieval. In: Proc. of SIGIR 2009 (2009) 14. Zaragoza, H., Hiemstra, D., Tipping, M., Robertson, S.E.: Bayesian extension to the language model for ad hoc information retrieval. In: Proc. of SIGIR 2003 (2003) 15. Zellner, A.: Bayesian estimation and prediction using asymmetric loss functions. Journal of the American Statistical Association 81(394), 446–451 (1986) 16. Zhai, C., Lafferty, J.D.: A risk minimization framework for information retrieval. Information Processing and Management 42(1), 31–55 (2006) 17. Zhai, C., Lafferty, J.D.: A study of smoothing methods for language models applied to information retrieval. ACM Trans. on Information Systems 22(2), 179–214 (2004) 18. Zhu, J.H., Wang, J., Cox, I., Taylor, M.: Risk business: modeling and exploiting uncertainty in information retrieval. In: Proc. of SIGIR 2009 (2009) 19. Zhai, C., Cohen, W., Lafferty, J.: Beyond independent relevance: methods and evaluation metrics for subtopic retrieval. In: Proc. of SIGIR 2003 (2003) 20. Gollapudi, S., Sharma, A.: An axiomatic approach for result diversification. In: Proc. of WWW 2009 (2009) 21. Radlinski, F., Kleinberg, R., Joachims, T.: Learning diverse rankings with multi-armed bandits. In: Proc. of ICML 2008 (2008)
Efficient Large-Scale Image Data Set Exploration: Visual Concept Network and Image Summarization Chunlei Yang1,2 , Xiaoyi Feng1 , Jinye Peng1 , and Jianping Fan1,2 1
School of Electronics and Information, Northwestern Polytechnical University, Xian, P.R.C. 2 Dept. of Computer Science, UNC-Charlotte, Charlotte, NC 28223, USA
Abstract. When large-scale online images come into view, it is very important to construct a framework for efficient data exploration. In this paper, we build exploration models based on two considerations: inter-concept visual correlation and intra-concept image summarization. For inter-concept visual correlation, we have developed an automatic algorithm to generate visual concept network which is characterized by the visual correlation between image concept pairs. To incorporate reliable inter-concept correlation contexts, multiple kernels are combined and a kernel canonical correlation analysis algorithm is used to characterize the diverse visual similarity contexts between the image concepts. For intra-concept image summarization, we propose a greedy algorithm to sequentially pick the best representation of the image concept set. The quality score for each candidate summary is computed based on the clustering result, which considers the relevancy, orthogonality and uniformity terms at the same time. Visualization techniques are developed to assist user on assessing the coherence between concept-pairs and investigating the visual properties within the concept. We have conducted experiments and user studies to evaluate both algorithms. We observed very good results and received positive feedback.
1 Introduction With the exponential availability of high-quality digital images, there is an urgent need to develop new frameworks for image summarization and interactive image navigation and category exploration [1-2]. The project of Large-Scale Concept Ontology for Multimedia (LSCOM) is the first such kind of efforts to facilitate more effective end-user access of large-scale image/video collections in a large semantic space [3-4]. There are more than 2000 concepts and 61901 labels for each concept in the LSCOM project, which is still a small subset compared to all the available online resources. Commercial image collection web site, such as Flickr.com, has 2 billion images and are still increasing. Considering the scale of problem we are dealing with, to effectively organize the relationship between large number of concepts and also to better summarize the large number of data within each concept will be the focus of our work. To organize the relationship of inter-concept pairs, concept ontology can be used to navigate and explore large-scale image/video collections at the concept level according to the hierarchical inter-concept relationships such as “IS-A” and “part-of” [4]. However, following issues make most existing techniques for concept ontology construction unable to support effective navigation and exploration of large-scale image collections: K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 111–121, 2011. c Springer-Verlag Berlin Heidelberg 2011
112
C. Yang et al.
(a) Only the hierarchical inter-concept relationships are exploited for concept ontology construction [5-6]. When large-scale online image collections come into view, the interconcept similarity relationships could be more complex than the hierarchical ones (i.e., concept network) [7]. (b) Only the inter-concept semantic relationships are exploited for concept ontology construction [5-6], thus the concept ontology cannot allow users to navigate large-scale online image collections according to their visual similarity contexts at the semantic level. It is well-accepted that the visual properties of the images are very important for users to search for images [1-4, 7]. Thus it is very attractive to develop new algorithm for visual concept network generation, which is able to exploit more precise inter-concept visual similarity contexts for image summarization and exploration. To reduce the scale of the image set of each concept, while maintaining visual perspectives of this concept as much as possible, image summarization methods can be conducted to select a subset of images which have the best representative visual properties. Jing et al. [8] used the ”local coherence” information to find the most representative image, which has the maximum number of edge connections in the group. Simon et al. [9] proposed a greedy algorithm to find representative images iteratively and also considered likelihood and orthogonality score of the image. Unfortunately, neither algorithms took into consideration the global clustering information, such as the distribution of the clusters, the size, mean value and variance value of each cluster. Based on these observations, this paper will focus on: (a) integrating multiple kernels and incorporating kernel canonical correlation analysis (KCCA) to enable more accurate characterization of inter-concept visual similarity contexts and generate more precise visual concept network; (b) supporting similarity-preserving visual concept network visualization and exploration for assisting users on perceptual coherence assessment; (d) iterative image summarization generation considering the global clustering information, which is specifically characterized by relevancy, orthogonality and uniformity score; and (e) generating visualization of image summarization results within each of the concept. The remainder of this paper is organized as follows. Section 2 introduces our approach for image content representation, similarity determination and kernel design. Section 3 introduces our work on automatic visual concept network generation. Section 4 describes our work on intra-concept image summarization algorithm. We visualize and evaluate our visual concept network and image summarization results in section 5. We conclude this paper at section 6.
2 Data Collection, Feature Extraction and Similarity Measurement The images used in this work are collected from the internet. There are totally 1000 keywords used to construct our data set, with some of the keywords derived directly from Caltech-256 [13] and LSCOM concept list. It is not an easy job to determine the meaningful text terms for crawling images from the internet. Many people use the textterm architecture from WordNet. Unfortunately, most of the text terms from WordNet may not be meaningful for image concept interpretation, especially when you need a
Efficient Large-Scale Image Data Set Exploration
113
Fig. 1. The taxonomy for text term determination for image crawling
Fig. 2. feature extraction frameworks: image-, grid-,segment-framework
large number of keywords; Some other people use the most popular tags from Flickr or some other commercial image-share web sites as the query keywords. The problem, again, is that these tags may not represent a concrete object or something that has a visual form, examples are like “2010”, ”California”, ”Friend”, “Music” etc. Based on the above analysis, we have developed a taxonomy for nature objects and scenes interpretation [10]. Thus we follow this pre-defined taxonomy to determine the meaningful keywords for image crawling as shown in Fig. 1. Because there is no explicit correspondence between the image semantics and the keywords extracted from the associated text documents, images returned are sometimes junk images or weakly-related images. We apply the algorithms introduced in [11] for cleansing the images which are crawled from the internet(i.e., filtering out the junk images and removing the weakly-tagged images). For each of the keywords, or concepts as used in this paper, approximately 1000 images are kept after cleansing. For image classification applications, the underlying framework for image content representation should be able to: (a) characterize the image contents as effectively as possible; (b) reduce the computational cost for feature extraction to a tolerable degree. Based on these observations, we have incorporated three frameworks for image content representation and feature extraction as shown in Fig. 2: (1) image-based framework; (2) segment-based framework and (3) grid-based framework. The segment-based framework has the most distinguish power and the best representation of the image, however, it always suffer from the large computation burden and sometimes may over segment the image. On the other hand, image-based framework is most computationally efficient but is too coarse to model the local information. As a tradeoff, we find
114
C. Yang et al.
the grid-based framework most suitable for our system in terms of both efficiency and effectiveness. Feature extraction is conducted on the grids of the images as described above. The global visual features such as color histograms can provide the global region statistics and the perceptual properties of entire region, but they may not be able to capture the object information, or in other words ,the local information, within the region[12,14]. On the other hand, the local visual features such as SIFT(Scale Invariant Feature Transform) features can allow object-level detail recognition against the cluttered backgrounds [12,14]. In our implementation, we incorporated three types of visual features as color histogram, gabor texture and SIFT, which are described in detail as follows. Color histogram: We performed on the HSV color space to extract the histogram vectors. HSV color space outperforms the RGB color space by its invariance to the change of illuminance. We conducted the histogram extraction on 18 bins of Hue components and 4 bins of Saturation components which yields a 72 dimension feature vector. Gabor texture: We apply gabor filter to different orientation and scale of the image region and the homogenous texture feature of the region is represented by the mean and standard deviation value of the transformed coefficients. Specifically, the extraction is conducted on 3 scales and 4 orientations which yields a 24 dimension feature vector. SIFT: We use the SURF (Speed Up Robust Features)[14]descriptor which is inspired by the SIFT descriptor but has a much faster computation speed in the real implementation. The parameters are configured as blob response threshold equals 0.001 and maximum number of interpolation step equals 5. The similarity measurement for color histogram and gabor texture is defined as follows: Similarity(i, j) = max −||xi − xj ||2 i,j∈X
(1)
where i, j are from the grid set of X, which is composed by 5 regional grids. By calculating the similarity score for each of the grid pairs, the maximum score is taken as the similarity score between the two images. The similarity measurement for SIFT feature is defined as Similarity(i, j) =
total # of matches total # of interesting points
(2)
We have also studied the statistical property of the images under each feature subset as introduced above. The gained knowledge for the statistical property of the images under each feature subset has been used to design the basic image kernel for each feature subset. Because different basic image kernels may play different roles on characterizing the diverse visual similarity relationships between the images, and the optimal kernel for diverse image similarity characterization can be approximated more accurately by using a linear combination of these basic image kernels with different importance. For a given image concept Cj , the diverse visual similarity contexts between its images can be characterized more precisely by using a mixture of these basic image kernels (i.e., mixture-of-kernels) [11, 15-17].
Efficient Large-Scale Image Data Set Exploration
κ(u, v) =
5
αi κi (ui , vi ),
i=1
5
αi = 1
115
(3)
i=1
where u and v are the visual features for two images in the given image concept Cj , ui and vi are their ith feature subset, αi ≥ 0 is the importance factor for the ith basic image kernel κi (ui , vi ).
3 Visual Concept Network Generation We determine the inter-concept visual similarity contexts for automatic visual concept network generation with the image features and kernels introduced above. The interconcept visual similarity context γ(Ci , Cj ) between the image concepts Ci and Cj can be determined by performing kernel canonical correlation analysis (KCCA) [18] on their image sets Si and Sj : γ(Ci , Cj ) =
θT κ(Si )κ(Sj )ϑ max θ, ϑ θT κ2 (Si )θ · ϑT κ2 (Sj )ϑ
(4)
The detailed explanation of the parameters can be found in our previous report of [22]. When large numbers of image concepts and their inter-concepts visual similarity contexts are available, they are used to construct a visual concept network. However, the strength of the inter-concept visual similarity contexts between some image concepts may be very weak, thus it is not necessary for each image concept to be linked with all the other image concepts on the visual concept network. Eliminating the weak interconcept links can increase the visibility of the image concepts of interest dramatically, but also allow our visual concept network to concentrate on the most significant interconcept visual similarity contexts. Based on this understanding, each image concept is automatically linked with the most relevant image concepts with larger values of the inter-concept visual similarity contexts γ(·, ·) (i.e., their values of γ(·, ·) are above a threshold δ = 0.65 in a scale from 0 to 1). Compared with Flickr distance [19], our algorithm for inter-concept visual similarity context determination have several advantages: (a) It can deal with the sparse distribution problem more effectively by using a mixture-of-kernels to achieve more precise characterization of diverse image similarity contexts in the high-dimensional multi-modal feature space; (b) By projecting the image sets for the image concepts into the same kernel space, our KCCA technique can achieve more precise characterization of the inter-concept visual similarity contexts.
4 Image Summarization Algorithm When the user explores inside of each concept, there are still thousands of images need to be displayed. In order to summarize the data set, we need to find the most representative images which is composed by a subset of the original data set. The summarization problem is eventually a subset selection problem and can be interpreted formally with mathematical terms as follows. Given a image data set V of N images, our goal is to
116
C. Yang et al.
find a subset S ⊂ V that can best represent the original data set V. We introduce the quality term Qv for each v ∈ V expressing its capability to represent the entire data set. The vi with the highest value is considered as the best candidate to be a summarization of V and thus can be added into S, we name the set of S as the “summarization pool”. Traditional image summarization model will partition or cluster the data set and find the summarization image within each cluster. We propose to build a model that not only make use of the cluster information but also consider the inter-cluster relationship of the data and then build a global objective function. We apply the affinity propagation algorithm to cluster the data set into several clusters and record the size, mean and variance value of each cluster. Affinity propagation has demonstrated its superiority in automatic number of clusters, faster converge speed and more accurate result compared with other methods, like k-means. The similarity measurement used in affinity propagation is defined in Eqn (1). For our proposed image summarization algorithm, we go through each element in V for the best Qv to be added as a candidate summarization. The representativeness of an image v can be reflected by the following three aspects: 1. Relevancy: the relevance score of v is determined by the size, mean and variance value of the cluster that v belongs to, for which v ∈ c(v). A candidate summarization comes from the cluster with big size, small variance and small distance value d(v, vmean ) to the mean. In other words, it should be most similar to other images. 2. Orthogonality: the orthogonality score penalizes candidates from same cluster, or it penalize candidates which are too similar to each. It is not recommended to select multiple candidates from one cluster, because these candidates tends to bring redundancy to the final summarization. 3. Uniformity: the uniformity score penalizes candidates appears to be an outlier of original data set. Although the outliers always show a unique perspective of the original set, or in other words, a high relevancy score, it should not be considered as a summarization. Base on the above criteria, we formulate the final quality score as a linear combination of the three terms: ˆ Qv = R(v) + αO(v, S) + βU (v, S)
(5)
ˆ represent for the Relevancy, Orthogonality and Uniwhere R(v), O(v, S) and U (v, S) formity score respectively. We further define the formulation of the three terms as follows. For Relevancy score: R(v) =
|c(v)| ∗ Lc(v) σc(v) + d(v, µc(v) ) + ǫ1
(6)
where c(v) denotes the cluster that contains v, or v ∈ c(v). |c(v)| is the number of elements in c(v), µc(v) and σc(v) are the mean and standard deviation of the cluster, Lc(v) is the number of linkage from v. Within each cluster, similar images are linked together. The similarity measurement is defined by SIFT feature as in Eqn (2). The similarity score above a pre-set threshold (0.6) is defined as a match. The matched image pairs are linked together, while un-matched pairs are not linked. Lc(v) can be also seen as the degree of v in the match graph. For Orthogonality score:
Efficient Large-Scale Image Data Set Exploration
O(v, S) =
−
J(v ′ )
0 if J(v′ ) = ∅ if J(v′ ) = ∅
1 d(v,v ′ )+ǫ2
117
(7)
where J(v ′ ) = {v ′ |c(v′ ) = c(v), v ′ ∈ S}. J(v ′ ) is empty if none of the elements v ′ in S comes from the same cluster with v, or else, J(v′ ) is not empty and a penalty term will be applied. For Uniformity score: ˆ =− U (v, S)
1 d(v, µSˆ) + ǫ3
(8)
ˆ where Sˆ = V \ S, µSˆ is the mean value of S. In the above terms, d(, ) is the Euclidean distance, ǫi , i = 1, 2, 3 is a positive number small enough to make sure the denominator in the fraction is non-zero. Applying the formulation of Qv will give the best summarization of the concept set. For a fixed number of summarizations, |S| = k, we need to iterate the calculation of Qv for k times. The process of the proposed algorithm is close to greedy algorithm and can be summarized as follows: Algorithm 1. Cluster-based Image Summarization Algorithm 1: Initialization: S =∅ 2: For each image v ∈ V, compute ˆ Qv =R(v) + αO(v, S) + βU (v, S) 3: The v with the maximum quality Qv is add into S S =S∪v 4: If stop criteria is satisfied, stop. Otherwise, V = V \v repeat from step 2.
At each iteration, the Qv is calculated for every v in V, and the best v is found and added into S. For fixed size summarization problem, the stop criteria is that S is enlarged into the pre-defined size. For automatic summarization problem, the stop criteria is that Qv is above a pre-defined value, and this value is 0 in our implementation. The parameter of α and β is determined experimentally and here in our implementation we have α = 0.3 and β = 0.4.
5 System Visualization and Evaluation For inter-concept exploration, to allow users to assess the coherence between the visual similarity contexts determined by our algorithm and their perceptions, it is very important to enable graphical representation and visualization of the visual concept network, so that users can obtain a good global overview of the visual similarity contexts between the image concepts at the first glance. It is also very attractive to enable
118
C. Yang et al.
Fig. 3. System User Interface: left: global visual concept network; right: cluster of the selected concept node
interactive visual concept network navigation and exploration according to the inherent inter-concept visual similarity contexts, so that users can easily assess the coherence with their perceptions. Based on these observations, our approach for visual concept network visualization exploited hyperbolic geometry [20]. The essence of our approach is to project the visual concept network onto a hyperbolic plane according to the inter-concept visual similarity contexts, and layout the visual concept network by mapping the relevant image concept nodes onto a circular display region. Thus our visual concept network visualization scheme takes the following steps: (a) The image concept nodes on the visual concept network are projected onto a hyperbolic plane according to their inter-concept visual similarity contexts by performing multi-dimensional scaling (MDS) [21] (b) After such similarity-preserving projection of the image concept nodes is obtained, Poincare disk model [20] is used to map the image concept nodes on the hyperbolic plane onto a 2D display coordinate. Poincare disk model maps the entire hyperbolic space onto an open unit circle, and produces a non-uniform mapping of the image concept nodes to the 2D display coordinate. The visualization results of our visual concept network are shown in Fig. 3, where each image concept is linked with multiple relevant image concepts with larger values of γ(·, ·). By visualizing large numbers of image concepts according to their inter-concept visual similarity contexts, our visual concept network can allow users to navigate large amounts of image concepts interactively according to their visual similarity contexts. For algorithm evaluation, we focus on assessing whether our visual similarity characterization techniques (i.e., mixture-of-kernels and KCCA) have good coherence with human perception. We have conducted both subjective and objective evaluations. For subjective evaluation, users are involved to explore our visual concept network and assess the visual similarity contexts between the concept pairs. In such an interactive visual concept network exploration procedure, users can score the coherence between the inter-topic visual similarity contexts provided by our visual concept network and their perceptions. For the user study listed in Table 2, 10 sample concept pairs are selected equidistantly from the indexed sequence of concept pairs. One can observe that our visual concept network has a good coherence with human perception on the underlying inter-concept visual similarity contexts.
Efficient Large-Scale Image Data Set Exploration
119
For objective evaluation, We find center concepts and their first-order neighbor as the clusters. By clustering the similar image concepts into the same concept cluster, it is able for us to deal with the issue of synonymous concepts effectively, e.g., multiple image concepts may share the same meaning for object and scene interpretation. Because only the inter-concept visual similarity contexts are used for concept clustering, one can observe that some of them may not semantic to human beings, thus it is very attractive to integrate both the inter-concept visual similarity contexts and their inter-concept semantic similarity contexts for concept clustering. As shown in Table 2, we have also compared our KCCA-based approach with Flickr distance approach [19] on inter-concept visual similarity context determination. The normalized distance to human perception is 0,92 and 1.42 respectively in terms of Euclidean distance, which means KCCA-base approach performs 54% better than Flickr distance on the random selected sample data. Table 1. Image concept clustering results group 1 urban-road, street-view, city-building, fire-engine, moped, brandenberg-gate, buildings group 2 knife, humming-bird, cruiser, spaghetti, sushi, grapes, escalator, chimpanzee group 3 electric-guitar, suv-car, fresco, crocodile, horse, billboard, waterfall, golf-cart group 4 bus, earing, t-shirt, school-bus, screwdriver, hammock, abacus, light-bulb, mosquito
Table 2. Evaluation results of perception coherence for inter-concept visual similarity context determination: KCCA and Flickr distances concept pair user score KCCA (γ) Flickr Distance urbanroad-streetview 0.76 0.99 0.0 cat-dog 0.78 0.81 1.0 frisbee-pizza 0.56 0.80 0.26 moped-bus 0.50 0.75 0.37 dolphin-cruiser 0.34 0.73 0.47 habor-outview 0.42 0.71 0.09 monkey-humanface 0.52 0.71 0.32 guitar-violin 0.72 0.71 0.54 lightbulb-firework 0.48 0.69 0.14 mango-broccoli 0.48 0.69 0.34
For intra-concept exploration, we simply display the top 5 image summarization results for each of the concept as shown in Fig.4. The user will have a direct impression about the visual properties of this concept. The accessability to the full concept set is also provided as shown in Fig.3. For image summarization evaluation task, efficiency, effectiveness, and satisfaction are three important metrics. We will design a user study based on these three metrics: 1. A group of 10 users were told to find top 5 summaries of three concepts as “Building”, “Spaghetti” and “Bus”; 2. Their manual selection results were gathered to find the ground truth summarization of the give concept by picking the image views with the highest votes.
120
C. Yang et al.
Fig. 4. Top 5 summarization results for “Building”, “Spaghetti” and “Bus”
3. Users were guided to evaluate our summarization system and give a satisfaction feedback compared with their own understanding of the summarization of this concept. As a result, the user took around 50 seconds in average to find top 5 summarizations from the concept data set. Comparatively, given the affinity propagation clustering results. The calculation of top image summaries can be finished almost at real time, which shows a big advantage for large scale image summarization. Averagely, 2 out of 5 images generated by our algorithm coincided with the ground truth summarizations we derived from the users. We define “coincide” by a strong visual closeness to each other, even if the images may not be identical. Considering the size of the concept set, and the limited number of summarizations we derived, the performance is quite persuasive. After exploring our system, the user also provided satisfactory feedback on the user friendly operation interface, fast responding speed and reasonable return results.
6 Conclusion To deal with large-scale image collection exploration problem, we have proposed novel algorithms for inter-concept visual network generation and intra-concept image summarization. The visual network reflects the diverse inter-concept visual similarity relationships more precisely on a high-dimensional multi-modal feature space by incorporating multiple kernels and kernel canonical correlation analysis. The image summarization iteratively generates most representative images of the set by incorporating the global clustering information while computing the relevancy, orthogonality and uniformity term of the candidate summaries. We designed user interactive system to run the algorithms on a self-collected image data set. The experiment observed very good result and the user study provide positive feedback. This research is partly supported by NSFC-61075014 and NSFC-60875016, by the Program for New Century Excellent Talents in University under Grant NCET-07-0693, NCET-08-0458 and NCET-10-0071 and the Research Fund for the Doctoral Program of Higher Education of China (Grant No.20096102110025).
Efficient Large-Scale Image Data Set Exploration
121
References 1. Smeulders, A.W.M., Worring, M., Santini, S., Gupta, A., Jain, R.: Content-based image retrieval at the end of the early years. IEEE Trans. PAMI 22(12), 1349–1380 (2000) 2. Hauptmann, A., Yan, R., Lin, W.-H., Christel, M., Wactlar, H.: Can high-level concepts fill the semantic gap in video retrieval? A case study with broadcast news. IEEE Trans. on Multimedia 9(5), 958–966 (2007) 3. Benitez, A.B., Smith, J.R., Chang, S.-F.: MediaNet: A multimedia information network for knowledge representation. In: Proc. SPIE, vol. 4210 (2000) 4. Naphade, M., Smith, J.R., Tesic, J., Chang, S.-F., Hsu, W., Kennedy, L., Hauptmann, A., Curtis, J.: Large-scale concept ontology for multimedia. IEEE Multimedia (2006) 5. Cilibrasi, R., Vitanyi, P.: The Google similarity distance. IEEE Trans. Knowledge and Data Engineering 19 (2007) 6. Fellbaum, C.: WordNet: An Electronic Lexical Database. MIT Press, Boston (1998) 7. Wu, L., Hua, X.-S., Yu, N., Ma, W.-Y., Li, S.: Flickr distance. In: ACM Multimedia (2008) 8. Jing, Y., Baluja, S., Rowley, H.: Canonical image selection from the web. In: Proceedings of the 6th ACM International CIVR, Amsterdam, The Netherlands, pp. 280–287 (2007) 9. Simon, I., Snavely, N., Seitz, S.M.: Scene Summarization for online Image Collections. In: ICCV 2007 (2007) 10. Fan, J., Gao, Y., Luo, H.: Hierarchical classification for automatic image annotation. In: ACM SIGIR, Amsterdam, pp. 11–118 (2007) 11. Gao, Y., Peng, J., Luo, H., Keim, D., Fan, J.: An Interactive Approach for Filtering out Junk Images from Keyword-Based Google Search Results. IEEE Trans on Circuits and Systems for Video Technology 19(10) (2009) 12. Lowe, D.: Distinctive image features from scale invariant keypoints. Intl Journal of Computer Vision 60, 91–110 (2004) 13. Grin, G., Holub, A., Perona, P.: Caltech-256 object category dataset, Technical Report 7694, California Institute of Technology (2007) 14. Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (surf). Comput. Vis. Image Underst. 110(3), 346–359 (2008) 15. Sonnenburg, S., R¨atsch, G., Sch¨afer, C., Sch¨olkopf, B.: Large scale multiple kernel learning. Journal of Machine Learning Research 7, 1531–1565 (2006) 16. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: A comprehensive study. Intl Journal of Computer Vision 73(2), 213–238 (2007) 17. Torralba, A., Murphy, K.P., Freeman, W.T.: Sharing features: efficient boosting procedures for multiclass object detection. In: IEEE CVPR (2004) 18. Huang, J., Kumar, S.R., Zabih, R.: An automatic hierarchical image classification scheme. In: ACM Multimedia, Bristol, UK (1998) 19. Vasconcelos, N.: “Image indexing with mixture hierarchies. In: IEEE CVPR (2001) 20. Barnard, K., Forsyth, D.: Learning the semantics of words and pictures. In: IEEE ICCV, pp. 408–415 (2001) 21. Naphade, M., Huang, T.S.: A probabilistic framework for semantic video indexing, filterig and retrieval. IEEE Trans. on Multimedia 3(1), 141–151 (2001) 22. Yang, C., Luo, H., Fan, J.: Generating visual concept network from large-scale weaklytagged images. In: Advance in Multimedia Modeling (2010)
A Study in User-Centered Design and Evaluation of Mental Tasks for BCI Danny Plass-Oude Bos, Mannes Poel, and Anton Nijholt U n i v e r s i t y of T w e n t e , F ac u l t y of E E M C S , P O B ox 217, 7500 A E E n s c h e d e , T h e N e t h e r l an d s {d.plass,m.poel,a.nijholt}@ewi.utwente.nl
Abstract. C u r r e n t b r ai n - c om p u t e r i n t e r f ac i n g (BCI) research focuses on detection performance, speed, and bit rates. However, this is only a small part of what is important to the user. From human-computer interaction (HCI) research, we can apply the paradigms of user-centered design and evaluation, to improve the usability and user experience. Involving the users in the design process may also help in moving beyond the limited mental tasks that are currently common in BCI systems. To illustrate the usefulness of these methods to BCI, we involved potential users in the design process of a BCI system, resulting in three new mental tasks. The experience of using these mental tasks was then evaluated within a prototype BCI system using a commercial online role-playing game. Results indicate that user preference for certain mental tasks is primarily based on the recognition of brain activity by the system, and secondly on the ease of executing the task. Keywords: user-centered design, evaluation, brain-computer interfacing, multimodal interaction, games.
1
Introduction
The research field of brain-computer interfaces (BCI) originates from the wish to provide fully-paralyzed people with a new output channel to enable them to interact with the outside world, despite their handicap. As the technology is getting better, the question arises whether BCI could also be beneficial for healthy users in some way, for example, by improving quality of life or by providing private, handsfree interaction [11,18]. There are still a lot of issues yet to solve, such as delays, bad mental task recognition rates, long training times, and cumbersome hardware [8]. Current BCI research concentrates on improving the recognition accuracy and speed, which are two important factors of how BCI systems are experienced. On the other hand there is a lot of interest for making BCI a more usable technology for healthy users [1,13]. But in order for this technology to be accepted by the general public, other factors of usability and user experience have to be taken into account as well [14,19]. There is some tentative research in this direction, such as Ko et al. who evaluated the convenience, fun and intuitiveness of a BCI game they developed [6], and Friedman K . - T . L e e e t al . (Eds.): MMM 2011, Part II, LNCS 6524, pp. 122–134, 2011. c Springer-Verlag Berlin Heidelberg 2011
A Study in User-Centered Design and Evaluation of Mental Tasks for BCI
123
et al. who looked into the level of presence experienced in two different navigation experiments [3]. But a lot of research still needs to be conducted. In this paper we focus on applying HCI principles to BCI design. In Section 2 we apply a user-centered design method to the selection of mental tasks for shapeshifting in the very popular massively-multiplayer online role-playing game R R , developed by Blizzard Entertainment, Inc . Within the World of Warcraft large group of healthy users, gamers are an interesting target group. Fed by a hunger for novelty and challenges, gamers are often early adopters of new paradigms [12]. Besides, it is suggested that users will be able to stay motivated and focused for longer periods if the BCI experiment can be presented in a game format [4]. Afterwards, in Section 3 and 4 the selected mental tasks are evaluated in a user study; in this evaluation the focus was on the user preferences – which among others include re oc gnitio n, i m m re sio n, e ff ort, ae s e of us e – for the designed mental tasks. The main research questions we try to answer are: Which mental tasks do the users prefer, and why? How may this preference be influenced by the detection performance of the system? The results are discussed in Section 5 and conclusions of this user-centered design and evaluation can be found in Section 6.
2
User-Centered Design: What Users Want
One of the problems facing BCI research is the uncovering of usable mental tasks that trigger detectible brain activity. The tasks (by convention indicated by the name of the corresponding brain activity) that are currently most popular are: slow cortical potentials, imaginary movement, P300, and steady-state visuallyevoked potentials [15,17]. Users regularly indicate that these tasks are either too slow, nonintuitive, cumbersome, or just annoying to use for control BCIs [10,12]. Current commercial applications are a lot more complex and offer a lot more interaction possibilities than applications used in BCI research. Whereas current game controllers have over twelve dimensions of input, BCI games are generally limited to one or two-dimensional controls. Also, the mental tasks that are available are limited in their applicability for intuitive interaction. New mental tasks are needed that can be mapped in an intuitive manner with the system action. One way to discover mental tasks that are suitable from a user perspective is to simply ask the user what they would like to do to trigger certain actions. R In World of Warcraft , the user can play an elf druid who can shape-shift into animal forms. As an elf, the player can cast spells to attack or to heal. When in bear form, the player can no longer use most spells, but is stronger and better protected against direct attacks, which is good for close combat. R players of varying In an open interview, we asked four World of Warcraft expertise and ages what mental tasks they would prefer to use to shape-shift from the initial elf form to bear, and back again. The participants were not informed about the limits of current BCI systems, but most people did need an introduction to start thinking about tasks that would have a mental component.
124
D . Plass-Oude Bos, M. Poel, and A. Nijholt
They were asked to think of using the action in the game, and what it means to them, what it means to their character in the game, what they think about when doing it, what they think when they want to use the action, and then to come up with mental tasks that fit naturally with their gameplay. The ideas that the players came up with can be grouped into three categories. For the user evaluation of these mental tasks in Section 3, these three categories were translated into concrete mental tasks, mapped to the in-game action of shapeshifting. Each category consists of a task, and its reverse, to accommodate the shapeshifting action in the directions of both bear and elf form. 1. Inner speech: recite a mental spell to change into one form or the other. The texts of spells subsequently used were derived from expressions already used in the game world. The user had to mentally recite “I call upon the great bear spirit” to change to bear. “Let the balance be restored” was the expression used to change back to elf form. 2. Association: think about or feel like the form you want to become. Concretely, this means the user had to feel like a bear to change into a bear, and to feel like an elf to change into an elf. 3. Mental state: automatically change into a bear when the situation demands it. When you are attacked, the resulting stress could function as a trigger. For the next step of this research, this had to be translated into a task that the users could also perform consciously. To change to bear form the users had to make themselves feel stressed; to shift into elf form, relaxed.
3
User Evaluation Methodology
The goal of the user evaluation was to answer the following questions in this game context: Which mental tasks do the users prefer, and why? How may this preference be influenced by the detection performance of the system? Fourteen healthy participants (average age 27, ranging from 15 to 56; 4 female) participated in the experiment voluntarily. All but one of the participants were right-handed. Highest finished education ranged from elementary school to a R master’s degree. Experience with the application World of Warcraft ranged from “I never play any games” to “I raid daily with my level 80 druid”. Three participants were actively playing on a weekly basis. A written informed consent was obtained from all participants. The general methodology to answer these questions was as follows. In order to measure the influence of the detection performance of the system the participants were divided in two groups, a so-called “real-BCI” and “utopia-BCI” R with “utopia-BCI” decided group. The group that played World of Warcraft for themselves whether they performed the mental task correctly, and pressed the button to shapeshift when they had. In this way a BCI system with 100% detection performance (an utopia) was simulated. The group that played World R of Warcraft with “real-BCI” actually controlled their shapeshifting action with their mental tasks, at least insofar as the system could detect it.
A Study in User-Centered Design and Evaluation of Mental Tasks for BCI
125
The participants came in for experiments once a week for five weeks, in order to track potential changes over time. During an experiment, for each pair (change to bear, to elf) of mental tasks (inner speech, association, and mental state), the participant underwent a training and game session and filled in questionnaires to evaluate the user experience. The following sections explain each part of the methods in more detail. 3.1
Weekly Sessions and Measurements
The participants participated in five experiments, lasting about two hours each, over five weeks. The mental tasks, mentioned above, were evaluated in random order to eliminate any potential order effects, for example, due to fatigue or user learning. For each task pair, a training session was done. The purpose of the training session was manifold: it gathered clean data to evaluate the recognizability of the brain activity related to the mental tasks, the user was trained in performing the mental tasks, the system was trained for those participants who play the game with the real BCI system, and the user experience could be evaluated outside the game context. A training session consisted of two sets of trials with a break in between during which the participant could relax. Each set started with four watch-only trials (two per mental task), followed by 24 do-task trials (twelve per mental task). The trial sequence constituted of five seconds watching the character in their start form, followed by two seconds during which the shape-shifting task was presented. After this the participant had ten seconds to perform the mental task repeatedly until the time was up, or just watch if it was a watch-only trial. At the end of these ten seconds, the participant saw the character transform. See Figure 1 for a visualization of the trial sequence. During the watch-only trials, the participant saw exactly what they would see during the do-task trials, but they were asked only to watch the sequence. This watch-only data was recorded so it could function as a baseline. It is possible that simply watching the sequence already induces certain discriminable brain activity. This data provides the possibility to test this. The character in the videos was viewed from the back, similar to the way the participant would see the avatar in the game. At the end of the training session, the participant was asked to fill in forms to evaluate the user experience. The user experience questionnaire was loosely based on the Game Experience Questionnaire [5]. It contained statements for which the participants had to indicate their amount of agreement on a five-point Likert scale, for example: “I could perform these mental tasks easily”, “It was tiring to use these mental tasks”, and “It was fun to use these mental tasks”. The statements can be categorized into the following groups: whether the task was easy, doable, fun, intuitive, tiring to execute, whether they felt positive or negative about the task, and whether the mapping to the in-game action made sense or not. After the training session, the participant had roughly eight minutes to try the set of mental tasks in the game environment. The first experiment consisted only
126
D. Plass-Oude Bos, M. Poel, and A. Nijholt
Fig. 1. Training session trial sequence: first the character is shown in their start form, then the task is presented, after which there is a period during which this task can be performed. At the end the animation for the shape-shift is shown.
Fig. 2. Orange feedback bar with thresholds. The user has to go below 0.3 to change to elf, and above 0.7 to change to bear. In between no action is performed.
of a training session. For weeks two up to and including four, the participants were split up into the “real-BCI” and “utopia-BCI” group. Groups were fixed for the total experiment. During the last week all participants followed a training and played the game with the real BCI system. The “real-BCI” group received feedback on the recognition of their mental tasks in the form of an orange bar in the game (see Figure 2). The smaller the bar, the more the system had detected the mental task related to elf form. The larger, the more the system had interpreted the brain activity as related to bear form. When the thresholds were crossed the shape-shift action was executed automatically. The “utopia-BCI” group participants had to interact with a BCI system with (a near) 100% performance. Since this is not technical feasible yet, one could rely on a Wizard of Oz technique [16]. The users wear the EEG cap and have to perform the mental tasks when they wanted to shapeshift and the Wizard decides when it was performed correctly. In this case, however, the Wizard would have no way of knowing what the user is doing as there is no external expression of the task. The only option left to simulate a perfect system is to let the participants evaluate themselves whether or not they had performed the task correctly. Then they pressed the shape-shift button in the game manually. At the end of the session, the user experience questionnaire was repeated, to determine potential differences between the training and game sessions. The game session questionnaire contained an extra question to determine the perceived detection performance of the mental tasks. A final form concerning the experiment session as a whole was filled in at the end of the session. The participants were asked to put the mental tasks in order of preference, and to indicate why they choose this particular ordering.
A Study in User-Centered Design and Evaluation of Mental Tasks for BCI
3.2
127
EEG Analysis and Mapping
The EEG analysis pipeline, programmed in Python, was kept very general, as there was no certainty about how to detect these selected mental tasks. Common Average Reference was used as a spatial filter, in order to improve the signalto-noise ratio [9]. The bandpass filter was set to 1–80Hz. The data gathered during the training session was sliced in 10-second windows. These samples were whitened [2], and the variance was computed as an indication of the power in the window. A support vector machines (SVM) classifier trained on the training session data provided different weights for each of the EEG channels. To make the BCI control more robust to artifacts, two methods were applied to the mapping of classification results to in-game actions. A short dwelling was required to trigger the shape-shift, so it would not be activated by quick peaks in power. Secondly, hysteresis was applied: the threshold that needed to be crossed to change into a bear was higher than the threshold required to revert back to elf form. In between these two thresholds was a neutral zone in which no action was performed, see also Figure 2.
4 4.1
Results Which Mental Tasks Do Users Prefer and Why?
In the post-experiment questionnaire, the participants were asked to list the mental tasks in order of preference. The place in this list was used as a preference score, where a value of 1 indicated first choice, and 6 is the least preferable. These values were rescaled to match the user experience questionnaire values: ranging from 1 to 5, with 5 most preferable, and therefore 3.0 indicates a neutral disposition in preference order. Sixty-nine measurements were obtained from 14 participants over five weeks. One week one participant had to leave early and could not fill in his preference questionnaire. The average preference scores show a general preference for the association tasks, and the mental state seems to be disliked the most. But this paints a very simplistic image, as there are large differences between the “real-BCI” and “utopia-BCI” groups. To better understand the effects of the different aspects, Figure 3 shows the preference and user experience scores for each of the three mental task pairs, separate for the two participant groups. Whereas for the “real-BCI” group the mental state tasks are most liked, for the ”Utopia BCI” group, they are most disliked. Similarly, The “utopia-BCI” group most preferred inner speech, which was least preferred by the “real-BCI” group. Because of these large differences, these two groups need to be investigated as two separate conditions. 4.2
What Is the Influence of Recognition Performance on Task Preference?
Although it is not possible to completely separate the influence of the recognition performance and other aspects that differ between the participant groups, based
128
D. Plass-Oude Bos, M. Poel, and A. Nijholt
Fig. 3. User experience, preference, and perceived performance scores for the “utopiaBCI” and “real-BCI” groups, separate for the three mental task pairs, averaged for weeks 2 to 4. The bar plot is annotated with significant differences between task pairs (association, inner speech, mental state; with a line with star above the two pairs) and game conditions (“utopia-BCI”, “real-BCI”; indicated by a star above the bar).
A Study in User-Centered Design and Evaluation of Mental Tasks for BCI
129
on the user experience scores, recognition perception scores, and the words the participants used to describe their reasoning for their preference, it is possible to explain the discrepancy between the two conditions and get an idea of the influence of recognition performance. Inner speech is preferred by the “utopia-BCI” group, mainly because it is considered easy and doable. Although the inner speech tasks were rated highly by both groups, the system recognition had a heavy impact: it is the least preferred task pair for the “real-BCI” participants. The association tasks are valued mostly for their intuitiveness and the mapping with the in-game task, by the “utopia-BCI” group. Where the bad detection of inner speech mainly affected the preference scores, for association the user experience is significantly different on multiple aspects: easy, intuitive, and positive. The opposite happens for mental state. This task pair scores low across the board by both groups, yet it was preferred by the “real-BCI” group. It was also the task that was best recognized by the system, which is reflected in the perceived recognition scores. Based on these results, it seems that the recognition performance has a strong influence on the user preference, which is the most important consideration for the “real-BCI” group. For the “utopia-BCI” group different considerations emerge, where the ease of the task pair seems to play a dominant role, followed by the intuitiveness.
Fig. 4. Counts for the categories of words used to describe the reasoning behind the participant’s preference ranking, total for weeks 2 to 4
This view is confirmed by looking at the reasons the participants describe for their preference ranking, see Figure 4. The words they used were categorized, and the number of occurrences within each category was used as an indication of how important that category was to the reasoning. To reduce the number of categories, words that indicated a direct opposite and words that indicated a similar reason were merged in one category. For example, difficult was recoded to easy, and tiring was recoded to effort. In the words used by the “real-BCI”
130
D. Plass-Oude Bos, M. Poel, and A. Nijholt
group, recognition performance is the most used (n = 15), more than twice as any other word category (n id102. The information in the presenting users’ records includes comments, ratings and so on.
Fig. 2. Data structure related to a video
CLUENET: Enabling Automatic Video Aggregation in Social Media Networks
279
Note that the clues center can detect the new publish videos and their duplicates in social media networks all the times according to the video ID. While a user sends out a request for a remote video file in clue reporter, clues center will probe the states of all duplicates and redirect the nearest and available duplicate for the user. By the duplicates and their social information, we can detect fraud duplicates.
3 Automatic Aggregation Model 3.1 User Query In the architecture, a new publishing video aggregates and maintains multiple types of important information, such that duplicate, similar content and social information. The contents strengthen the video’s annotation and the power of users’ query. With the help of the clues center, when user send a query, the system can: (1) Commends sequential and similar videos for users according to its contextual clues; (2) Integrates a video and social information, and strengthens the video sharing in social media networks; (3) Chooses the nearest video duplicate or replace invalid video with the valid duplicate for users which can optimize the transmission in social media networks. 3.2 The Automatic Aggregation Model In general, the video aggregation is depended on user’s query and selection. So in the architecture, we develop an automatic aggregation model by a Dynamic Petri Net (DPN) [11] to automatically generate aggregating multimedia document for presenting videos and closely relevant contents. The generated DPN can be used to deal with user interactions and requests for video aggregation, and used as a guideline to layout the presentation. It also specifies the temporal and spatial formatting for each media object. A Dynamic Petri Net structure, S, is a 10-tuple. S=(P, T, I, O, τ, Pd{F},N, F, P{F}, Oc{F})
(1)
1) P={p1,p2,……,pα}, where α≥0, is a finite set of places; 2) T={t1,t2,……,tβ}, where β≥0, is a finite set of transitions, such that P∩T= φ , i.e the set of places and transitions are disjoint; 3) I: P ∞ →T is the input arc, a mapping from bags of places to transitions; 4) O: T→ P ∞ is the output arc, a mapping from transitions to bags of places; 5) τ={( p1, τ 1 , s1 ),( p2 ,τ 2 , s2 ),……,( pα , τ α , sα )}, where τ i and si represent the video’s playing time and spatial location for the related objects represented by the places pi ;
6) N={ n1 , n2 ,..., nγ }, where γ ≥ 0 ,is a finite set of persistent control variables. There variables are persistent through every marking of the net; 7) F={ f 0 , f1 ,..., fη }, where η ≥ 0 , is a finite set of control functions that perform functions based on any control variable N; 8) P{F}, where P{F} ⊆ P, is a finite set of (static) control places that executes any control function F;
280
Z. Liao et al.
9) Oc{F}, where Oc{F} ⊆ O, is a finite set of (static) control output arcs that may be disabled or enabled according to any control function F; 10) Pd{F}, where Pd {F} ⊆ P, is a finite set of dynamic places that takes their value from control function F. z Temporal Layout The temporal layout is mainly dominated by the playing video after the user queried and chose it. The layout composed of the playing video and related contents which have to be synchronized the playing video. The temporal layout Lmain is represented as {Lv , Ld } . The layout Lv represents the playing video and Ld represents the related contents to the playing video. They together represent the entire layout of the presentation. The Lv is represented as: {Lv1 , Lv 2 ,......} , where Lvi = (< Bi0 , Ei0 >, S i ) , Bi0 is the start time of ith playing video,
Ei0 is the terminating time of ith playing video and 1 ≤ i , Si represents the spatial location of ith playing video. The set Ld is used to store the information that includes all related contents, and is represented as {Ld 1 , Ld 2 ,......} , where Ldi={ Di1 , Di 2, ......., Din }, and Dij= ( < B ij , E ij >, S j ). Here 1 ≤ i , 1 ≤ j ≤ n and Bi0 ≤ B ij < E ij ≤ Ei0 , S i represents the spatial location of jth data. z
Modeling of automatic video aggregation
Fig. 3. The automatic aggregation model
In the video aggregation, the playing video object is a dominant object. The control function controls the aggregating operation, and presents aggregated media objects in what time. Considering the representation of video sequence in an interactive manner, the aggregation is modeled as shown in Fig. 3. The media objects are represented as places (P) while the synchronization points are modeled using transitions (T) in the Petri net. Control Functions: -query(): The function is associated with the output arc that triggers the start of the aggregation.
CLUENET: Enabling Automatic Video Aggregation in Social Media Networks
281
-has_video: the function is present whether videos returned after the query. Algorithm 1. Function: query(q) list = query_video(q) Seqlist = null Simlist = null n1 = size(list)
Algorithm 2. Function: has_video() if
n1>0
then
enable
Oac ,
disable Obc else end
disable Oac ,enable Obc
-choose_a: when the video set has more than one, the control function allows user to choose one video for presenting. -aggregating: the system aggregates the related contents to the playing video from clues center, such that similar videos and sequential segments. Algorithm 4. Function: aggregating () Algorithm 3 Function: choose_a() Seq =Discovery_sequence(v_ch if user has chose ose)list list[k] OR Seqlist[k] Simlist=Match_similar(v_chose) OR Simlist[k] Datalthen ist=Get_RelatedData(v_chose) v_chose= the video j = 0 that chose by user n2 = size(Seqlist) end
-is_relatedvideo: assess whether there exists the related videos to the playing video by its similarity clues. -is_relateddata: assess whether there exists the related data (e.g. text, image) to the playing video by its related data clues. Algorithm 5 Function: is_relatedvideo() if Seqlist >1 OR Simlist>1 then enable O cc
Algorithm 6 Function: is_relateddata() if Datalist >0 then enable O ec
else disable O cc end
else disable O ec end
-auto_a: when the playing video is ready to reach the terminate time, the system will automatically choose a successor video as the next playing video which also refer to user for proposal. Algorithm 7. Function: auto_a () if j, S i )
,
( P3 , < B 0j , E 0j >, S j )
,(
P4 , < Bl0 ,
El0 >, Sl ),( P5 , < Bk0 , Ek0 >, S k )} N = {n1, n2} F = {has_video, is_relatedvideo, is_relateddata, next, is_terminated, is_null, no_relatedvideo} P{F} = { P1 , P2 , P3 , P4 , P5 } Oc{F} = { (t10 , P1 ) {has_video}, (t10 , Pend ) {is_null}, (t11 , P4 )
{is_relatedvideo},
{is_relateddata}, (t11 , Pend ) {no_relatedvideo}} Pd{F} = { P6d }
(t11 , P5 )
Note that, the clues reporter also provides an interface to layout similar videos, sequential segments and social information when a user playing a video. And user can forward a new query to clues center if recommended contents do not meet the user.
4 Implementation and Measurement The objective of the implementation was to develop a multimedia aggregation and presentation to deliver the following capabilities: (1) Multimedia aggregating and sharing in social media networks; (2) Synchronous presenting video and related data; (3) Videos’ transmission optimizing in social media networks. The application is also to allow prefetching and buffering of videos to obtain a reasonable QoS before playing. We executed a web crawler for collecting a set of videos and there descriptions. We have collected relevant information about 1,260 video files from Youku (see http://www.youku.com) and Tudou (see http://www.tudou.com), two popular videodistributing websites in china.
CLUENET: Enabling Automatic Video Aggregation in Social Media Networks
283
For demonstrating the CLUENET can be efficiently aggregated video sequences, similar videos and related contents, we have developed the clues reporter in the opensource Ambulant player[13] and built the clues center on the Internet. In addition, we have developed the cmf file packager for encapsulating new publish video file to cmf file and unpackager for extracting the original video file from cmf file. For aggregating videos and related contents, system automatically generates the media document based on Scalable MSTI model [14] in the place P4 and P5 of the Petri net. The media document was generated based on the SMIL template which we predefined. Based on these video’s data repository, we conduct a test in two phases: Firstly, we encapsulate these videos to cmf files and test the clues collected by clues reporter. The test shows our system can not only detect user’s some behaviors such as accessing time, adding ratings and comments, but also track the sequence accessed by the user, while the user visits a set of videos. Secondly, for verifying the construction of videos’ clues network, we have used the Multimedia PageRank [15],VSM, and Apriori II algorithms to refine the clues in clues center. Test shows that the clues center can recommend some high-relevant contents and top-N frequent video sequences to the user while a user is playing a video. After the user sends a query to the system, it can aggregate a set of related videos and data, and present them to user according to the DPN model. In short, the CLUENET can not only aggregate relevant contents and synthesize all textual information and video’s sequences to present for user, but also keep track of the new publish videos in social networks and redirect nearby duplicate for user. On the other hand, aggregating all-around related contents can enhance the system retrieve more comprehensive knowledge and enhance the dynamic adaptive faculty in social networks.
5 Conclusions Inspired by SOA, we propose a novel framework for automatically aggregating videos and relevant contents in social media networks. The goal of our framework is to build a world wide mesh of videos based on the clues network, which can not only aggregate single videos and their context, and present related rich semantic media, but guide users to retrieve coherent videos that have been associated in deep-level and enhance the trustiness of results by clues in social networks. In the future we expect to continue to retrieve deeper knowledge of videos by analyzing the clues and explore some mechanism to alleviate the extent of clues collection that depended on users, and we will further study the scalability and address the security issues of the novel framework. Acknowledgements. We acknowledge the National High-Tech Research and Development Plan of China under Grant No. 2008AA01Z203 for funding our research.
References 1. Murthy, D., Zhang, A.: Webview: A Multimedia Database Resource Integration and Search System over Web. In: WebNet: World Conference of the WWW, Internet and Intranet (1997)
284
Z. Liao et al.
2. Cao, J., Zhang, Y.D., et al.: VideoMap: An Interactive Video Retrieval System of MCGICT-CAS. In: CIVR (July 2009) 3. Sprott, D., Wilkes, L.: Understanding Service Oriented Architecture. Microsoft Architecture Journal (1) (2004) 4. Sakkopoulos, E., et al.: Semantic mining and web service discovery techniques for media resources management. Int. J. Metadata, Semantics and Ontologies 1(1) (2006) 5. Agrawal, R., Srikant, R.: Mining sequential pattern. In: Proc. of the 11th International Conference on Data Engineering, Taipei (1995) 6. Bianchini, M., Gori, M., Scarselli, F.: Inside PageRank. ACM Transactions on Internet Technology 5(1) (2005) 7. Dongwon, L., Hung-sik, K., Eun Kyung, K., et al.: LeeDeo: Web-Crawled Academic Video Search Engine. In: Tenth IEEE International Symposium on Multimedia (2008) 8. Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: Towards automatic data extraction from large web wites. In: VLDB (2001) 9. Wu, X., Hauptmann, A.G., Ngo, C.W.: Practical elimination of near-duplicates from web video search. In: ACM Multimedia, MM 2007 (2007) 10. Amer-Yahia, S., Lakshmanan, L., Yu, C.: SocialScope: Enabling Information Discovery on Social Content Sites. In: CIDR (2009) 11. Tan, R., Guan, S.U.: A dynamic Petri net model for iterative and interactive distributed multimedia presentation. IEEE Transactions on MultiMedia 7(5), 869–879 (2005) 12. Massimo, M.: A basis for information retrieval in context. ACM Transactions on Information Systems (TOIS) 26(3) (June 2008) 13. Bulterman, D.C.A., Rutledge, L.R.: SMIL 3.0: Interactive Multimedia for the Web. In: Mobile Devices and Daisy Talking Books. Springer, Heidelberg (2008) 14. Pellan, B., Concolato, C.: Authoring of scalable multimedia documents. Multimedia Tools and Applications 43(3) (2009) 15. Yang, C.C., Chan, K.Y.: Retrieving Multimedia Web Objects Based on PageRank Algorithm. In: WWW, May 10-14 (2005) 16. Cha, M., Kwak, H., et al.: I tube, you tube, everybody tubes: analyzing the world’s largest user generated content video system. In: IMC 2007. ACM, New York (2007)
Pedestrian Tracking Based on Hidden-Latent Temporal Markov Chain Peng Zhang1 , Sabu Emmanuel1 , and Mohan Kankanhalli2 1
2
Nanyang Technological University, 639798, Singapore {zh0036ng,asemmanuel}@ntu.edu.sg National University of Singapore, 117417, Singapore [email protected]
Abstract. Robust, accurate and efficient pedestrian tracking in surveillance scenes is a critical task in many intelligent visual security systems and robotic vision applications. The usual Markov chain based tracking algorithms suffer from error accumulation problem in which the tracking drifts from the objects as time passes. To minimize the accumulation of tracking errors, in this paper we propose to incorporate the semantic information about each observation in the Markov chain model. We thus obtain pedestrian tracking as a temporal Markov chain with two hidden states, called hidden-latent temporal Markov chain (HL-TMC). The hidden state is used to generate the estimated observations during the Markov chain transition process and the latent state represents the semantic information about each observation. The hidden state and the latent state information are then used to obtain the optimum observation, which is the pedestrian. Use of latent states and the probabilistic latent semantic analysis (pLSA) handles the tracking error accumulation problem and improves the accuracy of tracking. Further, the proposed HL-TMC method can effectively track multiple pedestrians in real time. The performance evaluation on standard benchmarking datasets such as CAVIAR, PETS2006 and AVSS2007 shows that the proposed approach minimizes the accumulation of tracking errors and is able to track multiple pedestrians in most of the surveillance situations. Keywords: Tracking, Hidden-Latent, Temporal Markov Chain, Error Accumulation, Surveillance.
1
Introduction
Pedestrian tracking has numerous applications in visual surveillance systems, robotics, assisting systems for visually impaired, content based indexing and intelligent transport systems among others. However, tracking of pedestrians robustly, accurately and efficiently is hard because of many challenges such as target appearance changing, non-rigid motion, varying illumination, and occlusions. To achieve successful tracking by resolving these challenges, a lot of works [2][8][4][9][11] employing mechanisms such as using more distinctive features, using of integrated models and subspace decomposition analysis with Markov K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 285–295, 2011. c Springer-Verlag Berlin Heidelberg 2011
286
P. Zhang, S. Emmanuel, and M. Kankanhalli
chain or Bayesian inference have been proposed. These works approximate a target object by employing a random posterior state estimation in every temporal stage during the tracking to obtain more accurate appearance presentation of the target object. However, this mechanism would introduce approximation error when ‘estimation optimization’ is performed. Eventually, the errors accumulated from each temporal stage leads to the problem of ‘error accumulation’ making the tracking to fail as time passes. In this paper, to address the error accumulation problem in tracking we propose to incorporate probabilistic latent semantic analysis (pLSA) into the temporal Markov chain (TMC) tracking model. The latent information is discovered by training the pLSA with histogram of oriented gradient (HoG) features. The tracking incorporates the pLSA for the observation optimization during the stage of maximum a posteriori process (MAP) computation of the tracking process. The pLSA used is a time independent operation and thus the past errors have no influence on the current result of the optimization. Thus by combining a time independent operation with particle filters of the TMC for observation optimization, the proposed method can avoid the error accumulation problem by calibrating the errors that occur during each time stage. In this way, we model tracking as a temporal Markov chain with two hidden states, called hidden-latent temporal Markov chain (HL-TMC). In order to distinguish the hidden states, we call the hidden state which is part of the pLSA as ‘latent’. The hidden state is used to generate the estimated observations from the current state during the Markov chain transition process and the latent state helps to determine the most accurate observation. Further, HL-TMC accurately track multiple pedestrians by simultaneously considering the diffusion distance [10] of low-level image patches and high-level semantic meaning between observations and models. Additionally, the effect of pose, viewpoint and illumination changes can be alleviated by using semantic information learned from the HoG features [5]. The rest of the paper is organized as follows. In Section 2, we describe the related works. In Section 3, we present the proposed HL-TMC mechanism for pedestrian tracking in detail. The experimental results and the analysis are carried out in Section 4. Finally, we conclude the paper in Section 5 with our observations and conclusions.
2
Related Works
Black et. al. [2] utilized a pre-trained view-based subspace and a robust error norm to model the appearance variations to handle the error accumulation problem. However, the robust performance of this algorithm is at the cost of large amount of off-line training images which cannot be obtained in many realistic visual surveillance scenarios. Further, the error caused by template matching process has not been considered in this work and its accumulation can fail the tracking process. Ramesh et. al. proposed a kernel based object tracking method called ‘mean-shift’ [4]. Instead of directly matching the templates as
Pedestrian Tracking Based on HL-TMC
287
in [2], ‘mean-shift’ based tracking improves the tracking performance by performing a gradient-based estimation-optimization process with histogram feature of isotropic kernel. However, these mean-shift based methods still introduce tracking errors under intrinsic variation such as pose changing. When introduced errors accumulate to some level, they would finally make the tracking algorithm to fail. Therefore, more flexible on-line learning models are needed to model the appearance variations of targets to be tracked. To explicitly model the appearance changes during the tracking, a mixture model via online expectation-maximization (EM ) algorithm was proposed by Jepson et al. [9]. Its treatment of tracking objects as sets of pixels makes the tracking fail when the background pixels are modeled differently from foreground during the tracking. In order to treat the tracking targets as abstract objects rather than sets of independent pixels, Ross et al. proposed an online incremental subspace learning mechanism (ISL) with temporal Markov chain (TMC) inference for tracking in [14]. The subspace learning-update mechanism facilitates to reduce the tracking error when varying pose/illumination and partial occlusions occur, and the target appearance modeling can be effectively achieved. However, when the target size is small or if there is strong noise in the captured frame, the error still occurs because the optimization is based on the image pixel/patch-level (low-level) distance metric using particle filters. Therefore, as the TMC inference passes, the ISL also suffers from error accumulation problem and cause the tracking to fail at the end. Grabner et. al. [6] proposed a semi-supervised detector as ‘time independent’ optimization method to calibrate the accumulated errors. However, it performs satisfactorily only for the scenarios where the target is leaving out of the surveillance scenes [1]. In this paper, we simultaneously consider pixel-level and semantic level distance metrics (probability) in each temporal optimization process and the error accumulation problem is resolved by maximizing their joint probability.
3
Pedestrian Tracking Based on Hidden-Latent Temporal Markov Chain Inference (HL-TMC)
We model the pedestrian tracking problem as an inference task in a temporal Markov chain with two hidden state variables. The proposed model is given in Fig. 1. The details are given below. For a pedestrian P , let Xt describes its affine motion parameters (and thereby the location) at time t. Let It = {I1 , . . . , It } where It = {d1t , d2t , . . . , dnt } denotes a collection of the estimated image patches of P at time t and n is a predefined number of sample estimates. Let z= {z1 , z2 , . . . , zm } be a collection of m latent semantic topics associated with the pedestrians and w= {w1 , w2 , . . . , wk } is a collection of k codewords . We consider the temporal Markov chain with hidden states Xt and z. For the HL-TMC, it is assumed that the hidden state Xt is independent of latent semantic topics zand codewords w[7]. The whole tracking process is accomplished through maximizing the following probability for each time stage:
288
P. Zhang, S. Emmanuel, and M. Kankanhalli
Fig. 1. Proposed Hidden-Latent Temporal Markov Chain (HL-TMC)
p(Xt |It , z, w) · p(It |z, w) · p(z|w) ∝ p(Xt |It ) · p(It |z, w), Based on the Bayes theorem, the posterior probability can be obtained as: p(Xt |It ) ∝ p(It |Xt ) p(Xt |Xt−1 )p(Xt−1 |It−1 )dXt−1 = Yt . Since the latent semantic analysis process is not a temporal inference, the maximum probability for each time stage t can be obtained as follows: max p(Xt |It , z, w) · p(It |z, w) = max Yt · p(It |z, w) = max Yt · p(d1t |z, w), . . . , Yt · p(dnt |z, w) . Thus, tracking at each time stage t is achieved by maximizing the following quantity for each i ∈ [1, n], p(dit |z, w)p(It |Xt )
p(Xt |Xt−1 )p(Xt−1 |It−1 )dXt−1 .
(1)
We estimate the three probabilities p(Xt |Xt−1 ), p(It |Xt ) and p(dit |z, w) in the Expression 1 above by adopting the following three probabilistic models: timevarying motion model is used to describe the state transfer probability p(Xt |Xt−1 ), the temporal-inference observation model is to estimate the relationship p(It |Xt ) between the observations It and the hidden states Xt and the probabilistic latent semantic analysis model (pLSA) is employed in the testing phase to find the maximal pedestrian likelihood probability p(dit |z, w), for 1 ≤ i ≤ n. The pLSA model helps to improve the accuracy of tracking, reduces the tracking error accumulation and deal with pose, viewpoint and illumination variations. We now discuss these models in detail below.
Pedestrian Tracking Based on HL-TMC
3.1
289
Time-Varying Motion Model
We represent each pedestrian as an affine image warp, which is represented by the hidden state Xt composed of 6 parameters: Xt = (xt , yt , θt , st , αt , ϕt ), where xt , yt , θt , st , αt , ϕt denote x, y translation, rotation angle, scale, aspect ratio and skew direction at time t respectively. As in [8], the distribution of each parameter of Xt is assumed to be Gaussian centered around Xt−1 and the corresponding diagonal covariance matrix Ψ of the Gaussian distribution is made up of 6 parameters denoting the variance of the affine parameters, σx2 , σy2 , σθ2 , σs2 , σα2 and σϕ2 . If we assume that the variance of each affine parameters do not change over time, the time-varying motion model can be formulated as, p(Xt |Xt−1 ) = N (Xt ; Xt−1 , Ψ ). 3.2
(2)
Temporal-Inference Observation Model
The relationship p(It |Xt ) between the observations It and the hidden states Xt is estimated using this model. We use It to denote a collection of the estimated image patches from the hidden state Xt . Here principle component analysis (PCA) is employed in a stochastic manner. Suppose that the sample It is drawn from a subspace spanned by U (obtained by SVD of centered data matrix [14]) and centered at µ such that the distance ℓ from the sample to µ is inversely proportional to the probability of this sample being yielded from this subspace. Let Σ denote the matrix of singular values corresponding to the columns of U , I denote the identity matrix and εI denotes the additive Gaussian noise in the observation process. Then as in [14], the probability of a sample drawn from the subspace is estimated as: p(It |Xt ) = N (It ; µ, U U ⊤ + εI) · N (It ; µ, U Σ −2 U ⊤ ). 3.3
(3)
Time-Independent Probabilistic Latent Semantic Analysis (pLSA) Model
Using the above two probability models, one can decide which estimated sample has the shortest distance to the hidden states. However, this computation is based on low-level processing (pixel-level) of the samples. This mechanism may cause the problem that the estimated sample which has the shortest distance is not exactly what we really want to track because the background pixels inside the sample can affect the distance heavily. Therefore, we need the tracking method to be more intelligent and understand what is being tracked by working at the latent semantic level. Therefore we employ the pLSA model [7] in the testing phase to find the maximal pedestrian likelihood probability (likelihood probability whether the estimated sample is a pedestrian or not from time t − 1 to t) p(dit |z, w), for 1 ≤ i ≤ n. For the pLSA model in [7], the variable dit denotes a document, while in our case it denotes an estimated image patch which is the observation. The variable z ∈ z = {z1 , . . . , zk } are the unobserved latent topics with each observation, which is defined as ‘pedestrian’. The variable
290
P. Zhang, S. Emmanuel, and M. Kankanhalli
w ∈ w = {w1 , w2 , . . . , wm } represent the codewords which is the clustering centers of extracted HoG feature vectors from training dataset images. We assume that d and w are independent and conditioned on the state of the associated latent variable z. Let wit ⊂ w be a collection of codewords generated from dit . Each codewords in wit is obtained from extracted HoG features of each dit by vector quantization. As in [7] during testing the likelihood of each estimated sample (observation) at time t to be a pedestrian is obtained as, p(dit |z, w) ∝ p(dit |z) ∝ p(z|w)p(w|dit ). (4) i w∈w t
Since latent semantic analysis process is not a temporal inference process, by using all the above three models in the Expression 1, the tracking can be performed by maximizing the following probability for each time stage.
4
Experiments and Discussion
The implementation of the proposed mechanism consists of two tasks, one is off-line learning and the other is tracking. For the feature selection in our experiments, we use the histogram of oriented gradients (HoG) because it contains the shape, context information as well as the texture information, which is substantially descriptive for representing the characteristics of a pedestrian. The effectiveness of the HoG feature has been verified in [13] and [15]. Another advantage of the HoG feature is its computational efficiency, which is critical for the real-time requirement of the visual surveillance tracking system. For the training dataset selection, we use the NICTA pedestrian dataset for our training. The reason we choose this dataset is because it provides us about 25K unique pedestrian images at different resolutions and also a sample negative set. And also, each image of NICTA has the size 64 × 80 making it suitable for efficient generation of HoG features. In the pLSA learning process we first perform the HoG feature extraction on the training pedestrian dataset NICTA for each image. Then the generated HoG features are clustered by k-means clustering algorithm to obtain the codebook w = {w1 , w2 , . . . , wk } [12]. Next, the vector quantization (VQ) is carried out for each HoG of each training image based on the codebook to obtain its histogram of codewords which is the ‘bag-of-words’ for learning requirement. The results of learning is the m × k size association probability matrix p(z|w). In our experiments, the size of the codebook k and the number of topics m are both pre-defined as k = 300 and m = 20. For the learned topics, not all of them equally denote the meaning of a ‘pedestrian’. Hence we need to assign weights to each topic based on their descriptive ability for a pedestrian. To obtain this weights-topics histogram, we used a dataset called Pedestrian Seg. The important characteristic of this dataset is that, each image in this dataset only contains the foreground (pedestrian) without the background. The segmentation work of the Pedestrian Seg dataset is done manually by using Adobe Photoshop. We first extracted the HoG features
Pedestrian Tracking Based on HL-TMC
291
Fig. 2. Comparison between HL-TMC tracking and ISL tracking on CAVIAR dataset
of each image in the Pedestrian Seg dataset. Then another collection of codewords wps is obtained by VQ on the codebook w. The weighs λ1 , . . . , λm for each topic is computed as, ′
′
λi =
w∈wps
λ p(w|zi ), λi = mi 1
′
λj
.
As described in Section 3, the tracking process at each temporal state is done by maximizing p(dit |z, w) · p(It |Xt ) p(Xt |Xt−1 )p(Xt−1 |It−1 )dXt−1 . For each estimation sample dit , p(dit |z, w) and p(It |Xt ) p(Xt |Xt−1 )p(Xt−1 |It−1 )dXt−1 are calculated independently. The quantity p(dit |z, w) denotes a high/semantic level distance value, which is computed as: λj p(zj |w)p(w|dit ). p(dit |z, w) = j
w∈wit
The quantity p(It |Xt ) p(Xt |Xt−1 )p(Xt−1 |It−1 )dXt−1 represents a low/pixel level distance value. The quantity p(It |Xt ) is computed from the probability distributions N (It ; µ, U U ⊤ + εI) and N (It ; µ, U Σ −2 U ⊤ ) as described in Section 3.2. As in [15], exp(−(It − µ) − U U ⊤ (It − µ))2 ) corresponds to the negative exponential distance of It to the subspace spanned by U . This low/pixel level distance is proportional to the Gaussian distribution N (It ; µ, U U ⊤ + εI). The component N (It ; µ, U Σ −2 U ⊤ ) is modeled using the Mahalanobis distance. Thus, for each dit , there is a probability product arising from high/semantic level and low/pixel level distances. The estimation sample dit which has the largest value for the product is regarded as the optimum estimation for the next tracking stage.
292
P. Zhang, S. Emmanuel, and M. Kankanhalli
Fig. 3. Comparison between HL-TMC tracking and ISL tracking on PEST2006 dataset
4.1
Tracking Performance Comparison
To verify the effectiveness of the proposed tracking mechanism, we chose surveillance videos with multiple pedestrians for tracking. In addition the scenes where challenging with occlusion, illumination variations and scale changes happening. We compare the proposed work with the IS Ltracking of Ross et al. [14], AMS tracking of Collins [3] and WSL tracking of Jepson et al. [9]. Since ISL tracking has been demonstrated to be more robust and accurate than other classic tracking approaches [10], due to space limitations we give the visual comparison of the proposed method only with ISL. The quantitative comparison is provided against all the three works. Visual Performance Comparison: We performed the proposed tracking method on the popular surveillance video datasets, such as CAVIAR, PETS2006 and AVSS2007. The tracking results are shown in the Fig. 2-4. Fig. 2 presents the performance of the proposed pedestrian tracking in the surveillance scene inside a hall of the CAVIAR dataset. For all the tracked pedestrians, the proposed HL-TMC tracking performs more accurately than the ISL tracking, but both methods can deal with the occlusion cases in this test. Besides halls and malls, visual surveillance systems are widely deployed in subway stations. Therefore, we test the performance of HL-TMC on these scenes also. The comparison in Fig. 3 shows that the proposed tracking method can track multiple pedestrians more accurately in the long-distant camera shot scene of PETS2006 dataset. Even with occlusion occurring in the scenes with many moving pedestrians, the proposed method can still robustly perform the tracking. Another test is performed with short-distant camera shot in the AVSS2007 surveillance dataset and the results are given in Fig. 4. In this case, since the background is simple compared to the previous cases, the proposed method and the ISL performs well for the pedestrians who are close to the camera. However,
Pedestrian Tracking Based on HL-TMC
293
Fig. 4. Comparison between HL-TMC tracking and ISL tracking on AVSS2007 dataset
Fig. 5. Covering rate computation for quantitative analysis
while tracking the pedestrians who are far way the ISL loses track and the HLTMC is still able to track the pedestrians when occlusion occurs. Similarly, the proposed method outperformed the AMS and WSL tracking methods on these test videos. The AMS and WSL test results are not included due to the space limitations. Quantitative Analysis: For the quantitative analysis of tracking performance, we manually label several key locations on the pedestrians as ‘ground truth’ that need to be covered by the tracking area. The tracking precision of each pedestrian is defined by the ‘covering rate’ (CR) and is computed as:
CR =
number of tracked key locations inside the bounding box . total number of key locations needed to be tracked
(5)
In our experiments, we had about 409 pedestrian templates of various poses for marking these key locations for the CR calculation. The CR computations on some sample templates are shown in the Fig. 5. During the tracking process, both numbers are manually counted frame by frame for the whole video. Then the
294
P. Zhang, S. Emmanuel, and M. Kankanhalli
)b*
)c*
Fig. 6. Covering rates of HL-TMC,ISL,WSL and AMS Tracking Algorithms
CR for each frame is obtained using Equation 5. In addition, the likelihood computation based on local features(texture & HoG) would guarantee the tracking bounding box not enlarge too much over the boundary of the target pedestrian, which can avoid the case of CR ≡ 1 if bounding box covers the whole frame. In our experiments, we did not employ the Jepson’s root mean square (RMS ) error calculation for the quantitative analysis because for tracking of multiple pedestrian with occlusions and pose changing, computing RMS is very hard. Fig. 6 (a) shows the comparison of CR between HL-TMC tracking and ISL, AMS and WSL tracking on the video “EnterExitCrossingPaths1cor” in the CAVIAR dataset. It can be seen that during the whole tracking process, the HL-TMC always outperforms the other three tracking mechanisms and the performance of ISL is better than the other two as claimed in [14]. Fig. 6(b) shows the performance comparison on the video “PETS2006.S3-T7-A.Scene3” in the PETS 2006 dataset. The proposed HL-TMC tracking in this case also outperforms the other three methods. Notice that the WSL tracking outperforms the ISL tracking at the beginning, but it gradually looses track when heavy occlusions of pedestrians occur.
5
Conclusion
In this paper, we proposed a novel pedestrian tracking mechanism based on a temporal Markov chain model with two hidden states to handle the tracking error problem. To minimize the accumulation of tracking errors, we proposed to incorporate the semantic information about each observation in the general temporal Markov chain tracking model. By employing the HoG features to find the latent semantic cues denoting the pedestrians, the proposed method can search the meaning of a “pedestrian” to find the most likely (“pedestrian” observation) sample accurately and efficiently for updating the target appearance. The experimental results on different popular surveillance datasets, such as CAVIAR,
Pedestrian Tracking Based on HL-TMC
295
PETS2006 and AVSS2007 demonstrated that the proposed method can robustly and accurately track the pedestrians under various complex surveillance scenarios and it outperforms the existing algorithms.
References 1. Babenko, B., Yang, M.H., Belongie, S.: Visual tracking with online multiple instance learning. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 983–990 (June 2009) 2. Black, M., Jepson, A.: Eigentracking: Robust matching and tracking of articulated objects using a view-based representation. International Journal of Computer Vision 26(1), 63–84 (1998) 3. Collins, R.: Mean-shift blob tracking through scale space. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. II–234–40 (June 2003) 4. Comaniciu, D., Ramesh, V., Meer, P.: Kernel-based object tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5), 564–577 (2003) 5. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 1, pp. 886– 893 (June 2005) 6. Grabner, H., Leistner, C., Bischof, H.: Semi-supervised on-line boosting for robust tracking. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008, Part I. LNCS, vol. 5302, pp. 234–247. Springer, Heidelberg (2008) 7. Hofmann, T.: Probabilistic latent semantic indexing. In: Proc. International ACM SIGIR Conference, pp. 50–57 (1999) 8. Isard, M., Blake, A.: Condensation – conditional density propagation for visual tracking. International Journal of Computer Vision 29(1), 5–28 (1998) 9. Jepson, A., Fleet, D., El-Maraghi, T.: Robust online appearance models for visual tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(10), 1296–1311 (2003) 10. Kwon, J., Lee, K.: Visual tracking decomposition. In: IEEE Conference on Computer Vision and Pattern Recognition (2010) 11. Lim, J., Ross, D., Lin, R., Yang, M.: Incremental learning for visual tracking. In: Advances in Neural Information Processing Systems (NIPS), vol. 17, pp. 793–800 (2005) 12. Niebles, J., Wang, H., Li, F.: Unsupervised learning of human action categories using spatial-temporal words. International Journal of Computer Vision 79(3), 299– 318 (2008) 13. Zhu, Q., Yeh, M.C., Cheng, K., Avidan, S.: Fast human detection using a cascade of histograms of oriented gradients. In: IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 1491–1498 (2006) 14. Ross, D., Lim, J., Lin, R., Yang, M.: Incremental learning for robust visual tracking. International Journal of Computer Vision 77(1-3), 125–141 (2008) 15. Zhang, J., Marszalek, M., Lazebnik, S., Schmid, C.: Local features and kernels for classification of texture and object categories: A comprehensive study. International Journal of Computer Vision 73(2), 213–238 (2007)
Motion Analysis via Feature Point Tracking Technology Yu-Shin Lin1, Shih-Ming Chang1, Joseph C. Tsai1, Timothy K. Shih2, and Hui-Huang Hsu1 1
Department of Computer Science and Information Engineering, Tamkang University, Taipei County, 25137, Taiwan [email protected] 2 Department of Computer Science and Information Engineering, National Central University, Taoyuan County, 32001, Taiwan
Abstract. In this paper, we propose a tracking method via SIFT algorithm for recording the trajectory of human motion in image sequence. Instead of using a human model that present the human body to analyze motion. Only exact two feature points from the local region of a trunk, one for joints and one for limb. We calculate the similarity between two features of trajectories. The method of computing similarity is based on the “motion vector” and “angle”. We can know the degree of the angle by the connect line from joint to limb in a plane which is using the core of object to be the center. The proposed method consists of two parts. The first is to track the feature points and output the file which record motion trajectory. The second part is to analyze features of trajectory and adopt DTW (Dynamic Time Warping) to calculate the score to show the similarity between two trajectories. Keywords: motion analysis, object tracking, SIFT.
1 Introduction Motion analysis is a very important approach in several researches such as kinematics of the human body or 3D objects generation in 3D games. It should record the trajectory of the motion of body or physical gestures by camera. After getting the information, the motion form can be completed. This kind of technology is widely applied to the issue of human motion 0. Before analyzing, we have to know the information of object moving. Object tracking technique is a useful method to find out the information about the locating of moving object over time in the image processing. In order to match the similarity between two set videos, we have to extract the features of object in images [2, 3]. For tracking objects, we use the SIFT algorithm 0 to describe the features from objects. The features are based on the local appearance as characteristic of objects, the scale and orientation of the features are invariant and the effect of matching features is robust. In recent years, there are many researches about object tracking or recognize are according to the modified SIFT algorithm [5, 6]. For human motion capture and record the trajectory of body movement [7, 8, 9, 10], such trajectories are important K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 296–303, 2011. © Springer-Verlag Berlin Heidelberg 2011
Motion Analysis via Feature Point Tracking Technology
297
content to analyze motion. In our approach, DTW [11, 12, 13] is used to match the human behavior sequences. In this paper, our system is combined by tracking and trajectories matching. The remainder of this paper is organized as follows. Section 2 is the overview of SIFT algorithm. Section 3 introduces the application of SIFT algorithm briefly. Section 4 describes the presentation of trajectory and how to analyze. In section 5 explain our system architecture. Experimental results are given in Section 6. Finally, a short conclusion is given in section 7.
2 SIFT Algorithm and Application In order to get an exact tracking result between two continuous frames, we choose the SIFT algorithm to finish this work. Although there are many approaches can get a very precise tracking result, we need a result includes the information of angle and location. So we choose the SIFT algorithm in this study. There are main major stages for SIFT: Scale-space extrema detection In the first stage, the tracking point can be extracted from the image over location and scale. The scale space of an image is computed by Gaussian function,
G ( x, y , σ ) =
1 2πσ 2
e
− ⎛⎜ x 2 + y 2 ⎞⎟ ⎠ ⎝
2σ 2
(1)
in each octave of the difference of Gaussian scale space, extrema are detected out by comparing a pixel to its 26 neighbors in 3 by 3 regions at the current and adjacent scales. Orientation assignment To achieve invariance to image rotation, assigning a consistent orientation to each keypoint based on local image properties. The scale of the keypoint is used to select the Gaussian smoothed image, and then compute the gradient magnitude, m(x,y), and orientation, θ(x,y): m(x, y ) =
(L(x + 1, y) − L(x − 1, y ))2 + (L(x, y + 1) − L(x, y − 1))2
θ (x, y) = tan−1((L(x, y +1) − L(x, y −1)) /(L(x +1, y) − L(x −1, y)))
(2)
where L is the smoothed image. An orientation histogram with 36 bins is formed, with each bin covering 10 degrees. The highest peaks in the orientation histogram correspond to dominant directions of local gradients. Keypoint descriptor After finding locations of keypoint at particular scales and assigned orientations to them. This step is to compute a descriptor vector for each keypoint. A keypoint
298
Y.-S. Lin et al.
Fig. 1. This figure shows the Keypoint descriptor of SIFT
descriptor is created by first computing the gradient magnitude and orientation at each image sample point in a region around the keypoint location. These are weighted by a Gaussian window, indicated by the overlaid circle. These samples are then accumulated into orientation histograms summarizing the contents over 4 by 4 subregions. Each subregion has a histogram with 8 orientations. Therefore, the feature vector has 4 by 4 by 8 elements. The descriptor is formed from a vector containing the values of all the orientation histogram entries. 2.1 Application of SIFT Algorithm The SIFT keypoints of objects are extracted from scale-space. In practice, the location of keypoint is not necessary in the region of object which we want to track. Therefore, replace the extreme detection method with entering the initial point by hand. By the way, we can set the region which we want to analyze. In the stages for Orientation Assign and Keypoint Descriptor, the scale of the keypoint is used to select the Gaussian smoothed image with the closest scale, and compute the gradient magnitude and orientation. Because the initial point is not detected, we can’t determine the scale to be which level, each standard deviation of Gaussian function is set to1. There are two reasons for setting, one is the scale of image that is close to original image; another is that Gaussian function can reduce amount of noise.
3 Trajectory of Object 3.1 Presentation of Trajectory A motion path of object is recorded from consecutive frames by the method from last section, the path is the trajectory of moving object. In this system, we just record the absolute coordinate of feature point which is in the region of object from each frame. Therefore, we con not calculate directly the similarity of two trajectories. In this paper, we defined “motion vector” and “angle” to be the features of data point which is belong to trajectory: Angle Human’s continues movement is consist of different postures. These postures of transformation are expressed by angle of transformation that is relative to limbs and a joint in the region of body.
Motion Analysis via Feature Point Tracking Technology
299
Fig. 2. The angle representation in our experiment. We set the neck to be the center point of the upper body and the hip to be the center point of the lower body. We also make a four-quadrant coordinate axis to record the angle and position of the feature points such as hands, head and foots.
The representation by a concept of polar coordinate system, the joint as a pole, the angle is that a point in the region of limb is measured counterclockwise from the polar axis. An instance is showed in Fig. 2. Motion Vector The direction of moving object is defined as motion vector. It is represented by 8 directions, fig. 3 shows the graph of directional representation.
Fig. 3. The motion vectors are showed in the figure. The total vectors are eight directions and start from the right way. We assign numbers to each direction. By this way, we can know the motion direction very clearly.
Assign a number to each direction for easily identify. In addition, 0 is assigned to the center which means object didn’t move.
300
Y.-S. Lin et al.
3.2 Similarity between Trajectories Dynamic time warping is an algorithm for measuring similarity between two sequences. Based on the “motion vector” and “angle”, we modify DTW algorithm into a score path equation: ⎧⎪SP(Ei −1 , C j −1 ) + mvSi , j + anSi , j , if θ < 1 SP(Ei , C j ) = ⎨ ⎪⎩MAX (SP (Ei , C j −1 ), SP (Ei −1 , C j ) ) + mvSi , j + anSi , j , if θ
(3)
where E and C are trajectories, i and j are data points which is belong to respective trajectory. The mvS is the score which can be reached through the calculation of similarity which is in the “motion vector”, and the anS is in the “angle”. According to the equation, we can get a total score, it means the similarity between two trajectories.
4 System Architecture In our system, it is consist to two parts. The first part is object tracking, the main goal is to track the object and record the moving path. By this step, we can get a file recorded the angle and location of the keypoints in the video stream. We can accord the results of tracking to analyze two motions. The second is to analyze and compare the trajectories. After analyzing two motions, we can assign a score to the estimative motion video. The score is the similarity between the two motions. Figure 4 illustrates the system architecture diagram. In the tracking process, at first, enter the initial points as the feature points which want to be tracked in the beginning. There is SIFT descriptor belong to the each point via SIFT operator. We have to do the motion estimation by a modified full search. We make the feature point as the center of a block and process full search with the next frame. The pixels in the region are keypoint candidates. We use the Full Search by block search to find out the candidate area which is the most similar to the feature point. The result is the feature point to be tracked. After the tracking processing, the system will output a file which is recorded the trajectory coordinate of feature points. The trajectory coordinate means the position of the target object in the consecutive frames. Features of trajectory are formed with “Motion Vector” and “Angle”, the trajectories will be matched by analyzing these two features. Finally, we use Dynamic Time Warping (DTW) to score the similarity of two trajectories, and show the result. In the tracking processing, the result of object location where detected may be not expected according to block matching algorithm. The error of result from that object is rotational or size of variance. To solve this problem, adding a adjust mechanism to tracking processing. In this system, it can show the result of tracking in the current frame on real-time, the situation is required to adjust when that occurred. There are 4 steps are following: 1. 2. 3. 4.
Pause the process immediately. Delete the record of the wrong point. Add the new point which is correct. Continue the processing.
Motion Analysis via Feature Point Tracking Technology
301
Fig. 4. The system flowchart. The left part is showed the tracking algorithm. At first, we have to enter the initial keypoints and track all of the frames of the video stream. When it finishes, we can get a file recorded the result of tracking. The right part is the matching algorithm. We load the files by the first part and accord the files to analyze. In the end, we can obtain the matching results and the scores of the object motion.
After the adjusting, we can get a more precise tracking result. We use an estimative algorithm named dPSNR to check the accuracy of result. dPSNR is a modified PSNR method. We just use the distance to replace the error of pixels. The values are showed as table 1. Table 1. A comparing table of dPSNR trajectory neck left hand right hand
dPSNR (without adjusting) 51.5986 42.8803 46.6695
dPSNR(adjusting) 52.6901 52.1102 50.0016
5 Experiments The video are used to study in this article, give instances of yoga sports. We cut a clip from a film, the motion is that raise both arms parallel to the floor. The result of analysis is showed in figure 5. The green of trajectory represent the standard motion, another blue of trajectory represent the mimic motion. The green circle is represented the joint which is on the
302
Y.-S. Lin et al.
neck. Square is the head, the right foot and the left foot separately. Head and foot did not move, in this case, there are no trajectories. Two joints are located the same coordinate, we see the similarity between two trajectories through the figure very clearly. There are red lines between data points when they are very similar to each other. The similarity means that the motion vector is identical and the angle of difference is less than degree 1.
Fig. 5. The matching result. It shows two hands example. The red lines are connected with two trajectories. The points in each line need to be compared and compute the score.
6 Conclusion In this paper, we use SIFT algorithm and Full Search approach to find out the trajectory of object in consecutive images. We represented trajectory as “motion vector” and “angle”, and then compute the similarity which is based on modified Dynamic Time Warping. Finally, show the data and the image which is the result of difference between trajectories. The trajectory consist of points, in practice, the one point is a pixel in digital image. Hence we can display the subtle action of human, and differentiate easily the difference.
References 1. Hernández, P.C., Czyz, J., Marqués, F., Umeda, T., Marichal, X., Macq, B.: Bayesian Approach for Morphology-Based 2-D Human Motion Capture. IEEE Transactions on Multimedia, 754–765 (June 2007) 2. Zhao, W.-L., Ngo, C.-W.: Scale-Rotation Invariant Pattern Entropy for Keypoint- Based Near-Duplicate Detection. IEEE Transactions on Image Processing, 412–423 (February 2009)
Motion Analysis via Feature Point Tracking Technology
303
3. Tang, F., Tao, H.: Probabilistic Object Tracking With Dynamic Attributed Relational Feature Graph. In: IEEE Transactions on Circuits and Systems for Video Technology, pp. 1064–1074 (August 2008) 4. Lowe, D.G.: Distinctive Image Features from Scale-Invariant Keypoints. International Journal of Computer Vision 60(2), 91–110 (2004) 5. Chen, A.-H., Zhu, M., Wang, Y.-h., Xue, C.: Mean Shift Tracking Combining SIFT. In: 9th International Conference on Signal Processing, ICSP 2008, pp. 1532–1535 (2008) 6. Li, Z., Imai, J.-i., Kaneko, M.: Facial Feature Localization Using Statistical Models and SIFT Descriptors. In: The 18th IEEE International Symposium on Robot and Human Interactive Communication, pp. 961–966 (2009) 7. Lu, Y., Wang, L., Hartley, R., Li, H., Shen, C.: Multi-view Human Motion Capture with An Improved Deformation Skin Model. In: Computing: Techniques and Applications, pp. 420–427 (2008) 8. Huang, C.-H.: Classification and Retrieval on Human Kinematical Movements. Tamkang University (June 2007) 9. Chao, S.-P., Chiu, C.-Y., Chao, J.-H., Yang, S.-N., Lin, T.-K.: Motion Retrieval And Its Application To Motion Synthesis. In: Proceedings. 24th International Conference on Distributed Computing Systems Workshops, pp. 254–259 (2004) 10. Lai, Y.-C., Liao, H.-Y.M., Lin, C.-C., Chen, J.-R., Peter Luo, Y.-F.: A Local Feature-based Human Motion Recognition Framework. In: IEEE International Symposium on Circuits and Systems, May 24-27, pp. 722–725 (2009) 11. Shin, C.-B., Chang, J.-W.: Spatio-temporal Representation and Retrieval Using Moving Object’s Trajectories. In: International Multimedia Conference, Proceedings of the 2000 ACM Workshops on Multimedia, pp. 209–212 (2000) 12. Chen, Y., Wu, Q., He, X.: Using Dynamic Programming to Match Human Behavior Sequences. In: 10th International Conference on Control, Automation, Robotics and Vision, ICARCV 2008, pp. 1498–1503 (2008) 13. Yabe, T., Tanaka, K.: Similarity Retrieval of Human Mot ion as Multi-Stream Time Series Data. In: Proc. International Symposium on Database Applications in Non-Traditional Environments, pp. 279–286 (1999)
Traffic Monitoring and Event Analysis at Intersection Based on Integrated Multi-video and Petri Net Process Chang-Lung Tsai and Shih-Chao Tai Department of Computer Science, Chinese Culture University 55, Hwa-Kang Road, Taipei, 1114, Taiwan, R.O.C. [email protected]
Abstract. Decreasing traffic accidents and events are one of the most significant responsibilities for most of the government in the world. Nevertheless, it is hard to precisely predict the traffic condition. To comprehend the root cause of traffic accidents and restore the occurrence of traffic events, a traffic monitor and event analysis mechanism based on multi-videos processing and Petri net analysis techniques is proposed. In which, the traffic information are collected through the deployment of cameras at intersection of heavy traffic areas. After then, all of the collected information is provided for constructing multiviewpoint traffic model. The significant features are extracted for traffic analyzing through Petri Net and detection of motion vector. Finally, decision will be output after integrated traffic information and event analysis. Experimental results demonstrate the feasibility and validity of our proposed mechanism. It can be applied as a traffic management system. Keywords: Traffic monitor, video process, intelligent traffic system, Petri net, motion detection.
1 Introduction As the technique emerging, the necessity of establishing an intelligent life has become very popular and strongly advocated. Thus, the physical construction of ITS (intelligent traffic system) through traffic monitoring and management has become a significant issue in most of the countries. The goal is to decrease the occurrence of traffic incidents. Therefore high resolution video cameras have been mounted on all of the freeways, heavy traffic highways, tunnels and important traffic intersections of the main cities to monitor and detect the traffic condition. In addition, the related information can be supplied as the evidence for traffic enforcement. In the past few years, the fact is most of the traffic conditions are difficult to predict. But, as the scientific techniques emerging, traffic flow management becomes easier than before. In addition, some mobiles have mounted with high resolution camera to provide early warning and avoid accidents. The infrastructure for monitoring and managing the traffic at freeway, heavy traffic highway, main traffic intersection or tunnels has obviously improved and gradually accomplished in most of the critical cities worldwide. However, as regarding the early warning systems or traffic restoration system, the development of related applications and products is still a must. K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 304–314, 2011. © Springer-Verlag Berlin Heidelberg 2011
Traffic Monitoring and Event Analysis at Intersection
305
Life is invaluable treasure. That is why most of the drivers are driving carefully. However, there still traffic accidents occurred every day. Therefore, well understand the behavior of drivers is very important. The goal of traffic management is not only focus on the detection of traffic events and flow control, but also focused on detecting those aggressive and violating drivers to decrease traffic accidents. Regarding video surveillance for traffic detection, several researchers have developed some different types of video detection systems (VDS) [2]. In [3], Wang et al. proposed a video image vehicle detection system (VIVDS) to detect different color vehicles. In [4], Ki et al. proposed a vision-based traffic accident detection system based on inter-frame difference algorithm for traffic detection at intersections. Although there are a lot of different types of VIVDS developed, single installed camera mounted on the poles of roadside or traffic light is not sufficiently to capture the entire intersection and provide the needed whole completely traffic information. Some evidences shown that even the popular advanced Traficon systems might have problem of false positive and false negative signals at the intersections due to the reason of weather and lighting conditions [2]. In order to understand the root cause of traffic accidents and restore the occurrence of traffic events, a traffic monitor and event analysis mechanism based on multiviewpoint and 3D video processing techniques is proposed in this paper. The rest of this paper is organized as follows. In Section 2, the rationale of Petri Net and multi-viewpoint traffic model is introduced. Traffic monitor and event analysis are addressed in Section 3. Experimental results are demonstrated in Section 4 to verify the feasibility and validity of our proposed mechanism. Finally, concluding remarks are given in Section 5.
2 Rationale of Petri Net and Multi-viewpoint Traffic Model 2.1 Rationale of Petri Net Generally, a Petri net [5] is defined by variables such as place, transition, and directed arc to express their relation and flow and transition status. In which, an arc represents the variation and only runs from a place to a transition or from a transition to a place. It will not run between places or between transitions. A traditional graph of Petri Net is shown as Figure 1.
Fig. 1. A traditional graph of Petri Net
The graph of Petri Net graph applied in this paper is a 5-tuple (S, T, W, I0, M), where S represents a finite set of places that vehicle located in each frames of the traffic video, T represents a finite set of transitions that represent the variation of each vehicle, W represents a multiset of arcs which is defined and operated as W: (SxT)
∪
306
C.-L. Tsai and S.-C. Tai
(TxS)->N, I0 represents the initial state of each vehicle, and M represents the current state of each vehicle. All of the 5-tuple is assigned and validated only as the vehicle enters into the area of focused traffic intersection blocks as shown in Figure 2. The application of Petri Net analysis for traffic is based on the vehicular route and interactivity. After the vehicle shaping, the center of each shape represents the current place of that vehicle and its driving route will then be detected. As the vehicles come over into the intersection area which indicated by gray color as shown in Figure 2, each vehicle will be assign a token and the token is validated only inside the intersection area. In order to record the timestamp of each token, the token’s timing will be set synchronously with the fps (frames per second) of video displaying. As the vehicle move outside of the intersection area, the validation of the token will be expired. Under normal situation, each vehicle pass through the intersection area only as its driving direction’s green light is turning on. No matter the vehicle is going straight or taking left turn and even right turn, it has to wait for the validated indicating traffic light. Normally, most of the inner lanes will be reserved and marked for those vehicles that needed to take left turn and the other lanes are used for driving straight or taking right turn.
Fig. 2. Illustrate the focused traffic intersection for applying Petri Net analyzing
2.2 Rationale of 3D Video and Multi-viewpoint Traffic Model To record dynamic visual events, multi-viewpoint is an ultimate image media [6] in the real world. For example, one can record 3D object shape with high fidelity surface properties such as color, shape, or texture according to time varying sampling. In [7], Matsuyama et al. proposed a 3D video processing model, in which some techniques are developed as the following: 1. 2. 3.
Reconstruct dynamic 3D object action from multi-viewpoint video images in real-time. Reconstruct accurate 3D object shape by deforming 3D mesh model. Render natural-looking texture on the 3D object surface from the multiviewpoint video images.
Shown in Figure 3 is the demonstration of sensors deployment at the intersection of a road. In the figure, the red spot indicates a virtual target for cameras to focus on and
Traffic Monitoring and Event Analysis at Intersection
307
record the related information based on time varying sampling. The yellow arrows represent the deployed cameras and their recording direction. How many cameras will be optimal for dispatching to record the real traffic condition is left for implementation-defined. The reason is because there are not any intersections which will be under the same traffic situation. But, at least 4 to 5 will be better for constructing multiviewpoint traffic model. In addition, each dispatched sensors must possess the same specification such as the same color display and calibration system, resolution, and etc.
Fig. 3. Demonstrate the deployment of sensors at the intersection of a road
(a)
(c)
(b)
(d)
(e)
Fig. 4. Illustrate the recorded frames in one of the traffic intersection from 5 different directions according to Figure 3
Shown in Figures 4(a) to Figure 4(e) are the illustrations of frames recorded in the same traffic intersection area from different viewpoints according to Figure 3. In
308
C.-L. Tsai and S.-C. Tai
which, the same bus in different frames that extracted from videos recorded in different direction are marked by rectangular blocks. To construct perspective multi-view from multi-videos, the procedure is stated as the following: Step 1: Evaluate the importance of a traffic intersection. Step 2: Decide how many cameras would be sufficient and suitable for deploying to collect the needed information for construct multi-viewpoint traffic model. Step 3: Determine the mounted location. Step 4: Record the physical traffic information from different directions such as the viewpoint from front-side direction, the viewpoint from right-hand side direction at the front and rear position, and the viewpoint from left-hand side direction at the front and behind position. Step 5: Perform video pre-process such as noise cleaning. Step 6: Integrate and construct 3D traffic video or multi-viewpoint traffic model.
3 Traffic Monitor and Event Analysis 3.1 Preprocess of Traffic Videos To completely comprehend the real traffic condition, correctly perform the desired traffic detection and exactly analyze the driving behavior of each vehicle under the monitored traffic area, the raw traffic video has been applied for preprocessed to reduce those unwanted interference information or even noises. The diagram for preprocessing of traffic video is shown in Figure 5. The completely flowchart and different modules for traffic video process is shown in Figure 6. The detailed traffic videos process procedures are addressed as the following. Step 1: The input video is decomposed into frames. Step 2: Noise cleaning has been performed to reduce those unnecessary noises. Step 3: Background segmentation is performed for further processing. Step 4: In this step, three tasks have been performed including multi-viewpoint traffic model construction, 3D Video presentation, motion vector detection, and vehicle detection by applying Fourier descriptor (FD) for vehicle shaping, representation, recognition and further tracking. Step 5: Petri Net has been performed for status tracking for each vehicle. In addition, integrated traffic information are analyzed in this step. Step 6: In this step, two tasks have been performed including violation detection and continuing to monitor and record the traffic. In which, the module of violation detection is determined from those information generated in step 4. In addition, the proposed mechanism will keep on monitoring and recording the following traffic condition. Input Videos
Frame Extraction
Noise Cleaning
Image Segmentation
Fig. 5. The diagram of preprocessing lists for traffic video
Traffic Monitoring and Event Analysis at Intersection
Input Videos
Video Preprocess
Multi-view & 3D Model
Motion vector Detection
Vehicle Shaping
Petri Net Analysis
Keep on Monitoring
Violation Detection
Fig. 6. Demonstrate the flowchart of traffic video processing
(a)
(c)
(b)
(d)
Fig. 7. Demonstrate the detected motion vector of traffic flows
309
310
C.-L. Tsai and S.-C. Tai
3.2 Combination of Petri Net and Motion Vector Analysis The recorded Petri Net information of each vehicle will be collected and combined with the motion vector of each vehicle for integrated analysis. Shown in Figure 7(b) is the motion vector detected from Figure 7(a) and shown in Figure 7(d) is the motion vector detected from Figure 7(c). Both of the motion vectors are detected inside the traffic intersection area in different timing. All of the vehicles moved on the same direction or took left-turn or right turn or even U-turn from other directions will also be collected and integrated for advanced analysis. Through manipulation, the interactivities of each vehicle inside the same traffic environment or within the same monitored traffic intersection area could be applied to determine if a vehicle is categorized as possible aggressive, dangerous, or violating driving. In addition, if the motion of a vehicle is recognized as violating or aggressive, a violation message will be alarmed. Otherwise, this system will keep on monitoring and analyzing the traffic until there is violating or aggressive driving occurred. The motion vector of each vehicle is measured based on the movement amount of macroblock with the most likelihood color and the highest similarity of vehicle shape traced by Fourier descriptor simultaneously to avoid false tracing induced by considering color only. Some examples are demonstrated as shown in Figure 6(b) and Figure 6(d). Moreover, the traffic light will also be considered as a feature to evaluate if there is any violation.
4 Experimental Results and Discussion 4.1 Experimental Results Shown in Figure 8 are some images/frames extracted from traffic videos. Shown in Figure 9(a) to Figure 9(h) are some sequential frames extracted from the video which is recorded at the same intersection as shown in Figure 8 from the right-side direction. In which, Figure 8(c) is the frame extracted from front-side tapped video. Shown in Figure 9(a), the traffic light of straight direction has changed to be green light. However, those vehicles that took left turn west to north direction such as the bus that marked by red line are still moving. In addition, some different vehicles just behind that bus and also keep moving.
(a)Left-side
(b) Right-side
(c) Front-side direction
Fig. 8. Illustrate the video recorded from different viewpoint in different traffic intersection
Traffic Monitoring and Event Analysis at Intersection
(a)
(b)
(d)
(c)
(e)
(g)
311
(f)
(h)
Fig. 9. Illustrate the extracted sequential frames that recorded at the same intersection as shown in Figure 8 from the right-side
Shown in Figure 9(c) and 9(d), one can clearly find that some motorcycles as marked by green line are moving straight forward from north to south direction, because their traffic light has already changed into green light. However, those left turn vehicles are still not yet finished their left-turn route. Shown in Figure 10 are some frames extracted from the video which is recorded at the same intersection as shown in Figure 8 and Figure 9 from the front-side direction. 4.2 Discussions By inspecting Figure 9(c) to 9(h), one might suspect and can estimate that some vehicles might run over red light as they took left turn from the west direction to the north direction. The behavior of those vehicles is quiet dangerous and aggressive to straight route vehicles. If there is only single camera/sensor monitored, it is difficult to make judgment and well comprehend the real traffic situation. Fortunately, in the paper, 3 cameras have been mounted and recorded the traffic from 3 different directions. Therefore, the exact traffic can be completely established or restored. Moreover, those violated vehicles will be enforced.
312
C.-L. Tsai and S.-C. Tai
(a)
(b)
(c)
(d)
Fig. 10. Demonstrate the extracted frames that recorded in Figure 8 and 9 from the front-side
In order to avoid traffic incidents, the process of motion vector has been adopted in the paper. Therefore, one can easily extract the exact route of each vehicle from the 3 different tapped videos. To construct 3D traffic video, the difficulty is much higher than construct 3D images for static object. Because, the traffic information is recorded from time varying system, therefore the techniques for improving the process of image compensation and interpolation are significantly affect the final performance. As regarding the construction of 3D traffic model for the necessity of presenting from different viewpoint, some works had better be well prepared such as the following: 1. 2.
3. 4.
A lot of videos should be taped from different viewpoints to focus on a same object. Collecting as much information as possible to avoid the problem of information insufficient. The tapping directions must at least include the viewpoints from forward direction, right side direction, and left side direction. The 3D model of an object could be constructed through fragile segmented images. Compensation techniques must be applied to remedy the insufficient information for 3D model construction.
Since the goal of constructing multi-viewpoint traffic model and 3D video automatically, therefore the preprocessing of image normalization and compensation for those extracted frames are quiet important. That is because the results of preprocessing will significant effect the following construction of 3D video. In addition, well process of interpolation is another quiet important key factor for optimal presenting the real
Traffic Monitoring and Event Analysis at Intersection
313
traffic environment. Moreover, the decision for image segmentation might be a tough task in the automatic process. Therefore, the quality of preprocessing must be strictly required.
5 Conclusion To decrease the traffic events and prevent the occurrence of traffic accidents are one of the important responsibilities of the government for all of the nations in the world. Since most of the heavy traffic bottlenecks occurred within the main traffic intersection area, highways, freeways and etc., therefore a lot of traffic monitor systems or earlier warning systems are mounted on those important areas. However, there are still a lot of traffic occurred every day. No government agents could precisely and exactly provide a system to dramatically cut down or even stop the occurrence of traffic events or accidents. In order to correctly find out the root cause of traffic accidents, restore the completely process of the occurrence of traffic events and exactly understand the key factor of traffic management, a traffic monitor and event analysis mechanism based on multi video for establishing 3D video processing are introduced in this paper. In the proposed scheme, there are three modules named as information collection, multiviewpoint traffic model and 3D video process, and expert system. The traffic information are recorded and collected through the deployment of a lot of video monitors around each intersection of those heavy traffic areas along five directions where 3 directions are from front, right-front, and left-front directions and 2 directions are from rear-left direction and rear right direction. Therefore, each vehicle is monitored by many detectors. All of the collected information is transferred to the 3D process for constructing the 3D model based on the options such as vehicles, directions, time factor, and etc. Those significant features for traffic analyzing are also extracted and retrieved in this module. In addition, rationale of Petri Net is adopted for vehicular tracking and interactivity analysis. Finally, the expert module is responsible for integrated traffic information and event analysis and offering decision making such as the application of early warning or enforcement. Experimental results demonstrate the feasibility and validity of our proposed mechanism. However, to construct 3D traffic video, the difficulty is much higher than construct 3D images for static object. That is because the traffic information is recorded based on time varying system. In order to present traffic in 3D model, the future work will be focused on the improvement of image normalization, compensation and interpolation process.
References 1. Starck, J., Maki, A., Nobuhara, S., Hilton, A., Matsuyama, T.: The Multiple-Camera 3-D Production Studio. IEEE Transactions on Circuits and Systems for Video Technology 19, 856–869 (2009) 2. Misener, J., et al.: California Intersection Decision Support: A Systems Approach to Achieve Nationally Interoperable Solutions II. California PATH Research Report UCB-ITSPRR-2007-01 (2007)
314
C.-L. Tsai and S.-C. Tai
3. Wang, Y., Zou, Y., Sri, H., Zhao, H.: Video Image Vehicle Detection System for Signaled Traffic Intersection. In: 9th International Conference on Hybrid Intelligent Systems, pp. 222–227 (2009) 4. Ki, Y.K., Lee, D.Y.: A traffic accident recording and reporting model at intersections. IEEE Transactions on Intelligent Transportation Systems 8, 188–194 (2007) 5. Petri Net, http://en.wikipedia.org/wiki/Petri_net (accessed on August 1, 2010) 6. Moezzi, S., Tai, L., Gerard, P.: Virtual view generation for 3d digital video. IEEE Multimedia, 18–26 (1997) 7. Matsuyama, T., Wu, X., Takai, T., Nobuhara, S.: Real-Time Generation and High Fidelity Visualization of 3D Video. In: Proceedings of MIRAG 2003, pp. 1–10 (2003)
Baseball Event Semantic Exploring System Using HMM Wei-Chin Tsai1, Hua-Tsung Chen1, Hui-Zhen Gu1, Suh-Yin Lee1, and Jen-Yu Yu2 1
Department of Computer Science, National Chiao-Tung University, Hsinchu, Taiwan {wagin,huatsung,hcku,sylee}@cs.nctu.edu.tw 2 ICL/Industrial Technology Research Institute, Hsinchu, Taiwan [email protected]
Abstract. Despite a lot of research efforts in baseball video processing in recent years, little work has been done in analyzing the detailed semantic baseball event detection. This paper presents an effective and efficient baseball event classification system for broadcast baseball videos. Utilizing the specifications of the baseball field and the regularity of shot transition, the system recognizes highlight in video clips and identifies what semantic baseball event of the baseball clips is currently proceeding. First, a video is segmented into several highlights starting with a PC (Pitcher and Catcher) shot and ending up with some specific shots. Before every baseball event classifier is designed, several novel schemes including some specific features such as soil percentage and objects extraction such as first base are applied. The extracted mid-level cues are used to develop baseball event classifiers based on an HMM (Hidden Markov model). Due to specific features detection, more hitting baseball events are detected and the simulation results show that the classification of twelve significant baseball events is very promising. Keywords: Training step, Classification step, Object detection, Play region classification, Semantic event, Hitting baseball event, Hidden Markov Model.
1 Introduction In recent years, the amount of multimedia information has grown rapidly. This trend leads to the development of efficient sports video analysis. The possible applications of sports video analysis have been found almost in all sports, among which baseball is a quite popular one. However, a whole game is very long but the highlight is only a small portion of the game. Based on the motivation, development of the semantic event exploring system for the baseball games is our focus. Because the positions of cameras are fixed in a game and the ways of showing game progressing are similar in different TV channels. Each category of semantic baseball event usually has a similar shot transition. A baseball highlight starts with PC shot, so the PC shot detection [1] plays an important role in baseball semantic event detection. In addition, a baseball semantic event is composed of a sequence of several play regions, so play region classification [2] is also an essential task during baseball event classification. Many methods are applied on semantic baseball event detection such as HMM [3][4][5], temporal feature detection [6], BBN (Bayesian Belief K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 315–325, 2011. © Springer-Verlag Berlin Heidelberg 2011
316
W.-C. Tsai et al.
Network) and scoreboard information[7][8]. Chang et al. [3] assumes that most highlights in baseball games consist of certain shot types and these shots have similar transition in time. Mochizuki et al. [4] provide a baseball indexing method based on patternizing baseball scenes using a set of rectangles with image features and a motion vector. Fleischman et al. [6] records some objects or features such as field type, speech, and camera motion start time and end time to find the frequent temporal patterns for highlight classification. Hung et al. [8] combines scoreboard information with few shot types for event detection. Even if the previous works report good results on highlight classification, it doesn’t stress the variety of hitting event types such as left foul ball and ground out. In this paper, we aim at exploring hitting baseball events. Twelve semantic baseball event types in baseball games are defined and detected in the proposed system: (1) single (2) double (3) pop up (4) fly out (5) ground out (6) two base hit (7) right foul ball (8) left foul ball (9) foul out (10) double play (11) home run (12) home base out. With the proposed framework, event classification in baseball videos will be more powerful and practical, since comprehensive, detailed and explicit events about the game can be presented to users.
2 Proposed Framework In this paper, a novel framework is proposed for baseball event classification of the batting content in baseball videos. The process can be divided into two steps: training step and classification step as shown in Figure 1. As illustrated in Figure 1, 1.Training Step indexed baseball clips in each type
2. Classification step Unknown clip
Color conversion Object detection Play region classification HMM training
Rule table
Event classification
HMM 1 HMM . 2 . .
Event type is determined
HMM 12
Fig. 1. Overview of the training step and classification step in proposed system
Baseball Event Semantic Exploring System Using HMM
317
in training step, each type as listed in Table 2 of indexed baseball event was input as training data for each baseball event classifier. In classification step, when each observation symbol sequence of unknown clip was input, each highlight classifier will evaluate how well a model predicts a given observation sequence. In both two steps, with baseball domain knowledge, the spatial patterns of field lines and field objects and color features are recognized to classify play region types, such as infield left, outfield right, audience, etc. Finally, from each field shot, a symbol sequence which describes the transition of play regions is generated as HMM training data or input data for event classification. Details of the proposed approaches are described in the following sections. Section 3 introduces the color conversion. Section 4 describes object and feature detection. Section 5 describes play region classification. Section 6 and section 7 describe HMM training and classification of baseball events.
3 Color Conversion In image processing or analysis, color is an important feature for our proposed object detection and feature (the percentage of grass and soil) extraction. However, the color of each baseball game in frames might vary because of the different angles of view and lighting conditions. To obtain the color distribution of grass and soil in video frames, several baseball clips from different video sources composed of grass and soil are input to produce the color histograms in RGB and HSI color space. Figure 2 takes two different baseball clips from different sources as examples. Owing to the discrimination the Hue value in HSI color space is selected as the color feature, and the dominant color of grass (green) and soil (brown) color ranges [Ha1,Hb1],[Ha2,Hb2] are set.
Fig. 2. The color space of RGB and HSI of two baseball clips
4 Object Detection The baseball field is characterized by a well-defined layout as described in Figure 3. Furthermore, important lines and the bases are in white color, and auditorium (AT) is of high texture and no dominant color as shown in Figure 3(b).
318
W.-C. Tsai et al.
Fig. 3. (a) Full view of real baseball field (b) Illustration of objects and features in baseball field
Each object will be elaborated as follows. (1) Back auditorium (AT): The top area which contains high texture and no dominant colors is considered as the auditorium, and is marked as the black area above the white horizontal line in Figure 4(a). (2) Left auditorium (L-AT) and right auditorium (R-AT): The left area and right area which contains high texture and no dominant colors is considered as the left auditorium and right auditorium, as the left black area and the right black area marked with the white vertical line in Figure 4 (b) and Figure 4 (c).
Fig. 4. Illustration of (a) back auditorium (b) left auditorium (c) right auditorium
(3) Left line (LL) and right line (RL) : A Ransac algorithm, which finds the line parameter of line segments [9], is applied to the line pixels and then finds the left or right line. (4) Pitch mound (PM): An ellipse soil region surrounded by a grass region would be recognized as pitcher’s mound as shown in Figure 5 marked with red rectangle. (5) First base (1B) and third base (3B): The square region located on right line, if detected, in soil region would be identified as first base as shown in Figure 5. Similarly, the square region located on left line, if detected, in soil region would be identified as third base.
Baseball Event Semantic Exploring System Using HMM
319
(6) Second base (2B): In a soil region, a white square region on neither field line would be identified as second base as shown in Figure 5. (7) Home base (HB): Home base is located on the region of the intersection between left line and right line as shown in Figure 5.
Fig. 5. Illustration of the objects 1B, 2B, HB, LL, RL, and PM
5 Play Region Classification Sixteen play region types are defined and classified based on the position or percentage of some objects and features as described in section 3. Sixteen typical region types are: IL (infield left), IC (infield center), IR (infield), B1 (first base), B2 (second base), B3 (third base), OL (outfield left), OC (outfield center), OR (outfield right), PS (play in soil), PG (play in grass), AD (audience), RAD (right audience), LAD (left audience), CU (close-up), and TB (touch base), as shown in Figure 6.
IL: infield left
IC: infield center
IR: infield right
PS: play in soil
B3: third base
B2: second base
B1: first base
PG: play in grass
OL: outfield left
OC: outfield center
LAD: left audience RAD: right audience
OR: outfield right
AD: audience
TB: touch base
CU: close-up
Fig. 6. Sixteen typical play region types
320
W.-C. Tsai et al.
The rules of play region type classification are listed in Table 1 modified from [2]. The symbols of first column are our sixteen defined play region types as shown in Figure 6. Wf is the frame width, the function P(Area) returns the percentage of the area Area in a frame, X(Obj) returns the x-coordinate of the center of the field object Obj, and W(Obj) returns true if the object Obj exists. Each play region is classified into one of sixteen play region types using rule table. For example: a field frame would be classified as B1 frame type if the frame meets the following conditions: The percentage of AT is no more than 10%, the object of PM does not exist, the object of RL and 1B must exist, the percentage of soil is more than 30%. After play region classification, each frame in video clip will output a symbol representing the play region type. A video clip will be represented as a symbol sequence. For example, a video clip of ground out would output a symbol sequence ILÆICÆPSÆIRÆB1. Table 1. Rules of play region type classification
Baseball Event Semantic Exploring System Using HMM
321
6 HMM Training for Baseball Events One HMM is created for each baseball event for recognizing time-sequential observed symbols. In our proposed method, twelve baseball events listed in Table 2 are defined so that there are twelve HMMs. Given a set of training data from each type of baseball event, we want to estimate the model parameters λ = (A, B, π) that best describe each baseball event. The matrix A is transition probabilities between states. Matrix B is output symbol probabilities from each state. Matrix π is initial probabilities of each state. First of all, Segmental K-means algorithm [10] is used to create an initial HMM parameter λ and then Baum-Welch algorithm [10] is applied to re-estimate each HMM parameters λ = A , B , π of baseball event.
(
)
Table 2. List of twelve baseball events
Single
Right foul ball
Double
Left foul ball
Pop up
Foul out
Fly out
Double play
Ground out
Home run
Two-base out
Home base out
In our proposed method, two features such as grass and soil, and ten objects as shown in Figure 3(b) are used as observations represented as a 1×12 vector to record whether the object appears or not. To apply HMM to time-sequential video, the extracted features represented as a vector sequence must be transformed into a symbol sequence by rule table as listed in Table 1 for later baseball event recognition. Conventional implementation issues in HMM include (1) number of states, (2) initialization, and (3) distribution of observation at each state. The first problem of determining the number of states is determined empirically and differs in each baseball event. The second problem can be approached by random initialization or using Segmental K-mean algorithm [10]. Finally, the last problem can be solved by trying several models such as Gaussian model and choose the best one. In our approach, we choose Gaussian distribution. The following is the detailed description of each essential element. State S: The number of states is selected empirically depending on different baseball events and each hidden state represents a shot type. Observation O: the symbol mapped from rule table. Observation distribution matrix B: use K-means algorithm and choose the Gaussian distribution at each state [10]. Transition probability matrix A: the state transition probability, which can be learned by Baum-Welch algorithm [10].
322
W.-C. Tsai et al.
Initial state probability matrix π: the probability of occurrence of the first state, which is initialized by Segmental K-means algorithm [10] after determining the number of states. After determining the number of states and setting the initial tuple λ, to maximize the probability of the observation sequence given the model, we can use the BaumWelch algorithm[10] to re-estimate the HMM parameter λ .
7 Baseball Event Classification The idea behind using the HMMs is to construct a model for each of the baseball event that we want to recognize. HMMs give a state based representation for each highlight. After training each baseball event model, we calculate the probability P(O λi) (index i for each baseball event HMM) of a given unknown symbol sequence O for each highlight model λi. We can then recognize the baseball event as being the one by the highest probable baseball event HMM.
∣
8 Experiments To test the performance of baseball event classification, we implement a system capable of recognizing twelve different types of baseball events. All video sources are Major League Baseball (MLB). 120 baseball clips from three different MLB video sources as training data and 122 baseball clips from two different MLB video sources as test data. Each video source is digitized into 352×240 pixel resolution. The experimental result is shown in Table 3. Table 3. Recognition of baseball events
Both the precision and recall are about 80% except for the precision of double, double play and the recall of double, two-base out. The low recall rate of baseball
Baseball Event Semantic Exploring System Using HMM
323
event double and two-base out might result from the missed detection of field object 2B. The low precision rate of baseball event double might be that the transitions of double and Home run are similar if the batter hits the ball to the audience wall. The low precision rate of baseball event double play might be that the transitions of double play and ground out are similar if the batter hits the ball around the second base as shown in Figure 7. Figure 8 shows the miss detection of right foul ball and home run due to the similar shot transition. Figure 9 shows some ambiguities in nature of baseball events such as ground out and left foul ball even if those baseball events are judged by people. Figure 10 shows miss detection between single and ground out because the player in first base does not catch the ball and we do not detect the ball object.
Fig. 7. Comparison between (a) ground out and (b) double play
Fig. 8. Comparison between (a) right foul ball and (b) home run
Fig. 9. Ambiguity of (a) left foul ball (b) replay of left foul ball
324
W.-C. Tsai et al.
Fig. 10. Ambiguity of ground out and single
The miss detection of highlights can be classified into four reasons: (1) similar shot transition, (2) miss object detection, (3) detected objects are not enough, and (4) ambiguity in nature. These could be improved by detecting the object of ball and players, or adding additional information such as scoreboard information. Overall, we still achieve good performance.
9 Conclusions In this paper, a novel framework is proposed for baseball event classification. In training step, first, the spatial patterns of the field objects and lines of each field frame in each baseball event type video clips are recognized based on the distribution of dominant colors and white pixels. With baseball domain knowledge, each field frame is classified into one of the sixteen typical play region types using the rules on the spatial patterns. After play region classification, output symbol sequences of each baseball event type will be used as training data for each baseball event HMM. In classification step, the observation symbol sequence, generated by the play region classification of video clips, is used as an input for each baseball event HMM and each baseball event classifier will evaluate how well a model predicts a given symbol sequence. Finally, we can then recognize the baseball event as being the one by the highest probable baseball event HMM.
References 1. Kumano, M., Ariki, Y., Tsukada, K., Hamaguchi, S., Kiyose, H.: Automatic Extraction of PC Scenes Based on Feature Mining for a Real Time Delivery System of Baseball Highlight Scenes. In: IEEE international Conference on Multimedia and Expo, vol. 1, pp. 277– 280 (2004) 2. Chen, H.T., Hsiao, M.H., Chen, H.S., Tsai, W.J., Lee, S.Y.: A baseball exploration system using spatial pattern recognition. In: Proc. IEEE International Symposium on Circuits and Systems, pp. 3522–3525 (May 2008) 3. Chang, P., Han, M., Gong, Y.: Extract Highlight From Baseball Game Video With Hidden Markov Models. In: International Conference on Image Processing, vol. 1, pp. 609–612 (2002) 4. Mochizuki, T., Tadenuma, M., Yagi, N.: Baseball Video Indexing Using Patternization Of scenes and Hidden Markov Model. In: IEEE International Conference on Image Processing, vol. 3, pp. III -1212-15 (2005) 5. Bach, H., Shinoda, K., Furui, S.: Robust Highlight Extraction Using Multi-stream Hidden Markov Model For Baseball Video. In: IEEE International Conference on Image Processing, vol. 3, pp. III- 173-6 (2005)
Baseball Event Semantic Exploring System Using HMM
325
6. Fleischman, M., Roy, B., Roy, D.: Temporal Feature Induction for Baseball Highlight Classification. In: Proc. ACM Multimedia Conference, pp. 333–336 (2007) 7. Hung, H., Hsieh, C.H., Kuo, C.M.: Rule-based Event Detection of Broadcast Baseball Videos Using Mid-level Cues. In: Proceedings of IEEE International Conference on Innovative Computing Information and Control, pp. 240–244 (2007) 8. Hung, H., Hsieh, C.H.: Event Detection of Broadcast Baseball Videos. IEEE Trans. on Circuits and Systems for Video Technology 18(12), 1713–1726 (2008) 9. Farin, D., Han, J., Peter, H.N.: Fast Camera Calibration for the Analysis of Sport Sequences. In: IEEE International Conference on Multimedia & Expo, pp. 482–485 (2005) 10. Rabiner, R., Juang, B.H.: An Introduction to Hidden Markov Models. IEEE Signal Processing Magazine 3(1), 4–16 (1986)
Robust Face Recognition under Different Facial Expressions, Illumination Variations and Partial Occlusions Shih-Ming Huang and Jar-Ferr Yang Institute of Computer and Communication Engineering, Department of Electrical Engineering, National Cheng Kung University, Tainan, 70101, Taiwan [email protected], [email protected]
Abstract. In this paper, a robust face recognition system is presented, which can perform precise face recognition under facial expression variations, illumination changes, and partial occlusions. The embedded hidden Markov model based face classifier is applied for identity recognition in which the proposed observation extraction is presented by performing local binary patterns prior to performing delta operation on the discrete cosine transform coefficients of consecutive blocks. Experimental results show that the proposed face recognition system achieves high recognition accuracy of 99%, 96.6% and 98% under neutral face, expression variations, and illumination changes respectively. Particularly, under partial occlusions, the system achieves recognition rate of 81.6% and 86.6% for wearing sunglasses and scarf respectively. Keywords: Robust face recognition, Delta discrete cosine transform coefficient, Local binary pattern, Embedded hidden Markov model.
1 Introduction Face recognition [1] is to distinguish a specific identity from the unknown objects characterized by face images. In realistic situations, such as video surveillance applications, face recognition may encounter many great challenges such as different facial expressions, illumination variations, and even partial occlusions, which might make face recognition systems unreliable [2-3]. Particularly, under partial occlusions, partial portions of the face might be covered or modified, for instance, by wearing a pair of sunglasses or a scarf. Obviously, partial occlusion problems might lead to a significant deterioration in face recognition performance because some facial features, e.g. eyes, mouth etc., disappear. Thus, a robust face recognition system, which could achieve correct face identification not only under different facial expressions and illumination variations, but also under partial occlusions, is necessary for more practical face recognition applications and to provide a reliable performance. In the literatures, numerous researches have been proposed to achieve successfully face recognition in certain conditions. Generally, we can categorize them into holistic and local feature approaches. For holistic approaches, one observation signal describes the entire face. For local feature approach, a face is represented by a set of K.-T. Lee et al. (Eds.): MMM 2011, Part II, LNCS 6524, pp. 326–336, 2011. © Springer-Verlag Berlin Heidelberg 2011
Robust Face Recognition under Different Facial Expressions, Illumination Variations
327
observation vectors each describing a part of the face. Conventionally, the appearance based approaches [4-6], including principle component analysis (PCA), linear discriminant analysis (LDA) and two-dimensional PCA (2D-PCA), are popular for face recognition. Additionally, many other methods, including hidden Markov model (HMM) [7] based approaches [8-13], have been introduced. The HMM is a stochastic modeling technique that has been widely and effectively used in speech recognition [7], handwritten word recognition [14]. Samaria and Young [8] first introduced luminance-based 1D-HMM to solve face recognition problems. Nefian and Hayes III [9] then introduced 1D-HMMs with 2D-DCT coefficients to reduce the computational cost. They also further proposed embedded HMM (EHMM) and embedded Bayesian network (EBN) to enhance performance of face recognition successfully [10-12]. Furthermore, a low complexity 2D HMM (LC-2D HMM) [13] has been presented to process face images in a 2D manner without forming the observation signals in 1D sequences. Thus, we aim at developing an HMM-based face recognition system which can successfully achieve identity recognition under facial expression variations, illumination changes, and even partial occlusions. Previously, to resolve partial occlusion problems, several approaches [15-16] have shown excellent results on AR facial database [17] recently. A local probabilistic approach proposed by Martinez in [15] has been introduced to analyze separated local regions in isolation. Inspired by the compression sensing, Wright et al. proposed a sparse representation-based classification (SRC) exploiting the sparsity nature of the occluded images for face recognition [16]. Face Image
Local Binary Pattern
Sliding Block
2D-DCT
Identity Recognition
Embedded HMM
Delta
Fig. 1. The Proposed Robust Face Recognition System based on embedded HMM classifier
In this study, we has presented a robust face recognition system based on embedded HMM with the proposed robust observation vectors, as depicted in Fig. 1. First, the input face image is processed by the local binary pattern (LBP) [18] operation. Next, the LBP image in a sliding block manner is transformed by the 2D-DCT to obtain the DCT coefficients, which are most commonly used in HMM-based face recognition [9, 19]. However, we suggest that the delta operation be performed to construct delta DCT observation vectors from consecutive blocks [20]. Finally, the observation vectors extracted are used for model training and testing in the EHMM classifier. To explain the whole processes and recognition performances of the proposed system in details, the rest of the paper is organized as follows. Section 2 introduces the observation extraction method. The embedded HMM is introduced in Section 3. Section 4 shows the experimental results and discussions. Finally, conclusions are drawn in Section 5.
328
S.-M. Huang and J.-F. Yang
2 Observation Vector Extraction 2.1 Local Binary Patterns (LBP) The illumination variation is one of the important impacts to realistic face recognition applications. In the literature, to eliminate the lighting impact, Heusch et al. [21] suggested the use of local binary pattern (LBP) [18, 22] to be an image preprocessing for face authentication. The LBP achieves better performance than the histogram equalization (HE) by Wang, et al. [23] and the illumination normalization approach (GROSS) proposed by Gross and Brajovic [24]. Accordingly, in our study, we also adopt the LBP as a preprocessing step prior to generating observation vectors.
Fig. 2. Illustration of LBP operation flow
The LBP is a local texture descriptor and computationally simple. Moreover, it is invariant to monotonic grayscale transformation, so the LBP representation may be less sensitive to illumination variations. As illustrated in Fig. 2, a typical computation of LBP operation takes the pixels in a 3 × 3 block, in which the center pixel is used as the reference to generate a binary 1 for the neighbor is larger than the reference; otherwise to generate a binary 0. The eight neighbors of the reference can then be represented by an 8-bit unsigned integer. In other words, the element-wise multiplication is done with a weighting block. The LBP value for the pixel (x, y) is calculated as follows:
LBP( x, y ) = ∑ s (in − ic ) ⋅ 2n 7
(1)
n=0
where ic is the grey value of the center value in a 3 × 3 block , in is the grey value of the nth surrounding pixel, and the function s ( x) is defined as:
⎧0 if s ( x) = ⎨ ⎩1 if
x