318 63 16MB
English Pages XIV, 678 [691] Year 2020
LNCS 12273
Donghyun Kim R. N. Uma Zhipeng Cai Dong Hoon Lee (Eds.)
Computing and Combinatorics 26th International Conference, COCOON 2020 Atlanta, GA, USA, August 29–31, 2020 Proceedings
Lecture Notes in Computer Science Founding Editors Gerhard Goos Karlsruhe Institute of Technology, Karlsruhe, Germany Juris Hartmanis Cornell University, Ithaca, NY, USA
Editorial Board Members Elisa Bertino Purdue University, West Lafayette, IN, USA Wen Gao Peking University, Beijing, China Bernhard Steffen TU Dortmund University, Dortmund, Germany Gerhard Woeginger RWTH Aachen, Aachen, Germany Moti Yung Columbia University, New York, NY, USA
12273
More information about this series at http://www.springer.com/series/7407
Donghyun Kim R. N. Uma Zhipeng Cai Dong Hoon Lee (Eds.) •
•
•
Computing and Combinatorics 26th International Conference, COCOON 2020 Atlanta, GA, USA, August 29–31, 2020 Proceedings
123
Editors Donghyun Kim Georgia State University Atlanta, USA Zhipeng Cai Computer Science Georgia State University Atlanta, GA, USA
R. N. Uma Department of Mathematics and Physics North Carolina Central University Durham, NC, USA Dong Hoon Lee Korea University Seoul, South Korea
ISSN 0302-9743 ISSN 1611-3349 (electronic) Lecture Notes in Computer Science ISBN 978-3-030-58149-7 ISBN 978-3-030-58150-3 (eBook) https://doi.org/10.1007/978-3-030-58150-3 LNCS Sublibrary: SL1 – Theoretical Computer Science and General Issues © Springer Nature Switzerland AG 2020 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, expressed or implied, with respect to the material contained herein or for any errors or omissions that may have been made. The publisher remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. This Springer imprint is published by the registered company Springer Nature Switzerland AG The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
Preface
The 26th International Computing and Combinatorics Conference (COCOON 2020) was held during August 29–31, 2020. This year, COCOON 2020 was organized as a fully online conference. COCOON 2020 provided an excellent venue for researchers working in the area of algorithms, theory of computation, computational complexity, and combinatorics related to computing. The technical program of the conference included 54 regular papers selected by the Program Committee from 126 full submissions received in response to the call for papers. Each submission was carefully reviewed by Program Committee members and/or external reviewers. Some of the papers were selected for publication in special issues of Theoretical Computer Science and the Journal of Combinatorial Optimization. It is expected that the journal version of the papers will appear in a more complete form. We thank everyone who made this meeting possible: the authors for submitting papers, the Program Committee members, and external reviewers for volunteering their time to review conference papers. We also appreciate the financial sponsorship from Springer for the Best Paper Award. We would also like to extend special thanks to the chairs and conference Organizing Committee for their work in making COCOON 2020 a successful event. August 2020
Donghyun Kim R. N. Uma Zhipeng Cai Dong Hoon Lee
Organization
Steering Committee Members Lian Li (Chair) Zhipeng Cai Xiaoming Sun My Thai Cong Tian Jianping Yin Guochuan Zhang Zhao Zhang
Hefei University of Technology, China Georgia State University, USA Chinese Academy of Sciences, China University of Florida, USA Xidian University, China Dongguan University of Technology, China Zhejiang University, China Zhejiang Normal University, China
General Chairs Zhipeng Cai Dong Hoon Lee
Georgia State University, USA Korea University, South Korea
Program Chairs Donghyun Kim R. N. Uma
Georgia State University, USA North Carolina Central University, USA
Publication Chairs Yan Huang Yingshu Li
Kennesaw State University, USA Georgia State University, USA
Web/Submission System Chairs Yongmin Ju Yi Liang Vannel Zeufack
Georgia State University, USA Georgia State University, USA Kennesaw State University, USA
Financial Chairs Yongmin Ju Daehee Seo
Georgia State University, USA Sangmyung University, South Korea
viii
Organization
Local Organization Chairs Jeongsu Park Daehee Seo
Korea University, South Korea Sangmyung University, South Korea
Program Committee Hans-Joachim Boeckenhauer Zhi-Zhong Chen Rajesh Chitnis Ovidiu Daescu Bhaskar Dasgupta Thang Dinh Stanley Fung Xiaofeng Gao Anirban Ghosh Raffaele Giancarlo Yan Huang Michael Khachay Sung-Sik Kwon Daniel Lee Joong-Lyul Lee Shuai Cheng Li Wei Li Xianyue Li Bei Liu Bodo Manthey Manki Min Rashika Mishra N. S. Narayanaswamy Nam Nguyen Pavel Skums Xiaoming Sun Ryuhei Uehara Jianxin Wang Wei Wang Gerhard J. Woeginger Wen Xu Boting Yang Chee Yap
ETH Zurich, Switzerland Tokyo Denki University, Japan University of Birmingham, UK The University of Texas at Dallas, USA University of Illinois at Chicago, USA Virginia Commonwealth University, USA University of Leicester, UK Shanghai Jiao Tong University, China University of North Florida, USA Universita di Palermo, Italy Kennesaw State University, USA Krasovsky Institute of Mathematics and Mechanics, Russia North Carolina Central University, USA Simon Fraser University, Canada The University of North Carolina at Pembroke, USA City University of Hong Kong, Hong Kong Georgia State University, USA Lanzhou University, China Northwest University, USA University of Twente, The Netherlands Louisiana Tech University, USA Lenovo, USA IIT Madras, India Towson University, USA Georgia State University, USA Institute of Computing Technology, Chinese Academy of Sciences, China Japan Advanced Institute of Science and Technology, Japan Central South University, China Xi’an Jiaotong University, China RWTH Aachen University, Germany Texas Woman’s University, USA University of Regina, Canada New York University, USA
Organization
Peng Zhang Zhao Zhang Yi Zhu Yuqing Zhu Martin Ziegler
Shandong University, China Zhejiang Normal University, China Hawaii Pacific University, USA California State University Los Angeles, USA KAIST, South Korea
ix
Contents
Subspace Approximation with Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . Amit Deshpande and Rameshwar Pratap
1
Linear-Time Algorithms for Eliminating Claws in Graphs . . . . . . . . . . . . . . Flavia Bonomo-Braberman, Julliano R. Nascimento, Fabiano S. Oliveira, Uéverton S. Souza, and Jayme L. Szwarcfiter
14
A New Lower Bound for the Eternal Vertex Cover Number of Graphs . . . . . Jasine Babu and Veena Prabhakaran
27
Bounded-Degree Spanners in the Presence of Polygonal Obstacles . . . . . . . . André van Renssen and Gladys Wong
40
End-Vertices of AT-free Bigraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jan Gorzny and Jing Huang
52
Approaching Optimal Duplicate Detection in a Sliding Window . . . . . . . . . . Rémi Géraud-Stewart, Marius Lombard-Platet, and David Naccache
64
Computational Complexity Characterization of Protecting Elections from Bribery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lin Chen, Ahmed Sunny, Lei Xu, Shouhuai Xu, Zhimin Gao, Yang Lu, Weidong Shi, and Nolan Shah
85
Coding with Noiseless Feedback over the Z-Channel. . . . . . . . . . . . . . . . . . Christian Deppe, Vladimir Lebedev, Georg Maringer, and Nikita Polyanskii
98
Path-Monotonic Upward Drawings of Graphs . . . . . . . . . . . . . . . . . . . . . . . Seok-Hee Hong and Hiroshi Nagamochi
110
Seamless Interpolation Between Contraction Hierarchies and Hub Labels for Fast and Space-Efficient Shortest Path Queries in Road Networks . . . . . . Stefan Funke
123
Visibility Polygon Queries Among Dynamic Polygonal Obstacles in Plane. . . Sanjana Agrawal and R. Inkulu
136
How Hard Is Completeness Reasoning for Conjunctive Queries? . . . . . . . . . Xianmin Liu, Jianzhong Li, and Yingshu Li
149
Imbalance Parameterized by Twin Cover Revisited . . . . . . . . . . . . . . . . . . . Neeldhara Misra and Harshil Mittal
162
xii
Contents
Local Routing in a Tree Metric 1-Spanner . . . . . . . . . . . . . . . . . . . . . . . . . Milutin Brankovic, Joachim Gudmundsson, and André van Renssen
174
Deep Specification Mining with Attention . . . . . . . . . . . . . . . . . . . . . . . . . Zhi Cao and Nan Zhang
186
Constructing Independent Spanning Trees in Alternating Group Networks . . . Jie-Fu Huang and Sun-Yuan Hsieh
198
W[1]-Hardness of the k-Center Problem Parameterized by the Skeleton Dimension. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Johannes Blum
210
An Optimal Lower Bound for Hierarchical Universal Solutions for TSP on the Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Patrick Eades and Julián Mestre
222
Quantum Speedup for the Minimum Steiner Tree Problem . . . . . . . . . . . . . . Masayuki Miyamoto, Masakazu Iwamura, Koichi Kise, and François Le Gall Access Structure Hiding Secret Sharing from Novel Set Systems and Vector Families . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Vipin Singh Sehrawat and Yvo Desmedt
234
246
Approximation Algorithms for Car-Sharing Problems . . . . . . . . . . . . . . . . . Kelin Luo and Frits C. R. Spieksma
262
Realization Problems on Reachability Sequences. . . . . . . . . . . . . . . . . . . . . Matthew Dippel, Ravi Sundaram, and Akshar Varma
274
Power of Decision Trees with Monotone Queries . . . . . . . . . . . . . . . . . . . . Prashanth Amireddy, Sai Jayasurya, and Jayalal Sarma
287
Computing a Maximum Clique in Geometric Superclasses of Disk Graphs. . . Nicolas Grelier
299
Shortest Watchman Tours in Simple Polygons Under Rotated Monotone Visibility. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Bengt J. Nilsson, David Orden, Leonidas Palios, Carlos Seara, and Paweł Żyliński Tight Approximation for the Minimum Bottleneck Generalized Matching Problem. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julián Mestre and Nicolás E. Stier Moses Graph Classes and Approximability of the Happy Set Problem . . . . . . . . . . . Yuichi Asahiro, Hiroshi Eto, Tesshu Hanaka, Guohui Lin, Eiji Miyano, and Ippei Terabaru
311
324 335
Contents
xiii
A Simple Primal-Dual Approximation Algorithm for 2-Edge-Connected Spanning Subgraphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stephan Beyer, Markus Chimani, and Joachim Spoerhase
347
Uniqueness of DP-Nash Subgraphs and D-sets in Weighted Graphs of Netflix Games . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Gregory Gutin, Philip R. Neary, and Anders Yeo
360
On the Enumeration of Minimal Non-pairwise Compatibility Graphs . . . . . . . Naveed Ahmed Azam, Aleksandar Shurbevski, and Hiroshi Nagamochi
372
Constructing Tree Decompositions of Graphs with Bounded Gonality . . . . . . Hans L. Bodlaender, Josse van Dobben de Bruyn, Dion Gijswijt, and Harry Smit
384
Election Control Through Social Influence with Unknown Preferences . . . . . Mohammad Abouei Mehrizi, Federico Corò, Emilio Cruciani, and Gianlorenzo D’Angelo
397
k-Critical Graphs in P5 -Free Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kathie Cameron, Jan Goedgebeur, Shenwei Huang, and Yongtang Shi
411
New Symmetry-less ILP Formulation for the Classical One Dimensional Bin-Packing Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Khadija Hadj Salem and Yann Kieffer
423
On the Area Requirements of Planar Greedy Drawings of Triconnected Planar Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Giordano Da Lozzo, Anthony D’Angelo, and Fabrizio Frati
435
On the Restricted 1-Steiner Tree Problem. . . . . . . . . . . . . . . . . . . . . . . . . . Prosenjit Bose, Anthony D’Angelo, and Stephane Durocher Computational Complexity of Synchronization Under Regular Commutative Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Stefan Hoffmann
448
460
Approximation Algorithms for General Cluster Routing Problem . . . . . . . . . Xiaoyan Zhang, Donglei Du, Gregory Gutin, Qiaoxia Ming, and Jian Sun
472
Hardness of Sparse Sets and Minimal Circuit Size Problem . . . . . . . . . . . . . Bin Fu
484
Succinct Monotone Circuit Certification: Planarity and Parameterized Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mateus Rodrigues Alves, Mateus de Oliveira Oliveira, Janio Carlos Nascimento Silva, and Uéverton dos Santos Souza
496
xiv
Contents
On Measures of Space over Real and Complex Numbers . . . . . . . . . . . . . . . Om Prakash and B. V. Raghavendra Rao Parallelized Maximization of Nonsubmodular Function Subject to a Cardinality Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hongxiang Zhang, Dachuan Xu, Longkun Guo, and Jingjing Tan An Improved Bregman k-means++ Algorithm via Local Search . . . . . . . . . . Xiaoyun Tian, Dachuan Xu, Longkun Guo, and Dan Wu Approximating Maximum Acyclic Matchings by Greedy and Local Search Strategies. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Julien Baste, Maximilian Fürst, and Dieter Rautenbach
508
520 532
542
On the Complexity of Directed Intersection Representation of DAGs. . . . . . . Andrea Caucchiolo and Ferdinando Cicalese
554
On the Mystery of Negations in Circuits: Structure vs Power . . . . . . . . . . . . Prashanth Amireddy, Sai Jayasurya, and Jayalal Sarma
566
Even Better Fixed-Parameter Algorithms for Bicluster Editing . . . . . . . . . . . Manuel Lafond
578
Approximate Set Union via Approximate Randomization . . . . . . . . . . . . . . . Bin Fu, Pengfei Gu, and Yuming Zhao
591
A Non-Extendibility Certificate for Submodularity and Applications . . . . . . . Umang Bhaskar and Gunjan Kumar
603
Parameterized Complexity of MAXIMUM EDGE COLORABLE SUBGRAPH . . . . . . . . Akanksha Agrawal, Madhumita Kundu, Abhishek Sahu, Saket Saurabh, and Prafullkumar Tale
615
Approximation Algorithms for the Lower-Bounded k-Median and Its Generalizations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lu Han, Chunlin Hao, Chenchen Wu, and Zhenning Zhang A Survey for Conditional Diagnosability of Alternating Group Networks . . . . Nai-Wen Chang and Sun-Yuan Hsieh Fixed Parameter Tractability of Graph Deletion Problems over Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Arijit Bishnu, Arijit Ghosh, Sudeshna Kolay, Gopinath Mishra, and Saket Saurabh
627 640
652
Mixing of Markov Chains for Independent Sets on Chordal Graphs with Bounded Separators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Ivona Bezáková and Wenbo Sun
664
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
677
Subspace Approximation with Outliers Amit Deshpande1 and Rameshwar Pratap2(B) 1
Microsoft Research, Bengaluru, India [email protected] 2 IIT Mandi, Mandi, India [email protected]
Abstract. The subspace approximation problem with outliers, for given n points in d dimensions x1 , x2 , . . . , xn ∈ Rd , an integer 1 ≤ k ≤ d, and an outlier parameter 0 ≤ α ≤ 1, is to find a k-dimensional linear subspace of Rd that minimizes the sum of squared distances to its nearest (1 − α)n points. More generally, the p subspace approximation problem with outliers minimizes the sum of p-th powers of distances instead of the sum of squared distances. Even the case of p = 2 or robust PCA is non-trivial, and previous work requires additional assumptions on the input or generative models for it. Any multiplicative approximation algorithm for the subspace approximation problem with outliers must solve the robust subspace recovery problem, a special case in which the (1 − α)n inliers in the optimal solution are promised to lie exactly on a k-dimensional linear subspace. However, robust subspace recovery is Small Set Expansion (SSE)-hard, and known algorithmic results for robust subspace recovery require strong assumptions on the input, e.g., any d outliers must be linearly independent. In this paper, we show how to extend dimension reduction techniques and bi-criteria approximations based on sampling and coresets to the problem of subspace approximation with outliers. To get around the SSE-hardness of robust subspace recovery, we assume that the squared distance error of the optimal k-dimensional subspace summed over the optimal (1 − α)n inliers is at least δ times its squared-error summed over all n points, for some 0 < δ ≤ 1 − α. Under this assumption, we give an efficient algorithm to find a weak coreset or a subset of poly(k/) log(1/δ) log log(1/δ) points whose span contains a kdimensional subspace that gives a multiplicative (1+)-approximation to the optimal solution. The running time of our algorithm is linear in n and d. Interestingly, our results hold even when the fraction of outliers α is large, as long as the obvious condition 0 < δ ≤ 1−α is satisfied. We show similar results for subspace approximation with p error or more general M-estimator loss functions, and also give an additive approximation for the affine subspace approximation problem.
Full version of paper is available here https://arxiv.org/pdf/2006.16573.pdf. c Springer Nature Switzerland AG 2020 D. Kim et al. (Eds.): COCOON 2020, LNCS 12273, pp. 1–13, 2020. https://doi.org/10.1007/978-3-030-58150-3_1
2
1
A. Deshpande and R. Pratap
Introduction
Finding low-dimensional representations of large, high-dimensional input data is an important first step for several problems in computational geometry, data mining, machine learning, and statistics. For given input points x1 , x2 , . . . , xn ∈ Rd , a positive integer 1 ≤ k ≤ d and 1 ≤ p < ∞, the p subspace approximation problem asks to find a k-dimensional linear subspace V of Rd that essentially minimizes the sum of p-th powers of the distances of all the points to the subspace V , or to be precise, it minimizes the p error 1/p n n p d(xi , V ) or equivalently d(xi , V )p . i=1
i=1
For p = 2, the optimal subspace is spanned by the top k right singular vectors of the matrix X ∈ Rn×d formed by x1 , x2 , . . . , xn as its rows. The optimal solution for p = 2 canbe computed efficiently by the Singular Value Decomposition (SVD) in time O min{nd2 , n2 d} . Liberty’s deterministic matrix sketching [15] and subsequent work [9] provide a faster, deterministic algorithm that runs in O (nd · poly(k/)) time and gives a multiplicative (1 + )-approximation to the optimum. There is also a long line of work on randomized algorithms [17,19] that sample a subset of points and output a subspace from their span, giving a multiplicative (1 + )-approximation in running time O(nnz(X)) + (n + d)poly(k/), where nnz(X) is the number of non-zero entries in X. These are especially useful on sparse data. For p = 2, unlike the p = 2 case, we do not know any simple description of the optimal subspace. For any p ≥ 1, Shyamalkumar and Varadarajan [18]give a (1+ )-approximation algorithm that runs in time O nd · exp((k/)O(p) ) . Building upon this, Deshpande and Varadarajan [4] give a bi-criteria (1+)-approximation by finding a subset of s = (k/)O(p) points in time O (nd · poly(k/)) such that their s-dimensional linear span gives a (1 + )-approximation to the optimal k-dimensional subspace. The subset they find is basically a weak coreset, and projecting onto its span also gives dimension-reduction result for subspace approximation. Feldman et al. [6] improve the running time to nd · poly(k/) + (n + d) · exp(poly(k/)) for p = 1. Feldman and Langberg [5] extend this result to achieve a running time of nd · poly(k/) + exp((k/)O(p) ) for any p ≥ 1. Clarkson and Woodruff [3] improve this running time to O(nnz(X) + (n + d) · poly(k/) + exp(poly(k/))) for any p ∈ [1, 2). The case p ∈ [1, 2), especially p = 1, is important because the 1 error (i.e., the sum of distances) is more robust to outliers than the 2 error (i.e., the sum of squared distances). We consider the following variant of p subspace approximation in the presence of outliers. Given points x1 , x2 , . . . , xn ∈ Rd , an integer 1 ≤ k ≤ d, 1 ≤ p < ∞, and an outlier parameter 0 ≤ α ≤ 1, find a k-dimensional linear subspace V that minimizes the sum of p-th powers of distances of the (1 − α)n points nearest to it. In other words, let Nα (V ) ⊆ [n] consist of the indices of the points to V among x1 , x2 , . . . , xn . We want to minimize nearest (1 − α)n p i∈Nα (V ) d(xi , V ) .
Subspace Approximation with Outliers
3
The robust subspace recovery problem is a special case in which the optimal error for the subspace approximation problem with outliers is promised to be zero, that is, the optimal subspace V is promised to go through some (1 − α)n points among x1 , x2 , . . . , xn . Thus, any multiplicative approximation must also have zero error and recover the optimal subspace. Khachiyan [11] proved that it is NP-hard to find a (d − 1)-dimensional subspace that contains at least (1 − )(1 − 1/d)n points. Hardt and Moitra [10] study robust subspace recovery and define an (, δ)-Gap-Inlier problem of distinguishing between these two cases: (a) there exists a subspace of dimension δn containing (1 − )δn points and (b) every subspace of dimension δn contains at most δn points. They show a polynomial time reduction from the (, δ)-Gap-Small-Subset-Expansion problem to the (, δ)-Gap-Inlier problem. For more on Small Set Expansion conjecture and its connections to Unique Games, please see [16]. Under a strong assumption on the data (that requires any d or fewer outliers to be linearly independent), Hardt and Moitra give an efficient algorithms for finding k-dimensional subspace containing (1 − k/d)n points. This naturally leaves open the question of finding other more reasonable approximations to the subspace approximation problem with outliers. In recent independent work, Bhaskara and Kumar (see Theorem 12 in [1]) showed that if (, δ)-Gap-Small-Subset-Expansion problem is NP-hard, then there exists an instance of subspace approximation with outliers where the optimal inliers lie on a k-dimensional subspace but it is NP-hard to find even a sub√ space of dimension O(k/ ) that contains all but (1 + δ/4) times more points than the optimal number of outliers. This showed that even bi-criteria approximation for subspace recovery is a challenging problem. We compare and contrast our results with the result of Bhaskara and Kumar [1]. Their algorithm throws more outlier than the optimal solution, while we don’t throw any extra outlier. Also, their bi-criteria approximation depends on the “rank-k condition” number which is a somewhat stronger assumption than ours. The problem of clustering using points and lines in the presence of outliers has been studied in special cases of k-median and k-means clustering [2,13], and points and line clustering [7]. Krishnaswamy et al. [13] give a constant factor approximation for k-median and k-means clustering with outliers, whereas Feldman and Schulman give (1 + )-approximations for k-median with outliers and k-line median with outliers that run in time linear in n and d. Another recent line of research on robust regression considers data coming from an underlying distribution where a fraction of it is arbitrarily corrupted [12,14]. The problem we study is different as we do not assume any generative model for the input.
2
Our Contributions
– We assume that the p error of the optimal subspace summed over the optimal (1 − α)n inliers is at least δ times its total p error summed over all n points, for some δ > 0. Under this assumption, we give an algorithm to efficiently find
4
–
–
–
– –
3
A. Deshpande and R. Pratap
a subset of poly(pk/) · log(1/δ) log log(1/δ) points from x1 , x2 , . . . , xn such that the span of this subset contains a k-dimensional linear subspace whose p error over its nearest (1 − α)n points is within (1 + ) of the optimum. The running time of our algorithm is linear in n and d. Note that even for δ as small as 1/poly(n), our algorithm outputs a fairly small subset of poly(pk/) · log n log log n points. The running time of our sampling-based algorithm is linear in n and d. Alternatively, the entire span of the above subset is a linear subspace of dimension poly(pk/) · log(1/δ) log log(1/δ) that gives a bi-criteria multiplicative (1 + )-approximation to the optimal k-dimensional solution to the p subspace approximation problem with outliers. Interestingly, this holds even when the fraction of outliers α is large, as long as the obvious condition 0 < δ ≤ 1 − α is satisfied. Our assumption that the p error of the optimal subspace summed over the optimal (1 − α)n inliers is at least δ times its total p error summed over all n points, for some δ > 0, is more reasonable and realistic than the assumptions used in previous work on subspace approximation with outliers. Without this assumption, our problem (even its special case of subspace recovery) is known to be Small Set Expansion (SSE)-hard [10]. The technical contribution of our work is in showing that the sampling-based weak coreset constructions and dimension reduction results for the subspace approximation problem without outliers [4] also extend to its robust version for data with outliers. If we know the inlier-outlier partition of the data, then the result of [4] can easily be extended for the outlier version of the problem. However, if we don’t know such a partitioning, then a brute-force n subsets and picks the best solution. approach has to go over all (1−α)n This is certainly not an efficient approach as the number of such subsets is exponential in n. Further, on inputs that satisfy our assumption (stated above), it is easy to see that solving the subspace approximation problem without outliers gives a multiplicative 1/δ-approximation to the subspace approximation problem with outliers. Our contribution lies in showing that this approximation guarantee can be improved significantly in a small number of additional sampling steps. We show immediate extensions of our results to more general M-estimator loss functions as previously considered by [3]. We show that our multiplicative approximation for the linear subspace approximation problem under 2 error implies an additive approximation for the affine subspace approximation problem under 2 error. The running time of this algorithm is also linear in n and d.
Warm-Up: Least Squared Error Line Approximation with Outliers
As a warm-up towards the main proof, we first consider the case k = 1 and p = 2, that is, for a given 0 < α < 1, we want to find the best line that minimizes
Subspace Approximation with Outliers
5
the sum of squared distances summed over its nearest (1 − α)n points. Let x1 , x2 , . . . , xn ∈ Rd be the given points and let l∗ be the optimal line. Let I ⊆ [n] consist of the indices of the nearest (1 − α)n points to l∗ among x1 , x2 , . . . , xn . Our algorithm iteratively builds a subset S ⊆ [n] by starting from S = ∅ and in each step samples with replacement poly(k/) i.i.d. points where each point xi is picked with probability proportional to its squared distance to the span of the current subset d(xi , span (S))2 . We abuse the notation as span (S) to denote the linear subspace spanned by {xi : i ∈ S}. These sampled points are added to S and the sampling algorithm is repeated poly(k/) times. 3.1
Additive Approximation
We are looking for a small subset S ⊆ [n] of size poly(1/) that contains a close additive approximation to the optimal subspace over the optimal inliers, that is, for the projection of l∗ onto span (S) denoted by PS (l∗ ),
d(xi , PS (l∗ ))2 ≤
i∈I
d(xi , l∗ )2 +
n
2
xi
(1)
i=1
i∈I
This immediately implies that there exists a line lS in span (S) such that
d(xi , lS )2 ≤
i∈Nα (lS )
d(xi , l∗ )2 +
n
2
xi ,
i=1
i∈I
where Nα (lS ) ⊆ [n] consists of the indices of the nearest (1 − α)n points from x1 , x2 , . . . , xn to lS . Given any subset S ⊆ [n], define the set of bad points as a subset of inliers I whose error w.r.t. PS (l∗ ) is somewhat larger than their error w.r.t. the optimal line l∗ , that is, B(S) = {i ∈ I : d(xi , PS (l∗ ))2 > (1 + /2) d(xi , l∗ )2 } and good points as G(S) = I \ B(S). The following lemma shows that sampling points 2 with probability proportional to their squared lengths xi picks a bad point from B(S) with probability at least /2. Its proof is deferred to the full version. n 2 2 Lemma 1. If S ⊆ [n] does not satisfy (1), then i∈B(S) xi ≥ 2 i=1 xi . Below we show that a bad point sampled by squared-length sampling can be used to get another line closer to the optimal solution by a multiplicative factor, and repeat this. A proof of the following theorem is deferred to the full version. Theorem 1. For any given x1 , x2 , . . . , xn ∈ Rd , let S be an i.i.d. sample of 2 O (1/ ) log(1/) points picked by squared-length sampling. Let I ⊆ [n] be the set of optimal (1 − α)n inliers and l∗ be the optimal line that minimizes their squared distance. Then i∈I
d(xi , PS (l∗ ))2 ≤
i∈I
d(xi , l∗ )2 +
n i=1
2
xi ,
with a constant probability.
6
A. Deshpande and R. Pratap
3.2
Multiplicative Approximation
To turn this into a multiplicative (1+) guarantee, we need to use this adaptively by treating the projections of x1 , x2 , . . . , xn orthogonal to span (S) as our new points, and repeating the squared length sampling on these new points. Here is the modified statement of the additive approximation that we need. Theorem 2. For any given points x1 , x2 , . . . , xn ∈ Rd and any initial subset S0 , let S be an i.i.d. sample of O (1/2 ) log(1/) points sampled with probability proportional to d(xi , span (S0 ))2 . Let I ⊆ [n] be the set of optimal (1−α)n inliers and l∗ be the optimal line that minimizes their squared distance. Then, with a constant probability, we have
d(xi , PS∪S0 (l∗ ))2 ≤
i∈I
d(xi , l∗ )2 +
n
d(xi , span (S0 ))2 .
i=1
i∈I
Proof. Similar to the proof of Theorem 1 but using the projections of xi ’s orthogonal to span (S0 ) as the point set. In particular, we apply Lemma 1 to the projections of xi ’s orthogonal span (S0 ) instead of xi ’s. Note that Theorem 1 is a special case with S0 = ∅. Repeating this squared-distance sampling adaptively for multiple rounds brings the additive approximation error down exponentially in the number of rounds. We defer a proof of the following theorem to the full version. d Theorem 3. For any given points x1 , x2 , . . . , xn ∈ R ,2any initial subsetS0 and positive integer T , let St be an i.i.d. sample of O (1/ ) log(1/) log T points sampled with probability proportional to d(xi , span (St−1 ))2 , for 1 ≤ t ≤ T . Let I ⊆ [n] be the set of optimal (1 − α)n inliers and l∗ be the optimal line that minimizes their squared distance. Then, with a constant probability,
d(xi , PS0 ∪S1 ∪...∪ST (l∗ ))2 ≤ (1 + )
i∈I
d(xi , l∗ )2 + T
i∈I
n
d(xi , span (S0 ))2 .
i=1
∗ Now assume that the optimal for l is at least δ times its error inlier error n ∗ 2 ∗ 2 over the entire data, that is, i∈I d(xi , l ) ≥ δ i=1 d(xi , l ) . In that case, we can show a much stronger multiplicative (1 + )-approximation instead of additive one using similar analysis as of [8]. We defer its proof to the full version of the paper.
Theorem 4. For any given points x1 , x2 , . . . , xn ∈ Rd , let I ⊆ [n] be the set of line that minimizes their squared optimal (1 − α)n inliers and l∗ be the optimal n distance. Suppose i∈I d(xi , l∗ )2 ≥ δ i=1 d(xi , l∗ )2 . For any 0 < < 1, we can efficiently find a subset S of size O (1/2 ) log(1/) log(1/δ) log log(1/δ) s.t. i∈I
d(xi , PS (l∗ ))2 ≤ (1 + )
i∈I
d(xi , l∗ )2 ,
with a constant probability.
Subspace Approximation with Outliers
4
7
p subspace approximation with outliers
Given an instance of k-dimensional subspace approximation with outliers as points x1 , x2 , . . . , xn ∈ Rd , a positive integer 1 ≤ k ≤ d, a real number p ≥ 1, and an outlier parameter 0 < α < 1, let the optimal k-dimensional linear subspace be V ∗ that minimizes the p error summed over its nearest (1 − α)n points from x1 , x2 , . . . , xn . Let I ⊆ [n] denote the subset of indices of these nearest (1 − α)n points to V ∗ . In other words, I = Nα (V ∗ ) consists of the indices of the optimal inliers. Given any subset S ⊆ [n], let lS be the line or direction in it that makes the smallest angle with V ∗ , and define the subspace WS as the rotation of V ∗ along this angle so as to contain lS . To be precise, let l∗ be the projection of lS onto V ∗ and let W ∗ be the orthogonal complement of l∗ in V ∗ . Observe that WS is the k-dimensional linear subspace spanned by lS and W ∗ . We say that S contains a line lS useful for additive approximation if
d(xi , WS )p ≤
i∈I
d(xi , V ∗ )p +
n
p
xi .
(2)
i=1
i∈I
Define the set of bad points as the subset of inliers I whose error w.r.t. WS is somewhat larger than their error w.r.t. the optimal subspace V ∗ , that is, B(S) = {i ∈ I : d(xi , WS )p > (1 + /2) d(xi , V ∗ )p } and good points as G(S) = I \ B(S). The following lemma shows that sampling i-th points with p probability proportional to xi picks a bad point i ∈ B(S) with probability at least /2. A proof of the following lemma is deferred to the full version of the paper. n p p Lemma 2. If S ⊆ [n] does not satisfy (2), then i∈B(S) xi ≥ 2 i=1 xi . 4.1
Additive Approximation: One Dimension at a Time
Below we show that a bad point i ∈ B(S) can be used to improve WS , or in other words, span (S ∪ {i}) contains a line lS∪{i} that is much closer to V ∗ than lS . We defer a proof of the following theorem to the full version of the paper. d Theorem 2 25. For anygiven points x1 , x2 , . . . , xn ∈ R , let S be an i.i.d. sample p of O (p / ) log(1/) points picked with probabilities proportional to xi . Let ∗ I ⊆ [n] be the set of optimal (1 − α)n inliers and V be the optimal subspace that minimizes the p error over the inliers. Also let WS be defined as in the beginning of Sect. 4. Then, with a constant probability, we have
i∈I
d(xi , WS )p ≤
i∈I
d(xi , V ∗ )p +
n
p
xi .
i=1
Note that we can start with any given initial subspace S0 and prove a similar result for sampling points with probability proportional to d(xi , span (S0 ))p .
8
A. Deshpande and R. Pratap
d Theorem 6. For any given points x1 , x2 , . . . , xn ∈ R and an initial subspace S0 , let S be an i.i.d. sample of O (p2 /2 ) log(1/) points picked with probabilities proportional to d(xi , span (S0 ))p . Let I ⊆ [n] be the set of optimal (1 − α)n inliers and V ∗ be the optimal subspace that minimizes the p error over the inliers. Also let WS be defined as in the beginning of Sect. 4. Then, with a constant probability, we have
d(xi , WS∪S0 )p ≤
i∈I
d(xi , V ∗ )p +
n
d(xi , span (S0 ))p .
i=1
i∈I
Proof. Similar to the proof of Theorem 5 above. Once we have a line lS that is close to V ∗ , we can project orthogonal to it and repeat the sampling again. The caveat is, we do not know lS . One can get around this by projecting all the points x1 , x2 , . . . , xn to span (S) of the current sample S, and repeat. Theorem 7. For any given points x1 , x2 , . . . , xn ∈ Rd , let S = S1 ∪ S2 ∪ . . . ∪ ˜ p2 k 2 /2 points picked as follows: S1 be an i.i.d. sample Sk be a sample of O p with probability proportional to xi , of O (p2 k/2 ) log(k/) points 2 picked S2 be an i.i.d. sample of O (p k/2 ) log(k/) points picked with probability proportional to d(xi , span (S1 ))p , and so on. Let I ⊆ [n] be the set of optimal (1 − α)n inliers and V ∗ be the optimal subspace that minimizes the p error over the inliers. Then, with a constant probability, span (S) contains a k-dimensional subspace VS such that i∈I
d(xi , VS )p ≤
d(xi , V ∗ )p +
i∈I
n
p
xi .
i=1
Proof. Similar to the proof of Theorem