书 | Springer | 时空无监督学习：使用基于图技术和深度神经网络的现代计算机视觉方法_守望者白狼_监督学习的神经网络

网络 02-07 5430

书名： Unsupervised Learning in Space and Time: A Modern Approach for Computer Vision using Graph-based Techniques and Deep Neural Networks

作者： Marius Leordeanu

单位： Computer Science and Engineering Department, Polytechnic University of Bucharest, Bucharest, Romania

出版时间： 2020

出版商： Springer

在空间和时间的无监督学习中，我们解决了人工智能中最重要且仍未解决的问题之一，即从通常免费获得的大量时空视觉数据中以无监督的方式学习。这本书涵盖了我们的主要科学发现和成果，同时从原始和有见地的角度关注该领域的最新进展。文本具有连贯的结构，它在逻辑上将原始数学公式和有效的计算解决方案深度连接起来，用于许多不同的无监督学习任务，例如图和超图匹配和聚类、特征选择、分类器学习、对象发现、识别和分割在视频中。这些任务在脑海中呈现出一个统一的画面，它将不同的任务放在一起并在多个层次上关联起来，这些任务汇聚成更普遍的无监督学习问题。我们从一组直观的无监督学习的原则，然后逐步建立必要的数学模型和算法工具。我们最终达到了一个用于无监督学习的通用计算模型，它将图和深度神经网络结合在一起。总体而言，本书深深植根于我们与卡内基梅隆大学机器人研究所、英特尔和谷歌研究院、罗马尼亚科学院数学研究所“Simion Stolow”的教授、同事和博士生共同开发的科学工作和布加勒斯特理工大学。对于我们在无监督学习方面的工作，我们在 2014 年获得了罗马尼亚学院颁发的最高数学奖“Gigore Moisil 奖”。

本书分为八章，将读者从一组无监督学习的初始直觉和常识原则，带到不同的任务、计算模型和解决方案，这些都被介绍和集成在一起，如下所示：

第一章，我们介绍了无监督学习的七个原则，然后简要介绍与这些基本原则和概念密切相关的下一章所涉及的主题。第1章逐渐勾勒出本书的整体轮廓，没有涵盖最后一章中介绍的概念和模型。

第二章，我们介绍了图和超图匹配的问题，从初始动机和直觉到优化和无监督学习的有效算法。在本章中，我们将介绍光谱图匹配算法，该算法将与第6章中介绍的视频中无监督目标分割的方法相关。我们还提出了整数投影不动点(IPFP)方法，该方法的聚类扩展(第3章)后来被用于超图聚类(第3章)，无监督特征选择和分类器学习(第4章)，以及视频中的描述符学习和对象发现(第5章)。我们广泛地将我们的方法与许多其他图和超图匹配方法进行了比较，并展示了显著的提高。我们还展示了图匹配的无监督学习如何显著提高性能。

第三章，我们将第二章的公式和算法推广到图和超图的聚类中。这两个问题密切相关，相似的模型和算法可以解决这两个问题。第二章提出了一种基于整数投影不动点法的高效聚类算法。然后将ipfp聚类方法应用于第4章和第5章中定义的任务。

第四章，我们提出了一种高效的线性分类器学习方法。该公式导致了特征选择和分类，为此我们还提供了一个无监督的解决方案。我们引入了特征符号的概念，并表明，通过知道这个符号，我们可以在不知道样本标签的情况下学习。学习算法与第3章的clustering-IPFP方法相同。在视频分类方面，我们与包括支持向量机(SVM)在内的许多线性分类方法进行了比较，并从有限的训练数据中显示出明显更强大的泛化能力。

第五章，我们把迄今为止介绍的所有构建模块放在一起。通过遵循第一章的初始无监督学习原则，我们创建了一个完全无监督的视频对象分割系统，它通过几代分类器学习，使用从简单像素到从整个图像提取的深度特征。我们在大量的实验中表明，在几个具有挑战性的数据集上，我们的方法比以前发表的方法更有效和准确。

第六章，我们继续探索视频中的无监督对象发现，并提出了一个对象发现的原始公式，作为时空图中的分割，其中每个像素视频都是一个节点。我们引入了一种新颖的特征运动矩阵，它优雅地结合了运动和外观，并证明了通过有效地计算特征运动矩阵的特征向量可以在时空图中发现作为强聚类的主要目标。因此，数学公式和解与第一章的谱图匹配方法直接相关。我们的光谱聚类方法在空间和时间的目标发现是快速和完全无监督的，同时也能够适应任何类型的预先训练的特征，如果需要。我们在三个具有挑战性的数据集上进行测试，并优于其他无监督的方法。我们还提高了其他监督方法的性能，包括他们的输出到特征-运动公式。

第七章，我们将进入下一个层次的无监督学习。我们引入了一个教师-学生系统，它以一种自我监督的方式学习，这样在一次迭代中训练的学生ConvNets人口在下一代中来自教师。这些想法建立在迄今为止所展示的材料的基础上，但这个系统是原创的，它展示了如何在没有任何人类监督的情况下从视频中学习，将物体分割成单个图像。虽然前几章更多地关注经典图模型，而不是深度神经网络，但在第7章，我们将重点转向深度学习。

第八章,我们将图和神经网络模型合并到一个新的轮回时空图(RSTG)神经网络模型,利用的特点和好处,包括学习能力在深厚层特性和多个规模,以及迭代发送消息的能力跨越空间和时间。RSTG模型充分利用了时空域，并证明了其在高级任务上的有效性，如学习识别复杂的运动模式和人类活动。在最后一部分，我们引入了视觉故事网络的概念，这是一个通用的无监督学习机器，它通过多种预测途径学习，在不同的世界解释之间，通过优化自己的自我共识。

Contents

1 Unsupervised Visual Learning: From Pixels to Seeing 1.1 What Does It Mean to See? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 What Is Unsupervised Visual Learning? . . . . . . . . . . . . . . . . . . . 2

1.3 Visual Learning in Space and Time . . . . . . . . . . . . . . . . . . . . . . . 3

1.3.1 Current Trends in Unsupervised Learning . . . . . . . . . . . . . . . . . . . . .5

1.3.2 Relation to Gestalt Psychology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

1.4 Principles of Unsupervised Learning . . . . . . . . . . . . . . . . . . . . . . 9

1.4.1 Object Versus Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

1.4.2 Learning with Highly Probable Positive Features . . . . . . . . . . . . . . .14

1.5 Unsupervised Learning for Graph Matching . . . . . . . . . . . . . . . 19

1.5.1 Graph Matching: Problem Formulation . . . . . . . . . . . . . . . . . . . . . . 20

1.5.2 Spectral Graph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.5.3 Integer Projected Fixed Point Method for Graph Matching . . . . . . 24

1.5.4 Learning Graph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .27

1.5.5 Supervised Learning for Graph Matching . . . . . . . . . . . . . . . . . . . . . 27

1.5.6 Unsupervised Learning for Graph Matching. . . . . . . . . . . . . . . . . . . 28

1.6 Unsupervised Clustering Meets Classifier Learning . . . . . . . . . 30

1.6.1 Integer Projected Fixed Point Method for Graph Clustering . . . . . 30

1.6.2 Feature Selection as a Graph Clustering Problem . . . . . . . . . . . . . . 32

1.7 Unsupervised Learning for Object Segmentation in Video . . .38

1.8 Space-Time Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .40

1.8.1 Optimization Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

1.8.2 Learning Unsupervised Segmentation over Multiple Teacher-Student Generations . . . . . 43

1.8.3 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

1.9 Next Steps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

2 Unsupervised Learning of Graph and Hypergraph Matching 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .53 2.1.1 Relation to Principles of Unsupervised Learning . . . . . . . . . . . . . . . 55 2.2 Graph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 2.3 Hypergraph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58 2.4 Solving Graph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59 2.4.1 Spectral Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 2.4.2 Integer Projected Fixed Point Algorithm . . . . . . . . . . . . . . . . . . . . . . 61 2.5 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 2.6 Solving Hypergraph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 2.7 Learning Graph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.7.1 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 2.7.2 Supervised Learning for Graph Matching . . . . . . . . . . . . . . . . . . . . . 71 2.7.3 Unsupervised and Semi-supervised Learning for Graph Matching73 2.7.4 Learning Pairwise Conditional Random Fields . . . . . . . . . . . . . . . . . 73 2.8 Learning Hypergraph Matching . . . . . . . . . . . . . . . . . . . . . . . . . . 75 2.9 Experiments on Graph Matching . . . . . . . . . . . . . . . . . . . . . . . . . 77 2.9.1 Learning with Unlabeled Correspondences . . . . . . . . . . . . . . . . . . . 80 2.9.2 Learning for Different Graph Matching Algorithms . . . . . . . . . . . . 84 2.9.3 Experiments on Conditional Random Fields. . . . . . . . . . . . . . . . . . . 86 2.10 Experiments on Hypergraph Matching . . . . . . . . . . . . . . . . . . 89 2.10.1 Synthetic Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 2.10.2 Experiments on Real Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 2.10.3 Matching People . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 2.10.4 Supervised Versus Unsupervised Learning . . . . . . . . . . . . . . . . . . . 96 2.11 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . 96 2.11.1 Efficient Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 2.11.2 Higher Order Relationships . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

3 Unsupervised Learning of Graph and Hypergraph Clustering 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .107 3.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .108 3.3 Algorithm: IPFP for Hypergraph Clustering . . . . . . . . . . . . . . . 109 3.4 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111 3.4.1 Computational Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112 3.5 Learning Graph and Hypergraph Clustering . . . . . . . . . . . . . . 113 3.6 Experiments on Third-Order Hypergraph Clustering . . . . . . . 115 3.6.1 Line Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 3.6.2 Affine-Invariant Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .118 3.7 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . .121 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

4 Feature Selection Meets Unsupervised Learning 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125 4.1.1 Relation to Principles of Unsupervised Learning . . . . . . . . . . . . . . 128 4.1.2 Why Labeling the Features and Not the Samples . . . . . . . . . . . . . .129 4.2 Mathematical Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 4.2.1 Supervised Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 4.2.2 Unsupervised Case: Labeling the Features Not the Samples. . . . 131 4.2.3 Intuition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 132 4.3 Feature Selection and Learning by Clustering with IPFP . . . . 133 4.3.1 Theoretical Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134 4.4 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .137 4.4.1 Comparative Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140 4.4.2 Additional Comparisons with SVM . . . . . . . . . . . . . . . . . . . . . . . . . 143 4.5 The Effect of Limited Training Data . . . . . . . . . . . . . . . . . . . . . . 144 4.5.1 Estimating Feature Signs from Limited Data . . . . . . . . . . . . . . . . . .145 4.5.2 Varying the Amount of Unsupervised Data . . . . . . . . . . . . . . . . . . 147 4.6 Intuition Regarding the Selected Features . . . . . . . . . . . . . . . .150 4.6.1 Location Distribution of Selected Features . . . . . . . . . . . . . . . . . . . 152 4.7 Concluding Remarks and Future Work . . . . . . . . . . . . . . . . . . .153 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

5 Unsupervised Learning of Object Segmentation in Video with Highly Probable Positive Features 5.1 From Simple Features to Unsupervised Segmentation in Video. . . . . . . . . . . .157 5.2 A Simple Approach to Unsupervised Image Segmentation . 160 5.2.1 A Fast Color Segmentation Algorithm . . . . . . . . . . . . . . . . . . . . . . 163 5.3 VideoPCA: Unsupervised Background Subtraction in Video. 168 5.3.1 Soft Foreground Segmentation with VideoPCA . . . . . . . . . . . . . . . 169 5.4 Unsupervised Segmentation in Video Using HPP Features. . 170 5.4.1 Learning with Highly Probable Positive Features . . . . . . . . . . . . . . 173 5.4.2 Descriptor Learning with IPFP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 5.4.3 Combining Appearance and Motion . . . . . . . . . . . . . . . . . . . . . . . . 178 5.5 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .179 5.5.1 Tests on YouTube-Objects Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 179 5.5.2 Tests on SegTrack V2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181 5.5.3 Computation Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .182 5.6 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . .183 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183

6 Coupling Appearance and Motion: Unsupervised Clustering for Object Segmentation Through Space and Time 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .187 6.1.1 Relation to Principles of Unsupervised Learning . . . . . . . . . . . . . . 188 6.1.2 Scientific Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 6.2 Our Spectral Approach to Segmentation . . . . . . . . . . . . . . . . . 190 6.2.1 Creating the Space-Time Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190 6.2.2 Segmentation as Spectral Clustering . . . . . . . . . . . . . . . . . . . . . . . . 192 6.2.3 Optimization by Power Iteration Method . . . . . . . . . . . . . . . . . . . . 193 6.3 Theoretical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 6.3.1 Convergence Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194 6.3.2 Feature-Motion Matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 6.4 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .196 6.4.1 The Role of Segmentation Initialization . . . . . . . . . . . . . . . . . . . . . .197 6.4.2 The Role of Node Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .199 6.4.3 The Role of Optical Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .200 6.4.4 Complexity Analysis and Computational Cost . . . . . . . . . . . . . . . . 200 6.4.5 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 201 6.5 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .205 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

7 Unsupervised Learning in Space and Time over Several Generations of Teacher and Student Networks 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .211 7.1.1 Relation to Unsupervised Learning Principles . . . . . . . . . . . . . . . . 214 7.2 Scientific Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 7.3 Learning over Multiple Teacher-Student Generations . . . . . . 217 7.4 Our Teacher-Student System Architecture . . . . . . . . . . . . . . . . 218 7.4.1 Student Pathway: Single-Image Segmentation . . . . . . . . . . . . . . . 219 7.4.2 Teacher Pathway: Unsupervised Object Discovery . . . . . . . . . . . . 222 7.4.3 Unsupervised Soft Masks Selection . . . . . . . . . . . . . . . . . . . . . . . . .223 7.4.4 Implementation Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 7.5 Experimental Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .228 7.5.1 Ablation Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 7.5.2 Tests on Foreground Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . 236 7.5.3 Tests on Transfer Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 7.5.4 Concluding Remarks on Experiments . . . . . . . . . . . . . . . . . . . . . . . 245 7.6 Concluding Discussion on Unsupervised Learning . . . . . . . . .245 7.7 Overall Conclusions and Future Work . . . . . . . . . . . . . . . . . . . .246 7.7.1 Towards a Universal Visual Learning Machine . . . . . . . . . . . . . . . . .247 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

8 Unsupervised Learning Towards the Future 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .253 8.2 Recurrent Graph Neural Networks in Space and Time . . . . . .254 8.2.1 Scientific Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .255 8.2.2 Recurrent Space-Time Graph Model . . . . . . . . . . . . . . . . . . . . . . . . .257 8.2.3 Experiments: Learning Patterns of Movements and Shapes . . . . .261 8.2.4 Experiments: Learning Complex Human-Object Interactions . . . .264 8.3 Putting Things Together . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 8.3.1 Agreements at the Geometric Level . . . . . . . . . . . . . . . . . . . . . . . . .267 8.3.2 Agreements at the Semantic Level . . . . . . . . . . . . . . . . . . . . . . . . . .268 8.3.3 Agreements as Highly Probable Positive (HPP) Features . . . . . . . 269 8.3.4 Motion Patterns as HPP Features . . . . . . . . . . . . . . . . . . . . . . . . . . . 271 8.3.5 Learning over Multiple Teacher-Student Generations . . . . . . . . . .271 8.3.6 Building Blocks of the Visual Story Network . . . . . . . . . . . . . . . . . .272 8.4 The Dawn of the Visual Story Graph Neural Network . . . . . . 272 8.4.1 Classifiers Should Be Highly Interconnected . . . . . . . . . . . . . . . . . 273 8.4.2 Relation to Adaptive Resonance Theory . . . . . . . . . . . . . . . . . . . . . 274 8.4.3 Multiple Layers of Interpretation: Depth, Motion, and Meaning .275 8.4.4 Local Objects and Their Global Roles in the Story. . . . . . . . . . . . . .278 8.4.5 Unsupervised Learning in the Visual Story Network . . . . . . . . . . . 279 8.4.6 Learning Evolution over Multiple Generations . . . . . . . . . . . . . . . . 280 8.4.7 Learning New Categories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 8.5 Visual Stories Towards Language and Beyond . . . . . . . . . . . . .281 8.5.1 Learning from Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 284 8.5.2 Unsupervised Learning by Surprise . . . . . . . . . . . . . . . . . . . . . . . . . 288 8.5.3 Discover Itself . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .289 8.5.4 Dreams of Tomorrow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 297

阅读原文