commit 7cec2d9ba0acf19eeb6a6b7a8a7fd17dd40cc56a
Author: wubw <879367232@qq.com>
Date:   Wed Apr 23 15:46:42 2025 +0800

    first commit

diff --git a/202221090225_武博文_中期答辩.pptx b/202221090225_武博文_中期答辩.pptx
new file mode 100644
index 0000000..a4ed39c
Binary files /dev/null and b/202221090225_武博文_中期答辩.pptx differ
diff --git a/202221090225_武博文_开题.pptx b/202221090225_武博文_开题.pptx
new file mode 100644
index 0000000..65574f8
Binary files /dev/null and b/202221090225_武博文_开题.pptx differ
diff --git a/202221090225_武博文_开题报告表.docx b/202221090225_武博文_开题报告表.docx
new file mode 100644
index 0000000..e4ae7dd
--- /dev/null
+++ b/202221090225_武博文_开题报告表.docx
@@ -0,0 +1,231 @@
+                            电 子 科 技 大 学
+               学术学位研究生学位论文开题报告表
+	攻读学位级别： □博士        硕士
+	学科专业：   软件工程                
+	学        院：   信息与软件工程学院      
+	学        号：   202221090225            
+	姓        名：   武博文                  
+	论文题目：   室外动态场景下基于实例   
+	                 分割的视觉SLAM研究      
+	指导教师：   王春雨                   
+	填表日期：   2023  年  12  月  15  日
+                        电子科技大学研究生院
+             
+ 学位论文研究内容
+                                 课题类型
+□基础研究    □应用基础研究     应用研究
+                                 课题来源
+□纵向        □横向             自拟
+                                      学
+                                      位
+                                      论
+                                      文
+                                      研
+                                      究
+                                      内
+                                      容
+学位论文的研究目标、研究内容及拟解决的关键性问题（可续页）
+ 研究目标
+目前机器人SLAM算法主要分为激光SLAM和视觉SLAM，区别在于传感器分别是激光雷达和相机。随着移动机器人的普及以及应用场景的增多，激光SLAM由于激光雷达的高价格，难以应用在小电器以及低成本机器人上，而视觉SLAM凭借相机价格较低，体积较少，能够采集多维度信息等优势，逐渐成为目前SLAM算法中研究的主流方向。
+视觉同步定位与建图（Visual Simultaneous Localization And Mapping，V-SLAM）在机器人视觉感知领域中占有重要地位。最先进的V-SLAM算法提供了高精度定位和场景重建的能力[[1][]]。然而，它们大多忽略了动态对象所产生的不良影响。在这些研究中，环境被认为是完全静止的，这种强假设使得系统在复杂的动态环境中会产生严重的误差导致位姿估计误差较大，甚至导致定位失败。因此研究动态场景下的相机运动和物体运动是十分有必要的。
+拟在存在动态物体的室外场景下，使用相机作为传感器，研究如何区分真正移动的动态物体和潜在运动但是静止的物体，更好地利用静态特征点，提高相机运动估计的准确性和SLAM系统的鲁棒性。
+     研究内容
+  动态场景作为V-SLAM走向实际应用的一大阻碍，具有较大的难度和挑战性。也是许多学者研究的内容。本文拟研究在室外动态场景下如何识别动态物体，设计动态物体识别算法，将动态物体对相机位姿估计的影响降低，获得较为精准的相机位姿。在获得较为精准的相机位姿后，跟踪动态物体，建立动态物体跟踪集合，对新出现的物体和消失的物体记录。最后，将观测量，如相机位姿和物体位姿等传入后端，建立全局优化，根据优化后的地图点建立地图。
+  针对如何识别室外动态物体的问题，研究深度学习和几何约束相结合的动态点判定方法，设计识别运动物体的算法，去除语义信息未包括的动点，正确恢复相机位姿。
+  针对运动物体跟踪，研究在语义信息中的不同物体的跟踪方法，设计区分不同物体以及其运动，恢复运动物体的位姿。
+  针对后端优化，研究应用动态物体信息的优化方法，同时优化相机位姿和物体位姿，得到更精确的相机位姿。
+     拟解决的关键性问题
+ 动态物体判别问题
+  动态物体判别是整个动态SLAM问题要解决的一个关键环节，其最终解决的效果好坏直接影响到相机位姿的估计和后端的建图效果。该问题解决的是在相机不同的两帧之间，物体的空间位置是否发生了移动的问题。只使用实例分割获得的语义信息只能判定已知语义的物体是动态物体，但是不能确定在当前图像物体是否真的发生了移动。同时在处理未知运动物体方面，语义信息会失效，需要结合几何信息设计一种算法来判定物体是否真正运动，以将动态物体的特征点与静态背景特征点做区分。
+ 动态物体跟踪问题
+  对动态点的处理常常是在判定为动态点后，直接将其从特征点中去除，不再考虑这些特征点的意义。但是这些特征点也是地图中的点，对于动态物体存在跟踪的价值，因此研究动态物体所产生的特征点的存储和利用是关键点。在动态场景下，动态特征点非常可能不是来源一个物体，即在一个图像中可能存在多个动态物体，因此需要研究不同物体在不同帧间的关联关系，建立唯一的匹配，实现动态物体的分别跟踪。
+ 同步跟踪和优化问题
+  在求解相机位姿后，跟踪动态物体的运动，获得运动物体的位姿，物体运动信息是预测得来的信息，可以经过局部优化或全局优化获得更精准的信息。但一般的优化只进行线性优化或者只对相机位姿优化，忽略了动态物体点的有效信息。因此拟设计一种优化的过程，确定优化变量，实现更准确的位姿估计，生成更准确的地图点，解决动态物体有效信息不完全利用的问题。
+
+
+ 学位论文研究依据
+学位论文的选题依据和研究意义，国内外研究现状和发展态势，主要参考文献，以及已有的工作积累和研究成果。（应有2000字）
+ 选题依据和研究意义
+  同步定位与地图构建（SLAM）是搭载激光雷达、IMU（Inertial Measurement Unit）、相机等传感器的移动载体在未知环境下同步进行定位与地图构建的过程[[][2][]]。SLAM一般可分为激光SLAM和视觉SLAM。激光SLAM利用激光雷达、编码器和惯性测量单元（IMU）等多种传感器相结合，已在理论和应用方面相对成熟。然而，激光雷达具有较高的价格使其难以普及到个人小型设备，并且雷达信息获取量有限。视觉SLAM利用视觉传感器，如单目、双目和RGB-D（带有深度信息的彩色图像）相机等，来构建环境地图。相机能够获取丰富的图像信息，并且视觉传感器具有低廉的价格，简单的结构和小巧便携的特点，因此成为近年来研究者们关注的热点，也成为SLAM技术中的主要研究方向。视觉SLAM能够广泛应用于无人驾驶，自主机器人，导盲避障等领域，对视觉SLAM的研究具有现实意义。
+  经过近二十年的发展，视觉同时定位与建图（Visual Simultaneous Localization And Mapping，V-SLAM）框架已趋于成熟。现阶段，V-SLAM系统大多数建立在非动态环境的假设上，即假设移动载体在跟踪过程中不存在动态物体。然而，这种假设是一种强假设，在现实场景中很难成立。在室内场景中，常出现移动的人和桌椅等等；在室外场景中，常常出现移动的车和动物等等，这些动态物体的出现对V-SLAM系统的影响巨大，尤其是对V-SLAM中的前端模块的影响。SLAM前端求解存在两种方案，直接法和特征点法。直接法基于光度不变假设来描述像素随时间在图像之间的运动方式，每个像素在两帧之间的运动是一致的，通过此估计相机的运动。然而由于相机获得的图像受到光线，噪声等影响，光度不变假设往往不成立，如果再出现动态物体，直接使用此方法更会影响相机的运动估计。特征点法是一种间接的方法，它首先提取图像的特征点，然后通过两帧间特征点的匹配和位置变化求解相机运动。特征点的选择与使用大幅提高了V-SLAM系统定位的准确性，例如著名开源视觉SLAM框架ORB-SLAM2[[3]]、ORB-SLAM3[[4]]、VINS-Mono[[5]]都采用了特征点法。但是，一旦出现动态物体，这些特征点中会包含动态物体上的点，动态物体的移动造成了特征点移动的不一致性，从而对相机运动的估计造成了巨大影响。这种影响会导致后端失效，定位精度大幅减弱，不能忽视。随着视觉SLAM技术的发展，如何解决动态影响受到广泛关注，具有重要的研究价值。
+ 国内外研究现状和发展态势
+  2.1视觉SLAM研究现状
+  视觉SLAM问题最早可追溯到滤波技术的提出，Smith等人提出了采用状态估计理论的方法处理机器人在定位和建图等方面的问题[[][6][]]。随后出现各种基于滤波算法的SLAM系统，例如粒子滤波[[][7][]]和卡尔曼滤波[[][8][]]。2007年视觉SLAM取得重大突破，A. J. Davison等人提出第一个基于单目相机的视觉SLAM系统MonoSLAM[[][9][]]。该系统基于扩展卡尔曼滤波算法（Extended Kalman Filter, UKF），是首个达到实时效果的单目视觉SLAM系统，在此之前其他的算法都是对预先拍好的视频进行处理，无法做到同步。同年，Klein 等人提出了PTAM( Parallel Tracking And Mapping) [[][10][]]，创新地以并行的方式进行跟踪和建图线程，这种并行的方式也是当下SLAM框架的主流。PTAM应用了关键帧和非线性化优化理论而非当时多数的滤波方案，为后续基于非线性化优化的视觉SLAM开辟了道路。
+  2014年慕尼黑工业大学计算机视觉组Jakob Engel等人[[][11][]]提出LSD-SLAM，该方案是一种基于直接法的单目视觉SLAM算法，不需要计算特征点，通过最小化光度误差进行图像像素信息的匹配，实现了效果不错的建图。该方案的出现证明了基于直接法的视觉SLAM系统的有效性，为后续的研究奠定了基础。同年SVO被Forster等人提出[[][12][]]。这是一种基于稀疏直接法的视觉SLAM方案，结合了特征点和直接法，使用了特征点，但是不计算特征点的描述子，特征点的匹配使用其周围像素利用直接法匹配。
+  2015年Mur-Artal等人参考PTAM关键帧和并行线程的方案，提出了ORB-SLAM框架[[][13][]]。该框架是一种完全基于特征点法的单目视觉SLAM系统，包括了跟踪，建图和回环检测三个并行线程。最为经典的是该系统采用的ORB特征点，能实现提取速度和效果的平衡。但是其系统只适用于单目相机，精度低且应用场景受限。随着相机的进步，2017年Mur-Artal 等人对ORB-SLAM进行了改进，扩展了对双目和RGB-D相机的支持，提出ORB-SLAM2[[][3][]]。相比于原版，该系统支持三种相机，同时新增重定位，全局优化和地图复用等功能，更具鲁棒性。
+  2017年，香港科技大学Qin Tong等人[[][1][4]]提出VINS Mono系统，该系统在单目相机中融合IMU传感器，在视觉信息短暂失效时可利用IMU估计位姿，视觉信息在优化时可以修正IMU数据的漂移，两者的结合表现出了优良的性能。2019年提出改进版系统VINS-Fusion[[][1][5][]]，新增对双目相机和GPS传感器的支持，融合后的系统效果更优。
+  2020年Carlos Campos等[[][4][]]提出了ORB-SLAM3，该系统在ORB-SLAM2的基础上，加入了对视觉惯性传感器融合的支持，并在社区开源。系统对算法的多个环节进行改进优化，例如加入了多地图系统和新的重定位模块，能够适应更多的场景，同时精度相比上一版增加2-3倍。在2021年底，系统更新了V1.0版本，继承了ORB-SLAM2的优良性能，成为现阶段最有代表性的视觉SLAM系统之一。
+  2.2动态SLAM研究现状
+  针对动态物体的影响，已经有许多研究人员开展了相关工作，尝试解决动态场景下的视觉SLAM问题。解决这一问题的主要挑战就是如何高效地检测到动态物体和其特征点，并将动态特征点剔除以恢复相机运动。
+  最早的解决思路是根据几何约束来筛除动态物体的特征点，如WANG 等[[][1][6][]]首次使用 K-Means 将由 RGB-D相机计算的3D点聚类，并使用连续图像之间的极线约束计算区域中内点关键点数量的变化，内点数量较少的区域被认定是动态的。Fang[[][1][7][]]使用光流法检测图像之间的动态物体所在位置，对其特征点进行滤除。该方法利用光流提高检测的精度，有效地降低了帧之间极线约束的误差。尽管基于几何约束的方法可以在一定程度消除动态特征点的影响，但随着深度学习的发展，图像中语义信息逐渐被重视和利用起来。
+  现阶段有许多优秀的深度学习网络，如YOLO[[1][8][]]，SegNet[[1][9][]]，Mask R-CNN[[][20][]]等等。这些神经网络有着强大的特征提取能力和语义信息提取能力，可以帮助SLAM系统更轻松地辨别出动态物体的存在，从而消除其影响。Fangwei Zhong等人提出的Detect-SLAM[[2][1][]]，利用目标检测网络获取环境中的动态的人和车等，为了实时性，只在关键帧中进行目标检测，最后去除所有检测到的动态点来恢复相机位姿。LIU和MIURA[[][2][2][]]提出了RDS-SLAM。基于ORB-SLAM3[4]的RDS-SLAM框架使用模型的分割结果初始化移动对象的移动概率，将概率传播到随后的帧，以此来区分动静点。这种只基于深度学习的方法仅能提供图像中的语义信息，但无法判断图像中的物体是否真的在运动，比如静止的人或者路边停靠的汽车。若根据语义信息将其标记为动态物体后直接去除其特征点，这种方法会导致系统丢失有用的特征点，对相机的运动估计有所影响。因此仅利用深度学习不能很好解决动态物体对SLAM系统的影响。
+  许多研究开始探索语义信息和几何信息的结合。例如清华大学Chao Yun等提出的DS-SLAM[[][2][3][]]，该系统首先利用SegNet网络进行语义分割，再利用极线约束过滤移动的物体，达到了不错的效果。Berta Bescos等人首次利用Mask R-CNN网络进行实例分割，提出了DynaSLAM[[][2][4][]]。该系统结合基于多视几何深度的动态物体分割和区域生长算法，大幅降低了位姿估计的误差。
+  利用深度学习得来的语义信息和几何信息结合来解决SLAM中的动态场景问题渐渐成了一种主流，但是上述大多系统只是为了恢复相机的位姿而剔除动态物体的特征点，而没有估计动态物体的位姿。同时估计相机运动和跟踪动态物体运动，将动态物体的点加入优化步骤正在发展为一种趋势。Javier Civera等人提出的DOT SLAM（Dynamic Object Tracking for Visual SLAM）[[][2][5][]]主要工作在前端，结合实例分割为对态对象生成掩码，通过最小化光度重投影误差跟踪物体。AirDOS被卡内基梅隆大学Yuheng Qiu等人提出[[][2][6][]]，将刚性和运动约束引入模型铰接对象，通过联合优化相机位姿、物体运动和物体三维结构，来纠正相机位姿估计。VDO SLAM[[][2][7][]]利用Mask R-CNN掩码和光流区分动静点，将动态环境下的SLAM表示为整体的图优化，同时估计相机位姿和物体位姿。
+  总体来说，目前动态场景下的视觉SLAM问题的解决需要借助几何信息和深度学习的语义信息，语义信息提供更准确的物体，几何信息提供物体真实的运动状态，两者结合来估计相机运动和跟踪物体。
+     主要参考文献
+ J. Engel, V. Koltun, and D. Cremers, "Direct sparse odometry," IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611 - 625, Mar. 2016.
+ 孔德磊, 方正. 基于事件的视觉传感器及其应用综述[J]. 信息与控制, 2021, 50(1): 1-19. KONG D L, FANG Z. A review of event-based vision sensors and their applications[J]. Information and Control, 2021, 50(1): 1-19.
+ Mur-Artal R , JD Tardós. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras[J]. IEEE Transactions on Robotics, 2017.
+ Campos C, Elvira R, Rodriguez J, et al. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual - Inertial, and Multimap SLAM[J]. IEEE Transactions on Robotics: A publication of the IEEE Robotics and Automation Society, 2021, 37(6): 1874-1890.
+ Tong, Qin, Peiliang, et al. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator[J]. IEEE Transactions on Robotics, 2018.
+ Smith R, Self M, Cheeseman P. Estimating Uncertain Spatial Relationships in Robotics [J]. Machine Intelligence & Pattern Recognition, 1988, 5(5):435-461.
+ Grisetti G, Stachniss C, Burgard W. Improved Techniques for Grid Mapping With Rao-Blackwellized Particle Filters [J]. IEEE Transactions on Robotics, 2007, 23(1):34-46.
+ Kalman R E. A New Approach To Linear Filtering and Prediction Problems[J]. Journal of Basic Engineering,  1960, 82D:35-45.DOI:10.1115/1.3662552.
+ Davison A J, Reid I D, Molton N D, et al. MonoSLAM: Real-Time Single Camera SLAM[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6):1052-1067.
+ Klein G, Murray D. Parallel Tracking and Mapping for Small AR Workspaces[C]. IEEE and ACM International Symposium on Mixed and Augmented Reality. IEEE, 2007:1-10.
+ ENGEL J, SCHOPS T, CREMERS D, LSD-SLAM: Large-scale direct monocular SLAM[C]. European Conference on Computer Vision(ECCV), 2014:834 - 849.
+ FORSTER C, PIZZOLI M, SCARAMUZZA D. SVO: Fast semi-direct monocular visual odometry[C]. Hong Kong, China: IEEE International Conference on Robotics and Automation (ICRA), 2014: 15-22.
+ MURARTAL R, MONTIEL J M, TARDOS J D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System[J]. IEEE Transactions on Robotics, 2015, 31(5):1147-1163.
+ TONG Q, PEILIANG L, SHAOJIE S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator[J]. IEEE Transactions on Robotics, 2017,99:1-17.
+ QIN T, PAN J, CAO S, et al. A general optimization-based framework for local odometry estimation with multiple sensors[J]. ArXiv, 2019:1901.03638.
+ WANG R, WAN W, WANG Y, et al. A new RGB-D SLAM method with moving object detection for dynamic indoor scenes[J]. Remote Sensing, 2019, 11(10): 1143.
+ Fang Y, Dai B. An improved moving target detecting and tracking based on Optical Flow technique and Kalman filter[J]. IEEE, 2009.DOI:10.1109/ICCSE.2009.5228464.
+ Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified, Real-Time Object Detection[C]//Computer Vision & Pattern Recognition. IEEE, 2016.DOI:10.1109/CVPR.2016.91.
+ Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(12): 2481-2495.
+ Gkioxari G, He K, Piotr Dollár, et al. Mask R-CNN [J]. IEEE transactions on pattern analysis and machine intelligence, 2020, 42(2): 386-397.
+ Zhong F, Wang S, Zhang Z, et al. Detect-SLAM: Making Object Detection and SLAM Mutually Beneficial[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018.DOI:10.1109/WACV.2018.00115.
+ LIU Y, MIURA J. RDS-SLAM: Real-time dynamic SLAM using semantic segmentation methods[J]. IEEE Access, 2021, 9: 23772-23785.
+ C. Yu, et al. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments[A]. //2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 1168-1174.
+ B. Bescos, J. M. Fácil, J. Civera and J. Neira, DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes[J]. IEEE Robotics and Automation Letters, 2018, 3(4)4076-4083.
+ Ballester I, Fontan A, Civera J, et al. DOT: Dynamic Object Tracking for Visual SLAM[J].  2020.DOI:10.48550/arXiv.2010.00052.
+ Qiu Y, Wang C, Wang W, et al. AirDOS: Dynamic SLAM benefits from articulated objects[C]//2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022: 8047-8053.
+ Zhang J, Henein M, Mahony R, et al. VDO-SLAM: A Visual Dynamic Object-aware SLAM System[J].  2020.DOI:10.48550/arXiv.2005.11052.
+ 高翔, 张涛等. 视觉SLAM十四讲[M]. 第二版. 北京:电子工业出版社, 2019．
+ 已有的工作积累和研究成果
+ 工作积累
+研究生期间学习主要以视觉十四讲[[][28][]]为主，阅读了大量SLAM相关文献，在虚拟机环境下测试过ORB-SLAM2，VDO-SLAM等多种框架，在公开数据集KITTI序列中的性能。掌握框架的主要函数，可以通过编程实现环境的搭建和算法的编写测试。
+ 研究成果
+暂无
+ 学位论文研究计划及预期目标
+1.拟采取的主要理论、研究方法、技术路线和实施方案（可续页）
+1.1 主要理论和研究方法
+一个典型的视觉SLAM系统一般可以分为五个子模块，包括传感器，前端，后端优化，回环检测和建图。如图3-1所示。
+                                       
+                             图3-1 SLAM模块图
+对于视觉SLAM而言，传感器为相机，前端又称为视觉里程计，主要根据相机信息估计相邻两个时刻内的运动（即位姿变化）。后端优化位姿，回环检测是检测相机是否经过相同的场景，与建图有着密切的联系。本文的主要工作集中在前端和后端。在光照变化不明显，没有动态物体的场景下，SLAM基本模块已经很完善。要解决动态场景下的问题，需要在此模块的基础上，结合深度学习模型来实现语义级别的SLAM。
+在后端优化方面，基于因子图优化。因子图是应用贝叶斯定律的估计模型，贝叶斯模型是给定Z，求解X的概率，表示为P(X|Z)，P(X|Z)正比于给定求解Z的概率，如公式（1）所示。
+PXZ=PZXPXPZ=k*PZXPX#1
+贝叶斯定律左侧称为后验概率，右侧的P(Z|X)称为似然，P(X)称为先验。直接求后验分布是困难的，但是求一个状态最优估计，使得在该状态下，后验概率最大化是可行的，如公式（2）所示。因此求解最大后验概率，等价于求解最大化似然和先验的乘积。
+X*=argmax PXZ=argmax PZXPX#2
+求解最大似然估计时，考虑观测数据的条件概率满足高斯分布，可以使用最小化负对数来求高斯分布的最大似然,，这样就可以得到一个最小二乘问题，如公式（3）所示，它的解等价于状态的最大似然估计。其中公式（3）中的f(x)为噪声符合高斯分布的X的误差项。
+X*=argmax PXZ=argmaxlogPXZ=argminfx22 #3
+在SLAM问题中，每一个观测变量在贝叶斯网络中都是相互独立的，因此所有条件概率是乘积的形式，且可分解，对应于因子图中的每一项。因子图包含节点和边，节点为状态变量节点，表示待估计的变量，如位姿，3D点等。边为误差项，即因子，表示变量之间的误差项。因子图还会包含一个先验因子，用来固定系统的解，以满足可解。因子图的求解就是使得所有因子的乘积最大化的状态量，该步骤可转化为最小二乘问题，最终解得的系统状态是在概率上最可能的系统状态。                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   
+在研究时，从主要理论出发，阅读大量室外动态场景下的视觉SLAM文献，对文献总结和理解，学习方法的异同，优化自己的算法。从实践出发，多写代码尝试不同的算法，测试算法性能，通过实验得到良好的解决方案。
+　　1.2 技术路线和实施方案
+本文预计的技术路线和实施方案如图3-2所示：
+                      图3-2 技术路线和实施方案
+在室外动态场景下基于实例分割的SLAM算法首先需要解决深度学习模型的数据预处理，然后应用得到的语义信息和几何约束设计算法来实现动静点判定。根据静点估计相机的运动，根据动点估计运动物体的运动，不同的运动物体分别跟踪。最终研究相机位姿，运物体位姿和地图点的全局优化，实现建图。
+本文预计的详细技术路线和实施方案如下：
+ 基于实例分割和聚类的动态物体判别方法
+在室外动态场景下，提出一种基于实例分割和超像素聚类的动态物体识别算法。通过实例分割得到物体掩码，将掩码内的点作为动点候选点，通过特征提取的点与动点候选点做差，得到静点候选点。静点候选点通过聚类后重投影到前一帧，计算点误差，提出一种基于误差比的动点判断方法，解决语义未知的动态物体判定问题。对于语义已知的掩码物体，同样使用该方法判定是否真的在运动。研究思路如图3-3所示。
+         图3-3 基于实例分割和聚类的动态物体判别方法
+ 依赖掩码内动点集合的动态物体跟踪方法
+研究具有掩码的动态物体的运动，提出一种在室外场景下，全局的动态物体跟踪方法。首先通过掩码，稠密地提取像素点，每隔2个点取一个点，以保证物体跟踪时特征点的数量。再通过运动判定和语义标签得到真的在运动的物体，设计一个存储集合来管理这些物体像素点，同时利用提取的像素点估计不同物体的位姿，物体位姿的求解建立在刚体假设之上。研究思路如图3-4所示。
+
+
+
+
+
+
+                        图3-4 动态物体跟踪方法
+ 因子图优化方法
+研究基于因子图的相机位姿和物体位姿优化，该方法将动态SLAM问题作为一个图形优化的问题，为了构建全局一致的地图。因子图的变量节点作为观测得来的值，是要估计求解的变量，点之间的变量作为状态变量，是因子节点，作为约束。拟设计的因子图如图3-5所示。
+                               图3-5 因子图
+2.研究计划可行性，研究条件落实情况，可能存在的问题及解决办法（可续页）
+2.1 可行性分析
+  得益于视觉SLAM的逐渐发展，动态物体问题已经有了不少解决思路，尤其是前端部分的研究更多，每年都有一定的论文产出，可作为参考。其次，随着深度学习的模型逐渐完善，实例分割技术和光流检测等技术也能有比较好的效果，对动态SLAM问题的解决有所助益。因此，在理论上和实践上，本论文的研究方向均具有可行性。
+2.2 研究条件
+  (1) 教研室的科研氛围，指导老师和教研室老师们的意见，师兄们的帮助。教研室已经发了不少相关论文和专利；
+  (2) 教研室完备的硬件环境，服务器，移动小车和各种摄像头等硬件设施；
+  (3) 研究内容相关的论文和书籍，有足够的理论基础支撑研究；
+2.3 可能存在的问题及解决办法
+  (1) 全局优化的结果不如原始数据
+  在将预测值进行全局优化时，不确定预测值的误差大小，会导致一些误差较大的预测值加入全局优化，使得优化后的效果不如原始数据。针对这样的问题，首先考虑优化对象的选择，增加或删除优化值，为了更准确的效果。其次考虑在加入优化前对预测值做处理，比如绝对阈值处理或相对阈值。
+  (2) 实施方案未能达到较好的效果
+  若出现这样的问题，则需要和导师师兄交流，讨论原因做好记录，找到问题所在，并根据实际情况调整技术路线，设计新的方案来达到效果。
+  
+
+3.研究计划及预期成果
+                                      研
+                                      究
+                                      计
+                                      划
+                                 起止年月
+                                 完成内容
+                                       
+                                2023.12-2024.02
+研究动态物体判别方法
+                                       
+                                2024.02-2024.04
+研究动态物体跟踪方法
+                                       
+                                2024.04-2024.06
+研究包含动态物体的局部优化和全局优化
+                                       
+                                2024.06-2024.08
+验证地图精度指标，改进算法
+                                       
+                                2024.08-2024.11
+测试数据集，做实验
+                                       
+                                2024.11-2025.03
+撰写硕士学位论文
+                                      预
+                                      期
+                                      创
+                                      新
+                                      点
+                                      及
+                                      成
+                                      果
+                                      形
+                                      式
+ 预期创新点
+ 设计基于实例分割和聚类的动态物体判别方法
+ 提出基于掩码的动态物体同步跟踪方法
+ 设计因子图，实现更优的全局优化
+  
+ 成果形式
+ 学术论文
+  发表一篇学术论文
+ 专利
+  申请发明专利1-2项
+ 论文
+  撰写硕士学位论文1篇
+
+ 开题报告审查意见
+1.导师对学位论文选题和论文计划可行性意见，是否同意开题：
+
+导师（组）签字：                                                          年     月     日
+2.开题报告考评组意见
+                                 开题日期
+                                       
+                                 开题地点
+
+                                 考评专家
+
+                                 考评成绩
+合格    票       基本合格    票       不合格    票
+                                  结    论
+□通过            □原则通过           □不通过 
+通过：表决票均为合格
+原则通过：表决票中有1票为基本合格或不合格，其余为合格和基本合格
+不通过：表决票中有2票及以上为不合格
+考评组对学位论文的选题、研究计划及方案实施的可行性的意见和建议：
+
+
+
+
+
+
+
+
+
+
+
+考评组签名：                   
+                                                            年     月     日
+3.学院意见：
+
+负责人签名：                             年     月     日
+                                       
diff --git a/202221090225_武博文_文献综述.docx b/202221090225_武博文_文献综述.docx
new file mode 100644
index 0000000..c3e3df2
--- /dev/null
+++ b/202221090225_武博文_文献综述.docx
@@ -0,0 +1,65 @@
+     电子科技大学学术学位硕士研究生学位论文文献综述
+姓名：武博文
+ 学号：202221090225
+ 学科：软件工程
+综述题目：室外动态场景下基于实例分割的视觉SLAM研究  
+                                                                               
+导师意见：
+
+
+
+
+
+导师签字：
+日期：
+
+  选题依据和研究意义
+同步定位与地图构建（SLAM）是搭载激光雷达、IMU（Inertial Measurement Unit）、相机等传感器的移动载体在未知环境下同步进行定位与地图构建的过程[[][1][]]。SLAM一般可分为激光SLAM和视觉SLAM。激光SLAM利用激光雷达、编码器和惯性测量单元（IMU）等多种传感器相结合，已在理论和应用方面相对成熟。然而，激光雷达具有较高的价格使其难以普及到个人小型设备，并且雷达信息获取量有限。视觉SLAM利用视觉传感器，如单目、双目和RGB-D（带有深度信息的彩色图像）相机等，来构建环境地图。相机能够获取丰富的图像信息，并且视觉传感器具有低廉的价格，简单的结构和小巧便携的特点，因此成为近年来研究者们关注的热点，也成为SLAM技术中的主要研究方向。视觉SLAM能够广泛应用于无人驾驶，自主机器人，导盲避障等领域，对视觉SLAM的研究具有现实意义。
+经过近二十年的发展，视觉同时定位与建图（Visual Simultaneous Localization And Mapping，V-SLAM）框架已趋于成熟，在机器人视觉感知领域中占有重要地位，最先进的V-SLAM算法提供了高精度定位和场景重建的能力[[][2][]]。现阶段，V-SLAM系统大多数建立在非动态环境的假设上，即假设移动载体在跟踪过程中不存在动态物体。然而，这种假设是一种强假设，在现实场景中很难成立。在室内场景中，常出现移动的人和桌椅等等；在室外场景中，常常出现移动的车和动物等等，这些动态物体的出现对V-SLAM系统的影响巨大，尤其是对V-SLAM中的前端模块的影响。SLAM前端求解存在两种方案，直接法和特征点法。直接法基于光度不变假设来描述像素随时间在图像之间的运动方式，每个像素在两帧之间的运动是一致的，通过此估计相机的运动。然而由于相机获得的图像受到光线，噪声等影响，光度不变假设往往不成立，如果再出现动态物体，直接使用此方法更会影响相机的运动估计。特征点法是一种间接的方法，它首先提取图像的特征点，然后通过两帧间特征点的匹配和位置变化求解相机运动。特征点的选择与使用大幅提高了V-SLAM系统定位的准确性，例如著名开源视觉SLAM框架ORB-SLAM2[[3]]、ORB-SLAM3[[4]]、VINS-Mono[[5]]都采用了特征点法。但是，一旦出现动态物体，这些特征点中会包含动态物体上的点，动态物体的移动造成了特征点移动的不一致性，从而对相机运动的估计造成了巨大影响。这种影响会导致后端失效，定位精度大幅减弱，不能忽视。随着视觉SLAM技术的发展，如何解决动态影响受到广泛关注，具有重要的研究价值。
+ 国内外研究现状和发展态势
+2.1 视觉SLAM研究现状
+视觉SLAM问题最早可追溯到滤波技术的提出，Smith等人提出了采用状态估计理论的方法处理机器人在定位和建图等方面的问题[[6]]。随后出现各种基于滤波算法的SLAM系统，例如粒子滤波[[7]]和卡尔曼滤波[[8]]。2007年视觉SLAM取得重大突破，A. J. Davison等人提出第一个基于单目相机的视觉SLAM系统MonoSLAM[[][9]]。该系统基于扩展卡尔曼滤波算法（Extended Kalman Filter, UKF），是首个达到实时效果的单目视觉SLAM系统，在此之前其他的算法都是对预先拍好的视频进行处理，无法做到同步。MonoSLAM的发布标志着视觉SLAM的研究从理论层面转到了实际应用，具有里程碑式意义。同年，Klein 等人提出了PTAM( Parallel Tracking And Mapping) [[10]]，创新地以并行的方式进行跟踪和建图线程，解决了MonoSLAM计算复杂度高的问题，这种并行的方式也是当下SLAM框架的主流。PTAM应用了关键帧和非线性化优化理论而非当时多数的滤波方案，为后续基于非线性化优化的视觉SLAM开辟了道路。
+2014年慕尼黑工业大学计算机视觉组Jakob Engel等人[[11]]提出LSD-SLAM，该方案是一种基于直接法的单目视觉SLAM算法，不需要计算特征点，通过最小化光度误差进行图像像素信息的匹配，实现了效果不错的建图，可以生成半稠密的深度图。该方案的出现证明了基于直接法的视觉SLAM系统的有效性，为后续的研究奠定了基础。但该方案仍旧存在尺度不确定性问题，以及在相机快速移动时容易丢失目标的问题等等。同年SVO（semi-direct monocular visual odometry）被Forster等人提出[[12]]。这是一种基于稀疏直接法的视觉SLAM方案，结合了特征点和直接法，使用了特征点，但是不计算特征点的描述子，特征点的匹配使用特征点周围像素利用直接法匹配。SVO有着较快的计算速度，但是缺少了后端的功能，对相机的运动估计有较为明显的累计误差，应用场景受限。
+2015年Mur-Artal等人参考PTAM关键帧和并行线程的方案，提出了ORB-SLAM框架[[13]]。该框架是一种完全基于特征点法的单目视觉SLAM系统，包括了跟踪，建图和回环检测三个并行线程。跟踪线程负责提取ORB[[][14][]]（oriented FAST and rotated BRIEF）特征点，这该系统最为经典的一部分，采用的ORB特征点具有良好的尺度不变性和旋转不变性，能实现提取速度和效果的平衡。跟踪线程还完成估计位姿的工作，并且适时选出新的关键帧来实现建图。建图线程接收跟踪线程选出的关键帧，删除冗余的关键帧和地图点，再进行全局优化。回环线程接收建图线程筛选后的关键帧，与其他关键图进行回环检测，然后更新相机位姿和地图。ORB-SLAM因为回环检测线程的加入，有限消除了累计误差的影响，提高了定位和建图的准确性。但是其系统只适用于单目相机，精度低且应用场景受限。随着相机的进步，2017年Mur-Artal 等人对ORB-SLAM进行了改进，扩展了对双目和RGB-D相机的支持，提出ORB-SLAM2[[3]]。相比于原版，该系统支持三种相机，同时新增重定位，全局优化和地图复用等功能，更具鲁棒性。
+2017年，香港科技大学Qin Tong等人[[1][5][]]提出VINS Mono系统，该系统在单目相机中融合IMU传感器，在视觉信息短暂失效时可利用IMU估计位姿，视觉信息在优化时可以修正IMU数据的漂移，两者的结合表现出了优良的性能。2019年提出改进版系统VINS-Fusion[[1][6][]]，新增对双目相机和GPS传感器的支持，融合后的系统效果更优。
+2020年Carlos Campos等提出了ORB-SLAM3[[4]]，该系统在ORB-SLAM2的基础上，加入了对视觉惯性传感器融合的支持，并在社区开源。系统对算法的多个环节进行改进优化，例如加入了多地图系统和新的重定位模块，能够适应更多的场景，同时精度相比上一版增加2-3倍。在2021年底，系统更新了V1.0版本，继承了ORB-SLAM2的优良性能，成为现阶段最有代表性的视觉SLAM系统之一。
+2.2 动态SLAM研究现状
+针对动态物体的影响，已经有许多研究人员开展了相关工作，尝试解决动态场景下的视觉SLAM问题。解决这一问题的主要挑战就是如何高效地检测到动态物体和其特征点，并将动态特征点剔除以恢复相机运动。
+最早的解决思路是根据几何约束来筛除动态物体的特征点，如WANG等[[1][7][]]首次使用K-Means将由RGB-D相机计算的3D点聚类，并使用连续图像之间的极线约束计算区域中内点关键点数量的变化，内点数量较少的区域被认定是动态的。利用极线约束是一种判断动态物体特征点的常见方法，但是如果相邻帧间存在高速移动物体或者运动物体沿着极线方向移动，这种方法效果会大大减弱。为了更好地利用几何信息，研究人员提出借助光流信息来提高动态物体的检测。Fang[[1][8][]]使用光流法检测图像之间的动态物体所在位置，对其特征点进行滤除。该方法利用光流提高检测的精度，有效地降低了帧之间极线约束的误差。尽管基于几何约束的方法可以在一定程度消除动态特征点的影响，但随着深度学习的发展，图像中语义信息逐渐被重视和利用起来。
+现阶段有许多优秀的深度学习网络，如YOLO[[1][9][]]，SegNet[[][20][]]，Mask R-CNN[[2][1][]]等等。这些神经网络有着强大的特征提取能力和语义信息提取能力，可以帮助SLAM系统更轻松地辨别出动态物体的存在，提供语义先验信息，从而消除其影响。Fangwei Zhong等人提出的Detect-SLAM[[2][2][]]，利用目标检测网络获取环境中的动态的人和车等，为了实时性，只在关键帧中进行目标检测，最后去除所有检测到的动态点来恢复相机位姿。LIU和MIURA[[2][3][]]提出了 RDS-SLAM。基于ORB-SLAM3[[4]]的RDS-SLAM框架使用模型的分割结果初始化移动对象的移动概率，将概率传播到随后的帧，以此来区分动静点。这种只基于深度学习的方法仅能提供图像中的语义信息，但无法判断图像中的物体是否真的在运动，比如静止的人或者路边停靠的汽车。若根据语义信息将其标记为动态物体后直接去除其特征点，这种方法会导致系统丢失有用的特征点，对相机的运动估计有所影响。因此仅利用深度学习不能很好解决动态物体对SLAM系统的影响。
+许多研究开始探索语义信息和几何信息的结合。清华大学Chao Yun等提出的DS-SLAM[[2][4][]]，该系统首先利用SegNet网络进行语义分割，再利用极线约束过滤移动的物体，达到了不错的效果。Berta Bescos等人首次利用Mask R-CNN网络进行实例分割，提出了DynaSLAM[[2][5][]]。该系统结合基于多视几何深度的动态物体分割和区域生长算法，大幅降低了位姿估计的误差。Runz等人提出了MaskFusion，一种考虑物体的语义和动态RGD-D SLAM系统[[][26][]]。这个系统基于MASK-RCNN语义分割和几何分割，将语义分割和SALM线程放在两个线程以保证整个SLAM系统的实时性。但是该系统物体边界分割常包含背景，仍有改善空间。等人提出RS-SLAM，一种使用RGB-D相机解决动态环境不良影响的SLAM[[][27][]]。该系统采用语义分割识别动态对象，通过动态对象和可移动对象的几何关系来判断可移动对象是否移动。动态内容随后被剔除，跟踪模块对剔除过的静态背景图像帧进行ORB特征提取并估计相机位姿。
+利用深度学习得来的语义信息和几何信息结合来解决SLAM中的动态场景问题渐渐成了一种主流，但是上述大多系统只是为了恢复相机的位姿而剔除动态物体的特征点，而没有估计动态物体的位姿。同时估计相机运动和跟踪动态物体运动，将动态物体的点加入优化步骤正在发展为一种趋势。Henein等人提出一种新的基于特征的，无模型的动态SLAM算法Dynamic SLAM（Dynamic SLAM: The Need For Speed）[[][28][]]。该方法利用语义分割场景中的刚体物体的运动，并提取运动物体的速度，有效性在各种虚拟和真实数据集上得到了验证。Javier Civera等人提出的DOT SLAM（Dynamic Object Tracking for Visual SLAM）[[2][9][]]主要工作在前端，结合实例分割为对态对象生成掩码，通过最小化光度重投影误差跟踪物体。AirDOS被卡内基梅隆大学Yuheng Qiu等人提出[[][30][]]，将刚性和运动约束引入模型铰接对象，通过联合优化相机位姿、物体运动和物体三维结构，来纠正相机位姿估计。VDO SLAM[[][31][]]利用Mask R-CNN掩码和光流区分动静点，将动态环境下的SLAM表示为整体的图优化，同时估计相机位姿和物体位姿。
+总体来说，目前动态场景下的视觉SLAM问题的解决需要借助几何信息和深度学习的语义信息，语义信息提供更准确的物体，几何信息提供物体真实的运动状态，两者结合来估计相机运动和跟踪物体。
+
+                                 参考文献
+ 孔德磊, 方正. 基于事件的视觉传感器及其应用综述[J]. 信息与控制, 2021, 50(1): 1-19. KONG D L, FANG Z. A review of event-based vision sensors and their applications[J]. Information and Control, 2021, 50(1): 1-19.
+ J. Engel, V. Koltun, and D. Cremers, "Direct sparse odometry," IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611 - 625, Mar. 2016.
+ Mur-Artal R , JD Tardós. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras[J]. IEEE Transactions on Robotics, 2017.
+ Campos C, Elvira R, Rodriguez J, et al. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual - Inertial, and Multimap SLAM[J]. IEEE Transactions on Robotics: A publication of the IEEE Robotics and Automation Society, 2021, 37(6): 1874-1890.
+ Tong, Qin, Peiliang, et al. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator[J]. IEEE Transactions on Robotics, 2018.
+ Smith R, Self M, Cheeseman P. Estimating Uncertain Spatial Relationships in Robotics [J]. Machine Intelligence & Pattern Recognition, 1988, 5(5):435-461.
+ Grisetti G, Stachniss C, Burgard W. Improved Techniques for Grid Mapping With Rao-Blackwellized Particle Filters [J]. IEEE Transactions on Robotics, 2007, 23(1):34-46.
+ Kalman R E. A New Approach To Linear Filtering and Prediction Problems[J]. Journal of Basic Engineering,  1960, 82D:35-45.DOI:10.1115/1.3662552.
+ Davison A J, Reid I D, Molton N D, et al. MonoSLAM: Real-Time Single Camera SLAM[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6):1052-1067.
+ Klein G, Murray D. Parallel Tracking and Mapping for Small AR Workspaces[C]. IEEE and ACM International Symposium on Mixed and Augmented Reality. IEEE, 2007:1-10.
+ ENGEL J, SCHOPS T, CREMERS D, LSD-SLAM: Large-scale direct monocular SLAM[C]. European Conference on Computer Vision(ECCV), 2014:834 - 849.
+ FORSTER C, PIZZOLI M, SCARAMUZZA D. SVO: Fast semi-direct monocular visual odometry[C]. Hong Kong, China: IEEE International Conference on Robotics and Automation (ICRA), 2014: 15-22.
+ MURARTAL R, MONTIEL J M, TARDOS J D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System[J]. IEEE Transactions on Robotics, 2015, 31(5):1147-1163.
+ Rublee E,Rabaud V,Konolige K,et al.ORB:An efficient alternative to SIFT or SURF[C].2011 International conference on computer vision. IEEE, 2011:2564-2571.
+ TONG Q, PEILIANG L, SHAOJIE S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator[J]. IEEE Transactions on Robotics, 2017,99:1-17.
+ QIN T, PAN J, CAO S, et al. A general optimization-based framework for local odometry estimation with multiple sensors[J]. ArXiv, 2019:1901.03638.
+ WANG R, WAN W, WANG Y, et al. A new RGB-D SLAM method with moving object detection for dynamic indoor scenes[J]. Remote Sensing, 2019, 11(10): 1143.
+ Fang Y, Dai B. An improved moving target detecting and tracking based on Optical Flow technique and Kalman filter[J]. IEEE, 2009.DOI:10.1109/ICCSE.2009.5228464.
+ Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified, Real-Time Object Detection[C]//Computer Vision & Pattern Recognition. IEEE, 2016.DOI:10.1109/CVPR.2016.91.
+ Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(12): 2481-2495.
+ Gkioxari G, He K, Piotr Dollár, et al. Mask R-CNN [J]. IEEE transactions on pattern analysis and machine intelligence, 2020, 42(2): 386-397.
+ Zhong F, Wang S, Zhang Z, et al. Detect-SLAM: Making Object Detection and SLAM Mutually Beneficial[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018.DOI:10.1109/WACV.2018.00115.
+ LIU Y, MIURA J. RDS-SLAM: Real-time dynamic SLAM using semantic segmentation methods[J]. IEEE Access, 2021, 9: 23772-23785.
+ C. Yu, et al. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments[A]. //2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 1168-1174.
+ B. Bescos, J. M. Fácil, J. Civera and J. Neira, DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes[J]. IEEE Robotics and Automation Letters, 2018, 3(4)4076-4083.
+ Runz M, Buffier M, Agapito L. MaskFusion: real-time recognition, tracking and reconstruction of multiple moving objects[J]. 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2018, pp. 10-20.
+ T. Ran, L. Yuan, J. Zhang, D. Tang and L. He. RS-SLAM: A Robust Semantic SLAM in Dynamic Environments Based on RGB-D Sensor[J]. IEEE Sensors Journal, 2021, vol. 21, no. 18, pp. 20657-20664.
+ M. Henein, J. Zhang, R. Mahony and V. Ila. Dynamic SLAM: The Need For Speed[C]. 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020: 2123-2129.
+ Ballester I, Fontan A, Civera J, et al. DOT: Dynamic Object Tracking for Visual SLAM[J].  2020.DOI:10.48550/arXiv.2010.00052.
+ Qiu Y, Wang C, Wang W, et al. AirDOS: Dynamic SLAM benefits from articulated objects[C]//2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022: 8047-8053.
+ Zhang J, Henein M, Mahony R, et al. VDO-SLAM: A Visual Dynamic Object-aware SLAM System[J]. 2020.DOI:10.48550/arXiv.2005.11052.
diff --git a/docker/wbw-slam/Dockerfile b/docker/wbw-slam/Dockerfile
new file mode 100644
index 0000000..5826354
--- /dev/null
+++ b/docker/wbw-slam/Dockerfile
@@ -0,0 +1,3 @@
+FROM nvidia/cuda:11.1.1-devel-ubuntu18.04
+
+WORKDIR /root
\ No newline at end of file
diff --git a/docker/wbw-slam/run.txt b/docker/wbw-slam/run.txt
new file mode 100644
index 0000000..5f082c8
--- /dev/null
+++ b/docker/wbw-slam/run.txt
@@ -0,0 +1,10 @@
+
+docker run --name wbw-slam --gpus=all -it -v /home/wbw/data:/data -e DISPLAY -e WAYLAND_DISPLAY -e XDG_RUNTIME_DIR -e PULSE_SERVER -p 8080:5901 -p 8081:20 wbw-slam /bin/bash 
+
+
+docker run --name wbw-docker --gpus=all -it -v /home/wbw/data:/data -e DISPLAY -e WAYLAND_DISPLAY -e XDG_RUNTIME_DIR -e PULSE_SERVER -p 8083:5901 -p 8084:20 wbw-docker /bin/bash 
+// 启动docker
+docker start wbw-slam
+docker exec -it wbw-slam bash
+
+
diff --git a/docker/wbw_docker_export.tar b/docker/wbw_docker_export.tar
new file mode 100644
index 0000000..4425e4a
Binary files /dev/null and b/docker/wbw_docker_export.tar differ
diff --git a/动态slam/06_tar.txt b/动态slam/06_tar.txt
new file mode 100644
index 0000000..6480af9
--- /dev/null
+++ b/动态slam/06_tar.txt
@@ -0,0 +1,1101 @@
+2.220446e-16	0.0	-1.110223e-16	0.0	0.0	0.0	1.0
+1.198998	-0.01401751	-0.02820321	-0.00035986356967330357	6.49054580573347e-05	-0.00034352059264449364	0.9999998741395396
+2.384181	-0.02787365	-0.0560812	-0.0007155286094661527	0.00012929831495668684	-0.0006830403431447477	0.9999995023781983
+3.575628	-0.0418032	-0.08410629	-0.0010730197134110313	0.00019427286969106458	-0.0010243109083292836	0.9999988808363287
+4.767291	-0.05573543	-0.1121362	-0.0014305218763574284	0.00025949949677692467	-0.0013655992458171734	0.9999980107009575
+5.962283	-0.06970674	-0.1402442	-0.0017889690255053106	0.00032515025427644703	-0.0017077967747923602	0.9999968886438168
+7.152769	-0.08362554	-0.1682458	-0.002146010347958631	0.00039079257280655213	-0.0020486601625632228	0.9999955224461207
+8.3454	-0.09756958	-0.1962977	-0.002503641285496061	0.00045679391386660456	-0.0023900930922495997	0.9999939052687492
+9.536762	-0.1114989	-0.2243193	-0.002860836791111922	0.0005229646662447424	-0.0027311180076956434	0.9999920415259517
+10.72832	-0.1254307	-0.2523452	-0.0032180356859406976	0.0005893864406748412	-0.00307215346553219	0.9999899293208067
+11.91983	-0.1393621	-0.2803697	-0.0035751674902217787	0.000656045413982819	-0.0034131317238107425	0.99998756907957
+13.11191	-0.1533004	-0.3084073	-0.0039324143727152705	0.0007229763697410812	-0.003754227693729973	0.9999849594852883
+14.30348	-0.1671408	-0.3362948	-0.004221199569135658	0.0007775326433186569	-0.00407839981343691	0.9999824716324521
+15.49659	-0.181814	-0.3654237	-0.005111837105190033	0.0009433322831451939	-0.0045504186947271635	0.9999761361829173
+16.68758	-0.1950729	-0.3924445	-0.004975825921469052	0.0009150478953510533	-0.004775125564258965	0.9999758007169943
+17.88104	-0.2081761	-0.4191514	-0.004690864663502994	0.0007505806471832133	-0.005072272635004193	0.9999758519422941
+19.07663	-0.2220687	-0.4450814	-0.00452155249754079	0.0005498399673708305	-0.005316702723358468	0.999975492655282
+20.27019	-0.2372951	-0.4715143	-0.004640924132021398	0.0010122291741614564	-0.005662007298249387	0.9999726890713838
+21.46468	-0.2531559	-0.4977817	-0.004512736565285698	0.001532656673956377	-0.0059070602522334785	0.9999711959908588
+22.65883	-0.2707637	-0.5247441	-0.004814594710778253	0.0012356936499101766	-0.006313858871244178	0.9999677134413534
+23.85296	-0.2887908	-0.5527301	-0.004932776240605864	0.0011715336446188685	-0.0066750169286301396	0.99996486907115
+25.04684	-0.3071147	-0.5812326	-0.004937456600778406	0.0015563988969339428	-0.007130184190804922	0.9999611790555641
+26.2431	-0.3273512	-0.6069839	-0.005068149234804969	0.0013898935019328099	-0.007546006475853068	0.9999577190289864
+27.44168	-0.3472488	-0.6329432	-0.004630086244265993	0.00030384521350993474	-0.007903069481223837	0.9999580048543197
+28.63712	-0.3677985	-0.6565642	-0.004307487780242342	0.00019982652719625942	-0.008352183951850249	0.9999558223449756
+29.8283	-0.3892136	-0.6824261	-0.004000424736176496	0.0008871439665114932	-0.00875663179881376	0.9999532643964181
+31.02794	-0.4112762	-0.7087202	-0.0037427400897394867	0.0012865214191688057	-0.009250873606068	0.9999493777670864
+32.23193	-0.4357161	-0.7356474	-0.003948906989813863	0.0012909524369781888	-0.009625171396769454	0.9999450463155333
+33.42024	-0.4596533	-0.7632533	-0.0037122819926728146	0.0010122586694583024	-0.009955169656655438	0.9999430428238901
+34.6148	-0.4846981	-0.7915945	-0.0035890184917547465	0.0008815173096166077	-0.0102784903450411	0.9999403454755317
+35.81256	-0.509819	-0.8182722	-0.0032104033604379322	0.0008405012118857604	-0.010525218803421508	0.9999391014642423
+37.00928	-0.5367057	-0.8457827	-0.003244006256697946	0.0008424160809041116	-0.010822482274992346	0.9999358182583324
+38.2069	-0.5627412	-0.8734675	-0.002804461366961401	0.0005889585634444017	-0.01092397922522269	0.9999362253674675
+39.40549	-0.5904619	-0.9001114	-0.002865083761176584	-1.3086527926736924e-05	-0.01087952079915811	0.99993671157277
+40.60244	-0.6181703	-0.9268582	-0.0030953862390823987	-0.0004823207888735142	-0.01080541166105546	0.9999367125121084
+41.79871	-0.6466319	-0.9531287	-0.0034943967462006606	-0.00027018418288089925	-0.0107787324756015	0.9999357654960176
+42.99592	-0.675668	-0.9776312	-0.0039091046637364305	-0.00021836258948151034	-0.010726366969192474	0.9999348060099468
+44.19563	-0.708016	-1.000412	-0.00538943993983646	-0.0009551878948845999	-0.010494284948573317	0.9999299533150502
+45.39563	-0.7358613	-1.020365	-0.005512763103214569	-0.002517406669680244	-0.009698016870821509	0.9999346081496533
+46.59612	-0.7617275	-1.037444	-0.005784655736757264	-0.003885003880645267	-0.008825975525198936	0.9999367713305092
+47.80149	-0.784447	-1.051207	-0.00591626339592865	-0.0055348163379227485	-0.007943893937144661	0.9999356270203852
+49.00454	-0.8050204	-1.064092	-0.005755818552842153	-0.00659222783583631	-0.007295013446649078	0.9999350958256042
+50.20348	-0.822002	-1.07385	-0.004924196091158244	-0.005736844786528477	-0.006556563335938939	0.9999499249372306
+51.40457	-0.8402902	-1.085315	-0.005171862204093085	-0.004580702126983747	-0.006022975448525762	0.9999579955058673
+52.60583	-0.859463	-1.098286	-0.005748736106224239	-0.004488039183417627	-0.005617838059339086	0.9999576238186337
+53.81125	-0.8753783	-1.112635	-0.005263112956959487	-0.0042580071979188795	-0.005594300728272517	0.9999614356644292
+55.0152	-0.892873	-1.127883	-0.00492937457798024	-0.0039014722344250905	-0.005961486329581552	0.99996246952644
+56.22027	-0.9118602	-1.143152	-0.005010130200693083	-0.0038739579320249836	-0.00672945987229295	0.9999573017960017
+57.42777	-0.9332309	-1.160687	-0.005014072503500606	-0.004632534390145542	-0.007495952837085332	0.9999486033757524
+58.63262	-0.956215	-1.175655	-0.005008822220882213	-0.004822565791383462	-0.00797915679588278	0.9999439922395525
+59.83857	-0.981436	-1.190278	-0.005487281545579829	-0.004689162095371804	-0.008398892620312917	0.9999386781712343
+61.04209	-1.006715	-1.2056	-0.005682545336444833	-0.0046268695371305705	-0.008755755620244316	0.9999348166257159
+62.24563	-1.031126	-1.22132	-0.005362389928530254	-0.0043532266769055455	-0.008840141266126565	0.9999370710670491
+63.44747	-1.053863	-1.237483	-0.004720127394815745	-0.0038674846717600873	-0.008944114211026727	0.9999413811722517
+64.64755	-1.078473	-1.251701	-0.004504945542912553	-0.004335284165652808	-0.009120605357612381	0.9999388607983836
+65.8504	-1.103193	-1.267101	-0.004491673024876352	-0.005428991562021672	-0.009239413839285306	0.9999324897991682
+67.04761	-1.129538	-1.281112	-0.004718131929314229	-0.00470424250883669	-0.009387227108872515	0.99993374245533
+68.24107	-1.15768	-1.293922	-0.005480803792065498	-0.003855150896731819	-0.00942281474381245	0.9999331523475257
+69.43527	-1.18693	-1.309417	-0.006561273435701319	-0.003927524377118801	-0.009466116071796076	0.9999259557035632
+70.62616	-1.215316	-1.324678	-0.007464574099429974	-0.0040483605233212535	-0.009437135575929376	0.9999194124441775
+71.816	-1.243226	-1.339456	-0.008289339427966946	-0.004279873989077048	-0.009335140507968442	0.9999129085486306
+73.00219	-1.268567	-1.355743	-0.008491451685334401	-0.004744092182267336	-0.008852847932267698	0.999913504219805
+74.18749	-1.292838	-1.37147	-0.00932716938156376	-0.005034652890879012	-0.007823693914830008	0.9999132192321107
+75.36554	-1.310252	-1.390125	-0.009042048588194797	-0.004656733178612881	-0.006499165874355283	0.9999271558650508
+76.54079	-1.320138	-1.407515	-0.0073093447366157985	-0.0019016182322941096	-0.005326747545106521	0.9999572906320602
+77.7163	-1.33092	-1.427085	-0.007592526715736524	-0.0010536985407000503	-0.004101086118856066	0.9999622114610641
+78.8946	-1.33795	-1.448797	-0.007243512981106252	-0.0029978498811586442	-0.0026506117434878782	0.9999657587503523
+80.06843	-1.340942	-1.468534	-0.007292822678787891	-0.003974074304452394	-0.0009959385982906443	0.9999650141765495
+81.23454	-1.34229	-1.485826	-0.007515479409308318	-0.002864112095918309	0.001004311745089842	0.9999671523550505
+82.39913	-1.335651	-1.506096	-0.007071442357789663	-0.0015889224658754508	0.002740485047686651	0.9999699794344236
+83.55801	-1.323865	-1.530312	-0.006363264139240645	0.0003415142602140135	0.004433976285844162	0.9999698655918592
+84.71745	-1.309268	-1.555472	-0.005931743562991535	0.0008806153262992821	0.006077871385678563	0.9999635485428302
+85.87696	-1.292993	-1.580587	-0.0058391162607415725	-7.8719846807099e-05	0.007890760085090619	0.999951816054033
+87.03114	-1.273121	-1.605074	-0.006391301202174677	-0.0005429595690217777	0.009807148167652787	0.9999313357969473
+88.18383	-1.247131	-1.630673	-0.006215597798760267	0.000606249442847437	0.011646864603080382	0.9999126708620781
+89.33336	-1.216671	-1.656163	-0.006037974338264521	0.001887806082478586	0.013292848052516114	0.9998916337507471
+90.48344	-1.182745	-1.683163	-0.006076154599675972	0.002078016064286196	0.014923607532290922	0.9998680153563967
+91.63655	-1.144128	-1.710419	-0.005409023558403269	0.0014661109090187787	0.016356920085521247	0.999850510900636
+92.79129	-1.10175	-1.736086	-0.005040835051890419	0.0008044372506283193	0.017752128814033447	0.9998293878383748
+93.94713	-1.056015	-1.76065	-0.004647192419600728	-0.0001236459029356669	0.019125172080521127	0.9998062892916789
+95.10262	-1.006119	-1.785071	-0.004093907937738486	-0.0006999334554371111	0.020348323545870305	0.9997843246119774
+96.25538	-0.953966	-1.809543	-0.0034241107091019097	0.0003861401026011044	0.0214873615210282	0.9997631817868358
+97.41317	-0.8975398	-1.834871	-0.0026506564244431334	0.0021820933612555094	0.022588776549432397	0.9997389457568822
+98.57222	-0.8410031	-1.861916	-0.0027865197249119515	0.0037527411389656866	0.02362137671753932	0.9997100493661824
+99.73968	-0.7820453	-1.891643	-0.002844125604205814	0.004263955460228568	0.02443487962739204	0.9996882845622307
+100.9141	-0.7233569	-1.922457	-0.002880257638698931	0.003954372444814974	0.02473744622249732	0.9996820123463718
+102.0947	-0.6655319	-1.953029	-0.0030048321594401232	0.002947597666050729	0.024361603482147207	0.9996943507530046
+103.2788	-0.6077182	-1.983093	-0.002684090500032616	0.002140522918165129	0.02367590660310628	0.9997137916755703
+104.47	-0.5505715	-2.012875	-0.002833784438348407	0.001815979219984615	0.02267514628677632	0.9997372202864625
+105.6679	-0.4977483	-2.041118	-0.0029159243632119096	0.0015188217570027904	0.021444745259269905	0.9997646290334258
+106.8672	-0.4483272	-2.069739	-0.003158915404173649	0.0015239926374694453	0.019892197916632902	0.9997959787686463
+108.0719	-0.4053771	-2.098282	-0.003888196137698956	0.0016704704753170844	0.018013378790936518	0.9998287901654563
+109.2786	-0.3651076	-2.127	-0.003963029053943499	0.0015711521308144729	0.016058214165400424	0.9998619702936589
+110.4916	-0.3290753	-2.156853	-0.004297420656953719	0.0010103101069397018	0.014035356700536402	0.9998917542471654
+111.7073	-0.2985283	-2.185665	-0.004397760737387526	0.0005974922418329293	0.011872866153568933	0.9999196656496047
+112.9237	-0.2726718	-2.213402	-0.004480138358741955	0.0002213837819483314	0.009647168600610368	0.999943404142204
+114.1425	-0.2537792	-2.240145	-0.004673018107087119	0.00012158664856487125	0.007251937304367714	0.9999627780691596
+115.3628	-0.2415523	-2.267804	-0.005088736593879853	0.00011159737141238686	0.004782355122464345	0.9999756103952673
+116.586	-0.2340717	-2.29683	-0.005346380751607163	0.00025066205483806494	0.0023032300959496873	0.9999830241121688
+117.8116	-0.2325711	-2.327575	-0.005789870445636307	0.0005420126690783513	-2.1495406131837887e-05	0.9999830914372686
+119.0399	-0.2372931	-2.358001	-0.006709019307093993	0.0006906839151796674	-0.0018659201972787234	0.9999755148789813
+120.2714	-0.246444	-2.388148	-0.007882105089020762	0.0005917012759859647	-0.0029247176792551333	0.9999644835370216
+121.5013	-0.2562318	-2.417424	-0.008358395339655827	0.0007331398214902632	-0.0034083153002364085	0.9999589907192006
+122.7424	-0.2696789	-2.444635	-0.009423688209011237	0.0005269259778054986	-0.003949035681284493	0.9999476594136021
+123.9829	-0.2835717	-2.4715	-0.010128547407015631	-5.708253867631641e-05	-0.004450295875645202	0.9999388001951054
+125.2241	-0.2970155	-2.498919	-0.010147723417882244	-4.151022423533046e-05	-0.0048278779229040504	0.9999368547968901
+126.4702	-0.3168152	-2.521615	-0.011780432367271056	-0.0008854915288300424	-0.005257412029966079	0.9999163949734695
+127.7202	-0.336076	-2.543289	-0.0125889844750612	-0.002345682370011026	-0.005562947188091507	0.9999025296811126
+128.9699	-0.3532043	-2.561295	-0.01207402728461645	-0.0026018800392388786	-0.005952791976685023	0.9999060017587024
+130.2201	-0.3704424	-2.578125	-0.011931388314351072	-0.0016477390832804791	-0.0062128406058438	0.9999081595528722
+131.4756	-0.384545	-2.59617	-0.010569427914954426	-0.002004529637688497	-0.006580110709343627	0.9999204824372461
+132.7335	-0.3976226	-2.614808	-0.009388399259498761	-0.002977906419199691	-0.0070979379928864834	0.9999263019387739
+133.9959	-0.4133731	-2.631357	-0.007988738895299616	-0.003913266103574896	-0.007255095553951683	0.9999341128233241
+135.2593	-0.4299121	-2.64614	-0.007102812549908933	-0.0048371440997820305	-0.007417545324406014	0.9999355639800992
+136.5226	-0.4438867	-2.660056	-0.0049193540583562755	-0.0038595730505289203	-0.007373976976430861	0.9999532629653585
+137.7895	-0.4570556	-2.675992	-0.0035936223341468715	-0.0030132358636768357	-0.007388265364500639	0.9999617091784331
+139.0568	-0.472057	-2.693213	-0.0028005006717013482	-0.0030000636673928726	-0.00729711830087367	0.9999649538251241
+140.3306	-0.490815	-2.711316	-0.0027762409525668398	-0.0032085139317314304	-0.007162048179978077	0.9999653508949152
+141.606	-0.5099508	-2.728134	-0.002957768652786085	-0.004096279109567636	-0.006813617613118185	0.999964022711355
+142.8802	-0.5316574	-2.744134	-0.004264040518370598	-0.004966018736653906	-0.006468253833852581	0.9999576582579414
+144.153	-0.5501461	-2.759529	-0.004302252755785651	-0.004564971688800125	-0.006179234507802903	0.9999612336063851
+145.4241	-0.5681071	-2.775354	-0.004457828850840038	-0.00325935623605166	-0.005970482350687801	0.9999669283028131
+146.694	-0.586224	-2.79483	-0.0045398859905447135	-0.002223733514076087	-0.005757391472960463	0.9999706480131696
+147.9611	-0.6015689	-2.818864	-0.003905415977261743	-0.0001642960588533671	-0.005629999732894554	0.9999765116420771
+149.227	-0.6184009	-2.844547	-0.003954186155803207	0.0015811751505896757	-0.0053734287516974365	0.999976495003977
+150.4924	-0.6344428	-2.875057	-0.0038331783316937857	0.0021424458780584113	-0.005287277443455508	0.9999763804044438
+151.7562	-0.6492857	-2.908757	-0.0037458114159513345	0.0019491110961216793	-0.005333560901335034	0.99997686122774
+153.0248	-0.6669047	-2.939717	-0.0040681562486969075	0.0017760185646128485	-0.005316372413833427	0.9999760157359534
+154.2883	-0.6841745	-2.969914	-0.004472601030136438	0.0011681885332730745	-0.005439788688839492	0.9999745196126734
+155.5533	-0.703475	-2.997619	-0.004706727402907223	0.0008849443598929549	-0.005566890094636194	0.9999730362991339
+156.8169	-0.7212308	-3.024885	-0.004841709527416548	0.0010108793510865638	-0.0058371059251233455	0.9999707316547863
+158.0778	-0.7401339	-3.051707	-0.005001882994952539	0.0009234644464940929	-0.006156738819819325	0.9999681109650574
+159.3422	-0.7581397	-3.079341	-0.0047211130087185926	0.0003711877353379654	-0.006500872845527376	0.9999676554588506
+160.6037	-0.7785169	-3.10733	-0.00515549896485803	0.00021737884260166398	-0.006831455393804252	0.9999633517254842
+161.8641	-0.7989665	-3.135262	-0.005301338267941133	0.0006860472755226084	-0.00713524617952972	0.999960255917035
+163.1262	-0.8210143	-3.16234	-0.005527743503521305	0.0012355994884098056	-0.0075440357354324755	0.9999555014451823
+164.3903	-0.8448485	-3.191224	-0.005947355494980201	0.0012300540983073457	-0.007851041779608638	0.9999507373228476
+165.6519	-0.8676072	-3.221845	-0.005791327957094879	0.0015695562518456068	-0.008179363318739755	0.9999485461909356
+166.9143	-0.8915914	-3.252329	-0.005924696183497966	0.002170703171646729	-0.00858217697624092	0.9999432645211539
+168.1765	-0.9159259	-3.282708	-0.0060124319708359615	0.0020909930161350926	-0.008891351595468583	0.9999402093509432
+169.4414	-0.9418036	-3.313952	-0.00618127365111987	0.0013798535090794111	-0.009254224364655384	0.9999371216190306
+170.7047	-0.9682551	-3.3446	-0.006216312113405884	0.0011460004746534824	-0.009625880652394527	0.9999336910857072
+171.9707	-0.9957698	-3.373848	-0.006396828009711327	0.0009960761507025718	-0.009946823025379609	0.9999295720876635
+173.2343	-1.024319	-3.402447	-0.0066141808356763344	0.000817657605203399	-0.010307894133043758	0.9999246628453844
+174.4983	-1.053647	-3.431925	-0.006723309945024927	0.0012072308041932353	-0.010607288976629428	0.9999204093915349
+175.759	-1.083166	-3.461238	-0.0072429831152864554	0.0015467353159486746	-0.010825608829293381	0.999913972799125
+177.0264	-1.111965	-3.491932	-0.007341293035560935	0.0012905293579524358	-0.011030474945822077	0.9999113803597856
+178.2891	-1.14094	-3.523533	-0.007307016875576709	0.0017407100205012792	-0.011197758133274402	0.9999090896905545
+179.5492	-1.167498	-3.555449	-0.006659744345200259	0.0026787623090805124	-0.011233513311735555	0.9999111361598205
+180.8107	-1.193384	-3.587083	-0.005182021159617512	0.0024960316105607774	-0.011376582111618351	0.9999187416297176
+182.0722	-1.220912	-3.617217	-0.00513493094784783	0.0008767559166606002	-0.011530847812232573	0.999919948461853
+183.3367	-1.249819	-3.644685	-0.004995259374406725	-0.001702682340561743	-0.01159100005068945	0.9999188951982327
+184.5934	-1.280267	-3.669131	-0.005166418896233336	-0.001318543322361763	-0.011688107145910823	0.9999174754501701
+185.8431	-1.311919	-3.694209	-0.006012777778598071	0.0002303633749491823	-0.011557215811215533	0.9999151084961131
+187.0923	-1.342768	-3.722358	-0.006168455110437966	0.0013395238746064255	-0.01170138155878978	0.9999116128473333
+188.3388	-1.371475	-3.752664	-0.0059574947893589435	0.0012948308102644038	-0.01178819590959887	0.9999119311750433
+189.5856	-1.400872	-3.781861	-0.0059698393214338025	0.0004720687406639241	-0.011752192333273011	0.9999130082886921
+190.8308	-1.429913	-3.811043	-0.006008959010337257	-0.0005246968445113759	-0.011676003488574808	0.9999136402946848
+192.0697	-1.45692	-3.839753	-0.005706682231831856	-0.0010040440157608665	-0.011371431048756077	0.9999185547980508
+193.3035	-1.482347	-3.868861	-0.00572696117889226	-0.0007702629838274687	-0.01070302180638407	0.9999260242311944
+194.5344	-1.504387	-3.897568	-0.005040909339382521	7.198868724934202e-05	-0.010011531760978064	0.9999371746677189
+195.761	-1.525428	-3.929306	-0.005021648399951481	0.0007840621393912429	-0.009383739248992029	0.9999430552444553
+196.9816	-1.54334	-3.962116	-0.00457558173366062	0.0011870392376031873	-0.008705454600296703	0.9999509338211796
+198.1987	-1.561315	-3.994523	-0.004638989444218995	0.0007527849324634227	-0.008085171206076634	0.9999562705930446
+199.4115	-1.577337	-4.025379	-0.004475645862326369	-0.00025649204126482336	-0.007393658978525132	0.9999626176077069
+200.6192	-1.593225	-4.055718	-0.004824402198392766	-0.00046540619919824446	-0.006813538253964993	0.9999650415074317
+201.8207	-1.607052	-4.084508	-0.004720683052171917	-0.0003723357498953366	-0.006269436136682513	0.999969134867741
+203.0155	-1.619569	-4.112611	-0.004912032507515999	-0.0005239493707796766	-0.0057731941099837805	0.999971133405095
+204.2033	-1.629855	-4.142166	-0.005041266371839272	-0.0007351137710935808	-0.005213975958176966	0.9999734294949137
+205.3821	-1.639698	-4.168995	-0.005036260997492797	-0.000565825374978919	-0.004829409175142207	0.999975496061693
+206.5525	-1.649074	-4.196218	-0.005059224283866238	-0.0006214400378378531	-0.004380657205573824	0.9999774136971156
+207.7094	-1.655974	-4.223531	-0.00446840554160697	-0.0007066491720240149	-0.004099281661094248	0.9999813647707273
+208.8522	-1.663162	-4.25023	-0.0043840997490424746	-0.0010001845798053841	-0.003990825640035301	0.9999819261421216
+209.9818	-1.670517	-4.277047	-0.004143498586890967	-0.0016919064767357496	-0.004148568181207222	0.9999813789536184
+211.0937	-1.680916	-4.301812	-0.004264365282922473	-0.0016742734652974245	-0.004493547500690188	0.9999794098020001
+212.1907	-1.692609	-4.32686	-0.0044061678229455024	-0.0011816757291153977	-0.004678590639250687	0.999978649830694
+213.2749	-1.705386	-4.351615	-0.00456130927657072	-0.000734465379669493	-0.004896099606849786	0.9999773413567576
+214.3453	-1.717659	-4.376802	-0.004529148623555702	-0.0006376127216145577	-0.00517544105237936	0.9999761472518613
+215.4046	-1.730641	-4.402081	-0.004251476257953618	-0.000674743840125339	-0.005733469843450655	0.9999742981666742
+216.4482	-1.744177	-4.426871	-0.004057872673643171	-0.001050836638385631	-0.006527994043578137	0.9999699068999481
+217.4854	-1.758097	-4.450451	-0.003524744923737176	-0.0017804703635620769	-0.007512570522703676	0.9999639830427136
+218.5077	-1.775959	-4.472509	-0.0036281552435897756	-0.002860121517231803	-0.008618069604701224	0.9999521913925303
+219.5114	-1.796549	-4.494237	-0.0037296897629312525	-0.0029212413438274745	-0.009739111242290535	0.9999413510178953
+220.495	-1.82019	-4.513126	-0.003915637710152975	-0.0026039376700479947	-0.011080667857211401	0.9999275504204149
+221.4612	-1.844527	-4.532452	-0.003950947329296195	-0.002625963759567057	-0.01244733948941836	0.9999112750985307
+222.4026	-1.871601	-4.553298	-0.0042282975957171915	-0.0029561039360682033	-0.01387554551137387	0.999890420088983
+223.3228	-1.900504	-4.574348	-0.004402710883512963	-0.0029470449956857997	-0.015369413795883004	0.99986784735896
+224.2192	-1.93108	-4.594192	-0.004376669534247475	-0.0026069283493568235	-0.016900267672214996	0.9998442026840866
+225.0932	-1.963226	-4.613875	-0.004273065751746019	-0.0023775746978733302	-0.01851499987748945	0.9998166246003184
+225.9462	-1.999321	-4.634336	-0.004870509509578697	-0.0022682834568319788	-0.020070958182132306	0.9997841215307071
+226.7758	-2.03611	-4.65549	-0.005018820774178972	-0.001833782807499098	-0.02156986114595439	0.9997530643958016
+227.5885	-2.076063	-4.674144	-0.00598374915543535	-0.001874760927719855	-0.022751857081532708	0.9997214777210951
+228.382	-2.113952	-4.692285	-0.006383590231490124	-0.001959370692285193	-0.02316343937667015	0.9997093906323427
+229.159	-2.149429	-4.710568	-0.006402428906073897	-0.0025413403779662638	-0.02305663634717929	0.9997104290811114
+229.9166	-2.182836	-4.728951	-0.006440707018978673	-0.0029594068322568784	-0.022805709646356222	0.9997147887331781
+230.6577	-2.215172	-4.746558	-0.006346003510297576	-0.0026798307906587726	-0.02264865993974909	0.9997197532055243
+231.3882	-2.24791	-4.762455	-0.006601955128766423	-0.0017113867195343514	-0.02248362464625508	0.9997239478809838
+232.1156	-2.278813	-4.778921	-0.0066336247151772615	-0.0006213152517196035	-0.022064263847648813	0.9997343533416052
+232.8383	-2.30914	-4.795655	-0.006659424199865404	5.686102676857922e-05	-0.021497737449859767	0.9997467159839469
+233.56	-2.338553	-4.813178	-0.0070266706690061435	0.0002983465532425132	-0.02075912741397536	0.999759768903337
+234.2776	-2.3648	-4.830531	-0.006744831196475719	0.00041733354214340437	-0.019909981844003586	0.9997789384197975
+234.9915	-2.389577	-4.847959	-0.006652391844235032	0.0004131895794232589	-0.018985406288160363	0.9997975441584139
+235.7021	-2.414706	-4.864866	-0.006691335223534169	0.00044390001244712563	-0.017962637488761393	0.9998161694232358
+236.4085	-2.437161	-4.881579	-0.006566063885326994	0.0005784864727226527	-0.016909852435788783	0.9998352909599932
+237.1116	-2.45827	-4.896919	-0.0066202880746239045	0.00044885617655090825	-0.015470827169013567	0.9998583018711443
+237.8114	-2.475218	-4.912132	-0.006158206033067212	0.0001729241458064679	-0.013863607578882995	0.9998849168682329
+238.5076	-2.489243	-4.929647	-0.005633681239257314	-7.393507017735587e-05	-0.012417059360918116	0.9999070320815467
+239.1999	-2.501543	-4.947554	-0.005493395276037911	-0.00018095336456171063	-0.010916720318857345	0.9999253047511604
+239.8878	-2.511949	-4.965562	-0.005299675469448788	-3.645217796804254e-05	-0.00943663532384446	0.9999414292971974
+240.5753	-2.520293	-4.98174	-0.005192641118942066	0.0001900586673171618	-0.007997421157449012	0.999954519771146
+241.2639	-2.528203	-4.99821	-0.005675353367414076	0.0006539975873278694	-0.006520980159126439	0.9999624190283727
+241.9573	-2.534352	-5.01566	-0.005636015186327582	0.0009917333167748618	-0.005087684115257344	0.9999706832043584
+242.6535	-2.538348	-5.032044	-0.005651416463690905	0.0005377262707581398	-0.003665165888026632	0.9999771691900895
+243.3559	-2.543393	-5.04817	-0.005908604890876848	0.00020486840696317538	-0.002380134545058291	0.9999796904821251
+244.059	-2.545596	-5.064282	-0.0057773035533943955	0.0002866743143483518	-0.0012280044795743112	0.999982516140401
+244.7634	-2.546852	-5.080295	-0.005696437201949102	0.0008915390716305719	-8.062816542457547e-05	0.9999833744919897
+245.4715	-2.546619	-5.094783	-0.00578640010086667	0.0021873308145204475	0.0007838325030446613	0.9999805591932214
+246.1863	-2.545576	-5.109038	-0.00541947013778019	0.00296077355484346	0.0012613233015339742	0.9999801359161649
+246.9081	-2.543734	-5.126143	-0.005381508562997345	0.0033763584994251036	0.0013546663367740995	0.99997890200143
+247.6359	-2.542056	-5.145191	-0.005001537491639155	0.0034887730274840855	0.0014418375709886171	0.9999803669022214
+248.3695	-2.541896	-5.164585	-0.005457155108079145	0.003576681858737057	0.0015305268869576678	0.9999775418940448
+249.1114	-2.542204	-5.185382	-0.005874029847300168	0.003922926913611629	0.0016835771313498042	0.9999736356453736
+249.8573	-2.540944	-5.205979	-0.005534770667544091	0.004919541558204293	0.0017910169910071536	0.9999709779201857
+250.61	-2.535403	-5.22737	-0.003649843828115298	0.00582512645105069	0.0019764397884364762	0.9999744197866375
+251.3708	-2.529259	-5.246305	-0.001998405247531203	0.006305282228533947	0.0021364320799153176	0.9999758424334325
+252.145	-2.529475	-5.263872	-0.002804967705525245	0.004763998968227461	0.002434379690330077	0.9999817549663218
+252.9289	-2.534412	-5.282956	-0.005000348748876124	0.0017170780614718822	0.0027661172328310116	0.9999821982169356
+253.716	-2.53173	-5.300222	-0.0048790149537135735	-0.0006355591861324731	0.0031901986440771345	0.999982806807304
+254.5107	-2.522827	-5.316153	-0.004140220136641355	0.0003244978754738217	0.0041339900295965905	0.9999828315550144
+255.3056	-2.516895	-5.328791	-0.0046848791814871745	0.0035079683074199387	0.006112546280636822	0.999964190780538
+256.1119	-2.508908	-5.345294	-0.005730087790873103	0.005795539606181935	0.008088263267927718	0.9999340767330068
+256.9296	-2.495854	-5.364407	-0.006454118420688381	0.005721154206928013	0.009694061215798305	0.9999158154200265
+257.7578	-2.476437	-5.385775	-0.005786964327180164	0.0044322513335785605	0.011316565335989006	0.9999093966660121
+258.5928	-2.451908	-5.406872	-0.004279942929150887	0.0028650544847295244	0.012772763926717853	0.999905160529235
+259.4295	-2.426274	-5.425049	-0.0030475053084927633	0.001980665202861959	0.014117142497595258	0.9998937423368802
+260.2685	-2.401831	-5.445288	-0.0029387269002611603	0.0017034424205714487	0.015705782673291944	0.9998708869442818
+261.1028	-2.377452	-5.467449	-0.004040644648751783	0.002687672315584574	0.016985144927200175	0.9998439650566224
+261.9398	-2.347909	-5.488513	-0.004246662389256982	0.0038501142012416834	0.018164645257847417	0.9998185776138816
+262.7787	-2.315586	-5.509575	-0.004400507317080512	0.004053612898621959	0.01946291256829852	0.999792677904864
+263.6212	-2.282743	-5.533299	-0.004866986311897632	0.0030827714208549475	0.02092154403624296	0.9997645212547533
+264.4631	-2.249292	-5.557575	-0.005645073246307746	0.002096919075669219	0.022571833176240674	0.9997270869719885
+265.3026	-2.211671	-5.580169	-0.0059160388460119595	0.0020691859083779036	0.02395447439529243	0.9996934040547103
+266.1456	-2.169031	-5.601556	-0.005838566041318017	0.002452105055866802	0.02528343206322559	0.9996602654854699
+266.9886	-2.120601	-5.622751	-0.005029610899606978	0.0026513111375255074	0.026580568800215375	0.9996305051996501
+267.8345	-2.070929	-5.642388	-0.0035293480010464466	0.002408408060557073	0.027484451205395687	0.999613099261529
+268.6821	-2.020152	-5.662313	-0.0027953502829305157	0.0023676246831968967	0.027939014576023238	0.9996029170799153
+269.5293	-1.972336	-5.683949	-0.0024180144820167304	0.0023822452723187048	0.02822750096332305	0.9995957614469926
+270.3816	-1.923826	-5.705921	-0.0023586175110118968	0.00234348603800398	0.028483136546883333	0.999588743398543
+271.2341	-1.875688	-5.726987	-0.0023526102718817816	0.0021266913037168225	0.028471069655583946	0.9995895860810448
+272.0891	-1.829452	-5.750154	-0.0026370707646808707	0.0018179267971589897	0.02820041468928538	0.9995971576646737
+272.9432	-1.783337	-5.773108	-0.0024442972704059266	0.0021876763449981042	0.027639791281896708	0.9996125656577936
+273.7954	-1.741198	-5.793806	-0.0024973228274521187	0.0026144548059754648	0.0268442698043919	0.9996330892799777
+274.6511	-1.700845	-5.814912	-0.00288006271389569	0.0030813637108716816	0.02602855072437549	0.999652302044883
+275.5035	-1.660663	-5.835884	-0.0034273851501816816	0.0031227012388506965	0.025314251581748336	0.9996687903675202
+276.3544	-1.622669	-5.858238	-0.003990212260549216	0.0031715833028724562	0.024517395725096106	0.9996864091165427
+277.2037	-1.585579	-5.879709	-0.004239219312218926	0.002936121717780462	0.02340716173988423	0.9997127152278118
+278.0516	-1.550632	-5.901149	-0.004163709288289363	0.002205438760947295	0.02203055406029566	0.9997461949178055
+278.8967	-1.517863	-5.921904	-0.0034918114637584236	0.001055827121516442	0.019972619150897063	0.9997938722386952
+279.7354	-1.489305	-5.941918	-0.002910957559224276	5.4757130667850154e-05	0.017600506303954676	0.9998408600900393
+280.5687	-1.466242	-5.96075	-0.0026761577139728992	-0.0006430411926906249	0.01522104626474485	0.9998803650580008
+281.3897	-1.447757	-5.978025	-0.002469313385449175	-0.0009408980218235877	0.01285310429576814	0.9999139037498574
+282.2031	-1.431895	-5.99441	-0.00222728319027806	-0.001290913866329565	0.010584203177309091	0.9999406719371312
+283.0036	-1.419054	-6.010847	-0.0017205741997510324	-0.00140037353126141	0.008243030385739258	0.9999635648504679
+283.7879	-1.412086	-6.026687	-0.0015177726898127758	-0.0012578092191750205	0.006178752524871416	0.9999789684284698
+284.5553	-1.411252	-6.04017	-0.0017292712783256784	-0.001211281578539399	0.0041934716435664595	0.9999889785459427
+285.3052	-1.414317	-6.053988	-0.0018938822672762419	-0.0012510269518930433	0.0022767549918570966	0.9999948322507624
+286.0359	-1.417779	-6.069167	-0.0017676849416135864	-0.001324825578852983	0.0003805104060980861	0.999997487666326
+286.744	-1.423417	-6.085125	-0.001330173263930217	-0.0009526111955601658	-0.001721589621503114	0.9999971796461095
+287.4331	-1.435822	-6.10138	-0.0019061442198138323	-0.0002493137946469586	-0.0041879781748332045	0.9999893825914614
+288.1005	-1.454441	-6.117454	-0.002719858638188185	0.0006897123360891307	-0.007345617234943967	0.9999690838087543
+288.7511	-1.477649	-6.132605	-0.0031602139628514286	0.001586826587425182	-0.011431848308833972	0.9999284013734855
+289.389	-1.506373	-6.149396	-0.003119282867860965	0.0020885051248669944	-0.016567305690589892	0.9998557058910492
+290.0173	-1.545617	-6.167037	-0.00335529847338664	0.002230144526545465	-0.023255826138112033	0.9997214286881017
+290.6281	-1.597157	-6.182453	-0.003840303059593055	0.0017992377452013637	-0.0314188971006499	0.9994973075105933
+291.23	-1.665044	-6.196964	-0.005181267413899468	0.0014477758000736004	-0.04122660103179808	0.9991353390710306
+291.8062	-1.742378	-6.210196	-0.0060532710050725455	0.001493860885644706	-0.05240623477111658	0.9986063853425473
+292.379	-1.834488	-6.224247	-0.006805937498885246	0.0015511538149781503	-0.0650360254486167	0.9978585012567917
+292.9374	-1.942994	-6.237015	-0.007604937948481352	0.0013915703152440108	-0.07886957415228758	0.9968549637353943
+293.4679	-2.067997	-6.247853	-0.008302874508479007	0.0008574450035684524	-0.09411830972745253	0.9955260272022072
+293.9981	-2.209538	-6.257666	-0.008225953804257388	-6.270104525681046e-05	-0.11106326780145809	0.993779291542066
+294.5121	-2.367902	-6.271075	-0.008262053535467957	-0.000974712700319937	-0.12929729985469254	0.9915709740895075
+294.9879	-2.543375	-6.28527	-0.008857518851703433	-0.002182050066800217	-0.14838711320683975	0.9888872775253194
+295.4667	-2.739392	-6.297631	-0.009534359118854243	-0.0033302057281516806	-0.16836823122667496	0.985672432626378
+295.9018	-2.949047	-6.307941	-0.00975092309356888	-0.004356114976736761	-0.18905026833629426	0.9819093337997699
+296.3492	-3.178542	-6.314604	-0.009567086632385327	-0.005327768535317985	-0.21039834501041838	0.9775544087940474
+296.7828	-3.425572	-6.32098	-0.00929874900032999	-0.0059888241225705425	-0.23271579901809664	0.9724818888493639
+297.1641	-3.687333	-6.328231	-0.00888136618655533	-0.006676699431223337	-0.2559291224819742	0.9666316916411194
+297.5634	-3.968958	-6.33276	-0.00821231308242336	-0.007542857937440825	-0.27983142038088127	0.959984395381295
+297.9439	-4.268559	-6.336992	-0.007951854746849875	-0.008559790063085526	-0.3044167811552438	0.952467280987248
+298.2617	-4.586321	-6.339444	-0.008613734011902045	-0.008854362527851234	-0.329396505764152	0.9441108758196595
+298.5997	-4.922654	-6.345184	-0.007993056759187808	-0.009275380292276456	-0.3549522232554826	0.9348042562858109
+298.8694	-5.270323	-6.347814	-0.006898328392304809	-0.010833082066893528	-0.38121962560636014	0.9243952912314464
+299.1672	-5.635576	-6.343982	-0.005098533958987264	-0.013675571805067092	-0.40855323839024094	0.9126177924455126
+299.4337	-6.014393	-6.339964	-0.004125368203507028	-0.015793892552840513	-0.4366752277537475	0.8994711111322183
+299.6092	-6.404391	-6.33231	-0.005215756987576131	-0.016870279670279258	-0.4648584398792032	0.8852089134300256
+299.805	-6.808874	-6.331276	-0.008573322167277771	-0.017085006508500845	-0.49236316367705696	0.8701799329756692
+299.9751	-7.216405	-6.326098	-0.010974861454779524	-0.017121555905112103	-0.5194888916278836	0.8542351527622061
+300.0619	-7.620091	-6.315633	-0.011874594037191802	-0.017266716987170153	-0.5461859484502317	0.8374017937743253
+300.1811	-8.027228	-6.304576	-0.01223226285890819	-0.018230563448855177	-0.5719713947425268	0.8199797204187287
+300.22	-8.436151	-6.292308	-0.012686166758047051	-0.01995932356037467	-0.5969766676865812	0.8019099355998929
+300.2873	-8.849753	-6.282888	-0.013627953294615873	-0.021619960253191656	-0.6212891549532951	0.7831645051616363
+300.3291	-9.270289	-6.271213	-0.014037040850428284	-0.023353453008241837	-0.6456645813493542	0.7631348675743485
+300.284	-9.696194	-6.256955	-0.01553173267791933	-0.024289020260971465	-0.6698837808253982	0.7419060108678549
+300.2712	-10.13605	-6.242546	-0.01696260915808764	-0.026308441089200205	-0.6935930521636683	0.7196866080512357
+300.2293	-10.58129	-6.224323	0.0190651930500382	0.02863810390603924	0.7168976767426651	-0.6963290159828189
+300.097	-11.03111	-6.199526	0.02112663961352105	0.031053982923104923	0.7398582822066911	-0.6717135084940722
+299.9941	-11.48794	-6.178297	0.023264322291792455	0.033162836953504256	0.7619531868567272	-0.646363936642799
+299.7982	-11.94318	-6.15018	0.024770940447299928	0.035677638651591535	0.7830706682823819	-0.6204142447473194
+299.6328	-12.40289	-6.123531	0.0256422662955451	0.03777782339100135	0.8032079206390507	-0.5939464171638374
+299.4338	-12.8598	-6.094784	0.025902746271336857	0.039071494160779166	0.8221274665126269	-0.5673701568510425
+299.1429	-13.3109	-6.063773	0.025508217471558625	0.039832940852919343	0.839924124546091	-0.5406386340892724
+298.8742	-13.7634	-6.028358	0.02515566441758571	0.040096948830556534	0.856297892808772	-0.5143086097018463
+298.5219	-14.21251	-5.989762	0.024938136498918375	0.040343032353498696	0.8713066429699928	-0.4884416679654259
+298.1811	-14.66311	-5.950116	0.02493098028773047	0.0405089156222852	0.8845510667203981	-0.46401172866728835
+297.8016	-15.11616	-5.913525	0.025556060654800622	0.04076474570797745	0.896051207585367	-0.4413358773722995
+297.3601	-15.56917	-5.871992	0.025966180627406417	0.04254493401149697	0.906383604701549	-0.4195050025706588
+296.9179	-16.02426	-5.830931	0.026007396564020378	0.04637585984300477	0.9163958292546229	-0.39672607561454276
+296.4013	-16.47064	-5.783464	0.026573602266285936	0.049449553410350795	0.9260128663209168	-0.37329446384621057
+295.8766	-16.90392	-5.737737	0.026785096526165165	0.05065256018330975	0.9348391677688537	-0.350417760902893
+295.2843	-17.32241	-5.692386	0.02571446663724326	0.04947535546995225	0.9429788110955372	-0.328149230124849
+294.6844	-17.72918	-5.645499	0.023613087049225972	0.04887017602857277	0.9502939073293353	-0.30658704752106947
+294.0467	-18.12343	-5.596317	0.021630256888301197	0.04707014505602804	0.9569311858389713	-0.2856558051224772
+293.3496	-18.50534	-5.550791	0.0196243344515039	0.0463768810255874	0.9628817079863333	-0.2651846278142746
+292.6362	-18.87203	-5.509383	0.019721652416999733	0.04553004154808073	0.9682421455880438	-0.2450412603004532
+291.8625	-19.21381	-5.47294	0.019769022359624095	0.04396384008967606	0.9733058529184837	-0.2243926986203811
+291.0794	-19.53852	-5.435497	0.019219595475224574	0.04386330039978405	0.9778236596697132	-0.20388111393146727
+290.2408	-19.84548	-5.397191	0.01906496885706396	0.045120400917967436	0.9818569886404157	-0.18318714540451658
+289.3905	-20.12922	-5.364689	0.019001586609629494	0.04720560946574344	0.9853630033183316	-0.1626970246586039
+288.5047	-20.38184	-5.330467	0.02001401271969574	0.04754078249493923	0.9884071186074539	-0.14279594245193122
+287.5677	-20.61251	-5.29613	0.020390685768332122	0.04924658476441999	0.9909062010881172	-0.12354713460074085
+286.6208	-20.81155	-5.265003	0.020746540348082484	0.04916019627324974	0.9930553151016356	-0.10485226423092751
+285.6322	-20.97738	-5.231887	0.01871993338588706	0.04872668113480883	0.994807494920421	-0.08736888856893621
+284.6297	-21.11659	-5.198232	0.01717920932556132	0.04839195624141827	0.9961862580741427	-0.07054099915875034
+283.5931	-21.22691	-5.169612	0.017264710771513425	0.04819849012834713	0.9971966977296338	-0.05456721862470982
+282.5205	-21.30702	-5.143301	0.017333756873584915	0.04753290556198627	0.9979556305816399	-0.03904770354174555
+281.4312	-21.35871	-5.118237	0.01702665514140087	0.04707270561965347	0.9984453626754536	-0.02451756823633409
+280.3112	-21.38358	-5.092068	0.016513191082952784	0.047514135468867034	0.9986717569448594	-0.011155417140187385
+279.1733	-21.37724	-5.068634	0.01728706135530638	0.04606153257721507	0.9987880235530578	0.0014059634819465336
+278.0075	-21.34556	-5.044246	0.01769375152623964	0.04536285019259863	0.9987271317832281	0.013162796791643037
+276.8295	-21.28976	-5.02153	0.017374435099350397	0.04418947039874604	0.9985977933928645	0.023406553393412452
+275.6388	-21.20924	-5.001358	0.016113647274925123	0.04194604041560387	0.9984868798175014	0.03169906776645799
+274.4313	-21.11193	-4.983245	0.015503896439124415	0.039486160441577996	0.9983782267697249	0.0379656244638019
+273.2075	-21.0019	-4.965436	0.01620353447790039	0.0371603527233451	0.9982966986203865	0.04195539511835345
+271.963	-20.88617	-4.946924	0.017134400484067013	0.03543157999268642	0.9982593236314419	0.04392423297189561
+270.705	-20.77685	-4.927465	0.01630366549019556	0.03449645029411905	0.998298517887861	0.04409370240283104
+269.4279	-20.66591	-4.906062	0.015612744465037617	0.033366758692646166	0.9983912595330475	0.04310214046461341
+268.136	-20.55522	-4.883412	0.01605719469823696	0.033049507384499265	0.9984461874965106	0.04189399997312026
+266.8306	-20.44759	-4.861751	0.016549899013712323	0.03251057205864236	0.9985091133165214	0.040604361476447366
+265.5097	-20.34586	-4.838516	0.01687704174672132	0.03227974551711386	0.998587175388717	0.03868897311138597
+264.1794	-20.24967	-4.817703	0.017686142930441674	0.03144762388586395	0.9986968149272765	0.036095971441292944
+262.8481	-20.15628	-4.799526	0.01881500595443335	0.031155805956697622	0.9987831177792373	0.03328054905360081
+261.5173	-20.07085	-4.780284	0.01966261859248611	0.030820886313190802	0.9988613658870122	0.03065005930317196
+260.1902	-19.99327	-4.760491	0.01983778249307066	0.030652293731982073	0.9989364945354484	0.02815633427766943
+258.8499	-19.9213	-4.738614	0.019471757106491443	0.03007935189258988	0.9990306678690043	0.02556966800955405
+257.5113	-19.8553	-4.717988	0.019104696547373512	0.02922351413542488	0.9991295840990105	0.022826978984256414
+256.1702	-19.79675	-4.696286	0.018760878961431346	0.02884531251939635	0.9992058448225302	0.020091217950806605
+254.8239	-19.74591	-4.673674	0.020036596651371304	0.028515872525201005	0.9992398386771066	0.017467816386332043
+253.4746	-19.70344	-4.648388	0.021667609314913974	0.02872322285100218	0.9992387384486837	0.015080940259507352
+252.1232	-19.66817	-4.618642	0.021674489492180378	0.029107586970031917	0.9992590658248716	0.012817341878888498
+250.7731	-19.63934	-4.586765	0.021325682132292804	0.029690013144256004	0.9992717823899383	0.010937244640310534
+249.4162	-19.61368	-4.556415	0.021048586582845513	0.029787739133474714	0.9992902390555157	0.009416248114332653
+248.0644	-19.59029	-4.529158	0.019387742625627508	0.029894218907056357	0.9993324217821649	0.008072291298802112
+246.706	-19.56928	-4.501374	0.018131826986346824	0.029742195639348584	0.9993696325646418	0.006853915399292221
+245.3382	-19.55134	-4.473848	0.019277489397623374	0.029190440127965765	0.9993716975149891	0.005701474692491632
+243.9698	-19.538	-4.445003	0.02046092100809288	0.029723621472222413	0.9993373647528041	0.004763239128750006
+242.5957	-19.52666	-4.411335	0.02031519486146367	0.029796108286246816	0.999340183723827	0.00432226602173179
+241.2224	-19.51383	-4.37826	0.02050663368531659	0.030415716999610293	0.9993179670720092	0.004238256877834013
+239.8385	-19.50318	-4.346093	0.02051169841208766	0.030548956255807735	0.9993133042081261	0.004353335797920685
+238.4577	-19.49325	-4.316996	0.019760823151875533	0.031447422068581574	0.9992997417794032	0.0045382367962560205
+237.0732	-19.48099	-4.286446	0.01885453717884577	0.03214408402665795	0.9992933099855317	0.004913746841131058
+235.6839	-19.46668	-4.255713	0.01911290626144785	0.03293641261676103	0.9992598440112066	0.005445519706929988
+234.2911	-19.45014	-4.223798	0.020835264801623354	0.03315481273464108	0.9992156425428803	0.005894898709545703
+232.8928	-19.4304	-4.186692	0.021256195290473574	0.03258860098579986	0.9992218592470106	0.006467862901216208
+231.4922	-19.4066	-4.152924	0.021604839077110777	0.03195150867226101	0.9992321762423763	0.006884038345571014
+230.0914	-19.38129	-4.121142	0.021451272775573044	0.031049969102356866	0.9992612093952744	0.007264827110419387
+228.6877	-19.35555	-4.091351	0.021189944587631303	0.03026222649441401	0.9992888361639912	0.0075502194663245965
+227.2832	-19.33238	-4.061083	0.02056291976995954	0.030410752124206793	0.9992965696772854	0.0076627878076083455
+225.8709	-19.30936	-4.031202	0.01964511952207055	0.030375956111302553	0.9993153803826534	0.007755069307265324
+224.4541	-19.28821	-4.003413	0.019956730230893654	0.030925275062500986	0.9992920919764415	0.007789171595691792
+223.0339	-19.26526	-3.975161	0.020451396906196	0.03081374692646553	0.999284177091137	0.007961581515392495
+221.6106	-19.24098	-3.945563	0.020844887907062723	0.030886550147856587	0.9992732934037943	0.008024759064316824
+220.1863	-19.2149	-3.915755	0.021224042220267726	0.030334275162686438	0.9992820822232993	0.008043129344594375
+218.7565	-19.19066	-3.884537	0.021307022957225107	0.030744470063392863	0.9992686432168338	0.00793517591548009
+217.3282	-19.16474	-3.851913	0.02046019235540279	0.03046787531416345	0.9992942305490838	0.008008114255853262
+215.8915	-19.14051	-3.816692	0.02101707956820257	0.0312978988368542	0.9992567009746939	0.008047947066992859
+214.4523	-19.11492	-3.783819	0.021820566568636436	0.031475005260047116	0.9992339054286243	0.008049171408461153
+213.0109	-19.08395	-3.744539	0.021532070103997828	0.031622695013022624	0.9992354525217156	0.008055156166838573
+211.5703	-19.0532	-3.706671	0.020186585485378515	0.03189462027919479	0.9992548168122374	0.008065112598685881
+210.1235	-19.02208	-3.670204	0.019981731736008826	0.03198734560125967	0.9992564881247837	0.008000691141376333
+208.6722	-18.99038	-3.634175	0.020581351545908975	0.03165744412183143	0.999254475689061	0.008044067105422603
+207.2186	-18.96021	-3.596747	0.019511836906326546	0.031773229925847954	0.9992722533291373	0.00804448923975986
+205.7582	-18.92786	-3.559153	0.02004640263248467	0.03131098576878339	0.9992755509753375	0.008132474077375855
+204.2948	-18.89614	-3.524788	0.021335466859055397	0.030725107266585092	0.9992680312879243	0.00801044838439257
+202.8318	-18.86476	-3.487637	0.021174881430153225	0.03058095317804986	0.9992760964230868	0.007982030855542112
+201.3684	-18.83712	-3.453039	0.021124371307022355	0.03107980016490761	0.999261581774203	0.00800613194433103
+199.9091	-18.80904	-3.419171	0.020607555749851448	0.03127398480729472	0.9992660267562021	0.008042032775541967
+198.4404	-18.78081	-3.384783	0.021672492836287034	0.03153766324617793	0.9992369222704941	0.007826367140790371
+196.9762	-18.75193	-3.350644	0.02306236433968062	0.031203896157916872	0.9992192043325783	0.007444857862896781
+195.5153	-18.72357	-3.313855	0.023258874433074096	0.030653896388685417	0.999235370337627	0.006930949611976875
+194.0556	-18.69718	-3.276436	0.02317002280227061	0.030301485906365448	0.9992518182678782	0.006385427436590589
+192.6001	-18.67123	-3.241513	0.02346119080822332	0.029714656261177388	0.9992651499946745	0.005980947689184805
+191.1441	-18.64803	-3.205048	0.023224155583236688	0.029785742997909946	0.9992715385898586	0.0054626253546038225
+189.6915	-18.62849	-3.170622	0.022658616048869536	0.030257681669795155	0.9992725195483534	0.005048910199743968
+188.2429	-18.61037	-3.138275	0.022510447108165545	0.030590471171176144	0.9992671524487181	0.004760344687756483
+186.7945	-18.59411	-3.107703	0.022851676297995772	0.03052139626880785	0.9992626676056516	0.004512913672016687
+185.3525	-18.57686	-3.075839	0.022518697605652595	0.03044984439649871	0.9992741691173024	0.004104895772007059
+183.9085	-18.56047	-3.044236	0.023017013356528906	0.03033499155987721	0.9992674522317885	0.0038163193521951735
+182.4687	-18.546	-3.011131	0.023823700779981416	0.030199034492800717	0.9992539519695486	0.0034625236588046947
+181.0316	-18.5334	-2.977267	0.02423744096870438	0.03022528659470345	0.9992450470605335	0.002883475428866504
+179.601	-18.52315	-2.942176	0.02347640473901481	0.030159817346966798	0.9992675649625606	0.0018914153225483793
+178.1753	-18.51524	-2.908289	0.023016280783132115	0.029990339247609174	0.999284726967864	0.000929956712429785
+176.7507	-18.51072	-2.873542	0.023780609083700135	0.03028526567883062	0.999258362147097	-0.00010484976047998332
+175.3373	-18.51025	-2.838653	0.024738464534040685	0.03069574191832978	0.9992219902528396	-0.001092701118936926
+173.9296	-18.51326	-2.803213	0.0242545878484337	0.031221241069351188	0.9992158059418094	-0.0021730693620919425
+172.5375	-18.51829	-2.769552	0.02425821295745384	0.031271267854314916	0.9992114929481748	-0.003168480893753708
+171.1562	-18.52671	-2.736025	0.0250488779073001	0.03136462349309932	0.9991852596249269	-0.004198935220834025
+169.7919	-18.5371	-2.701364	0.025796812057990635	0.031381707132243905	0.9991606762944607	-0.005258886945601355
+168.4471	-18.55169	-2.66695	0.026470642977990275	0.031760200072460075	0.9991247668812122	-0.006347830830475623
+167.1265	-18.5681	-2.633527	0.026024062297567818	0.031864017951324576	0.999127463745663	-0.0071933115245263295
+165.8218	-18.58704	-2.599237	0.026074075627066713	0.03231374257544871	0.9991054154704629	-0.00802081034280045
+164.539	-18.60503	-2.565877	0.027586061535681623	0.031870672041445045	0.9990744664427101	-0.008572046126235605
+163.2792	-18.62542	-2.533595	0.027597007859322946	0.03207829949938859	0.9990651825177727	-0.008840188865049841
+162.0471	-18.64444	-2.502804	0.026629970379085335	0.03191381038025019	0.9990967600749096	-0.008832745464512359
+160.8371	-18.66211	-2.473852	0.026295752834982433	0.03154891962315254	0.9991178111638803	-0.008763473543656152
+159.6532	-18.68207	-2.447586	0.0264193804752979	0.032025006560499	0.9991002767735243	-0.00866326965807682
+158.4929	-18.70043	-2.422217	0.026374453977839333	0.032073678303046714	0.999100893367633	-0.00854822844109998
+157.3598	-18.71715	-2.396857	0.026866295100109426	0.03186084793932756	0.9990964357850154	-0.008330699722081851
+156.2503	-18.73344	-2.370798	0.02776022648957001	0.03210023090101984	0.9990669475862852	-0.008011194790538483
+155.1689	-18.74892	-2.347373	0.02826871028042105	0.03179497690246805	0.9990644346677727	-0.007759822473265262
+154.1173	-18.76231	-2.322647	0.02784507625035552	0.03120077420428958	0.9990969115287831	-0.0075182971018481135
+153.0881	-18.7779	-2.297427	0.02730757137658647	0.031507736461417205	0.9991035849763122	-0.007320216941544603
+152.0873	-18.79239	-2.274461	0.02679782718875282	0.03169946924723332	0.9991136619568395	-0.006993611270287818
+151.1111	-18.80437	-2.252345	0.02685430385795093	0.031242079083768106	0.999128206630834	-0.006753189874563879
+150.1621	-18.81642	-2.23022	0.0273810117466828	0.03164133379689621	0.9991028722265826	-0.006523564966686604
+149.2397	-18.82738	-2.209938	0.02768833306280547	0.03152381083267158	0.9990997585604373	-0.006267216867543746
+148.3393	-18.83766	-2.187753	0.027792238598300274	0.03168043749014281	0.9990930248127053	-0.006088441896496239
+147.4612	-18.84754	-2.164801	0.027947388762334897	0.03181068737728931	0.9990866341811456	-0.005737685120396552
+146.5989	-18.85476	-2.144196	0.027731939727689257	0.031087466741466792	0.9991172726388596	-0.005402262972555021
+145.7538	-18.86279	-2.123511	0.026551665327942925	0.03157355688470166	0.9991349757405659	-0.005236394356830439
+144.9223	-18.87376	-2.100874	0.025425258852973837	0.033058263051751795	0.9991167499498191	-0.005140761224859557
+144.1037	-18.88239	-2.078341	0.024922391447189305	0.0337588067461529	0.9991070825346446	-0.004924936622906657
+143.3024	-18.88627	-2.059412	0.025089634212012858	0.03318016455697401	0.9991221000843047	-0.004961457266586779
+142.5198	-18.89195	-2.042113	0.024894964307351115	0.033289590101182376	0.9991240352417781	-0.004817275704915314
+141.7522	-18.90126	-2.02299	0.023968331428663225	0.034396103828676125	0.9991096082939945	-0.004734738065980972
+140.9929	-18.90872	-2.005121	0.023645379415746976	0.03468304329170034	0.999107278160899	-0.0047570228327925585
+140.2422	-18.91348	-1.989403	0.02397671251025961	0.034521250141487145	0.9991040211948229	-0.00495533835417049
+139.4994	-18.91945	-1.974562	0.024442852650577027	0.034547652274933796	0.9990903014999241	-0.0052512974952304845
+138.7654	-18.92677	-1.960534	0.0241958185938109	0.034919341099560186	0.9990819328745414	-0.005522081438946455
+138.044	-18.93511	-1.944374	0.023722443194805664	0.03559006450576434	0.9990681289454304	-0.005785043026765156
+137.3298	-18.94438	-1.928158	0.023192079348272228	0.036168984047620596	0.9990585457830419	-0.005996177644146775
+136.6214	-18.95464	-1.914991	0.02217098677411288	0.03688971776178573	0.9990548781475247	-0.006078364792657228
+135.917	-18.96431	-1.905325	0.021063482823542937	0.03712568189943476	0.9990689686938006	-0.006261727233763424
+135.2161	-18.97372	-1.895274	0.019935109855169827	0.037642794299362896	0.999070494572484	-0.006615006204863639
+134.5169	-18.97911	-1.881347	0.018844870039049364	0.03687984862499836	0.9991184395825156	-0.006862311914212331
+133.8197	-18.98311	-1.86792	0.018126984206735988	0.03596404583750702	0.9991632921573653	-0.0071214784856745645
+133.1238	-18.99097	-1.854861	0.018710991735742536	0.03642347913938374	0.999134377485489	-0.007329712295337043
+132.4286	-19.00389	-1.83902	0.020110912698156687	0.03802807344188823	0.9990464122711656	-0.0074621008340527905
+131.7402	-19.0082	-1.829069	0.020053505909874352	0.0361708060608633	0.9991150949724138	-0.007652234176665759
+131.0599	-19.00738	-1.823556	0.018992035475059215	0.033236204064814485	0.9992361143189863	-0.007864169922492802
+130.3774	-19.01169	-1.813487	0.01675769123051702	0.03185692307348292	0.9993220364698615	-0.007731989565544769
+129.6926	-19.02656	-1.799997	0.012599234690348696	0.03382434662554971	0.9993192573122885	-0.0076285532958875754
+128.9952	-19.02976	-1.790406	0.011665965829619809	0.03188758113580042	0.9993943370857702	-0.00761881955842492
+128.2867	-19.02968	-1.783236	0.012789432504612045	0.029354733145144783	0.999460901685562	-0.007255071334634666
+127.5639	-19.03291	-1.773597	0.014438909179784454	0.027944264466700987	0.9994816654842141	-0.006858304893669716
+126.8317	-19.03689	-1.762349	0.01590780058308891	0.02708734905266844	0.9994882734305193	-0.006033960290473813
+126.0904	-19.04093	-1.751742	0.016518778374032095	0.027058867072917964	0.9994844861039371	-0.0050704744658064335
+125.3386	-19.0431	-1.739549	0.01611505755149827	0.026979920054611688	0.9994995163377945	-0.003620176027960146
+124.5754	-19.0406	-1.721883	0.015065238704164492	0.026555697344912303	0.9995321347868356	-0.001828947725667791
+123.7928	-19.03398	-1.704964	0.015106712511449157	0.02557775902141944	0.9995586410497446	-0.0002976295120139031
+122.9911	-19.02609	-1.689113	0.016744772187845907	0.024919415198532097	0.9995482600743657	0.0013824372856111671
+122.1724	-19.01737	-1.671608	0.019202333718555384	0.024985624414766636	0.9994984384365563	0.003140145737203849
+121.3372	-19.00788	-1.651373	0.020569056569252318	0.025807247821167185	0.9994437421138588	0.004806893091651971
+120.4889	-18.99551	-1.630463	0.020858303511653287	0.026519855043347574	0.9994084646420767	0.006659524372107838
+119.6223	-18.9811	-1.606831	0.02014903722754052	0.02742235573136862	0.9993846217937705	0.008509314136155539
+118.7404	-18.96236	-1.583863	0.018859519419673766	0.02814283438545422	0.9993701180839945	0.010567236194718194
+117.841	-18.94089	-1.56236	0.018318159616608477	0.02924022153698594	0.9993266511738543	0.012477929570980217
+116.9209	-18.91249	-1.539442	0.018459668898010463	0.029850097244410897	0.9992784927980377	0.014515720786697316
+115.9843	-18.88164	-1.51858	0.019515878322156614	0.03059488977532522	0.9992053700550357	0.01648367878021526
+115.0276	-18.84566	-1.49574	0.020660187262335303	0.03126764912232277	0.9991230726642429	0.018670202211298004
+114.0573	-18.80738	-1.472691	0.02054495886654235	0.032789911162079896	0.9990365633120774	0.020704384964265382
+113.0721	-18.76261	-1.449607	0.019535317074973416	0.03357368713042368	0.9989822645075674	0.022926275724183674
+112.0705	-18.71096	-1.425819	0.017841269478838143	0.034154547402804186	0.9989453009583076	0.02496881431725127
+111.0501	-18.65467	-1.402779	0.017036014523116084	0.03458562701597711	0.9988941165094898	0.02691008390558835
+110.0104	-18.59273	-1.382089	0.017109289171547814	0.03455316912624032	0.9988544658602502	0.02833913827927968
+108.9547	-18.52665	-1.362588	0.01693264828255048	0.034629487341958094	0.9988278410990891	0.02927503840997116
+107.8847	-18.45845	-1.343346	0.015689095076267683	0.034138657812213216	0.9988515951452674	0.029730375278882407
+106.7965	-18.38974	-1.32609	0.014163752742018702	0.03388643201538302	0.9988780862343754	0.0298942582159427
+105.6929	-18.32377	-1.307555	0.01258611056081788	0.0348993598543519	0.9988669807762981	0.029805691026207365
+104.5622	-18.25676	-1.288139	0.014145652248754784	0.035636256833428866	0.9988288184861546	0.029511846484277297
+103.4111	-18.18725	-1.269054	0.017470700286308514	0.03541989789627457	0.9987882410976148	0.029364177319591213
+102.2464	-18.11737	-1.244731	0.018629272873207516	0.03571170201320184	0.9987628302632492	0.029162191551388866
+101.074	-18.04504	-1.219976	0.01744909448200695	0.0352511125567185	0.99880683180503	0.028944790634409533
+99.88171	-17.972	-1.195663	0.01658254821139547	0.0348143408666297	0.9988483040502866	0.02854901505665557
+98.67175	-17.89756	-1.172886	0.016853473542105462	0.034119511118000524	0.998875240719814	0.02828555934691294
+97.4478	-17.82503	-1.151222	0.01652730356736687	0.03407543165729259	0.9988861870345944	0.028144245338090106
+96.20855	-17.75374	-1.126795	0.016094603855796582	0.034499433343690385	0.9988883511662479	0.027799545501030566
+94.95275	-17.68421	-1.102525	0.017594865975409008	0.03465248203317205	0.9988740592944434	0.02720734917056519
+93.67749	-17.6127	-1.077492	0.018879023913128925	0.03404039738146705	0.9988936801246587	0.026386542198915917
+92.39323	-17.54366	-1.049972	0.01783898504921177	0.033754326667827246	0.998946839501164	0.025448534225984924
+91.09028	-17.4771	-1.022258	0.017861599565137335	0.033378954888393625	0.9989893419756295	0.02423021359030954
+89.77477	-17.40999	-0.9990056	0.01759510951991081	0.03179172149444711	0.999076047362708	0.022951038135777653
+88.45413	-17.35065	-0.9753403	0.0127243991599314	0.032880638642939775	0.9991485476962082	0.021427386798419767
+87.10974	-17.29334	-0.9584784	0.01058582552613299	0.032848196770890487	0.9992022031036463	0.020097103762752533
+85.73751	-17.23975	-0.9440549	0.014633289070184654	0.0328658943625804	0.9991772661129686	0.01872139743048158
+84.35171	-17.18834	-0.9218292	0.018014392448969525	0.03254208390262956	0.9991565029131237	0.017400492130561992
+82.95643	-17.13847	-0.8925921	0.018332181018760964	0.03207941586399653	0.9991878424745282	0.01607786266944726
+81.55322	-17.09087	-0.8621183	0.017347030545706705	0.031618194566973866	0.9992387523378959	0.01487568922021156
+80.12646	-17.04828	-0.8330614	0.01688041039901714	0.031592092895822564	0.9992643484648684	0.013702309962358264
+78.68382	-17.00912	-0.8053784	0.017142455159685724	0.03202005218862654	0.9992601668059312	0.012647984974623617
+77.22778	-16.97362	-0.7771295	0.017235651620196524	0.03286491204083717	0.9992429022856969	0.011681271396236509
+75.75752	-16.9389	-0.7478602	0.018060018547462774	0.033005860933920106	0.9992345617777074	0.010711648938839484
+74.28435	-16.90538	-0.7187593	0.01924826332501231	0.032639717980615914	0.9992357335476563	0.009596977159197677
+72.80871	-16.87364	-0.688618	0.020718903269736515	0.032003440041364004	0.9992364080085627	0.008550308939642658
+71.33452	-16.84487	-0.6588975	0.022468459423008583	0.03145133269328972	0.9992242766907536	0.007538360220424405
+69.86187	-16.82112	-0.6266962	0.02296843833251295	0.031734216681613775	0.9992101918672692	0.00666204178499671
+68.39401	-16.79851	-0.5936207	0.022348110417726823	0.031752269054467826	0.999227873589278	0.006001000991465331
+66.92462	-16.77917	-0.561778	0.02190569508238906	0.03203674411330987	0.9992321510296022	0.00537549051994799
+65.45693	-16.76019	-0.5300473	0.022041416274824612	0.0320556117030206	0.9992310349637248	0.0048941283461601355
+63.98723	-16.74127	-0.4983233	0.022482203694010574	0.03157241831676313	0.9992382364091458	0.004547506639104229
+62.52254	-16.72374	-0.4662392	0.022297839520469837	0.03145889690606758	0.999247363431623	0.004225024629815623
+61.05928	-16.70738	-0.4342216	0.020953053874894732	0.03155055353437131	0.9992748630967826	0.003908975300134383
+59.5885	-16.69261	-0.4028967	0.021432400922989238	0.03196114999092366	0.9992523922002994	0.0037139956802225396
+58.11853	-16.67728	-0.3696661	0.023605948203460767	0.03217945640862144	0.9991965584960969	0.0036714144708727223
+56.651	-16.66067	-0.331875	0.02507685086812287	0.032301081587404516	0.9991565305375241	0.003744217279942751
+55.18855	-16.64289	-0.2928006	0.02478496109910151	0.032096912270468035	0.9991690012333274	0.004098890144043372
+53.73519	-16.62286	-0.2553499	0.022959449003049522	0.0322271527188174	0.9992062673810589	0.004594513633270268
+52.27704	-16.60088	-0.2238776	0.02121339438337727	0.031655548726466386	0.999260987580889	0.005039527004388365
+50.82074	-16.57777	-0.1950878	0.02081957476746884	0.03156922148182963	0.9992700430735101	0.005413924399021327
+49.36683	-16.55446	-0.1667537	0.02158214721224716	0.03120525050211799	0.9992633982799533	0.005753618062442453
+47.91138	-16.53154	-0.1361189	0.022953101762249843	0.03179027899926177	0.9992124568145327	0.00608271545500048
+46.46111	-16.5059	-0.1018661	0.024675224454393126	0.0314925888852821	0.9991786487142963	0.006432580889569971
+45.01382	-16.47825	-0.06682977	0.02472132493202827	0.030701987899528674	0.9992010893993585	0.006589914700324977
+43.57573	-16.45171	-0.03390796	0.022437376361668253	0.030651172100623337	0.9992566491202318	0.006574114379042892
+42.13696	-16.42535	-0.004286154	0.020726594949216557	0.030546729942434268	0.9992981309478974	0.006367969509501153
+40.69789	-16.40183	0.02267275	0.021001770904048672	0.030870655058956448	0.9992839663757428	0.006122321413316164
+39.26177	-16.38003	0.0513988	0.022497541519638856	0.03147846625834446	0.9992334080805372	0.005963469070024312
+37.82483	-16.35679	0.08060552	0.023228067411014696	0.031226470374165982	0.9992257254607267	0.0057718289826668955
+36.39684	-16.33245	0.1118254	0.02250045242605646	0.030618250998587244	0.9992629247103558	0.005464398019547973
+34.97003	-16.30758	0.1414094	0.022174631964380017	0.029698863073291143	0.9992994353892304	0.005186681035470868
+33.54441	-16.28474	0.1706823	0.02180090562587617	0.029256171808690112	0.9993221330180895	0.004906259794169456
+32.12471	-16.264	0.1998713	0.021281782282354354	0.02965809474313164	0.9993222151430415	0.004753260007471801
+30.70783	-16.24608	0.2279988	0.021197430188197974	0.0305886406435069	0.999296361803354	0.004667472932874278
+29.28474	-16.22987	0.257281	0.02193508218178756	0.03185060010102543	0.9992412759475635	0.004611278084026782
+27.8676	-16.21276	0.2860073	0.02241450163483873	0.032119587809520296	0.9992222010670172	0.004573301891851275
+26.45737	-16.19488	0.3156601	0.02175028512100806	0.03212492318950204	0.9992368064694642	0.00455181325737315
+25.05017	-16.17482	0.3455316	0.021719966807737755	0.031319824916383694	0.9992637301045899	0.004394235603513558
+23.6461	-16.15363	0.3743586	0.02177700809460173	0.030318814551547783	0.9992941943846936	0.0042005323094430344
+22.24839	-16.13407	0.4041376	0.021690267848528785	0.02983658906915148	0.9993112047980764	0.004052924671710977
+20.85659	-16.11623	0.4323654	0.022358126233984067	0.029738221631291552	0.9992997576927033	0.0039682037268662
+19.47255	-16.10044	0.4615578	0.022546588372237032	0.03016287589520217	0.9992831237960457	0.003884683631975494
+18.08843	-16.08658	0.4926781	0.022879252550762643	0.03083294620971204	0.9992555497772618	0.0037703408452715863
+16.70842	-16.07258	0.5236603	0.02344267119055308	0.031005322197163798	0.9992372306911201	0.003750728414993696
+15.33299	-16.0573	0.5539333	0.022206131788650676	0.03084187319580101	0.9992711645157402	0.0035785941535179285
+13.9623	-16.04039	0.5822458	0.01996555246639013	0.029762906047671987	0.999351994500645	0.0033372482514966123
+12.59501	-16.01956	0.6052059	0.018917466223599216	0.027744174473193668	0.9994306056228737	0.0032946621641649645
+11.22152	-15.99987	0.6277813	0.019788318875534772	0.026113804138032593	0.999457318106204	0.003400140213976403
+9.852998	-15.98184	0.653636	0.02127842360597966	0.02558553613324546	0.9994398876271741	0.0035383681832939595
+8.487382	-15.96753	0.6821976	0.022036424904129655	0.026199100972797127	0.999407572902363	0.0035364263133636537
+7.122797	-15.95764	0.7126679	0.02335073301968515	0.028292302656660193	0.9993201101952943	0.0036886633205085576
+5.763789	-15.94956	0.7443265	0.024123592959603073	0.030624874681817738	0.9992325823894714	0.0037967887366320337
+4.400318	-15.93877	0.7787213	0.02370667102759514	0.03186990508351989	0.9992035593295723	0.0038144359833160485
+3.038918	-15.9259	0.8125999	0.02261539179059871	0.03256271033684962	0.9992062063046354	0.0038950265691465873
+1.676097	-15.90918	0.8446651	0.021258459834429633	0.0322291167211062	0.9992473672972325	0.003749782713040033
+0.3089439	-15.88963	0.8754178	0.020977440131755745	0.031132416533832676	0.9992886346768396	0.0035978121101940736
+-1.063172	-15.87121	0.9045038	0.02120005384129237	0.030274248959713732	0.9993106802166793	0.00349170043520793
+-2.436523	-15.85397	0.9332548	0.021010641559918834	0.029882175192407958	0.9993269135128069	0.0033657801440120975
+-3.813595	-15.83843	0.9601236	0.020081961197486647	0.029976573187819173	0.9993433878663041	0.003303486381958705
+-5.192326	-15.8232	0.9870965	0.01983754468867219	0.029732496687530603	0.9993557772399603	0.0032374308588353153
+-6.57647	-15.80825	1.014088	0.02016940192868718	0.029641428314405006	0.9993519118759732	0.0032151490102785454
+-7.966034	-15.79123	1.041956	0.01997172000516704	0.028999546107997036	0.9993748684068092	0.0031668789743476223
+-9.357139	-15.7739	1.070725	0.02003228102174497	0.028023911207160537	0.9994021787500078	0.0029416368668567137
+-10.75522	-15.75586	1.10059	0.02142410515067714	0.026793509992555448	0.9994073292974124	0.0028470488905359644
+-12.14993	-15.73852	1.129671	0.021524247167213226	0.025942627041639046	0.9994279866375609	0.002717795366556129
+-13.54746	-15.72233	1.159922	0.01989364826624101	0.025711534374179886	0.999468258141441	0.0025220480760894354
+-14.95103	-15.70577	1.188722	0.019397472755654607	0.02511019535051045	0.9994932234030673	0.002551962284026683
+-16.35914	-15.6909	1.214563	0.01998608408548817	0.025005214902775408	0.9994837969780853	0.0027267652747028043
+-17.76435	-15.67519	1.238904	0.01895043108745312	0.024419538103863825	0.9995183016580017	0.0027806422095335817
+-19.17351	-15.65947	1.258378	0.017022736651195892	0.023915868376489846	0.9995648891171041	0.0028792570246226512
+-20.58415	-15.64353	1.279114	0.01658493110512659	0.023634129016058443	0.999578630267338	0.002988296967018817
+-22.00393	-15.62986	1.298206	0.017683239452542385	0.024026221113070238	0.9995500189513215	0.0031310311245522932
+-23.42439	-15.61782	1.320134	0.01802599062003311	0.025475784341950875	0.9995072022618314	0.003376492418953509
+-24.84504	-15.60601	1.342475	0.017633933632639976	0.02641089625638073	0.9994889997928118	0.0036398127340520267
+-26.26783	-15.59027	1.362532	0.017786385760014848	0.02618142911227566	0.9994915536780354	0.00384855524862023
+-27.68798	-15.57331	1.381668	0.01677390674864704	0.02592692104158377	0.9995158300178613	0.0038126580487955163
+-29.11536	-15.55401	1.401278	0.016970533299113983	0.025374670282571284	0.999526490611797	0.00386285812058895
+-30.54783	-15.53676	1.423844	0.01917137839318833	0.025160159233112357	0.9994909993428874	0.004143292220662105
+-31.97309	-15.5214	1.450596	0.019473236832053733	0.026350486911963598	0.999453242839156	0.004433989747421006
+-33.39613	-15.50772	1.474487	0.018726599220297437	0.02772535915903161	0.9994282617512766	0.0048753003976000145
+-34.81999	-15.49326	1.496909	0.018981946751468973	0.028701502352172368	0.9993942073857205	0.005208426261371306
+-36.24448	-15.47662	1.519635	0.01964774288568167	0.029379684309609456	0.9993607201314444	0.005380651236316782
+-37.66558	-15.46088	1.542757	0.019781490227387555	0.030372490986008575	0.9993279027192167	0.005472411031745277
+-39.08217	-15.4441	1.566849	0.020001712960150516	0.031487421412858964	0.9992881553226935	0.005626402331582135
+-40.49901	-15.42668	1.593713	0.020757082478989063	0.03238579255820387	0.9992430215213566	0.005804128548465199
+-41.91388	-15.40927	1.621146	0.021249464955299818	0.033318806847166614	0.9992017660966201	0.005843627194630488
+-43.32259	-15.39245	1.647343	0.01964238352135774	0.034462536599135174	0.9991958087625201	0.0058520160770826985
+-44.72465	-15.37523	1.670185	0.017051001337550532	0.035666195890867204	0.9992004022547727	0.0059784578246959525
+-46.13041	-15.35699	1.68978	0.016608936862169767	0.035966311942438496	0.99919657418428	0.006064137198846929
+-47.53795	-15.33783	1.710257	0.018053001601641444	0.03605490068880934	0.9991680777327859	0.006106202523719637
+-48.94398	-15.3171	1.733637	0.01847335397119575	0.03584828380730592	0.9991678065725514	0.006109832249914256
+-50.3466	-15.29715	1.757778	0.018031664255783928	0.03563600980960708	0.9991839822782795	0.006025234230722762
+-51.7485	-15.27696	1.78328	0.01782837191249632	0.03572858743511733	0.9991842109751501	0.0060439831905350315
+-53.15272	-15.25676	1.808697	0.018626423689654233	0.03538846959055769	0.9991819028389982	0.006019767405006135
+-54.54794	-15.23749	1.833987	0.017503132561295327	0.03565532029888657	0.9991946894177722	0.005684287488074441
+-55.9495	-15.21771	1.859642	0.01695787510770998	0.03515499424430761	0.9992240975302719	0.005268753773696039
+-57.35212	-15.19945	1.883021	0.017594767595326233	0.03504146716665346	0.9992198002411145	0.004723403241530531
+-58.75141	-15.1841	1.905888	0.017662865381489343	0.03547946801371712	0.9992048103640424	0.004356315109943966
+-60.15069	-15.17008	1.927806	0.016917646229211037	0.03596909253507103	0.9992020965793958	0.003897155322317364
+-61.54936	-15.15697	1.949198	0.016823974755130706	0.03579683442344435	0.9992112436651946	0.0035257696683517717
+-62.95149	-15.14226	1.969588	0.017355930777542074	0.035494011054918305	0.9992143379564117	0.003107035308496088
+-64.35076	-15.12801	1.98995	0.017598764570644767	0.035123844491605695	0.9992243966730685	0.002684050494812029
+-65.75074	-15.11338	2.0103	0.01779289052417658	0.03453695642166122	0.9992423614112957	0.0023053956494129358
+-67.14653	-15.10121	2.031551	0.01782577430234554	0.03452251891710764	0.9992429876595055	0.0019720728803227926
+-68.53878	-15.09141	2.052437	0.017877857689930682	0.03446333141376637	0.9992444583174647	0.0017814359713769997
+-69.93462	-15.08303	2.071234	0.017547919459178888	0.03493758517116097	0.999233869901623	0.0017632091448127108
+-71.32326	-15.07453	2.08736	0.016138877110256838	0.03518659999175084	0.9992485575316847	0.0019390970805598185
+-72.70839	-15.06438	2.102811	0.015865155127753102	0.03518040276968461	0.9992529273982056	0.002055042542537473
+-74.09527	-15.05383	2.117081	0.016578892826746322	0.0355901001433936	0.9992262708075182	0.0023118850582934804
+-75.47624	-15.04316	2.130723	0.017473000220101877	0.03638263620895787	0.9991815733111143	0.0026798509886250123
+-76.85642	-15.03084	2.146554	0.018677554235864532	0.03686579788627925	0.9991404581916425	0.003225944601030539
+-78.22646	-15.01678	2.16346	0.01844121998633743	0.0372325448700663	0.9991291409882855	0.003823432479455407
+-79.59325	-14.99833	2.179823	0.016618801621000802	0.03715396722466116	0.999162750777298	0.004146759131404652
+-80.9536	-14.9793	2.191194	0.0151963130783136	0.03623283264719445	0.9992185168911063	0.004313862436875343
+-82.31728	-14.95957	2.200916	0.015895614549677736	0.035570199134069634	0.99923124619619	0.004359701464547636
+-83.68053	-14.94128	2.214391	0.018267599964107605	0.03561772335770302	0.9991890140379014	0.004357384574706263
+-85.03495	-14.92467	2.232509	0.019607651622533532	0.035873387083454375	0.9991543058300487	0.004394682955970009
+-86.38401	-14.90817	2.250863	0.01952424618373312	0.036183647858905425	0.9991449671415781	0.004345350875874112
+-87.72673	-14.89187	2.268643	0.018749543958719254	0.03622044537965484	0.9991592580479455	0.004160648377176188
+-89.0704	-14.87398	2.282365	0.017816130212816774	0.0358943352672626	0.9991897763161346	0.003738061700145277
+-90.40801	-14.85657	2.296993	0.017315469240228837	0.035339798919201165	0.9992199554724173	0.0032792869715250975
+-91.74112	-14.83919	2.311152	0.01773060883399501	0.03475026869506104	0.9992348759634613	0.002776147381379119
+-93.07672	-14.82426	2.326927	0.018603343717876183	0.034607877851718524	0.9992247940020309	0.0024538632223480086
+-94.40039	-14.81076	2.341502	0.017987904833180406	0.0346054465746143	0.9992369962607613	0.002079339042832701
+-95.72436	-14.79798	2.357904	0.017433185569018045	0.03460194716511097	0.9992475106452953	0.0017893469421973848
+-97.04551	-14.78703	2.374193	0.018076326856067856	0.03469837870047612	0.9992333002318041	0.0014424391869781116
+-98.36341	-14.77603	2.389834	0.018768821546332026	0.03443251513156526	0.9992301753796673	0.0010908025467309232
+-99.6766	-14.76703	2.407843	0.01858516526649825	0.03437334633267633	0.9992360714763979	0.0005815103832645156
+-100.9838	-14.75854	2.426874	0.01878798259183506	0.034564961655172324	0.9992258206190818	0.0001858602374135875
+-102.2903	-14.75176	2.444391	0.01859889556764111	0.03461181410471267	0.999227732406066	-0.00020542316381735496
+-103.5926	-14.7441	2.462669	0.01857009362873808	0.034480228800587695	0.999232700268384	-0.0005255081666404211
+-104.892	-14.7396	2.480363	0.01854695531079519	0.03470745375642941	0.999225041244639	-0.000848558833703688
+-106.1858	-14.73425	2.498308	0.018280357160107872	0.034366912640524355	0.9992414071033272	-0.0011636957593066054
+-107.4728	-14.73066	2.514949	0.01771972773601436	0.034334096264417996	0.9992520991963159	-0.0015567062739441507
+-108.7629	-14.72824	2.530663	0.016962034191058416	0.03449310479735327	0.999259428471012	-0.0017634424647912795
+-110.0459	-14.72729	2.547386	0.017415013768043722	0.03471879837966938	0.9992434064990457	-0.0019841629755262143
+-111.3291	-14.72555	2.56369	0.018976494660834103	0.03456779217153361	0.9992202295651517	-0.0019731250691359566
+-112.6054	-14.72479	2.582153	0.018308709390506883	0.035030805928373314	0.9992166984665425	-0.001903497329993933
+-113.8771	-14.72153	2.600374	0.017747887037011932	0.03473485629819844	0.9992373933436886	-0.001770312790370101
+-115.1477	-14.7196	2.617244	0.018078493820142306	0.03506265155238774	0.9992199968462365	-0.0017822541676125513
+-116.4111	-14.7178	2.634202	0.018325717284729846	0.03545204239806821	0.9992019601827647	-0.001662390665977057
+-117.6691	-14.71582	2.650909	0.018792648351773463	0.035625631785494445	0.9991874118049657	-0.0014720116524172698
+-118.9238	-14.71284	2.667793	0.019272533620666575	0.035446869065864535	0.9991847682651149	-0.0013739680695537837
+-120.1734	-14.70983	2.685657	0.018983672407010727	0.035580400936694026	0.999185615933229	-0.001326711446140644
+-121.4141	-14.70566	2.703911	0.018986377181338065	0.035334080903309875	0.999194318349438	-0.0013169610656284597
+-122.6449	-14.70294	2.719941	0.0188978662990636	0.035386640918825506	0.9991940052142754	-0.0014128827550024263
+-123.8671	-14.70054	2.735913	0.018639178580645805	0.035136967089037296	0.9992072646879422	-0.0016783207272153757
+-125.0815	-14.69881	2.751503	0.018887629053121142	0.03463414693029286	0.9992188638466375	-0.002322814714461969
+-126.2871	-14.6983	2.767801	0.019435569901964246	0.03406845582447455	0.9992261600305535	-0.002946192613585149
+-127.4852	-14.69852	2.784821	0.01985729946618703	0.03342698016916247	0.9992376068889769	-0.0035397222307672646
+-128.6704	-14.70118	2.801395	0.019374685591528308	0.033132670196510214	0.9992542889593703	-0.004208767061988305
+-129.8453	-14.70614	2.817529	0.018496901862452404	0.03324612482913472	0.9992645031606922	-0.004797137516408751
+-131.0089	-14.71233	2.833172	0.018302394434347405	0.03345098157556073	0.999259686322728	-0.005112091529330468
+-132.1621	-14.71896	2.848691	0.019297382951314626	0.033900549038389426	0.9992257722077186	-0.005120541183750077
+-133.2996	-14.72488	2.863999	0.020504398191329588	0.034178733308603285	0.9991928260871835	-0.005008007601656352
+-134.4174	-14.73334	2.880267	0.02088246095598546	0.03521011377598766	0.9991489999239177	-0.005044468588350929
+-135.514	-14.74314	2.895211	0.02045850633856486	0.03617605677560035	0.9991233772504095	-0.005021898674754682
+-136.5942	-14.74619	2.913166	0.02100781118603649	0.035268291426845615	0.9991435961671019	-0.005185916242500813
+-137.6542	-14.75233	2.928943	0.022584556158624144	0.035336895776531846	0.9991068417598318	-0.005173042498459743
+-138.6869	-14.75847	2.946711	0.02251281944969025	0.035364502676738026	0.9991037644377591	-0.0058474607439018954
+-139.6957	-14.76466	2.964758	0.021727894695702618	0.034728312680799095	0.9991373076382849	-0.006817871790842998
+-140.6763	-14.77357	2.979529	0.0218666756214058	0.03451565940179751	0.9991367939084407	-0.007495652852781925
+-141.6311	-14.78487	2.993256	0.022166002810232063	0.035027203978002845	0.9991080748800626	-0.008050963318473941
+-142.5562	-14.79743	3.006784	0.02242773361504898	0.03556034524861756	0.9990802355436361	-0.008434545439074995
+-143.4522	-14.80845	3.020005	0.022445810947588445	0.03600156999827066	0.9990656055444848	-0.008245504629589583
+-144.3211	-14.81802	3.032842	0.022078255431643014	0.03692609696403403	0.9990472341660246	-0.007323790365260831
+-145.155	-14.8233	3.044707	0.02144462791858166	0.0387652722786203	0.9990086021190141	-0.004381151767677446
+-145.9633	-14.81877	3.055661	0.021083581108683757	0.040973958742356986	0.9989369284365778	0.0012768391867616056
+-146.7445	-14.79757	3.065594	0.02079038745522756	0.04297127957856861	0.9988082542853527	0.010163665297917276
+-147.4992	-14.75456	3.071965	0.02106736175260581	0.043672001764781004	0.9985944491848451	0.021402069709672823
+-148.2245	-14.68816	3.079106	0.022409715824115356	0.0434594935172253	0.9982200384584071	0.034144280341000165
+-148.926	-14.59929	3.086043	0.02395703261435749	0.04277062409186745	0.9976249242213555	0.04838641209514765
+-149.5947	-14.49202	3.08883	0.02410157321761741	0.042583207129294455	0.9967576614465711	0.0638744783664198
+-150.225	-14.3642	3.087775	0.022871389947382165	0.04320363256420672	0.9954640802973131	0.08161868960572809
+-150.8335	-14.21312	3.08933	0.020876368435353378	0.04416777518056442	0.9935867471195894	0.10197431453490417
+-151.4162	-14.03648	3.090778	0.019112304118380604	0.0447851609989788	0.991057108744136	0.12423693650962266
+-151.9504	-13.83342	3.086577	0.01807168256232674	0.04470756466689981	0.9877995436296196	0.1480767015976399
+-152.4769	-13.60854	3.07671	0.017696995075809407	0.045250696399491995	0.9835351902741292	0.17406240355998895
+-152.9502	-13.36041	3.06086	0.017084588694235358	0.04620206236863084	0.97816503780257	0.20190751616205516
+-153.424	-13.08243	3.04676	0.016023979824224183	0.04595886962535624	0.9717089886004883	0.23111178214528277
+-153.8743	-12.77723	3.032264	0.015263481722047188	0.045740552771721954	0.9640848320843404	0.26118052090990135
+-154.2672	-12.44653	3.015235	0.014579602511767582	0.04534273538965792	0.9552357742505586	0.2919864502505521
+-154.6664	-12.0933	2.992923	0.014185699530790269	0.04537932777912913	0.9450382676390742	0.3234843972077697
+-155.0328	-11.71655	2.969218	0.013667369410799306	0.044635448783614064	0.9335409812999844	0.35541822682402824
+-155.331	-11.31815	2.947184	0.013537289475973529	0.04410805623382458	0.9203540879549514	0.38835495870855946
+-155.6301	-10.90107	2.926439	0.013168075405440027	0.04304094782493959	0.9057519537933283	0.42141129173292935
+-155.8564	-10.46589	2.901563	0.012489012038874073	0.04036796294094445	0.890199124103115	0.4536077287616891
+-156.0826	-10.02043	2.872763	0.011371556105522242	0.038177341984636785	0.8737546368605877	0.48473292939034934
+-156.2778	-9.567123	2.844586	0.009879952422999086	0.03657237217723463	0.8562556139318532	0.5151613065279983
+-156.4059	-9.106293	2.815747	0.00903539466669196	0.036043372850706114	0.8377049520039603	0.5448574587034709
+-156.5389	-8.642035	2.787238	0.008015848275762644	0.03652083214658787	0.8181514129017577	0.5737858838997333
+-156.6351	-8.174402	2.7555	0.005993091551131241	0.03671543918374064	0.7977514445866393	0.6018377622242849
+-156.6631	-7.699412	2.722687	0.004064905302654161	0.03625524849867694	0.7764916466016939	0.6290705495085345
+-156.6903	-7.221458	2.690343	0.0028086741528824378	0.03419772325764685	0.7540666365833943	0.6559010098065892
+-156.651	-6.734167	2.665011	0.0041939192794099335	0.031357829218849054	0.730567200362597	0.6821075159695252
+-156.6171	-6.24316	2.64127	0.007272242300138055	0.028593812313630344	0.7059423403809774	0.707654520544262
+-156.5444	-5.751237	2.615199	0.010084336075520733	0.024260543608890822	0.6803686387034614	0.7323989675433163
+-156.3936	-5.261917	2.593307	0.012306851758028921	0.01916274321800345	0.6538625682373642	0.7562705022213976
+-156.2439	-4.779011	2.569364	0.013184514060740638	0.014384259552897181	0.6262221001043701	0.7794005023137779
+-156.0244	-4.305429	2.552946	0.013770016347601244	0.012509527550849966	0.5975382825566992	0.801624537579428
+-155.8078	-3.838336	2.542408	0.014015314904168383	0.011753981416406477	0.5679189973906499	0.8228811744546104
+-155.5541	-3.375344	2.534833	0.015347923605789295	0.010233593080638335	0.5375897349109537	0.8430047400412535
+-155.2209	-2.922418	2.530704	0.016768902556088483	0.008118003287271946	0.5066722397061723	0.8619374359202779
+-154.8906	-2.480572	2.522617	0.017762290246249344	0.006060460084380443	0.47515407807919463	0.8797024348911877
+-154.4839	-2.057963	2.510377	0.01738385765811627	0.004332498140031546	0.44352851454270165	0.8960811836772009
+-154.0795	-1.653879	2.499815	0.01607973138988786	0.003998580041789036	0.4122706364143604	0.9109107398349064
+-153.6401	-1.268487	2.493093	0.015448677293536666	0.0054552315151853836	0.38127804632029455	0.9243152223203809
+-153.1409	-0.8983582	2.487045	0.015744118457563804	0.00739279317824626	0.3508220494154795	0.9362805984249277
+-152.6325	-0.5510777	2.478814	0.01562970390110868	0.007739223891219789	0.32049697872378946	0.9470889627687824
+-152.0606	-0.2261366	2.467171	0.015500617574510302	0.006441686651619533	0.2899125695950434	0.9569059188440054
+-151.4836	0.06230717	2.456904	0.014587070157282837	0.00462199873197451	0.25833370475618106	0.9659345482479103
+-150.8797	0.323482	2.446327	0.015079320677427132	0.004072838217591765	0.2263032549819606	0.9739316520481971
+-150.2166	0.5523554	2.433161	0.014443810662394809	0.0028723836058676022	0.194726199714012	0.9807470789611908
+-149.5497	0.7484821	2.422189	0.013580292026801173	0.0019814931999142926	0.16540020380932194	0.9861310368977341
+-148.838	0.9129124	2.410623	0.010585674508893642	0.0020622920953002554	0.13876507977730584	0.9902666020223561
+-148.1125	1.04825	2.40293	0.007457471310312433	0.002511931018389478	0.1151848770796834	0.9933129015654397
+-147.3483	1.163229	2.393425	0.004726086692510001	0.002583635233082503	0.09441006845753844	0.9955188234822082
+-146.5668	1.265054	2.383452	0.004478953250403233	0.001993203736710804	0.07615454373619879	0.9970839741892235
+-145.7604	1.348496	2.371923	0.004330983742227694	0.0010581755416420255	0.059281548280691584	0.9982313463710663
+-144.9267	1.40767	2.359471	0.0037915113122519513	0.000538003878525595	0.0439730523583865	0.9990253778859085
+-144.0738	1.44478	2.346915	0.002176838699347527	0.0004501563449653026	0.030639819234022724	0.9995280187217607
+-143.1977	1.463451	2.332663	-0.000283819714289212	0.0007352560847356547	0.01955569579864828	0.9998084584592642
+-142.3032	1.469675	2.319985	-0.0028702306992696735	0.0012509267711476024	0.011615750553655605	0.9999276330300215
+-141.3867	1.468738	2.305914	-0.005444014786913748	0.0019069609868024382	0.006754273981580212	0.9999605522148243
+-140.4509	1.471319	2.290968	-0.007184552587440622	0.0018755040185296367	0.004852932128645964	0.9999606560953033
+-139.4923	1.479967	2.274698	-0.007636267138219702	0.0014963386943438026	0.004388938090559575	0.9999600920122477
+-138.5154	1.489345	2.260154	-0.0069941151217736775	0.0008824479690542902	0.003919845070403393	0.9999674686977919
+-137.5181	1.499366	2.246738	-0.006701008189751243	0.0004089067586542523	0.0033260306575685857	0.9999719330084063
+-136.5045	1.504842	2.234017	-0.005879920413004202	0.0004926194131949593	0.002475262190100472	0.9999795282600242
+-135.4735	1.505414	2.220167	-0.0060636286894458855	0.0007971443533006686	0.0013226331392162504	0.9999804236131703
+-134.4219	1.504501	2.205688	-0.00586264400440049	0.00036497431907709635	0.0003449383471791243	0.9999826884584356
+-133.3518	1.503329	2.190202	-0.005860823833257581	-0.0008992636818007438	-0.0005562303544524305	0.9999822661810652
+-132.2648	1.501627	2.17614	-0.0050864568554325525	-0.0014597790569944944	-0.001492513301602942	0.9999848845886657
+-131.1603	1.49856	2.159629	-0.004920423050316095	-0.0007457089726092294	-0.0022987855957650916	0.9999849743570745
+-130.039	1.49165	2.144473	-0.004525580379588483	9.861783014753175e-05	-0.0029164660443468595	0.9999855017061815
+-128.8996	1.48238	2.128143	-0.004411964633903725	0.00016615895072741088	-0.0035286385422362357	0.9999840277070982
+-127.7441	1.470024	2.111152	-0.004860649353279161	0.00018370651604867632	-0.003894533849372179	0.9999805862844922
+-126.5764	1.458443	2.092643	-0.005128876405013443	0.0006468473002571108	-0.00408465647337202	0.9999782956629043
+-125.3883	1.447721	2.073753	-0.0051171845638973085	0.0010872388100105308	-0.004053858527565097	0.9999780990426479
+-124.185	1.436723	2.053638	-0.005367513711048034	0.0007271404931845227	-0.0037362951438361038	0.9999783503465778
+-122.9654	1.426398	2.032685	-0.00542255678691692	0.0004755219587315937	-0.0032652568084117076	0.9999798537244311
+-121.7344	1.41804	2.009877	-0.005686755105830954	0.0013451787278148724	-0.002523815084639146	0.9999797406287664
+-120.4917	1.413177	1.9844	-0.005828598184047533	0.003175616675240736	-0.0015922958170844876	0.9999767034766219
+-119.2315	1.410381	1.956743	-0.005508083337828515	0.004497556249359475	-0.000738283127381861	0.999974443645312
+-117.9509	1.407398	1.927293	-0.004739803475375571	0.004733037573385735	-0.00023439584048944026	0.9999775385862092
+-116.6572	1.405768	1.897027	-0.004155527556330976	0.004341840432686672	-0.0002939521769914906	0.9999818966383859
+-115.3483	1.402662	1.864276	-0.003622513456122521	0.004252528742176142	-0.000827003424046554	0.9999840546033191
+-114.0291	1.396732	1.832056	-0.003704703912510499	0.005794367974374147	-0.0014497065210941846	0.9999752991047832
+-112.695	1.385028	1.800206	-0.0037040930357802662	0.006476070371761423	-0.0026530083591152494	0.9999686503855852
+-111.3405	1.369687	1.767521	-0.003917363217505259	0.00587280034195822	-0.004182097135819139	0.9999663367059475
+-109.9726	1.351798	1.734254	-0.00355232129745498	0.004980099846963336	-0.005465718939664845	0.999966352201606
+-108.5928	1.330549	1.700357	-0.004090989795291706	0.0051176367766741495	-0.006153518631189044	0.999959603086431
+-107.2003	1.30829	1.668467	-0.003940592377689967	0.005715376006421358	-0.006767502934986702	0.9999530024520368
+-105.7911	1.284827	1.635925	-0.003528636055934132	0.0059515991142868995	-0.007367766477481987	0.9999489203017826
+-104.3679	1.259616	1.601652	-0.0035458449694281895	0.006678686424357215	-0.007842983033086896	0.9999406531130939
+-102.9321	1.230259	1.565326	-0.004438197864124789	0.006802749805983578	-0.008361353226634482	0.9999320540751834
+-101.4787	1.202212	1.530047	-0.004233900558569332	0.006094923384462809	-0.008456070374436736	0.9999367084314988
+-100.0076	1.170075	1.495851	-0.005189108693634676	0.0046697390474212015	-0.008265573744373325	0.999941471776659
+-98.52911	1.142136	1.46328	-0.0052771836140445404	0.0036326602898352255	-0.008035294231178703	0.9999471931851905
+-97.02601	1.115543	1.436157	-0.00479336586569885	0.0024844691911812154	-0.00788561255807855	0.9999543330428142
+-95.51571	1.089274	1.41068	-0.005088253446592736	-0.0005915861070441886	-0.007657092468684488	0.9999575634184019
+-94.00229	1.062982	1.390693	-0.005250829559785136	-0.003415524958185518	-0.007537086808826577	0.9999519764971868
+-92.49143	1.036568	1.374673	-0.005626987811681752	-0.004849836263522448	-0.007543077680942055	0.999943957467359
+-90.98226	1.009148	1.361424	-0.0065162866659219335	-0.005878138902704838	-0.007550963501133429	0.9999329819749584
+-89.47609	0.984734	1.348472	-0.006244379770734587	-0.006264901166692033	-0.007603577229445329	0.9999319698599332
+-87.97323	0.9616474	1.333401	-0.005994349400037454	-0.005386948709180056	-0.007797482773648283	0.9999371219338092
+-86.4669	0.9399983	1.316473	-0.005618564424695484	-0.0041044736594272714	-0.0075659548385827195	0.9999471692830403
+-84.96724	0.9191412	1.298308	-0.005317673509981028	-0.004028196786628243	-0.007131062459853782	0.9999523208269898
+-83.45436	0.8995469	1.282847	-0.005434806960543168	-0.005413360857773635	-0.006634758662671116	0.9999485678648744
+-81.9495	0.8834484	1.268237	-0.00454710610004723	-0.005339860715465948	-0.006327724784873771	0.9999553838110484
+-80.44554	0.8679533	1.25271	-0.004435083511972056	-0.004461704280936914	-0.006080342737663482	0.9999617255982071
+-78.9424	0.8520469	1.234831	-0.0042858758111804455	-0.00435678556048535	-0.005758330459729811	0.9999647450377577
+-77.4396	0.8381739	1.219714	-0.00400441062930753	-0.005456882990725539	-0.005509807507959324	0.9999619138471038
+-75.93444	0.8255732	1.207956	-0.003519321571667185	-0.006555454759914845	-0.005195303789090013	0.999958823755811
+-74.432	0.8134235	1.199179	-0.003270177874732058	-0.00735335757210694	-0.00519824613696055	0.999954105099921
+-72.93636	0.8003155	1.189369	-0.0030440254573989387	-0.006925592721931282	-0.005424267640033795	0.9999566727588921
+-71.44336	0.7849788	1.176896	-0.003192256423645445	-0.004668039870928476	-0.005716642261730755	0.9999676689293212
+-69.956	0.763954	1.1606	-0.004798592147377881	-0.0034072859523368478	-0.006064541614894309	0.9999642919878906
+-68.4626	0.7450432	1.144297	-0.005373484646086016	-0.004372844700744511	-0.00627064632280091	0.9999563404902614
+-66.97132	0.7252411	1.128317	-0.006048725895059527	-0.00516410969592739	-0.006418527707980583	0.9999477723302137
+-65.48911	0.710763	1.111896	-0.004597258513115933	-0.0038276737917735866	-0.006687846647219709	0.9999597426070365
+-64.00515	0.6946295	1.096104	-0.004021846254928044	-0.0040957990987634686	-0.0071225304728155265	0.9999581584956982
+-62.52551	0.6777408	1.080622	-0.003541417796653733	-0.005764520569809485	-0.007473332446274754	0.9999491876914233
+-61.04536	0.6587746	1.067484	-0.0030896929870722185	-0.005049383338859535	-0.008019085865150342	0.9999503246596954
+-59.57557	0.6370383	1.049226	-0.0028567058898093544	-0.0028926497929940955	-0.00868544697382267	0.9999540163525016
+-58.10346	0.6135026	1.030981	-0.00255038894938789	-0.0032934587892451245	-0.009299747963951697	0.999948080318782
+-56.62762	0.5856644	1.015669	-0.0025763379125422303	-0.005839166631336958	-0.009924428623226036	0.9999303837430457
+-55.1606	0.5571177	1.005763	-0.002416650946192448	-0.005844791778474327	-0.010584354545494081	0.9999239819336897
+-53.69802	0.5253982	0.9957028	-0.0026535008256733895	-0.005216706121818227	-0.011020457945284615	0.999922144177877
+-52.23668	0.4907091	0.9846881	-0.0032649961099545126	-0.005300623498331421	-0.01100262156486387	0.9999200895619765
+-50.77854	0.4561315	0.9709516	-0.003924232927104173	-0.005355222277902244	-0.011046217443416401	0.9999169480864303
+-49.32287	0.4235407	0.9558653	-0.0039685783297729255	-0.005144780084264676	-0.011134087857486891	0.9999169034032312
+-47.8694	0.3928533	0.9391869	-0.0038091999821099243	-0.004730445612205525	-0.011112046140217484	0.9999198144403301
+-46.41797	0.3621644	0.9230089	-0.003484876786041499	-0.00476982581361295	-0.010864758476213474	0.9999235277853739
+-44.96842	0.3335672	0.9069681	-0.0029660981819177304	-0.0048136105579237335	-0.010625957487127993	0.9999275575972759
+-43.52559	0.3070324	0.8911135	-0.0022215659363265883	-0.004820476090097063	-0.010331972318027094	0.9999325367258908
+-42.07912	0.2806278	0.8754699	-0.0018554565642041809	-0.0056484467027698	-0.00994428759971368	0.9999328794848774
+-40.63656	0.254373	0.8594365	-0.001956179256213601	-0.005674383260414422	-0.009497674649459652	0.9999368824648807
+-39.19717	0.2281179	0.8436096	-0.0022021938536577017	-0.005004493934791361	-0.00917263566765334	0.9999429824432972
+-37.76252	0.202969	0.8253285	-0.0024673629714720793	-0.004018454490875692	-0.008740786205856783	0.9999506801837654
+-36.32796	0.1777994	0.806336	-0.0030643499434211746	-0.0037034682351016535	-0.008281509553104444	0.9999541542900743
+-34.89508	0.1537249	0.7865138	-0.004032592701584853	-0.004883947273417813	-0.007852789595245468	0.9999491081803644
+-33.46381	0.132669	0.7682861	-0.0044110132375488765	-0.005737620985002584	-0.007366567856079267	0.9999466767511518
+-32.03604	0.1133742	0.7522313	-0.004593747367740423	-0.00560524182161094	-0.0068013808943781055	0.999950608763839
+-30.61396	0.09746084	0.735284	-0.004711816258092655	-0.004805625940370656	-0.0062105778326010195	0.999958065855692
+-29.19399	0.0848288	0.7168959	-0.004170538687110562	-0.0035769108325914203	-0.0054735096265107926	0.9999699260519406
+-27.77753	0.07224084	0.6980984	-0.00425432218520779	-0.004053443130497366	-0.004824737168229387	0.9999710957086658
+-26.35596	0.06062565	0.6785993	-0.004555323238767719	-0.005115033364894673	-0.004118265279928681	0.9999680621674626
+-24.94768	0.05256871	0.6586934	-0.004542904171216495	-0.002758870510281241	-0.00338429317417339	0.999980148410412
+-23.54342	0.04390241	0.6329309	-0.005483565964127701	5.822436857956401e-05	-0.0027643074070555444	0.9999811426816004
+-22.13989	0.03720921	0.6021743	-0.006050047006498975	0.0012021652333885835	-0.002177258606107215	0.9999786054086022
+-20.73844	0.03232865	0.5653751	-0.0059476659208749505	0.0017996123440157912	-0.0016984719952377993	0.9999792507139267
+-19.34063	0.02925888	0.5295576	-0.00569323898233138	0.0017908747452596715	-0.0013128428044869962	0.999981327946331
+-17.94366	0.02669736	0.4947389	-0.005376703656706342	0.0009577829022464522	-0.0011426890292206267	0.999984433864589
+-16.55026	0.02516131	0.4624741	-0.004944754651541313	-7.786126152580095e-05	-0.0009502067894082451	0.9999873201426693
+-15.16614	0.02416042	0.4312538	-0.004327621972585983	-0.0002344860877712207	-0.000838888434642579	0.999990256437797
+-13.78259	0.02354535	0.3987551	-0.0039044910930287075	0.0007838218933857324	-0.0010437340963058539	0.999991525559932
+-12.40382	0.02158789	0.3670075	-0.0035107348107804516	0.0011833537055694866	-0.001347603233302791	0.9999922291601178
+-11.02737	0.01528089	0.3343854	-0.004113616443927969	0.0006382898746093434	-0.0018475616752600555	0.999989628577039
+-9.654739	0.008955879	0.3016035	-0.004056897850943448	0.0002023250742042334	-0.002453584079138778	0.9999887402213876
+-8.285137	-0.0005365659	0.2716196	-0.004392130436080488	5.580029728286596e-05	-0.0031292750527413295	0.9999854567513488
+-6.914129	-0.009785478	0.2414501	-0.004214085704210374	-0.0005688457541848982	-0.003682446038955782	0.9999841786185198
+-5.548032	-0.0204447	0.2108318	-0.003851421475070751	0.0003419560635627981	-0.004227034141141523	0.9999835907658894
+-4.183286	-0.0311414	0.1812868	-0.003415840139496972	0.0012183680064254319	-0.004389409176388927	0.9999837902199338
+-2.818515	-0.04437224	0.1473329	-0.0035661542957558445	0.001588812672635966	-0.0045310128063320455	0.9999821139104333
+-1.456295	-0.05789727	0.1134699	-0.003811083302882139	0.0014495431659677035	-0.004713066079243784	0.9999805805498431
+-0.09070031	-0.07289734	0.07816926	-0.004099438233980414	0.0009369263088075501	-0.004852819260163327	0.9999793832477176
+1.269419	-0.08864261	0.04590358	-0.004517814649343131	0.0013775387351343336	-0.004914690864429959	0.9999767685059161
+2.629005	-0.1031707	0.0117715	-0.0047434607190280165	0.0013554645316014358	-0.005051520896666123	0.9999750719057655
+3.988036	-0.1175645	-0.02016324	-0.004849483369341883	0.0011191965104930162	-0.005156129999377409	0.9999743217871402
+5.344099	-0.1318827	-0.0534628	-0.004991669165735864	0.0011981483937416012	-0.005229025283489438	0.9999731521265713
+6.704415	-0.1458828	-0.08812058	-0.004752292283349532	0.0013885287514847297	-0.005329403558291579	0.9999735422318297
+8.059835	-0.1601769	-0.1221722	-0.004792643403626379	0.00168092651263296	-0.005404776250809773	0.9999724963462463
+9.415895	-0.175095	-0.1546373	-0.004731216577331551	0.00109965562927434	-0.00542016107709339	0.9999735137497862
+10.77107	-0.1926087	-0.1870422	-0.005601069013084418	0.00047131009685175726	-0.005367169455130723	0.9999697992363286
+12.12317	-0.2081545	-0.2207459	-0.005486207654651677	0.0006558664450766151	-0.0053643640422418705	0.9999703470419504
+13.47733	-0.223104	-0.2534493	-0.0054837727123108315	0.0010753171563806197	-0.005299019730394239	0.9999703457201867
+14.83032	-0.235533	-0.2892015	-0.005289172021863932	0.001109298109222249	-0.005225520222464016	0.999971743628505
+16.18093	-0.2482197	-0.3237106	-0.005377757761876374	0.0010792661596201372	-0.005255806741014117	0.999971145284459
+17.53356	-0.2588532	-0.3582793	-0.0048672682169486	0.0007411851441133942	-0.005234777709372257	0.9999741783901321
+18.88305	-0.2709142	-0.3936047	-0.0048027117262242226	0.0005371589730041704	-0.005090472809911177	0.9999753659500238
+20.2333	-0.2833573	-0.4284891	-0.004867266498855278	0.0007801755953053862	-0.005095940492352976	0.9999748659008226
+21.58016	-0.2952092	-0.463001	-0.004720356529404998	0.0013177234779649977	-0.00513323444061267	0.9999748155544962
+22.92653	-0.3094675	-0.4966128	-0.00505931971817804	0.0010656429287961124	-0.005337269894050355	0.9999723902384584
+24.27308	-0.3233764	-0.5315032	-0.004997413555837667	0.0009667395935198906	-0.005494608990070295	0.9999719498787736
+25.61353	-0.3380715	-0.5652421	-0.005078200375199409	0.0011442134456393493	-0.005752838184636223	0.9999699033017752
+26.95489	-0.3534582	-0.5991642	-0.004973739032595188	0.0004093705559492423	-0.00598441672284119	0.9999696400852732
+28.29306	-0.3674572	-0.630802	-0.004496613196629782	-0.00030141525998547507	-0.0060671283086991545	0.9999714393784888
+29.62384	-0.3836725	-0.6611411	-0.004545802605226357	5.179296812240009e-05	-0.0059993200566018845	0.9999716701762211
+30.9509	-0.3976799	-0.6925713	-0.004074737561659741	0.0005702809940000095	-0.0058215854587441666	0.9999745898952322
+32.28248	-0.4139985	-0.7240355	-0.004430186112113266	0.0007048763549909293	-0.00565218968310471	0.9999739643371334
+33.60651	-0.4270073	-0.7566562	-0.0041222448263577	0.00043460037507527595	-0.005391569679324364	0.999976874330851
+34.92691	-0.4397568	-0.7887161	-0.0037420356590800907	0.00011106687062725368	-0.005137883881310465	0.9999797932870936
+36.24336	-0.4515375	-0.8206355	-0.0036287424405715097	0.00018464891988401126	-0.004936081757819437	0.9999812164385672
+37.55497	-0.4615286	-0.8517037	-0.0029550395293953024	9.186663705570051e-05	-0.004804122707517284	0.9999840897268876
+38.86652	-0.4718552	-0.882457	-0.0025345872248272714	-0.0003423172012304586	-0.004714575596680856	0.9999856156282833
+40.17377	-0.4839122	-0.9115305	-0.0022502991818249907	-0.0009158934588274481	-0.004905313171303751	0.99998501748549
+41.48016	-0.4974895	-0.9403596	-0.0021977742032031535	-0.0010764937072498325	-0.0054074852267976285	0.9999823848715397
+42.77845	-0.5125043	-0.9685698	-0.0024004813304831624	-0.0009479949226370743	-0.005994240053034556	0.9999787038138338
+44.07602	-0.5346231	-0.9956525	-0.0037759779285337705	-0.0014977833501933623	-0.006512466750724786	0.9999705427724058
+45.37317	-0.5552041	-1.019285	-0.004011745026970988	-0.003150187201624963	-0.006872794381944991	0.9999633727891322
+46.6701	-0.5782583	-1.038533	-0.0044904005834029365	-0.005002768717371335	-0.007206094975394161	0.9999514392224088
+47.96111	-0.5999916	-1.053627	-0.004642681667091838	-0.006654121223781247	-0.007600807228846839	0.999938196043709
+49.24944	-0.6230983	-1.066505	-0.004799991630360185	-0.007457037950327385	-0.008142942661409469	0.9999275199484055
+50.52765	-0.6456359	-1.075808	-0.003915011676222682	-0.006377053613360522	-0.008587246965691438	0.999935130426138
+51.81041	-0.6740058	-1.087279	-0.004598544634585716	-0.005419892724114146	-0.00907501565575268	0.9999335589132663
+53.08722	-0.7020633	-1.100738	-0.004904568250582201	-0.0050766592638246	-0.009625870941621444	0.9999287551368895
+54.3609	-0.7311397	-1.115793	-0.005309859760514863	-0.005101148261237981	-0.010192343235638456	0.9999209467828482
+55.62948	-0.7624769	-1.130431	-0.0054404242823096255	-0.004464301120443913	-0.010709021953754717	0.9999178909530165
+56.90157	-0.7942812	-1.144783	-0.0055502822520068435	-0.0049394687035950015	-0.011257038734043448	0.9999090334099345
+58.17085	-0.8268987	-1.159941	-0.005496341492538484	-0.0053192540461096125	-0.011771492212366575	0.9999014590136793
+59.43408	-0.8597866	-1.174664	-0.005605937959327568	-0.0052567218002087255	-0.012166089721443597	0.9998964579377212
+60.69463	-0.8950914	-1.188757	-0.00581052583358095	-0.005053281023108634	-0.012493998189504061	0.9998922952746858
+61.94931	-0.9282884	-1.203865	-0.005462451849248479	-0.00492726536251858	-0.012749436384138614	0.999891661905293
+63.2074	-0.9621792	-1.218811	-0.004715828631030393	-0.00446090339255294	-0.013098662077436068	0.9998931374667126
+64.45843	-0.9961734	-1.23409	-0.004285470053145724	-0.004529953904682152	-0.013519336146142593	0.9998891647648817
+65.71017	-1.031828	-1.250518	-0.00425298766578139	-0.005361015263746825	-0.013888635190407802	0.9998801315276767
+66.9537	-1.068654	-1.267809	-0.004514655887055788	-0.004927730155548478	-0.014136926429320828	0.9998777338599293
+68.19143	-1.107414	-1.284607	-0.0053759910737393906	-0.004000188975026394	-0.014275395329293833	0.9998756474163832
+69.43308	-1.148265	-1.303043	-0.006546867907295741	-0.0041597853673039625	-0.01428738109833658	0.9998678439912213
+70.66875	-1.188155	-1.321203	-0.007794742609791613	-0.0041792822094286605	-0.01412499181647201	0.9998611204532586
+71.90028	-1.227083	-1.338073	-0.008831294925583003	-0.004325515507114882	-0.013610033572431636	0.9998590226285365
+73.12719	-1.261385	-1.354666	-0.008917227134199353	-0.005220286430759807	-0.012648414906490147	0.999866615739405
+74.34109	-1.293464	-1.370635	-0.009381307467882955	-0.00590709947015297	-0.011802444899252838	0.9998688911754595
+75.54359	-1.318225	-1.387753	-0.008292351167184913	-0.00576868777921867	-0.011069650536518864	0.9998877046900944
+76.72276	-1.337076	-1.406295	-0.0067314522509627685	-0.0033681683222121163	-0.010219580736544935	0.9999194483368726
+77.88694	-1.359396	-1.424263	-0.007402102961382371	-0.0036100588717806682	-0.009106288920815955	0.9999246230835512
+79.03738	-1.377966	-1.443308	-0.007330198486526788	-0.006266397435712794	-0.007705622089984765	0.9999238090182333
+80.1599	-1.3923	-1.459771	-0.007725700806992872	-0.006480791162038866	-0.006189970804315075	0.9999299961269272
+81.26768	-1.402951	-1.475612	-0.007857093700538357	-0.004710264015201959	-0.004338358156533908	0.9999486277504428
+82.36682	-1.407266	-1.495464	-0.007425521709781604	-0.0021364896725677195	-0.002846460489105555	0.999966096776236
+83.4619	-1.406566	-1.518542	-0.006514981639024661	0.0007304559410819082	-0.0014163371288765153	0.9999775074657923
+84.55751	-1.402636	-1.542197	-0.005636250449197725	0.0017283367857840895	-0.0001974080306421264	0.9999826031301235
+85.65524	-1.397294	-1.566145	-0.004943065942099047	0.000587629173085077	0.0006755950342927935	0.9999873821015927
+86.75113	-1.394094	-1.587213	-0.005112009586912639	-0.0006387907578965572	0.001287065899536984	0.9999859012834736
+87.83653	-1.391329	-1.608126	-0.005164329586391375	-0.00018026756579962792	0.0017871846767670708	0.9999850514755004
+88.91917	-1.385563	-1.630039	-0.005172084736431333	0.0009826595993239489	0.0022399441979443287	0.9999836331508537
+89.99811	-1.3802	-1.654701	-0.005409706392945611	0.0013935270125663222	0.002689553902651184	0.9999807795447931
+91.07568	-1.372397	-1.681009	-0.0052710219207035315	0.00036768971087976796	0.0029675833163188533	0.9999816371221267
+92.14962	-1.363832	-1.705257	-0.00464594629447782	-0.00044863986133057704	0.0029475976652065705	0.9999847626705657
+93.21864	-1.356111	-1.726171	-0.004091971370264144	-0.0008484676515914787	0.0027619194229039517	0.9999874537583211
+94.28214	-1.348852	-1.747689	-0.003167386125590067	-0.001333655684614569	0.0020533007000679056	0.9999919864598323
+95.34468	-1.343107	-1.769082	-0.0020963317109782787	-0.0013966013874874125	0.0005601831101170993	0.9999966705408599
+96.39851	-1.341283	-1.790649	-0.0009759853297777672	-0.00035840523832631296	-0.0012821874969775157	0.9999986374958437
+97.45206	-1.346362	-1.812458	-0.0008196071180677521	0.0013999469920221183	-0.003135180327752534	0.9999937694990425
+98.50801	-1.357998	-1.835038	-0.0011932598243470775	0.0033472058231626928	-0.00501859244281575	0.9999810928582908
+99.57007	-1.375059	-1.861895	-0.0016675312469264377	0.0039107910267002464	-0.007236472440058778	0.999964778639583
+100.6364	-1.397412	-1.887691	-0.001819049084006557	0.0032830496880766383	-0.009850194872345154	0.9999444416097089
+101.7103	-1.426691	-1.913252	-0.0021960637507944846	0.002425457407197463	-0.013013719357302033	0.999909964731254
+102.7874	-1.463118	-1.9392	-0.0020779802075087863	0.0018382141644611603	-0.016494677590136387	0.9998601045036952
+103.8682	-1.507808	-1.964419	-0.002440444986096237	0.0016720316820967888	-0.02016209147401754	0.9997923477431288
+104.9566	-1.561792	-1.988222	-0.0033410425399828365	0.0011092705395985114	-0.023291487695429648	0.9997225182792222
+106.0502	-1.62272	-2.010967	-0.005075640867038843	0.00044586699560171273	-0.025437949444295272	0.9996634182566052
+107.1475	-1.686244	-2.034406	-0.007101613398412274	0.0006634227143568081	-0.026131828932500923	0.9996330599145288
+108.2514	-1.748773	-2.056669	-0.009431404820970699	0.0011817668271255671	-0.025288639973408345	0.9996350017474201
+109.3616	-1.806035	-2.079756	-0.011367464571031015	0.001749331502332108	-0.02319103699719178	0.999664892047088
+110.4821	-1.855701	-2.102084	-0.012571108875661267	0.001516043978758011	-0.02038894847086733	0.9997119383165048
+111.6084	-1.897032	-2.123092	-0.013429626841442386	0.000808769366529721	-0.017057380477661047	0.999763990542894
+112.7436	-1.928092	-2.145152	-0.013891045338374068	0.000578865685545168	-0.01304777953590916	0.9998182130883132
+113.8811	-1.949092	-2.166434	-0.014467095815034636	0.0010538676864185776	-0.00838824749818978	0.9998596050473723
+115.0263	-1.958712	-2.18791	-0.014766686885688692	0.0015839579128181774	-0.0034589831673069492	0.9998837289761239
+116.1788	-1.956136	-2.210288	-0.014418158699757452	0.0017463899655609896	0.0012162674168660669	0.9998937881172018
+117.3414	-1.940135	-2.235787	-0.013903052122774578	0.002236115450405336	0.005979267274543876	0.9998829697980777
+118.5076	-1.913209	-2.261146	-0.012850495750048927	0.0026799591153635447	0.010663260368302908	0.9998569785006433
+119.6839	-1.87627	-2.286958	-0.01234863227524338	0.0025603153328296693	0.01480029331463563	0.9998109358194318
+120.8659	-1.83308	-2.31212	-0.01208235794881595	0.002038758323076078	0.018322459002000736	0.9997570442797674
+122.056	-1.781388	-2.338144	-0.012114775603454577	0.001605429603100308	0.02129820228456298	0.9996984752350647
+123.2581	-1.725066	-2.362701	-0.012812694318133474	0.0012808562789106183	0.023963643645201347	0.9996299005405694
+124.4659	-1.665951	-2.387598	-0.014094310594444976	0.0009550465884636655	0.026316325875258815	0.9995538451165642
+125.6836	-1.602097	-2.410925	-0.015451910354699155	0.0007856577163634128	0.02824014191280311	0.9994814233356655
+126.9096	-1.529042	-2.432251	-0.015246380655573027	6.738092597865853e-05	0.02977106500759864	0.9994404569683126
+128.1428	-1.45419	-2.455779	-0.015217964795136372	-0.00013026913479753885	0.030509687471022465	0.9994186087660502
+129.3864	-1.374919	-2.480551	-0.014136985079956017	0.002246764236625104	0.03055008085221711	0.9994307330992165
+130.6411	-1.29261	-2.504395	-0.011882057685410534	0.0027386574876867693	0.030065169995346277	0.9994735624384853
+131.9105	-1.212752	-2.530425	-0.009860344601532388	0.001086674436838427	0.028981642532046582	0.9995307184569927
+133.1888	-1.135332	-2.552624	-0.007588872909084147	0.00045158204454039244	0.027288598394347472	0.9995986882140746
+134.4809	-1.063477	-2.571936	-0.0053705701354577005	-0.0007256673623886027	0.02481433301750762	0.9996773875907145
+135.7808	-0.9982536	-2.593743	-0.0029847375085402515	-0.0010432090563423466	0.02164302494176869	0.9997607626468643
+137.0887	-0.9419915	-2.613858	-0.0010408384855369885	-0.00022061830333848308	0.018253161180460612	0.9998328310721405
+138.414	-0.8957984	-2.63369	0.00018806630268720375	-0.00026881713863862453	0.014977898151425285	0.9998877711700336
+139.7491	-0.86007	-2.654048	0.0005618008432868998	-0.0007367620959398491	0.011799793847415183	0.9999299507598445
+141.0912	-0.8326782	-2.674907	0.00040681439343859775	-0.001304388746411311	0.009017182503075023	0.9999584108810494
+142.4513	-0.8166247	-2.693332	-0.00046255512035782014	-0.002310246168302674	0.006380020021006732	0.9999768718075104
+143.8175	-0.8082126	-2.711653	-0.0010191078553633023	-0.0018846453009572963	0.00370805978981855	0.9999908298698863
+145.1909	-0.8063071	-2.728963	-0.0014571773024555186	-0.0003880957298054606	0.0011057719369185052	0.9999982516406902
+146.5742	-0.8107742	-2.750987	-0.002040590175845962	0.0006278397270757612	-0.0013205271583578168	0.9999968490035532
+147.9621	-0.8181994	-2.779541	-0.002061838354012667	0.0028134665264124064	-0.0031972290766791103	0.9999888054148097
+149.3663	-0.8311338	-2.811042	-0.0027400863174739986	0.004646854694498661	-0.0043291198414273325	0.9999760784087885
+150.7791	-0.8465134	-2.846681	-0.0031151114217393646	0.005045834968392015	-0.005124674889734749	0.9999692861971193
+152.2017	-0.8636978	-2.884423	-0.0033664005665275257	0.004737695697783775	-0.00594292784119935	0.9999654510008709
+153.6297	-0.8818614	-2.922677	-0.003614192785418664	0.004168487245267001	-0.006654896143816281	0.9999626361429266
+155.0655	-0.903141	-2.959476	-0.003988019814063512	0.0033018490505993185	-0.007405644841308658	0.999959174124271
+156.5082	-0.9252286	-2.995496	-0.004024982037010492	0.0026708872184959477	-0.00813068868482408	0.9999552778912556
+157.949	-0.9505879	-3.030265	-0.004364527136424669	0.001999580717519752	-0.008984208507606081	0.9999481169427349
+159.3917	-0.9765171	-3.064935	-0.004179040301965527	0.0009016330072535506	-0.009854625926427468	0.9999423028494818
+160.8317	-1.007339	-3.096653	-0.004605622333527776	0.00022006701512890147	-0.010776182653583577	0.9999313044908866
+162.2714	-1.039444	-3.129506	-0.004852311758712805	0.0005975461896332606	-0.011648439651673266	0.9999202027476137
+163.7088	-1.075629	-3.162047	-0.005454938408863737	0.001085496182540136	-0.012282241579511867	0.9999091018121475
+165.1447	-1.112662	-3.196968	-0.005832011810382905	0.001361665256988206	-0.01265699097808662	0.9999019622369746
+166.5807	-1.150964	-3.232057	-0.0062784571017293865	0.0022448049030080832	-0.012777944660427483	0.9998961275840728
+168.0142	-1.188887	-3.267867	-0.00679573428148296	0.002388158527167772	-0.012699148460523666	0.9998934174814844
+169.4545	-1.225583	-3.303343	-0.006949291578597575	0.0013580115203693807	-0.012522925810922278	0.9998965143855639
+170.8876	-1.262468	-3.337774	-0.007021018655642544	0.0006376896371512529	-0.012306665272116702	0.9998994172609782
+172.3194	-1.297634	-3.371687	-0.007103935454815577	0.00037188386961452307	-0.011990357882257899	0.9999028088375873
+173.7467	-1.333194	-3.40556	-0.00763994259435732	0.0002658376781734974	-0.011575203734408984	0.9999037830041402
+175.1688	-1.366304	-3.44042	-0.007950358092071052	0.0007073753934605948	-0.0110723089081568	0.9999068433617718
+176.5929	-1.398479	-3.475354	-0.008374373508472548	0.0008415152202000726	-0.010503333838950761	0.9999094167467082
+178.0112	-1.427528	-3.511715	-0.008314215509110497	0.0013326835822618085	-0.009790306044673855	0.9999166203651629
+179.4251	-1.453391	-3.547073	-0.007607026871429158	0.0021849222288918	-0.009116584382287386	0.9999271209174355
+180.8386	-1.476326	-3.583634	-0.006402050266735137	0.0021007233170296526	-0.008429223301215052	0.9999417727590271
+182.2502	-1.498241	-3.619675	-0.005711867603728798	0.0007904420325872365	-0.007778239414236313	0.9999531232820293
+183.6608	-1.51887	-3.652469	-0.005386960328152696	-0.001350506232579145	-0.007004582487918167	0.9999600455095736
+185.0616	-1.539735	-3.680137	-0.005867917070875543	-0.0008866815508624267	-0.006220695121134999	0.9999630414656767
+186.4556	-1.559212	-3.708117	-0.006506016419225169	0.000565859345073087	-0.00541241709963404	0.9999640280003546
+187.8484	-1.576113	-3.742055	-0.0065581125501097196	0.0012578379348377896	-0.00482596976969647	0.9999660589336479
+189.2418	-1.590302	-3.777663	-0.006450063731785784	0.0012865556411808437	-0.004194748367803073	0.9999695723063622
+190.6344	-1.603074	-3.811744	-0.0060740416561316625	0.00023617199113217067	-0.0038299782792997734	0.999974190420498
+192.0206	-1.613327	-3.843743	-0.005259924913585355	-0.0006894229064686705	-0.0036899171060472257	0.9999791209808883
+193.4008	-1.623756	-3.875123	-0.004606706308890281	-0.000414308142617902	-0.003697804865326246	0.9999824662687464
+194.7764	-1.633867	-3.905876	-0.00384551753098951	0.00043322844999923305	-0.00382929579041281	0.9999851802910773
+196.1509	-1.645539	-3.93893	-0.003824723040562045	0.001185310416928959	-0.003987209883766344	0.9999840342176575
+197.5229	-1.657445	-3.973595	-0.0036313764181579275	0.0015080977608540177	-0.004088385020081168	0.9999839117977752
+198.8896	-1.669867	-4.009132	-0.003962644638585429	0.0008648402937552581	-0.004149264026906724	0.9999831664116999
+200.2547	-1.682304	-4.042827	-0.004065075922338206	0.00011893817814002041	-0.004241086960683946	0.9999827369474171
+201.611	-1.694878	-4.074851	-0.00417000248334953	-1.889354863450131e-05	-0.004351795534977472	0.9999818361340093
+202.9698	-1.705984	-4.106859	-0.004302080236342997	-0.0002223496994798571	-0.00440959252899705	0.9999809988994688
+204.3226	-1.717412	-4.139885	-0.004650438396579134	-0.00022901769832201367	-0.00437144195339372	0.9999796055264636
+205.6689	-1.727574	-4.171862	-0.0047003260383536045	7.488944897706169e-05	-0.004468454665033961	0.9999789668986094
+207.0122	-1.737778	-4.204811	-0.004547744961394204	0.0002842958301201656	-0.004560212970367722	0.9999792206087645
+208.3506	-1.746442	-4.238781	-0.0043302180110784015	0.0006647169719032937	-0.004623150447977305	0.9999797167159238
+209.6908	-1.757196	-4.270954	-0.004342836745440385	0.00029333588551749476	-0.004659892437938883	0.9999796693560962
+211.0251	-1.768399	-4.303818	-0.004589617163271408	-5.707627434395361e-05	-0.0047714764019583175	0.9999780823445777
+212.3505	-1.780005	-4.335485	-0.004375833986804754	9.594370218112898e-05	-0.00491358737916557	0.9999783495310253
+213.6668	-1.792488	-4.366042	-0.004250008268879994	1.1779266817557034e-05	-0.005213913102728992	0.9999773759446363
+214.9667	-1.802728	-4.397364	-0.00367742042841126	-0.0006789744767759679	-0.005686916723456945	0.9999768370071545
+216.2478	-1.814957	-4.428604	-0.0036294134657329107	-0.0017031546393686627	-0.006451628642203999	0.9999711511388878
+217.5072	-1.826714	-4.458316	-0.0031482807164717533	-0.002834793741774322	-0.007217102659677795	0.999964982237964
+218.7426	-1.842513	-4.486884	-0.003435214585678441	-0.004063584524082948	-0.007935874814574395	0.9999543531944313
+219.9508	-1.859734	-4.515014	-0.0039300710813878455	-0.004085445906694353	-0.008660391541802908	0.999946429210776
+221.1287	-1.87789	-4.540479	-0.00395087016309762	-0.0035890410384786754	-0.009396653603507065	0.999941604450197
+222.2852	-1.895168	-4.569335	-0.003735797193859575	-0.0038084560044598648	-0.01052837578893123	0.9999303439667363
+223.4108	-1.917244	-4.598297	-0.0037313498758303194	-0.003685824613298926	-0.012197227546154965	0.9999118557979055
+224.5051	-1.94167	-4.625178	-0.0033207431614200303	-0.0034470566273997177	-0.014079084168578322	0.9998894288142252
+225.5672	-1.971966	-4.651833	-0.0037599798244474084	-0.0032941522967861146	-0.015841660926882634	0.9998620169260573
+226.6035	-2.004581	-4.679495	-0.0040773511427864536	-0.003122249980256115	-0.017276859240585436	0.9998375552546024
+227.6035	-2.040151	-4.706339	-0.005226128672949349	-0.0032506052258088345	-0.018345993069284548	0.999812755311244
+228.5742	-2.073778	-4.731016	-0.005212419363128533	-0.0038746585121623043	-0.019035507984075262	0.9997977131107999
+229.5125	-2.106014	-4.755443	-0.005216993161700781	-0.004797530796961254	-0.019722096010096724	0.9997803786880254
+230.4146	-2.139617	-4.778138	-0.005141651558309187	-0.005032250135780837	-0.02043325243995957	0.9997653335020918
+231.2868	-2.17419	-4.799162	-0.00530925356053649	-0.0049554000143027304	-0.021290832861494188	0.9997469461185626
+232.1263	-2.207888	-4.818374	-0.004981601111739464	-0.004739977517867859	-0.022262015998466524	0.9997285226036005
+232.9333	-2.243669	-4.838575	-0.005550045971521183	-0.003936008022376021	-0.0231508606885917	0.9997088288496503
+233.7189	-2.280068	-4.860372	-0.0063447770776162505	-0.002673314816609195	-0.023462522754625032	0.9997010089110221
+234.4921	-2.314971	-4.881633	-0.007146112175537706	-0.0006653583406530917	-0.02284895862670062	0.9997131665976638
+235.2654	-2.346879	-4.903083	-0.007988122722651531	0.0010221613278802427	-0.021012535715947217	0.9997467771512812
+236.0372	-2.372559	-4.923693	-0.007864207592357068	0.0015428968877048278	-0.018579346375119852	0.9997952698409869
+236.8075	-2.391165	-4.943065	-0.007098149161328073	0.001190372377912095	-0.016449689050093152	0.9998387905168716
+237.5762	-2.40634	-4.961684	-0.0062441453524815504	0.0006183670391566793	-0.01439790129737417	0.9998766567478476
+238.3384	-2.42154	-4.979606	-0.005589043939301585	9.646465343892369e-05	-0.01224686722376506	0.999909379656786
+239.1021	-2.438172	-4.997815	-0.00559643574668683	-3.758471129258567e-05	-0.009968738992025146	0.9999346492332545
+239.8593	-2.449188	-5.01612	-0.005437798996150165	5.1027223651815144e-05	-0.007697949370881683	0.9999555836704864
+240.6138	-2.457471	-5.032857	-0.005459624573872094	8.40402967771624e-05	-0.0055975837956608	0.9999694257788042
+241.3652	-2.464848	-5.049016	-0.005482832435611237	-9.559489111473524e-05	-0.00369193451015952	0.9999781492761092
+242.113	-2.471118	-5.064022	-0.005381833762726796	-0.0006099678102152499	-0.0019047371648805106	0.9999835177546446
+242.8585	-2.47794	-5.076752	-0.005517103882406049	-0.0012776743337648434	-0.00032969291146410376	0.9999839100783731
+243.6012	-2.481229	-5.090455	-0.0051468151069383366	-0.0015100057853576051	0.000628169916818358	0.9999854176833475
+244.3383	-2.485941	-5.103671	-0.004914912681250389	-0.0009937161785989403	0.0011757840931717104	0.9999867367586724
+245.0752	-2.487604	-5.118722	-0.0049864540743464305	0.0003533733846354315	0.0015831422715725364	0.9999862519372771
+245.8164	-2.493249	-5.131839	-0.005436862077213867	0.002254829121294674	0.0019588136774827034	0.9999807594775831
+246.5623	-2.49473	-5.147361	-0.005435868951980338	0.003359009094162382	0.0021825887687127	0.999977202086582
+247.3157	-2.497272	-5.164438	-0.005023549763908254	0.0034056833291604316	0.0019626207028765373	0.9999796564874747
+248.0723	-2.498606	-5.182585	-0.004732006477907364	0.0032021965786981237	0.0016913882667059876	0.999982246471154
+248.8329	-2.503085	-5.199298	-0.0052410017653757345	0.003077582257910577	0.0013698330982699835	0.9999805917842726
+249.5991	-2.506486	-5.218219	-0.0054122897028467225	0.003828302688367552	0.001092788923830244	0.9999774282607917
+250.3665	-2.508505	-5.237784	-0.0045826722319296295	0.00504629388391841	0.0012753301768844022	0.9999759534939786
+251.1395	-2.505563	-5.256841	-0.003225485166293301	0.005940301457055155	0.001670844741192291	0.9999757583771179
+251.9216	-2.503453	-5.273084	-0.0026299286256550535	0.004852364238184509	0.002278267107665933	0.9999821736089649
+252.7116	-2.508541	-5.289235	-0.004711754018578084	0.0019099650784319506	0.002742785956896514	0.9999833141271219
+253.5088	-2.507121	-5.306929	-0.004824560076717001	-0.0010845146864033746	0.0034486573509592703	0.9999818269400883
+254.3037	-2.498934	-5.321532	-0.0038145462024468456	-0.0009598637215940122	0.004397340355187277	0.9999825954968946
+255.099	-2.491981	-5.336196	-0.003994993128337885	0.002062161742000586	0.005960648679795896	0.9999721287046807
+255.9	-2.485561	-5.351123	-0.004796561029817668	0.0045169030643075715	0.007517807922964464	0.9999500353282799
+256.7068	-2.475778	-5.3702	-0.005612387631943082	0.004923281742983772	0.008675215758401782	0.9999344993715804
+257.5259	-2.461593	-5.39272	-0.005562715006197702	0.004126253566910604	0.009834619112070767	0.9999276526330202
+258.3481	-2.442073	-5.414196	-0.004415152221408841	0.002857985916314043	0.010871765314702513	0.9999270688736784
+259.1759	-2.418788	-5.432422	-0.0026146608160637136	0.002208208915103914	0.011740998179220752	0.9999252153656091
+260.0072	-2.398099	-5.451575	-0.0023029409173612085	0.0025559304891376883	0.012752790102114992	0.9999127612082352
+260.8436	-2.3794	-5.472421	-0.0032614361369239864	0.003720928152911217	0.013702866788221734	0.9998938689529955
+261.6878	-2.357178	-5.492976	-0.003966880638307105	0.005392550904201476	0.014410589589869767	0.9998737516108819
+262.5408	-2.331649	-5.513461	-0.004087672727956619	0.006390651196631106	0.015189674554177874	0.9998558517585874
+263.4078	-2.305211	-5.536213	-0.004538530284909171	0.0056210009082556705	0.016189706908188682	0.9998428373908917
+264.2827	-2.278446	-5.560785	-0.005112991479589846	0.004600104307079007	0.017275659270124085	0.9998271090319947
+265.167	-2.24888	-5.58502	-0.005301363320306526	0.004378555903907105	0.017989979360907418	0.9998145250183839
+266.0593	-2.217026	-5.60834	-0.005284721028860163	0.004610398592097925	0.018232529470735465	0.9998091772021138
+266.9604	-2.181048	-5.632766	-0.004234279825377067	0.004716359282874845	0.018276174867779327	0.9998128866251313
+267.8693	-2.144562	-5.657825	-0.003517251090219671	0.004207383175084425	0.01831029273033074	0.9998173133386499
+268.7872	-2.107091	-5.684566	-0.002968226161870879	0.0041022540620476635	0.018465827821355793	0.9998166703690905
+269.7117	-2.072995	-5.710527	-0.0025951922859569535	0.0038327845091246544	0.018507780637666272	0.999818001836221
+270.6435	-2.039857	-5.735484	-0.002340291269136965	0.0033789372637785843	0.018407774202823347	0.9998221140126081
+271.578	-2.008137	-5.759967	-0.0026258219195686313	0.002477615242992138	0.01825923338381998	0.9998267684345072
+272.5156	-1.977183	-5.786935	-0.002481355510887373	0.00214908450925928	0.01781139906174545	0.9998359757350522
+273.4511	-1.948577	-5.813039	-0.0025282710647775426	0.0022739337321695968	0.016965970360044706	0.9998502852530207
+274.3868	-1.923694	-5.837506	-0.002650141978318813	0.0027849133271064656	0.015813394287244636	0.9998675700143351
+275.322	-1.903306	-5.862752	-0.0033837211473557623	0.003147385275300177	0.014502444650734561	0.9998841550381115
+276.2556	-1.886301	-5.888992	-0.004215415565174309	0.0032251218254860854	0.013162615702802923	0.9998992821322976
+277.1894	-1.871958	-5.914599	-0.004810677758635455	0.002941066469456641	0.01178342801298534	0.9999146755257609
+278.1217	-1.860171	-5.940087	-0.0053171323165317795	0.0022407606168909477	0.010514596028559644	0.9999280725962955
+279.0549	-1.850165	-5.965237	-0.0053460976010341575	0.001223379212258305	0.009269072653673781	0.9999420017560439
+279.9822	-1.84113	-5.989292	-0.005418973154136164	0.0005348331756265014	0.008157844894638888	0.9999518979681492
+280.9067	-1.834878	-6.011766	-0.005537934759043605	0.0001675988270577558	0.007105218367803964	0.9999594087067653
+281.8284	-1.829538	-6.033194	-0.00573560102215185	2.7408065048187213e-05	0.0061672813399026565	0.9999645327563309
+282.7472	-1.824423	-6.05424	-0.005303278331526704	7.71670197215794e-05	0.005360757466301482	0.9999715653775245
+283.6626	-1.820378	-6.076007	-0.005150161549096235	0.0003111863453735988	0.00463752995162213	0.9999759358679711
+284.5738	-1.819188	-6.097253	-0.005172886674369616	0.000436649540538892	0.003903946143233418	0.9999789046700652
+285.4729	-1.820078	-6.118356	-0.005560881546158259	4.9562236454426417e-05	0.003171319289468756	0.999979508227033
+286.3559	-1.821293	-6.140076	-0.005503369270736444	-0.0004938601576802682	0.0025464655903669575	0.9999814920996347
+287.2165	-1.824665	-6.162957	-0.006128368382255615	-0.00029061201298678676	0.0020643132449836817	0.9999790484087434
+288.0602	-1.831132	-6.186044	-0.007361025916745047	0.0007905429200520044	0.0017763957526850526	0.9999710169587289
+288.8905	-1.835812	-6.20897	-0.008631048904592532	0.0023079725538300106	0.0016563028942316459	0.9999587166069507
+289.7169	-1.838573	-6.232745	-0.009461461581921977	0.0034262367643627064	0.0016376433357821012	0.99994802853482
+290.5416	-1.841599	-6.257807	-0.010316437835547839	0.0037337323458844305	0.0018110642228688506	0.9999381732884965
+291.3657	-1.843827	-6.281585	-0.011303544920266848	0.0040174851343701915	0.002067771690342757	0.9999259042577442
+292.1857	-1.844053	-6.305431	-0.011881819979992308	0.004415544077263771	0.0025246228799578434	0.9999164723134522
+293.0004	-1.843243	-6.332348	-0.012858453071016393	0.0049163189828790995	0.00329451677193526	0.9998998130570474
+293.8122	-1.840209	-6.360716	-0.013458276917570025	0.0050742597201222026	0.004083078672120882	0.9998882213223936
+294.6199	-1.836186	-6.387429	-0.013941011877324477	0.004491014587157527	0.004697940818840592	0.9998816971661582
+295.4284	-1.829836	-6.414405	-0.01370061734422586	0.0035193935137974555	0.005148825886130237	0.9998866918534702
+296.2317	-1.823832	-6.440313	-0.01312004963372362	0.002423255966845758	0.005081527885411473	0.9998980799073861
+297.0309	-1.81794	-6.46419	-0.012496453601313236	0.0013795939142901106	0.004445965805628529	0.999911080424693
+297.8283	-1.814744	-6.486179	-0.012110832418403357	0.0005711841122432597	0.0036839357952715675	0.9999197118288545
+298.6243	-1.812035	-6.506511	-0.011683835588696677	0.00017611145831383709	0.0029392593505399717	0.9999274062276525
+299.4226	-1.809517	-6.524789	-0.011108068369766778	-0.0003836599889766203	0.002219462785821458	0.9999357667405682
+300.2232	-1.807621	-6.541554	-0.010223099399367618	-0.0006464734259319517	0.0012326256396453285	0.9999467740559057
diff --git a/动态slam/2020年-2022年开源动态SLAM.zip b/动态slam/2020年-2022年开源动态SLAM.zip
new file mode 100644
index 0000000..6b87990
Binary files /dev/null and b/动态slam/2020年-2022年开源动态SLAM.zip differ
diff --git a/动态slam/2020年-2022年开源动态SLAM/2020-2022年开源动态SLAM.docx b/动态slam/2020年-2022年开源动态SLAM/2020-2022年开源动态SLAM.docx
new file mode 100644
index 0000000..82b382c
--- /dev/null
+++ b/动态slam/2020年-2022年开源动态SLAM/2020-2022年开源动态SLAM.docx
@@ -0,0 +1,38 @@
+                     2020-2023年开源的动态SLAM论文
+一、2020年
+1.Zhang J, Henein M, Mahony R, et al. VDO-SLAM: a visual dynamic object-aware SLAM system[J]. arXiv preprint arXiv:2005.11052, 2020.
+https://github.com/halajun/vdo_slam
+2.Bescos B, Cadena C, Neira J. Empty cities: A dynamic-object-invariant space for visual SLAM[J]. IEEE Transactions on Robotics, 2020, 37(2): 433-451.
+https://github.com/bertabescos/EmptyCities_SLAM
+3.Vincent J, Labbé M, Lauzon J S, et al. Dynamic object tracking and masking for visual SLAM[C]//2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020: 4974-4979.
+https://github.com/introlab/dotmask
+
+二、2021年
+1.Liu Y, Miura J. RDS-SLAM: Real-time dynamic SLAM using semantic segmentation methods[J]. Ieee Access, 2021, 9: 23772-23785.
+ https://github.com/yubaoliu/RDS-SLAM/
+ 2.Bao R, Komatsu R, Miyagusuku R, et al. Stereo camera visual SLAM with hierarchical masking and motion-state classification at outdoor construction sites containing large dynamic objects[J]. Advanced Robotics, 2021, 35(3-4): 228-241.
+ https://github.com/RunqiuBao/kenki-positioning-vSLAM
+ 3.Wimbauer F, Yang N, Von Stumberg L, et al. MonoRec: Semi-supervised dense reconstruction in dynamic environments from a single moving camera[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 6112-6122.
+https://github.com/Brummi/MonoRec
+ 4.Wang W, Hu Y, Scherer S. Tartanvo: A generalizable learning-based vo[C]//Conference on Robot Learning. PMLR, 2021: 1761-1772.
+https://github.com/castacks/tartanvo
+5.Zhan H, Weerasekera C S, Bian J W, et al. DF-VO: What should be learnt for visual odometry?[J]. arXiv preprint arXiv:2103.00933, 2021.
+https://github.com/Huangying-Zhan/DF-VO
+
+三、2022年
+1.Liu J, Li X, Liu Y, et al. RGB-D inertial odometry for a resource-restricted robot in dynamic environments[J]. IEEE Robotics and Automation Letters, 2022, 7(4): 9573-9580.
+https://github.com/HITSZ-NRSL/Dynamic-VINS
+2.Song S, Lim H, Lee A J, et al. Dynavins: A visual-inertial slam for dynamic environments[J]. IEEE Robotics and Automation Letters, 2022, 7(4): 11523-11530.
+https://github.com/url-kaist/dynavins
+3.Wang H, Ko J Y, Xie L. Multi-modal Semantic SLAM for Complex Dynamic Environments[J]. arXiv e-prints, 2022: arXiv: 2205.04300.
+ https://github.com/wh200720041/MMS_SLAM
+4.Qiu Y, Wang C, Wang W, et al. AirDOS: Dynamic SLAM benefits from articulated objects[C]//2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022: 8047-8053.
+https://github.com/haleqiu/AirDOS
+5.Cheng S, Sun C, Zhang S, et al. SG-SLAM: a real-time RGB-D visual SLAM toward dynamic scenes with semantic and geometric information[J]. IEEE Transactions on Instrumentation and Measurement, 2022, 72: 1-12.
+https://github.com/silencht/SG-SLAM
+6.Esparza D, Flores G. The STDyn-SLAM: a stereo vision and semantic segmentation approach for VSLAM in dynamic outdoor environments[J]. IEEE Access, 2022, 10: 18201-18209.
+https://github.com/DanielaEsparza/STDyn-SLAM
+7.Shen S, Cai Y, Wang W, et al. DytanVO: Joint refinement of visual odometry and motion segmentation in dynamic environments[C]//2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023: 4048-4055.
+https://github.com/castacks/DytanVO
+
+
diff --git a/动态slam/2020年-2022年开源动态SLAM/2020年/Dynamic object tracking and masking for visual SLAM.pdf b/动态slam/2020年-2022年开源动态SLAM/2020年/Dynamic object tracking and masking for visual SLAM.pdf
new file mode 100644
index 0000000..3578993
--- /dev/null
+++ b/动态slam/2020年-2022年开源动态SLAM/2020年/Dynamic object tracking and masking for visual SLAM.pdf	
@@ -0,0 +1,381 @@
+                                        Dynamic Object Tracking and Masking for Visual SLAM
+
+                                                Jonathan Vincent, Mathieu Labbe´, Jean-Samuel Lauzon, Franc¸ois Grondin,
+                                                                   Pier-Marc Comtois-Rivet, Franc¸ois Michaud
+
+arXiv:2008.00072v1 [cs.CV] 31 Jul 2020     Abstract— In dynamic environments, performance of visual               the proposed method. Our research hypothesis is that a
+                                        SLAM techniques can be impaired by visual features taken                  deep learning algorithm can be used to semantically segment
+                                        from moving objects. One solution is to identify those objects            object instances in images using a priori semantic knowledge
+                                        so that their visual features can be removed for localization and         of dynamic objects, enabling the identiﬁcation, tracking and
+                                        mapping. This paper presents a simple and fast pipeline that              removal of dynamic objects from the scenes using extended
+                                        uses deep neural networks, extended Kalman ﬁlters and visual              Kalman ﬁlters to improve both localization and mapping in
+                                        SLAM to improve both localization and mapping in dynamic                  vSLAM. By doing so, the approach, referred to as Dynamic
+                                        environments (around 14 fps on a GTX 1080). Results on the                Object Tracking and Masking for vSLAM (DOTMask)1
+                                        dynamic sequences from the TUM dataset using RTAB-Map                     aims at providing six beneﬁts: 1) increased visual odometry
+                                        as visual SLAM suggest that the approach achieves similar                 performance; 2) increased quality of loop closure detection;
+                                        localization performance compared to other state-of-the-art               3) produce 3D maps free of dynamic objects; 4) tracking of
+                                        methods, while also providing the position of the tracked                 dynamic objects; 5) modular and fast pipeline.
+                                        dynamic objects, a 3D map free of those dynamic objects, better
+                                        loop closure detection with the whole pipeline able to run on a              The paper is organized as follows. Section II presents re-
+                                        robot moving at moderate speed.                                           lated work of approaches taking into consideration dynamic
+                                                                                                                  objects during localization and during mapping. Section III
+                                                               I. INTRODUCTION                                    describes our approach applied as a pre-processing module
+                                                                                                                  to RTAB-Map [5], a vSLAM approach. Section IV presents
+                                           To perform tasks effectively and safely, autonomous mo-                the experimental setup, and Section V provides comparative
+                                        bile robots need accurate and reliable localization from their            results on dynamic sequences taken from the TUM dataset.
+                                        representation of the environment. Compared to LIDARs
+                                        (Light Detection And Ranging sensors) and GPS (Global                                             II. RELATED WORK
+                                        Positioning System), using visual images for Simultaneous
+                                        Localization and Mapping (SLAM) adds signiﬁcant infor-                       Some approaches take into consideration dynamic objects
+                                        mation about the environment [1], such as color, textures,                during localization. For instance, BaMVO [6] uses a RGB-
+                                        surface composition that can be used for semantic interpre-               D camera to estimate ego-motion. It uses a background
+                                        tation of the environment. Standard visual SLAM (vSLAM)                   model estimator combined with an energy-based dense visual
+                                        techniques perform well in static environments by being                   odometry technique to estimate the motion of the camera. Li
+                                        able to extract stable visual features from images. However,              et al. [7] developed a static point weighting method which
+                                        in environments with dynamic objects (e.g., people, cars,                 calculates a weight for each edge point in a keyframe. This
+                                        animals), performance decreases signiﬁcantly because visual               weight indicates the likelihood of that speciﬁc edge point
+                                        features may come from those objects, making localization                 being part of the static environment. Weights are determined
+                                        less reliable [1]. Deep learning architectures have recently              by the movement of a depth edge point between two frames
+                                        demonstrated interesting capabilities to achieve semantic seg-            and are added to an Intensity Assisted Iterative Closest Point
+                                        mentation from images, outperforming traditional techniques               (IA-ICP) method used to perform the registration task in
+                                        in tasks such as image classiﬁcation [2]. For instance, Segnet            SLAM. Sun et al. [8] present a motion removal approach to
+                                        [3] is commonly used for semantic segmentation [4]. It uses               increase the localization reliability in dynamic environments.
+                                        an encoder and a decoder to achieve pixel wise semantic                   It consists of three steps: 1) detecting moving objects’ motion
+                                        segmentation of a scene.                                                  based on ego-motion compensated using image differencing;
+                                                                                                                  2) using a particle ﬁlter for tracking; and 3) applying a
+                                           This paper introduces a simple and fast pipeline that                  Maximum-A-Posterior (MAP) estimator on depth images
+                                        uses neural networks, extended Kalman ﬁlters and vSLAM                    to determine the foreground. This approach is used as the
+                                        algorithm to deal with dynamic objects. Experiments con-                  frontend of Dense Visual Odometry (DVO) SLAM [9]. Sun
+                                        ducted on the TUM dataset demonstrate the robustness of                   et al. [10] uses a similar foreground technique but instead
+                                                                                                                  of using a MAP they use a foreground model which is
+                                           This work was supported by the Institut du ve´hicule innovant (IVI),   updated on-line. All of these approaches demonstrate good
+                                        Mitacs, InnovE´ E´ and NSERC. J. Vincent, M. Labbe´, J.-S. Lauzon,        localization results using the Technical University of Munich
+                                        F. Grondin and F. Michaud are with the Interdisciplinary Institute for    (TUM) dataset [11], however, mapping is yet to be addressed.
+                                        Technological Innovation (3IT), Dept. Elec. Eng. and Comp. Eng.,
+                                        Universite´ de Sherbrooke, 3000 boul. de l’Universite´, Que´bec (Canada)     1https://github.com/introlab/dotmask
+                                        J1K 0A5. P.-M. Comtois-Rivet is with the Institut du Ve´hicule Innovant
+                                        (IVI), 25, boul. Maisonneuve, Saint-Je´roˆme, Que´bec (Canada), J5L 0A1.
+                                        {Jonathan.Vincent2, Mathieu.m.Labbe, Jean-Samuel.Lauzon,
+                                        Francois.Grondin2, Francois.Michaud}@USherbrooke.ca,
+                                        Pmcrivet@ivisolutions.ca
+Depth Image RGB Image  Instance segmentation            Dynamic   is then applied to the original depth image, resulting in a
+                             DOS                         Object   masked depth image (MDI). The DOS is also sent to the
+                                                         Classes  Tracking module. After computing a 3D centroid for each
+                                                                  masked object, the Tracking module predict the position and
+                                                        MDI       velocity of the objects. This information is then used by the
+                                                                  Moving Object Classiﬁcation module (MOC) to classify the
+MO-MDI                 Tracking/MOC           Camera              object as idle or not based on its class, its estimated velocity
+                                              World               and its shape deformation. Moving objects are removed
+                                               Pose               from the original depth image, resulting in the Moving
+                                                                  Object Masked Depth Image (MO-MDI). The original RGB
+                       vSLAM                                      image, the MDI and the MO-MDI are used by the vSLAM
+                                                                  algorithm. It uses the depth images as a mask for feature
+                                              Odometry            extraction thus ignoring features from the masked regions.
+                                                                  The MO-MDI is used by the visual odometry algorithm of
+                                               Map                the vSLAM approach while the MDI is used by both its
+                                                                  mapping and loop closure algorithms, resulting in a map free
+                Fig. 1: Architecture of DOTMask                   of dynamic objects while still being able to use the features
+                                                                  of the idle objects for visual odometry. The updated camera
+   SLAM++ [12] and Semantic Fusion [13] focus on                  pose is then used in the Tracking module to estimate the
+the mapping aspect of SLAM in dynamic environments.               position and velocity of the dynamic objects resulting in a
+SLAM++ [12] is an object-oriented SLAM which achieves             closed loop.
+efﬁcient semantic scene description using 3D object recog-
+nition. SLAM++ deﬁnes objects using areas of interest             A. Instance Segmentation
+to subsequently locate and map them. However, it needs
+predeﬁned 3D object models to work. Semantic Fusion                  Deep learning algorithms such as Mask R-CNN recently
+[13] creates a semantic segmented 3D map in real time             proved to be useful to accomplish instance semantic seg-
+using RGB-CNN [14], a convolutional deep learning neural          mentation [4]. A recent and interesting architecture for
+network, and a dense SLAM algorithm. However, SLAM++              fast instance segmentation is the YOLACT [18] and its
+and Semantic Fusion do not address SLAM localization              update YOLACT++ [19]. This network aims at providing
+accuracy in dynamic environments, neither do they remove          similar results as the Mask-RCNN or the Fully Convolutional
+dynamic objects in the 3D map.                                    Instance-aware Semantic Segmentation (FCIS) [20] but at a
+                                                                  much lower computational cost. YOLACT and YOLACT++
+   Other approaches use deep learning algorithm to provide        can achieve real-time instance segmentation. Development in
+improved localisation and mapping. Fusion++ [15] and MID-         neural networks has been incredibly fast in the past few years
+Fusion [16] uses object-level octree-based volumetric repre-      and probably will be in the years to come. DOTMask was
+sentation to estimate both the camera pose and the object         designed the be modular and can easily change the neural
+positions. They use deep learning techniques to segment ob-       network used in the pipeline. In its current state, DOTMask
+ject instances. DynaSLAM [17] proposes to combine multi-          works with Mask-RCNN, YOLACT and YOLACT++. The
+view geometry models and deep-learning-based algorithms           YOLACT is much faster than the two others and the loss
+to detect dynamic objects and to remove them from the im-         in precision doesn’t impact our results. This is why this
+ages prior to a vSLAM algorithm. They also uses inpainting        architecture is used in our tests. The instance segmentation
+to recreate the image without object occlusion. DynaSLAM          module takes the input RGB image and outputs the bounding
+achieves impressive results on the TUM dataset. However,          box, class and binary mask for each instance.
+these approaches are not optimized for real-time operation.
+                                                                  B. Tracking Using EKF
+  III. DYNAMIC OBJECT TRACKING AND MASKING FOR
+                               VSLAM                                 Using the DOS from the Instance Segmentation module
+                                                                  and odometry from vSLAM, the Tracking module predicts
+   The objective of our work is to provide a fast and complete    the pose and velocity of the objects in the world frame. This
+solution for visual SLAM in dynamic environments. Figure          is useful when the camera is moving at speed similar to the
+1 illustrates the DOTMask pipeline. As a general overview         objects to track (e.g., moving cars on the highway, robot
+of the approach, a set of objects of interest (OOI) are deﬁned    following a pedestrian) or when idle objects have a high
+using a priori knowledge and understanding of dynamic             amount of features (e.g., person wearing a plaid shirt).
+objects classes that can be found in the environment. Instance
+segmentation is done using a neural network trained to               First, the Tracking module receives the DOS and the
+identify the object classes from an RGB image. For each           original depth image as a set, deﬁned as Dk = {d1k, ..., dkI },
+dynamic object instance, its bounding box, class type and         where dki = Tk, Bki , ζik is the object instance detected
+binary mask are grouped for convenience and referred as the       by the Instance Segmentation module, with i ∈ I, I =
+dynamic object state (DOS). The binary mask of the DOS            {1, ..., L}, L being the total number of object detection
+                                                                  in the frame at time k. T ∈ Rm×n is the depth image ,
+B ∈ Zm 2 ×n is the binary mask and ζ ∈ J is the class ID,           2) Update: In EKF, the Update step starts by evaluating
+with J = {1, ..., W }, and W is the number of total trained      the innovation y˜k deﬁned as (4):
+classes in the Instance Segmentation module.
+                                                                 y˜k = zk − hˆk(xˆk|k−1)         (4)
+   The DOS and the original depth image are used by EKF
+to estimate the dynamic objects positions and velocities.        where zk ∈ R3 is a 3D observation of a masked object in
+EKF provides steady tracking of each object instance corre-
+sponding to the object type detected by the neural network.      reference to the camera for each object instance, with z =
+An EKF is instantiated for each new object, and a priori         [zx zy zz]T , zx = (µx −Cx)zz/fx and zy = (µy −Cy)zz/fy,
+knowledge from the set of dynamic object classes deﬁnes          where Cx and Cy are the principal center point coordinate
+some of the ﬁlter’s parameters. This instantiation is made       and fx and fy are the focal lengths expressed in pixels. zz
+using the following parameters: the class of the object, its
+binary mask and its 3D centroid position. The 3D centroid        is approximated using the average depth from the masked
+is deﬁned as the center of the corresponding bounding box.
+If the tracked object is observed in the DOS, its position is    region on the depth image. The expressions µx and µy stand
+updated accordingly, otherwise its predicted position using
+EKF is used. If no observations of the object are made for       for the center of the bounding box.
+e number of frames, the object is considered removed from
+the scene and therefore the ﬁlter is discarded. The Tracking        To simplify the following equations, (s, c) represent re-
+module outputs the estimated velocity of the objects to the
+MOC module. The MOC module will classify the objects             spectively the sine and cosine operations of the the Euler
+as idle or not based on the object class, the ﬁlter velocity     angles φ, θ, ψ (roll, pitch, yaw). h(xk) ∈ R4 is the
+estimation and the object deformation.                           observation function which maps the true state space xk to
+                                                                 the observed state space zk. hˆ(xk) is the three ﬁrst terms of
+   To explain further how the Tracking module works, the         h(xk). However, in our case, the transform between those
+following subsections presents in more details the Prediction
+and Update steps of EKF used by DOTMask.                         spaces is not linear, justifying the use of EKF. The non-linear
+                                                                 rotation matrix used to transform the estimate state xˆk in the
+   1) Prediction: Let us deﬁne the hidden state x ∈ R6×1 as      observed state zk follows the (x, y, z) Tait-Bryan convention
+the 3D position and velocity of an object referenced in the      and is given by h(xˆk) = [hφ hθ hψ 1], where:
+global map in Cartesian coordinates. The a priori estimate
+of the state at time k ∈ N is predicted based on the previous    hφ = (cφcθ)xˆx + (cφsθsψ − cψsφ)xˆy + (sφsψ + cφcψsθ)xˆz + cx
+state at time k − 1 as in (1):                                   hθ = (cθsφ)xˆx + (cφcψ + sφsθsψ)xˆy + (cψsφsθ − cφsψ)xˆz + cy
+
+                                                                              hψ = −(sθ)xˆx + (cθsψ)xˆy + (cθcψ)xˆz + cz
+                                                                                                                                                (5)
+
+                                                                 and cx, cy and cz are the coordinate of the camera referenced
+
+                                                                 to the world, which is derived using vSLAM odometry.
+                                                                    The innovation covariance Sk ∈ R3×3 is deﬁned as
+
+                                                                 follows, where the expression Hk ∈ R3×6 stands for the
+                                                                 Jacobian of h(xˆk):
+
+xˆk|k−1 = Fxˆk−1|k−1 with F =    I3 ∆tI3     (1)                 Sk = HkPk|k−1 Hk T + Rk         (6)
+                                 03 I3
+
+where F ∈ R6×6 is the state transition matrix, ∆t ∈ R+ is        where Rk ∈ R3×3 is the covariance of the observation noise,
+the time between each prediction, 03 is a 3 × 3 zero matrix      its diagonal terms stand for the imprecision of the RGB-
+and I3 is a 3 × 3 identity matrix. Note that the value of ∆t     D camera. The near optimal Kalman gain Kk ∈ R3×3 is
+is redeﬁned before each processing cycle.                        deﬁned as follows:
+
+   The a priori estimate of the state covariance (Pk|k−1 ∈       Kk = Pk|k−1 Hk T (Sk)−1         (7)
+R6×6) at time k is predicted based on the previous state at
+time k − 1 as given by (2):                                         Finally, the updated state estimate xˆk|k and the covariance
+                                                                 estimate are given respectively by (8) and (9).
+Pk|k−1 = FPk−1|k−1FT + Q                     (2)
+
+where Q ∈ R6×6 is the process noise covariance matrix            xˆk|k = xˆk|k−1 + Kky˜k         (8)
+deﬁned using the random acceleration model (3):
+                                                                 Pk|k = (I6 − KkHk)Pk|k−1        (9)
+
+Q = ΓΣΓT  with Γ = [  ∆t2  I3×3  ∆t2I3×3 ]T  (3)                 C. Moving Object Classiﬁcation
+                        2
+                                                                    The MOC module classify dynamic objects as either
+where Γ ∈ R6×3 is the mapping between the random                 moving or idle. It takes as inputs the dynamic objects class,
+acceleration vector a ∈ R3 and the state x, and Σ ∈ R3×3         velocity and mask. The object velocity comes from the
+is the covariance matrix of a. The acceleration components       tracking module estimation. The object class and mask are
+ax, ay and az are assumed to be uncorrelated.                    directly obtained from the DOS. The object class deﬁnes
+                                                                 if the object is rigid or not. The deformation of non-rigid
+   The dynamic of every detected objects may vary greatly        object is computed using the intersection over union (IoU)
+depending on its class. For instance, a car does not have the    of the masks of the object at time k and k − 1. The IoU
+same dynamic as a mug. To better track different types of        algorithm takes two arbitrary convex shape Mk−1, Mk and
+objects, a covariance matrix is deﬁned for each class to better  is deﬁned as IoU = |Mk ∩ Mk−1|/|Mk ∪ Mk−1|, where
+represent their respective process noise.
+TABLE I: Experimental Parameters
+
+                 Description                 Value                 (a) Original RGB Image
+  Frame to terminate object tracking
+                                               10
+            Score threshold (s)               0.1
+Maximum number of observations (m)             5
+                                          0.01 m/sec
+     Velocity threshold for a person      0.1 m/sec
+Velocity threshold for the other objects  0.62 m/s2
+                                           1.0 m/s2
+   Random acceleration for a person
+Random acceleration for other objects
+
+| . . . | is the cardinality of the set. A dynamic object is            (b) RGB and depth image superposed without DOTMask
+classiﬁed as moving if its velocity is higher than a predeﬁned
+threshold or if it is an non-rigid object with an IoU above               (c) RGB and depth image superposed with DOTMask
+another predeﬁned threshold. The original depth image is
+then updated resulting in the MO-MDI. The MO-MDI is                Fig. 2: RTAB-Map features (colored dots) not appearing on
+sent to the vSLAM odometry to update the camera pose.              moving objects with DOTMask
+
+                IV. EXPERIMENTAL SETUP                             dataset, along with their superimposed RGB and depth
+                                                                   images with features used by RTAB-Map (Fig. 2b) and with
+   To test our DOTMask approach, we chose to use the TUM           DOTMask (Fig. 2c). Using the depth image as a mask to
+dataset because it presents challenging indoor dynamic RGB-        ﬁlter outlying features, dynamic objects (i.e., humans and
+D sequences with ground truth to evaluate visual odometry          chairs in this case) are ﬁltered out because the MDI includes
+techniques. Also, TUM is commonly used to compare with             the semantic mask. The MO-MDI is used by RTAB-Map
+other state-of-the-art techniques. We used sequences in low        to compute visual odometry, keeping only the features from
+dynamic and highly dynamic environments.                           static objects as seen in Fig. 2c (left vs right) with the colored
+                                                                   dots representing visual features used for visual odometry. In
+   For our experimental setup, ROS is used as a middleware         the left image of Fig. 2c, the man on the left is classiﬁed
+to make the interconnections between the input images,             by the Tracking module as moving, while the man on the
+segmentation network, EKF and RTAB-Map. The deep learn-            right is classiﬁed as being idle, resulting in keeping his
+ing library PyTorch is used for the instance segmentation          visual features. In the rigth image of Fig. 2c, the man on the
+algorithm. The ResNet-50-FPN backbone is used for the              right is also classiﬁed as moving because he is standing up,
+YOLACT architecture because this conﬁguration achieves             masking his visual features. Figure 3 illustrates the inﬂuence
+the best results at a higher framerate [18]. Our Instance          of MDI, which contains the depth mask of all the dynamic
+segmentation module is based on the implementation of              objects, either idle or not, to generate a map free of dynamic
+YOLACT by dbolya2 and its pre-trained weights. The net-            objects. This has two beneﬁts: it creates a more visually
+work is trained on all 91 classes of the COCO dataset.             accurate 3D rendered map, and it improves loop closure
+The COCO dataset is often used to compare state-of-the-art         detection. The differences in the 3D generated maps between
+instance segmentation approaches, which is why we chose to         RTAB-Map without and with DOTMask are very apparent:
+use it in our trials. In our tests, person, chair, cup and bottle  there are less artifacts of dynamic objects and less drifting.
+are the the OOI used because of their presence in the TUM          The fr3/walking static sequence shows improved quality in
+dataset and in our in-house tests.The RTAB-Map library [5]         the map, while the fr3/walking rpy sequence presents some
+is also used, which includes various state-of-the-art visual       undesirable artifacts. These artifacts are caused either by the
+odometry algorithms, a loop closure detection approach and         mask failing to identify dynamic objects that are tilted or
+a 3D map render.                                                   upside down or by the time delay between the RGB image
+                                                                   and its corresponding depth image. The fr3/sitting static
+   Table I presents the parameters used for DOTMask in our
+trials, based on empirical observations in the evaluated TUM
+sequences and our understanding of the nature of the objects.
+A probability threshold p and a maximum instance number
+m are used to reduce the number of object instances to feed
+into the pipeline. Only detections with a score above p are
+used and at maximum, m objects detections are processed.
+This provides faster and more robust tracking.
+
+                            V. RESULTS
+
+   Trials were conducted in comparison with approaches
+by Kim and Kim [6], Sun et al. [8], Bescos et al. [17]
+and RTAB-Map, the latter being also used with DOTMask.
+Figure 2a shows two original RGB frames in the TUM
+
+   2https://github.com/dbolya/yolact
+TABLE II: Absolute Transitional Error (ATE) RMSE in cm                                          TABLE IV: Timing Analysis
+
+TUM Seqs         BaMVO                                                              Aproach     Img. Res.  Avg. Time     CPU        GPU
+                            Sun et al.
+                                      DynaSLAM                                      BaMVO.      320×240     42.6 ms   i7 3.3GHz       -
+                                                  RTAB-Map                          Sun et al.  640×480     500 ms         i5         -
+                                                             DOTMask                DynaSLAM    640×480     500 ms         -          -
+                                                                         Impr. (%)  DOTMask     640×480      70 ms               GTX1080
+                                                                                    DOTMask     640×480     125 ms    i5-8600K   GTX1050
+fr3/sit static   2.48  -  -        1.70 0.60   64.71                                                                  i7-8750H
+fr3/sit xyz                        1.60 1.80   -12.50
+fr3/wlk static   4.82 3.17 1.5     10.7 0.80   92.52
+fr3/wlk xyz                        24.50 2.10  91.42
+fr3/wlk rpy      13.39 0.60 2.61   22.80 5.30  76.75
+fr3/wlk halfsph                    14.50 4.00  72.41
+                 23.26 9.32 1.50                                                    a mobile robot operating at a moderate speed. The fastest
+                                                                                    method is BaMVO with only 42 ms cycle time.
+                 35.84 13.33 3.50
+                                                                                       Figure 4 shows the tracked dynamic objects in the ROS
+                 17.38 12.52 2.50                                                   visualizer RViz. DOTMask generates ROS transforms to
+                                                                                    track the position of the objects. Those transforms could
+          TABLE III: Loop Closure Analysis                                          easily be used in other ROS applications. Figure 5 shows the
+                                                                                    difference between RTAB-Map and DOTMask in a real scene
+TUM Seqs               RTAB-Map           DOTMask                                   where a robot moves at a similar speed as dynamic objects
+                 Nb Terr Rerr       Nb Terr Rerr                                    (chairs and humans). The pink and blue lines represent the
+fr3/sit static   loop (cm) (deg)   loop (cm) (deg)                                  odometry of RTAB-Map without and with DOTMask. These
+fr3/sit xyz                                                                         results suggest qualitatively that DOTMask improves the
+fr3/wlk static    33 1.80 0.26     1246 0.60 0.21                                   odometry and the 3D map.
+fr3/wlk xyz      288 2.10 0.42     1486 2.50 0.45
+fr3/wlk halfs.   105 9.00 0.18     1260 7.00 0.15                                                          VI. CONCLUSION
+fr3/wlk rpy       55 6.5 0.99      1516 2.9 0.45
+                 121 5.90 0.84     964 4.90 0.79                                       This paper presents DOTMask, a fast and modular pipeline
+                  94 6.7 1.06      965 6.00 1.04                                    that uses a deep learning algorithm to semantically segment
+                                                                                    images, enabling the tracking and masking of dynamic
+shows the result when masking idle object, resulting in                             objects in scenes to improve both localization and mapping in
+completely removing the dynamic objects from the scene.                             vSLAM. Our approach aims at providing a simple and com-
+                                                                                    plete pipeline to allow mobile robots to operate in dynamic
+   Table II characterizes the overall SLAM quality in terms                         environments. Results on the TUM dataset suggest that using
+of absolute trajectory error (ATE). In almost all cases,                            DOTMask with RTAB-Map provides similar performance
+DOTMask improves the ATE compared to RTAB-Map alone                                 compared to other state-of-the-art localization approaches
+(as seen in the last column of the table). Table II characterizes                   while providing an improved 3D map, dynamic objects
+the overall SLAM quality in terms of absolute trajectory                            tracking and higher loop closure detection. While DOTMask
+error (ATE). While DynaSLAM is better in almost every                               does not outperform DynaSLAM on the TUM dataset or
+sequences, DOTMask is not far off with closer values com-                           outrun BaMVO, it reveals to be a good compromise for
+pared to the other techniques.                                                      robotic applications. Because DOTMask pipeline is highly
+                                                                                    modular, it can also evolve with future improvements of
+   Table III presents the number of loop closure detections,                        deep learning architectures and new sets of dynamic object
+the mean translation error (Terr) and the mean rotational                           classes. In future work, we want to use the tracked dynamic
+error (Rerr) on each sequences both with and without                                objects to create a global 3D map with object permanence,
+DOTMask. In all sequences, DOTMask helps RTAB-Map                                   and explore more complex neural networks3 to add body
+to make more loop closures while also lowering both mean                            keypoint tracking, which could signiﬁcantly improve human
+errors. Since loop closure features are computed from the                           feature extraction. We would also like to explore techniques
+depth image (MDI), using DOTMask forces RTAB-Map to                                 to detect outlier segmentations from the neural network to
+use only features from static object hence providing better                         improve robustness.
+loop closures.
+                                                                                                                REFERENCES
+   On the fr3/sitting xyz sequence, RTAB-Map alone pro-
+vides better performance in both ATE and loop closure                                [1] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendo´n-Mancha,
+detection. In this entire sequence, the dynamic objects do                                “Visual simultaneous localization and mapping: A survey,” Artiﬁcial
+not move. While the MO-MDI enables features from idle                                     Intelligence Review, vol. 43, no. 1, pp. 55–81, 2015.
+dynamic objects to be used by the odometry algorithm, the
+MDI does not enables those same features for the loop                                [2] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural
+closure algorithm. Since nothing is moving in this particular                             networks for image classiﬁcation,” in Proc. IEEE Conf. Computer
+sequence, all features will help to provide a better locali-                              Vision and Pattern Recognition, 2012, pp. 3642–3649.
+sation. However, this case is not representative of dynamic
+environments.                                                                        [3] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep
+                                                                                          convolutional encoder-decoder architecture for image segmentation,”
+   Table IV presents the average computation time to process                              IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 39,
+a frame for each approach without vSLAM and odometry                                      no. 12, pp. 2481–2495, 2017.
+algorithms. Results are processed on a computer equipped
+with a GTX 1080 GPU and a I5-8600k CPU. DOTMask was                                    3https://github.com/daijucug/Mask-RCNN-TF detection-human segment-
+also tested on a laptop with a GTX 1050 where it achieved                           body keypoint-regression
+an average of 8 frames per second. At 70 ms, it can run on
+(a) fr3/sitting static  (b) fr3/walking static                              (c) fr3/walking rpy
+
+Fig. 3: RTAB-Map 3D rendered map from the TUM sequences, without (top) and with (bottom) DOTMask
+
+Fig. 4: Position of tracked dynamic objects shown in RVIZ                         static point weighting,” IEEE Robotics and Automation Letters, vol. 2,
+                                                                                  no. 4, pp. 2263–2270, 2017.
+(a) RTAB-Map alone      (b) RTAB-Map with DOTMask                            [8] Y. Sun, M. Liu, and M. Q.-H. Meng, “Improving RGB-D SLAM in
+                                                                                  dynamic environments: A motion removal approach,” Robotics and
+Fig. 5: 3D map and odometry improved with DOTMask                                 Autonomous Systems, vol. 89, pp. 110–122, 2017.
+                                                                             [9] C. Kerl, J. Sturm, and D. Cremers, “Dense visual SLAM for RGB-D
+[4] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and        cameras,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems,
+     J. Garcia-Rodriguez, “A review on deep learning techniques applied           2013, pp. 2100–2106.
+     to semantic segmentation,” arXiv preprint arXiv:1704.06857, 2017.      [10] Y. Sun, M. Liu, and M. Q.-H. Meng, “Motion removal for reliable rgb-
+                                                                                  d slam in dynamic environments,” Robotics and Autonomous Systems,
+[5] M. Labbe´ and F. Michaud, “Online global loop closure detection for           vol. 108, pp. 115–128, 2018.
+     large-scale multi-session graph-based SLAM,” in Proc. IEEE/RSJ Int.    [11] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A
+     Conf. on Intelligent Robots and Systems, 2014, pp. 2661–2666.                benchmark for the evaluation of RGB-D SLAM systems,” in Proc.
+                                                                                  IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Oct. 2012.
+[6] D.-H. Kim and J.-H. Kim, “Effective background model-based RGB-         [12] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and
+     D dense visual odometry in a dynamic environment,” IEEE Trans.               A. J. Davison, “SLAM++: Simultaneous localisation and mapping at
+     Robotics, vol. 32, no. 6, pp. 1565–1573, 2016.                               the level of objects,” in Proc. IEEE Conf. Computer Vision and Pattern
+                                                                                  Recognition, 2013, pp. 1352–1359.
+[7] S. Li and D. Lee, “RGB-D SLAM in dynamic environments using             [13] J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Seman-
+                                                                                  ticFusion: Dense 3D semantic mapping with convolutional neural
+                                                                                  networks,” in Proc. IEEE Int. Conf. Robotics and Automation, 2017,
+                                                                                  pp. 4628–4635.
+                                                                            [14] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for
+                                                                                  semantic segmentation,” in Proc. IEEE Int. Conf. Computer Vision,
+                                                                                  2015, pp. 1520–1528.
+                                                                            [15] J. McCormac, R. Clark, M. Bloesch, A. Davison, and S. Leutenegger,
+                                                                                  “Fusion++: Volumetric object-level SLAM,” in 2018 international
+                                                                                  conference on 3D vision (3DV). IEEE, 2018, pp. 32–41.
+                                                                            [16] B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. Davison, and
+                                                                                  S. Leutenegger, “Mid-fusion: Octree-based object-level multi-instance
+                                                                                  dynamic slam,” in 2019 International Conference on Robotics and
+                                                                                  Automation (ICRA). IEEE, 2019, pp. 5231–5237.
+                                                                            [17] B. Bescos, J. M. Fa´cil, J. Civera, and J. Neira, “Dynaslam: Tracking,
+                                                                                  mapping, and inpainting in dynamic scenes,” IEEE Robotics and
+                                                                                  Automation Letters, vol. 3, no. 4, pp. 4076–4083, 2018.
+                                                                            [18] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “Yolact: Real-time instance
+                                                                                  segmentation,” in ICCV, 2019.
+                                                                            [19] ——, “Yolact++: Better real-time instance segmentation,” 2019.
+                                                                            [20] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional instance-
+                                                                                  aware semantic segmentation,” in Proc. IEEE Conf. Computer Vision
+                                                                                  and Pattern Recognition, 2017, pp. 2359–2367.
+
diff --git a/动态slam/2020年-2022年开源动态SLAM/2020年/Empty_Cities_A_Dynamic-Object-Invariant_Space_for_Visual_SLAM.pdf b/动态slam/2020年-2022年开源动态SLAM/2020年/Empty_Cities_A_Dynamic-Object-Invariant_Space_for_Visual_SLAM.pdf
new file mode 100644
index 0000000..79db8f5
--- /dev/null
+++ b/动态slam/2020年-2022年开源动态SLAM/2020年/Empty_Cities_A_Dynamic-Object-Invariant_Space_for_Visual_SLAM.pdf
@@ -0,0 +1,1090 @@
+IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021                                                     433
+
+Empty Cities: A Dynamic-Object-Invariant Space for
+                       Visual SLAM
+
+                                       Berta Bescos , Cesar Cadena , and José Neira
+
+   Abstract—In this article, we present a data-driven approach                       Fig. 1. Dynamic images are ﬁrst converted one by one into static with an
+to obtain the static image of a scene, eliminating dynamic objects                   end-to-end deep learning model. Such images allow us to compute an accurate
+that might have been present at the time of traversing the scene                     camera trajectory estimation that is not damaged by the dynamic objects’ motion,
+with a camera. The general objective is to improve vision-based                      as well as to build dense static maps that are useful for long-term applications.
+localization and mapping tasks in dynamic environments, where                        (a) Input of our system: urban images with dynamic content. (b) Output of
+the presence (or absence) of different dynamic objects in different                  our system: Dynamic objects have been removed. (c) Static map built with the
+moments makes these tasks less robust. We introduce an end-to-end                    images preprocessed by our framework.
+deep learning framework to turn images of an urban environment
+that include dynamic content, such as vehicles or pedestrians,
+into realistic static frames suitable for localization and mapping.
+This objective faces two main challenges: detecting the dynamic
+objects, and inpainting the static occluded background. The ﬁrst
+challenge is addressed by the use of a convolutional network that
+learns a multiclass semantic segmentation of the image. The second
+challenge is approached with a generative adversarial model that,
+taking as input the original dynamic image and the computed
+dynamic/static binary mask, is capable of generating the ﬁnal static
+image. This framework makes use of two new losses, one based on
+image steganalysis techniques, useful to improve the inpainting
+quality, and another one based on ORB features, designed to
+enhance feature matching between real and hallucinated image
+regions. To validate our approach, we perform an extensive
+evaluation on different tasks that are affected by dynamic entities,
+i.e.,visual odometry, place recognition, and multiview stereo,
+with the hallucinated images. Code has been made available on
+https://github.com/bertabescos/EmptyCities_SLAM.
+
+   Index Terms—Visual SLAM, Inpainting, Dynamic objects,
+GANs.
+
+                           I. INTRODUCTION
+
+M OST vision-based localization systems are conceived to                             small fractions of dynamic content, but tend to compute dynamic
+         work in static environments [1]–[3]. They can deal with                     objects motion as camera ego-motion. Thus, their performance is
+                                                                                     compromised. Building stable maps is also of key importance for
+   Manuscript received April 19, 2020; revised July 10, 2020; accepted Septem-       long-term autonomy. Mapping dynamic objects prevents vision-
+ber 8, 2020. Date of publication November 2, 2020; date of current version           based robotic systems from recognizing already visited places
+April 2, 2021. This work was supported in part by the Spanish Ministry of            and reusing precomputed maps.
+Economy and Competitiveness under Project PID2019-108398GB-I00 and FPI
+Grant BES-2016-077836, in part by the Aragón regional government (Grupos                To deal with dynamic objects, some approaches include in
+DGA T45-17R, T45-20R), in part by the EU H2020 research project under Grant          their model the behavior of the observed dynamic content [4],
+688652, and in part by the Swiss State Secretariat for Education, Research and       [5]. Such strategy is needed when the majority of the observed
+Innovation (SERI) under Grant 15.0284 and NVIDIA, through the donation of a          scene is not rigid. However, when scenes are mainly rigid, as in
+Titan X GPU. This article was recommended for publication by Associate Editor        Fig. 1(a), the standard strategy consists of detecting the dynamic
+S. Huang and Editor F. Chaumette upon evaluation of the reviewers’ comments.         objects within the images and not to use them for localization and
+(Corresponding author: Berta Bescos.)                                                mapping [6]–[9]. To address mainly rigid scenes, we propose
+                                                                                     to instead modify these images so that dynamic content is
+   Berta Bescos is with the Department of Computer Science and Sys-                  eliminated and the scene is converted realistically into static. We
+tem Engineering, University of Zaragoza, 50018 Zaragoza, Spain (e-mail:              consider that the combination of experience and context allows
+bbescos@unizar.es).                                                                  to hallucinate, i.e., inpaint, a geometrically and semantically
+
+   Cesar Cadena is with the Mechanical and Process Engineering, ETH Zurich,
+8090 Zurich, Switzerland (e-mail: cesarcadena.lerma@gmail.com).
+
+   José Neira is with the Instituto de Investigación en Ingeniería de Aragón,
+Universidad de Zaragoza, 50018 Zaragoza, Spain (e-mail: jneira@unizar.es).
+
+   Color versions of one or more of the ﬁgures in this article are available online
+at http://ieeexplore.ieee.org.
+
+   Digital Object Identiﬁer 10.1109/TRO.2020.3031267
+
+1552-3098 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
+                   See https://www.ieee.org/publications/rights/index.html for more information.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+434                                                                  IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021
+
+consistent appearance of the rigid and static structure behind       maps (see Fig. 1), as well as for street-view imagery suppliers
+dynamic objects. This hallucinated structure can be used by          as a privacy measure to replace faces and license plates blurring.
+the simultaneous localization and mapping (SLAM) system to           We provide an extensive evaluation on robotic applications, such
+provide it with robustness against dynamic objects.                  as visual odometry, place recognition, and mapping to prove the
+                                                                     validity of our framework.
+   Turning images that contain dynamic objects into realistic
+static frames reveals several challenges.                                                      II. RELATED WORK
+
+   1) Detecting such dynamic content in the image. By this, we       A. Dynamic Objects Detection
+       mean to detect not only those objects that are known to
+       move such as vehicles, people, and animals, but also the         The vast majority of SLAM systems assume a static environ-
+       shadows and reﬂections that they might generate, since        ment. As a consequence, they can only manage small fractions
+       they also change the image appearance.                        of dynamic content by classifying them as spurious data or
+                                                                     outliers to such static model. The most typical outlier rejection
+   2) Inpainting the resulting space left by the detected dynamic    algorithms are RANSAC (e.g., in ORB-SLAM [1], [23]) and
+       content with plausible imagery. The resulting image would     robust cost functions (e.g., in PTAM [24]).
+       succeed in being realistic if the inpainted areas are both
+       semantically and geometrically consistent with the static        There are several SLAM systems that address more speciﬁ-
+       content of the image.                                         cally the dynamic scene content. Tan et al. [9] detect changes that
+                                                                     take place in the scene by projecting the map features into the
+   The ﬁrst challenge can be addressed with geometrical ap-          current frame for appearance and structure validation. Wang and
+proaches if an image sequence is available. This procedure usu-      Huang [8] segment the dynamic objects in the scene using the
+ally consists in studying the optical ﬂow consistency along the      RGB optical ﬂow. Alcantarilla et al. [7] detect moving objects
+images [7], [8]. In the case in which only one frame is available,   by means of a scene ﬂow representation with stereo cameras.
+deep learning is the approach that excels at this task by the        More recently, thanks to the boost of deep learning, integrating
+use of convolutional neural networks (CNNs) [10], [11]. These        semantics information into SLAM has allowed to deal with
+frameworks are trained with the previous knowledge of what           dynamic content in a different manner [6], [25]. This idea allows
+classes are dynamic and which ones are not. Recent works show        the clustering of map points belonging to independent objects
+that it is possible to acquire this knowledge in a self-supervised   with different dynamics, as well as the possibility of detecting
+way [12], [13].                                                      dynamic objects in just one shot.
+
+   Regarding the second challenge, some recent image inpaint-        B. Sequence-Based Inpainting
+ing approaches use image statistics of the remaining image to ﬁll
+in the holes [14], [15]. The former work estimates the pixel value      Previous works on SLAM in dynamic scenes have attempted
+with the normalized weighted sum of all the known pixels in the      to reconstruct the background occluded by dynamic objects in
+neighborhood. While this approach generally produces smooth          the images with information from previous frames [6], [26].
+results, it is limited by the available image statistics and has     Such works need perpixel depth information and only make use
+no concept of visual semantics. Neural networks learn semantic       of the static content of the prebuilt map to create the inpainted
+priors and meaningful hidden representations in an end-to-end        frames, but do not add semantic consistency. The work by Grana-
+fashion, which have been used for recent image inpainting            dos et al. [27] removes marked dynamic objects from videos by
+efforts [16]–[19]. These networks employ convolutional ﬁlters        aligning other candidate frames in which parts of the missing
+on images, replacing the removed content with inpainted areas        region are visible, assuming that the scene can be approximated
+that have geometrical and semantic consistency with the whole        using piecewise planar geometry. The recent work by Uitten-
+image.                                                               bogaard et al. [28] utilizes a generative adversarial network
+                                                                     (GAN) to learn to use information from different viewpoints
+   Both challenges can also be seen as one single task: translating  and select imagery information from those views to generate a
+a dynamic image into a corresponding static image. In this           plausible inpainting, which is similar to the ground-truth static
+direction, Isola et al. [20] propose a general-purpose solution      background. Eventually, if only one frame is available, the static
+for image-to-image translation. Our previous work [21] builds        occluded background can only be reconstructed by utilizing
+on top of this idea and reformulates the framework objectives        image-based inpainting techniques.
+to take advantage of a precomputed dynamic object mask,
+seeking a more inpainting-oriented framework. In this work,          C. Image-Based Inpainting
+we follow this idea of transforming images with dynamic con-
+tent into realistic static frames, while optimizing for localiza-       Among the nonlearning approaches to image inpainting, prop-
+tion and mapping performance. For such task, we introduce            agating appearance information from neighboring pixels to the
+a new loss that combined with the integration of a semantic          target region is the usual procedure [14]. Accordingly, these
+segmentation network achieves the ﬁnal objective of creating a       methods succeed in dealing with narrow holes, where color and
+dynamic-object-invariant space. This loss is based on steganal-      texture vary smoothly, but fail when handling big holes, resulting
+ysis techniques and on ORB features detection, orientation, and      in oversmoothing. Differently, patch-based methods iteratively
+descriptor maps [22]. Such loss allows the inpainted images to       search for relevant patches from the rest of the image [29].
+be realistic and suitable for localization and mapping. These
+images provide a richer understanding of the stationary scene,
+and could also be of interest for the creation of high-detail road
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM                                                        435
+
+These approaches are computationally expensive and, hence,          Fig. 2. Our generator G adopts a UResNet-like architecture. It employs three
+not fast enough for real-time applications. Yet, they do not make   down-convolutional layers with a stride of 2, six ResNet blocks, and three
+semantically aware patch selections.                                up-convolutional layers with a fractional stride of 1/2, with skip connections
+                                                                    between corresponding down- and up-convolutional layers. Only two ResNet
+   Deep-learning-based methods usually initialize the image         blocks are shown for simplicity.
+holes with a constant value, and further pass it through a CNN.
+Context Encoders [19] were among the ﬁrst ones to successfully      In contrast, a conditional GAN (cGAN) learns a mapping from
+use a standard pixelwise reconstruction loss, as well as an         observed image x and optional random noise vector z, to y, G :
+adversarial loss for image inpainting tasks. Due to the resulting   {x, z} → y [34], or G : x → y [20]. The generator G is trained
+artifacts, Yang et al. [30] take their results as input and then    to produce outputs indistinguishable from the “real” images by
+propagate the texture information from nonhole regions to ﬁll the   an adversarially trained discriminator D, which is trained to
+hole regions as postprocessing. Song et al. [31] use a reﬁnement    do as well as possible at detecting the generator’s “fakes.” The
+network in which a blurry initial hole-ﬁlling result is used        objective of a cGAN can be expressed as
+as the input, then iteratively replaced with patches from the
+closest nonhole regions in the feature space. Iizuka et al. [18]               LcGAN(G, D) = Ex,y[log D(x, y)]
+extend content encoders by deﬁning global and local discrim-                                      + Ex[log (1 − D(x, G(x)))] (1)
+inators, then applying a postprocessing. Following this work,
+Yu et al. [17] replaced the postprocessing with a reﬁnement         where G tries to minimize this objective against an adversarial
+network powered by the contextual attention layers. The recent      D that tries to maximize it. Previous approaches have found
+work of Liu et al. [16] obtains excellent results by using partial  it beneﬁcial to mix the GAN objective with a more traditional
+convolutions.                                                       appearance loss, such as the L1 or L2 distance [19]. The dis-
+                                                                    criminator’s job remains unchanged, but the generator is tasked
+   In contrast, the work by Ulyanov et al. [32] proves that         not only with fooling the discriminator, but also with being near
+there is no need for external dataset training. The generative      the ground-truth in a L1 sense, as expressed in
+network itself can rely on its structure to complete the corrupted
+image. However, this approach usually applies several iterations               G∗  =  arg  min  max  LcGAN(G,  D)  +  λ1  ·  LL1(G)  (2)
+(∼50 000) to get good and detailed results.
+                                                                                             G    D
+D. Image Inpainting for a Dynamic-Object-Invariant Space
+                                                                    where LL1(G) = Ex,y[||y − G(x)||1]. The recent work of Isola
+   This work builds on our previous work Empty Cities [21],         et al. [20] shows that cGANs are suitable for image-to-image
+which bins the image sequences and treats the frames inde-          translation tasks, where the output image is conditioned on its
+pendently. It makes use of deep learning to segment out the         corresponding input image, i.e., it translates an image from one
+a priori moving objects: vehicles, animals, and pedestrians,        space into another (semantic labels to RGB appearance, RGB
+and also of image-based “inpainting.” It does not perform pure      appearance to drawings, day to night, etc.). The realism of their
+inpainting but image-to-image translation with the help of a        results is also enhanced by their generator architecture. They
+dynamic objects’ mask, which is the outcome of a semantic           employ a U-Net [35], which allows low-level information to
+segmentation network. This choice is justiﬁed by the fact that the  shortcut across the network. In our previous work [21], we
+dynamic objects’ mask might be inaccurate or may not include        made the use of this same architecture with 256 × 256 resolution
+their shadows. The adoption of an image-to-image translation        images. However, visual localization systems see their accuracy
+framework allows to slightly modify the image nonhole regions       degraded when working with low-resolution images. For this
+for better accommodation of the reconstructed areas. Differently    objective, we hereby employ as the architecture for our generator
+to inpainting methods, the “holes” cannot be initialized with       G a UResNet [36], see Fig. 2. This architecture uses residual
+any placeholder values because we do not want the frame-            blocks [37] and has shown impressive results for superresolution
+work to only modify those values, and hence, our inpainting         images [38].
+network input consists of the dynamic original image with
+the dynamic/static mask concatenated. Concisely, utilizing an          It is well known that L2 and L1 losses produce blurry results
+image-to-image translation approach allows us to have the image     on image generation problems, i.e., they can capture the low
+hole regions inpainted, and the nonhole regions slightly modiﬁed    frequencies but fail to encourage high-frequency crispness. This
+for better accommodation of the reconstructed areas to cope with    motivates restricting the GAN discriminator to only model high-
+imprecise masks, or with dynamic objects possible shadows and       frequency structures. Following this idea, Isola et al. [20] adopt
+reﬂections.
+
+               III. IMAGE-TO-IMAGE TRANSLATION
+
+   Our work makes use of the successful image-to-image transla-
+tion framework by Isola et al. [20]. For the sake of completeness,
+we summarize the basis of their approach.
+
+   A GAN is a generative model that learns a mapping from a
+random noise vector z to an output image y, G: z → y [33].
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+436                                                                             IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021
+
+Fig. 3. Block diagram of our proposal. We ﬁrst compute the segmentation of the RGB dynamic image, as well as its loss against its ground-truth. Both the
+dynamic/static binary mask and dynamic image are used to obtain the static image. A loss based on ORB features together with an appearance and an adversarial
+loss are obtained and back-propagated until the RGB dynamic image. The striped blocks are differentiable layers that are ﬁxed and, hence, not modiﬁed during
+training time. The adversarial discriminator is not shown here for simplicity.
+
+a discriminator architecture that classiﬁes each N × N patch          G : {x, m} → y. Also, the discriminator D learns to classify
+in an image as real or fake, rather than classifying the image
+as a whole. Due to their excellent results, we adopt this same        yˆ = G(x, m) patches as “fake” from yˆ, m, and x, and the patches
+architecture for our discriminator.
+                                                                      of y as “real” from y, m, and x, D : {x, y/yˆ, m} → real/fake.
+                          IV. OUR PROPOSAL
+                                                                      In most of the training dataset images, the relationship be-
+   Our proposed system turns images of an urban environment
+that show dynamic content, such as vehicles or pedestrians, into      tween the static and dynamic regions sizes is unbalanced, i.e.,
+realistic static frames which are suitable for localization and
+mapping. We ﬁrst obtain the pixelwise semantic segmentation           static regions occupy usually a much bigger area. This leads us to
+of the RGB dynamic image (see Fig. 3). Then, the segmentation
+of only the dynamic objects is obtained with the convolutional        believe that the inﬂuence of dynamic regions on the ﬁnal loss is
+network DynSS. Once we have this mask, we convert the RGB
+dynamic image to grayscale and we compute the static image,           signiﬁcantly reduced. As a solution to this problem, we propose
+also in grayscale, with the use of the generator G, which has been
+trained in an adversarial way. For simplicity, the discriminator is   to reformulate the cGAN and L1 losses so that there is more
+not shown in this diagram. To fully exploit the capabilities of this
+framework for localization and mapping, inpainting is enriched        emphasis on the main areas that have to be inpainted, according
+with a loss based on ORB features detection, orientation, and de-
+scriptors between the ground-truth and computed static images.        to (3) and (4). The weights w are computed as w       =   N    if  m  =  1
+Another feature of our framework for localization and mapping                                                                  Ndyn
+is the fact that we perform the inpainting in grayscale rather than
+in RGB. The motivation for this is that many visual localization      (dynamic  object),  and  as  w  =      N    if  m  =  0  (background).
+applications only need the images grayscale information. The                                             N −Ndyn
+different stages are described in Sections IV-A–IV-E.
+                                                                      N stands for the number of elements in the binary mask m, and
+A. From Image-to-Image Translation to Inpainting
+                                                                      Ndyn means the number of pixels where m = 1.
+   For our objective, dynamic objects masks are specially con-
+sidered to reformulate the training objectives of the general         LL1(G) = Ex,y[w · ||y − G(x, m)||1]                                   (3)
+purpose image-to-image translation work by Isola et al. [20].
+We adopt a variant of the cGAN that learns a mapping from             LcGAN(G, D) = Ex,y[w · log D(x, y, m)]
+observed image x and dynamic/static binary mask m, to y,
+                                                                                          + Ex[w · log (1 − D(x, G(x, m), m))]. (4)
+
+                                                                         An important feature that we have also incorporated to the
+                                                                      framework is the computation of our output and target images
+                                                                      “noise.” This is motivated by the use of the noise domain for
+                                                                      steganalysis to detect if an image has been tampered or not.
+                                                                      Fig. 4 shows an example of why working in the noise domain is
+                                                                      helpful for detecting “fake” images. While the static generated
+                                                                      image [see Fig. 4(b)] looks visually similar to its target image
+                                                                      [see Fig. 4(d)], their computed noises [see Fig. 4(c) and (e)]
+                                                                      are very different. It would be very easy for us, humans, to
+                                                                      tell what parts of the original image [see Fig. 4(a)] have been
+                                                                      changed by analyzing their noise mapping. In the same way, the
+                                                                      discriminator could more easily learn to distinguish “real” from
+                                                                      “fake” images if it can take as input their noise. This idea is
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM          437
+
+Fig. 4. (a) Original image. (b) Image generated by our framework when taking           manipulation techniques. It is the discriminator’s job to classify
+(a) as input. (c) Computed noise of (b). (d) Static objective image. (e) Computed      the generated image patches as tampered (fake) or real.
+noise of (e). (b) and (d) are visually similar, but their computed noise [(c) and (e)
+repectively] clearly show what image and what parts of it have been modiﬁed               Images have a low-frequency component dependent on their
+the most. The noise magnitude has been ampliﬁed (×10) for visualization.               content, and a high-frequency component dependent on their
+                                                                                       source camera. These high-frequency components are known
+explained more in depth in Section IV-B, and the whole training                        as noise features or noise residuals, and can be extracted using
+procedure is diagrammed in Fig. 5. To the best of our knowledge,                       linear and nonlinear high-pass ﬁlters. Recent works on image
+steganalysis noise features have not been used before in the                           forensics utilize noise features [39], [40] as clues to classify
+context of GANs.                                                                       a speciﬁc patch or pixel in an image as tampered or not, and
+                                                                                       localize the tampered regions. The intuition behind this idea is
+   This GAN training setup leads to having good inpainting                             that when an object is removed from one image (source) and the
+results. However, despite the efforts of the discriminator to catch                    gap is inpainted (target), the noise features between the source
+the high frequency of the “real” images, the outputs of our                            and target are unlikely to match.
+framework are still slightly blurry. One of the objectives of this
+work is to use our images for localization tasks; therefore, if                           To provide the discriminator with better clues to distinguish
+the inpainted regions are somewhat blurry, features would not                          real from fake inputs, we ﬁrst extract the noise features from
+be extracted in these areas. Image features are important for                          our images and concatenate them to the grayscale images, as
+localization since many visual SLAM systems rely on them as                            depicted in Fig. 5. The cGAN objective is reformulated as
+their core (ORB-SLAM [23]). Having blurriness in inpainted
+areas could be seen as a good feature of our framework for                                    LcGAN(G, D) = Ex,y[w · log D(x, y, m, n)]
+navigation, because it would allow feature-based localization
+systems to work with our images without any modiﬁcation in                                                        + Ex[w · log (1 − D(x, yˆ, m, nˆ))] (5)
+their architecture, and “fake” features would not be introduced.
+This would be equivalent to modifying the utilized localization                        where n = SRM(y) and nˆ = SRM(yˆ). There are many ways
+system to work with the raw images and the dynamic/static                              to produce noise features from an image. Inspired by recent
+binary masks. We have proved with our localization experiments                         progress on steganalysis rich models (SRM) for image manip-
+(see Section VI-A) that not utilizing moving objects’ features                         ulation detection [39], we use SRM ﬁlter kernels to extract
+lead to worse tracking results than working with fully static                          the local noise features from the static images as the input to
+images. For that reason, we want to exploit our framework for                          our discriminator. The SRM use statistics of neighboring noise
+obtaining both high-quality inpainting results and to succeed-                         residual samples as features to capture the dependence changes
+ing in generating reliable features for visual localization tasks.                     caused by embedding. Zhou et al. [40] use SRM residuals,
+Fortunately, these two assignments are highly related. Solving                         together with the RGB image to detect and localize corrupted
+one of them leads to having the other one tackled. Therefore,                          regions in images. They only use 3 SRM kernels, instead of 30
+we have implemented a new loss based on ORB features [22].                             (as in the original Fridrich Kodovsky’s work [39]) and claim that
+That is, we want the output of our generator G to have the same                        they achieve comparable performance. Similarly, we use these
+ORB features than its target image, while keeping it realistic and                     same three ﬁlters (see Fig. 6), setting the kernel size of the SRM
+close to its target in a L1 sense. By the same ORB features, we                        ﬁlter layer to be 5 × 5 × 3.
+mean the same detected keypoints with their same orientation
+and descriptors, following ORB’s implementation to the extent                          C. ORB-Features-Based Loss
+possible. This procedure is further described in Section IV-C.
+                                                                                          ORB features allow real-time detection and description, and
+B. Steganalysis-Based Loss                                                             provide good invariance to changes in viewpoint and illumina-
+                                                                                       tion. Furthermore, they are useful for visual SLAM and place
+   With the advances of image editing techniques, tampered or                          recognition, as demonstrated in the popular ORB-SLAM [23]
+manipulated image generation processes have become widely                              and its binary bag-of-words [41]. The following sections sum-
+available. As a result, distinguishing authentic images from                           marize how the ORB features detector, descriptors, and orien-
+tampered images has become increasingly challenging.                                   tation are computed, and how we have adapted them into a new
+                                                                                       loss.
+   What our framework is actually trying to achieve is to
+eliminate certain regions from an authentic image followed                                1) Detector: The ORB detector is based on the FAST al-
+by inpainting, i.e., removal, one of the most common image                             gorithm [42]. It takes one parameter, the intensity threshold t
+                                                                                       between the center pixel p, Ip, and those in a circular ring around
+                                                                                       the center. If there exists a set of contiguous pixels in the circle,
+                                                                                       which are all brighter than Ip + t, or all darker than Ip − t, the
+                                                                                       pixel p would be a keypoint candidate. Then, the Harris corner
+                                                                                       measure is computed for each of these candidates, and the target
+                                                                                       N keypoints with the highest Harris measure are ﬁnally selected.
+                                                                                       FAST does not produce multiscale features; therefore, ORB uses
+                                                                                       a scale pyramid of the image and extracts FAST features at each
+                                                                                       level in the pyramid.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+438                                                                                               IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021
+
+Fig. 5. Discriminator D has to learn to differ between the real images y and the images generated by the generator G(x, m). D makes a better decision (real/fake)
+by seeing the inputs of the generator x and m, and by seeing the SRM noise features of G(x, m) and y. The striped blocks are convolutional layers whose weights
+
+do not require upgrading during training time.
+
+                                                                                  of pixels in the features map, and Nf represents the number of
+                                                                                  pixels in the response map det(y), where det(y) > 0.5, i.e., the
+
+                                                                                  number of FAST features in the current objective frame.
+
+                                                                                                  ⎧   N
+                                                                                                  ⎪⎨  Nf  ,     det(y) > 0.5 and det(yˆ) ≤ 0.5
+
+Fig. 6. Three utilized SRM kernels to extract noise features. The left kernel is     wdet    =           N   ,  det(y) ≤ 0.5  and  det(yˆ) > 0.5 .  (7)
+useful in regions with a strong gradient. The middle and the rightmost kernels                                  otherwise
+provide the layer with a high shift-invariance.                                                   ⎩⎪0N,−Nf
+
+Fig. 7. Subset of the 16 kernels used to obtain corner responses in the images.      According to our results, the optimum number of image
+The 12 black pixels have a value of −1/12, the gray pixels are set to 0, and
+the white pixel is set to 1. A very positive or a very negative response will be  pyramid levels for this objective is 1. More levels lead to a
+obtained when convolving these kernels with a corner area in an image.
+                                                                                  greater training time and the results are barely inﬂuenced. This is
+   To bring this to a differentiable solution, we have deﬁned
+a convolution capable of detecting corners in an image in the                     coherent with the idea that we want to maximize the sharpness of
+same way that FAST does. We have approximated the FAST
+corner detection and have used instead a convolution with the                     small features rather that of the big corners. These convolutions
+kernels in Fig. 7. These images show some of the kernels used
+for corners detection for a circular ring of three pixels around the              have been applied with a stride of 5, offering a good tradeoff
+center. By convolving the image with these kernels for different
+kernel sizes, we obtain its corner response for the different image               between computational training time and good-quality results.
+pyramid levels. We keep the maximum score per pixel and per
+level and raise each element to its second power to equally                          Other approaches have tried before to include a similar loss
+leverage positive and negative responses. We then subtract a
+value which is equivalent to the FAST threshold t, and apply                      inside a CycleGAN framework [43]. The work by Porav et al.
+a sigmoid operation. Its output is the probability of a pixel of
+being a FAST feature and could also be seen as the Harris corner                  [43] uses the SURF detector [44], which is already differentiable,
+measure. Features for the output and target images are computed
+following this procedure. We deﬁne this network as det, and the                   but does not compute a binary loss. They compute a more
+corresponding loss Ldet(G) can be expressed as
+                                                                                  traditional L1 loss between the blob responses of both output and
+          Ldet(G) = −Ex,y[wdet · (det(y) · log(det(yˆ))
+                       + (1 − det(y)) · log(1 − det(yˆ)))] (6)                    ground-truth images. Computing a binary loss as in (7) allows
+
+where yˆ = G(x, m) and wdet is calculated following (7). This                     us to have more emphasis on the high-gradient areas.
+weights deﬁnition allows us to leverage the uneven distribution
+of nonfeatures and features pixels, and to affect only those image                   2) Orientation: Once FAST features have been detected, the
+regions with a wrong feature response. N stands for the number
+                                                                                  original ORB work extracts their orientation to provide them
+
+                                                                                  with rotation invariance. This is done by computing its ori-
+
+                                                                                  entation θ = atan2(m01, m10) and its intensity centroid C =
+                                                                                     m10     m01                     x,y xpyqI(x, y) are the moments
+                                                                                  (  m00  ,  m00  ),  where  mpq  =
+
+                                                                                  of an image patch. More precisely, the three utilized patch
+
+                                                                                  moments are m10 = x,y x · I(x, y), m01 = x,y y · I(x, y),
+                                                                                  and m00 = x,y I(x, y). We have created three 14-pixel-radius
+                                                                                  circular kernels with the values x, y, and 1, respectively, for m10,
+
+                                                                                  m01, and m00 (centered in 0), so that when convolving the image
+                                                                                  with them, we obtain their respective patch moments m01, m10,
+                                                                                  and m00. We deﬁne this network as ori, and the objective of
+
+                                                                                  its corresponding loss is that the “fake” static image detected
+
+                                                                                  features det(G(x, m)) have the same orientation parameters
+
+                                                                                  m01, m10, and m00 than the ground-truth static image detected
+                                                                                  features, det(y). This loss can be expressed as
+
+                                                                                             Lori(G) = −Ex,y[wori · ||ori(y) − ori(yˆ)||1].         (8)
+
+                                                                                     Even though these convolutions are applied to the whole
+                                                                                  image with a stride of 5 as in the detection loss, the weighting
+                                                                                  term wori in (8) has a value of 1 if a FAST feature has been
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM                                                               439
+
+detected in either the ground-truth static image or the output       D. Semantic Segmentation
+image, i.e., if det(y) > 0.5 or det(yˆ) > 0.5. Otherwise, the
+weighting term wori is set to 0.                                        Semantic segmentation is a challenging task that addresses
+                                                                     most of the perception needs of intelligent vehicles in a uniﬁed
+   3) Descriptor: The ORB descriptor is a bit string description     way. Deep neural networks excel at this task, as they can be
+                                                                     trained end-to-end to accurately classify multiple object cate-
+of an image patch constructed from a set of binary intensity tests.  gories in an image at the pixel level. However, very few archi-
+Consider a smoothed image patch p, a binary test τ is deﬁned         tectures have a good tradeoff between high-quality and compu-
+by                                                                   tational resources. The recent work of Romera et al. [11] runs in
+                                                                     real time while providing accurate semantic segmentation. The
+            τ (p; x, y) =  1,        p(x) < p(y)          (9)        core of their architecture (ERFNet) uses residual connections
+                           0,        p(x) ≥ p(y)                     and factorized convolutions to remain efﬁcient while retaining
+                                                                     remarkable accuracy.
+where p(x) is the intensity of p at a point x. The feature is
+deﬁned as a vector of n binary tests                                    Romera et al. [11] have made public some of their trained
+                                                                     models [45]. As in our preliminary work, we use for our approach
+                 fn(p) =         2i−1τ (p; xi, yi).       (10)       the ERFNet model with encoder and decoder both trained from
+                                                                     scratch on the Cityscapes train set [46]. We have ﬁne tuned
+                          1≤i≤n                                      their model to adjust it to our inpainting approach by back-
+                                                                     propagating the loss of the semantic segmentation LCE(SS), cal-
+As in Rublee et al.’s work [22], we use a Gaussian distribution      culated with the cross entropy criterion using the class weights
+                                                                     they suggest wSS and the adversarial loss of our ﬁnal inpainting
+around the center of the patch and a vector length n = 256. This     model LcGAN(G, D). The semantic segmentation network’s job
+                                                                     (SS) can be, hence, expressed as
+can be achieved in a differentiable and convolutional manner by
+
+creating n kernels with all values set to 0 except for those in the
+
+positions x and y:        ⎧
+                          ⎨1, z = x
+
+                 k(z) = ⎩−0,1,       z=y                  (11)
+                                     otherwise
+                                                                     SS∗       =       min  max  LcGAN   (G,  D)  +  λ2  ·  LCE(SS)
+where k(z) is the value of the kernel k at a point z. Convolving                  arg                                                       (16)
+an image with these n kernels yields each pixel’s ORB descriptor                        SS    D
+(a negative output corresponds to the bit value 0 and a positive
+one to 1). This convolution is followed by a sigmoid activation      where LCE (SS) = wSS [class] · (log ( j exp(ySS[j])) − ySS
+function. We deﬁne this network as desc, and the corresponding       [class]). Its objective is to produce an accurate semantic seg-
+loss Ldesc(G) can be expressed as                                    mentation ySS, but also to fool the discriminator D. The latter
+                                                                     objective might occasionally lead the network to not only rec-
+                                                                     ognize dynamic objects but also their shadows.
+
+    Ldesc(G) = −Ex,y[wdesc · (desc(y) · log(desc(yˆ))
+
+                 + (1 − desc(y)) · log(1 − desc(yˆ)))] (12)          E. Dynamic Objects Semantic Segmentation
+
+where the weights wdesc are deﬁned in (13). This descriptors loss    Once the semantic segmentation of the RGB image is done,
+
+is back-propagated to the whole image, whether a feature has         we can select those classes known to be dynamic (vehicles and
+
+been detected or not, as it helps keeping the image statistics.      pedestrians). This has been done by applying a SoftMax layer,
+                ⎧
+                ⎨1, desc(y) > 0.5 & desc(yˆ) ≤ 0.5                   followed by a convolutional layer with a kernel of n × 1 × 1,
+
+                                                                     where n is the number of classes, and with the weights of
+
+    wdesc = ⎩10,,   desc(y) ≤ 0.5 & desc(yˆ) > 0.5 .      (13)       those dynamic and static channels set to wdyn and wstat, re-
+                    otherwise
+                                                                     spectively. With wdyn =     n−ndyn  and  wstat  =   −  ndyn  ,  where  ndyn
+                                                                                                    n                        n
+
+   All these losses are combined into one loss LORB(G), that is      stands for the number of dynamic existing classes, a positive
+computed as in (14). The values for the weights of the different
+losses λdet, λori, and λdesc have been chosen empirically, and they  output corresponds to a dynamic object, whereas a negative one
+are set to 10, 0.1 and 1, respectively.
+                                                                     corresponds to a static one. The resulting output passes through a
+
+                                                                     hyperbolic tangent-type activation function to obtain the desired
+
+LORB(G) = λdetLdet(G) + λoriLori(G) + λdescLdesc(G). (14)            dynamic/static mask. Note that the deﬁned weights wdyn and wstat
+
+                                                                     are not changed during training time.
+
+   The features detection, orientation, and descriptor maps can      This segmentation stage has been adopted from our prelimi-
+be computed in a parallel way to decrease the training time, since
+their computation is not necessarily sequential.                     nary work [21] without suffering any new modiﬁcations.
+
+   Finally, the generator’s job can be expressed as in the follow-                     V. IMAGE-BASED EXPERIMENTS
+ing equation:
+                                                                     A. Data Generation
+G∗  =  arg  min  max  LcGAN(G,   D)  +  λ1  ·  LL1(G)  +  LORB(G).
+                                                                        We have analyzed the performance of our method using
+              G    D                                                 CARLA [47]. CARLA is an open-source simulator for au-
+                                                                     tonomous driving research, which provides open digital assets
+                                                          (15)       (urban layouts, buildings, vehicles, pedestrians, etc.). The sim-
+                                                                     ulation platform supports ﬂexible speciﬁcation of sensor suites
+As an implementation detail, we have ﬁrst trained the whole
+
+system without the ORB loss for 125 epochs, and then have
+
+ﬁne-tuned including it for another 25 epochs.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+440                                                                 IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021
+
+                                                                    TABLE I
+     QUANTITATIVE EVALUATIONS OF OUR CONTRIBUTIONS IN THE INPAINTING TASK ON THE TEST SYNTHETIC IMAGES
+
+Bold entities means “the best performing system.”
+
+The best results for almost all the inpainting metrics (L1, PSNR, and SSIM) are obtained with the generator G(x, m)|w and the discriminator D(x, y, m, n)|w. More correct
+features (Feat metric) are detected though when adding the features based loss G(x, m)|w ORB. Full image designates the perpixel error considering the whole image. For In and Out,
+
+we refer, respectively, to the error per pixel considering the masked and unmasked pixels.
+
+and environmental conditions. We have generated over 12 000         worse quality results in the nonhole regions (Out). Leveraging
+image pairs consisting of a target image captured with neither      the unbalanced quantity of static and dynamic data within the
+vehicles nor pedestrians, and a corresponding input image cap-      dataset with w (3) and (4) helps obtaining better results too.
+tured at the same pose with the same illumination conditions, but   Providing the GAN’s discriminator with the images noise makes
+with cars, tracks, and people moving around. These images have      it learn better to distinguish between real and fake images, and
+been recorded using a front and a rear RGB camera mounted           therefore, the generator learns to produce more realistic images.
+on a car. Their ground-truth semantic segmentation has also         The ORB-based loss leads to slightly worse inpainting results
+been captured. CARLA offers two different towns that we have        according to L1, PSNR, and SSIM metrics, but renders this
+used for training and testing, respectively. Our dataset, together  approach more useful for both localization and mapping tasks
+with more information about our framework, is available on          since more correct features are created.
+https://bertabescos.github.io/EmptyCities_SLAM/.
+                                                                       1) Baselines for Inpainting: We compare qualitatively and
+   At present, we are limited to training this framework on         quantitatively our “inpainting” method with four other state-of-
+synthetic datasets since, to our knowledge, no real-world dataset   the-art approaches.
+exists that provides RGB images captured under same illumina-
+tion conditions at identical poses, with and without dynamic ob-       1) Geo1, Geo2: two nonlearning approaches [14], [15];
+jects. In order to render our framework trained on synthetic data      2) Lea1, Lea2: two deep-learning-based methods [17], [18].
+transferable to real-world data, we have ﬁne-tuned our models          For a fair comparison, we have trained the approach by Yu
+with data from the Cityscapes and KITTI semantic segmentation       et al. (Lea1) with our same training data. Iizuka et al. (Lea2)
+training datasets [46], [48]. These datasets are semantically       do not have their training code available. We have directly used
+similar to the ones synthesized with CARLA. Nonetheless, their      their release model [18] trained on the Places2 dataset [51]. This
+image statistics are different. We further explain this ﬁne-tuning  dataset contains images of urban streets from a car perspective
+process in Section V-C.                                             similar to ours. A more direct comparison is not possible. We
+                                                                    provide them with the same mask than to our method to generate
+B. Inpainting                                                       the holes in the images. We evaluate qualitatively on the 3000
+                                                                    images from our synthetic test dataset, on the 500 validation
+   In this section, we report the improvements achieved by our      images from the Cityscapes dataset [46], and on the images from
+framework for inpainting. Table I describes the ablation study      the Oxford Robotcar dataset [50]. We can see in Figs. 8–10
+of our work for the different reported inputs and losses. The       the qualitative comparisons on these three datasets. Note that
+existence of many possible solutions renders difﬁcult to deﬁne      results generated with both Lea1 and Lea2 have been generated
+a metric to evaluate image inpainting [17]. Nevertheless, we        with the color images and then converted to grayscale for visual
+follow previous works and report the L1, PSNR, and SSIM             comparison. Visually, we see that our method obtains a more
+errors [49], as well as a feature-based metric Feat. This last      realistic output (these results are computed without the ORB
+metric computes the FAST features detection, as explained           loss for an inpainting oriented comparison). Also, it is the only
+in Section IV-C for the output and ground-truth images, and         one capable of removing the shadows generated by the dynamic
+compares them, similarly to (6).                                    objects even though they are not included in the dynamic/static
+                                                                    mask (see Fig. 8 row 2 and Fig. 10 row 1). The utilized masks
+   Adding the dynamic/static mask as input for both the gener-      are included in the images in Figs. 8(a) and 10(a), respectively.
+ator and discriminator helps obtaining better inpainting results       Table II describes the quantitative comparison of our method
+within the images hole regions (In), at the expense of having       against Geo1, Geo2, Lea1, and Lea2 on our CARLA dataset.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM               441
+
+Fig. 8. (a) Input. (b) Geo1 [14]. (c) Lea1 [17]. (d) Lea2 [18]. (e) Ours. (f) Ground-truth. Qualitative comparison of our method (e) against other inpainting
+techniques (b), (c), (d) on our synthetic dataset. Our results are semantically and geometrically coherent, and do not show the dynamic objects’ shadows, even if
+they are not included in the input mask.
+
+                                         TABLE II                                           We want to highlight the importance of the inpainting robustness
+    QUANTITATIVE RESULTS OF OUR METHOD AGAINST OTHER INPAINTING                             to inaccurate segmentation masks since, in practice, partial or
+                                                                                            missing segmentation happens frequently. Empty Cities cannot
+                       APPROACHES IN OUR CARLA DATASET                                      handle missing detections but can cope with partial segmenta-
+                                                                                            tions covering at least the 85% of the object image.
+Bold entities means “the best performing system.”
+For a fair comparison, we only report the different errors within the images’ hole regions     We hereby report some metrics evaluating how our framework
+since the other methods are conceived to only signiﬁcantly modify such parts.               behaves with the dynamic objects’s shadows. Thresholding the
+                                                                                            difference between the dynamic and static image, and sub-
+It is not possible to quantitatively measure the performance of                             tracting the dynamic-objects mask yields its dynamic objects
+the different methods on the Cityscapes and Oxford Robotcar                                 shadows and reﬂections mask. We have ﬁrst generated the
+datasets, since ground-truth does not exist. By following these                             shadows ground truth of our CARLA dataset, and then computed
+results, we can claim that our method outperforms both qualita-                             the shadows masks for our inpainted images in the same way.
+tively and quantitatively the other approaches.                                             The intersection over union of the estimated shadows against
+                                                                                            the ground truth is 42.8%. Following recent works in shadow
+   As seen in Fig. 10 row 1, the fact that our method does not                              detection [52], we also report our method’s shadow, nonshadow,
+perform pure inpainting but image-to-image translation with the                             and total accuracy (59.7%, 99.8%, and 99.5%, respectively).
+help of a dynamic/static mask allows us to modify not only the                              With this framework, we can remove almost 50% of the shadows
+dynamic objects themselves but also their shadows or reﬂections.                            of the dynamic objects of our CARLA test dataset. Admitting
+We believe that the main underlying reasoning for this is the                               this could be improved, our method’s nonshadow accuracy is
+direct supervision for image-to-image translation. Also, since                              almost 100%, which means that it does not modify other objects’
+during the training with real-world data, the segmentation masks                            shadows.
+are not 100% accurate, the model learns that it has to modify
+mainly the areas of the mask and, in case a smooth representation                           C. Transfer to Real Data
+of the world is not obtained, also its surroundings. We believe
+that, was the training performed with perfect masks that also                                  Models trained on synthetic data can be useful for real-world
+cover the shadows, the model would not learn to handle the                                  vision tasks [53]–[56]. Accordingly, we provide a study of
+shadows of dynamic objects or the inaccuracies of segmentation.                             synthetic-to-real transfer learning using data from the Cityscapes
+                                                                                            dataset [46], which offers a variety of urban real-world environ-
+                                                                                            ments similar to the synthetic ones.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+442                                                                   IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021
+
+Fig. 9. (a) Input. (b) Geo1. [14]. (c) Lea1. [17]. (d) Ours. Comparison of our method (d) against other image inpainting approaches (b), (c) on the Cityscapes
+validation dataset [46]. (c) and (d) Results when real images have been incorporated into our training set together with the synthetic images with a ratio of 1/10.
+
+   When testing our method on real data, we see qualitatively         backpropagation of the loss derivative only on those image areas
+that the synthesized images show some artifacts. This happens         that we consider as static. This way, the model can learn both
+because such data have different statistics than the real one and,    the inpainting task and the static real-world texture. Once the
+therefore, cannot be easily used. The combination of real and         model is adapted to real-world data, it can be directly used in
+synthetic data is possible during training despite the lack of        completely new real-world scenarios, e.g., the Oxford Robotcar
+ground-truth static real images. In the case of the real images,      dataset [50].
+the network only learns the texture and the style of the static real
+world by encoding its information and decoding back the origi-                                  VI. EXPERIMENTS
+nal image nonhole regions. The synthetic data are substantially
+more plentiful and has information about the inpainting process.      A. Visual Odometry
+The rendering, however, is far from realistic. Thus, the chosen
+representation attempts to bridge the reality gap encountered            We have evaluated Empty Cities on 20 CARLA synthetic
+when using simulated data and to remove the need for domain           sequences and on nine sequences from real-world new en-
+adaptation.                                                           vironments. For these VO experiments, we have chosen the
+                                                                      state-of-the-art feature-based system ORB-SLAM [23]1 and the
+   We provide implementation details: we have ﬁnetuned our            direct method DSO [2]. The former is ideal to test the inﬂuence
+model with real data for 25 epochs with a real/synthetic images       of our ORB features loss, and the latter is useful to prove that
+ratio of 1/10. On the one hand, for every ten images there are        different systems can also beneﬁt from this approach.
+nine synthetic images that provide our model with information
+about the inpainting task. On the other hand, one image out              1) Baselines for VO in Our Synthetic Dataset: Fig. 11 dis-
+of those ten is a real image from the Cityscapes train dataset.       plays the ORB-SLAM absolute trajectory RMSE [m] computed
+There is groundtruth of its semantic information but there is         for 20 CARLA sequences of approximately 100 m long without
+no groundtruth of its static representation. In such cases, we do
+                                                                         1ORB-SLAM acts as visual odometry in trajectories without loop closures.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM  443
+
+Fig. 10. (a) Input. (b) Geo [14]. (c) Lea1 [17]. (d) Ours. Comparison of our method (d) against other image inpainting approaches (b), (c) on the Oxford Robotcar
+dataset [50]. (c) and (d) Results when Cityscapes images have been incorporated into our training set together with the synthetic images with a 1/10 ratio. The
+binary dynamic/static mask computed for every input image has been added on the top-left corner of every raw.
+
+Fig. 11. Vertical axis shows the different sequences from our CARLA dataset in which we have tested our model, and the horizontal axes show the ATE [m]
+obtained by ORB-SLAM. (a) ORB-SLAM absolute trajectory RMSE [m] for the raw dynamic images. (b) ATE computed by DynaSLAM [6]: this system is based
+on ORB-SLAM and computes the dynamic objects masks in every frame not to use features belonging to them. (c) ORB-SLAM ATE when using our inpainted
+frames (the ones obtained with the ORB-features-based loss). (d) ORB-SLAM ATE for the ground-truth static images.
+
+loop closures. Fig. 11(a) shows the results when many vehicles      DynaSLAM [6]. This system is based on ORB-SLAM and uses
+and pedestrians are moving independently. More precisely, the       the semantic segmentation network Mask R-CNN [10] to detect
+number of vehicles and pedestrians have been set to the maxi-       the moving objects and not extract ORB features within them.
+mum allowed by CARLA. Fig. 11(d) shows the same odometry            Even though better odometry results are obtained compared
+results for the ground-truth static sequences. We can see that dy-  against using the raw dynamic images, this experiment shows
+namic objects have in many sequences a big inﬂuence on ORB-         that using static images leads to a more accurate camera tracking.
+SLAM’s performance (sequences 02, 07, 08, etc.). Fig. 11(b)         One reason for this is that dynamic objects might occlude the
+shows the trajectory error obtained with our previous system        nearby regions in the scene, which are the most reliable for
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+444                                                                  IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021
+
+Fig. 12. Vertical axis shows the different sequences from our CARLA dataset, and the horizontal axes show the ORB-SLAM ATE [m]. (a) and (b) ORB-SLAM
+absolute trajectory RMSE [m] for our images obtained with the model trained without and with the ORB loss term, respectively. (c) ATE computed by DynaSLAM
+for our images obtained with the model trained with the ORB loss term. On the right-most side, one can ﬁnd the percentage of keypoints extracted in the inpainted
+areas, w.r.t. to the ground-truth keypoints.
+
+camera pose estimation. Another reason might be that using           Fig. 13. (a) Input. (b) Output. (c) Output. Visual comparison of the improve-
+dynamic objects masks does not yield anymore a homogeneous           ments achieved by utilizing the ORB-based loss (c) against not using it (b).The
+features distribution within the image. ORB-SLAM looks for a         reconstructed curbs in (c) are sharper and more straight than those in (b).
+uniform distribution of image features. Pose optimization could
+be degraded and drifting could increase if the features do not fol-     2) Baselines for Inpainting: We have compared in the Sec-
+low such distribution. Finally, Fig. 11(c) shows the ORB-SLAM        tion V the quality of our results with respect to the inpainting
+error when using our inpainted images. Our odometry results          metrics against four other methods. We want to compare now
+show that better results are usually obtained when using our         how our approach compares to them w.r.t. visual odometry
+inpainted images. The inpainting is realistic enough to provide      metrics. Among these four other methods, we have chosen two
+the visual odometry system with consistent features that are         of them for this evaluation: Geo1 [14] and Lea1 [17]. The
+useful for localization.                                             ﬁrst choice is motivated by its performance on our inpainting
+                                                                     test dataset: this method performs the best among the two
+   We want to highlight the importance of the inﬂuence of            nonlearning-based approaches. The second choice is, however,
+using the ORB loss during training (see Fig. 12). Fig. 12(a)         motivated by the fact that we have not been able to train the
+and (b) presents the ATE obtained by ORB-SLAM with our               model by Iizuka et al. [18] with our training data for a direct
+inpainted images without and with the ORB loss, respectively.        comparison. The evaluation and different results can be seen in
+The estimated errors are smaller and more constant when using        Fig. 15.
+this loss. This performance gain can be due to features in the
+regions that originally contained dynamic objects, and to more          The images inpainted by Telea’s method are usually very
+stable features in the static-content regions. Fig. 12(c) presents   smooth and no features are extracted within the inpainted ar-
+the DynaSLAM ATE with the inpainted images generated with            eas. The behavior of ORB-SLAM when using such sequences
+our model trained with this loss term. We expect these errors        [Fig. 15(a)] is very similar to using DynaSLAM. However,
+to be very similar to those shown in Fig. 11(b), and slightly        the learning-based method by Yu et al. [17] tends to inpaint
+bigger than those in Fig. 12(b). This experiment shows that our      the images with low-frequency patterns found in the image
+model barely damages the static content of the scene, keeping        static content, generating many crispy artifacts (see examples in
+the static features as they used to be. It also demonstrates that    Fig. 10). A higher ATE is observed when using such images [see
+the hallucinated features are useful to estimate the camera SE3      Fig. 15(b)]. Our method seems to be more suitable for the VO
+pose. To support this claim, on the right-most side of Fig. 12, one  task: the inpainting is neither too smooth nor generates crispy
+can ﬁnd the percentage of keypoints extracted in the inpainted       artifacts.
+areas w.r.t. to the ground-truth keypoints. We show in Fig. 13
+two examples of the visual inﬂuence of such loss on enhancing           3) Baselines for VO in Real-World New Scenarios: For the
+high frequencies inpainted areas.                                    evaluation of Empty Cities on real-world environments w.r.t.
+
+   Fig. 14 shows the DSO error for the same 20 CARLA se-
+quences for the different input images with dynamic content
+[see Fig. 14(a)], without dynamic content [see Fig. 14(d)], and
+the images obtained by our framework without and with the
+ORB-based loss [see Fig. 14(b) and (c)]. Even though direct
+systems are more robust to dynamic objects within the scene,
+utilizing our approach also yields a higher tracking accuracy.
+Despite the fact that our feature-based loss follows the ORB
+implementation, better results are also obtained with other visual
+odometry system that do not rely on ORB features.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM  445
+
+Fig. 14. Vertical axis shows the different sequences from our CARLA dataset, and the horizontal axes show the DSO ATE [m]. We have computed the boxplots’
+minimum, maximum, and quartiles with the results from ten repetitions of every test. (a) DSO absolute trajectory RMSE [m] for the raw dynamic images. (b) and (b)
+DSO trajectory errors when using our inpainted frames without and with the ORB-features-based loss, respectively. (d) DSO trajectory errors for the ground-truth
+static images.
+
+Fig. 15. Vertical axis shows the different sequences from our CARLA dataset, and the horizontal axes show the ATE [m] obtained by ORB-SLAM for ten
+repetitions. (a) and (b) Absolute trajectory RMSE [m] for the images inpainted with the method by Telea [14] and Yu et al. [17], respectively. (c) ORB-SLAM
+trajectory error for our images.
+
+visual odometry, we have chosen the KITTI [48] and the Oxford           Fig. 16 shows the evaluation of our method performance with
+Robotcar [50] datasets. Dynamic objects in the KITTI dataset do      the sequences from the Oxford Robotcar and the KITTI datasets.
+not represent a big inconvenience for camera pose estimation,        The asterisk at the beginning of some sequences names means
+as was shown in our last work [6]. Most of the vehicles that         that most of the observed vehicles are either not moving or
+appear are not moving and lay in nearby scene areas; thus, their     parked. The other sequences though present many moving ve-
+features happen to be helpful to compute the sensor odometry.        hicles. In the former type of sequences (*), the highest accuracy
+Also, the few moving pedestrians and cars along the sequences        should be observed in the case in which the raw images are used
+do not represent a big region within the images. The Oxford          [see Fig. 16(a)]. Removing the features from such vehicles, as
+Robotcar dataset has though many sequences with representative       in Fig. 16(b), leads to a lower accuracy since the most nearby
+moving objects (driving cars), having also sequences with only       features are no longer used. Inpainting the static scene behind
+stationary objects (parked cars). Note that, the reported results    these vehicles [see Fig. 16(c)] would still remove nearby features
+are not for the whole sequence since the authors provide their VO    but would create new static features a little bit further. That is,
+solution as ground-truth, stating that it is accurate over hundreds  the visual odometry accuracy should be lower than in Fig. 16(a)
+of metres. Hence, the sequences we use are between 100 and           but a little bit higher or similar to Fig. 16(b). This is the case
+300 m long.                                                          of our performance on the Oxford Robotcar sequences and the
+                                                                     KITTI sequence 03. However, DynaSLAM achieves a result in
+   To perform the KITTI experiment, we have retrained our            the KITTI sequence 07 better the proposed Empty Cities. After
+network with 256 × 768 resolution images for a better adap-          the ﬁrst half of the sequence, there are a few consecutive frames
+tation. The CARLA camera intrinsics have also been mod-              in which a truck covers almost 75% of the image. The task
+iﬁed to match the ones used in KITTI. In this case, we               of inpainting becomes especially difﬁcult, thus, worsening the
+have shifted the previously used semantic segmentation model         estimation of the camera’s trajectory. Regarding the performance
+trained only on Cityscapes for the ERFNet model with en-             of the second type of sequences, removing the features from
+coder trained on ImageNet and decoder trained on Cityscapes          moving vehicles and pedestrians should lead to a lower ATE [see
+train set, and have ﬁnetuned it with the KITTI semantic seg-         Fig. 16(b) compared to Fig. 16(a)]. Empty Cities adds in these
+mentation training dataset. The generator and discriminator          sequences an important number of features for pose estimation
+have also been ﬁnetuned with such data, as explained in              that usually leads to a slightly better trajectory estimation [see
+Section V-C.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+446                                                                IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021
+
+Fig. 16. Vertical axis shows the KITTI and Oxford Robotcar datasets sequences in which we have tested our model, and the horizontal axes show the boxplots
+of the absolute trajectory RMSE [m] with the results from ten repetitions of every test. (a) and (b) ORB-SLAM and DynaSLAM trajectory errors, respectively, for
+the raw dynamic images. (c) ORB-SLAM results when our framework is employed. The asterisk at the beginning of some sequences names means that most of
+the observed vehicles are either not moving or parked.
+
+Fig. 17. (a) and (b) show the precision and recall curves for the VPR results with BoW and NetVLAD, respectively. We report the results for the dynamic, the
+inpainted, and ground-truth-static sequences, as well as the results obtained when masking out the dynamic objects as in DynaSLAM [only in (a)]. The precision
+and recall curves for the inpainting methods Geo1 [14] and Lea1 [17] are also presented.
+
+Fig. 16(c)]. Note that the results given in here for the KITTI        We have generated two CARLA sequences with loop closures
+sequences might not match the ones reported by ORB-SLAM            with and without dynamic objects. Two images are deﬁned as
+and DynaSLAM, respectively, because of the utilized images         the same place if they are less than 10 m apart.
+resolution (256 × 768).
+                                                                      The precision-recall curves for the BoW experiments are
+B. Visual Place Recognition                                        depicted in Fig. 17(a). We have extracted the visual words of
+                                                                   every frame along the trajectories and have tried to match every
+   Visual place recognition (VPR) is an important task for visual  two images as a function of the number of common visual words.
+SLAM. Such algorithms are useful when revisiting places to         For both trajectories, the results obtained with the dynamic
+perform loop closure and correct the accumulated drift along       images with and without masks are similar. This is congruent
+long trajectories. Bags of visual words (BoW) is the approach      with the idea that the database of visual words mostly contains
+that is widely used to perform such task, as can be seen in        static and long-term stable words. The ﬁrst trajectory is a good
+ORB-SLAM [23] and LDSO [57]. Lately, thanks to the boost of        example of how the VPR recall drops fast in the presence of
+deep learning, learnt global image descriptors are also used for   dynamic objects: a place is better represented with words from
+VPR [58]. In our previous work [21], we showed preliminary         the whole static image. It leads to less false positives and less
+results proving the beneﬁts of our solution for VPR by using       false negatives mostly. Even though our method slightly brings
+descriptors from an off-the-shelf CNN [59].                        closer in the bag-of-words space images from the same place
+                                                                   with different dynamic objects, we would have expected our
+   1) Baseline for VPR in Our Synthetic Dataset: In this section,  results to be closer to those of the ground-truth static images.
+we show a VPR experiment performed with the bag of words           Our intuition behind these results is that the synthetic features,
+work by Mur and Tardos [41], [60]. It is ideal to test our         despite being useful for feature matching, do not fully fall on
+model, since it is based on ORB features. We also show an          any visual words in the BoW space.
+experiment with one of the strongest learning-based baselines
+NetVLAD [58], which is trained for the speciﬁc task of VPR.           The precision-recall curves for the NetVLAD experiments are
+This comparison can provide a broader understanding of how         shown in Fig. 17(b). We have extracted the learnt descriptors of
+and when end-to-end task-speciﬁc learning becomes more or          every frame along the trajectories and tried to match every two
+less suitable than an explicit use of semantics-based visual       images as a function of their descriptors’ Euclidean distance.2
+description, which forms the primary pitch of this article.
+                                                                      2With the Python-Tensorﬂow implementation by Cieslewski et al. [61].
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM                                                    447
+
+                                                                                                          TABLE III
+                                                                               QUANTITATIVE MAPPING RESULTS FOR FIGS. 19 AND 20
+
+Fig. 18. (a) Ref. (b) Query. (c) Empty Ref. (d) Empty Query. (a) and (b) same  Bold entities means “the best performing system.”
+location with different viewpoints and objects setups. NetVLAD [58] fails to   We report the Euclidean ﬁtness score given by the ICP algorithm. This score has been
+match them, but it succeeds when our framework is previously employed [(c)     computed between the different point clouds w.r.t. the point cloud built with the ground-truth
+and (d)].                                                                      static images. Note that, since the maps do not have scale, the reported scores are up-to-scale.
+
+Despite NetVLAD’s incredible performance on VPR face to                        The areas which have been consistently inpainted along the
+illumination, view point, and clutter changes, we show that their              frames are mapped, even if they have never been seen. When
+execution is seen slightly degraded by dynamic objects. Empty                  inpainting fails or is not consistent along the sequence, the
+Cities brings closer together the descriptors of the same place,               mapping photometric and geometric epipolar constraints are not
+and pulls apart the descriptors of different places with the same              met and such areas cannot be reconstructed. This idea makes our
+dynamic objects, leading to higher precision and recall. That is,              framework suitable to build stable maps.
+the hidden semantic representations of our model match the ones
+learnt by NetVLAD. We can conclude that our method brings                         To give a quantitative experiment on the validity of our
+more relevant improvements in VPR if it is used in conjunction                 maps, we choose the standard iterative closest point (ICP)
+with a learning-based method. Finally, Fig. 18 shows a case in                 algorithm [64] based on the Euclidean ﬁtness score. That is,
+which NetVLAD fails at matching two dynamic frames of the                      having the point cloud built with the ground-truth static images
+same place, but manages to match them when dynamic objects                     as a ﬁx reference, for each of its points the algorithm searches
+are inpainted with our framework.                                              for the closest point in the target point cloud and calculates the
+                                                                               distance based on the result of the algorithm search. To have a
+   2) Baseline for Inpainting: We compare our approach to                      baseline for our experiment, we compute the point cloud with
+the inpainting methods (Geo1 [14] and Lea1 [17]) w.r.t. the                    the dynamic images and with the CARLA segmentation. That is,
+VPR metrics in Fig. 17. For the BoW curves, the conclusion                     the pixels belonging to the dynamic objects have not been used
+is similar to what we have seen in the previous experiment.                    in the multiview-stereo pipeline. The results can be described in
+Even if the extracted synthetic features were useful for matching,             Table III. Since the map has no scale, the ICP Euclidean ﬁtness
+they seem to be of less help for place recognition with BoW.                   score is also up-to-scale. Even though the similarity score is
+Few synthetic ORB features match any existing visual word in                   improved when masking out dynamic objects, the map built with
+the BoW space. Our method though creates more useful visual                    our images has triangulated more 3-D points from the inpainted
+words than Lea1 and Geo1. As for the results with NetVLAD,                     regions, and such points have a low error.
+the use of the geometric method Geo1 brings little improvement.
+However, the learning-based method Lea1 decreases NetVLADs                        2) Baseline for Inpainting: We also compare our ap-
+performance on the second trajectory. NetVLAD can—up to                        proach with the two other state-of-the-art inpainting methods
+some extent—ignore the dynamic classes clues, but cannot                       (Geo1 [14] and Lea1 [17]) w.r.t. the map quality. The qualitative
+ignore the transformed regions. That is, the hidden semantic                   results are depicted in Fig. 20, and the ICP scores are described in
+representations of these inpainted regions do not always match                 Table III. It can be seen that with both other inpainting methods
+the static scene representation learnt by NetVLAD.                             the shadows of the dynamic objects are reconstructed, as well as
+                                                                               their prolongation into the inpainted image regions. This leads
+C. Mapping                                                                     to an ICP score that is higher than that of the map built with
+                                                                               dynamic objects.
+   Another important application of our framework is the cre-
+ation of high-detail road maps. Our inpainting framework allows                   3) Baseline for Mapping in Real-World Environments:
+to create long-term and reusable maps that neither show dynamic                Fig. 21 shows an example of the computed dense maps for both
+objects nor the empty spaces left by them. To this end, we use                 types of inputs with the sequence 04 from the KITTI dataset.
+the MVS and SfM software COLMAP [62], [63].                                    These maps have been computed with the camera poses given
+                                                                               by ORB-SLAM using, respectively, the dynamic and inpainted
+   1) Baseline for Mapping in Our Synthetic Dataset: Fig. 19                   images. Our framework, when being able to inpaint a coherent
+shows the dense map of a simulated city environment (CARLA)                    context along the sequence, allows us to densely reconstruct
+with the original dynamic sequence, with the images processed                  unseen areas. When the inpainting task is not coherent along the
+by our framework, and with the ground-truth static images. The                 sequence, the epipolar constraints are not met and, therefore,
+map seen in Fig. 19(a) is not useful for future use since it shows             such areas cannot be reconstructed. Note that this map is shown
+dynamic objects that might not be there any more. Fig. 19(c)                   in RGB only for visualization purposes.3 We do recommend
+shows the map built with the ground-truth static images, and
+Fig. 19(b) shows the map computed with our generated images.                      3Since our framework offers enough ﬂexibility, we have retrained our model
+                                                                               with RGB images just as explained in Section IV. The only difference is that since
+                                                                               the features have to be extracted in grayscale images, we add a convolutional
+                                                                               layer to convert the RGB output images to grayscale.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+448                                                               IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021
+
+Fig. 19. Dense maps of a CARLA city environment with the COLMAP Multi-View-Stereo software. The upper row shows some of the sequence images used
+to build such maps. (a) Case in which the original dynamic images have been used. (b) Resulting map with the images previously processed by our framework. (c)
+Ground-truth static images to create the resulting map. All maps are computed with the ground-truth camera poses.
+
+Fig. 20. Dense maps of the same CARLA city environment than in Fig. 19 with the COLMAP Multi-View-Stereo software. (a) Case in which our inpainted
+images have been used. (b) and (c) Resulting maps with the images previously processed by the frameworks of Telea [14] and Yu et al. [17], respectively. All maps
+are computed with the ground-truth camera poses.
+
+Fig. 21. Dense maps built with images from the KITTI dataset [48]. The upper row shows some of the sequence images used to build each map. (a) Case in
+which the original dynamic images have been used. (b) Resulting map with the images previously processed by our framework. Both maps are computed with the
+camera poses that ORB-SLAM estimates for the different sequences.
+
+to use the grayscale images for both localization and mapping     12 GB with images of a 512 × 512 resolution. Out of the 10 ms,
+purposes since a lower reconstruction error is usually achieved.  it takes to process one frame, 8 ms are invested into obtaining its
+                                                                  semantic segmentation, and 2 ms are used for the inpainting
+D. Timing Analysis                                                task. Other than to deal with dynamic objects, the semantic
+   Reporting our framework efﬁciency is crucial to judge its      segmentation may be needed for many other tasks involved in
+                                                                  autonomous navigation. In such cases, our framework would
+suitability for autonomous driving and robotic tasks in general.  only add two extra ms per frame. Based on our analysis, we
+The end-to-end pipeline runs at 100 ft/s on a nVidia Titan Xp     consider that the inpainting task is not a bottleneck.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM  449
+
+            VII. FAILURE MODES AND FUTURE WORK                                parts of the parked cars. It would be interesting for future work
+                                                                              to include such scenarios in our training data. Our suggestion
+   The aim of this section is to provide the reader with an                   is, for now, inpainting all instances, regardless of their current
+understanding on the beneﬁts and limitations of our proposal,                 dynamic status.
+and on how to integrate it on a VO, SLAM, or MVS pipeline.
+                                                                                                        VIII. CONCLUSION
+   Having a scene static representation leads to better visual
+odometry results than just excluding the features belonging to                   We have presented an end-to-end deep learning framework
+moving objects. The presented inpainting approach has though                  that translated images that contained dynamic objects within a
+some weaknesses: the bigger the image dynamic region is, the                  city environment, such as vehicles or pedestrians, into a realistic
+lower inpainting quality the resulting image has. Empty Cities                image with only static content. These images were suitable for
+would be suitable in setups in which approximately less than                  visual odometry, place recognition, and mapping tasks, thanks
+15% of the camera ﬁeld of view is covered by dynamic objects. In              to a new loss based on steganalysis techniques and ORB features
+such setups, the reconstructed L1 error is acceptable and usually             maps, descriptors, and orientation. We motivated this extra com-
+lies between 1 and 10%. The L1 error goes above 10% when                      plexity by showing quantitatively that the systems ORB-SLAM
+more than 15% of the image pixels are covered.4 Work remains to               and DSO obtained a higher accuracy when utilizing the images
+be done to tackle extreme situations. Also, developing a system               synthesized with this loss. Also, mapping systems can beneﬁt
+that processes the sequence as a whole, rather than binning it into           from this approach since not only they would not map dynamic
+independent frames, would result in a more consistent image                   objects but also would they map the plausible static scene
+inpainting along time.                                                        behind them. Finally, an architectural nicety was that our system
+                                                                              processes the image streams outside of the localization pipeline,
+   Our system processes the image streams outside of the ap-                  either ofﬂine or online and, hence, can be used naturally as a
+plication pipeline and, hence, can be used naturally as a front               front end to many existing systems.
+end to many existing systems. We hereby want to discuss other
+application-dependent possibilities to boost its performance.                                                  REFERENCES
+
+   Visual odometry: Removing the features that belong to sta-                  [1] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM: A
+tionary objects certainly damages VO. However, e.g., a car can                      versatile and accurate monocular SLAM system,” IEEE Trans. Robot.,
+from one frame to another change from static to moving. Had we                      vol. 31, no. 5, pp. 1147–1163, Oct. 2015.
+a movement detector, we would use the static objects’ features
+and the inpainted ones behind moving objects.                                  [2] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,”
+                                                                                    IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611–625,
+   Place recognition: Using the features of stationary dynamic                      Mar. 2018.
+objects damages the performance of place recognition algo-
+rithms. For example, two frames of the same place with a                       [3] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semi-direct monoc-
+different setup of parked cars can be incorrectly tagged as a                       ular visual odometry,” in Proc. IEEE Int. Conf. Robot. Autom., 2014,
+different place. Also, two frames of different places with the                      pp. 15–22.
+same parked cars setup can be wrongly matched as the same
+place. Only the features of objects that remain stable in the long             [4] A. Agudo, F. Moreno-Noguer, B. Calvo, and J. M. M. Montiel, “Sequen-
+term (buildings, sidewalks, etc.) would beneﬁt VPR.                                 tial non-rigid structure from motion using physical priors,” IEEE Trans.
+                                                                                    Pattern Anal. Mach. Intell., vol. 38, no. 5, pp. 979–994, May 2015.
+   Mapping: A map containing information about dynamic ob-
+jects would be useless for future reuse. Only the information                  [5] J. Lamarca, S. Parashar, A. Bartoli, and J. Montiel, “DefSLAM: Tracking
+belonging to objects that remain stable in the long term, as well                   and mapping of deforming scenes from monocular sequences,” IEEE
+as the most likely static representation of the static scene behind                 Trans. Robot., 2020.
+dynamic objects should be included in the map.
+                                                                               [6] B. Bescos, J. M. Fácil, J. Civera, and J. Neira, “DynaSLAM: Mapping,
+   Ideally, one would use a movement detector to identify the                       tracking and inpainting in dynamic scenes,” IEEE Robot. Autom. Lett.,
+status of the different observed instances and also to allow the                    vol. 3, no. 4, pp. 4076–4083, Oct. 2018.
+discovery of new dynamic classes on the ﬂy [13]. The features
+belonging to static instances would be used for visual odometry,               [7] P. F. Alcantarilla, J. J. Yebes, J. Almazán, and L. M. Bergasa, “On
+and the corresponding inpainted features would be used for place                    combining visual SLAM and dense scene ﬂow to increase the robustness
+recognition and mapping. This approach would bring the highest                      of localization and mapping in dynamic environments,” in Proc. IEEE Int.
+accuracy results but would entail a series of modiﬁcations to the                   Conf. Robot. Autom., 2012, pp. 1290–1297.
+existing pipeline. Our method would currently pose problems in
+the case in which one wanted to inpaint the static scene behind                [8] Y. Wang and S. Huang, “Motion segmentation based robust RGB-
+a car that is currently moving, and this static scene contained                     D SLAM,” in Proc. 11th World Congr. Intell. Control Autom., 2014,
+parked cars. Our model would fail to reconstruct the unseen                         pp. 3122–3127.
+
+   4As a practical example, the Oxford Robotcar sequence 2014-05-06-12-54-54   [9] W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao, “Robust monocular SLAM
+has 56% of images with less than 5% of covered pixels, 30% with a percentage        in dynamic environments,” in Proc. Int. Symp. Mixed Augmented Reality,
+of dynamic pixels between 5% and 10%, 12% between 10% and 15%, and 2%               2013, pp. 209–218.
+between 15% and 20%. This sequence is not Manhattan at 11 A.M., but shows
+cars parked at both sides of the road and cars driving nearby.                [10] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
+                                                                                    IEEE Int. Conf. Comput. Vis., 2017, pp. 2980–2988.
+
+                                                                              [11] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “ERFNet:
+                                                                                    Efﬁcient residual factorized convnet for real-time semantic segmenta-
+                                                                                    tion,” IEEE Trans. Intell. Transport. Syst., vol. 19, no. 1, pp. 263–272,
+                                                                                    Jan. 2018.
+
+                                                                              [12] D. Barnes, W. Maddern, G. Pascoe, and I. Posner, “Driven to distraction:
+                                                                                    Self-supervised distractor learning for robust monocular visual odometry
+                                                                                    in urban environments,” in Proc. IEEE Int. Conf. Robot. Autom, 2018,
+                                                                                    pp. 1894–1900.
+
+                                                                              [13] G. Zhou, B. Bescos, M. Dymczyk, M. Pfeiffer, J. Neira, and R. Siegwart,
+                                                                                    “Dynamic objects segmentation for visual localization in urban environ-
+                                                                                    ments,” 2018, arXiv:1807.02996.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+450                                                                                IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021
+
+[14] A. Telea, “An image inpainting technique based on the fast marching           [40] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis, “Learning rich features for
+      method,” J. Graph. Tools, vol. 9, no. 1, pp. 23–34, 2004.                          image manipulation detection,” in Proc. IEEE Conf. Comput. Vis. Pattern
+                                                                                         Recognit., 2018, pp. 1053–1061.
+[15] M. Bertalmio, A. L. Bertozzi, and G. Sapiro, “Navier-stokes, ﬂuid dynam-
+      ics, and image and video inpainting,” in Proc. IEEE Comput. Soc. Conf.       [41] D. Gálvez-López and J. D. Tardos, “Bags of binary words for fast place
+      Comput. Vis. Pattern Recognit., 2001, vol. 1, pp. I–I.                             recognition in image sequences,” IEEE Trans. Robot., vol. 28, no. 5,
+                                                                                         pp. 1188–1197, Oct. 2012.
+[16] G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catanzaro,
+      “Image inpainting for irregular holes using partial convolutions,” in Proc.  [42] E. Rosten and T. Drummond, “Machine learning for high-speed corner
+      Eur. Conf. Comput. Vis., 2018, pp. 89–105.                                         detection,” in Proc. Eur. Conf. Comput. Vis., 2006, pp. 430–443.
+
+[17] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative image    [43] H. Porav, W. Maddern, and P. Newman, “Adversarial training for adverse
+      inpainting with contextual attention,” in Proc. IEEE Conf. Comput. Vis.            conditions: Robust metric localisation using appearance transfer,” in Proc.
+      Pattern Recognit., 2018, pp. 5505–5514.                                            IEEE Int. Conf. Robot. and Autom., 2018, pp. 1011–1018.
+
+[18] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally consistent   [44] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust
+      image completion,” ACM Trans. Graph., vol. 36, no. 4, 2017, Art. no. 107.          features (surf),” Comput. Vis. Image Understanding, vol. 110, no. 3,
+                                                                                         pp. 346–359, 2008.
+[19] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context
+      encoders: Feature learning by inpainting,” in Proc. IEEE Conf. Comput.       [45] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, ERFNet. 2017.
+      Vis. Pattern Recognit., 2016, pp. 2536–2544.                                       [Online]. Available: https://github.com/Eromera/erfnet
+
+[20] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation    [46] M. Cordts et al., “The cityscapes dataset for semantic urban scene un-
+      with conditional adversarial networks,” in Proc. IEEE Conf. Comput. Vis.           derstanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016,
+      Pattern Recognit., 2017, pp. 1125–1134.                                            pp. 3213–3223.
+
+[21] B. Bescos, R. Siegwart, J. Neira, and C. Cadena, “Empty cities: Image         [47] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA:
+      inpainting for a dynamic-object-invariant space,” in Proc. IEEE Int. Conf.         An open urban driving simulator,” in Proc. 1st Annu. Conf. Robot Learn.,
+      Robot. Autom., 2019, pp. 5460–5466.                                                2017, pp. 1–16.
+
+[22] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efﬁcient          [48] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:
+      alternative to sift or surf,” in Proc. IEEE Int. Conf. Comput. Vis., 2011,         The KITTI dataset,” Int. J. Robot. Res., vol. 32, no. 11, pp. 1231–1237,
+      pp. 2564–2571.                                                                     2013.
+
+[23] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: An open-source slam                [49] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality
+      system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot.,              assessment: From error visibility to structural similarity,” IEEE Trans.
+      vol. 33, no. 5, pp. 1255–1262, Oct. 2017.                                          Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004.
+
+[24] G. Klein and D. Murray, “Parallel tracking and mapping for small              [50] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 Year, 1000 km:
+      AR workspaces,” in Proc. Int. Symp. Mixed Augmented Reality, 2007,                 The Oxford RobotCar Dataset,” Int. J. Robot. Res., vol. 36, no. 1, pp. 3–15,
+      pp. 225–234.                                                                       2017. [Online]. Available: http://dx.doi.org/10.1177/0278364916679498
+
+[25] M. Runz, M. Bufﬁer, and L. Agapito, “Maskfusion: Real-time recognition,       [51] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba,
+      tracking and reconstruction of multiple moving objects,” in Proc. IEEE Int.        “Places: A 10 million image database for scene recognition,” IEEE
+      Symp. Mixed Augmented Reality, 2018, pp. 10–20.                                    Trans. Pattern Anal. Mach. Intell., vol. 40, no. 6, pp. 1452–1464,
+                                                                                         Jun. 2018.
+[26] R. Scona, M. Jaimez, Y. R. Petillot, M. Fallon, and D. Cremers, “Stat-
+      icFusion: Background reconstruction for dense RGB-D SLAM in dy-              [52] S. Hosseinzadeh, M. Shakeri, and H. Zhang, “Fast shadow detection from
+      namic environments,” in Proc. IEEE Int. Conf. Robot. Autom., 2018,                 a single image using a patched convolutional neural network,” in Proc.
+      pp. 3849–3856.                                                                     IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2018, pp. 3124–3129.
+
+[27] M. Granados, K. I. Kim, J. Tompkin, J. Kautz, and C. Theobalt, “Back-         [53] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as proxy for
+      ground inpainting for videos with dynamic objects and a free-moving                multi-object tracking analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern
+      camera,” in Proc. Eur. Conf. Comput. Vis., 2012, pp. 682–695.                      Recognit., 2016, pp. 4340–4349.
+
+[28] R. Uittenbogaard, D. Gavrila, C. Sebastian, and J. Vijverberg, “Moving        [54] M. Peris, S. Martull, A. Maki, Y. Ohkawa, and K. Fukui, “Towards a
+      object detection and image inpainting in street-view imagery,” Master              simulation driven stereo vision system,” in Proc. 21st Int. Conf. Pattern
+      Thesis, Delft Univ. Technol., Delft, Netherlands, 2018.                            Recognit., 2012, pp. 1038–1042.
+
+[29] A. A. Efros and W. T. Freeman, “Image quilting for texture synthesis and      [55] J. Skinner, S. Garg, N. Sünderhauf, P. Corke, B. Upcroft, and
+      transfer,” in Proc. 28th Annu. Conf. Comput. Graph. Interactive Techn.,            M. Milford, “High-ﬁdelity simulation for evaluating robotic vision
+      2001, pp. 341–346.                                                                 performance,” in Proc. IEEE Int. Conf. Intell. Robot. Syst., 2016,
+                                                                                         pp. 2737–2744.
+[30] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li, “High-
+      resolution image inpainting using multi-scale neural patch synthesis,”       [56] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Do-
+      in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, vol. 1,                  main randomization for transferring deep neural networks from simulation
+      pp. 4076–4084.                                                                     to the real world,” in Proc. IEEE Int. Conf. Intell. Robot. Syst., 2017,
+                                                                                         pp. 23–30.
+[31] Y. Song, C. Yang, Z. L. Lin, H. Li, Q. Huang, and C.-C. J. Kuo, “Contextual-
+      based image inpainting: Infer, match, and translate,” in Proc. Eur. Conf.    [57] X. Gao, R. Wang, N. Demmel, and D. Cremers, “LDSO: Direct sparse
+      Comput. Vis. (ECCV), 2018, pp. 3–19.                                               odometry with loop closure,” in Proc. Int. Conf. Intell. Robot. Syst., 2018,
+                                                                                         pp. 2198–2204.
+[32] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in Proc.
+      IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 9446–9454.              [58] R. Arandjelovic´, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD:
+                                                                                         CNN architecture for weakly supervised place recognition,” in Proc. IEEE
+[33] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural           Conf. Comput. Vis. Pattern Recognit., 2018, pp. 1437–1451.
+      Inf. Process. Syst., 2014, pp. 2672–2680.
+                                                                                   [59] D. Olid, J. M. Fácil, and J. Civera, “Single-view place recognition under
+[34] J. Gauthier, “Conditional generative adversarial nets for convolutional face        seasonal changes,” 2018, arXiv:1808.06516.
+      generation,” Class Project Stanford CS231N: Convolutional Neural Netw.
+      Vis. Recognit., Winter semester, vol. 2014, no. 5, p. 2, 2014.               [60] R. Mur-Artal and J. D. Tardós, “Fast relocalisation and loop closing in
+                                                                                         keyframe-based slam,” in Proc. IEEE Int. Conf. Robot. Autom., 2014,
+[35] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks             pp. 846–853.
+      for biomedical image segmentation,” in Proc. Int. Conf. Med. Image
+      Comput. Comput.-Assisted Intervention, 2015, pp. 234–241.                    [61] T. Cieslewski, S. Choudhary, and D. Scaramuzza, “Data-efﬁcient decen-
+                                                                                         tralized visual SLAM,” in Proc. IEEE Int. Conf. Robot. Autom., 2018,
+[36] R. Guerrero et al., “White matter hyperintensity and stroke lesion segmen-          pp. 2466–2473.
+      tation and differentiation using convolutional neural networks,” NeuroIm-
+      age: Clin., vol. 17, pp. 918–934, 2018.                                      [62] J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in
+                                                                                         Proc. Conf. Comput. Vis. Pattern Recognit, 2016, pp. 4104–4113.
+[37] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of
+      data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006.  [63] J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm, “Pixelwise
+                                                                                         view selection for unstructured multi-view stereo,” in Proc. Eur. Conf.
+[38] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proc.            Comput. Vis., 2016, pp. 501–518.
+      Int. Conf. Learn. Representations, 2013.
+                                                                                   [64] P. J. Besl and N. D. McKay, “Method for registration of 3-D shapes,” Sen-
+[39] J. Fridrich and J. Kodovsky, “Rich models for steganalysis of digital               sor Fusion IV: Control Paradigms Data Struct., vol. 1611, pp. 586–606,
+      images,” IEEE Trans. Inf. Forensics Secur., vol. 7, no. 3, pp. 868–882,            1992.
+      Jun. 2012.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM         451
+
+Berta Bescos was born in Zaragoza, Spain, in 1993.                                                                   José Neira was born in Bogotá, Colombia, in 1963.
+She received the bachelor’s and M.S. degrees in in-                                                                  He received the M.S. degree from the Universidad de
+dustrial engineering with mention in robotics and                                                                    los Andes, Bogota, Colombia, in 1986, and the Ph.D.
+computer vision from the University of Zaragoza,                                                                     degree from the University of Zaragoza, Zaragoza,
+Zaragoza, Spain, where she is currently working to-                                                                  Spain, in 1993, both in computer science.
+ward the Ph.D. degree with the I3A Robotics, Per-
+ception and Real Time Group, with her Ph.D. topic                                                                       Since 2010, he has been a Full Professor with the
+dealing with dynamic objects in SLAM for a better                                                                    Departamento de Informática e Ingeniería de Sis-
+scene understanding.                                                                                                 temas, University of Zaragoza, where he is in charge
+                                                                                                                     of courses in compiler theory, computer vision, ma-
+   Her research interests include the intersection be-                                                               chine learning, and mobile robotics. His current re-
+tween perception and learning for robotics.                                                                          search interests are centered around robust, life-long
+                                                                                      simultaneous localization and mapping. He also coordinates the university’s
+                                                                                      Master Program in robotics, graphics and computer vision.
+
+                               Cesar Cadena received the Ph.D. degree in computer
+                               science from the University of Zaragoza, Zaragoza,
+                               Spain, in 2011.
+
+                                  He is a Senior Researcher with ETH Zurich,
+                               Zurich, Switzerland. He is particularly interested on
+                               how to provide machines the capability of under-
+                               standing this ever changing world through the sensory
+                               information they can gather. He has work intensively
+                               on robotic scene understanding, both geometry and
+                               semantics, covering semantic mapping, data associ-
+                               ation and place recognition tasks, simultaneous lo-
+calization and mapping problems, as well as persistent mapping in dynamic
+environments. His main research interests include the interception of perception
+and learning in robotics.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply.
+
diff --git a/动态slam/2020年-2022年开源动态SLAM/2020年/VDO-SLAM a visual dynamic object aware SLAM system.pdf b/动态slam/2020年-2022年开源动态SLAM/2020年/VDO-SLAM a visual dynamic object aware SLAM system.pdf
new file mode 100644
index 0000000..f342bfd
--- /dev/null
+++ b/动态slam/2020年-2022年开源动态SLAM/2020年/VDO-SLAM a visual dynamic object aware SLAM system.pdf	
@@ -0,0 +1,1136 @@
+                                        MANUSCRIPT ONLY                                                                        1
+
+                                        VDO-SLAM: A Visual Dynamic Object-aware
+                                                           SLAM System
+
+                                                                Jun Zhang[co], Mina Henein[co], Robert Mahony and Viorela Ila
+
+arXiv:2005.11052v3 [cs.RO] 14 Dec 2021     Abstract—Combining Simultaneous Localisation and Mapping                      Fig. 1: Results of our VDO-SLAM system. (Top) A full
+                                        (SLAM) estimation and dynamic scene modelling can highly                         map including camera trajectory in red, static background
+                                        beneﬁt robot autonomy in dynamic environments. Robot path                        points in black and points on moving objects colour coded
+                                        planning and obstacle avoidance tasks rely on accurate esti-                     by their instance. (Bottom) Detected 3D points on the static
+                                        mations of the motion of dynamic objects in the scene. This                      background and the objects’ body, and the estimated object
+                                        paper presents VDO-SLAM, a robust visual dynamic object-                         speed. Black circles represents static points, and each object
+                                        aware SLAM system that exploits semantic information to enable                   is shown with a different colour.
+                                        accurate motion estimation and tracking of dynamic rigid objects
+                                        in the scene without any prior knowledge of the objects’ shape                   ([6]–[9]). The former technique excludes information about
+                                        or geometric models. The proposed approach identiﬁes and                         dynamic objects in the scene, and generates static only maps.
+                                        tracks the dynamic objects and the static structure in the                       The accuracy of the latter depends on the camera pose
+                                        environment and integrates this information into a uniﬁed SLAM                   estimation, which is more susceptible to failure in complex
+                                        framework. This results in highly accurate estimates of the                      dynamic environments. Increased presence of autonomous
+                                        robot’s trajectory and the full SE(3) motion of the objects as well              systems in dynamic environments is driving the community
+                                        as a spatiotemporal map of the environment. The system is able                   to challenge the static world assumption that underpins most
+                                        to extract linear velocity estimates from objects’ SE(3) motion                  existing open-source SLAM algorithms. In this paper, we
+                                        providing an important functionality for navigation in complex                   redeﬁne the term “mapping” in SLAM to be concerned with
+                                        dynamic environments. We demonstrate the performance of the                      a spatiotemporal representation of the world, as opposed to
+                                        proposed system on a number of real indoor and outdoor datasets                  the concept of a static map that has long been the emphasis
+                                        and the results show consistent and substantial improvements                     of the classical SLAM algorithms. Our approach focuses on
+                                        over the state-of-the-art algorithms. An open-source version of                  accurately estimate the motion of all dynamic entities in the
+                                        the source code is available∗.                                                   environment including the robot and other moving objects
+                                                                                                                         in the scene, this information being highly relevant in the
+                                           Index Terms—SLAM, dynamic scene, object motion estima-                        context of robot path planning and navigation in dynamic
+                                        tion, multiple object tracking.                                                  environments.
+
+                                                                  I. INTRODUCTION                                           Existing scene motion estimation techniques mainly rely
+                                                                                                                         on optical ﬂow estimation ([10]–[13]) and scene ﬂow esti-
+                                        T He ability of a robot to build a model of the environment,                     mation ([14]–[17]). Optical ﬂow records the scene motion
+                                              often called map, and to localise itself within this map is                by estimating the velocities associated with the movement
+                                        a key factor in enabling autonomous robots to operate in real
+                                        world environments. Creating these maps is achieved by fusing
+                                        multiple sensor measurements into a consistent representation
+                                        using estimation techniques such as Simultaneous Localisation
+                                        And Mapping (SLAM). SLAM is a mature research topic and
+                                        have already revolutionised a wide range of applications from
+                                        mobile robotics, inspection, entertainment and ﬁlm produc-
+                                        tion to exploration and monitoring of natural environments,
+                                        amongst many others. However, most of the existing solutions
+                                        to SLAM rely heavily on the assumption that the environment
+                                        is predominantly static.
+
+                                           The conventional techniques to deal with dynamics in
+                                        SLAM is to either treat any sensor data associated with moving
+                                        objects as outliers and remove them from the estimation
+                                        process ([1]–[5]), or detect moving objects and track them
+                                        separately using traditional multi-target tracking approaches
+
+                                           Jun Zhang, Mina Henein and Robert Mahony are with the
+                                        Australian National University (ANU), 0020 Canberra, Australia.
+                                        {jun.zhang2,mina.henein,robert.mahony}@anu.edu.au
+
+                                           Viorela Ila is with the University of Sydney (USyd), 2006 Sydney, Australia.
+                                        viorela.ila@sydney.edu.au
+
+                                           [co]: The two authors contributed equally to this work.
+                                           ∗https://github.com/halajun/vdo slam
+MANUSCRIPT ONLY                                                   2
+
+of brightness patterns on an image plane. Scene ﬂow, on the       feature and object tracking method is proposed, with the ability
+other hand, describes the 3D motion ﬁeld of a scene observed      to handle indirect occlusions resulting from the failure of
+at different instants of time. Those techniques only estimate     semantic object segmentation. In summary, the contributions
+linear translation of individual pixels or 3D points in the       of this work are:
+scene, and are not exploiting the collective behaviour points
+on rigid objects failing to describe the full SE(3) motion of        • a novel formulation to model dynamic scenes in a uni-
+objects in the scene. In this paper we explore this collective          ﬁed estimation framework over robot poses, static and
+behaviour of points on individual objects to obtain accurate            dynamic 3D points, and object motions.
+and robust motion estimation of the objects in the scene while
+simultaneously localising the robot and map the environment.         • accurate estimation for SE(3) motion of dynamic objects
+                                                                        that outperforms state-of-the-art algorithms, as well as a
+   A typical SLAM system consists of a front-end module,                way to extract objects’ velocity in the scene,
+that processes the raw data from the sensors and a back-
+end module, that integrates the obtained information (raw            • a robust method for tracking moving objects exploiting
+and higher-level information) into a probabilistic estimation           semantic information with the ability to handle indirect
+framework. Simple primitives such as 3D locations of salient            occlusions resulting from the failure of semantic object
+features are commonly used to represent the environment.                segmentation,
+This is largely a consequence of the fact that points are easy
+to detect, track and integrate within the SLAM estimation            • a demonstrable full system in complex and compelling
+problem.                                                                real-world scenarios.
+
+   Feature tracking has been more reliable and robust with        To the best of our knowledge, this is the ﬁrst full dynamic
+the advances in deep learning to provide algorithms that can      SLAM system that is able to achieve motion segmentation,
+reliably estimate the 2D optical ﬂow associated with the          dynamic object tracking, and estimate the camera poses along
+apparent motion of every pixel on an image in a dense manner.     with the static and dynamic structure, the full SE(3) pose
+A task that is particularly important for data association and    change of every rigid object in the scene, extract velocity infor-
+that has been otherwise challenging in dynamic environments       mation, and be demonstrable in real-world outdoor scenarios
+using classical feature tracking methods.                         (see Fig. 1). We demonstrate the performance of our algorithm
+Other primitives such as lines and planes ([18]–[21]) or even     on real datasets and show capability of the proposed system to
+objects ([22]–[24]) have been considered in order to provide      resolve rigid object motion estimation and yield motion results
+richer map representations. To incorporate such information in    that are comparable to the camera pose estimation in accuracy
+existing geometric SLAM algorithms, either a dataset of 3D-       and that outperform state-of-the-art algorithms by an order of
+models of every object in the scene must be available a priori    magnitude in urban driving scenarios.
+([23], [25]) or the front end must explicitly provide object
+pose information in addition to detection and segmentation           The remainder of this paper is structured as follows, in the
+([26]–[28]) adding a layer of complexity to the problem. The      following Section II we discuss the related work. In Section III
+requirement for accurate 3D-models severely limits the poten-     and IV we describe the proposed algorithm and system. We
+tial domains of application, while to the best of our knowledge,  introduce the experimental setup, followed by the results and
+multiple object tracking and 3D pose estimation remain a          evaluations in Section V. We summarise and offer concluding
+challenge to learning techniques. There is a clear need for       remarks in Section VI.
+an algorithm that can exploit the powerful detection and
+segmentation capabilities of modern deep learning algorithms                               II. RELATED WORK
+([29], [30]) without relying on additional pose estimation or
+object model priors, an algorithm that operates at feature-level     In the past two decades, the study of SLAM for dynamic
+with the awareness of an object concept.                          environments has become more and more popular in the
+                                                                  community, with a considerable amount of algorithms being
+   While the problems of SLAM and object motion track-            proposed to solve the dynamic SLAM problem. Motivated by
+ing/estimation are long studied in isolation in the literature,   different goals to achieve, solutions in the literature can be
+recent approaches try to solve the two problems in a uniﬁed       mainly divided into three categories.
+framework ([31], [32]). However, they both focus on the
+SLAM back-end instead of a full system, resulting in a               The ﬁrst category aims to explore robust SLAM against
+severely limited performance in real world scenarios. In this     dynamic environments. Early methods in this category ([2],
+paper, we carefully integrate our previous works ([31], [33])     [34], [35]) normally detect and remove the information drawn
+and propose VDO-SLAM, a novel feature-based stereo/RGB-           from dynamic foreground, which is seen as degrading the
+D dynamic SLAM system, that leverages image-based se-             SLAM performance. More recent methods on this track tend
+mantic information to simultaneously localise the robot, map      to go further by not just removing the dynamic foreground,
+the static and dynamic structure, and track motions of rigid      but also inpainting or reconstructing the static background that
+objects in the scene. Different to [31], we rely on a denser      is occluded by moving targets. [5] present dynaSLAM that
+object feature representation to ensure robust tracking, and      combines classic geometry and deep learning-based models to
+propose new factors to smoothen the motion of rigid objects in    detect and remove dynamic objects, then inpaint the occluded
+urban driving scenarios. Different to [33], an improved robust    background with multi-view information of the scene. Simi-
+                                                                  larly, a Light Field SLAM front-end is proposed by [36] to
+                                                                  reconstruct the occluded static scene via Synthetic Aperture
+                                                                  Imaging (SAI) technics. Different from [5], features on the
+                                                                  reconstructed static background are also tracked and used
+MANUSCRIPT ONLY                                                   3
+
+to achieve better SLAM performance. The above state-of-           Both methods succeed to exploit object information in a
+the-art solutions achieve robust and accurate estimation by       dense RGB-D SLAM framework, without prior knowledge of
+discarding the dynamic information. However, we argue that        object model. Their main interest, however, is the 3D object
+this information has potential beneﬁts for SLAM if it is prop-    segmentation and consistent fusion of the dense map rather
+erly modelled. Furthermore, understanding dynamic scenes in       than the estimation of the motion of the objects.
+addition to SLAM is crucial for many other robotics tasks such
+as planning, control and obstacle avoidance, to name a few.          Lately, the use of basic geometric models to represent
+                                                                  objects becomes a popular solution due to the less complexity
+   Approaches of the second category performs SLAM and            and easy integration into a SLAM framework. In Quadric-
+Moving Objects Tracking (MOT) separately, as an extension         SLAM [46], detected objects are represented as ellipsoids to
+to conventional SLAM for dynamic scene understanding ([9],        compactly parametrise the size and 3D pose of an object. In
+[37]–[39]). [37] developed a theory for performing SLAM           this way, the quadric parameters are directly constrained as
+with Moving Objects Tracking (SLAMMOT). In the latest             geometric error and formulated together with camera poses
+version of their SLAM with detection and tracking of mov-         in a factor graph SLAM for joint estimation. [24] propose to
+ing objects, the estimation problem is decomposed into two        combine 2D and 3D object detection with SLAM for both
+separate estimators (moving and stationary objects) to make       static and dynamic environments. Objects are represented as
+it feasible to update both ﬁlters in real time. [9] tackle the    high-quality cuboids and optimized together with points and
+SLAM problem with dynamic objects by solving the problems         cameras through multi-view bundle adjustment. While both
+of Structure from Motion (SfM) and tracking of moving             methods prove the mutual beneﬁt between detected object and
+objects in parallel, and unifying the output of the system        SLAM, their main focus is on object detection and SLAM
+into a 3D dynamic map containing the static structure and         primarily for static scenarios. In this paper, we take this
+the trajectories of moving objects. Later in [38], the authors    direction further to tackle the challenging problem of dynamic
+propose to integrate semantic constraints to further improve the  object tracking within a SLAM framework, and exploit the
+3D reconstruction. The more recent work [39] present a stereo-    relationships between moving objects and agent robot, static
+based dense mapping algorithm in a SLAM framework, with           and dynamic structures for potential advantages.
+the advantage of accurately and efﬁciently reconstructing both
+static background and moving objects in large scale dynamic          Apart from the dynamic SLAM categories, the literature of
+environments. The listed algorithms above have proven that        6-DoF object motion estimation is also crucial for dynamic
+combining multiple objects tracking with SLAM is doable           SLAM problem. Quite a few methods have been proposed in
+and applicable for dynamic scene exploration. To take a step      the literature to estimate SE(3) motion of objects in a visual
+further by proper exploiting and establishing the spatial and     odometry or SLAM framework ([50]–[52]). [50] present a
+temporal relationships between the robot, static background,      model-free method for detecting and tracking moving objects
+stationary and dynamic objects, we show in this paper that        in 3D LiDAR scans. The method sequentially estimates mo-
+the problems of SLAM and multi-object tracking are mutually       tion models using RANSAC [53], then segments and tracks
+beneﬁcial.                                                        multiple objects based on the models by a proposed Bayesian
+                                                                  approach. In [51], the authors address the problem of simul-
+   The last and most active category is object SLAM, which        taneous estimation of ego and third-party SE(3) motions in
+usually includes both static and dynamic objects. Algorithms      complex dynamic scenes using cameras. They apply multi-
+in this class normally require speciﬁc modelling and repre-       model ﬁtting techniques into a visual odometry pipeline and
+sentation of 3D object, such as 3D shape ([40]–[42]), sur-        estimate all rigid motions within a scene. In later work, [52]
+fel [43] or volumetric [44] model, geometric model such as        present ClusterVO that is able to perform online processing
+ellipsoid ([45], [46]) or 3D bounding box ([24], [47]–[49]),      for multiple motion estimations. To achieve this, a multi-level
+etc., to extract high-level primitive (e.g., object pose) and     probabilistic association mechanism is proposed to efﬁciently
+integrate into a SLAM framework. [40] is one of the earliest      track features and detections, then a heterogeneous Conditional
+works to introduce an object-oriented SLAM paradigm, which        Random Field (CRF) clustering approach is applied to jointly
+represents cluttered scene in object level and constructs an      infer cluster segmentations, with a sliding-window optimiza-
+explicit graph between camera and object poses to achieve         tion for clusters in the end. While the above proposed methods
+joint pose-graph optimisation. Later, [41] propose a novel 3D     represent an important step forward to the Multi-motion Visual
+object recognition algorithm to ensure the system robustness      Odometry (MVO) task, the study of spacial and temporal
+and improve the accuracy of estimated object pose. The high-      relationships is not fully explored but is arguably important.
+level scene representation enables real-time 3D recognition       Therefore, by carefully considering the pros and cons in the
+and signiﬁcant compression of map storage for SLAM. Never-        literature of SLAM+MOT, object SLAM and MVO, this paper
+theless, a database of pre-scanned or pre-trained object models   proposes a visual dynamic object-aware SLAM system that is
+has to be created in advance. To avoid prebuilt database,         able to achieve robust ego and object motion tracking, as well
+representing objects using surfel or voxel element in a dense     as consistent static and dynamic mapping in a novel SLAM
+manner starts to gain popularity, along with RGB-D cameras        formulation.
+becoming widely used. [43] present MaskFusion that adopts
+surfel representation to model, track and reconstruct objects in                          III. METHODOLOGY
+the scene, while [44] apply an octree-based volumetric model
+to objects and build multi-object dynamic SLAM system.               Before discussing details of the proposed system pipeline,
+                                                                  as shown in Fig. 4, this section covers the mathematical details
+MANUSCRIPT ONLY                                                                                                                        4
+
+of the core components in the system. Variables and notations         (5) is crucially important as it relates the same 3D point
+are ﬁrst introduced, including the novel way of modelling the
+motion of a rigid-object in a model free manner. Then we              on a rigid object in motion at consecutive time steps by
+show how the camera pose and object motion are estimated                                                                     Lk−1
+in the tracking component of the system. Finally, a factor            a  homogeneous     transformation   k−01Hk  :=  0Lk−1  k−1   Hk  0L−k−11  .
+graph optimisation is proposed and applied in the mapping
+component, to reﬁne the camera poses and object motions,              This equation represents a frame change of a pose transforma-
+and build a global consistent map including static and dynamic
+structure.                                                            tion [54], and shows how the body-ﬁxed frame pose change
+                                                                      Lk−1
+                                                                      k−1   Hk  relates  to  the  global  reference   frame  pose      change
+
+                                                                      k−10Hk. The point motion in global reference frame is then
+
+                                                                      expressed as:
+
+                                                                                             0mik = k−10Hk 0mik−1 .                    (6)
+
+A. Background and Notation                                            Equation (6) is at the core of our motion estimation approach,
+                                                                      as it expresses the rigid object pose change in terms of the
+   1) Coordinate Frames: Let 0Xk,0 Lk ∈ SE(3) be the                  points that reside on the object in a model-free manner without
+                                                                      the need to include the object 3D pose as a random variable
+robot/camera and the object 3D pose respectively, at time k           in the estimation. Section III-B2 details how this rigid object
+                                                                      pose change is estimated based on the above equation. Here
+in a global reference frame 0, with k ∈ T the set of time             k−10Hk ∈ SE(3) represents the object point motion in global
+                                                                      reference frame; for the remainder of this document, we refer
+steps. Note that calligraphic capital letters are used in our         to this quantity as the object pose change or the object motion
+                                                                      for ease of reading.
+notation to represent sets of indices. Fig. 2 shows these pose
+                                                                      B. Camera Pose and Object Motion Estimation
+transformations as solid curves.
+   2) Points: Let 0mki be the homogeneous coordinates of the             The cost function chosen to estimate the camera pose and
+
+ith 3D point at time k, with 0mi = mix, myi , miz, 1 ∈ IE3 and        object motion is associated with the 3D-2D re-projection error
+i ∈ M the set of points. We write a point in robot/camera
+frame as Xk mik =0 Xk−1 0mik.                                         and is deﬁned on the image plane. Since the noise is better
+
+   Deﬁne Ik the reference frame associated with the image             characterised in image plane, this yields more accurate results
+
+captured by the camera at time k chosen at the top left               for camera localisation [55]. Moreover, based on this error
+corner of the image, and let Ik pik = ui, vi, 1 ∈ IE2 be the pixel
+location on frame Ik corresponding to the homogeneous 3D              term, we propose a novel formulation to jointly optimise the
+point Xk mki , which is obtained via the projection function π(·)
+as follows:                                                           optical ﬂow along with the camera pose and the object motion,
+
+                Ik pik = π(Xk mik) = K Xk mik ,                  (1)  to ensure a robust tracking of points. In the mapping module, a
+
+where K is the camera intrinsics matrix.                              3D error cost function is used in global optimization to ensure
+
+   The camera and/or object motions both produce an optical           best results of 3D structure and object motions estimation as
+ﬂow Ik φ i ∈ IR2 that is the displacement vector indicating the
+motion of pixel Ik−1 pik−1 from image frame Ik−1 to Ik, and is        later described in Section III-C.
+given by:                                                                1) Camera Pose Estimation: Given a set of static 3D
+
+                 Ik φ i = Ik p˜ ik − Ik−1 pik−1 .                (2)  points {0mik−1 | i ∈ M , k ∈ T } observed at time k − 1 in
+                                                                      global reference frame, and the set of 2D correspondences
+Here Ik p˜ ik is the correspondence of Ik−1 pik−1 in Ik. Note that,   {Ik p˜ ki | i ∈ M , k ∈ T } in image Ik, the camera pose 0Xk is
+we overload the same notation to represent the 2D pixel               estimated via minimizing the re-projection error:
+coordinates ∈ IR2. In this work, we leverage optical ﬂow to
+                                                                                ei(0Xk) = Ik p˜ ik − π(0X−k 1 0mik−1) .                (7)
+
+ﬁnd correspondences between consecutive frames.                       We parameterise the SE(3) camera pose by elements of the
+                                                                      Lie-algebra xk ∈ se(3):
+3) Object and 3D Point Motions: The object motion be-
+
+tween times k − 1 and k is described by the homogeneous
+                Lk−1
+transformation  k−1   Hk  ∈   SE(3)     according     to:                                    0Xk = exp(0xk) ,                          (8)
+
+                   Lk−1   Hk  =0    L−k−11  0Lk  .               (3)  and deﬁne 0x∨k ∈ IR6 with the vee operator a mapping from
+                   k−1                                                se(3) to IR6. Using the Lie-algebra parameterisation of SE(3)
+                                                                      with the substitution of (8) into (7), the solution of the least
+Fig. 2 shows these motion transformations as dashed curves.           squares cost is given by:
+
+We write a point in its corresponding object frame as                                             nb
+Lk mik = 0L−k 1 0mki (shown as a dashed vector from the object                  ∑ 0x∗k∨ = argmin ρh ei (0xk) Σ−p 1 ei(0xk)
+reference frame to the red dot in Fig. 2), substituting the object                                                                     (9)
+
+pose at time k from (3), this becomes:                                                   0x∨k i
+
+0mik  =         0Lk   Lk mik  =  0Lk−1      Lk−1  Hk  Lk mik  .  (4)  for all nb visible 3D-2D static background point correspon-
+                                            k−1
+                                                                      dences between consecutive frames. Here ρh is the Huber
+Note that for rigid body objects, Lk mki stays constant at Lmi,
+and Lmi = 0Lk−1 0mki = 0L−k+1n 0mik+n for any integer n ∈ Z.          function [56], and Σp is the covariance matrix associated with
+Then, for rigid objects with n = −1, (4) becomes:
+                                                                      the re-projection error. The estimated camera pose is given by
+0mik            =  0Lk−1      Lk−1  Hk  0L−k−11  0mki −1   .     (5)  0Xk∗ = exp(0x∗k) and is found using the Levenberg-Marquardt
+                              k−1                                     algorithm to solve for (9).
+MANUSCRIPT ONLY                                                                                                                                                                                                  5
+
+                                                                                                          ������������−2                                           ������������−1
+
+                                                                                              ������−2 ������������−1                                                  ������−1 ������������
+                                                                  ������������−2 ������������−2
+                                                                                                                                            ������������−1 ������������−1
+                                                                                           0
+                              0                                                                                                                                                                ������������ ������������
+                                                                                      ������������−1
+                                ������������−2                                                                                                                                        0
+
+                                         0                                                                        0                                                             ������������
+
+                                           ������������−2                                                                   ������������−1
+
+                                                                                                                                                                      0
+
+                                                                                                                                                                        ������������
+
+          {0}
+
+                              {������������−2 }                 ������������−2 ������������−2                                                                       ������������−1 ������������−1                                      ������������ ������������
+                                                                                                                                                                                                  ������������ ������������
+                                         ������������−2 ������������−2                                                            {������������−1 }                                                   {������������ }
+
+                          0                                                                                                  ������������−1 ������������−1
+
+                            ������������−2
+
+                                                                       ������������−2                                                                              ������������−1
+
+                                                                       ������−2 ������������−1                                                                         ������−1 ������������
+
+                                                                                                                  0                                                                    0
+
+                                                                                                                    ������������−1                                                               ������������
+
+Fig. 2: Notation and coordinate frames. Solid curves represent camera and object poses in inertial frame; 0X and 0L
+respectively, and dashed curves their respective motions in body-ﬁxed frame. Solid lines represent 3D points in inertial frame,
+and dashed lines represent 3D points in camera frames.
+
+   2) Object Motion Estimation: Analogous to the camera                                                           function:
+
+pose estimation, a cost function based on re-projection error                                                                                                         nb
+is constructed to solve for the object motion k−10Hk. Using (6),                                                             ∑ {0xk∗∨, kΦ∗k} = argmin                         ρh ei (Ik φ i) Σφ−1 ei(Ik φ i) +
+the error term between the re-projection of an object 3D point                                                                                             {0xk∨,kΦk} i
+
+and the corresponding 2D point in image Ik is:                                                                                                             ρh ei (0xk,Ik φ i) Σ−p 1 ei(0xk,Ik φ i) , (13)
+
+       ei(k−10Hk) := Ik p˜ ik − π(0X−k 1 k−01Hk 0mki −1)                                                          where ρh(ei (Ik φ i) Σ−φ 1 ei(Ik φ i)) is the regularization term with
+
+                     =        Ik p˜ ki − π(k−10Gk 0mki −1) ,                        (10)
+
+where  k−01Gk ∈ SE(3).        Parameterising               0  Gk  :=   exp  k−01gk                                                                         ei(Ik φ i) = Ik φˆ i − Ik φ i .                   (14)
+                                                        k−1
+with k−10gk ∈ se(3), the optimal solution is found via min-                                                       Here Ik Φˆ i = {Ik φˆ i | i ∈ M , k ∈ T } is the initial optic-ﬂow
+                                                                                                                  obtained through classical or learning-based methods, and Σφ
+imising:                                                                                                          is the associated covariance matrix. Analogously, the cost
+                                                                                                                  function for object motion in (11) combining optical ﬂow
+                          nd                                                                                      reﬁnement is given by
+∑ k−01g∗k∨ = argmin ρh ei (k−10gk) Σ−p 1 ei(k−01gk)
+                                                                                    (11)
+
+                  0  gk∨  i
+               k−1
+
+given all nd visible 3D-2D dynamic point correspondences on                                                                                                               nd
+                                                                                                                  ∑ {k0−1g∗k∨, kΦk∗} = argmin                                 ρh ei (Ik φ i) Σ−φ 1 ei(Ik φ i) +
+an object between frames k − 1 and k. The object motion,                                                                                                   {k0−1gk∨,kΦk} i
+k−01Hk = 0Xk k−10Gk can be recovered afterwards.
+
+   3) Joint Estimation with Optical Flow: The camera pose                                                                                                  ρh ei (k−10gk,Ik φ i) Σ−p 1 ei(k−10gk,Ik φ i) .
+and object motion estimation both rely on good image corre-                                                                                                                                                (15)
+spondences. Tracking of points on moving objects can be very
+challenging due to occlusions, large relative motions and large                                                   C. Graph Optimisation
+camera-object distances. In order to ensure a robust tracking of
+points, we follow our earlier work [33] to reﬁne the estimation                                                      The proposed approach formulates the dynamic SLAM as
+of the optical ﬂow jointly with the motion estimation.                                                            a graph optimisation problem, to reﬁne the camera poses and
+                                                                                                                  object motions, and build a global consistent map including
+   For camera pose estimation, the error term in (7) is refor-                                                    static and dynamic structure. We model the dynamic SLAM
+mulated considering (2) as:                                                                                       problem as a factor graph as the one demonstrated in Fig. 3.
+                                                                                                                  The factor graph formulation is highly intuitive and has the
+ei(0Xk,Ik φ ) = Ik−1 pik−1 + Ik φ i − π(0X−k 1 0mik−1) .                            (12)                          advantage that it allows for efﬁcient implementations of batch
+                                                                                                                  ([57], [58]) and incremental ([59]–[61]) solvers.
+Applying the Lie-algebra parameterisation of SE(3) element,
+the optimal solution is obtained via minimising the cost                                                             Four types of measurements/observations are integrated into
+                                                                                                                  a joint optimisation problem; the 3D point measurements, the
+MANUSCRIPT ONLY                                                                                                                                                                    6
+
+                                                                                           rate and the physics laws governing the motion of relatively
+
+                                                                                           large objects (vehicles) and preventing their motions to change
+
+                                                                                           abruptly, we introduce smooth motion factors to minimise the
+
+                                                                                           change in consecutive object motions, with the error term
+
+                                                                                           deﬁned as:
+
+                                                                                           el,k(k−02Hkl −1,          0  Hlk )         =  k−20Hlk−1−1            k−01Hkl .          (19)
+                                                                                                                  k−1
+
+                                                                                           The object smooth motion factor el,k(k−02Hkl −1, k−01Hkl ) is used
+                                                                                           to minimise the change between the object motion at consec-
+
+Fig. 3: Factor graph representation of an object-aware                                     utive time steps and is shown as cyan circles in Fig. 3.
+SLAM with a moving object. Black squares stand for the                                        Let θ M = {0mik | i ∈ M , k ∈ T } be the set of all 3D points,
+camera poses at different time steps, blue for static points, red
+for the same dynamic point on an object (dashed box) at dif-                               and θ X = {0xk∨ | k ∈ T } as the set of all camera poses.
+ferent time steps and green for the object pose change between                             We parameterise the SE(3) object motion k−01Hlk by elements
+time steps. For ease of visualisation, only one dynamic point is                           k−01hkl ∈ se(3) the Lie-algebra of SE(3):
+drawn here. A prior factor is shown as a black circle, odometry
+factors are shown as orange, point measurement factors as                                                      0  Hlk   =  exp(k−10 hlk )            ,                             (20)
+white and point motion factors as magenta. A smooth motion                                                  k−1
+factor is shown as cyan circle.
+                                                                                           and deﬁne θ H    =wi{thk−k01−h01lkh∨lk ∨|  k  ∈ T , l ∈ L } as the                 set  of all
+                                                                                           object motions,                            ∈  IR6 and L the set of                 all  object
+
+                                                                                           labels. Given θ = θ X ∪ θ M ∪ θ H as all the nodes in the graph,
+
+                                                                                           with the Lie-algebra parameterisation of SE(3) for X and H
+
+                                                                                           (substituting (8) in (16) and (17), and substituting (20) in (18)
+
+                                                                                           and (19)), the solution of the least squares cost is given by:
+
+                                                                                                            nz
+                                                                                           ∑ θ ∗ = argmin        ρh ei,k(0xk,0 mki ) Σ−z 1 ei,k(0xk,0 mik)
+
+visual odometry measurements, the motion of points on a                                                θ    i,k
+dynamic object and the object smooth motion observations.
+                                                                                               no
+   The 3D point measurement model error ei,k(0Xk,0 mik) is
+deﬁned as:                                                                                 ∑ + ρh log(ek(0xk−1, 0xk)) Σo−1 log(ek(0xk−1, 0xk))
+
+                  ei,k(0Xk,0 mik) =0 Xk−1 0mki − zik .                              (16)        k
+
+                                                                                               ng
+
+                                                                                           ∑ + ρh ei,l,k(0mik, k−01hkl ,0 mki −1) Σg−1
+                                                                                              i,l,k
+
+Here z = {zki | i ∈ M , k ∈ T } is the set of all 3D point mea-                                             ei,l,k  (0mik  ,             0  hkl  ,0  mki −1  )
+surements at all time steps, with cardinality nz and zki ∈ IR3.                                                                       k−1
+The 3D point measurement factors are shown as white circles
+                                                                                               ns
+
+                                                                                           ∑ + ρh log(el,k(k−02hkl −1, k−10hlk)) Σ−s 1
+                                                                                           l,k
+in Fig. 3.
+
+The tracking component of the system provides a high-quality                                                               log(el,k(k−20hkl −1,                    0  hkl ))  , (21)
+                                                                                                                                                                k−1
+
+ego-motion via 3D-2D error minimization, which can be used                                 where Σz is the 3D point measurement noise covariance
+                                                                                           matrix, Σo is the odometry noise covariance matrix, Σg is
+as an odometry measurement to constrain camera poses in                                    the motion noise covariance matrix with ng the total number
+the graph. The visual odometry model error ek(0Xk−1,0 Xk) is                               of ternary object motion factors, and Σs the smooth motion
+deﬁned as:                                                                                 covariance matrix, with ns the total number of smooth motion
+                                                                                           factors. The non-linear least squares problem in (21) is solved
+        ek (0 Xk−1 ,0           Xk)     =  (0X−k−11  0 Xk )−1  Xk−1  Tk  ,          (17)
+                                                               k−1                         using Levenberg-Marquardt method.
+
+where T = {Xkk−−11 Tk | k ∈ T } is the odometry measurement set
+        Xk−1
+with    k−1   Tk  ∈     SE(3)    and       cardinality  no.  The  odometric      factors
+
+are shown as orange circles in Fig. 3.                                                                                   IV. SYSTEM
+
+The motion model error of points on dynamic objects                                           In this section, we propose a novel object-aware dynamic
+                                                                                           SLAM system that robustly estimates both camera and object
+ei,l,k  (0mki  ,     0  Hkl ,0  mki −1  )  is  deﬁned   as:                                motions, along with the static and dynamic structure of the
+                  k−1                                                                      environment. The full system overview is shown in Fig. 4.
+                                                                                           The system consists of three main components: image pre-
+        ei,l,k(0mik, k−01Hkl ,0 mik−1) = 0mki − k−10Hkl 0mki −1 .                   (18)   processing, tracking and mapping.
+
+The motion of all points on a detected rigid object l are                                     The input to the system is stereo or RGB-D images. For
+                                                                                           stereo images, as a ﬁrst step, we extract depth information by
+characterised           by  the  same      pose  transformation         0   Hlk  ∈  SE(3)  applying the stereo depth estimation method described in [62]
+                                                                     k−1                   to generate depth maps and the resulting data is treated as
+                                                                                           RGB-D.
+given by (6) and the corresponding factor, shown as magenta
+
+circles in Fig. 3, is a ternary factor which we call the motion
+
+model of a point on a rigid body.
+
+It has been shown that incorporating prior knowledge about
+
+the motion of objects in the scene is highly valuable in
+
+dynamic SLAM ([31], [37]). Motivated by the camera frame
+MANUSCRIPT ONLY                                                                                                  7
+
+   Although this system was initially designed to be an RGB-D         2) Camera Pose Estimation: The camera pose is com-
+system, as an attempt to fully exploit image-based semantic in-    puted using (13) for all detected 3D-2D static point cor-
+formation, we apply single image depth estimation to achieve       respondences. To ensure robust estimation, a motion model
+depth information from monocular camera. Our “learning-            generation method is applied for initialisation. Speciﬁcally,
+based monocular” system is monocular in the sense that only        the method generates two models and compares their inlier
+RGB images are used as input to the system, however the            numbers based on re-projection error. One model is generated
+estimation problem is formulated using RGB-D data, where           by propagating the camera previous motion, while the other by
+the depth is obtained using single image depth estimation.         computing a new motion transform using P3P [63] algorithm
+                                                                   with RANSAC. The motion model that generates most inliers
+A. Pre-processing                                                  is then selected for initialisation.
+
+   There are two challenging aspects that this module needs to        3) Dynamic Object Tracking: The process of object motion
+fulﬁl. First, to robustly separate static background and objects,
+and secondly to ensure long-term tracking of dynamic objects.      tracking consists of two steps. In the ﬁrst step, segmented ob-
+To achieve this, we leverage recent advances in computer
+vision techniques for instance level semantic segmentation and     jects are classiﬁed into static and dynamic. Then we associate
+dense optical ﬂow estimation in order to ensure efﬁcient object
+motion segmentation and robust object tracking.                    the dynamic objects across pairs of consecutive frames.
+                                                                   • Instance-level object segmentation allows us to separate
+   1) Object Instance Segmentation: Instance-level semantic
+segmentation is used to segment and identify potentially mov-      objects from background. Although the algorithm is capable of
+able objects in the scene. Semantic information constitutes an
+important prior in the process of separating static and moving     estimating the motions of all the segmented objects, dynamic
+object points, e.g., buildings and roads are always static, but
+cars can be static or dynamic. Instance segmentation helps         object identiﬁcation helps reduce computational cost of the
+to further divide semantic foreground into different instance
+masks, which makes it easier to track each individual object.      proposed system. This is done based on scene ﬂow estimation.
+Moreover, segmentation masks provide a “precise” boundary          Speciﬁcally, after obtaining the camera pose 0Xk, the scene
+of the object body that ensures robust tracking of points on       ﬂow vector fik describing the motion of a 3D point 0mi between
+the object.                                                        frames k − 1 and k, can be calculated as in [64]:
+
+   2) Optical Flow Estimation: The dense optical ﬂow is            fki = 0mki −1 − 0mik = 0mik−1 −0 Xk Xk mik .  (22)
+used to maximise the number of tracked points on moving
+objects. Most of the moving objects only occupy a small            Unlike optical ﬂow, scene ﬂow−ideally only caused by scene
+portion of the image. Therefore, using sparse feature matching     motion−can directly decide whether some structure is moving
+does not guarantee robust nor long-term feature tracking. Our      or not. Ideally, the magnitude of the scene ﬂow vector should
+approach makes use of dense optical ﬂow to considerably            be zero for all static 3D points. However, noise or error in
+increase the number of object points by sampling from all the      depth and matching complicates the situation in real scenarios.
+points within the semantic mask. Dense optical ﬂow is also         To robustly handle this, we compute the scene ﬂow magnitude
+used to consistently track multiple objects by propagating a       of all the sampled points on each object. If the magnitude of
+unique object identiﬁer assigned to every point on an object       the scene ﬂow of a certain point is greater than a predeﬁned
+mask. Moreover, it allows to recover objects masks if semantic     threshold, the point is considered dynamic. This threshold was
+segmentation fails; a task that is extremely difﬁcult to achieve   set to 0.12 in all experiments carried in this work. An object
+using sparse feature matching.                                     is then recognised dynamic if the proportion of “dynamic”
+                                                                   points is above a certain level (30% of total number of
+B. Tracking                                                        points), otherwise static. Thresholds to identify if an object
+                                                                   is dynamic were deliberately chosen as mentioned above, to
+   The tracking component includes two modules; the camera         be more conservative as the system is ﬂexible to model a
+ego-motion tracking with sub-modules of feature detection          static object as dynamic and estimate a zero motion at every
+and camera pose estimation, and the object motion tracking         time step, however, the opposite would degrade the system’s
+including sub-modules of dynamic object tracking and object        performance.
+motion estimation.                                                 • Instance-level object segmentation only provides single-
+                                                                   image object labels. Objects then need to be tracked across
+   1) Feature Detection: To achieve fast camera pose estima-       frames and their motion models propagated over time. We
+tion, we detect a sparse set of corner features and track them     propose to use optical ﬂow to associate point labels across
+with optical ﬂow. At each frame, only inlier feature points        frames. A point label is the same as the unique object identiﬁer
+that ﬁt the estimated camera motion are saved into the map,        on which the point was sampled. We maintain a ﬁnite tracking
+and used to track correspondences in the next frame. New           label set L ⊂ N, where l ∈ L starts from l = 1 for the ﬁrst
+features are detected and added, if the number of inlier tracks    detected moving object in the scene. The number of elements
+falls below a certain level (1200 in default). These sparse        in L increases as more moving objects are being detected.
+features are detected on static background, i.e., image regions    Static objects and background are labelled with l = 0.
+excluding the segmented objects.
+                                                                      Ideally, for each detected object in frame k, the labels of all
+                                                                   its points should be uniquely aligned with the labels of their
+                                                                   correspondences in frame k − 1. However, in practice this is
+                                                                   affected by the noise, image boundaries and occlusions. To
+                                                                   overcome this, we assign all the points with the label that
+MANUSCRIPT ONLY                                                  8
+
+Fig. 4: Overview of our VDO-SLAM system. Input images are ﬁrst pre-processed to generate instance-level object
+segmentation and dense optical ﬂow. These are then used to track features on static background structure and dynamic objects.
+Camera poses and object motions estimated from feature tracks are then reﬁned in a global batch optimisation, and a local
+map is maintained and updated with every new frame. The system outputs camera poses, static structure, tracks of dynamic
+objects, and estimates of their pose changes over time.
+
+appears most in their correspondences. For a dynamic object,     similarly, a factor graph optimisation is performed to reﬁne all
+if the most frequent label in the previous frame is 0, it means  the variables within the local map, and then update them back
+that the object starts to move, appears in the scene at the      into the global map.
+boundary, or reappears from occlusion. In this case, the object
+is assigned a new tracking label.                                   2) Global Batch Optimisation: The output of the tracking
+                                                                 component and the local batch optimisation consists of the
+   4) Object Motion Estimation: As mentioned above, objects      camera pose, the object motions and the inlier structure. These
+normally appear in small portions in the scene, which makes      are saved in a global map that is constructed with all the
+it hard to get sufﬁcient sparse features to track and estimate   previous time steps and is continually updated with every
+their motions robustly. We sample every third point within       new frame. A factor graph is constructed based on the global
+an object mask, and track them across frames. Similar to the     map after all input frames have been processed. To effectively
+camera pose estimation, only inlier points are saved into the    explore the temporal constraints, only points that have been
+map and used for tracking in the next frame. When the number     tracked for more than 3 instances are added into the factor
+of tracked object points decreases below a certain level, new    graph. The graph is formulated as an optimisation problem as
+object points are sampled and added. We follow the same          described in Section III-C. The optimisation results serve as
+method as discussed in Section IV-B2 to generate an initial      the output of the whole system.
+object motion model.
+                                                                    3) From Mapping to Tracking: Maintaining the map pro-
+C. Mapping                                                       vides history information to the estimate of the current state in
+                                                                 the tracking module, as shown in Fig. 4 with blue arrows going
+   In the mapping component, a global map is constructed         from the global map to multiple components in the tracking
+and maintained. Meanwhile, a local map is extracted from         module of the system. Inlier points from the last frame are
+the global map, which is based on the current time step and      leveraged to track correspondences in the current frame and
+a window of previous time steps. Both maps are updated via       estimate camera pose and object motions. The last camera
+a batch optimisation process.                                    and object motion also serve as possible prior models to
+                                                                 initialise the current estimation as described in Section IV-B2
+   1) Local Batch Optimisation: We maintain and update a         and IV-B4. Furthermore, object points help associate semantic
+local map. The goal of the local batch optimisation is to        masks across frames to ensure robust tracking of objects,
+ensure accurate camera pose estimates are provided to the        by propagating their previously segmented masks in case of
+global batch optimisation. The camera pose estimation has a      “indirect occlusion” resulting from the failure of semantic
+big inﬂuence on the accuracy of the object motion estimation     object segmentation.
+and the overall performance of the algorithm. The local map
+is built using a ﬁxed-size sliding window containing the                                   V. EXPERIMENTS
+information of the last nw frames, where nw is the window size
+and is set to 20 in this paper. Local maps share some common        We evaluate VDO-SLAM in terms of camera motion, object
+information; this deﬁnes the overlap between the different       motion and velocity, as well as object tracking performance.
+windows. We choose to only locally optimise the camera           The evaluation is done on the Oxford Multimotion Dataset [65]
+poses and static structure within the window size, as locally    for indoor, and KITTI Tracking dataset [66] for outdoor
+optimising the dynamic structure does not bring any beneﬁt       scenarios, with comparison to other state-of-the-art methods,
+to the optimisation unless a hard constraint (e.g. a constant    including MVO [51], ClusterVO [52], DynaSLAM II [49]
+object motion) is assumed within the window. However, the        and CubeSLAM [24]. Due to the non-deterministic nature in
+system is able to incorporate static and dynamic structure in    running the proposed system, such as RANSAC processing,
+the local mapping if needed. When a local map is constructed,    we run each sequence 5 times and take median values as the
+MANUSCRIPT ONLY                                                                        9
+
+demonstrating results. All the results are obtained by running                         Then the speed error Es between the estimated vˆ and the
+the proposed system in default parameter setup. Our open-                              ground truth v velocities can be calculated as: Es = |vˆ| − |v|.
+source implementation includes the demo YAML ﬁles and
+instructions to run the system in both datasets.                                       C. Oxford Multimotion Dataset
+
+A. Deep Model Setup                                                                       The recent Oxford Multimotion Dataset [65] contains se-
+                                                                                       quences from a moving stereo or RGB-D camera sensor
+   We adopt a learning-based instance-level object segmen-                             observing multiple swinging boxes or toy cars in an indoor
+tation, Mask R-CNN [67], to generate object segmentation                               scenario. Ground truth trajectories of the camera and moving
+masks. The model of this method is trained on COCO                                     objects are obtained via a Vicon motion capture system. We
+dataset [68], and is directly used in this work without any ﬁne-                       only choose the swinging boxes sequence (500 frames) for
+tuning. For dense optical ﬂow, we leverage a state-of-the-art                          evaluation, since results of real driving scenarios are evaluated
+method; PWC-Net [12]. The model is trained on FlyingChairs                             on KITTI dataset. Note that, the trained model for instance
+dataset [69], and then ﬁne-tuned on Sintel [70] and KITTI                              segmentation cannot be applied to this dataset directly, since
+training datasets [71]. To generate depth maps for a “monocu-                          the training data (COCO) does not contain the class of
+lar” version of our proposed system, we apply a learning-based                         square box. Instead, we use Otsu’s method [77], together with
+monocular depth estimation method, MonoDepth2 [72]. The                                color information and multi-label processing to segment the
+model is trained on Depth Eigen split [73] excluding the tested                        boxes, which works very well for the simple setup of this
+data in this paper. Feature detection is done using FAST [74]                          dataset (color boxes that are highly distinguishable from the
+implemented in [75]. All the above methods are applied using                           background). Table I shows results compared to the state-of-
+the default parameters.                                                                the-art MVO [51] and ClusterVO [52], with data provided by
+                                                                                       the authors, respectively. As they are both visual odometry
+B. Error Metrics                                                                       systems without global reﬁnement, we switch off the batch
+                                                                                       optimisation module in our system and generate our results
+   We use a pose change error metric to evaluate the estimated                         for fair comparison. We use the error metrics described in
+SE(3) motion, i.e., given a ground truth motion transform T                            Section V-B.
+and a corresponding estimated motion Tˆ , where T ∈ SE(3)
+                                                                                          Compared to MVO, our proposed method achieves better
+could be either a camera relative pose or an object motion.                            accuracy in the estimation of camera pose (35%) and motion
+The pose change error is computed as: E = Tˆ −1 T. This is                             of the swinging boxes, top-left (15%) and bottom-left (40%).
+                                                                                       We obtain slightly higher errors when there is spinning ro-
+similar to Relative Pose Error [76], while we set the time                             tational motion of the object observed, in particular the top-
+interval ∆ = 1 (per frame), because the trajectory of different                        right swinging and rotating box (in translation only), and the
+object in a sequence varies from each other and are normally                           bottom-right rotating box. We believe that this is due to using
+                                                                                       an optical ﬂow algorithm that is not well optimised for self-
+much shorter than the camera trajectory.                                               rotating objects. The consequence of this is poor estimation of
+The translational error Et (meter) is computed as the L2 norm                          point motion and consequent degradation of the overall object
+of the translational component of E. The rotational error Er                           tracking performance. Even with the associated performance
+(degree) is calculated as the angle of rotation in an axis-angle                       loss for rotating objects, the beneﬁts of dense optical ﬂow
+representation of the rotational component of E. For different                         motion estimation is clear in the other metrics. Our method
+                                                                                       performs slightly worse than ClusterVO in the estimate of
+camera time steps and different objects in a sequence, we                              camera pose, and the translation of bottom-right rotating box.
+                                                                                       Other than that, we achieve more than twice improvements
+compute the root mean squared error (RMSE) for camera                                  against ClusterVO in the estimate of object motions.
+
+poses and object motions, respectively. The object pose change                            An illustrative result of the trajectory output of our algo-
+                                                                                       rithm on Oxford Multimotion Dataset is shown in Fig. 5.
+in body-ﬁxed frame is obtained by transforming the pose                                Tracks of dynamic features on swinging boxes visually corre-
+change k−01Hk in the inertial frame into the body frame using                          spond to the actual motion of the boxes. This can be clearly
+the object pose ground-truth                                                           seen in the swinging motion of the bottom-left box shown with
+                                                                                       purple color in Fig. 5.
+                    Lk−1  Hk  =0      Lk−−11      0  Hk  0Lk−1.              (23)
+                    k−1                        k−1                                     D. KITTI Tracking Dataset
+
+    We also evaluate the object speed error. The linear velocity                          The KITTI Tracking Dataset [66] contains 21 sequences in
+                                                                                       total with ground truth information about camera and object
+of a point on the object, expressed in the inertial frame, can                         poses. Among these sequences, some are not included in the
+                                                                                       evaluation of our system; as they contain no moving objects
+be  estimated   by     applying       the  pose       change     0  Hk  and  taking    (static only scenes) or only contain pedestrians that are non-
+                                                              k−1                      rigid objects, which is outside the scope of this work. Note
+                                                                                       that, as only rotation around Y-axis is provided in the ground
+the difference
+
+    v ≈0 mik −0 mki −1 =             0   Hk    −   I4  0 mki −1
+                                  k−1
+
+                              = k−10tk − (I3 − k−10Rk) 0mki −1.              (24)
+
+To get a more reliable measurement, we average over all points
+
+on  an  object  at  a  certain   time.     Deﬁne       ck−1  :=  1  ∑ mik−1  for  all
+                                                                 n
+
+n points on an object at time k − 1. Then
+
+           ∑ 1 n              k−01tk − (I3 − k−10Rk) 0mik−1
+
+        v≈
+             n i=1
+
+                =      0  tk  −  (I3  −     0  Rk  )  ck−1.                  (25)
+                    k−1                  k−1
+MANUSCRIPT ONLY                                                                                             10
+
+TABLE I: Comparison versus MVO [51] and ClusterVO [52] for camera pose and object motion estimation accuracy on the
+sequence of swinging 4 unconstrained sequence in Oxford Multi-motion dataset. Bold numbers indicate the better results.
+
+                                                      VDO-SLAM              MVO             ClusterVO
+
+                                                      Er(deg) Et (m)        Er(deg) Et (m)  Er(deg) Et (m)
+
+                                   Camera             0.7709        0.0112  1.1948  0.0314  0.7665  0.0066
+                         Top-left Swinging Box        1.1889        0.0207  1.4553  0.0288  3.2537  0.0673
+                 Top-right Swinging and rotating Box  0.7631        0.0132  0.8992  0.0130  3.5308  0.0256
+                       Bottom-left Swinging Box       0.9153        0.0149  1.4949  0.0261  4.9146  0.0763
+                       Bottom-right Rotating Box      0.8469        0.0192  0.7815  0.0115  4.0675  0.0144
+
+                                                                    also compute results of a learning-based monocular version of
+                                                                    our proposed method (as mentioned in Section IV) for fair
+                                                                    comparison.
+
+Fig. 5: Qualitative results of our method on Oxford                    Our proposed method achieves competitive and high ac-
+Multimotion Dataset. (Left) The 3D trajectories of camera           curacy in comparison with DynaSLAM II for the estimate
+(red) and centres of the four boxes. (Right) Detected points        of camera pose. In particular, our method obtains slightly
+on static background and object body. Black color corresponds       lower rotational errors while higher translational errors than
+to static points and features on each object are shown in a         DynaSLAM II. We believe the difference in accuracy is due to
+different color.                                                    the underlying formulations in estimating camera pose. When
+                                                                    compared to CubeSLAM, our RGB-D version gets lower
+truth object poses, we assign zeros to the other two axes for       errors in camera pose, while our learning-based monocular
+the convenience of full motion evaluation.                          version slightly higher. We believe the weak performance of
+                                                                    monocular version is because the model does not capture
+                                                                    the scale of depth accurately with only monocular input.
+                                                                    Nevertheless, both versions obtain consistently lower errors
+                                                                    in object motion estimation. In particular, as demonstrated in
+                                                                    Fig. 6, the translation and rotation errors in CubeSLAM are
+                                                                    all above 3 meters and 3 degrees, with errors reaching 32
+                                                                    meters and 5 degrees in extreme cases respectively. However,
+                                                                    our translation errors vary between 0.1-0.3 meters and rotation
+                                                                    errors between 0.2-1.5 degrees in the case of RGB-D, and 0.1-
+                                                                    0.3 meters, and 0.4-3.1 degrees in the case of learning-based
+                                                                    monocular, which indicates that our object motion estimation
+                                                                    achieves an order of magnitude improvements in most cases. In
+                                                                    general, the results suggest that point-based object motion/pose
+                                                                    estimation methods is more robust and accurate than those us-
+                                                                    ing high-level geometric models, probably due to the fact that
+                                                                    geometric model extraction could lead to losing information
+                                                                    and introducing more uncertainty.
+
+Fig. 6: Accuracy of object motion estimation of our method             2) Object Tracking and Velocity: We also demonstrate the
+compared to CubeSLAM ([24]). The color bars refer to                performance of tracking dynamic objects, and show results
+translation error that is corresponding to the left Y-axis in log-  of object speed estimation, which is an important information
+scale. The circles refer to rotation error, which corresponds to    for autonomous driving applications. Fig. 7 illustrates results
+the right Y-axis in linear-scale.                                   of object tracking length and object speed for some selected
+                                                                    objects (tracked for over 20 frames) in all the tested sequences.
+   1) Camera Pose and Object Motion: Table II demonstrates          Our system is able to track most objects for more than 80%
+results of both camera pose and object motion estimation            of their occurrence in the sequence. Moreover, our estimated
+in nine sequences, compared to DynaSLAM II [49] and                 objects speed is always consistently close to the ground truth.
+CubeSLAM [24]. Results of DynaSLAM II is obtained di-
+rectly from their paper, where only the evaluation of camera           3) Qualitative Results: Fig. 8 illustrates the output of our
+pose is available. We initially tried to evaluate CubeSLAM          system for three of the KITTI sequences. The proposed system
+ourselves with the default provided parameters, however errors      is able to output the camera poses, along with the static
+were much higher, and hence we only report results of ﬁve           structure and dynamic tracks of every detected moving object
+sequences provided by the authors of CubeSLAM after some            in the scene in a spatiotemporal map representation.
+correspondences. As CubeSLAM is for monocular camera, we
+MANUSCRIPT ONLY                                                                                                                                                                          11
+
+TABLE II: Comparison versus DynaSLAM II [49] and CubeSLAM [24] for camera pose and object motion estimation accuracy
+on nine sequences with moving objects drawn from the KITTI dataset. Bold numbers indicate the better result.
+
+     DynaSLAM II             VDO-SLAM (RGB-D)                                                                                      VDO-SLAM (Monocular)             CubeSLAM
+
+         Camera        Camera                   Object                                                                             Camera  Object           Camera            Object
+
+Seq Er(deg) Et (m) Er(deg) Et (m) Er(deg) Et (m) Er(deg) Et (m) Er(deg) Et (m) Er(deg) Et (m) Er(deg) Et (m)
+
+00 0.06 0.04 0.0741 0.0674 1.0520 0.1077 0.1830 0.1847 2.0021 0.3827                                                                                     -          -  -              -
+
+01 0.04 0.05 0.0382 0.1220 0.9051 0.1573 0.1772 0.4982 1.1833 0.3589                                                                                     -          -  -              -
+
+02 0.02 0.04 0.0182 0.0445 1.2359 0.2801 0.0496 0.0963 1.6833 0.4121                                                                                     -          -  -              -
+
+03 0.04 0.06 0.0311 0.0816 0.2919 0.0965 0.1065 0.1505 0.4570 0.2032 0.0498 0.0929 3.6085 4.5947
+
+04 0.06 0.07 0.0482 0.1114 0.8288 0.1937 0.1741 0.4951 3.1156 0.5310 0.0708 0.1159 5.5803 32.5379
+
+05 0.03 0.06 0.0219 0.0932 0.3705 0.1140 0.0506 0.1368 0.6464 0.2669 0.0342 0.0696 3.2610 6.4851
+
+06 0.04 0.02 0.0488 0.0186 1.0803 0.1158 0.0671 0.0451 2.0977 0.2394                                                                                     -          -  -              -
+
+18 0.02 0.05 0.0211 0.0749 0.2453 0.0825 0.1236 0.3551 0.5559 0.2774 0.0433 0.0510 3.1876 3.7948
+
+20 0.04 0.07 0.0271 0.1662 0.3663 0.0824 0.3029 1.3821 1.1081 0.3693 0.1348 0.1888 3.4206 5.6986
+
+240  GT Tracks         EST. Tracks GT Speed     EST. Speed                                                               60        of good points in terms of both quantity and quality. This was
+                                                                                                                                   achieved by reﬁning the estimated optical ﬂow jointly with
+216                                                                                                                      54        the motion estimation, as discussed in Section III-B3. The
+                                                                                                                                   effectiveness of joint optimisation is shown by comparing a
+192                                                                                                                      48        baseline method that only optimises for the motion (Motion
+                                                                                                                                   Only) using (9) for camera motion or (11) for object motion,
+168Track Length                                                                                                          42        and the improved method that optimises for both the motion
+                                                                                                                     Speed (Km/h)  and the optical ﬂow (Joint) using (13) or (15). Table III
+144                                                                                                                      36        demonstrates that the joint method obtains considerably more
+                                                                                                                                   points that are tracked for long periods.
+120                                                                                                                      30
+
+96                                                                                                                       24
+
+72                                                                                                                       18
+
+48                                                                                                                       12
+
+24                                                                                                                       6
+
+0                                                                                                                        0
+
+     00-1 00-2 01-1 01-2 02-1 02-2 02-3 03-1 03-2 04-1 05-1 06-1 06-2 06-3 06-4 06-5 06-6 18-1 18-2 18-3 20-1 20-2 20-3
+
+                       Sequence-Object ID
+
+Fig. 7: Tracking performance and speed estimation. Results                                                                         TABLE IV: Average camera pose and object motion errors
+of object tracking length and object speed for some selected                                                                       over the nine sequences of the KITTI dataset. Bold numbers
+objects (tracked for over 20 frames), due to limited space.                                                                        indicate the better results.
+The color bars represent the length of object tracks, which is
+corresponding to the left Y-axis. The circles represent object                                                                                        Motion Only          Joint
+speeds, which is corresponding to the right Y-axis. GT refers                                                                                      Er(deg) Et (m)  Er(deg) Et (m)
+to ground truth, and EST. refers to estimated values.
+                                                                                                                                                   0.0412 0.0987   0.0365 0.0866
+                                                                                                                                           Camera  1.0179 0.1853   0.7085 0.1367
+                                                                                                                                           Object
+
+E. Discussion                                                                                                                         Using the tracked points given by the joint estimation
+                                                                                                                                   process leads to better estimation of both camera pose and
+   Apart from the extensive evaluation in Section V-D and V-C,                                                                     object motion. As demonstrated in Table IV, an improvement
+we also provide detailed experimental results to prove the                                                                         of about 10% (camera) and 25% (object) in both translation
+effectiveness of key modules in our proposed system. Finally,                                                                      and rotation errors was observed over the nine sequences of
+the computational cost of the proposed system is discussed.                                                                        the KITTI dataset shown above.
+
+TABLE III: The number of points tracked for more than ﬁve                                                                             2) Robustness against Non-direct Occlusion: The mask
+frames on the nine sequences of the KITTI dataset. Bold                                                                            segmentation may fail in some cases, due to direct or indirect
+numbers indicate the better results. Underlined bold numbers                                                                       occlusions (illumination change, etc.). Thanks to the mask
+indicate an order of magnitude increase in number.                                                                                 propagating method described in Section IV-C3, our proposed
+                                                                                                                                   system is able to handle mask failure cases caused by indirect
+                 Background                    Object                                                                              occlusions. Fig. 9 demonstrates an example of tracking a white
+                                                                                                                                   van for 80 frames, where the mask segmentation fails in 33
+     Seq Motion Only Joint Motion Only Joint                                                                                       frames. Despite the object segmentation failure, our system is
+                                                                                                                                   still continuously able to track the van, and estimate its speed
+     00          1798  12812   1704                    7162                                                                        with an average error of 2.64 km/h across the whole sequence.
+                                                                                                                                   Speed errors in the second half of the sequence are higher due
+     01          237   5075                907         4583                                                                        to partial direct occlusions, and increased distance to the object
+                                                                                                                                   getting farther away from the camera.
+     02          7642  10683               52          1442
+                                                                                                                                      3) Global Reﬁnement on Object Motion: Initial object
+     03          778   12317               343         3354                                                                        motion estimation (in the tracking component of the system)
+                                                                                                                                   is independent between frames, since it is purely related to the
+     04          9913  25861               339         2802                                                                        sensor measurements. As illustrated in Fig. 10, the blue curve
+                                                                                                                                   describes an initial object speed estimate of a wagon observed
+     05          713   11627   2363                    2977
+
+     06          7898  11048               482         5934
+
+     18          4271  22503   5614                    14989
+
+     20          9838  49261   9282                    13434
+
+   1) Robust Tracking of Points: The graph optimisation ex-
+plores the spacial and temporal information to reﬁne the
+camera poses and the object motions, as well as the static
+and dynamic structure. This process requires robust tracking
+MANUSCRIPT ONLY                                                   12
+
+Fig. 8: Illustration of system output; a dynamic map with camera poses, static background structure, and tracks of
+dynamic objects. Sample results of VDO-SLAM on KITTI sequences. Black represents static background, and each detected
+object is shown in a different colour. Top left ﬁgure represents Seq.01 and a zoom-in on the intersection at the end of the
+sequence, top right ﬁgure represents Seq.06 and bottom ﬁgure represents Seq.03.
+
+for 55 frames in sequence 03 of the KITTI tracking dataset.       16 GB RAM. The object semantic segmentation and dense
+As seen in the ﬁgure, the speed estimation is not smooth and      optical ﬂow computation times depend on the GPU power
+large errors occur towards the second half of the sequence.       and the CNN model complexity. Many current state-of-the-
+This is mainly caused by the increased distance to the object     art algorithms can run in real time ([30], [78]). In this
+getting farther away from the camera, and its structure only      paper, the semantic segmentation and optical ﬂow results are
+occupying a small portion of the scene. In this case, the object  produced off-line as input to the system. The SLAM system
+motion estimation from sensor measurements solely becomes         is implemented in C++ on CPU using a modiﬁed version of
+challenging and error-prone. Therefore, we formulate a factor     g2o as a back-end [79]. We show the computational time in
+graph and reﬁne the motions together with the static and          Table V for both datasets. Overall, the tracking part of our
+dynamic structure as discussed in Section III-C. The green        proposed system is able to run at the frame rate of 5-8 fps
+curve in Fig. 10 shows the object speed results after the global  depending on the number of detected moving objects, which
+reﬁnement, which becomes smoother in the ﬁrst half of the         can be improved by employing parallel implementation. The
+sequence and is signiﬁcantly improved in the second half.         runtime of the global batch optimisation strongly depends on
+                                                                  the amount of camera poses (number of frames), and objects
+   Fig. 11 demonstrates the average improvement for all ob-       (density in terms of the number of dynamic objects observed
+jects in each sequence of KITTI dataset. With graph optimiza-     per frame) present in the scene.
+tion, the errors can be reduced up to 39% in translation and
+55% in rotation. Interestingly, the translation errors in Seq.18                            VI. CONCLUSION
+and Seq.20 increase slightly. We believe it is because the ve-
+hicles keep alternating between acceleration and deceleration        In this paper, we have presented VDO-SLAM, a novel
+due to the heavy trafﬁc jams in both sequences, which strongly    dynamic feature-based SLAM system that exploits image-
+violates the smooth motion constraint that is set for general     based semantic information in the scene with no additional
+cases.                                                            knowledge of the object pose or geometry, to achieve simulta-
+                                                                  neous localisation, mapping and tracking of dynamic objects.
+   4) Computational Analysis: Finally, we provide the com-        The system consistently shows robust and accurate results on
+putational analysis of our system. The experiments are carried    indoor and challenging outdoor datasets, and achieves state-of-
+out on an Intel Core i7 2.6 GHz laptop computer with              the-art performance in object motion estimation. We believe
+MANUSCRIPT ONLY                                                                                                                   13
+
+                                                                                                                                  0.5
+
+                                                                  Translation 0.27 0.27 0.11 0.39 0.1 0.16 0.02 -0.03 -0.04       0.4
+
+                                                                                                                                  0.3
+
+                                                                                                                                  0.2
+
+                                                                  Rotation 0.2 0.22 0.06 0.54 0.26 0.55 0.04 0.34 0.12            0.1
+
+                                                                                                                                  0
+
+                                                                  Seq.00 Seq.01 Seq.02 Seq.03 Seq.04 Seq.05 Seq.06 Seq.18 Seq.20
+
+                                                                  Fig. 11: Improvement on object motion after graph op-
+                                                                  timization. The numbers in the heatmap show the ratio of
+                                                                  decrease in error on the nine sequences of the KITTI dataset.
+
+                                                                  TABLE V: Runtime of different system components for both
+                                                                  datasets. The time cost of every component is averaged over all
+                                                                  frames and sequences, except for the object motion estimation
+                                                                  and object motion estimation that are averaged over the
+                                                                  number of objects.
+
+Fig. 9: Robustness in tracking performance and speed              Dataset                      Tasks              Runtime (mSec)
+estimation in case of semantic segmentation failure.              KITTI
+An example of tracking performance and speed estimation for                            Feature Detection               16.2550
+a white van (ground-truth average speed 20km/h) in Seq.00.         OMD             Camera Pose Estimation              52.6542
+(Top) Blue bars represent a successful object segmentation,                Dynamic Object Tracking (avg/object)        8.2980
+and green curves refer to the object speed error. (Bottom-left)            Object Motion Estimation (avg/object)       22.9081
+An illustration of semantic segmentation failure on the van.                       Map and Mask Updating               22.1830
+(Bottom-right) Result of propagating the previously tracked                       Local Batch Optimisation             18.2828
+features on the van by our system.
+                                                                                       Feature Detection               7.5220
+                                                                                   Camera Pose Estimation              32.0909
+                                                                           Dynamic Object Tracking (avg/object)        7.0134
+                                                                           Object Motion Estimation (avg/object)       19.5280
+                                                                                   Map and Mask Updating               30.3153
+                                                                                  Local Batch Optimisation             15.3414
+
+                                                                  observed far in the past seems to be a natural step towards a
+                                                                  long-term SLAM system in highly dynamic environments.
+
+Fig. 10: Global reﬁnement effect on object speed estima-                                 ACKNOWLEDGEMENTS
+tion. The initial (blue) and reﬁned (green) estimated speeds of
+a wagon in Seq.03, travelling along a straight road, compared        This research is supported by the Australian Research Coun-
+to the ground truth speed (red). Note the ground truth speed      cil through the Australian Centre of Excellence for Robotic
+is slightly ﬂuctuating. We believe it is due to the ground truth  Vision (CE140100016), and the Sydney Institute for Robotics
+object poses being approximated from lidar scans.                 and Intelligent Systems. The authors would like to thank Mr.
+                                                                  Ziang Cheng and Mr. Huangying Zhan for providing help in
+the high performance accuracy achieved in object motion           preparing the testing datasets.
+estimation is due to the fact that our system is a feature-based
+system. Feature points remain to be the easiest to detect, track                               REFERENCES
+and integrate within a SLAM system, and that require the
+front-end to have no additional knowledge about the object         [1] D. Hahnel, D. Schulz, and W. Burgard, “Map Building with Mobile
+model, or explicitly provide any information about its pose.            Robots in Populated Environments,” in International Conference on
+                                                                        Intelligent Robots and Systems (IROS), vol. 1. IEEE, 2002, pp. 496–
+   An important issue to be reduced is the computational                501.
+complexity of SLAM with dynamic objects. In long-term
+applications, different techniques can be applied to limit the     [2] D. Hahnel, R. Triebel, W. Burgard, and S. Thrun, “Map Building with
+growth of the graph ([80], [81]). In fact, history summari-             Mobile Robots in Dynamic Environments,” in International Conference
+sation/deletion of map points pertaining to dynamic objects             on Robotics and Automation (ICRA), vol. 2. IEEE, 2003, pp. 1557–
+                                                                        1563.
+
+                                                                   [3] D. F. Wolf and G. S. Sukhatme, “Mobile Robot Simultaneous Local-
+                                                                        ization and Mapping in Dynamic Environments,” Autonomous Robots,
+                                                                        vol. 19, no. 1, pp. 53–65, 2005.
+
+                                                                   [4] H. Zhao, M. Chiba, R. Shibasaki, X. Shao, J. Cui, and H. Zha, “SLAM
+                                                                        in a Dynamic Large Outdoor Environment using a Laser Scanner,” in
+                                                                        International Conference on Robotics and Automation (ICRA). IEEE,
+                                                                        2008, pp. 1455–1462.
+
+                                                                   [5] B. Bescos, J. M. Fa´cil, J. Civera, and J. Neira, “DynaSLAM: Tracking,
+                                                                        Mapping, and Inpainting in Dynamic Scenes,” Robotics and Automation
+                                                                        Letters (RAL), vol. 3, no. 4, pp. 4076–4083, 2018.
+
+                                                                   [6] C.-C. Wang, C. Thorpe, and S. Thrun, “Online Simultaneous Local-
+                                                                        ization and Mapping with Detection and Tracking of Moving Objects:
+                                                                        Theory and Results from a Ground Vehicle in Crowded Urban Areas,”
+                                                                        in International Conference on Robotics and Automation (ICRA), vol. 1.
+                                                                        IEEE, 2003, pp. 842–849.
+MANUSCRIPT ONLY                                                               14
+
+ [7] I. Miller and M. Campbell, “Rao-blackwellized Particle Filtering for     [30] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: Real-time Instance
+      Mapping Dynamic Environments,” in International Conference on                 Segmentation,” in International Conference on Computer Vision (ICCV).
+      Robotics and Automation (ICRA). IEEE, 2007, pp. 3862–3869.                    IEEE, 2019, pp. 9157–9166.
+
+ [8] J. G. Rogers, A. J. Trevor, C. Nieto-Granda, and H. I. Christensen,      [31] M. Henein, J. Zhang, R. Mahony, and V. Ila, “Dynamic SLAM:
+      “SLAM with Expectation Maximization for Moveable Object Tracking,”            The Need for Speed,” in International Conference on Robotics and
+      in International Conference on Intelligent Robots and Systems (IROS).         Automation (ICRA). IEEE, 2020, pp. 2123–2129.
+      IEEE, 2010, pp. 2077–2082.
+                                                                              [32] J. Huang, S. Yang, Z. Zhao, Y. Lai, and S. Hu, “ClusterSLAM: A
+ [9] A. Kundu, K. M. Krishna, and C. Jawahar, “Realtime Multibody Visual            SLAM Backend for Simultaneous Rigid Body Clustering and Motion
+      SLAM with a Smoothly Moving Monocular Camera,” in International               Estimation,” in International Conference on Computer Vision (ICCV).
+      Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2080–2087.              IEEE, 2019, pp. 5874–5883.
+
+[10] K. Yamaguchi, D. McAllester, and R. Urtasun, “Efﬁcient Joint Segmen-     [33] J. Zhang, M. Henein, R. Mahony, and V. Ila, “Robust Ego and Object 6-
+      tation, Occlusion Labeling, Stereo and Flow Estimation,” in European          DoF Motion Estimation and Tracking,” in International Conference on
+      Conference on Computer Vision (ECCV). Springer, 2014, pp. 756–771.            Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 5017–5023.
+
+[11] D. Sun, S. Roth, and M. J. Black, “Secrets of Optical Flow Estimation    [34] P. F. Alcantarilla, J. J. Yebes, J. Almaza´n, and L. M. Bergasa, “On
+      and Their Principles,” in Computer Society Conference on Computer             Combining Visual SLAM and Dense Scene Flow to Increase the
+      Vision and Pattern Recognition (CVPR). IEEE, 2010, pp. 2432–2439.             Robustness of Localization and Mapping in Dynamic Environments,” in
+                                                                                    International Conference on Robotics and Automation (ICRA). IEEE,
+[12] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-Net: CNNs for Optical           2012, pp. 1290–1297.
+      Flow Using Pyramid, Warping, and Cost Volume,” in Computer Soci-
+      ety Conference on Computer Vision and Pattern Recognition (CVPR).       [35] W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao, “Robust Monocular
+      IEEE, 2018.                                                                   SLAM in Dynamic Environments,” in International Symposium on
+                                                                                    Mixed and Augmented Reality (ISMAR). IEEE, 2013, pp. 209–218.
+[13] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox,
+      “FlowNET 2.0: Evolution of Optical Flow Estimation with Deep            [36] P. Kaveti and H. Singh, “A Light Field Front-end for Robust SLAM in
+      Networks,” in Computer Society Conference on Computer Vision and              Dynamic Environments,” arXiv preprint arXiv:2012.10714, 2020.
+      Pattern Recognition (CVPR). IEEE, 2017, pp. 2462–2470.
+                                                                              [37] C.-C. Wang, C. Thorpe, S. Thrun, M. Hebert, and H. Durrant-Whyte,
+[14] C. Vogel, K. Schindler, and S. Roth, “Piecewise Rigid Scene Flow,” in          “Simultaneous Localization, Mapping and Moving Object Tracking,”
+      International Conference on Computer Vision (ICCV). IEEE, 2013, pp.           International Journal of Robotics Research (IJRR), vol. 26, no. 9, pp.
+      1377–1384.                                                                    889–916, 2007.
+
+[15] M. Menze and A. Geiger, “Object Scene Flow for Autonomous Vehi-          [38] N. D. Reddy, P. Singhal, V. Chari, and K. M. Krishna, “Dynamic Body
+      cles,” in Computer Society Conference on Computer Vision and Pattern          VSLAM with Semantic Constraints,” in International Conference on
+      Recognition (CVPR). IEEE, 2015, pp. 3061–3070.                                Intelligent Robots and Systems (IROS). IEEE, 2015, pp. 1897–1904.
+
+[16] X. Liu, C. R. Qi, and L. J. Guibas, “FlowNet3D: Learning Scene Flow in   [39] I. A. Baˆrsan, P. Liu, M. Pollefeys, and A. Geiger, “Robust Dense
+      3D Point Clouds,” in Computer Society Conference on Computer Vision           Mapping for Large-Scale Dynamic Environments,” in International
+      and Pattern Recognition (CVPR). IEEE, 2019, pp. 529–537.                      Conference on Robotics and Automation (ICRA). IEEE, 2018.
+
+[17] H. Jiang, D. Sun, V. Jampani, Z. Lv, E. Learned-Miller, and J. Kautz,    [40] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and
+      “SENSE: A Shared Encoder Network for Scene-ﬂow Estimation,” in                A. J. Davison, “SLAM++: Simultaneous Localisation and Mapping at
+      International Conference on Computer Vision (ICCV). IEEE, 2019,               the Level of Objects,” in Computer Society Conference on Computer
+      pp. 3195–3204.                                                                Vision and Pattern Recognition (CVPR). IEEE, 2013, pp. 1352–1359.
+
+[18] P. de la Puente and D. Rodr´ıguez-Losada, “Feature Based Graph-SLAM      [41] K. Tateno, F. Tombari, and N. Navab, “When 2.5D is Not Enough:
+      in Structured Environments,” Autonomous Robots, vol. 37, no. 3, pp.           Simultaneous Reconstruction, Segmentation and Recognition on Dense
+      243–260, 2014.                                                                SLAM,” in International Conference on Robotics and Automation
+                                                                                    (ICRA). IEEE, 2016, pp. 2295–2302.
+[19] M. Kaess, “Simultaneous Localization and Mapping with Inﬁnite
+      Planes,” in International Conference on Robotics and Automation         [42] E. Sucar, K. Wada, and A. Davison, “NodeSLAM: Neural Object
+      (ICRA). IEEE, 2015, pp. 4605–4611.                                            Descriptors for Multi-View Shape Reconstruction,” arXiv preprint
+                                                                                    arXiv:2004.04485, 2020.
+[20] M. Henein, M. Abello, V. Ila, and R. Mahony, “Exploring the Effect
+      of Meta-structural Information on the Global Consistency of SLAM,”      [43] M. Runz, M. Bufﬁer, and L. Agapito, “MaskFusion: Real-time Recog-
+      in International Conference on Intelligent Robots and Systems (IROS).         nition, Tracking and Reconstruction of Multiple Moving Objects,” in
+      IEEE, 2017, pp. 1616–1623.                                                    International Symposium on Mixed and Augmented Reality (ISMAR).
+                                                                                    IEEE, 2018, pp. 10–20.
+[21] M. Hsiao, E. Westman, G. Zhang, and M. Kaess, “Keyframe-based
+      Dense Planar SLAM,” in International Conference on Robotics and         [44] B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. Davison, and
+      Automation (ICRA). IEEE, 2017, pp. 5110–5117.                                 S. Leutenegger, “MID-Fusion: Octree-based Object-level Multi-instance
+                                                                                    Dynamic SLAM,” in International Conference on Robotics and Automa-
+[22] B. Mu, S.-Y. Liu, L. Paull, J. Leonard, and J. P. How, “SLAM with              tion (ICRA). IEEE, 2019, pp. 5231–5237.
+      Objects using a Nonparametric Pose Graph,” in International Conference
+      on Intelligent Robots and Systems (IROS). IEEE, 2016, pp. 4602–4609.    [45] M. Hosseinzadeh, K. Li, Y. Latif, and I. Reid, “Real-time Monocular
+                                                                                    Object-model Aware Sparse SLAM,” in International Conference on
+[23] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and              Robotics and Automation (ICRA). IEEE, 2019, pp. 7123–7129.
+      A. J. Davison, “SLAM++: Simultaneous Localisation and Mapping at
+      the Level of Objects,” in Computer Society Conference on Computer       [46] L. Nicholson, M. Milford, and N. Su¨nderhauf, “QuadricSLAM: Dual
+      Vision and Pattern Recognition (CVPR). IEEE, 2013, pp. 1352–1359.             Quadrics from Object Detections as Landmarks in Object-oriented
+                                                                                    SLAM,” Robotics and Automation Letters (RAL), vol. 4, no. 1, pp. 1–8,
+[24] S. Yang and S. Scherer, “CubeSLAM: Monocular 3-D Object SLAM,”                 2018.
+      Transactions on Robotics (T-RO), vol. 35, no. 4, pp. 925–938, 2019.
+                                                                              [47] P. Li, T. Qin, et al., “Stereo Vision-based Semantic 3D Object and Ego-
+[25] D. Ga´lvez-Lo´pez, M. Salas, J. D. Tardo´s, and J. Montiel, “Real-time         motion Tracking for Autonomous Driving,” in European Conference on
+      Monocular Object SLAM,” Robotics and Autonomous Systems, vol. 75,             Computer Vision (ECCV), 2018, pp. 646–661.
+      pp. 435–449, 2016.
+                                                                              [48] P. Li, J. Shi, and S. Shen, “Joint Spatial-temporal Optimization for Stereo
+[26] A. Milan, L. Leal-Taixe´, I. Reid, S. Roth, and K. Schindler,                  3D Object Tracking,” in Computer Society Conference on Computer
+      “MOT16: A Benchmark for Multi-Object Tracking,” arXiv:1603.00831              Vision and Pattern Recognition (CVPR). IEEE, 2020, pp. 6877–6886.
+      [cs], Mar. 2016, arXiv: 1603.00831. [Online]. Available: http:
+      //arxiv.org/abs/1603.00831                                              [49] B. Bescos, C. Campos, J. D. Tardo´s, and J. Neira, “DynaSLAM
+                                                                                    II: Tightly-coupled Multi-object Tracking and SLAM,” Robotics and
+[27] A. Byravan and D. Fox, “SE3-Nets: Learning Rigid Body Motion using             Automation Letters (RAL), vol. 6, no. 3, pp. 5191–5198, 2021.
+      Deep Neural Networks,” in International Conference on Robotics and
+      Automation (ICRA). IEEE, 2017, pp. 173–180.                             [50] A. Dewan, T. Caselitz, G. D. Tipaldi, and W. Burgard, “Motion-based
+                                                                                    Detection and Tracking in 3D Lidar Scans,” in International Conference
+[28] P. Wohlhart and V. Lepetit, “Learning Descriptors for Object Recog-            on Robotics and Automation (ICRA). IEEE, 2016, pp. 4508–4513.
+      nition and 3D Pose Estimation,” in Computer Society Conference on
+      Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3109–         [51] K. M. Judd, J. D. Gammell, and P. Newman, “Multimotion Visual
+      3118.                                                                         Odometry (MVO): Simultaneous Estimation of Camera and Third-party
+                                                                                    Motions,” in International Conference on Intelligent Robots and Systems
+[29] K. He, G. Gkioxari, P. Dolla´r, and R. Girshick, “Mask R-CNN,”                 (IROS). IEEE, 2018, pp. 3949–3956.
+      Transactions on Pattern Analysis and Machine Intelligence (PAMI),
+      2018.
+MANUSCRIPT ONLY                                                               15
+
+[52] J. Huang, S. Yang, T.-J. Mu, and S.-M. Hu, “ClusterVO: Clustering        [67] K. He, G. Gkioxari, P. Dolla´r, and R. Girshick, “Mask R-CNN,” in
+      Moving Instances and Estimating Visual Odometry for Self and Sur-             International Conference on Computer Vision (ICCV). IEEE, 2017,
+      roundings,” in Computer Society Conference on Computer Vision and             pp. 2980–2988.
+      Pattern Recognition (CVPR). IEEE, 2020, pp. 2168–2177.
+                                                                              [68] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
+[53] M. A. Fischler and R. C. Bolles, “Random Sample Consensus: A                   P. Dolla´r, and C. L. Zitnick, “Microsoft COCO: Common objects
+      Paradigm for Model Fitting with Applications to Image Analysis and            in context,” in European Conference on Computer Vision (ECCV).
+      Automated Cartography,” Communications of the ACM, vol. 24, no. 6,            Springer, 2014, pp. 740–755.
+      pp. 381–395, 1981.
+                                                                              [69] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy,
+[54] G. S. Chirikjian, R. Mahony, S. Ruan, and J. Trumpf, “Pose Changes             and T. Brox, “A Large Dataset to Train Convolutional Networks for
+      from a Different Point of View,” in The ASME International Design             Disparity, Optical Flow, and Scene Flow Estimation,” in Computer So-
+      Engineering Technical Conferences (IDETC). ASME, 2017.                        ciety Conference on Computer Vision and Pattern Recognition (CVPR).
+                                                                                    IEEE, 2016, pp. 4040–4048.
+[55] D. Niste´r, O. Naroditsky, and J. Bergen, “Visual Odometry,” in Com-
+      puter Society Conference on Computer Vision and Pattern Recognition     [70] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A Naturalistic
+      (CVPR), vol. 1. IEEE, 2004, pp. I–I.                                          Open Source Movie for Optical Flow Evaluation,” in European Confer-
+                                                                                    ence on Computer Vision (ECCV). Springer, 2012, pp. 611–625.
+[56] P. J. Huber, “Robust Estimation of a Location Parameter,” in Break-
+      throughs in Statistics. Springer, 1992, pp. 492–518.                    [71] A. Geiger, P. Lenz, and R. Urtasun, “Are We Ready for Autonomous
+                                                                                    Driving? The KITTI Vision Benchmark Suite,” in Computer Society
+[57] F. Dellaert and M. Kaess, “Square Root SAM: Simultaneous Localiza-             Conference on Computer Vision and Pattern Recognition (CVPR).
+      tion and Mapping via Square Root Information Smoothing,” Interna-             IEEE, 2012.
+      tional Journal of Robotics Research (IJRR), vol. 25, no. 12, pp. 1181–
+      1203, 2006.                                                             [72] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging
+                                                                                    into Self-supervised Monocular Depth Estimation,” in International
+[58] S. Agarwal, K. Mierle, and Others, “Ceres Solver,” http://ceres-solver.        Conference on Computer Vision (ICCV). IEEE, 2019, pp. 3828–3838.
+      org, 2012.
+                                                                              [73] D. Eigen, C. Puhrsch, and R. Fergus, “Depth Map Prediction from a
+[59] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and                Single Image using a Multi-scale Deep Network,” in Advances in Neural
+      F. Dellaert, “iSAM2: Incremental Smoothing and Mapping using the              Information Processing Systems (NIPS), 2014, pp. 2366–2374.
+      Bayes Tree,” International Journal of Robotics Research (IJRR), p.
+      0278364911430419, 2011.                                                 [75] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An Efﬁcient
+                                                                                    Alternative to SIFT or SURF,” in International Conference on Computer
+[60] L. Polok, V. Ila, M. Solony, P. Smrz, and P. Zemcik, “Incremental Block        Vision (ICCV). IEEE, 2011, pp. 2564–2571.
+      Cholesky Factorization for Nonlinear Least Squares in Robotics,” in
+      Robotics: Science and Systems (RSS), Berlin, Germany, June 2013.        [76] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers,
+                                                                                    “A Benchmark for the Evaluation of RGB-D SLAM Systems,” in
+[61] V. Ila, L. Polok, M. Sˇ olony, and P. Svoboda, “SLAM++-A Highly                International Conference on Intelligent Robots and Systems (IROS).
+      Efﬁcient and Temporally Scalable Incremental SLAM Framework,”                 IEEE, 2012, pp. 573–580.
+      International Journal of Robotics Research (IJRR), vol. Online First,
+      no. 0, pp. 1–21, 2017.                                                  [77] N. Otsu, “A Threshold Selection Method from Gray-level Histograms,”
+                                                                                    Transactions on Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62–66,
+[62] K. Yamaguchi, D. McAllester, and R. Urtasun, “Efﬁcient Joint Segmen-           1979.
+      tation, Occlusion Labeling, Stereo and Flow Estimation,” in European
+      Conference on Computer Vision (ECCV). Springer, 2014, pp. 756–771.      [78] T.-W. Hui, X. Tang, and C. C. Loy, “A Lightweight Optical Flow CNN
+                                                                                    - Revisiting Data Fidelity and Regularization.” IEEE, 2020. [Online].
+[63] T. Ke and S. I. Roumeliotis, “An Efﬁcient Algebraic Solution to the            Available: http://mmlab.ie.cuhk.edu.hk/projects/LiteFlowNet/
+      Perspective-three-point Problem,” in Computer Society Conference on
+      Computer Vision and Pattern Recognition (CVPR). IEEE, 2017.             [79] R. Ku¨mmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard,
+                                                                                    “g2o: A General Framework for Graph Optimization,” in International
+[64] Z. Lv, K. Kim, A. Troccoli, J. Rehg, and J. Kautz, “Learning Rigidity          Conference on Robotics and Automation (ICRA). IEEE, 2011, pp.
+      in Dynamic Scenes with a Moving Camera for 3D Motion Field                    3607–3613.
+      Estimation,” in European Conference on Computer Vision (ECCV).
+      Springer, 2018.                                                         [80] H. Strasdat, A. J. Davison, J. M. Montiel, and K. Konolige, “Double
+                                                                                    Window Optimisation for Constant Time Visual SLAM,” in Interna-
+[65] K. M. Judd and J. D. Gammell, “The Oxford Multimotion Dataset:                 tional Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2352–
+      Multiple SE(3) Motions with Ground Truth,” Robotics and Automation            2359.
+      Letters (RAL), vol. 4, no. 2, pp. 800–807, 2019.
+                                                                              [81] V. Ila, J. M. Porta, and J. Andrade-Cetto, “Information-based Compact
+[66] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision Meets Robotics:        Pose SLAM,” Transactions on Robotics (T-RO), vol. 26, no. 1, pp. 78–
+      The KITTI Dataset,” International Journal of Robotics Research (IJRR),        93, 2010.
+      vol. 32, no. 11, pp. 1231–1237, 2013.
+
+[74] E. Rosten and T. Drummond, “Machine Learning for High-speed Cor-
+      ner Detection,” in European Conference on Computer Vision (ECCV).
+      Springer, 2006, pp. 430–443.
+
diff --git a/动态slam/2020年-2022年开源动态SLAM/2021年/DF_VO What should be learnt for visual odometry.pdf b/动态slam/2020年-2022年开源动态SLAM/2021年/DF_VO What should be learnt for visual odometry.pdf
new file mode 100644
index 0000000..5c1d534
--- /dev/null
+++ b/动态slam/2020年-2022年开源动态SLAM/2021年/DF_VO What should be learnt for visual odometry.pdf	
@@ -0,0 +1,1367 @@
+                                        UNDER REVIEW manuscript No.
+                                        (will be inserted by the editor)
+
+                                       DF-VO: What Should Be Learnt for Visual Odometry?
+
+                                       Huangying Zhan, Chamara Saroj Weerasekera, Jia-Wang Bian, Ravi Garg,
+                                       Ian Reid
+
+arXiv:2103.00933v1 [cs.CV] 1 Mar 2021  the date of receipt and acceptance should be inserted later
+
+                                       Abstract Multi-view geometry-based methods dominate                    (a)            (b)
+                                       the last few decades in monocular Visual Odometry for            (c)
+                                       their superior performance, while they have been vulnera-                        (d)            (e)
+                                       ble to dynamic and low-texture scenes. More importantly,
+                                       monocular methods suﬀer from scale-drift issue, i.e., er-                   (f)            (g)
+                                       rors accumulate over time. Recent studies show that deep
+                                       neural networks can learn scene depths and relative cam-         Fig. 1 Inputs and intermediate CNN outputs of the system. (a,
+                                       era in a self-supervised manner without acquiring ground         b) Current and previous input images with examples of auto-
+                                       truth labels. More surprisingly, they show that the well-        selected 2D-2D matches; (c) Single view depth prediction; (d, e)
+                                       trained networks enable scale-consistent predictions over        Forward and backward optical ﬂow prediction; (f) Flow consis-
+                                       long videos, while the accuracy is still inferior to tradi-      tency between optical ﬂow and rigid ﬂow; (g) Forward-backward
+                                       tional methods because of ignoring geometric informa-            ﬂow consistency; In (f)(g), red/blue means high/low inconsis-
+                                       tion. Building on top of recent progress in computer vi-         tency.
+                                       sion, we design a simple yet robust VO system by inte-
+                                       grating multi-view geometry and deep learning on Depth           1 Introduction
+                                       and optical Flow, namely DF-VO. In this work, a) we
+                                       propose a method to carefully sample high-quality corre-         The ability of an autonomous robot to localize itself and
+                                       spondences from deep ﬂows and recover accurate camera            know its surroundings is vital for diﬀerent robotic tasks
+                                       poses with a geometric module; b) we address the scale-          such as navigation and object manipulation. Vision-based
+                                       drift issue by aligning geometrically triangulated depths        methods are often the preferred choice because of factors
+                                       to the scale-consistent deep depths, where the dynamic           such as cost-saving, low power requirements, and useful
+                                       scenes are taken into account. Comprehensive ablation            complementary information can be provided to other sen-
+                                       studies show the eﬀectiveness of the proposed method,            sors such as IMU, GPS, laser scanners. We address the
+                                       and extensive evaluation results show the state-of-the-art       monocular Visual Odometry (VO) problem in this paper,
+                                       performance of our system, e.g., Ours (1.652% ) v.s. ORB-        where the goal is to estimate 6DoF motions of a moving
+                                       SLAM (3.247% ) in terms of translation error in KITTI            camera.
+                                       Odometry benchmark. Source code is publicly available
+                                       at: DF-VO.
+
+                                       Keywords Visual Odometry, Self-supervised Learning,
+                                       Depth Estimation, Optical Flow Estimation
+
+                                       All authors are with the University of Adelaide, and Australian
+                                       Centre for Robotic Vision
+2                                                             Zhan et al.
+
+    Geometry-based Visual Odometry has shown domi-            era motions are estimated using well-studied multi-view
+nating performance in the last few decades, while they are    geometry in the proposed system.
+only reliable and accurate under a restrictive setup, such
+as when static scenes comprising well-textured Lamber-            To summarize, the contributions of this paper include:
+tian surfaces are captured with suﬃcient uniform illumi-
+nation enabling to establish good correspondences (Bian        – we propose a hybrid system, DF-VO, which leverages
+et al. 2019a; Lowe 2004; Rublee et al. 2011). The tra-            both deep learning and multi-view geometry for Vi-
+ditional correspondence search pipeline usually detects           sual Odometry. Especially, self-supervised learning is
+sparse feature points ﬁrstly and then matches extracted           used for training networks so expensive ground truth
+features, resulting in a limited number of high-quality           data is not required and it enables online ﬁnetuning.
+correspondences because of the aforementioned assump-
+tions. The accuracy and diversity of the correspondences       – we propose to sample accurate sparse correspondences
+are of the utmost importance in solving Visual Odome-             from dense optical ﬂow predictions for camera track-
+try problems. In contrast, we propose to extract accurate         ing, and a bi-directional consistency based sampling
+correspondences diversely from the dense predictions of           method is presented.
+an optical ﬂow network using the consistency constraint
+between bi-directional ﬂows. Then the selected correspon-      – we propose to use scale-consistent monocular depth
+dences are fed into geometry-based trackers (Epipolar             predictions for maintaining a consistent scale over long
+Geometry based tracker and Prospective-n-point based              video for Visual Odometry, and propose an iterative
+tracker) for accurate and robust VO estimation, as de-            scale recovery method for better performance in dy-
+scribed in Sec. 4.                                                namic scenarios.
+
+    Most monocular systems suﬀer from a depth-translation      – the comprehensive evaluation shows that the proposed
+scale ambiguity issue, which means the predictions (struc-        DF-VO system achieves state-of-the-art performance
+ture and motion) are up-to-scale. The scale ambiguity             in standard benchmarks, and we conduct a detailed
+leads to a scale drift issue that accumulated over time.          ablation study for evaluating the eﬀect of diﬀerent
+Resolving scale-drift usually relies on keeping a scale-          factors in our system.
+consistent map for map-to-frame tracking, performing an
+expensive global bundle adjustment for scale optimiza-            A preliminary version of DF-VO was presented in
+tion or additional prior assumptions, like constant camera    (Zhan et al. 2020). We extend the system in the following
+height from the known ground plane. Recently deep learn-      four aspects (1) clearer presentation and more details of
+ing methods have made possible end-to-end learning of         the proposed system (2) improving the system in dynamic
+structure-and-motion from unlabelled videos. The trained      environments with an iterative correspondence selection
+single-view depth models give scale-consistent predictions    scheme; (3) improving the adaptation ability in new en-
+with the use of stereo-based training (Garg et al. 2016;      vironments by introducing an online adaptation scheme;
+Godard et al. 2019; Zhan et al. 2018) or scale-consistency    (4) more comprehensive experiments and ablation stud-
+constraint in monocular-based training (Bian et al. 2019b).   ies.
+In this work, we propose to use the scale-consistent single-
+view depths as the reference to maintain a consistent scale   2 Related Work
+over long videos. The scale-consistent depths are used in
+two circumstances: (1) scale recovery when the transla-       Geometry based VO: Camera tracking is a fundamen-
+tion scale is missed in the Epipolar Geometry tracker; (2)    tal and well-studied problem in computer vision, with dif-
+establishing scale-consistent 3D-2D correspondences in        ferent pose estimation methods based on multiple-view
+the PnP tracker. Besides, we propose an iterative method      geometry been established (Hartley & Zisserman 2003;
+for robust scale recovery, which is especially eﬀective in    Scaramuzza & Fraundorfer 2011). Early work in VO dates
+highly dynamic scenes by removing the extracted corre-        back to the 1980s (Scaramuzza & Fraundorfer 2011; Ull-
+spondences (i.e. outliers) on dynamic regions.                man 1979), with a successful application of it in the Mars
+                                                              exploration rover in 2004 (Matthies et al. 2007), albeit
+    Although recent deep pose networks can learn camera       with a stereo camera. Two dominant methods for geometry-
+motions directly from videos (Bian et al. 2019b; Godard       based VO/SLAM are feature-based (Geiger et al. 2011;
+et al. 2019; Wang et al. 2017; Zhan et al. 2018; Zhou         Klein & Murray 2007; Mur-Artal & Tard´os 2016) and di-
+et al. 2017), the accuracy is limited because of neglect-     rect methods (Engel et al. 2017; Newcombe et al. 2011).
+ing to incorporate geometric knowledge in inference time.     The former involves explicit correspondence estimation,
+In contrast, correspondences and scene scales are learnt      and the latter takes the form of an energy minimization
+in our proposed framework (Fig. 1). Thus accurate cam-        problem based on the image colour/feature warp error,
+                                                              parameterized by pose and map parameters. There are
+                                                              also hybrid approaches that make use of the good prop-
+                                                              erties of both (Engel et al. 2014; Forster et al. 2014,
+DF-VO: What Should Be Learnt for Visual Odometry?                                                 3
+
+2016). One of the most successful and accurate full SLAM     ometry to varying extent and degree of success. CNN-
+systems using a sparse (ORB) feature-based approach          SLAM (Tateno et al. 2017) fuse single view CNN depths
+is ORB-SLAM2 (Mur-Artal & Tardo´s 2016), along with          in a direct SLAM system, and CNN-SVO (Loo et al.
+DSO (Engel et al. 2017), a direct keyframe-based sparse      2019) initialize the depth at a feature location with CNN
+SLAM method. VISO2 (Geiger et al. 2011) on the other         provided depth for reducing the uncertainty in the ini-
+hand is a feature-based VO system that only tracks against   tial map. Yang et al.(Yang et al. 2018) feed depth pre-
+a local map created by the previous two frames. All these    dictions into DSO (Engel et al. 2017) as virtual stereo
+methods suﬀer from the previously mentioned issues (in-      measurements. Li et al.(Li et al. 2019) reﬁne their pose
+cluding scale-drift) common to monocular geometry-based      predictions via pose-graph optimisation. In contrast to
+systems. Various techniques have been developed for re-      the above methods, we eﬀectively utilize CNNs for both
+solving the scale drift issue. For example, an expensive     single-view depth prediction and correspondence estima-
+global bundle adjustment is performed for global scale       tion, on top of standard multi-view geometry to create a
+optimization based on loop-closure detection, which does     simple yet eﬀective VO system.
+not always exist (Mur-Artal et al. 2015b); or additional
+prior assumptions are introduced like constant camera        3 Preliminaries
+height from the known ground plane (Geiger et al. 2011;
+Zhou et al. 2019). In this work, with the aid of depth       We revisit geometry-based pose estimation methods, in-
+estimations from a consistent-scale deep network, scale      cluding Epipolar Geometry and Perspective-n-Point in
+estimation is performed with respect to the depth pre-       this section to understand the principle and the underly-
+dictions such that a single consistent scale is maintained   ing limitations of each method.
+(Sec. 4.4).
+                                                             3.1 Epipolar Geometry
+    Deep learning for VO: For supervised learning,
+Agrawal et al.(Agrawal et al. 2015) propose to learn         Epipolar Geometry can be employed for camera motion
+good visual features from an ego-motion estimation task,     estimation from two images (Ii, Ij) Suppose we have ob-
+in which the model is capable of relative camera pose es-    tained a set of 2D-2D correspondences (pi, pj) from the
+timation. Wang et al.(Wang et al. 2017) propose a recur-     image pair. Epipolar constraint is employed for solving
+rent network for learning VO from videos. Ummenhofer         fundamental matrix, F , or essential matrix, E, which
+et al.(Ummenhofer et al. 2017) and Zhou et al.(Zhou          are related by the camera intrinsic K such that F =
+et al. 2018) propose to learn monocular depth estimation     K−T EK−1. Thus, the camera motion [R, t] can be re-
+and VO together in an end-to-end fashion by formulating      covered by decomposing F or E (Bian et al. 2019c; Hart-
+structure from motion as a supervised learning problem.      ley 1995; Nister 2003; Zhang 1998).
+Dharmasiri et al.(Dharmasiri et al. 2018) train a depth
+network and extend the depth system for predicting opti-     pjT K−T EK−1pi = 0, where E = [t]×R  (1)
+cal ﬂows and camera motion. Recent works suggest that
+both tasks can be jointly learnt in a self-supervised man-   However, the general viewpoint and general structure are
+ner using a photometric warp loss to replace a super-        assumed in such geometry guided tracking. Problems arise
+vised loss based on ground truth. SfM-Learner (Zhou          with Epipolar Geometry while frames in the sequence
+et al. 2017) is the ﬁrst self-supervised method for jointly  and/or scene structure do not conform to these assump-
+learning camera motion and depth estimation. SC-SfM-         tions(Torr et al. 1999).
+Learner (Bian et al. 2019b) is a very recent work which
+solves the scale inconsistent issue in SfM-Learner by en-     – Motion degeneracy: motion degeneracy happens when
+forcing depth consistency. (Ranjan et al. 2019; Yin & Shi        the camera does not translate between frames, i.e. re-
+2018) improve SfM-Learner by incorporating optical ﬂow           covering R becomes unsolvable if the camera motion
+in their joint training framework for dynamics reasoning.        is a pure rotation.
+Some prior works solve both scale ambiguity and incon-
+sistency issue by using stereo sequences in training (Li      – Structure degeneracy: viewed scene structure is pla-
+et al. 2017; Zhan et al. 2018), which address the issue of       nar.
+metric scale.
+                                                             Solving fundamental/essential matrix becomes unstable
+    The issue with the above learning-based methods is       in practice when the camera baseline is small relative to
+that they do not explicitly account for the multi-view       the scene size. Moreover, translation recovered from the
+geometry constraints that are introduced due to camera       essential matrix is up-to-scale because of scale ambiguity.
+motion during inference. In order to address this, recent
+works propose to combine the best of learning and ge-
+4                                                                                             Zhan et al.
+
+                          Deep Models                                  Correspondence  2D-2D
+                         Depth Flow                                         Selection  3D-2D
+
+                            E-tracker                        Scale
+                         Model Selection                   Recovery
+
+      2D-2D                PnP-tracker
+      3D-2D
+
+Fig. 2 DF-VO pipeline. For an image pair, (forward and backward) optical ﬂows and single view depths are predicted. A
+forward-backward ﬂow consistency is computed as a criterion to establish good correspondences (2D-2D; 3D-2D). We have two
+alternative trackers out of which one is selected by the data-driven model selection module. The ﬁrst tracker (E-tracker) uses 2D-2D
+correspondences to estimate and decompose an essential matrix to ﬁnd rotation and translation direction, which is followed by a
+transnational scale recovery step to estimate metric VO. The second tracker (PnP) uses single view depth estimates in conjunction
+with 3D-2D registration via PnP.
+
+3.2 Perspective-n-Point                                    (2D-2D and 3D-2D) are considered in this system. To ob-
+                                                           tain the correspondences, (1) an optical ﬂow network is
+Perspective-n-point (PnP) solves camera pose given known   trained to predict dense correspondences between images
+3D-2D correspondences. In a two-view problem, suppose      for 2D-2D correspondences establishment; (2) a single-
+we have obtained a set of 3D-2D correspondences, includ-   view depth network is used to estimate 3D structure thus
+ing the 3D points on i -th view and the corresponding      3D-2D correspondences can be established by combining
+projection in j -th view (Xi, pj), PnP can be employed     the optical ﬂow estimation. Accurate sparse correspon-
+to estimate camera pose by minimizing the reprojection     dences are thus selected with a carefully designed mech-
+error,                                                     anism. Two trackers used for pose estimation are named
+                                                           E-tracker and PnP-tracker, which employ Epipolar Ge-
+e = ||K(RXi[x] + t) − pj[x]||2,           (2)              ometry with a scale recovery module and Prospective-n-
+                                                           Point, respectively. Note that the scale recovery module
+   x                                                       is associated with E-tracker for solving the well-known
+                                                           scale ambiguity and scale drift issues. To decide a suit-
+where [x] is pixel coordinate indexing. Solving a PnP      able tracker for each input pair, a robust model selection
+problem requires accurate estimation of the 3D structure   method using geometric robust information criterion is
+of the scene which can be obtained from depth sensor       used. In order to achieve minimal training and supervi-
+measurements or mature stereo reconstruction methods,      sion, and high-quality prediction on the deep networks,
+while it is a more challenging problem in the monocular    we explore a variety of training schemes on the depth net-
+case.                                                      work and ﬂow network. Building on top of advanced deep
+                                                           networks and classic geometry methods, we present a sim-
+4 DF-VO: Depth and Flow for Visual Odometry                ple yet eﬀective and robust monocular Visual Odometry
+4.1 System Overview                                        system.
+
+A standard Visual Odometry pipeline includes feature ex-   4.2 Deep Predictions
+traction and matching to establish correspondences , fol-
+lowed by pose estimation from the correspondences. We      In order to form 2D-2D/3D-2D correspondences from an
+follow this pipeline and present DF-VO, which is illus-    image pair, speciﬁcally (pi, pj) or (Xi, pj), we propose
+trated in Fig. 2 and Alg. 1. Two types of correspondences
+DF-VO: What Should Be Learnt for Visual Odometry?                                                                         5
+
+to use an optical ﬂow network and a single-view depth          Algorithm 1 DF-VO: Depth and Flow for Visual Odom-
+network to establish the correspondences.
+                                                               etry
+Optical ﬂow The 2D-2D correspondences are extracted
+from dense optical ﬂow prediction. Give an image pair,         Require: Depth-CNN: Md; Flow-CNN: Mf
+(Ii, Ij), optical ﬂow describes the pixel movements in Ii,     Input: Image sequence: [I1, I2, ..., Ik]
+which gives the correspondences of all the pixels of Ii        Output: Camera poses: [T1, T2, ..., Tk]
+in Ij. Though the state-of-the-art deep optical ﬂow net-        1: Initialization T1 = I ; i = 2
+works have shown high average accuracy, not all the pix-
+els share the same high accuracy. Therefore, we propose a      2: while i ≤ k do
+correspondence selection scheme in Sec. 4.3 to pick good
+predictions robustly.                                          3: Get CNN predictions: Di, Fii−1, and Fii−1
+                                                               4: Compute forward-backward ﬂow inconsistency from
+Single view depth In order to establish 3D-2D correspon-
+dences between two views, (Xi, pj), we need to obtain              (Fii−1, Fii−1).
+the 3D structure of i -th view and the correspondences         5: Correspondence selection: form matches (Pi, Pi−1) from
+between the 3D landmarks and 2D landmarks. Tradi-
+tional approaches establish the correspondences via fea-            the ﬁltered ﬂows based on ﬂow inconsistency
+ture matching between 3D landmarks and 2D feature
+points. In this work, we use a deep depth network as our       6: Model selection: estimate E and H from (Pi, Pi−1) and
+“depth sensor” to estimate the 3D structure on i -th view,         compute GRIC scores for the trackers
+Xi. Through the 2D-2D correspondences established by
+optical ﬂows, we can directly get a set of 3D-2D corre-        7: if E-Tracker then
+spondences and solve the relative camera pose by solving
+PnP.                                                           8:    Recover [R, ˆt] from the estimated Essential matrix
+
+    Unfortunately, the current state-of-the-art single view    9:    Triangulate (Pi, Pi−1) to get Di
+depth estimation methods are still insuﬃcient for recov-
+ering very accurate 3D structure (about 10% relative er-       10:   Scale recovery to estimate s
+ror) for accurate camera pose estimation, which is shown
+in Tab. 3. On the other hand, optical ﬂow estimation           11:   Tii−1 = [R, sˆt]
+is a more generic task. The state-of-the-art deep learn-
+ing methods are accurate and with good generalization          12: else if PnP-Tracker then
+ability. Therefore, we mainly use the 2D-2D matches for
+solving pose from essential matrix while the depth pre-        13:   Form 3D-2D correspondences from (Di, Pi, Pi−1)
+dictions are used for scale recovery and PnP-tracker. As
+a result, PnP-tracker is used as an auxiliary tracker when     14:   Estimate [R, t] using PnP
+E-tracker tends to fail.
+                                                               15:   Tii−1 = [R, t]
+
+                                                               16: end if
+
+                                                               17: Ti ← Ti−1Tii−1
+
+                                                               18: end while
+
+                                                               the images have worse optical ﬂow predictions, for in-
+                                                               stance, out-of-view regions where no correspondences can
+                                                               be found in the other view; dynamic object regions where
+                                                               occlusion is usually associated with. In order to ﬁlter out
+                                                               the outliers and pick good optical ﬂows, we propose a
+                                                               correspondence selection scheme based on a bi-directional
+                                                               ﬂow consistency, see example in Fig. 3.
+
+                                                               Flow consistency Given an image pair, (Ii, Ij), both for-
+                                                               ward and backward optical ﬂows, Fij and Fji, are pre-
+                                                               dicted by the ﬂow network. Thus we compute forward-
+                                                               backward ﬂow consistency as a measure to choose good
+                                                               2D-2D correspondences. The ﬂow consistency is computed
+                                                               by,
+
+                                                               C = −Fij − w Fji, pf (Fij ) ,                              (3)
+
+4.3 Correspondence Selection                                   The warping process at a pixel x is described as
+
+Most deep learning-based optical ﬂow models predict dense      w Fji[x], pf (Fij[x]) = Fji[x + Fij[x]].                   (4)
+optical ﬂows, i.e. every pixel is associated with a predicted
+ﬂow vector. There can be a lot of matches formed by the        As x + Fji[x] does not necessarily locate on the regular
+optical ﬂows, in which some of them are very accurate. It      grid, the resulted ﬂow is interpolated from the ﬂow vec-
+is time-consuming if all matches are taken into consider-      tors in the 4 corners(Jaderberg et al. 2015). We use the
+ation in solving a VO problem since only sparse matches        ﬂow consistency to select correspondences with higher ac-
+are required to solve the problem in theory. The vanilla       curacy and the hypothesis we made is that the optical
+way is to sample the optical ﬂows randomly/uniformly           ﬂows with better consistency tend to have higher accu-
+from the dense predictions.                                    racy, which is proved with an experiment in Sec. 6.
+
+    However, we have observed that not all the ﬂow pre-
+dictions share the same high accuracy. Some regions in
+6                                                              Zhan et al.
+
+Best-N selection After computing forward-backward ﬂow          Fig. 3 (Top) Filtered 2D correspondences established by the
+consistency, we choose optical ﬂows with the least incon-      optical ﬂow prediction; (Bottom left) Optical ﬂow prediction;
+sistency F to form the best-N 2D-2D matches, (Pi, Pj)          (Bottom right) Bidirectional ﬂow consistency (high consistency
+(Zhan et al. 2020), where N equals to 2000 in most exper-      is shown in blue) shows that suﬃcient correspondences can be
+iments. This correspondence selection scheme is able to        established in the overexposure case.
+reject a lot of inaccurate ﬂows. As shown in (Zhan et al.
+2020), DF-VO with this correspondence selection scheme
+has already outperformed existing VO/SLAM baselines.
+However, there are still some potential issues regarding
+the scheme.
+
+ – Model under-ﬁtting: if the chosen best-N matches do
+    not have enough location diversity, the pose model
+    estimated can be an under-ﬁtting model.
+
+ – Structure degeneracy: if all the chosen matches locate
+    on a planar region, structure degeneracy happens and
+    leads to the failure of estimating essential matrix(Torr
+    et al. 1999).
+
+Local best-K selection On top of the Best-N selection,             After selecting good 2D-2D correspondences, the es-
+we want to increase the location diversity of the matches.     sential matrix can be solved using Epipolar Geometry as
+We divide the image regions into M (M = 10 × 10) re-           described in Sec. 3.1. Then, the camera motion, consist-
+gions and choose best-K matches from each region. How-         ing of rotation R and translation ˆt, can be decomposed
+ever, there might be cases that have severe inaccurate ﬂow     from the essential matrix. However, the recovered motion
+predictions (e.g. margin regions where usually are out-of-     is up-to-scale. Speciﬁcally, the translation is a unit vector
+view) and the ﬂow predictions should not be used. There-       representing the translation direction only. In order to re-
+fore, we ﬁrst ﬁlter the ﬂows such that only ﬂows with          cover and maintain a consistent scale over the monocular
+inconsistency less than a threshold can be picked. As a        footage, a consistent scale recovery process is required.
+result, The ﬁnal correspondences (Pi, Pj) formed from F
+are a union of best-K matches in each region. The value        4.4 Scale Recovery
+K in j -th region is deﬁned as Kj = min(N/M, Qj), where
+Qj is the number of valid ﬂows after thresholding. Since       In traditional monocular VO pipeline, the per-frame scale
+the correspondence quality is vital, we further check the      is recovered by aligning triangulated 3D landmarks with
+number of valid correspondences and the number of re-          existing 3D landmarks which accumulates errors.
+gions with valid correspondences to determine if suﬃcient
+good correspondences are used. If insuﬃcient correspon-        Simple alignment In this work, we use the predicted depths
+dences are found, which rarely happens (mostly when the        Di to inform 3D structure as a reference for scale recov-
+image quality is very poor such as extreme under/over-         ery. After recovering [R, ˆt] from solving essential matrix,
+exposure), we use a constant motion model instead of the       triangulation is performed for (Pi, Pj) to recover up-to-
+E/PnP-tracker.                                                 scale depths Di. A scaling factor, s, can be estimated by
+                                                               aligning the triangulated depth map Di with the CNN
+    The advantages of performing local best-K selection        depth map Di. An important advantage of using depth
+are two-fold, (1) increasing location diversity as described;  CNN is that we can get rid of the scale drift issue because
+(2) speeding up correspondence selection process since         of the following reasons.
+part of ﬂows are rejected in the ﬁrst place and sorting
+ﬂow inconsistency is performed in a local image region          – Depth CNN predicts per-frame 3D structures, which
+instead of the whole image region.                                 are scale consistent. We show that we can train scale
+                                                                   consistent depth networks (Sec. 4.6).
+    Comparing to traditional feature-based methods, which
+only use salient feature points for matching and tracking,      – Scale drift is introduced by an accumulated error in
+any pixel in the dense optical ﬂow can be a candidate              creating new 3D landmarks. We do not create new 3D
+for tracking. Moreover, traditional features usually gather        landmarks but recover scale w.r.t. a single network.
+visual information from local regions while CNN gath-
+ers more visual information (larger receptive ﬁeld) and        Iterative alignment Aligning 3D landmarks triangulated
+higher-level contextual information, which gives more ac-      on selected optical ﬂow matches with CNN depths is
+curate and robust correspondences.
+DF-VO: What Should Be Learnt for Visual Odometry?                                         7
+
+Algorithm 2 Iterative Scale Recovery                          4.5 Model Selection
+
+Input: [R, ˆt], F , Di, st−1                                  We have presented a camera tracking method integrat-
+ 1: Initialization s = st−1                                   ing Epipolar Geometry with deep predictions. However,
+ 2: while s has not converged do                              as mentioned in Sec. 3.1, there are some known issues
+ 3: Pose hypothesis: T = [R, sˆt]                             with Epipolar Geometry, i.e. motion degeneracy and un-
+                                                              stable solution when the motion is small. Since we have
+ 4: Compute rigid ﬂow Frigid from T and Di                    both 3D-2D and 2D-2D correspondences available, we can
+ 5: Compute ﬂow inconsistency: Fdiff ← ||F − Frigid||2        instead solve a PnP problem using the correspondences
+ 6: Select depth-ﬂow pairs (Di, P1, P2)sel with               obtained in Sec. 4.3 when Epipolar Geometry tends to
+                                                              fail. In this section, we show that we can select a suitable
+     Fdif f < δrigid                                          tracker/model by two possible ways.
+ 7: Estimate new pose, [R, ˆt] from (P1, P2)sel
+ 8: Triangulate (P1, P2)sel to get Di
+ 9: Estimate scaling factor, snew, by comparing (Di,sel, Di)
+10: s ← snew
+11: end while
+
+simple and suﬃcient to recover accurate scale in gen-         Flow magnitude We measure the magnitude of the ﬂow
+eral cases. However, in a highly dynamic environment,         predictions and solve essential matrix only when the av-
+the selected optical ﬂows can be lying on dynamic re-         erage ﬂow magnitude is large enough. It avoids small
+gions, which is problematic for depth alignment. More-        camera motions which usually come with small optical
+over, similar to optical ﬂow predictions, not all the pre-    ﬂows(Zhan et al. 2020). However, this na¨ıve approach is
+dicted depths are highly accurate. The pixels with high       associated with some issues. (1) It does not resolve mo-
+forward-backward ﬂow consistency are not guaranteed to        tion degeneracy (pure rotation), which also causes large
+have high depth accuracy. Therefore, we here propose an       optical ﬂows. (2) It does not take outliers into account,
+iterative scheme, Alg. 2.                                     e.g. dynamic objects which cause optical ﬂows even the
+                                                              camera is stationary. Therefore, we adopt a more robust
+    The key is to select depths and ﬁltered optical ﬂows      measure for model selection.
+(Sec. 4.3) that are consistent with each other. Given that
+the ﬁltered optical ﬂows generally establish good corre-      Geometric Robust Information Criterion Torr et al.(Torr
+spondences, a pixel with depth being consistent with the      et al. 1999) discuss the degeneracy cases (motion and
+optical ﬂow means that (1) the pixel belongs to a static      structure) and their inﬂuence on geometry guided cam-
+region in the environment; (2) the depth is likely to be      era motion estimation. Two robust strategies for tackling
+accurate. However, the depth and optical ﬂow are related      such degeneracies are proposed. (1) A statistical model
+by a camera pose for a static scene. Since the camera         selection test, named Geometric Robust Information Cri-
+pose [R, ˆt] is up-to-scale and does not share the same       terion (GRIC), is used to identify cases when degenera-
+scale with the depth prediction, we, therefore, propose       cies occur; (2) multiple motion models are used to over-
+an iterative approach to select depth-ﬂow pair (Alg. 2).      come the degeneracies. In this work, we follow the ﬁrst
+We ﬁrst initialize a relative pose T with a pose T0. Then     approach to identify when E-Tracker tends to fail and
+the rigid ﬂow is computed using the current relative pose     switch to PnP-Tracker. (Torr et al. 1999) estimates both
+by,                                                           Fundamental F and Homography matrix H and choose
+                                                              the model with lower GRIC score. The model that ex-
+Frigid = KT K−1(xDi) − x                           (5)        plains the data best, i.e. lower GRIC score, is indicated
+                                                              as most likely.
+where x belongs to pixel coordinates of the selected opti-
+cal ﬂow. The consistency between the ﬁltered optical ﬂow          GRIC calculates a score function for each tracker (Fun-
+F and the rigid ﬂow is then measured by ||F − Frigid||2.      damental / Homography) considering the following fac-
+Only depth-ﬂow pairs with small optical-rigid ﬂow incon-      tors.
+sistency are selected as new matches. Thus, we update
+T with the new scaled pose and iterate the process un-         – number of matches, n
+til reaching the stopping condition (convergence or meet       – residuals of the matches, ei
+n-iterations). The scale initialization for the ﬁrst image     – standard deviation of the measurement error, σ
+pair is set as zero while the scale at time-(t-1) is used as   – data dimension, r (4 for two views)
+the scale initialization at time-(t).                          – number of motion model parameters, k (5 for E, 7 for
+
+                                                                  F , 8 for H)
+                                                               – dimension of the structure, d (3 for F , 2 for H)
+
+                                                              GRIC = ρ(ei2) + λ1dn + λ2k  (6)
+8                                                                                                               Zhan et al.
+
+where ρ(e2i ) is a robust function of the residuals:          4.6.1 Training overview
+
+ρ(e2) = min     e2              .                     (7) In this work, we jointly train the depth network and the
+                σ2 , λ3(r − d)                                 pose network by minimizing the mean of the following
+
+The value of the parameters are λ1 = log4, λ2 = log4n, per-pixel objective function over the whole image. The
+λ3 = 2. Diﬀerent from (Torr et al. 1999), since we have per-pixel loss is
+
+both 3D-2D      and 2D-2D correspondences, we can     choose       =  min  Lpe(Ii,  Iji )  +  λdsLds(Di,  Ii)+
+PnP-Tracker     instead of Homography-Tracker when    E-Tracker L
+                                                                        j
+
+tends to fail.                                                        min  λdcLdc(Di       ,  Dji  ),                          (8)
+
+                                                                        j
+
+Cheirality condition In addition to the two methods in-       where Lpe is photometric loss; Lds is depth smoothness
+troduced above, we check for cheirality condition as well.    loss; Ldc is depth consistency loss; and [λds, λdc] are loss
+There are 4 possible solutions for [R, ˆt] by decomposing     weightings.
+E. To ﬁnd the correct unique solution, cheirality condi-
+tion, i.e. the triangulated 3D points must be in front of     4.6.2 Photometric loss
+both cameras, is checked to remove the other solutions.
+We further use the number of points satisfying cheirality     Lpe is the photometric error by computing the diﬀer-
+condition as a reference to determine if the solution is      ence between the reference image Ii and the synthesized
+stable.                                                       view Iji warped from the source image Ij, where j ∈
+                                                              [i − n, i + n, s]. [i − n, i + n] are neighbouring views of
+    Therefore, we choose PnP-Tracker when GRICE is            Ii while s is stereo pair if stereo sequences are used in
+higher than GRICH or cheirality check condition is not        training. As proposed in (Godard et al. 2019), instead of
+fulﬁlled. Otherwise, E-Tracker is employed for solving        averaging the photometric errors between the reference
+frame-to-frame camera motion. To robustify the system,        pixel and the synthesized pixels from multiple views, (Go-
+we wrap the trackers in RANSAC loops.                         dard et al. 2019) only counts the photometric error be-
+                                                              tween the reference pixel and the synthesized pixel with
+4.6 Jointly learning of depths and pose                       the minimum error. The rationale is to overcome the is-
+                                                              sues related to out-of-view pixels and occlusions.
+Various depth training frameworks can be employed de-
+pending on the availability of data (monocular/stereo se-     Lpe(Ii, Iji)  =  α    1 − SSIM(Ii, Iji)     + (1 − α)|Ii − Iji|  (9)
+quences, depth sensor measurements). The most trivial                          2
+way is using a supervised training framework (Eigen et al.
+2014; Fu et al. 2018; Kendall & Gal 2017; Laina et al.                     Iji = w Ij , pre(K, Di, Tij ) ,                     (10)
+2016; Liu et al. 2015, 2016; Nekrasov et al. 2019), but
+ground truth depths are not always available for any sce-     where SSIM (Wang et al. 2004) is a robust measurement
+nario. Some recent works suggest that jointly learning
+single-view depths and camera motion in a self-supervised     for image similarity and α = 0.85 balances the SSIM
+manner is feasible using monocular sequences (Bian et al.
+2019b; Godard et al. 2019; Yin & Shi 2018; Zhou et al.        error and the simple color intensity error. w(I, p) is a
+2017), or stereo sequences (Garg et al. 2016; Godard et al.
+2017, 2019; Zhan et al. 2018). Instead of using ground        diﬀerentiable warping function (Jaderberg et al. 2015)
+truth supervisions, the main supervision signal in the self-
+supervised framework is photometric consistency across        which warps image I according to the pixel locations
+multiple-views.                                               p. pre(K, Di, Tij) establishes the pixel coordinates repro-
+                                                              jected from view-i to view-j, where K is the camera in-
+    In this work, we mainly follow (Godard et al. 2019)       trinsics, Di is the predicted depth map of view-i, and Tij
+for training depth models using monocular and stereo          is the relative pose between the pair. The reprojection for
+sequences. The depth network is based on the encoder-
+decoder architecture with skip connections(Ronneberger        a pixel x from view-i to view-j is represented by
+et al. 2015). The pose network consists of a ResNet18 fea-
+ture extractor which takes an image pair as input (con-       pre K, Di, Tij = KTij K−1xDi[x]                                  (11)
+catenated as a 6-channel input) and predicts 6-DoF rel-
+ative pose. We refer readers to (Godard et al. 2019) for      4.6.3 Depth smoothness regularization
+more network architecture details.
+                                                              Following the approach in (Godard et al. 2017), we en-
+                                                              courage depth to be smooth locally so we induce an edge-
+                                                              aware depth smoothness term. The depth discontinuity is
+DF-VO: What Should Be Learnt for Visual Odometry?                                                                         9
+
+penalized if colour continuity is presented in the same lo-    2018) as our backbone network for optical ﬂow predic-
+cal region. The smoothness regularization is formulated        tion since LiteFlowNet is fast, lightweight, and accurate.
+as                                                             LiteFlowNet consists of a two-stream network for feature
+                                                               extraction and a cascaded network for ﬂow inference and
+Lds(Di, Ii) = |∂xDi|e−|∂xIi| + |∂yDi|e−|∂yIi|,     (12)        regularization. We refer readers to (Hui et al. 2018) for
+                                                               more details. LiteFlowNet shows good generalization abil-
+where ∂x(.) and ∂y(.) are gradients in horizontal and          ity. LiteFlowNet trained on a synthetic dataset (Scene
+vertical direction respectively. Note that we use inverse      Flow(Dosovitskiy et al. 2015)) can generalize well in real-
+depth regularization instead.                                  world scenarios, though sometimes artifacts present in
+                                                               some regions.
+4.6.4 Training without scaling issues
+                                                                   In this work, we mainly use the model trained from
+Similar to traditional monocular 3D reconstruction, scale      Scene Flow. However, we also show that a self-supervised
+ambiguity and scale inconsistency issues exist when            ﬁnetuning can be performed to help the model better
+monocular videos are used for training. Since the monoc-       adapt to unseen environments and remove the artifacts.
+ular training usually uses image snippets (usually 2 or 3      Two ﬁnetuning schemes are tested and compared, in-
+frames) for training, the training does not guarantee a        cluding oﬄine ﬁnetuning and online ﬁnetuning (Sec. 6.2).
+consistent learnt scale across snippets and it creates the     Similar to the self-supervised training of the depth net-
+scale inconsistency issue(Bian et al. 2019b).                  work, the optical ﬂow network is trained by minimizing
+                                                               the mean of the following per-pixel loss function over the
+    One solution to solve both scale problems is using         whole image.
+stereo sequences during training (Godard et al. 2019; Li
+et al. 2017; Zhan et al. 2018), the deep predictions are       L  =  min  Lpe(Ii,  Iji )  +  λf s Lf s (||Fij  ||2,  Ii)
+aligned with real-world scale and scale-consistent because
+of the constraint introduced by the known stereo baseline.             j
+Even though stereo sequences are used during training,
+only monocular images are required during inference for              + λfcLfc −Fij − w Fji, pf (Fij )                     (14)
+depth predictions.
+                                                                  Iji = w Ij , pf (Fij ) ,                                (15)
+    Another solution to overcome the scale inconsistency
+issue is using temporal geometry consistency regulariza-       Diﬀerent from Eqn. 10, pf (.) establish the correspon-
+tion proposed in (Bian et al. 2019b; Zhan et al. 2019),        dences between view-i and view-j via the ﬂow ﬁeld in-
+which constrains the depth consistency across multiple
+views. As depth predictions are consistent across diﬀerent     stead of using reprojection deﬁned in Eqn. 10. For a pixel
+views and thus diﬀerent snippets, the scale inconsistency      x on view-i, the corresponding pixel position , pf (Fij[x]),
+issue is resolved. Using the rigid scene assumption as the     on view-j is x + Fij[x].
+cameras move in space over time we want the predicted
+depths at view-i to be consistent with the respective pre-         We also regularize the optical ﬂow to be smooth using
+dictions at view-j. This is done by correctly transform-
+ing the scene geometry from frame-j to frame-i much            an edge-aware ﬂow smoothness loss Lfs(.) similar to the
+like the image warping. Speciﬁcally, we adopt the inverse      depth smoothness loss deﬁned in Eqn. 12. Similar to Meis-
+depth consistency proposed in (Zhan et al. 2019).
+                                                               ter et al.(Meister et al. 2018), we estimate both forward
+
+                                                               and backward optical ﬂow and constrain the bidirectional
+
+                                                               predictions to be consistent with the loss Lfc.
+
+Ldc(Di, Dji ) = |1/Di − 1/Dji |                    (13)        5 Implementation and Benchmarking
+
+Inspired by (Godard et al. 2019), we use minimum error         5.1 Dataset
+in multi-view pairs to avoid occlusions and out-of-view
+scenes instead of averaging the depth consistency error        We train and test our method in popular benchmarking
+over all source views.                                         datasets, KITTI (Geiger et al. 2013, 2012) and Oxford
+                                                               Robotcar(Maddern et al. 2017), which are large scale out-
+4.7 Learning of optical ﬂows                                   door driving datasets. There are various splits in KITTI
+                                                               for several tasks, e.g. depth estimation, odometry, object
+Many deep learning-based methods have been proposed            tracking. In this work, we select the following three splits
+for estimating optical ﬂow (Dosovitskiy et al. 2015; Hui       to evaluate our method.
+et al. 2018; Ilg et al. 2017; Meister et al. 2018; Sun et al.
+2018). In this work, we choose LiteFlowNet(Hui et al.          KITTI Odometry Odometry split contains 11 driving se-
+                                                               quences with publicly available ground truth camera poses.
+10                                                                                                                                                                                                                                                                                                                Zhan et al.
+
+Most of the sequences are long sequences and some with
+
+loop closing. Following (Zhou et al. 2017), we train our                500  GT                                      500                                                                                                                                                                      GT
+networks on sequences 00-08. The dataset contains 36,671                400  SfM-Learner                             400                                                                                                                                                                      VISO2
+training pairs, [Ii, Ii−1, Ii+1, Ii,s].                                 300  Depth-VO-Feat                           300                                                                                                                                                                      ORB-SLAM2 (w/ LC)
+                                                                             SC-SfM-Learner                                                                                                                                                                                                   ORB-SLAM2 (w/o LC)
+KITTI Tracking Tracking split contains 21 sequences with                     Ours (M-SC-Train.)                                                                                                                                                                                               Ours (M-SC-Train.)
+available ground truths. The split is primarily used for                     Ours (S-Train.)                                                                                                                                                                                                  Ours (S-Train.)
+object tracking benchmarking so there are more dynamic
+objects in these sequences when compared to the Odome-           z (m)                                        z (m)  200
+try split, but shorter sequence length in general. Following
+                                                                        200
+
+                                                                                                                     100
+
+                                                                        100
+
+                                                                                                                     0
+
+                                                                        0                                            −100
+
+                                                                             −100  0             100 200 300                                                                                                                                                                                  −100  0  100 200 300
+                                                                                                                                                                                                                                                                                                       x (m)
+                                                                                      x (m)
+
+(Zhang et al. 2020), we choose 9 out of the 21 sequences                150  GT                                      150                                                                                                                                                                      GT
+with a considerable number of dynamic objects to test the               125  SfM-Learner                             100                                                                                                                                                                      VISO2
+robustness of our system in dynamic environments. These                 100  Depth-VO-Feat                                                                                                                                                                                                    ORB-SLAM2 (w/ LC)
+sequences are challenging for most monocular VO/SLAM                         SC-SfM-Learner                           50                                                                                                                                                                      ORB-SLAM2 (w/o LC)
+systems since most of the systems assume static scenarios.               75  Ours (M-SC-Train.)                                                                                                                                                                                               Ours (M-SC-Train.)
+                                                                         50  Ours (S-Train.)                                                                                                                                                                                                  Ours (S-Train.)
+
+                                                                 z (m)                                        z (m)
+
+                                                                                                                                                                                            25
+                                                                                                                                                                                                                                                                                           0
+
+KITTI Flow KITTI Flow 2012/2015 splits contain 194/200 0
+
+image pairs with high-quality optical ﬂow labels. We use                −25                                          −50
+this split to evaluate the performance of the optical ﬂow
+models in this work.                                                    −50
+
+Oxford Robotcar To further test the generalization abil-                     0 100 200 300 400 500 600 700                                                                                                                                                                                    −100 0 100 200 300 400 500 600 700
+                                                                                                                                                                                                                                                                                                       x (m)
+                                                                                      x (m)
+                                                                 Fig. 4 Qualitative VO results on KITTI: (Top) Seq.09 and
+
+                                                                 (Bottom) Seq.10 against deep learning-based and geometry-
+
+                                                                 based methods (shown separately).
+
+ity of the system, we test the proposed system on Oxford
+
+Robotcar dataset. Following (Loo et al. 2019), 8 sequences       ground truth. Relative pose error (RPE) measures frame-
+are selected for evaluation and the ﬁrst 200 frames 1 are        to-frame relative pose error. Since most of the methods
+skipped in the evaluation due to the extremely overex-           are a monocular method, which lacks a scaling factor
+posed images at the beginning of the sequences.                  to match with the real-world scale, we scale and align
+
+                                                                 (7DoF optimization) the predictions to the ground truth
+
+5.2 Deep network training                                        associated poses during evaluation by minimizing ATE
+                                                                 (Umeyama 1991). Except for methods using stereo depth
+
+We train our networks with the PyTorch (Paszke et al. models (Ours(Stereo Train.), Depth-VO-Feat) and known
+2017) framework. All self-supervised experiments are trained scale prior (VISO2), which have already aligned predic-
+using Adam optimizer (Kingma & Ba 2014) for 20 epochs. tions to real-world scale, for a fair comparison, we perform
+For KITTI, images with a size of 640 × 192 are used 6DoF optimization w.r.t ATE instead.
+
+for training. Learning rate is set to 10−4 for the ﬁrst 15
+epochs and then is dropped to 10−5 for the remaining KITTI Odometry We provide a detailed comparison be-
+                                                                 tween our VO system and some prior arts in KITTI Odom-
+epochs. The loss weightings are [λds, λdc] = [10−3, 5] for
+                                                                 etry split, which includes pure deep learning methods
+jointly learning depths and camera motion while [λfs, λfc] =     (Zhou et al. 2017)2, (Zhan et al. 2018) (Bian et al. 2019b),
+[10−1, 5 × 10−3] for optical ﬂow experiments.
+                                                                 and geometry-based methods including DSO(Engel et al.
+
+                                                                 2017)3, VISO2(Geiger et al. 2011), and ORB-SLAM2 (Mur-
+
+5.3 Visual Odometry Benchmarking                                 Artal et al. 2015a) (w/ and w/o loop-closure). ORB-
+                                                                 SLAM2 occasionally suﬀers from tracking failure or un-
+
+Evaluation Criterion Some common evaluation criteria successful initialization. We run ORB-SLAM2 three times
+
+are adopted for a detailed analysis. KITTI Odometry and report the one with the least trajectory error. The
+
+criterion reports the average translational error terr(%)        quantitative and qualitative results are shown in Tab. 1,
+and rotational errors rerr(◦/100m) by evaluating possi-          Fig. 4, and Fig. 5. Seq.01 is not included while comput-
+                                                                 ing average error since a sub-sequence of Seq.01 does not
+ble sub-sequences of length (100, 200, ..., 800) meters.
+
+Absolute trajectory error (ATE) measures the root-mean- contain trackable close features and most methods fail in
+
+square error between predicted camera poses [x, y, z] and the sub-sequence.
+
+  1 Our system can operate even without skipping the frames.       2 SfM-Learner(Zhou et al. 2017): the updated model in
+The 200 frames are skipped in the evaluation for a fair compar-  Github is evaluated
+ison.
+                                                                   3 result taken from (Loo et al. 2019)
+DF-VO: What Should Be Learnt for Visual Odometry?                                                                                                                                                 11
+
+       500     GT                                      800     GT                                         200     GT                                          400  GT
+       400     ORB-SLAM2 (w/ LC)                       600     ORB-SLAM2 (w/ LC)                          150     ORB-SLAM2 (w/ LC)                           350  ORB-SLAM2 (w/ LC)
+       300     ORB-SLAM2 (w/o LC)                              ORB-SLAM2 (w/o LC)                                 ORB-SLAM2 (w/o LC)                          300  ORB-SLAM2 (w/o LC)
+               Ours (M-SC-Train.)                              Ours (M-SC-Train.)                                 Ours (M-SC-Train.)                               Ours (M-SC-Train.)
+               Ours (S-Train.)                                 Ours (S-Train.)                                    Ours (S-Train.)                                  Ours (S-Train.)
+
+                                                                                                                                                              250
+
+z (m)  200                                      z (m)                                              z (m)  100                                          z (m)  200
+
+                                                       400
+
+                                                                                                                                                              150
+
+       100                                                                                                                                                    100
+
+                                                       200                                                50
+
+            0                                                                                                                                                 50
+
+       −10−0300 −200 −100 0        100 200 300         0       100 200 300 400 500 600                    0       100 200 300 400                             0
+                       x (m)                                0                    x (m)                         0                  x (m)                       −50 −40 −30 −20 −10 0 10 20 30 40 50
+
+                                                                                                                                                                                              x (m)
+
+       400     GT                                      300     GT                                         100     GT                                          400  GT
+       300     ORB-SLAM2 (w/ LC)                       200     ORB-SLAM2 (w/ LC)                           50     ORB-SLAM2 (w/ LC)                           300  ORB-SLAM2 (w/ LC)
+               ORB-SLAM2 (w/o LC)                      100     ORB-SLAM2 (w/o LC)                                 ORB-SLAM2 (w/o LC)                               ORB-SLAM2 (w/o LC)
+               Ours (M-SC-Train.)                              Ours (M-SC-Train.)                                 Ours (M-SC-Train.)                               Ours (M-SC-Train.)
+               Ours (S-Train.)                            0    Ours (S-Train.)                                    Ours (S-Train.)                                  Ours (S-Train.)
+
+z (m)  200                                      z (m)                                              z (m)                                               z (m)  200
+
+                                                                                                               0
+
+       100                                                                                                                                                    100
+
+                                                       −100                                               −50
+
+       0
+
+                                                       −200−150−100 −50 0 50 100 150 200                  −100 −175−150−125−100−75 −50 −25 0                  0
+                                                                                        x (m)                                                   x (m)
+               −200 −100 0         100 200                                                                                                                    −400 −200   0            200        400
+               x (m)                                                                                                                                                      x (m)
+
+Fig. 5 DF-VO and ORB-SLAM2 (monocular, w/ and w/o loop-closure) trajectories in sequences 00, 02, 03, 04, 05, 06, 07 and
+08 from the KITTI odometry benchmark. Note that Seq. 08 does not contains loops and ORB-SLAM2 (w/ LC) undergoes severe
+scale drifting while DF-VO does not.
+
+Table 1 Quantitative result on KITTI Odometry Seq. 00-10. The best result is in bold and second best is underlined.
+
+            Category                            Method          Metric     00        01      02             03      04     05           06     07        08          09     10         Avg. Err.
+            Deep VO                          SfM-Learner
+                                         (Zhou et al. 2017)        terr  21.32     22.41   24.10          12.56    4.32  12.99        15.55  12.61     10.66       11.32  15.25          14.068
+         Full SLAM /                                               rerr   6.19      2.79    4.18           4.52    3.28   4.66         5.58   6.31      3.75        4.07   4.06          4.660
+       VO with Optim.                      Depth-VO-Feat          ATE    104.87    109.61  185.43          8.42    3.10  60.89        52.19  20.12     30.97       26.93  24.09          51.701
+                                         (Zhan et al. 2018)    RPE (m)   0.282     0.660   0.365          0.077   0.125  0.158        0.151  0.081     0.122       0.103  0.118          0.158
+                VO                                             RPE (◦)   0.227     0.133   0.172          0.158   0.108  0.153        0.119  0.181     0.152       0.159  0.171          0.160
+                                           SC-SfMLearner           terr   6.23     23.78    6.59          15.76    3.14   4.94         5.80   6.49      5.45       11.89  12.82          7.911
+                                         (Bian et al. 2019b)       rerr   2.44      1.75    2.26          10.62    2.02   2.34         2.06   3.56      2.39        3.60   3.41          3.470
+                                                                  ATE    64.45     203.44  85.13          21.34    3.12  22.15        14.31  15.35     29.53       52.12  24.70          33.220
+                                     DSO (Engel et al. 2017)   RPE (m)   0.084     0.547   0.087          0.168   0.095  0.077        0.079  0.081     0.084       0.164  0.159          0.108
+                                     ORB-SLAM2 (w/o LC)        RPE (◦)   0.202     0.133   0.177          0.308   0.120  0.156        0.131  0.176     0.180       0.233  0.246          0.193
+                                   (Mur-Artal & Tardo´s 2016)      terr  11.01     27.09    6.74           9.22    4.22   6.70         5.36   8.29      8.11        7.64  10.74          7.803
+                                                                   rerr   3.39      1.31    1.96           4.93    2.01   2.38         1.65   4.53      2.61        2.19   4.58          3.023
+                                      ORB-SLAM2 (w/ LC)           ATE    93.04     85.90   70.37          10.21    2.97  40.56        12.56  21.01     56.15       15.02  20.19          34.208
+                                   (Mur-Artal & Tardo´s 2016)  RPE (m)   0.139     0.888   0.092          0.059   0.073  0.070        0.069  0.075     0.085       0.095  0.105          0.086
+                                                               RPE (◦)   0.129     0.075   0.087          0.068   0.055  0.069        0.066  0.074     0.074       0.102  0.107          0.083
+                                                 VISO2
+                                        (Geiger et al. 2011)      ATE    113.18       /    116.81          1.39   0.42   47.46        55.62  16.72     111.08      52.23  11.09          52.600
+                                                                   terr  11.43     107.57  10.34           0.97    1.30   9.04        14.56   9.77     11.46        9.30   2.57          8.074
+                                                   Ours            rerr   0.58              0.26          0.19     0.27   0.26         0.26   0.36      0.28        0.26   0.32          0.304
+                                         (Mono-SC Train.)         ATE    40.65      0.89   47.82          0.94     1.30  29.95        40.82  16.04     43.09       38.77   5.42          26.480
+                                                               RPE (m)   0.169     502.20  0.172          0.031   0.078  0.140        0.237  0.105     0.192       0.128  0.045          0.130
+                                                   Ours        RPE (◦)   0.079     2.970   0.072          0.055   0.079  0.058        0.055  0.047     0.061       0.061  0.065          0.063
+                                           (Stereo Train.)         terr   2.35     0.098    3.32          0.91     1.56   1.84         4.99   1.91      9.41        2.88   3.30          3.247
+                                                                   rerr   0.35     109.10   0.31          0.19     0.27  0.20         0.23   0.28       0.30        0.25  0.30           0.268
+                                                                  ATE     6.03      0.45   14.76           1.02    1.57   4.04        11.16   2.19     38.85        8.39   6.63          9.464
+                                                               RPE (m)   0.206     508.34  0.221          0.038   0.081  0.294        0.734  0.510     0.162       0.343  0.047          0.264
+                                                               RPE (◦)   0.090     3.042   0.079          0.055   0.076  0.059        0.053  0.050     0.065       0.063  0.066          0.066
+                                                                                   0.087
+                                                                   terr  10.53             18.71          30.21   34.05  13.16        17.69  10.80     13.85       18.06  26.10          19.316
+                                                                   rerr   2.73     61.36    1.19           2.21    1.78   3.65         1.93   4.67      2.52        1.25   3.26          2.519
+                                                                  ATE    79.24      7.68   70.13          52.36   38.33  66.75        40.72  18.32     61.49       52.62  57.25          53.721
+                                                               RPE (m)   0.221     494.60  0.318          0.226   0.496  0.213        0.343  0.191     0.234       0.284  0.442          0.297
+                                                               RPE (◦)   0.141     1.413   0.108          0.157   0.103  0.131        0.118  0.176     0.128       0.125  0.154          0.134
+                                                                   terr   2.33     0.432    3.24           2.21    1.43  1.09         1.15   0.63       2.18        2.40  1.82           1.848
+                                                                   rerr   0.63     39.46    0.49           0.38    0.30   0.25         0.39   0.29      0.32        0.24   0.38          0.367
+                                                                  ATE    14.45      0.50   19.69           1.00    1.39  3.61         3.20   0.98       7.63        8.36  3.13           6.344
+                                                               RPE (m)   0.039     117.40  0.057          0.029   0.046  0.024        0.030  0.021     0.041       0.051  0.043          0.038
+                                                               RPE (◦)   0.056     1.554   0.045          0.038   0.029  0.035        0.029  0.030     0.037       0.036  0.043          0.038
+                                                                   terr   2.01     0.049    2.32           2.22   0.74    1.30         1.42   0.72      1.66       2.07    2.06          1.652
+                                                                   rerr   0.61     40.02    0.48           0.30   0.25    0.30         0.32   0.35      0.33       0.23    0.36          0.353
+                                                                  ATE    12.17      0.47   17.59           1.96    0.70   4.94         3.73   1.06      6.96       7.59    4.21          6.091
+                                                               RPE (m)   0.025     342.71  0.030          0.021   0.026  0.018        0.025  0.015     0.030       0.044  0.040          0.027
+                                                               RPE (◦)   0.055     0.854   0.045          0.038   0.029  0.035        0.030  0.031     0.036       0.037  0.043          0.038
+                                                                                   0.052
+12                                                                                                                                              Zhan et al.
+
+    Ours (Mono-SC Train.) uses a depth model trained Table 2 Visual odometry evaluation in Oxford Robotcar
+with monocular videos and inverse depth consistency for Dataset. Absolute Trajectory Error (metre) is used as the eval-
+ensuring scale-consistency. Ours (Stereo Train) uses a depth uation criterion.
+
+model trained with stereo videos. Note that even stereo          Sequence              SVO              CNN-SVO                 DSO          ORB-SLAM (w/o LC)        Ours
+sequences are used during training, monocular sequences                      (Forster et al. 2016)  (Loo et al. 2019)  (Engel et al. 2017)  (Mur-Artal et al. 2015b)  4.16
+are used in testing. Therefore, Ours (Stereo Train) is          2014-05-06-                                                                                           3.46
+still a monocular VO system. We show that our meth-               12-54-54               X                  8.66                 4.71                    10.66        4.55
+ods outperform pure deep learning methods, which rely                                    X                  9.19                  X                         X         4.58
+on a PoseCNN for camera motion estimation, by a large           2014-05-06-              X                  10.19                 X                         X         6.89
+margin in all metrics. For KITTI Odometry criterion,              13-09-52               X                  8.26                  X                         X         5.09
+ORB-SLAM2 shows less rotation drift rerr but higher                                      X                  13.75                 X                         X         1.83
+translation drift terr due to scale drift issue, which is also  2014-05-06-              X                  6.30                  X                         X         3.20
+showed in Fig. 4. The drifting issue sometimes can be             13-14-58               X                  6.15                                            X
+resolved by loop closing with expensive global bundle ad-                                X                  3.70                 2.45
+justment but the issue exists when there is no loop closing     2014-05-06-                                                       X                       6.56
+detected. Diﬀerent from other methods, we use a single            13-17-51
+depth network as our “reference map”. The translation
+scales are recovered w.r.t to the scale-consistent depth        2014-05-14-
+                                                                  13-46-12
+
+                                                                2014-05-14-
+                                                                  13-53-47
+
+                                                                2014-05-14-
+                                                                  13-59-05
+
+                                                                2014-06-25-
+                                                                  16-22-15
+
+                                                                200                                 Ground Truth                                Ground Truth
+
+                                                                150                                 Ours                             100        Ours
+
+                                                                100                                                                  0
+
+                                                                50
+                                                                                                                             100
+
+                                                                 0
+                                                                z (m)
+                                                                                                                              z (m)
+
+predictions. As a result, we mitigate the scale drift is-       50                                                                   200
+
+sue in most monocular VO/SLAM systems and show less             100          50 100x (m1)50 200 250                                  300
+translation drift over long sequences. More importantly,                                                                                   0 50 100 150x2(0m0) 250 300 350 400
+our method shows a consistently smaller relative error,         150
+both translation and rotation, which allows our system                0
+
+to be a robust module for frame-to-frame tracking.              Fig. 6 Qualitative VO results on Oxford Robotcar: (Left) 2014-
+                                                                05-06-12-54-54 and (Right) 2014-06-25-16-22-15. Note that
+KITTI Tracking To show the robustness of our system in          there is in fact a loop closure in the left sequence but the
+dynamic environments, we compare our system with                ”Ground truth” is not accurate enough as mentioned in the
+ORB-SLAM2 in KITTI Tracking dataset individually.               Robotcar oﬃcial document.
+
+The results are shown in Tab. 5. However, since the Track-      Table 3 Ablation study on KITTI Odometry dataset regard-
+ing split contains relatively shorter sequences when com-       ing diﬀerent components
+pared to the Odometry split, KITTI Odometry criterion
+
+is not a suitable measurement to evaluate the perfor-           Experiment       Variant                                                   09         10
+mance. Therefore, we report frame-to-frame RPE (trans-                                                                               terr rerr  terr rerr
+lation) for Tracking split as a reference. Note that se-                     Reference Model
+quence (2011/10/03-47) is the most diﬃcult sequence among                                                                            3.45 0.68  3.19 1.00
+the 9 sequences due to its highly dynamic environment           Tracker          PnP                                                 6.79 2.27  6.31 3.75
+in a highway. ORB-SLAM2 is well known for its supe-                                                                                  2.90 0.74  2.98 1.03
+rior ability in removing outliers but its performance still     Flow             Self-Flow (Oﬄine)                                   2.07 0.38  2.54 0.62
+downgraded signiﬁcantly in this sequence while our method                        Self-Flow (Online)                                  3.45 0.73  3.63 1.20
+performs robustly.                                                                                                                   3.52 0.81  4.29 1.44
+                                                                Depth            Mono-SC                                             5.05 1.18  5.38 1.97
+                                                                                 Mono.                                               4.88 1.06  4.26 1.83
+                                                                                                                                     3.34 0.63  3.05 1.07
+                                                                Correspondences  Uniform                                             3.71 0.76  3.57 1.16
+                                                                                 Best-N                                              2.38 0.37  2.00 0.40
+
+                                                                Scale            Iterative
+
+                                                                Model Sel.       Flow
+
+                                                                Img. Res.        Full
+
+Oxford Robotcar We also test the generalization ability         reﬂects the failure cases and constant motion model is
+of the system on Oxford Robotcar(Maddern et al. 2017).          employed in such cases. The result shows that our sys-
+The result 4 is reported in Tab. 2 and illustrated in Fig. 6.   tem outperforms the others. More importantly, it proves
+Note that there are some overexposed frames at the mid-         that sampling correspondence from deep optical ﬂow is
+dle of the sequence (e.g. Fig. 3), which are challenging        more robust than matching hand-crafted features.
+for visual odometry/SLAM algorithms such that many
+algorithms listed in Tab. 2 fail to run the sequences. How-     6 Ablation study
+ever, the deep optical ﬂow network still predicts suﬃcient
+good correspondences for pose estimation (Fig. 3). The          In this section, we present an extensive ablation study
+optical network rarely fails to give suﬃcient good cor-         (Tab. 3) to understand the eﬀect of the components pro-
+respondences but the number of valid correspondences
+
+  4 The result of others are taken from (Loo et al. 2019)
+DF-VO: What Should Be Learnt for Visual Odometry?                                                       13
+
+posed in this work. We use a Reference Model with the Flow evaluation We evaluate the quality of optical ﬂows
+
+following settings and study the component in the follow- on KITTI 2012/2015, which are two benchmark dataset
+
+ing categories.                                             for optical ﬂow evaluation. The result is shown in Tab. 4.
+
+– Tracker: Hybrid (E-tracker and PnP-tracker)               We can see that with self-supervised ﬁnetuning (oﬄine),
+– Depth model: Trained with stereo sequences                the accuracy of the ﬂow prediction is signiﬁcantly im-
+– Flow model: LiteFlowNet trained from synthetic dataset    proved, especially in the percentage of outliers. One no-
+– Correspondence selection: Local best-K selection          ticeable result is that self-supervised training increases
+– Scale recovery: Simple alignment                          the end-point-error in KITTI2015 from 4.785 to 4.987.
+– Model selection: GRIC                                     The reason is that the self-supervised model is trained in
+– Image resolution: down-sampled size (640 × 192)           KITTI Odometry split, which contains long driving se-
+                                                            quences without many dynamic objects. However, KITTI2015
+
+                                                            contains many dynamic objects and we observe that the
+
+6.1 Tracker                                                 error of the ﬂow estimation on these dynamic objects are
+                                                            larger for the self-supervised model, which increases the
+
+DF-VO consists of two trackers – E-tracker and PnP-         average error. On the other hand, Scene Flow model is
+tracker. E-tracker is considered as the main tracker when   trained in highly dynamic synthetic environments, i.e.
+general motion (suﬃcient translation) and general struc-    able to estimate large ﬂow magnitude caused by mov-
+ture (non-planar) are assumed. PnP-tracker is used when     ing objects. Moreover, the synthetic model generates ar-
+E-tracker fails to estimate the motion, which is intro-     tifacts in some regions when used in real-world data so
+duced in Sec. 4.5. Using E-tracker alone potentially fails  there are more outliers, as shown in Tab. 4. Nevertheless,
+when motion degeneracy or structure degeneracy hap-         the correspondence selection module eﬀectively removes
+pens as described in Sec. 3.1. Therefore, we only compare   the bad ﬂows predicted by the self-supervised model and
+the Reference model to the case that only PnP-tracker       the overall ﬂow accuracy is improved over the Scene Flow
+is used. PnP relies on the accuracy of both depth and       model. Since better correspondences are estimated, the
+optical ﬂow predictions for establishing accurate 3D-2D     odometry result using Self-Flow is improved as well.
+
+correspondence. However, there is not a straightforward
+
+way to sample good depth predictions for accurate 3D-2D
+
+correspondences for 6DoF pose estimation, but the depth 6.3 Depth model
+predictions are suﬃcient for 1DoF scale recovery problem
+
+in E-tracker.                                               Training depth models with monocular videos comes with
+
+                                                            a scale inconsistency issue (Bian et al. 2019b). We use
+
+6.2 Flow model                                              an inverse depth consistency proposed in (Zhan et al.
+                                                            2019) to enforce the depth predictions to be consistent
+
+LiteFlowNet trained with synthetic data shows accept-       (Sec. 4.6.4). Using a scale-consistent depth CNN for trans-
+able generalization ability from synthetic to real. How-    lation scale recovery helps to mitigate the scale drift is-
+ever, there are still some regions with signiﬁcantly erro-  sue, which usually occurs after long travelling. Here we
+neous ﬂow predictions. We ﬁnd that with self-supervised     compare three depth models trained by diﬀerent strate-
+ﬁnetuning, the model adapts better to the real world se-    gies. We train two models using monocular videos. Mono.
+quences and the optical ﬂow prediction accuracy is im-      model is trained without the depth consistency term while
+proved (Tab. 4).                                            Mono-SC model is trained with the depth consistency
+                                                            term. Models trained with monocular videos are always
+
+                                                            up-to-scale, i.e. the metric scale is unknown. Therefore,
+
+oﬄine v.s. online We perform two types of self-supervised we also train a model using stereo sequences. Note that,
+
+ﬁnetuning for the optical ﬂow network. The oﬄine method the model trained with stereo sequences do not include
+
+ﬁnetunes the ﬂow network on sequences 00-08 using monoc- the depth consistency term. The predictions in stereo
+
+ular videos while the online method ﬁnetunes the model training are always associated with one and only one
+
+on-the-run for the running sequence. We test various amounts scale i.e. real-world scale due to the constraint set by
+
+of data for online ﬁnetuning and evaluate the correspond- the known stereo baseline. Therefore, no scale ambigu-
+
+ing odometry result. The relationship is shown in Fig. 7. ity/inconsistency issues exist in this training scheme. We
+
+We can see that ﬁnetuning on a small amount of data can see that both Reference Model (stereo) and Mono-SC
+
+(10%) is suﬃcient for the optical ﬂow network to adapt have less terr and rerr after long travelling, which is aided
+
+to unseen scenarios.                                        by the scale-consistent depth predictions.
+14                                                                                                         Zhan et al.
+
+Table 4 Optical ﬂow evaluation in KITTI 2012/2015 optical ﬂow split. Average end-point-error (AEPE) and the percentage of
+pixels with error larger than 1 (Out-1) are evaluated. Non-occluded regions are evaluated. SF (Super.): supervised training on
+Scene Flow. KITTI (Self.): self-supervised training on KITTI. BestN: Bidirectional ﬂow consistency thresholding is applied.
+
+               Network      Dataset & Method                             KITTI 2012           KITTI 2015
+                                                                 AEPE (px) Out-1 (%)  AEPE (px) Out-1 (%)
+               LiteFlowNet  SF (Super.)
+               LiteFlowNet  SF (Super.) + KITTI (Self.)          1.593  26.1%         4.785  39.6%
+               LiteFlowNet  SF (Super.) + BestN                  1.467  19.7%         4.987  32.7%
+               LiteFlowNet  SF (Super.) + KITTI (Self.) + BestN  0.478  7.6%          0.711  10.5%
+                                                                 0.422  5.7%          0.628  7.7%
+
+Table 5 Quantitative result on KITTI tracking sequences. The
+RPE (m) is reported.
+
+Seq.           Seq. Length  ORB-SLAM2     DF-VO (Ours)
+                    (m)                 Simple Iterative
+2011/09/26-05                    0.053
+2011/09/26-09       69.4         0.061  0.039  0.038             Fig. 7 Eﬀect of self-supervised online ﬁnetuning. X-axis is the
+2011/09/26-11      332.4         0.033  0.049  0.047             percentage of data used in the online ﬁnetuning.
+2011/09/26-13      114.0         0.075  0.030  0.030
+2011/09/26-14      173.0         0.101  0.071  0.071             6.5 Scale recovery
+2011/09/26-15      402.5         0.087  0.074  0.074
+2011/09/26-18      362.8         0.049  0.068  0.063
+2011/09/29-04       51.5         0.073  0.014  0.015
+2011/10/03-47      254.9         0.200  0.040  0.044
+Average            712.6         0.081  0.071  0.060
+                   274.8                0.051  0.049
+
+    We also explored an online adaptation scheme for the         We propose two scale recovery methods in this work,
+depth network. However, the depth network training is            namely simple alignment and Iterative alignment. Simple
+unstable in the online ﬁnetuning. The scale of the depth         alignment aligns the triangulated depths of the ﬁltered
+predictions ﬂuctuates during the training due to the scale       optical ﬂows and their corresponding depth predictions.
+ambiguity nature in the monocular training.                      However, the ﬁltered optical ﬂows can fall onto dynamic
+                                                                 object regions and the depth predictions may not be ac-
+6.4 Correspondence selection                                     curate. The iterative alignment is proposed for more ro-
+                                                                 bust scale recovery in dynamic environments. Only depth
+Since only sparse matches are required for DF-VO, a              points and ﬁltered optical ﬂows that are consistent with
+na¨ıve way to extract sparse matches from dense optical          each other are used for scale recovery. This eliminates
+ﬂow prediction is to sample matches uniformly/randomly.          both bad depth predictions and optical ﬂows of the dy-
+We uniformly sampled 2000 ﬂows to form the correspon-            namic objects. Iterative alignment slightly improves over
+dences and it shows that the odometry result is worse            the simple alignment in KITTI Odometry split, which
+than either Best-N selection or Local Best-K selection           might be because of the less dynamic scene nature of the
+method. To verify the eﬀectiveness of forward-backward           sequences. However, in a highly dynamic environment,
+ﬂow inconsistency, which is used for correspondences se-         like KITTI Tracking split, especially in Seq. 2011/10/03-
+lection in both Best-N selection and Local Best-K selec-         47 which is a sequence on a highway with one-third of the
+tion, we evaluate the optical ﬂow performance with/without       image occupied by moving cars, iterative scale recovery
+the selection (Tab. 4). Instead of evaluating best-N points,     shows a better result when compared to simple alignment
+we alternatively set an inconsistency threshold such that        and works more robustly when compared to ORB-SLAM2
+only the ﬂows with inconsistency less than δfc are eval-         (Tab. 5).
+uated. We show that the accuracy of the selected ﬂows
+is improved signiﬁcantly when compared to the average            6.6 Model selection
+result of all optical ﬂows.
+                                                                 Two model selection methods are proposed and tested
+                                                                 in this work. Flow magnitude-based method (Zhan et al.
+                                                                 2020) is straightforward but there are some potential fail-
+                                                                 ure cases, which is explained in Sec. 4.5. Moreover, a ﬂow
+DF-VO: What Should Be Learnt for Visual Odometry?            15
+
+magnitude thresholding value is required in this method,     Acknowledgment
+which is found empirically. However, GRIC-based model
+selection is a parameter-free method, which calculates a     This work was supported by the UoA Scholarship to HZ,
+score function for each motion model. It shows a more        the ARC Laureate Fellowship FL130100102 to IR and
+robust result when compared to the ﬂow-based method.         the Australian Centre of Excellence for Robotic Vision
+                                                             CE140100016.
+
+6.7 Image resolution                                         References
+
+Down-sampled images are used in the Reference Model          Agrawal, P., Carreira, J., & Malik, J. (2015). Learning to
+because the size is used in training deep networks. How-        see by moving. In IEEE International Conference on
+ever, simply increasing the image size to full resolution       Computer Vision (ICCV). (pp. 37–45).
+allows the optical ﬂow network predicts more accurate
+correspondences thus the odometry result can be boosted      Bian, J., Lin, W.-Y., Liu, Y., Zhang, L., Yeung, S.-K.,
+easily.                                                         Cheng, M.-M., et al. (2019a). GMS: Grid-based motion
+                                                                statistics for fast, ultra-robust feature correspondence.
+7 Conclusion                                                    International Journal on Computer Vision (IJCV).
+
+In this paper, we have presented a robust monocular VO       Bian, J.-W., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng,
+system leveraging deep learning and geometry methods.           M.-M., et al. (2019b). Unsupervised scale-consistent
+We explore the integration of deep predictions with classic     depth and ego-motion learning from monocular video.
+geometry methods. Speciﬁcally, we use optical ﬂow and           In Neural Information Processing Systems (NeurIPS).
+single-view depth predictions from deep networks as in-
+termediate outputs to establish 2D-2D/3D-2D correspon-       Bian, J.-W., Wu, Y.-H., Zhao, J., Liu, Y., Zhang, L.,
+dences for camera pose estimation. We show that the deep        Cheng, M.-M., et al. (2019c). An evaluation of fea-
+models can be trained/ﬁnetuned in a self-supervised man-        ture matchers for fundamental matrix estimation. In
+ner and we explore the eﬀect of various training schemes.       British Machine Vision Conference (BMVC).
+Depth models with consistent scale can be used for scale
+recovery, which mitigates the scale drift issue in most      Dharmasiri, T., Spek, A., & Drummond, T. (2018).
+monocular VO/SLAM systems. Instead of learning a com-           Eng: End-to-end neural geometry for robust depth
+plete VO system in an end-to-end manner, which does             and pose estimation using cnns. arXiv preprint
+not perform competitively to geometry-based methods,            arXiv:1807.05705.
+we think integrating deep predictions with geometry gain
+the best from both domains. Compared to our previous         Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazir-
+conference version (Zhan et al. 2020), we robustify diﬀer-      bas, C., Golkov, V., et al. (2015). Flownet: Learning
+ent components in this system and systematically evalu-         optical ﬂow with convolutional networks. In IEEE In-
+ate the variants. Moreover, we integrate an online adapta-      ternational Conference on Computer Vision (ICCV).
+tion scheme into the system for better adaptation ability       (pp. 2758–2766).
+in unseen scenarios. A detailed ablation study is provided
+to verify the eﬀectiveness of diﬀerent choices in each mod-  Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map
+ule, including the original choices (Zhan et al. 2020) and      prediction from a single image using a multi-scale deep
+the new components in this work. With the improved sys-         network. In Neural Information Processing Systems
+tem, our current version shows more robust performance,         (NeurIPS). (pp. 2366–2374).
+especially in highly dynamic environments. Some prior
+arts (Tang et al. 2019; Tateno et al. 2017; Yang et al.      Engel, J., Koltun, V., & Cremers, D. (2017). Direct sparse
+2018) show that a local optimization module is useful to        odometry. IEEE Transactions on Pattern Recognition
+further improve the VO result, which can be a future            and Machine Intelligence (TPAMI).
+direction to improve our VO system. Current pipeline in-
+volves a single view depth network which is less accurate    Engel, J., Sch¨ops, T., & Cremers, D. (2014). LSD-SLAM:
+than multi-view stereo (MVS) networks. An MVS net-              Large-scale direct monocular slam. In European Con-
+work can be considered replacing the depth network for          ference on Computer Vision (ECCV). Springer, (pp.
+better accuracy and possible online adaptation.                 834–849).
+
+                                                             Forster, C., Pizzoli, M., & Scaramuzza, D. (2014). Svo:
+                                                                Fast semi-direct monocular visual odometry. In IEEE
+                                                                International Conference on Robotics and Automation
+                                                                (ICRA). (pp. 15–22).
+
+                                                             Forster, C., Zhang, Z., Gassner, M., Werlberger, M., &
+                                                                Scaramuzza, D. (2016). SVO: Semidirect visual odom-
+                                                                etry for monocular and multicamera systems. IEEE
+                                                                Transactions on Robotics (TRO), 33 (2), pp. 249–265.
+16                                                            Zhan et al.
+
+Fu, H., Gong, M., Wang, C., Batmanghelich, K., & Tao,            arXiv:1412.6980.
+                                                              Klein, G., & Murray, D. (2007). Parallel tracking and
+    D. (2018). Deep ordinal regression network for monoc-
+                                                                 mapping for small ar workspaces. In Mixed and Aug-
+    ular depth estimation. In IEEE Conference on Com-            mented Reality, 2007. ISMAR 2007. 6th IEEE and
+                                                                 ACM International Symposium on. IEEE, (pp. 225–
+    puter Vision and Pattern Recognition (CVPR). (pp.            234).
+                                                              Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F.,
+    2002–2011).                                                  & Navab, N. (2016). Deeper depth prediction with
+                                                                 fully convolutional residual networks. In International
+Garg, R., B G, V. K., Carneiro, G., & Reid, I. (2016).           Conference on 3D Vision (3DV). IEEE, (pp. 239–248).
+                                                              Li, R., Wang, S., Long, Z., & Gu, D. (2017). Undeepvo:
+    Unsupervised cnn for single view depth estimation: Ge-       Monocular visual odometry through unsupervised deep
+                                                                 learning. arXiv preprint arXiv:1709.06841.
+    ometry to the rescue. In European Conference on Com-      Li, Y., Ushiku, Y., & Harada, T. (2019). Pose graph
+                                                                 optimization for unsupervised monocular visual odom-
+    puter Vision (ECCV). Springer, (pp. 740–756).                etry. IEEE International Conference on Robotics and
+                                                                 Automation (ICRA).
+Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013).      Liu, F., Shen, C., & Lin, G. (2015). Deep convolutional
+                                                                 neural ﬁelds for depth estimation from a single image.
+    Vision meets robotics: The kitti dataset. International      In IEEE Conference on Computer Vision and Pattern
+                                                                 Recognition (CVPR). (pp. 5162–5170).
+    Journal of Robotics Research (IJRR).                      Liu, F., Shen, C., Lin, G., & Reid, I. (2016). Learning
+                                                                 depth from single monocular images using deep con-
+Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready         volutional neural ﬁelds. IEEE Transactions on Pat-
+                                                                 tern Recognition and Machine Intelligence (TPAMI),
+    for autonomous driving? the kitti vision benchmark           38 (10), pp. 2024–2039.
+                                                              Loo, S. Y., Amiri, A. J., Mashohor, S., Tang, S. H., &
+    suite. In IEEE Conference on Computer Vision and             Zhang, H. (2019). CNN-SVO: Improving the map-
+                                                                 ping in semi-direct visual odometry using single-image
+    Pattern Recognition (CVPR).                                  depth prediction. IEEE International Conference on
+                                                                 Robotics and Automation (ICRA).
+Geiger, A., Ziegler, J., & Stiller, C. (2011). Stereoscan:    Lowe, D. G. (2004). Distinctive image features from scale-
+                                                                 invariant keypoints. International Journal on Com-
+    Dense 3d reconstruction in real-time. In Intelligent Ve-     puter Vision (IJCV), 60 (2), pp. 91–110.
+                                                              Maddern, W., Pascoe, G., Linegar, C., & New-
+    hicles Symposium (IV).                                       man, P. (2017). 1 Year, 1000km: The Oxford
+                                                                 RobotCar Dataset. The International Journal
+Godard, C., Mac Aodha, O., & Brostow, G. (2017). Un-             of Robotics Research (IJRR), 36 (1), pp. 3–15.
+                                                                 http://ijr.sagepub.com/content/early/2016/
+    supervised monocular depth estimation with left-right        11/28/0278364916679498.full.pdf+html, URL
+                                                                 http://dx.doi.org/10.1177/0278364916679498.
+    consistency. In IEEE Conference on Computer Vision        Matthies, L., Maimone, M., Johnson, A., Cheng, Y., Will-
+                                                                 son, R., Villalpando, C., et al. (2007). Computer vision
+    and Pattern Recognition (CVPR). IEEE, (pp. 6602–             on mars. International Journal on Computer Vision
+                                                                 (IJCV), 75 (1), pp. 67–92.
+    6611).                                                    Meister, S., Hur, J., & Roth, S. (2018). Unﬂow: Unsuper-
+                                                                 vised learning of optical ﬂow with a bidirectional census
+Godard, C., Mac Aodha, O., Firman, M., & Brostow,                loss. In Association for the Advancement of Artiﬁcial
+                                                                 Intelligence (AAAI).
+    G. J. (2019). Digging into self-supervised monocular      Mur-Artal, R., Montiel, J. M. M., & Tardos, J. D.
+                                                                 (2015a). ORB-SLAM: a versatile and accurate monoc-
+    depth prediction. In IEEE International Conference           ular slam system. IEEE Transactions on Robotics
+                                                                 (TRO), 31 (5), pp. 1147–1163.
+    on Computer Vision (ICCV).
+
+Hartley, R., & Zisserman, A. (2003). Multiple View Ge-
+
+    ometry in Computer Vision. New York, NY, USA:
+
+    Cambridge University Press, 2 edition.
+
+Hartley, R. I. (1995). In defence of the 8-point algorithm.
+
+    In IEEE International Conference on Computer Vision
+
+    (ICCV). IEEE, (pp. 1064–1070).
+
+Hui, T.-W., Tang, X., & Loy, C. C. (2018). Liteﬂownet:
+
+    A lightweight convolutional neural network for opti-
+
+    cal ﬂow estimation. In IEEE Conference on Computer
+
+    Vision and Pattern Recognition (CVPR). (pp. 8981–
+
+    8989).
+
+Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy,
+
+    A., & Brox, T. (2017). Flownet 2.0: Evolution of opti-
+
+    cal ﬂow estimation with deep networks. In IEEE Con-
+
+    ference on Computer Vision and Pattern Recognition
+
+    (CVPR). (pp. 2462–2470).
+
+Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015).
+
+    Spatial transformer networks. In Neural Information
+
+    Processing Systems (NeurIPS). (pp. 2017–2025).
+
+Kendall, A., & Gal, Y. (2017). What uncertainties do
+
+    we need in bayesian deep learning for computer vision?
+
+    In Neural Information Processing Systems (NeurIPS).
+
+    (pp. 5580–5590).
+
+Kingma, D., & Ba, J. (2014). Adam: A method
+
+    for stochastic optimization.          arXiv preprint
+DF-VO: What Should Be Learnt for Visual Odometry?            17
+
+Mur-Artal, R., Montiel, J. M. M., & Tardo´s, J. D.           Torr, P. H., Fitzgibbon, A. W., & Zisserman, A. (1999).
+   (2015b). Orb-slam: A versatile and accurate monocular        The problem of degeneracy in structure and motion
+   slam system. IEEE Transactions on Robotics (TRO),            recovery from uncalibrated image sequences. Interna-
+   31 (5), pp. 1147–1163.                                       tional Journal of Computer Vision, 32 (1), pp. 27–44.
+
+Mur-Artal, R., & Tardo´s, J. D. (2016). ORB-SLAM2: an        Ullman, S. (1979). The interpretation of structure from
+   open-source SLAM system for monocular, stereo and            motion. Proceedings of the Royal Society of London.
+   RGB-D cameras. CoRR, abs/1610.06475.                         Series B. Biological Sciences, 203 (1153), pp. 405–426.
+
+Nekrasov, V., Dharmasiri, T., Spek, A., Drummond, T.,        Umeyama, S. (1991). Least-squares estimation of trans-
+   Shen, C., & Reid, I. (2019). Real-time joint seman-          formation parameters between two point patterns.
+   tic segmentation and depth estimation using asymmet-         IEEE Transactions on Pattern Recognition and Ma-
+   ric annotations. IEEE International Conference on            chine Intelligence (TPAMI), (4), pp. 376–380.
+   Robotics and Automation (ICRA).
+                                                             Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E.,
+Newcombe, R. A., Lovegrove, S. J., & Davison, A. J.             Dosovitskiy, A., et al. (2017). Demon: Depth and mo-
+   (2011). Dtam: Dense tracking and mapping in real-            tion network for learning monocular stereo. In IEEE
+   time. In Computer Vision (ICCV), 2011 IEEE Inter-            Conference on Computer Vision and Pattern Recogni-
+   national Conference on. IEEE, (pp. 2320–2327).               tion (CVPR).
+
+Nister, D. (2003). An eﬃcient solution to the ﬁve-point      Wang, S., Clark, R., Wen, H., & Trigoni, N. (2017).
+   relative pose problem. In IEEE Conference on Com-            Deepvo: Towards end-to-end visual odometry with
+   puter Vision and Pattern Recognition (CVPR). (pp.            deep recurrent convolutional neural networks. In IEEE
+   II–195).                                                     International Conference on Robotics and Automation
+                                                                (ICRA). IEEE, (pp. 2043–2050).
+Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E.,
+   DeVito, Z., et al. (2017). Automatic diﬀerentiation in    Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P.
+   PyTorch. In NIPS Autodiﬀ Workshop.                           (2004). Image quality assessment: from error visibility
+                                                                to structural similarity. IEEE transactions on image
+Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D.,          processing, 13 (4), pp. 600–612.
+   Wulﬀ, J., et al. (2019). Competitive collaboration:
+   Joint unsupervised learning of depth, camera motion,      Yang, N., Wang, R., Stueckler, J., & Cremers, D. (2018).
+   optical ﬂow and motion segmentation. In IEEE Con-            Deep virtual stereo odometry: Leveraging deep depth
+   ference on Computer Vision and Pattern Recognition           prediction for monocular direct sparse odometry. In
+   (CVPR). (pp. 12240–12249).                                   European Conference on Computer Vision (ECCV).
+
+Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net:      Yin, Z., & Shi, J. (2018). Geonet: Unsupervised learning
+   Convolutional networks for biomedical image segmen-          of dense depth, optical ﬂow and camera pose. In IEEE
+   tation. In International Conference on Medical Im-           Conference on Computer Vision and Pattern Recogni-
+   age Computing and Computer-Assisted Intervention.            tion (CVPR). (pp. 1983–1992).
+   Springer, (pp. 234–241).
+                                                             Zhan, H., Garg, R., Weerasekera, C. S., Li, K., Agarwal,
+Rublee, E., Rabaud, V., Konolige, K., & Bradski, G.             H., & Reid, I. (2018). Unsupervised learning of monoc-
+   (2011). Orb: An eﬃcient alternative to sift or surf.         ular depth estimation and visual odometry with deep
+   In Computer Vision (ICCV), 2011 IEEE international           feature reconstruction. In IEEE Conference on Com-
+   conference on. IEEE, (pp. 2564–2571).                        puter Vision and Pattern Recognition (CVPR). IEEE,
+                                                                (pp. 340–349).
+Scaramuzza, D., & Fraundorfer, F. (2011). Visual odom-
+   etry: Part i: The ﬁrst 30 years and fundamentals. IEEE    Zhan, H., Weerasekera, C. S., Bian, J., & Reid, I. (2020).
+   Robotics & Automation Magazine, 18 (4), pp. 80–92.           Visual odometry revisited: What should be learnt?
+                                                                Robotics and Automation (ICRA), 2020 IEEE Inter-
+Sun, D., Yang, X., Liu, M.-Y., & Kautz, J. (2018). Pwc-         national Conference on.
+   net: Cnns for optical ﬂow using pyramid, warping, and
+   cost volume. In IEEE Conference on Computer Vision        Zhan, H., Weerasekera, C. S., Garg, R., & Reid, I. D.
+   and Pattern Recognition (CVPR). (pp. 8934–8943).             (2019). Self-supervised learning for single view depth
+                                                                and surface normal estimation. In IEEE International
+Tang, J., Ambrus, R., Guizilini, V., Pillai, S., Kim, H., &     Conference on Robotics and Automation (ICRA). (pp.
+   Gaidon, A. (2019). Self-Supervised 3D Keypoint Learn-        4811–4817).
+   ing for Ego-motion Estimation. 1912.03426.
+                                                             Zhang, J., Henein, M., Mahony, R., & Ila, V. (2020).
+Tateno, K., Tombari, F., Laina, I., & Navab, N. (2017).         Vdo-slam: A visual dynamic object-aware slam system.
+   CNN-SLAM: Real-time dense monocular slam with                arXiv preprint arXiv:2005.11052.
+   learned depth prediction. In IEEE Conference on Com-
+   puter Vision and Pattern Recognition (CVPR). (pp.         Zhang, Z. (1998). Determining the epipolar geometry
+   6243–6252).                                                  and its uncertainty: A review. International Journal
+18                                                                 Zhan et al.
+
+   on Computer Vision (IJCV), 27 (2), pp. 161–195.                    rescue. In European Conference on Computer Vision (ECCV).
+Zhou, D., Dai, Y., & Li, H. (2019). Ground-plane-based                Springer, (pp. 740–756).
+                                                                   Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision
+   absolute scale estimation for monocular visual odome-              meets robotics: The kitti dataset. International Journal of
+   try. IEEE Transactions on Intelligent Transportation               Robotics Research (IJRR).
+   Systems.                                                        Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for
+Zhou, H., Ummenhofer, B., & Brox, T. (2018). Deep-                    autonomous driving? the kitti vision benchmark suite. In
+   tam: Deep tracking and mapping. arXiv preprint                     IEEE Conference on Computer Vision and Pattern Recognition
+   arXiv:1808.01900.                                                  (CVPR).
+Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017).            Geiger, A., Ziegler, J., & Stiller, C. (2011). Stereoscan: Dense 3d
+   Unsupervised learning of depth and ego-motion from                 reconstruction in real-time. In Intelligent Vehicles Symposium
+   video. In IEEE Conference on Computer Vision and                   (IV).
+   Pattern Recognition (CVPR).                                     Godard, C., Mac Aodha, O., & Brostow, G. (2017). Unsuper-
+                                                                      vised monocular depth estimation with left-right consistency.
+References                                                            In IEEE Conference on Computer Vision and Pattern Recogni-
+                                                                      tion (CVPR). IEEE, (pp. 6602–6611).
+Agrawal, P., Carreira, J., & Malik, J. (2015). Learning to see     Godard, C., Mac Aodha, O., Firman, M., & Brostow, G. J.
+   by moving. In IEEE International Conference on Computer            (2019). Digging into self-supervised monocular depth predic-
+   Vision (ICCV). (pp. 37–45).                                        tion. In IEEE International Conference on Computer Vision
+                                                                      (ICCV).
+Bian, J., Lin, W.-Y., Liu, Y., Zhang, L., Yeung, S.-K., Cheng,     Hartley, R., & Zisserman, A. (2003). Multiple View Geometry in
+   M.-M., et al. (2019a). GMS: Grid-based motion statistics           Computer Vision. New York, NY, USA: Cambridge Univer-
+   for fast, ultra-robust feature correspondence. International       sity Press, 2 edition.
+   Journal on Computer Vision (IJCV).                              Hartley, R. I. (1995). In defence of the 8-point algorithm. In
+                                                                      IEEE International Conference on Computer Vision (ICCV).
+Bian, J.-W., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.-         IEEE, (pp. 1064–1070).
+   M., et al. (2019b). Unsupervised scale-consistent depth and     Hui, T.-W., Tang, X., & Loy, C. C. (2018). Liteﬂownet: A
+   ego-motion learning from monocular video. In Neural Infor-         lightweight convolutional neural network for optical ﬂow es-
+   mation Processing Systems (NeurIPS).                               timation. In IEEE Conference on Computer Vision and Pattern
+                                                                      Recognition (CVPR). (pp. 8981–8989).
+Bian, J.-W., Wu, Y.-H., Zhao, J., Liu, Y., Zhang, L., Cheng,       Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., &
+   M.-M., et al. (2019c). An evaluation of feature matchers for       Brox, T. (2017). Flownet 2.0: Evolution of optical ﬂow esti-
+   fundamental matrix estimation. In British Machine Vision           mation with deep networks. In IEEE Conference on Computer
+   Conference (BMVC).                                                 Vision and Pattern Recognition (CVPR). (pp. 2462–2470).
+                                                                   Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spa-
+Dharmasiri, T., Spek, A., & Drummond, T. (2018). Eng: End-            tial transformer networks. In Neural Information Processing
+   to-end neural geometry for robust depth and pose estimation        Systems (NeurIPS). (pp. 2017–2025).
+   using cnns. arXiv preprint arXiv:1807.05705.                    Kendall, A., & Gal, Y. (2017). What uncertainties do we need
+                                                                      in bayesian deep learning for computer vision? In Neural
+Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C.,     Information Processing Systems (NeurIPS). (pp. 5580–5590).
+   Golkov, V., et al. (2015). Flownet: Learning optical ﬂow with   Kingma, D., & Ba, J. (2014). Adam: A method for stochastic
+   convolutional networks. In IEEE International Conference on        optimization. arXiv preprint arXiv:1412.6980.
+   Computer Vision (ICCV). (pp. 2758–2766).                        Klein, G., & Murray, D. (2007). Parallel tracking and mapping
+                                                                      for small ar workspaces. In Mixed and Augmented Reality,
+Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map predic-        2007. ISMAR 2007. 6th IEEE and ACM International Sympo-
+   tion from a single image using a multi-scale deep network. In      sium on. IEEE, (pp. 225–234).
+   Neural Information Processing Systems (NeurIPS). (pp. 2366–     Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab,
+   2374).                                                             N. (2016). Deeper depth prediction with fully convolutional
+                                                                      residual networks. In International Conference on 3D Vision
+Engel, J., Koltun, V., & Cremers, D. (2017). Direct sparse            (3DV). IEEE, (pp. 239–248).
+   odometry. IEEE Transactions on Pattern Recognition and Ma-      Li, R., Wang, S., Long, Z., & Gu, D. (2017). Undeepvo: Monoc-
+   chine Intelligence (TPAMI).                                        ular visual odometry through unsupervised deep learning.
+                                                                      arXiv preprint arXiv:1709.06841.
+Engel, J., Scho¨ps, T., & Cremers, D. (2014). LSD-SLAM: Large-     Li, Y., Ushiku, Y., & Harada, T. (2019). Pose graph opti-
+   scale direct monocular slam. In European Conference on Com-        mization for unsupervised monocular visual odometry. IEEE
+   puter Vision (ECCV). Springer, (pp. 834–849).                      International Conference on Robotics and Automation (ICRA).
+                                                                   Liu, F., Shen, C., & Lin, G. (2015). Deep convolutional neural
+Forster, C., Pizzoli, M., & Scaramuzza, D. (2014). Svo: Fast          ﬁelds for depth estimation from a single image. In IEEE Con-
+   semi-direct monocular visual odometry. In IEEE Interna-            ference on Computer Vision and Pattern Recognition (CVPR).
+   tional Conference on Robotics and Automation (ICRA). (pp.          (pp. 5162–5170).
+   15–22).                                                         Liu, F., Shen, C., Lin, G., & Reid, I. (2016). Learning depth
+                                                                      from single monocular images using deep convolutional neu-
+Forster, C., Zhang, Z., Gassner, M., Werlberger, M., & Scara-         ral ﬁelds. IEEE Transactions on Pattern Recognition and Ma-
+   muzza, D. (2016). SVO: Semidirect visual odometry for              chine Intelligence (TPAMI), 38 (10), pp. 2024–2039.
+   monocular and multicamera systems. IEEE Transactions on         Loo, S. Y., Amiri, A. J., Mashohor, S., Tang, S. H., & Zhang,
+   Robotics (TRO), 33 (2), pp. 249–265.                               H. (2019). CNN-SVO: Improving the mapping in semi-direct
+
+Fu, H., Gong, M., Wang, C., Batmanghelich, K., & Tao, D.
+   (2018). Deep ordinal regression network for monocular depth
+   estimation. In IEEE Conference on Computer Vision and Pat-
+   tern Recognition (CVPR). (pp. 2002–2011).
+
+Garg, R., B G, V. K., Carneiro, G., & Reid, I. (2016). Unsuper-
+   vised cnn for single view depth estimation: Geometry to the
+DF-VO: What Should Be Learnt for Visual Odometry?                    19
+
+   visual odometry using single-image depth prediction. IEEE         Tateno, K., Tombari, F., Laina, I., & Navab, N. (2017). CNN-
+   International Conference on Robotics and Automation (ICRA).          SLAM: Real-time dense monocular slam with learned depth
+Lowe, D. G. (2004). Distinctive image features from scale-              prediction. In IEEE Conference on Computer Vision and Pat-
+   invariant keypoints. International Journal on Computer Vision        tern Recognition (CVPR). (pp. 6243–6252).
+   (IJCV), 60 (2), pp. 91–110.
+Maddern, W., Pascoe, G., Linegar, C., & Newman, P. (2017).           Torr, P. H., Fitzgibbon, A. W., & Zisserman, A. (1999). The
+   1 Year, 1000km: The Oxford RobotCar Dataset. The In-                 problem of degeneracy in structure and motion recovery from
+   ternational Journal of Robotics Research (IJRR), 36 (1), pp.         uncalibrated image sequences. International Journal of Com-
+   3–15. http://ijr.sagepub.com/content/early/2016/11/28/               puter Vision, 32 (1), pp. 27–44.
+   0278364916679498.full.pdf+html, URL http://dx.doi.org/
+   10.1177/0278364916679498.                                         Ullman, S. (1979). The interpretation of structure from motion.
+Matthies, L., Maimone, M., Johnson, A., Cheng, Y., Willson,             Proceedings of the Royal Society of London. Series B. Biological
+   R., Villalpando, C., et al. (2007). Computer vision on mars.         Sciences, 203 (1153), pp. 405–426.
+   International Journal on Computer Vision (IJCV), 75 (1), pp.
+   67–92.                                                            Umeyama, S. (1991). Least-squares estimation of transfor-
+Meister, S., Hur, J., & Roth, S. (2018). Unﬂow: Unsupervised            mation parameters between two point patterns. IEEE
+   learning of optical ﬂow with a bidirectional census loss. In As-     Transactions on Pattern Recognition and Machine Intelligence
+   sociation for the Advancement of Artiﬁcial Intelligence (AAAI).      (TPAMI), (4), pp. 376–380.
+Mur-Artal, R., Montiel, J. M. M., & Tardos, J. D. (2015a).
+   ORB-SLAM: a versatile and accurate monocular slam sys-            Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Doso-
+   tem. IEEE Transactions on Robotics (TRO), 31 (5), pp. 1147–          vitskiy, A., et al. (2017). Demon: Depth and motion network
+   1163.                                                                for learning monocular stereo. In IEEE Conference on Com-
+Mur-Artal, R., Montiel, J. M. M., & Tard´os, J. D. (2015b).             puter Vision and Pattern Recognition (CVPR).
+   Orb-slam: A versatile and accurate monocular slam system.
+   IEEE Transactions on Robotics (TRO), 31 (5), pp. 1147–1163.       Wang, S., Clark, R., Wen, H., & Trigoni, N. (2017). Deepvo: To-
+Mur-Artal, R., & Tardo´s, J. D. (2016). ORB-SLAM2: an open-             wards end-to-end visual odometry with deep recurrent con-
+   source SLAM system for monocular, stereo and RGB-D cam-              volutional neural networks. In IEEE International Conference
+   eras. CoRR, abs/1610.06475.                                          on Robotics and Automation (ICRA). IEEE, (pp. 2043–2050).
+Nekrasov, V., Dharmasiri, T., Spek, A., Drummond, T., Shen,
+   C., & Reid, I. (2019). Real-time joint semantic segmentation      Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P.
+   and depth estimation using asymmetric annotations. IEEE              (2004). Image quality assessment: from error visibility to
+   International Conference on Robotics and Automation (ICRA).          structural similarity. IEEE transactions on image processing,
+Newcombe, R. A., Lovegrove, S. J., & Davison, A. J. (2011).             13 (4), pp. 600–612.
+   Dtam: Dense tracking and mapping in real-time. In Computer
+   Vision (ICCV), 2011 IEEE International Conference on. IEEE,       Yang, N., Wang, R., Stueckler, J., & Cremers, D. (2018). Deep
+   (pp. 2320–2327).                                                     virtual stereo odometry: Leveraging deep depth prediction
+Nister, D. (2003). An eﬃcient solution to the ﬁve-point relative        for monocular direct sparse odometry. In European Conference
+   pose problem. In IEEE Conference on Computer Vision and              on Computer Vision (ECCV).
+   Pattern Recognition (CVPR). (pp. II–195).
+Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., De-       Yin, Z., & Shi, J. (2018). Geonet: Unsupervised learning of
+   Vito, Z., et al. (2017). Automatic diﬀerentiation in PyTorch.        dense depth, optical ﬂow and camera pose. In IEEE Con-
+   In NIPS Autodiﬀ Workshop.                                            ference on Computer Vision and Pattern Recognition (CVPR).
+Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulﬀ, J.,        (pp. 1983–1992).
+   et al. (2019). Competitive collaboration: Joint unsupervised
+   learning of depth, camera motion, optical ﬂow and motion          Zhan, H., Garg, R., Weerasekera, C. S., Li, K., Agarwal, H., &
+   segmentation. In IEEE Conference on Computer Vision and              Reid, I. (2018). Unsupervised learning of monocular depth
+   Pattern Recognition (CVPR). (pp. 12240–12249).                       estimation and visual odometry with deep feature reconstruc-
+Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net:                 tion. In IEEE Conference on Computer Vision and Pattern
+   Convolutional networks for biomedical image segmentation.            Recognition (CVPR). IEEE, (pp. 340–349).
+   In International Conference on Medical Image Computing and
+   Computer-Assisted Intervention. Springer, (pp. 234–241).          Zhan, H., Weerasekera, C. S., Bian, J., & Reid, I. (2020). Visual
+Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. (2011).             odometry revisited: What should be learnt? Robotics and
+   Orb: An eﬃcient alternative to sift or surf. In Computer             Automation (ICRA), 2020 IEEE International Conference on.
+   Vision (ICCV), 2011 IEEE international conference on. IEEE,
+   (pp. 2564–2571).                                                  Zhan, H., Weerasekera, C. S., Garg, R., & Reid, I. D. (2019).
+Scaramuzza, D., & Fraundorfer, F. (2011). Visual odometry:              Self-supervised learning for single view depth and surface
+   Part i: The ﬁrst 30 years and fundamentals. IEEE Robotics            normal estimation. In IEEE International Conference on
+   & Automation Magazine, 18 (4), pp. 80–92.                            Robotics and Automation (ICRA). (pp. 4811–4817).
+Sun, D., Yang, X., Liu, M.-Y., & Kautz, J. (2018). Pwc-net:
+   Cnns for optical ﬂow using pyramid, warping, and cost vol-        Zhang, J., Henein, M., Mahony, R., & Ila, V. (2020). Vdo-slam:
+   ume. In IEEE Conference on Computer Vision and Pattern               A visual dynamic object-aware slam system. arXiv preprint
+   Recognition (CVPR). (pp. 8934–8943).                                 arXiv:2005.11052.
+Tang, J., Ambrus, R., Guizilini, V., Pillai, S., Kim, H., &
+   Gaidon, A. (2019). Self-Supervised 3D Keypoint Learning           Zhang, Z. (1998). Determining the epipolar geometry and its
+   for Ego-motion Estimation. 1912.03426.                               uncertainty: A review. International Journal on Computer Vi-
+                                                                        sion (IJCV), 27 (2), pp. 161–195.
+
+                                                                     Zhou, D., Dai, Y., & Li, H. (2019). Ground-plane-based abso-
+                                                                        lute scale estimation for monocular visual odometry. IEEE
+                                                                        Transactions on Intelligent Transportation Systems.
+
+                                                                     Zhou, H., Ummenhofer, B., & Brox, T. (2018). Deeptam: Deep
+                                                                        tracking and mapping. arXiv preprint arXiv:1808.01900.
+
+                                                                     Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Un-
+                                                                        supervised learning of depth and ego-motion from video. In
+                                                                        IEEE Conference on Computer Vision and Pattern Recognition
+                                                                        (CVPR).
+
diff --git a/动态slam/2020年-2022年开源动态SLAM/2021年/MonoRec_Semi-Supervised_Dense_Reconstruction_in_Dynamic_Environments_from_a_Single_Moving_Camera.pdf b/动态slam/2020年-2022年开源动态SLAM/2021年/MonoRec_Semi-Supervised_Dense_Reconstruction_in_Dynamic_Environments_from_a_Single_Moving_Camera.pdf
new file mode 100644
index 0000000..359a826
--- /dev/null
+++ b/动态slam/2020年-2022年开源动态SLAM/2021年/MonoRec_Semi-Supervised_Dense_Reconstruction_in_Dynamic_Environments_from_a_Single_Moving_Camera.pdf
@@ -0,0 +1,679 @@
+                                                                                                                                                         2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
+
+2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) | 978-1-6654-4509-2/21/$31.00 ©2021 IEEE | DOI: 10.1109/CVPR46437.2021.00605  MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments
+                                                                                                                                                                                       from a Single Moving Camera
+
+                                                                                                                                                         Felix Wimbauer1, Nan Yang1,2, Lukas von Stumberg1 Niclas Zeller1,2 Daniel Cremers1,2
+                                                                                                                                                                                         1 Technical University of Munich, 2 Artisense
+
+                                                                                                                                                                                       {wimbauer, yangn, stumberg, zellern, cremers}@in.tum.de
+
+                                                                                                                                                         Abstract
+
+                                                                                                                                                            In this paper, we propose MonoRec, a semi-supervised       Figure 1: MonoRec can deliver high-quality dense recon-
+                                                                                                                                                         monocular dense reconstruction architecture that predicts     struction from a single moving camera. The ﬁgure shows
+                                                                                                                                                         depth maps from a single moving camera in dynamic en-         an example of a large-scale outdoor point cloud reconstruc-
+                                                                                                                                                         vironments. MonoRec is based on a multi-view stereo set-      tion (KITTI Odometry sequence 07) by simply accumulat-
+                                                                                                                                                         ting which encodes the information of multiple consecutive    ing predicted depth maps. Please refer to our project page
+                                                                                                                                                         images in a cost volume. To deal with dynamic objects in      for the video of the entire reconstruction of the sequence.
+                                                                                                                                                         the scene, we introduce a MaskModule that predicts mov-
+                                                                                                                                                         ing object masks by leveraging the photometric inconsisten-   creasing demand of reducing the total number of sensors.
+                                                                                                                                                         cies encoded in the cost volumes. Unlike other multi-view     Over the past years, researchers have therefore put a lot of
+                                                                                                                                                         stereo methods, MonoRec is able to reconstruct both static    effort into solving the problem of perception with only a sin-
+                                                                                                                                                         and moving objects by leveraging the predicted masks. Fur-    gle monocular camera. Considering recent achievements in
+                                                                                                                                                         thermore, we present a novel multi-stage training scheme      monocular visual odometry (VO) [8, 58, 51], with respect to
+                                                                                                                                                         with a semi-supervised loss formulation that does not re-     ego-motion estimation, this was certainly successful. Nev-
+                                                                                                                                                         quire LiDAR depth values. We carefully evaluate MonoRec       ertheless, reliable dense 3D mapping of the static environ-
+                                                                                                                                                         on the KITTI dataset and show that it achieves state-of-the-  ment and moving objects is still an open research topic.
+                                                                                                                                                         art performance compared to both multi-view and single-
+                                                                                                                                                         view methods. With the model trained on KITTI, we further-       To tackle the problem of dense 3D reconstruction based
+                                                                                                                                                         more demonstrate that MonoRec is able to generalize well      on a single moving camera, there are basically two paral-
+                                                                                                                                                         to both the Oxford RobotCar dataset and the more chal-
+                                                                                                                                                         lenging TUM-Mono dataset recorded by a handheld cam-
+                                                                                                                                                         era. Code and related materials are available at https:
+                                                                                                                                                         //vision.in.tum.de/research/monorec.
+
+                                                                                                                                                         1. Introduction
+
+                                                                                                                                                         1.1. Real-world Scene Capture from Video
+
+                                                                                                                                                            Obtaining a 3D understanding of the entire static and dy-
+                                                                                                                                                         namic environment can be seen as one of the key-challenges
+                                                                                                                                                         in robotics, AR/VR, and autonomous driving. State of to-
+                                                                                                                                                         day, this is achieved based on the fusion of multiple sen-
+                                                                                                                                                         sor sources (incl. cameras, LiDARs, RADARs and IMUs).
+                                                                                                                                                         This guarantees dense coverage of the vehicle’s surround-
+                                                                                                                                                         ings and accurate ego-motion estimation. However, driven
+                                                                                                                                                         by the high cost as well as the challenge to maintain cross-
+                                                                                                                                                         calibration of such a complex sensor suite, there is an in-
+
+                                                                                                                                                                Indicates equal contribution.
+
+                                                                                                                                                         978-1-6654-4509-2/21/$31.00 ©2021 IEEE                        6108
+
+                                                                                                                                                         DOI 10.1109/CVPR46437.2021.00605
+
+                                                                                                                                                         Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
+lel lines of research. On one side, there are dense multi-    [26, 27, 36] or 3D point cloud based [3, 12]. Most popular
+view stereo (MVS) methods, which evolved over the last        are still depth map representations predicted from a 3D cost
+decade [39, 45, 2] and saw a great improvement through the    volume [23, 53, 61, 66, 22, 56, 41, 24, 33, 62, 19, 64, 57].
+use of convolutional neural networks (CNNs) [23, 61, 57].     Huang et al. [23] proposed one of the ﬁrst cost-volume
+On the other side, there are monocular depth prediction       based approaches. They compute a set of image-pair-wise
+methods which purely rely on deep learning [7, 16, 58].       plane-sweep volumes with respect to a reference image and
+Though all these methods show impressive performance,         use a CNN to predict one single depth map based on this
+both types have also their respective shortcomings. For       set. Zhou et al. [66] also use the photometric cost volumes
+MVS the overall assumption is a stationary environment to     as the inputs of the deep neural networks and employ a two
+be reconstructed, so the presence of dynamic objects deteri-  stage approach for dense depth prediction. Yao et al. [61]
+orate their performance. Monocular depth prediction meth-     instead calculate a single cost volume using deep features
+ods, in contrast, perform very well in reconstructing mov-    of all input images.
+ing objects, as predictions are made only based on individ-
+ual images. At the same time, due to their use of a single    2.2. Dense Depth Estimation in Dynamic Scenes
+image only, they strongly rely on the perspective appear-
+ance of objects as observed with speciﬁc camera intrinsics       Reconstructing dynamic scenes is challenging since the
+and extrinsics and therefore do not generalize well to other  moving objects violate the static-world assumption for clas-
+datasets.                                                     sical multi-view stereo methods. Russell et al. [43] and
+                                                              Ranftl et al. [40] base on motion segmentation and perform
+1.2. Contribution                                             classical optimization. Li et al. [32] proposed to estimate
+                                                              dense depth maps from the scenes with moving people. All
+   To combine the advantage of both deep MVS and              these methods need additional inputs, e.g., optical ﬂow, ob-
+monocular depth prediction, we propose MonoRec, a novel       ject masks, etc., for the inference, while MonoRec requires
+monocular dense reconstruction architecture that consists of  only the posed images as the inputs. Another line of re-
+a MaskModule and a DepthModule. We encode the infor-          search is monocular depth estimation [7, 6, 29, 31, 11, 59,
+mation from multiple consecutive images using cost vol-       16, 48, 67, 63, 65, 52, 18, 17, 58]. These methods are not
+umes which are constructed based on structural similarity     affected by moving objects, but the depth estimation is not
+index measure (SSIM) [54] instead of sum of absolute dif-     necessarily accurate, especially in unseen scenarios. Luo
+ferences (SAD) like prior works. The MaskModule is able       et al. [34] proposed a test-time optimization method which
+to identify moving pixels and downweights the correspond-     is not real-time capable. In a concurrent work, Watson et
+ing voxels in the cost volume. Thereby, in contrast to other  al. [55] address moving objects with the consistency be-
+MVS methods, MonoRec does not suffer from artifacts on        tween monocular depth estimation and multi-view stereo,
+moving objects and therefore delivers depth estimations on    while MonoRec predicts the dynamic masks explicitly by
+both static and dynamic objects.                              the proposed MaskModule.
+
+   With the proposed multi-stage training scheme,             2.3. Dense SLAM
+MonoRec achieves state-of-the-art performance compared
+to other MVS and monocular depth prediction methods              Several of the methods cited above solve both the prob-
+on the KITTI dataset [14]. Furthermore, we validate the       lem of dense 3D reconstruction and camera pose estima-
+generalization capabilities of our network on the Oxford      tion [48, 67, 63, 65, 66, 59, 58]. Nevertheless, these meth-
+RobotCar dataset [35] and the TUM-Mono dataset [9].           ods either solve both problems independently or only in-
+Figure 1 shows a dense point cloud reconstructed by our       tegrate one into the other (e.g. [66, 58]). Newcombe et
+method on one of our test sequences of KITTI.                 al. [37] instead jointly optimize the 6DoF camera pose and
+                                                              the dense 3D scene structure. However, due to its volu-
+2. Related Work                                               metric map representation it is only applicable to small-
+                                                              scale scenes. Recently, Bloesch et al. [1] proposed a
+2.1. Multi-view Stereo                                        learned code representation which can be optimized jointly
+                                                              with the 6DoF camera poses. This idea is pursued by
+   Multi-view stereo (MVS) methods estimate a dense rep-      Czarnowski et al. [5] and integrated into a full SLAM sys-
+resentation of the 3D environment based on a set of im-       tem. All the above-mentioned methods, however, do not
+ages with known poses. Over the past years, several           address the issue of moving objects. Instead, the proposed
+methods have been developed to solve the MVS problem          MonoRec network explicitly deals with moving objects and
+[46, 28, 30, 2, 47, 49, 39, 13, 45, 60] based on classical    achieves superior accuracy both on moving and on static
+optimization. Recently, due to the advance of deep neu-       structures. Furthermore, prior works show that the accuracy
+ral networks (DNNs), different learning based approaches      of camera tracking does not necessarily improve with more
+were proposed. This representation can be volumetric
+
+                                                                                                     6109
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
+Cost Volume                  weighted
+Construction
+                                                                 MaskModule
+                                                                                                  DepthModule
+
+                                                                                      masked
+
+              Reprojections            Max Pool
+
+                                                                                                               ResNet-18 Image features for
+
+Figure 2: MonoRec Architecture: It ﬁrst constructs a photometric cost volume from multiple input frames. Unlike prior
+works, we use the SSIM [54] metric instead of SAD to measure the photometric consistency. The MaskModule aims to detect
+inconsistencies between the different input frames to determine moving objects. The multi-frame cost volume C is multiplied
+with the predicted mask and then passed to the DepthModule which predicts a dense inverse depth map. In both the decoders
+of MaskModule and DepthModule, the cost volume features are concatenated with pre-trained ResNet-18 features.
+
+points [8, 10]. MonoRec therefore focuses solely on deliv-       3.2. Cost Volume
+ering dense reconstruction using poses from a sparse VO
+system and shows state-of-the-art results on public bench-          A cost volume encodes geometric information from the
+marks. Note that, this way, MonoRec can be easily com-           different frames in a tensor that is suited as input for neural
+bined with any VO systems with arbitrary sensor setups.          networks. For a number of discrete depth steps, the tem-
+                                                                 poral stereo frames are reprojected to the keyframe and a
+3. The MonoRec Network                                           pixel-wise photometric error is computed. Ideally, the lower
+                                                                 the photometric error, the better the depth step approximates
+   MonoRec uses a set of consecutive frames and the cor-         the real depth at a given pixel. Our cost volume follows the
+responding camera poses to predict a dense depth map for         general formulation of the prior works [37, 66]. Neverthe-
+the given keyframe. The MonoRec architecture combines            less, unlike the previous works that deﬁne the photometric
+a MaskModule and a DepthModule. MaskModule predicts              error pe() as a patch-wise SAD, we propose to use the SSIM
+moving object masks that improve depth accuracy and al-          as follows:
+lows us to eliminate noise in 3D reconstructions. Depth-
+Module predicts a depth map from the masked cost volume.                     pe(x, d) = 1 − SSIM(Itt (x, d), It(x))                          (2)
+In this section, we ﬁrst describe the different modules of our                                             2
+architecture, and then discuss the specialized multi-stage
+semi-supervised training scheme.                                 with 3 × 3 patch size. Here Itt (x, d) deﬁnes the intensity
+                                                                 at pixel x of the image It warped with constant depth d.
+3.1. Preliminaries                                               In practice, we clamp the error to [0, 1]. The cost volume
+
+   Our method aims to predict a dense inverse depth map          C stores at C(x, d) the aggregated photometric consistency
+
+Dt of the selected keyframe from a set of consecutive            for pixel x and depth d
+frames {I1, · · · , IN }. We denote the selected keyframe as
+It and others as It (t ∈ {1, · · · , N } \ t). Given the camera  C(x, d) = 1 − 2 ·            1            pett (x, d) · ωt (x)              (3)
+intrinsics, the inverse depth map Dt, and the relative cam-                                          ·
+era pose Ttt ∈ SE(3) between It and It, we can perform
+the reprojection from It to It as                                                             t ωt t
+
+                                                                 where  d    ∈  {di|dmin  +    i  · (dmin  − dmax)}.  The  weighting
+                                                                                              M
+
+                                                                 term wt (x) weights the optimal depth step height based on
+
+                                                                 the photometric error while others are weighted lower:
+
+              Itt = It proj Dt, Ttt ,  (1)                                            1
+                                                                 wt (x) =1 − M − 1
+                                                                                      exp −α pett (x, d) − pett (x, d∗) 2 (4)
+where proj () is the projection function and is the dif-                     ·
+ferentiable sampler [25]. This reprojection formulation is
+important for both the cost volume formation (Sec. 3.2) and                     d=d∗
+the self-supervised loss term (Sec. 3.4).
+                                                                 with d∗t = arg mind pett (x, d). Note that C(x, d) has the
+   In the following, we refer to the consecutive frames as       range [−1, 1] where −1/1 indicates the lowest/highest pho-
+temporal stereo (T) frames. During training, we use an ad-
+ditional static stereo (S) frame ItS for each sample, which      tometric consistency.
+was captured by a synchronized stereo camera at the same
+time as the respective keyframe.                                    In the following section, we denote cost volumes calcu-
+                                                                 lated based on the keyframe It and only one non-keyframe
+                                                                 It by Ct (x, d) where applicable.
+
+                                                                                                     6110
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
+3.3. Network Architecture                                      Figure 3: Auxiliary Training Masks: Examples of aux-
+                                                               iliary training masks from the training set that are used as
+   As shown in Figure 2, the proposed network architec-        reference.
+ture contains two sub-modules, namely, MaskModule and
+DepthModule.                                                   MaskModule reﬁnement stage and the DepthModule reﬁne-
+                                                               ment stage are executed successively.
+MaskModule MaskModule aims to predict a mask Mt
+where Mt(x) ∈ [0, 1] indicates the probability of a pixel      Bootstrapping In the bootstrapping stage, MaskModule
+x in It belonging to a moving object. Determining mov-         and DepthModule are trained separately. DepthModule
+ing objects from It alone is an ambiguous task and hard to     takes the non-masked C as the input and predicts Dt. The
+be generalizable. Therefore, we propose to use the set of      training objective of DepthModule is deﬁned as a multi-
+cost volumes {Ct |t ∈ {1, · · · , N } \ t} which encode the    scale (s ∈ [0, 3]) semi-supervised loss. It combines a self-
+geometric priors between It and {It |t ∈ {1, · · · , N } \ t}  supervised photometric loss and an edge-aware smoothness
+respectively. We use Ct instead of C since the inconsis-       term, as proposed in [17], with a supervised sparse depth
+tent geometric information from different Ct is a strong       loss.
+prior for moving object prediction – dynamic pixels yield
+inconsistent optimal depth steps in different Ct . However,                       3
+geometric priors alone are not enough to predict moving
+objects, since poorly-textured or non-Lambertian surfaces      Ldepth = Lself,s + αLsparse,s + βLsmooth,s. (5)
+can lead to inconsistencies as well. Furthermore, the cost
+volumes tend to reach a consensus on wrong depths that                          s=0
+semantically don’t ﬁt into the context of the scene for ob-
+jects that move at constant speed . Therefore, we further      The self-supervised loss is computed from the photometric
+leverage pre-trained ResNet-18 [21] features of It to en-      errors between the keyframe and the reprojected temporal
+code semantic priors in addition to the geometric ones. The    stereo and static stereo frames:
+network adapts a U-Net architecture design [42] with skip
+connections. All cost volumes are passed through the en-       Lself,s = min                   1  −  SSIM(Itt  , It)
+coders with shared weights. The features from different cost                                 λ
+volumes are aggregated using max-pooling and then passed                        t ∈t ∪{tS }          2
+through the decoder. In this way, MaskModule can be ap-                                                                     (6)
+plied to different numbers of frames without retraining.
+                                                                                                  + (1 − λ)||Itt − It||1 ,
+DepthModule DepthModule predicts a dense pixel-wise
+inverse depth map Dt of It. To this end, the module re-        where λ = 0.85. Note that Lself,s takes the per-pixel min-
+ceives the complete cost volume C concatenated with the
+keyframe It. Unlike MaskModule, here we use C instead          imum which has be shown to be superior compared to the
+of Ct since multi-frame cost volumes in general lead to
+higher depth accuracy and robustness against photometric       per-pixel average [17]. The sparse supervised depth loss is
+noise [37]. To eliminate wrong depth predictions for mov-
+ing objects, we perform pixel-wise multiplication between      deﬁned as
+Mt and the cost volume C for every depth step d. This way,
+there won’t be any maxima (i.e. strong priors) in regions                 Lsparse,s = ||Dt − DV O||1,                       (7)
+of moving objects left, such that DepthModule has to rely
+on information from the image features and the surround-       where the ground-truth sparse depth maps (DV O) are ob-
+ings to infer the depth of moving objects. We employ a         tained by a visual odometry system [59]. Note that all the
+U-Net architecture with multi-scale depth outputs from the     supervision signals of DepthModule are generated from ei-
+decoder [17]. Finally, DepthModule outputs an interpola-       ther images themselves or the visual odometry system with-
+tion factor between dmin and dmax. In practice, we use         out any manual labeling or LiDAR depth.
+s = 4 scales of depth prediction.
+                                                                  MaskModule is trained with the mask loss Lmask which
+3.4. Multi-stage Training                                      is the weighted binary cross entropy between the predicted
+                                                               mask Mt and the auxiliary ground-truth moving object
+   In this section, we propose a multi-stage training scheme   mask Maux. We generate Maux by leveraging a pre-trained
+for the networks. Speciﬁcally, the bootstrapping stage, the    Mask-RCNN and the trained DepthModule as explained
+                                                               above. We ﬁrstly deﬁne the movable object classes, e.g.,
+                                                               cars, cyclists, etc, and then obtain the instance segmenta-
+                                                               tions of these object classes for the training images. A
+                                                               movable instance is classiﬁed as a moving instance if it
+
+                                                                                                     6111
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
+has a high ratio of photometrically inconsistent pixels be-       a)
+                                                                                     Depth
+tween temporal stereo and static stereo. Speciﬁcally, for                           Module
+each image, we predict its depth maps Dt and DtS using
+the cost volumes formed by temporal stereo images C and                                                  Mask
+static stereo images CS, respectively. Then a pixel x is re-                                            Module
+
+garded as a moving pixel if two of the following three met-                          Depth
+                                                                                    Module
+rics are above predeﬁned thresholds: (1) The static stereo        b)
+photometric error using Dt, i.e., pettS (x, Dt(x)). (2) The
+average temporal stereo photometric error using DtS, i.e.,                                 Depth
+pett (x, DtS(x)). (3) The difference between Dt(x) and                                    Module
+DtS(x). Please refer to our supplementary materials for
+more details. Figure 3 shows some examples of the gen-                               Mask
+                                                                                    Module
+erated auxiliary ground-truth moving object masks.
+                                                                                           Depth
+MaskModule Reﬁnement The bootstrapping stage for                                          Module
+
+MaskModule is limited in two ways: (1) Heavy augmen-             Figure 4: Reﬁnement Losses: a) MaskModule reﬁnement
+                                                                 and b) DepthModule reﬁnement loss functions. Dashed out-
+tation is needed since mostly only a very small percent-         lines denote that no gradient is being computed for the re-
+                                                                 spective forward pass in the module.
+age of pixels on the image belongs to moving objects. (2)
+                                                                 stereo and sparse depth losses are backpropagated. Because
+The auxiliary masks are not necessarily related to the ge-       moving objects make up only a small percentage of all pix-
+                                                                 els in a keyframe, the gradients from the photometric error
+ometric prior in the cost volume, which slows down the           are rather weak. To solve this, we perform a further static
+                                                                 stereo forward pass and use the resulting depth map DtS
+convergence. Therefore, to improve the mask prediction,          as prior for moving objects. Therefore, as shown in Fig-
+                                                                 ure 4(b), the loss for reﬁning DepthModule is deﬁned as
+we utilize the trained DepthModule from the bootstrapping
+                                                                 Ld ref,s =(1 − Mt) (Lself,s + αLsparse,s)
+stage. We leverage the fact that the depth prediction for
+                                                                 + Mt LsSelf,s + γ Dt − DtS 1               (9)
+moving objects, and consequently the photometric consis-
+                                                                 + βLsmooth,s.
+tency, should be better with a static stereo prediction than
+                                                                 3.4.1 Implementation Details
+with a temporal stereo one. Therefore, similar to the classi-
+                                                                 The networks are implemented in PyTorch [38] with image
+ﬁcation of moving pixels as explained in the previous sec-       size 512×256. For the bootstrapping stage, we train Depth-
+tion, we obtain DtS and Dt from two forward passes using         Module for 70 epochs with learning rate lr = 1e−4 for
+CS and C as inputs, respectively. Then we compute the            the ﬁrst 65 epochs and lr = 1e−5 for the remaining ones.
+static stereo photometric error LsSelf,s using DtS as depth      MaskModule is trained for 60 epochs with lr = 1e−4. Dur-
+and the temporal stereo photometric error LsTelf,s using Dt      ing MaskModule reﬁnement, we train for 32 epochs with
+as depth. To train Mt, we interpret it as pixel-wise inter-      lr = 1e−4, and during DepthModule reﬁnement we train
+polation factors between LsSelf,s and LsTelf,s, and minimize     for 15 epochs with lr = 1e−4 and another 4 epochs at
+the summation:                                                   lr = 1e−5. The hyperparameters α, β and γ are set to
+                                                                 4, 10−3 × 2−s and 4, respectively. For inference, MonoRec
+          3                                                      can achieve 10 fps with batch size 1 using 2GB memory.
+
+Lm ref =       MtLdSepth,s + (1 − Mt)LdTepth,s  (8)              4. Experiments
+
+          s=0                                                       To evaluate the proposed method, we ﬁrst compare
+                                                                 against state-of-the-art monocular depth prediction and
+          + Lmask.                                               MVS methods with our train/test split of the KITTI
+                                                                 dataset [15]. Then, we perform extensive ablation studies
+Figure 4(a) shows the diagram illustrating different loss        to show the efﬁcacy of our design choices. In the end,
+terms. Note that we still add the supervised mask loss           we demonstrate the generalization capabilities of different
+Lmask as a regularizer to stabilize the training. This way,      methods on Oxford RobotCar [35] and TUM-Mono [9] us-
+the new gradients are directly related to the geometric struc-   ing the model trained on KITTI.
+ture in the cost volume and help to improve the mask pre-
+diction accuracy and alleviate the danger of overﬁtting.
+
+DepthModule Reﬁnement The bootstrapping stage does
+not distinguish between the moving pixels and static pixels
+when training DepthModule. Therefore, we aim to reﬁne
+DepthModule such that it is able to predict proper depths
+also for moving objects. The key idea is that, by utilizing
+Mt, only the static stereo loss is backpropagated for mov-
+ing pixels, while for static pixels the temporal stereo, static
+
+                                                                                                     6112
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
+                                             Seq=07, KF=395            Depth Map  View 1  View 2
+
+GT View 2 GT View 1 Keyframe                                  MonoRec
+
+                                                              PackNet
+
+                                                              DORN
+
+                                             Seq=00, KF=1482           Depth Map  View 1  View 2
+
+GT View 2 GT View 1 Keyframe                                  MonoRec
+
+                                                              PackNet
+
+                                                              DORN
+
+                              Mask Filtered
+
+                              Original
+
+         Figure 5: Qualitative Results on KITTI: The upper part of the ﬁgure shows the results for a selected number of frames
+         from the KITTI test set. The compared PackNet model was trained in a semi-supervised fashion using LiDAR as the ground
+         truth. Besides the depth maps, we also show the 3D point clouds by reprojecting the depth and viewing from two different
+         perspectives. For comparison we show the LiDAR ground truth from the corresponding perspectives. Our method clearly
+         shows the best prediction quality. The lower part of the ﬁgure shows large scale reconstructions as point clouds accumulated
+         from multiple frames. The red insets depict the reconstructed artifacts from moving objects. With the proposed MaskModule,
+         we can effectively ﬁlter out the moving objects to avoid those artifacts in the ﬁnal reconstruction.
+
+                                                                                                     6113
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
+Method                     Training      Dataset      Input  Abs Rel  Sq Rel  RMSE   RMSElog  δ < 1.25  δ < 1.252  δ < 1.253
+
+Colmap [44] (geometric)        -             -       KF + 2   0.099   3.451   5.632    0.184    0.952      0.979      0.986
+Colmap [44] (photometric)      -             -       KF + 2   0.190   6.826   7.781    0.531    0.893      0.932      0.947
+
+Monodepth2 [17]              MS        Eigen Split     KF     0.082   0.405   3.129    0.127    0.931      0.985      0.996
+PackNet [20]                 MS      CS+Eigen Split    KF     0.080   0.331   2.914    0.124    0.929      0.987       0.997
+PackNet [20]                MS, D    CS+Eigen Split    KF     0.077   0.290   2.688    0.118    0.935      0.988       0.997
+DORN [11]                      D                       KF     0.077   0.290   2.723    0.113    0.949      0.988      0.996
+                                       Eigen Split
+DeepMVS [23]                   D                     KF+2     0.103   1.160   3.968    0.166    0.896      0.947      0.978
+DeepMVS [23] (pretr.)          D      Odom. Split    KF+2     0.088   0.644   3.191    0.146    0.914      0.955       0.982
+DeepTAM [66] (only FB)     MS, D*     Odom. Split    KF+2     0.059   0.474   2.769    0.096    0.964      0.987       0.994
+DeepTAM [66] (1x Ref.)     MS, D*     Odom. Split    KF+2     0.053   0.351   2.480    0.089    0.971      0.990      0.995
+                                      Odom. Split
+MonoRec                    MS, D*                    KF+2     0.050   0.295   2.266    0.082    0.973      0.991      0.996
+                                      Odom. Split
+
+Table 1: Quantitative Results on KITTI: Comparison between MonoRec and other methods on our KITTI test set. The
+Dataset column shows the training dataset used by the corresponding method and please note that Eigen split is a superset
+of our odometry split. Best / Second best results are marked bold / underlined. The evaluation result shows that our method
+achieves overall the best performance. Legend: M: Monocular images, S: Stereo images, D: GT depth, D*: Depths from
+DVSO, KF: Keyframe, KF + 2: Keyframe + 2 mono frames, CS: Cityscapes [4], pretr.: Pretrained network, FB: Fixed band
+module of DeepTAM, Ref.: Narrow band reﬁnement module of DeepTAM
+
+(a) Keyframe                         (b) W/o MaskModule               TAM), shown in Table 1. Note that the training code of
+                                                                      DeepTAM was not published, we therefore implemented it
+(c) MaskModule                       (d) MaskModule+D.Ref.            ourselves for training and testing using our split to deliver
+                                                                      a fair comparison. Our method outperforms all the other
+Figure 6: Qualitative Improvement: Effects of cost vol-               methods with a notable margin despite relying on images
+ume masking and depth reﬁnement.                                      only without using LiDAR ground truth for training.
+
+4.1. The KITTI Dataset                                                   This is also clearly reﬂected in the qualitative results
+                                                                      shown in Figure 5. Compared with monocular depth esti-
+   The Eigen split [6] is the most popular training/test split        mation methods, our method delivers very sharp edges in
+for evaluating depth estimation on KITTI. We cannot make              the depth maps and can recover ﬁner details. In comparison
+use of it directly since MonoRec requires temporally con-             to the other MVS methods, it can better deal with moving
+tinuous images with estimated poses. Hence, we select our             objects, which is further illustrated in Figure 7.
+training/testing splits as the intersection between the KITTI
+Odometry benchmark and the Eigen split, which results in                 A single depth map usually cannot really reﬂect the qual-
+13714/8634 samples for training/testing. We obtain the rel-           ity for large scale reconstruction. We therefore also visual-
+ative poses between the images from the monocular VO sys-             ize the accumulated points using the depth maps from mul-
+tem DVSO [59]. During training, we also leverage the point            tiple frames in lower part of Figure 5. We can see that our
+clouds generated by DVSO as the sparse depth supervision              method can deliver very high quality reconstruction and,
+signals. For training MaskModule we only use images that              due to our MaskModule, is able to remove artifacts caused
+contain moving objects in the generated auxiliary masks,              by moving objects. We urge readers to watch the supple-
+2412 in total. For all the following evaluation results we            mentary video for more convincing comparisons.
+use the improved ground truth [50] and cap depths at 80 m.            Ablation Studies. We also investigated the contribution
+                                                                      of the different components towards the method’s perfor-
+   We ﬁrst compare our method against the recent state of             mance. Table 2 shows quantitative results of our ablation
+the art including an optimization based method (Colmap),              studies, which conﬁrm that all our proposed contributions
+self-supervised monocular methods (MonoDepth2 and                     improve the depth prediction over the baseline method. Fur-
+PackNet), a semi-supervised monocular method using                    thermore, Figure 6 demonstrates the qualitative improve-
+sparse LiDAR data (PackNet), a supervised monocular                   ment achieved by MaskModule and reﬁnement training.
+method (DORN) and MVS methods (DeepMVS and Deep-
+                                                                      4.2. Oxford RobotCar and TUM-Mono
+
+                                                                         To demonstrate the generalization capabilities of
+                                                                      MonoRec, we test our KITTI model on the Oxford Robot-
+                                                                      Car dataset and the TUM-Mono dataset. Oxford RobotCar
+                                                                      is a street view dataset and shows a similar motion pattern
+                                                                      and view perspective to KITTI. TUM-Mono, however, is
+                                                                      recorded by a handheld monochrome camera, so it demon-
+
+                                                                                                     6114
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
+Model     SSIM  MaskModule  D. Ref.  M. Ref.  Abs Rel         Sq Rel                               RMSE   RMSElog  δ < 1.25  δ < 1.252  δ < 1.253
+
+Baseline                                       0.056          0.342                                2.624    0.092    0.965      0.990      0.994
+Baseline                                       0.054          0.346                                2.444    0.088    0.970      0.989      0.995
+
+MonoRec                                        0.054          0.306                                2.372    0.087    0.970      0.990      0.995
+MonoRec                                        0.051          0.346                                2.361    0.085    0.972      0.990       0.995
+MonoRec                                        0.052          0.302                                2.303    0.087    0.969      0.990       0.995
+MonoRec                                        0.050          0.295                                2.266    0.082    0.973      0.991       0.996
+
+Table 2: Ablation Study: Baseline consists of only DepthModule using the unmasked cost volume (CV). Baseline without
+SSIM uses a 5x5 patch that has same receptive ﬁeld as SSIM. Using SSIM to form CV gives a signiﬁcant improvement. For
+MonoRec, only the addition of MaskModule without reﬁnement does not yield signiﬁcant improvements. The DepthModule
+reﬁnement gives a major improvement. The best performance is achieved by combining all the proposed components.
+
+Keyframe                    MonoRec           Mask Prediction                                             DeepTAM                       DeepMVS
+
+Figure 7: Comparison on Moving Objects Depth Estimation: In comparison to other MVS methods, MonoRec is able to
+predict plausible depths. Furthermore, the depth prediction has less noise and artifacts in static regions of the scene.
+
+strates very different motion and image quality compared              PackNet Monodepth2 Keyframe
+to KITTI. The results are shown in Figure 8. The monoc-
+ular methods struggle to generalize to a new context. The             DORN
+compared MVS methods show more artifacts and cannot
+predict plausible depths for the moving objects. In contrast          DeepTAM DeepMVS
+our method is able to generalize well to the new scenes for
+both depth and moving object predictions. Since Oxford                MonoRec
+RobotCar also provides LiDAR depth data, we further show
+a quantitative evaluation in the supplementary material.      Figure 8: Oxford RobotCar and TUM-Mono: All results
+                                                              are obtained by the respective best-performing variant in
+5. Conclusion                                                 Table 1. MonoRec shows stronger generalization capabil-
+                                                              ity than the monocular methods. Compared to DeepMVS
+   We have presented MonoRec, a deep architecture that        and DeepTAM, MonoRec delivers depth maps with less ar-
+estimates accurate dense 3D reconstructions from only a       tifacts and predicts the moving object masks in addition.
+single moving camera. We ﬁrst propose to use SSIM as
+the photometric measurement to construct the cost vol-        Acknowledgement This work was supported by the Munich Center
+umes. To deal with dynamic objects, we propose a novel
+MaskModule which predicts moving object masks from the        for Machine Learning and by the ERC Advanced Grant SIMULACRON.
+input cost volumes. With the predicted masks, the pro-
+posed DepthModule is able to estimate accurate depths for
+both static and dynamic objects. Additionally, we propose
+a novel multi-stage training scheme together with a semi-
+supervised loss formulation for training the depth predic-
+tion. All combined, MonoRec is able to outperform the
+state-of-the-art MVS and monocular depth prediction meth-
+ods both qualitatively and quantitatively on KITTI and also
+shows strong generalization capability on Oxford RobotCar
+and TUM-Mono. We believe that this capacity to recover
+accurate dense 3D reconstructions from a single moving
+camera will help to establish the camera as the lead sensor
+for autonomous systems.
+
+                                                                                                     6115
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
+References                                                                national Journal of Robotics Research (IJRR), pages 1229–
+                                                                          1235, 2013.
+ [1] M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and
+      A. J. Davison. CodeSLAM - learning a compact, optimisable     [15] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we
+      representation for dense visual SLAM. In IEEE Conference            ready for autonomous driving? the KITTI vision benchmark
+      on Computer Vision and Pattern Recognition (CVPR), pages            suite. In IEEE Conference on Computer Vision and Pattern
+      2560–2568, 2018.                                                    Recognition (CVPR), pages 3354–3361. IEEE, 2012.
+
+ [2] Neill D. F. Campbell, George Vogiatzis, Carlos Herna´ndez,     [16] Clement Godard, Oisin Mac Aodha, and Gabriel J. Bros-
+      and Roberto Cipolla. Using multiple hypotheses to improve           tow. Unsupervised monocular depth estimation with left-
+      depth-maps for multi-view stereo. In European Conference            right consistency. In IEEE Conference on Computer Vision
+      on Computer Vision (ECCV), pages 766–779, 2008.                     and Pattern Recognition (CVPR), 2017.
+
+ [3] Rui Chen, Songfang Han, Jing Xu, and Hao Su. Point-based       [17] Cle´ment Godard, Oisin Mac Aodha, Michael Firman, and
+      multi-view stereo network. In International Conference on           Gabriel J. Brostow. Digging into self-supervised monocular
+      Computer Vision (ICCV), 2019.                                       depth estimation. In International Conference on Computer
+                                                                          Vision (ICCV), pages 3828–3838, 2019.
+ [4] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo
+      Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe              [18] Ariel Gordon, Hanhan Li, Rico Jonschkowski, and Anelia
+      Franke, Stefan Roth, and Bernt Schiele. The cityscapes              Angelova. Depth from videos in the wild: Unsupervised
+      dataset for semantic urban scene understanding. In IEEE             monocular depth learning from unknown cameras. In In-
+      Conference on Computer Vision and Pattern Recognition               ternational Conference on Computer Vision (ICCV), 2019.
+      (CVPR), pages 3213–3223, 2016.
+                                                                    [19] Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong
+ [5] Jan Czarnowski, Tristan Laidlow, Ronald Clark, and An-               Tan, and Ping Tan. Cascade cost volume for high-resolution
+      drew J. Davison. DeepFactors: Real-time probabilistic dense         multi-view stereo and stereo matching. In IEEE Interna-
+      monocular SLAM. IEEE Robotics and Automation Letters                tional Conference on Computer Vision and Pattern Recog-
+      (RA-L), 5(2):721–728, 2020.                                         nition (CVPR), 2020.
+
+ [6] David Eigen and Rob Fergus. Predicting depth, surface nor-     [20] Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven-
+      mals and semantic labels with a common multi-scale convo-           tos, and Adrien Gaidon. 3D packing for self-supervised
+      lutional architecture. In International Conference on Com-          monocular depth estimation. In IEEE Conference on Com-
+      puter Vision (ICCV), pages 2650–2658, 2015.                         puter Vision and Pattern Recognition (CVPR), pages 2485–
+                                                                          2494, 2020.
+ [7] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map
+      prediction from a single image using a multi-scale deep net-  [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
+      work. In Neural Information Processing Systems (NIPS),              Deep residual learning for image recognition. In IEEE
+      2014.                                                               Conference on Computer Vision and Pattern Recognition
+                                                                          (CVPR), pages 770–778, 2016.
+ [8] Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct
+      sparse odometry. IEEE Transactions on Pattern Analysis and    [22] Yuxin Hou, Juho Kannala, and Arno Solin. Multi-view
+      Machine Intelligence (PAMI), 40(3):611–625, 2018.                   stereo by temporal nonparametric fusion. In International
+                                                                          Conference on Computer Vision (ICCV), 2019.
+ [9] Jakob Engel, Vladyslav Usenko, and Daniel Cremers. A
+      photometrically calibrated benchmark for monocular visual     [23] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra
+      odometry. In arXiv, July 2016.                                      Ahuja, and Jia-Bin Huang. DeepMVS: Learning multi-view
+                                                                          stereopsis. In IEEE Conference on Computer Vision and Pat-
+[10] Alejandro Fontan, Javier Civera, and Rudolph Triebel.                tern Recognition (CVPR), pages 2821–2830, 2018.
+      Information-driven direct rgb-d odometry. In IEEE Confer-
+      ence on Computer Vision and Pattern Recognition (CVPR),       [24] Sunghoon Im, Hae-Gon Jeon, Stephen Lin, and In So
+      pages 4929–4937, 2020.                                              Kweon. DPSNet: End-to-end deep plane sweep stereo. In In-
+                                                                          ternational Conference on Learning Representations (ICLR),
+[11] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat-                    2019.
+      manghelich, and Dacheng Tao. Deep ordinal regression net-
+      work for monocular depth estimation. In IEEE Conference       [25] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and
+      on Computer Vision and Pattern Recognition (CVPR), pages            Koray Kavukcuoglu. Spatial transformer networks. In Neu-
+      2002–2011, 2018.                                                    ral Information Processing Systems (NIPS), pages 2017–
+                                                                          2025, 2015.
+[12] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and
+      robust multiview stereopsis. IEEE Transactions on Pat-        [26] Mengqi Ji, Ju¨rgen Gall, Haitian Zheng, Yebin Liu, and Lu
+      tern Analysis and Machine Intelligence (PAMI), pages 1362–          Fang. SurfaceNet: An end-to-end 3D neural network for
+      1376, 2010.                                                         multiview stereopsis. In International Conference on Com-
+                                                                          puter Vision (ICCV), pages 2326–2334, 2017.
+[13] Silvano Galliani, Katrin Lasinger, and Konrad Schindler.
+      Massively parallel multiview stereopsis by surface normal     [27] Abhishek Kar, Christian Ha¨ne, and Jitendra Malik. Learning
+      diffusion. In International Conference on Computer Vision           a multi-view stereo machine. In Neural Information Process-
+      (ICCV), 2015.                                                       ing Systems (NIPS), page 364–375, 2017.
+
+[14] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel     [28] Kiriakos N. Kutulakos and Steven M. Seitz. A theory of
+      Urtasun. Vision meets robotics: The KITTI dataset. Inter-           shape by space carving. In International Conference on
+                                                                          Computer Vision (ICCV), 1999.
+
+                                                                                                     6116
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
+[29] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Fed-          puting and Computer Assisted Intervention (MICCAI), pages
+      erico Tombari, and Nassir Navab. Deeper depth prediction            234–241. Springer, 2015.
+      with fully convolutional residual networks. In International
+      Conference on 3D Vision (3DV), 2016.                          [43] Chris Russell, Rui Yu, and Lourdes Agapito. Video pop-up:
+                                                                          Monocular 3d reconstruction of dynamic scenes. In Euro-
+[30] Maxime Lhuillier and Long Quan. A quasi-dense approach               pean Conference on Computer Vision (ECCV), pages 583–
+      to surface reconstruction from uncalibrated images. IEEE            598. Springer, 2014.
+      Transactions on Pattern Analysis and Machine Intelligence
+      (PAMI), pages 418–433, 2005.                                  [44] Johannes L Schonberger and Jan-Michael Frahm. Structure-
+                                                                          from-motion revisited. In IEEE Conference on Computer
+[31] Bo Li, Chunhua Shen, Yuchao Dai, Anton van den Hengel,               Vision and Pattern Recognition (CVPR), pages 4104–4113,
+      and Mingyi He. Depth and surface normal estimation from             2016.
+      monocular images using regression on deep features and hi-
+      erarchical CRFs. In IEEE Conference on Computer Vision        [45] Johannes L. Scho¨nberger, Enliang Zheng, Jan-Michael
+      and Pattern Recognition (CVPR), 2015.                               Frahm, and Marc Pollefeys. Pixelwise view selection for
+                                                                          unstructured multi-view stereo. In European Conference on
+[32] Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker,              Computer Vision (ECCV), pages 501–518, 2016.
+      Noah Snavely, Ce Liu, and William T Freeman. Learning
+      the depths of moving people by watching frozen people. In     [46] Steven M. Seitz and Charles R. Dyer. Photorealistic scene
+      Proceedings of the IEEE/CVF Conference on Computer Vi-              reconstruction by voxel coloring. In IEEE Conference on
+      sion and Pattern Recognition, pages 4521–4530, 2019.                Computer Vision and Pattern Recognition (CVPR), 1997.
+
+[33] Keyang Luo, Tao Guan, Lili Ju, Haipeng Huang, and Yawei        [47] Jan Stu¨hmer, Stefan Gumhold, and Daniel Cremers. Real-
+      Luo. P-MVSNet: Learning patch-wise matching conﬁdence               time dense geometry from a handheld camera. In DAGM
+      aggregation for multi-view stereo. In International Confer-         Conference on Pattern Recognition, pages 11–20, 2010.
+      ence on Computer Vision (ICCV), 2019.
+                                                                    [48] Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir
+[34] Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen,             Navab. CNN-SLAM: Real-time dense monocular SLAM
+      and Johannes Kopf. Consistent video depth estimation.               with learned depth prediction. In IEEE Conference on Com-
+      39(4), 2020.                                                        puter Vision and Pattern Recognition (CVPR), 2017.
+
+[35] Will Maddern, Geoff Pascoe, Chris Linegar, and Paul New-       [49] Engin Tola, Christoph Strecha, and Pascal Fua. Efﬁcient
+      man. 1 Year, 1000km: The Oxford RobotCar Dataset. Inter-            large-scale multi-view stereo for ultra high-resolution image
+      national Journal of Robotics Research (IJRR), 36(1):3–15,           sets. Machine Vision and Applications (MVA), pages 903–
+      2017.                                                               920, 2011.
+
+[36] Zak Murez, Tarrence van As, James Bartolozzi, Ayan Sinha,      [50] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke,
+      Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End-            Thomas Brox, and Andreas Geiger. Sparsity invariant CNNs.
+      to-end 3D scene reconstruction from posed images. In Euro-          In International Conference on 3D Vision (3DV), pages 11–
+      pean Conference on Computer Vision (ECCV), 2020.                    20. IEEE, 2017.
+
+[37] Richard A. Newcombe, Steven J. Lovegrove, and Andrew J.        [51] Vladyslav Usenko, Nikolaus Demmel, David Schubert, Jo¨rg
+      Davison. DTAM: Dense tracking and mapping in real-time.             Stu¨ckler, and Daniel Cremers. Visual-inertial mapping with
+      In International Conference on Computer Vision (ICCV),              non-linear factor recovery. IEEE Robotics and Automation
+      2011.                                                               Letters (RA-L), 5(2):422–429, 2020.
+
+[38] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer,           [52] Chaoyang Wang, Jose Miguel Buenaposada, Rui Zhu, and
+      James Bradbury, Gregory Chanan, Trevor Killeen, Zeming              Simon Lucey. Learning depth from monocular videos using
+      Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An            direct methods. In IEEE Conference on Computer Vision and
+      imperative style, high-performance deep learning library. In        Pattern Recognition (CVPR), 2018.
+      Advances in neural information processing systems, pages
+      8026–8037, 2019.                                              [53] Kaixuan Wang and Shaojie Shen. MVDepthNet: Real-time
+                                                                          multiview depth estimation neural network. In International
+[39] Matia Pizzoli, Christian Forster, and Davide Scaramuzza.             Conference on 3D Vision (3DV), 2018.
+      REMODE: Probabilistic, monocular dense reconstruction in
+      real time. In IEEE International Conference on Robotics and   [54] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si-
+      Automation (ICRA), 2014.                                            moncelli. Image quality assessment: from error visibility to
+                                                                          structural similarity. IEEE transactions on image processing,
+[40] Rene Ranftl, Vibhav Vineet, Qifeng Chen, and Vladlen                 13(4):600–612, 2004.
+      Koltun. Dense monocular depth estimation in complex dy-
+      namic scenes. In IEEE Conference on Computer Vision and       [55] Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel
+      Pattern Recognition (CVPR), pages 4058–4066, 2016.                  Brostow, and Michael Firman. The temporal opportunist:
+                                                                          Self-supervised multi-frame monocular depth. In IEEE
+[41] Andrea Romanoni and Matteo Matteucci. TAPA-MVS:                      Conference on Computer Vision and Pattern Recognition
+      Textureless-aware PAtchMatch multi-view stereo. In Inter-           (CVPR), 2021.
+      national Conference on Computer Vision (ICCV), 2019.
+                                                                    [56] Youze Xue, Jiansheng Chen, Weitao Wan, Yiqing Huang,
+[42] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-               Cheng Yu, Tianpeng Li, and Jiayu Bao. MVSCRF: Learning
+      Net: Convolutional networks for biomedical image segmen-            multi-view stereo with conditional random ﬁelds. In Inter-
+      tation. In International Conference on Medical Image Com-           national Conference on Computer Vision (ICCV), 2019.
+
+                                                                                                     6117
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
+          [57] Jiayu Yang, Wei Mao, Jose M. Alvarez, and Miaomiao Liu.
+                Cost volume pyramid based depth inference for multi-view
+                stereo. In IEEE Conference on Computer Vision and Pattern
+                Recognition (CVPR), 2020.
+
+          [58] Nan Yang, Lukas von Stumberg, Rui Wang, and Daniel Cre-
+                mers. D3VO: Deep depth, deep pose and deep uncertainty
+                for monocular visual odometry. In IEEE Conference on
+                Computer Vision and Pattern Recognition (CVPR), 2020.
+
+          [59] Nan Yang, Rui Wang, Jo¨rg Stu¨ckler, and Daniel Cremers.
+                Deep virtual stereo odometry: Leveraging deep depth predic-
+                tion for monocular direct sparse odometry. In European Con-
+                ference on Computer Vision (ECCV), pages 817–833, 2018.
+
+          [60] Yao Yao, Shiwei Li, Siyu Zhu, Hanyu Deng, Tian Fang, and
+                Long Quan. Relative camera reﬁnement for accurate dense
+                reconstruction. In International Conference on 3D Vision
+                (3DV), 2017.
+
+          [61] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long
+                Quan. MVSNet: Depth inference for unstructured multi-
+                view stereo. In European Conference on Computer Vision
+                (ECCV), pages 785–801, 2018.
+
+          [62] Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang,
+                and Long Quan. Recurrent MVSNet for high-resolution
+                multi-view stereo depth inference. In IEEE Conference on
+                Computer Vision and Pattern Recognition (CVPR), 2019.
+
+          [63] Zhichao Yin and Jianping Shi. GeoNet: Unsupervised learn-
+                ing of dense depth, optical ﬂow and camera pose. In IEEE
+                Conference on Computer Vision and Pattern Recognition
+                (CVPR), 2018.
+
+          [64] Zehao Yu and Shenghua Gao. Fast-MVSNet: Sparse-to-
+                dense multi-view stereo with learned propagation and gauss-
+                newton reﬁnement. In IEEE Conference on Computer Vision
+                and Pattern Recognition (CVPR), 2020.
+
+          [65] Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera,
+                Kejie Li, Harsh Agarwal, and Ian M. Reid. Unsupervised
+                learning of monocular depth estimation and visual odometry
+                with deep feature reconstruction. In IEEE Conference on
+                Computer Vision and Pattern Recognition (CVPR), 2018.
+
+          [66] Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox.
+                DeepTAM: Deep tracking and mapping. In European Con-
+                ference on Computer Vision (ECCV), pages 822–838, 2018.
+
+          [67] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G.
+                Lowe. Unsupervised learning of depth and ego-motion from
+                video. In IEEE Conference on Computer Vision and Pattern
+                Recognition (CVPR), 2017.
+
+                                                                                                     6118
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply.
+
diff --git a/动态slam/2020年-2022年开源动态SLAM/2021年/RDS-SLAM_Real-Time_Dynamic_SLAM_Using_Semantic_Segmentation_Methods.pdf b/动态slam/2020年-2022年开源动态SLAM/2021年/RDS-SLAM_Real-Time_Dynamic_SLAM_Using_Semantic_Segmentation_Methods.pdf
new file mode 100644
index 0000000..dc7da67
--- /dev/null
+++ b/动态slam/2020年-2022年开源动态SLAM/2021年/RDS-SLAM_Real-Time_Dynamic_SLAM_Using_Semantic_Segmentation_Methods.pdf
@@ -0,0 +1,738 @@
+Received December 21, 2020, accepted January 6, 2021, date of publication January 11, 2021, date of current version February 10, 2021.
+Digital Object Identifier 10.1109/ACCESS.2021.3050617
+
+RDS-SLAM: Real-Time Dynamic SLAM Using
+Semantic Segmentation Methods
+
+YUBAO LIU AND JUN MIURA , (Member, IEEE)
+
+Department of Computer Science and Engineering, Toyohashi University of Technology, Toyohashi 441-8580, Japan
+
+Corresponding author: Yubao Liu (yubao.liu.ra@tut.jp)
+
+This work was supported in part by the Japan Society for the Promotion of Science (JSPS) KAKENHI under Grant 17H01799.
+
+  ABSTRACT The scene rigidity is a strong assumption in typical visual Simultaneous Localization and
+  Mapping (vSLAM) algorithms. Such strong assumption limits the usage of most vSLAM in dynamic
+  real-world environments, which are the target of several relevant applications such as augmented reality,
+  semantic mapping, unmanned autonomous vehicles, and service robotics. Many solutions are proposed
+  that use different kinds of semantic segmentation methods (e.g., Mask R-CNN, SegNet) to detect dynamic
+  objects and remove outliers. However, as far as we know, such kind of methods wait for the semantic
+  results in the tracking thread in their architecture, and the processing time depends on the segmentation
+  methods used. In this paper, we present RDS-SLAM, a real-time visual dynamic SLAM algorithm that
+  is built on ORB-SLAM3 and adds a semantic thread and a semantic-based optimization thread for robust
+  tracking and mapping in dynamic environments in real-time. These novel threads run in parallel with the
+  others, and therefore the tracking thread does not need to wait for the semantic information anymore.
+  Besides, we propose an algorithm to obtain as the latest semantic information as possible, thereby making
+  it possible to use segmentation methods with different speeds in a uniform way. We update and propagate
+  semantic information using the moving probability, which is saved in the map and used to remove outliers
+  from tracking using a data association algorithm. Finally, we evaluate the tracking accuracy and real-time
+  performance using the public TUM RGB-D datasets and Kinect camera in dynamic indoor scenarios.
+  Source code and demo: https://github.com/yubaoliu/RDS-SLAM.git
+
+INDEX TERMS Dynamic SLAM, ORB SLAM, Mask R-CNN, SegNet, real-time.
+
+I. INTRODUCTION                                                         algorithm from using data associations related to such
+Simultaneous localization and mapping (SLAM) [1] is a fun-              dynamic objects in real-time is the challenge to allow vSLAM
+damental technique for many applications such as augmented              to be deployed in the real world.
+reality (AR), robotics, and unmanned autonomous vehicles
+(UAV). Visual SLAM (vSLAM) [2] uses the camera as the                      We classify the solutions into two classes: pure
+input and is useful in scene understanding and decision mak-            geometric-based [3]–[7] and semantic-based [8]–[13] meth-
+ing. However, the strong assumption of scene rigidity limits            ods. These geometric-based approaches cannot remove all
+the use of most vSLAM in real-world environments. Dynamic               potential dynamic objects, e.g., people who are sitting. Fea-
+objects will cause many bad or unstable data associations that          tures on such objects are unreliable and also need to be
+accumulate drifts during the SLAM process. In Fig. 1, for               removed from tracking and mapping. These semantic-based
+example, assume m1 is on a person and its position changes              methods use semantic segmentation or object detection
+in the scene. The bad or unstable data associations (the red            approaches to obtain pixel-wise masks or bounding box of
+lines in Fig. 1) will lead to incorrect camera ego-motion               potential dynamic objects. Sitting people can be detected
+estimation in dynamic environments. Usually, there are two              and removed from tracking and mapping using the semantic
+basic requirements for vSLAM: robustness in tracking and                information and a map of static objects can be built. Usu-
+real-time performance. Therefore, how to detect dynamic                 ally, in semantic-based methods, geometric check, such as
+objects in the populated scene and prevent the tracking                 Random Sample Consensus (RANSAC) [14] and multi-view
+                                                                        geometry, are also used to remove outliers.
+   The associate editor coordinating the review of this manuscript and
+                                                                           These semantic-based methods ﬁrst detect or segment
+approving it for publication was Heng Wang .                            objects and then remove outliers from tracking. The tracking
+                                                                        thread has to wait for semantic information before tracking
+
+23772  This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 9, 2021
+Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
+
+FIGURE 1. Example of data association in vSLAM under dynamic scene.               These classiﬁed map points will be used to select as stable
+Ft , (t ≥ 0) is the frame and KFt is the selected keyframe. mi , i ∈ {0, 1, ...}  data associations as possible in tracking.
+is the map point. Assume m1 moved to new position m1 because it
+belongs to a moving object. The red line indicates the unstable or bad               The main contributions of this paper are:
+data association.                                                                    (1) we propose a novel semantic-based real-time dynamic
+                                                                                  vSLAM algorithm, RDS-SLAM, which enables the tracking
+                                                                                  thread does not need to wait for the semantic results any-
+                                                                                  more. This method efﬁciently and effectively uses seman-
+                                                                                  tic segmentation results for dynamic object detection and
+                                                                                  outliers removing while keeping the algorithm’s real-time
+                                                                                  nature.
+                                                                                     (2) we propose a keyframe selection strategy that uses as
+                                                                                  the latest new semantic information as possible for outliers
+                                                                                  removal with any semantic segmentation methods with dif-
+                                                                                  ferent speeds in a uniform way.
+                                                                                     (3) We show the real-time performance of the proposed
+                                                                                  method is better than the existing similar methods using the
+                                                                                  TUM dataset.
+                                                                                     The rest of the paper is structured as follows. Section II dis-
+                                                                                  cusses related work. Section III describes a system overview.
+                                                                                  Sections IV, V, and VI detail the implementation of the pro-
+                                                                                  posed methods. Section VII shows experimental results, and
+                                                                                  section VIII presents the conclusions and discusses future
+                                                                                  work.
+
+FIGURE 2. Blocked model. Semantic model can use different kinds of                II. RELATED WORK
+segmentation methods, e.g., Mask R-CNN and SegNet. Note that this is              A. VISUAL SLAM
+not exactly the same as the semantic-based methods mentioned [8]–[13].            vSLAM [2] can be classiﬁed into feature-based methods
+The tracking process is blocked to wait for the results of semantic model.        and direct methods. Mur-Artal et al. presented ORB-SLAM2
+                                                                                  [16], a complete SLAM system for monocular, stereo, and
+(camera ego-motion estimation), which is called the blocked                       RGB-D cameras, which works in real-time on standard CPUs
+model in this paper (as shown in Fig. 2). Their processing                        in a wide variety of environments. This system estimates the
+speed is limited by the time-consuming of semantic segmen-                        ego-motion of the camera by matching the corresponding
+tation methods used. For example, Mask R-CNN requires                             ORB [17] features between the current frame and previous
+about 200ms [15] for segmenting one image and this will limit                     frames and has three parallel threads: tracking, local map-
+the real-time performance of the entire system.                                   ping, and loop closing. Carlos et al. proposed the latest
+                                                                                  version ORB-SLAM3 [18], mainly adding two novelties:
+   Our main challenge is how to execute vSLAM in real-time                        1) a feature-based tightly-integrated visual-inertial SLAM
+under dynamic scenes with various pixel-wise semantic seg-                        that fully relies on maximum-a-posteriori (MAP) estimation;
+mentation methods that ran at a different speed, such as                          2) a multiple map system (ATLAS [19]) that relies on a new
+SegNet and Mask R-CNN. We propose a semantic thread                               place recognition method with improved recall. In contrast to
+to wait for the semantic information. It runs in parallel with                    features-based methods,. For example, Kerl et al. proposed a
+the tracking thread and the tracking thread does not need to                      dense visual SLAM method, DVO [20], for RGB-D cameras
+wait for the segment result. Therefore, the tracking thread                       that minimizes both the photometric and the depth error over
+can execute in real-time. We call it a non-blocked model in                       all pixels. However, none of the above methods can address
+this paper. Faster segmentation methods (e.g., SegNet) can                        the common problem of dynamic objects. Detecting and deal-
+update semantic information more frequently than slower                           ing with dynamic objects in a dynamic scene in real-time is a
+methods (e.g., Mask R-CNN). Although we cannot control                            challenging task in vSLAM.
+the segmentation speed, we can use a strategy to obtain as
+the latest semantic information as possible to remove outliers                       Our work follows the implementation of ORB-SLAM3
+from the current frame.                                                           [18]. The concepts in ORB-SLAM3: keyframe, covisibility
+                                                                                  graph, ATLAS and Bundle adjustment (BA), are also used in
+   Because the semantic thread runs in parallel with the track-                   our implementation.
+ing thread, we use the map points to save and share the
+semantic information. As shown in Fig. 1, we update and                           1) KEYFRAME
+propagate semantic information using the moving probability                       Keyframes [18] is a subset of selected frames to avoid
+and classify map points into three categories, static, dynamic,                   unnecessary redundancy in tracking and optimization. Each
+and unknown, according to the moving probability threshold.
+                                                                                                                                                                                 23773
+VOLUME 9, 2021
+Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
+
+keyframe stores 1) a rigid body transformation of the camera     C. SEMANTIC-BASED SOLUTIONS
+pose that transforms points from the world to the camera         DS-SLAM [10], implemented on ORB-SLAM2 [16], com-
+coordinate system; 2) ORB features, associated or not to         bines a semantic segmentation network (SegNet [22]) with a
+a map point. In this paper, keyframes are selected by the        moving consistency check to reduce the impact of dynamic
+same policy as ORB-SLAM3; a keyframe is selected if all          objects and produce a dense semantic octree map [23].
+the following conditions are met: 1) 20 frames have passed       DS-SLAM assumes that the feature points on the people are
+from the last global relocalization or the last keyframe inser-  most likely to be outliers. If a person is determined to be static,
+tion; 2) local mapping thread is idle; 3) current frame tracks   then matching points on the person can also be used to predict
+at least 50 points or less than 90% points than reference        the pose of the camera.
+keyframe.
+                                                                    DynaSLAM [9], also built on ORB-SLAM2, is robust
+2) COVISIBILITY GRAPH                                            in dynamic scenarios for monocular, stereo, and RGB-D
+Covisibility graph [16] is represented as an undirected          datasets, by adding the capabilities of dynamic object detec-
+weighted graph, in which each node is a keyframe and the         tion and background inpainting. It can detect the moving
+edge holds the number of commonly observed map points.           objects either by multi-view geometry, deep learning, or both
+                                                                 and inpaint the frame background that has been occluded by
+3) ATLAS                                                         dynamic objects using a static map of the scene. It uses Mask
+The Atlas [19] is a multi-map representation that handles an     R-CNN to segment out all the priori dynamic objects, such as
+unlimited number of sub-maps. Two kinds of maps, active          people or vehicles. DynaSLAM II [24] tightly integrates the
+map and non-active map, are managed in the atlas. When           multi-object tracking capability. But this method only works
+the camera tracking is considered lost and relocalization was    for rigid objects. However, in the dynamic scene of TUM [25]
+failed for a few frames, the active map becomes a non-active     dataset, people change their shape by sometimes standing and
+map, and a new map will be initialized. In the atlas, keyframes  sometimes sitting.
+and map points are managed using the covisibility graph and
+the spanning tree.                                                  Detect-SLAM [12], also built on ORB-SLAM2, integrates
+                                                                 visual SLAM with single-shot multi-box detector (SSD) [26]
+4) BUNDLE ADJUSTMENT (BA)                                        to make the two functions mutually beneﬁcial. They call
+BA [21] is the problem of reﬁning a visual reconstruction to     the probability of a feature point belonging to a moving
+produce jointly optimal 3D structure and viewing parameter       object the moving probability. They distinguish keypoints
+estimates. Local BA is used in the local mapping thread          into four states, high-conﬁdence static, low-conﬁdence static,
+to optimize only the camera pose. Loop closing launches a        low-conﬁdence dynamic, and high-conﬁdence dynamic.
+thread to perform full BA after the pose-graph optimization      Considering the delay of detection and the spatio-temporal
+to jointly optimize the camera pose and the corresponding        consistency of successive frames, they only use the color
+landmarks.                                                       images of keyframes to detect using SSD, meanwhile prop-
+                                                                 agating probability frame-by-frame in the tracking thread.
+B. GEOMETRIC-BASED SOLUTIONS                                     Once the detection result is obtained, they insert the keyframe
+Li et al. [5] proposed a real-time depth edge-based RGB-D        into the local map and update the moving probability on
+SLAM system for dynamic environments based on the frame-         the local map. Then they update the moving probabil-
+to-keyframe registration. They only use depth edge points        ity of 3D points in the local map that matched with the
+which have an associated weight indicating its probability       keyframe.
+of belonging to a dynamic object. Sun et al. [6] classify
+pixels using the segmentation of the quantized depth image          DM-SLAM [11] combines Mask R-CNN, optical ﬂow, and
+and calculate the difference in intensity between consec-        epipolar constraint to judge outliers. The Ego-motion Estima-
+utive RGB images. Tan et al. [3] propose a novel online          tion module estimates the initial pose of the camera, similar
+keyframe representation and updating method to adaptively        to the Low-cost tracking module in DynaSLAM. DM-SLAM
+model the dynamic environments. The camera pose can reli-        also uses features in priori dynamic objects, if they are not
+ably be estimated even in challenging situations using a         moving heavily, to reduce the feature-less case caused by
+novel prior-based adaptive RANSAC algorithm to efﬁciently        removing all priori dynamic objects.
+remove outliers.
+                                                                    Fan et al. [8] proposed a novel semantic SLAM system
+   Although the geometric-based vSLAM solution in                with a more accurate point cloud map in dynamic environ-
+dynamic environments can restrict the effect of the dynamic      ments and they use BlizNet [27] to obtain the masks and
+objects to some extent, there are some limitations: 1) they      bounding boxes of the dynamic objects in the image.
+cannot detect the potential dynamic objects that temporarily
+keep static; 2) lack of semantic information. We cannot judge       All these methods use the blocked model. They wait for the
+dynamic objects using priori knowledge of the scene.             semantic results of every frame or keyframe before estimating
+                                                                 the camera pose. As a result, their processing speed are
+23774                                                            limited by the speciﬁc CNN models they used. In this paper,
+                                                                 we propose RDS-SLAM that uses the non-blocked model and
+                                                                 shows its real-time performance by comparing it with those
+                                                                 methods.
+
+                                                                                                                                                                    VOLUME 9, 2021
+Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
+
+FIGURE 3. System architecture. Models with orange color are modified blocks based on ORB-SLAM3. Models with magenta color are newly added
+features. Blocks in blue are important data structures.
+
+III. SYSTEM OVERVIEW                                             thread, we use a simple example to explain the general
+Each frame will ﬁrst pass through the tracking thread. The ini-  ﬂow, as shown in Fig. 4. Assume the keyframes are selected
+tial camera pose is estimated for the current frame after being  every two frames. The keyframes are selected by the ORB-
+tracked with the last frame and further optimized by being       SLAM3 and we inserted them into a keyframe list KF sequen-
+tracked with the local map. Then, keyframes are selected         tially. Assume, at time t = 12, KF2-KF6 are inside KF.
+and they are useful in semantic tracking, semantic-based         The next step is to select keyframes from KF to request
+optimization, and local mapping thread. We modify several        semantic labels from the semantic server. We call this pro-
+models in the tracking and the local mapping threads to          cess as semantic keyframe selection process in this paper.
+remove outliers from camera ego-motion estimation using          We take one keyframe from the head of KF (KF2) and one
+the semantic information. In the tracking thread, we propose     from the back of KF (KF6) to request the semantic labels.
+a data association algorithm to use as the features on static    Then, we calculate the mask of the priori dynamic objects
+objects as possible.                                             using semantic labels S2 and S6. Next, we update the moving
+                                                                 probability of map points stored in the atlas. The moving
+   The semantic thread runs in parallel with the others, so as   probability will be used later to remove outliers from the
+not to block the tracking thread and saves the semantic infor-   tracking thread.
+mation into the atlas. Semantic labels are used to generate
+the mask image of the priori dynamic objects. The moving            Alg. 1 shows the detailed implementation of the semantic
+probability of the map points matched with features in the       thread. The ﬁrst step is to select semantic keyframes from
+keyframes is updated using the semantic information. Finally,    keyframe list KF (Line 2). Next, we request semantic labels
+the camera pose is optimized using the semantic information      from the semantic model and return the semantic labels SLs
+in the atlas.                                                    (Line 3). Lines 4-8 are to save and process the semantic
+                                                                 results for each item returned. Line 6 is to generate the mask
+   We will introduce the new features and modiﬁed models in      image of dynamic objects and Line 7 updates the moving
+the following sections. We skip the detailed explanations of     probability stored in the atlas. We will introduce each sub-
+the modules that are the same as those of ORB-SLAM3.             module of the semantic thread sequentially (see Fig. 3).
+
+IV. SEMANTIC THREAD                                              A. SEMANTIC KEYFRAME SELECTION ALGORITHM
+The semantic thread is responsible for generating seman-         The semantic keyframe selection algorithm is to select
+tic information and updating it into the atlas map. Before       keyframes for requesting the semantic labels later. We need
+we introduce the detailed implementation of the semantic
+
+VOLUME 9, 2021                                                                                                                             23775
+                                          Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
+
+       FIGURE 4. Semantic tracking example. Assume keyframes KFn is selected every two frames Fn and inserted into keyframe list KF .
+       We choose keyframes from KF to request semantic labels Sn. Then we update the moving probability into the atlas using the mask image
+       of dynamic objects that reproduced from the semantic label. Blue circles stand for the static map points and red circles for the dynamic
+
+       map points. Others marked in green are unknown.
+
+Algorithm 1 Semantic Tracking Thread                               However, this will monotonically increase the time delay
+                                                                   when using time-consuming segmentation methods as shown
+Require: KeyFrame list: KF                                         as the blue line in Fig. 6. For instance, at time t = 10 (F10),
+ 1: while not_request_ﬁnish() do                                   the semantic model completed the segmentation of KF0 (F0)
+ 2: SK = semantic_keyframe_selection(KF)                           and the semantic delay is d = 10. Similarly, at time 40 (F40),
+ 3: SLs = request_segmentation(SK)                                 the semantic delay becomes 34. That is, the last frame that
+ 4: for i = 0; i < SLs.size(); i + + do                            has semantic information is 34 frames behind the current
+ 5: KeyFrame kf = SR[i]                                            frame. The current frame cannot obtain the latest semantic
+ 6: kf->mask = GenerateMaskImage(SLs[i])                           information.
+ 7: kf->UpdatePrioriMovingProbability()
+ 8: end for                                                           To shorten the distance, supposed that we segment two
+ 9: end while                                                      frames sequentially at the same time (Fig. 5 (b)). Then,
+                                                                   the delay becomes 12−2 = 10 if KF0 and KF1 are segmented
+to keep the real-time performance while using different kinds      at the same time. The delay still grows linearly as shown as
+of semantic segmentation methods. However, some of them,           the red line in Fig. 6.
+such as Mask R-CNN, are time-consuming and the current
+frame in tracking may not obtain the new semantic informa-            To further shorten the semantic delay, we use a
+tion if we segment every keyframe sequentially.                    bi-directional model. We do not segment keyframes sequen-
+                                                                   tially. Instead, we do semantic segmentation using keyframes
+   To evaluate the distance quantitatively, we deﬁne the           both from the front and back of the list to use as the latest
+semantic delay that is the distance between the latest frame id    semantic information as possible, as shown in Fig. 5 (c) and as
+which has the semantic label (St ) that holds the latest semantic  the yellow line in Fig. 6. The semantic delay becomes a con-
+information and the current frame (Ft ) id, as follows:            stant value. In practice, the delay in the bidirectional model is
+                                                                   not always 10. The distance is inﬂuenced by the segmentation
+       d = FrameID(Ft ) − FrameID(St ).   (1)                      method used, the frequency of keyframe selection, and the
+                                                                   processing speed of the related threads.
+   Fig. 5 shows the semantic delay for several cases. The
+general idea is to segment each frame or keyframe sequen-             The left side of Fig. 7 indicates a semantic keyframe selec-
+tially, according to the time sequence as shown in Fig. 5 (a).     tion example and the right side of Fig. 7 shows the time-
+We call this kind of model the sequential segmentation model.      line of requesting semantic information from the semantic
+                                                                   model/server. We take both keyframes from the head and
+23776
+                                                                                                                                                                      VOLUME 9, 2021
+Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
+
+                                                                        FIGURE 7. Semantic time line. The left side is the contents inside the
+                                                                        keyframe list KF and right side is the time line of requesting semantic
+                                                                        label. Keyframe in green color means this item has already obtained the
+                                                                        semantic information in the previous round.
+
+FIGURE 5. Bi-direction model vs sequential model. Assume we use Mask       We can obtain relatively new information if we segment the
+R-CNN (200ms) and ORB-SLAM3 (20ms), and the keyframe is selected        keyframe at the tail of the KF list. Then why do we also need
+every two frames. About 200/20 = 10 frames delay while waiting for the  to segment the keyframe that in front of the list? Different
+semantic result.                                                        from the blocked model, there is no semantic information for
+                                                                        the ﬁrst few frames (about 10 frames if use Mask R-CNN)
+                                                                        in our method. Since the processing speed of the tracking
+                                                                        thread is usually faster than the semantic thread, vSLAM
+                                                                        may have already accumulated large errors because of the
+                                                                        dynamic objects. Therefore, we need to correct these drift
+                                                                        errors using the semantic information by popping out and
+                                                                        feeding the keyframes in the front of the KF list sequentially
+                                                                        to the semantic-based optimization thread to correct/optimize
+                                                                        the camera poses.
+
+                                                                        B. SEMANTIC SEGMENTATION
+
+                                                                        In our experiment, we use two models with different
+                                                                        speeds, Mask R-CNN (slower) and SegNet (faster), as shown
+                                                                        in Fig. 8. Mask R-CNN [15] is trained with the MS
+                                                                        COCO [28], which has both pixel-wise semantic segmenta-
+                                                                        tion results and instance labels. We implemented it based on
+                                                                        the TensorFlow version of Matterport.1 SegNet [22] imple-
+                                                                        mented using Caffe,2 is trained with the PASCAL VOC [29]
+                                                                        2012 dataset, where 20 classes are offered. We did not reﬁne
+                                                                        the network using the TUM dataset because SLAM usually
+                                                                        runs in an unknown environment.
+
+FIGURE 6. Semantic delay of sequential model vs bi-direction model.     C. SEMANTIC MASK GENERATION
+                                                                        We merge all the binary mask images of instance segmenta-
+back of KF to request the semantic label. (Round 1) At time             tion results into one mask image that is used to generate the
+t = 2, two keyframes KF0 and KF1 are selected. Segmen-                  mask image (Fig. 8) of people. Then we calculate the priori
+tation ﬁnished at t = 12. By this time, new keyframes are               moving probability of map points using the mask. In practice,
+selected and then inserted into KF (see Round 2). Then we               since the segmentation on object boundaries are sometimes
+take two elements KF2 from the front and KF6 from this back             unreliable, the features on the boundaries cannot be detected
+to request the semantic label. At the time t = 22, we received          if directly apply the mask image, as shown in Fig. 9 (a).
+the semantic result and continue the next round (Round 3).
+                                                                        1https://github.com/matterport/Mask_RCNN
+VOLUME 9, 2021                                                          2https://github.com/alexgkendall/SegNet-Tutorial
+
+                                                                                                                          23777
+Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
+
+                                                                           FIGURE 10. Segmentation failure case. Some features on the body on the
+                                                                           person (a) cannot be identified as outliers using unsound mask
+                                                                           (c) generated by semantic result (b). Therefore, those features are
+                                                                           wrongly labeled as static in this frame.
+
+                                                                           FIGURE 11. Moving probability. θs is the static threshold and θd is the
+                                                                           dynamic threshold value.
+
+FIGURE 8. Semantic information. ‘‘M’’ stands for Mask R-CNN and ‘‘S’’ for  moving probability is used to detect and remove outliers from
+‘‘SegNet’’. (e) shows the outliers that marked in red color, which are     tracking.
+detected using the mask image.
+                                                                           1) DEFINITION OF MOVING PROBABILITY
+FIGURE 9. Mask dilation. Remove outliers on the edge of dynamic            As we know, vSLAM is usually running in an unknown
+objects.                                                                   environment, the semantic result is not always robust if the
+                                                                           CNN network is not well trained or reﬁned according to
+Therefore, we dilate the mask using a morphological ﬁlter to               the current environment (Fig. 10). To detect outliers, it is
+include the edge of dynamic objects, as shown in Fig. 9 (b).               more reasonable to consider the spatio-temporal consistency
+D. MOVING PROBABILITY UPDATE                                               of frames, rather than just use the semantic result of one
+In order not to wait for the semantic information in the                   frame. Therefore, we use the moving probability to leverage
+tracking thread, we isolate the semantic segmentation from                 the semantic information of successive keyframes.
+tracking. We use the moving probability to convey semantic
+information from semantic thread to tracking thread. The                      We deﬁne the moving probability (p(mti ), mti ∈ M ) of each
+                                                                           map point i at the current time as shown in Fig. 11. The
+23778                                                                      status of the map point is more likely dynamic if its moving
+                                                                           probability is closer to one. The more static the map point
+                                                                           is if it is more closer to zero. To simplify, we abbreviate the
+                                                                           moving probability of map point i at time t (p(mti )) to p(mt ).
+                                                                           Each map point has two status (M ), dynamic and static, and
+                                                                           the initial probability (initial belief) is set to 0.5 (bel(m0)).
+
+                                                                                                    M = {static(s), dynamic(d) }
+                                                                                       bel(m0 = d) = bel(m0 = s) = 0.5
+
+                                                                           2) DEFINITION OF OBSERVED MOVING PROBABILITY
+                                                                           Considering the fact that the semantic segmentation is not
+                                                                           100% accurate, we deﬁne the observe moving probability as:
+
+                                                                                              p(zt = d|mt = d) = α,
+                                                                                              p(zt = s|mt = d) = 1 − α,
+                                                                                               p(zt = s|mt = s) = β, and
+                                                                                              p(zt = d|mt = s) = 1 − β.
+
+                                                                           The values α and β are manually given and it is related to the
+                                                                           accuracy of semantic segmentation. In the experiment, we set
+                                                                           α and β to 0.9 by supping the semantic segmentation is fairly
+                                                                           reliable.
+
+                                                                                                                                                                              VOLUME 9, 2021
+Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
+
+3) MOVING PROBABILITY UPDATE                                       Algorithm 2 Robust Data Association Algorithm
+
+The moving probability of the current time bel(mt ) is pre-        Require: Current Frame: Ft
+dicted based on the observation z1:t (semantic segmentation)            Last Frame: Ft−1
+and initial status m0. We formulate the moving probability              Unknown subset: Unknown<FeatureId, MapPoint*>
+updating problem as a Bayesian ﬁlter [30] problem:
+                                                                        Static subset: Static<FeatureId, MapPoint*>
+
+bel(mt ) = p(mt |z1:t , m0)                                             Threshold: θd , θs, τ = 20
+
+                                                                   1: for i = 0; i < Ft−1.Features.size(); i + + do
+
+                = ηp(zt |mt , z1:t−1, m0)p(mt |z1:t−1, m0)         2: MapPoint* m = Ft−1.MapPoints[i]
+
+                = ηp(zt |mt )p(mt |z1:t−1, m0)                     3: f = FindMatchedFeatures(Ft , m)
+
+                = ηp(zt |mt )bel(mt )                       (2)    4: if p(m) > θd then
+
+                                                                   5:                   continue
+
+In Eq. 2 exploits Bayes rule and the conditional independence      6: end if
+                                                                   7: if p(m) < θs then
+that the current observation zt only relies on the current status  8: Static.insert(f , m)
+mt . η is a constant. The prediction bel(mt ) is calculated by:
+
+bel(mt ) = p(mt |mt−1, z1:t−1)p(mt−1|z1:t−1)dmt−1                   9: end if
+                                                                   10: if θd ≤ p(m) ≤ θs then
+                                                                   11: Unknown.insert(f , m)
+
+                = p(mt |mt−1)bel(mt−1)dmt−1                 (3)    12: end if
+
+                                                                   13: end for
+
+In Eq. (3), we exploit the assumption that our state is com-       14: for it = Static.begin(); it! = Static.end();it ++ do
+
+plete. This implies if we know the previous state mt−1, past       15: Ft .MapPoints[it->ﬁrst] = it->second;
+measurements convey no information regarding the state mt .        16: end for
+We assume the state transition probability p(mt = d|mt−1 =         17: if Static.size()<τ then
+s) = 0 and p(mt = d|mt−1 = d) = 1 because we cannot
+detect the suddenly change of objects. η is calculated by          18: for it = Unknown.begin(); it!=Unknown.end();it ++
+(bel(mt = d)+bel(mt = s))/2. The probability of map points
+belonging to dynamic is calculated by:                                  do
+
+                                                                   19:                  Ft .MapPoints[it->ﬁrst] = it->second;
+
+                                                                   20: end for
+
+                bel(mt = d)                                        21: end if
+
+                = p(mt = d|mt−1 = d)bel(mt−1 = d)           (4)
+
+4) JUDGEMENT OF STATIC AND DYNAMIC POINTS                          in order to remove the bad inﬂuence from dynamic map
+                                                                   points, we skip those map points that have higher moving
+Whether a point is dynamic or static is judged using prede-        probability (Lines 4-6). Then, there are two kinds of map
+ﬁned probability thresholds, θd and θs (see Fig. 11). They are     points left, static and unknown map points. We want to
+set to 0.6 and 0.4 respectively in the experiment.                 use only the static map points as far as we can. Therefore,
+                                                                   we classify the remaining map points into two subsets: static
+                                                                  subset and unknown subset, according to their moving proba-
+                 dynamic p(mt ) > θd                              bility (Lines 7-12). Finally, we use the selected relative good
+                                                                  matches. We ﬁrst use all the good data stored in static subset
+Status(mti ) = static                                              (Lines 14-16). If the size of these data is not enough (less
+                                       p(mt ) < θs          (5)    than the threshold τ = 20, the value used in ORB-SLAM3),
+                                                                   we also use the data in unknown subset (Lines 17-21).
+                            unknown   others
+                                                                     We try to exclude outliers from tracking using the moving
+                                                                   probability stored in the atlas. How well the outliers are
+V. TRACKING THREAD                                                 removed will have a great inﬂuence on the tracking accuracy.
+The tracking thread runs in real-time and tends to accumulate      We show the results of a few frames in Fig. 12. All the features
+the drift error due to the incorrect or unstable data associa-     in the ﬁrst few frames are in green color because no semantic
+tions of 3D map points and 2D features in each frame caused        information can be used and the moving probability of all
+by dynamic objects. We modify the Track Last Frame model           map points is 0.5, the initial value. The features in red belong
+and Track Local Map model of ORB-SLAM3 tracking thread             to dynamic objects and they are hard to match with the last
+to remove outliers (see Fig. 3). We propose a data association     frame than static features (blue features). The green features
+algorithm that uses as good data associations as possible          are almost disappeared because the map points obtained the
+using the moving probability stored in the atlas.                  semantic information over time. We only use features in the
+                                                                   static subset if its size number is enough to estimate camera
+A. TRACK LAST FRAME                                                ego-motion.
+Alg. 2 shows the data association algorithm in tracking last
+frame model. For each feature i in the last frame, we ﬁrst                                                                                                        23779
+get their matched map point m (Line 2). Next, we ﬁnd
+the matched feature in the current frame by comparing the
+descriptor distance of ORB features (Line 3). After that,
+
+VOLUME 9, 2021
+Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
+
+FIGURE 12. Results after tracking last frame. ‘‘M’’ stands for Mask R-CNN an ‘‘S’’ for SegNet. The features in red color are not used in tracking. Blue
+features belong to the static subset and green features belong to the unknown subset.
+
+FIGURE 13. Results after tracking local map. ‘‘M’’ stands for Mask R-CNN and ‘‘S’’ for SegNet.
+
+B. TRACK LOCAL MAP                                              drifts have already accumulated to some extent with the
+The basic idea of the data association algorithm in the Track-
+ing Local Map model is similar with Alg. 2. The difference      inﬂuence of dynamic objects. Therefore, we try to correct
+is that here we use all the map points in the local map to
+ﬁnd good data association. The data association result after    the camera pose using semantic information. We modify
+tracking local map is shown in Fig. 13. More map points are
+used to match in this model than the tracking last frame. The   the error term used in ORB-SLAM3 by using the moving
+features on the people are almost successfully detected or not
+matched/used.                                                   probability of map points for weighting, as shown below.
+
+VI. OPTIMIZATION                                                In the experience, we only use the matched static map points
+A. SEMANTIC-BASED OPTIMIZATION
+We optimize the camera pose using the keyframes given by        for optimization.
+the semantic keyframe selection algorithm. Considering that        Assume Xjw ∈ R3 is the 3D pose of a map point j in the
+the tracking thread runs very fast than the semantic thread,
+                                                                world coordinate system. The i-th keyframe pose in the world
+23780                                                           coordinate is Tiw ∈ SE(3). The camera pose Tiw is optimized
+                                                                by minimizing the reprojection error concerning the matched
+                                                                keypoint xij ∈ R2 of the map point. The error term for the
+                                                                observation of a map point j in a keyframe i is:
+
+                                                                e(i, j) = (xij − πi(Tiw, Xjw))(1 − p(mj)),                                               (6)
+
+                                                                                                            VOLUME 9, 2021
+Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
+
+where πi is the projection function that projects a 3D map          the obtained trajectories using their source codes and
+point into a 2D pixel point in the keyframe i. The larger the       therefore the trajectories are not exactly the same as the
+moving probability is, the smaller contribution to the error.       ones in their original paper. We evaluated our system
+The cost function to be optimized is:                               using both Mask R-CNN (M) and SegNet (S). The tra-
+                                                                    jectory of DynaSLAM that use Mask R-CNN is very
+                C=       ρ(eiT,j  −1   ei,j)            (7)         similar with our Mask R-CNN version as shown in
+                                  i,j                               Fig. 14 (m-p) and Fig. 14 (q-t). The performance of our
+                                                                    SegNet version (Fig. 14 (i and j)) is similar to the DS-SLAM
+                    i,j                                             (Fig. 14 (e and f)).
+
+where ρ is the Huber robust cost function and  −1   is  the            The error in the estimated trajectory was calculated by
+                                               i,j                  comparing it with the ground truth, using two promi-
+                                                                    nent measurements: absolute trajectory error (ATE) and
+covariance matrix.                                                  relative pose error (RPE) [25], which are well-suited for
+                                                                    measuring the performance of the vSLAM. The root mean
+B. BUNDLE ADJUSTMENT IN LOCAL MAPPING THREAD                        squared error (RMSE), and the standard deviation (S.D.)
+We modify the local BA model to reduce the inﬂuence of              of ATE and RPE are compared. Each sequence was run at
+dynamic map points using semantic information. What we              least ﬁve times as dynamic objects are prone to increase
+modiﬁed are: 1) the error term, in which the moving probabil-       the non-deterministic effect. We compared our method
+ity is used, as shown in Eq. 6; 2) only keyframes that already      with ORB-SLAM3 [18], DS-SLAM [10], DynaSLAM [9],
+obtained semantic information are used for BA.                      SLAM-PCD [8], DM-SLAM [11], and Detect-SLAM [12].
+                                                                    The comparison results are summarized in Tables 1, 2, and 3.
+VII. EXPERIMENTAL RESULTS                                           DynaSLAM reported they obtained the best performance
+We evaluate the tracking accuracy using TUM [25] indoor             using the combination of Mask R-CNN and geometric model.
+dataset and demonstrate the real-time performance by                In this paper, we mainly focus on the time cost problem
+comparing with state-of-the-art vSLAMs methods using,               caused by semantic segmentation. Contrary to the very heavy
+when possible, the results in the original papers.                  geometric model that DynaSLAM used, we only use the very
+                                                                    light geometric check, such as RANSAC, photometric error
+A. SYSTEM SETUP                                                     to deal with the outliers that not rely on the priori dynamic
+                                                                    objects.
+Our system is evaluated using GeForce RTX 2080Ti GPU,
+Cuda 11.1, and docker. 3 Docker is used to deploy differ-              Our proposal outperforms the original ORB-SLAM3
+                                                                    (RGB-D mode only without IMU) and obtains similar per-
+ent kinds of semantic segmentation methods on the same              formance with DynaSLAM, SLAM-PCD, and DM-SLAM,
+machine. We also use Kinect v2 4 camera to evaluate in real         in which the tracking error is already very small. Different
+                                                                    from them, we use the non-blocked model. The ﬁrst few
+environment.                                                        frames do not have any semantic information. The number
+                                                                    of keyframes that have a semantic label is smaller than suing
+B. TRACKING ACCURACY EVALUATION                                     the blocked model because the processing speed of the track-
+                                                                    ing thread is much faster than the semantic segmentation
+The proposed method was compared against the ORB-                   (especially for the heavy model, Mask R-CNN). However,
+SLAM3 and similar semantic-based algorithms to quantify             we achieved a similar tracking performance using less seman-
+the tracking performance of our proposal in dynamic scenar-         tic information.
+ios.
+                                                                    C. REAL ENVIRONMENT EVALUATION
+   The TUM RGB-D dataset contains color and depth images            We test our system using Kinect2 RGB-D camera, as shown
+along the ground-truth trajectory of the sensor. In the             in Fig. 15. All the features are in initial status when in the ﬁrst
+sequence named ‘‘fr3/walking_*’’ (labeled as f3/w/*), two           few frames because they have not yet obtained any semantic
+people walk through an ofﬁce. This is intended to evaluate          information. The static features will be increasingly detected
+the robustness of vSLAM in the case of quickly moving               over time and used to estimate camera pose. The features
+dynamic objects in large parts of a visible scene. Four types       on the person is detected and excluded from tracking. The
+of camera motion are included in walking data sequences             algorithm runs in around 30HZ, as shown in Table 4.
+1) ‘‘xyz’’, the Asus Xtion camera is manually moved along
+three directions (xyz); 2) ‘‘static’’, where the camera is kept in  D. EXECUTION TIME
+place manually; 3) ‘‘halfsphere’’, where the camera is moved        Tab. 4 compares the execution time of vSLAM algorithms.
+on a small half-sphere of approximately one-meter diameter;         In the blocked model, the tracking thread needs to wait for
+4) ‘‘rpy’’, where the camera is rotated along the principal axes    the semantic label. The speed of the other methods is related
+(roll-pitch-yaw). In the experiment, the person is dealt with as    to the semantic segmentation methods used. The heavy the
+the only priori dynamic object in the TUM dataset.
+                                                                                                                                                                   23781
+   We compared the trajectory of camera with ORB-
+SLAM3,5 DS-SLAM,6 and DynaSLAM. 7 Fig. 14 compares
+
+   3https://docs.docker.com/
+   4https://github.com/code-iai/iai_kinect2
+   5https://github.com/UZ-SLAMLab/ORB_SLAM3
+   6https://github.com/ivipsourcecode/DS-SLAM
+   7https://github.com/BertaBescos/DynaSLAM
+
+VOLUME 9, 2021
+       Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
+
+FIGURE 14. Trajectory comparing frame by brame. ‘‘M’’ stands for ‘‘Mask R-CNN’’ and ‘‘S’’ for ‘‘SegNet’’.
+TABLE 1. Results of absolute trajectory error of TUM (m). Ours (1) and (3) are evaluated results only using keyframes.
+
+semantic model used, the higher the total time consuming is.  known, DynaSLAM is not a real-time algorithm. DS-SLAM
+Although DynaSLAM achieved good tracking performance,         is the second fastest algorithm because it uses a lightweight
+the processing time is long due to Mask R-CNN. As we          semantic segmentation method, SegNet. However, the
+
+23782                                                                                                                   VOLUME 9, 2021
+Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
+TABLE 2. Results of translational relative pose error (RPE) (m). Ours (1) and (3) are evaluated results only using keyframes.
+
+TABLE 3. Results of rotational relative pose error (RPE) (m). Ours (1) and (3) are evaluated results only using keyframes.
+
+TABLE 4. The execution time comparison of TUM Dataset. We use the data in their original paper as possible. If not provide in their papers,
+we approximate the processing time.
+
+architecture used is also a blocked model. The execution time   TABLE 5. Semantic keyframe number comparison (Mask R-CNN).
+will increase if a more time-consuming method is used. Our
+method uses the non-blocked model and runs almost at a          keyframes are segmented, the better tracking accuracy can be
+constant speed regardless of the segmentation methods.          achieved. This depends on the speciﬁc application and the
+                                                                segmentation methods used.
+   We evaluate the error metric of TUM dataset using 15HZ
+by manually adding some time delay in the tracking thread          In the bi-direction model, we selected two keyframes at
+because TUM dataset is very short. Very small semantic          the same time. We offered two strategies to segment them:
+information can be obtained in this short time. We compare      1) infer images at the same time as a batch on the same GPU,
+the time and the number of keyframes that obtained semantic     2) infer images on the same GPU sequentially (one by one).
+label (Semantic keyframe Number) in Tab. 5. We only com-
+pared the Mask R-CNN version because SegNet is faster and                                                                                                      23783
+it can segment almost all the keyframes in each dataset. We
+assume the time cost of Mask R-CNN is 0.2s for segmenting
+each frame. The total time of running the fr3/w/xyz dataset is
+about 57.3s for 15HZ, however, only 28.3s for 30HZ. In this
+short time, the number of semantic keyframes in 30HZ (143)
+is two times smaller than 15HZ (286). Usually, the more
+
+VOLUME 9, 2021
+Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
+
+                                                                              FIGURE 16. Semantic Delay of TUM w/xyz Dataset. The average value of
+                                                                              Mask R-CNN case is 10 and SegNet is 5.
+
+FIGURE 15. Result of real environment. The green features are in initial      factors: 1) the segmentation speed, 2) the keyframe selection
+status and their moving probability is 0.5. The blue features are static      policy, 3) the undetermined inﬂuence caused by the different
+features and the red are outliers. (a) is the original detected ORB           running speed of multiple threads (e.g., Loop Closing thread),
+features. (b) is the output after tracking last frame process and (c) is the  3) the hardware conﬁgures. In the fr3/w/xyz dataset, the cam-
+result after tracking local map process.                                      era sometimes moves very slow and sometimes moves for-
+                                                                              ward or backward. As a result, this will change the keyframe
+We suggest using (1) if the GPU can infer a batch of images at                selection frequency and cause the variance of semantic delay.
+the same time. Our Mask R-CNN version uses (1) because we
+found we need 0.3s-0.4s in case (1) and 0.2s in case (2). Our                 VIII. CONCLUSION
+SegNet version is evaluated using the strategy (2) because                    A novel vSLAM system, semantic-based real-time visual
+SegNet is very fast and can be segmented sequentially.                        SLAM (RDS-SLAM) for dynamic environment using an
+                                                                              RGB-D camera is presented. We modify ORB-SLAM3 and
+E. SEMANTIC DELAY EVALUATION                                                  add a semantic tracking thread and a semantic-based opti-
+We have analyzed the semantic delay by assuming the                           mization thread to remove the inﬂuence of dynamic objects
+keyframe is selected every two frames (see Fig. 6). In exper-                 using semantic information. These new threads run in parallel
+iment, we follow the keyframe selection policy used in                        with the tracking thread and therefore, the tracking thread is
+ORB SLAM3 and we compared the semantic delay of                               not blocked to wait for semantic information. We proposed
+Mask R-CNN case and SegNet case using the TUM dataset,                        a keyframe selection strategy for semantic segmentation to
+as shown in Fig. 16. The semantic delay is inﬂuenced by these                 obtain as the latest semantic information as possible that can
+                                                                              deal with segmentation methods with different speeds. We
+23784                                                                         update and propagate semantic information using the moving
+                                                                              probability which is used to detect and remove outliers from
+                                                                              tracking using a data association algorithm. We evaluated the
+                                                                              tracking performance and the processing time using the TUM
+                                                                              dataset. The comparison against state-of-the-art vSLAMs
+                                                                              shows that our method achieved good tracking performance
+                                                                              and can track each frame in real-time. The fastest speed of the
+                                                                              system is about 30HZ, which is similar to the tracking speed
+                                                                              of ORB-SLAM3. In future work, we will try to 1) deploy our
+                                                                              system on a real robot, 2) extend our system to the stereo
+                                                                              camera and mono camera systems, and 3) build a semantic
+                                                                              map.
+
+                                                                              REFERENCES
+
+                                                                               [1] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,
+                                                                                     I. Reid, and J. J. Leonard, ‘‘Past, present, and future of simultaneous
+                                                                                     localization and mapping: Toward the robust-perception age,’’ IEEE Trans.
+                                                                                     Robot., vol. 32, no. 6, pp. 1309–1332, Dec. 2016. [Online]. Available:
+                                                                                     http://ieeexplore.ieee.org/document/7747236/
+
+                                                                               [2] T. Taketomi, H. Uchiyama, and S. Ikeda, ‘‘Visual SLAM algorithms:
+                                                                                     A survey from 2010 to 2016,’’ IPSJ Trans. Comput. Vis. Appl., vol. 9, no. 1,
+                                                                                     pp. 1–11, Dec. 2017, doi: 10.1186/s41074-017-0027-2.
+
+                                                                                                                                                                                 VOLUME 9, 2021
+Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods
+
+ [3] W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao, ‘‘Robust monocular              [25] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, ‘‘A bench-
+       SLAM in dynamic environments,’’ in Proc. IEEE Int. Symp. Mixed Aug-                 mark for the evaluation of RGB-D SLAM systems,’’ in Proc. IEEE/RSJ
+       mented Reality (ISMAR), Oct. 2013, pp. 209–218. [Online]. Available:                Int. Conf. Intell. Robots Syst., Oct. 2012, pp. 573–580. [Online]. Available:
+       http://ieeexplore.ieee.org/                                                         http://vision.in.tum.de/data/datasets/
+
+ [4] W. Dai, Y. Zhang, P. Li, and Z. Fang, ‘‘RGB-D SLAM in dynamic envi-            [26] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu,
+       ronments using points correlations,’’ IEEE Robot. Autom. Lett., vol. 2,             and A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Computer
+       no. 4, pp. 2263–2270, Nov. 2018. [Online]. Available: https://arxiv.org/            Vision—ECCV 2016 (Lecture Notes in Computer Science: Lecture Notes
+       pdf/1811.03217v1.pdf                                                                in Artiﬁcial Intelligence: Lecture Notes in Bioinformatics), vol. 9905.
+                                                                                           Cham, Switzerland: Springer, 2016, pp. 21–37. [Online]. Available:
+ [5] S. Li and D. Lee, ‘‘RGB-D SLAM in dynamic environments using static                   https://github.com/weiliu89/caffe/tree/ssd
+       point weighting,’’ IEEE Robot. Autom. Lett., vol. 2, no. 4, pp. 2263–2270,
+       Oct. 2017.                                                                   [27] N. Dvornik, K. Shmelkov, J. Mairal, and C. Schmid, ‘‘BlitzNet: A real-time
+                                                                                           deep network for scene understanding,’’ in Proc. IEEE Int. Conf. Comput.
+ [6] Y. Sun, M. Liu, and M. Q.-H. Meng, ‘‘Improving RGB-D SLAM in                          Vis. (ICCV), Oct. 2017, pp. 4174–4182.
+       dynamic environments: A motion removal approach,’’ Robot. Auto. Syst.,
+       vol. 89, pp. 110–122, Mar. 2017.                                             [28] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár,
+                                                                                           and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in context,’’ in
+ [7] D.-H. Kim, S.-B. Han, and J.-H. Kim, ‘‘Visual odometry algorithm                      Computer Vision—ECCV 2014 (Lecture Notes in Computer Science: Lec-
+       using an RGB-D sensor and IMU in a highly dynamic environ-                          ture Notes in Artiﬁcial Intelligence and Lecture Notes in Bioinformatics),
+       ment,’’ in Robot Intelligence Technology and Applications 3, vol. 345.              vol. 8693. Cham, Switzerland: Springer, 2014, pp. 740–755.
+       New York, NY, USA: Springer-Verlag, 2015, pp. 11–26. [Online]. Avail-
+       able: https://link.springer.com/chapter/10.1007/978-3-319-16841-8_2          [29] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and
+                                                                                           A. Zisserman, ‘‘The Pascal visual object classes (VOC) challenge,’’
+ [8] Y. Fan, Q. Zhang, S. Liu, Y. Tang, X. Jing, J. Yao, and H. Han, ‘‘Semantic            Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010. [Online].
+       SLAM with more accurate point cloud map in dynamic environments,’’                  Available: http://link.springer.com/10.1007/s11263-009-0275-4
+       IEEE Access, vol. 8, pp. 112237–112252, 2020.
+                                                                                    [30] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. Cambridge, MA,
+ [9] B. Bescos, J. M. Facil, J. Civera, and J. Neira, ‘‘DynaSLAM: Tracking,                USA: MIT Press, 2012, p. 2012. [Online]. Available: https://mitpress-
+       mapping, and inpainting in dynamic scenes,’’ IEEE Robot. Autom. Lett.,              mit-edu.proxy.library.uu.nl/books/probabilistic-robotics%0Ahttp://
+       vol. 3, no. 4, pp. 4076–4083, Oct. 2018.                                            mitpress.mit.edu/books/probabilistic-robotics
+
+[10] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, ‘‘DS-                                          YUBAO LIU received the bachelor’s degree in
+       SLAM: A semantic visual SLAM towards dynamic environments,’’                                                computer science from Qufu Normal University,
+       in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Oct. 2018,                                        Qufu, China, in 2012, and the master’s degree in
+       pp. 1168–1174.                                                                                              computer science from Capital Normal University,
+                                                                                                                   Beijing, China, in 2015. He is currently pursuing
+[11] J. Cheng, Z. Wang, H. Zhou, L. Li, and J. Yao, ‘‘DM-SLAM: A feature-                                          the Ph.D. degree with the Toyohashi University of
+       based SLAM system for rigid dynamic scenes,’’ ISPRS Int. J. Geo-Inf.,                                       Technology, Toyohashi, Japan. In 2015, he joined
+       vol. 9, no. 4, pp. 1–18, 2020.                                                                              Intel Research Center, Beijing, and he transferred
+                                                                                                                   to Isoftstone, Beijing, in 2016, as a Senior Soft-
+[12] F. Zhong, S. Wang, Z. Zhang, C. Chen, and Y. Wang, ‘‘Detect-SLAM:                                             ware Engineer, working on computer vision and
+       Making object detection and SLAM mutually beneﬁcial,’’ in Proc. IEEE         AR. His research interests include pattern recognition and SLAM for AR
+       Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2018, pp. 1001–1010.            and smart robotics.
+
+[13] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, ‘‘Dynamic-                                                     JUN MIURA (Member, IEEE) received the
+       SLAM: Semantic monocular visual localization and mapping based on                                           B.Eng. degree in mechanical engineering and
+       deep learning in dynamic environment,’’ Robot. Auto. Syst., vol. 117,                                       the M.Eng. and Dr.Eng. degrees in informa-
+       pp. 1–16, Jul. 2019. [Online]. Available: https://linkinghub.elsevier.com/                                  tion engineering from The University of Tokyo,
+       retrieve/pii/S0921889018308029                                                                              Tokyo, Japan, in 1984, 1986, and 1989, respec-
+                                                                                                                   tively. In 1989, he joined the Department of
+[14] M. A. Fischler and R. Bolles, ‘‘Random sample consensus: A paradigm for                                       Computer-Controlled Mechanical Systems, Osaka
+       model ﬁtting with applications to image analysis and automated cartogra-                                    University, Suita, Japan. Since April 2007, he has
+       phy,’’ Commun. ACM, vol. 24, no. 6, pp. 381–395, 1981.                                                      been a Professor with the Department of Computer
+                                                                                                                   Science and Engineering, Toyohashi University of
+[15] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, ‘‘Mask R-CNN,’’ in Proc.    Technology, Toyohashi, Japan. From March 1994 to February 1995, he was a
+       IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980–2988.               Visiting Scientist with the Computer Science Department, Carnegie Mellon
+                                                                                    University, Pittsburgh, PA, USA. He has published over 220 articles in
+[16] R. Mur-Artal and J. D. Tardos, ‘‘ORB-SLAM2: An open-source SLAM                international journal and conferences in the areas of intelligent robotics,
+       system for monocular, stereo, and RGB-D cameras,’’ IEEE Trans. Robot.,       mobile service robots, robot vision, and artiﬁcial intelligence. He received
+       vol. 33, no. 5, pp. 1255–1262, Oct. 2017.                                    several awards, including the Best Paper Award from the Robotics Society
+                                                                                    of Japan, in 1997, the Best Paper Award Finalist at ICRA-1995, and the Best
+[17] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, ‘‘ORB: An efﬁcient          Service Robotics Paper Award Finalist at ICRA-2013.
+       alternative to SIFT or SURF,’’ in Proc. Int. Conf. Comput. Vis., Nov. 2011,
+       pp. 2564–2571.                                                                                                                                                              23785
+
+[18] C. Campos, R. Elvira, J. J. Gómez Rodríguez, J. M. M. Montiel, and
+       J. D. Tardós, ‘‘ORB-SLAM3: An accurate open-source library for visual,
+       visual-inertial and multi-map SLAM,’’ 2020, arXiv:2007.11898. [Online].
+       Available: http://arxiv.org/abs/2007.11898
+
+[19] R. Elvira, J. D. Tardos, and J. M. M. Montiel, ‘‘ORBSLAM-Atlas:
+       A robust and accurate multi-map system,’’ in Proc. IEEE/RSJ Int. Conf.
+       Intell. Robots Syst. (IROS), Nov. 2019, pp. 6253–6259.
+
+[20] C. Kerl, J. Sturm, and D. Cremers, ‘‘Dense visual SLAM for RGB-D
+       cameras,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Nov. 2013,
+       pp. 2100–2106.
+
+[21] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon,
+       ‘‘Bundle adjustment—A modern synthesis,’’ in Proc. Int. Workshop Vis.
+       Algorithms, 2000, pp. 298–372. [Online]. Available: http://link.springer.
+       com/10.1007/3-540-44480-7_21
+
+[22] V. Badrinarayanan, A. Kendall, and R. Cipolla, ‘‘SegNet: A deep convolu-
+       tional encoder-decoder architecture for image segmentation,’’ IEEE Trans.
+       Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, Dec. 2017.
+
+[23] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard,
+       ‘‘OctoMap: An efﬁcient probabilistic 3D mapping framework based on
+       octrees,’’ Auto. Robots, vol. 34, no. 3, pp. 189–206, Apr. 2013.
+
+[24] B. Bescos, C. Campos, J. D. Tardós, and J. Neira, ‘‘DynaSLAM II: Tightly-
+       coupled multi-object tracking and SLAM,’’ 2020, arXiv:2010.07820.
+       [Online]. Available: http://arxiv.org/abs/2010.07820
+
+VOLUME 9, 2021
+
diff --git a/动态slam/2020年-2022年开源动态SLAM/2021年/Stereo camera visual SLAM with hierarchical masking and motion_state classification at outdoor construction sites containing large dynamic object.pdf b/动态slam/2020年-2022年开源动态SLAM/2021年/Stereo camera visual SLAM with hierarchical masking and motion_state classification at outdoor construction sites containing large dynamic object.pdf
new file mode 100644
index 0000000..f1273ab
--- /dev/null
+++ b/动态slam/2020年-2022年开源动态SLAM/2021年/Stereo camera visual SLAM with hierarchical masking and motion_state classification at outdoor construction sites containing large dynamic object.pdf	
@@ -0,0 +1,760 @@
+                                        12
+
+                                             STEREO CAMERA VISUAL SLAM WITH HIERARCHICAL
+                                        MASKING AND MOTION-STATE CLASSIFICATION AT OUTDOOR
+
+                                              CONSTRUCTION SITES CONTAINING LARGE DYNAMIC
+                                                                           OBJECTS
+
+arXiv:2101.06563v1 [cs.RO] 17 Jan 2021                  Runqiu Bao                               Ren Komatsu
+                                            Dept. of Precision Engineering             Dept. of Precision Engineering
+                                        The University of Tokyo, Tokyo, Japan     The University of Tokyo, Tokyo, Japan
+                                        bao@robot.t.u-tokyo.ac.jp               komatsu@robot.t.u-tokyo.ac.jp
+
+                                                          Renato Miyagusuku
+                                          Dept. of Mechanical and Intelligent Engineering
+                                        Utsunomiya University, Utsunomiya, Tochigi, Japan
+                                          miyagusuku@cc.utsunomiya-u.ac.jp
+
+                                                              Masaki Chino                      Atsushi Yamashita
+                                                         Construction Division           Dept. of Precision Engineering
+                                        HAZAMA ANDO CORPORATION, Tokyo, Japan       The University of Tokyo, Tokyo, Japan
+                                                chino.masaki@ad-hzm.co.jp       yamashita@robot.t.u-tokyo.ac.jp
+
+                                                      Hajime Asama
+                                             Dept. of Precision Engineering
+                                        The University of Tokyo, Tokyo, Japan
+                                        asama@robot.t.u-tokyo.ac.jp
+
+                                                                                                 January 19, 2021
+
+                                                                                         ABSTRACT
+
+                                               At modern construction sites, utilizing GNSS (Global Navigation Satellite System) to measure the
+                                               real-time location and orientation (i.e. pose) of construction machines and navigate them is very
+                                               common. However, GNSS is not always available. Replacing GNSS with on-board cameras and
+                                               visual simultaneous localization and mapping (visual SLAM) to navigate the machines is a cost-
+                                               effective solution. Nevertheless, at construction sites, multiple construction machines will usually
+                                               work together and side-by-side, causing large dynamic occlusions in the cameras’ view. Standard
+                                               visual SLAM cannot handle large dynamic occlusions well. In this work, we propose a motion
+                                               segmentation method to efﬁciently extract static parts from crowded dynamic scenes to enable robust
+                                               tracking of camera ego-motion. Our method utilizes semantic information combined with object-level
+                                               geometric constraints to quickly detect the static parts of the scene. Then, we perform a two-step
+                                               coarse-to-ﬁne ego-motion tracking with reference to the static parts. This leads to a novel dynamic
+                                               visual SLAM formation. We test our proposals through a real implementation based on ORB-SLAM2,
+                                               and datasets we collected from real construction sites. The results show that when standard visual
+
+                                        ∗Code available at: https://github.com/RunqiuBao/kenki-positioning-vSLAM
+                                        †∗Corresponding author Email: bao@robot.t.u-tokyo.ac.jp
+                                                                                                   A PREPRINT - JANUARY 19, 2021
+
+           SLAM fails, our method can still retain accurate camera ego-motion tracking in real-time. Comparing
+           to state-of-the-art dynamic visual SLAM methods, ours shows outstanding efﬁciency and competitive
+           result trajectory accuracy.
+
+           Keywords
+           dynamic visual SLAM, motion segmentation, hierarchical masking, object motion-state classiﬁcation,
+           ego-motion tracking
+
+1 Introduction
+
+Knowledge of real-time location and orientation (i.e. pose) of construction machines, such as bulldozers, excavators,
+and vibration rollers, is essential for the automation of construction sites. Currently, RTK-GNSS (Real-Time Kinematic
+- Global Navigation Satellite System) is widely used because of its centimeter-level location accuracy. However,
+in addition to the high price, the location output of RTK-GNSS could be unstable due to loss of satellite signals
+underground, near mountains and trees, and between tall buildings. Therefore, replacing RTK-GNSS with on-board
+cameras and visual SLAM (vSLAM) has been proposed [1]. Assuming machine’s starting pose is known in a global
+coordinate system, relative pose outputs from vSLAM can be used to navigate the machine.
+However at construction sites, several machines usually work together and side-by-side (Figure 1), which results in
+large dynamic occlusions in the view of the cameras. Such dynamic occlusions can occupy more than 50% of the image.
+It leads to a dramatic drop in tracking accuracy or even tracking failure when using standard vSLAM. We introduce
+this problem distinctly in the context of dynamic vSLAM and propose an original stereo camera dynamic vSLAM
+formation.
+To deal with dynamic occlusions, our idea is to ﬁrstly detect static objects and backgrounds, and then track ego-motion
+with reference to them. To achieve this, we need to estimate the real motion-states of objects. We use learning-based
+object detection and instance segmentation combined with object-wise geometric measurement in stereo frames to label
+the motion-states of object instances and generate occlusion masks for dynamic objects. Additionally, two types of
+occlusion masks are applied to balance accuracy and computation cost, bounding box mask for small occlusions and
+pixel-wise for large occlusions. Pixel-wise masks describe boundaries of objects more accurately. While bounding
+boxes are faster to predict, it is not so accurate.
+In a nutshell, our contributions in this work include: (1) a semantic-geometric approach to detect static objects and
+static backgrounds for stereo vSLAM, (2) a masking technique for dynamic objects called hierarchical masking, (3) a
+novel stereo camera dynamic visual SLAM system for construction sites.
+The remainder of this work is organized as follows: In Section 2, we summarize the existing research on dynamic visual
+SLAM and motion segmentation methods, and describe the features of this work. In Section 3, the system structure
+and our original proposals (two algorithms) are introduced. In Section 4, to test the performance of our proposals,
+we conducted experiments at real construction sites and built datasets for algorithm evaluation. We used Absolute
+Trajectroy RMSE [2] to evaluate accuracy of the location outputs of the vSLAM system. Finally, Section 5 contains the
+conclusions and future work plan.
+
+Figure 1: Simultaneous working of construction machines causing large-area moving occlusions in on-board cameras’
+view.
+
+                                                                      2
+                                                                                                   A PREPRINT - JANUARY 19, 2021
+
+Figure 2: Cameras are mounted on top of our construction machine facing to the sides, and RTK-GNSS is used to
+collect ground truth positions.
+
+2 Related Work
+
+2.1 Dynamic Visual SLAM
+
+Standard visual SLAM (vSLAM) assumes that the environment is static. Correspondingly, vSLAM for dynamic
+environments (Dynamic vSLAM or Robust vSLAM) distinguishes static and dynamic features and computes pose
+estimation based solely on static features.
+Depending on the application, dynamic vSLAM can be categorized into two classes. One solely builds a static
+background model, ignoring moving objects [3, 4, 2]. The other aims at not only creating a static background map, but
+simultaneously maintaining sub-maps of moving objects [5, 6, 7]. Our task, i.e. positioning of construction machines,
+requires fast and accurate camera ego-motion tracking and thus belongs to the ﬁrst class.
+Real-time positioning task at construction sites brought new problem to vSLAM. Speciﬁcally, we found that at a busy
+construction site, there are often many machines, trucks and persons moving around which become large dynamic
+occlusions (occlusion rate >50% from time to time) in the camera view. Besides, such occlusions usually contain
+more salient feature points than earthen ground and cause chaos in feature-based camera ego-motion tracking. Even
+existing dynamic vSLAM solutions may suffer from different issues and are thus not the optimal solution to this task.
+For example, [8, 9, 10, 11] proposed very fast methods for dealing with dynamic objects. Yet, they did not explicitly
+consider overly-large dynamic occlusions and thus might suffer from accuracy drop. [2] and [6] proposed very robust
+methods for masking dynamic occlusions. But both of them require heavy computation and are not suitable for real-time
+positioning task. Therefore, we proposed our own dynamic vSLAM solution for real-time positioning at dynamic
+construction sites.
+In a dynamic vSLAM system, there are mainly two major modules: (1) motion segmentation and (2) localization and
+mapping [12]. Motion segmentation is the key part that distinguishes an outstanding dynamic vSLAM system from the
+rests.
+
+2.2 Motion Segmentation
+
+Motion segmentation is aimed at detecting moving parts in the image and classifying the features into two groups, static
+and dynamic features.
+Standard visual SLAM achieves this by applying robust statistical approaches to the estimation of geometric models,
+such as Random Sample Consensus (RANSAC) [13]. However, such approach may fail when large dynamic occlusions
+exist, and static features are not in the majority. Other approaches leverage external sensors such as inertial measurement
+units (IMU) to ﬁx camera ego-motion. In the following, we focus on visual-only approaches to distinguish static and
+dynamic features. Muhamad et al. [12] summarizes this research area well, for more details please refer to the study.
+The most intuitive approach for motion segmentation is using semantic information to separate object instances that
+may move in the scene. To obtain semantic information, Bârsan et al. [6] used learning-based instance segmentation
+to generate pixel-wise masks for object instances. Cui et al. [14] proposed only using bounding boxes obtained from
+YOLO v3 [15] to ﬁlter dynamic objects, which can reduce computation cost. However, these works simply assume that
+movable objects are dynamic. End-to-end learning-based methods for motion segmentation (without prior information
+about the environment) are still scarce [12].
+Another common strategy for motion segmentation is utilizing geometric constraints. It leverages the fact that dynamic
+features will violate constraints deﬁned in multi-view geometry for static scenes. Kundu et al. [16] detected dynamic
+features by checking if the points lie on the epipolar line in the subsequent view and used Flow Vector Bound (FVB) to
+distinguish motion-states of 3D points moving along the epipolar line. Migliore et al. [17] kept checking the intersection
+
+                                                                      3
+                                                                                                   A PREPRINT - JANUARY 19, 2021
+
+between three projected viewing rays in three different views to conﬁrm static points. Tan et al. [18] projected existing
+map points into the current frame to check if the feature is dynamic. It is difﬁcult for us to evaluate these methods.
+However, one obvious drawback is that they require complicated modiﬁcations to the bottom components of standard
+visual SLAM algorithm without the static environment assumption. We argue that such modiﬁcations are not good for
+the modularity of a vSLAM system.
+
+As a novel hybrid approach, Berta et al. [2], in their work named Dynaslam, proposed to combine learning-based
+instance segmentation with multi-view geometry to reﬁne masks for objects that are not a priori dynamic, but movable.
+Our system follows the hybrid fashion of Dynaslam, but we treat motion segmentation as an object-level classiﬁcation
+problem. Our idea is, by triangulating and measuring positions of points inside the bounding boxes and comparing them
+between frames, we can estimate object-level motion-states for every bounding box (assuming objects are all rigid). If
+we know the motion-state of every bounding box, the surroundings can be easily divided into static and dynamic parts.
+
+Besides, bounding boxes of large dynamic occlusions reduce available static features. We will show that it is essential to
+keep the overall masked area under a certain threshold if possible. Hence, we designed an algorithm named hierarchical
+masking to reﬁne a pixel-wise mask inside the bounding box when the overall masked area extends past a threshold to
+save scarce static features. This hierarchical masking algorithm is also an original proposal from us.
+
+3 Stereo Camera Dynamic Visual SLAM robust against Large Dynamic Occlusions
+
+The core problem in this research is to achieve fast and accurate camera ego-motion tracking when there are large
+occlusions in the camera’s view. Subsection 3.1 is a general introduction of the system pipeline. In Subsection 3.2, the
+principle of feature-based camera ego-motion tracking with occlusion masks for dynamic occlusions is introduced. In
+order to balance computation speed and accuracy in occlusion mask generation, a hierarchical masking approach is
+proposed in Subsection 3.3. Last, through stereo triangulation and comparison, object instances in the current frame
+will be assigned a predicted motion-state label, static or dynamic, which leads to further mask reﬁning and a second
+around of tracking.
+
+3.1 System Overview
+
+The system installation is illustrated in Figure 2 and the system pipeline is shown in Figure 3. Inputs are stereo frames
+(left image and right image) captured by a stereo camera. Then semantic information, including object labels and
+bounding boxes, are extracted using learning-based object detection. In addition, a hierarchical mask generation
+approach is proposed to balance mask accuracy and generation speed. Object masks exclude suspicious dynamic objects
+from the static background. The features in the static background are then used in the initial tracking of camera pose.
+
+After initial tracking, a rough pose of the new frame is known, with which we distinguish static objects from other
+objects. This is done by triangulating object-level 3D key points in the reference and current frame and comparing the
+3D position errors to distinguish whether the object is moving or not. Large static objects can provide more salient static
+features for improving tracking accuracy. Dynamic objects will be kept masked in the second ego-motion tracking.
+This two-round coarse-to-ﬁne tracking scheme helps detect static objects and improve pose estimation accuracy.
+
+After the second round of tracking, there will be mapping and pose graph optimization steps as most of state-of-the-art
+vSLAM algorithms do.
+
+3.2 Feature-based Camera Ego-motion Tracking by Masking Dynamic Occlusions
+
+Camera ego-motion tracking framework used here is based on ORB-SLAM2 stereo [19]. When a new frame comes in,
+ﬁrst, a constant velocity motion model is used to predict the new camera pose with which we can search for map points
+and 2D feature points matches. After enough matches are found, a new pose can be re-estimated by Perspective-n-point
+(PnP) algorithm [20]. Motion-only bundle adjustment (BA) is then used for further pose optimization. Motion-only
+BA estimates the camera pose of the new stereo frame, including orientation R ∈ SO (3) and position t ∈ R3, by
+minimizing the reprojection error between matched 3D points xi ∈ R3 in the SLAM coordinates and feature points
+pi(.) in the new frame, where i = 1, 2, ..., N . pi(.) include monocular feature points pm i ∈ R2 and stereo feature points
+pis ∈ R3.
+Now supposing M out of N 3D points are on a rigid body dynamic object that had a pose change R , t in the physical
+world, and their 3D coordinates change from xi to xi, for i = 1, 2, ..., M . The rigid body transformation can be
+
+                                                                      4
+                                                                          A PREPRINT - JANUARY 19, 2021
+
+Figure 3: An overview of the proposed system. Inputs are stereo frames (all the processes are on the left image. Right
+image is only for triangulating 3D points). After semantic information extraction, occlusion masks of objects are
+generated and used in ﬁltering potential dynamic features. The initial ego-motion tracking is based purely on the static
+background. Then more static objects are found and used as references in the second round of tracking to get more
+accuracy results. The ﬁnal output is the camera pose R and t of the current frame in the SLAM coordinates.
+
+expressed as xi = R xi + t . Pose change estimation can be expressed as:
+
+         M                                                                2
+
+{R, t, R , t } = arg min ρ                 pi(.) − π(.) (R (R xi + t ) + t) Σ
+
+                             R,t,R ,t i=1                                      (1)
+
+   N                                          2
+
++        ρ pi(.) − π(.) (Rxi + t) Σ ,
+
+   M +1
+
+where ρ is the robust Huber cost function that controls the error growth of the quadratic function, and Σ is the covariance
+matrix associated to the scale of the feature point. The project functions π(.) include monocular πm and rectiﬁed stereo
+
+                                           5
+                                                                           A PREPRINT - JANUARY 19, 2021
+
+πs, as deﬁned in [19]:
+
+                        πm  X                =  fxX/Z + cx   =  ul  ,            (2)
+                            Y                   fyY /Z + cy     vl
+                            Z
+
+                        X                       fxX/Z + cx          ul
+
+                        πs Y =                  fyY /Z + cy     = vl ,           (3)
+
+                        Z                 fx (X − b) /Z + cx        ur
+
+where (fx, fy) is the focal length, (cx, cy) is the principal point and b the baseline. (ul, vl) represents the monocular
+feature points and (ul, vl, ur) the stereo feature points.
+
+However, solving this equation (1) is not easy, not to mention that there could be more than one dynamic objects in real
+world. If we only want to estimate R, t, equation (1) can be simpliﬁed to:
+
+                                          N                                2
+
+                        {R, t} = arg min        ρ pi(.) − π(.) (Rxi + t)      ,  (4)
+
+                            R,t i=M +1                                     Σ
+
+which means only using static points in the scene to estimate the camera pose. If dynamic feature points as moving
+outliers are not excluded, the estimation result will be wrong.
+
+To separate static and dynamic feature points, our approach is to use a binary image as mask (for the left image of the
+input stereo frame). The mask has the same size as the input image, and pixels with value 0 indicate static area, while
+pixels with value 1 indicate dynamic area. Suppose that Imask (u, v) refers to a pixel in the mask image Imask. Sp is a
+set of static pixels and Dp is a set of dynamic pixels,
+
+                                            0,  Imask (u, v) ∈ Sp       .        (5)
+                        Imask (u, v) = 1,       Imask (u, v) ∈ Dp
+
+Figure 4 shows examples of mask (with alpha blending). To generate a mask, we ﬁrst get bounding boxes or pixel-wise
+segmentation results from learning-based object detection and instance segmentation (Subsection 3.3). Then, for those
+objects with a priori dynamic semantic label such as "car", "person", "truck", etc., we change the pixels’ value to 1 in
+the mask image, while keeping the others as 0. We also apply geometrical measurement and calculate a motion-state
+label for every object bounding box. Inside a static bounding box, we change the pixels’ value to 0 whatever it was
+(Subsection 3.4). Later during ego-motion tracking period, only the areas where the mask value equals 0 will be used to
+estimate camera pose as described by Equation (4).
+
+3.3 Hierarchical Object Masking
+
+The switching between two types of masks forms a hierarchical masking strategy that balances computation speed and
+mask accuracy.
+
+To reduce computation cost, we ﬁrst used object detectors, e.p. EfﬁcientDet [21], to predict object instances and
+recognize their bounding boxes. Such learning-based object detector is a deep neural network, which can predict all the
+bounding boxes, class labels, and class probabilities directly from an image in one evaluation. A bounding box only
+represents a rough boundary of the object, so when using it as an object mask, background feature points inside the
+rectangle are also classiﬁed as "object". It is, therefore, only a rough boundary description.
+
+There were cases when bounding boxes occupied most of the area in the image, which led to a shortage of available static
+features, and thus the accuracy of the ego-motion tracking declined. In such cases, we perform pixel-wise segmentation
+on the image to save more static features. For pixel-wise segmentation, we also use deep learning approaches, such as
+Mask R-CNN [22]. Pixel-wise segmentation takes more time and slows down the system output rate. Thus, only in
+extreme cases when the frame is so crowded with object bounding boxes, should pixel-wise segmentation be performed.
+
+                                                                      6
+              A PREPRINT - JANUARY 19, 2021
+
+                                        Figure 4: Two kinds of masks and masked features.
+
+Algorithm 1: Hierarchical Masking
+
+Input: stereo images in current frame, Icl, Icr; Mased Area Ratio threshold, τmar.
+Output: image mask for the left image in current frame, Imask.
+
+     Initialisation: a blank image mask, Imask; initial Masked Area Ratio as 0, mar = 0;
+ 1: Imask=objectDetectionAndMasking(Icl);
+ 2: mar=calMaskedAreaRatio(Imask);
+ 3: if (mar ≥ τmar) then
+ 4: Imask=pixelwiseSegmentationAndMasking(Icl);
+ 5: end if
+ 6: return Imask
+
+The switching to pixel-wise segmentation is controlled by an index named Masked Area Ratio (mar). If Am is the total
+area of bounding boxes in pixels and Af is the total area of the image in pixels, then we have,
+
+mar = Am .                                                                                 (6)
+          Af
+
+If mar is larger than the threshold τmar, it means the current frame is quite crowded and pixel-wise segmentation is
+necessary.
+
+Hierarchical object masking is concluded as following: when we get one frame input, we ﬁrst use object detector
+performing object detection and obtain bounding boxes. Then mar is calculated. If mar is higher than a pre-set
+threshold τmar, then we perform pixel-wise segmentation and output the pixel-wise object mask. If mar is smaller than
+the threshold, then the bounding box mask are directly forwarded as object mask. This algorithm is summarized in
+Algorithm 1.
+
+3.4 Objects’ Motion-state Classiﬁcation for Further Mask Reﬁnement
+
+After the ﬁrst ego-motion tracking, with reference to the background, we roughly know the pose of the current frame.
+Based on the current pose, we triangulate object-level 3D points on all the detected object instances in the current frame
+and a selected reference frame and distinguish whether they have moved. Feature points inside static bounding boxes
+are then unmasked and used as valid static references in the second round of tracking. This algorithm (Algorithm 2)
+named motion-state classiﬁcation is detailed in the following.
+
+To classify objects’ motion-state, ﬁrst, a reference frame needs to be selected from previous frames. In this work, we
+used the N -th frame before the current frame as reference frame. N is determined based on the machines’ velocity.
+For example, for vibration rollers moving at 4 km/h mostly, FPS/3 to FPS/2 can be selected as N (FPS stands for
+the frame rate of camera recording, namely Frame Per Second). For domestic automobiles running at higher speed,
+N should be selected smaller so that there is appropriate visual change between current and reference frame. This
+strategy is simple but effective, given the simple moving pattern of construction machines. There are more sophisticated
+methods for selecting the best reference frame as stated in [2] and [18].
+
+Then, suppose there are objects {obj1, obj2, ..., objm} in the reference frame (RF) and objects {obj1, obj2, ..., objn}
+in the current frame (CF). We associate the m objects in RF with the n objects in CF by feature matching. If the
+
+7
+                                                                                                   A PREPRINT - JANUARY 19, 2021
+
+Figure 5: Associate bounding boxes between the Reference Frame (RF) and Current Frame (CF) using feature matching.
+Triangulate object-level 3D points in RF, then triangulate corresponding 3D points in CF and compare their positions in
+the two measurements. If most of point-wise position errors of an object (bounding box) are smaller than three times
+the standard variation of static background points, the object is labeled as ‘static’ during camera pose change from RF
+to CF.
+
+                                  Figure 6: Algorithm 2: Objects’ Motion-state Classiﬁcation.
+object instances are associated successfully between two frames, which means the object is co-visible in both frames,
+we triangulate 3D points within the bounding boxes in both frames in SLAM coordinates and calculate point-wise
+position errors. 3D points’ position errors of static background are assumed to obey zero-mean Gaussian distribution.
+The standard deviation, σbkg, is determined beforehand and used as the threshold for classiﬁcation. For static objects,
+principally all 3D points’ position error should be less than three times of σbkg. But considering the inaccuracy of a
+bounding box, we loosened the condition to 70%, i.e. objects are classiﬁed as "static" when more than 70% of its 3D
+points have a position error smaller than (3 × σbkg). However, outliers of feature matching usually result in very large
+position errors. We only keep points with position error smaller than the median to exclude outliers. Figure 5 shows the
+
+                                                                      8
+                                 A PREPRINT - JANUARY 19, 2021
+
+(a) Construction site bird view  (b) Vibration roller
+
+Figure 7: Experiment setting.
+
+principle of the geometric constraint, the left one is a dynamic object and the right one is a static object. Figure 6 shows
+input and output as well as main ideas of Algorithm 2. Details about how to implement this algorithm can be found in
+our code repository.
+
+4 Experimental Evaluations
+
+4.1 Testing Environments and Datasets
+
+To evaluate our proposed approaches, we conducted experiments at two construction sites in Japan with a machine
+called vibration roller as shown in Figure 7(b). Vibration roller is used to ﬂatten the earthen basement of structures and
+facilities. For efﬁciency of work, there are usually multiple rollers running simultaneously and side by side, thus large
+moving occlusions become a serious problem for visual SLAM.
+
+In all experiments, a stereo camera was mounted on the cabin top of a roller facing the side. The baseline of the stereo
+camera was about 1 m. The roller moved along a typical trajectory (Figure 7(a)) with maximum speed of 11 km/h. The
+ground truth trajectories were recorded using RTK-GNSS. We synchronized ground truth and estimated camera poses
+by minimizing Absolute Trajectory RMSE ([2, 19, 23]) and choosing appropriate time offsets between GNSS’s and
+the camera’s timer. Then the estimated camera trajectories will be aligned with ground truth trajectories by Umeyama
+algorithm [24]. We evaluate the accuracy of camera pose outputs of the vSLAM system with reference to the associated
+ground truth by Absolute Trajectory RMSE (AT-RMSE).
+
+Video data were collected at the site and evaluated in the lab. Image resolution was 3840 × 2160, and frame rate was 60
+fps. For efﬁcient evaluation, we downsampled the image sequences to 960 × 540 and 6 fps. We eventually collected
+ﬁve image sequences, three with dynamic machines inside, the 4th one containing only two static machines, and the
+5th one was without any occlusions. The speciﬁcations of the computer being used were Intel Core i7-9700K CPU,
+and NVIDIA GeForce RTX 2080 Ti GPU. We used a tool provided by [25] for trajectory accuracy evaluation and
+visualization.
+
+When evaluating our vSLAM system implemetation, all the masks including bounding box and pixel-wise masks
+are generated beforehand using EfﬁcientDet [21] and Detectron2 [26] version of Mask R-CNN [22]. EfﬁcientDet is
+reported to be able to prioritize detection speed or detection accuracy through conﬁguration. In our implementation, we
+used EfﬁcientDet-D0 and the weights were trained on MS COCO dataset [27]. The weights for Mask R-CNN are also
+trained on MS COCO datasets [27]. Without ﬁne-tuning, they are already good enough for this study. Besides, when
+calculating overall computation time per frame, we record time consumption for vSLAM tracking part as well as mask
+generation part respectively, and then add them together. Note that in hierarchical masking, the additional time caused
+by pixel-wise segmentation will be averaged into all the frames.
+
+                                                                      9
+                                                                              A PREPRINT - JANUARY 19, 2021
+
+(a) Absolute position error of every camera pose         (b) Camera trajectory with colormapped position error
+
+Figure 8: Quantitative evaluation for estimated trajectory of image sequence 1 "kumamoto1".
+
+                      Table 1: Details about the ﬁve image sequences.
+
+   Dataset details     kumamoto1       kumamoto2               chiba1              chiba2            chiba3
+Max. occlusion ratio       0.493           0.445                0.521               0.633               0.0
+                           0/1263          0/1186              12/647              69/668
+ MAR>0.5 frames                                              0 to 4 km/h        0 to 4 km/h           0/708
+  Machines’ speed       0 to 4 km/h     0 to 4 km/h                                                0 to 4 km/h
+                          1 roller        1 roller       1 roller (dynamic)   2 rollers (static)
+    Occlusions &                                           1 roller (static)                      no occlusions
+ their motion-states     (dynamic)       (dynamic)
+                       7 color cones   7 color cones
+
+                           (static)        (static)
+                      1 checkerboard  1 checkerboard
+
+                           (static)        (static)
+
+4.2 Performance Evaluation with Our Construction Site Datasets
+
+Figure 8(a) shows the absolute position error of every camera pose between the estimated trajectory using the proposed
+system and ground truth of a sequence (kumamoto2). Figure 8(b) is a bird’s eye view of the camera trajectory with
+colormapped absolute position error. There are totally 5 sequences prepared, we repeat such evaluation 10 times for
+each sequence. The details about the ﬁve sequences are described in Table 1. Figure 9(a) shows the distribution of
+
+(a) Estimated trajectory accuracy (lower is better)      (b) Averaged computation speed (lower is better)
+
+                      Figure 9: Performance comparison on our construction site datasets.
+
+                                                     10
+                                                      A PREPRINT - JANUARY 19, 2021
+
+(a) Three machines working parallelly to each other.  (b) From view point of the on-board camera
+
+Figure 10: Dynamic scene and hierarchical masking example.
+
+Absolute Trajectory RMSE of all ﬁve sequences. We compare our proposed system with a simple baseline system, with
+DynaSLAM [2] and with the original ORB-SLAM2 stereo. The baseline system is also based on ORB-SLAM2 but
+is able to detect and remove moving objects. Its “moving object removal” method is derived from Detect-SLAM [9],
+which performs bounding box detection and masks all movable bounding boxes detected. In the results, our proposed
+system shows better trajectory accuracy in 3 sequences out of ﬁve comparing to the baseline, including kumamoto1,
+chiba1 and chiba3. If the baseline represents fast and efﬁcient handling of dynamic objects, DynaSLAM is much
+heavier computationally. But the motion segmentation method in DynaSLAM is pixel-level precise and indeed the
+current state-of-the-art. The experiment results shows that, DynaSLAM does show slight superiority of trajectory
+accuracy in sequences including kumamoto1, chiba1. The original ORB-SLAM2 stereo can only survive chiba2 and
+chiba3, which are completely static. In addition, trajectory accuracy of chiba2 and chiba3 are generally better than
+those of dynamic sequences, no matter which method. Dynamic occlusions do cause irreversible inﬂuence on camera
+ego-motion tracking.
+
+Averaged computation speed comparisons are shown in Figure 9(b). Our proposed system is relatively slow than the
+baseline and orb-slam2 stereo at the beginning. However, our method is able to be signiﬁcantly accelerated by utilizing
+parallel computing such as GPU acceleration. In our implementation named "ours_gpu" in Figure 9, we enabled GPU
+acceleration for all the ORB feature extractions, and the speed can be improved notably. However, the trajectory
+accuracy became different from "ours" to a certain extent, although theoretically they should be the same. We are still
+looking for the root cause. Finally, time cost of DynaSLAM (only tracking, without background inpainting) is 2 to 3
+times of ours_gpu. Large computation latency is not preferable, since our targeted task is real-time positioning and
+navigation of a construction machine.
+
+4.3 Ablation Study
+
+4.3.1 Hierarchical Object Masking
+
+Hierarchical masking aims to efﬁciently propose an appropriate initial mask in case there are overly-large dynamic
+occlusions in the image. Figure 10(a) shows a scene when the machine was working along with two other machines and
+thus had two large occlusions in the camera view. Figure 10(b) shows a sample image recorded by the on-board camera.
+Notice that the two rectangles labeled as truck are bounding boxes detected by object detection algorithm, and the
+color masks inside the bounding boxes are by pixel-wise segmentation. Besides, ORB feature points are extracted
+and marked on this image. Green points are static features on the static background, blue points are those included
+by bounding boxes but not included by pixel-wise masks, and red points are features masked by pixel-wise masks.
+It is obvious that bounding box mask causes many innocent static features being treated as dynamic. Through a toy
+experiment, we can see how it will cause shortage of available feature points and lead to worse pose tracking accuracy.
+Then by a real example in our datasets, we explain the effectiveness of hierarchical masking.
+
+11
+                                                 A PREPRINT - JANUARY 19, 2021
+
+Figure 11: A toy experiment: estimated trajectory accuracy when putting different sizes of occlusions on the 4th image
+sequence "chiba2".
+
+Table 2: Tracking accuracy of "chiba2" with three different mask types.
+
+Mask type          AT-RMSE, m              Max. occlusion ratio
+                   (average of 10 trials)
+
+B-box mask         0.0437                  0.63
+
+Hierarchical mask  0.0404                  0.50
+
+Pixel-wise mask    0.0397                  0.32
+
+(1) A toy experiment
+We put a fake constant dynamic occlusion at the center of the mask images of the 4th image sequence chiba2 (static
+scene). And we adjusted the size of this area to simulate different occlusion ratio and see how the result trajectory
+accuracy changes. The result is plotted in Figure 11. Before occlusion ratio reaches 0.6, trajectory error only varies
+over a small range; when occlusion ratio exceeds 0.7, the RMSE increases exponentially due to shortage of available
+features. Therefore, when occlusion ratio of the image approaches the critical point of 0.6, we deﬁne it as a large
+occlusion condition, requiring the reﬁnement of the bounding box mask to a pixel-wise mask to suppress the growing
+error. Besides, when occlusion ratio is larger than 0.6, tracking lost will frequently happen which is not preferred when
+navigating a construction machine. To avoid tracking lost and relocalization, we set the threshold (τmar in section 3.3)
+to 0.5 as a safty limit.
+
+However, when occlusion ratio is far smaller than 0.6, bounding box mask is enough and also faster to get. With our
+computer, generating bounding box masks for one image frame takes 0.0207 seconds in average while a pixel-wise
+mask takes 0.12 seconds.
+
+(2) An overly large occlusion case
+
+In order to demonstrate the effectiveness of hierarchial masking when facing overly large occlusions, we show an
+example in sequence "chiba2". From the 3500th frame to 4500th frame (1000 frames in the original 60 fps sequence) in
+"chiba2" sequence, we encountered an overly large occlusion. As Table 2 shows, when changing from bounding box
+mask to pixel-wise mask, the maximum masked area ratio reduced from 0.63 to 0.32 and, correspondingly, trajectory
+error decreased. Hierarchical masking beneﬁts trajectory accuracy, and it will cost much less time than only using
+pixel-wise mask. In this example, only 2/3 of the frames during this period need to calculate pixel-wise mask. And the
+maximum masked area ratio is constrained within 0.5. Note that although the Absolute Trajectory RMSE difference
+between 0.0404 and 0.0437 seems trivial here in Table 2. It is partially because of the trajectory alignment algorithm
+[24] we used for evaluation, the actual accuracy difference can be larger.
+
+4.3.2 Objects’ Motion-state Classiﬁcation
+
+Not all a priori dynamic objects are moving. Ignoring static objects leads to loss of information, especially when
+they are salient and occupy a large area in the image. Therefore, we designed the objects’ motion-state classiﬁcation
+algorithm to detect static objects and unmask them for ego-motion tracking. Figure 12 shows dynamic and static objects
+detected in the image sequences and scores relating to the possibility of them being dynamic. We also show an example
+
+                                                                     12
+                                      A PREPRINT - JANUARY 19, 2021
+
+Figure 12: Illustration of the classiﬁcation result. In the left column, the third row shows that there is one machine
+classiﬁed as "moving" and another classiﬁed as "static" in this frame. The second row shows the position errors of
+3D points on these two machines between this frame and the reference frame. Points on the "moving" machine have
+higher position errors. Similarly, in the right column, there are also "moving" machines (two parts of one machine) and
+a "static" color cone detected.
+
+Table 3: Tracking accuracy with motion-state classiﬁcation.
+
+Mask type           AT-RMSE, m Max. occlusion ratio
+
+All objects masked  0.04973      0.63
+
+Static objects unmasked 0.04198  0.0
+
+of using the proposed algorithm in visual SLAM. Again, we use the 3500th frame to 4500th frame (1000 frames) in
+"chiba2" sequence, and since the machines are totally static during this period, they are detected as static and unmasked.
+Table 3 shows how it can inﬂuence the tracking accuracy.
+
+However, there is still one bottleneck in this algorithm. σbkg is an essential parameter for the performance of motion-
+state classiﬁcation. For all the evaluations above with the four image sequences, we set σbkg to 0.12 which was
+empirically determined. To explore the inﬂuence of this parameter on system performance, we variate σbkg between 0
+and 0.6 to evaluate the classiﬁer in terms of ROC (Receiver Operating Characteristics). Since the ﬁnal target is to ﬁnd
+static objects, "static" is regarded as positive and "dynamic" as negative, ignoring objects that cannot be classiﬁed. The
+ROC curve is shown in Figure 13, true positive rate (TPR, sensitivity) on the y axis is the ratio of true positive number
+over the sum of true positives and false negatives. False positive rate on the x axis is the ratio of false positives over the
+sum of false positives and true negatives. According to this curve, the Area Under the Curve (AUC) reached 0.737,
+which proved it to be a valid classiﬁer. The red dot in the plot is the position where σbkg = 0.12.
+
+4.4 Evaluation with KITTI Dataset
+
+The KITTI Dataset [28] provides stereo camera sequences in outdoor urban and highway environments. It has been a
+wide-spread benchmark for evaluating vSLAM system performance, especially trajectory accuracy. Works such as
+[2, 19] all provide evaluation results with KITTI. There are some sequences in KITTI containing normal-size dynamic
+occlusions, thus KITTI is also appropriate for evaluation of our method. Table 4 shows the evaluation results. The
+comparison includes four systems, our proposed system, the baseline, DynaSLAM and ORB-SLAM2 stereo, same as in
+section 4.2. For the baseline, DynaSLAM and ORB-SLAM2 stereo, all the settings remain the same as before. For our
+system, τmar (Section 3.3) remains to be 0.5 and σbkg (Section 3.4) remains to be 0.12. However, N (Section 3.4) is
+changed to be 2, since frame rate of KITTI is 10 fps and the cars are much faster than our construction machines. We
+ran each sequence 10 times with each system and recorded the averaged Absolute Trajectory RMSE (AT-RMSE, m) as
+well as the averaged computation time per frame (s). For our system, we recorded both results with GPU acceleration (w
+A) and without GPU acceleration (w/o A). Between the four comparisons, best AT-RMSE for each sequence is marked
+with bold font and best computation time marked with bold, italic font. Note that the AT-RMSE results of DynaSLAM
+
+                                                                     13
+                                                                   A PREPRINT - JANUARY 19, 2021
+
+Figure 13: ROC curve for the motion-state classiﬁcation when σbkg was between 0 and 0.6, estimated with the 3rd
+image sequence "chiba1". The Area Under Curve (AUC) reached 0.737. Red dot is the position where σbkg = 0.12.
+
+          Table 4: Trajectory accuracy and time consumption evaluation on KITTI Dataset.
+
+                       ours                 baseline      dynaslam (tracking)             orb-slam2
+
+Sequence  AT-RMSE (m)  time per        AT-RMSE time per   AT-RMSE time per     AT-RMSE time per
+                       frame (s)
+KITTI 00                               (m)     frame (s)  (m)      frame (s)   (m)        frame (s)
+KITTI 01  w/o A w A w/o A w A
+KITTI 02
+KITTI 03  2.1290 1.7304 0.2018 0.1565  2.0173  0.0912     3.9691   0.3354      1.7304     0.0703
+KITTI 04                               9.1271  0.0917     21.8982  0.3273      8.7620     0.0734
+KITTI 05  8.4940 8.7620 0.1860 0.1305  4.9280  0.0935     5.9401   0.3243      4.9994     0.0771
+KITTI 06                               3.1174  0.0898     4.7770   0.3459      3.0735     0.0723
+KITTI 07  5.1759 4.7338 0.1764 0.1194  0.9970  0.0864     1.3371   0.3420      1.0079     0.0672
+KITTI 08                               2.0528  0.0923     1.7644   0.3482      1.9751     0.0717
+KITTI 09  3.2169 3.4246 0.1462 0.0983  1.9338  0.0943     2.0627   0.3434      1.8793     0.0752
+KITTI 10                               1.1799  0.0843     1.1285   0.3493      0.9733     0.0632
+          1.0835 1.2937 0.1811 0.1297  4.7857  0.0882     3.7062   0.3488      4.6483     0.0675
+                                       7.1441  0.0865     4.2753   0.3463      5.9788     0.0657
+          2.1243 2.2529 0.1915 0.1382  2.6986  0.0912     2.2028   0.3466      2.6699     0.0631
+
+          2.1718 2.2651 0.2076 0.1546
+
+          1.2323 1.3159 0.1791 0.1337
+
+          4.5641 5.2294 0.1945 0.1445
+
+          4.9692 5.8698 0.1760 0.1231
+
+          2.5849 2.6375 0.1522 0.1022
+
+and ORB-SLAM2 stereo are different from the original paper. It is because we only align the trajectory with ground
+truth without adjusting scale before calculating trajectory error, since our target is online positioning with vSLAM.
+
+From Table 4, we see that in terms of computation speed, ORB-SLAM2 stereo is always the best. Because it has
+adapted the static environment assumption. DynaSLAM is the slowest. Ours is slightly worse than the baseline and
+ORB-SLAM2 stereo, however, we do see that GPU acceleration helps improving speed to a tolerable level. In terms of
+AT-RMSE, the results are various, but DynaSLAM and ORB-SLAM2 stereo did have the most bold fonts numbers.
+In KITTI dataset, there are moving automobiles, bicycles and persons in some frames, but they are not overly-large.
+Actually there are only 6 frames in "07" in which occlusion ratio became larger than 0.5. Besides, automobiles on
+the street do not contain so many salient feature points as construction machines, most of them have texture-less and
+smooth surface. Therefore, our proposed system is not advantageous in KITTI.
+
+5 Conclusions & Future Work
+
+We presented a stereo vSLAM system for dynamic outdoor construction sites. The key contributions are, ﬁrst, a
+hierarchical masking strategy that can timely reﬁne overly-large occlusion mask in an efﬁcient way. Second, a semantic-
+geometric approach for objects’ motion-state classiﬁcation and a two-step coarse-to-ﬁne ego-motion tracking scheme.
+Our system accurately retrieved the motion trajectories of a stereo camera at construction sites, and most of the
+surrounding objects’ motion-states in the scene were correctly predicted. Hierarchical object masking has also been
+
+                                                                     14
+                                                                                                   A PREPRINT - JANUARY 19, 2021
+
+proved to be a simple but useful strategy. Our proposed system can deal with dynamic and crowded environments that
+standard vSLAM systems may fail to keep tracking.
+
+In future work, the method to select reference frames can be optimized to enable more robust object motion-state
+classiﬁcation. Moreover, we plan to combine vSLAM with an inertial measuring unit (IMU) sensor for higher-accuracy
+positioning. However, the ﬁerce and high-frequency vibration of the vibration roller may cause severe noises in the
+IMU measurements, which results in worse pose accuracy. Therefore, we will look into this problem and meanwhile
+also explore other topics about visual SLAM related research at construction sites.
+
+References
+
+ [1] Runqiu Bao, Ren Komatsu, Renato Miyagusuku, Masaki Chino, Atsushi Yamashita, and Hajime Asama. Cost-
+      effective and robust visual based localization with consumer-level cameras at construction sites. In Proceedings of
+      the 2019 IEEE Global Conference on Consumer Electronics (GCCE 2019), pages 983–985, 2019.
+
+ [2] Berta Bescos, José M Fácil, Javier Civera, and José Neira. Dynaslam: Tracking, mapping, and inpainting in
+      dynamic scenes. IEEE Robotics and Automation Letters, 3(4):4076–4083, 2018.
+
+ [3] Mariano Jaimez, Christian Kerl, Javier Gonzalez-Jimenez, and Daniel Cremers. Fast odometry and scene ﬂow
+      from rgb-d cameras based on geometric clustering. In Proceedings of the 2017 IEEE International Conference on
+      Robotics and Automation (ICRA 2017), pages 3992–3999, 2017.
+
+ [4] Dan Barnes, Will Maddern, Geoffrey Pascoe, and Ingmar Posner. Driven to distraction: Self-supervised distractor
+      learning for robust monocular visual odometry in urban environments. In Proceedings of the 2018 IEEE
+      International Conference on Robotics and Automation (ICRA 2018), pages 1894–1900, 2018.
+
+ [5] Binbin Xu, Wenbin Li, Dimos Tzoumanikas, Michael Bloesch, Andrew Davison, and Stefan Leutenegger. Mid-
+      fusion: Octree-based object-level multi-instance dynamic slam. In Proceedings of the 2019 IEEE International
+      Conference on Robotics and Automation (ICRA 2019), pages 5231–5237, 2019.
+
+ [6] Ioan Andrei Bârsan, Peidong Liu, Marc Pollefeys, and Andreas Geiger. Robust dense mapping for large-scale
+      dynamic environments. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation
+      (ICRA 2018), pages 7510–7517, 2018.
+
+ [7] Martin Runz, Maud Bufﬁer, and Lourdes Agapito. Maskfusion: Real-time recognition, tracking and reconstruction
+      of multiple moving objects. In Proceedings of the 2018 IEEE International Symposium on Mixed and Augmented
+      Reality (ISMAR 2018), pages 10–20, 2018.
+
+ [8] Chao Yu, Zuxin Liu, Xin-Jun Liu, Fugui Xie, Yi Yang, Qi Wei, and Qiao Fei. Ds-slam: A semantic visual slam
+      towards dynamic environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent
+      Robots and Systems (IROS 2018), pages 1168–1174. IEEE, 2018.
+
+ [9] Fangwei Zhong, Sheng Wang, Ziqi Zhang, and Yizhou Wang. Detect-slam: Making object detection and slam
+      mutually beneﬁcial. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision
+      (WACV 2018), pages 1001–1010. IEEE, 2018.
+
+[10] Linhui Xiao, Jinge Wang, Xiaosong Qiu, Zheng Rong, and Xudong Zou. Dynamic-slam: Semantic monocular
+      visual localization and mapping based on deep learning in dynamic environment. Robotics and Autonomous
+      Systems, 117:1–16, 2019.
+
+[11] João Carlos Virgolino Soares, Marcelo Gattass, and Marco Antonio Meggiolaro. Visual slam in human populated
+      environments: Exploring the trade-off between accuracy and speed of yolo and mask r-cnn. In Proceedings of the
+      2019 International Conference on Advanced Robotics (ICAR 2019), pages 135–140. IEEE, 2019.
+
+[12] Muhamad Risqi U Saputra, Andrew Markham, and Niki Trigoni. Visual slam and structure from motion in
+      dynamic environments: A survey. ACM Computing Surveys (CSUR), 51(2):1–36, 2018.
+
+[13] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model ﬁtting with applications
+      to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981.
+
+[14] Zhaopeng Cui, Lionel Heng, Ye Chuan Yeo, Andreas Geiger, Marc Pollefeys, and Torsten Sattler. Real-time dense
+      mapping for self-driving vehicles using ﬁsheye cameras. In Proceedings of the 2019 International Conference on
+      Robotics and Automation (ICRA 2019), pages 6087–6093, 2019.
+
+[15] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018.
+
+[16] Abhijit Kundu, K Madhava Krishna, and Jayanthi Sivaswamy. Moving object detection by multi-view geometric
+      techniques from a single camera mounted robot. In Proceedings of the 2009 IEEE/RSJ International Conference
+      on Intelligent Robots and Systems (IROS 2009), pages 4306–4312, 2009.
+
+                                                                     15
+                                                                                                   A PREPRINT - JANUARY 19, 2021
+
+[17] Davide Migliore, Roberto Rigamonti, Daniele Marzorati, Matteo Matteucci, and Domenico G Sorrenti. Use a
+      single camera for simultaneous localization and mapping with mobile object tracking in dynamic environments.
+      In Proceedings of the 2009 ICRA Workshop on Safe navigation in open and dynamic environments: Application to
+      autonomous vehicles, pages 12–17, 2009.
+
+[18] Wei Tan, Haomin Liu, Zilong Dong, Guofeng Zhang, and Hujun Bao. Robust monocular slam in dynamic
+      environments. In Proceedings of the 2013 IEEE International Symposium on Mixed and Augmented Reality
+      (ISMAR 2013), pages 209–218, 2013.
+
+[19] Raul Mur-Artal and Juan D Tardós. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d
+      cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017.
+
+[20] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular
+      slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015.
+
+[21] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efﬁcientdet: Scalable and efﬁcient object detection. arXiv
+      preprint arXiv:1911.09070, 2019.
+
+[22] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the 2017 IEEE
+      International Conference on Computer Vision (ICCV 2017), pages 2961–2969, 2017.
+
+[23] Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the
+      evaluation of rgb-d slam systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent
+      Robots and Systems (IROS 2012), pages 573–580. IEEE, 2012.
+
+[24] Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE
+      Transactions on Pattern Analysis & Machine Intelligence, (4):376–380, 1991.
+
+[25] Michael Grupp. evo: Python package for the evaluation of odometry and slam. https://github.com/
+      MichaelGrupp/evo, 2017.
+
+[26] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https:
+      //github.com/facebookresearch/detectron2, 2019.
+
+[27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and
+      C Lawrence Zitnick. Microsoft coco: Common objects in context. In Proceedings of the 2014 European
+      Conference on Computer Vision (ECCV 2014), pages 740–755, 2014.
+
+[28] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The
+      International Journal of Robotics Research, 32(11):1231–1237, 2013.
+
+                                                                     16
+
diff --git a/动态slam/2020年-2022年开源动态SLAM/2021年/Tartanvo A generalizable learning_based vo.pdf b/动态slam/2020年-2022年开源动态SLAM/2021年/Tartanvo A generalizable learning_based vo.pdf
new file mode 100644
index 0000000..de9ebaa
--- /dev/null
+++ b/动态slam/2020年-2022年开源动态SLAM/2021年/Tartanvo A generalizable learning_based vo.pdf	
@@ -0,0 +1,724 @@
+                                        TartanVO: A Generalizable Learning-based VO
+
+                                        Wenshan Wang∗  Yaoyu Hu  Sebastian Scherer
+
+                                        Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University
+
+arXiv:2011.00359v1 [cs.CV] 31 Oct 2020            Abstract: We present the ﬁrst learning-based visual odometry (VO) model,
+                                                  which generalizes to multiple datasets and real-world scenarios, and outperforms
+                                                  geometry-based methods in challenging scenes. We achieve this by leveraging
+                                                  the SLAM dataset TartanAir, which provides a large amount of diverse synthetic
+                                                  data in challenging environments. Furthermore, to make our VO model generalize
+                                                  across datasets, we propose an up-to-scale loss function and incorporate the cam-
+                                                  era intrinsic parameters into the model. Experiments show that a single model,
+                                                  TartanVO, trained only on synthetic data, without any ﬁnetuning, can be general-
+                                                  ized to real-world datasets such as KITTI and EuRoC, demonstrating signiﬁcant
+                                                  advantages over the geometry-based methods on challenging trajectories. Our
+                                                  code is available at https://github.com/castacks/tartanvo.
+
+                                                  Keywords: Visual Odometry, Generalization, Deep Learning, Optical Flow
+
+                                        1 Introduction
+
+                                        Visual SLAM (Simultaneous Localization and Mapping) becomes more and more important for
+                                        autonomous robotic systems due to its ubiquitous availability and the information richness of im-
+                                        ages [1]. Visual odometry (VO) is one of the fundamental components in a visual SLAM system.
+                                        Impressive progress has been made in both geometric-based methods [2, 3, 4, 5] and learning-based
+                                        methods [6, 7, 8, 9]. However, it remains a challenging problem to develop a robust and reliable VO
+                                        method for real-world applications.
+
+                                        On one hand, geometric-based methods are not robust enough in many real-life situations [10, 11].
+                                        On the other hand, although learning-based methods demonstrate robust performance on many vi-
+                                        sual tasks, including object recognition, semantic segmentation, depth reconstruction, and optical
+                                        ﬂow, we have not yet seen the same story happening to VO.
+
+                                        It is widely accepted that by leveraging a large amount of data, deep-neural-network-based methods
+                                        can learn a better feature extractor than engineered ones, resulting in a more capable and robust
+                                        model. But why haven’t we seen the deep learning models outperform geometry-based methods yet?
+                                        We argue that there are two main reasons. First, the existing VO models are trained with insufﬁcient
+                                        diversity, which is critical for learning-based methods to be able to generalize. By diversity, we
+                                        mean diversity both in the scenes and motion patterns. For example, a VO model trained only on
+                                        outdoor scenes is unlikely to be able to generalize to an indoor environment. Similarly, a model
+                                        trained with data collected by a camera ﬁxed on a ground robot, with limited pitch and roll motion,
+                                        will unlikely be applicable to drones. Second, most of the current learning-based VO models neglect
+                                        some fundamental nature of the problem which is well formulated in geometry-based VO theories.
+                                        From the theory of multi-view geometry, we know that recovering the camera pose from a sequence
+                                        of monocular images has scale ambiguity. Besides, recovering the pose needs to take account of the
+                                        camera intrinsic parameters (referred to as the intrinsics ambiguity later). Without explicitly dealing
+                                        with the scale problem and the camera intrinsics, a model learned from one dataset would likely fail
+                                        in another dataset, no matter how good the feature extractor is.
+
+                                        To this end, we propose a learning-based method that can solve the above two problems and can
+                                        generalize across datasets. Our contributions come in three folds. First, we demonstrate the crucial
+                                        effects of data diversity on the generalization ability of a VO model by comparing performance on
+                                        different quantities of training data. Second, we design an up-to-scale loss function to deal with the
+
+                                           ∗Corresponding author: wenshanw@andrew.cmu.edu
+
+                                        4th Conference on Robot Learning (CoRL 2020), Cambridge MA, USA.
+scale ambiguity of monocular VO. Third, we create an intrinsics layer (IL) in our VO model enabling
+generalization across different cameras. To our knowledge, our model is the ﬁrst learning-based VO
+that has competitive performance in various real-world datasets without ﬁnetuning. Furthermore,
+compared to geometry-based methods, our model is signiﬁcantly more robust in challenging scenes.
+A demo video can be found: https://www.youtube.com/watch?v=NQ1UEh3thbU
+
+2 Related Work
+
+Besides early studies of learning-based VO models [12, 13, 14, 15], more and more end-to-end
+learning-based VO models have been studied with improved accuracy and robustness. The majority
+of the recent end-to-end models adopt the unsupervised-learning design [6, 16, 17, 18], due to the
+complexity and the high-cost associated with collecting ground-truth data. However, supervised
+models trained on labeled odometry data still have a better performance [19, 20].
+
+To improve the performance, end-to-end VO models tend to have auxiliary outputs related to camera
+motions, such as depth and optical ﬂow. With depth prediction, models obtain supervision signals
+by imposing depth consistency between temporally consecutive images [17, 21]. This procedure can
+be interpreted as matching the temporal observations in the 3D space. A similar effect of temporal
+matching can be achieved by producing the optical ﬂow, e.g., [16, 22, 18] jointly predict depth,
+optical ﬂow, and camera motion.
+
+Optical ﬂow can also be treated as an intermediate representation that explicitly expresses the 2D
+matching. Then, camera motion estimators can process the optical ﬂow data rather than directly
+working on raw images[20, 23]. If designed this way, components for estimating the camera motion
+can even be trained separately on available optical ﬂow data [19]. We follow these designs and use
+the optical ﬂow as an intermediate representation.
+
+It is well known that monocular VO systems have scale ambiguity. Nevertheless, most of the super-
+vised learning models did not handle this issue and directly use the difference between the model
+prediction and the true camera motion as the supervision [20, 24, 25]. In [19], the scale is handled
+by dividing the optical ﬂow into sub-regions and imposing a consistency of the motion predictions
+among these regions. In non-learning methods, scale ambiguity can be solved if a 3D map is avail-
+able [26]. Ummenhofer et al. [20] introduce the depth prediction to correcting the scale-drift. Tateno
+et al. [27] and Sheng et al. [28] ameliorate the scale problem by leveraging the key-frame selection
+technique from SLAM systems. Recently, Zhan et al. [29] use PnP techniques to explicitly solve
+for the scale factor. The above methods introduce extra complexity to the VO system, however, the
+scale ambiguity is not totally suppressed for monocular setups especially in the evaluation stage.
+Instead, some models choose to only produce up-to-scale predictions. Wang et al. [30] reduce the
+scale ambiguity in the monocular depth estimation task by normalizing the depth prediction before
+computing the loss function. Similarly, we will focus on predicting the translation direction rather
+than recovering the full scale from monocular images, by deﬁning a new up-to-scale loss function.
+
+Learning-based models suffer from generalization issues when tested on images from a new en-
+vironment or a new camera. Most of the VO models are trained and tested on the same dataset
+[16, 17, 31, 18]. Some multi-task models [6, 20, 32, 22] only test their generalization ability on the
+depth prediction, not on the camera pose estimation. Recent efforts, such as [33], use model adap-
+tation to deal with new environments, however, additional training is needed on a per-environment
+or per-camera basis. In this work, we propose a novel approach to achieve cross-camera/dataset
+generalization, by incorporating the camera intrinsics directly into the model.
+
+Figure 1: The two-stage network architecture. The model consists of a matching network, which
+estimates optical ﬂow from two consecutive RGB images, followed by a pose network predicting
+camera motion from the optical ﬂow.
+
+                                                           2
+3 Approach
+
+3.1 Background
+
+We focus on the monocular VO problem, which takes two consecutive undistorted images {It, It+1},
+and estimates the relative camera motion δtt+1 = (T, R), where T ∈ R3 is the 3D translation and
+R ∈ so(3) denotes the 3D rotation. According to the epipolar geometry theory [34], the geometry-
+based VO comes in two folds. Firstly, visual features are extracted and matched from It and It+1.
+Then using the matching results, it computes the essential matrix leading to the recovery of the
+up-to-scale camera motion δtt+1.
+
+Following the same idea, our model consists of two sub-modules. One is the matching module
+Mθ(It, It+1), estimating the dense matching result Ftt+1 from two consecutive RGB images (i.e.
+optical ﬂow). The other is a pose module Pφ(Ftt+1) that recovers the camera motion δtt+1 from the
+matching result (Fig. 1). This modular design is also widely used in other learning-based methods,
+especially in unsupervised VO [13, 19, 16, 22, 18].
+
+3.2 Training on large scale diverse data
+
+The generalization capability has always been one of the most critical issues for learning-based
+methods. Most of the previous supervised models are trained on the KITTI dataset, which contains
+11 labeled sequences and 23,201 image frames in the driving scenario [35]. Wang et al. [8] presented
+the training and testing results on the EuRoC dataset [36], collected by a micro aerial vehicle (MAV).
+They reported that the performance is limited by the lack of training data and the more complex
+dynamics of a ﬂying robot. Surprisingly, most unsupervised methods also only train their models in
+very uniform scenes (e.g., KITTI and Cityscape [37]). To our knowledge, no learning-based model
+has yet shown the capability of running on multiple types of scenes (car/MAV, indoor/outdoor). To
+achieve this, we argue that the training data has to cover diverse scenes and motion patterns.
+
+TartanAir [11] is a large scale dataset with highly diverse scenes and motion patterns, containing
+more than 400,000 data frames. It provides multi-modal ground truth labels including depth, seg-
+mentation, optical ﬂow, and camera pose. The scenes include indoor, outdoor, urban, nature, and
+sci-ﬁ environments. The data is collected with a simulated pinhole camera, which moves with ran-
+dom and rich 6DoF motion patterns in the 3D space.
+
+We take advantage of the monocular image sequences {It}, the optical ﬂow labels {Ftt+1}, and the
+ground truth camera motions {δtt+1} in our task. Our objective is to jointly minimize the optical
+ﬂow loss Lf and the camera motion loss Lp. The end-to-end loss is deﬁned as:
+
+L = λLf + Lp = λ Mθ(It, It+1) − Ftt+1 + Pφ(Fˆtt+1) − δtt+1  (1)
+
+where λ is a hyper-parameter balancing the two losses. We use ˆ· to denote the estimated variable
+from our model.
+
+Since TartanAir is purely synthetic, the biggest question is can a model learned from simulation
+data be generalized to real-world scenes? As discussed by Wang et al. [11], a large number of
+studies show that training purely in simulation but with broad diversity, the model learned can be
+easily transferred to the real world. This is also known as domain randomization [38, 39]. In our
+experiments, we show that the diverse simulated data indeed enable the VO model to generalize to
+real-world data.
+
+3.3 Up-to-scale loss function
+
+The motion scale is unobservable from a monocular image sequence. In geometry-based methods,
+the scale is usually recovered from other sources of information ranging from known object size or
+camera height to extra sensors such as IMU. However, in most existing learning-based VO studies,
+the models generally neglect the scale problem and try to recover the motion with scale. This is
+feasible if the model is trained and tested with the same camera and in the same type of scenario.
+For example, in the KITTI dataset, the camera is mounted at a ﬁxed height above the ground and a
+ﬁxed orientation. A model can learn to remember the scale in this particular setup. Obviously, the
+model will have huge problems when tested with a different camera conﬁguration. Imagine if the
+
+                                          3
+Figure 2: a) Illustration of the FoV and image resolution in TartanAir, EuRoC, and KITTI datasets.
+b) Calculation of the intrinsics layer.
+
+camera in KITTI moves a little upwards and becomes higher from the ground, the same amount of
+camera motion would cause a smaller optical ﬂow value on the ground, which is inconsistent with
+the training data. Although the model could potentially learn to pick up other clues such as object
+size, it is still not fully reliable across different scenes or environments.
+
+Following the geometry-based methods, we only recover an up-to-scale camera motion from the
+
+monocular sequences. Knowing that the scale ambiguity only affects the translation T , we design
+
+a new loss function for T and keep the loss for rotation R unchanged. We propose two up-to-scale
+loss functions for LP : the cosine similarity loss Lcpos and the normalized distance loss Lnporm. Lpcos
+is deﬁned by the cosine angle between the estimated Tˆ and the label T :
+
+Lpcos     =  max(  Tˆ · T  T      +  Rˆ − R        (2)
+                   Tˆ ·       ,)
+
+Similarly, for Lnporm, we normalize the translation vector before calculating the distance between
+the estimation and the label:
+
+Lpnorm =     Tˆ               T          + Rˆ − R
+          max( Tˆ , ) − max( T                     (3)
+                                     ,)
+
+where = 1e-6 is used to avoid divided by zero error. From our preliminary empirical comparison,
+the above two formulations have similar performance. In the following sections, we will use Eq 3
+to replace Lp in Eq 1. Later, we show by experiments that the proposed up-to-scale loss function is
+crucial for the model’s generalization ability.
+
+3.4 Cross-camera generalization by encoding camera intrinsics
+
+In epipolar geometry theory, the camera intrinsics is required when recovering the camera pose
+from the essential matrix (assuming the images are undistorted). In fact, learning-based methods
+are unlikely to generalize to data with different camera intrinsics. Imagine a simple case that the
+camera changes a lens with a larger focal length. Assume the resolution of the image remains the
+same, the same amount of camera motion will introduce bigger optical ﬂow values, which we call
+the intrinsics ambiguity.
+
+A tempting solution for intrinsics ambiguity is warping the input images to match the camera in-
+trinsics of the training data. However, this is not quite practical especially when the cameras differ
+too much. As shown in Fig. 2-a, if a model is trained on TartanAir, the warped KITTI image only
+covers a small part of the TartanAir’s ﬁeld of view (FoV). After training, a model learns to exploit
+cues from all possible positions in the FoV and the interrelationship among those cues. Some cues
+no longer exist in the warped KITTI images leading to drastic performance drops.
+
+3.4.1 Intrinsics layer
+
+We propose to train a model that takes both RGB images and camera intrinsics as input, thus the
+model can directly handle images coming from various camera settings. Speciﬁcally, instead of re-
+covering the camera motion Ttt+1 only from the feature matching Ftt+1, we design a new pose net-
+work Pφ(Ftt+1, K), which depends also on the camera intrinsic parameters K = {fx, fy, ox, oy},
+where fx and fy are the focal lengths, and ox and oy denote the position of the principle point.
+
+                   4
+Figure 3: The data augmentation procedure of random cropping and resizing. In this way we gener-
+ate a wide range of camera intrinsics (FoV 40◦ to 90◦).
+
+As for the implementation, we concatenate an IL (intrinsics layer) Kc ∈ R2×H×W (H and W
+are image height and width respectively) to Ftt+1 before going into Pφ. To compose Kc, we ﬁrst
+generate two index matrices Xind and Yind for the x and y axis in the 2D image frame (Fig. 2-b).
+Then the two channels of Kc are calculated from the following formula:
+
+Kxc = (Xind − ox)/fx                                 (4)
+Kyc = (Yind − oy)/fy
+
+The concatenation of Ftt+1 and Kc augments the optical ﬂow estimation with 2D position informa-
+tion. Similar to the situation where geometry-based methods have to know the 2D coordinates of the
+matched features, Kc provides the necessary position information. In this way, intrinsics ambiguity
+
+is explicitly handled by coupling 2D positions and matching estimations (Ftt+1).
+
+3.4.2 Data generation for various camera intrinsics
+
+To make a model generalizable across different cameras, we need training data with various camera
+intrinsics. TartanAir only has one set of camera intrinsics, where fx = fy = 320, ox = 320,
+and oy = 240. We simulate various intrinsics by randomly cropping and resizing (RCR) the input
+images. As shown in Fig. 3, we ﬁrst crop the image at a random location with a random size. Next,
+we resize the cropped image to the original size. One advantage of the IL is that during RCR, we can
+crop and resize the IL with the image, without recomputing the IL. To cover typical cameras with
+FoV between 40◦ to 90◦, we ﬁnd that using random resizing factors up to 2.5 is sufﬁcient during
+RCR. Note the ground truth optical ﬂow should also be scaled with respect to the resizing factor. We
+use very aggressive cropping and shifting in our training, which means the optical center could be
+way off the image center. Although the resulting intrinsic parameters will be uncommon in modern
+cameras, we ﬁnd the generalization is improved.
+
+4 Experimental Results
+
+4.1 Network structure and training detail
+
+Network We utilize the pre-trained PWC-Net [40] as the matching network Mθ, and a modiﬁed
+ResNet50 [41] as the pose network Pφ. We remove the batch normalization layers from the ResNet,
+and add two output heads for the translation and rotation, respectively. The PWC-Net outputs optical
+ﬂow in size of H/4 × W/4, so Pφ is trained on 1/4 size, consuming very little GPU memory. The
+overall inference time (including both Mθ and Pφ) is 40ms on an NVIDIA GTX 1080 GPU.
+
+Training Our model is implemented by PyTorch [42] and trained on 4 NVIDIA GTX 1080 GPUs.
+There are two training stages. First, Pφ is trained separately using ground truth optical ﬂow and
+camera motions for 100,000 iterations with a batch size of 100. In the second stage, Pφ and Mθ are
+connected and jointly optimized for 50,000 iterations with a batch size of 64. During both training
+stages, the learning rate is set to 1e-4 with a decay rate of 0.2 at 1/2 and 7/8 of the total training
+steps. The RCR is applied on the optical ﬂow, RGB images, and the IL (Sec 3.4.2).
+
+4.2 How the training data quantity affects the generalization ability
+
+To show the effects of data diversity, we compare the generalization ability of the model trained
+with different amounts of data. We use 20 environments from the TartanAir dataset, and set aside
+3 environments (Seaside-town, Soul-city, and Hongkong-alley) only for testing, which results in
+
+5
+Figure 4: Generalization ability with respect to different quantities of training data. Model Pφ is
+trained on true optical ﬂow. Blue: training loss, orange: testing loss on three unseen environments.
+Testing loss drops constantly with increasing quantity of training data.
+
+Figure 5: Comparison of the loss curve w/ and w/o up-to-scale loss function. a) The training and
+testing loss w/o the up-to-scale loss. b) The translation and rotation loss of a). Big gap exists between
+the training and testing translation losses (orange arrow in b)). c) The training and testing losses w/
+up-to-scale loss. d) The translation and rotation losses of c). The translation loss gap decreases.
+
+more than 400,000 training frames and about 40,000 testing frames. As a comparison, KITTI and
+EuRoC datasets provide 23,201 and 26,604 pose labeled frames, respectively. Besides, data in
+KITTI and EuRoC are much more uniform in the sense of scene type and motion pattern. As shown
+in Fig. 4, we set up three experiments, where we use 20,000 (comparable to KITTI and EuRoC),
+100,000, and 400,000 frames of data for training the pose network Pφ. The experiments show that
+the generalization ability, measured by the gap between training loss and testing loss on unseen
+environments, improves constantly with increasing training data.
+
+4.3 Up-to-scale loss function
+
+Without the up-to-scale loss, we observe that there is a gap between the training and testing loss even
+trained with a large amount of data (Fig. 5-a). As we plotting the translation loss and rotation loss
+separately (Fig. 5-b), it shows that the translation error is the main contributor to the gap. After we
+apply the up-to-scale loss function described in Sec 3.3, the translation loss gap decreases (Fig. 5-
+c,d). During testing, we align the translation with the ground truth to recover the scale using the
+same way as described in [16, 6].
+
+4.4 Camera intrinsics layer
+
+The IL is critical to the generalization ability across datasets. Before we move to other datasets,
+we ﬁrst design an experiment to investigate the properties of the IL using the pose network Pφ. As
+shown in Table 1, in the ﬁrst two columns, where the data has no RCR augmentation, the training
+and testing loss are low. But these two models would output nonsense values on data with RCR
+augmentation. One interesting ﬁnding is that adding IL doesn’t help in the case of only one type
+of intrinsics. This indicates that the network has learned a very different algorithm compared with
+the geometry-based methods, where the intrinsics is necessary to recover the motion. The last two
+columns show that the IL is critical when the input data is augmented by RCR (i.e. various intrin-
+sics). Another interesting thing is that training a model with RCR and IL leads to a lower testing
+loss (last column) than only training on one type of intrinsics (ﬁrst two columns). This indicates that
+by generating data with various intrinsics, we learned a more robust model for the VO task.
+
+                                                           6
+Table 1: Training and testing losses with four combinations of RCR and IL settings. The IL is
+critical with the presence of RCR. The model trained with RCR reaches lower testing loss than
+those without RCR.
+
+Training conﬁguration      w/o RCR, w/o IL     w/o RCR, w/ IL        w/ RCR, w/o IL          w/ RCR, w/ IL
+Training loss                    0.0325             0.0311                0.1534                 0.0499
+Test-loss on data w/ RCR             -                  -                 0.1999                 0.0723
+Test-loss on data w/o RCR        0.0744             0.0714                0.1630                 0.0549
+
+Table 2: Comparison of translation and rotation on the KITTI dataset. DeepVO [43] is a super-
+vised method trained on Seq 00, 02, 08, 09. It contains an RNN module, which accumulates
+information from multiple frames. Wang et al. [9] is a supervised method trained on Seq 00-08
+and uses the semantic information of multiple frames to optimize the trajectory. UnDeepVO [44]
+and GeoNet [16] are trained on Seq 00-08 in an unsupervised manner. VISO2-M [45] and ORB-
+SLAM [3] are geometry-based monocular VO. ORB-SLAM uses the bundle adjustment on multiple
+frames to optimize the trajectory. Our method works in a pure VO manner (only takes two frames).
+It has never seen any KITTI data before the testing, and yet achieves competitive results.
+
+        Seq                06              07              09              10                      Ave
+
+DeepVO [43]*†      trel        rrel  trel      rrel  trel      rrel  trel              rrel  trel       rrel
+Wang et al. [9]*†
+UnDeepVO [44]*     5.42 5.82         3.91 4.6        -         -     8.11 8.83               5.81 6.41
+GeoNet [16]*
+VISO2-M [45]       -           -     -         -     8.04 1.51       6.23 0.97               7.14 1.24
+ORB-SLAM [3]†
+TartanVO (ours)    6.20 1.98         3.15 2.48       -         -     10.63 4.65              6.66 3.04
+
+                   9.28 4.34         8.27 5.93       26.93 9.54      20.73 9.04              16.3 7.21
+
+                   7.3 6.14          23.61 19.11     4.04 1.43       25.2 3.8                15.04 7.62
+
+                   18.68 0.26        10.96 0.37      15.3 0.26       3.71 0.3                12.16 0.3
+
+                   4.72 2.95         4.32 3.41       6.0 3.11        6.89 2.73               5.48 3.05
+
+trel: average translational RMSE drift (%) on a length of 100–800 m.
+rrel: average rotational RMSE drift (◦/100 m) on a length of 100–800 m.
+*: the starred methods are trained or ﬁnetuned on the KITTI dataset.
+†: these methods use multiple frames to optimize the trajectory after the VO process.
+
+4.5 Generalize to real-world data without ﬁnetuning
+
+KITTI dataset The KITTI dataset is one of the most inﬂuential datasets for VO/SLAM tasks. We
+compare our model, TartanVO, with two supervised learning models (DeepVO [43], Wang et al.
+[9]), two unsupervised models (UnDeepVO [44], GeoNet [16]), and two geometry-based methods
+(VISO2-M [45], ORB-SLAM [3]). All the learning-based methods except ours are trained on the
+KITTI dataset. Note that our model has not been ﬁnetuned on KITTI and is trained purely on a
+synthetic dataset. Besides, many algorithms use multiple frames to further optimize the trajectory.
+In contrast, our model only takes two consecutive images. As listed in Table 2, TartanVO achieves
+comparable performance, despite no ﬁnetuning nor backend optimization are performed.
+
+EuRoC dataset The EuRoC dataset contains 11 sequences collected by a MAV in an indoor en-
+vironment. There are 3 levels of difﬁculties with respect to the motion pattern and the light con-
+dition. Few learning-based methods have ever been tested on EuRoC due to the lack of training
+data. The changing light condition and aggressive rotation also pose real challenges to geometry-
+based methods as well. In Table 3, we compare with geometry-based methods including SVO [46],
+ORB-SLAM [3], DSO [5] and LSD-SLAM [2]. Note that all these geometry-based methods per-
+form some types of backend optimization on selected keyframes along the trajectory. In contrast, our
+model only estimates the frame-by-frame camera motion, which could be considered as the frontend
+module in these geometry-based methods. In Table 3, we show the absolute trajectory error (ATE)
+of 6 medium and difﬁcult trajectories. Our method shows the best performance on the two most dif-
+ﬁcult trajectories VR1-03 and VR2-03, where the MAV has very aggressive motion. A visualization
+of the trajectories is shown in Fig. 6.
+
+Challenging TartanAir data TartanAir provides 16 very challenging testing trajectories2 that
+cover many extremely difﬁcult cases, including changing illumination, dynamic objects, fog and
+rain effects, lack of features, and large motion. As listed in Table 4, we compare our model with the
+ORB-SLAM using ATE. Our model shows a more robust performance in these challenging cases.
+
+    2https://github.com/castacks/tartanair tools#download-the-testing-data-for-the-cvpr-visual-slam-challenge
+
+                                                  7
+Table 3: Comparison of ATE on EuRoC dataset. We are among very few learning-based methods,
+which can be tested on this dataset. Same as the geometry-based methods, our model has never seen
+the EuRoC data before testing. We show the best performance on two difﬁcult sequences VR1-03
+and VR2-03. Note our method doesn’t contain any backend optimization module.
+
+                  Seq.            MH-04 MH-05 VR1-02 VR1-03 VR2-02 VR2-03
+
+                  SVO [46]        1.36 0.51   0.47  x                         0.47                    x
+
+Geometry-based *  ORB-SLAM [3]    0.20  0.19   x     x                        0.07   x
+                  DSO [5]         0.25  0.11  0.11  0.93                      0.13  1.16
+
+                  LSD-SLAM [2] 2.13 0.85      1.11  x                         x                       x
+
+Learning-based † TartanVO (ours) 0.74 0.68    0.45  0.64                      0.67  1.04
+
+* These results are from [46]. † Other learning-based methods [36] did not report numerical results.
+
+Figure 6: The visualization of 6 EuRoC trajectories in Table 3. Black: ground truth trajectory,
+orange: estimated trajectory.
+
+Table 4: Comparison of ATE on TartanAir dataset. These trajectories are not contained in the
+
+training set. We repeatedly run ORB-SLAM 5 times and report the best result.
+
+Seq               MH000 MH001 MH002 MH003 MH004 MH005 MH006 MH007
+
+ORB-SLAM [3] 1.3            0.04  2.37  2.45  x     x                         21.47 2.73
+
+TartanVO (ours) 4.88        0.26  2     0.94  1.07  3.19                      1     2.04
+
+Figure 7: TartanVO outputs competitive results on D345i IR data compared to T265 (equipped with
+ﬁsh-eye stereo camera and an IMU). a) The hardware setup. b) Trail 1: smooth and slow motion. c)
+Trail 2: smooth and medium speed. d) Trail 3: aggressive and fast motion. See videos for details.
+
+RealSense Data Comparison We test TartanVO using data collected by a customized sensor
+setup. As shown in Fig. 7 a), a RealSense D345i is ﬁxed on top of a RealSense T265 tracking
+camera. We use the left near-infrared (IR) image on D345i in our model and compare it with the
+trajectories provided by the T265 tracking camera. We present 3 loopy trajectories following similar
+paths with increasing motion difﬁculties. From Fig. 7 b) to d), we observe that although TartanVO
+has never seen real-world images or IR data during training, it still generalizes well and predicts
+odometry closely matching the output of T265, which is a dedicated device estimating the camera
+motion with a pair of ﬁsh-eye stereo camera and an IMU.
+
+5 Conclusions
+
+We presented TartanVO, a generalizable learning-based visual odometry. By training our model
+with a large amount of data, we show the effectiveness of diverse data on the ability of model gener-
+alization. A smaller gap between training and testing losses can be expected with the newly deﬁned
+up-to-scale loss, further increasing the generalization capability. We show by extensive experiments
+that, equipped with the intrinsics layer designed explicitly for handling different cameras, TartanVO
+can generalize to unseen datasets and achieve performance even better than dedicated learning mod-
+els trained directly on those datasets. Our work introduces many exciting future research directions
+such as generalizable learning-based VIO, Stereo-VO, multi-frame VO.
+
+                                                           8
+Acknowledgments
+
+This work was supported by ARL award #W911NF1820218. Special thanks to Yuheng Qiu and Huai
+Yu from Carnegie Mellon University for preparing simulation results and experimental setups.
+
+References
+
+ [1] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendo´n-Mancha. Visual simultaneous localization and
+      mapping: a survey. Artiﬁcial Intelligence Review, 43(1):55–81, 2015.
+
+ [2] J. Engel, T. Schops, and D. Cremers. LSD-SLAM: Large-scale direct monocular slam. In ECCV, 2014.
+
+ [3] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. Orb-slam: a versatile and accurate monocular slam
+      system. IEEE transactions on robotics, 31(5):1147–1163, 2015.
+
+ [4] C. Forster, M. Pizzoli, and D. Scaramuzza. Svo: Fast semi-direct monocular visual odometry. In ICRA,
+      pages 15–22. IEEE, 2014.
+
+ [5] J. Engel, V. Koltun, and D. Cremers. Direct sparse odometry. IEEE transactions on pattern analysis and
+      machine intelligence, 40(3):611–625, 2017.
+
+ [6] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from
+      video. In CVPR, 2017.
+
+ [7] S. Vijayanarasimhan, S. Ricco, C. Schmidy, R. Sukthankar, and K. Fragkiadaki. Sfm-net: Learning of
+      structure and motion from video. In arXiv:1704.07804, 2017.
+
+ [8] S. Wang, R. Clark, H. Wen, and N. Trigoni. End-to-end, sequence-to-sequence probabilistic visual odom-
+      etry through deep neural networks. The International Journal of Robotics Research, 37(4-5):513–542,
+      2018.
+
+ [9] X. Wang, D. Maturana, S. Yang, W. Wang, Q. Chen, and S. Scherer. Improving learning-based ego-
+      motion estimation with homomorphism-based losses and drift correction. In 2019 IEEE/RSJ International
+      Conference on Intelligent Robots and Systems (IROS), pages 970–976. IEEE, 2019.
+
+[10] G. Younes, D. Asmar, E. Shammas, and J. Zelek. Keyframe-based monocular slam: design, survey, and
+      future directions. Robotics and Autonomous Systems, 98:67–88, 2017.
+
+[11] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, and S. Scherer. Tartanair: A
+      dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots
+      and Systems (IROS), 2020.
+
+[12] R. Roberts, H. Nguyen, N. Krishnamurthi, and T. Balch. Memory-based learning for visual odometry.
+      In Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on, pages 47–52. IEEE,
+      2008.
+
+[13] V. Guizilini and F. Ramos. Semi-parametric models for visual odometry. In Robotics and Automation
+      (ICRA), 2012 IEEE International Conference on, pages 3482–3489. IEEE, 2012.
+
+[14] T. A. Ciarfuglia, G. Costante, P. Valigi, and E. Ricci. Evaluation of non-geometric methods for visual
+      odometry. Robotics and Autonomous Systems, 62(12):1717–1730, 2014.
+
+[15] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam: Real-time dense monocular slam with learned
+      depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
+      pages 6243–6252, 2017.
+
+[16] Z. Yin and J. Shi. Geonet: Unsupervised learning of dense depth, optical ﬂow and camera pose. In
+      Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2,
+      2018.
+
+[17] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid. Unsupervised learning of monocular
+      depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE
+      Conference on Computer Vision and Pattern Recognition, pages 340–349, 2018.
+
+[18] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and M. J. Black. Competitive collabora-
+      tion: Joint unsupervised learning of depth, camera motion, optical ﬂow and motion segmentation. In
+      Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June
+      2019.
+
+[19] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia. Exploring representation learning with cnns for
+      frame-to-frame ego-motion estimation. RAL, 1(1):18–25, 2016.
+
+[20] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and
+      motion network for learning monocular stereo. In Proceedings of the IEEE Conference on Computer
+      Vision and Pattern Recognition (CVPR), July 2017.
+
+[21] N. Yang, L. v. Stumberg, R. Wang, and D. Cremers. D3vo: Deep depth, deep pose and deep uncertainty
+      for monocular visual odometry. In IEEE/CVF Conference on Computer Vision and Pattern Recognition
+      (CVPR), June 2020.
+
+[22] Y. Zou, Z. Luo, and J.-B. Huang. Df-net: Unsupervised joint learning of depth and ﬂow using cross-task
+      consistency. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
+
+                                                           9
+[23] H. Zhou, B. Ummenhofer, and T. Brox. Deeptam: Deep tracking and mapping. In Proceedings of the
+      European Conference on Computer Vision (ECCV), September 2018.
+
+[24] C. Tang and P. Tan. Ba-net: Dense bundle adjustment network. arXiv preprint arXiv:1806.04807, 2018.
+
+[25] R. Clark, M. Bloesch, J. Czarnowski, S. Leutenegger, and A. J. Davison. Ls-net: Learning to solve
+      nonlinear least squares for monocular stereo. arXiv preprint arXiv:1809.02966, 2018.
+
+[26] H. Li, W. Chen, j. Zhao, J.-C. Bazin, L. Luo, Z. Liu, and Y.-H. Liu. Robust and efﬁcient estimation of ab-
+      solute camera pose for monocular visual odometry. In Proceedings of the IEEE International Conference
+      on Robotics and Automation (ICRA), May 2020.
+
+[27] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam: Real-time dense monocular slam with learned
+      depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
+      (CVPR), July 2017.
+
+[28] L. Sheng, D. Xu, W. Ouyang, and X. Wang. Unsupervised collaborative learning of keyframe detec-
+      tion and visual odometry towards monocular deep slam. In Proceedings of the IEEE/CVF International
+      Conference on Computer Vision (ICCV), October 2019.
+
+[29] H. Zhan, C. S. Weerasekera, J.-W. Bian, and I. Reid. Visual odometry revisited: What should be learnt?
+      In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), May 2020.
+
+[30] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learning depth from monocular videos using direct
+      methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
+      June 2018.
+
+[31] Y. Wang, P. Wang, Z. Yang, C. Luo, Y. Yang, and W. Xu. Unos: Uniﬁed unsupervised optical-ﬂow and
+      stereo-depth estimation by watching videos. In Proceedings of the IEEE/CVF Conference on Computer
+      Vision and Pattern Recognition (CVPR), June 2019.
+
+[32] R. Mahjourian, M. Wicke, and A. Angelova. Unsupervised learning of depth and ego-motion from monoc-
+      ular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision
+      and Pattern Recognition (CVPR), June 2018.
+
+[33] S. Li, X. Wang, Y. Cao, F. Xue, Z. Yan, and H. Zha. Self-supervised deep visual odometry with online
+      adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
+      (CVPR), June 2020.
+
+[34] D. Niste´r. An efﬁcient solution to the ﬁve-point relative pose problem. IEEE transactions on pattern
+      analysis and machine intelligence, 26(6):756–770, 2004.
+
+[35] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International
+      Journal of Robotics Research, 32(11):1231–1237, 2013.
+
+[36] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart. The
+      euroc micro aerial vehicle datasets. The International Journal of Robotics Research, 35(10):1157–1163,
+      2016.
+
+[37] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and
+      B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE
+      conference on computer vision and pattern recognition, pages 3213–3223, 2016.
+
+[38] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transfer-
+      ring deep neural networks from simulation to the real world. In IROS, pages 23–30. IEEE, 2017.
+
+[39] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon,
+      and S. Birchﬁeld. Training deep networks with synthetic data: Bridging the reality gap by domain ran-
+      domization. In CVPR Workshops, pages 969–977, 2018.
+
+[40] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. Pwc-net: Cnns for optical ﬂow using pyramid, warping, and
+      cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
+      8934–8943, 2018.
+
+[41] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the
+      IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
+
+[42] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and
+      A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
+
+[43] S. Wang, R. Clark, H. Wen, and N. Trigoni. Deepvo: Towards end-to-end visual odometry with deep
+      recurrent convolutional neural networks. In Robotics and Automation (ICRA), 2017 IEEE International
+      Conference on, pages 2043–2050. IEEE, 2017.
+
+[44] R. Li, S. Wang, Z. Long, and D. Gu. Undeepvo: Monocular visual odometry through unsupervised deep
+      learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7286–7291.
+      IEEE, 2018.
+
+[45] S. Song, M. Chandraker, and C. Guest. High accuracy monocular SFM and scale correction for au-
+      tonomous driving. IEEE Transactions on Pattern Analysis & Machine Intelligence, pages 1–1, 2015.
+
+[46] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza. Svo: Semidirect visual odometry
+      for monocular and multicamera systems. IEEE Transactions on Robotics, 33(2):249–265, 2016.
+
+                                                          10
+A Additional experimental details
+
+In this section, we provide additional details for the experiments, including the network structure,
+training parameters, qualitative results, and quantitative results.
+
+A.1 Network Structure
+
+Our network consists of two sub-modules, namely, the matching network Mθ and the pose network
+Pφ. As mentioned in the paper, we employ PWC-Net as the matching network, which takes in two
+consecutive images of size 640 x 448 (PWC-Net only accepts image size that is multiple of 64). The
+output optical ﬂow, which is 160 x 112 in size, is fed into the pose network. The structure of the
+pose network is detailed in Table 5. The overall inference time (including both Mθ and Pφ) is 40ms
+on an NVIDIA GTX 1080 GPU.
+
+Table 5: Parameters of the proposed pose network. Constructions of residual blocks are designated
+
+in brackets multiplied by the number of stacked blocks. Downsampling is performed by Conv1, and
+
+at the beginning of each residual block. After the residual blocks, we reshape the feature map into a
+
+one-dimensional vector, which goes through three fully connected layers in the translation head and
+
+rotation head, respectively.
+
+     Name                    Layer setting                       Output dimension
+
+     Input                                         1  H  ×  1   W   ×  2          114 × 160
+     Conv1                                         4        4                      56 × 80
+     Conv2                                                                         56 × 80
+     Conv3                    3 × 3, 32        1    H    ×  1  W   ×  32           56 × 80
+                              3 × 3, 32        8            8
+                              3 × 3, 32
+                                               1    H    ×  1  W   ×  32
+                                               8            8
+
+                                               1    H    ×  1  W   ×  32
+                                               8            8
+
+                                            ResBlock
+
+     Block1        3 × 3, 64          ×3       1    H    ×  1   W   ×  64         28 × 40
+                   3 × 3, 64                   16           16
+
+     Block2        3 × 3, 128            ×4    1   H     ×  1   W  ×   128        14 × 20
+                   3 × 3, 128                  32           32
+
+     Block3        3 × 3, 128            ×6    1   H     ×  1   W  ×   128        7 × 10
+                   3 × 3, 128                  64           64
+
+     Block4        3 × 3, 256            ×7     1   H    ×   1   W    × 256        4×5
+                   3 × 3, 256                  128          128
+
+     Block5        3 × 3, 256            ×3     1   H    ×   1   W    × 256        2×3
+                   3 × 3, 256                  256          256
+
+             FC trans                                                     FC rot
+
+Trans head fc1 256 × 6 × 128                        Rot head fc1                  256 × 6 × 128
+
+Trans head fc2                128 × 32              Rot head fc2                  128 × 32
+
+Trans head fc3                32 × 3                Rot head fc3                  32 × 3
+
+     Output                   3                          Output                    3
+
+Table 6: Comparison of ORB-SLAM and TartanVO on the TartanAir dataset using the ATE metric.
+
+These trajectories are not contained in the training set. We repeatedly run ORB-SLAM for 5 times
+
+and report the best result.
+
+Seq             SH000 SH001 SH002 SH003 SH004 SH005 SH006 SH007
+
+ORB-SLAM        x             3.5           x         x            x         x     x         x
+
+TartanVO (ours) 2.52 1.61 3.65 0.29 3.36 4.74 3.72 3.06
+
+A.2 Testing Results on TartanAir
+
+TartanAir provides 16 challenging testing trajectories. We reported 8 trajectories in the experiment
+section. The rest 8 trajectories are shown in Table 6. We compare TartanVO against the ORB-SLAM
+monocular algorithm. Due to the randomness in ORB-SLAM, we repeatedly run ORB-SLAM for 5
+trials and report the best result. We consider a trial is a failure if ORB-SLAM tracks less than 80%
+
+                                                          11
+of the trajectory. A visualization of all the 16 trajectories (including the 8 trajectories shown in the
+experiment section) is shown in Figure 8.
+
+Figure 8: Visualization of the 16 testing trajectories in the TartanAir dataset. The black dashed line
+represents the ground truth. The estimated trajectories by TartanVO and the ORB-SLAM monocular
+algorithm are shown in orange and blue lines, respectively. The ORB-SLAM algorithm frequently
+loses tracking in these challenging cases. It fails in 9/16 testing trajectories. Note that we run
+full-ﬂedge ORB-SLAM with local bundle adjustment, global bundle adjustment, and loop closure
+components. In contrast, although TartanVO only takes in two images, it is much more robust than
+ORB-SLAM.
+
+                                                          12
+
diff --git a/动态slam/2020年-2022年开源动态SLAM/2022年/AirDOS_Dynamic_SLAM_benefits_from_Articulated_Objects.pdf b/动态slam/2020年-2022年开源动态SLAM/2022年/AirDOS_Dynamic_SLAM_benefits_from_Articulated_Objects.pdf
new file mode 100644
index 0000000..2177561
--- /dev/null
+++ b/动态slam/2020年-2022年开源动态SLAM/2022年/AirDOS_Dynamic_SLAM_benefits_from_Articulated_Objects.pdf
@@ -0,0 +1,518 @@
+                                                                                                                                                     2022 IEEE International Conference on Robotics and Automation (ICRA)
+                                                                                                                                                     May 23-27, 2022. Philadelphia, PA, USA
+
+                                                                                                                                                            AirDOS: Dynamic SLAM beneﬁts from Articulated Objects
+
+                                                                                                                                                                     Yuheng Qiu1, Chen Wang1, Wenshan Wang1, Mina Henein2, and Sebastian Scherer1
+
+2022 IEEE International Conference on Robotics and Automation (ICRA) | 978-1-7281-9681-7/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICRA46639.2022.9811667     Abstract— Dynamic Object-aware SLAM (DOS) exploits                       (a) Challenge of Shibuya Tokyo (b) TartanAir Shibuya Dataset
+                                                                                                                                                     object-level information to enable robust motion estimation in
+                                                                                                                                                     dynamic environments. Existing methods mainly focus on iden-                            (c) Example of KITTI tracking dataset training 19
+                                                                                                                                                     tifying and excluding dynamic objects from the optimization.
+                                                                                                                                                     In this paper, we show that feature-based visual SLAM systems            Fig. 1. (a) Example of a highly dynamic environment cluttered with
+                                                                                                                                                     can also beneﬁt from the presence of dynamic articulated                 humans which represents a challenge for Visual SLAM. Existing dynamic
+                                                                                                                                                     objects by taking advantage of two observations: (1) The 3D              SLAM algorithms often fail in this challenging scenario (b) Example of
+                                                                                                                                                     structure of each rigid part of articulated object remains               the TartanAir Shibuya Dataset. (c) Example of the estimated full map with
+                                                                                                                                                     consistent over time; (2) The points on the same rigid part              dynamic objects and static background.
+                                                                                                                                                     follow the same motion. In particular, we present AirDOS,
+                                                                                                                                                     a dynamic object-aware system that introduces rigidity and                  Can we make use of moving objects in SLAM to improve
+                                                                                                                                                     motion constraints to model articulated objects. By jointly              camera pose estimation rather than ﬁltering them out?
+                                                                                                                                                     optimizing the camera pose, object motion, and the object 3D
+                                                                                                                                                     structure, we can rectify the camera pose estimation, preventing            In this paper, we extend the simple rigid objects to general
+                                                                                                                                                     tracking loss, and generate 4D spatio-temporal maps for both             articulated objects, deﬁned as objects composed of one
+                                                                                                                                                     dynamic objects and static scenes. Experiments show that our             or more rigid parts (links) connected by joints allowing
+                                                                                                                                                     algorithm improves the robustness of visual SLAM algorithms              rotational motion [10], e.g., vehicles and humans in Fig. 2,
+                                                                                                                                                     in challenging crowded urban environments. To the best of our            and utilize the properties of articulated objects to improve
+                                                                                                                                                     knowledge, AirDOS is the ﬁrst dynamic object-aware SLAM                  the camera pose estimation. Namely, we jointly optimize
+                                                                                                                                                     system demonstrating that camera pose estimation can be                  (1) the 3D structural information and (2) the motion of
+                                                                                                                                                     improved by incorporating dynamic articulated objects.                   articulated objects. To this end, we introduce (1) a rigidity
+                                                                                                                                                                                                                              constraint, which assumes that the distance between any two
+                                                                                                                                                                              I. INTRODUCTION                                 points located on the same rigid part remains constant over
+                                                                                                                                                                                                                              time, and (2) a motion constraint, which assumes that feature
+                                                                                                                                                        Simultaneous localization and mapping (SLAM) is a fun-                points on the same rigid parts follow the same 3D motion.
+                                                                                                                                                     damental research problem in many robotic applications.                  This allows us to build a 4D spatio-temporal map including
+                                                                                                                                                     Despite its success in static environments, the performance              both dynamic and static structures.
+                                                                                                                                                     degradation and lack of robustness in the dynamic world has
+                                                                                                                                                     become a major hurdle for its practical applications [1], [2].              In summary, the main contributions of this paper are:
+                                                                                                                                                     To address the challenges of dynamic environments, most                     • A new pipeline, named AirDOS, is introduced for stereo
+                                                                                                                                                     SLAM algorithms adopt an elimination strategy that treats
+                                                                                                                                                     moving objects as outliers and estimates the camera pose                      SLAM to jointly optimize the camera poses, trajectories
+                                                                                                                                                     only based on the measurements of static landmarks [3], [4].                  of dynamic objects, and the map of the environment.
+                                                                                                                                                     This strategy can handle environments with a small number                   • We introduce simple yet efﬁcient rigidity and motion
+                                                                                                                                                     of dynamics, but cannot address challenging cases, where                      constraints for general dynamic articulated objects.
+                                                                                                                                                     dynamic objects cover a large ﬁeld of view as in Fig. 1(a).                 • We introduce a new benchmark TartanAir Shibuya, on
+                                                                                                                                                                                                                                   which we demonstrates that, for the ﬁrst time, dynamic
+                                                                                                                                                        Some efforts have been made to include dynamic objects                     articulated objects can beneﬁt the camera pose estima-
+                                                                                                                                                     in the SLAM process. Very few methods try to estimate the                     tion in visual SLAM.
+                                                                                                                                                     pose of simple rigid objects [5], [6] or estimate their motion
+                                                                                                                                                     model [7], [8]. For example, CubeSLAM [6] introduces a
+                                                                                                                                                     simple 3D cuboid to model rigid objects. Dynamic SLAM
+                                                                                                                                                     [9] estimates 3D motions of dynamic objects. However, these
+                                                                                                                                                     methods can only cover special rigid objects, e.g., cubes [6]
+                                                                                                                                                     and quadrics [5] and do not show that camera pose estimation
+                                                                                                                                                     can be improved by the introduction of dynamic objects [7]–
+                                                                                                                                                     [9]. This introduces our main question:
+
+                                                                                                                                                        *This work was supported by the Sony award #A023367.
+                                                                                                                                                        Source Code: https://github.com/haleqiu/AirDOS.
+                                                                                                                                                        1Yuheng Qiu, Chen Wang, Wenshan Wang, and Sebastiian
+                                                                                                                                                     Scherer are with the Robotics Institute, Carnegie Mellon University,
+                                                                                                                                                     Pittsburgh, PA 15213, USA {yuhengq, wenshanw, basti}
+                                                                                                                                                     @andrew.cmu.edu; chenwang@dr.com
+                                                                                                                                                        2Mina Henein is with the System, Theory and Robotics Lab, Australian
+                                                                                                                                                     National University. mina.henein@anu.edu.au
+
+                                                                                                                                                     978-1-7281-9680-0/22/$31.00 ©2022 IEEE  8047
+
+                                                                                                                                                     Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply.
+                                                                                    Wang et al. [18] introduce a simultaneous localization, map-
+                                                                                    ping, and moving object tracking (SLAMMOT) algorithm,
+                                                                                    which tracks moving objects with a learned motion model
+                                                                                    based on a dynamic Bayesian network. Reddy, et al. [19] use
+                                                                                    optical ﬂow to segment moving objects, and apply a smooth
+                                                                                    trajectory constraint to enforce the smoothness of objects’
+                                                                                    motion. Judd et al. [8] propose multi-motion visual odometry
+                                                                                    (MVO), which simultaneously estimates the camera pose
+                                                                                    and the object motion. The work by Henein, et al. [7],
+                                                                                    [20], [21], of which the most recent is VDO-SLAM [20],
+                                                                                    generate a map of dynamic and static structure and estimate
+                                                                                    velocities of rigid moving objects using motion constraints.
+                                                                                    Rosinol, et al. [22] propose 3D dynamic scene graphs to
+                                                                                    detect and track dense human mesh in dynamic scenes. This
+                                                                                    method constraints the humans maximum walking speed for
+                                                                                    a consistency check.
+
+Fig. 2. This is an example of the articulated dynamic objects’ point-segment        C. Rigidity Constraint
+mode. In urban environment, we can model rigid objects like vehicle and
+semi-rigid objects like pedestrian as articulated object. pki and pkj are the i-th     Rigidity constraint assumes that pair-wise distances of
+and j-th dynamic features on the moving objects at time k. pik+1 and pkj+1          points on the same rigid body remain the same over time.
+is the dynamic features after the motion l Tk at time k + 1. In this model,         It was applied to segment moving objects in dynamic en-
+the segment si j is invariant over time and motion.                                 vironments dating back to the 1980s. Zhang et al. [23]
+                                                                                    propose to use rigidity constraint to match moving rigid
+                        II. RELATED WORK                                            bodies. Thompson et al. [24] use a similar idea of rigidity
+                                                                                    constraint and propose a rigidity geometry testing for moving
+   Recent works on dynamic SLAM roughly fall into three                             rigid object matching. Previous research utilized rigidity
+categories: elimination strategy, motion constraint, and rigid-                     assumption to segment moving rigid objects, while in this
+ity constraint, which will be reviewed, respectively.                               paper, we use rigidity constraint to recover objects’ structure.
+
+A. Elimination Strategy                                                                To model rigid object, SLAM++ [25] introduced pre-
+                                                                                    deﬁned CAD models into the object matching and pose
+   Algorithms in this category ﬁlter out the dynamic objects                        optimization. QuadricSLAM [5] utilize dual-quadrics as 3D
+and only utilize the static structures of the environment                           object representation, to represent the orientation and scale of
+for pose estimation. Therefore, most of the algorithms in                           object landmarks. Yang and Scherer [6] propose a monocular
+this category apply elimination strategies like RANSAC                              object SLAM system named CubeSLAM for 3D cuboid
+[11] and robust loss functions [12] to eliminate the effects                        object detection and multi-view object SLAM. As mentioned
+of dynamic objects. For example, ORB-SLAM [3] applies                               earlier, the above methods can only model simple rigid
+RANSAC to select and remove points that cannot converge                             objects, e.g., cubes, while we target more general objects,
+to a stable pose estimation. DynaSLAM [13] detects the                              i.e., articulated objects, which can cover common dynamic
+moving objects by multi-view geometry and deep learning                             objects such as vehicles and humans.
+modules. This allows inpainting the frame background that
+has been occluded by dynamic objects. Bârsan et al. [14] use                        III. METHODOLOGY
+both instance-aware semantic segmentation and sparse scene
+ﬂow to classify objects as either background, moving, or                            A. Background and Notation
+potentially moving objects. Dai et al. [15] utilize the distance
+correlation of map points to segment dynamic objects from                           Visual SLAM in static environments is often formulated as
+static background. To reduce the computational cost, Ji et al.
+[16] combine semantic segmentation and geometry modules,                            a factor graph optimization [26]. The objective (1) is to ﬁnd
+which clusters the depth image into a few regions and
+identify dynamic regions via reprojection errors.                                   the robot state xk ∈ X, k ∈ [0, nx] and the static landmarks pi ∈
+
+B. Motion Constraint                                                                Ps, i ∈ [0, nps ] that best ﬁt the observation of the landmarks
+                                                                                    zki ∈ Z, where nx denotes the total number of the robots’
+   Most algorithms in this category estimate the motion of                          state and nps denotes the number of the static landmarks.
+dynamic objects but do not show that the motion constraint                          This is often based on a reprojection error minimization
+can contribute to the camera pose estimation, and would thus                        ei,k = h(xk, pi) − zik with:
+suffer in highly dynamic environments. For example, Hahnel
+et al. [17] track the dynamic objects in the SLAM system.                           ∑ X ∗, P∗ = argmin eiT,kΩ−i,k1ei,k  (1)
+
+                                                                                    {X,Ps} i,k
+
+                                                                                    where h(xk, pi) denotes the 3D points observation function
+                                                                                    and Ωi,k denotes the observation covariance matrix.
+
+                                                                                                 8048
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply.
+                                                  Nose                         C. Motion Constraint
+                                                  Neck
+                                          Right            Left                   We adopt the motion constraint from [7] which does not
+                                        Shoulder        Shoulder               need a prior geometric model. For every feature point on the
+                                                                               same rigid part of an articulated object l, we have
+                                      Right                    Left
+                                      Elbow                  Elbow
+
+                                      Right                   Left                                   l p¯ki +1 = l T l p¯ki ,  (4)
+                                      Hand                   Hand
+                                                                               where lT ∈ SE(3) is a motion transform associated with the
+                                           Right         Left                  object l and ¯· indicates homogeneous coordinates. Therefore,
+                                           Knee         Knee                   we can deﬁne the loss function for motion constraint as:
+
+                                      Right             Left
+                                      Feet              Feet
+
+(a) Rigidity Constraint Factor Graph  (b) Human Rigidity                       em = || l p¯ki +1 − lT l p¯ik||.
+
+                                                                                                                               (5)
+
+                        (c) Motion Constraint Factor Graph                        The motion constraint simultaneously estimates the ob-
+                                                                               jects’ motion lT and enforces each point l pik to follow the
+Fig. 3. (a) Factor graph of the rigidity constraint. Black nodes represent     same motion pattern [7]. This motion model lT assumes that
+the camera pose, blue nodes the dynamic points, and red nodes indicate the     the object is rigid, thus, for articulated objects, we apply the
+rigid segment length. Cyan and red rectangles represent the measurements       motion constraint on each rigid part of articulated object. In
+of points and rigidity consequently. (c) Human can be modeled with point       Fig. 3(c) we show the factor graph of the motion constraint.
+and segment based on the body parts’ rigidity. (b) Factor graph of the motion
+constraint. The orange node is the estimated motion and the green rectangles      In highly dynamic environments, even if we ﬁlter out
+denote the motion constraints                                                  the moving objects, the tracking of static features is easily
+                                                                               interrupted by the moving objects. By enforcing the motion
+                                                                               constraints, dynamic objects will be able to contribute to the
+                                                                               motion estimation of the camera pose. Therefore, when the
+                                                                               static features are not reliable enough, moving objects can
+                                                                               correct the camera pose estimation, preventing tracking loss.
+
+   In dynamic SLAM, the reprojection error ep of dynamic                       D. Bundle Adjustment
+feature points is also considered:
+                                                                                  The bundle adjustment (BA) jointly optimizes the static
+ep = h(xk, l pki ) − lzik ,                                   (2)              points pi, dynamic points l pki , segments si j, camera poses xk
+                                                                               and dynamic object motions lT . This can be formulated as
+                                                                               the factor graph optimization:
+
+where l pik ∈ Pd are the dynamic points and lzik are the                          X ∗, P∗, S∗, T ∗ = argmin erT Ω−i, j1er+
+corresponding observation of dynamic points.
+                                                                                                              {X,P,S,T }
+B. Rigidity Constraint
+                                                                                                                        emT Ω−i,l1em + eTp Ωi−,k1ep, (6)
+   Let si j be the segment length between two feature points
+l pik and l pkj, the rigidity constraint is that si j is invariant over        where P is the union set of Ps and Pd. This problem can be
+time, i.e, ski j = ski j+1, if l pki and l pkj are on the same rigid           solved using the Levenberg-Marquardt algorithms.
+part of an articulated object, as shown in Fig. 2. Inspired by
+                                                                                                    IV. SYSTEM OVERVIEW
+this, we model the dynamic articulated object using a rigidity
+                                                                                  We propose the framework AirDOS in Fig. III-B for dy-
+constraint, and thus we can deﬁne the rigidity error er as                     namic stereo visual SLAM, which consists of three modules,
+                                                                               pre-processing, tracking, and back-end bundle adjustment.
+er = l pki − l pkj − si j .                                   (3)
+                                                                                  In pre-processing and tracking modules, we ﬁrst extract
+   Fig. 3(a) shows the factor graph of the rigidity constraint,                ORB features [28] and perform an instance-level segmen-
+where the length of segment si j is invariant after the motion.                tation [29] to identify potential moving objects. We then
+The beneﬁts to involving the rigidity error (3) are two-fold.                  estimate the initial ego-motion by tracking the static features.
+First, it offers a temporal geometric constraint for dynamic                   For articulated objects like humans, we perform Alpha-Pose
+points, which is able to correct the scale and 3D structure                    [27] to extract the human key points and calculate their
+of dynamic objects. Second, it provides a geometric check,                     3D positions by triangulating the corresponding key points
+which eliminates the incorrectly matched points.                               from stereo images. We then track the moving humans using
+                                                                               the optical ﬂow generated by PWC-net [30]. The tracking
+   We model humans as a special articulated object shown                       module provides a reliable initialization for the camera pose
+in Fig. 3(b), where each human can be described by 14                          and also the object poses of dynamic objects.
+key points, including nose, shoulders, elbows, hands, waists,
+knee, feet, etc. In the experiments, we detect the human key                      In the back-end optimization, we construct a global map
+points using the off-the-shelf algorithm Alpha-Pose [27].                      consisting of camera poses, static points, dynamic points, and
+                                                                               the motion of objects. We perform local bundle adjustment
+                                                                               with dynamic objects in the co-visibility graph [31] built
+
+                                                                                                 8049
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply.
+         Preprocessing                            Tracking                                              Back-end
+        Stereo Image                  Tracking Camera Pose                                            Optimization
+
+ Instant-level                                                                                                           Map
+Segmentation
+Human Pose                     Static Feature    Ego-motion Track Local                    Local Map      Camera Pose
+                                  Extractor                                                               Static Points
+  Detection                                      Estimation                           Map
+ Optical Flow
+  Estimation                   3D Human Pose                      Motion                    Local Bundle  Dynamic Points
+                                 Triangulation                  Estimation                   Adjustment   Object Rigidity
+
+                               Dynamic Object                                              Global Bundle       Motion
+                                    Tracking                                                Adjustment
+
+                                      Tracking Dynamic Objects
+
+Fig. 4. The framework of AirDOS, which is composed of three modules, i.e., pre-processing, tracking, and back-end optimization.
+
+                               TABLE I                                  Camera Pose
+PERFORMANCE ON KITTI DATASETS BASED ON ATE (m).                             Object 1
+                                                                            Object 2
+Sequence          W/ Mask             W/O Mask                                                            AirDOS
+          ORB-SLAM AirDOS      ORB-SLAM AirDOS                  ORB-SLAM
+ Test 18
+ Test 28  0.933         0.934  0.937  0.948
+Train 13  2.033         2.027  2.031  2.021
+Train 14  1.547         1.618  1.551  1.636
+Train 15  0.176         0.172  0.174  0.169
+Train 19  0.240         0.234  0.240  0.234
+          2.633         2.760  2.642  2.760
+
+from the co-visible landmarks for the sake of efﬁciency.        Fig. 5. Qualitative analysis of the KITTI Tracking datasets in training 19.
+Similar to the strategy of RANSAC, we eliminate the factors     Applying rigidity constraint and motion constraint improve the estimation
+and edges which contribute a large error based on the rigidity  of the objects’ structure.
+constraint (3) and motion constraint (5). This process helps
+to identify the mismatched or falsely estimated human poses.
+Visual SLAM algorithms usually only perform bundle adjust-
+ment on selected key-frames due to the repeated static feature
+observations. However, in highly dynamic environments, like
+the ones presented in this paper, this might easily result in
+loss of dynamic object tracking, therefore we perform bundle
+adjustment on every frame to capture the full trajectory.
+
+                         V. EXPERIMENTS                         B. Performance on KITTI Tracking Dataset
+
+A. Metric, Baseline, and Implementation                            The KITTI Tracking dataset [32] contains 50 sequences
+                                                                (29 for testing, 21 for training) with multiple moving objects.
+   We use the Absolute Translation Error (ATE) to evaluate      We select 6 sequences that contain moving pedestrians. For
+our algorithm. Our method is compared against the state-of-     evaluation, we generate the ground truth using IMU and GPS.
+the-art methods, ORB-SLAM [3] (1) with and (2) without the      As shown in Table I, the ATEs of both our method and ORB-
+masking of potential dynamic objects, and RGB-D dynamic         SLAM are small in all sequences, which means that both
+SLAM algorithm [20]. Similar to the setup described in          methods perform well in these sequences. The main reason is
+Section IV, we modiﬁed the ORB-SLAM to perform BA               that the moving objects are relatively far and small, and there
+on every frame with the observation from dynamic features,      are plentiful static features in these sequences. Moreover,
+so as to capture the full trajectory of the moving objects. In  most sequences have a simple translational movement, which
+the experiment, we applied the same parameters to AirDOS        makes these cases very simple.
+and ORB-SLAM, i.e., the number of feature points extracted
+per frame, the threshold for RANSAC, and the covariance            Although the camera trajectory is similar, our algorithm
+of the reprojection error.                                      recovers a better human model as shown in Fig. 5. The ORB-
+                                                                SLAM generates noisy human poses when the human is far
+
+                                                                                                 8050
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply.
+                                                          TABLE II
+                    EXPERIMENTS ON TARTAN-AIR DATASET WITH AND WITHOUT MASK
+
+Datasets Sequence              W/ Mask                                                                                           W/O Mask
+                            ORB-SLAM
+                    AirDOS                                                 VDO-SLAM [20]                                 AirDOS  ORB-SLAM
+
+Standing       I    0.0606  0.0788                                         0.0994                                        0.0469            0.1186
+                    0.0193  0.0060                                         0.6129                                           -                 -
+Human          II
+                    0.0951  0.0657                                         0.3813                                        0.0278            0.0782
+Road Crossing  III  0.0331  0.0196                                         0.3879                                        0.1106            0.0927
+     (Easy)    IV   0.0206  0.0148                                         0.2175                                        0.0149            0.0162
+               V
+
+Road Crossing VI    0.2230  1.0984                                         0.2400                                        3.6700            4.3907
+                    0.5625  0.8476                                         0.6628                                        1.1572            1.4632
+(Hard)         VII
+
+Overall             0.1449  0.3044                                         0.3717                                        0.8379            1.0226
+
+Results show Absolute Trajectory Error (ATE) in meter (m). ‘-’ means that SLAM failed in this sequence.
+
+                                                                                                            Camera Pose
+                                                                                                               Object 1
+                                                                                                               Object 2
+                                                                                                               Object 3
+                                                                                                               Object 3
+
+                                                                           ORBSLAM
+
+(a) Standing Human  (b) Road Crossing
+
+Fig. 6. (a) Example of the Tartan-Air datasets, where almost every one is
+standing. (b) Example of moving humans in road crossing.
+
+away from the camera. That’s because the rigidity constraint                                AirDOS
+helps to recover the structure of the moving articulated
+objects. Also, the motion constraint can improve the accuracy              Fig. 7. Qualitative analysis of the TartanAir sequence IV. The moving
+of the dynamic objects’ trajectory. Given the observation                  objects tracked by the ORB-SLAM is noisy, while our proposed method
+from the entire trajectory, our algorithm recovers the human               generate a smooth trajectory. We present that dynamic objects and the
+pose and eliminates the mismatched dynamic feature points.                 camera pose can beneﬁts each other in visual SLAM.
+
+C. Performance on TartanAir Shibuya Dataset                                   1) Evaluation: To test the robustness of our system when
+                                                                           the visual odometry is interrupted by dynamic objects or
+   We notice that the moving objects in KITTI dataset only                 in cases where the segmentation might fail due to indirect
+cover a small ﬁeld of view. To address the challenges of                   occlusions such as illumination changes, we evaluate the
+the highly dynamic environment, we build the TartanAir                     performance in two settings: with and without masking the
+Shibuya dataset as shown in Fig. 6, and demonstrate that                   dynamic features during ego-motion estimation.
+our method outperforms the existing dynamic SLAM al-
+gorithms in this benchmark. Our previous work TartanAir                       As shown in the Table II, with human masks, our algo-
+[33] is a very challenging visual SLAM dataset consisting                  rithm obtains a 39.5% and 15.2% improvements compared
+of binocular RGB-D sequences together with additional per-                 to ORB-SLAM [3] and VDO-SLAM [20] in the overall
+frame information such as camera poses, optical ﬂow, and                   performance. In Sequence II, IV and V, both ORB-SLAM
+semantic annotations. In this paper, we use the same pipeline              and our algorithm show a good performance, where all
+to generate TartanAir Shibuya, which is to simulate the                    ATEs are lower than 0.04. We notice that the performance
+world’s most busy road intersection at Shibuya Tokyo shown                 of VDO-SLAM is not as good as ORB-SLAM. This may
+in Fig. 1. It covers much more challenging viewpoints and                  be because that VDO-SLAM relies heavily on the optical
+diverse motion patterns for articulated objects than TartanAir.            ﬂow for feature matching, it is likely to confuse background
+                                                                           features with dynamic features.
+   We separate the TartanAir Shibuya dataset into two
+groups: Standing Humans in Fig. 6(a) and Road Crossing in                     Our algorithm also outperforms ORB-SLAM without
+Fig. 6(b) with easy and difﬁcult categories. Each sequence                 masking the potential moving objects. As shown in the
+contains 100 frames and more than 30 tracked moving                        sequence I, III, V, and VI of Table II, our method obtains a
+humans. In the sequences of Standing Human, most of                        higher accuracy than ORB-SLAM by 0.0717, 0.050, 0.721
+the humans standstill, while few of them move around the                   and 0.306. Overall, we achieve an improvement of 18.1%.
+space. In Road Crossing, there are multiple moving humans                  That’s because moving objects can easily lead the traditional
+coming from different directions. For the difﬁcult sequences,              visual odometry to fail, but we take the observations from
+dynamic objects often enter the scene abruptly, in which the               moving articulated objects to rectify the camera poses, and
+visual odometry of traditional methods will fail easily.                   ﬁlter out the mismatched dynamic features.
+
+                                                                                                 8051
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply.
+                                                                         TABLE III
+                                                    ABLATION STUDY ON SIMULATED DATASET.
+
+       Groups       RPE-R      I     ATE    RPE-R      II     ATE    RPE-R      III    ATE    RPE-R       IV     ATE    RPE-R   Overall
+                            RPE-T                   RPE-T                    RPE-T                     RPE-T                    RPE-T ATE
+     Before BA      0.4898          83.441  0.6343          109.968  1.1003          138.373  0.7925           168.312  0.7908
+BA w/ static point  0.0989  15.991  15.002  0.1348  17.728  25.796   0.2028  21.070  17.085   0.1389   19.242  35.521   0.1537  18.8328  125.024
+                    0.0988  3.3184  15.019  0.1349  3.7146  25.708   0.2035  4.2522  16.985   0.1388   3.5074  35.269   0.1538  3.7540   23.351
+  BA w/o motion     0.0962  3.3176  14.881  0.1282  3.7176  25.704   0.1871  4.2631  16.921   0.1226   3.5069  35.426   0.1410  3.7565   23.245
+  BA w/o rigidity   0.0958  3.2245  14.879  0.1276  3.4984  25.703   0.1870  4.0387  16.914   0.1215   3.2397  35.412   0.1407  3.5148   23.233
+BA in Equation (6)          3.2177                  3.4824                   4.0372                    3.2227                   3.5085   23.227
+
+Results show RPE-T and ATE in centimeter (cm) and RPE-R in degree (°).
+
+   It can be seen that in Fig. 7, ORB-SLAM was interrupted           (d) is 5.71 ± 0.31 lower than (b). This is because the motion
+by the moving humans and failed when making a large rota-            constraint assumes that every dynamic feature on the same
+tion. By tracking moving humans, our method outperforms              object follows the same motion pattern, which requires the
+ORB-SLAM when making a turn. Also, a better camera                   object to be rigid. From another point of view, the rigidity
+pose estimation can in turn beneﬁt the moving objects’               constraint provides a good initialization to the objects’ 3D
+trajectory. As can be seen, the objects’ trajectories generated      structure, and so indirectly improves the estimation of the
+by ORB-SLAM are noisy and inconsistent, while ours are               objects’ trajectory. In general, the ablation study proves
+smoother. In general, the proposed motion constraint and             that applying motion and rigidity constraints to dynamic
+rigidity constraint have a signiﬁcant impact on the difﬁcult         articulated objects can beneﬁt the camera pose estimation.
+sequences, where ORB-SLAM outputs inaccurate trajectories
+due to dynamic objects.                                              C. Computational Analysis
+
+                      VI. ABLATION STUDY                                Finally, we evaluate the running time of the rigidity con-
+                                                                     straint and motion constraint in the optimization. The back-
+   We perform an ablation study to show the effects of the           end optimization is implemented in C++ with a modiﬁed
+introduced rigidity and motion constraints. Speciﬁcally, we          g2o [34] solver. With the same setup as section VI-A, we
+demonstrate that the motion constraint and rigidity constraint       randomly initialized 10 different sequences with 18 frames.
+inprove the camera pose estimation via bundle adjustment.            In each frame, we can observe 8 static landmarks, and 12
+                                                                     dynamic landmarks from one moving object. In Table IV,
+A. Implementation                                                    We show the (i) convergence time (ii) runtime per iteration
+                                                                     of group I in the ablation study. Our method takes 53.54
+   We simulate dynamic articulated objects that follow a             (mSec) to converge, which is comparable to 39.22 (mSec)
+simple constant motion pattern, and initialize the robot’s state     from the optimization with re-projection error only.
+with Gaussian noise of σ = 0.05m on translation σ = 2.9°
+on rotation. We also generate static features around the path           In this paper, semantic mask [29] and human poses [27]
+of the robot, and simulate a sensor with a ﬁnite ﬁeld of view.       are pre-processed as an input to the system. The experiment
+The measurement of point also has a noise of σ = 0.05m.              are carried out on an Intel Core i7 with 16GB RAM.
+We generate 4 groups of sequences with different lengths
+and each group consists of 10 sequences that are initialized                                                TABLE IV
+with the same number of static and dynamic features. We                                TIME ANALYSIS OF BUNDLE ADJUSTMENT
+set the ratio of static to dynamic landmarks as 1:1.8.
+                                                                             BA w/ reprojection error  Convergence Time (mSec)  Runtime/iter (mSec)
+B. Results                                                                        BA w/o Rigidity
+                                                                                  BA w/o Motion                     39.22                4.024
+   We evaluate the performance of (a) bundle adjustment with                                                        45.47                4.078
+static features only, (b) bundle adjustment without motion                      BA in Equation (6)                  45.37                4.637
+constraint, (c) bundle adjustment without rigidity constraint,                                                      53.54                4.792
+and (d) bundle adjustment with both the motion constraint
+and rigidity constraint. We use the Absolute Translation Error                                   CONCLUSION
+(ATE) and Relative Pose Error of Rotation (RPE-R) and
+Translation (RPE-T) as our evaluation metrics.                          In this paper, we introduce the rigidity constraint and
+                                                                     motion constraint to model dynamic articulated objects. We
+   As shown in Table III, both motion and rigidity constraints       propose a new pipeline, AirDOS for stereo SLAM which
+are able to improve the camera pose estimation, while the            jointly optimizes the trajectory of dynamic objects, map of
+best performance is obtained when the two constraints are            the environment, and camera poses, improving the robustness
+applied together. An interesting phenomenon is that rigidity         and accuracy in dynamic environments. We evaluate our
+constraint can also beneﬁt the objects’ trajectory estimation.       algorithm in KITTI tracking and TartanAir Shibuya dataset,
+In Groups I, we evaluate the estimation of dynamic points            and demonstrate that camera pose estimation and dynamic
+with setting (b), (c), and (d), with 100 repeated experiments.       objects can beneﬁt each other, especially when there is
+We ﬁnd that the ATE of dynamic object feature points in              an aggressive rotation or static features are not enough to
+setting (c) is 5.68 ± 0.30 lower than setting (b), while setting     support the visual odometry.
+
+                                                                                                 8052
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply.
+                            REFERENCES                                        [24] W. B. Thompson, P. Lechleider, and E. R. Stuck, “Detecting moving
+                                                                                    objects using the rigidity constraint,” IEEE Transactions on Pattern
+ [1] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira,         Analysis and Machine Intelligence, vol. 15, no. 2, pp. 162–166, 1993.
+      I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous
+      localization and mapping: Toward the robust-perception age,” IEEE       [25] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and
+      Transactions on robotics, vol. 32, no. 6, pp. 1309–1332, 2016.                A. J. Davison, “Slam++: Simultaneous localisation and mapping at the
+                                                                                    level of objects,” in Proceedings of the IEEE conference on computer
+ [2] C. Wang, J. Yuan, and L. Xie, “Non-iterative SLAM,” in International           vision and pattern recognition, 2013, pp. 1352–1359.
+      Conference on Advanced Robotics (ICAR). IEEE, 2017, pp. 83–90.
+                                                                              [26] M. Kaess, A. Ranganathan, and F. Dellaert, “isam: Incremental
+ [3] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a                 smoothing and mapping,” IEEE Transactions on Robotics, vol. 24,
+      versatile and accurate monocular slam system,” IEEE transactions on           no. 6, pp. 1365–1378, 2008.
+      robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
+                                                                              [27] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regional multi-
+ [4] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE            person pose estimation,” in ICCV, 2017.
+      transactions on pattern analysis and machine intelligence, vol. 40,
+      no. 3, pp. 611–625, 2017.                                               [28] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An
+                                                                                    efﬁcient alternative to sift or surf,” in 2011 International conference
+ [5] L. Nicholson, M. Milford, and N. Sünderhauf, “Quadricslam: Dual                on computer vision. Ieee, 2011, pp. 2564–2571.
+      quadrics from object detections as landmarks in object-oriented slam,”
+      IEEE Robotics and Automation Letters, vol. 4, no. 1, pp. 1–8, 2018.     [29] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in
+                                                                                    Proceedings of the IEEE international conference on computer vision,
+ [6] S. Yang and S. Scherer, “Cubeslam: Monocular 3-d object slam,” IEEE            2017, pp. 2961–2969.
+      Transactions on Robotics, vol. 35, no. 4, pp. 925–938, 2019.
+                                                                              [30] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical
+ [7] M. Henein, G. Kennedy, R. Mahony, and V. Ila, “Exploiting rigid body           ﬂow using pyramid, warping, and cost volume,” in Proceedings of the
+      motion for slam in dynamic environments,” environments, vol. 18,              IEEE conference on computer vision and pattern recognition, 2018,
+      p. 19, 2018.                                                                  pp. 8934–8943.
+
+ [8] K. M. Judd, J. D. Gammell, and P. Newman, “Multimotion visual            [31] C. Mei, G. Sibley, and P. Newman, “Closing loops without places,”
+      odometry (mvo): Simultaneous estimation of camera and third-party             in 2010 IEEE/RSJ International Conference on Intelligent Robots and
+      motions,” in 2018 IEEE/RSJ International Conference on Intelligent            Systems. IEEE, 2010, pp. 3738–3744.
+      Robots and Systems (IROS). IEEE, 2018, pp. 3949–3956.
+                                                                              [32] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:
+ [9] M. Henein, J. Zhang, R. Mahony, and V. Ila, “Dynamic slam: The                 The kitti dataset,” The International Journal of Robotics Research,
+      need for speed,” in 2020 IEEE International Conference on Robotics            vol. 32, no. 11, pp. 1231–1237, 2013.
+      and Automation (ICRA). IEEE, 2020, pp. 2123–2129.
+                                                                              [33] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor,
+[10] G. Stamou, M. Krinidis, E. Loutas, N. Nikolaidis, and I. Pitas, “4.11-         and S. Scherer, “Tartanair: A dataset to push the limits of visual
+      2d and 3d motion tracking in digital video,” Handbook of Image and            slam,” in IEEE/RSJ International Conference on Intelligent Robots
+      Video Processing, 2005.                                                       and Systems (IROS), 2020.
+
+[11] M. A. Fischler and R. C. Bolles, “Random sample consensus: a             [34] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard,
+      paradigm for model ﬁtting with applications to image analysis and             “g 2 o: A general framework for graph optimization,” in 2011 IEEE
+      automated cartography,” Communications of the ACM, vol. 24, no. 6,            International Conference on Robotics and Automation. IEEE, 2011,
+      pp. 381–395, 1981.                                                            pp. 3607–3613.
+
+[12] C. Kerl, J. Sturm, and D. Cremers, “Robust odometry estimation for
+      rgb-d cameras,” in 2013 IEEE International Conference on Robotics
+      and Automation. IEEE, 2013, pp. 3748–3754.
+
+[13] B. Bescos, J. M. Fácil, J. Civera, and J. Neira, “Dynaslam: Tracking,
+      mapping, and inpainting in dynamic scenes,” IEEE Robotics and
+      Automation Letters, vol. 3, no. 4, pp. 4076–4083, 2018.
+
+[14] I. A. Bârsan, P. Liu, M. Pollefeys, and A. Geiger, “Robust dense
+      mapping for large-scale dynamic environments,” in 2018 IEEE In-
+      ternational Conference on Robotics and Automation (ICRA). IEEE,
+      2018, pp. 7510–7517.
+
+[15] W. Dai, Y. Zhang, P. Li, Z. Fang, and S. Scherer, “Rgb-d slam in
+      dynamic environments using point correlations,” IEEE Transactions
+      on Pattern Analysis and Machine Intelligence, 2020.
+
+[16] T. Ji, C. Wang, and L. Xie, “Towards real-time semantic rgb-d slam in
+      dynamic environments,” in 2021 International Conference on Robotics
+      and Automation (ICRA), 2021.
+
+[17] D. Hahnel, R. Triebel, W. Burgard, and S. Thrun, “Map building with
+      mobile robots in dynamic environments,” in 2003 IEEE International
+      Conference on Robotics and Automation (Cat. No. 03CH37422),
+      vol. 2. IEEE, 2003, pp. 1557–1563.
+
+[18] C.-C. Wang, C. Thorpe, S. Thrun, M. Hebert, and H. Durrant-Whyte,
+      “Simultaneous localization, mapping and moving object tracking,” The
+      International Journal of Robotics Research, vol. 26, no. 9, pp. 889–
+      916, 2007.
+
+[19] N. D. Reddy, P. Singhal, V. Chari, and K. M. Krishna, “Dynamic body
+      vslam with semantic constraints,” in 2015 IEEE/RSJ International
+      Conference on Intelligent Robots and Systems (IROS). IEEE, 2015,
+      pp. 1897–1904.
+
+[20] J. Zhang, M. Henein, R. Mahony, and V. Ila, “Vdo-slam: a visual
+      dynamic object-aware slam system,” arXiv preprint arXiv:2005.11052,
+      2020.
+
+[21] M. Henein, J. Zhang, R. Mahony, and V. Ila, “Dynamic slam: The
+      need for speed,” in 2020 IEEE International Conference on Robotics
+      and Automation (ICRA), 2020, pp. 2123–2129.
+
+[22] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3d dynamic
+      scene graphs: Actionable spatial perception with places, objects, and
+      humans,” arXiv preprint arXiv:2002.06289, 2020.
+
+[23] Z. Zhang, O. D. Faugeras, and N. Ayache, “Analysis of a sequence
+      of stereo scenes containing multiple moving objects using rigidity
+      constraints,” in ICCV, 1988.
+
+                                                                                                 8053
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply.
+
diff --git a/动态slam/2020年-2022年开源动态SLAM/2022年/DynaVINS_A_Visual-Inertial_SLAM_for_Dynamic_Environments.pdf b/动态slam/2020年-2022年开源动态SLAM/2022年/DynaVINS_A_Visual-Inertial_SLAM_for_Dynamic_Environments.pdf
new file mode 100644
index 0000000..cb06d7b
--- /dev/null
+++ b/动态slam/2020年-2022年开源动态SLAM/2022年/DynaVINS_A_Visual-Inertial_SLAM_for_Dynamic_Environments.pdf
@@ -0,0 +1,663 @@
+IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022                                             11523
+
+DynaVINS: A Visual-Inertial SLAM for
+          Dynamic Environments
+
+Seungwon Song , Hyungtae Lim , Graduate Student Member, IEEE, Alex Junho Lee ,
+                             and Hyun Myung , Senior Member, IEEE
+
+   Abstract—Visual inertial odometry and SLAM algorithms are                    Fig. 1. Our algorithm, DynaVINS, in various dynamic environments. (a)–(b)
+widely used in various ﬁelds, such as service robots, drones, and               Feature rejection results in city_day sequence of VIODE dataset [13]. Even if
+autonomous vehicles. Most of the SLAM algorithms are based on                   the most features are dynamic, DynaVINS can discard the effect of the dynamic
+assumption that landmarks are static. However, in the real-world,               features. (c) Separation of feature matching results into multiple hypotheses in E
+various dynamic objects exist, and they degrade the pose estimation             shape sequence of our dataset. Even if a temporarily static object exists, only a
+accuracy. In addition, temporarily static objects, which are static             hypothesis from static objects is determined as true positive. Features with high
+during observation but move when they are out of sight, trigger false           and low weights are denoted as green circles and red crosses, respectively, in
+positive loop closings. To overcome these problems, we propose a                both two cases.
+novel visual-inertial SLAM framework, called DynaVINS, which is
+robust against both dynamic objects and temporarily static objects.             cameras [4], [5], [6] are widely used because of their relatively
+In our framework, we ﬁrst present a robust bundle adjustment                    low cost and weight with rich information.
+that could reject the features from dynamic objects by leveraging
+pose priors estimated by the IMU preintegration. Then, a keyframe                  Various visual SLAM methods have been studied for more
+grouping and a multi-hypothesis-based constraints grouping meth-                than a decade. However, most researchers have assumed that
+ods are proposed to reduce the effect of temporarily static objects in          landmarks are implicitly static; thus, many visual SLAM meth-
+the loop closing. Subsequently, we evaluated our method in a public             ods still have potential risks when interacting with real-world
+dataset that contains numerous dynamic objects. Finally, the exper-             environments that contain various dynamic objects. Only re-
+imental results corroborate that our DynaVINS has promising per-                cently, several studies focused on dealing with dynamic objects
+formance compared with other state-of-the-art methods by success-               solely using visual sensors.
+fully rejecting the effect of dynamic and temporarily static objects.
+                                                                                   Most of the studies [7], [8], [9] address the problems by de-
+   Index Terms—Visual-inertial SLAM, SLAM, visual tracking.                     tecting the regions of dynamic objects via depth clustering, fea-
+                                                                                ture reprojection, or deep learning. Moreover, some researchers
+                           I. INTRODUCTION                                      incorporate the dynamics of the objects into the optimization
+                                                                                framework [10], [11], [12]. However, geometry-based methods
+S IMULTANEOUS localization and mapping (SLAM) al-                               require accurate camera poses; hence they can only deal with
+     gorithms have been widely exploited in various robotic                     limited fractions of dynamic objects. In addition, deep-learning-
+applications that require precise positioning or navigation in                  aided methods have the limitation of solely working for prede-
+environments where GPS signals are blocked. Various types                       ﬁned objects.
+of sensors have been used in SLAM algorithms. In particular,
+visual sensors such as monocular cameras [1], [2], [3] and stereo                  In the meanwhile, visual-inertial SLAM (VI-SLAM) frame-
+                                                                                works [2], [3], [4], [5], [6] have been proposed by integrating an
+   Manuscript received 27 April 2022; accepted 22 August 2022. Date of          inertial measurement unit (IMU) into the visual SLAM. Unlike
+publication 31 August 2022; date of current version 6 September 2022. This      the visual SLAMs, a motion prior from the IMU helps the
+letter was recommended for publication by Associate Editor M. Magnusson and     VI-SLAM algorithms to tolerate scenes with dynamic objects to
+Editor S. Behnke upon evaluation of the reviewers’ comments. This work was      some degree. However, if the dominant dynamic objects occlude
+supported in part by the Indoor Robot Spatial AI Technology Development”
+project funded by KT, KT award under Grant B210000715 and in part by the
+Institute of Information & Communications Technology Planning & Evaluation
+(IITP) grant funded by Korea government (MSIT) under Grant 2020-0-00440,
+Development of Artiﬁcial Intelligence Technology that Continuously Improves
+Itself as the Situation Changes in the Real World. The students are supported
+by the BK21 FOUR from the Ministry of Education (Republic of Korea).
+(Corresponding author: Hyun Myung.)
+
+   Seungwon Song, Hyungtae Lim, and Hyun Myung are with the School of
+Electrical Engineering, KAIST, Daejeon 34141, Republic of Korea (e-mail:
+sswan55@kaist.ac.kr; shapelim@kaist.ac.kr; hmyung@kaist.ac.kr).
+
+   Alex Junho Lee is with the Department of Civil and Environmen-
+tal Engineering, KAIST, Daejeon 34141, Republic of Korea (e-mail:
+alex_jhlee@kaist.ac.kr).
+
+   Our code is available: https://github.com/url-kaist/dynaVINShttps://github.
+com/url-kaist/dynaVINS
+
+   This letter has supplementary downloadable material available at
+https://doi.org/10.1109/LRA.2022.3203231, provided by the authors.
+
+   Digital Object Identiﬁer 10.1109/LRA.2022.3203231
+
+2377-3766 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
+                   See https://www.ieee.org/publications/rights/index.html for more information.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply.
+11524                                                                     IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
+
+most of the view as shown in Fig. 1(b), the problem cannot be             camera movement and the feature. Canovas et al. [9] proposed
+                                                                          a similar method, but adopted a surfel, similar to a polygon,
+solved solely using the motion prior.                                     to enable a real-time performance by reducing the number of
+                                                                          items to be computed. However, multi-view geometry-based
+In addition, in real-world applications, temporarily static ob-           algorithms assumed that the camera pose estimation is accurate
+                                                                          enough, leading to the failure when the camera pose estimation
+jects are static while being observed but in motion when they are         is inaccurate owing to the dominant dynamic objects.
+
+not under observation. These objects may lead to a critical failure          One of the solutions to this problem is to employ a wheel
+                                                                          encoder. G2P-SLAM [18] rejected loop closure matching results
+on the loop closure process due to false positives as shown in            with a high Mahalanobis distance from the estimated pose by
+                                                                          the wheel odometry, which is invariant to the effect of dynamic
+Fig. 1(c). To deal with temporarily static objects, robust back-end       and temporarily static objects. Despite the advantages of wheel
+                                                                          encoder, these methods are highly dependent on the wheel
+methods [14], [15], [16], [17] are proposed to reduce the effect of       encoder, limiting their own applicability.
+
+the false positive loop closures in optimization. However, since             Another feasible approach is to adopt deep learning networks
+                                                                          to identify predeﬁned dynamic objects. In the DynaSLAM [7],
+they focused on the instantaneous false positive loop closures,           masked areas of the predeﬁned dynamic objects using a deep
+                                                                          learning network were eliminated and the remainder was deter-
+they cannot deal with the persistent false positive loop closures         mined via multi-view geometry. In the Dynamic SLAM [19],
+                                                                          a compensation method was adopted to make up for missed
+caused by the temporarily static objects.                                 detections in a few keyframes using sequential data. Although
+                                                                          the deep learning methods can successfully discard the dynamic
+In this study, to address the aforementioned problems, we                 objects even if they are temporarily static, these methods are
+                                                                          somewhat problematic for the following two reasons: a) the types
+propose a robust VI-SLAM framework, called DynaVINS, which                of dynamic objects have to be predeﬁned, and b) sometimes, only
+                                                                          a part of the dynamic object is visible as shown in Fig. 1(b). For
+is robust against dynamic and temporarily static objects. Our             these reasons, the objects may not be detected occasionally.
+
+conrtriTbhuetiornosbuarset  summarized as follows:  proposed  to  handle     On the other hand, methods for tracking a dynamic object’s
+                            VI-SLAM approach is                           motion have been proposed. RigidFusion [10] assumed that only
+                                                                          a single dynamic object is in the environment and estimated
+       dominant, undeﬁned dynamic objects that cannot be solved           the motion of the dynamic object. Qiu et al. [12] combined a
+                                                                          deep learning method and VINS-Mono [2] to track poses of the
+r      solely by learning-based or vision-only methods.                   camera and object simultaneously. DynaSLAM II [11] identiﬁed
+       A novel bundle adjustment (BA) pipeline is proposed for            dynamic objects, similar to DynaSLAM [7], then, within the BA
+                                                                          factor graph, the poses of static features and the camera were
+       simultaneously estimating camera poses and discarding the          estimated while estimating the motion of the dynamic objects
+                                                                          simultaneously.
+       features from the dynamic objects that deviate signiﬁcantly
+                                                                          C. Robust Back-End
+r      from the motion prior.
+       A robust global optimization with constraints grouped into            In the graph SLAM ﬁeld, several researchers have attempted
+                                                                          to discard incorrectly created constraints. For instance, max-
+       multiple hypotheses is proposed to reject persistent loop          mixture [14] employed a single integrated Bayesian framework
+                                                                          to eliminate the incorrect loop closures, while switchable con-
+       closures from the temporarily static objects.                      straint [15] is proposed to adjust the weight of each constraint to
+                                                                          eliminate false positive loop closures in the optimization. How-
+In the remainder of this letter, we introduce the robust BA               ever, false-positive loop closures can be expected to be consistent
+                                                                          and occur persistently by the temporarily static objects. These
+method for optimizing moving windows in Section III, methods              robust kernels are not appropriate to handling such persistent
+                                                                          loop closures.
+for the robust global optimization in Section IV, and compare our
+                                                                             On the other hand, the Black-Rangarajan (B-R) duality [20] is
+proposed method with other state-of-the-art (SOTA) methods                proposed to unify robust estimation and outlier rejection process.
+                                                                          Some methods [16], [17] utilize B-R duality in point cloud
+in various environments in Section V.                                     registration and pose graph optimization (PGO) to reduce the
+                                                                          effect of false-positive matches even if they are dominant. These
+                         II. RELATED WORKS                                methods are useful for rejecting outliers in a PGO. However,
+                                                                          repeatedly detected false-positive loop closures from similar
+A. Visual-Inertial SLAM                                                   objects are not considered. Moreover, B-R duality is not yet
+                                                                          utilized in the BA of the VI-SLAM.
+   As mentioned earlier, to address the limitations of the visual
+SLAM framework, VI-SLAM algorithms have been recently                        To address the aforementioned limitations, we improve the
+proposed to correct the scale and camera poses by adopting                VI-SLAM to minimize the effect of the dynamic and temporarily
+the IMU. MSCKF [3] was proposed as an extended Kalman                     static objects by adopting the B-R duality not only in the graph
+ﬁlter(EKF)-based VI-SLAM algorithm. ROVIO [6] also used
+an EKF, but proposed a fully robocentric and direct VI-SLAM
+framework running in real time.
+
+   There are other approaches using optimization. OKVIS [5]
+proposed a keyframe-based framework and fuses the IMU
+preintegration residual and the reprojection residual in an op-
+timization. ORB-SLAM3 [4] used an ORB descriptor for the
+feature matching, and poses and feature positions are corrected
+through an optimization. VINS-Fusion [2], an extended version
+of VINS-Mono, supports a stereo camera and adopts a feature
+tracking, rather than a descriptor matching, which makes the
+algorithm faster and more robust.
+
+   However, these VI-SLAM methods described above still have
+potential limitations in handling the dominant dynamic objects
+and the temporarily static objects.
+
+B. Dynamic Objects Rejection in Visual and VI SLAM
+
+   Numerous researchers have proposed various methods to
+handle dynamic objects in visual and VI SLAM algorithms. Fan
+et al. [8] proposed a multi-view geometry-based method using an
+RGB-D camera. After obtaining camera poses by minimizing the
+reprojection error, the type of each feature point is determined
+as dynamic or static by the geometric relationship between the
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply.
+SONG et al.: DYNAVINS: A VISUAL-INERTIAL SLAM FOR DYNAMIC ENVIRONMENTS                                                    11525
+
+Fig. 2. The pipeline of our robust visual inertial SLAM. Features are tracked in mono or stereo images and IMU data are preintegrated in the sensor preprocessing
+step. Then, the robust BA is applied to discard tracked features from dynamic objects and only the features from static objects will be remain. Keyframes are
+grouped using the number of tracked features, and loop closures detected in current keyframe groups are clustered into hypotheses. Each hypothesis with the weight
+is used or rejected in the selective optimization. Using the proposed framework, a trajectory robust against dynamic and temporarily static objects can be obtained.
+
+structure but also in the BA framework by reﬂecting the IMU              considered as outliers would never become inliers even though
+prior and the feature tracking information.                              the features are originated from static objects.
+
+                III. ROBUST BUNDLE ADJUSTMENT                               To address these problems, our BA method consists of two
+                                                                         parts: a) a regularization factor that leverages the IMU preinte-
+A. Notation                                                              gration and b) a momentum factor for considering the previous
+                                                                         state of each weight to cover the case where the preintegration
+   In this letter, the following notations are deﬁned. The i-th          becomes temporarily inaccurate.
+camera frame and the j-th tracked feature are denoted as Ci
+and fj, respectively. For two frames CA and CB, TBA ∈ SE(3)              C. Regularization Factor
+denotes the pose of CA relative to CB. And the pose of CA in
+the world frame W can be denoted as TWA .                                   First, to reject the outlier features while robustly estimate
+                                                                         the poses, we propose a novel loss term inspired by the B-R
+   B is a set of indices of the IMU preintegrations, and P is a set      duality [20] as follows:
+of visual pairs (i, j) where i corresponds to the frame Ci and j to
+the feature fj. Because the feature fj is tracked across multiple                     ρ wj , rPj = wj2rPj + λwΦ2(wj ),              (2)
+camera frames, different camera frames can contain the same
+feature fj. Thus, a set of indices of all tracked features in the        where rjP denotes i∈P(fj) rjP,i 2 for simplicity, wj ∈ [0, 1]
+current moving window is denoted as FP , and a set of indices            denotes the weight corresponding to each feature fj, and fj
+of the camera frames that contain the feature fj is denoted as           with wj close to 1 is determined as a static feature; λw ∈ R+
+P (fj ).                                                                 is a constant parameter; Φ(wj) denotes the regularization factor
+
+   In the visual-inertial optimization framework of the current          of the weight wj and is deﬁned as follows:
+sliding window, X represents the full state vector that contains
+sets of poses and velocities of the keyframes, biases of the IMU,                               Φ(wj) = 1 − wj.                     (3)
+i.e., acceleration and gyroscope biases, and estimated depth of
+the features as in [2].
+
+B. Conventional Bundle Adjustment                                           Then, ρ(wj, rPj ) in (2) is adopted instead of the Huber norm
+                                                                         in the visual reprojection term in (1). Hence, the BA formulation
+   In the conventional visual-inertial state estimator [2], the
+visual-inertial BA formulation is deﬁned as follows:                     can be expressed as:
+
+                                                                               ⎧                                          ⎫
+                                                                               ⎨                                          ⎬
+     rp − HpX 2 +                rI zˆbbkk+1 , X  2                                             2+       rIk 2 +          ρ wj , rjP ⎭ ,
+min                                               Pbbkk+1                min   ⎩  rp − HpX
+                                          ⎫
+  X                k∈B             2⎬                     (1)            X ,W                       k∈B           j∈FP
+                                   PjCi ⎭ ,
+                                                                                                                                    (4)
+
++             ρH  rP zˆCj i , X                                          where W = {wj|j ∈ FP } represents the set of all weights.
+
+     (i,j)∈P                                                             By adopting weight and regularization factor inspired by
+
+                                                                         B-R duality, the inﬂuence of features with a high reprojection
+
+where ρH (·) denotes the Huber loss [21]; rp, rI , and rP represent      error compared to the estimated state can be reduced while
+residuals for marginalization, IMU, and visual reprojection mea-
+surements, respectively; zˆbbkk+1 and zˆjCi stand for observations of    maintaining the state estimation performance. The details will
+IMU and feature points; Hp denotes a measurement estimation
+matrix of the marginalization, and P denotes the covariance of           be covered in the remainder of this subsection.
+each term. For convenience, rI (zˆbbkk+1 , X ) and rP (zˆCj i , X ) are
+simpliﬁed as rIk and rjP,i, respectively.                                (4) is solved using an alternating optimization [20]. Because
+
+   The Huber loss does not work successfully once the ra-                the current state X can be estimated from the IMU preintegration
+
+tio of outliers increases. This is because the Huber loss does           and the previously optimized state, unlike other methods [16],
+
+not entirely reject the residuals from outliers [22]. On the             [17], W is updated ﬁrst with the ﬁxed X . Then, X is optimized
+
+other hand, the redescending M-estimators, such as Geman-                with the ﬁxed W.
+
+McClure (GMC) [23], ignore the outliers perfectly once the               While optimizing W, all terms except weights are constants.
+
+residuals are over a speciﬁc range owing to their zero-gradients.        Hence, the formulation for optimizing weights can be expressed
+
+Unfortunately, this truncation triggers a problem that features          as follows:
+
+                                                                                                ⎧                 ⎫
+                                                                                                ⎨                 ⎬
+                                                                                                         ρ wj , rPj ⎭ .
+                                                                                           min  ⎩                                   (5)
+
+                                                                                            W      j∈FP
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply.
+11526                                                                                    IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
+
+                                                                                         Fig. 4. Framework of robust BA. Each feature has a weight and is used in
+                                                                                         the visual residual. Each weight has been optimized through the regularization
+                                                                                         factor and the weight momentum factor. Preintegrated IMU data are used in the
+                                                                                         IMU residual term. All parameters are optimized in the robust BA.
+
+                                                                                         D. Weight Momentum Factor
+
+Fig. 3. Changes of loss functions w.r.t. various parameters. (a) ρ(wj , rjP ) w.r.t.     When the motion becomes aggressive, the IMU preinte-
+wj in the alternating optimization for λw = 1. ρ¯(rPj ) represents the converged
+loss. (b) ρ¯(rPj ) w.r.t. λw. (c) ρ¯m(rPj ) w.r.t. w¯j for nj = 5. (d) ρ¯m(rPj ) w.r.t.  gration becomes imprecise, and thus the estimated state be-
+nj for w¯j = 0.
+                                                                                         comes inaccurate. In this case, the reprojection residuals of
+
+                                                                                         the features from the static objects become larger; hence, by
+
+                                                                                         the regularization factor, those features will be ignored in the
+
+                                                                                         BA process even though the previous weights were close to
+
+Because the weight wj is independent to each other, (5) can be                           one.
+
+optimized independently for each wj as follows:          ⎫                                  If increasing λw to solve this problem, even the fea-
+                  ⎧⎛                           ⎞                                         tures with high reprojection residuals by dynamic objects
+                  ⎨                                      ⎬
+                  ⎩wj2  ⎝               rjP,i  2⎠ + λwΦ2(wj )⎭ .                         are used. Therefore, the result of the BA will be inac-
+        min                                                       (6)
+                           i∈P                                                           curate. Thus, increasing λw is not enough to cope this
+       wj ∈[0,1]                (fj  )                                                   problem.
+
+Because the terms in (6) are in a quadratic form w.r.t. wj, the                          To solve this issue, an additional factor, a weight momentum
+optimal wj can be derived as follows:
+                                                                                         factor, is proposed to make the previously estimated feature
+
+                                             λw                                          weights unaffected by an aggressive motion.
+                                             + λw
+                        wj      =                  ,              (7)                    Because the features are continuously tracked, each feature
+
+                                        rjP                                              fj is optimized nj times with its previous weight w¯j. In order to
+                                                                                         make the current weight tend to remain at w¯j, and to increase the
+   As mentioned previously, the weights are ﬁrst optimized
+based on the estimated state. Thus the weights of features with                          degree of the tendency as nj increases, the weight momentum
+                                                                                         factor Ψ(wj) is designed as follows:
+high reprojection errors start with small values. However, as
+shown in Fig. 3(a), the loss of the feature ρ(wj, rjP ) is a convex                                   Ψ(wj) = nj(w¯j − wj).                    (9)
+function unless the weight is zero, there is a non-zero gradient
+not only in the loss of an inlier feature but also in the loss of an                     Then, adding (9) to (2), the modiﬁed loss term can be derived
+outlier feature. Which means that the new feature affects the BA
+regardless of the type at ﬁrst.                                                          as follows:
+
+   While the optimization step is repeated until the states and                                   ρm wj , rPj = wj2               rjP,i 2
+the weights are converged, the weights of the outlier features
+                                                                                                                     i∈P(fj )
+
+are lowered and their losses are more ﬂattened. As a result, the                                                + λwΦ2(wj ) + λmΨ2(wj ), (10)
+losses of the outlier features approach zero-gradient and cannot
+affect the BA.                                                                           where λm ∈ R+ represents a constant parameter to adjust the
+                                                                                         effect of the momentum factor on the BA.
+   After convergence, the weight can be expressed using the
+reprojection error as in (7). Thus the converged loss ρ¯(rjP ) can                       In summary, proposed robust BA can be illustrated as Fig. 4.
+be derived by applying (7) to (2) as follows:
+                                                                                         The previous weights of the tracked features are used in the
+
+                                                                                         weight momentum factor, and the weights of all features in the
+
+                                         λw rjP                                          current window are used in the regularization factor. As a result,
+                                        λw + rPj
+                        ρ¯(rPj )     =                .           (8)                    the ro⎧bust BA is expressed as follows:                        ⎫
+                                                                                               ⎨                                                        ⎬
+                                                                                                            2+       rkI 2 +                   wj , rjP ⎭ .
+   As shown in Fig. 3(b), increasing λw affects ρ¯(rjP ) in two di-                      min   ⎩  rp − HpX                                 ρm           (11)
+rections: increasing the gradient value and convexity. By increas-
+                                                                                         X ,W                   k∈B               j∈FP
+
+ing the gradient value, the visual reprojection residuals affect                         (11) can be solved by using the alternating optimization in the
+
+the BA more than the marginalization and IMU preintegration                              same way as (4). The alternating optimization is iterated until X
+                                                                                         and W are converged. Then, the converged loss ρ¯m(rjP ) can be
+residuals. And by increasing the convexity, some of the outlier                          derived. ρ¯m(rjP ) w.r.t. w¯j and nj is shown in Fig. 3(c) and (d),
+                                                                                         respectively.
+features can affect the BA.
+
+To sum up, the proposed factor beneﬁts from both Huber
+
+loss and GMC by adjusting the weights in an adaptive way;                                   As shown in Fig. 3(c), if w¯ is low, the gradient of the loss is
+                                                                                         small even when rjP is close to 0. Thus, the features presumably
+our method efﬁciently ﬁlters out outliers, but does not entirely
+
+ignore outliers in the optimization at ﬁrst as well.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply.
+SONG et al.: DYNAVINS: A VISUAL-INERTIAL SLAM FOR DYNAMIC ENVIRONMENTS                                                                         11527
+
+originated from dynamic objects don’t have much impact on                Fig. 5. The procedure of the multiple hypotheses clustering. (a) Keyframes
+
+the BA even if their reprojection errors are low in the current          that share the minimum number of the tracked features are grouped. (b) There are
+
+step. In addition, the gradient of the loss increases for features       two types of features used for matchings: static and temporarily static features.
+whose w¯ is close to 1, so even though the current residual is           k,mTWi , the estimated pose of Ci, can be estimated using the matching result
+high, an optimization is performed in the direction of reducing          Tmk and the local relative pose Tki. An accurate keyframe pose can be estimated
+the residual rather than w.                                              if static features are used for the matching. (c) The temporarily static feature
+
+   Furthermore, as shown in Fig. 3(d), if w¯j is zero, the gradient      is moved from the previous position. However, the matching result is based on
+gets smaller as nj increases; hence the tracked outlier feature
+has less effect on the BA, and the longer it is tracked, the less it     the previous position of the feature. Thus, the estimated keyframe pose will be
+                                                                         inaccurate. Finally, the feature matching results with similar TWi are clustered
+affects the BA.                                                          based on the Euclidean distance.
+
+   For the stereo camera conﬁguration, in addition to the repro-
+
+jection on one camera, reprojections on the other camera in the
+same keyframe, rPstereo, or another keyframe, rPanother, exist. In that
+case, weights are also applied to the reprojection raPnother because
+it is also affected by the movement of features, while rPstereo is
+invariant to the movement of features and is only adopted as the
+
+criterion for the depth estimation.
+
+              IV. SELECTIVE GLOBAL OPTIMIZATION                             However, it is difﬁcult to directly compute the similarity be-
+
+   In the VIO framework, the drift is inevitably cumulative along        tween the loop closures from different keyframes in the current
+the trajectory because the optimization is performed only within         group. Assuming that the relative pose Tki between Ck and Ci
+the moving window. Hence, a loop closure detection, e.g. using           is sufﬁciently accurate, the estimated pose of Ci in the world
+DBoW2 [24], is necessary to optimize all trajectories.                   frame can be expressed as follows:
+
+   In a typical visual SLAM, all loop closures are exploited even                        k,mTWi = Tki ·m TWk .                                 (14)
+if some of them are from temporarily static objects. Those false
+positive loop closures may lead to the failure of the SLAM                  If the features used for matchings are from the same object,
+framework. Moreover, features from the temporarily static ob-            the estimated TWi of the matchings will be located close to each
+jects and from the static objects may exist at the same keyframe.        other, even if Ck and Cm of the matchings are different. Hence,
+Therefore, in this section, we propose a method to eliminate the         after calculating Euclidean distances between the loop closure’s
+false positive loop closures while maintaining the true positive         estimated TWi , the similar loop closures with the small Euclidean
+loop closures.                                                           distance can be clustered as shown in Fig. 5(c).
+
+A. Keyframe Grouping                                                        Depending on which loop closure cluster is selected, the
+
+Unlike conventional methods that treat loop closures indi-               trajectory result from the graph optimization varies. Therefore,
+
+vidually, in this study, loop closures from the same features            each cluster can be called a hypothesis. To reduce the computa-
+
+are grouped, even if they are from different keyframes. As a             tional cost, top-two hypotheses were adopted by comparing the
+
+result, only one weight per group is used, allowing for effective        cardinality of the loop closures within the hypothesis. These two
+                                                                         hypotheses of the current group Gi are denoted as Hi0 and Hi1.
+optimization.
+                                                                            However, it is not yet possible to distinguish between true or
+As shown in Fig. 5(a), before grouping the loop closures,
+                                                                         false positive hypotheses. Hence, the method for determining the
+adjacent keyframes that share at least a minimum number of
+                                                                         true positive hypothesis among the candidate hypotheses will be
+tracked features have to be grouped. The group starting from
+                                                                         described in the next section.
+the i-th camera frame Ci is deﬁned as follows:
+
+Group(Ci) = Ck| |Fik| ≥ α, k ≥ i ,                       (12)
+
+where α represents a minimum number of tracked features, and             C. Selective Optimization for Constraint Groups
+Fik represents the set of features tracked from Ci to Ck. For
+simplicity, Group(Ci) will be denoted as Gi hereinafter.
+
+B. Multiple Hypotheses Clustering                                        Most of the recent visual SLAM algorithms use a graph
+
+                                                                         optimization. Let C, T , L, and W denote the sets of keyframes,
+
+   After keyframes are grouped as in the previous subsection,            poses, loop closures, and all weights, respectively. Then the
+
+DBoW2 is employed to identify the similar keyframe Cm with               graph optimization can be denoted as:
+each keyframe Ck in the current group Gi starting from Ci                ⎧                                                                       ⎫
+(Ck ∈ Gi and m < i). Note that Ck is skipped if there is no              ⎨⎪⎪⎪⎪⎪                                                                  ⎪⎪⎪⎪⎪⎬
+similar keyframe. After identifying up to three different m                                            + ρ 2                       r(Tkj, T )  2PL⎪⎭⎪⎪⎪⎪,
+                                                                         mTin⎪⎪⎪⎪⎪⎩ i∈C  r(Tii+1, T )  PT    i+1                H
+for k, a feature matching is conducted between Ck and these                                                  i    (j,k)∈L
+keyframes, and the relative pose Tmk can be obtained. Using                                               T
+Tmk , the estimated pose of Ck in the world frame, mTWk , can be
+obtained as follows:                                                                     local edge               loop closure edge
+
+               mTWk = Tmk · TWm,                                                                                                               (15)
+
+                                                         (13)            where Tii+1 represents the local pose between two adjacent
+                                                                         keyframes Ci and Ci+1; Tkj is the relative pose between Cj and
+where TWm represents the pose of Cm in the world frame.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply.
+11528                                                                                                IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
+
+Ck  from  the  loop    closure;  PT i+1      and    PL  denote                      the  covariance                                            TABLE I
+                                                                                                                                      ABLATION EXPERIMENT
+                                    Ti
+of the local pose and loop closure, respectively.                                                    A. Dataset
+
+   For the two hypotheses of group Gi, weights are denoted                                              VIODE Dataset VIODE dataset [13] is a simulated dataset
+as wi0 and wi1, a sum of the weights as wi, and the set of                                           that contains lots of moving objects, such as cars or trucks,
+hypotheses as H. Using a similar procedure as in Section III-C,                                      compared with conventional datasets. In addition, the dataset
+                                                                                                     includes overall occlusion situations, where most parts of
+Black-Rangarajan duality is applied to (15) as follows:                                              the image are occluded by dominant dynamic objects as
+                                                                                                     shown in Fig. 1. Note that the sub-sequence name none
+               ⎧                             2                                                       to high means how many dynamic objects exist in the
+               ⎪⎪⎪⎪⎨⎪                                                                                scene.
+
+       min     ⎩⎪⎪⎪⎪⎪       r Tii+1, T       PT     i+1                                                 Our Dataset Unfortunately, VIODE dataset does not contain
+                                                    i                                                harsh loop closing situations caused by temporarily static ob-
+       T ,W            i∈C                      T                                                    jects. Accordingly, we obtained our dataset with four sequences
+                                                                                                     to evaluate our global optimization. First, Static sequence
+                   ⎛                                                                     ⎞           validates the dataset. In Dynamic follow sequence, a dom-
+          + Hi∈H ⎜⎝⎜⎜⎜⎜⎝⎛(j,k)∈Hi0                                                                   inant dynamic object moves in front of the camera. Next, in
+                                              wi0   r(Tkj                  ,  T  )  2    ⎠           Temporal static sequence, the same object is observed
+                                             |Hi0|                                  PL               from multiple locations. In other words, the object is static while
+                                                                                                     being observed, and then it moves to a different position. Finally,
+             ⎛                   residual for hypothesis 0                                           in E-shape sequence, the camera moves along the shape of the
+                                                                                                     letter E. The checkerboard is moved while not being observed,
+          +⎝                                         ⎞                                               thus it will be observed at the three end-vertices of the E-shaped
+                                                                                                     trajectory in the camera perspective, which triggers the false-
+                  (j,k)∈Hi1       wi1     r  Tkj , T                       2   ⎠                     positive loop closures. Note that the feature-rich checkerboard
+                                 |Hi1|                                     PL                        is used in the experiment to address the effect of false loop
+                                                                                                     closures.
+                       residual for hypothesis1 (optional)
+                                                                                                     B. Error Metrics
+                                                    ⎞⎫
+                                                    ⎟⎟⎟⎠⎟⎟⎪⎪⎪⎪⎪⎪⎪⎪⎭⎪⎪⎬                                  The accuracy of the estimated trajectory from each algorithm
+          +            λlΦl2(wi)                                        ,                   (16)     is measured by Absolute Trajectory Error (ATE) [25], which di-
+                                                                                                     rectly measures the difference between points of the ground truth
+               hypothesis regularization  function                                                   and the aligned estimated trajectory. In addition, for the VIODE
+                                                                                                     dataset, the degradation rate [13], rd = ATEhigh/ATEnone, is
+where λl ∈ R+ is a constant parameter. The regularization factor                                     calculated to determine the robustness of the algorithm.
+for the loop closure, Φl, is deﬁned as follows:
+                                                                                                     C. Evaluation on the VIODE Dataset
+                       Φl(wi) = 1 − wi
+                                                                                                        First, the effects of the proposed factors on BA time cost
+                             = 1 − wi0 + wi1 ,                                              (17)     and accuracy are analyzed as shown in the Table I. Ours
+                                                                                                     with only the regularization factor has a better result than
+where wi0, wi1 ∈ [0, 1]. To ensure that the weights are not af-                                      VINS-Fusion, but with the momentum factor together, not
+fected by the number of loop closures in the hypothesis, the                                         only it shows outperforming result than VINS-Fusion, but also
+weights are divided by the cardinality of each hypothesis.                                           it takes less time due to a previous information. Moreover,
+                                                                                                     although the BA time of ours was increased due to addi-
+   Then, (16) is optimized in the same manner as (11). Ac-                                           tional optimizations, it is sufﬁcient for high-level control of
+cordingly, only the hypothesis with a high weight is adopted                                         robots.
+in the optimization. In addition, all weights can be close to
+0 when all hypotheses are false positives due to the multiple                                           As shown in Table II and Fig. 6, the SOTA methods show
+temporarily static objects. Hence, the failure caused by false                                       precise pose estimation results in static environments. However,
+positive hypotheses can be prevented.                                                                they struggle with the effect of dominant dynamic objects. In
+                                                                                                     particular, even though DynaSLAM employs a semantic seg-
+   Because keyframe poses are changed after the optimization,                                        mentation module, DynaSLAM tends to diverge or shows large
+the hypothesis clustering in Section IV-B is conducted again for                                     ATE compared with other methods as the number of dynamic
+all groups for the next optimization.                                                                objects increases (from none to high). This performance
+                                                                                                     degradation is due to the overall occlusion situations, leading to
+                    V. EXPERIMENTAL RESULTS
+
+   To evaluate the proposed algorithm, we compare ours with
+SOTA algorithms, namely, VINS-Fusion [2], ORB-SLAM3 [4],
+and DynaSLAM [7]. Each algorithm is tested in a mono-
+inertial (-M-I) and a stereo-inertial (-S-I) mode. Note that
+an IMU is not used in DynaSLAM, so it is only tested in a
+stereo (-S) mode and compared with the -S-I mode of other
+algorithms. It could be somewhat unfair, but the comparison is
+conducted to stress the necessity for an IMU when dealing with
+dynamic environments.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply.
+SONG et al.: DYNAVINS: A VISUAL-INERTIAL SLAM FOR DYNAMIC ENVIRONMENTS                                             11529
+
+                                                                                           TABLE II
+                                                    COMPARISON WITH STATE-OF-THE-ART METHODS (RMSE OF ATE IN [M])
+
+Fig. 6. ATE results of state-of-the-art algorithms and ours on the city_day sequences of the VIODE dataset [13]. Note that the y-axis is expressed in logarithmic
+scale. Our algorithm shows promising performance with less performance degeneration compared with the other state-of-the-art methods.
+
+the failure of the semantic segmentation module and the absence      Fig. 7. Results of the state-of-the-art algorithms and ours on the park-
+of features from static objects.                                     ing_lot high sequence of the VIODE dataset [13]. (a) Trajectory of each
+                                                                     algorithm in the 3D feature map, which is the result of our proposed algorithm.
+   Similarly, although ORB-SLAM3 tries to reject the frames          Features with low weight are depicted in red. (b) Enlarged view of (a). All
+with inaccurate features, it diverges when dominant dynamic          other algorithms except our algorithm lost track or had noisy trajectories while
+objects exist in parking_lot mid, high and city_day                  observing dynamic objects and as in (c) feature weighting result of our algorithm,
+high sequences. However, especially in parking_lot low               features from dynamic objects (red crosses) have low weight while robust
+sequence, there is only one vehicle that is far from the camera,     features (green circles) have high weight.
+and it occludes an unnecessary background environment. As
+a consequence, ORB-SLAM3-S-I outperforms other algo-                                                          TABLE III
+rithms.                                                                                    COMPARISON OF DEGRADATION RATE rd
+
+   VINS-Fusion is less hindered by the dynamic objects because
+it tries to remove the features with an incorrectly estimated depth
+(negative or far) after BA. However, those features have affected
+the BA before they are removed. As a result, as the number of
+the features from dynamic objects increases, the trajectory error
+of VINS-Fusion gets higher.
+
+   In contrast, our proposed method shows promising perfor-
+mance in both mono-inertial and stereo-inertial modes. For
+example, in parking_lot high sequence as shown in
+Fig. 7(a)–(b), ours performs stable pose estimation even when
+other algorithms are inﬂuenced by dynamic objects. Moreover,
+even though the number of dynamic objects increases, a perfor-
+mance degradation remains small compared to other methods
+in all scenes. This conﬁrms that our method overcomes the
+problems caused by dynamic objects owing to our robust BA
+method, which is also supported by Table III. In other words,
+our proposed method successfully rejects all dynamic features
+by adjusting the weights in an adaptive way. Also, our method
+could be even robust against the overall occlusion situations, as
+shown in Fig. 1(b).
+
+   Interestingly, our proposed robust BA method enables robust-
+ness against changes in illuminance by rejecting the inconsistent
+features (e.g., low weight features in dark area of Fig. 7(c)). Ac-
+cordingly, our method shows remarkable performance compared
+with the SOTA methods in city_night scenes where not only
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply.
+11530                                                                               IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
+
+Fig. 8. Results of the algorithms on E-shape sequence. (a) Trajectory results.       [3] A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint kalman
+Other algorithms are inaccurate due to false positive loop closures. (b) A loop           ﬁlter for vision-aided inertial navigation,” in Proc. IEEE Int. Conf. Robot.
+closure rejection result of our algorithm. Constraints with low weight (red lines)        Automat., 2007, pp. 3565–3572.
+do not contribute to the optimized trajectory.
+                                                                                     [4] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. M. Montiel, and J.
+dynamic objects exist, but also there is a lack of illuminance.                           D. Tardós, “ORB-SLAM3: An accurate open-source library for visual,
+Note that -M-I of ours has better result than -S-I. This is                               visual–inertial, and multimap SLAM,” IEEE Trans. Robot., vol. 37, no. 6,
+because the stereo reprojection, rPstereo, can be inaccurate in                           pp. 1874–1890, Dec. 2021.
+low-light conditions.
+                                                                                     [5] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale,
+D. Evaluation on Our Dataset                                                              “Keyframe-based visual-inertial odometry using nonlinear optimization,”
+                                                                                          Int. J. Robot. Res., vol. 34, no. 3, pp. 314–334, 2015.
+   In the static case, all algorithms have low ATE values.
+This sequence validates that our dataset is correctly obtained.                      [6] M. Bloesch, S. Omari, M. Hutter, and R. Siegwart, “Robust visual inertial
+                                                                                          odometry using a direct EKF-based approach,” in Proc. IEEE/RSJ Int.
+   However, in Dynamic follow, other algorithms tried to                                  Conf. Intell. Robots Syst., 2015, pp. 298–304.
+track the occluding object. Hence, not only failures of BA but
+also false-positive loop closures are triggered. Consequently,                       [7] B. Bescos, J. M. Fácil, J. Civera, and J. Neira, “DynaSLAM: Tracking,
+other algorithms except ours have higher ATEs.                                            mapping, and inpainting in dynamic scenes,” IEEE Robot. Automat. Lett.,
+                                                                                          vol. 3, no. 4, pp. 4076–4083, Oct. 2018.
+   Furthermore, in Temporal static, ORB-SLAM3 and
+VINS-Fusion can eliminate the false-positive loop closure in                         [8] Y. Fan, H. Han, Y. Tang, and T. Zhi, “Dynamic objects elimination
+the stereo-inertial case. However, in the mono-inertial case,                             in SLAM based on image fusion,” Pattern Recognit. Lett., vol. 127,
+due to an inaccurate depth estimation, they cannot reject the                             pp. 191–201, 2019.
+false-positive loop closures. Additionaly, VINS-Fusion with
+Switchable Constraints [15] can also reject the false-positive                       [9] B. Canovas, M. Rombaut, A. Nègre, D. Pellerin, and S. Olympi-
+loop closures, but ours has a better performance as shown in                              eff, “Speed and memory efﬁcient dense RGB-D SLAM in dynamic
+Table II.                                                                                 scenes,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2020,
+                                                                                          pp. 4996–5001.
+   Finally, in E-shape case, other algorithms fail to optimize
+the trajectory, as illustrated in Fig. 8(a), owing to the false-                    [10] R. Long, C. Rauch, T. Zhang, V. Ivan, and S. Vijayakumar, “RigidFusion:
+positive loop closures. Also VINS-Fusion with Switchable Con-                             Robot localisation and mapping in environments with large dynamic
+straints cannot reject the false-positive loop closures that are con-                     rigid objects,” IEEE Robot. Automat. Lett., vol. 6, no. 2, pp. 3703–3710,
+tinuously generated. However, ours optimizes the weight of each                           Apr. 2021.
+hypothesis, not individual loop closures. Hence, false-positive
+loop closures are rejected in the optimization irrespective of the                  [11] B. Bescos, C. Campos, J. D. Tardós, and J. Neira, “DynaSLAM II: Tightly-
+number of them, as illustrated in Fig. 8(b). Ours does not use                            coupled multi-object tracking and SLAM,” IEEE Robot. Automat. Lett.,
+any object-wise information from the image; hence the features                            vol. 6, no. 3, pp. 5191–5198, Jul. 2021.
+from the same object can be divided into different hypotheses,
+as depicted in Fig. 1(c).                                                           [12] K. Qiu, T. Qin, W. Gao, and S. Shen, “Tracking 3-D motion of dynamic
+                                                                                          objects using monocular visual-inertial sensing,” IEEE Trans. Robot.,
+                           VI. CONCLUSION                                                 vol. 35, no. 4, pp. 799–816, Aug. 2019.
+
+   In this study, DynaVINS has been proposed, which is a robust                     [13] K. Minoda, F. Schilling, V. Wüest, D. Floreano, and T. Yairi, “VIODE: A
+visual-inertial SLAM framework based on the robust BA and                                 simulated dataset to address the challenges of visual-inertial odometry
+the selective global optimization in dynamic environments. The                            in dynamic environments,” IEEE Robot. Automat. Lett., vol. 6, no. 2,
+experimental evidence corroborated that our algorithm works                               pp. 1343–1350, Apr. 2021.
+better than other algorithms in simulations and in actual environ-
+ments with various dynamic objects. In future works, we plan to                     [14] E. Olson and P. Agarwal, “Inference on networks of mixtures for ro-
+improve the speed and the performance. Moreover, we will adapt                            bust robot mapping,” Int. J. Robot. Res., vol. 32, no. 7, pp. 826–840,
+the concept of DynaVINS to the LiDAR-Visual-Inertial (LVI)                                2013.
+SLAM framework.
+                                                                                    [15] N. Sünderhauf and P. Protzel, “Switchable constraints for robust pose
+                                 REFERENCES                                               graph SLAM,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2012,
+                                                                                          pp. 1879–1884.
+ [1] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM: A
+      versatile and accurate monocular SLAM system,” IEEE Trans. Robot.,            [16] H. Yang, P. Antonante, V. Tzoumas, and L. Carlone, “Graduated non-
+      vol. 31, no. 5, pp. 1147–1163, Oct. 2015.                                           convexity for robust spatial perception: From non-minimal solvers to
+                                                                                          global outlier rejection,” IEEE Robot. Automat. Lett., vol. 5, no. 2,
+ [2] T. Qin, P. Li, and S. Shen, “VINS-Mono: A robust and versatile monoc-                pp. 1127–1134, Apr. 2020.
+      ular visual-inertial state estimator,” IEEE Trans. Robot., vol. 34, no. 4,
+      pp. 1004–1020, Aug. 2018.                                                     [17] Q.-Y. Zhou, J. Park, and V. Koltun, “Fast global registration,” in Proc. Eur.
+                                                                                          Conf. Comput. Vis., 2016, pp. 766–782.
+
+                                                                                    [18] S. Song, H. Lim, S. Jung, and H. Myung, “G2P-SLAM: Generalized RGB-
+                                                                                          D SLAM framework for mobile robots in low-dynamic environments,”
+                                                                                          IEEE Access, vol. 10, pp. 21370–21383, 2022.
+
+                                                                                    [19] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, “Dynamic-SLAM:
+                                                                                          Semantic monocular visual localization and mapping based on deep
+                                                                                          learning in dynamic environment,” Robot. Auton. Syst., vol. 117,
+                                                                                          pp. 1–16, 2019.
+
+                                                                                    [20] M. J. Black and A. Rangarajan, “On the uniﬁcation of line processes,
+                                                                                          outlier rejection, and robust statistics with applications in early vision,”
+                                                                                          Int. J. Comput. Vis., vol. 19, no. 1, pp. 57–91, 1996.
+
+                                                                                    [21] P. J. Huber, “Robust estimation of a location parameter,” in Breakthroughs
+                                                                                          Statist., 1992, pp. 492–518.
+
+                                                                                    [22] P. Babin, P. Giguère, and F. Pomerleau, “Analysis of robust functions for
+                                                                                          registration algorithms,” in Proc. IEEE Int. Conf. Robot. Automat., 2019,
+                                                                                          pp. 1451–1457.
+
+                                                                                    [23] S. Geman, D. E. McClure, and D. Geman, “A nonlinear ﬁlter for ﬁlm
+                                                                                          restoration and other problems in image processing,” CVGIP: Graph.
+                                                                                          Models Image Process., vol. 54, no. 4, pp. 281–289, 1992.
+
+                                                                                    [24] D. Galvez-López and J. D. Tardos, “Bags of binary words for fast place
+                                                                                          recognition in image sequences,” IEEE Trans. Robot., vol. 28, no. 5,
+                                                                                          pp. 1188–1197, Oct. 2012.
+
+                                                                                    [25] Z. Zhang and D. Scaramuzza, “A tutorial on quantitative trajectory eval-
+                                                                                          uation for visual(-inertial) odometry,” in Proc. IEEE/RSJ Int. Conf. Intell.
+                                                                                          Robots Syst., 2018, pp. 7244–7251.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply.
+
diff --git a/动态slam/2020年-2022年开源动态SLAM/2022年/DytanVO_Joint_Refinement_of_Visual_Odometry_and_Motion_Segmentation_in_Dynamic_Environments.pdf b/动态slam/2020年-2022年开源动态SLAM/2022年/DytanVO_Joint_Refinement_of_Visual_Odometry_and_Motion_Segmentation_in_Dynamic_Environments.pdf
new file mode 100644
index 0000000..9d059c7
--- /dev/null
+++ b/动态slam/2020年-2022年开源动态SLAM/2022年/DytanVO_Joint_Refinement_of_Visual_Odometry_and_Motion_Segmentation_in_Dynamic_Environments.pdf
@@ -0,0 +1,476 @@
+                                                                                                                                                      2023 IEEE International Conference on Robotics and Automation (ICRA 2023)
+                                                                                                                                                      May 29 - June 2, 2023. London, UK
+
+                                                                                                                                                      DytanVO: Joint Refinement of Visual Odometry and Motion
+                                                                                                                                                                   Segmentation in Dynamic Environments
+
+                                                                                                                                                                           Shihao Shen, Yilin Cai, Wenshan Wang, Sebastian Scherer
+
+2023 IEEE International Conference on Robotics and Automation (ICRA) | 979-8-3503-2365-8/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICRA48891.2023.10161306     Abstract— Learning-based visual odometry (VO) algorithms               Fig. 1: A overview of the DytanVO. (a) Input frame at time t0 and
+                                                                                                                                                      achieve remarkable performance on common static scenes,                   t1. (b) Optical flow output from the matching network. (c) Motion
+                                                                                                                                                      benefiting from high-capacity models and massive annotated                segmentation output after iterations. (d) Trajectory estimation on
+                                                                                                                                                      data, but tend to fail in dynamic, populated environments.                sequence RoadCrossing VI from the AirDOS-Shibuya Dataset,
+                                                                                                                                                      Semantic segmentation is largely used to discard dynamic                  which is a highly dynamic environment cluttered with humans. Ours
+                                                                                                                                                      associations before estimating camera motions but at the cost             is the only learning-based VO that keeps track.
+                                                                                                                                                      of discarding static features and is hard to scale up to unseen
+                                                                                                                                                      categories. In this paper, we leverage the mutual dependence              (MAV) that operate with aggressive and frequent rotations
+                                                                                                                                                      between camera ego-motion and motion segmentation and                     cars do not have. Learning without supervision is hindered
+                                                                                                                                                      show that both can be jointly refined in a single learning-               from generalizing due to biased data with simple motion
+                                                                                                                                                      based framework. In particular, we present DytanVO, the                   patterns. Therefore, we approach the dynamic VO problem
+                                                                                                                                                      first supervised learning-based VO method that deals with                 as supervised learning so that the model can map inputs to
+                                                                                                                                                      dynamic environments. It takes two consecutive monocular                  complex ego-motion ground truth and be more generalizable.
+                                                                                                                                                      frames in real-time and predicts camera ego-motion in an
+                                                                                                                                                      iterative fashion. Our method achieves an average improvement                To identify dynamic objects, object detection or semantic
+                                                                                                                                                      of 27.7% in ATE over state-of-the-art VO solutions in real-world          segmentation techniques are largely relied on to mask all
+                                                                                                                                                      dynamic environments, and even performs competitively among               movable objects, such as pedestrians and vehicles [12]–
+                                                                                                                                                      dynamic visual SLAM systems which optimize the trajectory                 [15]. Their associated features are discarded before applying
+                                                                                                                                                      on the backend. Experiments on plentiful unseen environments              geometry-based methods. However, there are two issues of
+                                                                                                                                                      also demonstrate our method’s generalizability.                           utilizing semantic information in dynamic VO. First, class-
+                                                                                                                                                                                                                                specific detectors for semantic segmentation heavily depend
+                                                                                                                                                                               I. INTRODUCTION                                  on appearance cues but not every object that can move is
+                                                                                                                                                                                                                                present in the training categories, leading to false negatives.
+                                                                                                                                                         Visual odometry (VO), one of the most essential com-                   Second, even if all moving objects in a scene within the cat-
+                                                                                                                                                      ponents for pose estimation in the visual Simultaneous                    egories, algorithms could not distinguish between “actually
+                                                                                                                                                      Localization and Mapping (SLAM) system, has attracted                     moving” versus “static but being able to move”. In dynamic
+                                                                                                                                                      significant interest in robotic applications over past few                VO where static features are crucial to robust ego-motion
+                                                                                                                                                      years [1]. A lot of research works have been conducted                    estimation, one should segment objects based on pure motion
+                                                                                                                                                      to develop an accurate and robust monocular VO system                     (motion segmentation) rather than heuristic appearance cues.
+                                                                                                                                                      using both geometry-based methods [2], [3]. However, they
+                                                                                                                                                      require significant engineering effort for each module to                    Motion segmentation utilizes relative motion between con-
+                                                                                                                                                      be carefully designed and finetuned [4], which makes it                   secutive frames to remove the effect of camera movement
+                                                                                                                                                      difficult to be readily deployed in the open world with                   from the 2D motion fields and calculates residual optical flow
+                                                                                                                                                      complex environmental dynamcis, changes of illumination                   to account for moving regions. But paradoxically, ego-motion
+                                                                                                                                                      or inevitable sensor noises.                                              cannot be correctly estimated in dynamic scenes without a
+                                                                                                                                                                                                                                robust segmentation. There exists such a mutual dependence
+                                                                                                                                                         On the other hand, recent learning-based methods [4]–
+                                                                                                                                                      [7] are able to outperform geometry-based methods in
+                                                                                                                                                      more challenging environments such as large motion, fog
+                                                                                                                                                      or rain effects and lack of features. However, they will
+                                                                                                                                                      easily fail in dynamic environments if they do not take
+                                                                                                                                                      into consideration independently moving objects that cause
+                                                                                                                                                      unpredictable changes in illumination or occlusions. To this
+                                                                                                                                                      end, recent works utilize abundant unlabeled data and adopt
+                                                                                                                                                      either self-supervised learning [8], [9] or unsupervised learn-
+                                                                                                                                                      ing [10], [11] to handle dynamic scenes. Although they
+                                                                                                                                                      achieve outstanding performance on particular tasks, such as
+                                                                                                                                                      autonomous driving, they produce worse results if applied to
+                                                                                                                                                      very different data distributions, such as micro air vehicles
+
+                                                                                                                                                         Code is available at https://github.com/Geniussh/DytanVO
+                                                                                                                                                         S. Shen, Y. Cai, W. Wang, S. Scherer are with the Robotics Institute,
+                                                                                                                                                      Carnegie Mellon University, Pittsburgh, PA 15213, USA. {shihaosh,
+                                                                                                                                                      yilincai, wenshanw, basti}@andrew.cmu.edu
+
+                                                                                                                                                      979-8-3503-2365-8/23/$31.00 ©2023 IEEE  4048
+
+                                                                                                                                                      Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply.
+between motion segmentation and ego-motion estimation that     on geometric constraints arising from epipolar geometry
+has never been explored in supervised learning methods.        and rigid transformations, which are vulnerable to motion
+Therefore, motivated by jointly refining the VO and motion     ambiguities such as objects moving in the colinear direc-
+segmentation, we propose our learning-based dynamic VO         tion relative to the camera being indistinguishable from the
+(DytanVO). To our best knowledge, our work is the first        background given only ego-motion and optical flow. On the
+supervised learning-based VO for dynamic environments.         other hand, MaskVO [8] and SimVODIS++ [9] approach the
+The main contributions of this paper are threefold:            problem by learning to mask dynamic feature points in a self-
+                                                               supervised manner. CC [11] couples motion segmentation,
+   • A novel learning-based VO is introduced to leverage       flow, depth and camera motion models which are jointly
+      the interdependence among camera ego-motion, optical     solved in an unsupervised way. Nevertheless, these self-
+      flow and motion segmentation.                            supervised or unsupervised methods are trained on self-
+                                                               driving vehicle data dominated by pure translational motions
+   • We introduce an iterative framework where both ego-       with little rotation, which makes them difficult to generalize
+      motion estimation and motion segmentation can con-       to completely different data distributions such as handheld
+      verge quickly within time constraints for real-time ap-  cameras or drones. Our work introduces a framework that
+      plications.                                              jointly refines camera ego-motion and motion segmentation
+                                                               in an iterative way that is robust against motion ambiguities
+   • Among learning-based VO solutions, our method             as well as generalizes to the open world.
+      achieves state-of-the-art performance in real-world dy-
+      namic scenes without finetuning. Furthermore, our                               III. METHODOLOGY
+      method performs even comparably with visual SLAM
+      solutions that optimize trajectories on the backend.     A. Datasets
+
+                        II. RELATED WORK                          Built on TartanVO [5], our method remains its general-
+                                                               ization capability while handling dynamic environments in
+   Learning-based VO solutions aim to avoid hard-coded         multiple types of scenes, such as car, MAV, indoor and
+modules that require significant engineering efforts for de-   outdoor. Besides taking camera intrinsics as an extra layer
+sign and finetuning in classic pipelines [1], [16]. For exam-  into the network to adapt to various camera settings as
+ple, Valada [17] applies auxiliary learning to leverage rela-  explored in [5], we train our model on large amounts of
+tive pose information to constrain the search space and pro-   synthetic data with broad diversity, which is shown capable
+duce consistent motion estimation. Another class of learning-  of facilitating easy adaptation to the real world [27]–[29].
+based methods rely on dense optical flow to estimate pose as
+it provides more robust and redundant modalities for feature      Our model is trained on both TartanAir [27] and Scene-
+association in VO [5], [18], [19]. However, their frameworks   Flow [30]. The former contains more than 400,000 data
+are built on the assumption of photometric consistency which   frames with ground truth of optical flow and camera pose in
+only holds in a static environment without independently       static environments only. The latter provides 39,000 frames
+moving objects. They easily fail when dynamic objects          in highly dynamic environments with each trajectory hav-
+unpredictably cause occlusions or illuminations change.        ing backward/forward passes, different objects and motion
+                                                               characteristics. Although SceneFlow does not provide ground
+   Semantic information is largely used by earlier works in    truth of motion segmentations, we are able to recover it by
+VO or visual SLAM to handle dynamic objects in the scene,      taking use of its ground truth of disparity, optical flow and
+which is obtained by either a feature-based method or a        disparity change maps.
+learning-based method. Feature-based methods utilize hand-
+designed features to recognize semantic entities [20]. An      B. Architecture
+exemplary system proposed by [21] computes SIFT descrip-
+tors from monocular image sequences in order to recognize         Our network architecture is illustrated in Fig. 2, which is
+semantic objects. On the other hand, data-driven CNN-based     based on TartanVO. Our method takes in two consecutive
+semantic methods have been widely used to improve the          undistorted images It, It+1 and outputs the relative camera
+performance, such as DS-SLAM [22] and SemanticFusion           motion δtt+1 = (R|T), where T ∈ R3 is the 3D translation
+[23]. A few works on semantic VO/SLAM have fused the           and R ∈ SO(3) is the 3D rotation. Our framework consists
+semantic information from recognition modules to enhance       of three sub-modules, a matching network, a motion seg-
+motion estimation and vice versa [24], [25]. However, all      mentation network, and a pose network. We estimate dense
+these methods are prone to limited semantic categories,        optical flow Ftt+1 with a matching network, Mθ (It, It+1),
+which leads to false negatives when scaling to unusual real-   from two consecutive images. The network is built based
+world applications such as offroad driving or MAV, and         on PWC-Net [31]. The motion segmentation network Uγ,
+requires continuous efforts in ground-truth labeling.          based on a lightweight U-Net [32], takes in the relative
+                                                               camera motion output, R|T, optical flow from Mθ, and the
+   Instead of utilizing appearance cues for segmentation,      original input frames. It outputs a probability map, ztt+1,
+efforts are made to segment based on geometry cues. Flow-      of every pixel belonging to a dynamic object or not, which
+Fusion [26] iteratively refines its ego-motion estimation by   is thresholded and turned into a binary segmentation mask,
+computing residual optical flow. GeoNet [10] divides its       Stt+1. The optical flow is then stacked with the mask and
+system into two sub-tasks by separately predicting static
+scene structure and dynamic motions. However, both depend
+
+                                                                                                     4049
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply.
+Fig. 2: Overview of our three-stage network architecture. It consists of a matching network which estimates optical flow from two
+consecutive images, a pose network that estimates pose based on optical flow without dynamic movements, and a motion segmentation
+network that outputs a probability mask of the dynamicness. The matching network is forwarded only once while the pose network and
+the segmentation network are iterated to jointly refine pose estimate and motion segmentation. In the first iteration, we randomly initialize
+the segmentation mask. In each iteration, optical flow is set to zero inside masked regions.
+
+the intrinsics layer KC, followed by setting all optical flow     iterations being smaller than prefixed thresholds ϵ. Instead
+inside the masked regions to zeros, i.e., F˜tt+1. The last is     of having a fixed constant to threshold probability maps into
+a pose network P ϕ, with ResNet50 [33] as the backbone,           segmentation masks, we predetermine a decaying parameter
+which takes in the previous stack, and outputs camera             that empirically reduces the input threshold over time, in
+motion.                                                           order to discourage inaccurate masks in earlier iterations
+                                                                  while embracing refined masks in later ones.
+C. Motion segmentation
+                                                                  Algorithm 1 Inference with Iterations
+   Earlier dynamic VO methods that use motion segmentation
+rely on purely geometric constraints arising from epipolar           Given two consecutive frames It, It+1 and intrinsics K
+geometry and rigid transformations [12], [26] so that they can       Initialize iteration number: i ← 1
+threshold residual optical flow which is designed to account         Initialize difference in output camera motions: δR|T ← ∞
+for moving regions. However, they are prone to catastrophic          iFtt+1 ← OpticalFlow(It, It+1)
+failures under two cases: (1) points in 3D moving along              while δR|T ≥ stopping criterion, ϵ do
+epipolar lines cannot be identified from the background given
+only monocular cues; (2) pure geometry methods leave no                  if i is 0 then
+tolerance to noisy optical flow and less accurate camera                      iStt+1 ← getCowmask(It)
+motion estimations, which in our framework is very likely to
+happen in the first few iterations. Therefore, following [34],           else
+to deal with the ambiguities above, we explicitly model cost                  iztt+1 ← MotionSegmentation(iFtt+1, It, iR|iT)
+maps as inputs into the segmentation network after upgrading                  iStt+1 ← mask iztt+1 ≥ zthreshold
+the 2D optical flow to 3D through optical expansion [35],
+which estimates the relative depth based on the scale change             iF˜tt+1 ← set iFtt+1 = 0 for iStt+1 == 1
+of overlapping image patches. The cost maps are tailored                 iR|iT ← PoseNetwork(iF˜tt+1, iStt+1, K)
+to coplanar and colinear motion ambiguities that cause seg-              δR|T ← iR|iT − i−1R|i−1T
+mentation failures in geometry-based motion segmentation.                i←i+1
+More details can be found in [34].
+                                                                     Intuitively, during early iterations, the estimated motion
+D. Iteratively refine camera motion                               is less accurate, which leads to false positives in the seg-
+                                                                  mentation output (assigning high probabilities to static ar-
+   We provide an overview of our iterative framework in           eas). However, due to the fact that optical flow map still
+Algorithm 1. During inference, the matching network is            provides enough correspondences regardless of cutting out
+forwarded only once while the pose network and the seg-           non-dynamic regions from it, Pϕ is able to robustly leverage
+mentation network are iterated to jointly refine ego-motion       the segmentation mask Stt+1 concatenated with F˜tt+1, and
+estimation and motion segmentation. In the first iteration,       outputs reasonable camera motion. In later iterations, Uγ
+the segmentation mask is initialized randomly using [36].         is expected to output increasingly precise probability maps
+The criterion to stop iteration is straightforward, which is the  such that static regions in the optical flow map are no longer
+rotational and translational differences of R|T between two
+
+                                                                                                     4050
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply.
+                                                                    ground truth motion R|T ,
+
+                                                                    LP =  Tˆ                   T        + Rˆ − R
+                                                                          max(∥Tˆ ∥, ϵ) − max (∥T∥, ϵ)            (1)
+
+                                                                    where ϵ=1e-6 to prevent numerical instability and ˆ· denotes
+                                                                    estimated quantities.
+
+                                                                       Our framework can also be trained in an end-to-end
+                                                                    fashion, in which case the objective becomes an aggregated
+                                                                    loss of the optical flow loss LM , the camera motion loss
+                                                                    LP and the motion segmentation loss LU , where LM is the
+                                                                    L1 norm between the predicted flow and the ground truth
+                                                                    flow whereas LU is the binary cross entropy loss between
+                                                                    predicted probability and the segmentation label.
+
+                                                                          L = λ1LM + λ2LU + LP                    (2)
+
+Fig. 3: Motion segmentation output at each iteration when testing   From preliminary empirical comparison, end-to-end training
+on unseen data. (a) Running inference on the hardest sequence in    gives similar performance to training the pose network only,
+AirDOS-Shibuya with multiple people moving in different direc-      because we use λ1 and λ2 to regularize the objective such
+tions with our segmentation network. (b) Inference on the sequence  that the training is biased toward mainly improving the
+from FlyingThings3D where dynamic objects take up more than         odometry rather than optimizing the other two tasks. This is
+60% area. Ground truth (GT) mask on Shibuya is generated by the     ideal since the pose network is very tolerant of false positives
+segmentation network with GT ego-motion as input.                   in segmentation results (shown in III-D). In the following
+                                                                    section, we show our results of supervising only on Eq. 1
+“wasted” and hence Pϕ can be improved accordingly.                  by fixing the motion segmentation network.
+   In practice, we find that 3 iterations are more than enough
+                                                                                      IV. EXPERIMENTAL RESULTS
+to get both camera motion and segmentation refined. To clear
+up any ambiguity, a 1-iteration pass is composed of one Mθ          A. Implementation details
+forward pass and one Pϕ forward pass with random mask,
+while a 3-iteration pass consists of one Mθ forward pass,              1) Network: We intialize the matching network Mθ with
+two Uγ forward passes and three Pϕ forward passes. In Fig.          the pre-trained model from TartanVO [5], and fix the motion
+3 we illustrate how segmentation masks evolve after three           segmentation network Uγ with the pre-trained weights from
+iterations on unseen data. The mask at the first iteration          Yang et al. [34]. The pose network Pϕ uses ResNet50 [33]
+contains a significant amount of false positives but quickly        as the backbone, removes the bach normalization layers, and
+converges beyond the second iteration. This verifies our            adds two output heads for rotation R and translation T .
+assumption that the pose network is robust against false            Mθ outputs optical flow at size of H/4 × W/4. Pϕ takes
+positives in segmentation results.                                  in a 5-channel input, i.e., F˜tt+1 ∈ R2×H/4×W/4, Stt+1 ∈
+                                                                    RH/4×W/4 and KC ∈ R2×H/4×W/4. The concatenation of
+E. Supervision                                                      F˜tt+1 and KC augments the optical flow input with 2D
+                                                                    positional information while concatenating F˜tt+1 with Stt+1
+   We train our pose network to be robust against large areas       encourages the network to learn dynamic representations.
+of false positives. On training data without any dynamic
+object, we adopt the cow-mask [36] to create sufficiently              2) Training: Our method is implemented in PyTorch [43]
+random yet locally connected segmentation patterns as a             and trained on 2 NVIDIA A100 Tensor Core GPUs. We train
+motion segmentation could occur in any size, any shape              the network in two stages on TartanAir, which includes only
+and at any position in an image while exhibiting locally            static scenes, and SceneFlow [30]. In the first stage, we train
+explainable structures corresponding to the types of moving         Pϕ independently using ground truth optical flow, camera
+objects. In addition, we apply curriculum learning to the           motion, and motion segmentation mask in a curriculum-
+pose network where we gradually increase the maximum                learning fashion. We generate random cow-masks [36] on
+percentage of dynamic areas in SceneFlow from 15%, 20%,             TartanAir as motion segmentation input. Each curriculum is
+30%, 50% to 100%. Since TartanAir only contains static              initialized with weights from the previous curriculum and
+scenes, we adjust the size of the cow-masks accordingly.            takes 100,000 iterations with a batch size of 256. In the
+                                                                    second stage, Pϕ and Mθ are jointly optimized for another
+   We supervise our network on the camera motion loss LP .          100,000 iterations with a batch size of 64. During curriculum
+Under the monocular setting, we only recover an up-to-scale         learning, the learning rate starts at 2e-4, while the second
+camera motion. We follow [5] and normalize the translation          stage uses a learning rate of 2e-5. Both stages apply a
+vector before calculating the distance to ground truth. Given       decay rate of 0.2 to the learning rate every 50,000 iterations.
+                                                                    Random cropping and resizing (RCR) [5] as well as frame
+                                                                    skipping are applied to both datasets.
+
+                                                                                                     4051
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply.
+TABLE I: ATE (m) results on dynamic sequences from AirDOS-Shibuya. Our method gives outstanding performance among VO methods.
+DeepVO, TrianFlow and CC are trained on KITTI only and unable to generalize to complex motion patterns. All SLAM methods use
+bundle adjustment (BA) on multiple frames to optimize the trajectory and hence we only numerically compare ours with pure VO methods.
+The best and the second best VO performances are highlighted as bold and underlined. We use “-” to denote SLAM methods that fail to
+initialize.
+
+                                              StandingHuman              RoadCrossing (Easy)     RoadCrossing (Hard)
+
+                                              I                  II      III     IV      V       VI      VII
+
+SLAM method            DROID-SLAM [37]        0.0051             0.0073  0.0103  0.0120  0.2778  0.0253  0.5788
+  VO method            AirDOS w/ mask [38]    0.0606             0.0193  0.0951  0.0331  0.0206  0.2230  0.5625
+                       ORB-SLAM w/ mask [39]  0.0788             0.0060  0.0657  0.0196  0.0148  1.0984  0.8476
+                       VDO-SLAM [40]          0.0994             0.6129  0.3813  0.3879  0.2175  0.2400  0.6628
+                       DynaSLAM [41]                             0.8836  0.3907  0.4196  0.4925  0.6446  0.6539
+                                                 -
+                       DeepVO [4]
+                       TrianFlow [42]         0.3956             0.6351  0.7788  0.3436  0.5434  0.7223  0.9633
+                       CC [11]                0.9743             1.3835  1.3348  1.6172  1.4769  1.7154  1.9075
+                       TartanVO [5]           0.4527             0.7714  0.5406  0.6345  0.5411  0.8558  1.0896
+                       Ours                   0.0600             0.1605  0.2762  0.1814  0.2174  0.3228  0.5009
+                                              0.0327             0.1017  0.0608  0.0516  0.0755  0.0365  0.0660
+
+   3) Runtime: Although our method iterates multiple times       Crossing (Easy) contains multiple humans moving in and out
+to refine both segmentation and camera motion, we find in        of the camera’s view, and in Road Crossing (Hard) humans
+practice that 3 iterations are more than enough due to the       enter camera’s view abruptly. Besides VO methods, we also
+robustness of Pϕ as shown in Fig. 3. On an NVIDIA RTX            compare ours with SLAM methods that are able to handle
+2080 GPU, inference takes 40ms with 1 iteration, 100ms           dynamic scenes. DROID-SLAM [37] is a learning-based
+with 2 iterations and 160ms with 3 iterations.                   SLAM trained on TartanAir. AirDOS [38], VDO-SLAM [40]
+                                                                 and DynaSLAM [41] are three feature-based SLAM methods
+   4) Evaluation: We use the Absolute Trajectory Error           targeting dynamic scenes. We provide the performance of
+(ATE) to evaluate our algorithm against other state-of-the-art   AirDOS and ORB-SLAM [39] after masking the dynamic
+methods including both VO and Visual SLAM. We evaluate           features during their ego-motion estimation. DeepVO [4],
+our method on AirDOS-Shibuya dataset [38] and KITTI              TartanVO and TrianFlow [42] are three learning-based VO
+Odometry dataset [44]. Additionally, in the supplemental         methods not targeting dynamic scenes while CC [11] is an
+material, we test our method on data collected in a cluttered    unsupervised VO resolving dynamic scenes through motion
+intersection to demonstrate our method can scale to real-        segmentation.
+world dynamic scenes competitively.
+                                                                    Our model achieves the best performance in all sequences
+B. Performance on AirDOS-Shibuya Dataset                         among VO baselines and is competitive even among SLAM
+                                                                 methods. DeepVO, TrianFlow and CC perform badly on
+   We first provide an ablation study of the number of itera-    AirDOS-Shibuya dataset because they are trained on KITTI
+tions (iter) in Tab. III using three sequences from AirDOS-      only and not able to generalize. TartanVO performs better but
+Shibuya [38]. The quantitative results are consistent with Fig.  it is still susceptible to the disturbance of dynamic objects.
+3 where the pose network quickly converges after the first       On RoadCrossing V as shown in Fig. 1, all VO baselines
+iteration. We also compare the 3-iteration finetuned model       fail except ours. In hard sequences where there are more
+after jointly optimizing Pϕ and Mθ (second stage), which         aggressive camera movements and abundant moving objects,
+shows less improvement because the optical flow estimation       ours outperforms dynamic SLAM methods such as AirDOS,
+on AirDOS-Shibuya already has high quality.                      VDO-SLAM and DynaSLAM by more than 80%. While
+                                                                 DROID-SLAM remains competitive most time, it loses track
+TABLE III: Experiments on number of iterations in ATE (m)        of RoadCrossing V and VII as soon as a walking person
+                                                                 occupies a large area in the image. Note that ours only takes
+1 iter     Standing I  RoadCrossing III  RoadCrossing VII        0.16 seconds per inference with 3 iterations but DROID-
+2 iter                                                           SLAM takes extra 4.8 seconds to optimize the trajectory.
+3 iter       0.0649           0.1666            0.3157           More qualitative results are in the supplemental material.
+Finetuned    0.0315           0.0974            0.0658
+             0.0327           0.0608            0.0660           C. Performance on KITTI
+             0.0384           0.0631            0.0531
+                                                                    We also evaluated our method against others on sequences
+   We then compare our method with others on the seven           from KITTI Odometry dataset [44] in Tab. II. Our method
+sequences from AirDOS-Shibuya in Tab. I and demonstrate          outperforms other VO baselines in 6 out of 8 dynamic
+that our method outperforms existing state-of-the-art VO         sequences with an improvement of 27.7% on average against
+algorithms. This benchmark covers much more challenging          the second best method. DeepVO, TrianFlow and CC are
+viewpoints and diverse motion patterns for articulated objects   trained on some of the sequences in KITTI while ours has not
+than our training data. The seven sequences are categorized      been finetuned on KITTI and is trained purely using synthetic
+into three levels of difficulty: most humans stand still in
+Standing Human with few of them moving around, Road
+
+                                                                                                     4052
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply.
+TABLE II: Results of ATE (m) on Dynamic Sequences from KITTI Odometry. Original sequences are trimmed into shorter ones that
+contain dynamic objects1. DeepVO [4], TrianFlow [42] and CC [11] are trained on KITTI, while ours has not been finetuned on KITTI
+
+and is trained purely using synthetic data. Without backend optimization unlike SLAM, we achieve the best performance on 00, 02, 04,
+
+and competitive performance on the rest among all methods including SLAM.
+
+SLAM method  DROID-SLAM [37]          00        01                               02        03      04      07       08        10
+             ORB-SLAM w/ mask [39]  0.0148    49.193                           0.1064    0.0119  0.0374  0.1939   0.9713    0.0368
+             DynaSLAM [41]          0.0187                                     0.0796    0.1519  0.0198  0.2108   1.0479    0.0246
+                                    0.0138       -                             0.1046            0.1450  0.3187   1.0559    0,0264
+                                                 -                                          -
+                                                                                                         0.7262  (0.6547)   0.1042
+VO method    DeepVO [4]             (0.0206)   1.2896                          (0.2975)  0.0783  0.0506  1.5540  (3.8984)   0.2545
+             TrianFlow [42]          0.6966   (8.2127)                         (1.8759)  1.6862  1.2950  0.6789  (1.0411)  (0.0346)
+             CC [11]                 0.0253   (0.3060)                         (0.2559)  0.0505  0.0337  0.7108   0.9776    0.1024
+             TartanVO [5]            0.0345    4.7080                           0.1049   0.2832  0.0743  0.6367   1.0344    0.0280
+             Ours                    0.0126    0.4081                           0.0594   0.0406  0.0180
+
+We use (·) to denote the sequence is in the training set of the corresponding method.
+
+Fig. 4: Qualitative results on dynamic sequences in KITTI Odometry 01, 03, 04 and 10. The first row is our segmentation outputs of
+moving objects. The second row is the visualization after aligning the scales of trajectories with ground truth all at once. Ours produces
+precise odometry given large areas in the image being dynamic even among methods that are trained on KITTI. Note that the trajectories
+do not always reflect the ATE results due to alignment.
+
+data. Moreoever, we achieve the best ATE on 3 sequences                        almost the entire optical flow map as zeros, leading to the
+among both VO and SLAM without any optimization. We                            divergence of motion estimation and segmentation. Future
+provide qualitative results in Fig. 4 on four challenging                      work could hence consider incorporating dynamic object-
+sequences with fast-moving vehicles or dynamic objects                         awareness into the framework and utilizing dynamic cues
+occupying large areas in images. Note on sequence 01 which                     instead of fully discarding them. Additionally, learning-based
+starts with a high-speed vehicle passing by, both ORB-SLAM                     VO tends to overfit on simple translational movements such
+and DynaSLAM fail to initialize, while DROID-SLAM loses                        as in KITTI, which is resolved in our method by training on
+track from the beginning. Even though CC uses 01 in its                        datasets with broad diversity, but our method gives worse
+training set, ours gives only 0.1 higher ATE while 0.88 lower                  performance when there is little or zero camera motion,
+than the third best baseline. On sequence 10 when a huge                       caused by the bias in currently available datasets. One should
+van takes up significant areas in the center of the image, ours                consider training on zero-motion inputs in addition frame
+is the only VO that keeps track robustly.                                      skipping.
+
+D. Diagnostics                                                                                           V. CONCLUSION
+
+   While we observe our method is robust to heavily dynamic                       In this paper, we propose a learning-based dynamic VO
+scenes with as much as 70% dynamic objects in the image,                       (DytanVO) which can jointly refine the estimation of camera
+it still fails when all foreground objects are moving, leaving                 pose and segmentation of the dynamic objects. We demon-
+textureless background only. This is most likely to happen                     strate both ego-motion estimation and motion segmentation
+when dynamic objects take up large areas in the image. For                     can converge quickly within time constrains for real-time
+example, when testing on the test set of FlyingThings3D [30]                   applications. We evaluate our method on KITTI Odometry
+where 80% of the image being dynamic, our method masks                         and AirDOS-Shibuya datasets, and demonstrate state-of-the-
+                                                                               art performance in dynamic environments without finetuning
+   1Sequences listed are trimmed into lengths of 28, 133, 67, 31, 40, 136, 51  nor optimation on the backend. Our work introduces new
+and 59 respectively which contain moving pedestrians, vehicles and cyclists.   directions for dynamic visual SLAM algorithms.
+
+                                                                                                     4053
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply.
+                            REFERENCES                                          [23] J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Se-
+                                                                                      manticfusion: Dense 3d semantic mapping with convolutional neural
+ [1] D. Scaramuzza and F. Fraundorfer, “Visual odometry [tutorial],” IEEE             networks,” in 2017 IEEE International Conference on Robotics and
+      robotics & automation magazine, vol. 18, no. 4, pp. 80–92, 2011.                automation (ICRA), pp. 4628–4635, IEEE, 2017.
+
+ [2] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE        [24] L. An, X. Zhang, H. Gao, and Y. Liu, “Semantic segmentation–aided
+      transactions on pattern analysis and machine intelligence, vol. 40,             visual odometry for urban autonomous driving,” International Journal
+      no. 3, pp. 611–625, 2017.                                                       of Advanced Robotic Systems, vol. 14, no. 5, p. 1729881417735667,
+                                                                                      2017.
+ [3] C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct
+      monocular visual odometry,” in 2014 IEEE international conference         [25] K.-N. Lianos, J. L. Schonberger, M. Pollefeys, and T. Sattler, “Vso:
+      on robotics and automation (ICRA), pp. 15–22, IEEE, 2014.                       Visual semantic odometry,” in Proceedings of the European conference
+                                                                                      on computer vision (ECCV), pp. 234–250, 2018.
+ [4] S. Wang, R. Clark, H. Wen, and N. Trigoni, “Deepvo: Towards
+      end-to-end visual odometry with deep recurrent convolutional neural       [26] T. Zhang, H. Zhang, Y. Li, Y. Nakamura, and L. Zhang, “Flowfusion:
+      networks,” in 2017 IEEE international conference on robotics and                Dynamic dense rgb-d slam based on optical flow,” in 2020 IEEE Inter-
+      automation (ICRA), pp. 2043–2050, IEEE, 2017.                                   national Conference on Robotics and Automation (ICRA), pp. 7322–
+                                                                                      7328, IEEE, 2020.
+ [5] W. Wang, Y. Hu, and S. Scherer, “Tartanvo: A generalizable learning-
+      based vo,” arXiv preprint arXiv:2011.00359, 2020.                         [27] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor,
+                                                                                      and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,”
+ [6] H. Zhou, B. Ummenhofer, and T. Brox, “Deeptam: Deep tracking and                 in 2020 IEEE/RSJ International Conference on Intelligent Robots and
+      mapping,” in Proceedings of the European conference on computer                 Systems (IROS), pp. 4909–4916, IEEE, 2020.
+      vision (ECCV), pp. 822–838, 2018.
+                                                                                [28] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel,
+ [7] S. Li, X. Wang, Y. Cao, F. Xue, Z. Yan, and H. Zha, “Self-supervised             “Domain randomization for transferring deep neural networks from
+      deep visual odometry with online adaptation,” in Proceedings of the             simulation to the real world,” in 2017 IEEE/RSJ international con-
+      IEEE/CVF Conference on Computer Vision and Pattern Recognition,                 ference on intelligent robots and systems (IROS), pp. 23–30, IEEE,
+      pp. 6339–6348, 2020.                                                            2017.
+
+ [8] W. Xuan, R. Ren, S. Wu, and C. Chen, “Maskvo: Self-supervised              [29] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil,
+      visual odometry with a learnable dynamic mask,” in 2022 IEEE/SICE               T. To, E. Cameracci, S. Boochoon, and S. Birchfield, “Training deep
+      International Symposium on System Integration (SII), pp. 225–231,               networks with synthetic data: Bridging the reality gap by domain
+      IEEE, 2022.                                                                     randomization,” in Proceedings of the IEEE conference on computer
+                                                                                      vision and pattern recognition workshops, pp. 969–977, 2018.
+ [9] U.-H. Kim, S.-H. Kim, and J.-H. Kim, “Simvodis++: Neural seman-
+      tic visual odometry in dynamic environments,” IEEE Robotics and           [30] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy,
+      Automation Letters, vol. 7, no. 2, pp. 4244–4251, 2022.                         and T. Brox, “A large dataset to train convolutional networks for
+                                                                                      disparity, optical flow, and scene flow estimation,” in Proceedings
+[10] Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth,                of the IEEE conference on computer vision and pattern recognition,
+      optical flow and camera pose,” in Proceedings of the IEEE conference            pp. 4040–4048, 2016.
+      on computer vision and pattern recognition, pp. 1983–1992, 2018.
+                                                                                [31] D. Sun, X. Yang, M. Liu, and J. Kautz, “Pwc-net: Cnns for
+[11] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and                  optical flow using pyramid, warping, and cost volume,” CoRR,
+      M. J. Black, “Competitive collaboration: Joint unsupervised learning            vol. abs/1709.02371, 2017.
+      of depth, camera motion, optical flow and motion segmentation,” in
+      Proceedings of the IEEE/CVF conference on computer vision and             [32] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional
+      pattern recognition, pp. 12240–12249, 2019.                                     networks for biomedical image segmentation,” in International Confer-
+                                                                                      ence on Medical image computing and computer-assisted intervention,
+[12] H. Liu, G. Liu, G. Tian, S. Xin, and Z. Ji, “Visual slam based on                pp. 234–241, Springer, 2015.
+      dynamic object removal,” in 2019 IEEE International Conference on
+      Robotics and Biomimetics (ROBIO), pp. 596–601, IEEE, 2019.                [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for
+                                                                                      image recognition,” CoRR, vol. abs/1512.03385, 2015.
+[13] B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. Davison, and
+      S. Leutenegger, “Mid-fusion: Octree-based object-level multi-instance     [34] G. Yang and D. Ramanan, “Learning to segment rigid motions
+      dynamic slam,” in 2019 International Conference on Robotics and                 from two frames,” in Proceedings of the IEEE/CVF Conference on
+      Automation (ICRA), pp. 5231–5237, IEEE, 2019.                                   Computer Vision and Pattern Recognition, pp. 1266–1275, 2021.
+
+[14] S. Li and D. Lee, “RGB-D SLAM in dynamic environments using                [35] G. Yang and D. Ramanan, “Upgrading optical flow to 3d scene
+      static point weighting,” IEEE Robotics and Automation Letters, vol. 2,          flow through optical expansion,” in Proceedings of the IEEE/CVF
+      no. 4, pp. 2263–2270, 2017.                                                     Conference on Computer Vision and Pattern Recognition, pp. 1334–
+                                                                                      1343, 2020.
+[15] Y. Sun, M. Liu, and M. Q.-H. Meng, “Improving rgb-d slam in
+      dynamic environments: A motion removal approach,” Robotics and            [36] G. French, A. Oliver, and T. Salimans, “Milking cowmask for semi-
+      Autonomous Systems, vol. 89, pp. 110–122, 2017.                                 supervised image classification,” arXiv preprint arXiv:2003.12022,
+                                                                                      2020.
+[16] F. Fraundorfer and D. Scaramuzza, “Visual odometry: Part ii: Match-
+      ing, robustness, optimization, and applications,” IEEE Robotics &         [37] Z. Teed and J. Deng, “Droid-slam: Deep visual slam for monocular,
+      Automation Magazine, vol. 19, no. 2, pp. 78–90, 2012.                           stereo, and rgb-d cameras,” Advances in Neural Information Process-
+                                                                                      ing Systems, vol. 34, pp. 16558–16569, 2021.
+[17] A. Valada, N. Radwan, and W. Burgard, “Deep auxiliary learning
+      for visual localization and odometry,” in 2018 IEEE international         [38] Y. Qiu, C. Wang, W. Wang, M. Henein, and S. Scherer, “Airdos:
+      conference on robotics and automation (ICRA), pp. 6939–6946, IEEE,              Dynamic slam benefits from articulated objects,” in 2022 International
+      2018.                                                                           Conference on Robotics and Automation (ICRA), pp. 8047–8053,
+                                                                                      IEEE, 2022.
+[18] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring
+      representation learning with cnns for frame-to-frame ego-motion esti-     [39] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: A
+      mation,” IEEE robotics and automation letters, vol. 1, no. 1, pp. 18–25,        versatile and accurate monocular slam system,” IEEE transactions on
+      2015.                                                                           robotics, vol. 31, no. 5, pp. 1147–1163, 2015.
+
+[19] H. Zhan, C. S. Weerasekera, J.-W. Bian, and I. Reid, “Visual odometry      [40] J. Zhang, M. Henein, R. Mahony, and V. Ila, “Vdo-slam: a visual
+      revisited: What should be learnt?,” in 2020 IEEE International Con-             dynamic object-aware slam system,” arXiv preprint arXiv:2005.11052,
+      ference on Robotics and Automation (ICRA), pp. 4203–4210, IEEE,                 2020.
+      2020.
+                                                                                [41] B. Bescos, J. M. Fa´cil, J. Civera, and J. Neira, “DynaSLAM: Tracking,
+[20] D.-H. Kim and J.-H. Kim, “Effective background model-based rgb-d                 mapping, and inpainting in dynamic scenes,” IEEE Robotics and
+      dense visual odometry in a dynamic environment,” IEEE Transactions              Automation Letters, vol. 3, no. 4, pp. 4076–4083, 2018.
+      on Robotics, vol. 32, no. 6, pp. 1565–1573, 2016.
+                                                                                [42] W. Zhao, S. Liu, Y. Shu, and Y.-J. Liu, “Towards better generalization:
+[21] S. Pillai and J. Leonard, “Monocular slam supported object recogni-              Joint depth-pose learning without posenet,” in Proceedings of the
+      tion,” arXiv preprint arXiv:1506.01732, 2015.                                   IEEE/CVF Conference on Computer Vision and Pattern Recognition,
+                                                                                      pp. 9151–9161, 2020.
+[22] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “Ds-
+      slam: A semantic visual slam towards dynamic environments,” in 2018       [43] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan,
+      IEEE/RSJ International Conference on Intelligent Robots and Systems             T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An
+      (IROS), pp. 1168–1174, IEEE, 2018.
+
+                                                                                                     4054
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply.
+                imperative style, high-performance deep learning library,” Advances
+                in neural information processing systems, vol. 32, 2019.
+         [44] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics:
+                The kitti dataset,” The International Journal of Robotics Research,
+                vol. 32, no. 11, pp. 1231–1237, 2013.
+
+                                                                                                     4055
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply.
+
diff --git a/动态slam/2020年-2022年开源动态SLAM/2022年/Multi_modal Semantic SLAM for Complex Dynamic Environment.pdf b/动态slam/2020年-2022年开源动态SLAM/2022年/Multi_modal Semantic SLAM for Complex Dynamic Environment.pdf
new file mode 100644
index 0000000..dbee05c
--- /dev/null
+++ b/动态slam/2020年-2022年开源动态SLAM/2022年/Multi_modal Semantic SLAM for Complex Dynamic Environment.pdf	
@@ -0,0 +1,510 @@
+                                       Multi-modal Semantic SLAM for Complex Dynamic Environments
+
+                                                                  Han Wang*, Jing Ying Ko* and Lihua Xie, Fellow, IEEE
+
+                                          Abstract— Simultaneous Localization and Mapping (SLAM)                 Input Image       Segmentation Node             Mask Image
+                                       is one of the most essential techniques in many real-world
+                                       robotic applications. The assumption of static environments is                                                  Semantic
+                                       common in most SLAM algorithms, which however, is not the                                                       Category
+                                       case for most applications. Recent work on semantic SLAM
+arXiv:2205.04300v1 [cs.RO] 9 May 2022  aims to understand the objects in an environment and distin-                                              Semantic
+                                       guish dynamic information from a scene context by performing                                                Mask
+                                       image-based segmentation. However, the segmentation results
+                                       are often imperfect or incomplete, which can subsequently                 Peception Node    Point Cloud   Filter ing & SL AM Node
+                                       reduce the quality of mapping and the accuracy of localization.
+                                       In this paper, we present a robust multi-modal semantic                        LiDAR Input
+                                       framework to solve the SLAM problem in complex and highly
+                                       dynamic environments. We propose to learn a more powerful                                                 Clustering                 Projected
+                                       object feature representation and deploy the mechanism of                                                   Result                 Segmentation
+                                       looking and thinking twice to the backbone network, which
+                                       leads to a better recognition result to our baseline instance                                                                          Result
+                                       segmentation model. Moreover, both geometric-only clustering
+                                       and visual semantic information are combined to reduce the                Dynamic Point     Static Point                  Mapping
+                                       effect of segmentation error due to small-scale objects, occlusion
+                                       and motion blur. Thorough experiments have been conducted to              Fig. 1: System overview of the proposed multi-modal se-
+                                       evaluate the performance of the proposed method. The results              mantic SLAM. Compared to traditional semantic SLAM,
+                                       show that our method can precisely identify dynamic objects               we propose to use multi-modal method to improve the
+                                       under recognition imperfection and motion blur. Moreover, the             efﬁciency and accuracy of the existing SLAM methods in the
+                                       proposed SLAM framework is able to efﬁciently build a static              complex and dynamic environment. Our method signiﬁcantly
+                                       dense map at a processing rate of more than 10 Hz, which can              reduces the localization drifts caused by dynamic objects and
+                                       be implemented in many practical applications. Both training              performs dense semantic mapping in real time.
+                                       data and the proposed method is open sourced1.
+                                                                                                                 correspondences or insufﬁcient matching features [4]. The
+                                                              I. INTRODUCTION                                    presence of dynamic objects can greatly degrade the accuracy
+                                                                                                                 of localization and the reliability of the mapping during the
+                                          Simultaneous Localization and Mapping (SLAM) is one of                 SLAM process.
+                                       the most signiﬁcant capabilities in many robot applications
+                                       such as self-driving cars, unmanned aerial vehicles, etc.                    Advancements in deep learning have enabled the develop-
+                                       Over the past few decades, SLAM algorithms have been                      ments of various instance segmentation networks based on
+                                       extensively studied in both Visual SLAM such as ORB-                      2D images [5]–[6]. Most existing semantic SLAMs leverage
+                                       SLAM [1] and LiDAR-based SLAM such as LOAM [2]                            the success of deep learning-based image segmentation, e.g.,
+                                       and LeGO-LOAM [3]. Unfortunately, many existing SLAM                      dynamic-SLAM [7] and DS-SLAM [8]. However, the seg-
+                                       algorithms assume the environment to be static, and cannot                mentation results are not ideal under dynamic environments.
+                                       handle dynamic environments well. The localization is often               Various factors such as small-scale objects, objects under
+                                       achieved via visual or geometric features such as feature                 occlusion and motion blur contribute to challenges in 2D
+                                       points, lines and planes without including semantic infor-                instance segmentation. For example, the object is partially
+                                       mation to represent the surrounding environment, which can                recognized under motion blur or when it is near to the border
+                                       only work well under static environments. However, the real-              of the image. These can degrade the accuracy of localization
+                                       world is generally complex and dynamic. In the presence of                and the reliability of the mapping. Some recent works target
+                                       moving objects, pose estimation might suffer from drifting,               to perform deep learning on 3D point clouds to achieve
+                                       which may cause the system failure if there are wrong                     semantic recognition [9]–[10]. However, 3D point cloud
+                                                                                                                 instance segmentation does not perform as well as its 2D
+                                          *Jing Ying Ko and Han Wang contribute equally to this paper and are
+                                       considered as jointly ﬁrst authors.
+
+                                          The research is supported by the National Research Foundation, Singa-
+                                       pore under its Medium Sized Center for Advanced Robotics Technology
+                                       Innovation.
+
+                                          Jing Ying Ko, Han Wang and Lihua Xie are with the School of
+                                       Electrical and Electronic Engineering, Nanyang Technological Univer-
+                                       sity, 50 Nanyang Avenue, Singapore 639798. e-mail: {hwang027,
+                                       E170043}@e.ntu.edu.sg; elhxie@ntu.edu.sg
+
+                                          1https://github.com/wh200720041/MMS_SLAM
+counterpart due to its smaller scale of training data and high  A. Feature Consistency Veriﬁcation
+computational cost. There are several reasons: 1) 3D point
+cloud instance segmentation such as PointGroup takes a long        Dai et al. [13] presents a segmentation method using the
+computation time (491ms) [11]; 2) it is much less efﬁcient to   correlation between points to distinguish moving objects
+label a point cloud since the geometric information is not as   from the stationary scene, which has a low computational
+straightforward as the visual information; 3) it is inevitable  requirement. Lee et al. [14] introduces a real-time depth
+to change the viewpoint in order to label a point cloud [12],   edge-based RGB-D SLAM system to deal with a dynamic
+which increases the labeling time.                              environment. Static weighting method is proposed to mea-
+                                                                sure the likelihood of the edge point being part of the
+   In this paper, we propose a robust and computationally       static environment and is further used for the registration of
+efﬁcient multi-modal semantic SLAM framework to tackle          the frame-to-keyframe point cloud. These methods generally
+the limitation of existing SLAM methods in dynamic en-          can achieve real-time implementation without increasing the
+vironments. We modify the existing backbone network to          computational complexity. Additionally, they need no prior
+learn a more powerful object feature representation and         knowledge about the dynamic objects. However, they are
+deploy the mechanism of looking and thinking twice to the       unable to continuously track potential dynamic objects, e.g.,
+backbone network, which leads to a better recognition result    a person that stops at a location temporarily between moves
+to our baseline instance segmentation model. Moreover, we       is considered as a static object in their work.
+combine the geometric-only clustering and visual semantic
+information to reduce the effect of motion blur. Eventually     B. Deep Learning-Based Dynamic SLAM
+the multi-modal semantic recognition is integrated into the
+SLAM framework which is able to provide real-time local-           Deep learning-based dynamic SLAM usually performs
+ization in different dynamic environments. The experiment       better than feature consistency veriﬁcation as it provides
+results show that the segmentation errors due to misclassiﬁ-    conceptual knowledge of the surrounding environment to
+cation, small-scale object and occlusion can be well-solved     perform the SLAM tasks. Xun et al. [15] proposes a feature-
+with our proposed method. The main contributions of this        based visual SLAM algorithm based on ORB-SLAM2,
+paper are summarized as follows:                                where a front-end semantic segmentation network is in-
+                                                                troduced to ﬁlter out dynamic feature points and subse-
+   • We propose a robust and fast multi-modal semantic          quently ﬁne-tune the camera pose estimation, thus making
+      SLAM framework that targets to solve the SLAM prob-       the tracking algorithm more robust. Reference [16] combines
+      lem in complex and dynamic environments. Speciﬁcally,     a semantic segmentation network with a moving consistency
+      we combine the geometric-only clustering and visual se-   check method to reduce the impact of dynamic objects and
+      mantic information to reduce the effect of segmentation   generate a dense semantic octree map. A visual SLAM
+      error due to small-scale objects, occlusion and motion    system proposed by [17] develops a dynamic object detector
+      blur.                                                     with multi-view geometry and background inpainting, which
+                                                                aims to estimate a static map and reuses it in long term
+   • We propose to learn a more powerful object feature         applications. However, Mask R-CNN is considered as com-
+      representation and deploy the mechanism of looking and    putationally intensive; as a result, the whole framework can
+      thinking twice to the backbone network, which leads       only be performed ofﬂine.
+      to a better recognition result to our baseline instance
+      segmentation model.                                          Deep learning-based LiDAR SLAM in dynamic envi-
+                                                                ronments are relatively less popular than visual SLAM.
+   • A thorough evaluation on the proposed method is pre-       Reference [18] integrates semantic information by using a
+      sented. The results show that our method is able to       fully convolutional neural network to embed these labels
+      provide reliable localization and a semantic dense map.   into a dense surfel-based map representation. However, the
+                                                                adopted segmentation network is based on 3D point clouds,
+   The rest of the paper is organized as follows: Section II    which is less effective as compared to 2D segmentation net-
+presents an overview of the related works regarding the three   works. Reference [19] develops a laser-inertial odometry and
+main SLAM methods in dynamic environments. Section III          mapping method which consists of four sequential modules
+describes the details of the proposed SLAM framework.           to perform a real-time and robust pose estimation for large
+Section IV provides quantitative and qualitative experimental   scale high-way environments. Reference [20] presents a dy-
+results in dynamic environments. Section V concludes this       namic objects-free LOAM system by overlapping segmented
+paper.                                                          images into LiDAR scans. Although deep learning-based
+                                                                methods can effectively alleviate the impact of dynamic ob-
+                      II. RELATED WORK                          jects on the SLAM performance, they are normally difﬁcult
+                                                                to operate in real-time due to the implementation of deep-
+   In this section, we present the existing works that address  learning neural networks which possess high computational
+SLAM problems in dynamic environments. The existing             complexity.
+dynamic SLAM can be categorized into three main methods:
+feature consistency veriﬁcation method, deep learning-based
+method and multi-modal-based method.
+                                                                                                                                 Iterative
+                                                                                                                                  Estima
+
+             (b) M ulti-modal Fusion M odule                   (c) L ocalization M odule
+
+ LiDAR       Point Cloud  Geometric       Semantic              Feature          Data
+                          Clustring        Fusion              Extraction    Association
+Camera
+                                                                                              Pose
+                                                                                           Estimation
+
+             Image        Segmentation     Motion Blur         Key Frame     Feature Map
+                                 &        Compensation          Selection       Update
+
+                          Classfication
+
+             (a) I nstance Segmentation M odule                (d) Global Optimization & M apping M odule
+
+ Dynamic         Data         Model        Convolutional        Static Map        Map                      Localization
+Object Info  Acquisition  Generalization  Neural Network       Construction  Optimization                     Output
+
+               Data       Data Training                        Dynamic       Global Map    Visualization   3D Map
+             Labelling                                         Mapping          Fusion                     Output
+
+Fig. 2: Flow chart of the proposed method. Our system consists of four modules: (a) semantic fusion module; (b) semantic
+learning module; (c) localization module; (d) global optimization and mapping module.
+
+C. Multi-modal-based Dynamic SLAM                              other state-of-the-art instance segmentation models, both in
+                                                               segmentation accuracy and inference speed. Given an input
+   Multi-modal approaches are also explored to deal with       image I, our adopted instance segmentation network predicts
+dynamic environments. Reference [21] introduces a multi-       a set of {Ci, M i}in=1, where Ci is a class label and M i is a
+modal sensor-based semantic mapping algorithm to improve       binary mask, n is the number of instances in the image.
+the semantic 3D map in large-scale as well as in featureless   The image is spatially separated into N × N grid cells. If
+environments. Although this work is similar to our proposed    the center of an object falls into a grid cell, that grid cell
+method, it incurs higher computational cost as compared to     is responsible for predicting the semantic category Cij and
+our proposed method. A LiDAR-camera SLAM system [22]           semantic mask M ij of the object in category branch Bc and
+is presented by applying a sparse subspace clustering-based    mask branch P m respectively:
+motion segmentation method to build a static map in dynamic
+environments. Reference [23] incorporates the information of        Bc(I, θc) : I → {Cij ∈ Rλ | i, j = 0, 1, ..., N }, (1a)
+a monocular camera and a laser range ﬁnder to remove the           P m(I, θm) : I → {M ij ∈ Rφ | i, j = 0, 1, ..., N }, (1b)
+feature outliers related to dynamic objects. However, both
+reference [22] and [23] can only work well in low dynamic      where θc and θm are the parameters of category branch Bc
+environments.                                                  and mask branch P m respectively. λ is the number of classes.
+                                                               φ is the total number of grid cells. The category branch
+                     III. METHODOLOGY                          and mask branch are implemented with a Fully Connected
+                                                               Network (FCN). Cij has a total of λ elements. Each element
+   In this section, the proposed method will be discussed      of Cij indicates the class probability for each object instance
+in detail. Fig. 2 illustrates an overview of our framework.    at grid cell (i, j). In parallel with the category branch, M ij
+It is mainly composed of four modules, namely instance         has a total of N 2 elements [24]. Each positive grid cell (i, j)
+segmentation module, multi-modal fusion module, localiza-      will generate the corresponding instance mask in kth element,
+tion module and global optimization & mapping module.          where kth = i · N + j. Since our proposed SLAM system is
+Instance segmentation module uses a real-time instance seg-    intentionally designed for real-world robotics applications,
+mentation network to extract the semantic information of       computational cost for performing instance segmentation
+all potential dynamic objects that are present in an RGB       is our primary concern. Therefore, we use a light-weight
+image. The convolution neural network is trained ofﬂine and    version of SOLOv2 with lower accuracy to achieve real-
+is later implemented online to achieve real-time performance.  time instance segmentation. To improve the segmentation
+Concurrently, the multi-modal fusion module transfers rel-     accuracy, several methods have been implemented to build a
+evant semantic data to LiDAR through sensor fusion and         more effective and robust feature representation discriminator
+subsequently uses the multi-modal information to further       in the backbone network. Firstly, we modify our backbone
+strengthen the segmentation results. The static information    architecture from the original Feature Pyramid Network
+is used in the localization module to ﬁnd the robot pose,      (FPN) to Recursive Feature Pyramid Network (RFP) [25].
+while both static information and dynamic information are      Theoretically, RFP instills the idea of looking twice or
+utilized in the global optimization and mapping module to      more by integrating additional feedback from FPN into
+build a 3D dense semantic map.                                 bottom-up backbone layers. This recursively strengthens the
+                                                               existing FPN and provides increasingly stronger feature
+A. Instance Segmentation & Semantic Learning                   representations. By offsetting richer information with small
+
+   A recent 2D instance segmentation framework [24] is
+employed in our work due to its ability to outperform
+                        (1a)                                     (2a)  (3a)
+
+SOL Ov2 with Or iginal  (1b)                                     (2b)  (3b)
+   DetectoRS SOL O v2
+Fig. 3: Comparison of the original SOLOv2 with the proposed method. Our segmentation results achieve higher accuracy:
+In (1b), our method can preserve a more detailed mask for the rider on a motorcycle compared to the SOLOv2 result in
+(1a); In (2b), we can handle the occluded object while it is not detected in (2a); In (3b), our method can accurately predict
+the mask for a handbag compared to (3a).
+
+receptive ﬁeld in the lower-level feature maps, we are able      dynamic targets will degrade the localization accuracy and
+to improve the segmentation performance on small objects.        produce noise when performing a mapping task. Therefore,
+Meanwhile, the ability of RFP to adaptively strengthen and       we ﬁrstly implement morphological dilation to convolute
+suppress neuron activation enables the instance segmentation     the 2D pixel-wise mask image with a structuring element,
+network to handle occluded objects more efﬁciently. On           for gradually expanding the boundaries of regions for the
+the other hand, we replace the convolutional layers in the       dynamic objects. The morphological dilation result marks
+backbone architecture with Switchable Atrous Convolution         the ambiguous boundaries around the dynamic objects. We
+(SAC). SAC operates as a soft switch function, which is          take the both dynamic objects and their boundaries as the
+used to collect the outputs of convolutional computation with    dynamic information, which will be further reﬁned in the
+different atrous rates. Therefore, we are able to learn the      multi-modal fusion section.
+optimal coefﬁcient from SAC and can adaptively select the
+size of receptive ﬁeld. This allows SOLOv2 to efﬁciently            2) Geometric Clustering & Semantic Fusion: Compensa-
+extract important spatial information.                           tion via connectivity analysis on Euclidean space [27] is also
+                                                                 implemented in our work. Instance segmentation network has
+   The outputs are pixel-wise instance masks for each dy-        excellent recognition capability in most practical situations,
+namic object, as well as their corresponding bounding box        however motion blur limits the segmentation performance
+and class type. To better integrate the dynamic information to   due to ambiguous pixels between regions, leading to undesir-
+the SLAM algorithm, the output binary mask is transformed        able segmentation error. Therefore, we combine both point
+into a single image containing all pixel-wise instance masks     cloud clustering results and segmentation results to better
+in the scene. The pixel with the mask falling onto it is         reﬁne the dynamic objects. In particular, we perform the
+considered as “dynamic state” and otherwise is considered as     connectivity analysis on the geometry information and merge
+“static state”. The binary mask is then applied to the semantic  with vision-based segmentation results.
+fusion module to generate a 3D dynamic mask.
+                                                                    A raw LiDAR scan often contains tens of thousands of
+B. Multi-Modal Fusion                                            points. To increase the efﬁciency of our work, 3D point
+                                                                 cloud is ﬁrstly downsized to reduce the scale of data and
+   1) Motion Blur Compensation: The instance segmenta-           used as the input for point cloud clustering. Then the
+tion has achieved good performance on the public dataset         instance segmentation results are projected to the point cloud
+such as the COCO dataset and the Object365 dataset [24]–         coordinate to label each point. The point cloud cluster will be
+[26]. However, in practice the target may be partially rec-      considered as a dynamic cluster when most points (90%) are
+ognized or incomplete due to the motion blur on moving           dynamic labelled points. The static point will be re-labeled to
+objects, resulting in ambiguous boundaries of a moving           the dynamic tag when it is close to a dynamic point cluster.
+object. Moreover, motion blur effect is further enlarged         And the dynamic point will be re-labelled when there is no
+when projecting the 2D pixel-wise semantic mask for a            dynamic points cluster nearby.
+dynamic object to 3D semantic label, leading to point mis-
+alignment and inconsistency of feature point extraction. In
+the experiments, we ﬁnd that the ambitious boundaries of
+C. Localization & Pose Estimation                                                  （a）  （b）  （c）  （d）  （e）
+
+   1) Feature Extraction: After applying multi-modal dy-                           （f）
+namic segmentation, the point cloud is divided into a dy-
+namic point cloud PD and a static point cloud PS . The
+static point cloud is subsequently used for the localization
+and mapping module based on our previous work [28].
+Compared to the existing SLAM approach such as LOAM
+[2], the proposed framework in [28] is able to support real-
+time performance at 30 Hz which is a few times faster. It
+is also resistant to illumination variation compared to visual
+SLAMs such as ORB-SLAM [1] and VINS-MONO [29].
+For each static point pk ∈ PS , we can search for its nearby
+static points set Sk by radius search in Euclidean space. Let
+|S| be the cardinality of a set S, the local smoothness is thus
+deﬁned by:
+
+               σk    =     1   ·        (||pk||   −    ||pi||).               (2)  Fig. 4: Different types of AGVs used in our warehouse
+                                                                                   environment: (a) the grabbing AGV with a robot arm; (b)
+                        |Sk |     pi ∈Sk                                           folklift AGV; (c) scanning AGV; (d) the Pioneer robot; (e)
+                                                                                   the transportation AGV with conveyor belt; (f) warehouse
+The edge features are deﬁned by the points with large σk                           environment;
+and the planar features are deﬁned by the points with small
+                                                                                   D. Global Map Building
+σk .
+   2) Data Association: The ﬁnal robot pose is calculated                             The semantic map is separated into a static map and
+                                                                                   a dynamic map. Note that the visual information given
+by minimizing the point-to-edge and point-to-plane distance.                       previously is also used to construct the colored dense static
+                                                                                   map. Speciﬁcally, the visual information can be achieved
+For an edge feature point pE ∈ PE , it can be transformed                          by re-projecting 3D points into the image plane. After each
+into local map coordinate by pˆE = T·pE , where T ∈ SE(3)                          update, the map is down-sampled by using a 3D voxelized
+                                                                                   grid approach [30] in order to prevent memory overﬂow.
+is the current pose. We can search for 2 nearest edge features                     The dynamic map is built by PD and it is used to reveal the
+pE1 and pE2 from the local edge feature map and the point-                         dynamic objects. The dynamic information can be used for
+to-edge residual is deﬁned by [28]:                                                high-level tasks such as motion planning.
+
+            fE (pˆE )   =  ||(pˆ E   − pE1 ) ×  (pˆE −    pE2 )|| ,           (3)               IV. EXPERIMENT EVALUATION
+                                       ||pE1 −  p2E ||
+                                                                                      In this section, experimental results will be presented to
+where symbol × is the cross product. Similarly, given a                            demonstrate the effectiveness of our proposed method. First,
+                                                                                   our experimental setup will be discussed in detail. Second,
+planar feature point pL ∈ PL and its transformed point                             we elaborate how we acquire the data of potential moving
+pˆL = T · pL, we can search for 3 nearest points p1L, p2L, and                     objects in a warehouse environment. Third, we evaluate the
+p3L from the local planar map. The point-to-plane residual is                      segmentation performance on our adopted instance segmen-
+                                                                                   tation model. Subsequently, we explain how we perform the
+deﬁned by:                                                                         dense mapping and dynamic tracking. Lastly, we evaluate
+                                                                                   the performance of our proposed method regarding the
+fL(pˆL)     =  (pˆL  −  p1L)T     ·   (p1L   −  pL2 )  ×  (pL1  −  p3L)    .  (4)  localization drifts under dynamic environments.
+                                     ||(pL1  −  p2L)   ×  (pL1  −  p3L)||
+                                                                                   A. Experimental Setup
+3) Pose Estimation: The ﬁnal robot pose is calculated
+                                                                                      For our experimental setup, the Robot Operating System
+by minimizing the sum of point-to-plane and point-to-edge                          (ROS) is utilized as the interface for the integration of
+                                                                                   the semantic learning module and the SLAM algorithm, as
+residuals:                                                                         shown in Fig. 1. Intel RealSense LiDAR camera L515 is
+                                                                                   used to capture RGB and point cloud at a ﬁxed frame rate.
+T∗ = arg min                         fE (pˆE ) +          fL(pˆL). (5)             All the experiments are performed on a computer with an
+                                                                                   Intel i7 CPU and an Nvidia GeForce RTX 2080 Ti GPU.
+                   T       pE ∈PE                 pL ∈PL
+
+This non-linear optimization problem can be solved by the
+
+Gauss-Newton method and we can derive an optimal robot
+
+pose based on the static information.
+
+4) Feature Map Update & Key Frame Selection: Once
+
+the optimal pose is derived, the features are updated to the
+
+local edge map and local plane map respectively, which
+
+will be used for the data association on the next frame.
+
+Note that to build and update a global dense map is often
+
+very computational costly. Hence, the global static map is
+
+updated based on the keyframe. A key frame is selected when
+
+the translational change of the robot pose is greater than a
+
+predeﬁned translation threshold, or the rotational change of
+
+the robot pose is greater than a predeﬁned rotation threshold.
+                              （a）                                                                         （b）
+
+Fig. 5: Static map creation and ﬁnal semantic mapping result: (a) static map built by the proposed SLAM framework; (b)
+ﬁnal semantic mapping result. The instance segmentation is shown on the left. Human operators are labeled by red bounding
+boxes and AGVs are labeled by green bounding boxes.
+
+B. Data Acquisition                                             tion network, SOLOv2 is built based on the MMDetection
+                                                                2.0 [32], an open-source object detection toolbox based
+   Humans are often considered as dynamic objects in many       on PyTorch. We trained SOLOv2 on the COCO dataset
+scenarios such as autonomous driving and smart warehouse        which consists of 81 classes. We choose ResNet-50 as our
+logistics. Therefore we choose 5,000 human images from          backbone architecture since this conﬁguration satisﬁes our
+the COCO dataset. In the experiment, the proposed method        requirements for the real-world robotics applications. Instead
+is evaluated in the warehouse environment as shown in           of training the network from scratch, we make use of the
+Fig. 4. Other than considering humans as dynamic objects,       parameters of ResNet-50 that are pre-trained on ImageNet.
+an advanced factory requires human-to-robot and robot-to-       For fair comparison, all the models are trained under the
+robot collaboration, so that the Automated Guided Vehicles      same conﬁgurations, they are trained with the synchronized
+(AGVs) are also potential dynamic objects. Hence a total        stochastic gradient descent with a total of 8 images per mini-
+of 3,000 AGV images are collected to train the instance         batch for 36 epochs.
+segmentation network and some of the AGVs are shown in
+Fig. 4.                                                            For SOLOv2 with Recursive Feature Pyramid (RFP), we
+                                                                modify our backbone architecture from Feature Pyramid
+   In order to solve the small dataset problem, we implement    Network (FPN) to RFP network. In this experiment, we only
+the copy-paste augmentation method proposed by [31] to          set the number of stages to be 2, allowing SOLOv2 to look
+enhance the generalization ability of the network and directly  at the image twice. As illustrated in Table I, implementation
+improve the robustness of the network. To be speciﬁc, this      of RFP network brings a signiﬁcant improvement on the
+method generates new images through applying random             segmentation performance. On the other hand, we replace
+scale jittering on two random training datasets and randomly    all 3x3 convolutional layers in the backbone network with
+chooses a subset of object instances from one image to paste    Switchable Atrous Convolution (SAC), which increases the
+onto the other image.                                           segmentation accuracy by 2.3%. By implementing both SAC
+                                                                and RFP network to SOLOv2, the segmentation performance
+C. Evaluation on Instance Segmentation Performance              is further improved by 5.9% with only 17ms increase in
+                                                                inference time. Overall, SOLOv2 learns to look at the image
+   In this part, we will evaluate the segmentation performance  twice with adaptive receptive ﬁelds, therefore it is able to
+on the COCO dataset with regards to the segmentation loss       highlight important semantic information for the instance
+and mean Average Precision (mAP). The purpose of this           segmentation network. The segmentation result is further
+evaluation is to compare our adopted instance segmentation      visualized in Fig. 3.
+network, SOLOv2, with the proposed method. The results
+are illustrated in Table I. Our adopted instance segmenta-
+
+Model                   Segmentation Mean Inference
+
+                        Loss  AP (%) Time (ms)                                  Methods            ATDE   MTDE
+                                                                                                    (cm)   (cm)
+SOLOv2                  0.52  38.8  54.0                            W/O Semantic Recognition
+                                                                Vision-based Semantic Recognition  4.834   1.877
+SOLOv2 + RFP            0.36  41.2  64.0                        Multi-Modal Recognition (Ours)     1.273   0.667
+                                                                                                   0.875   0.502
+SOLOv2 + SAC            0.39  39.8  59.0
+
+SOLOv2+DetectoRS(Ours)  0.29  43.4  71.0
+
+TABLE I: Performance comparison of instance segmenta-           TABLE II: Ablation study of localization drifts under dy-
+tion.                                                           namic environments.
+3                                                              （a）                                       （c）
+
+                  w/o filtering
+
+2                 proposed
+
+                  Ground Truth
+
+1                                                              （b）
+
+0
+
+-1
+
+-2                                                             Fig. 7: Ablation study of localization drifts. (a) original
+                                                               image view; (b) the visual semantic recognition result based
+-3                                                             on the proposed method; (c) Localization drifts observed due
+                                                               to the moving objects. The localization drifts are highlighted
+-3  -2  -1  0  1  2              3                             in red circle.
+
+Fig. 6: Localization comparison in a dynamic environment.      E. Ablation Study of Localization Drifts  （d）
+The ground truth, the original localization result without
+ﬁltering and the localization result with our proposed multi-     To further evaluate the performance of localization under
+modal semantic ﬁltering are plotted in red, green and orange   dynamic proﬁles, we compare the localization drifts of differ-
+respectively.                                                  ent dynamic ﬁltering approaches. Firstly, we keep the robot
+                                                               still and let a human operator walk frequently in front of the
+D. Dense Mapping and Dynamic Tracking                          robot. The localization drifts are recorded in order to evaluate
+                                                               the performance under dynamic objects. Speciﬁcally, we
+   To evaluate the performance of our multi-modal semantic     calculate the Average Translational Drifts Error (ATDE) and
+SLAM in dynamic environments, the proposed method is           Maximum Translational Drifts Error (MTDE) to verify the
+implemented on warehouse AGVs which are shown in Fig. 4.       localization, where the ATDE is the average translational
+In a smart manufacturing factory, both human operators and     error of each frame and MTDE is the maximum translational
+different types of AGVs (e.g., folklift AGVs, transportation   drift caused by the walking human. The results are shown
+AGVs and robot-arm equipped AGVs) are supposed to work         in Table II. We ﬁrstly remove the semantic recognition
+in a collaborative manner. Therefore, the capability of each   module from SLAM and evaluate the performance. Then
+AGV to localize itself under moving human operators and        we use the visual semantic recognition (SOLOv2) to re-
+other AGVs is the essential technology towards industry 4.0.   move the dynamic information. The results are compared
+In many warehouse environments, the rest of objects such       with the proposed semantic multi-modal SLAM. It can be
+as operating machines or tables can be taken as a static       seen that, compared to the original SLAM, the proposed
+environment. Hence we only consider humans and AGVs            method signiﬁcantly reduces the localization drift. Compared
+as dynamic objects in order to reduce the computational        to vision-only-based ﬁltering methods, the proposed multi-
+cost. In the experiment, an AGV is manually controlled         modal semantic SLAM is more stable and accurate under the
+to move around and build the warehouse environment map         presence of dynamic objects.
+simultaneously, while the human operators are walking fre-
+quently in the warehouse. The localization result is shown                             V. CONCLUSION
+in Fig. 6, where we compare the results of ground truth, the
+proposed SLAM method and original SLAM without our                In this paper, we have presented a semantic multi-modal
+ﬁltering approach. It can be seen that when the dynamic        framework to tackle the SLAM problem in dynamic en-
+object appears (in blue), the proposed multi-modal semantic    vironments, which is able to effectively reduce the impact
+SLAM is more robust and stable than traditional SLAM.          of dynamic objects in complex dynamic environments. Our
+The mapping results are shown in Fig. 5. The proposed          approach aims to provide a modular pipeline to allow real-
+method is able to efﬁciently identify the potential dynamic    world applications in dynamic environments. Meanwhile, a
+objects and separate them from the static map. Although the    3D dense stationary map is constructed with the removal
+human operators are walking frequently in front of the robot,  of dynamic information. To verify the effectiveness of the
+they are totally removed from the static map. All potential    proposed method in a dynamic complex environment, our
+dynamic objects are enclosed by bounding boxes and are         method is evaluated on warehouse AGVs used for smart
+added into a ﬁnal semantic map to visualize the status of      manufacturing. The results show that our proposed method
+each object in real time, where the moving human is colored    can signiﬁcantly improve the existing semantic SLAM algo-
+in red and the AGVs are colored in green. Our method is        rithm in terms of robustness and accuracy.
+able to identify and locate multiple targets in the complex
+dynamic environment.
+                            REFERENCES                                         [17] B. Bescos, J. M. Fa´cil, J. Civera, and J. Neira, “Dynaslam: Tracking,
+                                                                                     mapping, and inpainting in dynamic scenes,” IEEE Robotics and
+ [1] R. Mur-Artal and J. D. Tardo´s, “Orb-slam 2: An open-source slam                Automation Letters, vol. 3, no. 4, pp. 4076–4083, 2018.
+      system for monocular, stereo, and rgb-d cameras,” IEEE Transactions
+      on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017.                        [18] X. Chen, A. Milioto, E. Palazzolo, P. Gigue`re, and C. Stachniss,
+                                                                                     “Suma++: Efﬁcient lidar-based semantic slam,” IEEE International
+ [2] J. Zhang and S. Singh, “Loam: Lidar odometry and mapping in real-               Conference on Intelligent Robots and Systems, 2019.
+      time.” in Robotics: Science and Systems, vol. 2, no. 9, 2014.
+                                                                               [19] S. Zhao, Z. Fang, H. Li, and S. Scherer, “A robust laser-inertial
+ [3] T. Shan and B. Englot, “Lego-loam: Lightweight and ground-                      odometry and mapping method for large-scale highway environments,”
+      optimized lidar odometry and mapping on variable terrain,” in 2018             IEEE International Conference on Intelligent Robots and Systems,
+      IEEE/RSJ International Conference on Intelligent Robots and Systems            2019.
+      (IROS). IEEE, 2018, pp. 4758–4765.
+                                                                               [20] R. Jian, W. Su, R. Li, S. Zhang, J. Wei, B. Li, and R. Huang,
+ [4] W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao, “Robust monocular                “A semantic segmentation based lidar slam system towards dynamic
+      slam in dynamic environments,” IEEE International Symposium on                 environments,” IEEE International Conference on Intelligent Robotics
+      Mixed and Augmented Reality, vol. 1, pp. 209–218, 2013.                        and Applications, pp. 582–590, 2019.
+
+ [5] K. He, G. Gkioxari, P. Dolla´r, and R. Girshick, “Mask r-cnn,” IEEE       [21] J. Jeong, T. S. Yoon, and P. J. Bae, “Towards a meaningful 3d map
+      International Conference on Computer Vision, 2017.                             using a 3d lidar and a camera,” Sensors, vol. 18, no. 8, 2018.
+
+ [6] D. Bolya, Z. Chong, F. Xiao, and Y. J. Lee, “Yolact: Real-time instance   [22] C. Jiang, D. P. Paudel, Y. Fougerolle, D. Foﬁ, and C. Demonceaux,
+      segmentation,” IEEE International Conference on Computer Vision,               “Static-map and dynamic object reconstruction in outdoor scenes using
+      2019.                                                                          3-d motion segmentation,” IEEE Robotics and Automation Letters,
+                                                                                     vol. 1, no. 1, pp. 324–331, 2016.
+ [7] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, “Dynamic-slam:
+      Semantic monocular visual localization and mapping based on deep         [23] X. Zhang, A. B. Rad, and Y.-K. Wong, “Sensor fusion of monocular
+      learning in dynamic environment,” Robotics and Autonomous Systems,             cameras and laser range ﬁnders for line-based simultaneous localiza-
+      vol. 117, pp. 1–16, 2019.                                                      tion and mapping (slam) tasks in autonomous mobile robots,” Sensors,
+                                                                                     vol. 12, pp. 429–452, 2012.
+ [8] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “Ds-
+      slam: A semantic visual slam towards dynamic environments,” in 2018      [24] X. Wang, R. Zhang, K. Tao, L. Lei, and C. Shen, “Solov2: Dynamic
+      IEEE/RSJ International Conference on Intelligent Robots and Systems            and fast instance segmentation,” IEEE Computer Vision and Pattern
+      (IROS). IEEE, 2018, pp. 1168–1174.                                             Recognition, 2020.
+
+ [9] L. Han, T. Zheng, L. Xu, and L. Fang, “Occuseg: Occupancy-aware 3d        [25] S. Qiao, L.-C. Chen, and A. Yuille, “Detectors: Detecting objects with
+      instance segmentation,” IEEE International Conference on Computer              recursive feature pyramid and switchable atrous convolution,” IEEE
+      Vision and Pattern Recognition, 2020.                                          Computer Vision and Pattern Recognition, 2020.
+
+[10] J. Li, H. Zhao, S. Shi, S. Liu, C.-W. Fu, and J. Jia, “Pointgroup: Dual-  [26] G. Ghiasi, Y. Cui, A. Srinivas, R. Qian, T.-Y. Lin, E. D. Cubuk, Q. V.
+      set point grouping for 3d instance segmentation,” IEEE International           Le, and B. Zoph, “Simple copy-paste is a strong data augmentation
+      Conference on Computer Vision and Pattern Recognition, 2020.                   method for instance segmentation,” in Proceedings of the IEEE/CVF
+                                                                                     Conference on Computer Vision and Pattern Recognition, 2021, pp.
+[11] L. Jiang, H. Zhao, S. Shi, S. Liu, C.-W. Fu, and J. Jia, “Pointgroup:           2918–2928.
+      Dual-set point grouping for 3d instance segmentation,” in Proceed-
+      ings of the IEEE/CVF Conference on Computer Vision and Pattern           [27] R. B. Rusu, “Semantic 3d object maps for everyday manipulation in
+      Recognition, 2020, pp. 4867–4876.                                              human living environments,” KI-Ku¨nstliche Intelligenz, vol. 24, no. 4,
+                                                                                     pp. 345–348, 2010.
+[12] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stach-
+      niss, and J. Gall, “SemanticKITTI: A Dataset for Semantic Scene          [28] H. Wang, C. Wang, and L. Xie, “Lightweight 3-d localization and
+      Understanding of LiDAR Sequences,” in Proc. of the IEEE/CVF                    mapping for solid-state lidar,” IEEE Robotics and Automation Letters,
+      International Conf. on Computer Vision (ICCV), 2019.                           2020.
+
+[13] W. Dai, Y. Zhang, P. Li, Z. Fang, and S. Schere, “Rgb-d slam in           [29] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monoc-
+      dynamic environments using point correlations,” IEEE Transactions              ular visual-inertial state estimator,” IEEE Transactions on Robotics,
+      on Pattern Analysis and Machine Intelligence, vol. 1, no. 1, 2020.             vol. 34, no. 4, pp. 1004–1020, 2018.
+
+[14] S. Li and D. Lee, “Rgb-d slam in dynamic environments using static        [30] R. B. Rusu and S. Cousins, “3d is here: Point cloud library (pcl),”
+      point weighting,” IEEE Robotics and Automation Letters), vol. 2, no. 4,        in 2011 IEEE international conference on robotics and automation.
+      pp. 2262–2270, 2017.                                                           IEEE, 2011, pp. 1–4.
+
+[15] Y. Xun and C. Song, “Sad-slam: A visual slam based on semantic            [31] G. Ghiasi, C. Yin, A. Srinivas, R. Qian, T.-Y. Lin, E. D.Cubuk,
+      and depth information,” IEEE International Conference on Intelligent           Q. V. Le, and B. Zoph, “Simple copy-paste is a strong data aug-
+      Robots and Systems, 2021.                                                      mentation method for instance segmentation,” IEEE Computer Vision
+                                                                                     and Pattern Recognition, 2020.
+[16] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “Ds-
+      slam: A semantic visual slam towards dynamic environments,” IEEE         [32] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng,
+      International Conference on Intelligent Robots and Systems, pp. 1168–          Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li,
+      1174, 2018.                                                                    X. Lu, R. Zhu, Y. Wu, J. Dai, W. Jingdong, J. Shi, W. Ouyang, C. C.
+                                                                                     Loy, and D. Lin, “Mmdetection: Open mmlab detection toolbox and
+                                                                                     benchmark,” IEEE Computer Vision and Pattern Recognition, 2019.
+
diff --git a/动态slam/2020年-2022年开源动态SLAM/2022年/RGB_D_Inertial_Odometry_for_a_Resource Restricted_Robot_in_Dynamic_Environments.pdf b/动态slam/2020年-2022年开源动态SLAM/2022年/RGB_D_Inertial_Odometry_for_a_Resource Restricted_Robot_in_Dynamic_Environments.pdf
new file mode 100644
index 0000000..47e5ced
--- /dev/null
+++ b/动态slam/2020年-2022年开源动态SLAM/2022年/RGB_D_Inertial_Odometry_for_a_Resource Restricted_Robot_in_Dynamic_Environments.pdf	
@@ -0,0 +1,478 @@
+IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022                                             9573
+
+RGB-D Inertial Odometry for a Resource-Restricted
+           Robot in Dynamic Environments
+
+                    Jianheng Liu , Xuanfu Li, Yueqian Liu , and Haoyao Chen , Member, IEEE
+
+   Abstract—Current simultaneous localization and mapping                        performance. Most of the existing vSLAM systems depend on
+(SLAM) algorithms perform well in static environments but eas-                   a static world assumption. Stable features in the environment
+ily fail in dynamic environments. Recent works introduce deep                    are used to form a solid constraint for Bundle Adjustment [5].
+learning-based semantic information to SLAM systems to reduce                    However, in real-world scenarios like shopping malls and sub-
+the inﬂuence of dynamic objects. However, it is still challeng-                  ways, dynamic objects such as moving people, vehicles, and
+ing to apply a robust localization in dynamic environments for                   unknown objects, have an adverse impact on pose optimization.
+resource-restricted robots. This paper proposes a real-time RGB-D                Although some approaches like RANSAC [6] can suppress the
+inertial odometry system for resource-restricted robots in dynamic               inﬂuence of dynamic features to a certain extent, it will become
+environments named Dynamic-VINS. Three main threads run in                       overwhelmed when a vast number of dynamic objects appear in
+parallel: object detection, feature tracking, and state optimization.            the scene.
+The proposed Dynamic-VINS combines object detection and depth
+information for dynamic feature recognition and achieves per-                       Therefore, it is necessary for the system to reduce dynamic
+formance comparable to semantic segmentation. Dynamic-VINS                       objects’ inﬂuence on the estimation results consciously. The
+adopts grid-based feature detection and proposes a fast and ef-                  pure geometric methods [7]–[9] are widely used to handle
+ﬁcient method to extract high-quality FAST feature points. IMU                   dynamic objects, but it is unable to cope with latent or slightly
+is applied to predict motion for feature tracking and moving                     moving objects. With the development of deep learning, many
+consistency check. The proposed method is evaluated on both                      researchers have tried combining multi-view geometric methods
+public datasets and real-world applications and shows competitive                with semantic information [10]–[13] to implement a robust
+localization accuracy and robustness in dynamic environments. Yet,               SLAM system in dynamic environments. To avoid the accidental
+to the best of our knowledge, it is the best-performance real-time               deletion of stable features through object detection [14], recent
+RGB-D inertial odometry for resource-restricted platforms in dy-                 dynamic SLAM systems [15], [16] exploit the advantages of
+namic environments for now. The proposed system is open source                   pixel-wise semantic segmentation for a better recognition of
+at: https://github.com/HITSZ-NRSL/Dynamic-VINS.git                               dynamic features. Due to the expensive computing resource
+                                                                                 consumption of semantic segmentation, it is difﬁcult for a
+   Index Terms—Localization, visual-inertial SLAM.                               semantic-segmentation-based SLAM system to run in real-time.
+                                                                                 Therefore, some researchers have tried to perform semantic
+                           I. INTRODUCTION                                       segmentation only on keyframes and track moving objects via
+                                                                                 moving probability propagation [17], [18] or direct method [19]
+S IMULTANEOUS localization and mapping (SLAM) is a                               on each frame. In the cases of missed detections or object track-
+     foundational capability for many emerging applications,                     ing failures, the pose optimization is imprecise. Moreover, since
+such as autonomous mobile robots and augmented reality. Cam-                     semantic segmentation is performed after keyframe selection,
+eras as portable sensors are commonly equipped on mobile                         real-time precise pose estimation is inaccessible, and unstable
+robots and devices. Therefore, visual SLAM (vSLAM) has                           dynamic features in the original frame may also cause redundant
+received tremendous attention over the past decades. Lots of                     keyframe creation and unnecessary computational burdens.
+works [1]–[4] are proposed to improve visual SLAM systems’
+                                                                                    The above systems still require too many computing re-
+   Manuscript received 25 February 2022; accepted 20 June 2022. Date of          sources to perform robust real-time localization in dynamic
+publication 15 July 2022; date of current version 26 July 2022. This letter was  environments for Size, Weight, and Power (SWaP) restricted
+recommended for publication by Associate Editor L. Paull and Editor J. Civera    mobile robots or devices. Some researchers [20]–[22] try to
+upon evaluation of the reviewers’ comments. This work was supported in part      run visual odometry in real-time on embedded computing
+by the National Natural Science Foundation of China under Grants U21A20119       devices, yet the keyframe-based visual odometry is not per-
+and U1713206 and in part by the Shenzhen Science and Innovation Com-             formed [23], which makes their accuracy unsatisfactory. At
+mittee under Grants JCYJ20200109113412326, JCYJ20210324120400003,                the same time, increasingly embedded computing platforms are
+JCYJ20180507183837726, and JCYJ20180507183456108. (Corresponding                 equipped with NPU/GPU computing units, such as HUAWEI
+Author: Haoyao Chen.)                                                            Atlas200, NVIDIA Jetson, etc. It enables lightweight deep
+                                                                                 learning networks to run on the embedded computing platform
+   Jianheng Liu, Yueqian Liu, and Haoyao Chen are with the School of Mechan-     in real-time. Some studies [14], [24] implemented a keyframe-
+ical Engineering and Automation, Harbin Institute of Technology Shenzhen,        based dynamic SLAM system running on embedded computing
+Shenzhen, Guangdong 518055, China (e-mail: liujianhengchris@qq.com; yue-
+qianliu@outlook.com; hychen5@hit.edu.cn).
+
+   Xuanfu Li is with the Department of HiSilicon Research, Huawei Tech-
+nology Co., Ltd, Shenzhen, Guangdong 518129, China (e-mail: lixuanfu@
+huawei.com).
+
+   This letter has supplementary downloadable material available at
+https://doi.org/10.1109/LRA.2022.3191193, provided by the authors.
+
+   Digital Object Identiﬁer 10.1109/LRA.2022.3191193
+
+2377-3766 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
+                   See https://www.ieee.org/publications/rights/index.html for more information.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply.
+9574                                                               IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
+
+Fig. 1. The framework of Dynamic-VINS. The contributing modules are highlighted and surrounded by dash lines with different colors. Three main threads run
+in parallel in Dynamic-VINS. Features are tracked and detected in the feature tracking thread. The object detection thread detects dynamic objects in each frame in
+real-time. The state optimization thread summarizes the features information, object detection results, and depth image to recognize the dynamic features. Finally,
+stable features and IMU preintegration results are used for pose estimation.
+
+platforms. However, these works are still difﬁcult to balance      summarize the features information, object detection results,
+efﬁciency and accuracy for mobile robot applications.              and depth image to recognize the dynamic features. A missed
+                                                                   detection compensation module is conducted in case of missed
+   To address all these issues, this paper proposes a real-time    detection. The moving consistency check procedure combines
+RGB-D inertial odometry for resource-restricted robots in dy-      the IMU preintegration and historical pose estimation results
+namic environments named Dynamic-VINS. It enables edge             to identify potential dynamic features. Finally, stable features
+computing devices to provide instant robust state feedback for     and IMU preintegration results are used for the pose estimation.
+mobile platforms with little computation burden. An efﬁcient       And the propagation of the IMU is responsible for an IMU-rate
+dynamic feature recognition module that does not require a         pose estimation result. Loop closure is also supported in this
+high-precision depth camera can be used in mobile devices          system, but this paper pays more attention to the localization
+equipped with depth-measure modules. The main contributions        independent of loop closure.
+of this paper are as follows:
+                                                                                            III. METHODOLOGY
+   1) An efﬁcient optimization-based RGB-D inertial odometry
+       is proposed to provide real-time state estimation results      This study proposes lightweight, high-quality feature tracking
+       for resource-restricted robots in dynamic and complex       and detection methods to accelerate the system. Semantic and
+       environments.                                               geometry information from the input RGB-D images and IMU
+                                                                   preintegration are applied for dynamic feature recognition and
+   2) Lightweight feature detection and tracking are proposed      moving consistency check. The missed detection compensation
+       to cut the computing burden. In addition, dynamic fea-      module plays a subsidiary role to object detection in case of
+       ture recognition modules combining object detection and     missed detection. Dynamic features on unknown objects are
+       depth information are proposed to provide robust dynamic    further identiﬁed by moving consistency check. The proposed
+       feature recognition in complex and outdoor environments.    methods are divided into ﬁve parts for a detailed description.
+
+   3) Validation experiments are performed to show the pro-        A. Feature Matching
+       posed system’s competitive accuracy, robustness, and ef-
+       ﬁciency on resource-restricted platforms in dynamic en-        For each incoming image, the feature points are tracked using
+       vironments.                                                 the KLT sparse optical ﬂow method [27]. In this paper, the IMU
+                                                                   measurements between frames are used to predict the motion of
+                        II. SYSTEM OVERVIEW                        features. Better initial position estimation of features is provided
+                                                                   to improve the efﬁciency of feature tracking by reducing optical
+   The proposed SLAM system in this paper is extended based on     ﬂow pyramid layers. It can effectively discard unstable features
+VINS-Mono [2] and VINS-RGBD [25]; our framework is shown           such as noise and dynamic features with inconsistent motion.
+in Fig. 1, and the contributing modules are highlighted with       The basic idea is illustrated in Fig. 2.
+different colors. For efﬁciency, three main threads (surrounded
+by dash lines) run parallel in Dynamic-VINS: object detection,        In the previous frame, stable features are colored red, and
+feature tracking, and state optimization. Color images are passed  newly detected features are colored blue. When the current frame
+to both the object detection thread and the feature tracking       arrives, the IMU measurements between the current and previous
+thread. IMU measurements between two consecutive frames            frames are used to predict the feature position (green) in the
+are preintegrated [26] for feature tracking, moving consistency    current frame. Optical ﬂow uses the predicted feature position
+check, and state optimization.                                     as the initial position to look for a match feature in the current
+                                                                   frame. The successfully tracked features are turned red, while
+   In the feature tracking thread, features are tracked with the   those that failed to be tracked are marked as unstable features
+help of IMU preintegration and detected by grid-based feature
+detection. The object detection thread detects dynamic objects in
+each frame in real-time. Then, the state optimization thread will
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply.
+LIU et al.: RGB-D INERTIAL ODOMETRY FOR A RESOURCE-RESTRICTED ROBOT IN DYNAMIC ENVIRONMENTS                                9575
+
+Fig. 2. Illustration of feature tracking and detection. Stable features and new     Fig. 3. Illustration of semantic mask setting for dynamic feature recognition
+features are colored red and blue, respectively. The green circles denote the       when all pixel’s depth is available (d > 0). The left scene represents when an
+prediction for optical ﬂow. The successfully tracked features turn red; otherwise,  objected bounding box’s farthest corner’s depth is bigger than the center to
+the features turn purple. The orange and purple dash-line circles as masks are      a threshold and a semantic mask with weighted depth is set between them
+set for a uniform feature distribution and reliable feature detection. New feature  to separate features on dynamic objects from the background. Otherwise, the
+points are detected from unmasked areas in the current frame.                       semantic mask is set behind the bounding box’s center with the distance of ,
+                                                                                    shown on the right.
+
+(purple). In order to avoid the repetition and aggregation of                       constraints to the system. For the sake of efﬁciency and compu-
+feature detection, an orange circular mask centered on the stable                   tational cost, a real-time single-stage object detection method,
+feature is set; the region where the unstable features are located is               YOLOv3 [11], is used to detect many kinds of dynamic scene
+considered an unstable feature detection region and masked with                     elements like people and vehicles. If a detected bounding box
+a purple circular to avoid unstable feature detection. According                    covers a large region of the image, blindly deleting feature
+to the mask, new features are detected from unmasked areas in                       points in the bounding box might result in no available features
+the current frame and colored blue.                                                 to provide constraints. Therefore, semantic-segmentation-like
+                                                                                    masks are helpful to maintain the system’s running by tracking
+   The above means can obtain uniformly distributed features to                     features not occluded by dynamic objects.
+capture comprehensive constraints and avoid repeatedly extract-
+ing unstable features on the area with blurs or weak textures.                         This paper combines object detection and depth information
+Long-term feature tracking can reduce the time consumption                          for highly efﬁcient dynamic feature recognition to achieve per-
+with the help of grid-based feature detection in the following.                     formance comparable to semantic segmentation. As the farther
+                                                                                    the depth camera measures, the worse the accuracy is. This
+B. Grid-Based Feature Detection                                                     problem makes some methods, such as Seed Filling, DBSCAN,
+                                                                                    and K-Means, which make full use of the depth information,
+   The system maintains a minimum number of features for                            exhibit poor performance with a low accuracy depth camera, as
+stability. Therefore, feature points need to be extracted from the                  shown in Fig. 5(a). Therefore, a set of points in the detected
+frame constantly. This study adopts grid-based feature detection.                   bounding box and depth information are integrated to obtain
+Image is divided into grids, and the boundary of each grid is                       comparable performance to the semantic segmentation, as illus-
+padded to prevent the features at the edge of the grid from being                   trated in Fig. 3.
+ignored; the padding enables the current grid to obtain adjacent
+pixel information for feature detection. Unlike traversing the                         A pixel’s depth d is available, if d > 0, otherwise, d = 0.
+whole image to detect features, only the grid with insufﬁcient                      Considering that the bounding box corners of most dynamic
+matched features will conduct feature detection. The grid cell                      objects correspond to the background points, and the dynamic
+that fails to detect features due to weak texture or is covered by                  objects commonly have a relatively large depth gap with the
+the mask will be skipped in the next detection frame to avoid                       background. The K-th dynamic object’s largest background
+repeated useless detection. The thread pool technique is used to                    depth K dmax is obtained as follow
+exploit the parallel performance of grid-based feature detection.
+Thus, the time consumption of feature detection is signiﬁcantly                        K dmax = max K dtl + K dtrK + K dbl + K dbr , (1)
+reduced without loss.
+                                                                                    where K dtl, K dtr, K dbl, K dbr are the depth values of the Kth
+   The FAST feature detector [28] can efﬁciently extract feature
+points but easily treats noise as features and extracts similar                     object detection bounding box’s corners, respectively. Next, the
+clustered features. Therefore, the ideas of mask in Section III-A                   Kth bounding box’s depth threshold Kd¯is deﬁned as
+and Non-Maximum-Suppression are combined to select high-
+quality and uniformly distributed FAST features.                                            ⎧      1  K dmax  +  K  dc  ,  if K dmax −K dc > , K dc > 0,
+                                                                                            ⎨⎪⎪⎪⎪  2                       if K dmax −K dc < , K dc > 0,
+C. Dynamic Feature Recognition                                                              ⎪⎪⎪⎪⎩                          if K dmax > 0, K dc = 0,
+                                                                                    K  d¯=         K dc + ,
+   Most feature points can be stably tracked through the above                                                             otherwise ,
+improvement. However, long-term tracking features on dynamic                                       K dmax,
+objects always come with abnormal motion and introduce wrong
+                                                                                                   +∞,
+
+                                                                                                                           (2)
+
+                                                                                    where K dc is the depth value of the bounding box’s center; > 0
+
+                                                                                    is a predeﬁned distance according to the most common dynamic
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply.
+9576                                                                              IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
+
+Fig. 4. Results of missed detection compensation. The dynamic feature recognition results are shown in the ﬁrst row. The green box shows the dynamic object’s
+position from the object detection results. The second row shows the generated semantic mask. With the help of missed detection compensation, even if object
+detection failed in (b) and (d), a semantic mask including all dynamic objects could be built.
+
+                                                                                  incoming feature point from the feature tracking thread will be
+                                                                                  judged whether it is a historical dynamic feature or not. The
+                                                                                  above methods can avoid blindly deleting feature points while
+                                                                                  ensuring efﬁciency. It can save time from detecting features on
+                                                                                  dynamic objects, has the robustness to the missed detection of
+                                                                                  object detection, and recycle false-positive dynamic features, as
+                                                                                  illustrated in Section III-E.
+
+                                                                                  D. Missed Detection Compensation
+
+Fig. 5. Results of dynamic feature recognition. The stable features are circled      Since object detection might sometimes fail, the proposed
+by yellow. The dynamic feature recognition results generated by Seed Filling      Dynamic-VINS utilizes the previous detection results to predict
+and the proposed method are shown in (a) and (b), respectively. The weighted      the following detection result to compensate for missed detec-
+depth d¯is colored gray; the brighter means a bigger value. The feature point on  tions. It is assumed that the dynamic objects in adjacent frames
+the white area will be marked as a dynamic feature.                               have a consistent motion. Once a dynamic object is detected, its
+                                                                                  pixel velocity and bounding box will be updated. Assumed that
+objects’ size in scenes. The depth threshold Kd¯is deﬁned in the                  j is the current detected frame and j − 1 is the previous detected
+middle of the center’s depth K dc and the deepest background                      frame, the pixel velocity K vcj (pixel/frame) of the Kth dynamic
+depth K dmax. When the dynamic object has a close connection                      object between frames is deﬁned as
+with the background or is behind an object K dmax − K dc < ,
+the depth threshold is deﬁned at distance from the dynamic                                              K vcj  =  K uccj  −   K  ucj−1  ,               (3)
+object. If the depth is unavailable, a conservative strategy is
+adopted to choose an inﬁnite depth as the threshold.                                                                               c
+
+   On the semantic mask, the area covered by the K-th dynamic                     where  K uccj ,  u K  cj−1  represent  the  pixel  location  of  the  K th
+object bounding box is set to the weighted depth Kd¯; the area                                          c
+without dynamic objects is set to 0. Each incoming feature’s
+depth d is compared with the corresponding pixel’s depth thresh-                  object detection bounding box’s center in jth frame and j − 1th
+old d¯on the semantic mask. If d < d¯, the feature is considered as
+a dynamic one. Otherwise, the feature is considered as a stable                   frame, respectively. A weighted predicted velocity K vˆ is deﬁned
+one. Therefore, the region where the depth value is smaller than
+the weighted depth d¯constitutes the generalized semantic mask,                   as
+as shown in Figs. 4 and 5(b).
+                                                                                                   K vˆcj+1    =  1  K vcj + K vˆcj        ,            (4)
+   Considering that dynamic objects may exist in the ﬁeld of                                                      2
+view for a long time, the dynamic features are tracked but
+not used for pose estimation, different from directly deleting                    With the update going on, the velocities of older frames will have
+dynamic features. According to its recorded information, each
+                                                                                  a lower weight in K vˆ. If the object fail to be detected in the next
+                                                                                  frame, the bounding box K Box containing the corners’ pixel
+                                                                                  locations K utl, K utr, K ubl and K ubr, will be updated based on
+                                                                                  the predicted velocity K vˆ as follow
+
+                                                                                                   K Bˆoxcj+1 = K Boxcj + K vˆcj+1 ,                    (5)
+
+                                                                                  When the missed detection time is over a threshold, this dynamic
+                                                                                  object’s compensation will be abandoned. The result is shown
+                                                                                  in Fig. 4. It improves the recall rate of object detection and is
+                                                                                  helpful for a more consistent dynamic feature recognition.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply.
+LIU et al.: RGB-D INERTIAL ODOMETRY FOR A RESOURCE-RESTRICTED ROBOT IN DYNAMIC ENVIRONMENTS  9577
+
+E. Moving Consistency Check                                                    In order to demonstrate the efﬁciency of the proposed system,
+                                                                            all experiments of Dynamic-VINS are performed on the em-
+   Since object detection can only recognize artiﬁcially deﬁned             bedded edge computing devices, HUAWEI Atlas200 DK and
+dynamic objects and has a missed detection problem, the state               NVIDIA Jetson AGX Xavier. And the compared algorithms’
+optimization will still be affected by unknown moving objects               results are included from their original papers. Atlas200 DK
+like books moved by people. Dynamic-VINS combines the pose                  has an 8-core A55 Arm CPU (1.6 GHz), 8 GB of RAM, and
+predicted by IMU and the optimized pose in the sliding windows              a 2-core HUAWEI DaVinci NPU. Jetson AGX Xavier has an
+to recognize dynamic features.                                              8-core ARMv8.2 64-bit CPU (2.25 GHz), 16 GB of RAM,
+                                                                            and a 512-core Nvidia Volta GPU. And the results tested on
+   Consider the kth feature is ﬁrst observed in the ith image and           both devices are named Dynamic-VINS-Atlas and Dynamic-
+is observed by other m images in sliding windows. The average               VINS-Jetson, respectively. Yet, to the best of our knowledge, the
+reprojection residual rk of the feature observation in the sliding          proposed method is the best-performance real-time RGB-D iner-
+windows is deﬁned as                                                        tial odometry for dynamic environments on resource-restricted
+                                                                            embedded platforms.
+rk  =  1       ukci − π TcbTwbi Tbwj TbcPkcj ,                         (6)
+       m                                                                    A. OpenLORIS-Scene Dataset
+          j=i
+                                                                               OpenLORIS-Scene [3] is a real-world indoor dataset with
+where ukci is the observation of kth feature in the ith frame; Pkcj         a large variety of challenging scenarios like dynamic scenes,
+is the 3D location of kth feature in the jth frame; Tcb and Twbj            featureless frames, and dim illumination. The results on the
+are the transforms from camera frame to body frame and from                 OpenLORIS-Scene dataset are shown in Fig. 6, including the
+jth body frame to world frame, respecvtively; π represents the              results of VINS-Mono, ORB-SLAM2, and DS-SLAM from [3]
+camera projection model. When the rk is over a preset threshold,            as baselines.
+the kth feature is considered as a dynamic feature.
+                                                                               The OpenLORIS dataset includes ﬁve scenes and 22 se-
+   As shown in Fig. 7, the moving consistency check (MCC)                   quences in total. The proposed Dynamic-VINS shows the best
+module can ﬁnd out unstable features. However, some stable                  robustness among the tested algorithms. In ofﬁce scenes that
+features are misidentiﬁed (top left image), and features on                 are primarily static environments, all the algorithms can track
+standing people are not recognized (bottom right image). A low              successfully and achieve a decent accuracy. It is challenging for
+threshold holds a high recall rate of unstable features. Further,           the pure visual SLAM systems to track stable features in home
+a misidentiﬁed unstable feature with more observations will be              and corridor scenes that contain a large area of textureless walls
+recycled if its reprojection error is lower than the threshold.             and dim lighting. Thanks to the IMU sensor, the VINS systems
+                                                                            show robustness superiority when the camera is unreliable. The
+                    IV. EXPERIMENTAL RESULTS                                scenarios of home and caf e contain a number of sitting people
+                                                                            with a bit of motion, and market exists lots of moving pedes-
+   Quantitative experiments1 are performed to evaluate the pro-             trians and objects with unpredictable motion. And the market
+posed system’s accuracy, robustness, and efﬁciency. Public                  scenes cover the largest area and contain highly dynamic objects,
+SLAM evaluation datasets, OpenLORIS-Scene [29] and TUM                      as shown in Fig. 5. Although DS-SLAM is able to ﬁlter out
+RGB-D [30], provide sensor data and ground truth to evaluate                some dynamic features, its performance is still unsatisfactory.
+SLAM system in complex dynamic environments. Since our sys-                 VINS-RGBD has a similar performance with Dynamic-VINS
+tem is built on VINS-Mono [2] and VINS-RGBD [25], they are                  in relative static scenes, while VINS-RGBD’s accuracy drops in
+used as the baselines to demonstrate our improvement. VINS-                 highly dynamic market scenes. The proposed Dynamic-VINS
+Mono [2] provides robust and accurate visual-inertial odometry              can effectively deal with complex dynamic environments and
+by fusing IMU preintegration and feature observations. VINS-                improve robustness and accuracy.
+RGBD [25] integrates RGB-D camera based on VINS-Mono
+for better performance. Furthermore, DS-SLAM [15] and Ji                    B. TUM RGB-D Dataset
+et al.[24], state-of-the-art semantic algorithms based on ORB-
+SLAM2 [4], are also included for comparison.                                   The TUM RGB-D dataset [30] offers several sequences con-
+                                                                            taining dynamic objects in indoor environments. The highly
+   The accuracy is evaluated by Root-Mean-Square-Error                      dynamic f r3_walking sequences are chosen for evaluation
+(RMSE) of Absolute Trajectory Error (ATE), Translational Rel-               where two people walk around a desk and change chairs’
+ative Pose Error (T.RPE), and Rotational Relative Pose Error                positions while the camera moves in different motions. As
+(R.RPE). Correct Rate (CR) [29] measuring the correct rate                  the VINS system does not support VO mode and the TUM
+over the whole period of data is used to evaluate the robustness.           RGB-D dataset does not provide IMU measurements, a VO
+The RMSE of an algorithm is calculated only for its success-                mode is implemented by simply disabling modules relevant to
+ful tracking outputs. Therefore, the longer an algorithm tracks             IMU in Dynamic-VINS for experiments. The results are shown
+successfully, the more error is likely to accumulate. It implies            in Table I. The compared methods’ results are included from
+that evaluating algorithms purely by ATE could be mislead-                  their original published papers. The algorithms based on ORB-
+ing. On the other hand, considering only CR could also be                   SLAM2 and semantic segmentation perform better. Although
+misleading.
+
+1The experimental video is available at https://youtu.be/y0U1IVtFBwY.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply.
+9578                                                                         IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
+
+Fig. 6. Per-sequence testing results with the OpenLORIS-Scene datasets. Each black dot on the top line represents the start of one data sequence. For each
+algorithm, blue dots indicate successful initialization moments, and blue lines indicate successful tracking span. The percentage value on the top left of each scene
+is the average correct rate; the higher the correct rate of an algorithm, the more robust it is. The ﬂoat value on the ﬁrst line below is average ATE RMSE and the
+values on the second line below are T.RPE and R.RPE from left to right, and smaller means more accurate.
+
+                                                                                            TABLE I
+                              RESULTS OF RMSE OF ATE [m], T.RPE [m/s], AND R.RPE [◦/s] ON TUM RGB-D f r3_walking DATASETS
+
+                                                                                 TABLE II
+      ABLATION EXPERIMENT RESULTS OF RMSE OF ATE [m], T.RPE [m/s], AND R.RPE [◦/s] ON TUM RGB-D f r3_walking DATASETS
+
+                                                                             to extract evenly distributed stable features, which seriously
+                                                                             degrades the accuracy performance. Without the object detec-
+                                                                             tion (W/O OBJECT DETECTION), dynamic features introduce
+                                                                             wrong constraints to impair the system’s accuracy. Dynamic-
+                                                                             VINS-W/O-SEG-LIKE-MASK shows the results that mask all
+                                                                             features in the bounding boxes. The background features help the
+                                                                             system maintain as many stable features as possible to provide
+                                                                             more visual constraints. The moving consistency check plays
+                                                                             an important role when object detection fails, as shown in the
+                                                                             column W/O-MCC.
+
+Fig. 7. Results of Moving Consistency Check. Features without yellow circu-  C. Runtime Analysis
+lar are the outliers marked by the Moving Consistency Check module.
+                                                                                This part compares VINS-Mono, VINS-RGBD, and
+Dynamic-VINS is not designed for pure visual odometry, it still              Dynamic-VINS for runtime analysis. These methods are ex-
+shows competitive performance and has a signiﬁcant improve-                  pected to track and detect 130 feature points, and the frames
+ment over ORB-SLAM2.                                                         in Dynamic-VINS are divided into 7x8 grids. The object detec-
+                                                                             tion runs on the NPU/GPU parallel to the CPU. The average
+   To validate the effectiveness of each module in Dynamic-                  computation times of each module and thread are calculated on
+VINS, ablation experiments are conducted as shown in Table II.               OpenLORIS market scenes; the results run on both embedded
+The system without applying circular masks (W/O CIRCU-                       platforms are shown in Table III. It should be noted that the
+LAR MASK) from the Section III-A and Section III-B fails                     average computation time is only to be updated when the module
+                                                                             is used. Speciﬁcally, in VINS architecture, the feature detection
+                                                                             is executed at a consistent frequency with the state optimization
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply.
+LIU et al.: RGB-D INERTIAL ODOMETRY FOR A RESOURCE-RESTRICTED ROBOT IN DYNAMIC ENVIRONMENTS                                                         9579
+
+                                                                                           TABLE III
+                                  AVERAGE COMPUTATION TIME [ms] OF EACH MODULE AND THREAD ON OPENLORIS market SCENES
+
+* Tracking Thread, Optimization Thread and Object Detection correspond to the three different threads shown in Fig. 1, respectively.
+† Dynamic Feature Recognition Modules sum up the Dynamic Feature Recognition, Missed Detection Compensation, and Moving Consistency Check modules.
+
+Fig. 8. A compact aerial robot equipped with an RGB-D camera, an autopilot  Fig. 9. The estimated trajectories in the outdoor environment aligned with the
+with IMUs, an onboard computer, and an embedded edge computing device.      Google map. The green line is the estimated trajectory from Dynamic-VINS, the
+The whole size is about 255 × 165 mm.                                       red line is from VINS-RGBD, and the yellow line represents the loop closure
+                                                                            that happened at the end of the dataset.
+thread, which means the frequency of feature detection is lower
+than that of Feature Tracking Thread.                                       Fig. 10. Results of dynamic feature recognition in outdoor environments. The
+                                                                            dynamic feature recognition modules are still able to segment dynamic objects
+   On edge computing devices with AI accelerator modules,                   but with a larger mask region.
+the single-stage object detection method is computed by an
+NPU or GPU without costing the CPU resources and can out-                   handheld aerial robot above for safety. The total path lengths
+put inference results in real-time. With the same parameters,               are approximately 800 m and 1220 m, respectively. The dataset
+Dynamic-VINS shows signiﬁcant improvement in feature de-                    has a similar scene at the beginning and the end for loop
+tection efﬁciency in both embedded platforms and is the one able            closure, while loop closure fails in the THUSZ campus dataset.
+to achieve instant feature tracking and detection in HUAWEI At-             VINS-RGBD and Dynamic-VINS run the dataset on NVIDIA
+las200 DK. The dynamic feature recognition modules (Dynamic                 Jetson AGX Xavier. The estimated trajectories and loop closure
+Feature Recognition, Missed Detection Compensation, Moving                  trajectory aligned with the Google map are shown in Fig. 9.
+Consistency Check) to recognize dynamic features only take                  In outdoor environments, the depth camera is limited in range
+a tiny part of the consuming time. For real-time application,               and affected by the sunlight. The dynamic feature recognition
+the system is able to output a faster frame-to-frame pose and a             modules can still segment dynamic objects but with a larger
+higher-frequency imu-propagated pose rather than waiting for                mask region, as shown in Fig. 10. Compared with loop closure
+the complete optimization result.                                           results, Dynamic-VINS could provide a robust and stable pose
+                                                                            estimation with little drift.
+D. Real-World Experiments
+
+   A compact aerial robot is shown in Fig. 8. An RGB-D camera
+(Intel Realsense D455) provides 30 Hz color and aligned depth
+images. An autopilot (CUAV X7pro) with an onboard IMU
+(ADIS16470, 200 Hz) is used to provide IMU measurements.
+The aerial robot is equipped with an onboard computer (Intel
+NUC, i7-5557 U CPU) and an embedded edge computing de-
+vice (HUAWEI Atlas200 DK). These two computation resource
+providers play different roles in the aerial robot. The onboard
+computer charges for peripheral management and other core
+functions requiring more CPU resources, such as planning and
+mapping. The edge computing device as auxiliary equipment
+offers instant state feedback and object detection results to the
+onboard computer.
+
+   Large-scale outdoor datasets with moving people and vehi-
+cles on the HITSZ and THUSZ campus are recorded by the
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply.
+9580                                                                               IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022
+
+                            V. CONCLUSION                                          [12] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep con-
+                                                                                         volutional encoder-decoder architecture for image segmentation,” IEEE
+   This paper presents a real-time RGB-D inertial odometry                               Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495,
+for resource-restricted robots in dynamic environments. Cost-                            Dec. 2017.
+efﬁcient feature tracking and detection methods are proposed to
+cut down the computing burden. A lightweight object-detection-                     [13] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in Proc.
+based method is introduced to deal with dynamic features in                              IEEE Int. Conf. Comput. Vis., 2017, pp. 2961–2969.
+real-time. Validation experiments show the proposed system’s
+competitive accuracy, robustness, and efﬁciency in dynamic                         [14] L. Xiao et al., “Dynamic-SLAM: Semantic monocular visual localization
+environments. Furthermore, Dynamic-VINS is able to run on                                and mapping based on deep learning in dynamic environment,” Robot.
+resource-restricted platforms to output an instant pose estima-                          Auton. Syst., vol. 117, pp. 1–16, 2019.
+tion. In the future, the proposed approaches are expected to
+be validated on the existing popular SLAM frameworks. The                          [15] C. Yu et al., “DS-SLAM: A semantic visual SLAM towards dynamic
+missed detection compensation module is expected to develop                              environments,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2018,
+into a moving object tracking module, and semantic information                           pp. 1168–1174.
+will be further introduced for high-level guidance on mobile
+robots or mobile devices in complex dynamic environments.                          [16] B. Bescos,, J. M. Facil, J. Civera, and J. Neira, “DynaSLAM: Tracking,
+                                                                                         mapping, and inpainting in dynamic scenes,” IEEE Robot. Automat. Lett.,
+                                 REFERENCES                                              vol. 3, no. 4, pp. 4076–4083, Oct. 2018.
+
+ [1] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE           [17] F. Zhong, S. Wang, Z. Zhang, C. Chen, and Y. Wang, “Detect-SLAM:
+      Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611–625, Mar. 2018.        Making object detection and SLAM mutually beneﬁcial,” in Proc. IEEE
+                                                                                         Winter Conf. Appl. Comput. Vis., 2018, pp. 1001–1010.
+ [2] T. Qin, P. Li, and S. Shen, “VINS-Mono: A robust and versatile monoc-
+      ular visual-inertial state estimator,” IEEE Trans. Robot., vol. 34, no. 4,   [18] Y. Liu and J. Miura, “RDS-SLAM: Real-time dynamic SLAM using
+      pp. 1004–1020, Aug. 2018.                                                          semantic segmentation methods,” IEEE Access, vol. 9, pp. 23 772–23 785,
+                                                                                         2021.
+ [3] P. Geneva, K. Eckenhoff, W. Lee, Y. Yang, and G. Huang, “OpenVINS: A
+      research platform for visual-inertial estimation,” in Proc. IEEE Int. Conf.  [19] I. Ballester, A. Fontán, J. Civera, K. H. Strobl, and R. Triebel, “DOT:
+      Robot. Automat., 2020, pp. 4666–4672.                                              Dynamic object tracking for visual SLAM,” in Proc. IEEE Int. Conf. Robot.
+                                                                                         Automat., 2021, pp. 11 705–11 711.
+ [4] R. Mur-Artal and J. D. Tardos, “ORB-SLAM2: An open-source SLAM
+      system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot.,        [20] K. Schauwecker, N. R. Ke, S. A. Scherer, and A. Zell, “Markerless visual
+      vol. 33, no. 5, pp. 1255–1262, Oct. 2017.                                          control of a quad-rotor micro aerial vehicle by means of on-board stereo
+                                                                                         processing,” in Proc. Auton. Mobile Syst., 2012, pp. 11–20.
+ [5] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle
+      adjustment—a modern synthesis,” in Proc. Int. Workshop Vis. Algorithms,      [21] Z. Z. Nejad and A. Hosseininaveh Ahmadabadian, “ARM-VO: An efﬁcient
+      1999, pp. 298–372.                                                                 monocular visual odometry for ground vehicles on ARM CPUs,” Mach.
+                                                                                         Vis. Appl., vol. 30, no. 6, pp. 1061–1070, 2019.
+ [6] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm
+      for model ﬁtting with applications to image analysis and automated car-      [22] S. Bahnam, S. Pfeiffer, and G. C. H. E. de Croon, “Stereo visual iner-
+      tography,” Commun. ACM, vol. 24, no. 6, pp. 381–395, 1981.                         tial odometry for robots with limited computational resources,” in Proc.
+                                                                                         IEEE/RSJ Int. Conf. Intell. Robots Syst., 2021, pp. 9154–9159.
+ [7] Y. Sun, M. Liu, and M.Q.-H. Meng, “Improving RGB-D SLAM in dynamic
+      environments: A motion removal approach,” Robot. Auton. Syst., vol. 89,      [23] G. Younes et al., “Keyframe-based monocular SLAM: Design, survey, and
+      pp. 110–122, 2017.                                                                 future directions,” Robot. Auton. Syst., vol. 98, pp. 67–88, 2017.
+
+ [8] E. Palazzolo,, J. Behley, P. Lottes, P. Gigu, and C. Stachniss, “ReFusion:    [24] T. Ji, C. Wang, and L. Xie, “Towards real-time semantic RGB-D SLAM in
+      3D reconstruction in dynamic environments for RGB-D cameras exploit-               dynamic environments,” in Proc. IEEE Int. Conf. Robot. Automat., 2021,
+      ing residuals,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2019,           pp. 11 175–11 181.
+      pp. 7855–7862.
+                                                                                   [25] Z. Shan, R. Li, and S. Schwertfeger, “RGBD-inertial trajectory estima-
+ [9] W. Dai et al., “RGB-D SLAM in dynamic environments using point                      tion and mapping for ground robots,” Sensors, vol. 19, no. 10, 2019,
+      correlations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 1,            Art. no. 2251.
+      pp. 373–389, Jan. 2022.
+                                                                                   [26] C. Forster et al., “IMU preintegration on manifold for efﬁcient visual-
+[10] W. Liu et al., “SSD: Single shot MultiBox detector,” in Eur. Conf. Comp.            inertial maximum-a-posteriori estimation,” in Proc. Robot.: Sci. Syst.,
+      Vis., 2016, pp. 21–37.                                                             2015.
+
+[11] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,”               [27] B. D. Lucas et al., “An iterative image registration technique with an appli-
+      2018, arXiv:1804.02767.                                                            cation to stereo vision,” in Proc. DARPA Image Understanding Workshop,
+                                                                                         1981, pp. 121–130.
+
+                                                                                   [28] E. Rosten and T. Drummond, “Machine learning for high-speed corner
+                                                                                         detection,” in Proc. Eur. Conf. Comput. Vis., 2006, pp. 430–443.
+
+                                                                                   [29] X. Shi et al., “Are we ready for service robots? The OpenLORIS-Scene
+                                                                                         datasets for lifelong SLAM,” in Proc. IEEE Int. Conf. Robot. Automat.,
+                                                                                         2020, pp. 3139–3145.
+
+                                                                                   [30] J. Sturm,, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A bench-
+                                                                                         mark for the evaluation of RGB-D SLAM systems,” in Proc. IEEE/RSJ
+                                                                                         Int. Conf. Intell. Robots Syst., 2012, pp. 573–580.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply.
+
diff --git a/动态slam/2020年-2022年开源动态SLAM/2022年/SG-SLAM_A_Real-Time_RGB-D_Visual_SLAM_Toward_Dynamic_Scenes_With_Semantic_and_Geometric_Information.pdf b/动态slam/2020年-2022年开源动态SLAM/2022年/SG-SLAM_A_Real-Time_RGB-D_Visual_SLAM_Toward_Dynamic_Scenes_With_Semantic_and_Geometric_Information.pdf
new file mode 100644
index 0000000..0065e5b
--- /dev/null
+++ b/动态slam/2020年-2022年开源动态SLAM/2022年/SG-SLAM_A_Real-Time_RGB-D_Visual_SLAM_Toward_Dynamic_Scenes_With_Semantic_and_Geometric_Information.pdf
@@ -0,0 +1,665 @@
+IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023                                           7501012
+
+SG-SLAM: A Real-Time RGB-D Visual SLAM
+  Toward Dynamic Scenes With Semantic and
+               Geometric Information
+
+    Shuhong Cheng , Changhe Sun , Shijun Zhang , Student Member, IEEE, and Dianfan Zhang
+
+   Abstract— Simultaneous localization and mapping (SLAM) is                  systems, we have access to cheaper, faster, higher quality,
+one of the fundamental capabilities for intelligent mobile robots             and smaller vision-based sensors. It also helps vision-based
+to perform state estimation in unknown environments. However,                 measurement (VBM) become more ubiquitous and applica-
+most visual SLAM systems rely on the static scene assumption                  ble [2]. Hence, in the past years, a large number of excellent
+and consequently have severely reduced accuracy and robustness                visual SLAM systems have emerged, such as PTAM [3],
+in dynamic scenes. Moreover, the metric maps constructed by                   ORB-SLAM2 [4], DVO [5], and Kimera [6]. Some of these
+many systems lack semantic information, so the robots cannot                  visual SLAM systems are quite mature and have achieved
+understand their surroundings at a human cognitive level.                     good performance under certain speciﬁc environmental
+In this article, we propose SG-SLAM, which is a real-time                     conditions.
+RGB-D semantic visual SLAM system based on the ORB-SLAM2
+framework. First, SG-SLAM adds two new parallel threads: an                      As SLAM enters the age of robust perception [7], the system
+object detecting thread to obtain 2-D semantic information and                has higher requirements in terms of robustness and high-level
+a semantic mapping thread. Then, a fast dynamic feature rejec-                understanding characteristics. However, many visual-based
+tion algorithm combining semantic and geometric information                   classical SLAM systems still fall short of these requirements
+is added to the tracking thread. Finally, they are published                  in some practical scenarios. On the one hand, most visual
+to the robot operating system (ROS) system for visualization                  SLAM systems work based on the static scene assumption,
+after generating 3-D point clouds and 3-D semantic objects in                 which makes the system less accurate and less robust in real
+the semantic mapping thread. We performed an experimental                     dynamic scenes (e.g., scenes containing walking people and
+evaluation on the TUM dataset, the Bonn dataset, and the                      moving vehicles). On the other hand, most existing SLAM
+OpenLORIS-Scene dataset. The results show that SG-SLAM is                     systems only construct a globally consistent metric map of
+not only one of the most real-time, accurate, and robust systems in           the robot’s working environment [8]. However, the metric map
+dynamic scenes but also allows the creation of intuitive semantic             does not help the robot to understand its surroundings at a
+metric maps.                                                                  higher semantic level.
+
+   Index Terms— Dynamic scenes, geometric constraint, seman-                     Most visual SLAM algorithms rely on the static scene
+tic metric map, visual-based measurement, visual simultaneous                 assumption, which is why the presence of dynamic objects can
+localization and mapping (SLAM).                                              cause these algorithms to produce the wrong data correlation.
+                                                                              These outliers obtained from dynamic objects can seriously
+                          I. INTRODUCTION                                     impair the accuracy and stability of the algorithms. Even
+                                                                              though these algorithms show superior performance in some
+S IMULTANEOUS localization and mapping (SLAM) has                             speciﬁc scenarios, it is difﬁcult to extend them to actual
+     an important role in the state perception of mobile robots.              production and living scenarios containing dynamic objects.
+It can help a robot in an unknown environment with an                         Some recent works, such as [9], [10], [11], and [12], have
+unknown pose to incrementally build a globally consistent map                 used methods that combine geometric and semantic infor-
+and simultaneously measure its pose in this map [1]. Due to                   mation to eliminate the adverse effects of dynamic objects.
+continuing and rapid development of cameras and computing                     These algorithms mainly using deep learning have signiﬁcant
+                                                                              improvements in experimental accuracy, but they suffer from
+  Manuscript received 25 August 2022; revised 31 October 2022; accepted       shortcomings in scene generalizability or real time due to vari-
+23 November 2022. Date of publication 9 December 2022; date of current        ous factors. Therefore, how skillfully detecting and processing
+version 17 January 2023. This work was supported in part by the National Key  dynamic objects in the scene is crucial for the system to
+Research and Development Program under Grant 2021YFB3202303, in part          operate accurately, robustly, and in real time.
+by the S&T Program of Hebei under Grant 20371801D, in part by the Hebei
+Provincial Department of Education for Cultivating Innovative Ability of         Traditional SLAM systems construct only a sparse metric
+Postgraduate Students under Grant CXZZBS2022145, and in part by the Hebei     map [3], [4]. This metric map consists of simple geome-
+Province Natural Science Foundation Project under Grant E2021203018.          tries (points, lines, and surfaces) and every pose is strictly
+The Associate Editor coordinating the review process was Dr. Jae-Ho Han.      related to the global coordinate system. Enabling a robot to
+(Corresponding authors: Shijun Zhang; Dianfan Zhang.)                         perform advanced tasks with intuitive human–robot interac-
+                                                                              tion requires it to understand its surroundings at a human
+  Shuhong Cheng and Changhe Sun are with the School of Electri-
+cal Engineering, Yanshan University, Qinhuangdao 066000, China (e-mail:
+shhcheng@ysu.edu.cn; silencht@qq.com).
+
+  Shijun Zhang is with the School of Mechanical Engineering, Yanshan
+University, Qinhuangdao 066000, China (e-mail: 980871977@qq.com).
+
+  Dianfan Zhang is with the Key Laboratory of Special Delivery Equipment,
+Yanshan University, Qinhuangdao 066004, China (e-mail: zdf@ysu.edu.cn).
+
+  Digital Object Identiﬁer 10.1109/TIM.2022.3228006
+
+1557-9662 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
+                    See https://www.ieee.org/publications/rights/index.html for more information.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
+7501012  IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
+
+Fig. 1. Overview of the framework of the SG-SLAM system. The original work of ORB-SLAM2 is presented on an aqua-green background, while our main
+new or modiﬁed work is presented on a red background.
+
+cognitive level. However, the metric map lacks the neces-         The main contributions of this article include the following.
+sary semantic information and therefore cannot provide this       1) A complete real-time RGB-D visual SLAM system
+capability. With the rapid development of deep learning in
+recent years, some neural networks can effectively capture the        called SG-SLAM is proposed using ORB-SLAM2 as
+semantic information in the scenes. Therefore, the metric map         a framework. Compared to ORB-SLAM2, it has higher
+can be extended to the semantic metric map by integrating             accuracy and robustness in dynamic scenes and can pub-
+semantic information. The semantic information contained              lish a semantic metric map through the robot operating
+in the semantic metric map can provide the robot with the             system (ROS) system [13].
+capability to understand its surroundings at a higher level.      2) A fast dynamic feature rejection algorithm is proposed
+                                                                      by combining geometric information and semantic infor-
+   This article focuses on a dynamic feature rejection algorithm      mation. The geometric information is calculated from
+that integrates semantic and geometric information, which not         the epipolar constraint between image frames. Also, the
+only signiﬁcantly improves the accuracy of system localization        semantic information about dynamic objects is obtained
+but also has excellent computational efﬁciency. Thus, our algo-       through an NCNN-based [14] object detection network
+rithm is very useful from an instrumentation and measurement          in a new thread. The algorithm speed is greatly improved
+point of view [2]. This article also focuses on how to construct      by appropriate modiﬁcations and a combination of clas-
+the semantic metric map to improve the perceptual level of            sical methods while maintaining accuracy.
+the robot to understand the surrounding scenes. The overall       3) An independent semantic metric mapping thread that can
+framework of the SG-SLAM system is shown in Fig. 1.                   generate semantic objects and Octo maps [15] using the
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
+CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES  7501012
+
+       ROS interface is embedded in SG-SLAM. These maps           regarded as outliers and eliminated. Similarly, Dynamic-
+       can be useful in subsequent localization, navigation, and  SLAM proposed by Xiao et al. [25] has the same problem
+       object capture tasks.                                      of directly rejecting all features within the bounding box.
+                                                                  Liu and Miura [26] adopted a semantic segmentation method
+   The remaining sections of this article are organized as        to detect dynamic objects and remove outliers in keyframes.
+follows. The work related to this system is described in          The semantic segmentation method solves the problem of
+Section II. Section III shows the details related to the imple-   wrong recognition due to bounding boxes to a certain extent.
+mentation of this system. Section IV provides an experimental     However, the semantic information method relies heavily on
+evaluation and analysis of the results. The conclusions and       the quality of the neural network, so it is difﬁcult to meet the
+future works of this article are presented in Section V.          requirements of speed and accuracy at the same time.
+
+                        II. RELATED WORKS                            Recently, much work has taken on the method of combining
+                                                                  geometric and semantic information. For the RGB-D camera,
+A. SLAM in Dynamic Scenes                                         Bescos et al. [9] used the semantic segmentation results of
+                                                                  Mask R-CNN [27] combined with multiview geometry to
+   Most current visual SLAMs assume that the working scene        detect dynamic objects and reject outliers. Yu et al. [10]
+is static and rigid. When these systems work in dynamic           used an optical ﬂow-based moving consistency check method
+scenes, erroneous data associations due to the static scene       to detect all feature points and simultaneously performed
+assumption can seriously weaken the accuracy and stability        semantic segmentation of the image using SegNet [28] in
+of the system. The presence of dynamic objects in the scene       an independent thread. If the moving consistency checking
+makes all features divided into two categories: static features   method detects more than a certain percentage of dynamic
+and dynamic features. How to detect and reject dynamic            points within the range of the human object, all feature points
+features is the key to the problem solution. The previous         that lie inside the object are directly rejected. Wu et al. [11]
+research work can be divided into three categories: geomet-       used YOLO to detect a priori dynamic objects in the scene
+ric information method, semantic information method, and          and then combined it with the depth-RANSAC method to
+method combining geometric and semantic information.              reject the feature points inside the range of dynamic objects.
+                                                                  Chang et al. [12] segmented the dynamic objects by YOLACT
+   Geometric information method, whose main idea is to            and then removed the outliers inside the objects. Then, geo-
+assume that only static features can satisfy the geometric        metric constraints are introduced to further ﬁlter the missing
+constraints of the algorithm. A remarkable early monocular        dynamic points.
+dynamic object detection system comes from the work of
+Kundu et al. [16]. The system creates two geometric con-             The above methods have achieved quite good results in
+straints to detect dynamic objects based on the multiview         terms of accuracy improvement. Nevertheless, the idea of all
+geometry [17]. One of the most important is the epipolar          these methods relies heavily on semantic information and, to a
+constraint deﬁned by the fundamental matrix. The idea is          lesser extent, on geometric information. Thus, more or less all
+that a static feature point in the current image must lie on      of them have the following shortcomings.
+the pole line corresponding to the same feature point in the
+previous image. A feature point is considered dynamic if             1) Inability to correctly handle dynamic features outside of
+its distance from the corresponding polar line exceeds an                the prior object [10], [11], [23], [25], [26]. For example,
+empirical threshold. The fundamental matrix of the system is             chairs are static objects by default, but dynamic during
+calculated with the help of an odometer. In a purely visual              being moved by a person; moving cats appear in the
+system, the fundamental matrix can be calculated by the                  scene, while the neural network is not trained on the
+seven-point method based on RANSAC [18]. The algorithm                   category of cats; low recall problem for the detection
+of Kundu et al. [16] has the advantages of fast speed and                algorithm.
+strong scene generalization. However, it lacks a high-level
+understanding of the scene, so the empirical threshold is            2) The a priori dynamic object remains stationary yet still
+difﬁcult to select and the accuracy is not high. In addition,            brutally rejects the feature points in its range, resulting in
+some works use the direct method for motion detection of                 less available association data [11], [12], [23], [25], [26].
+scenes, such as [19], [20], [21], and [22]. The direct method            For example, a person who is sitting still is nevertheless
+algorithms are faster and can utilize more image information.            considered a dynamic object.
+However, it is less robust in complex environments because it
+is based on the gray-scale invariance assumption.                    3) The real-time performance is weak [9], [10], [11], [12].
+                                                                         The average frame rate of the system is low due to
+   Semantic information method, whose main idea is brutally              factors such as complex semantic segmentation networks
+rejecting features in dynamic regions that are obtained a priori         or unreasonable system architecture.
+using deep learning techniques. Zhang et al. [23] used the
+YOLO [24] object detection method to obtain the semantic             We propose an efﬁcient dynamic feature rejection algorithm
+information of dynamic objects in the working scene and           combining geometric and semantic information to solve the
+then reject the dynamic feature points based on the semantic      above problem. Unlike most current work that relies heavily
+information to improve the accuracy of the system. However,       on deep learning, our algorithm uses mainly geometric infor-
+the way YOLO extracts semantic information by bounding            mation and then supplements it with semantic information.
+box will cause a part of static feature points to be wrongly      This shift in thinking allows our algorithm to avoid the short-
+                                                                  comings associated with relying too much on deep learning.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
+7501012  IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
+
+B. Semantic Mapping                                               the base framework to provide global localization and mapping
+                                                                  functions.
+   Many current visual SLAMs only provide a metric map
+that only satisﬁes the basic functions of localization and           As shown in Fig. 1, the SG-SLAM system adds two more
+navigation of mobile robots, such as the sparse feature point     parallel threads: the object detecting thread and the seman-
+map constructed by ORB-SLAM2. If a mobile robot is to             tic mapping thread. Multithreading mechanism improves the
+perceive its surroundings at the human conceptual level, it is    system operation efﬁciency. The purpose of adding an object
+necessary to incorporate semantic information in the metric       detecting thread is to use the neural network to obtain the
+map to form a semantic map. The semantic metric map can           2-D semantic information. This 2-D semantic information
+help robots to act according to human rules, execute high-level   then provides a priori dynamic object information for the
+tasks, and communicate with humans at the conceptual level.       dynamic feature rejection strategy. The semantic mapping
+                                                                  thread integrates the 2-D semantic information and 3-D point
+   In an earlier study, Mozos et al. [29] used the hidden         cloud information from keyframes to generate a 3-D semantic
+Markov model to partition the metric map into different           object database. An intuitive semantic metric map is obtained
+functional locations (rooms, corridors, and doorways). The        by publishing the 3-D point cloud, 3-D semantic objects, and
+work of Nieto-Granda et al. [30] deployed a mapping module        camera pose to the ROS system. The semantic metric maps can
+based on the Rao–Blackwellized particle ﬁltering technique on     help mobile robots understand their surroundings and perform
+a ROS [13] and used the Gaussian model to partition the map       advanced tasks from a higher cognitive level compared to the
+into marked semantic regions. Subsequently, the development       sparse feature point maps of ORB-SLAM2.
+of deep learning has greatly contributed to the advancement
+of object detection and semantic segmentation algorithms.            When the SG-SLAM system is running, the image frames
+Sünderhauf et al. [31] used SSD [32] to detect objects in         captured from the RGB-D camera are ﬁrst fed together to the
+each RGB keyframe and then assign a 3-D point cloud to            tracking thread and the object detecting thread. The object
+each object using an adaptive 3-D unsupervised segmentation       detecting thread starts to perform object recognition on the
+method. This work is based on the data association mechanism      input RGB images. At the same time, the tracking thread also
+of ICP-like matching scores to decide whether to create           starts to extract ORB feature points from the input frames.
+new objects in the semantic map or to associate them with         After the extraction is completed, the iterative Lucas–Kanade
+existing objects. Zhang et al. [23] acquired semantic maps        optical ﬂow method with pyramids is used to match the sparse
+of the working scene through the YOLO object detection            feature points between the current frame and previous frames.
+module and localization module in the RGB-D SLAM system.          Then, the seven-point method based on RANSAC is used to
+In summary, many works only stop at using SLAM to help            compute the fundamental matrix between the two frames. This
+with semantic mapping and do not fully utilize the acquired       reduces the adverse effects due to incorrect data correlation
+semantic information to help to track. DS-SLAM, a semantic        in dynamic regions. Compared with feature extraction and
+mapping system proposed by Yu et al. [10], adopted semantic       fundamental matrix computation, the object detection task is
+segmentation information to build semantic maps. However,         more time-consuming. In other words, when the fundamental
+DS-SLAM only simply attaches semantic labels to the metric        matrix is computed, the tracking thread needs to wait for the
+map for visual display. The lack of position coordinates for      result of the object detecting thread. Since the tracking thread
+the objects described in mathematical form limits the system’s    adopts object detection rather than semantic segmentation, the
+ability to perform advanced task planning.                        blocking time is not too long [26]. This enhances the real-time
+                                                                  performance of the system. Next, the tracking thread combines
+                      III. SYSTEM OVERVIEW                        the epipolar constraint and 2-D semantic information to reject
+                                                                  the dynamic feature points. The camera pose is computed
+   In this section, we will introduce the technical details of    and released to ROS according to the remaining static feature
+the SG-SLAM system from ﬁve aspects. First, we introduce          points.
+the framework and the basic ﬂow of the system. Second,
+we give information about the object detecting thread. Then,         The new keyframes are fed into the local mapping thread
+the geometric principle of the epipolar constraint method         and the loop closing thread for pose optimization, which is
+for judging dynamic features is illustrated. Subsequently, the    the same as the original ORB-SLAM2 system. The difference
+dynamic feature rejection strategy is proposed. Finally, we pro-  is that the depth image of the new keyframe is used to
+pose methods to acquire semantic objects and build semantic       generate a 3-D point cloud in the semantic mapping thread.
+maps.                                                             Next, the 3-D point cloud is combined with the 2-D semantic
+                                                                  information to generate a 3-D semantic object database. There
+A. System Framework                                               are problems such as high computational effort and redundant
+                                                                  information between normal frames in semantic map con-
+   The SG-SLAM proposed in this article is developed based        struction. Thus, the practice of processing only keyframe data
+on the ORB-SLAM2 system, which is a feature point-based           here improves the efﬁciency of mapping. The reuse of 2-D
+classical visual SLAM system. ORB-SLAM2 consists of three         semantic information also improves the real-time performance
+main parallel threads: tracking, local mapping, and loop clos-    of the system. Finally, the 3-D point cloud and the 3-D
+ing. With the evaluation of many popular public datasets,         semantic object data are published to the 3-D visualization
+ORB-SLAM2 is one of the systems that achieve the state-of-        tool Rviz for map display using the interface of the ROS
+the-art accuracy. Therefore, SG-SLAM selects ORB-SLAM as          system.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
+CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES                                                 7501012
+
+   The adoption of object detection networks (rather than
+semantic segmentation), multithreading, keyframe-based map-
+ping, and data reuse mechanisms overcomes the real-time
+performance shortcomings listed in Section II-A.
+
+B. Object Detection                                                 Fig. 2. Epipolar constraints.
+
+   Due to the limitations in battery life, mobile robots generally     According to the pinhole camera model, as shown in Fig. 2,
+choose ARM architecture processors with high performance            the camera observes the same spatial point P from different
+per watt. NCNN is a high-performance neural network infer-          angles. O1 and O2 denote the optical centers of the camera. P1
+ence computing framework optimized for mobile platforms             and P2 are the matching feature points of the spatial point P
+since NCNN is implemented in pure C++ with no third-party           maps in the previous frame and the current frame, respectively.
+dependencies and can be easily integrated into SLAM systems.        The short dashed lines L1 and L2 are the epipolar lines in the
+Thus, we choose it as the base framework for object detecting       frame. The homogeneous coordinate forms of P1 and P2 are
+thread.                                                             denoted as follows:
+
+   Many SLAM systems, such as [9], [10], [11], and [12],                      P1 = [x1, y1, 1], P2 = [x2, y2, 1]           (1)
+run slowly due to complex semantic segmentation networks or
+unreasonable system architectures. SLAM, as a fundamental           where x and y denote the coordinate values of the feature
+component for state estimation of mobile robots, only has the
+good real-time performance to ensure the smooth operation           points in the image pixel coordinate system. Then, the polar
+of upper level tasks. To improve the object detection speed
+as much as possible, the single-shot multibox detector SSD is       line L2 in the current frame can be calculated from the
+chosen as the detection head. In addition, we use MobileNetV3
+[33] as a drop-in replacement for the backbone feature extrac-      fundamental matrix (denoted as F) with the equation as
+tor in SSDLite. Finally, the network was trained using the
+PASCAL VOC 2007 Dataset [34].                                       follows:  ⎡⎤                                   ⎡⎤
+
+   In reality, other detectors can be used ﬂexibly depending                      X                                    x1
+on the hardware performance to achieve a balance between
+accuracy and speed.                                                           L2 = ⎢⎣ Y ⎥⎦ = F P1 = F⎢⎣ y1 ⎥⎦              (2)
+
+C. Epipolar Constraints                                                           Z                                    1
+
+   SG-SLAM uses geometric information obtained from epipo-          where X, Y , and Z represent the line vectors. According to
+lar constraint to determine whether feature points are dynamic      [16], the epipolar constraint can be formulated as follows:
+or not. The judgment pipeline of the epipolar constraint is
+very straightforward. First, match the ORB feature points of                  P2T F P1 = P2T L2 = 0.                       (3)
+two consecutive frames. Next, solve the fundamental matrix.
+Finally, the distance is calculated between the feature point of    Next, the distance between the feature point Pi (i = 2, 4) and
+the current frame and its corresponding polar line. The bigger      the corresponding polar line is deﬁned as the offset distance,
+the distance is, the more likely the feature point is dynamic.      denoted by the symbol d. The offset distance can be described
+                                                                    as follows:
+   To solve the fundamental matrix, it is necessary to have the
+correct data association between the feature points. However,                     di               =  √PiT F P1 .          (4)
+the purpose of solving the fundamental matrix is to judge                                               X2 + Y2
+whether the data association is correct or not. This becomes
+a classic chicken or the egg problem. ORB-SLAM2 takes the           If the point P is a static space point, jointly with (3) and (4),
+Bag-of-Words method to accelerate feature matching, and the         the offset distance of the point P2 is
+continued use of this method cannot eliminate the adverse
+effect of outliers. Hence, to obtain a relatively accurate funda-             d2  =                √P2T F P1  =    0.      (5)
+mental matrix, SG-SLAM uses the pyramidal iterative Lucas-                                           X2 + Y2
+Kanade optical ﬂow method to calculate the matching point set
+of features. Inspired by Yu et al. [10], the matching point pairs      Equation (5) demonstrates that in the ideal case, the feature
+located at the edges of images and with excessive differences       point P2 in the current frame falls exactly on the polar line L2.
+in appearance are then removed to further reduce erroneous          In reality, however, the offset distance is generally greater than
+data associations. Then, the seven-point method based on            zero but below an empirical threshold ε due to the inﬂuence
+RANSAC is used to calculate the fundamental matrix between          of various types of noise.
+two frames. In general, the proportion of dynamic regions
+is relatively small compared to the whole image. Thus, the
+RANSAC algorithm can effectively reduce the adverse effects
+of wrong data association in dynamic regions.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
+7501012                                         IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
+
+Algorithm 1 Dynamic Feature Rejection Strategy
+
+Input: Previous frame, F1; Current frame, F2; Previous frame’s feature points, P1; Current frame’s feature points, P2;
+           Standard empirical thresholds, εstd;
+
+Output: The set of static feature points in the current frame’s feature points, S;
+1: P1 = CalcOpticalFlowPyrLK( F2, F1, P2 )
+2: Remove matched pairs that are located at the edges and have too much variation in appearance
+
+3: FundmentalMatrix = FindFundamentalMat(P2, P1, 7-point method based on RANSAC)
+
+4: for each matched pair p1, p2 in P1, P2 do:
+
+5: if (DynamicObjectsExist && IsInDynamicRegion(P2)) then
+
+6:       if (CalcEpiLineDistance( p2, p1, FundmentalMatrix) × GetDynamicWeightValue ( p2) < εstd ) then
+
+7:       Append p2 to S
+
+8: end if
+
+9: else
+
+10: if (CalcEpiLineDistance( p2, p1, FundmentalMatrix) < εstd) then
+
+11:      Append p2 to S
+
+12: end if
+
+13: end if
+
+14: end for
+
+   If the point P is not a static spatial point, as shown in Fig. 2,     With these preparations, all feature points in the current
+when the camera moves from the previous frame to the current          frame can be judged one by one. The dynamic feature rejection
+frame, the point P also moves to P . In this case, the point P1       strategy is described in Algorithm 1.
+is matched with the P4 point mapped from P to the current
+frame. If point P moves without degeneration [16], then in            E. Semantic Mapping
+general, the offset distance of P4 is greater than the threshold ε.
+In other words, the feature points can be judged as dynamic              The ROS [13] is a set of software tool libraries that
+or not by comparing the offset distance with the empirical            help developers quickly build robot applications. Rviz is a
+threshold ε.                                                          visualization tool in the ROS. In addition to the tracking thread
+                                                                      that publishes camera poses to the ROS, the semantic mapping
+D. Dynamic Feature Rejection Strategy                                 thread also publishes two kinds of data: 3-D point clouds and
+                                                                      3-D semantic objects. These data are then processed by rviz
+   To avoid the shortcomings of relying heavily on deep               to display an intuitive map interface.
+learning for dynamic feature judgment, our algorithm relies
+mainly on geometric information. The geometric information               For efﬁciency, only keyframes are used to construct seman-
+method judges whether a feature is dynamic by comparing the           tic metric maps. When a new keyframe arrives, the semantic
+offset distance d with an empirical threshold ε. However, the         mapping thread immediately uses its depth image and pose to
+threshold ε value is very difﬁcult to set [12]: setting it too        generate a 3-D ordered point cloud. The 3-D point cloud is
+small will make many static feature points wrongly judged as          subsequently published to the ROS, and a global Octo-map
+dynamic points and setting it too large will miss many true           is built incrementally by the Octomap_server package. The
+dynamic feature points. This is because the purely geometric          global Octo-map has the advantages of being updatable,
+method cannot understand the scene at the semantic level and          ﬂexible, and compact, which can easily serve navigation
+can only mechanically process all feature points using a ﬁxed         and obstacle avoidance tasks. However, the Octo-map lacks
+threshold.                                                            semantic information, so it limits the capability of advanced
+                                                                      task planning between mobile robots and semantic objects.
+   To solve the above problem, all objects that can be detected       Hence, a map with semantic objects with their coordinates
+by the object detecting thread are ﬁrst classiﬁed as static           is also necessary. The semantic mapping thread generates the
+objects and dynamic objects based on a priori knowledge. Any          3-D semantic objects by combining 2-D semantic information
+object with moving properties is deﬁned as a dynamic object           with 3-D point clouds, and the main process is described as
+(e.g., a person or car); otherwise, it is a static object. Then,      follows.
+both weight values w are deﬁned. The standard empirical
+threshold εstd is set in a very straightforward way: just make           The 2-D object bounding box is captured in the dynamic
+sure that only obvious true dynamic feature points are rejected       feature rejection algorithm stage. Fetch the 3-D point clouds in
+when using it. The dynamic weight value w is an a priori in           the bounding box region to calculate the 3-D semantic object
+the range of 1–5, which is set according to the probability           information. Yet, since the bounding box contains some noisy
+of the object moving. For example, a human normally moves             regions of nontarget objects, it cannot accurately segment the
+with a high probability, and then, w = 5; a chair normally            semantic object outline. To acquire relatively accurate position
+does not move, and then, w = 2.                                       and size information of the objects, the bounding box is
+                                                                      ﬁrst reduced appropriately. Next, we calculate the average
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
+CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES                                  7501012
+                                                                                           TABLE I
+
+                                                                  RESULTS OF METRIC ROTATIONAL DRIFT (RPE)
+
+                                                                                           TABLE II
+                                                               RESULTS OF METRIC TRANSLATIONAL DRIFT (RPE)
+
+                                                                                          TABLE III
+                                                         RESULTS OF METRIC ABSOLUTE TRAJECTORY ERROR (ATE)
+
+depth of the point cloud corresponding to the bounding box       A. Performance Evaluation on TUM RGB-D Dataset
+region. Then, the depth of each point cloud in the original
+bounding box is compared with the average depth, which is           The TUM RGB-D dataset [35] is a large dataset provided
+rejected if the difference is too large. Eventually, we ﬁlter    by the Technical University of Munich Computer Vision
+the remaining point cloud and calculate their sizes and spatial  Group to create a novel benchmark for visual odometry and
+centroid coordinates.                                            SLAM systems. To evaluate the accuracy and robustness of
+                                                                 the SG-SLAM system in dynamic scenes, the experiments
+   The above operation is performed for each 2-D semantic        mainly use ﬁve sequences under the dynamic objects category
+information (except dynamic objects, e.g., people, and dogs)     in the dataset. The ﬁrst four of them are high dynamic scene
+in the current keyframe to obtain the 3-D semantic object data.  sequences, as a supplement, and the ﬁfth one is a low dynamic
+During the operation of the system, the 3-D semantic object      scene sequence.
+database can be continuously merged or updated according to
+the object class, centroid, and size information. By publishing     There are two main error evaluation metrics for the exper-
+this database through the ROS interface, the semantic metric     iment. One is the absolute trajectory error (ATE), which is
+maps can be visualized.                                          directly used to measure the difference between the ground
+                                                                 trajectory and the estimated trajectory. The other is the relative
+                  IV. EXPERIMENTAL RESULTS                       pose error (RPE), which is mainly used to measure rotational
+                                                                 drift and translational drift. To evaluate the improvement in
+   In this section, we will experimentally evaluate and          performance relative to the original system, the experimental
+demonstrate the SG-SLAM system in four aspects. First,           results of SG-SLAM were compared with the ORB-SLAM2.
+the tracking performance is evaluated with two public            The evaluation comparison results in the ﬁve dynamic scene
+datasets. Second, we demonstrate the effectiveness of the        sequences are shown in Tables I–III.
+dynamic feature rejection strategy and analyze the advan-
+tages of the fusion algorithm compared to the individ-              The experimental results in Tables I–III show that our
+ual algorithms. Next, the system’s real-time performance         system improves more than 93% in most metrics in high
+is evaluated. Finally, the visualization of the semantic         dynamic sequences compared to the ORB-SLAM2 system.
+objects and the global Octo-map are shown. The experi-           Figs. 3 and 4 show the experimental results of ATE and
+ments were performed mainly on the NVIDIA Jetson AGX             RPE for the two systems at ﬁve sequences with an RGB-D
+Xavier development kit with Ubuntu 18.04 as the system           camera input. As shown in the ﬁgure, the accuracy of the
+environment.                                                     estimation results of our system in the high dynamic scene
+                                                                 sequences [Figs. 3(a)–(d) and 4(a)–(d)] is signiﬁcantly higher
+                                                                 than ORB-SLAM2. In the experiments with low dynamic
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
+7501012  IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
+
+Fig. 3. ATE results of SG-SLAM and ORB-SLAM2 running ﬁve sequences. (a) fr3/walking_xyz. (b) fr3/walking_static. (c) fr3/walking_rpy.
+(d) fr3/walking_halfsphere. (e) fr3/sitting_static.
+
+Fig. 4. RPE results of SG-SLAM and ORB-SLAM2 running ﬁve sequences. (a) fr3/walking_xyz. (b) fr3/walking_static. (c) fr3/walking_rpy.
+(d) fr3/walking_halfsphere. (e) fr3/sitting_static.
+
+scene sequences [Figs. 3(e) and 4(e)], the accuracy improve-      SLAM provided by Bonn University in 2019. To validate the
+ment is only 31.03% because the area and magnitude of             generalization performance of the dynamic feature rejection
+dynamic object activity are small.                                algorithm, we performed another experimental evaluation
+                                                                  using this dataset.
+   To further evaluate the effectiveness of the proposed algo-
+rithm, it continues to be compared with M-removal DVO [22],          The experiment mainly selected nine representative
+RDS-SLAM [26], ORB-SLAM3 [36], and other similar algo-            sequences in the dataset. Among them, the “crowd” sequences
+rithms. The results are shown in Table IV. Although the           are the scenes of three people walking randomly in the room.
+DynaSLAM system using pixel-level semantic segmentation           The “moving no box” sequences show a person moving a box
+achieves a slight lead in individual sequence results, its        from the ﬂoor to a desk. The “person tracking” sequences are
+real-time performance is weak (as shown in Table VII). All        scenes where the camera is tracking a walking person. The
+other methods have difﬁculty in achieving the highest accu-       “synchronous” sequences present scenes of several people
+racy of experimental results because of certain shortcomings      jumping together in the same direction over and over again.
+described in Section II. Overall, from the experimental results,  In order to evaluate the accuracy performance of our system,
+it can be concluded that SG-SLAM achieves a state-of-the-art      it is mainly compared with the original ORB-SLAM2
+level in terms of average accuracy improvement for all            system and the current state-of-the-art YOLO-SLAM
+sequences.                                                        system.
+
+B. Performance Evaluation on Bonn RGB-D Dataset                      The evaluation comparison results in the nine dynamic
+                                                                  scene sequences are shown in Table V. Only in the two
+   The Bonn RGB-D Dynamic Dataset [37] is a dataset               “synchronization” sequences, SG-SLAM does not perform as
+with 24 dynamic sequences for the evaluation of RGB-D             well as YOLO-SLAM. The main reason is that the human
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
+CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES                             7501012
+
+                                                                                          TABLE IV
+                                                                                RESULTS OF METRIC ATE
+
+Fig. 5. Dynamic feature rejection effect demonstration. The empir-  feature points on walking people are missed. Next, Fig. 5(d)
+ical threshold ε in (b) is 0.2 and in (c) is 1.0. (a) ORB-SLAM2.    shows the results of feature point extraction using only the
+(b) and (c) SG-SLAM (G). (d) SG-SLAM (S). (e) SG-SLAM (S + G).      semantic information method: all feature points around the
+                                                                    human body are brutally rejected. Finally, the experimental
+                                                                    results of the SG-SLAM system combining semantic and
+                                                                    geometric information are shown in Fig. 5(e). SG-SLAM
+                                                                    rejects all feature points on the human body and retains as
+                                                                    many static feature points outside the human body as possible,
+                                                                    and the rejection effect is better than the ﬁrst two algorithms.
+                                                                    The experimental results of the two algorithms based on
+                                                                    separate information are mutually superior and inferior in
+                                                                    different sequences. The algorithm combining both pieces of
+                                                                    information shows the most accurate experimental results in
+                                                                    all sequences. From the results in Table VI, the experimental
+                                                                    data of each algorithm match the intuitive rejection effect in
+                                                                    Fig. 5. This proves the effectiveness of the fusion of geometric
+                                                                    and semantic information algorithms.
+
+jump direction in the scene is similar to the polar line            D. Timing Analysis
+direction leading to different degrees of degeneration of the
+algorithm [16]. The results in Table V show that our algorithm         As the basic component of robot state estimation, the speed
+outperforms other algorithms in most sequences. Not only            of SLAM directly affects the smooth execution of higher level
+does this once again prove that the SG-SLAM system achieves         tasks. Thus, we tested the average time cost of processing each
+state-of-the-art accuracy and robustness in dynamic scenes but      frame when the system is running and compared it with other
+also proves its generalizability.                                   systems.
+
+C. Effectiveness of Dynamic Feature Rejection Strategy                 The experimental time-consuming results and hardware
+                                                                    platforms are shown in Table VII. Since systems, such as
+   SG-SLAM combines geometrical and semantic information            DS-SLAM, DynaSLAM, and YOLACT-based SLAM, use
+to reject dynamic features, drawing on the advantages and           pixel-level semantic segmentation networks, their average time
+avoiding the disadvantages of both methods. To validate             cost per frame is expensive. YOLO-SLAM uses the end-to-end
+the effectiveness of the fusion of geometric and semantic           YOLO fast object detection algorithm, but it is very slow due
+information algorithms, we designed comparative experiments.        to limitations such as system architecture optimization and
+Fig. 5 shows the experimental results of these methods for          hardware performance. The SG-SLAM system signiﬁcantly
+detecting dynamic points. First, SG-SLAM (S) denotes a              increases frame processing speed by using multithreading,
+semantic information-only algorithm to reject dynamic feature       SSD object detection algorithms, and data multiplexing mech-
+points. Next, SG-SLAM (G) is only the geometry algo-                anisms. Compared to ORB-SLAM2, our work increases the
+rithm based on the epipolar constraint. Finally, SG-SLAM            average processing time per frame by less than 10 ms, which
+(S + G) uses a fusion algorithm based on geometric and              can meet the real-time performance requirements of mobile
+semantic information. The experimental results are shown in         robots.
+Table VI.
+                                                                    E. Semantic Mapping
+   Fig. 5(a) shows the results of ORB-SLAM2 extracting
+feature points: essentially no dynamic regions are processed.          To show the actual semantic mapping effect, the SG-SLAM
+Fig. 5(b) and (c) shows the results of using only the epipolar      system conducts mapping experiments in the TUM RGB-D
+constraint method at different empirical thresholds. At the low     dataset and the OpenLORIS-Scene dataset [38]. OpenLORIS-
+threshold [see Fig. 5(b)], many static feature points are misde-    Scene is a dataset of data recorded by robots in real
+tected and rejected (e.g., feature points at the corners of the TV  scenes using a motion capture system to obtain real tra-
+monitor); at the high threshold [see Fig. 5(c)], some dynamic       jectories. This dataset is intended to help evaluate the
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
+7501012               IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
+                   TABLE V
+         RESULTS OF METRIC ATE
+
+                   TABLE VI
+         RESULTS OF METRIC ATE
+
+                  TABLE VII
+               TIME ANALYSIS
+
+                                                             Fig. 7. (a) Semantic object map and (b) global octo-map for the cafe1-2
+                                                             sequence of the OpenLORIS-Scene dataset.
+
+Fig. 6. Semantic object map for fr3_walking_xyz sequence.       Fig. 6 shows the semantic object mapping effect of
+                                                             SG-SLAM in the fr3_walking_xyz sequence of the TUM
+maturity of SLAM and scene understanding algorithms in real  RGB-D dataset. Fig. 7(a) and (b) shows the semantic object
+deployments.                                                 map and the global Octo-map built in the cafe1-2 sequence of
+                                                             the OpenLORIS-Scene dataset, respectively. The coordinates
+                                                             of the objects shown in the map are transformed from the
+                                                             origin point where the SLAM system is running. The semantic
+                                                             metric map and the global Octo-map not only enable mobile
+                                                             robots to navigate and avoid obstacles but also enable them
+                                                             to understand scenes at a higher level and perform advanced
+                                                             tasks.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
+CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES      7501012
+
+                           V. CONCLUSION                                        [16] A. Kundu, K. M. Krishna, and J. Sivaswamy, “Moving object detection
+                                                                                      by multi-view geometric techniques from a single camera mounted
+   This article presents a real-time semantic visual SG-SLAM                          robot,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2009,
+toward dynamic scenes with an RGB-D camera input.                                     pp. 4306–4312.
+SG-SLAM adds two new threads based on ORB-SLAM2:
+the object detecting thread and the semantic mapping thread.                    [17] R. Hartley and A. Zisserman, Multiple View Geometry in Computer
+The system signiﬁcantly improves real time, accuracy, and                             Vision. Cambridge, U.K.: Cambridge Univ. Press, 2003.
+robustness in dynamic scenes with the dynamic feature rejec-
+tion algorithm. The semantic mapping thread reuses the 2-D                      [18] M. A. Fischler and R. Bolles, “Random sample consensus: A para-
+semantic information to build the semantic object map with                            digm for model ﬁtting with applications to image analysis and auto-
+object coordinates and the global Octo-map. Experiments                               mated cartography,” Commun. ACM, vol. 24, no. 6, pp. 381–395,
+prove that improved traditional algorithms can achieve supe-                          1981.
+rior performance when introducing deep learning and coupled
+with proper engineering implementations.                                        [19] M. Piaggio, R. Fornaro, A. Piombo, L. Sanna, and R. Zaccaria,
+                                                                                      “An optical-ﬂow person following behaviour,” in Proc. IEEE Int. Symp.
+   There are still some disadvantages of the system that need                         Intell. Control (ISIC), IEEE Int. Symp. Comput. Intell. Robot. Autom.
+to be addressed in the future. For example, the degeneration                          (CIRA), Intell. Syst. Semiotics (ISAS), 1998, pp. 301–306.
+problem of dynamic objects moving along the polar line direc-
+tion can cause the dynamic feature rejection algorithm to fail,                 [20] D. Nguyen, C. Hughes, and J. Horgan, “Optical ﬂow-based moving-
+semantic metric map improvement in precision, experimental                            static separation in driving assistance systems,” in Proc. IEEE 18th Int.
+quantitative analysis, and so on.                                                     Conf. Intell. Transp. Syst., Sep. 2015, pp. 1644–1651.
+
+                             REFERENCES                                         [21] T. Zhang, H. Zhang, Y. Li, Y. Nakamura, and L. Zhang, “Flow-
+                                                                                      Fusion: Dynamic dense RGB-D SLAM based on optical ﬂow,”
+ [1] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and map-              in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2020,
+      ping: Part I,” IEEE Robot. Autom. Mag., vol. 13, no. 2, pp. 99–110,             pp. 7322–7328.
+      Jun. 2006.
+                                                                                [22] Y. Sun, M. Liu, and M. Q.-H. Meng, “Motion removal for reliable
+ [2] S. Shirmohammadi and A. Ferrero, “Camera as the instrument:                      RGB-D SLAM in dynamic environments,” Robot. Auton. Syst., vol. 108,
+      The rising trend of vision based measurement,” IEEE Instrum. Meas.              pp. 115–128, Oct. 2018.
+      Mag., vol. 17, no. 3, pp. 41–47, Jun. 2014.
+                                                                                [23] L. Zhang, L. Wei, P. Shen, W. Wei, G. Zhu, and J. Song, “Semantic
+ [3] G. Klein and D. Murray, “Parallel tracking and mapping for small AR              SLAM based on object detection and improved octomap,” IEEE Access,
+      workspaces,” in Proc. 6th IEEE ACM Int. Symp. Mixed Augmented                   vol. 6, pp. 75545–75559, 2018.
+      Reality, Nov. 2007, pp. 225–234.
+                                                                                [24] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in
+ [4] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: An open-source slam                   Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,
+      system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot.,           pp. 7263–7271.
+      vol. 33, no. 5, pp. 1255–1262, Oct. 2017.
+                                                                                [25] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, “Dynamic-SLAM:
+ [5] C. Kerl, J. Sturm, and D. Cremers, “Dense visual SLAM for RGB-D                  Semantic monocular visual localization and mapping based on deep
+      cameras,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Nov. 2013,         learning in dynamic environment,” Robot. Auton. Syst., vol. 117,
+      pp. 2100–2106.                                                                  pp. 1–16, Jul. 2019.
+
+ [6] A. Rosinol, M. Abate, Y. Chang, and L. Carlone, “Kimera: An open-          [26] Y. Liu and J. Miura, “RDS-SLAM: Real-time dynamic SLAM
+      source library for real-time metric-semantic localization and map-              using semantic segmentation methods,” IEEE Access, vol. 9,
+      ping,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2020,                 pp. 23772–23785, 2021.
+      pp. 1689–1696.
+                                                                                [27] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc.
+ [7] C. Cadena et al., “Past, present, and future of simultaneous localization        ICCV, Jun. 2017, pp. 2961–2969.
+      and mapping: Toward the robust-perception age,” IEEE Trans. Robot.,
+      vol. 32, no. 6, pp. 1309–1332, Dec. 2016.                                 [28] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep
+                                                                                      convolutional encoder–decoder architecture for image segmentation,”
+ [8] I. Kostavelis and A. Gasteratos, “Semantic mapping for mobile                    IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495,
+      robotics tasks: A survey,” Robot. Auton. Syst., vol. 66, pp. 86–103,            Jan. 2017.
+      Apr. 2015.
+                                                                                [29] Ó. M. Mozos, R. Triebel, P. Jensfelt, A. Rottmann, and W. Burgard,
+ [9] B. Bescos, J. M. Fácil, J. Civera, and J. L. Neira, “DynaSLAM: Tracking,         “Supervised semantic labeling of places using information extracted
+      mapping, and inpainting in dynamic scenes,” IEEE Robot. Autom. Lett.,           from sensor data,” Robot. Auto. Syst., vol. 55, no. 5, pp. 391–402,
+      vol. 3, no. 4, pp. 4076–4083, Oct. 2018.                                        May 2007.
+
+[10] C. Yu et al., “DS-SLAM: A semantic visual SLAM towards dynamic             [30] C. Nieto-Granda, J. G. Rogers, A. J. B. Trevor, and H. I. Christensen,
+      environments,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS),        “Semantic map partitioning in indoor environments using regional
+      Oct. 2018, pp. 1168–1174.                                                       analysis,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2010,
+                                                                                      pp. 1451–1456.
+[11] W. Wu, L. Guo, H. Gao, Z. You, Y. Liu, and Z. Chen, “YOLO-
+      SLAM: A semantic SLAM system towards dynamic environment                  [31] N. Sunderhauf, T. T. Pham, Y. Latif, M. Milford, and I. Reid, “Mean-
+      with geometric constraint,” Neural Comput. Appl., vol. 34, pp. 1–16,            ingful maps with object-oriented semantic mapping,” in Proc. IEEE/RSJ
+      Apr. 2022.                                                                      Int. Conf. Intell. Robots Syst. (IROS), Sep. 2017, pp. 5079–5085.
+
+[12] J. Chang, N. Dong, and D. Li, “A real-time dynamic object segmentation     [32] W. Liu et al., “SSD: Single shot MultiBox detector,” in Proc. Eur. Conf.
+      framework for SLAM system in dynamic scenes,” IEEE Trans. Instrum.              Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37.
+      Meas., vol. 70, pp. 1–9, 2021.
+                                                                                [33] A. Howard et al., “Searching for MobileNetV3,” in Proc. IEEE/CVF Int.
+[13] M. Quigley et al., “ROS: An open-source robot operating system,” in              Conf. Comput. Vis., Oct. 2019, pp. 1314–1324.
+      Proc. ICRA Workshop Open Source Softw., Kobe, Japan, 2009, vol. 3,
+      no. 3, p. 5.                                                              [34] M. Everingham, L. Van Gool, C. Williams, J. Winn, and
+                                                                                      A. Zisserman, “The PASCAL visual object classes challenge 2007
+[14] Tencent. (2017). NCNN. [Online]. Available: https://github.com/Tencent/          results,” 2008. [Online]. Available: http://www.pascal-network.org/
+      ncnn                                                                            challenges/VOC/voc2007/workshop/index.html
+
+[15] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard,        [35] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers,
+      “OctoMap: An efﬁcient probabilistic 3D mapping framework based on               “A benchmark for the evaluation of RGB-D SLAM systems,” in Proc.
+      octrees,” Auton. Robot., vol. 34, no. 3, pp. 189–206, 2013.                     IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2012, pp. 573–580.
+
+                                                                                [36] C. Campos, R. Elvira, J. J. G. Rodriguez, J. M. M. Montiel, and
+                                                                                      J. D. Tardos, “ORB-SLAM3: An accurate open-source library for visual,
+                                                                                      visual–inertial, and multimap SLAM,” IEEE Trans. Robot., vol. 37,
+                                                                                      no. 6, pp. 1874–1890, Dec. 2021.
+
+                                                                                [37] E. Palazzolo, J. Behley, P. Lottes, P. Giguere, and C. Stachniss, “ReFu-
+                                                                                      sion: 3D reconstruction in dynamic environments for RGB-D cameras
+                                                                                      exploiting residuals,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst.
+                                                                                      (IROS), Nov. 2019, pp. 7855–7862.
+
+                                                                                [38] X. Shi et al., “Are we ready for service robots? The OpenLORIS-scene
+                                                                                      datasets for lifelong SLAM,” in Proc. IEEE Int. Conf. Robot. Autom.
+                                                                                      (ICRA), May 2020, pp. 3139–3145.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
+7501012  IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023
+
+                               Shuhong Cheng was born in Daqing, Heilongjiang,        Shijun Zhang (Student Member, IEEE) was born
+                               China, in 1978. She received the B.S., M.S., and       in Lianyungang, China, in 1993. He received the
+                               Ph.D. degrees from Yanshan University, Qinhuang-       bachelor’s and master’s degrees in control engineer-
+                               dao, China, in 2001, 2007, and 2012, respectively.     ing from Yanshan University, Qinhuangdao, China,
+                                                                                      in 2016 and 2019, respectively, where he is cur-
+                                 She studied as a Visiting Scholar at the University  rently pursuing the Ph.D. degree in mechanical
+                               of Reading, Reading, U.K., in 2014. After her Ph.D.    engineering.
+                               degree, she has been working as a Professor at
+                               Yanshan University since 2019. She has published         His main research directions include mobile robot
+                               about 50 papers in journals and international confer-  control and perception, computer vision, and deep
+                               ences and eight computer software copyrights. She      learning.
+                               has been granted more than four Chinese invention
+patents. Since 2012, she has presided over and undertaken more than ten
+national projects. Her current research interests are in rehabilitation robots,
+assisting robot for the disabled, and the elderly and computer vision.
+
+         Changhe Sun was born in Tangshan, China,                                     Dianfan Zhang was born in Jilin, China, in 1978.
+         in 1996. He received the bachelor’s degree in com-                           He received the bachelor’s and master’s degrees
+         munication engineering from the Chongqing Uni-                               in control engineering and the Ph.D. degree from
+         versity of Technology, Chongqing, China, in 2019.                            Yanshan University, Qinhuangdao, China, in 2001,
+         He is currently pursuing the master’s degree with                            2006, and 2010, respectively.
+         the School of Electrical Engineering, Yanshan Uni-
+         versity, Qinhuangdao, China.                                                   His main research directions include mobile robot
+                                                                                      control and signal processing.
+           His main research interests include simultaneous
+         localization and mapping (SLAM), computer vision,
+         and robotics.
+
+Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply.
+
diff --git a/动态slam/2020年-2022年开源动态SLAM/2022年/The_STDyn-SLAM_A_Stereo_Vision_and_Semantic_Segmentation_Approach_for_VSLAM_in_Dynamic_Outdoor_Environments.pdf b/动态slam/2020年-2022年开源动态SLAM/2022年/The_STDyn-SLAM_A_Stereo_Vision_and_Semantic_Segmentation_Approach_for_VSLAM_in_Dynamic_Outdoor_Environments.pdf
new file mode 100644
index 0000000..d43aefc
--- /dev/null
+++ b/动态slam/2020年-2022年开源动态SLAM/2022年/The_STDyn-SLAM_A_Stereo_Vision_and_Semantic_Segmentation_Approach_for_VSLAM_in_Dynamic_Outdoor_Environments.pdf
@@ -0,0 +1,520 @@
+Received January 10, 2022, accepted January 27, 2022, date of publication February 7, 2022, date of current version February 18, 2022.
+Digital Object Identifier 10.1109/ACCESS.2022.3149885
+
+The STDyn-SLAM: A Stereo Vision and Semantic
+Segmentation Approach for VSLAM in Dynamic
+Outdoor Environments
+
+DANIELA ESPARZA AND GERARDO FLORES , (Member, IEEE)
+
+Laboratorio de Percepción y Robótica [LAPyR], Centro de Investigaciones en Óptica (CIO), León, Guanajuato 37150, Mexico
+
+Corresponding author: Gerardo Flores (gﬂores@cio.mx)
+This work was supported in part by the Consejo Nacional de Ciencia y Tecnología (CONACYT), Fondo Institucional de Fomento Regional
+para el Desarrollo Cientíﬁco, Tecnológico y de Innovación (FORDECYT) under Grant 292399.
+
+  ABSTRACT The Visual Simultaneous Localization and Mapping (VSLAM) is a system based on the scene’s
+  features to estimate a map and the system pose. Commonly, VSLAM algorithms are focused on a static
+  environment; however, some dynamic objects are present in the vast majority of real-world applications.
+  This work presents a feature-based SLAM system focused on dynamic environments using convolutional
+  neural networks, optical ﬂow, and depth maps to detect objects in the scene. The proposed system employs
+  a stereo camera as the primary sensor to capture the scene. The neural network is responsible for object
+  detection and segmentation to avoid erroneous maps and wrong system locations. Moreover, the proposed
+  system’s processing time is fast and can run in real-time, running in outdoor and indoor environments. The
+  proposed approach has been compared with state-of-the-art; besides, we present several experimental results
+  outdoors that corroborate the approach’s effectiveness. Our code is available online.
+
+  INDEX TERMS VSLAM, dynamic environment, stereo vision, neural network.
+
+I. INTRODUCTION                                                         moving objects can generate an erroneous map and wrong
+Simultaneous Localization and Mapping (SLAM) systems                    poses because dynamic features cause a bad pose estimation
+are strategic for developing the following navigation tech-             and incorrect data. For this reason, new approaches have
+niques. This is mainly due to its fundamental utility in                arisen for solving the dynamic environment problem, such
+solving the problem of autonomous exploration tasks in                  as NeuroSLAM [10], hierarchical Outdoor SLAM [11], and
+unknown environments such as mines, highways, farmlands,                Large-Scale Outdoor SLAM [12].
+underwater/aerial environments, and in broad terms, indoor
+and outdoor scenes. The problem of SLAM for indoor                         In this work, we propose a method called STDyn-SLAM for
+environments has been investigated for years, where usually             solving VSLAM’s problem in dynamic outdoor environments
+RGB-D cameras or Lidars are the primary sensors to capture              using stereo vision [19]. Fig. 1 depicts a sketch of our
+scenes [1]–[3]. Indoors, dynamic objects are usually more               proposal in real experiments. The ﬁrst row shows the input
+controllable, unlike outdoors, where dynamic objects are                images, where a potentially dynamic object is present on
+inherent to the scene.                                                  the scene and is detected by a semantic segmentation neural
+                                                                        network. Fig. 1d depicts the 3D reconstruction excluding
+   On the other hand, the vast majority of SLAM systems                 dynamic objects. To evaluate our system, we carried out
+are focused on the assumption of static environments, such              experiments in different outdoor scenes, and we qualitatively
+as HECTOR-SLAM [4], Kintinuous [5], MonoSLAM [6],                       compared the 3D reconstructions taking into account the
+PTAM [7], SVO [8], LSD-SLAM [9], among others. Since                    excluding of dynamic objects. We conducted experiments
+this assumption is strong, the system is restricted to work in          using sequences from KITTI Dataset, and they are compared
+static environments. However, in dynamic environments, the              with state-of-the-art systems. Furthermore, our approach is
+                                                                        implemented in ROS, in which we use the depth image
+   The associate editor coordinating the review of this manuscript and  from a stereo camera for making the 3D reconstruction using
+                                                                        the octomap. Also, we analyzed the processing time using
+approving it for publication was Sudipta Roy .
+
+VOLUME 10, 2022  This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License.                    18201
+                                 For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/
+                                                                                     D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM
+TABLE 1. This table shows the state-of-the-art SLAM problem considering dynamic environments.
+
+different datasets. Further, we publish our code been available  FIGURE 1. The STDyn-SLAM results in scenes with moving objects. First
+on GitHub.1 Also, a video is available on YouTube. The main      raw: Input images with two dynamic objects. Second raw: 3D
+contributions are itemized as follows:                           reconstruction performed by the STDyn-SLAM discarding moving objects.
+
+   • We proposed a Stereo SLAM for dynamic environments          OctoMap. The dynamic pixels are removed using an object
+     using semantic segmentation neural network and geo-         detector and a K-means to segment the point cloud. On the
+     metrical constraints to eliminate the dynamic objects.      other hand, in [21], Gimenez et al. present a CP-SLAM based
+                                                                 on continuous probabilistic mapping and a Markov random
+   • We use the depth image from a stereo camera for making      ﬁeld; they use the iterated conditional modes. Wang et al. [22]
+     the 3D reconstruction using the octomap. The depth          propose a SLAM system for indoor environments based
+     image is not necessary for the SLAM process.                on an RGB-D camera. They use the number of features
+                                                                 on the static scene and assume that the parallax between
+   • This work was tested using the KITTI and EurocMav           consecutive images is a movement constraint. In [23],
+     datasets, and we compared our system with the stereo        Cheng, Sun, and Meng implement an optical-ﬂow and the
+     conﬁguration systems from state-of-the-art. In addition,    ﬁve-point algorithm approach to obtain dynamic features.
+     we obtained results from outdoor and indoor environ-        In [24], Ma and Jia proposed a visual SLAM for dynamic
+     ments of our sequences.
+                                                                                                                                                                  VOLUME 10, 2022
+   • Some results are shown in a YouTube video, and the
+     STDyn-SLAM is available as a GitHub repo.
+
+   The rest of the paper is structured as follows. Section II
+mentions the related work of SLAM in dynamic environ-
+ments. Then, in Section III, we show the main results and
+the algorithm STDyn-SLAM algorithm. Section IV presents
+the real-time experiments of STDyn-SLAM in outdoor
+environments with moving objects; we compare our approach
+with state-of-art methods using the KITTI dataset. Finally, the
+conclusions and the future work are given in Section V.
+
+II. RELATED WORK
+A. CLASSIC APPROACHES
+The classical methods do not consider artiﬁcial intelligence.
+Some of these approaches are based on optical ﬂow, epipolar
+geometry, or a combination of the two. For example, in [20],
+Yang et al. propose a SLAM system using an RGB-D camera
+and two encoders for estimating the pose and building an
+
+   1https://github.com/DanielaEsparza/STDyn-SLAM
+
+18202
+D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM
+
+FIGURE 2. A block diagram showing the algorithm steps of the STDyn-SLAM.
+
+environments, detecting the moving objects in the scene using             the Mask R-CNN, edge reﬁnement, and optical ﬂow to detect
+optical ﬂow. Furthermore, they use the RANSAC algorithm                   the probably dynamic objects. Henein et al. [18] proposed a
+to improve the computation of the homography matrix.                      system based on an RGBD camera and proprioceptive sensors
+In [25], Sun et al. proposed an RGB-D system for detecting                for tackling the SLAM problem. They employ a model
+moving objects based on ego-motion to compensate for the                  of factor graph and an instance-level object segmentation
+camera movement, then obtaining the frame difference. The                 algorithm to the classiﬁcation of objects and the tracking of
+result of frame difference helps for detecting the moving                 features. The proprioceptive sensors are used to estimate the
+object. After that, Sun et al. proposed in [26] an RGB-D                  camera pose. Also, some works use a monocular camera,
+system for motion removal based on a foreground model. This               for instance, the DSOD-SLAM presented in [16]. Ma et al.
+system does not require prior information.                                employ a semantic segmentation network, a depth prediction
+                                                                          network, and geometry properties to improve the results in
+B. ARTIFICIAL-INTELLIGENCE-BASED APPROACHES                               dynamic environments. Our work is built on the well-known
+Thanks to the growing use of deep learning, the                           ORB-SLAM2 [32], taking some ideas from DS-SLAM
+researchers have proposed some SLAM systems using                         system [33]. In the DS-SLAM, the authors used stored images
+artiﬁcial-intelligence-based approaches. Table 1 resumes                  from an RGB-D camera for solving the SLAM problem
+the state-of-art in this regard. Some works, such as                      in indoor dynamic environments. Nevertheless, the depth
+Dosovitskiy et al. [27], Ilg et al. [28] and Mayer et al. [29],           map obtained from an RGB-D camera is hard for external
+used optical ﬂow and supervised learning for detecting and                environments. In [34], Cheng et al. proposed a SLAM
+segmenting moving objects.                                                system for building a semantic map in dynamic environments
+                                                                          using CRF-RNN for segmenting objects. Bescos et al.
+   In [30], Xu et al. proposed an instance segmentation of                in [14] proposed a system for object detecting using the
+the objects in the scene based on the COCO dataset [31].                  Mask R-CNN, and their method proposed for inpainting the
+The geometric and motion properties are detected and used to              background using the information from previous images.
+improve the mask boundaries. Also, they tracked the visible               An update of [14] is [35], where Bescos et al. proposed a
+objects and moving objects and estimated the system’s pose.               visual SLAM based on the trajectories of the objects and a
+Several works are based on RGB-D cameras, such as [15],                   bundle adjustment.
+[17], and [18]. Cui and Ma [15] proposed the SOF-SLAM,
+an RGB-D system based on ORB-SLAM2, which combines                        III. METHODS
+a neural network for semantic segmentation, and optical ﬂow               In this section, we present and describe the framework of the
+for removing dynamic features. Zhao et al. [17] proposed an               STDyn-SLAM with all the parts that compose it. A block
+RGB-D framework to dynamic scenes, where they combined
+                                                                                                                                                                         18203
+VOLUME 10, 2022
+D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM
+
+diagram describing the framework’s pipeline is depicted in       natural dynamic objects among all the objects in the scene.
+Fig. 2, where the inputs at the instant time t are the stereo    It is here where the NN depicted in Fig. 2 is introduced.
+pair, depth image, and the left image captured at t − 1 (aka     In the NN block of that ﬁgure, a semantic segmentation
+previous left image). The process starts with extracting ORB     neural network is shown, with the left image as input and
+features in the stereo pair and the past left image. Then,       a segmented image with the object of interest as output.
+it follows the optical ﬂow and epipolar geometry image           This NN is a pixel-wise classiﬁcation and segmentation
+processing. Next, the neural network segments potentially        framework. The STDyn-SLAM implements a particular NN
+moving objects parallelly in the current left image. To remove   of this kind called SegNet [37], which is an encoder-decoder
+outliers (features inside dynamic objects) and estimate the      network based on the VGG-16 model [38]. The encoder
+visual odometry, it is necessary to computation the semantic     of this NN architecture counts with thirteen convolutional
+information and the movement checking process. Finally, the      layers with batch normalization, a ReLU non-linearity
+3D reconstruction is computed from the segmented image,          divided into ﬁve encoders, and ﬁve non-overlapping max-
+visual odometry, the current left frame, and the depth image.    pooling and sub-sampling layers located at the end of each
+These processes are explained in detail in the following         encoder. Since each encoder is connected to a corresponding
+subsections.                                                     decoder, the decoder architecture has the same number
+                                                                 of layers as encoder architecture, and every decoder has
+A. STEREO PROCESS                                                an upsampling layer at ﬁrst. The last layer is a softmax
+Motivated by the vast applications of robotics outdoors,         classiﬁer. SegNet classiﬁes the pixel-wise using a model
+where dynamic objects are presented, we proposed that            based on the PASCAL VOC dataset [39], which consists
+our STDyn-SLAM system be focused on stereo vision.               of twenty classes. The pixel-wise can be classiﬁed into
+A considerable advantage of this is that the depth estimation    one of the following classes: airplane, bicycle, bird, boat,
+from a stereo camera is directly given as a distance measure.    bottle, bus, car, cat, chair, cow, dining table, dog, horse,
+The process described in this part is depicted in Fig. 2,        motorbike, person, potted plant, sheep, sofa, train and
+where three main tasks are developed: feature extraction,        TV/monitor.
+optical ﬂow, and epipolar geometry. Let’s begin with the
+former.                                                             Notwithstanding those above, not all feature points in the
+                                                                 left frame are matched in the right frame. For that reason and
+   The ﬁrst step of the stereo process is acquiring the left,    to save computing resources, the SegNet classiﬁes the objects
+right, and depth frames from a stereo camera. Then, a local      of interest only on the left input image.
+feature detector is applied in the stereo pair and the previous
+left image. As a feature detector, we use the Oriented fast      1) OUTLIERS REMOVAL
+and Rotated Brief (ORB) feature detector, which throws the
+well-known ORB features [36]. Once the ORB features are          Once all the previous steps have been accomplished, a thresh-
+found, optical ﬂow and a process using epipolar geometry are
+conducted.                                                       old is selected to determine the features as inlier or outlier.
+
+   To avoid dynamic objects not classiﬁed by the neural          Fig. 3 depicts the three cases of a mapped feature. Let x1, x2,
+network (explained in the following subsection), the STDyn-      and x3 denote the ORB features from the previous left image;
+SLAM computes optical ﬂow using the previous and current         x1, x2, and x3 are the corresponding features from the current
+left frames. This step employs a Harris detector to compute      left image; X and X represent the homogeneous coordinates
+the optical ﬂow. Remember, these features are different from
+the ORB ones. The Harris points pair is discarded if at least    of x and x , respectively; F is the fundamental matrix; and
+one of the points is on the edge corner or close to it.
+                                                                 l1 = FX1, l2 = FX2, and l3 = FX3 are the epipolar lines.
+   From the fundamental matrix, ORB features, and optical        The ﬁrst and second cases correspond to inliers, x1 is over
+ﬂow, we compute the epipolar lines. Thus, we can map             l1, and the distance from x2 to l2 is less than the threshold.
+the matched features from the current left frame into the        The third case is an outlier because the distance from x3
+previous left frame. The distance from the corresponding         to l3 is greater than the threshold. To compute the distance
+epipolar line to the mapped feature into the past left image     between the point x and the epipolar line, l , we proceed as
+determines an inlier or outlier. Please refer to the remove
+outliers section in Fig. 2. Notice that the orb features of the  follows,
+car in the left image were removed, but the points on the
+right frame remain unchanged. This is because removing           d(X , l ) =  X T FX             (1)
+the points in the right images adds computational cost and is
+unnecessary.                                                                  (FX )21 + (FX )22
+
+B. ARTIFICIAL NEURAL NETWORK’s ARCHITECTURE                      where the subindex from (FX )1 and (FX )2 denotes the
+The approach we use is eliminating the ORB features on           element of the epipolar line. If the distance is larger than
+dynamic objects. To address this, we need to discern the         the threshold, the feature point is considered an outlier, i.e.,
+                                                                 a dynamic feature.
+18204
+                                                                    Remember that the SegNet, described before, semantically
+                                                                 segments the left image in object classes. The semantic
+                                                                 segmentation enhances the rejection of ORB features on
+                                                                 the possible dynamic objects. The ORB features inside
+
+                                                                                                 VOLUME 10, 2022
+D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM
+
+FIGURE 3. The cases of inliers and outliers. Green: the x1 and x2 are
+inliers; the distance from the point to their corresponding epipolar line l
+is less than a threshold. Red: x3 is an outlier, since the distance is greater
+than the threshold.
+
+                                                                                FIGURE 5. The STDyn-SLAM when a static object becomes dynamic.
+                                                                                Images a) and b) corresponds to the left images from a sequence. Image
+                                                                                c) is the 3D reconstruction of the environment; in red dots is the
+                                                                                trajectory. The OctoMap node fills empty areas along the sequence of
+                                                                                images.
+
+FIGURE 4. Diagram of the ROS nodes of the STDyn-SLAM required to
+generate the trajectory and 3D reconstruction. The circles represent each
+process’s ROS node, and the arrows are the ROS topics published by the
+ROS nodes. The continued arrows depict the final ROS topics.
+
+segmented objects, and thus possible moving objects, are
+rejected. The remained points are matched with the ORB
+features from the right image.
+
+C. VISUAL ODOMETRY
+Because the system is based on ORB-SLAM2, the VSLAM
+visually computes the odometry. Therefore, the next step
+needs the ORB features to estimate the depth for each feature
+pair. The features are classiﬁed in mono and stereo and will
+be necessary to track the camera’s pose. Again, this step is
+merely a process from ORB-SLAM2.
+
+D. 3D RECONSTRUCTION                                                            FIGURE 6. The 3D reconstruction from STDyn-SLAM in an indoor
+Finally, the STDyn-SLAM builds a 3D reconstruction from                         environment. In the scene appears a moving person, which is crossing
+left, segmented, and depth images using visual odometry.                        from left to right. The VSLAM system considers the person as a dynamic
+First, the 3D reconstruction process checks each pixel of the                   object.
+segmented image to reject the point corresponding to the
+classes of the objects selected as dynamic in section III-B.                       Remark 1: It is essential to mention that we merely applied
+Then, if the pixel is not considered a dynamic object, the                      the semantic segmentation, optical ﬂow, and geometry
+equivalent pixel from the depth image is added to the point                     constraints to the left image to avoid increasing the time
+cloud, and the assigned color of the point is obtained from                     executing. Moreover, the right-hand-side frame segmentation
+the left frame. This section builds a local point cloud only in                 is unnecessary because feature selection rejects the ORB
+the current pose of the system, and then the octomap [40]                       features inside dynamic objects from the left image, so the
+joins and updates the local point clouds in a full point                        corresponding points from the right frame will not be
+cloud.                                                                          matched.
+
+VOLUME 10, 2022                                                                                                                                                                18205
+D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM
+
+FIGURE 7. The 3D reconstruction, with the presence of static (two parked cars) and dynamic objects (a person and two dogs). Notice that the person and
+dogs are not visualized in the scene for the effect of the STDyn-SLAM. Fig. a) depicts the static objects. Nevertheless, the vehicles are potentially dynamic
+objects, thus in Fig. b), the STDyn-SLAM excludes the bodies considering its possible movement.
+
+IV. EXPERIMENTS                                               B. REAL-TIME EXPERIMENTS
+This section tests our algorithm STDyn-SLAM in real-time      We present real-time experiments under three different
+scenes under the KITTI datasets. Our system’s experiments     scenarios explained next.
+were compared to other state-of-art systems to evaluate the
+3D reconstruction and the odometry. The results of the 3D        First, we test the STDyn-SLAM in an outdoor environment
+map were qualitatively measured because of the nature of      where a car is parked and then moves forward. In this case,
+the experiment. We employ the Absolute Pose Error (APE)       a static object (a car) becomes dynamic, see Fig. 5. This ﬁgure
+metric for the odometry.                                      shows the 3D reconstruction, where the car appears static in
+                                                              the ﬁrst images from the sequence, Fig. 5 a). Then, the car
+A. HARDWARE AND SOFTWARE SETUP                                becomes a dynamic object when it moves forward (Fig. 5 b),
+We tested our system on an Intel Core i7-7820HK laptop        so the STDyn-SLAM is capable of ﬁlling the empty zone if
+computer with 32 Gb RAM and a GPU GeForceGTX                  the scene is covered again, as is the case in Fig. 5 c).
+1070. Moreover, we used as input a ZED camera, which
+is a stereo camera developed by Sterolabs. We selected an        The second experiment tests our system in an indoor
+HD720 resolution. The ZED camera resolutions are WVGA         environment. The scene consists of a moving person crossing
+(672 × 376), HD720 (1280 × 720), HD1080 (1920 × 1080),        from left to right. Subﬁgures a and b depicts the left and right
+and 2.2K (2208 × 1242).                                       images from Fig. 6. And c shows the 3D reconstruction. The
+                                                              area occupied by the moving person is ﬁlled after the zone is
+   The STDyn-SLAM is developed naturally on ROS. Our          visible.
+system’s main inputs are the left and right images, but
+the depth map is necessary to build the point cloud.             The third experiment consists of a scene sequence with
+However, if this is not available, it is possible to exe-     two parked cars, a walking person, and a dog. Even though
+cute the STDyn-SLAM only with the stereo images and           the vehicles are static, the rest of the objects move. Fig. 7a
+then obtain the trajectory. On the other hand, the STDyn      shows the scene taking into account the potentially dynamic
+node in ROS generates two main topics; the Odom and           entities. However, a car can change its position; the STDyn-
+the ORB_SLAM2_PointMap_SegNetM/Point_Clouds topics.           SLAM excludes the probable moving bodies (parked cars) to
+The point cloud topic is the input of the octomap_server      avoid multiple plotting throughout the reconstruction. This is
+node; this node publishes the joined point cloud of the       depicted in Fig. 7b.
+scene.
+                                                                 We compared the point clouds from the RTABMAP and the
+   Fig. 4 depicts the required ROS nodes by the STDyn-        STDyn-SLAM systems as a fourth experiment. The sequence
+SLAM to generate the trajectory and the 3D reconstruction.    was carried out outdoors with a walking person and two
+The camera node publishes the stereo images and computes      dogs. Since RTABMAP generates a point cloud of the scene,
+the depth map from the left and right frames. Then, the       we decided to compare it with our system. To build the
+STDyn-SLAM calculates the odometry and the local point        3D reconstructions from RTABMAP, we provided left and
+cloud. The OctoMap combines and updates the current local     depth images, camera info, and odometry as inputs for the
+point cloud with the previous global map to visualize the     RTABMAP. We used stereo and depth images; the intrinsic
+global point cloud. It is worth mentioning that the user can  parameters are saved in a text ﬁle in the ORB-SLAM2
+choose the maximum depth of the local point cloud. All the    package. Fig 8 shows the 3D reconstructions. In Fig. 8a our
+ROS topics can be shown through the viewer.                   system excludes the dynamic objects. On the other hand, Fig
+                                                              8b RTABMAP plotted the dynamic objects on different sides
+18206                                                         of the scene, resulting in an incorrect map of the environment.
+
+                                                                                                                                                               VOLUME 10, 2022
+D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM
+                                                                                                                 TABLE 3. Comparison of Absolute Pose Error (APE) on Euroc-Mav dataset.
+
+                                                                         TABLE 4. Comparison of Relative Pose Error (RPE) on KITTI dataset.
+
+FIGURE 8. Experiment comparison between the STDyn-SLAM and the
+RTABMAP [41]. Image a) shows the 3D reconstruction given by
+STDyn-SLAM; it eliminates dynamic objects’ effect on the mapping. Image
+b) shows the point cloud created by RTABMAP; notice how dynamic
+objects are mapped along the trajectory. This is undesirable behavior.
+
+TABLE 2. Comparison of Absolute Pose Error (APE) on KITTI dataset.
+
+                                                                         TABLE 5. Comparison of Relative Pose Error (RPE) on Euroc-Mav dataset.
+
+C. COMPARISON OF STATE-OF-ART AND OUR SLAM                                  To evaluate the signiﬁcative difference of the ATE evalua-
+USING KITTI AND EurocMav DATASETS                                        tion, we implemented the Score Sρ [45] over the sequences
+We compare our VSLAM with DynaSLAM1 [14] and ORB-                        of EurocMav and KITTI datasets of tables 6 and 7. The
+SLAM2 approaches. We selected sequences with dynamic                     results in table 8 show an improvement of our system against
+objects, loop, and no-loop closure to evaluate the SLAM                  ORBSLAM2 in the trajectories of the EurocMav dataset.
+systems. Therefore, we chose the 00−10 sequences from                    In the KITTI dataset, STDyn-SLAM and ORBSLAM2 are
+the odometry KITTI datasets [42], furthermore all sequences              not signiﬁcative different. In evaluating our system and
+from the EurocMav dataset excepting the V1_03 and V2_03.                 DynaSLAM1, the Dyna is slightly better.
+Moreover, we employed EVO [43] tools to evaluate the
+Absolute Pose Error (APE) and the Relative Pose Error                    D. PROCESSING TIME
+(RPE), and RGB-D tools [44] to calculate the Absolute                    In this section, we analyzed the processing time of this work.
+Trajectory Error (ATE).                                                  For the study, we evaluate some datasets with different types
+                                                                         of images. The analysis consists of obtaining the processing
+   We present the results of APE, RPE, and ATE in different              time of each sequence with the same characteristics and
+tables. We divided the tables depending on the dataset                   calculating the average of the sequence’s mean. Table 9 shows
+evaluated. Tables 3 and 4 show the APE experiments on                    the times getting with the datasets. We use the KITTI and
+KITTI and EurocMav datasets, respectively. Tables 4 and 5                EurocMav datasets for the RGB and Gray columns. Since
+correspond to RPE, and tables 6 and 7 present the ATE results.           the sequences do not provide a depth image, we did not map
+We did not evaluate the EurocMav with the DynaSLAM1 due                  a 3D reconstruction. For the last column, we utilized our
+to excessive processing time to compute the trajectories.                trajectories. In addition, our dataset contains depth images,
+
+VOLUME 10, 2022                                                                                                                                                         18207
+D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM
+
+TABLE 6. Comparison of Absolute Trajectory Error (ATE) on KITTI dataset.  The STDyn-SLAM is based on images captured by a stereo
+                                                                          pair for 3D reconstruction of scenes, where the possible
+TABLE 7. Comparison of Absolute Trajectory Error (ATE) on Euroc-Mav       dynamic objects are discarded from the map; this allows a
+dataset.                                                                  trustworthy point cloud. The system capability for computing
+                                                                          a reconstruction and localization in real-time depends on
+TABLE 8. Comparison of Score Sρ (a, b) on the datasets.                   the computer’s processing power, since a GPU is necessary
+                                                                          to support the processing. However, with a medium-range
+TABLE 9. Processing time.                                                 computer, the algorithms work correctly.
+
+so we plotted a 3D reconstruction. For this reason, the                      In the future, we plan to implement an optical ﬂow
+processing time is longer.                                                approach based on the last generation of neural networks
+V. CONCLUSION                                                             to improve dynamic object detection. The implementation
+This work presents the STDyn-SLAM system for outdoor                      of neural networks allows replacing classic methods such
+and indoor environments where dynamic objects are present.                as geometric constraints. Furthermore, we plan to increase
+                                                                          the size of the 3D map to reconstruct larger areas and
+18208                                                                     obtain longer reconstructions of the scenes. The next step
+                                                                          is implementing the algorithm in an aerial manipulator
+                                                                          constructed in the lab.
+
+                                                                          SUPPLEMENTARY MATERIAL
+                                                                          The implementation of our system is released on GitHub
+                                                                          and is available under the following link: https://github.
+                                                                          com/DanielaEsparza/STDyn-SLAM
+
+                                                                             Besides, this letter has supplementary video material
+                                                                          available at https://youtu.be/3tnkwvRnUss, provided by the
+                                                                          authors.
+
+                                                                          REFERENCES
+
+                                                                           [1] J. Castellanos, J. Montiel, J. Neira, and J. Tardos, ‘‘The SPmap: A
+                                                                                 probabilistic framework for simultaneous localization and map building,’’
+                                                                                 IEEE Trans. Robot. Autom., vol. 15, no. 5, pp. 948–952, 1999.
+
+                                                                           [2] G. Dissanayake, H. Durrant-Whyte, and T. Bailey, ‘‘A computationally
+                                                                                 efﬁcient solution to the simultaneous localisation and map building
+                                                                                 (SLAM) problem,’’ in Proc. IEEE Int. Conf. Robot. Automation. Symposia
+                                                                                 (ICRA), 2000, pp. 1009–1014.
+
+                                                                           [3] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit, ‘‘FastSLAM: A
+                                                                                 factored solution to the simultaneous localization and mapping problem,’’
+                                                                                 in Proc. AAAI Nat. Conf. Artif. Intell., 2002, pp. 593–598.
+
+                                                                           [4] S. Kohlbrecher, O. von Stryk, J. Meyer, and U. Klingauf, ‘‘A ﬂexible and
+                                                                                 scalable SLAM system with full 3D motion estimation,’’ in Proc. IEEE Int.
+                                                                                 Symp. Saf., Secur., Rescue Robot., Nov. 2011, pp. 155–160.
+
+                                                                           [5] T. Whelan, J. McDonald, M. Kaess, M. Fallon, H. Johannsson, and
+                                                                                 J. J. Leonard, ‘‘Kintinuous: Spatially extended KinectFusion,’’ in Proc.
+                                                                                 RSS Workshop RGB-D, Adv. Reasoning with Depth Cameras, Jul. 2012,
+                                                                                 pp. 1–10.
+
+                                                                           [6] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, ‘‘MonoSLAM:
+                                                                                 Real-time single camera SLAM,’’ IEEE Trans. Pattern Anal. Mach. Intell.,
+                                                                                 vol. 29, no. 6, pp. 1052–1067, Jun. 2007.
+
+                                                                           [7] Y. Kameda, ‘‘Parallel tracking and mapping for small AR workspaces
+                                                                                 (PTAM) augmented reality,’’ J. Inst. Image Inf. Telev. Engineers, vol. 66,
+                                                                                 no. 1, pp. 45–51, 2012.
+
+                                                                           [8] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza,
+                                                                                 ‘‘SVO: Semidirect visual odometry for monocular and multicam-
+                                                                                 era systems,’’ IEEE Trans. Robot., vol. 33, no. 2, pp. 249–265,
+                                                                                 Apr. 2017.
+
+                                                                           [9] J. Engel, T. Schöps, and D. Cremers, ‘‘LSD-SLAM: Large-scale direct
+                                                                                 monocular SLAM,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), D. Fleet,
+                                                                                 T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham, Switzerland: Springer,
+                                                                                 2014, pp. 834–849.
+
+                                                                          [10] F. Yu, J. Shang, Y. Hu, and M. Milford, ‘‘NeuroSLAM: A brain-inspired
+                                                                                 SLAM system for 3D environments,’’ Biol. Cybern., vol. 113, nos. 5–6,
+                                                                                 pp. 515–545, Dec. 2019.
+
+                                                                          [11] D. Schleicher, L. M. Bergasa, M. Ocana, R. Barea, and M. E. Lopez,
+                                                                                 ‘‘Real-time hierarchical outdoor SLAM based on stereovision and GPS
+                                                                                 fusion,’’ IEEE Trans. Intell. Transp. Syst., vol. 10, no. 3, pp. 440–452,
+                                                                                 Sep. 2009.
+
+                                                                                                                                                                           VOLUME 10, 2022
+D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM
+
+[12] R. Ren, H. Fu, and M. Wu, ‘‘Large-scale outdoor SLAM based on 2D                   [35] B. Bescos, C. Campos, J. D. Tardos, and J. Neira, ‘‘DynaSLAM II: Tightly-
+       LiDAR,’’ Electronics, vol. 8, no. 6, p. 613, May 2019.                                  coupled multi-object tracking and SLAM,’’ IEEE Robot. Autom. Lett.,
+                                                                                               vol. 6, no. 3, pp. 5191–5198, Jul. 2021.
+[13] S. Yang and S. Scherer, ‘‘CubeSLAM: Monocular 3-D object SLAM,’’
+       IEEE Trans. Robot., vol. 35, no. 4, pp. 925–938, Aug. 2019.                      [36] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, ‘‘ORB: An efﬁcient
+                                                                                               alternative to SIFT or SURF,’’ in Proc. Int. Conf. Comput. Vis., Nov. 2011,
+[14] B. Bescos, J. M. Fácil, J. Civera, and J. L. Neira, ‘‘DynaSLAM: Tracking,                 pp. 2564–2571.
+       mapping, and inpainting in dynamic scenes,’’ IEEE Robot. Autom. Lett.,
+       vol. 3, no. 4, pp. 4076–4083, Oct. 2018.                                         [37] V. Badrinarayanan, A. Kendall, and R. Cipolla, ‘‘SegNet: A deep
+                                                                                               convolutional encoder-decoder architecture for image segmentation,’’
+[15] L. Cui and C. Ma, ‘‘SOF-SLAM: A semantic visual SLAM for                                  IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495,
+       dynamic environments,’’ IEEE Access, vol. 7, pp. 166528–166539,                         Dec. 2017.
+       2019.
+                                                                                        [38] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for
+[16] P. Ma, Y. Bai, J. Zhu, C. Wang, and C. Peng, ‘‘DSOD: DSO                                  large-scale image recognition,’’ in Proc. Int. Conf. Learn. Represent.
+       in dynamic environments,’’ IEEE Access, vol. 7, pp. 178300–178309,                      (ICLR), San Diego, CA, USA, Jul. 2015, pp. 1–14.
+       2019.
+                                                                                        [39] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and W. Zisserman,
+[17] L. Zhao, Z. Liu, J. Chen, W. Cai, W. Wang, and L. Zeng, ‘‘A compatible                    ‘‘The PASCAL visual object classes (VOC) challenge,’’ Int. J. Comput.
+       framework for RGB-D SLAM in dynamic scenes,’’ IEEE Access, vol. 7,                      Vis., vol. 88, no. 2, pp. 303–338, Sep. 2010.
+       pp. 75604–75614, 2019.
+                                                                                        [40] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard,
+[18] M. Henein, J. Zhang, R. Mahony, and V. Ila, ‘‘Dynamic SLAM: The                           ‘‘OctoMap: An efﬁcient probabilistic 3D mapping framework based on
+       need for speed,’’ in Proc. IEEE Int. Conf. Robot. Automat. (ICRA), 2020,                octrees,’’ Auto. Robots, vol. 34, no. 3, pp. 189–206, Apr. 2013. [Online].
+       pp. 2123–2129.                                                                          Available: https://octomap.github.io
+
+[19] S. Trejo, K. Martinez, and G. Flores, ‘‘Depth map estimation methodology           [41] M. Labbé and F. Michaud, ‘‘Long-term online multi-session graph-
+       for detecting free-obstacle navigation areas,’’ in Proc. Int. Conf. Unmanned            based SPLAM with memory management,’’ Auto. Robots, vol. 42, no. 6,
+       Aircr. Syst. (ICUAS), Jun. 2019, pp. 916–922.                                           pp. 1133–1150, 2018.
+
+[20] D. Yang, S. Bi, W. Wang, C. Yuan, W. Wang, X. Qi, and Y. Cai, ‘‘DRE-               [42] A. Geiger, P. Lenz, and R. Urtasun, ‘‘Are we ready for autonomous driving?
+       SLAM: Dynamic RGB-D encoder SLAM for a differential-drive robot,’’                      The KITTI vision benchmark suite,’’ in Proc. Int. Conf. Pattern Recognit.,
+       Remote Sens., vol. 11, no. 4, p. 380, Feb. 2019.                                        Jun. 2012, pp. 3354–3361.
+
+[21] J. Gimenez, A. Amicarelli, J. M. Toibero, F. di Sciascio, and R. Carelli,          [43] (2017). U. Technologies. EVO: Python Package for the Evalua-
+       ‘‘Continuous probabilistic SLAM solved via iterated conditional modes,’’                tion of Odometry and SLAM. [Online]. Available: https://github.com/
+       Int. J. Autom. Comput., vol. 16, no. 6, pp. 838–850, Aug. 2019.                         MichaelGrupp/evo
+
+[22] R. Wang, W. Wan, Y. Wang, and K. Di, ‘‘A new RGB-D SLAM method                     [44] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, ‘‘A
+       with moving object detection for dynamic indoor scenes,’’ Remote Sens.,                 benchmark for the evaluation of RGB-D SLAM systems,’’ in Proc.
+       vol. 11, no. 10, p. 1143, May 2019.                                                     IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2012, pp. 573–580.
+
+[23] J. Cheng, Y. Sun, and M. Q.-H. Meng, ‘‘Improving monocular visual                  [45] R. Muñoz-Salinas and R. Medina-Carnicer, ‘‘UcoSLAM: Simultaneous
+       SLAM in dynamic environments: An optical-ﬂow-based approach,’’ Adv.                     localization and mapping by fusion of keypoints and squared planar
+       Robot., vol. 33, no. 12, pp. 576–589, Jun. 2019.                                        markers,’’ Pattern Recognit., vol. 101, May 2020, Art. no. 107193.
+
+[24] Y. Ma and Y. Jia, ‘‘Robust SLAM algorithm in dynamic environment using                                            DANIELA ESPARZA received the B.S. degree in
+       optical ﬂow,’’ in Proc. Chin. Intell. Syst. Conf., Y. Jia, J. Du, and W. Zhang,                                 robotic engineering from the Universidad Politéc-
+       Eds. Singapore: Springer 2020, pp. 681–689.                                                                     nica del Bicentenario, México, in 2017, and the
+                                                                                                                       master’s degree in optomechatronics from the
+[25] Y. Sun, M. Liu, and M. Q.-H. Meng, ‘‘Improving RGB-D SLAM in                                                      Center for Research in Optics, in 2019, where
+       dynamic environments: A motion removal approach,’’ Robot. Auton. Syst.,                                         she is currently pursuing the Ph.D. degree in
+       vol. 89, pp. 110–122, Mar. 2017.                                                                                mechatronics and mechanical design.
+
+[26] Y. Sun, M. Liu, and M. Q.-H. Meng, ‘‘Motion removal for reliable                                                     Her research interests include artiﬁcial vision,
+       RGB-D SLAM in dynamic environments,’’ Robot. Auton. Syst., vol. 108,                                            such as 3D reconstruction and deep learning
+       pp. 115–128, Oct. 2018.                                                                                         applied to SLAM developed over platforms as
+                                                                                                                       mobile robots.
+[27] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov,
+       P. V. D. Smagt, D. Cremers, and T. Brox, ‘‘FlowNet: Learning optical                                            GERARDO FLORES (Member, IEEE) received
+       ﬂow with convolutional networks,’’ in Proc. IEEE Int. Conf. Comput. Vis.                                        the B.S. degree (Hons.) in electronic engineering
+       (ICCV), Dec. 2015, pp. 2758–2766.                                                                               from the Instituto Tecnológico de Saltillo, Mexico,
+                                                                                                                       in 2007, the M.S. degree in automatic control
+[28] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox,                                              from CINVESTAV-IPN, Mexico City, in 2010,
+       ‘‘FlowNet 2.0: Evolution of optical ﬂow estimation with deep networks,’’                                        and the Ph.D. degree in systems and information
+       in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017,                                           technology from the Heudiasyc Laboratory, Uni-
+       pp. 2462–2470.                                                                                                  versité de Technologie de Compiègne–Sorbonne
+                                                                                                                       Universités, France, in October 2014.
+[29] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and
+       T. Brox, ‘‘A large dataset to train convolutional networks for disparity,                                          Since August 2016, he has been a full-time
+       optical ﬂow, and scene ﬂow estimation,’’ in 2016 IEEE Conf. Comput. Vis.         Researcher and the Head of the Perception and Robotics Laboratory, Center
+       Pattern Recognit. (CVPR), Jun. 2016, pp. 4040–4048.                              for Research in Optics, León, Guanajuato, Mexico. His current research
+                                                                                        interests include the theoretical and practical problems arising from the
+[30] B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. Davison, and                          development of autonomous robotic and vision systems. He has been an
+       S. Leutenegger, ‘‘MID-fusion: Octree-based object-level multi-instance           Associate Editor of Mathematical Problems in Engineering, since 2020.
+       dynamic SLAM,’’ in Proc. Int. Conf. Robot. Automat. (ICRA), May 2019,
+       pp. 5231–5237.                                                                                                                                                                  18209
+
+[31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan,
+       P. Dollár, and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in
+       context,’’ in Computer Vision, D. Fleet, T. Pajdla, B. Schiele, and
+       T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014,
+       pp. 740–755.
+
+[32] R. Mur-Artal and J. D. Tardós, ‘‘ORB-SLAM2: An open-source slam
+       system for monocular, stereo, and RGB-D cameras,’’ IEEE Trans. Robot.,
+       vol. 33, no. 5, pp. 1255–1262, Oct. 2017.
+
+[33] C. Yu, Z. Liu, X. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, ‘‘DS-
+       SLAM: A semantic visual SLAM towards dynamic environments,’’
+       in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Oct. 2018,
+       pp. 1168–1174.
+
+[34] J. Cheng, Y. Sun, and M. Q.-H. Meng, ‘‘Robust semantic mapping
+       in challenging environments,’’ Robotica, vol. 38, no. 2, pp. 256–270,
+       Feb. 2020.
+
+VOLUME 10, 2022
+
diff --git a/动态slam/2020年-2022年开源动态SLAM/~$20-2022年开源动态SLAM.docx b/动态slam/2020年-2022年开源动态SLAM/~$20-2022年开源动态SLAM.docx
new file mode 100644
index 0000000..80e1aae
--- /dev/null
+++ b/动态slam/2020年-2022年开源动态SLAM/~$20-2022年开源动态SLAM.docx
@@ -0,0 +1,3 @@
+
+junwen Lai                                           
+ j u n w e n   L a i   7                     7      ��ҵ����e��2  y�i�� �     �y��              
\ No newline at end of file
diff --git a/动态slam/df_vo创建conda环境报错.txt b/动态slam/df_vo创建conda环境报错.txt
new file mode 100644
index 0000000..0755048
--- /dev/null
+++ b/动态slam/df_vo创建conda环境报错.txt
@@ -0,0 +1,846 @@
+jinja2=2.10 -> markupsafe[version='>=0.23|>=0.23,<2']
+_anaconda_depends=2019.03 -> jinja2 -> markupsafe[version='<2.0|>=0.23|>=0.23,<2|>=0.23,<2.1|>=2.0|>=2.0.0rc2|>=2.1.1']
+jupyter=1.0.0 -> nbconvert -> markupsafe[version='>=2.0']
+
+Package pycairo conflicts for:
+nltk=3.4 -> matplotlib -> pycairo
+anaconda=custom -> _anaconda_depends -> pycairo
+_anaconda_depends=2019.03 -> pycairo
+seaborn=0.9.0 -> matplotlib[version='>=1.4.3'] -> pycairo
+scikit-image=0.15.0 -> matplotlib[version='>=2.0.0'] -> pycairo
+
+Package isort conflicts for:
+pylint=2.3.1 -> isort[version='>=4.2.5']
+spyder=3.3.3 -> pylint -> isort[version='>=4.2.5|>=4.2.5,<5|>=4.2.5,<6']
+isort=4.3.16
+_anaconda_depends=2019.03 -> pylint -> isort[version='>=4.2.5|>=4.2.5,<5|>=4.2.5,<6']
+anaconda=custom -> _anaconda_depends -> isort
+_anaconda_depends=2019.03 -> isort
+
+Package pyflakes conflicts for:
+spyder=3.3.3 -> pyflakes
+anaconda=custom -> _anaconda_depends -> pyflakes
+pyflakes=2.1.1
+_anaconda_depends=2019.03 -> pyflakes
+
+Package pycurl conflicts for:
+anaconda=custom -> _anaconda_depends -> pycurl
+pycurl=7.43.0.2
+_anaconda_depends=2019.03 -> pycurl
+
+Package pycodestyle conflicts for:
+spyder=3.3.3 -> pycodestyle
+_anaconda_depends=2019.03 -> pycodestyle
+pycodestyle=2.5.0
+anaconda=custom -> _anaconda_depends -> pycodestyle
+
+Package singledispatch conflicts for:
+distributed=1.26.0 -> singledispatch
+ipykernel=5.1.0 -> tornado[version='>=4.0'] -> singledispatch==3.4.0.3
+nltk=3.4 -> singledispatch
+matplotlib=3.0.3 -> tornado -> singledispatch==3.4.0.3
+_anaconda_depends=2019.03 -> singledispatch
+terminado=0.8.1 -> tornado[version='>=4'] -> singledispatch==3.4.0.3
+jupyter_client=5.2.4 -> tornado[version='>=4.1'] -> singledispatch==3.4.0.3
+bokeh=1.0.4 -> tornado[version='>=4.3'] -> singledispatch==3.4.0.3
+numba=0.43.1 -> singledispatch
+dask=1.1.4 -> distributed[version='>=1.26.0'] -> singledispatch
+spyder=3.3.3 -> pylint -> singledispatch
+anaconda=custom -> _anaconda_depends -> singledispatch
+_anaconda_depends=2019.03 -> astroid -> singledispatch==3.4.0.3
+singledispatch=3.4.0.3
+notebook=5.7.8 -> tornado[version='>=4.1,<7'] -> singledispatch==3.4.0.3
+anaconda-project=0.8.2 -> tornado[version='>=4.2'] -> singledispatch==3.4.0.3
+distributed=1.26.0 -> tornado[version='<6.2'] -> singledispatch==3.4.0.3
+
+Package gast conflicts for:
+gast=0.2.2
+tensorflow=1.13.1 -> gast[version='>=0.2.0']
+
+Package cudnn conflicts for:
+torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> cudnn[version='7.3.*|>=7.6,<8.0a0|>=7.6.5.32,<8.0a0|>=8.4.1.50,<9.0a0|>=8.8.0.121,<9.0a0|>=8.2.1.32,<9.0a0|>=8.1.0.77,<9.0a0|>=8.9,<9.0a0|>=8.9.2.26,<9.0a0|>=8.2,<9.0a0|>=8.2.1,<9.0a0|>=7.6.5,<8.0a0|>=7.6.4,<8.0a0|>=7.3.1,<8.0a0|>=7.3.0,<=8.0a0']
+cupy=6.0.0 -> cudnn[version='>=7.1.3,<8.0a0|>=7.3.1,<8.0a0']
+pytorch=1.1.0 -> cudnn[version='>=7.3.1,<8.0a0']
+cudnn=7.6.0
+tensorflow=1.13.1 -> tensorflow-base==1.13.1=gpu_py27h8f37b9b_0 -> cudnn[version='>=7.3.1,<8.0a0']
+
+Package libdeflate conflicts for:
+anaconda=custom -> _anaconda_depends -> libdeflate
+_anaconda_depends=2019.03 -> libtiff -> libdeflate[version='>=1.10,<1.11.0a0|>=1.12,<1.13.0a0|>=1.13,<1.14.0a0|>=1.14,<1.15.0a0|>=1.16,<1.17.0a0|>=1.17,<1.18.0a0|>=1.18,<1.19.0a0|>=1.19,<1.20.0a0|>=1.8,<1.9.0a0|>=1.7,<1.8.0a0']
+pillow=6.0.0 -> libtiff[version='>=4.0.9,<4.4.0a0'] -> libdeflate[version='>=1.10,<1.11.0a0|>=1.8,<1.9.0a0|>=1.7,<1.8.0a0|>=1.19,<1.20.0a0|>=1.18,<1.19.0a0|>=1.17,<1.18.0a0|>=1.16,<1.17.0a0|>=1.14,<1.15.0a0|>=1.13,<1.14.0a0|>=1.12,<1.13.0a0']
+
+Package smart_open conflicts for:
+anaconda=custom -> _anaconda_depends -> smart_open
+nltk=3.4 -> gensim -> smart_open[version='>=1.2.1|>=1.8.1']
+
+Package gmp conflicts for:
+nbconvert=5.4.1 -> pandoc[version='>=1.12.1,<2.0.0'] -> gmp=6.1
+mpc=1.1.0 -> mpfr[version='>=4.0.2,<5.0a0'] -> gmp[version='>=6.2.1,<7.0a0']
+gmpy2=2.0.8 -> gmp[version='>=6.1.2|>=6.1.2,<7.0a0']
+gmp=6.1.2
+mpc=1.1.0 -> gmp[version='>=5.0.1,<7|>=6.1.2,<7.0a0|>=6.2.0,<7.0a0|>=6.1.2']
+pandoc=2.2.3.2 -> gmp
+gmpy2=2.0.8 -> mpc[version='>=1.1.0,<2.0a0'] -> gmp[version='>=5.0.1,<7|>=6.2.0,<7.0a0|>=6.2.1,<7.0a0']
+mpfr=4.0.1 -> gmp[version='>=6.1.2|>=6.1.2,<7.0a0']
+sympy=1.3 -> gmpy2[version='>=2.0.8'] -> gmp[version='>=6.1.2|>=6.1.2,<7.0a0|>=6.2.0,<7.0a0|>=6.2.1,<7.0a0']
+anaconda=custom -> _anaconda_depends -> gmp
+_anaconda_depends=2019.03 -> gmp
+_anaconda_depends=2019.03 -> gmpy2 -> gmp[version='6.1.*|>=5.0.1,<7|>=6.1.2|>=6.1.2,<7.0a0|>=6.2.0,<7.0a0|>=6.2.1,<7.0a0']
+
+Package numexpr conflicts for:
+anaconda=custom -> _anaconda_depends -> numexpr
+seaborn=0.9.0 -> pandas[version='>=0.14.0'] -> numexpr[version='>=2.6.8|>=2.7.0|>=2.7.1|>=2.7.3|>=2.8.0']
+_anaconda_depends=2019.03 -> pandas -> numexpr[version='2.0.*|2.1.*|2.2.*|2.3.*|2.4.*|2.5.*|>=2.6.2|>=2.6.8|>=2.7.0|>=2.7.1|>=2.7.3|>=2.8.0']
+_anaconda_depends=2019.03 -> numexpr
+numexpr=2.6.9
+dask=1.1.4 -> pandas[version='>=0.19.0,<2.0.0a0'] -> numexpr[version='>=2.6.8|>=2.7.0|>=2.7.1|>=2.7.3|>=2.8.0']
+statsmodels=0.9.0 -> pandas[version='>=0.14'] -> numexpr[version='>=2.6.8|>=2.7.0|>=2.7.1|>=2.7.3|>=2.8.0']
+bkcharts=0.2 -> pandas -> numexpr[version='>=2.6.8|>=2.7.0|>=2.7.1|>=2.7.3|>=2.8.0']
+pytables=3.5.1 -> numexpr
+
+Package iniconfig conflicts for:
+pytest-astropy=0.5.0 -> pytest[version='>=3.1'] -> iniconfig
+anaconda=custom -> _anaconda_depends -> iniconfig
+pytest-remotedata=0.3.1 -> pytest[version='>=3.1'] -> iniconfig
+pytest-doctestplus=0.3.0 -> pytest[version='>=3.0'] -> iniconfig
+pytest-openfiles=0.3.2 -> pytest[version='>=2.8.0'] -> iniconfig
+_anaconda_depends=2019.03 -> pytest -> iniconfig
+pytest-arraydiff=0.3 -> pytest -> iniconfig
+
+Package contextlib2 conflicts for:
+contextlib2=0.5.5
+anaconda=custom -> _anaconda_depends -> contextlib2
+_anaconda_depends=2019.03 -> contextlib2
+importlib_metadata=0.8 -> contextlib2
+path.py=11.5.0 -> importlib_metadata[version='>=0.5'] -> contextlib2
+
+Package sympy conflicts for:
+sympy=1.3
+_anaconda_depends=2019.03 -> sympy
+anaconda=custom -> _anaconda_depends -> sympy
+torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> sympy
+
+Package pyodbc conflicts for:
+anaconda=custom -> _anaconda_depends -> pyodbc
+_anaconda_depends=2019.03 -> pyodbc
+pyodbc=4.0.26
+
+Package pytorch conflicts for:
+torchvision=0.3.0 -> pytorch[version='1.1.*|>=1.1.0']
+pytorch=1.1.0
+
+Package qtawesome conflicts for:
+anaconda=custom -> _anaconda_depends -> qtawesome
+_anaconda_depends=2019.03 -> qtawesome
+qtawesome=0.5.7
+spyder=3.3.3 -> qtawesome[version='>=0.4.1']
+_anaconda_depends=2019.03 -> spyder -> qtawesome[version='>=0.4.1|>=0.5.7|>=1.0.2|>=1.2.1']
+
+Package exceptiongroup conflicts for:
+jupyter_console=6.0.0 -> ipython -> exceptiongroup
+ipykernel=5.1.0 -> ipython[version='>=5.0'] -> exceptiongroup
+pytest-astropy=0.5.0 -> pytest[version='>=3.1'] -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8']
+pytest-arraydiff=0.3 -> pytest -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8']
+pytest-doctestplus=0.3.0 -> pytest[version='>=3.0'] -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8']
+pytest-remotedata=0.3.1 -> pytest[version='>=3.1'] -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8']
+_anaconda_depends=2019.03 -> ipython -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8']
+ipywidgets=7.4.2 -> ipython[version='>=4.0.0'] -> exceptiongroup
+pytest-openfiles=0.3.2 -> pytest[version='>=2.8.0'] -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8']
+
+Package dbus conflicts for:
+keyring=18.0.0 -> secretstorage -> dbus[version='>=1.13.18,<2.0a0']
+anaconda=custom -> _anaconda_depends -> dbus
+_anaconda_depends=2019.03 -> dbus
+_anaconda_depends=2019.03 -> pyqt -> dbus[version='>=1.10.22,<2.0a0|>=1.12.2,<2.0a0|>=1.13.12,<2.0a0|>=1.13.6,<2.0a0|>=1.13.2,<2.0a0|>=1.13.0,<2.0a0|>=1.13.18,<2.0a0']
+pyqt=5.9.2 -> dbus[version='>=1.12.2,<2.0a0|>=1.13.12,<2.0a0|>=1.13.6,<2.0a0|>=1.13.2,<2.0a0']
+matplotlib=3.0.3 -> pyqt -> dbus[version='>=1.10.22,<2.0a0|>=1.12.2,<2.0a0|>=1.13.12,<2.0a0|>=1.13.6,<2.0a0|>=1.13.2,<2.0a0']
+qt=5.9.7 -> dbus[version='>=1.13.2,<2.0a0|>=1.13.6,<2.0a0']
+secretstorage=3.1.1 -> dbus
+dbus=1.13.6
+spyder=3.3.3 -> pyqt[version='>=5.6,<5.7'] -> dbus[version='>=1.10.22,<2.0a0|>=1.13.6,<2.0a0|>=1.13.12,<2.0a0|>=1.13.2,<2.0a0|>=1.12.2,<2.0a0']
+qtconsole=4.4.3 -> pyqt -> dbus[version='>=1.10.22,<2.0a0|>=1.12.2,<2.0a0|>=1.13.12,<2.0a0|>=1.13.6,<2.0a0|>=1.13.2,<2.0a0']
+
+Package greenlet conflicts for:
+_anaconda_depends=2019.03 -> greenlet
+anaconda=custom -> _anaconda_depends -> greenlet
+gevent=1.4.0 -> greenlet[version='>=0.4.14']
+_anaconda_depends=2019.03 -> bokeh -> greenlet[version='!=0.4.17|0.4.*|>=2.0.0|>=1.1.3,<2.0|>=1.1.0,<2.0|>=0.4.17,<2.0|>=0.4.17|>=0.4.14|>=0.4.13|>=0.4.10|>=0.4.9']
+greenlet=0.4.15
+
+Package graphite2 conflicts for:
+pango=1.42.4 -> harfbuzz[version='>=2.7.2,<3.0a0'] -> graphite2[version='1.3.*|>=1.3.11,<2.0a0|>=1.3.10,<2.0a0']
+anaconda=custom -> _anaconda_depends -> graphite2
+_anaconda_depends=2019.03 -> graphite2
+pango=1.42.4 -> graphite2[version='>=1.3.12,<2.0a0|>=1.3.13,<2.0a0|>=1.3.14,<2.0a0']
+harfbuzz=1.8.8 -> graphite2[version='>=1.3.11,<2.0a0']
+graphite2=1.3.13
+_anaconda_depends=2019.03 -> harfbuzz -> graphite2[version='1.3.*|>=1.3.14,<2.0a0|>=1.3.13,<2.0a0|>=1.3.11,<2.0a0|>=1.3.10,<2.0a0|>=1.3.12,<2.0a0']
+
+Package pthread-stubs conflicts for:
+qt=5.9.7 -> libxcb -> pthread-stubs
+libxcb=1.13 -> pthread-stubs
+gst-plugins-base=1.14.0 -> libxcb[version='>=1.14,<2.0a0'] -> pthread-stubs
+harfbuzz=1.8.8 -> libxcb[version='>=1.13,<2.0a0'] -> pthread-stubs
+cairo=1.14.12 -> libxcb -> pthread-stubs
+_anaconda_depends=2019.03 -> libxcb -> pthread-stubs
+
+Package astropy conflicts for:
+astropy=3.1.2
+anaconda=custom -> _anaconda_depends -> astropy
+_anaconda_depends=2019.03 -> astropy
+
+Package pyasn1 conflicts for:
+urllib3=1.24.1 -> cryptography[version='>=1.3.4'] -> pyasn1[version='>=0.1.8']
+anaconda=custom -> _anaconda_depends -> pyasn1
+_anaconda_depends=2019.03 -> cryptography -> pyasn1[version='0.1.7|0.1.9|>=0.1.8']
+secretstorage=3.1.1 -> cryptography -> pyasn1[version='0.1.7|0.1.9|>=0.1.8']
+
+Package ninja conflicts for:
+torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> ninja
+ninja=1.9.0
+pytorch=1.1.0 -> ninja
+
+Package tensorboard conflicts for:
+tensorboard=1.13.1
+tensorflow=1.13.1 -> tensorboard[version='1.13.*|>=1.13.0,<1.14.0a0|>=1.13.0,<1.14.0']
+
+Package bokeh conflicts for:
+anaconda=custom -> _anaconda_depends -> bokeh
+dask=1.1.4 -> bokeh[version='>=0.13.0|>=0.13.0,<3.0.0a0']
+_anaconda_depends=2019.03 -> bokeh
+bokeh=1.0.4
+_anaconda_depends=2019.03 -> dask -> bokeh[version='<3.0a0|>=0.13.0,<3.0.0a0|>=1.0.0,!=2.0.0,<3.0.0a0|>=2.1.1,<3.0.0a0|>=2.4.2,<3.0.0a0|>=2.4.2|>=2.4.2,!=3.0.*|>=2.4.2,<3|>=1.0.0,<3.0.0a0|>=2.4.2,<3.0|>=2.1.1|>=1.0.0,!=2.0.0|>=1.0.0|>=0.13.0|>=0.12.3|>=0.12.1']
+
+Package future conflicts for:
+path.py=11.5.0 -> backports.os -> future
+_anaconda_depends=2019.03 -> future
+backports.os=0.1.1 -> future
+pytorch=1.1.0 -> future
+anaconda=custom -> _anaconda_depends -> future
+torchvision=0.3.0 -> future
+
+Package path.py conflicts for:
+_anaconda_depends=2019.03 -> path.py
+ipython=7.4.0 -> pickleshare -> path.py
+anaconda=custom -> _anaconda_depends -> path.py
+spyder=3.3.3 -> pickleshare -> path.py
+path.py=11.5.0
+
+Package dbus-python conflicts for:
+keyring=18.0.0 -> secretstorage -> dbus-python
+_anaconda_depends=2019.03 -> secretstorage -> dbus-python
+
+Package _ipython_minor_entry_point conflicts for:
+jupyter_console=6.0.0 -> ipython -> _ipython_minor_entry_point=8.7.0
+ipywidgets=7.4.2 -> ipython[version='>=4.0.0'] -> _ipython_minor_entry_point=8.7.0
+ipykernel=5.1.0 -> ipython[version='>=5.0'] -> _ipython_minor_entry_point=8.7.0
+_anaconda_depends=2019.03 -> ipython -> _ipython_minor_entry_point=8.7.0
+
+Package gmpy2 conflicts for:
+sympy=1.3 -> gmpy2[version='>=2.0.8']
+sympy=1.3 -> mpmath[version='>=0.19'] -> gmpy2
+_anaconda_depends=2019.03 -> sympy -> gmpy2[version='>=2.0.8']
+anaconda=custom -> _anaconda_depends -> gmpy2
+_anaconda_depends=2019.03 -> gmpy2
+gmpy2=2.0.8
+
+Package fonttools conflicts for:
+scikit-image=0.15.0 -> matplotlib-base[version='>=2.0.0'] -> fonttools[version='>=4.22.0']
+seaborn=0.9.0 -> matplotlib-base -> fonttools[version='>=4.22.0']
+anaconda=custom -> _anaconda_depends -> fonttools
+
+Package blis conflicts for:
+numpy=1.16.2 -> libblas[version='>=3.8.0,<4.0a0'] -> blis[version='0.5.1.*|>=0.5.2,<0.5.3.0a0|>=0.6.0,<0.6.1.0a0|>=0.6.1,<0.6.2.0a0|>=0.7.0,<0.7.1.0a0|>=0.8.0,<0.8.1.0a0|>=0.8.1,<0.8.2.0a0|>=0.9.0,<0.9.1.0a0']
+scipy=1.2.1 -> libblas[version='>=3.8.0,<4.0a0'] -> blis[version='0.5.1.*|>=0.5.2,<0.5.3.0a0|>=0.6.0,<0.6.1.0a0|>=0.6.1,<0.6.2.0a0|>=0.7.0,<0.7.1.0a0|>=0.8.0,<0.8.1.0a0|>=0.8.1,<0.8.2.0a0|>=0.9.0,<0.9.1.0a0']
+
+Package qtconsole conflicts for:
+_anaconda_depends=2019.03 -> spyder -> qtconsole[version='>=4.2|>=4.6.0|>=4.7.7|>=5.0.1|>=5.0.3|>=5.1.0|>=5.1.0,<5.2.0|>=5.2.1,<5.3.0|>=5.3.0,<5.4.0|>=5.3.2,<5.4.0|>=5.4.0,<5.5.0|>=5.4.2,<5.5.0|>=5.5.0,<5.6.0']
+qtconsole=4.4.3
+anaconda=custom -> _anaconda_depends -> qtconsole
+spyder=3.3.3 -> qtconsole[version='>=4.2']
+jupyter=1.0.0 -> qtconsole
+_anaconda_depends=2019.03 -> qtconsole
+
+Package filelock conflicts for:
+torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> filelock
+anaconda=custom -> _anaconda_depends -> filelock
+_anaconda_depends=2019.03 -> filelock
+
+Package libnghttp2 conflicts for:
+_anaconda_depends=2019.03 -> libcurl -> libnghttp2[version='>=1.41.0,<2.0a0|>=1.43.0,<2.0a0|>=1.47.0,<2.0a0|>=1.51.0,<2.0a0|>=1.52.0,<2.0a0|>=1.57.0|>=1.57.0,<2.0a0|>=1.52.0|>=1.46.0|>=1.46.0,<2.0a0']
+tensorflow=1.13.1 -> libcurl[version='>=7.64.1,<9.0a0'] -> libnghttp2[version='>=1.41.0,<2.0a0|>=1.43.0,<2.0a0|>=1.47.0,<2.0a0|>=1.51.0,<2.0a0|>=1.52.0,<2.0a0|>=1.57.0|>=1.57.0,<2.0a0|>=1.52.0|>=1.46.0|>=1.46.0,<2.0a0']
+pycurl=7.43.0.2 -> libcurl[version='>=7.64.1,<9.0a0'] -> libnghttp2[version='>=1.41.0,<2.0a0|>=1.43.0,<2.0a0|>=1.47.0,<2.0a0|>=1.51.0,<2.0a0|>=1.52.0,<2.0a0|>=1.57.0|>=1.57.0,<2.0a0|>=1.52.0|>=1.46.0|>=1.46.0,<2.0a0']
+anaconda=custom -> _anaconda_depends -> libnghttp2
+
+Package secretstorage conflicts for:
+secretstorage=3.1.1
+spyder=3.3.3 -> keyring -> secretstorage[version='>=3|>=3.2']
+_anaconda_depends=2019.03 -> secretstorage
+keyring=18.0.0 -> secretstorage
+anaconda=custom -> _anaconda_depends -> secretstorage
+_anaconda_depends=2019.03 -> keyring -> secretstorage[version='>=3|>=3.2']
+
+Package pyobjc-framework-cocoa conflicts for:
+_anaconda_depends=2019.03 -> send2trash -> pyobjc-framework-cocoa
+notebook=5.7.8 -> send2trash -> pyobjc-framework-cocoa
+
+Package astroid conflicts for:
+spyder=3.3.3 -> pylint -> astroid[version='1.0.1|1.1.0|1.1.1|1.2.1|1.3.2|1.3.4|1.4.4|2.5.6|>=2.11.0,<=2.12.0|>=2.11.2,<=2.12.0|>=2.11.3,<=2.12.0|>=2.11.5,<2.12.0|>=2.11.6,<2.12.0|>=2.12.10,<2.14.0-dev0|>=2.12.11,<2.14.0-dev0|>=2.12.12,<2.14.0-dev0|>=2.12.13,<2.14.0-dev0|>=2.14.1,<2.16.0-dev0|>=2.14.2,<2.16.0-dev0|>=2.15.0,<2.17.0-dev0|>=2.15.2,<2.17.0-dev0|>=2.15.4,<2.17.0-dev0|>=2.15.6,<2.17.0-dev0|>=2.15.7,<2.17.0-dev0|>=2.15.8,<2.17.0-dev0|>=3.0.0,<3.1.0-dev0|>=3.0.1,<3.1.0-dev0|>=2.12.9,<2.14.0-dev0|>=2.12.4,<2.14.0-dev0|>=2.9.0,<2.10|>=2.8.0,<2.9|>=2.7.2,<2.8|>=2.6.5,<2.7|>=2.6.4,<2.7|>=2.6.2,<2.7|>=2.6.1,<2.7|>=2.5.7,<2.7|>=2.5.1,<2.6|>=2.4.0,<=2.5|>=2.4.0,<2.5|>=2.3.0,<2.4|>=2.2.0,<3|>=2.2.0|>=2.0.0|>=1.6,<2.0|>=1.5.1|>=1.4.5,<1.5.0|>=2.14.2,<=2.16.0|>=2.6.5,<=2.7|>=2.6.2,<=2.7|>=2.5.8,<=2.7|>=1.4.1,<1.5.0']
+_anaconda_depends=2019.03 -> astroid
+pylint=2.3.1 -> astroid[version='>=2.2.0']
+anaconda=custom -> _anaconda_depends -> astroid
+astroid=2.2.5
+_anaconda_depends=2019.03 -> pylint -> astroid[version='1.0.1|1.1.0|1.1.1|1.2.1|1.3.2|1.3.4|1.4.4|2.5.6|>=2.11.0,<=2.12.0|>=2.11.2,<=2.12.0|>=2.11.3,<=2.12.0|>=2.11.5,<2.12.0|>=2.11.6,<2.12.0|>=2.12.10,<2.14.0-dev0|>=2.12.11,<2.14.0-dev0|>=2.12.12,<2.14.0-dev0|>=2.12.13,<2.14.0-dev0|>=2.14.1,<2.16.0-dev0|>=2.14.2,<2.16.0-dev0|>=2.15.0,<2.17.0-dev0|>=2.15.2,<2.17.0-dev0|>=2.15.4,<2.17.0-dev0|>=2.15.6,<2.17.0-dev0|>=2.15.7,<2.17.0-dev0|>=2.15.8,<2.17.0-dev0|>=3.0.0,<3.1.0-dev0|>=3.0.1,<3.1.0-dev0|>=2.12.9,<2.14.0-dev0|>=2.12.4,<2.14.0-dev0|>=2.9.0,<2.10|>=2.8.0,<2.9|>=2.7.2,<2.8|>=2.6.5,<2.7|>=2.6.4,<2.7|>=2.6.2,<2.7|>=2.6.1,<2.7|>=2.5.7,<2.7|>=2.5.1,<2.6|>=2.4.0,<=2.5|>=2.4.0,<2.5|>=2.3.0,<2.4|>=2.2.0,<3|>=2.2.0|>=2.0.0|>=1.6,<2.0|>=1.5.1|>=1.4.5,<1.5.0|>=2.14.2,<=2.16.0|>=2.6.5,<=2.7|>=2.6.2,<=2.7|>=2.5.8,<=2.7|>=1.4.1,<1.5.0']
+
+Package xorg-libice conflicts for:
+cairo=1.14.12 -> xorg-libsm -> xorg-libice[version='1.0.*|>=1.1.1,<2.0a0']
+cairo=1.14.12 -> xorg-libice
+
+Package anaconda-project conflicts for:
+anaconda=custom -> _anaconda_depends -> anaconda-project
+anaconda-project=0.8.2
+_anaconda_depends=2019.03 -> anaconda-client -> anaconda-project[version='>=0.9.1']
+_anaconda_depends=2019.03 -> anaconda-project
+
+Package parso conflicts for:
+spyder=3.3.3 -> jedi[version='>=0.9'] -> parso[version='0.1.0|>=0.1.0,<0.2|>=0.2.0,<0.8.0a0|>=0.3.0,<0.8.0a0|>=0.5.0,<0.8.0a0|>=0.5.2,<0.8.0a0|>=0.7.0,<0.8.0a0|>=0.7.0,<0.8.0|>=0.8.0,<0.9.0|>=0.8.3,<0.9.0|>=0.7.0|>=0.5.2|>=0.5.0|>=0.3.0|>=0.2.0']
+ipython=7.4.0 -> jedi[version='>=0.10'] -> parso[version='0.1.0|>=0.1.0,<0.2|>=0.2.0,<0.8.0a0|>=0.3.0,<0.8.0a0|>=0.5.0,<0.8.0a0|>=0.5.2,<0.8.0a0|>=0.7.0,<0.8.0a0|>=0.7.0,<0.8.0|>=0.8.0,<0.9.0|>=0.8.3,<0.9.0|>=0.7.0|>=0.5.2|>=0.5.0|>=0.3.0|>=0.2.0']
+_anaconda_depends=2019.03 -> parso
+_anaconda_depends=2019.03 -> jedi -> parso[version='0.1.0|>=0.1.0,<0.2|>=0.2.0,<0.8.0a0|>=0.3.0,<0.8.0a0|>=0.5.0,<0.8.0a0|>=0.5.2,<0.8.0a0|>=0.7.0,<0.8.0a0|>=0.7.0,<0.8.0|>=0.8.0,<0.9.0|>=0.8.3,<0.9.0|>=0.7.0|>=0.5.2|>=0.5.0|>=0.3.0|>=0.2.0|>=0.7.0,<0.9.0|0.7.0.*|0.5.2.*']
+parso=0.3.4
+jedi=0.13.3 -> parso[version='>=0.3.0|>=0.3.0,<0.8.0a0']
+anaconda=custom -> _anaconda_depends -> parso
+
+Package typing conflicts for:
+spyder=3.3.3 -> sphinx -> typing
+torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> typing
+anaconda=custom -> _anaconda_depends -> typing
+_anaconda_depends=2019.03 -> typing
+numpydoc=0.8.0 -> sphinx -> typing
+
+Package clyent conflicts for:
+clyent=1.2.2
+anaconda-project=0.8.2 -> anaconda-client -> clyent[version='>=1.2.0|>=1.2.2']
+anaconda-client=1.7.2 -> clyent[version='>=1.2.0|>=1.2.2']
+_anaconda_depends=2019.03 -> clyent
+anaconda=custom -> _anaconda_depends -> clyent
+_anaconda_depends=2019.03 -> anaconda-client -> clyent[version='>=1.2.0|>=1.2.2']
+
+Package jupyterlab_pygments conflicts for:
+anaconda=custom -> _anaconda_depends -> jupyterlab_pygments
+notebook=5.7.8 -> nbconvert -> jupyterlab_pygments
+jupyter=1.0.0 -> nbconvert -> jupyterlab_pygments
+spyder=3.3.3 -> nbconvert -> jupyterlab_pygments
+_anaconda_depends=2019.03 -> nbconvert -> jupyterlab_pygments
+
+Package pytest conflicts for:
+pytest-doctestplus=0.3.0 -> pytest[version='>=2.8|>=3.0']
+anaconda=custom -> _anaconda_depends -> pytest
+_anaconda_depends=2019.03 -> pytest
+pytest=4.3.1
+pytest-astropy=0.5.0 -> pytest[version='>=3.1']
+pytest-openfiles=0.3.2 -> pytest[version='>=2.8.0']
+pytest-remotedata=0.3.1 -> pytest[version='>=3.1']
+_anaconda_depends=2019.03 -> astropy -> pytest[version='<3.7|<4|>=2.8|>=4.6|>=3.1|>=3.1.0|>=4.0|>=3.0|>=2.8.0']
+pytest-astropy=0.5.0 -> pytest-arraydiff[version='>=0.1'] -> pytest[version='>=2.8.0|>=2.8|>=3.0|>=4.0|>=4.6']
+astropy=3.1.2 -> pytest-astropy -> pytest[version='>=3.1.0|>=3.1|>=4.6']
+pytest-arraydiff=0.3 -> pytest
+
+Package jsonschema conflicts for:
+anaconda=custom -> _anaconda_depends -> jsonschema
+_anaconda_depends=2019.03 -> jsonschema
+jsonschema=3.0.1
+ipywidgets=7.4.2 -> nbformat[version='>=4.2.0'] -> jsonschema[version='>=2.4,!=2.5.0|>=2.6']
+nbformat=4.4.0 -> jsonschema[version='>=2.4,!=2.5.0']
+anaconda-client=1.7.2 -> nbformat[version='>=4.4.0'] -> jsonschema[version='2.4.0|>=2.0,!=2.5.0|>=2.4,!=2.5.0|>=2.6']
+nbconvert=5.4.1 -> nbformat[version='>=4.4'] -> jsonschema[version='>=2.4,!=2.5.0|>=2.6']
+notebook=5.7.8 -> nbformat -> jsonschema[version='2.4.0|>=2.0,!=2.5.0|>=2.4,!=2.5.0|>=2.6']
+_anaconda_depends=2019.03 -> jupyterlab_server -> jsonschema[version='2.4.0|>=2.0,!=2.5.0|>=2.4,!=2.5.0|>=2.6|>=3.0.1|>=4.17.3|>=4.18|>=4.18.0|>=3.2.0']
+
+Package tblib conflicts for:
+tblib=1.3.2
+_anaconda_depends=2019.03 -> distributed -> tblib[version='>=1.6.0']
+dask=1.1.4 -> distributed[version='>=1.26.0'] -> tblib[version='>=1.6.0']
+distributed=1.26.0 -> tblib
+_anaconda_depends=2019.03 -> tblib
+anaconda=custom -> _anaconda_depends -> tblib
+
+Package sphinxcontrib-websupport conflicts for:
+sphinxcontrib-websupport=1.1.0
+_anaconda_depends=2019.03 -> sphinxcontrib-websupport
+numpydoc=0.8.0 -> sphinx -> sphinxcontrib-websupport
+anaconda=custom -> _anaconda_depends -> sphinxcontrib-websupport
+spyder=3.3.3 -> sphinx -> sphinxcontrib-websupport
+
+Package tqdm conflicts for:
+_anaconda_depends=2019.03 -> anaconda-client -> tqdm[version='>=4.56.0']
+anaconda=custom -> _anaconda_depends -> tqdm
+_anaconda_depends=2019.03 -> tqdm
+anaconda-project=0.8.2 -> anaconda-client -> tqdm[version='>=4.56.0']
+tqdm=4.32.2
+
+Package brotli-python conflicts for:
+anaconda-client=1.7.2 -> urllib3[version='<2.0.0a'] -> brotli-python[version='>=1.0.9']
+_anaconda_depends=2019.03 -> urllib3 -> brotli-python[version='>=1.0.9']
+
+Package jdcal conflicts for:
+jdcal=1.4
+anaconda=custom -> _anaconda_depends -> jdcal
+_anaconda_depends=2019.03 -> jdcal
+_anaconda_depends=2019.03 -> openpyxl -> jdcal==1.0
+openpyxl=2.6.1 -> jdcal
+
+Package werkzeug conflicts for:
+anaconda=custom -> _anaconda_depends -> werkzeug
+_anaconda_depends=2019.03 -> werkzeug
+flask=1.0.2 -> werkzeug[version='>=0.14|>=0.15,<2.0']
+werkzeug=0.14.1
+tensorboard=1.13.1 -> werkzeug[version='>=0.11.10|>=0.11.15']
+_anaconda_depends=2019.03 -> flask -> werkzeug[version='0.8.3|>=0.14|>=0.15|>=0.15,<2.0|>=2.0|>=2.2.0|>=2.2.2|>=2.3.0|>=2.3.3|>=2.3.7|>=3.0.0|>=0.7|>=0.7,<1.0.0']
+tensorflow=1.13.1 -> tensorboard[version='>=1.13.0,<1.14.0a0'] -> werkzeug[version='>=0.11.10|>=0.11.15']
+
+Package sphinxcontrib-qthelp conflicts for:
+numpydoc=0.8.0 -> sphinx -> sphinxcontrib-qthelp
+_anaconda_depends=2019.03 -> sphinx -> sphinxcontrib-qthelp
+anaconda=custom -> _anaconda_depends -> sphinxcontrib-qthelp
+spyder=3.3.3 -> sphinx -> sphinxcontrib-qthelp
+
+Package cairo conflicts for:
+pango=1.42.4 -> harfbuzz[version='>=1.7.6,<2.0a0'] -> cairo[version='1.14.*|>=1.14.12,<2.0.0a0']
+pango=1.42.4 -> cairo[version='>=1.14.12,<2.0a0|>=1.16.0,<2.0.0a0']
+anaconda=custom -> _anaconda_depends -> cairo
+_anaconda_depends=2019.03 -> cairo
+_anaconda_depends=2019.03 -> harfbuzz -> cairo[version='1.12.*|1.14.*|>=1.14.12,<2.0.0a0|>=1.16.0,<2.0.0a0|>=1.16.0,<2.0a0|>=1.18.0,<2.0a0|>=1.14.12,<2.0a0|>=1.14.10,<2.0a0|>=1.12.10|>=1.14.10,<2.0.0a0']
+cairo=1.14.12
+harfbuzz=1.8.8 -> cairo[version='>=1.14.12,<2.0.0a0|>=1.14.12,<2.0a0']
+
+Package qtpy conflicts for:
+spyder=3.3.3 -> qtpy[version='>=1.5.0']
+qtpy=1.7.0
+spyder=3.3.3 -> qtawesome[version='>=0.4.1'] -> qtpy[version='>=2.0.1|>=2.4.0']
+jupyter=1.0.0 -> qtconsole-base -> qtpy[version='>=2.0.1|>=2.4.0']
+_anaconda_depends=2019.03 -> qtconsole -> qtpy[version='>=1.1|>=1.2.0|>=1.5.0|>=2.0.1|>=2.4.0|>=2.1.0']
+qtawesome=0.5.7 -> qtpy
+anaconda=custom -> _anaconda_depends -> qtpy
+_anaconda_depends=2019.03 -> qtpy
+
+Package pycparser conflicts for:
+anaconda=custom -> _anaconda_depends -> pycparser
+_anaconda_depends=2019.03 -> pycparser
+pycparser=2.19
+gevent=1.4.0 -> cffi[version='>=1.11.5'] -> pycparser
+cffi=1.12.2 -> pycparser
+pytorch=1.1.0 -> cffi -> pycparser
+cryptography=2.6.1 -> cffi[version='>=1.7'] -> pycparser
+
+Package mpi conflicts for:
+hdf5=1.10.4 -> openmpi[version='>=3.1,<3.2.0a0'] -> mpi==1.0[build='openmpi|mpich']
+anaconda=custom -> _anaconda_depends -> mpi
+h5py=2.9.0 -> openmpi[version='>=3.1.4,<3.2.0a0'] -> mpi==1.0[build='openmpi|mpich']
+
+Package cycler conflicts for:
+_anaconda_depends=2019.03 -> matplotlib -> cycler[version='>=0.10|>=0.10.0']
+matplotlib=3.0.3 -> cycler[version='>=0.10']
+_anaconda_depends=2019.03 -> cycler
+scikit-image=0.15.0 -> matplotlib-base[version='>=2.0.0'] -> cycler[version='>=0.10|>=0.10.0']
+anaconda=custom -> _anaconda_depends -> cycler
+cycler=0.10.0
+seaborn=0.9.0 -> matplotlib-base -> cycler[version='>=0.10|>=0.10.0']
+nltk=3.4 -> matplotlib -> cycler[version='>=0.10|>=0.10.0']
+
+Package cached-property conflicts for:
+_anaconda_depends=2019.03 -> h5py -> cached-property
+keras-applications=1.0.7 -> h5py -> cached-property
+anaconda=custom -> _anaconda_depends -> cached-property
+
+Package boto conflicts for:
+anaconda=custom -> _anaconda_depends -> boto
+boto=2.49.0
+_anaconda_depends=2019.03 -> boto
+
+Package wheel conflicts for:
+anaconda=custom -> _anaconda_depends -> wheel
+_anaconda_depends=2019.03 -> wheel
+pip=19.0.3 -> wheel
+wheel=0.33.1
+python=3.6.8 -> pip -> wheel
+
+Package wurlitzer conflicts for:
+_anaconda_depends=2019.03 -> spyder-kernels -> wurlitzer[version='>=1.0.3']
+spyder-kernels=0.4.2 -> wurlitzer
+_anaconda_depends=2019.03 -> wurlitzer
+wurlitzer=1.0.2
+spyder=3.3.3 -> spyder-kernels[version='>=0.4.2,<1'] -> wurlitzer
+anaconda=custom -> _anaconda_depends -> wurlitzer
+
+Package get_terminal_size conflicts for:
+ipywidgets=7.4.2 -> ipython[version='>=4.0.0'] -> get_terminal_size
+ipykernel=5.1.0 -> ipython[version='>=5.0'] -> get_terminal_size
+jupyter_console=6.0.0 -> ipython -> get_terminal_size
+get_terminal_size=1.0.0
+anaconda=custom -> _anaconda_depends -> get_terminal_size
+_anaconda_depends=2019.03 -> get_terminal_size
+
+Package pyqtchart conflicts for:
+spyder=3.3.3 -> pyqt=5 -> pyqtchart==5.12[build='py36h7ec31b9_6|py38h7400c14_7|py37he336c9b_7|py39h0fcd23e_7|py310hfcd6d55_8|py39h0fcd23e_8|py37he336c9b_8|py38h7400c14_8|py36h7ec31b9_7|py39h0fcd23e_6|py38h7400c14_6|py37he336c9b_6|py37he336c9b_5']
+matplotlib=3.0.3 -> pyqt -> pyqtchart==5.12[build='py36h7ec31b9_6|py38h7400c14_7|py37he336c9b_7|py39h0fcd23e_7|py310hfcd6d55_8|py39h0fcd23e_8|py37he336c9b_8|py38h7400c14_8|py36h7ec31b9_7|py39h0fcd23e_6|py38h7400c14_6|py37he336c9b_6|py37he336c9b_5']
+qtconsole=4.4.3 -> pyqt -> pyqtchart==5.12[build='py36h7ec31b9_6|py38h7400c14_7|py37he336c9b_7|py39h0fcd23e_7|py310hfcd6d55_8|py39h0fcd23e_8|py37he336c9b_8|py38h7400c14_8|py36h7ec31b9_7|py39h0fcd23e_6|py38h7400c14_6|py37he336c9b_6|py37he336c9b_5']
+_anaconda_depends=2019.03 -> pyqt -> pyqtchart==5.12[build='py36h7ec31b9_6|py38h7400c14_7|py37he336c9b_7|py39h0fcd23e_7|py310hfcd6d55_8|py39h0fcd23e_8|py37he336c9b_8|py38h7400c14_8|py36h7ec31b9_7|py39h0fcd23e_6|py38h7400c14_6|py37he336c9b_6|py37he336c9b_5']
+
+Package numba conflicts for:
+_anaconda_depends=2019.03 -> numba
+anaconda=custom -> _anaconda_depends -> numba
+numba=0.43.1
+
+Package mccabe conflicts for:
+_anaconda_depends=2019.03 -> mccabe
+_anaconda_depends=2019.03 -> pylint -> mccabe[version='>=0.6,<0.7|>=0.6,<0.8']
+spyder=3.3.3 -> pylint -> mccabe[version='>=0.6,<0.7|>=0.6,<0.8']
+mccabe=0.6.1
+pylint=2.3.1 -> mccabe
+anaconda=custom -> _anaconda_depends -> mccabe
+
+Package jaraco.itertools conflicts for:
+_anaconda_depends=2019.03 -> zipp -> jaraco.itertools
+importlib_metadata=0.8 -> zipp[version='>=0.3.2'] -> jaraco.itertools
+
+Package pycrypto conflicts for:
+_anaconda_depends=2019.03 -> pycrypto
+anaconda=custom -> _anaconda_depends -> pycrypto
+pycrypto=2.6.1
+
+Package _anaconda_depends conflicts for:
+_anaconda_depends=2019.03
+anaconda=custom -> _anaconda_depends
+
+Package pkg-config conflicts for:
+dbus=1.13.6 -> glib -> pkg-config
+_anaconda_depends=2019.03 -> glib -> pkg-config
+
+Package jupyter conflicts for:
+_anaconda_depends=2019.03 -> jupyter
+anaconda=custom -> _anaconda_depends -> jupyter
+jupyter=1.0.0
+
+Package scikit-image conflicts for:
+scikit-image=0.15.0
+anaconda=custom -> _anaconda_depends -> scikit-image
+_anaconda_depends=2019.03 -> scikit-image
+
+Package tensorflow-estimator conflicts for:
+tensorflow-estimator=1.13.0
+tensorflow=1.13.1 -> tensorflow-estimator[version='>=1.13.0,<1.14.0a0|>=1.13.0,<1.14.0rc0']
+
+Package dataclasses conflicts for:
+anaconda=custom -> _anaconda_depends -> dataclasses
+torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> dataclasses
+nltk=3.4 -> gensim -> dataclasses
+tensorboard=1.13.1 -> werkzeug[version='>=0.11.10'] -> dataclasses
+_anaconda_depends=2019.03 -> werkzeug -> dataclasses
+flask=1.0.2 -> werkzeug[version='>=0.14'] -> dataclasses
+
+Package jupyter-lsp conflicts for:
+_anaconda_depends=2019.03 -> jupyterlab -> jupyter-lsp[version='>=2.0.0']
+jupyter=1.0.0 -> jupyterlab -> jupyter-lsp[version='>=2.0.0']
+
+Package et_xmlfile conflicts for:
+et_xmlfile=1.0.1
+openpyxl=2.6.1 -> et_xmlfile
+anaconda=custom -> _anaconda_depends -> et_xmlfile
+_anaconda_depends=2019.03 -> et_xmlfile
+
+Package heapdict conflicts for:
+_anaconda_depends=2019.03 -> heapdict
+heapdict=1.0.0
+distributed=1.26.0 -> zict[version='>=0.1.3'] -> heapdict
+zict=0.1.4 -> heapdict
+anaconda=custom -> _anaconda_depends -> heapdict
+
+Package spyder conflicts for:
+anaconda=custom -> _anaconda_depends -> spyder
+_anaconda_depends=2019.03 -> spyder
+spyder=3.3.3
+
+Package notebook-shim conflicts for:
+_anaconda_depends=2019.03 -> jupyterlab -> notebook-shim[version='>=0.2|>=0.2,<0.3']
+jupyterlab_server=0.2.0 -> notebook -> notebook-shim[version='>=0.2,<0.3']
+jupyter=1.0.0 -> notebook -> notebook-shim[version='>=0.2|>=0.2,<0.3']
+widgetsnbextension=3.4.2 -> notebook[version='>=4.4.1'] -> notebook-shim[version='>=0.2,<0.3']
+jupyterlab=0.35.4 -> notebook[version='>=4.3.1'] -> notebook-shim[version='>=0.2,<0.3']
+
+Package xlsxwriter conflicts for:
+anaconda=custom -> _anaconda_depends -> xlsxwriter
+_anaconda_depends=2019.03 -> xlsxwriter
+xlsxwriter=1.1.5
+
+Package qtconsole-base conflicts for:
+jupyter=1.0.0 -> qtconsole-base
+_anaconda_depends=2019.03 -> jupyter -> qtconsole-base[version='>=5.2.2,<5.2.3.0a0|>=5.3.0,<5.3.1.0a0|>=5.3.1,<5.3.2.0a0|>=5.3.2,<5.3.3.0a0|>=5.4.0,<5.4.1.0a0|>=5.4.1,<5.4.2.0a0|>=5.4.2,<5.4.3.0a0|>=5.4.3,<5.4.4.0a0|>=5.4.4,<5.4.5.0a0|>=5.5.0,<5.5.1.0a0|>=5.5.1,<5.5.2.0a0']
+spyder=3.3.3 -> qtconsole[version='>=4.2'] -> qtconsole-base[version='>=5.2.2,<5.2.3.0a0|>=5.3.0,<5.3.1.0a0|>=5.3.1,<5.3.2.0a0|>=5.3.2,<5.3.3.0a0|>=5.4.0,<5.4.1.0a0|>=5.4.1,<5.4.2.0a0|>=5.4.2,<5.4.3.0a0|>=5.4.3,<5.4.4.0a0|>=5.4.4,<5.4.5.0a0|>=5.5.0,<5.5.1.0a0|>=5.5.1,<5.5.2.0a0']
+jupyter=1.0.0 -> qtconsole -> qtconsole-base[version='>=5.2.2,<5.2.3.0a0|>=5.3.0,<5.3.1.0a0|>=5.3.1,<5.3.2.0a0|>=5.3.2,<5.3.3.0a0|>=5.4.0,<5.4.1.0a0|>=5.4.1,<5.4.2.0a0|>=5.4.2,<5.4.3.0a0|>=5.4.3,<5.4.4.0a0|>=5.4.4,<5.4.5.0a0|>=5.5.0,<5.5.1.0a0|>=5.5.1,<5.5.2.0a0']
+
+Package pycosat conflicts for:
+_anaconda_depends=2019.03 -> pycosat
+pycosat=0.6.3
+anaconda=custom -> _anaconda_depends -> pycosat
+
+Package xyzservices conflicts for:
+dask=1.1.4 -> bokeh[version='>=0.13.0'] -> xyzservices[version='>=2021.09.1']
+_anaconda_depends=2019.03 -> bokeh -> xyzservices[version='>=2021.09.1']
+
+Package brotlipy conflicts for:
+anaconda-client=1.7.2 -> urllib3[version='<2.0.0a'] -> brotlipy[version='>=0.6.0']
+anaconda=custom -> _anaconda_depends -> brotlipy
+_anaconda_depends=2019.03 -> urllib3 -> brotlipy[version='>=0.6.0']
+
+Package libtool conflicts for:
+_anaconda_depends=2019.03 -> libtool
+anaconda=custom -> _anaconda_depends -> libtool
+libtool=2.4.6
+
+Package backports.os conflicts for:
+anaconda=custom -> _anaconda_depends -> backports.os
+_anaconda_depends=2019.03 -> backports.os
+path.py=11.5.0 -> backports.os
+backports.os=0.1.1
+
+Package tbb4py conflicts for:
+anaconda=custom -> _anaconda_depends -> tbb4py
+mkl_random=1.0.2 -> numpy-base[version='>=1.0.2,<2.0a0'] -> tbb4py
+_anaconda_depends=2019.03 -> numpy-base -> tbb4py
+
+Package libllvm8 conflicts for:
+numba=0.43.1 -> llvmlite[version='>=0.28.0'] -> libllvm8[version='>=8.0.1,<8.1.0a0']
+_anaconda_depends=2019.03 -> llvmlite -> libllvm8[version='>=8.0.1,<8.1.0a0']
+
+Package anaconda-anon-usage conflicts for:
+anaconda-project=0.8.2 -> anaconda-client -> anaconda-anon-usage[version='>=0.4.0']
+_anaconda_depends=2019.03 -> anaconda-client -> anaconda-anon-usage[version='>=0.4.0']
+
+Package pcre2 conflicts for:
+dbus=1.13.6 -> libglib[version='>=2.70.2,<3.0a0'] -> pcre2[version='>=10.37,<10.38.0a0|>=10.40,<10.41.0a0|>=10.42,<10.43.0a0']
+pango=1.42.4 -> libglib[version='>=2.64.6,<3.0a0'] -> pcre2[version='>=10.37,<10.38.0a0|>=10.40,<10.41.0a0|>=10.42,<10.43.0a0']
+
+Package cupti conflicts for:
+tensorflow=1.13.1 -> tensorflow-base==1.13.1=gpu_py27h8f37b9b_0 -> cupti
+torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> cupti
+
+Package unicodecsv conflicts for:
+anaconda=custom -> _anaconda_depends -> unicodecsv
+unicodecsv=0.14.1
+_anaconda_depends=2019.03 -> unicodecsv
+
+Package dill conflicts for:
+_anaconda_depends=2019.03 -> dask -> dill[version='0.2.2|0.2.3|0.2.4|>=0.3.7|>=0.3.6|>=0.2']
+spyder=3.3.3 -> pylint -> dill[version='>=0.2|>=0.3.6|>=0.3.7']
+
+Package cryptography-vectors conflicts for:
+_anaconda_depends=2019.03 -> cryptography -> cryptography-vectors[version='2.3.*|2.3.1.*']
+urllib3=1.24.1 -> cryptography[version='>=1.3.4'] -> cryptography-vectors[version='2.3.*|2.3.1.*']
+pyopenssl=19.0.0 -> cryptography[version='>=2.2.1'] -> cryptography-vectors[version='2.3.*|2.3.1.*']
+secretstorage=3.1.1 -> cryptography -> cryptography-vectors[version='2.3.*|2.3.1.*']
+
+Package xlrd conflicts for:
+anaconda=custom -> _anaconda_depends -> xlrd
+_anaconda_depends=2019.03 -> xlrd
+xlrd=1.2.0
+
+Package seaborn conflicts for:
+anaconda=custom -> _anaconda_depends -> seaborn
+_anaconda_depends=2019.03 -> seaborn
+seaborn=0.9.0
+
+Package mpi4py conflicts for:
+keras-applications=1.0.7 -> h5py -> mpi4py[version='>=3.0']
+h5py=2.9.0 -> mpi4py
+_anaconda_depends=2019.03 -> h5py -> mpi4py[version='>=3.0']
+
+Package selectors2 conflicts for:
+spyder-kernels=0.4.2 -> wurlitzer -> selectors2
+_anaconda_depends=2019.03 -> wurlitzer -> selectors2
+
+Package referencing conflicts for:
+_anaconda_depends=2019.03 -> jsonschema -> referencing[version='>=0.28.4']
+nbformat=4.4.0 -> jsonschema[version='>=2.4,!=2.5.0'] -> referencing[version='>=0.28.4']
+
+Package pyside conflicts for:
+nltk=3.4 -> matplotlib -> pyside[version='1.1.2|1.2.1']
+_anaconda_depends=2019.03 -> matplotlib -> pyside[version='1.1.2|1.2.1']
+
+Package gevent conflicts for:
+_anaconda_depends=2019.03 -> bokeh -> gevent==1.0.1
+anaconda=custom -> _anaconda_depends -> gevent
+_anaconda_depends=2019.03 -> gevent
+gevent=1.4.0
+
+Package pbr conflicts for:
+pytables=3.5.1 -> mock -> pbr[version='1.3.0|>=1.3']
+tensorflow=1.13.1 -> mock[version='>=2.0.0'] -> pbr[version='>=1.3']
+tensorflow-estimator=1.13.0 -> mock[version='>=2.0.0'] -> pbr[version='>=1.3']
+
+Package keras-base conflicts for:
+keras-applications=1.0.7 -> keras[version='>=2.1.6'] -> keras-base[version='2.2.0.*|2.2.2.*|2.2.4.*|2.3.1.*|2.4.3.*']
+keras-preprocessing=1.0.9 -> keras[version='>=2.1.6'] -> keras-base[version='2.2.0.*|2.2.2.*|2.2.4.*|2.3.1.*|2.4.3.*']
+
+Package openpyxl conflicts for:
+anaconda=custom -> _anaconda_depends -> openpyxl
+_anaconda_depends=2019.03 -> openpyxl
+openpyxl=2.6.1
+
+Package distribute conflicts for:
+_anaconda_depends=2019.03 -> pip -> distribute
+python=3.6.8 -> pip -> distributeThe following specifications were found to be incompatible with your system:
+
+  - feature:/linux-64::__cuda==11.7=0
+  - feature:/linux-64::__glibc==2.27=0
+  - feature:/linux-64::__linux==5.4.0=0
+  - feature:/linux-64::__unix==0=0
+  - feature:|@/linux-64::__cuda==11.7=0
+  - feature:|@/linux-64::__glibc==2.27=0
+  - feature:|@/linux-64::__linux==5.4.0=0
+  - feature:|@/linux-64::__unix==0=0
+  - _anaconda_depends=2019.03 -> click -> __unix
+  - _anaconda_depends=2019.03 -> click -> __win
+  - _anaconda_depends=2019.03 -> gst-plugins-base -> __glibc[version='>=2.17|>=2.17,<3.0.a0']
+  - _anaconda_depends=2019.03 -> ipykernel -> __linux
+  - astropy=3.1.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - bitarray=0.8.3 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - blosc=1.15.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - bottleneck=1.2.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - bzip2=1.0.6 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - c-ares=1.15.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - cairo=1.14.12 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - cffi=1.12.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - cryptography=2.6.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - cudatoolkit=9 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
+  - cupy=6.0.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - curl=7.64.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - cython=0.29.6 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - cytoolz=0.9.0.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - dbus=1.13.6 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
+  - distributed=1.26.0 -> click[version='>=6.6'] -> __unix
+  - distributed=1.26.0 -> click[version='>=6.6'] -> __win
+  - expat=2.2.6 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - fastcache=1.0.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - fastrlock=0.4 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - flask=1.0.2 -> click[version='>=5.1'] -> __unix
+  - flask=1.0.2 -> click[version='>=5.1'] -> __win
+  - fontconfig=2.13.0 -> libgcc-ng[version='>=4.9'] -> __glibc[version='>=2.17']
+  - freetype=2.9.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - fribidi=1.0.5 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - gevent=1.4.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - glib=2.56.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - gmp=6.1.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - gmpy2=2.0.8 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - graphite2=1.3.13 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
+  - greenlet=0.4.15 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - grpcio=1.16.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - gst-plugins-base=1.14.0 -> gstreamer[version='>=1.14.0,<2.0a0'] -> __glibc[version='>=2.17|>=2.17,<3.0.a0']
+  - gstreamer=1.14.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - h5py=2.9.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - harfbuzz=1.8.8 -> libgcc-ng[version='>=4.9'] -> __glibc[version='>=2.17']
+  - hdf5=1.10.4 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - icu=58.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - ipykernel=5.1.0 -> ipython[version='>=5.0'] -> __linux
+  - ipykernel=5.1.0 -> ipython[version='>=5.0'] -> __unix
+  - ipykernel=5.1.0 -> ipython[version='>=5.0'] -> __win
+  - ipywidgets=7.4.2 -> ipykernel[version='>=4.5.1'] -> __linux
+  - ipywidgets=7.4.2 -> ipykernel[version='>=4.5.1'] -> __osx
+  - ipywidgets=7.4.2 -> ipykernel[version='>=4.5.1'] -> __win
+  - ipywidgets=7.4.2 -> ipython[version='>=4.0.0'] -> __unix
+  - jbig=2.1 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
+  - jpeg=9b -> libgcc-ng[version='>=7.2.0'] -> __glibc[version='>=2.17']
+  - jupyter=1.0.0 -> ipykernel -> __linux
+  - jupyter=1.0.0 -> ipykernel -> __win
+  - jupyter_console=6.0.0 -> ipykernel -> __linux
+  - jupyter_console=6.0.0 -> ipykernel -> __win
+  - jupyter_console=6.0.0 -> ipython -> __unix
+  - kiwisolver=1.0.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - krb5=1.16.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - lazy-object-proxy=1.3.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - libcurl=7.64.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - libedit=3.1.20181209 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - libffi=3.2.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - libpng=1.6.36 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - libprotobuf=3.8.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - libsodium=1.0.16 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - libssh2=1.8.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - libtiff=4.0.10 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - libtool=2.4.6 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
+  - libuuid=1.0.3 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
+  - libxcb=1.13 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
+  - libxml2=2.9.9 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - libxslt=1.1.33 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
+  - llvmlite=0.28.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - lxml=4.3.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - lzo=2.10 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
+  - markupsafe=1.1.1 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
+  - matplotlib=3.0.3 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - mistune=0.8.4 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
+  - mkl-service=1.1.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - mkl_fft=1.0.10 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - mkl_random=1.0.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - mpc=1.1.0 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
+  - mpfr=4.0.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - msgpack-python=0.6.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - nccl=1.3.5 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - ncurses=6.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - ninja=1.9.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - notebook=5.7.8 -> ipykernel -> __linux
+  - notebook=5.7.8 -> ipykernel -> __win
+  - numba=0.43.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - numexpr=2.6.9 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - numpy-base=1.16.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - numpy=1.16.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - openssl=1.1.1c -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - pandas=0.24.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - pango=1.42.4 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17']
+  - pcre=8.43 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - pillow=6.0.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - pixman=0.38.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - psutil=5.6.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - pycosat=0.6.3 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
+  - pycrypto=2.6.1 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17']
+  - pycurl=7.43.0.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - pyodbc=4.0.26 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - pyqt=5.9.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - pyrsistent=0.14.11 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17']
+  - pytables=3.5.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - python=3.6.8 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - pytorch=1.1.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - pywavelets=1.0.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - pyyaml=5.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - pyzmq=18.0.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - qt=5.9.7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - qtconsole=4.4.3 -> ipykernel[version='>=4.1'] -> __linux
+  - qtconsole=4.4.3 -> ipykernel[version='>=4.1'] -> __win
+  - readline=7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - ruamel_yaml=0.15.46 -> libgcc-ng[version='>=4.9'] -> __glibc[version='>=2.17']
+  - scikit-image=0.15.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - scikit-learn=0.20.3 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - scipy=1.2.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - sip=4.19.8 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - snappy=1.1.7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - spyder-kernels=0.4.2 -> ipykernel[version='>4.9.0'] -> __linux
+  - spyder-kernels=0.4.2 -> ipykernel[version='>4.9.0'] -> __win
+  - sqlalchemy=1.3.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - sqlite=3.27.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - statsmodels=0.9.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - tensorboard=1.13.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - tensorflow=1.13.1 -> libgcc-ng[version='>=5.4.0'] -> __glibc[version='>=2.17']
+  - tk=8.6.8 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - torchvision=0.3.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17|>=2.17,<3.0.a0']
+  - torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> __cuda[version='>=11.8']
+  - tornado=6.0.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - typed-ast=1.3.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - unixodbc=2.3.7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - urllib3=1.24.1 -> pysocks[version='>=1.5.6,<2.0,!=1.5.7'] -> __unix
+  - urllib3=1.24.1 -> pysocks[version='>=1.5.6,<2.0,!=1.5.7'] -> __win
+  - wrapt=1.11.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - xz=5.2.4 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - yaml=0.1.7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - zeromq=4.3.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+  - zlib=1.2.11 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17']
+  - zstd=1.3.7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17']
+
+Your installed version is: not available
diff --git a/动态slam/run.txt b/动态slam/run.txt
new file mode 100644
index 0000000..d394924
--- /dev/null
+++ b/动态slam/run.txt
@@ -0,0 +1,9 @@
+python evaluation.py --result_dir=./data/ --eva_seqs=../pose_est/06/06_pred
+
+python evaluate_kitti.py ./pose_gt/06.txt ./06_est.txt 
+
+python tartanair_evaluator.py
+
+
+
+conda env create -f requirement.yml -p /root/miniconda3/envs/dfvo
\ No newline at end of file
diff --git a/动态slam/tartan.pdf b/动态slam/tartan.pdf
new file mode 100644
index 0000000..de9ebaa
--- /dev/null
+++ b/动态slam/tartan.pdf
@@ -0,0 +1,724 @@
+                                        TartanVO: A Generalizable Learning-based VO
+
+                                        Wenshan Wang∗  Yaoyu Hu  Sebastian Scherer
+
+                                        Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University
+
+arXiv:2011.00359v1 [cs.CV] 31 Oct 2020            Abstract: We present the ﬁrst learning-based visual odometry (VO) model,
+                                                  which generalizes to multiple datasets and real-world scenarios, and outperforms
+                                                  geometry-based methods in challenging scenes. We achieve this by leveraging
+                                                  the SLAM dataset TartanAir, which provides a large amount of diverse synthetic
+                                                  data in challenging environments. Furthermore, to make our VO model generalize
+                                                  across datasets, we propose an up-to-scale loss function and incorporate the cam-
+                                                  era intrinsic parameters into the model. Experiments show that a single model,
+                                                  TartanVO, trained only on synthetic data, without any ﬁnetuning, can be general-
+                                                  ized to real-world datasets such as KITTI and EuRoC, demonstrating signiﬁcant
+                                                  advantages over the geometry-based methods on challenging trajectories. Our
+                                                  code is available at https://github.com/castacks/tartanvo.
+
+                                                  Keywords: Visual Odometry, Generalization, Deep Learning, Optical Flow
+
+                                        1 Introduction
+
+                                        Visual SLAM (Simultaneous Localization and Mapping) becomes more and more important for
+                                        autonomous robotic systems due to its ubiquitous availability and the information richness of im-
+                                        ages [1]. Visual odometry (VO) is one of the fundamental components in a visual SLAM system.
+                                        Impressive progress has been made in both geometric-based methods [2, 3, 4, 5] and learning-based
+                                        methods [6, 7, 8, 9]. However, it remains a challenging problem to develop a robust and reliable VO
+                                        method for real-world applications.
+
+                                        On one hand, geometric-based methods are not robust enough in many real-life situations [10, 11].
+                                        On the other hand, although learning-based methods demonstrate robust performance on many vi-
+                                        sual tasks, including object recognition, semantic segmentation, depth reconstruction, and optical
+                                        ﬂow, we have not yet seen the same story happening to VO.
+
+                                        It is widely accepted that by leveraging a large amount of data, deep-neural-network-based methods
+                                        can learn a better feature extractor than engineered ones, resulting in a more capable and robust
+                                        model. But why haven’t we seen the deep learning models outperform geometry-based methods yet?
+                                        We argue that there are two main reasons. First, the existing VO models are trained with insufﬁcient
+                                        diversity, which is critical for learning-based methods to be able to generalize. By diversity, we
+                                        mean diversity both in the scenes and motion patterns. For example, a VO model trained only on
+                                        outdoor scenes is unlikely to be able to generalize to an indoor environment. Similarly, a model
+                                        trained with data collected by a camera ﬁxed on a ground robot, with limited pitch and roll motion,
+                                        will unlikely be applicable to drones. Second, most of the current learning-based VO models neglect
+                                        some fundamental nature of the problem which is well formulated in geometry-based VO theories.
+                                        From the theory of multi-view geometry, we know that recovering the camera pose from a sequence
+                                        of monocular images has scale ambiguity. Besides, recovering the pose needs to take account of the
+                                        camera intrinsic parameters (referred to as the intrinsics ambiguity later). Without explicitly dealing
+                                        with the scale problem and the camera intrinsics, a model learned from one dataset would likely fail
+                                        in another dataset, no matter how good the feature extractor is.
+
+                                        To this end, we propose a learning-based method that can solve the above two problems and can
+                                        generalize across datasets. Our contributions come in three folds. First, we demonstrate the crucial
+                                        effects of data diversity on the generalization ability of a VO model by comparing performance on
+                                        different quantities of training data. Second, we design an up-to-scale loss function to deal with the
+
+                                           ∗Corresponding author: wenshanw@andrew.cmu.edu
+
+                                        4th Conference on Robot Learning (CoRL 2020), Cambridge MA, USA.
+scale ambiguity of monocular VO. Third, we create an intrinsics layer (IL) in our VO model enabling
+generalization across different cameras. To our knowledge, our model is the ﬁrst learning-based VO
+that has competitive performance in various real-world datasets without ﬁnetuning. Furthermore,
+compared to geometry-based methods, our model is signiﬁcantly more robust in challenging scenes.
+A demo video can be found: https://www.youtube.com/watch?v=NQ1UEh3thbU
+
+2 Related Work
+
+Besides early studies of learning-based VO models [12, 13, 14, 15], more and more end-to-end
+learning-based VO models have been studied with improved accuracy and robustness. The majority
+of the recent end-to-end models adopt the unsupervised-learning design [6, 16, 17, 18], due to the
+complexity and the high-cost associated with collecting ground-truth data. However, supervised
+models trained on labeled odometry data still have a better performance [19, 20].
+
+To improve the performance, end-to-end VO models tend to have auxiliary outputs related to camera
+motions, such as depth and optical ﬂow. With depth prediction, models obtain supervision signals
+by imposing depth consistency between temporally consecutive images [17, 21]. This procedure can
+be interpreted as matching the temporal observations in the 3D space. A similar effect of temporal
+matching can be achieved by producing the optical ﬂow, e.g., [16, 22, 18] jointly predict depth,
+optical ﬂow, and camera motion.
+
+Optical ﬂow can also be treated as an intermediate representation that explicitly expresses the 2D
+matching. Then, camera motion estimators can process the optical ﬂow data rather than directly
+working on raw images[20, 23]. If designed this way, components for estimating the camera motion
+can even be trained separately on available optical ﬂow data [19]. We follow these designs and use
+the optical ﬂow as an intermediate representation.
+
+It is well known that monocular VO systems have scale ambiguity. Nevertheless, most of the super-
+vised learning models did not handle this issue and directly use the difference between the model
+prediction and the true camera motion as the supervision [20, 24, 25]. In [19], the scale is handled
+by dividing the optical ﬂow into sub-regions and imposing a consistency of the motion predictions
+among these regions. In non-learning methods, scale ambiguity can be solved if a 3D map is avail-
+able [26]. Ummenhofer et al. [20] introduce the depth prediction to correcting the scale-drift. Tateno
+et al. [27] and Sheng et al. [28] ameliorate the scale problem by leveraging the key-frame selection
+technique from SLAM systems. Recently, Zhan et al. [29] use PnP techniques to explicitly solve
+for the scale factor. The above methods introduce extra complexity to the VO system, however, the
+scale ambiguity is not totally suppressed for monocular setups especially in the evaluation stage.
+Instead, some models choose to only produce up-to-scale predictions. Wang et al. [30] reduce the
+scale ambiguity in the monocular depth estimation task by normalizing the depth prediction before
+computing the loss function. Similarly, we will focus on predicting the translation direction rather
+than recovering the full scale from monocular images, by deﬁning a new up-to-scale loss function.
+
+Learning-based models suffer from generalization issues when tested on images from a new en-
+vironment or a new camera. Most of the VO models are trained and tested on the same dataset
+[16, 17, 31, 18]. Some multi-task models [6, 20, 32, 22] only test their generalization ability on the
+depth prediction, not on the camera pose estimation. Recent efforts, such as [33], use model adap-
+tation to deal with new environments, however, additional training is needed on a per-environment
+or per-camera basis. In this work, we propose a novel approach to achieve cross-camera/dataset
+generalization, by incorporating the camera intrinsics directly into the model.
+
+Figure 1: The two-stage network architecture. The model consists of a matching network, which
+estimates optical ﬂow from two consecutive RGB images, followed by a pose network predicting
+camera motion from the optical ﬂow.
+
+                                                           2
+3 Approach
+
+3.1 Background
+
+We focus on the monocular VO problem, which takes two consecutive undistorted images {It, It+1},
+and estimates the relative camera motion δtt+1 = (T, R), where T ∈ R3 is the 3D translation and
+R ∈ so(3) denotes the 3D rotation. According to the epipolar geometry theory [34], the geometry-
+based VO comes in two folds. Firstly, visual features are extracted and matched from It and It+1.
+Then using the matching results, it computes the essential matrix leading to the recovery of the
+up-to-scale camera motion δtt+1.
+
+Following the same idea, our model consists of two sub-modules. One is the matching module
+Mθ(It, It+1), estimating the dense matching result Ftt+1 from two consecutive RGB images (i.e.
+optical ﬂow). The other is a pose module Pφ(Ftt+1) that recovers the camera motion δtt+1 from the
+matching result (Fig. 1). This modular design is also widely used in other learning-based methods,
+especially in unsupervised VO [13, 19, 16, 22, 18].
+
+3.2 Training on large scale diverse data
+
+The generalization capability has always been one of the most critical issues for learning-based
+methods. Most of the previous supervised models are trained on the KITTI dataset, which contains
+11 labeled sequences and 23,201 image frames in the driving scenario [35]. Wang et al. [8] presented
+the training and testing results on the EuRoC dataset [36], collected by a micro aerial vehicle (MAV).
+They reported that the performance is limited by the lack of training data and the more complex
+dynamics of a ﬂying robot. Surprisingly, most unsupervised methods also only train their models in
+very uniform scenes (e.g., KITTI and Cityscape [37]). To our knowledge, no learning-based model
+has yet shown the capability of running on multiple types of scenes (car/MAV, indoor/outdoor). To
+achieve this, we argue that the training data has to cover diverse scenes and motion patterns.
+
+TartanAir [11] is a large scale dataset with highly diverse scenes and motion patterns, containing
+more than 400,000 data frames. It provides multi-modal ground truth labels including depth, seg-
+mentation, optical ﬂow, and camera pose. The scenes include indoor, outdoor, urban, nature, and
+sci-ﬁ environments. The data is collected with a simulated pinhole camera, which moves with ran-
+dom and rich 6DoF motion patterns in the 3D space.
+
+We take advantage of the monocular image sequences {It}, the optical ﬂow labels {Ftt+1}, and the
+ground truth camera motions {δtt+1} in our task. Our objective is to jointly minimize the optical
+ﬂow loss Lf and the camera motion loss Lp. The end-to-end loss is deﬁned as:
+
+L = λLf + Lp = λ Mθ(It, It+1) − Ftt+1 + Pφ(Fˆtt+1) − δtt+1  (1)
+
+where λ is a hyper-parameter balancing the two losses. We use ˆ· to denote the estimated variable
+from our model.
+
+Since TartanAir is purely synthetic, the biggest question is can a model learned from simulation
+data be generalized to real-world scenes? As discussed by Wang et al. [11], a large number of
+studies show that training purely in simulation but with broad diversity, the model learned can be
+easily transferred to the real world. This is also known as domain randomization [38, 39]. In our
+experiments, we show that the diverse simulated data indeed enable the VO model to generalize to
+real-world data.
+
+3.3 Up-to-scale loss function
+
+The motion scale is unobservable from a monocular image sequence. In geometry-based methods,
+the scale is usually recovered from other sources of information ranging from known object size or
+camera height to extra sensors such as IMU. However, in most existing learning-based VO studies,
+the models generally neglect the scale problem and try to recover the motion with scale. This is
+feasible if the model is trained and tested with the same camera and in the same type of scenario.
+For example, in the KITTI dataset, the camera is mounted at a ﬁxed height above the ground and a
+ﬁxed orientation. A model can learn to remember the scale in this particular setup. Obviously, the
+model will have huge problems when tested with a different camera conﬁguration. Imagine if the
+
+                                          3
+Figure 2: a) Illustration of the FoV and image resolution in TartanAir, EuRoC, and KITTI datasets.
+b) Calculation of the intrinsics layer.
+
+camera in KITTI moves a little upwards and becomes higher from the ground, the same amount of
+camera motion would cause a smaller optical ﬂow value on the ground, which is inconsistent with
+the training data. Although the model could potentially learn to pick up other clues such as object
+size, it is still not fully reliable across different scenes or environments.
+
+Following the geometry-based methods, we only recover an up-to-scale camera motion from the
+
+monocular sequences. Knowing that the scale ambiguity only affects the translation T , we design
+
+a new loss function for T and keep the loss for rotation R unchanged. We propose two up-to-scale
+loss functions for LP : the cosine similarity loss Lcpos and the normalized distance loss Lnporm. Lpcos
+is deﬁned by the cosine angle between the estimated Tˆ and the label T :
+
+Lpcos     =  max(  Tˆ · T  T      +  Rˆ − R        (2)
+                   Tˆ ·       ,)
+
+Similarly, for Lnporm, we normalize the translation vector before calculating the distance between
+the estimation and the label:
+
+Lpnorm =     Tˆ               T          + Rˆ − R
+          max( Tˆ , ) − max( T                     (3)
+                                     ,)
+
+where = 1e-6 is used to avoid divided by zero error. From our preliminary empirical comparison,
+the above two formulations have similar performance. In the following sections, we will use Eq 3
+to replace Lp in Eq 1. Later, we show by experiments that the proposed up-to-scale loss function is
+crucial for the model’s generalization ability.
+
+3.4 Cross-camera generalization by encoding camera intrinsics
+
+In epipolar geometry theory, the camera intrinsics is required when recovering the camera pose
+from the essential matrix (assuming the images are undistorted). In fact, learning-based methods
+are unlikely to generalize to data with different camera intrinsics. Imagine a simple case that the
+camera changes a lens with a larger focal length. Assume the resolution of the image remains the
+same, the same amount of camera motion will introduce bigger optical ﬂow values, which we call
+the intrinsics ambiguity.
+
+A tempting solution for intrinsics ambiguity is warping the input images to match the camera in-
+trinsics of the training data. However, this is not quite practical especially when the cameras differ
+too much. As shown in Fig. 2-a, if a model is trained on TartanAir, the warped KITTI image only
+covers a small part of the TartanAir’s ﬁeld of view (FoV). After training, a model learns to exploit
+cues from all possible positions in the FoV and the interrelationship among those cues. Some cues
+no longer exist in the warped KITTI images leading to drastic performance drops.
+
+3.4.1 Intrinsics layer
+
+We propose to train a model that takes both RGB images and camera intrinsics as input, thus the
+model can directly handle images coming from various camera settings. Speciﬁcally, instead of re-
+covering the camera motion Ttt+1 only from the feature matching Ftt+1, we design a new pose net-
+work Pφ(Ftt+1, K), which depends also on the camera intrinsic parameters K = {fx, fy, ox, oy},
+where fx and fy are the focal lengths, and ox and oy denote the position of the principle point.
+
+                   4
+Figure 3: The data augmentation procedure of random cropping and resizing. In this way we gener-
+ate a wide range of camera intrinsics (FoV 40◦ to 90◦).
+
+As for the implementation, we concatenate an IL (intrinsics layer) Kc ∈ R2×H×W (H and W
+are image height and width respectively) to Ftt+1 before going into Pφ. To compose Kc, we ﬁrst
+generate two index matrices Xind and Yind for the x and y axis in the 2D image frame (Fig. 2-b).
+Then the two channels of Kc are calculated from the following formula:
+
+Kxc = (Xind − ox)/fx                                 (4)
+Kyc = (Yind − oy)/fy
+
+The concatenation of Ftt+1 and Kc augments the optical ﬂow estimation with 2D position informa-
+tion. Similar to the situation where geometry-based methods have to know the 2D coordinates of the
+matched features, Kc provides the necessary position information. In this way, intrinsics ambiguity
+
+is explicitly handled by coupling 2D positions and matching estimations (Ftt+1).
+
+3.4.2 Data generation for various camera intrinsics
+
+To make a model generalizable across different cameras, we need training data with various camera
+intrinsics. TartanAir only has one set of camera intrinsics, where fx = fy = 320, ox = 320,
+and oy = 240. We simulate various intrinsics by randomly cropping and resizing (RCR) the input
+images. As shown in Fig. 3, we ﬁrst crop the image at a random location with a random size. Next,
+we resize the cropped image to the original size. One advantage of the IL is that during RCR, we can
+crop and resize the IL with the image, without recomputing the IL. To cover typical cameras with
+FoV between 40◦ to 90◦, we ﬁnd that using random resizing factors up to 2.5 is sufﬁcient during
+RCR. Note the ground truth optical ﬂow should also be scaled with respect to the resizing factor. We
+use very aggressive cropping and shifting in our training, which means the optical center could be
+way off the image center. Although the resulting intrinsic parameters will be uncommon in modern
+cameras, we ﬁnd the generalization is improved.
+
+4 Experimental Results
+
+4.1 Network structure and training detail
+
+Network We utilize the pre-trained PWC-Net [40] as the matching network Mθ, and a modiﬁed
+ResNet50 [41] as the pose network Pφ. We remove the batch normalization layers from the ResNet,
+and add two output heads for the translation and rotation, respectively. The PWC-Net outputs optical
+ﬂow in size of H/4 × W/4, so Pφ is trained on 1/4 size, consuming very little GPU memory. The
+overall inference time (including both Mθ and Pφ) is 40ms on an NVIDIA GTX 1080 GPU.
+
+Training Our model is implemented by PyTorch [42] and trained on 4 NVIDIA GTX 1080 GPUs.
+There are two training stages. First, Pφ is trained separately using ground truth optical ﬂow and
+camera motions for 100,000 iterations with a batch size of 100. In the second stage, Pφ and Mθ are
+connected and jointly optimized for 50,000 iterations with a batch size of 64. During both training
+stages, the learning rate is set to 1e-4 with a decay rate of 0.2 at 1/2 and 7/8 of the total training
+steps. The RCR is applied on the optical ﬂow, RGB images, and the IL (Sec 3.4.2).
+
+4.2 How the training data quantity affects the generalization ability
+
+To show the effects of data diversity, we compare the generalization ability of the model trained
+with different amounts of data. We use 20 environments from the TartanAir dataset, and set aside
+3 environments (Seaside-town, Soul-city, and Hongkong-alley) only for testing, which results in
+
+5
+Figure 4: Generalization ability with respect to different quantities of training data. Model Pφ is
+trained on true optical ﬂow. Blue: training loss, orange: testing loss on three unseen environments.
+Testing loss drops constantly with increasing quantity of training data.
+
+Figure 5: Comparison of the loss curve w/ and w/o up-to-scale loss function. a) The training and
+testing loss w/o the up-to-scale loss. b) The translation and rotation loss of a). Big gap exists between
+the training and testing translation losses (orange arrow in b)). c) The training and testing losses w/
+up-to-scale loss. d) The translation and rotation losses of c). The translation loss gap decreases.
+
+more than 400,000 training frames and about 40,000 testing frames. As a comparison, KITTI and
+EuRoC datasets provide 23,201 and 26,604 pose labeled frames, respectively. Besides, data in
+KITTI and EuRoC are much more uniform in the sense of scene type and motion pattern. As shown
+in Fig. 4, we set up three experiments, where we use 20,000 (comparable to KITTI and EuRoC),
+100,000, and 400,000 frames of data for training the pose network Pφ. The experiments show that
+the generalization ability, measured by the gap between training loss and testing loss on unseen
+environments, improves constantly with increasing training data.
+
+4.3 Up-to-scale loss function
+
+Without the up-to-scale loss, we observe that there is a gap between the training and testing loss even
+trained with a large amount of data (Fig. 5-a). As we plotting the translation loss and rotation loss
+separately (Fig. 5-b), it shows that the translation error is the main contributor to the gap. After we
+apply the up-to-scale loss function described in Sec 3.3, the translation loss gap decreases (Fig. 5-
+c,d). During testing, we align the translation with the ground truth to recover the scale using the
+same way as described in [16, 6].
+
+4.4 Camera intrinsics layer
+
+The IL is critical to the generalization ability across datasets. Before we move to other datasets,
+we ﬁrst design an experiment to investigate the properties of the IL using the pose network Pφ. As
+shown in Table 1, in the ﬁrst two columns, where the data has no RCR augmentation, the training
+and testing loss are low. But these two models would output nonsense values on data with RCR
+augmentation. One interesting ﬁnding is that adding IL doesn’t help in the case of only one type
+of intrinsics. This indicates that the network has learned a very different algorithm compared with
+the geometry-based methods, where the intrinsics is necessary to recover the motion. The last two
+columns show that the IL is critical when the input data is augmented by RCR (i.e. various intrin-
+sics). Another interesting thing is that training a model with RCR and IL leads to a lower testing
+loss (last column) than only training on one type of intrinsics (ﬁrst two columns). This indicates that
+by generating data with various intrinsics, we learned a more robust model for the VO task.
+
+                                                           6
+Table 1: Training and testing losses with four combinations of RCR and IL settings. The IL is
+critical with the presence of RCR. The model trained with RCR reaches lower testing loss than
+those without RCR.
+
+Training conﬁguration      w/o RCR, w/o IL     w/o RCR, w/ IL        w/ RCR, w/o IL          w/ RCR, w/ IL
+Training loss                    0.0325             0.0311                0.1534                 0.0499
+Test-loss on data w/ RCR             -                  -                 0.1999                 0.0723
+Test-loss on data w/o RCR        0.0744             0.0714                0.1630                 0.0549
+
+Table 2: Comparison of translation and rotation on the KITTI dataset. DeepVO [43] is a super-
+vised method trained on Seq 00, 02, 08, 09. It contains an RNN module, which accumulates
+information from multiple frames. Wang et al. [9] is a supervised method trained on Seq 00-08
+and uses the semantic information of multiple frames to optimize the trajectory. UnDeepVO [44]
+and GeoNet [16] are trained on Seq 00-08 in an unsupervised manner. VISO2-M [45] and ORB-
+SLAM [3] are geometry-based monocular VO. ORB-SLAM uses the bundle adjustment on multiple
+frames to optimize the trajectory. Our method works in a pure VO manner (only takes two frames).
+It has never seen any KITTI data before the testing, and yet achieves competitive results.
+
+        Seq                06              07              09              10                      Ave
+
+DeepVO [43]*†      trel        rrel  trel      rrel  trel      rrel  trel              rrel  trel       rrel
+Wang et al. [9]*†
+UnDeepVO [44]*     5.42 5.82         3.91 4.6        -         -     8.11 8.83               5.81 6.41
+GeoNet [16]*
+VISO2-M [45]       -           -     -         -     8.04 1.51       6.23 0.97               7.14 1.24
+ORB-SLAM [3]†
+TartanVO (ours)    6.20 1.98         3.15 2.48       -         -     10.63 4.65              6.66 3.04
+
+                   9.28 4.34         8.27 5.93       26.93 9.54      20.73 9.04              16.3 7.21
+
+                   7.3 6.14          23.61 19.11     4.04 1.43       25.2 3.8                15.04 7.62
+
+                   18.68 0.26        10.96 0.37      15.3 0.26       3.71 0.3                12.16 0.3
+
+                   4.72 2.95         4.32 3.41       6.0 3.11        6.89 2.73               5.48 3.05
+
+trel: average translational RMSE drift (%) on a length of 100–800 m.
+rrel: average rotational RMSE drift (◦/100 m) on a length of 100–800 m.
+*: the starred methods are trained or ﬁnetuned on the KITTI dataset.
+†: these methods use multiple frames to optimize the trajectory after the VO process.
+
+4.5 Generalize to real-world data without ﬁnetuning
+
+KITTI dataset The KITTI dataset is one of the most inﬂuential datasets for VO/SLAM tasks. We
+compare our model, TartanVO, with two supervised learning models (DeepVO [43], Wang et al.
+[9]), two unsupervised models (UnDeepVO [44], GeoNet [16]), and two geometry-based methods
+(VISO2-M [45], ORB-SLAM [3]). All the learning-based methods except ours are trained on the
+KITTI dataset. Note that our model has not been ﬁnetuned on KITTI and is trained purely on a
+synthetic dataset. Besides, many algorithms use multiple frames to further optimize the trajectory.
+In contrast, our model only takes two consecutive images. As listed in Table 2, TartanVO achieves
+comparable performance, despite no ﬁnetuning nor backend optimization are performed.
+
+EuRoC dataset The EuRoC dataset contains 11 sequences collected by a MAV in an indoor en-
+vironment. There are 3 levels of difﬁculties with respect to the motion pattern and the light con-
+dition. Few learning-based methods have ever been tested on EuRoC due to the lack of training
+data. The changing light condition and aggressive rotation also pose real challenges to geometry-
+based methods as well. In Table 3, we compare with geometry-based methods including SVO [46],
+ORB-SLAM [3], DSO [5] and LSD-SLAM [2]. Note that all these geometry-based methods per-
+form some types of backend optimization on selected keyframes along the trajectory. In contrast, our
+model only estimates the frame-by-frame camera motion, which could be considered as the frontend
+module in these geometry-based methods. In Table 3, we show the absolute trajectory error (ATE)
+of 6 medium and difﬁcult trajectories. Our method shows the best performance on the two most dif-
+ﬁcult trajectories VR1-03 and VR2-03, where the MAV has very aggressive motion. A visualization
+of the trajectories is shown in Fig. 6.
+
+Challenging TartanAir data TartanAir provides 16 very challenging testing trajectories2 that
+cover many extremely difﬁcult cases, including changing illumination, dynamic objects, fog and
+rain effects, lack of features, and large motion. As listed in Table 4, we compare our model with the
+ORB-SLAM using ATE. Our model shows a more robust performance in these challenging cases.
+
+    2https://github.com/castacks/tartanair tools#download-the-testing-data-for-the-cvpr-visual-slam-challenge
+
+                                                  7
+Table 3: Comparison of ATE on EuRoC dataset. We are among very few learning-based methods,
+which can be tested on this dataset. Same as the geometry-based methods, our model has never seen
+the EuRoC data before testing. We show the best performance on two difﬁcult sequences VR1-03
+and VR2-03. Note our method doesn’t contain any backend optimization module.
+
+                  Seq.            MH-04 MH-05 VR1-02 VR1-03 VR2-02 VR2-03
+
+                  SVO [46]        1.36 0.51   0.47  x                         0.47                    x
+
+Geometry-based *  ORB-SLAM [3]    0.20  0.19   x     x                        0.07   x
+                  DSO [5]         0.25  0.11  0.11  0.93                      0.13  1.16
+
+                  LSD-SLAM [2] 2.13 0.85      1.11  x                         x                       x
+
+Learning-based † TartanVO (ours) 0.74 0.68    0.45  0.64                      0.67  1.04
+
+* These results are from [46]. † Other learning-based methods [36] did not report numerical results.
+
+Figure 6: The visualization of 6 EuRoC trajectories in Table 3. Black: ground truth trajectory,
+orange: estimated trajectory.
+
+Table 4: Comparison of ATE on TartanAir dataset. These trajectories are not contained in the
+
+training set. We repeatedly run ORB-SLAM 5 times and report the best result.
+
+Seq               MH000 MH001 MH002 MH003 MH004 MH005 MH006 MH007
+
+ORB-SLAM [3] 1.3            0.04  2.37  2.45  x     x                         21.47 2.73
+
+TartanVO (ours) 4.88        0.26  2     0.94  1.07  3.19                      1     2.04
+
+Figure 7: TartanVO outputs competitive results on D345i IR data compared to T265 (equipped with
+ﬁsh-eye stereo camera and an IMU). a) The hardware setup. b) Trail 1: smooth and slow motion. c)
+Trail 2: smooth and medium speed. d) Trail 3: aggressive and fast motion. See videos for details.
+
+RealSense Data Comparison We test TartanVO using data collected by a customized sensor
+setup. As shown in Fig. 7 a), a RealSense D345i is ﬁxed on top of a RealSense T265 tracking
+camera. We use the left near-infrared (IR) image on D345i in our model and compare it with the
+trajectories provided by the T265 tracking camera. We present 3 loopy trajectories following similar
+paths with increasing motion difﬁculties. From Fig. 7 b) to d), we observe that although TartanVO
+has never seen real-world images or IR data during training, it still generalizes well and predicts
+odometry closely matching the output of T265, which is a dedicated device estimating the camera
+motion with a pair of ﬁsh-eye stereo camera and an IMU.
+
+5 Conclusions
+
+We presented TartanVO, a generalizable learning-based visual odometry. By training our model
+with a large amount of data, we show the effectiveness of diverse data on the ability of model gener-
+alization. A smaller gap between training and testing losses can be expected with the newly deﬁned
+up-to-scale loss, further increasing the generalization capability. We show by extensive experiments
+that, equipped with the intrinsics layer designed explicitly for handling different cameras, TartanVO
+can generalize to unseen datasets and achieve performance even better than dedicated learning mod-
+els trained directly on those datasets. Our work introduces many exciting future research directions
+such as generalizable learning-based VIO, Stereo-VO, multi-frame VO.
+
+                                                           8
+Acknowledgments
+
+This work was supported by ARL award #W911NF1820218. Special thanks to Yuheng Qiu and Huai
+Yu from Carnegie Mellon University for preparing simulation results and experimental setups.
+
+References
+
+ [1] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendo´n-Mancha. Visual simultaneous localization and
+      mapping: a survey. Artiﬁcial Intelligence Review, 43(1):55–81, 2015.
+
+ [2] J. Engel, T. Schops, and D. Cremers. LSD-SLAM: Large-scale direct monocular slam. In ECCV, 2014.
+
+ [3] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. Orb-slam: a versatile and accurate monocular slam
+      system. IEEE transactions on robotics, 31(5):1147–1163, 2015.
+
+ [4] C. Forster, M. Pizzoli, and D. Scaramuzza. Svo: Fast semi-direct monocular visual odometry. In ICRA,
+      pages 15–22. IEEE, 2014.
+
+ [5] J. Engel, V. Koltun, and D. Cremers. Direct sparse odometry. IEEE transactions on pattern analysis and
+      machine intelligence, 40(3):611–625, 2017.
+
+ [6] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from
+      video. In CVPR, 2017.
+
+ [7] S. Vijayanarasimhan, S. Ricco, C. Schmidy, R. Sukthankar, and K. Fragkiadaki. Sfm-net: Learning of
+      structure and motion from video. In arXiv:1704.07804, 2017.
+
+ [8] S. Wang, R. Clark, H. Wen, and N. Trigoni. End-to-end, sequence-to-sequence probabilistic visual odom-
+      etry through deep neural networks. The International Journal of Robotics Research, 37(4-5):513–542,
+      2018.
+
+ [9] X. Wang, D. Maturana, S. Yang, W. Wang, Q. Chen, and S. Scherer. Improving learning-based ego-
+      motion estimation with homomorphism-based losses and drift correction. In 2019 IEEE/RSJ International
+      Conference on Intelligent Robots and Systems (IROS), pages 970–976. IEEE, 2019.
+
+[10] G. Younes, D. Asmar, E. Shammas, and J. Zelek. Keyframe-based monocular slam: design, survey, and
+      future directions. Robotics and Autonomous Systems, 98:67–88, 2017.
+
+[11] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, and S. Scherer. Tartanair: A
+      dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots
+      and Systems (IROS), 2020.
+
+[12] R. Roberts, H. Nguyen, N. Krishnamurthi, and T. Balch. Memory-based learning for visual odometry.
+      In Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on, pages 47–52. IEEE,
+      2008.
+
+[13] V. Guizilini and F. Ramos. Semi-parametric models for visual odometry. In Robotics and Automation
+      (ICRA), 2012 IEEE International Conference on, pages 3482–3489. IEEE, 2012.
+
+[14] T. A. Ciarfuglia, G. Costante, P. Valigi, and E. Ricci. Evaluation of non-geometric methods for visual
+      odometry. Robotics and Autonomous Systems, 62(12):1717–1730, 2014.
+
+[15] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam: Real-time dense monocular slam with learned
+      depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition,
+      pages 6243–6252, 2017.
+
+[16] Z. Yin and J. Shi. Geonet: Unsupervised learning of dense depth, optical ﬂow and camera pose. In
+      Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2,
+      2018.
+
+[17] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid. Unsupervised learning of monocular
+      depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE
+      Conference on Computer Vision and Pattern Recognition, pages 340–349, 2018.
+
+[18] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and M. J. Black. Competitive collabora-
+      tion: Joint unsupervised learning of depth, camera motion, optical ﬂow and motion segmentation. In
+      Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June
+      2019.
+
+[19] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia. Exploring representation learning with cnns for
+      frame-to-frame ego-motion estimation. RAL, 1(1):18–25, 2016.
+
+[20] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and
+      motion network for learning monocular stereo. In Proceedings of the IEEE Conference on Computer
+      Vision and Pattern Recognition (CVPR), July 2017.
+
+[21] N. Yang, L. v. Stumberg, R. Wang, and D. Cremers. D3vo: Deep depth, deep pose and deep uncertainty
+      for monocular visual odometry. In IEEE/CVF Conference on Computer Vision and Pattern Recognition
+      (CVPR), June 2020.
+
+[22] Y. Zou, Z. Luo, and J.-B. Huang. Df-net: Unsupervised joint learning of depth and ﬂow using cross-task
+      consistency. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018.
+
+                                                           9
+[23] H. Zhou, B. Ummenhofer, and T. Brox. Deeptam: Deep tracking and mapping. In Proceedings of the
+      European Conference on Computer Vision (ECCV), September 2018.
+
+[24] C. Tang and P. Tan. Ba-net: Dense bundle adjustment network. arXiv preprint arXiv:1806.04807, 2018.
+
+[25] R. Clark, M. Bloesch, J. Czarnowski, S. Leutenegger, and A. J. Davison. Ls-net: Learning to solve
+      nonlinear least squares for monocular stereo. arXiv preprint arXiv:1809.02966, 2018.
+
+[26] H. Li, W. Chen, j. Zhao, J.-C. Bazin, L. Luo, Z. Liu, and Y.-H. Liu. Robust and efﬁcient estimation of ab-
+      solute camera pose for monocular visual odometry. In Proceedings of the IEEE International Conference
+      on Robotics and Automation (ICRA), May 2020.
+
+[27] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam: Real-time dense monocular slam with learned
+      depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
+      (CVPR), July 2017.
+
+[28] L. Sheng, D. Xu, W. Ouyang, and X. Wang. Unsupervised collaborative learning of keyframe detec-
+      tion and visual odometry towards monocular deep slam. In Proceedings of the IEEE/CVF International
+      Conference on Computer Vision (ICCV), October 2019.
+
+[29] H. Zhan, C. S. Weerasekera, J.-W. Bian, and I. Reid. Visual odometry revisited: What should be learnt?
+      In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), May 2020.
+
+[30] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learning depth from monocular videos using direct
+      methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
+      June 2018.
+
+[31] Y. Wang, P. Wang, Z. Yang, C. Luo, Y. Yang, and W. Xu. Unos: Uniﬁed unsupervised optical-ﬂow and
+      stereo-depth estimation by watching videos. In Proceedings of the IEEE/CVF Conference on Computer
+      Vision and Pattern Recognition (CVPR), June 2019.
+
+[32] R. Mahjourian, M. Wicke, and A. Angelova. Unsupervised learning of depth and ego-motion from monoc-
+      ular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision
+      and Pattern Recognition (CVPR), June 2018.
+
+[33] S. Li, X. Wang, Y. Cao, F. Xue, Z. Yan, and H. Zha. Self-supervised deep visual odometry with online
+      adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
+      (CVPR), June 2020.
+
+[34] D. Niste´r. An efﬁcient solution to the ﬁve-point relative pose problem. IEEE transactions on pattern
+      analysis and machine intelligence, 26(6):756–770, 2004.
+
+[35] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International
+      Journal of Robotics Research, 32(11):1231–1237, 2013.
+
+[36] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart. The
+      euroc micro aerial vehicle datasets. The International Journal of Robotics Research, 35(10):1157–1163,
+      2016.
+
+[37] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and
+      B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE
+      conference on computer vision and pattern recognition, pages 3213–3223, 2016.
+
+[38] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transfer-
+      ring deep neural networks from simulation to the real world. In IROS, pages 23–30. IEEE, 2017.
+
+[39] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon,
+      and S. Birchﬁeld. Training deep networks with synthetic data: Bridging the reality gap by domain ran-
+      domization. In CVPR Workshops, pages 969–977, 2018.
+
+[40] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. Pwc-net: Cnns for optical ﬂow using pyramid, warping, and
+      cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages
+      8934–8943, 2018.
+
+[41] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the
+      IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
+
+[42] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and
+      A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017.
+
+[43] S. Wang, R. Clark, H. Wen, and N. Trigoni. Deepvo: Towards end-to-end visual odometry with deep
+      recurrent convolutional neural networks. In Robotics and Automation (ICRA), 2017 IEEE International
+      Conference on, pages 2043–2050. IEEE, 2017.
+
+[44] R. Li, S. Wang, Z. Long, and D. Gu. Undeepvo: Monocular visual odometry through unsupervised deep
+      learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7286–7291.
+      IEEE, 2018.
+
+[45] S. Song, M. Chandraker, and C. Guest. High accuracy monocular SFM and scale correction for au-
+      tonomous driving. IEEE Transactions on Pattern Analysis & Machine Intelligence, pages 1–1, 2015.
+
+[46] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza. Svo: Semidirect visual odometry
+      for monocular and multicamera systems. IEEE Transactions on Robotics, 33(2):249–265, 2016.
+
+                                                          10
+A Additional experimental details
+
+In this section, we provide additional details for the experiments, including the network structure,
+training parameters, qualitative results, and quantitative results.
+
+A.1 Network Structure
+
+Our network consists of two sub-modules, namely, the matching network Mθ and the pose network
+Pφ. As mentioned in the paper, we employ PWC-Net as the matching network, which takes in two
+consecutive images of size 640 x 448 (PWC-Net only accepts image size that is multiple of 64). The
+output optical ﬂow, which is 160 x 112 in size, is fed into the pose network. The structure of the
+pose network is detailed in Table 5. The overall inference time (including both Mθ and Pφ) is 40ms
+on an NVIDIA GTX 1080 GPU.
+
+Table 5: Parameters of the proposed pose network. Constructions of residual blocks are designated
+
+in brackets multiplied by the number of stacked blocks. Downsampling is performed by Conv1, and
+
+at the beginning of each residual block. After the residual blocks, we reshape the feature map into a
+
+one-dimensional vector, which goes through three fully connected layers in the translation head and
+
+rotation head, respectively.
+
+     Name                    Layer setting                       Output dimension
+
+     Input                                         1  H  ×  1   W   ×  2          114 × 160
+     Conv1                                         4        4                      56 × 80
+     Conv2                                                                         56 × 80
+     Conv3                    3 × 3, 32        1    H    ×  1  W   ×  32           56 × 80
+                              3 × 3, 32        8            8
+                              3 × 3, 32
+                                               1    H    ×  1  W   ×  32
+                                               8            8
+
+                                               1    H    ×  1  W   ×  32
+                                               8            8
+
+                                            ResBlock
+
+     Block1        3 × 3, 64          ×3       1    H    ×  1   W   ×  64         28 × 40
+                   3 × 3, 64                   16           16
+
+     Block2        3 × 3, 128            ×4    1   H     ×  1   W  ×   128        14 × 20
+                   3 × 3, 128                  32           32
+
+     Block3        3 × 3, 128            ×6    1   H     ×  1   W  ×   128        7 × 10
+                   3 × 3, 128                  64           64
+
+     Block4        3 × 3, 256            ×7     1   H    ×   1   W    × 256        4×5
+                   3 × 3, 256                  128          128
+
+     Block5        3 × 3, 256            ×3     1   H    ×   1   W    × 256        2×3
+                   3 × 3, 256                  256          256
+
+             FC trans                                                     FC rot
+
+Trans head fc1 256 × 6 × 128                        Rot head fc1                  256 × 6 × 128
+
+Trans head fc2                128 × 32              Rot head fc2                  128 × 32
+
+Trans head fc3                32 × 3                Rot head fc3                  32 × 3
+
+     Output                   3                          Output                    3
+
+Table 6: Comparison of ORB-SLAM and TartanVO on the TartanAir dataset using the ATE metric.
+
+These trajectories are not contained in the training set. We repeatedly run ORB-SLAM for 5 times
+
+and report the best result.
+
+Seq             SH000 SH001 SH002 SH003 SH004 SH005 SH006 SH007
+
+ORB-SLAM        x             3.5           x         x            x         x     x         x
+
+TartanVO (ours) 2.52 1.61 3.65 0.29 3.36 4.74 3.72 3.06
+
+A.2 Testing Results on TartanAir
+
+TartanAir provides 16 challenging testing trajectories. We reported 8 trajectories in the experiment
+section. The rest 8 trajectories are shown in Table 6. We compare TartanVO against the ORB-SLAM
+monocular algorithm. Due to the randomness in ORB-SLAM, we repeatedly run ORB-SLAM for 5
+trials and report the best result. We consider a trial is a failure if ORB-SLAM tracks less than 80%
+
+                                                          11
+of the trajectory. A visualization of all the 16 trajectories (including the 8 trajectories shown in the
+experiment section) is shown in Figure 8.
+
+Figure 8: Visualization of the 16 testing trajectories in the TartanAir dataset. The black dashed line
+represents the ground truth. The estimated trajectories by TartanVO and the ORB-SLAM monocular
+algorithm are shown in orange and blue lines, respectively. The ORB-SLAM algorithm frequently
+loses tracking in these challenging cases. It fails in 9/16 testing trajectories. Note that we run
+full-ﬂedge ORB-SLAM with local bundle adjustment, global bundle adjustment, and loop closure
+components. In contrast, although TartanVO only takes in two images, it is much more robust than
+ORB-SLAM.
+
+                                                          12
+
diff --git a/动态slam/tartanvo average time.txt b/动态slam/tartanvo average time.txt
new file mode 100644
index 0000000..4578211
--- /dev/null
+++ b/动态slam/tartanvo average time.txt	
@@ -0,0 +1,12 @@
+tartanvo
+
+shibuya_Standing01
+sum: 99 
+total time: 8.106080770492554
+average time: 0.08458585690970373
+
+Kitti04序列
+sum: 270
+total time:  20.52476716041565
+average time: 0.07601765614968758
+
diff --git a/武博文-学术学位研究生学位论文中期考评表.docx b/武博文-学术学位研究生学位论文中期考评表.docx
new file mode 100644
index 0000000..1a36032
--- /dev/null
+++ b/武博文-学术学位研究生学位论文中期考评表.docx
@@ -0,0 +1,266 @@
+                            电 子 科 技 大 学
+               学术学位研究生学位论文中期考评表
+	攻读学位级别： □博士        硕士
+	学科专业：     软件工程                
+	学        院：     信息与软件工程          
+	学        号：     202221090225            
+	姓        名：     武博文                  
+	论文题目：     室外动态场景下基于实例  
+	                   分割的视觉SLAM研究     
+	指导教师：     王春雨                  
+	填表日期：   2024   年   9   月   15  日
+                        电子科技大学研究生院
+   
+ 已完成的主要工作
+1.开题报告通过时间：   2023    年  12  月  21  日
+2. 课程学习情况
+是否已达到培养方案规定的学分要求
+□是   否
+3. 论文研究进展
+从理论分析或计算部分、实验（或实证）工作等方面进行总结（可续页）
+  一、基于实例分割和光流检测的运动物体判别算法
+ 理论分析
+  动态物体判别是整个动态SLAM问题要解决的一个关键环节，其最终解决的效果好坏直接影响到相机位姿的估计和后端的建图效果。该问题解决的是在相机不同的两帧之间，物体的空间位置是否发生了移动的问题。只使用实例分割获得的语义信息只能判定已知语义的物体是动态物体，但是不能确定在当前图像物体是否真的发生了移动。同时在处理未知运动物体方面，语义信息会失效。因此在ORB_SLAM2的基础上，设计了一种基于实时光流检测和实例分割的物体判别方法。
+  光流检测是一种用于估计图像序列中像素移动的技术，即在连续帧之间，图像中像素的运动轨迹。实例分割不仅要求检测出图像中的物体，还要精确地划分出每个物体实例的像素级别的掩码，不仅要区分图像中的不同类别，还需要区分属于同一类别的不同实例。
+  设计的运动物体判别方法如下，算法流程如图1-1所示：
+                                       
+图1-1 运动物体判别算法流程图
+  首先判断运动物体，通过实例分割得到当前帧物体掩码，将实例Oi中像素不为0的点作为动点候选点pio。同时通过光流检测出当前帧中的像素点pi的光流值，光流值可分为x方向的fix和y方向fiy，当光流值不为0时认为该像素在其方向上存在运动。故可以结合两个方向的光流值得到唯一光流值fi，计算如公式（1）所示：
+                                   fi=fix2+fiy2                         （1）
+  设置光流阈值Thf，当光流值fi大于阈值时，认为该像素点存在光流运动，作为光流动点pif；否则作为静点，系统中Thf取值为0.12。对于实例Oi的动点候选点，计算其光流值和光流动点数。实例Oi的运动状态Di可以表述如公式（2）：
+                                rd=pifpio                               （2）
+  其中rd为实例中光流动点数目在动点候选点数目所占的比例，pio是实例Oi中动点候选点总数目，pif是光流动点总数目，通过上述方式将光流信息与语义信息融合。最终根据Di来判断物体的运动状态，0为静止，1为运动，如公式（3）所示。
+                   Di=1,rd>=Thd  0,rd<Thd                           （3）
+  其中Thd为实例运动的判断阈值，当所占比例大于等于阈值时，认为该实例确实处于运动状态，否则为静止物体，系统中Thd取值为0.7。
+  其次针对实例外的点，直接通过光流检测判断像素点pi是否为动点。若为静点则加入相机运动估计候选点，保证静点数目。
+ 工作内容
+ DM-SLAM系统具体结构如图1-2所示。输入图像和深度图像，首先经过实例分割得到掩码信息，经过光流预测得到光流信息，以此来关联相邻两帧。其次，通过掩码区分出静点和动态物体点，此时的动点为候选动点。将候选动点和光流信息融合，进行运动物体判断，得到真正运动的物体及其点集合。同时进行相机位姿估计，通过静点的匹配实现相机位姿估计，得到相机初始位姿。在得到相机初始位姿和运动物体点集合后，进行运动物体位姿估计，计算运动物体的位姿。最终将相机位姿，运动物体位姿传入后端进行联合优化，以输出最终位姿。
+                                       
+图1-2 DM-SLAM系统结构图
+ 实例分割重点关注室外实例对象，如车，人，自行车，树等物体。研究主流实例分割模型，如YOLACT，PolarMask和MASK_RCNN。现采用MASK_RCNN训练模型进行实例分割，实现较为精准的边界分割，精度高。
+ 光流预测重点关注实时性和准确性，研究不同光流模型，如LiteFlowNet，PWC-Net和RAFT。现选用RAFT模型进行微调训练得出模型，增加实时光流分割模块，对图像进行光流预测。
+ 设计动态物体判断方法，实现了相应模块，算法流程图如图1-1所示。测试了KITTI-13序列，利用evo评估工具对比了相关指标，如表1-1所示。通过表的对比可以看出在该序列上，我的系统rmse提升了11.1%，说明我的系统可以降低位姿误差，这得益于动态物体的判别方法的有效。跟踪过程如图1-3所示，其中不同颜色表示不同的实例，即动态物体，黑点代表静态背景点。
+                                       
+图1-3 系统跟踪图
+表1-1  KITTI-13序列部分图像上DM-SLAM与原系统误差数值对比
+数值
+VDO-SLAM
+DM-SLAM
+
+APE
+RPE
+APE
+RPE
+max
+0.35
+0.10
+0.32
+0.10
+mean
+0.16
+0.06
+0.15
+0.06
+rmse
+0.18
+0.07
+0.16
+0.07
+min
+0.02
+0.04
+0.03
+0.03
+ 实验对比了几个现有动态SLAM系统，ORB_SLAM2，DyStSLAM和DynaSLAM在KITTI Odometry数据集不同序列下的各项轨迹误差，包括绝对轨迹误差和相对轨迹误差，如表1-2所示。这些数据可用于以后我的系统测试，评估DM-SLAM的鲁棒性。
+表1-2  KITTI数据集上各个系统轨迹误差对比
+Seq
+ORB_SLAM2
+DyStSLAM
+DynaSLAM
+
+APE
+RPE
+APE
+RPE
+APE
+RPE
+00
+1.30
+0.70
+1.28
+0.70
+3.88
+1.35
+01
+10.18
+1.44
+9.64
+1.49
+10.23
+1.33
+02
+6.28
+0.78
+4.91
+0.75
+6.37
+1.42
+03
+0.73
+0.73
+0.58
+0.70
+4.69
+2.73
+04
+0.25
+0.5
+0.23
+0.47
+1.18
+1.11
+05
+0.76
+0.39
+0.78
+0.40
+1.72
+0.69
+06
+0.80
+0.53
+0.72
+0.50
+1.95
+0.94
+07
+0.55
+0.51
+0.54
+0.50
+1.12
+0.84
+08
+3.59
+1.02
+3.27
+1.01
+3.97
+1.22
+09
+2.71
+0.86
+2.95
+0.85
+5.24
+1.31
+10
+1.02
+0.60
+1.01
+0.61
+2.33
+0.98
+
+  二、多运动刚性物体跟踪算法
+ 理论分析
+  研究不同实例间的运动并跟踪动态物体，尤其是在复杂的室外场景中，具有重要意义。车辆作为刚性物体，其运动模型的推导在自动驾驶系统中是核心技术之一。通过准确建模车辆的运动，可以实现对车辆的实时跟踪、预测其未来轨迹，以及作出有效的避障和路径规划决策。同时，动态物体的位姿估计可以用来做联合优化，实现信息有效利用。
+  刚体运动涉及两个方面，平移和旋转。为描述这些运动，通常会定义多个参考坐标系。世界坐标系或者全局坐标系，这是一个固定的参考系，用来描述刚体在空间中的位置和运动。刚体坐标系，固定在刚体上，随着刚体的运动而运动，这个坐标系的原点通常选择在刚体的质心，我的系统选择各个3D物体点的质心作为物体坐标系原点。下面为室外动态场景下运动中的刚性物体运动模型推导过程。
+  设刚体在k-1和k帧间，刚体坐标系下的位姿为k-1Rk-1Uk∈SE3，其中刚体在全局坐标系下的运动用表示，{0}表示全局坐标系。 
+  设在全局坐标系下，k帧的第i个点的3D坐标为，由坐标系转化可得式（4）：
+                                                                        （4）
+  又因为对于刚性物体，所有物体点在刚性坐标系下的坐标保持不变，即如公式（5）所示：
+                                                                        （5）
+  由公式（4）和（5）可得公式（6）：
+                                                                        （6）
+  公式（6）通过齐次变换联系了不同帧的运动刚体上的相同3D点，展示了刚体运动k-1Rk-1Uk和全局坐标系的关系。因此在全局坐标系下，3D点满足公式（7）：
+                                                                        （7）
+  其中k-10Uk∈SE3，表示在全局坐标系下物体点的运动。公式（7）是物体运动估计的核心，仅通过3D点将刚性物体运动展现，消除了估计物体全局位姿的工作。刚性物体运动模型原理图如图1-4所示。
+                                        
+  图1-4 刚体运动模型原理图
+  若考虑实际场景下，第k帧图像通过实例分割得到实例动点，其像素点为pki，对应的3D点为qki，已知k-1帧中的3D点为qk-1i与之对应。在k-1时刻，以k-1帧的物体运动作为参考系，则有公式（8）：
+                                                                        （8）
+  将k-1帧的3D点通过相机坐标系重投影到k帧上，可以得到投影点pki，如公式（9）:
+                                                                        （9）
+  其中k-1k-1Hk∈SE3，PI为相机投影函数，表示在k-1帧的相机坐标系下，第k-1帧到第k帧的位姿的逆。将k-1k-1Hk参数化为k-1k-1Hk=expk-1k-1εk，k-1k-1εk∈se3，类似于相机位姿求解，通过最小化找到k-1k-1Hk的最优解，如公式（10）：
+                         k-1k-1εk*=argmininρhpki-pki                 （10）
+  其中ρh为鲁棒核函数。当给定物体上在k帧的2D点和k-1帧的3D点共n个时，物体点的运动可由公式（11）计算恢复。
+                       k-1k-1Uk=k-1k-1Tkk-1k-1Hk                       （11）
+ 工作内容
+ 研究刚体运动，设计物体的标准，以实例分割的掩码取得不同物体的像素点，通过相机投影得到当前相机坐标系下的3D点，规定物体质心为物体点中心，以此作为刚性物体的坐标系原点。当物体第一次出现时，原点与掩码内深度值最低的点连线正方向作为x轴，与之垂直的方向作为y轴，xy轴构成的面的法向量作为z轴。当物体再次出现时，物体坐标系自身保持不变。
+ 实现全局的物体跟踪，应用室外场景下的刚性物体运动估计方法估计物体运动位姿，根据研究内容1中的方法确定运动的实例，运动实例在不同帧间通过掩码和光流关联。设计最小化方程，通过重投影求解对应的位姿。最终效果如图1-5和图1-6所示，系统同时跟踪不同的实例，并计算其运动位姿。
+                                       
+图1-5 物体关联图1
+                                       
+图1-6 物体关联图2
+
+4. 阶段性研究成果
+按《研究生学位论文撰写格式规范》的格式要求分类填写与学位论文相关的阶段性研究成果，例如期刊论文、会议论文、专利、科研获奖等，限填第一作者或导师为第一作者时的第二作者成果，其中已录用、已投稿或拟投稿的在括号内注明（可续页）
+ 
+ 存在的主要问题和解决办法
+1.未按开题计划完成的研究工作，研究工作存在的原理性、技术性难题以及在实验条件等方面的限制（可续页）
+  （1）实例分割的实时性问题
+  在追求实例分割的高精度时，使用的模型难以满足系统的实时性工作，故对图像做预处理以实现系统的正常运行。
+  （2）后端优化设计问题
+  目前后端的优化基于图优化，优化因子为静点和动点，一定程度上提高了跟踪的位姿估计，但仍需要设计完整的后端优化图，实验测试优化效果。
+  （3）整个系统的鲁棒性验证问题
+  需进行更多测试来验证系统的鲁棒性。
+2.针对上述问题采取何种解决办法，对学位论文的研究内容及所采取的理论方法、技术路线和实施方案的进一步调整，以及下一步的研究研究计划（可续页）
+  （1）针对实例分割的实时性问题，进一步研究实时性实例分割模型，尝试在精度和实时性之间达到平衡，新增线程同步完成实例分割任务，通过ROS实现通信。
+  （2）针对后端模块，设计新的优化函数，完善优化因子，尝试将物体速度等信息加入联合优化，以得到更加准确的相机位姿，同时估计多物体运动轨迹。
+  （3）针对整个系统，验证系统在KITTI数据集下的鲁棒性，再进一步测试在其他室外公开数据集下的鲁棒性。
+  下一步研究计划如表2-1所示。
+                        表2-1 下一步研究计划表
+起止年月
+完成内容
+2024.09-2024.10
+完成后端设计和实现
+2024.10-2024.11
+研究实例分割实时性
+2024.10-2024.11
+完成DM-SLAM系统在不同数据集的测试，完成专利1篇
+2024.11-2024.12
+撰写并完成硕士学位论文初稿。
+2025.01-2025.03
+完成硕士学位论文修改提升。
+2025.03-2025.04
+完成硕士学位论文答辩。
+  
+ 中期考评审查意见
+1.导师对工作进展及研究计划的意见：
+  进展符合预期，研究计划合理可行。
+导师（组）签字：                                                     2024  年   9  月  20  日
+2.中期考评专家组意见
+考评日期
+  2024.09.20
+考评地点
+国际创新中心B栋一楼会议室B105
+腾讯会议：479-687-638
+考评专家
+  杨远望、庄杰、李耶
+考评成绩
+合格  3  票       基本合格  0  票       不合格  0  票
+结    论
+通过            □原则通过           □不通过 
+通过：表决票均为合格
+原则通过：表决票中有1票为基本合格或不合格，其余为合格和基本合格
+不通过：表决票中有2票及以上为不合格
+对学位论文工作进展，从事科学研究的能力和作风，以及下一步研究计划的建议，是否适合继续攻读学位：
+  
+  
+  研究工作进展正常，计划可行，适合继续攻读学位。
+  
+
+  
+  
+  
+  	考评组签名：  	
+                                                       2024年  9  月  20  日
+3.学院意见：
+
+  
+  
+
+  
+                   负责人签名：                         年    月    日
+