commit 7cec2d9ba0acf19eeb6a6b7a8a7fd17dd40cc56a Author: wubw <879367232@qq.com> Date: Wed Apr 23 15:46:42 2025 +0800 first commit diff --git a/202221090225_武博文_中期答辩.pptx b/202221090225_武博文_中期答辩.pptx new file mode 100644 index 0000000..a4ed39c Binary files /dev/null and b/202221090225_武博文_中期答辩.pptx differ diff --git a/202221090225_武博文_开题.pptx b/202221090225_武博文_开题.pptx new file mode 100644 index 0000000..65574f8 Binary files /dev/null and b/202221090225_武博文_开题.pptx differ diff --git a/202221090225_武博文_开题报告表.docx b/202221090225_武博文_开题报告表.docx new file mode 100644 index 0000000..e4ae7dd --- /dev/null +++ b/202221090225_武博文_开题报告表.docx @@ -0,0 +1,231 @@ + 电 子 科 技 大 学 + 学术学位研究生学位论文开题报告表 + 攻读学位级别: □博士 硕士 + 学科专业: 软件工程 + 学 院: 信息与软件工程学院 + 学 号: 202221090225 + 姓 名: 武博文 + 论文题目: 室外动态场景下基于实例 + 分割的视觉SLAM研究 + 指导教师: 王春雨 + 填表日期: 2023 年 12 月 15 日 + 电子科技大学研究生院 + + 学位论文研究内容 + 课题类型 +□基础研究 □应用基础研究 应用研究 + 课题来源 +□纵向 □横向 自拟 + 学 + 位 + 论 + 文 + 研 + 究 + 内 + 容 +学位论文的研究目标、研究内容及拟解决的关键性问题(可续页) + 研究目标 +目前机器人SLAM算法主要分为激光SLAM和视觉SLAM,区别在于传感器分别是激光雷达和相机。随着移动机器人的普及以及应用场景的增多,激光SLAM由于激光雷达的高价格,难以应用在小电器以及低成本机器人上,而视觉SLAM凭借相机价格较低,体积较少,能够采集多维度信息等优势,逐渐成为目前SLAM算法中研究的主流方向。 +视觉同步定位与建图(Visual Simultaneous Localization And Mapping,V-SLAM)在机器人视觉感知领域中占有重要地位。最先进的V-SLAM算法提供了高精度定位和场景重建的能力[[1][]]。然而,它们大多忽略了动态对象所产生的不良影响。在这些研究中,环境被认为是完全静止的,这种强假设使得系统在复杂的动态环境中会产生严重的误差导致位姿估计误差较大,甚至导致定位失败。因此研究动态场景下的相机运动和物体运动是十分有必要的。 +拟在存在动态物体的室外场景下,使用相机作为传感器,研究如何区分真正移动的动态物体和潜在运动但是静止的物体,更好地利用静态特征点,提高相机运动估计的准确性和SLAM系统的鲁棒性。 + 研究内容 + 动态场景作为V-SLAM走向实际应用的一大阻碍,具有较大的难度和挑战性。也是许多学者研究的内容。本文拟研究在室外动态场景下如何识别动态物体,设计动态物体识别算法,将动态物体对相机位姿估计的影响降低,获得较为精准的相机位姿。在获得较为精准的相机位姿后,跟踪动态物体,建立动态物体跟踪集合,对新出现的物体和消失的物体记录。最后,将观测量,如相机位姿和物体位姿等传入后端,建立全局优化,根据优化后的地图点建立地图。 + 针对如何识别室外动态物体的问题,研究深度学习和几何约束相结合的动态点判定方法,设计识别运动物体的算法,去除语义信息未包括的动点,正确恢复相机位姿。 + 针对运动物体跟踪,研究在语义信息中的不同物体的跟踪方法,设计区分不同物体以及其运动,恢复运动物体的位姿。 + 针对后端优化,研究应用动态物体信息的优化方法,同时优化相机位姿和物体位姿,得到更精确的相机位姿。 + 拟解决的关键性问题 + 动态物体判别问题 + 动态物体判别是整个动态SLAM问题要解决的一个关键环节,其最终解决的效果好坏直接影响到相机位姿的估计和后端的建图效果。该问题解决的是在相机不同的两帧之间,物体的空间位置是否发生了移动的问题。只使用实例分割获得的语义信息只能判定已知语义的物体是动态物体,但是不能确定在当前图像物体是否真的发生了移动。同时在处理未知运动物体方面,语义信息会失效,需要结合几何信息设计一种算法来判定物体是否真正运动,以将动态物体的特征点与静态背景特征点做区分。 + 动态物体跟踪问题 + 对动态点的处理常常是在判定为动态点后,直接将其从特征点中去除,不再考虑这些特征点的意义。但是这些特征点也是地图中的点,对于动态物体存在跟踪的价值,因此研究动态物体所产生的特征点的存储和利用是关键点。在动态场景下,动态特征点非常可能不是来源一个物体,即在一个图像中可能存在多个动态物体,因此需要研究不同物体在不同帧间的关联关系,建立唯一的匹配,实现动态物体的分别跟踪。 + 同步跟踪和优化问题 + 在求解相机位姿后,跟踪动态物体的运动,获得运动物体的位姿,物体运动信息是预测得来的信息,可以经过局部优化或全局优化获得更精准的信息。但一般的优化只进行线性优化或者只对相机位姿优化,忽略了动态物体点的有效信息。因此拟设计一种优化的过程,确定优化变量,实现更准确的位姿估计,生成更准确的地图点,解决动态物体有效信息不完全利用的问题。 + + + 学位论文研究依据 +学位论文的选题依据和研究意义,国内外研究现状和发展态势,主要参考文献,以及已有的工作积累和研究成果。(应有2000字) + 选题依据和研究意义 + 同步定位与地图构建(SLAM)是搭载激光雷达、IMU(Inertial Measurement Unit)、相机等传感器的移动载体在未知环境下同步进行定位与地图构建的过程[[][2][]]。SLAM一般可分为激光SLAM和视觉SLAM。激光SLAM利用激光雷达、编码器和惯性测量单元(IMU)等多种传感器相结合,已在理论和应用方面相对成熟。然而,激光雷达具有较高的价格使其难以普及到个人小型设备,并且雷达信息获取量有限。视觉SLAM利用视觉传感器,如单目、双目和RGB-D(带有深度信息的彩色图像)相机等,来构建环境地图。相机能够获取丰富的图像信息,并且视觉传感器具有低廉的价格,简单的结构和小巧便携的特点,因此成为近年来研究者们关注的热点,也成为SLAM技术中的主要研究方向。视觉SLAM能够广泛应用于无人驾驶,自主机器人,导盲避障等领域,对视觉SLAM的研究具有现实意义。 + 经过近二十年的发展,视觉同时定位与建图(Visual Simultaneous Localization And Mapping,V-SLAM)框架已趋于成熟。现阶段,V-SLAM系统大多数建立在非动态环境的假设上,即假设移动载体在跟踪过程中不存在动态物体。然而,这种假设是一种强假设,在现实场景中很难成立。在室内场景中,常出现移动的人和桌椅等等;在室外场景中,常常出现移动的车和动物等等,这些动态物体的出现对V-SLAM系统的影响巨大,尤其是对V-SLAM中的前端模块的影响。SLAM前端求解存在两种方案,直接法和特征点法。直接法基于光度不变假设来描述像素随时间在图像之间的运动方式,每个像素在两帧之间的运动是一致的,通过此估计相机的运动。然而由于相机获得的图像受到光线,噪声等影响,光度不变假设往往不成立,如果再出现动态物体,直接使用此方法更会影响相机的运动估计。特征点法是一种间接的方法,它首先提取图像的特征点,然后通过两帧间特征点的匹配和位置变化求解相机运动。特征点的选择与使用大幅提高了V-SLAM系统定位的准确性,例如著名开源视觉SLAM框架ORB-SLAM2[[3]]、ORB-SLAM3[[4]]、VINS-Mono[[5]]都采用了特征点法。但是,一旦出现动态物体,这些特征点中会包含动态物体上的点,动态物体的移动造成了特征点移动的不一致性,从而对相机运动的估计造成了巨大影响。这种影响会导致后端失效,定位精度大幅减弱,不能忽视。随着视觉SLAM技术的发展,如何解决动态影响受到广泛关注,具有重要的研究价值。 + 国内外研究现状和发展态势 + 2.1视觉SLAM研究现状 + 视觉SLAM问题最早可追溯到滤波技术的提出,Smith等人提出了采用状态估计理论的方法处理机器人在定位和建图等方面的问题[[][6][]]。随后出现各种基于滤波算法的SLAM系统,例如粒子滤波[[][7][]]和卡尔曼滤波[[][8][]]。2007年视觉SLAM取得重大突破,A. J. Davison等人提出第一个基于单目相机的视觉SLAM系统MonoSLAM[[][9][]]。该系统基于扩展卡尔曼滤波算法(Extended Kalman Filter, UKF),是首个达到实时效果的单目视觉SLAM系统,在此之前其他的算法都是对预先拍好的视频进行处理,无法做到同步。同年,Klein 等人提出了PTAM( Parallel Tracking And Mapping) [[][10][]],创新地以并行的方式进行跟踪和建图线程,这种并行的方式也是当下SLAM框架的主流。PTAM应用了关键帧和非线性化优化理论而非当时多数的滤波方案,为后续基于非线性化优化的视觉SLAM开辟了道路。 + 2014年慕尼黑工业大学计算机视觉组Jakob Engel等人[[][11][]]提出LSD-SLAM,该方案是一种基于直接法的单目视觉SLAM算法,不需要计算特征点,通过最小化光度误差进行图像像素信息的匹配,实现了效果不错的建图。该方案的出现证明了基于直接法的视觉SLAM系统的有效性,为后续的研究奠定了基础。同年SVO被Forster等人提出[[][12][]]。这是一种基于稀疏直接法的视觉SLAM方案,结合了特征点和直接法,使用了特征点,但是不计算特征点的描述子,特征点的匹配使用其周围像素利用直接法匹配。 + 2015年Mur-Artal等人参考PTAM关键帧和并行线程的方案,提出了ORB-SLAM框架[[][13][]]。该框架是一种完全基于特征点法的单目视觉SLAM系统,包括了跟踪,建图和回环检测三个并行线程。最为经典的是该系统采用的ORB特征点,能实现提取速度和效果的平衡。但是其系统只适用于单目相机,精度低且应用场景受限。随着相机的进步,2017年Mur-Artal 等人对ORB-SLAM进行了改进,扩展了对双目和RGB-D相机的支持,提出ORB-SLAM2[[][3][]]。相比于原版,该系统支持三种相机,同时新增重定位,全局优化和地图复用等功能,更具鲁棒性。 + 2017年,香港科技大学Qin Tong等人[[][1][4]]提出VINS Mono系统,该系统在单目相机中融合IMU传感器,在视觉信息短暂失效时可利用IMU估计位姿,视觉信息在优化时可以修正IMU数据的漂移,两者的结合表现出了优良的性能。2019年提出改进版系统VINS-Fusion[[][1][5][]],新增对双目相机和GPS传感器的支持,融合后的系统效果更优。 + 2020年Carlos Campos等[[][4][]]提出了ORB-SLAM3,该系统在ORB-SLAM2的基础上,加入了对视觉惯性传感器融合的支持,并在社区开源。系统对算法的多个环节进行改进优化,例如加入了多地图系统和新的重定位模块,能够适应更多的场景,同时精度相比上一版增加2-3倍。在2021年底,系统更新了V1.0版本,继承了ORB-SLAM2的优良性能,成为现阶段最有代表性的视觉SLAM系统之一。 + 2.2动态SLAM研究现状 + 针对动态物体的影响,已经有许多研究人员开展了相关工作,尝试解决动态场景下的视觉SLAM问题。解决这一问题的主要挑战就是如何高效地检测到动态物体和其特征点,并将动态特征点剔除以恢复相机运动。 + 最早的解决思路是根据几何约束来筛除动态物体的特征点,如WANG 等[[][1][6][]]首次使用 K-Means 将由 RGB-D相机计算的3D点聚类,并使用连续图像之间的极线约束计算区域中内点关键点数量的变化,内点数量较少的区域被认定是动态的。Fang[[][1][7][]]使用光流法检测图像之间的动态物体所在位置,对其特征点进行滤除。该方法利用光流提高检测的精度,有效地降低了帧之间极线约束的误差。尽管基于几何约束的方法可以在一定程度消除动态特征点的影响,但随着深度学习的发展,图像中语义信息逐渐被重视和利用起来。 + 现阶段有许多优秀的深度学习网络,如YOLO[[1][8][]],SegNet[[1][9][]],Mask R-CNN[[][20][]]等等。这些神经网络有着强大的特征提取能力和语义信息提取能力,可以帮助SLAM系统更轻松地辨别出动态物体的存在,从而消除其影响。Fangwei Zhong等人提出的Detect-SLAM[[2][1][]],利用目标检测网络获取环境中的动态的人和车等,为了实时性,只在关键帧中进行目标检测,最后去除所有检测到的动态点来恢复相机位姿。LIU和MIURA[[][2][2][]]提出了RDS-SLAM。基于ORB-SLAM3[4]的RDS-SLAM框架使用模型的分割结果初始化移动对象的移动概率,将概率传播到随后的帧,以此来区分动静点。这种只基于深度学习的方法仅能提供图像中的语义信息,但无法判断图像中的物体是否真的在运动,比如静止的人或者路边停靠的汽车。若根据语义信息将其标记为动态物体后直接去除其特征点,这种方法会导致系统丢失有用的特征点,对相机的运动估计有所影响。因此仅利用深度学习不能很好解决动态物体对SLAM系统的影响。 + 许多研究开始探索语义信息和几何信息的结合。例如清华大学Chao Yun等提出的DS-SLAM[[][2][3][]],该系统首先利用SegNet网络进行语义分割,再利用极线约束过滤移动的物体,达到了不错的效果。Berta Bescos等人首次利用Mask R-CNN网络进行实例分割,提出了DynaSLAM[[][2][4][]]。该系统结合基于多视几何深度的动态物体分割和区域生长算法,大幅降低了位姿估计的误差。 + 利用深度学习得来的语义信息和几何信息结合来解决SLAM中的动态场景问题渐渐成了一种主流,但是上述大多系统只是为了恢复相机的位姿而剔除动态物体的特征点,而没有估计动态物体的位姿。同时估计相机运动和跟踪动态物体运动,将动态物体的点加入优化步骤正在发展为一种趋势。Javier Civera等人提出的DOT SLAM(Dynamic Object Tracking for Visual SLAM)[[][2][5][]]主要工作在前端,结合实例分割为对态对象生成掩码,通过最小化光度重投影误差跟踪物体。AirDOS被卡内基梅隆大学Yuheng Qiu等人提出[[][2][6][]],将刚性和运动约束引入模型铰接对象,通过联合优化相机位姿、物体运动和物体三维结构,来纠正相机位姿估计。VDO SLAM[[][2][7][]]利用Mask R-CNN掩码和光流区分动静点,将动态环境下的SLAM表示为整体的图优化,同时估计相机位姿和物体位姿。 + 总体来说,目前动态场景下的视觉SLAM问题的解决需要借助几何信息和深度学习的语义信息,语义信息提供更准确的物体,几何信息提供物体真实的运动状态,两者结合来估计相机运动和跟踪物体。 + 主要参考文献 + J. Engel, V. Koltun, and D. Cremers, "Direct sparse odometry," IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611 - 625, Mar. 2016. + 孔德磊, 方正. 基于事件的视觉传感器及其应用综述[J]. 信息与控制, 2021, 50(1): 1-19. KONG D L, FANG Z. A review of event-based vision sensors and their applications[J]. Information and Control, 2021, 50(1): 1-19. + Mur-Artal R , JD Tardós. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras[J]. IEEE Transactions on Robotics, 2017. + Campos C, Elvira R, Rodriguez J, et al. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual - Inertial, and Multimap SLAM[J]. IEEE Transactions on Robotics: A publication of the IEEE Robotics and Automation Society, 2021, 37(6): 1874-1890. + Tong, Qin, Peiliang, et al. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator[J]. IEEE Transactions on Robotics, 2018. + Smith R, Self M, Cheeseman P. Estimating Uncertain Spatial Relationships in Robotics [J]. Machine Intelligence & Pattern Recognition, 1988, 5(5):435-461. + Grisetti G, Stachniss C, Burgard W. Improved Techniques for Grid Mapping With Rao-Blackwellized Particle Filters [J]. IEEE Transactions on Robotics, 2007, 23(1):34-46. + Kalman R E. A New Approach To Linear Filtering and Prediction Problems[J]. Journal of Basic Engineering, 1960, 82D:35-45.DOI:10.1115/1.3662552. + Davison A J, Reid I D, Molton N D, et al. MonoSLAM: Real-Time Single Camera SLAM[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6):1052-1067. + Klein G, Murray D. Parallel Tracking and Mapping for Small AR Workspaces[C]. IEEE and ACM International Symposium on Mixed and Augmented Reality. IEEE, 2007:1-10. + ENGEL J, SCHOPS T, CREMERS D, LSD-SLAM: Large-scale direct monocular SLAM[C]. European Conference on Computer Vision(ECCV), 2014:834 - 849. + FORSTER C, PIZZOLI M, SCARAMUZZA D. SVO: Fast semi-direct monocular visual odometry[C]. Hong Kong, China: IEEE International Conference on Robotics and Automation (ICRA), 2014: 15-22. + MURARTAL R, MONTIEL J M, TARDOS J D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System[J]. IEEE Transactions on Robotics, 2015, 31(5):1147-1163. + TONG Q, PEILIANG L, SHAOJIE S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator[J]. IEEE Transactions on Robotics, 2017,99:1-17. + QIN T, PAN J, CAO S, et al. A general optimization-based framework for local odometry estimation with multiple sensors[J]. ArXiv, 2019:1901.03638. + WANG R, WAN W, WANG Y, et al. A new RGB-D SLAM method with moving object detection for dynamic indoor scenes[J]. Remote Sensing, 2019, 11(10): 1143. + Fang Y, Dai B. An improved moving target detecting and tracking based on Optical Flow technique and Kalman filter[J]. IEEE, 2009.DOI:10.1109/ICCSE.2009.5228464. + Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified, Real-Time Object Detection[C]//Computer Vision & Pattern Recognition. IEEE, 2016.DOI:10.1109/CVPR.2016.91. + Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(12): 2481-2495. + Gkioxari G, He K, Piotr Dollár, et al. Mask R-CNN [J]. IEEE transactions on pattern analysis and machine intelligence, 2020, 42(2): 386-397. + Zhong F, Wang S, Zhang Z, et al. Detect-SLAM: Making Object Detection and SLAM Mutually Beneficial[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018.DOI:10.1109/WACV.2018.00115. + LIU Y, MIURA J. RDS-SLAM: Real-time dynamic SLAM using semantic segmentation methods[J]. IEEE Access, 2021, 9: 23772-23785. + C. Yu, et al. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments[A]. //2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 1168-1174. + B. Bescos, J. M. Fácil, J. Civera and J. Neira, DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes[J]. IEEE Robotics and Automation Letters, 2018, 3(4)4076-4083. + Ballester I, Fontan A, Civera J, et al. DOT: Dynamic Object Tracking for Visual SLAM[J]. 2020.DOI:10.48550/arXiv.2010.00052. + Qiu Y, Wang C, Wang W, et al. AirDOS: Dynamic SLAM benefits from articulated objects[C]//2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022: 8047-8053. + Zhang J, Henein M, Mahony R, et al. VDO-SLAM: A Visual Dynamic Object-aware SLAM System[J]. 2020.DOI:10.48550/arXiv.2005.11052. + 高翔, 张涛等. 视觉SLAM十四讲[M]. 第二版. 北京:电子工业出版社, 2019. + 已有的工作积累和研究成果 + 工作积累 +研究生期间学习主要以视觉十四讲[[][28][]]为主,阅读了大量SLAM相关文献,在虚拟机环境下测试过ORB-SLAM2,VDO-SLAM等多种框架,在公开数据集KITTI序列中的性能。掌握框架的主要函数,可以通过编程实现环境的搭建和算法的编写测试。 + 研究成果 +暂无 + 学位论文研究计划及预期目标 +1.拟采取的主要理论、研究方法、技术路线和实施方案(可续页) +1.1 主要理论和研究方法 +一个典型的视觉SLAM系统一般可以分为五个子模块,包括传感器,前端,后端优化,回环检测和建图。如图3-1所示。 + + 图3-1 SLAM模块图 +对于视觉SLAM而言,传感器为相机,前端又称为视觉里程计,主要根据相机信息估计相邻两个时刻内的运动(即位姿变化)。后端优化位姿,回环检测是检测相机是否经过相同的场景,与建图有着密切的联系。本文的主要工作集中在前端和后端。在光照变化不明显,没有动态物体的场景下,SLAM基本模块已经很完善。要解决动态场景下的问题,需要在此模块的基础上,结合深度学习模型来实现语义级别的SLAM。 +在后端优化方面,基于因子图优化。因子图是应用贝叶斯定律的估计模型,贝叶斯模型是给定Z,求解X的概率,表示为P(X|Z),P(X|Z)正比于给定求解Z的概率,如公式(1)所示。 +PXZ=PZXPXPZ=k*PZXPX#1 +贝叶斯定律左侧称为后验概率,右侧的P(Z|X)称为似然,P(X)称为先验。直接求后验分布是困难的,但是求一个状态最优估计,使得在该状态下,后验概率最大化是可行的,如公式(2)所示。因此求解最大后验概率,等价于求解最大化似然和先验的乘积。 +X*=argmax PXZ=argmax PZXPX#2 +求解最大似然估计时,考虑观测数据的条件概率满足高斯分布,可以使用最小化负对数来求高斯分布的最大似然,,这样就可以得到一个最小二乘问题,如公式(3)所示,它的解等价于状态的最大似然估计。其中公式(3)中的f(x)为噪声符合高斯分布的X的误差项。 +X*=argmax PXZ=argmaxlogPXZ=argminfx22 #3 +在SLAM问题中,每一个观测变量在贝叶斯网络中都是相互独立的,因此所有条件概率是乘积的形式,且可分解,对应于因子图中的每一项。因子图包含节点和边,节点为状态变量节点,表示待估计的变量,如位姿,3D点等。边为误差项,即因子,表示变量之间的误差项。因子图还会包含一个先验因子,用来固定系统的解,以满足可解。因子图的求解就是使得所有因子的乘积最大化的状态量,该步骤可转化为最小二乘问题,最终解得的系统状态是在概率上最可能的系统状态。 +在研究时,从主要理论出发,阅读大量室外动态场景下的视觉SLAM文献,对文献总结和理解,学习方法的异同,优化自己的算法。从实践出发,多写代码尝试不同的算法,测试算法性能,通过实验得到良好的解决方案。 +  1.2 技术路线和实施方案 +本文预计的技术路线和实施方案如图3-2所示: + 图3-2 技术路线和实施方案 +在室外动态场景下基于实例分割的SLAM算法首先需要解决深度学习模型的数据预处理,然后应用得到的语义信息和几何约束设计算法来实现动静点判定。根据静点估计相机的运动,根据动点估计运动物体的运动,不同的运动物体分别跟踪。最终研究相机位姿,运物体位姿和地图点的全局优化,实现建图。 +本文预计的详细技术路线和实施方案如下: + 基于实例分割和聚类的动态物体判别方法 +在室外动态场景下,提出一种基于实例分割和超像素聚类的动态物体识别算法。通过实例分割得到物体掩码,将掩码内的点作为动点候选点,通过特征提取的点与动点候选点做差,得到静点候选点。静点候选点通过聚类后重投影到前一帧,计算点误差,提出一种基于误差比的动点判断方法,解决语义未知的动态物体判定问题。对于语义已知的掩码物体,同样使用该方法判定是否真的在运动。研究思路如图3-3所示。 + 图3-3 基于实例分割和聚类的动态物体判别方法 + 依赖掩码内动点集合的动态物体跟踪方法 +研究具有掩码的动态物体的运动,提出一种在室外场景下,全局的动态物体跟踪方法。首先通过掩码,稠密地提取像素点,每隔2个点取一个点,以保证物体跟踪时特征点的数量。再通过运动判定和语义标签得到真的在运动的物体,设计一个存储集合来管理这些物体像素点,同时利用提取的像素点估计不同物体的位姿,物体位姿的求解建立在刚体假设之上。研究思路如图3-4所示。 + + + + + + + 图3-4 动态物体跟踪方法 + 因子图优化方法 +研究基于因子图的相机位姿和物体位姿优化,该方法将动态SLAM问题作为一个图形优化的问题,为了构建全局一致的地图。因子图的变量节点作为观测得来的值,是要估计求解的变量,点之间的变量作为状态变量,是因子节点,作为约束。拟设计的因子图如图3-5所示。 + 图3-5 因子图 +2.研究计划可行性,研究条件落实情况,可能存在的问题及解决办法(可续页) +2.1 可行性分析 + 得益于视觉SLAM的逐渐发展,动态物体问题已经有了不少解决思路,尤其是前端部分的研究更多,每年都有一定的论文产出,可作为参考。其次,随着深度学习的模型逐渐完善,实例分割技术和光流检测等技术也能有比较好的效果,对动态SLAM问题的解决有所助益。因此,在理论上和实践上,本论文的研究方向均具有可行性。 +2.2 研究条件 + (1) 教研室的科研氛围,指导老师和教研室老师们的意见,师兄们的帮助。教研室已经发了不少相关论文和专利; + (2) 教研室完备的硬件环境,服务器,移动小车和各种摄像头等硬件设施; + (3) 研究内容相关的论文和书籍,有足够的理论基础支撑研究; +2.3 可能存在的问题及解决办法 + (1) 全局优化的结果不如原始数据 + 在将预测值进行全局优化时,不确定预测值的误差大小,会导致一些误差较大的预测值加入全局优化,使得优化后的效果不如原始数据。针对这样的问题,首先考虑优化对象的选择,增加或删除优化值,为了更准确的效果。其次考虑在加入优化前对预测值做处理,比如绝对阈值处理或相对阈值。 + (2) 实施方案未能达到较好的效果 + 若出现这样的问题,则需要和导师师兄交流,讨论原因做好记录,找到问题所在,并根据实际情况调整技术路线,设计新的方案来达到效果。 + + +3.研究计划及预期成果 + 研 + 究 + 计 + 划 + 起止年月 + 完成内容 + + 2023.12-2024.02 +研究动态物体判别方法 + + 2024.02-2024.04 +研究动态物体跟踪方法 + + 2024.04-2024.06 +研究包含动态物体的局部优化和全局优化 + + 2024.06-2024.08 +验证地图精度指标,改进算法 + + 2024.08-2024.11 +测试数据集,做实验 + + 2024.11-2025.03 +撰写硕士学位论文 + 预 + 期 + 创 + 新 + 点 + 及 + 成 + 果 + 形 + 式 + 预期创新点 + 设计基于实例分割和聚类的动态物体判别方法 + 提出基于掩码的动态物体同步跟踪方法 + 设计因子图,实现更优的全局优化 + + 成果形式 + 学术论文 + 发表一篇学术论文 + 专利 + 申请发明专利1-2项 + 论文 + 撰写硕士学位论文1篇 + + 开题报告审查意见 +1.导师对学位论文选题和论文计划可行性意见,是否同意开题: + +导师(组)签字: 年 月 日 +2.开题报告考评组意见 + 开题日期 + + 开题地点 + + 考评专家 + + 考评成绩 +合格 票 基本合格 票 不合格 票 + 结 论 +□通过 □原则通过 □不通过 +通过:表决票均为合格 +原则通过:表决票中有1票为基本合格或不合格,其余为合格和基本合格 +不通过:表决票中有2票及以上为不合格 +考评组对学位论文的选题、研究计划及方案实施的可行性的意见和建议: + + + + + + + + + + + +考评组签名: + 年 月 日 +3.学院意见: + +负责人签名: 年 月 日 + diff --git a/202221090225_武博文_文献综述.docx b/202221090225_武博文_文献综述.docx new file mode 100644 index 0000000..c3e3df2 --- /dev/null +++ b/202221090225_武博文_文献综述.docx @@ -0,0 +1,65 @@ + 电子科技大学学术学位硕士研究生学位论文文献综述 +姓名:武博文 + 学号:202221090225 + 学科:软件工程 +综述题目:室外动态场景下基于实例分割的视觉SLAM研究 + +导师意见: + + + + + +导师签字: +日期: + + 选题依据和研究意义 +同步定位与地图构建(SLAM)是搭载激光雷达、IMU(Inertial Measurement Unit)、相机等传感器的移动载体在未知环境下同步进行定位与地图构建的过程[[][1][]]。SLAM一般可分为激光SLAM和视觉SLAM。激光SLAM利用激光雷达、编码器和惯性测量单元(IMU)等多种传感器相结合,已在理论和应用方面相对成熟。然而,激光雷达具有较高的价格使其难以普及到个人小型设备,并且雷达信息获取量有限。视觉SLAM利用视觉传感器,如单目、双目和RGB-D(带有深度信息的彩色图像)相机等,来构建环境地图。相机能够获取丰富的图像信息,并且视觉传感器具有低廉的价格,简单的结构和小巧便携的特点,因此成为近年来研究者们关注的热点,也成为SLAM技术中的主要研究方向。视觉SLAM能够广泛应用于无人驾驶,自主机器人,导盲避障等领域,对视觉SLAM的研究具有现实意义。 +经过近二十年的发展,视觉同时定位与建图(Visual Simultaneous Localization And Mapping,V-SLAM)框架已趋于成熟,在机器人视觉感知领域中占有重要地位,最先进的V-SLAM算法提供了高精度定位和场景重建的能力[[][2][]]。现阶段,V-SLAM系统大多数建立在非动态环境的假设上,即假设移动载体在跟踪过程中不存在动态物体。然而,这种假设是一种强假设,在现实场景中很难成立。在室内场景中,常出现移动的人和桌椅等等;在室外场景中,常常出现移动的车和动物等等,这些动态物体的出现对V-SLAM系统的影响巨大,尤其是对V-SLAM中的前端模块的影响。SLAM前端求解存在两种方案,直接法和特征点法。直接法基于光度不变假设来描述像素随时间在图像之间的运动方式,每个像素在两帧之间的运动是一致的,通过此估计相机的运动。然而由于相机获得的图像受到光线,噪声等影响,光度不变假设往往不成立,如果再出现动态物体,直接使用此方法更会影响相机的运动估计。特征点法是一种间接的方法,它首先提取图像的特征点,然后通过两帧间特征点的匹配和位置变化求解相机运动。特征点的选择与使用大幅提高了V-SLAM系统定位的准确性,例如著名开源视觉SLAM框架ORB-SLAM2[[3]]、ORB-SLAM3[[4]]、VINS-Mono[[5]]都采用了特征点法。但是,一旦出现动态物体,这些特征点中会包含动态物体上的点,动态物体的移动造成了特征点移动的不一致性,从而对相机运动的估计造成了巨大影响。这种影响会导致后端失效,定位精度大幅减弱,不能忽视。随着视觉SLAM技术的发展,如何解决动态影响受到广泛关注,具有重要的研究价值。 + 国内外研究现状和发展态势 +2.1 视觉SLAM研究现状 +视觉SLAM问题最早可追溯到滤波技术的提出,Smith等人提出了采用状态估计理论的方法处理机器人在定位和建图等方面的问题[[6]]。随后出现各种基于滤波算法的SLAM系统,例如粒子滤波[[7]]和卡尔曼滤波[[8]]。2007年视觉SLAM取得重大突破,A. J. Davison等人提出第一个基于单目相机的视觉SLAM系统MonoSLAM[[][9]]。该系统基于扩展卡尔曼滤波算法(Extended Kalman Filter, UKF),是首个达到实时效果的单目视觉SLAM系统,在此之前其他的算法都是对预先拍好的视频进行处理,无法做到同步。MonoSLAM的发布标志着视觉SLAM的研究从理论层面转到了实际应用,具有里程碑式意义。同年,Klein 等人提出了PTAM( Parallel Tracking And Mapping) [[10]],创新地以并行的方式进行跟踪和建图线程,解决了MonoSLAM计算复杂度高的问题,这种并行的方式也是当下SLAM框架的主流。PTAM应用了关键帧和非线性化优化理论而非当时多数的滤波方案,为后续基于非线性化优化的视觉SLAM开辟了道路。 +2014年慕尼黑工业大学计算机视觉组Jakob Engel等人[[11]]提出LSD-SLAM,该方案是一种基于直接法的单目视觉SLAM算法,不需要计算特征点,通过最小化光度误差进行图像像素信息的匹配,实现了效果不错的建图,可以生成半稠密的深度图。该方案的出现证明了基于直接法的视觉SLAM系统的有效性,为后续的研究奠定了基础。但该方案仍旧存在尺度不确定性问题,以及在相机快速移动时容易丢失目标的问题等等。同年SVO(semi-direct monocular visual odometry)被Forster等人提出[[12]]。这是一种基于稀疏直接法的视觉SLAM方案,结合了特征点和直接法,使用了特征点,但是不计算特征点的描述子,特征点的匹配使用特征点周围像素利用直接法匹配。SVO有着较快的计算速度,但是缺少了后端的功能,对相机的运动估计有较为明显的累计误差,应用场景受限。 +2015年Mur-Artal等人参考PTAM关键帧和并行线程的方案,提出了ORB-SLAM框架[[13]]。该框架是一种完全基于特征点法的单目视觉SLAM系统,包括了跟踪,建图和回环检测三个并行线程。跟踪线程负责提取ORB[[][14][]](oriented FAST and rotated BRIEF)特征点,这该系统最为经典的一部分,采用的ORB特征点具有良好的尺度不变性和旋转不变性,能实现提取速度和效果的平衡。跟踪线程还完成估计位姿的工作,并且适时选出新的关键帧来实现建图。建图线程接收跟踪线程选出的关键帧,删除冗余的关键帧和地图点,再进行全局优化。回环线程接收建图线程筛选后的关键帧,与其他关键图进行回环检测,然后更新相机位姿和地图。ORB-SLAM因为回环检测线程的加入,有限消除了累计误差的影响,提高了定位和建图的准确性。但是其系统只适用于单目相机,精度低且应用场景受限。随着相机的进步,2017年Mur-Artal 等人对ORB-SLAM进行了改进,扩展了对双目和RGB-D相机的支持,提出ORB-SLAM2[[3]]。相比于原版,该系统支持三种相机,同时新增重定位,全局优化和地图复用等功能,更具鲁棒性。 +2017年,香港科技大学Qin Tong等人[[1][5][]]提出VINS Mono系统,该系统在单目相机中融合IMU传感器,在视觉信息短暂失效时可利用IMU估计位姿,视觉信息在优化时可以修正IMU数据的漂移,两者的结合表现出了优良的性能。2019年提出改进版系统VINS-Fusion[[1][6][]],新增对双目相机和GPS传感器的支持,融合后的系统效果更优。 +2020年Carlos Campos等提出了ORB-SLAM3[[4]],该系统在ORB-SLAM2的基础上,加入了对视觉惯性传感器融合的支持,并在社区开源。系统对算法的多个环节进行改进优化,例如加入了多地图系统和新的重定位模块,能够适应更多的场景,同时精度相比上一版增加2-3倍。在2021年底,系统更新了V1.0版本,继承了ORB-SLAM2的优良性能,成为现阶段最有代表性的视觉SLAM系统之一。 +2.2 动态SLAM研究现状 +针对动态物体的影响,已经有许多研究人员开展了相关工作,尝试解决动态场景下的视觉SLAM问题。解决这一问题的主要挑战就是如何高效地检测到动态物体和其特征点,并将动态特征点剔除以恢复相机运动。 +最早的解决思路是根据几何约束来筛除动态物体的特征点,如WANG等[[1][7][]]首次使用K-Means将由RGB-D相机计算的3D点聚类,并使用连续图像之间的极线约束计算区域中内点关键点数量的变化,内点数量较少的区域被认定是动态的。利用极线约束是一种判断动态物体特征点的常见方法,但是如果相邻帧间存在高速移动物体或者运动物体沿着极线方向移动,这种方法效果会大大减弱。为了更好地利用几何信息,研究人员提出借助光流信息来提高动态物体的检测。Fang[[1][8][]]使用光流法检测图像之间的动态物体所在位置,对其特征点进行滤除。该方法利用光流提高检测的精度,有效地降低了帧之间极线约束的误差。尽管基于几何约束的方法可以在一定程度消除动态特征点的影响,但随着深度学习的发展,图像中语义信息逐渐被重视和利用起来。 +现阶段有许多优秀的深度学习网络,如YOLO[[1][9][]],SegNet[[][20][]],Mask R-CNN[[2][1][]]等等。这些神经网络有着强大的特征提取能力和语义信息提取能力,可以帮助SLAM系统更轻松地辨别出动态物体的存在,提供语义先验信息,从而消除其影响。Fangwei Zhong等人提出的Detect-SLAM[[2][2][]],利用目标检测网络获取环境中的动态的人和车等,为了实时性,只在关键帧中进行目标检测,最后去除所有检测到的动态点来恢复相机位姿。LIU和MIURA[[2][3][]]提出了 RDS-SLAM。基于ORB-SLAM3[[4]]的RDS-SLAM框架使用模型的分割结果初始化移动对象的移动概率,将概率传播到随后的帧,以此来区分动静点。这种只基于深度学习的方法仅能提供图像中的语义信息,但无法判断图像中的物体是否真的在运动,比如静止的人或者路边停靠的汽车。若根据语义信息将其标记为动态物体后直接去除其特征点,这种方法会导致系统丢失有用的特征点,对相机的运动估计有所影响。因此仅利用深度学习不能很好解决动态物体对SLAM系统的影响。 +许多研究开始探索语义信息和几何信息的结合。清华大学Chao Yun等提出的DS-SLAM[[2][4][]],该系统首先利用SegNet网络进行语义分割,再利用极线约束过滤移动的物体,达到了不错的效果。Berta Bescos等人首次利用Mask R-CNN网络进行实例分割,提出了DynaSLAM[[2][5][]]。该系统结合基于多视几何深度的动态物体分割和区域生长算法,大幅降低了位姿估计的误差。Runz等人提出了MaskFusion,一种考虑物体的语义和动态RGD-D SLAM系统[[][26][]]。这个系统基于MASK-RCNN语义分割和几何分割,将语义分割和SALM线程放在两个线程以保证整个SLAM系统的实时性。但是该系统物体边界分割常包含背景,仍有改善空间。等人提出RS-SLAM,一种使用RGB-D相机解决动态环境不良影响的SLAM[[][27][]]。该系统采用语义分割识别动态对象,通过动态对象和可移动对象的几何关系来判断可移动对象是否移动。动态内容随后被剔除,跟踪模块对剔除过的静态背景图像帧进行ORB特征提取并估计相机位姿。 +利用深度学习得来的语义信息和几何信息结合来解决SLAM中的动态场景问题渐渐成了一种主流,但是上述大多系统只是为了恢复相机的位姿而剔除动态物体的特征点,而没有估计动态物体的位姿。同时估计相机运动和跟踪动态物体运动,将动态物体的点加入优化步骤正在发展为一种趋势。Henein等人提出一种新的基于特征的,无模型的动态SLAM算法Dynamic SLAM(Dynamic SLAM: The Need For Speed)[[][28][]]。该方法利用语义分割场景中的刚体物体的运动,并提取运动物体的速度,有效性在各种虚拟和真实数据集上得到了验证。Javier Civera等人提出的DOT SLAM(Dynamic Object Tracking for Visual SLAM)[[2][9][]]主要工作在前端,结合实例分割为对态对象生成掩码,通过最小化光度重投影误差跟踪物体。AirDOS被卡内基梅隆大学Yuheng Qiu等人提出[[][30][]],将刚性和运动约束引入模型铰接对象,通过联合优化相机位姿、物体运动和物体三维结构,来纠正相机位姿估计。VDO SLAM[[][31][]]利用Mask R-CNN掩码和光流区分动静点,将动态环境下的SLAM表示为整体的图优化,同时估计相机位姿和物体位姿。 +总体来说,目前动态场景下的视觉SLAM问题的解决需要借助几何信息和深度学习的语义信息,语义信息提供更准确的物体,几何信息提供物体真实的运动状态,两者结合来估计相机运动和跟踪物体。 + + 参考文献 + 孔德磊, 方正. 基于事件的视觉传感器及其应用综述[J]. 信息与控制, 2021, 50(1): 1-19. KONG D L, FANG Z. A review of event-based vision sensors and their applications[J]. Information and Control, 2021, 50(1): 1-19. + J. Engel, V. Koltun, and D. Cremers, "Direct sparse odometry," IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611 - 625, Mar. 2016. + Mur-Artal R , JD Tardós. ORB-SLAM2: An Open-Source SLAM System for Monocular, Stereo, and RGB-D Cameras[J]. IEEE Transactions on Robotics, 2017. + Campos C, Elvira R, Rodriguez J, et al. ORB-SLAM3: An Accurate Open-Source Library for Visual, Visual - Inertial, and Multimap SLAM[J]. IEEE Transactions on Robotics: A publication of the IEEE Robotics and Automation Society, 2021, 37(6): 1874-1890. + Tong, Qin, Peiliang, et al. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator[J]. IEEE Transactions on Robotics, 2018. + Smith R, Self M, Cheeseman P. Estimating Uncertain Spatial Relationships in Robotics [J]. Machine Intelligence & Pattern Recognition, 1988, 5(5):435-461. + Grisetti G, Stachniss C, Burgard W. Improved Techniques for Grid Mapping With Rao-Blackwellized Particle Filters [J]. IEEE Transactions on Robotics, 2007, 23(1):34-46. + Kalman R E. A New Approach To Linear Filtering and Prediction Problems[J]. Journal of Basic Engineering, 1960, 82D:35-45.DOI:10.1115/1.3662552. + Davison A J, Reid I D, Molton N D, et al. MonoSLAM: Real-Time Single Camera SLAM[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(6):1052-1067. + Klein G, Murray D. Parallel Tracking and Mapping for Small AR Workspaces[C]. IEEE and ACM International Symposium on Mixed and Augmented Reality. IEEE, 2007:1-10. + ENGEL J, SCHOPS T, CREMERS D, LSD-SLAM: Large-scale direct monocular SLAM[C]. European Conference on Computer Vision(ECCV), 2014:834 - 849. + FORSTER C, PIZZOLI M, SCARAMUZZA D. SVO: Fast semi-direct monocular visual odometry[C]. Hong Kong, China: IEEE International Conference on Robotics and Automation (ICRA), 2014: 15-22. + MURARTAL R, MONTIEL J M, TARDOS J D. ORB-SLAM: A Versatile and Accurate Monocular SLAM System[J]. IEEE Transactions on Robotics, 2015, 31(5):1147-1163. + Rublee E,Rabaud V,Konolige K,et al.ORB:An efficient alternative to SIFT or SURF[C].2011 International conference on computer vision. IEEE, 2011:2564-2571. + TONG Q, PEILIANG L, SHAOJIE S. VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator[J]. IEEE Transactions on Robotics, 2017,99:1-17. + QIN T, PAN J, CAO S, et al. A general optimization-based framework for local odometry estimation with multiple sensors[J]. ArXiv, 2019:1901.03638. + WANG R, WAN W, WANG Y, et al. A new RGB-D SLAM method with moving object detection for dynamic indoor scenes[J]. Remote Sensing, 2019, 11(10): 1143. + Fang Y, Dai B. An improved moving target detecting and tracking based on Optical Flow technique and Kalman filter[J]. IEEE, 2009.DOI:10.1109/ICCSE.2009.5228464. + Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified, Real-Time Object Detection[C]//Computer Vision & Pattern Recognition. IEEE, 2016.DOI:10.1109/CVPR.2016.91. + Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation[J]. IEEE transactions on pattern analysis and machine intelligence, 2017, 39(12): 2481-2495. + Gkioxari G, He K, Piotr Dollár, et al. Mask R-CNN [J]. IEEE transactions on pattern analysis and machine intelligence, 2020, 42(2): 386-397. + Zhong F, Wang S, Zhang Z, et al. Detect-SLAM: Making Object Detection and SLAM Mutually Beneficial[C]//2018 IEEE Winter Conference on Applications of Computer Vision (WACV). IEEE, 2018.DOI:10.1109/WACV.2018.00115. + LIU Y, MIURA J. RDS-SLAM: Real-time dynamic SLAM using semantic segmentation methods[J]. IEEE Access, 2021, 9: 23772-23785. + C. Yu, et al. DS-SLAM: A Semantic Visual SLAM towards Dynamic Environments[A]. //2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2018, pp. 1168-1174. + B. Bescos, J. M. Fácil, J. Civera and J. Neira, DynaSLAM: Tracking, Mapping, and Inpainting in Dynamic Scenes[J]. IEEE Robotics and Automation Letters, 2018, 3(4)4076-4083. + Runz M, Buffier M, Agapito L. MaskFusion: real-time recognition, tracking and reconstruction of multiple moving objects[J]. 2018 IEEE International Symposium on Mixed and Augmented Reality (ISMAR), 2018, pp. 10-20. + T. Ran, L. Yuan, J. Zhang, D. Tang and L. He. RS-SLAM: A Robust Semantic SLAM in Dynamic Environments Based on RGB-D Sensor[J]. IEEE Sensors Journal, 2021, vol. 21, no. 18, pp. 20657-20664. + M. Henein, J. Zhang, R. Mahony and V. Ila. Dynamic SLAM: The Need For Speed[C]. 2020 IEEE International Conference on Robotics and Automation (ICRA), 2020: 2123-2129. + Ballester I, Fontan A, Civera J, et al. DOT: Dynamic Object Tracking for Visual SLAM[J]. 2020.DOI:10.48550/arXiv.2010.00052. + Qiu Y, Wang C, Wang W, et al. AirDOS: Dynamic SLAM benefits from articulated objects[C]//2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022: 8047-8053. + Zhang J, Henein M, Mahony R, et al. VDO-SLAM: A Visual Dynamic Object-aware SLAM System[J]. 2020.DOI:10.48550/arXiv.2005.11052. diff --git a/docker/wbw-slam/Dockerfile b/docker/wbw-slam/Dockerfile new file mode 100644 index 0000000..5826354 --- /dev/null +++ b/docker/wbw-slam/Dockerfile @@ -0,0 +1,3 @@ +FROM nvidia/cuda:11.1.1-devel-ubuntu18.04 + +WORKDIR /root \ No newline at end of file diff --git a/docker/wbw-slam/run.txt b/docker/wbw-slam/run.txt new file mode 100644 index 0000000..5f082c8 --- /dev/null +++ b/docker/wbw-slam/run.txt @@ -0,0 +1,10 @@ + +docker run --name wbw-slam --gpus=all -it -v /home/wbw/data:/data -e DISPLAY -e WAYLAND_DISPLAY -e XDG_RUNTIME_DIR -e PULSE_SERVER -p 8080:5901 -p 8081:20 wbw-slam /bin/bash + + +docker run --name wbw-docker --gpus=all -it -v /home/wbw/data:/data -e DISPLAY -e WAYLAND_DISPLAY -e XDG_RUNTIME_DIR -e PULSE_SERVER -p 8083:5901 -p 8084:20 wbw-docker /bin/bash +// 启动docker +docker start wbw-slam +docker exec -it wbw-slam bash + + diff --git a/docker/wbw_docker_export.tar b/docker/wbw_docker_export.tar new file mode 100644 index 0000000..4425e4a Binary files /dev/null and b/docker/wbw_docker_export.tar differ diff --git a/动态slam/06_tar.txt b/动态slam/06_tar.txt new file mode 100644 index 0000000..6480af9 --- /dev/null +++ b/动态slam/06_tar.txt @@ -0,0 +1,1101 @@ +2.220446e-16 0.0 -1.110223e-16 0.0 0.0 0.0 1.0 +1.198998 -0.01401751 -0.02820321 -0.00035986356967330357 6.49054580573347e-05 -0.00034352059264449364 0.9999998741395396 +2.384181 -0.02787365 -0.0560812 -0.0007155286094661527 0.00012929831495668684 -0.0006830403431447477 0.9999995023781983 +3.575628 -0.0418032 -0.08410629 -0.0010730197134110313 0.00019427286969106458 -0.0010243109083292836 0.9999988808363287 +4.767291 -0.05573543 -0.1121362 -0.0014305218763574284 0.00025949949677692467 -0.0013655992458171734 0.9999980107009575 +5.962283 -0.06970674 -0.1402442 -0.0017889690255053106 0.00032515025427644703 -0.0017077967747923602 0.9999968886438168 +7.152769 -0.08362554 -0.1682458 -0.002146010347958631 0.00039079257280655213 -0.0020486601625632228 0.9999955224461207 +8.3454 -0.09756958 -0.1962977 -0.002503641285496061 0.00045679391386660456 -0.0023900930922495997 0.9999939052687492 +9.536762 -0.1114989 -0.2243193 -0.002860836791111922 0.0005229646662447424 -0.0027311180076956434 0.9999920415259517 +10.72832 -0.1254307 -0.2523452 -0.0032180356859406976 0.0005893864406748412 -0.00307215346553219 0.9999899293208067 +11.91983 -0.1393621 -0.2803697 -0.0035751674902217787 0.000656045413982819 -0.0034131317238107425 0.99998756907957 +13.11191 -0.1533004 -0.3084073 -0.0039324143727152705 0.0007229763697410812 -0.003754227693729973 0.9999849594852883 +14.30348 -0.1671408 -0.3362948 -0.004221199569135658 0.0007775326433186569 -0.00407839981343691 0.9999824716324521 +15.49659 -0.181814 -0.3654237 -0.005111837105190033 0.0009433322831451939 -0.0045504186947271635 0.9999761361829173 +16.68758 -0.1950729 -0.3924445 -0.004975825921469052 0.0009150478953510533 -0.004775125564258965 0.9999758007169943 +17.88104 -0.2081761 -0.4191514 -0.004690864663502994 0.0007505806471832133 -0.005072272635004193 0.9999758519422941 +19.07663 -0.2220687 -0.4450814 -0.00452155249754079 0.0005498399673708305 -0.005316702723358468 0.999975492655282 +20.27019 -0.2372951 -0.4715143 -0.004640924132021398 0.0010122291741614564 -0.005662007298249387 0.9999726890713838 +21.46468 -0.2531559 -0.4977817 -0.004512736565285698 0.001532656673956377 -0.0059070602522334785 0.9999711959908588 +22.65883 -0.2707637 -0.5247441 -0.004814594710778253 0.0012356936499101766 -0.006313858871244178 0.9999677134413534 +23.85296 -0.2887908 -0.5527301 -0.004932776240605864 0.0011715336446188685 -0.0066750169286301396 0.99996486907115 +25.04684 -0.3071147 -0.5812326 -0.004937456600778406 0.0015563988969339428 -0.007130184190804922 0.9999611790555641 +26.2431 -0.3273512 -0.6069839 -0.005068149234804969 0.0013898935019328099 -0.007546006475853068 0.9999577190289864 +27.44168 -0.3472488 -0.6329432 -0.004630086244265993 0.00030384521350993474 -0.007903069481223837 0.9999580048543197 +28.63712 -0.3677985 -0.6565642 -0.004307487780242342 0.00019982652719625942 -0.008352183951850249 0.9999558223449756 +29.8283 -0.3892136 -0.6824261 -0.004000424736176496 0.0008871439665114932 -0.00875663179881376 0.9999532643964181 +31.02794 -0.4112762 -0.7087202 -0.0037427400897394867 0.0012865214191688057 -0.009250873606068 0.9999493777670864 +32.23193 -0.4357161 -0.7356474 -0.003948906989813863 0.0012909524369781888 -0.009625171396769454 0.9999450463155333 +33.42024 -0.4596533 -0.7632533 -0.0037122819926728146 0.0010122586694583024 -0.009955169656655438 0.9999430428238901 +34.6148 -0.4846981 -0.7915945 -0.0035890184917547465 0.0008815173096166077 -0.0102784903450411 0.9999403454755317 +35.81256 -0.509819 -0.8182722 -0.0032104033604379322 0.0008405012118857604 -0.010525218803421508 0.9999391014642423 +37.00928 -0.5367057 -0.8457827 -0.003244006256697946 0.0008424160809041116 -0.010822482274992346 0.9999358182583324 +38.2069 -0.5627412 -0.8734675 -0.002804461366961401 0.0005889585634444017 -0.01092397922522269 0.9999362253674675 +39.40549 -0.5904619 -0.9001114 -0.002865083761176584 -1.3086527926736924e-05 -0.01087952079915811 0.99993671157277 +40.60244 -0.6181703 -0.9268582 -0.0030953862390823987 -0.0004823207888735142 -0.01080541166105546 0.9999367125121084 +41.79871 -0.6466319 -0.9531287 -0.0034943967462006606 -0.00027018418288089925 -0.0107787324756015 0.9999357654960176 +42.99592 -0.675668 -0.9776312 -0.0039091046637364305 -0.00021836258948151034 -0.010726366969192474 0.9999348060099468 +44.19563 -0.708016 -1.000412 -0.00538943993983646 -0.0009551878948845999 -0.010494284948573317 0.9999299533150502 +45.39563 -0.7358613 -1.020365 -0.005512763103214569 -0.002517406669680244 -0.009698016870821509 0.9999346081496533 +46.59612 -0.7617275 -1.037444 -0.005784655736757264 -0.003885003880645267 -0.008825975525198936 0.9999367713305092 +47.80149 -0.784447 -1.051207 -0.00591626339592865 -0.0055348163379227485 -0.007943893937144661 0.9999356270203852 +49.00454 -0.8050204 -1.064092 -0.005755818552842153 -0.00659222783583631 -0.007295013446649078 0.9999350958256042 +50.20348 -0.822002 -1.07385 -0.004924196091158244 -0.005736844786528477 -0.006556563335938939 0.9999499249372306 +51.40457 -0.8402902 -1.085315 -0.005171862204093085 -0.004580702126983747 -0.006022975448525762 0.9999579955058673 +52.60583 -0.859463 -1.098286 -0.005748736106224239 -0.004488039183417627 -0.005617838059339086 0.9999576238186337 +53.81125 -0.8753783 -1.112635 -0.005263112956959487 -0.0042580071979188795 -0.005594300728272517 0.9999614356644292 +55.0152 -0.892873 -1.127883 -0.00492937457798024 -0.0039014722344250905 -0.005961486329581552 0.99996246952644 +56.22027 -0.9118602 -1.143152 -0.005010130200693083 -0.0038739579320249836 -0.00672945987229295 0.9999573017960017 +57.42777 -0.9332309 -1.160687 -0.005014072503500606 -0.004632534390145542 -0.007495952837085332 0.9999486033757524 +58.63262 -0.956215 -1.175655 -0.005008822220882213 -0.004822565791383462 -0.00797915679588278 0.9999439922395525 +59.83857 -0.981436 -1.190278 -0.005487281545579829 -0.004689162095371804 -0.008398892620312917 0.9999386781712343 +61.04209 -1.006715 -1.2056 -0.005682545336444833 -0.0046268695371305705 -0.008755755620244316 0.9999348166257159 +62.24563 -1.031126 -1.22132 -0.005362389928530254 -0.0043532266769055455 -0.008840141266126565 0.9999370710670491 +63.44747 -1.053863 -1.237483 -0.004720127394815745 -0.0038674846717600873 -0.008944114211026727 0.9999413811722517 +64.64755 -1.078473 -1.251701 -0.004504945542912553 -0.004335284165652808 -0.009120605357612381 0.9999388607983836 +65.8504 -1.103193 -1.267101 -0.004491673024876352 -0.005428991562021672 -0.009239413839285306 0.9999324897991682 +67.04761 -1.129538 -1.281112 -0.004718131929314229 -0.00470424250883669 -0.009387227108872515 0.99993374245533 +68.24107 -1.15768 -1.293922 -0.005480803792065498 -0.003855150896731819 -0.00942281474381245 0.9999331523475257 +69.43527 -1.18693 -1.309417 -0.006561273435701319 -0.003927524377118801 -0.009466116071796076 0.9999259557035632 +70.62616 -1.215316 -1.324678 -0.007464574099429974 -0.0040483605233212535 -0.009437135575929376 0.9999194124441775 +71.816 -1.243226 -1.339456 -0.008289339427966946 -0.004279873989077048 -0.009335140507968442 0.9999129085486306 +73.00219 -1.268567 -1.355743 -0.008491451685334401 -0.004744092182267336 -0.008852847932267698 0.999913504219805 +74.18749 -1.292838 -1.37147 -0.00932716938156376 -0.005034652890879012 -0.007823693914830008 0.9999132192321107 +75.36554 -1.310252 -1.390125 -0.009042048588194797 -0.004656733178612881 -0.006499165874355283 0.9999271558650508 +76.54079 -1.320138 -1.407515 -0.0073093447366157985 -0.0019016182322941096 -0.005326747545106521 0.9999572906320602 +77.7163 -1.33092 -1.427085 -0.007592526715736524 -0.0010536985407000503 -0.004101086118856066 0.9999622114610641 +78.8946 -1.33795 -1.448797 -0.007243512981106252 -0.0029978498811586442 -0.0026506117434878782 0.9999657587503523 +80.06843 -1.340942 -1.468534 -0.007292822678787891 -0.003974074304452394 -0.0009959385982906443 0.9999650141765495 +81.23454 -1.34229 -1.485826 -0.007515479409308318 -0.002864112095918309 0.001004311745089842 0.9999671523550505 +82.39913 -1.335651 -1.506096 -0.007071442357789663 -0.0015889224658754508 0.002740485047686651 0.9999699794344236 +83.55801 -1.323865 -1.530312 -0.006363264139240645 0.0003415142602140135 0.004433976285844162 0.9999698655918592 +84.71745 -1.309268 -1.555472 -0.005931743562991535 0.0008806153262992821 0.006077871385678563 0.9999635485428302 +85.87696 -1.292993 -1.580587 -0.0058391162607415725 -7.8719846807099e-05 0.007890760085090619 0.999951816054033 +87.03114 -1.273121 -1.605074 -0.006391301202174677 -0.0005429595690217777 0.009807148167652787 0.9999313357969473 +88.18383 -1.247131 -1.630673 -0.006215597798760267 0.000606249442847437 0.011646864603080382 0.9999126708620781 +89.33336 -1.216671 -1.656163 -0.006037974338264521 0.001887806082478586 0.013292848052516114 0.9998916337507471 +90.48344 -1.182745 -1.683163 -0.006076154599675972 0.002078016064286196 0.014923607532290922 0.9998680153563967 +91.63655 -1.144128 -1.710419 -0.005409023558403269 0.0014661109090187787 0.016356920085521247 0.999850510900636 +92.79129 -1.10175 -1.736086 -0.005040835051890419 0.0008044372506283193 0.017752128814033447 0.9998293878383748 +93.94713 -1.056015 -1.76065 -0.004647192419600728 -0.0001236459029356669 0.019125172080521127 0.9998062892916789 +95.10262 -1.006119 -1.785071 -0.004093907937738486 -0.0006999334554371111 0.020348323545870305 0.9997843246119774 +96.25538 -0.953966 -1.809543 -0.0034241107091019097 0.0003861401026011044 0.0214873615210282 0.9997631817868358 +97.41317 -0.8975398 -1.834871 -0.0026506564244431334 0.0021820933612555094 0.022588776549432397 0.9997389457568822 +98.57222 -0.8410031 -1.861916 -0.0027865197249119515 0.0037527411389656866 0.02362137671753932 0.9997100493661824 +99.73968 -0.7820453 -1.891643 -0.002844125604205814 0.004263955460228568 0.02443487962739204 0.9996882845622307 +100.9141 -0.7233569 -1.922457 -0.002880257638698931 0.003954372444814974 0.02473744622249732 0.9996820123463718 +102.0947 -0.6655319 -1.953029 -0.0030048321594401232 0.002947597666050729 0.024361603482147207 0.9996943507530046 +103.2788 -0.6077182 -1.983093 -0.002684090500032616 0.002140522918165129 0.02367590660310628 0.9997137916755703 +104.47 -0.5505715 -2.012875 -0.002833784438348407 0.001815979219984615 0.02267514628677632 0.9997372202864625 +105.6679 -0.4977483 -2.041118 -0.0029159243632119096 0.0015188217570027904 0.021444745259269905 0.9997646290334258 +106.8672 -0.4483272 -2.069739 -0.003158915404173649 0.0015239926374694453 0.019892197916632902 0.9997959787686463 +108.0719 -0.4053771 -2.098282 -0.003888196137698956 0.0016704704753170844 0.018013378790936518 0.9998287901654563 +109.2786 -0.3651076 -2.127 -0.003963029053943499 0.0015711521308144729 0.016058214165400424 0.9998619702936589 +110.4916 -0.3290753 -2.156853 -0.004297420656953719 0.0010103101069397018 0.014035356700536402 0.9998917542471654 +111.7073 -0.2985283 -2.185665 -0.004397760737387526 0.0005974922418329293 0.011872866153568933 0.9999196656496047 +112.9237 -0.2726718 -2.213402 -0.004480138358741955 0.0002213837819483314 0.009647168600610368 0.999943404142204 +114.1425 -0.2537792 -2.240145 -0.004673018107087119 0.00012158664856487125 0.007251937304367714 0.9999627780691596 +115.3628 -0.2415523 -2.267804 -0.005088736593879853 0.00011159737141238686 0.004782355122464345 0.9999756103952673 +116.586 -0.2340717 -2.29683 -0.005346380751607163 0.00025066205483806494 0.0023032300959496873 0.9999830241121688 +117.8116 -0.2325711 -2.327575 -0.005789870445636307 0.0005420126690783513 -2.1495406131837887e-05 0.9999830914372686 +119.0399 -0.2372931 -2.358001 -0.006709019307093993 0.0006906839151796674 -0.0018659201972787234 0.9999755148789813 +120.2714 -0.246444 -2.388148 -0.007882105089020762 0.0005917012759859647 -0.0029247176792551333 0.9999644835370216 +121.5013 -0.2562318 -2.417424 -0.008358395339655827 0.0007331398214902632 -0.0034083153002364085 0.9999589907192006 +122.7424 -0.2696789 -2.444635 -0.009423688209011237 0.0005269259778054986 -0.003949035681284493 0.9999476594136021 +123.9829 -0.2835717 -2.4715 -0.010128547407015631 -5.708253867631641e-05 -0.004450295875645202 0.9999388001951054 +125.2241 -0.2970155 -2.498919 -0.010147723417882244 -4.151022423533046e-05 -0.0048278779229040504 0.9999368547968901 +126.4702 -0.3168152 -2.521615 -0.011780432367271056 -0.0008854915288300424 -0.005257412029966079 0.9999163949734695 +127.7202 -0.336076 -2.543289 -0.0125889844750612 -0.002345682370011026 -0.005562947188091507 0.9999025296811126 +128.9699 -0.3532043 -2.561295 -0.01207402728461645 -0.0026018800392388786 -0.005952791976685023 0.9999060017587024 +130.2201 -0.3704424 -2.578125 -0.011931388314351072 -0.0016477390832804791 -0.0062128406058438 0.9999081595528722 +131.4756 -0.384545 -2.59617 -0.010569427914954426 -0.002004529637688497 -0.006580110709343627 0.9999204824372461 +132.7335 -0.3976226 -2.614808 -0.009388399259498761 -0.002977906419199691 -0.0070979379928864834 0.9999263019387739 +133.9959 -0.4133731 -2.631357 -0.007988738895299616 -0.003913266103574896 -0.007255095553951683 0.9999341128233241 +135.2593 -0.4299121 -2.64614 -0.007102812549908933 -0.0048371440997820305 -0.007417545324406014 0.9999355639800992 +136.5226 -0.4438867 -2.660056 -0.0049193540583562755 -0.0038595730505289203 -0.007373976976430861 0.9999532629653585 +137.7895 -0.4570556 -2.675992 -0.0035936223341468715 -0.0030132358636768357 -0.007388265364500639 0.9999617091784331 +139.0568 -0.472057 -2.693213 -0.0028005006717013482 -0.0030000636673928726 -0.00729711830087367 0.9999649538251241 +140.3306 -0.490815 -2.711316 -0.0027762409525668398 -0.0032085139317314304 -0.007162048179978077 0.9999653508949152 +141.606 -0.5099508 -2.728134 -0.002957768652786085 -0.004096279109567636 -0.006813617613118185 0.999964022711355 +142.8802 -0.5316574 -2.744134 -0.004264040518370598 -0.004966018736653906 -0.006468253833852581 0.9999576582579414 +144.153 -0.5501461 -2.759529 -0.004302252755785651 -0.004564971688800125 -0.006179234507802903 0.9999612336063851 +145.4241 -0.5681071 -2.775354 -0.004457828850840038 -0.00325935623605166 -0.005970482350687801 0.9999669283028131 +146.694 -0.586224 -2.79483 -0.0045398859905447135 -0.002223733514076087 -0.005757391472960463 0.9999706480131696 +147.9611 -0.6015689 -2.818864 -0.003905415977261743 -0.0001642960588533671 -0.005629999732894554 0.9999765116420771 +149.227 -0.6184009 -2.844547 -0.003954186155803207 0.0015811751505896757 -0.0053734287516974365 0.999976495003977 +150.4924 -0.6344428 -2.875057 -0.0038331783316937857 0.0021424458780584113 -0.005287277443455508 0.9999763804044438 +151.7562 -0.6492857 -2.908757 -0.0037458114159513345 0.0019491110961216793 -0.005333560901335034 0.99997686122774 +153.0248 -0.6669047 -2.939717 -0.0040681562486969075 0.0017760185646128485 -0.005316372413833427 0.9999760157359534 +154.2883 -0.6841745 -2.969914 -0.004472601030136438 0.0011681885332730745 -0.005439788688839492 0.9999745196126734 +155.5533 -0.703475 -2.997619 -0.004706727402907223 0.0008849443598929549 -0.005566890094636194 0.9999730362991339 +156.8169 -0.7212308 -3.024885 -0.004841709527416548 0.0010108793510865638 -0.0058371059251233455 0.9999707316547863 +158.0778 -0.7401339 -3.051707 -0.005001882994952539 0.0009234644464940929 -0.006156738819819325 0.9999681109650574 +159.3422 -0.7581397 -3.079341 -0.0047211130087185926 0.0003711877353379654 -0.006500872845527376 0.9999676554588506 +160.6037 -0.7785169 -3.10733 -0.00515549896485803 0.00021737884260166398 -0.006831455393804252 0.9999633517254842 +161.8641 -0.7989665 -3.135262 -0.005301338267941133 0.0006860472755226084 -0.00713524617952972 0.999960255917035 +163.1262 -0.8210143 -3.16234 -0.005527743503521305 0.0012355994884098056 -0.0075440357354324755 0.9999555014451823 +164.3903 -0.8448485 -3.191224 -0.005947355494980201 0.0012300540983073457 -0.007851041779608638 0.9999507373228476 +165.6519 -0.8676072 -3.221845 -0.005791327957094879 0.0015695562518456068 -0.008179363318739755 0.9999485461909356 +166.9143 -0.8915914 -3.252329 -0.005924696183497966 0.002170703171646729 -0.00858217697624092 0.9999432645211539 +168.1765 -0.9159259 -3.282708 -0.0060124319708359615 0.0020909930161350926 -0.008891351595468583 0.9999402093509432 +169.4414 -0.9418036 -3.313952 -0.00618127365111987 0.0013798535090794111 -0.009254224364655384 0.9999371216190306 +170.7047 -0.9682551 -3.3446 -0.006216312113405884 0.0011460004746534824 -0.009625880652394527 0.9999336910857072 +171.9707 -0.9957698 -3.373848 -0.006396828009711327 0.0009960761507025718 -0.009946823025379609 0.9999295720876635 +173.2343 -1.024319 -3.402447 -0.0066141808356763344 0.000817657605203399 -0.010307894133043758 0.9999246628453844 +174.4983 -1.053647 -3.431925 -0.006723309945024927 0.0012072308041932353 -0.010607288976629428 0.9999204093915349 +175.759 -1.083166 -3.461238 -0.0072429831152864554 0.0015467353159486746 -0.010825608829293381 0.999913972799125 +177.0264 -1.111965 -3.491932 -0.007341293035560935 0.0012905293579524358 -0.011030474945822077 0.9999113803597856 +178.2891 -1.14094 -3.523533 -0.007307016875576709 0.0017407100205012792 -0.011197758133274402 0.9999090896905545 +179.5492 -1.167498 -3.555449 -0.006659744345200259 0.0026787623090805124 -0.011233513311735555 0.9999111361598205 +180.8107 -1.193384 -3.587083 -0.005182021159617512 0.0024960316105607774 -0.011376582111618351 0.9999187416297176 +182.0722 -1.220912 -3.617217 -0.00513493094784783 0.0008767559166606002 -0.011530847812232573 0.999919948461853 +183.3367 -1.249819 -3.644685 -0.004995259374406725 -0.001702682340561743 -0.01159100005068945 0.9999188951982327 +184.5934 -1.280267 -3.669131 -0.005166418896233336 -0.001318543322361763 -0.011688107145910823 0.9999174754501701 +185.8431 -1.311919 -3.694209 -0.006012777778598071 0.0002303633749491823 -0.011557215811215533 0.9999151084961131 +187.0923 -1.342768 -3.722358 -0.006168455110437966 0.0013395238746064255 -0.01170138155878978 0.9999116128473333 +188.3388 -1.371475 -3.752664 -0.0059574947893589435 0.0012948308102644038 -0.01178819590959887 0.9999119311750433 +189.5856 -1.400872 -3.781861 -0.0059698393214338025 0.0004720687406639241 -0.011752192333273011 0.9999130082886921 +190.8308 -1.429913 -3.811043 -0.006008959010337257 -0.0005246968445113759 -0.011676003488574808 0.9999136402946848 +192.0697 -1.45692 -3.839753 -0.005706682231831856 -0.0010040440157608665 -0.011371431048756077 0.9999185547980508 +193.3035 -1.482347 -3.868861 -0.00572696117889226 -0.0007702629838274687 -0.01070302180638407 0.9999260242311944 +194.5344 -1.504387 -3.897568 -0.005040909339382521 7.198868724934202e-05 -0.010011531760978064 0.9999371746677189 +195.761 -1.525428 -3.929306 -0.005021648399951481 0.0007840621393912429 -0.009383739248992029 0.9999430552444553 +196.9816 -1.54334 -3.962116 -0.00457558173366062 0.0011870392376031873 -0.008705454600296703 0.9999509338211796 +198.1987 -1.561315 -3.994523 -0.004638989444218995 0.0007527849324634227 -0.008085171206076634 0.9999562705930446 +199.4115 -1.577337 -4.025379 -0.004475645862326369 -0.00025649204126482336 -0.007393658978525132 0.9999626176077069 +200.6192 -1.593225 -4.055718 -0.004824402198392766 -0.00046540619919824446 -0.006813538253964993 0.9999650415074317 +201.8207 -1.607052 -4.084508 -0.004720683052171917 -0.0003723357498953366 -0.006269436136682513 0.999969134867741 +203.0155 -1.619569 -4.112611 -0.004912032507515999 -0.0005239493707796766 -0.0057731941099837805 0.999971133405095 +204.2033 -1.629855 -4.142166 -0.005041266371839272 -0.0007351137710935808 -0.005213975958176966 0.9999734294949137 +205.3821 -1.639698 -4.168995 -0.005036260997492797 -0.000565825374978919 -0.004829409175142207 0.999975496061693 +206.5525 -1.649074 -4.196218 -0.005059224283866238 -0.0006214400378378531 -0.004380657205573824 0.9999774136971156 +207.7094 -1.655974 -4.223531 -0.00446840554160697 -0.0007066491720240149 -0.004099281661094248 0.9999813647707273 +208.8522 -1.663162 -4.25023 -0.0043840997490424746 -0.0010001845798053841 -0.003990825640035301 0.9999819261421216 +209.9818 -1.670517 -4.277047 -0.004143498586890967 -0.0016919064767357496 -0.004148568181207222 0.9999813789536184 +211.0937 -1.680916 -4.301812 -0.004264365282922473 -0.0016742734652974245 -0.004493547500690188 0.9999794098020001 +212.1907 -1.692609 -4.32686 -0.0044061678229455024 -0.0011816757291153977 -0.004678590639250687 0.999978649830694 +213.2749 -1.705386 -4.351615 -0.00456130927657072 -0.000734465379669493 -0.004896099606849786 0.9999773413567576 +214.3453 -1.717659 -4.376802 -0.004529148623555702 -0.0006376127216145577 -0.00517544105237936 0.9999761472518613 +215.4046 -1.730641 -4.402081 -0.004251476257953618 -0.000674743840125339 -0.005733469843450655 0.9999742981666742 +216.4482 -1.744177 -4.426871 -0.004057872673643171 -0.001050836638385631 -0.006527994043578137 0.9999699068999481 +217.4854 -1.758097 -4.450451 -0.003524744923737176 -0.0017804703635620769 -0.007512570522703676 0.9999639830427136 +218.5077 -1.775959 -4.472509 -0.0036281552435897756 -0.002860121517231803 -0.008618069604701224 0.9999521913925303 +219.5114 -1.796549 -4.494237 -0.0037296897629312525 -0.0029212413438274745 -0.009739111242290535 0.9999413510178953 +220.495 -1.82019 -4.513126 -0.003915637710152975 -0.0026039376700479947 -0.011080667857211401 0.9999275504204149 +221.4612 -1.844527 -4.532452 -0.003950947329296195 -0.002625963759567057 -0.01244733948941836 0.9999112750985307 +222.4026 -1.871601 -4.553298 -0.0042282975957171915 -0.0029561039360682033 -0.01387554551137387 0.999890420088983 +223.3228 -1.900504 -4.574348 -0.004402710883512963 -0.0029470449956857997 -0.015369413795883004 0.99986784735896 +224.2192 -1.93108 -4.594192 -0.004376669534247475 -0.0026069283493568235 -0.016900267672214996 0.9998442026840866 +225.0932 -1.963226 -4.613875 -0.004273065751746019 -0.0023775746978733302 -0.01851499987748945 0.9998166246003184 +225.9462 -1.999321 -4.634336 -0.004870509509578697 -0.0022682834568319788 -0.020070958182132306 0.9997841215307071 +226.7758 -2.03611 -4.65549 -0.005018820774178972 -0.001833782807499098 -0.02156986114595439 0.9997530643958016 +227.5885 -2.076063 -4.674144 -0.00598374915543535 -0.001874760927719855 -0.022751857081532708 0.9997214777210951 +228.382 -2.113952 -4.692285 -0.006383590231490124 -0.001959370692285193 -0.02316343937667015 0.9997093906323427 +229.159 -2.149429 -4.710568 -0.006402428906073897 -0.0025413403779662638 -0.02305663634717929 0.9997104290811114 +229.9166 -2.182836 -4.728951 -0.006440707018978673 -0.0029594068322568784 -0.022805709646356222 0.9997147887331781 +230.6577 -2.215172 -4.746558 -0.006346003510297576 -0.0026798307906587726 -0.02264865993974909 0.9997197532055243 +231.3882 -2.24791 -4.762455 -0.006601955128766423 -0.0017113867195343514 -0.02248362464625508 0.9997239478809838 +232.1156 -2.278813 -4.778921 -0.0066336247151772615 -0.0006213152517196035 -0.022064263847648813 0.9997343533416052 +232.8383 -2.30914 -4.795655 -0.006659424199865404 5.686102676857922e-05 -0.021497737449859767 0.9997467159839469 +233.56 -2.338553 -4.813178 -0.0070266706690061435 0.0002983465532425132 -0.02075912741397536 0.999759768903337 +234.2776 -2.3648 -4.830531 -0.006744831196475719 0.00041733354214340437 -0.019909981844003586 0.9997789384197975 +234.9915 -2.389577 -4.847959 -0.006652391844235032 0.0004131895794232589 -0.018985406288160363 0.9997975441584139 +235.7021 -2.414706 -4.864866 -0.006691335223534169 0.00044390001244712563 -0.017962637488761393 0.9998161694232358 +236.4085 -2.437161 -4.881579 -0.006566063885326994 0.0005784864727226527 -0.016909852435788783 0.9998352909599932 +237.1116 -2.45827 -4.896919 -0.0066202880746239045 0.00044885617655090825 -0.015470827169013567 0.9998583018711443 +237.8114 -2.475218 -4.912132 -0.006158206033067212 0.0001729241458064679 -0.013863607578882995 0.9998849168682329 +238.5076 -2.489243 -4.929647 -0.005633681239257314 -7.393507017735587e-05 -0.012417059360918116 0.9999070320815467 +239.1999 -2.501543 -4.947554 -0.005493395276037911 -0.00018095336456171063 -0.010916720318857345 0.9999253047511604 +239.8878 -2.511949 -4.965562 -0.005299675469448788 -3.645217796804254e-05 -0.00943663532384446 0.9999414292971974 +240.5753 -2.520293 -4.98174 -0.005192641118942066 0.0001900586673171618 -0.007997421157449012 0.999954519771146 +241.2639 -2.528203 -4.99821 -0.005675353367414076 0.0006539975873278694 -0.006520980159126439 0.9999624190283727 +241.9573 -2.534352 -5.01566 -0.005636015186327582 0.0009917333167748618 -0.005087684115257344 0.9999706832043584 +242.6535 -2.538348 -5.032044 -0.005651416463690905 0.0005377262707581398 -0.003665165888026632 0.9999771691900895 +243.3559 -2.543393 -5.04817 -0.005908604890876848 0.00020486840696317538 -0.002380134545058291 0.9999796904821251 +244.059 -2.545596 -5.064282 -0.0057773035533943955 0.0002866743143483518 -0.0012280044795743112 0.999982516140401 +244.7634 -2.546852 -5.080295 -0.005696437201949102 0.0008915390716305719 -8.062816542457547e-05 0.9999833744919897 +245.4715 -2.546619 -5.094783 -0.00578640010086667 0.0021873308145204475 0.0007838325030446613 0.9999805591932214 +246.1863 -2.545576 -5.109038 -0.00541947013778019 0.00296077355484346 0.0012613233015339742 0.9999801359161649 +246.9081 -2.543734 -5.126143 -0.005381508562997345 0.0033763584994251036 0.0013546663367740995 0.99997890200143 +247.6359 -2.542056 -5.145191 -0.005001537491639155 0.0034887730274840855 0.0014418375709886171 0.9999803669022214 +248.3695 -2.541896 -5.164585 -0.005457155108079145 0.003576681858737057 0.0015305268869576678 0.9999775418940448 +249.1114 -2.542204 -5.185382 -0.005874029847300168 0.003922926913611629 0.0016835771313498042 0.9999736356453736 +249.8573 -2.540944 -5.205979 -0.005534770667544091 0.004919541558204293 0.0017910169910071536 0.9999709779201857 +250.61 -2.535403 -5.22737 -0.003649843828115298 0.00582512645105069 0.0019764397884364762 0.9999744197866375 +251.3708 -2.529259 -5.246305 -0.001998405247531203 0.006305282228533947 0.0021364320799153176 0.9999758424334325 +252.145 -2.529475 -5.263872 -0.002804967705525245 0.004763998968227461 0.002434379690330077 0.9999817549663218 +252.9289 -2.534412 -5.282956 -0.005000348748876124 0.0017170780614718822 0.0027661172328310116 0.9999821982169356 +253.716 -2.53173 -5.300222 -0.0048790149537135735 -0.0006355591861324731 0.0031901986440771345 0.999982806807304 +254.5107 -2.522827 -5.316153 -0.004140220136641355 0.0003244978754738217 0.0041339900295965905 0.9999828315550144 +255.3056 -2.516895 -5.328791 -0.0046848791814871745 0.0035079683074199387 0.006112546280636822 0.999964190780538 +256.1119 -2.508908 -5.345294 -0.005730087790873103 0.005795539606181935 0.008088263267927718 0.9999340767330068 +256.9296 -2.495854 -5.364407 -0.006454118420688381 0.005721154206928013 0.009694061215798305 0.9999158154200265 +257.7578 -2.476437 -5.385775 -0.005786964327180164 0.0044322513335785605 0.011316565335989006 0.9999093966660121 +258.5928 -2.451908 -5.406872 -0.004279942929150887 0.0028650544847295244 0.012772763926717853 0.999905160529235 +259.4295 -2.426274 -5.425049 -0.0030475053084927633 0.001980665202861959 0.014117142497595258 0.9998937423368802 +260.2685 -2.401831 -5.445288 -0.0029387269002611603 0.0017034424205714487 0.015705782673291944 0.9998708869442818 +261.1028 -2.377452 -5.467449 -0.004040644648751783 0.002687672315584574 0.016985144927200175 0.9998439650566224 +261.9398 -2.347909 -5.488513 -0.004246662389256982 0.0038501142012416834 0.018164645257847417 0.9998185776138816 +262.7787 -2.315586 -5.509575 -0.004400507317080512 0.004053612898621959 0.01946291256829852 0.999792677904864 +263.6212 -2.282743 -5.533299 -0.004866986311897632 0.0030827714208549475 0.02092154403624296 0.9997645212547533 +264.4631 -2.249292 -5.557575 -0.005645073246307746 0.002096919075669219 0.022571833176240674 0.9997270869719885 +265.3026 -2.211671 -5.580169 -0.0059160388460119595 0.0020691859083779036 0.02395447439529243 0.9996934040547103 +266.1456 -2.169031 -5.601556 -0.005838566041318017 0.002452105055866802 0.02528343206322559 0.9996602654854699 +266.9886 -2.120601 -5.622751 -0.005029610899606978 0.0026513111375255074 0.026580568800215375 0.9996305051996501 +267.8345 -2.070929 -5.642388 -0.0035293480010464466 0.002408408060557073 0.027484451205395687 0.999613099261529 +268.6821 -2.020152 -5.662313 -0.0027953502829305157 0.0023676246831968967 0.027939014576023238 0.9996029170799153 +269.5293 -1.972336 -5.683949 -0.0024180144820167304 0.0023822452723187048 0.02822750096332305 0.9995957614469926 +270.3816 -1.923826 -5.705921 -0.0023586175110118968 0.00234348603800398 0.028483136546883333 0.999588743398543 +271.2341 -1.875688 -5.726987 -0.0023526102718817816 0.0021266913037168225 0.028471069655583946 0.9995895860810448 +272.0891 -1.829452 -5.750154 -0.0026370707646808707 0.0018179267971589897 0.02820041468928538 0.9995971576646737 +272.9432 -1.783337 -5.773108 -0.0024442972704059266 0.0021876763449981042 0.027639791281896708 0.9996125656577936 +273.7954 -1.741198 -5.793806 -0.0024973228274521187 0.0026144548059754648 0.0268442698043919 0.9996330892799777 +274.6511 -1.700845 -5.814912 -0.00288006271389569 0.0030813637108716816 0.02602855072437549 0.999652302044883 +275.5035 -1.660663 -5.835884 -0.0034273851501816816 0.0031227012388506965 0.025314251581748336 0.9996687903675202 +276.3544 -1.622669 -5.858238 -0.003990212260549216 0.0031715833028724562 0.024517395725096106 0.9996864091165427 +277.2037 -1.585579 -5.879709 -0.004239219312218926 0.002936121717780462 0.02340716173988423 0.9997127152278118 +278.0516 -1.550632 -5.901149 -0.004163709288289363 0.002205438760947295 0.02203055406029566 0.9997461949178055 +278.8967 -1.517863 -5.921904 -0.0034918114637584236 0.001055827121516442 0.019972619150897063 0.9997938722386952 +279.7354 -1.489305 -5.941918 -0.002910957559224276 5.4757130667850154e-05 0.017600506303954676 0.9998408600900393 +280.5687 -1.466242 -5.96075 -0.0026761577139728992 -0.0006430411926906249 0.01522104626474485 0.9998803650580008 +281.3897 -1.447757 -5.978025 -0.002469313385449175 -0.0009408980218235877 0.01285310429576814 0.9999139037498574 +282.2031 -1.431895 -5.99441 -0.00222728319027806 -0.001290913866329565 0.010584203177309091 0.9999406719371312 +283.0036 -1.419054 -6.010847 -0.0017205741997510324 -0.00140037353126141 0.008243030385739258 0.9999635648504679 +283.7879 -1.412086 -6.026687 -0.0015177726898127758 -0.0012578092191750205 0.006178752524871416 0.9999789684284698 +284.5553 -1.411252 -6.04017 -0.0017292712783256784 -0.001211281578539399 0.0041934716435664595 0.9999889785459427 +285.3052 -1.414317 -6.053988 -0.0018938822672762419 -0.0012510269518930433 0.0022767549918570966 0.9999948322507624 +286.0359 -1.417779 -6.069167 -0.0017676849416135864 -0.001324825578852983 0.0003805104060980861 0.999997487666326 +286.744 -1.423417 -6.085125 -0.001330173263930217 -0.0009526111955601658 -0.001721589621503114 0.9999971796461095 +287.4331 -1.435822 -6.10138 -0.0019061442198138323 -0.0002493137946469586 -0.0041879781748332045 0.9999893825914614 +288.1005 -1.454441 -6.117454 -0.002719858638188185 0.0006897123360891307 -0.007345617234943967 0.9999690838087543 +288.7511 -1.477649 -6.132605 -0.0031602139628514286 0.001586826587425182 -0.011431848308833972 0.9999284013734855 +289.389 -1.506373 -6.149396 -0.003119282867860965 0.0020885051248669944 -0.016567305690589892 0.9998557058910492 +290.0173 -1.545617 -6.167037 -0.00335529847338664 0.002230144526545465 -0.023255826138112033 0.9997214286881017 +290.6281 -1.597157 -6.182453 -0.003840303059593055 0.0017992377452013637 -0.0314188971006499 0.9994973075105933 +291.23 -1.665044 -6.196964 -0.005181267413899468 0.0014477758000736004 -0.04122660103179808 0.9991353390710306 +291.8062 -1.742378 -6.210196 -0.0060532710050725455 0.001493860885644706 -0.05240623477111658 0.9986063853425473 +292.379 -1.834488 -6.224247 -0.006805937498885246 0.0015511538149781503 -0.0650360254486167 0.9978585012567917 +292.9374 -1.942994 -6.237015 -0.007604937948481352 0.0013915703152440108 -0.07886957415228758 0.9968549637353943 +293.4679 -2.067997 -6.247853 -0.008302874508479007 0.0008574450035684524 -0.09411830972745253 0.9955260272022072 +293.9981 -2.209538 -6.257666 -0.008225953804257388 -6.270104525681046e-05 -0.11106326780145809 0.993779291542066 +294.5121 -2.367902 -6.271075 -0.008262053535467957 -0.000974712700319937 -0.12929729985469254 0.9915709740895075 +294.9879 -2.543375 -6.28527 -0.008857518851703433 -0.002182050066800217 -0.14838711320683975 0.9888872775253194 +295.4667 -2.739392 -6.297631 -0.009534359118854243 -0.0033302057281516806 -0.16836823122667496 0.985672432626378 +295.9018 -2.949047 -6.307941 -0.00975092309356888 -0.004356114976736761 -0.18905026833629426 0.9819093337997699 +296.3492 -3.178542 -6.314604 -0.009567086632385327 -0.005327768535317985 -0.21039834501041838 0.9775544087940474 +296.7828 -3.425572 -6.32098 -0.00929874900032999 -0.0059888241225705425 -0.23271579901809664 0.9724818888493639 +297.1641 -3.687333 -6.328231 -0.00888136618655533 -0.006676699431223337 -0.2559291224819742 0.9666316916411194 +297.5634 -3.968958 -6.33276 -0.00821231308242336 -0.007542857937440825 -0.27983142038088127 0.959984395381295 +297.9439 -4.268559 -6.336992 -0.007951854746849875 -0.008559790063085526 -0.3044167811552438 0.952467280987248 +298.2617 -4.586321 -6.339444 -0.008613734011902045 -0.008854362527851234 -0.329396505764152 0.9441108758196595 +298.5997 -4.922654 -6.345184 -0.007993056759187808 -0.009275380292276456 -0.3549522232554826 0.9348042562858109 +298.8694 -5.270323 -6.347814 -0.006898328392304809 -0.010833082066893528 -0.38121962560636014 0.9243952912314464 +299.1672 -5.635576 -6.343982 -0.005098533958987264 -0.013675571805067092 -0.40855323839024094 0.9126177924455126 +299.4337 -6.014393 -6.339964 -0.004125368203507028 -0.015793892552840513 -0.4366752277537475 0.8994711111322183 +299.6092 -6.404391 -6.33231 -0.005215756987576131 -0.016870279670279258 -0.4648584398792032 0.8852089134300256 +299.805 -6.808874 -6.331276 -0.008573322167277771 -0.017085006508500845 -0.49236316367705696 0.8701799329756692 +299.9751 -7.216405 -6.326098 -0.010974861454779524 -0.017121555905112103 -0.5194888916278836 0.8542351527622061 +300.0619 -7.620091 -6.315633 -0.011874594037191802 -0.017266716987170153 -0.5461859484502317 0.8374017937743253 +300.1811 -8.027228 -6.304576 -0.01223226285890819 -0.018230563448855177 -0.5719713947425268 0.8199797204187287 +300.22 -8.436151 -6.292308 -0.012686166758047051 -0.01995932356037467 -0.5969766676865812 0.8019099355998929 +300.2873 -8.849753 -6.282888 -0.013627953294615873 -0.021619960253191656 -0.6212891549532951 0.7831645051616363 +300.3291 -9.270289 -6.271213 -0.014037040850428284 -0.023353453008241837 -0.6456645813493542 0.7631348675743485 +300.284 -9.696194 -6.256955 -0.01553173267791933 -0.024289020260971465 -0.6698837808253982 0.7419060108678549 +300.2712 -10.13605 -6.242546 -0.01696260915808764 -0.026308441089200205 -0.6935930521636683 0.7196866080512357 +300.2293 -10.58129 -6.224323 0.0190651930500382 0.02863810390603924 0.7168976767426651 -0.6963290159828189 +300.097 -11.03111 -6.199526 0.02112663961352105 0.031053982923104923 0.7398582822066911 -0.6717135084940722 +299.9941 -11.48794 -6.178297 0.023264322291792455 0.033162836953504256 0.7619531868567272 -0.646363936642799 +299.7982 -11.94318 -6.15018 0.024770940447299928 0.035677638651591535 0.7830706682823819 -0.6204142447473194 +299.6328 -12.40289 -6.123531 0.0256422662955451 0.03777782339100135 0.8032079206390507 -0.5939464171638374 +299.4338 -12.8598 -6.094784 0.025902746271336857 0.039071494160779166 0.8221274665126269 -0.5673701568510425 +299.1429 -13.3109 -6.063773 0.025508217471558625 0.039832940852919343 0.839924124546091 -0.5406386340892724 +298.8742 -13.7634 -6.028358 0.02515566441758571 0.040096948830556534 0.856297892808772 -0.5143086097018463 +298.5219 -14.21251 -5.989762 0.024938136498918375 0.040343032353498696 0.8713066429699928 -0.4884416679654259 +298.1811 -14.66311 -5.950116 0.02493098028773047 0.0405089156222852 0.8845510667203981 -0.46401172866728835 +297.8016 -15.11616 -5.913525 0.025556060654800622 0.04076474570797745 0.896051207585367 -0.4413358773722995 +297.3601 -15.56917 -5.871992 0.025966180627406417 0.04254493401149697 0.906383604701549 -0.4195050025706588 +296.9179 -16.02426 -5.830931 0.026007396564020378 0.04637585984300477 0.9163958292546229 -0.39672607561454276 +296.4013 -16.47064 -5.783464 0.026573602266285936 0.049449553410350795 0.9260128663209168 -0.37329446384621057 +295.8766 -16.90392 -5.737737 0.026785096526165165 0.05065256018330975 0.9348391677688537 -0.350417760902893 +295.2843 -17.32241 -5.692386 0.02571446663724326 0.04947535546995225 0.9429788110955372 -0.328149230124849 +294.6844 -17.72918 -5.645499 0.023613087049225972 0.04887017602857277 0.9502939073293353 -0.30658704752106947 +294.0467 -18.12343 -5.596317 0.021630256888301197 0.04707014505602804 0.9569311858389713 -0.2856558051224772 +293.3496 -18.50534 -5.550791 0.0196243344515039 0.0463768810255874 0.9628817079863333 -0.2651846278142746 +292.6362 -18.87203 -5.509383 0.019721652416999733 0.04553004154808073 0.9682421455880438 -0.2450412603004532 +291.8625 -19.21381 -5.47294 0.019769022359624095 0.04396384008967606 0.9733058529184837 -0.2243926986203811 +291.0794 -19.53852 -5.435497 0.019219595475224574 0.04386330039978405 0.9778236596697132 -0.20388111393146727 +290.2408 -19.84548 -5.397191 0.01906496885706396 0.045120400917967436 0.9818569886404157 -0.18318714540451658 +289.3905 -20.12922 -5.364689 0.019001586609629494 0.04720560946574344 0.9853630033183316 -0.1626970246586039 +288.5047 -20.38184 -5.330467 0.02001401271969574 0.04754078249493923 0.9884071186074539 -0.14279594245193122 +287.5677 -20.61251 -5.29613 0.020390685768332122 0.04924658476441999 0.9909062010881172 -0.12354713460074085 +286.6208 -20.81155 -5.265003 0.020746540348082484 0.04916019627324974 0.9930553151016356 -0.10485226423092751 +285.6322 -20.97738 -5.231887 0.01871993338588706 0.04872668113480883 0.994807494920421 -0.08736888856893621 +284.6297 -21.11659 -5.198232 0.01717920932556132 0.04839195624141827 0.9961862580741427 -0.07054099915875034 +283.5931 -21.22691 -5.169612 0.017264710771513425 0.04819849012834713 0.9971966977296338 -0.05456721862470982 +282.5205 -21.30702 -5.143301 0.017333756873584915 0.04753290556198627 0.9979556305816399 -0.03904770354174555 +281.4312 -21.35871 -5.118237 0.01702665514140087 0.04707270561965347 0.9984453626754536 -0.02451756823633409 +280.3112 -21.38358 -5.092068 0.016513191082952784 0.047514135468867034 0.9986717569448594 -0.011155417140187385 +279.1733 -21.37724 -5.068634 0.01728706135530638 0.04606153257721507 0.9987880235530578 0.0014059634819465336 +278.0075 -21.34556 -5.044246 0.01769375152623964 0.04536285019259863 0.9987271317832281 0.013162796791643037 +276.8295 -21.28976 -5.02153 0.017374435099350397 0.04418947039874604 0.9985977933928645 0.023406553393412452 +275.6388 -21.20924 -5.001358 0.016113647274925123 0.04194604041560387 0.9984868798175014 0.03169906776645799 +274.4313 -21.11193 -4.983245 0.015503896439124415 0.039486160441577996 0.9983782267697249 0.0379656244638019 +273.2075 -21.0019 -4.965436 0.01620353447790039 0.0371603527233451 0.9982966986203865 0.04195539511835345 +271.963 -20.88617 -4.946924 0.017134400484067013 0.03543157999268642 0.9982593236314419 0.04392423297189561 +270.705 -20.77685 -4.927465 0.01630366549019556 0.03449645029411905 0.998298517887861 0.04409370240283104 +269.4279 -20.66591 -4.906062 0.015612744465037617 0.033366758692646166 0.9983912595330475 0.04310214046461341 +268.136 -20.55522 -4.883412 0.01605719469823696 0.033049507384499265 0.9984461874965106 0.04189399997312026 +266.8306 -20.44759 -4.861751 0.016549899013712323 0.03251057205864236 0.9985091133165214 0.040604361476447366 +265.5097 -20.34586 -4.838516 0.01687704174672132 0.03227974551711386 0.998587175388717 0.03868897311138597 +264.1794 -20.24967 -4.817703 0.017686142930441674 0.03144762388586395 0.9986968149272765 0.036095971441292944 +262.8481 -20.15628 -4.799526 0.01881500595443335 0.031155805956697622 0.9987831177792373 0.03328054905360081 +261.5173 -20.07085 -4.780284 0.01966261859248611 0.030820886313190802 0.9988613658870122 0.03065005930317196 +260.1902 -19.99327 -4.760491 0.01983778249307066 0.030652293731982073 0.9989364945354484 0.02815633427766943 +258.8499 -19.9213 -4.738614 0.019471757106491443 0.03007935189258988 0.9990306678690043 0.02556966800955405 +257.5113 -19.8553 -4.717988 0.019104696547373512 0.02922351413542488 0.9991295840990105 0.022826978984256414 +256.1702 -19.79675 -4.696286 0.018760878961431346 0.02884531251939635 0.9992058448225302 0.020091217950806605 +254.8239 -19.74591 -4.673674 0.020036596651371304 0.028515872525201005 0.9992398386771066 0.017467816386332043 +253.4746 -19.70344 -4.648388 0.021667609314913974 0.02872322285100218 0.9992387384486837 0.015080940259507352 +252.1232 -19.66817 -4.618642 0.021674489492180378 0.029107586970031917 0.9992590658248716 0.012817341878888498 +250.7731 -19.63934 -4.586765 0.021325682132292804 0.029690013144256004 0.9992717823899383 0.010937244640310534 +249.4162 -19.61368 -4.556415 0.021048586582845513 0.029787739133474714 0.9992902390555157 0.009416248114332653 +248.0644 -19.59029 -4.529158 0.019387742625627508 0.029894218907056357 0.9993324217821649 0.008072291298802112 +246.706 -19.56928 -4.501374 0.018131826986346824 0.029742195639348584 0.9993696325646418 0.006853915399292221 +245.3382 -19.55134 -4.473848 0.019277489397623374 0.029190440127965765 0.9993716975149891 0.005701474692491632 +243.9698 -19.538 -4.445003 0.02046092100809288 0.029723621472222413 0.9993373647528041 0.004763239128750006 +242.5957 -19.52666 -4.411335 0.02031519486146367 0.029796108286246816 0.999340183723827 0.00432226602173179 +241.2224 -19.51383 -4.37826 0.02050663368531659 0.030415716999610293 0.9993179670720092 0.004238256877834013 +239.8385 -19.50318 -4.346093 0.02051169841208766 0.030548956255807735 0.9993133042081261 0.004353335797920685 +238.4577 -19.49325 -4.316996 0.019760823151875533 0.031447422068581574 0.9992997417794032 0.0045382367962560205 +237.0732 -19.48099 -4.286446 0.01885453717884577 0.03214408402665795 0.9992933099855317 0.004913746841131058 +235.6839 -19.46668 -4.255713 0.01911290626144785 0.03293641261676103 0.9992598440112066 0.005445519706929988 +234.2911 -19.45014 -4.223798 0.020835264801623354 0.03315481273464108 0.9992156425428803 0.005894898709545703 +232.8928 -19.4304 -4.186692 0.021256195290473574 0.03258860098579986 0.9992218592470106 0.006467862901216208 +231.4922 -19.4066 -4.152924 0.021604839077110777 0.03195150867226101 0.9992321762423763 0.006884038345571014 +230.0914 -19.38129 -4.121142 0.021451272775573044 0.031049969102356866 0.9992612093952744 0.007264827110419387 +228.6877 -19.35555 -4.091351 0.021189944587631303 0.03026222649441401 0.9992888361639912 0.0075502194663245965 +227.2832 -19.33238 -4.061083 0.02056291976995954 0.030410752124206793 0.9992965696772854 0.0076627878076083455 +225.8709 -19.30936 -4.031202 0.01964511952207055 0.030375956111302553 0.9993153803826534 0.007755069307265324 +224.4541 -19.28821 -4.003413 0.019956730230893654 0.030925275062500986 0.9992920919764415 0.007789171595691792 +223.0339 -19.26526 -3.975161 0.020451396906196 0.03081374692646553 0.999284177091137 0.007961581515392495 +221.6106 -19.24098 -3.945563 0.020844887907062723 0.030886550147856587 0.9992732934037943 0.008024759064316824 +220.1863 -19.2149 -3.915755 0.021224042220267726 0.030334275162686438 0.9992820822232993 0.008043129344594375 +218.7565 -19.19066 -3.884537 0.021307022957225107 0.030744470063392863 0.9992686432168338 0.00793517591548009 +217.3282 -19.16474 -3.851913 0.02046019235540279 0.03046787531416345 0.9992942305490838 0.008008114255853262 +215.8915 -19.14051 -3.816692 0.02101707956820257 0.0312978988368542 0.9992567009746939 0.008047947066992859 +214.4523 -19.11492 -3.783819 0.021820566568636436 0.031475005260047116 0.9992339054286243 0.008049171408461153 +213.0109 -19.08395 -3.744539 0.021532070103997828 0.031622695013022624 0.9992354525217156 0.008055156166838573 +211.5703 -19.0532 -3.706671 0.020186585485378515 0.03189462027919479 0.9992548168122374 0.008065112598685881 +210.1235 -19.02208 -3.670204 0.019981731736008826 0.03198734560125967 0.9992564881247837 0.008000691141376333 +208.6722 -18.99038 -3.634175 0.020581351545908975 0.03165744412183143 0.999254475689061 0.008044067105422603 +207.2186 -18.96021 -3.596747 0.019511836906326546 0.031773229925847954 0.9992722533291373 0.00804448923975986 +205.7582 -18.92786 -3.559153 0.02004640263248467 0.03131098576878339 0.9992755509753375 0.008132474077375855 +204.2948 -18.89614 -3.524788 0.021335466859055397 0.030725107266585092 0.9992680312879243 0.00801044838439257 +202.8318 -18.86476 -3.487637 0.021174881430153225 0.03058095317804986 0.9992760964230868 0.007982030855542112 +201.3684 -18.83712 -3.453039 0.021124371307022355 0.03107980016490761 0.999261581774203 0.00800613194433103 +199.9091 -18.80904 -3.419171 0.020607555749851448 0.03127398480729472 0.9992660267562021 0.008042032775541967 +198.4404 -18.78081 -3.384783 0.021672492836287034 0.03153766324617793 0.9992369222704941 0.007826367140790371 +196.9762 -18.75193 -3.350644 0.02306236433968062 0.031203896157916872 0.9992192043325783 0.007444857862896781 +195.5153 -18.72357 -3.313855 0.023258874433074096 0.030653896388685417 0.999235370337627 0.006930949611976875 +194.0556 -18.69718 -3.276436 0.02317002280227061 0.030301485906365448 0.9992518182678782 0.006385427436590589 +192.6001 -18.67123 -3.241513 0.02346119080822332 0.029714656261177388 0.9992651499946745 0.005980947689184805 +191.1441 -18.64803 -3.205048 0.023224155583236688 0.029785742997909946 0.9992715385898586 0.0054626253546038225 +189.6915 -18.62849 -3.170622 0.022658616048869536 0.030257681669795155 0.9992725195483534 0.005048910199743968 +188.2429 -18.61037 -3.138275 0.022510447108165545 0.030590471171176144 0.9992671524487181 0.004760344687756483 +186.7945 -18.59411 -3.107703 0.022851676297995772 0.03052139626880785 0.9992626676056516 0.004512913672016687 +185.3525 -18.57686 -3.075839 0.022518697605652595 0.03044984439649871 0.9992741691173024 0.004104895772007059 +183.9085 -18.56047 -3.044236 0.023017013356528906 0.03033499155987721 0.9992674522317885 0.0038163193521951735 +182.4687 -18.546 -3.011131 0.023823700779981416 0.030199034492800717 0.9992539519695486 0.0034625236588046947 +181.0316 -18.5334 -2.977267 0.02423744096870438 0.03022528659470345 0.9992450470605335 0.002883475428866504 +179.601 -18.52315 -2.942176 0.02347640473901481 0.030159817346966798 0.9992675649625606 0.0018914153225483793 +178.1753 -18.51524 -2.908289 0.023016280783132115 0.029990339247609174 0.999284726967864 0.000929956712429785 +176.7507 -18.51072 -2.873542 0.023780609083700135 0.03028526567883062 0.999258362147097 -0.00010484976047998332 +175.3373 -18.51025 -2.838653 0.024738464534040685 0.03069574191832978 0.9992219902528396 -0.001092701118936926 +173.9296 -18.51326 -2.803213 0.0242545878484337 0.031221241069351188 0.9992158059418094 -0.0021730693620919425 +172.5375 -18.51829 -2.769552 0.02425821295745384 0.031271267854314916 0.9992114929481748 -0.003168480893753708 +171.1562 -18.52671 -2.736025 0.0250488779073001 0.03136462349309932 0.9991852596249269 -0.004198935220834025 +169.7919 -18.5371 -2.701364 0.025796812057990635 0.031381707132243905 0.9991606762944607 -0.005258886945601355 +168.4471 -18.55169 -2.66695 0.026470642977990275 0.031760200072460075 0.9991247668812122 -0.006347830830475623 +167.1265 -18.5681 -2.633527 0.026024062297567818 0.031864017951324576 0.999127463745663 -0.0071933115245263295 +165.8218 -18.58704 -2.599237 0.026074075627066713 0.03231374257544871 0.9991054154704629 -0.00802081034280045 +164.539 -18.60503 -2.565877 0.027586061535681623 0.031870672041445045 0.9990744664427101 -0.008572046126235605 +163.2792 -18.62542 -2.533595 0.027597007859322946 0.03207829949938859 0.9990651825177727 -0.008840188865049841 +162.0471 -18.64444 -2.502804 0.026629970379085335 0.03191381038025019 0.9990967600749096 -0.008832745464512359 +160.8371 -18.66211 -2.473852 0.026295752834982433 0.03154891962315254 0.9991178111638803 -0.008763473543656152 +159.6532 -18.68207 -2.447586 0.0264193804752979 0.032025006560499 0.9991002767735243 -0.00866326965807682 +158.4929 -18.70043 -2.422217 0.026374453977839333 0.032073678303046714 0.999100893367633 -0.00854822844109998 +157.3598 -18.71715 -2.396857 0.026866295100109426 0.03186084793932756 0.9990964357850154 -0.008330699722081851 +156.2503 -18.73344 -2.370798 0.02776022648957001 0.03210023090101984 0.9990669475862852 -0.008011194790538483 +155.1689 -18.74892 -2.347373 0.02826871028042105 0.03179497690246805 0.9990644346677727 -0.007759822473265262 +154.1173 -18.76231 -2.322647 0.02784507625035552 0.03120077420428958 0.9990969115287831 -0.0075182971018481135 +153.0881 -18.7779 -2.297427 0.02730757137658647 0.031507736461417205 0.9991035849763122 -0.007320216941544603 +152.0873 -18.79239 -2.274461 0.02679782718875282 0.03169946924723332 0.9991136619568395 -0.006993611270287818 +151.1111 -18.80437 -2.252345 0.02685430385795093 0.031242079083768106 0.999128206630834 -0.006753189874563879 +150.1621 -18.81642 -2.23022 0.0273810117466828 0.03164133379689621 0.9991028722265826 -0.006523564966686604 +149.2397 -18.82738 -2.209938 0.02768833306280547 0.03152381083267158 0.9990997585604373 -0.006267216867543746 +148.3393 -18.83766 -2.187753 0.027792238598300274 0.03168043749014281 0.9990930248127053 -0.006088441896496239 +147.4612 -18.84754 -2.164801 0.027947388762334897 0.03181068737728931 0.9990866341811456 -0.005737685120396552 +146.5989 -18.85476 -2.144196 0.027731939727689257 0.031087466741466792 0.9991172726388596 -0.005402262972555021 +145.7538 -18.86279 -2.123511 0.026551665327942925 0.03157355688470166 0.9991349757405659 -0.005236394356830439 +144.9223 -18.87376 -2.100874 0.025425258852973837 0.033058263051751795 0.9991167499498191 -0.005140761224859557 +144.1037 -18.88239 -2.078341 0.024922391447189305 0.0337588067461529 0.9991070825346446 -0.004924936622906657 +143.3024 -18.88627 -2.059412 0.025089634212012858 0.03318016455697401 0.9991221000843047 -0.004961457266586779 +142.5198 -18.89195 -2.042113 0.024894964307351115 0.033289590101182376 0.9991240352417781 -0.004817275704915314 +141.7522 -18.90126 -2.02299 0.023968331428663225 0.034396103828676125 0.9991096082939945 -0.004734738065980972 +140.9929 -18.90872 -2.005121 0.023645379415746976 0.03468304329170034 0.999107278160899 -0.0047570228327925585 +140.2422 -18.91348 -1.989403 0.02397671251025961 0.034521250141487145 0.9991040211948229 -0.00495533835417049 +139.4994 -18.91945 -1.974562 0.024442852650577027 0.034547652274933796 0.9990903014999241 -0.0052512974952304845 +138.7654 -18.92677 -1.960534 0.0241958185938109 0.034919341099560186 0.9990819328745414 -0.005522081438946455 +138.044 -18.93511 -1.944374 0.023722443194805664 0.03559006450576434 0.9990681289454304 -0.005785043026765156 +137.3298 -18.94438 -1.928158 0.023192079348272228 0.036168984047620596 0.9990585457830419 -0.005996177644146775 +136.6214 -18.95464 -1.914991 0.02217098677411288 0.03688971776178573 0.9990548781475247 -0.006078364792657228 +135.917 -18.96431 -1.905325 0.021063482823542937 0.03712568189943476 0.9990689686938006 -0.006261727233763424 +135.2161 -18.97372 -1.895274 0.019935109855169827 0.037642794299362896 0.999070494572484 -0.006615006204863639 +134.5169 -18.97911 -1.881347 0.018844870039049364 0.03687984862499836 0.9991184395825156 -0.006862311914212331 +133.8197 -18.98311 -1.86792 0.018126984206735988 0.03596404583750702 0.9991632921573653 -0.0071214784856745645 +133.1238 -18.99097 -1.854861 0.018710991735742536 0.03642347913938374 0.999134377485489 -0.007329712295337043 +132.4286 -19.00389 -1.83902 0.020110912698156687 0.03802807344188823 0.9990464122711656 -0.0074621008340527905 +131.7402 -19.0082 -1.829069 0.020053505909874352 0.0361708060608633 0.9991150949724138 -0.007652234176665759 +131.0599 -19.00738 -1.823556 0.018992035475059215 0.033236204064814485 0.9992361143189863 -0.007864169922492802 +130.3774 -19.01169 -1.813487 0.01675769123051702 0.03185692307348292 0.9993220364698615 -0.007731989565544769 +129.6926 -19.02656 -1.799997 0.012599234690348696 0.03382434662554971 0.9993192573122885 -0.0076285532958875754 +128.9952 -19.02976 -1.790406 0.011665965829619809 0.03188758113580042 0.9993943370857702 -0.00761881955842492 +128.2867 -19.02968 -1.783236 0.012789432504612045 0.029354733145144783 0.999460901685562 -0.007255071334634666 +127.5639 -19.03291 -1.773597 0.014438909179784454 0.027944264466700987 0.9994816654842141 -0.006858304893669716 +126.8317 -19.03689 -1.762349 0.01590780058308891 0.02708734905266844 0.9994882734305193 -0.006033960290473813 +126.0904 -19.04093 -1.751742 0.016518778374032095 0.027058867072917964 0.9994844861039371 -0.0050704744658064335 +125.3386 -19.0431 -1.739549 0.01611505755149827 0.026979920054611688 0.9994995163377945 -0.003620176027960146 +124.5754 -19.0406 -1.721883 0.015065238704164492 0.026555697344912303 0.9995321347868356 -0.001828947725667791 +123.7928 -19.03398 -1.704964 0.015106712511449157 0.02557775902141944 0.9995586410497446 -0.0002976295120139031 +122.9911 -19.02609 -1.689113 0.016744772187845907 0.024919415198532097 0.9995482600743657 0.0013824372856111671 +122.1724 -19.01737 -1.671608 0.019202333718555384 0.024985624414766636 0.9994984384365563 0.003140145737203849 +121.3372 -19.00788 -1.651373 0.020569056569252318 0.025807247821167185 0.9994437421138588 0.004806893091651971 +120.4889 -18.99551 -1.630463 0.020858303511653287 0.026519855043347574 0.9994084646420767 0.006659524372107838 +119.6223 -18.9811 -1.606831 0.02014903722754052 0.02742235573136862 0.9993846217937705 0.008509314136155539 +118.7404 -18.96236 -1.583863 0.018859519419673766 0.02814283438545422 0.9993701180839945 0.010567236194718194 +117.841 -18.94089 -1.56236 0.018318159616608477 0.02924022153698594 0.9993266511738543 0.012477929570980217 +116.9209 -18.91249 -1.539442 0.018459668898010463 0.029850097244410897 0.9992784927980377 0.014515720786697316 +115.9843 -18.88164 -1.51858 0.019515878322156614 0.03059488977532522 0.9992053700550357 0.01648367878021526 +115.0276 -18.84566 -1.49574 0.020660187262335303 0.03126764912232277 0.9991230726642429 0.018670202211298004 +114.0573 -18.80738 -1.472691 0.02054495886654235 0.032789911162079896 0.9990365633120774 0.020704384964265382 +113.0721 -18.76261 -1.449607 0.019535317074973416 0.03357368713042368 0.9989822645075674 0.022926275724183674 +112.0705 -18.71096 -1.425819 0.017841269478838143 0.034154547402804186 0.9989453009583076 0.02496881431725127 +111.0501 -18.65467 -1.402779 0.017036014523116084 0.03458562701597711 0.9988941165094898 0.02691008390558835 +110.0104 -18.59273 -1.382089 0.017109289171547814 0.03455316912624032 0.9988544658602502 0.02833913827927968 +108.9547 -18.52665 -1.362588 0.01693264828255048 0.034629487341958094 0.9988278410990891 0.02927503840997116 +107.8847 -18.45845 -1.343346 0.015689095076267683 0.034138657812213216 0.9988515951452674 0.029730375278882407 +106.7965 -18.38974 -1.32609 0.014163752742018702 0.03388643201538302 0.9988780862343754 0.0298942582159427 +105.6929 -18.32377 -1.307555 0.01258611056081788 0.0348993598543519 0.9988669807762981 0.029805691026207365 +104.5622 -18.25676 -1.288139 0.014145652248754784 0.035636256833428866 0.9988288184861546 0.029511846484277297 +103.4111 -18.18725 -1.269054 0.017470700286308514 0.03541989789627457 0.9987882410976148 0.029364177319591213 +102.2464 -18.11737 -1.244731 0.018629272873207516 0.03571170201320184 0.9987628302632492 0.029162191551388866 +101.074 -18.04504 -1.219976 0.01744909448200695 0.0352511125567185 0.99880683180503 0.028944790634409533 +99.88171 -17.972 -1.195663 0.01658254821139547 0.0348143408666297 0.9988483040502866 0.02854901505665557 +98.67175 -17.89756 -1.172886 0.016853473542105462 0.034119511118000524 0.998875240719814 0.02828555934691294 +97.4478 -17.82503 -1.151222 0.01652730356736687 0.03407543165729259 0.9988861870345944 0.028144245338090106 +96.20855 -17.75374 -1.126795 0.016094603855796582 0.034499433343690385 0.9988883511662479 0.027799545501030566 +94.95275 -17.68421 -1.102525 0.017594865975409008 0.03465248203317205 0.9988740592944434 0.02720734917056519 +93.67749 -17.6127 -1.077492 0.018879023913128925 0.03404039738146705 0.9988936801246587 0.026386542198915917 +92.39323 -17.54366 -1.049972 0.01783898504921177 0.033754326667827246 0.998946839501164 0.025448534225984924 +91.09028 -17.4771 -1.022258 0.017861599565137335 0.033378954888393625 0.9989893419756295 0.02423021359030954 +89.77477 -17.40999 -0.9990056 0.01759510951991081 0.03179172149444711 0.999076047362708 0.022951038135777653 +88.45413 -17.35065 -0.9753403 0.0127243991599314 0.032880638642939775 0.9991485476962082 0.021427386798419767 +87.10974 -17.29334 -0.9584784 0.01058582552613299 0.032848196770890487 0.9992022031036463 0.020097103762752533 +85.73751 -17.23975 -0.9440549 0.014633289070184654 0.0328658943625804 0.9991772661129686 0.01872139743048158 +84.35171 -17.18834 -0.9218292 0.018014392448969525 0.03254208390262956 0.9991565029131237 0.017400492130561992 +82.95643 -17.13847 -0.8925921 0.018332181018760964 0.03207941586399653 0.9991878424745282 0.01607786266944726 +81.55322 -17.09087 -0.8621183 0.017347030545706705 0.031618194566973866 0.9992387523378959 0.01487568922021156 +80.12646 -17.04828 -0.8330614 0.01688041039901714 0.031592092895822564 0.9992643484648684 0.013702309962358264 +78.68382 -17.00912 -0.8053784 0.017142455159685724 0.03202005218862654 0.9992601668059312 0.012647984974623617 +77.22778 -16.97362 -0.7771295 0.017235651620196524 0.03286491204083717 0.9992429022856969 0.011681271396236509 +75.75752 -16.9389 -0.7478602 0.018060018547462774 0.033005860933920106 0.9992345617777074 0.010711648938839484 +74.28435 -16.90538 -0.7187593 0.01924826332501231 0.032639717980615914 0.9992357335476563 0.009596977159197677 +72.80871 -16.87364 -0.688618 0.020718903269736515 0.032003440041364004 0.9992364080085627 0.008550308939642658 +71.33452 -16.84487 -0.6588975 0.022468459423008583 0.03145133269328972 0.9992242766907536 0.007538360220424405 +69.86187 -16.82112 -0.6266962 0.02296843833251295 0.031734216681613775 0.9992101918672692 0.00666204178499671 +68.39401 -16.79851 -0.5936207 0.022348110417726823 0.031752269054467826 0.999227873589278 0.006001000991465331 +66.92462 -16.77917 -0.561778 0.02190569508238906 0.03203674411330987 0.9992321510296022 0.00537549051994799 +65.45693 -16.76019 -0.5300473 0.022041416274824612 0.0320556117030206 0.9992310349637248 0.0048941283461601355 +63.98723 -16.74127 -0.4983233 0.022482203694010574 0.03157241831676313 0.9992382364091458 0.004547506639104229 +62.52254 -16.72374 -0.4662392 0.022297839520469837 0.03145889690606758 0.999247363431623 0.004225024629815623 +61.05928 -16.70738 -0.4342216 0.020953053874894732 0.03155055353437131 0.9992748630967826 0.003908975300134383 +59.5885 -16.69261 -0.4028967 0.021432400922989238 0.03196114999092366 0.9992523922002994 0.0037139956802225396 +58.11853 -16.67728 -0.3696661 0.023605948203460767 0.03217945640862144 0.9991965584960969 0.0036714144708727223 +56.651 -16.66067 -0.331875 0.02507685086812287 0.032301081587404516 0.9991565305375241 0.003744217279942751 +55.18855 -16.64289 -0.2928006 0.02478496109910151 0.032096912270468035 0.9991690012333274 0.004098890144043372 +53.73519 -16.62286 -0.2553499 0.022959449003049522 0.0322271527188174 0.9992062673810589 0.004594513633270268 +52.27704 -16.60088 -0.2238776 0.02121339438337727 0.031655548726466386 0.999260987580889 0.005039527004388365 +50.82074 -16.57777 -0.1950878 0.02081957476746884 0.03156922148182963 0.9992700430735101 0.005413924399021327 +49.36683 -16.55446 -0.1667537 0.02158214721224716 0.03120525050211799 0.9992633982799533 0.005753618062442453 +47.91138 -16.53154 -0.1361189 0.022953101762249843 0.03179027899926177 0.9992124568145327 0.00608271545500048 +46.46111 -16.5059 -0.1018661 0.024675224454393126 0.0314925888852821 0.9991786487142963 0.006432580889569971 +45.01382 -16.47825 -0.06682977 0.02472132493202827 0.030701987899528674 0.9992010893993585 0.006589914700324977 +43.57573 -16.45171 -0.03390796 0.022437376361668253 0.030651172100623337 0.9992566491202318 0.006574114379042892 +42.13696 -16.42535 -0.004286154 0.020726594949216557 0.030546729942434268 0.9992981309478974 0.006367969509501153 +40.69789 -16.40183 0.02267275 0.021001770904048672 0.030870655058956448 0.9992839663757428 0.006122321413316164 +39.26177 -16.38003 0.0513988 0.022497541519638856 0.03147846625834446 0.9992334080805372 0.005963469070024312 +37.82483 -16.35679 0.08060552 0.023228067411014696 0.031226470374165982 0.9992257254607267 0.0057718289826668955 +36.39684 -16.33245 0.1118254 0.02250045242605646 0.030618250998587244 0.9992629247103558 0.005464398019547973 +34.97003 -16.30758 0.1414094 0.022174631964380017 0.029698863073291143 0.9992994353892304 0.005186681035470868 +33.54441 -16.28474 0.1706823 0.02180090562587617 0.029256171808690112 0.9993221330180895 0.004906259794169456 +32.12471 -16.264 0.1998713 0.021281782282354354 0.02965809474313164 0.9993222151430415 0.004753260007471801 +30.70783 -16.24608 0.2279988 0.021197430188197974 0.0305886406435069 0.999296361803354 0.004667472932874278 +29.28474 -16.22987 0.257281 0.02193508218178756 0.03185060010102543 0.9992412759475635 0.004611278084026782 +27.8676 -16.21276 0.2860073 0.02241450163483873 0.032119587809520296 0.9992222010670172 0.004573301891851275 +26.45737 -16.19488 0.3156601 0.02175028512100806 0.03212492318950204 0.9992368064694642 0.00455181325737315 +25.05017 -16.17482 0.3455316 0.021719966807737755 0.031319824916383694 0.9992637301045899 0.004394235603513558 +23.6461 -16.15363 0.3743586 0.02177700809460173 0.030318814551547783 0.9992941943846936 0.0042005323094430344 +22.24839 -16.13407 0.4041376 0.021690267848528785 0.02983658906915148 0.9993112047980764 0.004052924671710977 +20.85659 -16.11623 0.4323654 0.022358126233984067 0.029738221631291552 0.9992997576927033 0.0039682037268662 +19.47255 -16.10044 0.4615578 0.022546588372237032 0.03016287589520217 0.9992831237960457 0.003884683631975494 +18.08843 -16.08658 0.4926781 0.022879252550762643 0.03083294620971204 0.9992555497772618 0.0037703408452715863 +16.70842 -16.07258 0.5236603 0.02344267119055308 0.031005322197163798 0.9992372306911201 0.003750728414993696 +15.33299 -16.0573 0.5539333 0.022206131788650676 0.03084187319580101 0.9992711645157402 0.0035785941535179285 +13.9623 -16.04039 0.5822458 0.01996555246639013 0.029762906047671987 0.999351994500645 0.0033372482514966123 +12.59501 -16.01956 0.6052059 0.018917466223599216 0.027744174473193668 0.9994306056228737 0.0032946621641649645 +11.22152 -15.99987 0.6277813 0.019788318875534772 0.026113804138032593 0.999457318106204 0.003400140213976403 +9.852998 -15.98184 0.653636 0.02127842360597966 0.02558553613324546 0.9994398876271741 0.0035383681832939595 +8.487382 -15.96753 0.6821976 0.022036424904129655 0.026199100972797127 0.999407572902363 0.0035364263133636537 +7.122797 -15.95764 0.7126679 0.02335073301968515 0.028292302656660193 0.9993201101952943 0.0036886633205085576 +5.763789 -15.94956 0.7443265 0.024123592959603073 0.030624874681817738 0.9992325823894714 0.0037967887366320337 +4.400318 -15.93877 0.7787213 0.02370667102759514 0.03186990508351989 0.9992035593295723 0.0038144359833160485 +3.038918 -15.9259 0.8125999 0.02261539179059871 0.03256271033684962 0.9992062063046354 0.0038950265691465873 +1.676097 -15.90918 0.8446651 0.021258459834429633 0.0322291167211062 0.9992473672972325 0.003749782713040033 +0.3089439 -15.88963 0.8754178 0.020977440131755745 0.031132416533832676 0.9992886346768396 0.0035978121101940736 +-1.063172 -15.87121 0.9045038 0.02120005384129237 0.030274248959713732 0.9993106802166793 0.00349170043520793 +-2.436523 -15.85397 0.9332548 0.021010641559918834 0.029882175192407958 0.9993269135128069 0.0033657801440120975 +-3.813595 -15.83843 0.9601236 0.020081961197486647 0.029976573187819173 0.9993433878663041 0.003303486381958705 +-5.192326 -15.8232 0.9870965 0.01983754468867219 0.029732496687530603 0.9993557772399603 0.0032374308588353153 +-6.57647 -15.80825 1.014088 0.02016940192868718 0.029641428314405006 0.9993519118759732 0.0032151490102785454 +-7.966034 -15.79123 1.041956 0.01997172000516704 0.028999546107997036 0.9993748684068092 0.0031668789743476223 +-9.357139 -15.7739 1.070725 0.02003228102174497 0.028023911207160537 0.9994021787500078 0.0029416368668567137 +-10.75522 -15.75586 1.10059 0.02142410515067714 0.026793509992555448 0.9994073292974124 0.0028470488905359644 +-12.14993 -15.73852 1.129671 0.021524247167213226 0.025942627041639046 0.9994279866375609 0.002717795366556129 +-13.54746 -15.72233 1.159922 0.01989364826624101 0.025711534374179886 0.999468258141441 0.0025220480760894354 +-14.95103 -15.70577 1.188722 0.019397472755654607 0.02511019535051045 0.9994932234030673 0.002551962284026683 +-16.35914 -15.6909 1.214563 0.01998608408548817 0.025005214902775408 0.9994837969780853 0.0027267652747028043 +-17.76435 -15.67519 1.238904 0.01895043108745312 0.024419538103863825 0.9995183016580017 0.0027806422095335817 +-19.17351 -15.65947 1.258378 0.017022736651195892 0.023915868376489846 0.9995648891171041 0.0028792570246226512 +-20.58415 -15.64353 1.279114 0.01658493110512659 0.023634129016058443 0.999578630267338 0.002988296967018817 +-22.00393 -15.62986 1.298206 0.017683239452542385 0.024026221113070238 0.9995500189513215 0.0031310311245522932 +-23.42439 -15.61782 1.320134 0.01802599062003311 0.025475784341950875 0.9995072022618314 0.003376492418953509 +-24.84504 -15.60601 1.342475 0.017633933632639976 0.02641089625638073 0.9994889997928118 0.0036398127340520267 +-26.26783 -15.59027 1.362532 0.017786385760014848 0.02618142911227566 0.9994915536780354 0.00384855524862023 +-27.68798 -15.57331 1.381668 0.01677390674864704 0.02592692104158377 0.9995158300178613 0.0038126580487955163 +-29.11536 -15.55401 1.401278 0.016970533299113983 0.025374670282571284 0.999526490611797 0.00386285812058895 +-30.54783 -15.53676 1.423844 0.01917137839318833 0.025160159233112357 0.9994909993428874 0.004143292220662105 +-31.97309 -15.5214 1.450596 0.019473236832053733 0.026350486911963598 0.999453242839156 0.004433989747421006 +-33.39613 -15.50772 1.474487 0.018726599220297437 0.02772535915903161 0.9994282617512766 0.0048753003976000145 +-34.81999 -15.49326 1.496909 0.018981946751468973 0.028701502352172368 0.9993942073857205 0.005208426261371306 +-36.24448 -15.47662 1.519635 0.01964774288568167 0.029379684309609456 0.9993607201314444 0.005380651236316782 +-37.66558 -15.46088 1.542757 0.019781490227387555 0.030372490986008575 0.9993279027192167 0.005472411031745277 +-39.08217 -15.4441 1.566849 0.020001712960150516 0.031487421412858964 0.9992881553226935 0.005626402331582135 +-40.49901 -15.42668 1.593713 0.020757082478989063 0.03238579255820387 0.9992430215213566 0.005804128548465199 +-41.91388 -15.40927 1.621146 0.021249464955299818 0.033318806847166614 0.9992017660966201 0.005843627194630488 +-43.32259 -15.39245 1.647343 0.01964238352135774 0.034462536599135174 0.9991958087625201 0.0058520160770826985 +-44.72465 -15.37523 1.670185 0.017051001337550532 0.035666195890867204 0.9992004022547727 0.0059784578246959525 +-46.13041 -15.35699 1.68978 0.016608936862169767 0.035966311942438496 0.99919657418428 0.006064137198846929 +-47.53795 -15.33783 1.710257 0.018053001601641444 0.03605490068880934 0.9991680777327859 0.006106202523719637 +-48.94398 -15.3171 1.733637 0.01847335397119575 0.03584828380730592 0.9991678065725514 0.006109832249914256 +-50.3466 -15.29715 1.757778 0.018031664255783928 0.03563600980960708 0.9991839822782795 0.006025234230722762 +-51.7485 -15.27696 1.78328 0.01782837191249632 0.03572858743511733 0.9991842109751501 0.0060439831905350315 +-53.15272 -15.25676 1.808697 0.018626423689654233 0.03538846959055769 0.9991819028389982 0.006019767405006135 +-54.54794 -15.23749 1.833987 0.017503132561295327 0.03565532029888657 0.9991946894177722 0.005684287488074441 +-55.9495 -15.21771 1.859642 0.01695787510770998 0.03515499424430761 0.9992240975302719 0.005268753773696039 +-57.35212 -15.19945 1.883021 0.017594767595326233 0.03504146716665346 0.9992198002411145 0.004723403241530531 +-58.75141 -15.1841 1.905888 0.017662865381489343 0.03547946801371712 0.9992048103640424 0.004356315109943966 +-60.15069 -15.17008 1.927806 0.016917646229211037 0.03596909253507103 0.9992020965793958 0.003897155322317364 +-61.54936 -15.15697 1.949198 0.016823974755130706 0.03579683442344435 0.9992112436651946 0.0035257696683517717 +-62.95149 -15.14226 1.969588 0.017355930777542074 0.035494011054918305 0.9992143379564117 0.003107035308496088 +-64.35076 -15.12801 1.98995 0.017598764570644767 0.035123844491605695 0.9992243966730685 0.002684050494812029 +-65.75074 -15.11338 2.0103 0.01779289052417658 0.03453695642166122 0.9992423614112957 0.0023053956494129358 +-67.14653 -15.10121 2.031551 0.01782577430234554 0.03452251891710764 0.9992429876595055 0.0019720728803227926 +-68.53878 -15.09141 2.052437 0.017877857689930682 0.03446333141376637 0.9992444583174647 0.0017814359713769997 +-69.93462 -15.08303 2.071234 0.017547919459178888 0.03493758517116097 0.999233869901623 0.0017632091448127108 +-71.32326 -15.07453 2.08736 0.016138877110256838 0.03518659999175084 0.9992485575316847 0.0019390970805598185 +-72.70839 -15.06438 2.102811 0.015865155127753102 0.03518040276968461 0.9992529273982056 0.002055042542537473 +-74.09527 -15.05383 2.117081 0.016578892826746322 0.0355901001433936 0.9992262708075182 0.0023118850582934804 +-75.47624 -15.04316 2.130723 0.017473000220101877 0.03638263620895787 0.9991815733111143 0.0026798509886250123 +-76.85642 -15.03084 2.146554 0.018677554235864532 0.03686579788627925 0.9991404581916425 0.003225944601030539 +-78.22646 -15.01678 2.16346 0.01844121998633743 0.0372325448700663 0.9991291409882855 0.003823432479455407 +-79.59325 -14.99833 2.179823 0.016618801621000802 0.03715396722466116 0.999162750777298 0.004146759131404652 +-80.9536 -14.9793 2.191194 0.0151963130783136 0.03623283264719445 0.9992185168911063 0.004313862436875343 +-82.31728 -14.95957 2.200916 0.015895614549677736 0.035570199134069634 0.99923124619619 0.004359701464547636 +-83.68053 -14.94128 2.214391 0.018267599964107605 0.03561772335770302 0.9991890140379014 0.004357384574706263 +-85.03495 -14.92467 2.232509 0.019607651622533532 0.035873387083454375 0.9991543058300487 0.004394682955970009 +-86.38401 -14.90817 2.250863 0.01952424618373312 0.036183647858905425 0.9991449671415781 0.004345350875874112 +-87.72673 -14.89187 2.268643 0.018749543958719254 0.03622044537965484 0.9991592580479455 0.004160648377176188 +-89.0704 -14.87398 2.282365 0.017816130212816774 0.0358943352672626 0.9991897763161346 0.003738061700145277 +-90.40801 -14.85657 2.296993 0.017315469240228837 0.035339798919201165 0.9992199554724173 0.0032792869715250975 +-91.74112 -14.83919 2.311152 0.01773060883399501 0.03475026869506104 0.9992348759634613 0.002776147381379119 +-93.07672 -14.82426 2.326927 0.018603343717876183 0.034607877851718524 0.9992247940020309 0.0024538632223480086 +-94.40039 -14.81076 2.341502 0.017987904833180406 0.0346054465746143 0.9992369962607613 0.002079339042832701 +-95.72436 -14.79798 2.357904 0.017433185569018045 0.03460194716511097 0.9992475106452953 0.0017893469421973848 +-97.04551 -14.78703 2.374193 0.018076326856067856 0.03469837870047612 0.9992333002318041 0.0014424391869781116 +-98.36341 -14.77603 2.389834 0.018768821546332026 0.03443251513156526 0.9992301753796673 0.0010908025467309232 +-99.6766 -14.76703 2.407843 0.01858516526649825 0.03437334633267633 0.9992360714763979 0.0005815103832645156 +-100.9838 -14.75854 2.426874 0.01878798259183506 0.034564961655172324 0.9992258206190818 0.0001858602374135875 +-102.2903 -14.75176 2.444391 0.01859889556764111 0.03461181410471267 0.999227732406066 -0.00020542316381735496 +-103.5926 -14.7441 2.462669 0.01857009362873808 0.034480228800587695 0.999232700268384 -0.0005255081666404211 +-104.892 -14.7396 2.480363 0.01854695531079519 0.03470745375642941 0.999225041244639 -0.000848558833703688 +-106.1858 -14.73425 2.498308 0.018280357160107872 0.034366912640524355 0.9992414071033272 -0.0011636957593066054 +-107.4728 -14.73066 2.514949 0.01771972773601436 0.034334096264417996 0.9992520991963159 -0.0015567062739441507 +-108.7629 -14.72824 2.530663 0.016962034191058416 0.03449310479735327 0.999259428471012 -0.0017634424647912795 +-110.0459 -14.72729 2.547386 0.017415013768043722 0.03471879837966938 0.9992434064990457 -0.0019841629755262143 +-111.3291 -14.72555 2.56369 0.018976494660834103 0.03456779217153361 0.9992202295651517 -0.0019731250691359566 +-112.6054 -14.72479 2.582153 0.018308709390506883 0.035030805928373314 0.9992166984665425 -0.001903497329993933 +-113.8771 -14.72153 2.600374 0.017747887037011932 0.03473485629819844 0.9992373933436886 -0.001770312790370101 +-115.1477 -14.7196 2.617244 0.018078493820142306 0.03506265155238774 0.9992199968462365 -0.0017822541676125513 +-116.4111 -14.7178 2.634202 0.018325717284729846 0.03545204239806821 0.9992019601827647 -0.001662390665977057 +-117.6691 -14.71582 2.650909 0.018792648351773463 0.035625631785494445 0.9991874118049657 -0.0014720116524172698 +-118.9238 -14.71284 2.667793 0.019272533620666575 0.035446869065864535 0.9991847682651149 -0.0013739680695537837 +-120.1734 -14.70983 2.685657 0.018983672407010727 0.035580400936694026 0.999185615933229 -0.001326711446140644 +-121.4141 -14.70566 2.703911 0.018986377181338065 0.035334080903309875 0.999194318349438 -0.0013169610656284597 +-122.6449 -14.70294 2.719941 0.0188978662990636 0.035386640918825506 0.9991940052142754 -0.0014128827550024263 +-123.8671 -14.70054 2.735913 0.018639178580645805 0.035136967089037296 0.9992072646879422 -0.0016783207272153757 +-125.0815 -14.69881 2.751503 0.018887629053121142 0.03463414693029286 0.9992188638466375 -0.002322814714461969 +-126.2871 -14.6983 2.767801 0.019435569901964246 0.03406845582447455 0.9992261600305535 -0.002946192613585149 +-127.4852 -14.69852 2.784821 0.01985729946618703 0.03342698016916247 0.9992376068889769 -0.0035397222307672646 +-128.6704 -14.70118 2.801395 0.019374685591528308 0.033132670196510214 0.9992542889593703 -0.004208767061988305 +-129.8453 -14.70614 2.817529 0.018496901862452404 0.03324612482913472 0.9992645031606922 -0.004797137516408751 +-131.0089 -14.71233 2.833172 0.018302394434347405 0.03345098157556073 0.999259686322728 -0.005112091529330468 +-132.1621 -14.71896 2.848691 0.019297382951314626 0.033900549038389426 0.9992257722077186 -0.005120541183750077 +-133.2996 -14.72488 2.863999 0.020504398191329588 0.034178733308603285 0.9991928260871835 -0.005008007601656352 +-134.4174 -14.73334 2.880267 0.02088246095598546 0.03521011377598766 0.9991489999239177 -0.005044468588350929 +-135.514 -14.74314 2.895211 0.02045850633856486 0.03617605677560035 0.9991233772504095 -0.005021898674754682 +-136.5942 -14.74619 2.913166 0.02100781118603649 0.035268291426845615 0.9991435961671019 -0.005185916242500813 +-137.6542 -14.75233 2.928943 0.022584556158624144 0.035336895776531846 0.9991068417598318 -0.005173042498459743 +-138.6869 -14.75847 2.946711 0.02251281944969025 0.035364502676738026 0.9991037644377591 -0.0058474607439018954 +-139.6957 -14.76466 2.964758 0.021727894695702618 0.034728312680799095 0.9991373076382849 -0.006817871790842998 +-140.6763 -14.77357 2.979529 0.0218666756214058 0.03451565940179751 0.9991367939084407 -0.007495652852781925 +-141.6311 -14.78487 2.993256 0.022166002810232063 0.035027203978002845 0.9991080748800626 -0.008050963318473941 +-142.5562 -14.79743 3.006784 0.02242773361504898 0.03556034524861756 0.9990802355436361 -0.008434545439074995 +-143.4522 -14.80845 3.020005 0.022445810947588445 0.03600156999827066 0.9990656055444848 -0.008245504629589583 +-144.3211 -14.81802 3.032842 0.022078255431643014 0.03692609696403403 0.9990472341660246 -0.007323790365260831 +-145.155 -14.8233 3.044707 0.02144462791858166 0.0387652722786203 0.9990086021190141 -0.004381151767677446 +-145.9633 -14.81877 3.055661 0.021083581108683757 0.040973958742356986 0.9989369284365778 0.0012768391867616056 +-146.7445 -14.79757 3.065594 0.02079038745522756 0.04297127957856861 0.9988082542853527 0.010163665297917276 +-147.4992 -14.75456 3.071965 0.02106736175260581 0.043672001764781004 0.9985944491848451 0.021402069709672823 +-148.2245 -14.68816 3.079106 0.022409715824115356 0.0434594935172253 0.9982200384584071 0.034144280341000165 +-148.926 -14.59929 3.086043 0.02395703261435749 0.04277062409186745 0.9976249242213555 0.04838641209514765 +-149.5947 -14.49202 3.08883 0.02410157321761741 0.042583207129294455 0.9967576614465711 0.0638744783664198 +-150.225 -14.3642 3.087775 0.022871389947382165 0.04320363256420672 0.9954640802973131 0.08161868960572809 +-150.8335 -14.21312 3.08933 0.020876368435353378 0.04416777518056442 0.9935867471195894 0.10197431453490417 +-151.4162 -14.03648 3.090778 0.019112304118380604 0.0447851609989788 0.991057108744136 0.12423693650962266 +-151.9504 -13.83342 3.086577 0.01807168256232674 0.04470756466689981 0.9877995436296196 0.1480767015976399 +-152.4769 -13.60854 3.07671 0.017696995075809407 0.045250696399491995 0.9835351902741292 0.17406240355998895 +-152.9502 -13.36041 3.06086 0.017084588694235358 0.04620206236863084 0.97816503780257 0.20190751616205516 +-153.424 -13.08243 3.04676 0.016023979824224183 0.04595886962535624 0.9717089886004883 0.23111178214528277 +-153.8743 -12.77723 3.032264 0.015263481722047188 0.045740552771721954 0.9640848320843404 0.26118052090990135 +-154.2672 -12.44653 3.015235 0.014579602511767582 0.04534273538965792 0.9552357742505586 0.2919864502505521 +-154.6664 -12.0933 2.992923 0.014185699530790269 0.04537932777912913 0.9450382676390742 0.3234843972077697 +-155.0328 -11.71655 2.969218 0.013667369410799306 0.044635448783614064 0.9335409812999844 0.35541822682402824 +-155.331 -11.31815 2.947184 0.013537289475973529 0.04410805623382458 0.9203540879549514 0.38835495870855946 +-155.6301 -10.90107 2.926439 0.013168075405440027 0.04304094782493959 0.9057519537933283 0.42141129173292935 +-155.8564 -10.46589 2.901563 0.012489012038874073 0.04036796294094445 0.890199124103115 0.4536077287616891 +-156.0826 -10.02043 2.872763 0.011371556105522242 0.038177341984636785 0.8737546368605877 0.48473292939034934 +-156.2778 -9.567123 2.844586 0.009879952422999086 0.03657237217723463 0.8562556139318532 0.5151613065279983 +-156.4059 -9.106293 2.815747 0.00903539466669196 0.036043372850706114 0.8377049520039603 0.5448574587034709 +-156.5389 -8.642035 2.787238 0.008015848275762644 0.03652083214658787 0.8181514129017577 0.5737858838997333 +-156.6351 -8.174402 2.7555 0.005993091551131241 0.03671543918374064 0.7977514445866393 0.6018377622242849 +-156.6631 -7.699412 2.722687 0.004064905302654161 0.03625524849867694 0.7764916466016939 0.6290705495085345 +-156.6903 -7.221458 2.690343 0.0028086741528824378 0.03419772325764685 0.7540666365833943 0.6559010098065892 +-156.651 -6.734167 2.665011 0.0041939192794099335 0.031357829218849054 0.730567200362597 0.6821075159695252 +-156.6171 -6.24316 2.64127 0.007272242300138055 0.028593812313630344 0.7059423403809774 0.707654520544262 +-156.5444 -5.751237 2.615199 0.010084336075520733 0.024260543608890822 0.6803686387034614 0.7323989675433163 +-156.3936 -5.261917 2.593307 0.012306851758028921 0.01916274321800345 0.6538625682373642 0.7562705022213976 +-156.2439 -4.779011 2.569364 0.013184514060740638 0.014384259552897181 0.6262221001043701 0.7794005023137779 +-156.0244 -4.305429 2.552946 0.013770016347601244 0.012509527550849966 0.5975382825566992 0.801624537579428 +-155.8078 -3.838336 2.542408 0.014015314904168383 0.011753981416406477 0.5679189973906499 0.8228811744546104 +-155.5541 -3.375344 2.534833 0.015347923605789295 0.010233593080638335 0.5375897349109537 0.8430047400412535 +-155.2209 -2.922418 2.530704 0.016768902556088483 0.008118003287271946 0.5066722397061723 0.8619374359202779 +-154.8906 -2.480572 2.522617 0.017762290246249344 0.006060460084380443 0.47515407807919463 0.8797024348911877 +-154.4839 -2.057963 2.510377 0.01738385765811627 0.004332498140031546 0.44352851454270165 0.8960811836772009 +-154.0795 -1.653879 2.499815 0.01607973138988786 0.003998580041789036 0.4122706364143604 0.9109107398349064 +-153.6401 -1.268487 2.493093 0.015448677293536666 0.0054552315151853836 0.38127804632029455 0.9243152223203809 +-153.1409 -0.8983582 2.487045 0.015744118457563804 0.00739279317824626 0.3508220494154795 0.9362805984249277 +-152.6325 -0.5510777 2.478814 0.01562970390110868 0.007739223891219789 0.32049697872378946 0.9470889627687824 +-152.0606 -0.2261366 2.467171 0.015500617574510302 0.006441686651619533 0.2899125695950434 0.9569059188440054 +-151.4836 0.06230717 2.456904 0.014587070157282837 0.00462199873197451 0.25833370475618106 0.9659345482479103 +-150.8797 0.323482 2.446327 0.015079320677427132 0.004072838217591765 0.2263032549819606 0.9739316520481971 +-150.2166 0.5523554 2.433161 0.014443810662394809 0.0028723836058676022 0.194726199714012 0.9807470789611908 +-149.5497 0.7484821 2.422189 0.013580292026801173 0.0019814931999142926 0.16540020380932194 0.9861310368977341 +-148.838 0.9129124 2.410623 0.010585674508893642 0.0020622920953002554 0.13876507977730584 0.9902666020223561 +-148.1125 1.04825 2.40293 0.007457471310312433 0.002511931018389478 0.1151848770796834 0.9933129015654397 +-147.3483 1.163229 2.393425 0.004726086692510001 0.002583635233082503 0.09441006845753844 0.9955188234822082 +-146.5668 1.265054 2.383452 0.004478953250403233 0.001993203736710804 0.07615454373619879 0.9970839741892235 +-145.7604 1.348496 2.371923 0.004330983742227694 0.0010581755416420255 0.059281548280691584 0.9982313463710663 +-144.9267 1.40767 2.359471 0.0037915113122519513 0.000538003878525595 0.0439730523583865 0.9990253778859085 +-144.0738 1.44478 2.346915 0.002176838699347527 0.0004501563449653026 0.030639819234022724 0.9995280187217607 +-143.1977 1.463451 2.332663 -0.000283819714289212 0.0007352560847356547 0.01955569579864828 0.9998084584592642 +-142.3032 1.469675 2.319985 -0.0028702306992696735 0.0012509267711476024 0.011615750553655605 0.9999276330300215 +-141.3867 1.468738 2.305914 -0.005444014786913748 0.0019069609868024382 0.006754273981580212 0.9999605522148243 +-140.4509 1.471319 2.290968 -0.007184552587440622 0.0018755040185296367 0.004852932128645964 0.9999606560953033 +-139.4923 1.479967 2.274698 -0.007636267138219702 0.0014963386943438026 0.004388938090559575 0.9999600920122477 +-138.5154 1.489345 2.260154 -0.0069941151217736775 0.0008824479690542902 0.003919845070403393 0.9999674686977919 +-137.5181 1.499366 2.246738 -0.006701008189751243 0.0004089067586542523 0.0033260306575685857 0.9999719330084063 +-136.5045 1.504842 2.234017 -0.005879920413004202 0.0004926194131949593 0.002475262190100472 0.9999795282600242 +-135.4735 1.505414 2.220167 -0.0060636286894458855 0.0007971443533006686 0.0013226331392162504 0.9999804236131703 +-134.4219 1.504501 2.205688 -0.00586264400440049 0.00036497431907709635 0.0003449383471791243 0.9999826884584356 +-133.3518 1.503329 2.190202 -0.005860823833257581 -0.0008992636818007438 -0.0005562303544524305 0.9999822661810652 +-132.2648 1.501627 2.17614 -0.0050864568554325525 -0.0014597790569944944 -0.001492513301602942 0.9999848845886657 +-131.1603 1.49856 2.159629 -0.004920423050316095 -0.0007457089726092294 -0.0022987855957650916 0.9999849743570745 +-130.039 1.49165 2.144473 -0.004525580379588483 9.861783014753175e-05 -0.0029164660443468595 0.9999855017061815 +-128.8996 1.48238 2.128143 -0.004411964633903725 0.00016615895072741088 -0.0035286385422362357 0.9999840277070982 +-127.7441 1.470024 2.111152 -0.004860649353279161 0.00018370651604867632 -0.003894533849372179 0.9999805862844922 +-126.5764 1.458443 2.092643 -0.005128876405013443 0.0006468473002571108 -0.00408465647337202 0.9999782956629043 +-125.3883 1.447721 2.073753 -0.0051171845638973085 0.0010872388100105308 -0.004053858527565097 0.9999780990426479 +-124.185 1.436723 2.053638 -0.005367513711048034 0.0007271404931845227 -0.0037362951438361038 0.9999783503465778 +-122.9654 1.426398 2.032685 -0.00542255678691692 0.0004755219587315937 -0.0032652568084117076 0.9999798537244311 +-121.7344 1.41804 2.009877 -0.005686755105830954 0.0013451787278148724 -0.002523815084639146 0.9999797406287664 +-120.4917 1.413177 1.9844 -0.005828598184047533 0.003175616675240736 -0.0015922958170844876 0.9999767034766219 +-119.2315 1.410381 1.956743 -0.005508083337828515 0.004497556249359475 -0.000738283127381861 0.999974443645312 +-117.9509 1.407398 1.927293 -0.004739803475375571 0.004733037573385735 -0.00023439584048944026 0.9999775385862092 +-116.6572 1.405768 1.897027 -0.004155527556330976 0.004341840432686672 -0.0002939521769914906 0.9999818966383859 +-115.3483 1.402662 1.864276 -0.003622513456122521 0.004252528742176142 -0.000827003424046554 0.9999840546033191 +-114.0291 1.396732 1.832056 -0.003704703912510499 0.005794367974374147 -0.0014497065210941846 0.9999752991047832 +-112.695 1.385028 1.800206 -0.0037040930357802662 0.006476070371761423 -0.0026530083591152494 0.9999686503855852 +-111.3405 1.369687 1.767521 -0.003917363217505259 0.00587280034195822 -0.004182097135819139 0.9999663367059475 +-109.9726 1.351798 1.734254 -0.00355232129745498 0.004980099846963336 -0.005465718939664845 0.999966352201606 +-108.5928 1.330549 1.700357 -0.004090989795291706 0.0051176367766741495 -0.006153518631189044 0.999959603086431 +-107.2003 1.30829 1.668467 -0.003940592377689967 0.005715376006421358 -0.006767502934986702 0.9999530024520368 +-105.7911 1.284827 1.635925 -0.003528636055934132 0.0059515991142868995 -0.007367766477481987 0.9999489203017826 +-104.3679 1.259616 1.601652 -0.0035458449694281895 0.006678686424357215 -0.007842983033086896 0.9999406531130939 +-102.9321 1.230259 1.565326 -0.004438197864124789 0.006802749805983578 -0.008361353226634482 0.9999320540751834 +-101.4787 1.202212 1.530047 -0.004233900558569332 0.006094923384462809 -0.008456070374436736 0.9999367084314988 +-100.0076 1.170075 1.495851 -0.005189108693634676 0.0046697390474212015 -0.008265573744373325 0.999941471776659 +-98.52911 1.142136 1.46328 -0.0052771836140445404 0.0036326602898352255 -0.008035294231178703 0.9999471931851905 +-97.02601 1.115543 1.436157 -0.00479336586569885 0.0024844691911812154 -0.00788561255807855 0.9999543330428142 +-95.51571 1.089274 1.41068 -0.005088253446592736 -0.0005915861070441886 -0.007657092468684488 0.9999575634184019 +-94.00229 1.062982 1.390693 -0.005250829559785136 -0.003415524958185518 -0.007537086808826577 0.9999519764971868 +-92.49143 1.036568 1.374673 -0.005626987811681752 -0.004849836263522448 -0.007543077680942055 0.999943957467359 +-90.98226 1.009148 1.361424 -0.0065162866659219335 -0.005878138902704838 -0.007550963501133429 0.9999329819749584 +-89.47609 0.984734 1.348472 -0.006244379770734587 -0.006264901166692033 -0.007603577229445329 0.9999319698599332 +-87.97323 0.9616474 1.333401 -0.005994349400037454 -0.005386948709180056 -0.007797482773648283 0.9999371219338092 +-86.4669 0.9399983 1.316473 -0.005618564424695484 -0.0041044736594272714 -0.0075659548385827195 0.9999471692830403 +-84.96724 0.9191412 1.298308 -0.005317673509981028 -0.004028196786628243 -0.007131062459853782 0.9999523208269898 +-83.45436 0.8995469 1.282847 -0.005434806960543168 -0.005413360857773635 -0.006634758662671116 0.9999485678648744 +-81.9495 0.8834484 1.268237 -0.00454710610004723 -0.005339860715465948 -0.006327724784873771 0.9999553838110484 +-80.44554 0.8679533 1.25271 -0.004435083511972056 -0.004461704280936914 -0.006080342737663482 0.9999617255982071 +-78.9424 0.8520469 1.234831 -0.0042858758111804455 -0.00435678556048535 -0.005758330459729811 0.9999647450377577 +-77.4396 0.8381739 1.219714 -0.00400441062930753 -0.005456882990725539 -0.005509807507959324 0.9999619138471038 +-75.93444 0.8255732 1.207956 -0.003519321571667185 -0.006555454759914845 -0.005195303789090013 0.999958823755811 +-74.432 0.8134235 1.199179 -0.003270177874732058 -0.00735335757210694 -0.00519824613696055 0.999954105099921 +-72.93636 0.8003155 1.189369 -0.0030440254573989387 -0.006925592721931282 -0.005424267640033795 0.9999566727588921 +-71.44336 0.7849788 1.176896 -0.003192256423645445 -0.004668039870928476 -0.005716642261730755 0.9999676689293212 +-69.956 0.763954 1.1606 -0.004798592147377881 -0.0034072859523368478 -0.006064541614894309 0.9999642919878906 +-68.4626 0.7450432 1.144297 -0.005373484646086016 -0.004372844700744511 -0.00627064632280091 0.9999563404902614 +-66.97132 0.7252411 1.128317 -0.006048725895059527 -0.00516410969592739 -0.006418527707980583 0.9999477723302137 +-65.48911 0.710763 1.111896 -0.004597258513115933 -0.0038276737917735866 -0.006687846647219709 0.9999597426070365 +-64.00515 0.6946295 1.096104 -0.004021846254928044 -0.0040957990987634686 -0.0071225304728155265 0.9999581584956982 +-62.52551 0.6777408 1.080622 -0.003541417796653733 -0.005764520569809485 -0.007473332446274754 0.9999491876914233 +-61.04536 0.6587746 1.067484 -0.0030896929870722185 -0.005049383338859535 -0.008019085865150342 0.9999503246596954 +-59.57557 0.6370383 1.049226 -0.0028567058898093544 -0.0028926497929940955 -0.00868544697382267 0.9999540163525016 +-58.10346 0.6135026 1.030981 -0.00255038894938789 -0.0032934587892451245 -0.009299747963951697 0.999948080318782 +-56.62762 0.5856644 1.015669 -0.0025763379125422303 -0.005839166631336958 -0.009924428623226036 0.9999303837430457 +-55.1606 0.5571177 1.005763 -0.002416650946192448 -0.005844791778474327 -0.010584354545494081 0.9999239819336897 +-53.69802 0.5253982 0.9957028 -0.0026535008256733895 -0.005216706121818227 -0.011020457945284615 0.999922144177877 +-52.23668 0.4907091 0.9846881 -0.0032649961099545126 -0.005300623498331421 -0.01100262156486387 0.9999200895619765 +-50.77854 0.4561315 0.9709516 -0.003924232927104173 -0.005355222277902244 -0.011046217443416401 0.9999169480864303 +-49.32287 0.4235407 0.9558653 -0.0039685783297729255 -0.005144780084264676 -0.011134087857486891 0.9999169034032312 +-47.8694 0.3928533 0.9391869 -0.0038091999821099243 -0.004730445612205525 -0.011112046140217484 0.9999198144403301 +-46.41797 0.3621644 0.9230089 -0.003484876786041499 -0.00476982581361295 -0.010864758476213474 0.9999235277853739 +-44.96842 0.3335672 0.9069681 -0.0029660981819177304 -0.0048136105579237335 -0.010625957487127993 0.9999275575972759 +-43.52559 0.3070324 0.8911135 -0.0022215659363265883 -0.004820476090097063 -0.010331972318027094 0.9999325367258908 +-42.07912 0.2806278 0.8754699 -0.0018554565642041809 -0.0056484467027698 -0.00994428759971368 0.9999328794848774 +-40.63656 0.254373 0.8594365 -0.001956179256213601 -0.005674383260414422 -0.009497674649459652 0.9999368824648807 +-39.19717 0.2281179 0.8436096 -0.0022021938536577017 -0.005004493934791361 -0.00917263566765334 0.9999429824432972 +-37.76252 0.202969 0.8253285 -0.0024673629714720793 -0.004018454490875692 -0.008740786205856783 0.9999506801837654 +-36.32796 0.1777994 0.806336 -0.0030643499434211746 -0.0037034682351016535 -0.008281509553104444 0.9999541542900743 +-34.89508 0.1537249 0.7865138 -0.004032592701584853 -0.004883947273417813 -0.007852789595245468 0.9999491081803644 +-33.46381 0.132669 0.7682861 -0.0044110132375488765 -0.005737620985002584 -0.007366567856079267 0.9999466767511518 +-32.03604 0.1133742 0.7522313 -0.004593747367740423 -0.00560524182161094 -0.0068013808943781055 0.999950608763839 +-30.61396 0.09746084 0.735284 -0.004711816258092655 -0.004805625940370656 -0.0062105778326010195 0.999958065855692 +-29.19399 0.0848288 0.7168959 -0.004170538687110562 -0.0035769108325914203 -0.0054735096265107926 0.9999699260519406 +-27.77753 0.07224084 0.6980984 -0.00425432218520779 -0.004053443130497366 -0.004824737168229387 0.9999710957086658 +-26.35596 0.06062565 0.6785993 -0.004555323238767719 -0.005115033364894673 -0.004118265279928681 0.9999680621674626 +-24.94768 0.05256871 0.6586934 -0.004542904171216495 -0.002758870510281241 -0.00338429317417339 0.999980148410412 +-23.54342 0.04390241 0.6329309 -0.005483565964127701 5.822436857956401e-05 -0.0027643074070555444 0.9999811426816004 +-22.13989 0.03720921 0.6021743 -0.006050047006498975 0.0012021652333885835 -0.002177258606107215 0.9999786054086022 +-20.73844 0.03232865 0.5653751 -0.0059476659208749505 0.0017996123440157912 -0.0016984719952377993 0.9999792507139267 +-19.34063 0.02925888 0.5295576 -0.00569323898233138 0.0017908747452596715 -0.0013128428044869962 0.999981327946331 +-17.94366 0.02669736 0.4947389 -0.005376703656706342 0.0009577829022464522 -0.0011426890292206267 0.999984433864589 +-16.55026 0.02516131 0.4624741 -0.004944754651541313 -7.786126152580095e-05 -0.0009502067894082451 0.9999873201426693 +-15.16614 0.02416042 0.4312538 -0.004327621972585983 -0.0002344860877712207 -0.000838888434642579 0.999990256437797 +-13.78259 0.02354535 0.3987551 -0.0039044910930287075 0.0007838218933857324 -0.0010437340963058539 0.999991525559932 +-12.40382 0.02158789 0.3670075 -0.0035107348107804516 0.0011833537055694866 -0.001347603233302791 0.9999922291601178 +-11.02737 0.01528089 0.3343854 -0.004113616443927969 0.0006382898746093434 -0.0018475616752600555 0.999989628577039 +-9.654739 0.008955879 0.3016035 -0.004056897850943448 0.0002023250742042334 -0.002453584079138778 0.9999887402213876 +-8.285137 -0.0005365659 0.2716196 -0.004392130436080488 5.580029728286596e-05 -0.0031292750527413295 0.9999854567513488 +-6.914129 -0.009785478 0.2414501 -0.004214085704210374 -0.0005688457541848982 -0.003682446038955782 0.9999841786185198 +-5.548032 -0.0204447 0.2108318 -0.003851421475070751 0.0003419560635627981 -0.004227034141141523 0.9999835907658894 +-4.183286 -0.0311414 0.1812868 -0.003415840139496972 0.0012183680064254319 -0.004389409176388927 0.9999837902199338 +-2.818515 -0.04437224 0.1473329 -0.0035661542957558445 0.001588812672635966 -0.0045310128063320455 0.9999821139104333 +-1.456295 -0.05789727 0.1134699 -0.003811083302882139 0.0014495431659677035 -0.004713066079243784 0.9999805805498431 +-0.09070031 -0.07289734 0.07816926 -0.004099438233980414 0.0009369263088075501 -0.004852819260163327 0.9999793832477176 +1.269419 -0.08864261 0.04590358 -0.004517814649343131 0.0013775387351343336 -0.004914690864429959 0.9999767685059161 +2.629005 -0.1031707 0.0117715 -0.0047434607190280165 0.0013554645316014358 -0.005051520896666123 0.9999750719057655 +3.988036 -0.1175645 -0.02016324 -0.004849483369341883 0.0011191965104930162 -0.005156129999377409 0.9999743217871402 +5.344099 -0.1318827 -0.0534628 -0.004991669165735864 0.0011981483937416012 -0.005229025283489438 0.9999731521265713 +6.704415 -0.1458828 -0.08812058 -0.004752292283349532 0.0013885287514847297 -0.005329403558291579 0.9999735422318297 +8.059835 -0.1601769 -0.1221722 -0.004792643403626379 0.00168092651263296 -0.005404776250809773 0.9999724963462463 +9.415895 -0.175095 -0.1546373 -0.004731216577331551 0.00109965562927434 -0.00542016107709339 0.9999735137497862 +10.77107 -0.1926087 -0.1870422 -0.005601069013084418 0.00047131009685175726 -0.005367169455130723 0.9999697992363286 +12.12317 -0.2081545 -0.2207459 -0.005486207654651677 0.0006558664450766151 -0.0053643640422418705 0.9999703470419504 +13.47733 -0.223104 -0.2534493 -0.0054837727123108315 0.0010753171563806197 -0.005299019730394239 0.9999703457201867 +14.83032 -0.235533 -0.2892015 -0.005289172021863932 0.001109298109222249 -0.005225520222464016 0.999971743628505 +16.18093 -0.2482197 -0.3237106 -0.005377757761876374 0.0010792661596201372 -0.005255806741014117 0.999971145284459 +17.53356 -0.2588532 -0.3582793 -0.0048672682169486 0.0007411851441133942 -0.005234777709372257 0.9999741783901321 +18.88305 -0.2709142 -0.3936047 -0.0048027117262242226 0.0005371589730041704 -0.005090472809911177 0.9999753659500238 +20.2333 -0.2833573 -0.4284891 -0.004867266498855278 0.0007801755953053862 -0.005095940492352976 0.9999748659008226 +21.58016 -0.2952092 -0.463001 -0.004720356529404998 0.0013177234779649977 -0.00513323444061267 0.9999748155544962 +22.92653 -0.3094675 -0.4966128 -0.00505931971817804 0.0010656429287961124 -0.005337269894050355 0.9999723902384584 +24.27308 -0.3233764 -0.5315032 -0.004997413555837667 0.0009667395935198906 -0.005494608990070295 0.9999719498787736 +25.61353 -0.3380715 -0.5652421 -0.005078200375199409 0.0011442134456393493 -0.005752838184636223 0.9999699033017752 +26.95489 -0.3534582 -0.5991642 -0.004973739032595188 0.0004093705559492423 -0.00598441672284119 0.9999696400852732 +28.29306 -0.3674572 -0.630802 -0.004496613196629782 -0.00030141525998547507 -0.0060671283086991545 0.9999714393784888 +29.62384 -0.3836725 -0.6611411 -0.004545802605226357 5.179296812240009e-05 -0.0059993200566018845 0.9999716701762211 +30.9509 -0.3976799 -0.6925713 -0.004074737561659741 0.0005702809940000095 -0.0058215854587441666 0.9999745898952322 +32.28248 -0.4139985 -0.7240355 -0.004430186112113266 0.0007048763549909293 -0.00565218968310471 0.9999739643371334 +33.60651 -0.4270073 -0.7566562 -0.0041222448263577 0.00043460037507527595 -0.005391569679324364 0.999976874330851 +34.92691 -0.4397568 -0.7887161 -0.0037420356590800907 0.00011106687062725368 -0.005137883881310465 0.9999797932870936 +36.24336 -0.4515375 -0.8206355 -0.0036287424405715097 0.00018464891988401126 -0.004936081757819437 0.9999812164385672 +37.55497 -0.4615286 -0.8517037 -0.0029550395293953024 9.186663705570051e-05 -0.004804122707517284 0.9999840897268876 +38.86652 -0.4718552 -0.882457 -0.0025345872248272714 -0.0003423172012304586 -0.004714575596680856 0.9999856156282833 +40.17377 -0.4839122 -0.9115305 -0.0022502991818249907 -0.0009158934588274481 -0.004905313171303751 0.99998501748549 +41.48016 -0.4974895 -0.9403596 -0.0021977742032031535 -0.0010764937072498325 -0.0054074852267976285 0.9999823848715397 +42.77845 -0.5125043 -0.9685698 -0.0024004813304831624 -0.0009479949226370743 -0.005994240053034556 0.9999787038138338 +44.07602 -0.5346231 -0.9956525 -0.0037759779285337705 -0.0014977833501933623 -0.006512466750724786 0.9999705427724058 +45.37317 -0.5552041 -1.019285 -0.004011745026970988 -0.003150187201624963 -0.006872794381944991 0.9999633727891322 +46.6701 -0.5782583 -1.038533 -0.0044904005834029365 -0.005002768717371335 -0.007206094975394161 0.9999514392224088 +47.96111 -0.5999916 -1.053627 -0.004642681667091838 -0.006654121223781247 -0.007600807228846839 0.999938196043709 +49.24944 -0.6230983 -1.066505 -0.004799991630360185 -0.007457037950327385 -0.008142942661409469 0.9999275199484055 +50.52765 -0.6456359 -1.075808 -0.003915011676222682 -0.006377053613360522 -0.008587246965691438 0.999935130426138 +51.81041 -0.6740058 -1.087279 -0.004598544634585716 -0.005419892724114146 -0.00907501565575268 0.9999335589132663 +53.08722 -0.7020633 -1.100738 -0.004904568250582201 -0.0050766592638246 -0.009625870941621444 0.9999287551368895 +54.3609 -0.7311397 -1.115793 -0.005309859760514863 -0.005101148261237981 -0.010192343235638456 0.9999209467828482 +55.62948 -0.7624769 -1.130431 -0.0054404242823096255 -0.004464301120443913 -0.010709021953754717 0.9999178909530165 +56.90157 -0.7942812 -1.144783 -0.0055502822520068435 -0.0049394687035950015 -0.011257038734043448 0.9999090334099345 +58.17085 -0.8268987 -1.159941 -0.005496341492538484 -0.0053192540461096125 -0.011771492212366575 0.9999014590136793 +59.43408 -0.8597866 -1.174664 -0.005605937959327568 -0.0052567218002087255 -0.012166089721443597 0.9998964579377212 +60.69463 -0.8950914 -1.188757 -0.00581052583358095 -0.005053281023108634 -0.012493998189504061 0.9998922952746858 +61.94931 -0.9282884 -1.203865 -0.005462451849248479 -0.00492726536251858 -0.012749436384138614 0.999891661905293 +63.2074 -0.9621792 -1.218811 -0.004715828631030393 -0.00446090339255294 -0.013098662077436068 0.9998931374667126 +64.45843 -0.9961734 -1.23409 -0.004285470053145724 -0.004529953904682152 -0.013519336146142593 0.9998891647648817 +65.71017 -1.031828 -1.250518 -0.00425298766578139 -0.005361015263746825 -0.013888635190407802 0.9998801315276767 +66.9537 -1.068654 -1.267809 -0.004514655887055788 -0.004927730155548478 -0.014136926429320828 0.9998777338599293 +68.19143 -1.107414 -1.284607 -0.0053759910737393906 -0.004000188975026394 -0.014275395329293833 0.9998756474163832 +69.43308 -1.148265 -1.303043 -0.006546867907295741 -0.0041597853673039625 -0.01428738109833658 0.9998678439912213 +70.66875 -1.188155 -1.321203 -0.007794742609791613 -0.0041792822094286605 -0.01412499181647201 0.9998611204532586 +71.90028 -1.227083 -1.338073 -0.008831294925583003 -0.004325515507114882 -0.013610033572431636 0.9998590226285365 +73.12719 -1.261385 -1.354666 -0.008917227134199353 -0.005220286430759807 -0.012648414906490147 0.999866615739405 +74.34109 -1.293464 -1.370635 -0.009381307467882955 -0.00590709947015297 -0.011802444899252838 0.9998688911754595 +75.54359 -1.318225 -1.387753 -0.008292351167184913 -0.00576868777921867 -0.011069650536518864 0.9998877046900944 +76.72276 -1.337076 -1.406295 -0.0067314522509627685 -0.0033681683222121163 -0.010219580736544935 0.9999194483368726 +77.88694 -1.359396 -1.424263 -0.007402102961382371 -0.0036100588717806682 -0.009106288920815955 0.9999246230835512 +79.03738 -1.377966 -1.443308 -0.007330198486526788 -0.006266397435712794 -0.007705622089984765 0.9999238090182333 +80.1599 -1.3923 -1.459771 -0.007725700806992872 -0.006480791162038866 -0.006189970804315075 0.9999299961269272 +81.26768 -1.402951 -1.475612 -0.007857093700538357 -0.004710264015201959 -0.004338358156533908 0.9999486277504428 +82.36682 -1.407266 -1.495464 -0.007425521709781604 -0.0021364896725677195 -0.002846460489105555 0.999966096776236 +83.4619 -1.406566 -1.518542 -0.006514981639024661 0.0007304559410819082 -0.0014163371288765153 0.9999775074657923 +84.55751 -1.402636 -1.542197 -0.005636250449197725 0.0017283367857840895 -0.0001974080306421264 0.9999826031301235 +85.65524 -1.397294 -1.566145 -0.004943065942099047 0.000587629173085077 0.0006755950342927935 0.9999873821015927 +86.75113 -1.394094 -1.587213 -0.005112009586912639 -0.0006387907578965572 0.001287065899536984 0.9999859012834736 +87.83653 -1.391329 -1.608126 -0.005164329586391375 -0.00018026756579962792 0.0017871846767670708 0.9999850514755004 +88.91917 -1.385563 -1.630039 -0.005172084736431333 0.0009826595993239489 0.0022399441979443287 0.9999836331508537 +89.99811 -1.3802 -1.654701 -0.005409706392945611 0.0013935270125663222 0.002689553902651184 0.9999807795447931 +91.07568 -1.372397 -1.681009 -0.0052710219207035315 0.00036768971087976796 0.0029675833163188533 0.9999816371221267 +92.14962 -1.363832 -1.705257 -0.00464594629447782 -0.00044863986133057704 0.0029475976652065705 0.9999847626705657 +93.21864 -1.356111 -1.726171 -0.004091971370264144 -0.0008484676515914787 0.0027619194229039517 0.9999874537583211 +94.28214 -1.348852 -1.747689 -0.003167386125590067 -0.001333655684614569 0.0020533007000679056 0.9999919864598323 +95.34468 -1.343107 -1.769082 -0.0020963317109782787 -0.0013966013874874125 0.0005601831101170993 0.9999966705408599 +96.39851 -1.341283 -1.790649 -0.0009759853297777672 -0.00035840523832631296 -0.0012821874969775157 0.9999986374958437 +97.45206 -1.346362 -1.812458 -0.0008196071180677521 0.0013999469920221183 -0.003135180327752534 0.9999937694990425 +98.50801 -1.357998 -1.835038 -0.0011932598243470775 0.0033472058231626928 -0.00501859244281575 0.9999810928582908 +99.57007 -1.375059 -1.861895 -0.0016675312469264377 0.0039107910267002464 -0.007236472440058778 0.999964778639583 +100.6364 -1.397412 -1.887691 -0.001819049084006557 0.0032830496880766383 -0.009850194872345154 0.9999444416097089 +101.7103 -1.426691 -1.913252 -0.0021960637507944846 0.002425457407197463 -0.013013719357302033 0.999909964731254 +102.7874 -1.463118 -1.9392 -0.0020779802075087863 0.0018382141644611603 -0.016494677590136387 0.9998601045036952 +103.8682 -1.507808 -1.964419 -0.002440444986096237 0.0016720316820967888 -0.02016209147401754 0.9997923477431288 +104.9566 -1.561792 -1.988222 -0.0033410425399828365 0.0011092705395985114 -0.023291487695429648 0.9997225182792222 +106.0502 -1.62272 -2.010967 -0.005075640867038843 0.00044586699560171273 -0.025437949444295272 0.9996634182566052 +107.1475 -1.686244 -2.034406 -0.007101613398412274 0.0006634227143568081 -0.026131828932500923 0.9996330599145288 +108.2514 -1.748773 -2.056669 -0.009431404820970699 0.0011817668271255671 -0.025288639973408345 0.9996350017474201 +109.3616 -1.806035 -2.079756 -0.011367464571031015 0.001749331502332108 -0.02319103699719178 0.999664892047088 +110.4821 -1.855701 -2.102084 -0.012571108875661267 0.001516043978758011 -0.02038894847086733 0.9997119383165048 +111.6084 -1.897032 -2.123092 -0.013429626841442386 0.000808769366529721 -0.017057380477661047 0.999763990542894 +112.7436 -1.928092 -2.145152 -0.013891045338374068 0.000578865685545168 -0.01304777953590916 0.9998182130883132 +113.8811 -1.949092 -2.166434 -0.014467095815034636 0.0010538676864185776 -0.00838824749818978 0.9998596050473723 +115.0263 -1.958712 -2.18791 -0.014766686885688692 0.0015839579128181774 -0.0034589831673069492 0.9998837289761239 +116.1788 -1.956136 -2.210288 -0.014418158699757452 0.0017463899655609896 0.0012162674168660669 0.9998937881172018 +117.3414 -1.940135 -2.235787 -0.013903052122774578 0.002236115450405336 0.005979267274543876 0.9998829697980777 +118.5076 -1.913209 -2.261146 -0.012850495750048927 0.0026799591153635447 0.010663260368302908 0.9998569785006433 +119.6839 -1.87627 -2.286958 -0.01234863227524338 0.0025603153328296693 0.01480029331463563 0.9998109358194318 +120.8659 -1.83308 -2.31212 -0.01208235794881595 0.002038758323076078 0.018322459002000736 0.9997570442797674 +122.056 -1.781388 -2.338144 -0.012114775603454577 0.001605429603100308 0.02129820228456298 0.9996984752350647 +123.2581 -1.725066 -2.362701 -0.012812694318133474 0.0012808562789106183 0.023963643645201347 0.9996299005405694 +124.4659 -1.665951 -2.387598 -0.014094310594444976 0.0009550465884636655 0.026316325875258815 0.9995538451165642 +125.6836 -1.602097 -2.410925 -0.015451910354699155 0.0007856577163634128 0.02824014191280311 0.9994814233356655 +126.9096 -1.529042 -2.432251 -0.015246380655573027 6.738092597865853e-05 0.02977106500759864 0.9994404569683126 +128.1428 -1.45419 -2.455779 -0.015217964795136372 -0.00013026913479753885 0.030509687471022465 0.9994186087660502 +129.3864 -1.374919 -2.480551 -0.014136985079956017 0.002246764236625104 0.03055008085221711 0.9994307330992165 +130.6411 -1.29261 -2.504395 -0.011882057685410534 0.0027386574876867693 0.030065169995346277 0.9994735624384853 +131.9105 -1.212752 -2.530425 -0.009860344601532388 0.001086674436838427 0.028981642532046582 0.9995307184569927 +133.1888 -1.135332 -2.552624 -0.007588872909084147 0.00045158204454039244 0.027288598394347472 0.9995986882140746 +134.4809 -1.063477 -2.571936 -0.0053705701354577005 -0.0007256673623886027 0.02481433301750762 0.9996773875907145 +135.7808 -0.9982536 -2.593743 -0.0029847375085402515 -0.0010432090563423466 0.02164302494176869 0.9997607626468643 +137.0887 -0.9419915 -2.613858 -0.0010408384855369885 -0.00022061830333848308 0.018253161180460612 0.9998328310721405 +138.414 -0.8957984 -2.63369 0.00018806630268720375 -0.00026881713863862453 0.014977898151425285 0.9998877711700336 +139.7491 -0.86007 -2.654048 0.0005618008432868998 -0.0007367620959398491 0.011799793847415183 0.9999299507598445 +141.0912 -0.8326782 -2.674907 0.00040681439343859775 -0.001304388746411311 0.009017182503075023 0.9999584108810494 +142.4513 -0.8166247 -2.693332 -0.00046255512035782014 -0.002310246168302674 0.006380020021006732 0.9999768718075104 +143.8175 -0.8082126 -2.711653 -0.0010191078553633023 -0.0018846453009572963 0.00370805978981855 0.9999908298698863 +145.1909 -0.8063071 -2.728963 -0.0014571773024555186 -0.0003880957298054606 0.0011057719369185052 0.9999982516406902 +146.5742 -0.8107742 -2.750987 -0.002040590175845962 0.0006278397270757612 -0.0013205271583578168 0.9999968490035532 +147.9621 -0.8181994 -2.779541 -0.002061838354012667 0.0028134665264124064 -0.0031972290766791103 0.9999888054148097 +149.3663 -0.8311338 -2.811042 -0.0027400863174739986 0.004646854694498661 -0.0043291198414273325 0.9999760784087885 +150.7791 -0.8465134 -2.846681 -0.0031151114217393646 0.005045834968392015 -0.005124674889734749 0.9999692861971193 +152.2017 -0.8636978 -2.884423 -0.0033664005665275257 0.004737695697783775 -0.00594292784119935 0.9999654510008709 +153.6297 -0.8818614 -2.922677 -0.003614192785418664 0.004168487245267001 -0.006654896143816281 0.9999626361429266 +155.0655 -0.903141 -2.959476 -0.003988019814063512 0.0033018490505993185 -0.007405644841308658 0.999959174124271 +156.5082 -0.9252286 -2.995496 -0.004024982037010492 0.0026708872184959477 -0.00813068868482408 0.9999552778912556 +157.949 -0.9505879 -3.030265 -0.004364527136424669 0.001999580717519752 -0.008984208507606081 0.9999481169427349 +159.3917 -0.9765171 -3.064935 -0.004179040301965527 0.0009016330072535506 -0.009854625926427468 0.9999423028494818 +160.8317 -1.007339 -3.096653 -0.004605622333527776 0.00022006701512890147 -0.010776182653583577 0.9999313044908866 +162.2714 -1.039444 -3.129506 -0.004852311758712805 0.0005975461896332606 -0.011648439651673266 0.9999202027476137 +163.7088 -1.075629 -3.162047 -0.005454938408863737 0.001085496182540136 -0.012282241579511867 0.9999091018121475 +165.1447 -1.112662 -3.196968 -0.005832011810382905 0.001361665256988206 -0.01265699097808662 0.9999019622369746 +166.5807 -1.150964 -3.232057 -0.0062784571017293865 0.0022448049030080832 -0.012777944660427483 0.9998961275840728 +168.0142 -1.188887 -3.267867 -0.00679573428148296 0.002388158527167772 -0.012699148460523666 0.9998934174814844 +169.4545 -1.225583 -3.303343 -0.006949291578597575 0.0013580115203693807 -0.012522925810922278 0.9998965143855639 +170.8876 -1.262468 -3.337774 -0.007021018655642544 0.0006376896371512529 -0.012306665272116702 0.9998994172609782 +172.3194 -1.297634 -3.371687 -0.007103935454815577 0.00037188386961452307 -0.011990357882257899 0.9999028088375873 +173.7467 -1.333194 -3.40556 -0.00763994259435732 0.0002658376781734974 -0.011575203734408984 0.9999037830041402 +175.1688 -1.366304 -3.44042 -0.007950358092071052 0.0007073753934605948 -0.0110723089081568 0.9999068433617718 +176.5929 -1.398479 -3.475354 -0.008374373508472548 0.0008415152202000726 -0.010503333838950761 0.9999094167467082 +178.0112 -1.427528 -3.511715 -0.008314215509110497 0.0013326835822618085 -0.009790306044673855 0.9999166203651629 +179.4251 -1.453391 -3.547073 -0.007607026871429158 0.0021849222288918 -0.009116584382287386 0.9999271209174355 +180.8386 -1.476326 -3.583634 -0.006402050266735137 0.0021007233170296526 -0.008429223301215052 0.9999417727590271 +182.2502 -1.498241 -3.619675 -0.005711867603728798 0.0007904420325872365 -0.007778239414236313 0.9999531232820293 +183.6608 -1.51887 -3.652469 -0.005386960328152696 -0.001350506232579145 -0.007004582487918167 0.9999600455095736 +185.0616 -1.539735 -3.680137 -0.005867917070875543 -0.0008866815508624267 -0.006220695121134999 0.9999630414656767 +186.4556 -1.559212 -3.708117 -0.006506016419225169 0.000565859345073087 -0.00541241709963404 0.9999640280003546 +187.8484 -1.576113 -3.742055 -0.0065581125501097196 0.0012578379348377896 -0.00482596976969647 0.9999660589336479 +189.2418 -1.590302 -3.777663 -0.006450063731785784 0.0012865556411808437 -0.004194748367803073 0.9999695723063622 +190.6344 -1.603074 -3.811744 -0.0060740416561316625 0.00023617199113217067 -0.0038299782792997734 0.999974190420498 +192.0206 -1.613327 -3.843743 -0.005259924913585355 -0.0006894229064686705 -0.0036899171060472257 0.9999791209808883 +193.4008 -1.623756 -3.875123 -0.004606706308890281 -0.000414308142617902 -0.003697804865326246 0.9999824662687464 +194.7764 -1.633867 -3.905876 -0.00384551753098951 0.00043322844999923305 -0.00382929579041281 0.9999851802910773 +196.1509 -1.645539 -3.93893 -0.003824723040562045 0.001185310416928959 -0.003987209883766344 0.9999840342176575 +197.5229 -1.657445 -3.973595 -0.0036313764181579275 0.0015080977608540177 -0.004088385020081168 0.9999839117977752 +198.8896 -1.669867 -4.009132 -0.003962644638585429 0.0008648402937552581 -0.004149264026906724 0.9999831664116999 +200.2547 -1.682304 -4.042827 -0.004065075922338206 0.00011893817814002041 -0.004241086960683946 0.9999827369474171 +201.611 -1.694878 -4.074851 -0.00417000248334953 -1.889354863450131e-05 -0.004351795534977472 0.9999818361340093 +202.9698 -1.705984 -4.106859 -0.004302080236342997 -0.0002223496994798571 -0.00440959252899705 0.9999809988994688 +204.3226 -1.717412 -4.139885 -0.004650438396579134 -0.00022901769832201367 -0.00437144195339372 0.9999796055264636 +205.6689 -1.727574 -4.171862 -0.0047003260383536045 7.488944897706169e-05 -0.004468454665033961 0.9999789668986094 +207.0122 -1.737778 -4.204811 -0.004547744961394204 0.0002842958301201656 -0.004560212970367722 0.9999792206087645 +208.3506 -1.746442 -4.238781 -0.0043302180110784015 0.0006647169719032937 -0.004623150447977305 0.9999797167159238 +209.6908 -1.757196 -4.270954 -0.004342836745440385 0.00029333588551749476 -0.004659892437938883 0.9999796693560962 +211.0251 -1.768399 -4.303818 -0.004589617163271408 -5.707627434395361e-05 -0.0047714764019583175 0.9999780823445777 +212.3505 -1.780005 -4.335485 -0.004375833986804754 9.594370218112898e-05 -0.00491358737916557 0.9999783495310253 +213.6668 -1.792488 -4.366042 -0.004250008268879994 1.1779266817557034e-05 -0.005213913102728992 0.9999773759446363 +214.9667 -1.802728 -4.397364 -0.00367742042841126 -0.0006789744767759679 -0.005686916723456945 0.9999768370071545 +216.2478 -1.814957 -4.428604 -0.0036294134657329107 -0.0017031546393686627 -0.006451628642203999 0.9999711511388878 +217.5072 -1.826714 -4.458316 -0.0031482807164717533 -0.002834793741774322 -0.007217102659677795 0.999964982237964 +218.7426 -1.842513 -4.486884 -0.003435214585678441 -0.004063584524082948 -0.007935874814574395 0.9999543531944313 +219.9508 -1.859734 -4.515014 -0.0039300710813878455 -0.004085445906694353 -0.008660391541802908 0.999946429210776 +221.1287 -1.87789 -4.540479 -0.00395087016309762 -0.0035890410384786754 -0.009396653603507065 0.999941604450197 +222.2852 -1.895168 -4.569335 -0.003735797193859575 -0.0038084560044598648 -0.01052837578893123 0.9999303439667363 +223.4108 -1.917244 -4.598297 -0.0037313498758303194 -0.003685824613298926 -0.012197227546154965 0.9999118557979055 +224.5051 -1.94167 -4.625178 -0.0033207431614200303 -0.0034470566273997177 -0.014079084168578322 0.9998894288142252 +225.5672 -1.971966 -4.651833 -0.0037599798244474084 -0.0032941522967861146 -0.015841660926882634 0.9998620169260573 +226.6035 -2.004581 -4.679495 -0.0040773511427864536 -0.003122249980256115 -0.017276859240585436 0.9998375552546024 +227.6035 -2.040151 -4.706339 -0.005226128672949349 -0.0032506052258088345 -0.018345993069284548 0.999812755311244 +228.5742 -2.073778 -4.731016 -0.005212419363128533 -0.0038746585121623043 -0.019035507984075262 0.9997977131107999 +229.5125 -2.106014 -4.755443 -0.005216993161700781 -0.004797530796961254 -0.019722096010096724 0.9997803786880254 +230.4146 -2.139617 -4.778138 -0.005141651558309187 -0.005032250135780837 -0.02043325243995957 0.9997653335020918 +231.2868 -2.17419 -4.799162 -0.00530925356053649 -0.0049554000143027304 -0.021290832861494188 0.9997469461185626 +232.1263 -2.207888 -4.818374 -0.004981601111739464 -0.004739977517867859 -0.022262015998466524 0.9997285226036005 +232.9333 -2.243669 -4.838575 -0.005550045971521183 -0.003936008022376021 -0.0231508606885917 0.9997088288496503 +233.7189 -2.280068 -4.860372 -0.0063447770776162505 -0.002673314816609195 -0.023462522754625032 0.9997010089110221 +234.4921 -2.314971 -4.881633 -0.007146112175537706 -0.0006653583406530917 -0.02284895862670062 0.9997131665976638 +235.2654 -2.346879 -4.903083 -0.007988122722651531 0.0010221613278802427 -0.021012535715947217 0.9997467771512812 +236.0372 -2.372559 -4.923693 -0.007864207592357068 0.0015428968877048278 -0.018579346375119852 0.9997952698409869 +236.8075 -2.391165 -4.943065 -0.007098149161328073 0.001190372377912095 -0.016449689050093152 0.9998387905168716 +237.5762 -2.40634 -4.961684 -0.0062441453524815504 0.0006183670391566793 -0.01439790129737417 0.9998766567478476 +238.3384 -2.42154 -4.979606 -0.005589043939301585 9.646465343892369e-05 -0.01224686722376506 0.999909379656786 +239.1021 -2.438172 -4.997815 -0.00559643574668683 -3.758471129258567e-05 -0.009968738992025146 0.9999346492332545 +239.8593 -2.449188 -5.01612 -0.005437798996150165 5.1027223651815144e-05 -0.007697949370881683 0.9999555836704864 +240.6138 -2.457471 -5.032857 -0.005459624573872094 8.40402967771624e-05 -0.0055975837956608 0.9999694257788042 +241.3652 -2.464848 -5.049016 -0.005482832435611237 -9.559489111473524e-05 -0.00369193451015952 0.9999781492761092 +242.113 -2.471118 -5.064022 -0.005381833762726796 -0.0006099678102152499 -0.0019047371648805106 0.9999835177546446 +242.8585 -2.47794 -5.076752 -0.005517103882406049 -0.0012776743337648434 -0.00032969291146410376 0.9999839100783731 +243.6012 -2.481229 -5.090455 -0.0051468151069383366 -0.0015100057853576051 0.000628169916818358 0.9999854176833475 +244.3383 -2.485941 -5.103671 -0.004914912681250389 -0.0009937161785989403 0.0011757840931717104 0.9999867367586724 +245.0752 -2.487604 -5.118722 -0.0049864540743464305 0.0003533733846354315 0.0015831422715725364 0.9999862519372771 +245.8164 -2.493249 -5.131839 -0.005436862077213867 0.002254829121294674 0.0019588136774827034 0.9999807594775831 +246.5623 -2.49473 -5.147361 -0.005435868951980338 0.003359009094162382 0.0021825887687127 0.999977202086582 +247.3157 -2.497272 -5.164438 -0.005023549763908254 0.0034056833291604316 0.0019626207028765373 0.9999796564874747 +248.0723 -2.498606 -5.182585 -0.004732006477907364 0.0032021965786981237 0.0016913882667059876 0.999982246471154 +248.8329 -2.503085 -5.199298 -0.0052410017653757345 0.003077582257910577 0.0013698330982699835 0.9999805917842726 +249.5991 -2.506486 -5.218219 -0.0054122897028467225 0.003828302688367552 0.001092788923830244 0.9999774282607917 +250.3665 -2.508505 -5.237784 -0.0045826722319296295 0.00504629388391841 0.0012753301768844022 0.9999759534939786 +251.1395 -2.505563 -5.256841 -0.003225485166293301 0.005940301457055155 0.001670844741192291 0.9999757583771179 +251.9216 -2.503453 -5.273084 -0.0026299286256550535 0.004852364238184509 0.002278267107665933 0.9999821736089649 +252.7116 -2.508541 -5.289235 -0.004711754018578084 0.0019099650784319506 0.002742785956896514 0.9999833141271219 +253.5088 -2.507121 -5.306929 -0.004824560076717001 -0.0010845146864033746 0.0034486573509592703 0.9999818269400883 +254.3037 -2.498934 -5.321532 -0.0038145462024468456 -0.0009598637215940122 0.004397340355187277 0.9999825954968946 +255.099 -2.491981 -5.336196 -0.003994993128337885 0.002062161742000586 0.005960648679795896 0.9999721287046807 +255.9 -2.485561 -5.351123 -0.004796561029817668 0.0045169030643075715 0.007517807922964464 0.9999500353282799 +256.7068 -2.475778 -5.3702 -0.005612387631943082 0.004923281742983772 0.008675215758401782 0.9999344993715804 +257.5259 -2.461593 -5.39272 -0.005562715006197702 0.004126253566910604 0.009834619112070767 0.9999276526330202 +258.3481 -2.442073 -5.414196 -0.004415152221408841 0.002857985916314043 0.010871765314702513 0.9999270688736784 +259.1759 -2.418788 -5.432422 -0.0026146608160637136 0.002208208915103914 0.011740998179220752 0.9999252153656091 +260.0072 -2.398099 -5.451575 -0.0023029409173612085 0.0025559304891376883 0.012752790102114992 0.9999127612082352 +260.8436 -2.3794 -5.472421 -0.0032614361369239864 0.003720928152911217 0.013702866788221734 0.9998938689529955 +261.6878 -2.357178 -5.492976 -0.003966880638307105 0.005392550904201476 0.014410589589869767 0.9998737516108819 +262.5408 -2.331649 -5.513461 -0.004087672727956619 0.006390651196631106 0.015189674554177874 0.9998558517585874 +263.4078 -2.305211 -5.536213 -0.004538530284909171 0.0056210009082556705 0.016189706908188682 0.9998428373908917 +264.2827 -2.278446 -5.560785 -0.005112991479589846 0.004600104307079007 0.017275659270124085 0.9998271090319947 +265.167 -2.24888 -5.58502 -0.005301363320306526 0.004378555903907105 0.017989979360907418 0.9998145250183839 +266.0593 -2.217026 -5.60834 -0.005284721028860163 0.004610398592097925 0.018232529470735465 0.9998091772021138 +266.9604 -2.181048 -5.632766 -0.004234279825377067 0.004716359282874845 0.018276174867779327 0.9998128866251313 +267.8693 -2.144562 -5.657825 -0.003517251090219671 0.004207383175084425 0.01831029273033074 0.9998173133386499 +268.7872 -2.107091 -5.684566 -0.002968226161870879 0.0041022540620476635 0.018465827821355793 0.9998166703690905 +269.7117 -2.072995 -5.710527 -0.0025951922859569535 0.0038327845091246544 0.018507780637666272 0.999818001836221 +270.6435 -2.039857 -5.735484 -0.002340291269136965 0.0033789372637785843 0.018407774202823347 0.9998221140126081 +271.578 -2.008137 -5.759967 -0.0026258219195686313 0.002477615242992138 0.01825923338381998 0.9998267684345072 +272.5156 -1.977183 -5.786935 -0.002481355510887373 0.00214908450925928 0.01781139906174545 0.9998359757350522 +273.4511 -1.948577 -5.813039 -0.0025282710647775426 0.0022739337321695968 0.016965970360044706 0.9998502852530207 +274.3868 -1.923694 -5.837506 -0.002650141978318813 0.0027849133271064656 0.015813394287244636 0.9998675700143351 +275.322 -1.903306 -5.862752 -0.0033837211473557623 0.003147385275300177 0.014502444650734561 0.9998841550381115 +276.2556 -1.886301 -5.888992 -0.004215415565174309 0.0032251218254860854 0.013162615702802923 0.9998992821322976 +277.1894 -1.871958 -5.914599 -0.004810677758635455 0.002941066469456641 0.01178342801298534 0.9999146755257609 +278.1217 -1.860171 -5.940087 -0.0053171323165317795 0.0022407606168909477 0.010514596028559644 0.9999280725962955 +279.0549 -1.850165 -5.965237 -0.0053460976010341575 0.001223379212258305 0.009269072653673781 0.9999420017560439 +279.9822 -1.84113 -5.989292 -0.005418973154136164 0.0005348331756265014 0.008157844894638888 0.9999518979681492 +280.9067 -1.834878 -6.011766 -0.005537934759043605 0.0001675988270577558 0.007105218367803964 0.9999594087067653 +281.8284 -1.829538 -6.033194 -0.00573560102215185 2.7408065048187213e-05 0.0061672813399026565 0.9999645327563309 +282.7472 -1.824423 -6.05424 -0.005303278331526704 7.71670197215794e-05 0.005360757466301482 0.9999715653775245 +283.6626 -1.820378 -6.076007 -0.005150161549096235 0.0003111863453735988 0.00463752995162213 0.9999759358679711 +284.5738 -1.819188 -6.097253 -0.005172886674369616 0.000436649540538892 0.003903946143233418 0.9999789046700652 +285.4729 -1.820078 -6.118356 -0.005560881546158259 4.9562236454426417e-05 0.003171319289468756 0.999979508227033 +286.3559 -1.821293 -6.140076 -0.005503369270736444 -0.0004938601576802682 0.0025464655903669575 0.9999814920996347 +287.2165 -1.824665 -6.162957 -0.006128368382255615 -0.00029061201298678676 0.0020643132449836817 0.9999790484087434 +288.0602 -1.831132 -6.186044 -0.007361025916745047 0.0007905429200520044 0.0017763957526850526 0.9999710169587289 +288.8905 -1.835812 -6.20897 -0.008631048904592532 0.0023079725538300106 0.0016563028942316459 0.9999587166069507 +289.7169 -1.838573 -6.232745 -0.009461461581921977 0.0034262367643627064 0.0016376433357821012 0.99994802853482 +290.5416 -1.841599 -6.257807 -0.010316437835547839 0.0037337323458844305 0.0018110642228688506 0.9999381732884965 +291.3657 -1.843827 -6.281585 -0.011303544920266848 0.0040174851343701915 0.002067771690342757 0.9999259042577442 +292.1857 -1.844053 -6.305431 -0.011881819979992308 0.004415544077263771 0.0025246228799578434 0.9999164723134522 +293.0004 -1.843243 -6.332348 -0.012858453071016393 0.0049163189828790995 0.00329451677193526 0.9998998130570474 +293.8122 -1.840209 -6.360716 -0.013458276917570025 0.0050742597201222026 0.004083078672120882 0.9998882213223936 +294.6199 -1.836186 -6.387429 -0.013941011877324477 0.004491014587157527 0.004697940818840592 0.9998816971661582 +295.4284 -1.829836 -6.414405 -0.01370061734422586 0.0035193935137974555 0.005148825886130237 0.9998866918534702 +296.2317 -1.823832 -6.440313 -0.01312004963372362 0.002423255966845758 0.005081527885411473 0.9998980799073861 +297.0309 -1.81794 -6.46419 -0.012496453601313236 0.0013795939142901106 0.004445965805628529 0.999911080424693 +297.8283 -1.814744 -6.486179 -0.012110832418403357 0.0005711841122432597 0.0036839357952715675 0.9999197118288545 +298.6243 -1.812035 -6.506511 -0.011683835588696677 0.00017611145831383709 0.0029392593505399717 0.9999274062276525 +299.4226 -1.809517 -6.524789 -0.011108068369766778 -0.0003836599889766203 0.002219462785821458 0.9999357667405682 +300.2232 -1.807621 -6.541554 -0.010223099399367618 -0.0006464734259319517 0.0012326256396453285 0.9999467740559057 diff --git a/动态slam/2020年-2022年开源动态SLAM.zip b/动态slam/2020年-2022年开源动态SLAM.zip new file mode 100644 index 0000000..6b87990 Binary files /dev/null and b/动态slam/2020年-2022年开源动态SLAM.zip differ diff --git a/动态slam/2020年-2022年开源动态SLAM/2020-2022年开源动态SLAM.docx b/动态slam/2020年-2022年开源动态SLAM/2020-2022年开源动态SLAM.docx new file mode 100644 index 0000000..82b382c --- /dev/null +++ b/动态slam/2020年-2022年开源动态SLAM/2020-2022年开源动态SLAM.docx @@ -0,0 +1,38 @@ + 2020-2023年开源的动态SLAM论文 +一、2020年 +1.Zhang J, Henein M, Mahony R, et al. VDO-SLAM: a visual dynamic object-aware SLAM system[J]. arXiv preprint arXiv:2005.11052, 2020. +https://github.com/halajun/vdo_slam +2.Bescos B, Cadena C, Neira J. Empty cities: A dynamic-object-invariant space for visual SLAM[J]. IEEE Transactions on Robotics, 2020, 37(2): 433-451. +https://github.com/bertabescos/EmptyCities_SLAM +3.Vincent J, Labbé M, Lauzon J S, et al. Dynamic object tracking and masking for visual SLAM[C]//2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2020: 4974-4979. +https://github.com/introlab/dotmask + +二、2021年 +1.Liu Y, Miura J. RDS-SLAM: Real-time dynamic SLAM using semantic segmentation methods[J]. Ieee Access, 2021, 9: 23772-23785. + https://github.com/yubaoliu/RDS-SLAM/ + 2.Bao R, Komatsu R, Miyagusuku R, et al. Stereo camera visual SLAM with hierarchical masking and motion-state classification at outdoor construction sites containing large dynamic objects[J]. Advanced Robotics, 2021, 35(3-4): 228-241. + https://github.com/RunqiuBao/kenki-positioning-vSLAM + 3.Wimbauer F, Yang N, Von Stumberg L, et al. MonoRec: Semi-supervised dense reconstruction in dynamic environments from a single moving camera[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 6112-6122. +https://github.com/Brummi/MonoRec + 4.Wang W, Hu Y, Scherer S. Tartanvo: A generalizable learning-based vo[C]//Conference on Robot Learning. PMLR, 2021: 1761-1772. +https://github.com/castacks/tartanvo +5.Zhan H, Weerasekera C S, Bian J W, et al. DF-VO: What should be learnt for visual odometry?[J]. arXiv preprint arXiv:2103.00933, 2021. +https://github.com/Huangying-Zhan/DF-VO + +三、2022年 +1.Liu J, Li X, Liu Y, et al. RGB-D inertial odometry for a resource-restricted robot in dynamic environments[J]. IEEE Robotics and Automation Letters, 2022, 7(4): 9573-9580. +https://github.com/HITSZ-NRSL/Dynamic-VINS +2.Song S, Lim H, Lee A J, et al. Dynavins: A visual-inertial slam for dynamic environments[J]. IEEE Robotics and Automation Letters, 2022, 7(4): 11523-11530. +https://github.com/url-kaist/dynavins +3.Wang H, Ko J Y, Xie L. Multi-modal Semantic SLAM for Complex Dynamic Environments[J]. arXiv e-prints, 2022: arXiv: 2205.04300. + https://github.com/wh200720041/MMS_SLAM +4.Qiu Y, Wang C, Wang W, et al. AirDOS: Dynamic SLAM benefits from articulated objects[C]//2022 International Conference on Robotics and Automation (ICRA). IEEE, 2022: 8047-8053. +https://github.com/haleqiu/AirDOS +5.Cheng S, Sun C, Zhang S, et al. SG-SLAM: a real-time RGB-D visual SLAM toward dynamic scenes with semantic and geometric information[J]. IEEE Transactions on Instrumentation and Measurement, 2022, 72: 1-12. +https://github.com/silencht/SG-SLAM +6.Esparza D, Flores G. The STDyn-SLAM: a stereo vision and semantic segmentation approach for VSLAM in dynamic outdoor environments[J]. IEEE Access, 2022, 10: 18201-18209. +https://github.com/DanielaEsparza/STDyn-SLAM +7.Shen S, Cai Y, Wang W, et al. DytanVO: Joint refinement of visual odometry and motion segmentation in dynamic environments[C]//2023 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2023: 4048-4055. +https://github.com/castacks/DytanVO + + diff --git a/动态slam/2020年-2022年开源动态SLAM/2020年/Dynamic object tracking and masking for visual SLAM.pdf b/动态slam/2020年-2022年开源动态SLAM/2020年/Dynamic object tracking and masking for visual SLAM.pdf new file mode 100644 index 0000000..3578993 --- /dev/null +++ b/动态slam/2020年-2022年开源动态SLAM/2020年/Dynamic object tracking and masking for visual SLAM.pdf @@ -0,0 +1,381 @@ + Dynamic Object Tracking and Masking for Visual SLAM + + Jonathan Vincent, Mathieu Labbe´, Jean-Samuel Lauzon, Franc¸ois Grondin, + Pier-Marc Comtois-Rivet, Franc¸ois Michaud + +arXiv:2008.00072v1 [cs.CV] 31 Jul 2020 Abstract— In dynamic environments, performance of visual the proposed method. Our research hypothesis is that a + SLAM techniques can be impaired by visual features taken deep learning algorithm can be used to semantically segment + from moving objects. One solution is to identify those objects object instances in images using a priori semantic knowledge + so that their visual features can be removed for localization and of dynamic objects, enabling the identification, tracking and + mapping. This paper presents a simple and fast pipeline that removal of dynamic objects from the scenes using extended + uses deep neural networks, extended Kalman filters and visual Kalman filters to improve both localization and mapping in + SLAM to improve both localization and mapping in dynamic vSLAM. By doing so, the approach, referred to as Dynamic + environments (around 14 fps on a GTX 1080). Results on the Object Tracking and Masking for vSLAM (DOTMask)1 + dynamic sequences from the TUM dataset using RTAB-Map aims at providing six benefits: 1) increased visual odometry + as visual SLAM suggest that the approach achieves similar performance; 2) increased quality of loop closure detection; + localization performance compared to other state-of-the-art 3) produce 3D maps free of dynamic objects; 4) tracking of + methods, while also providing the position of the tracked dynamic objects; 5) modular and fast pipeline. + dynamic objects, a 3D map free of those dynamic objects, better + loop closure detection with the whole pipeline able to run on a The paper is organized as follows. Section II presents re- + robot moving at moderate speed. lated work of approaches taking into consideration dynamic + objects during localization and during mapping. Section III + I. INTRODUCTION describes our approach applied as a pre-processing module + to RTAB-Map [5], a vSLAM approach. Section IV presents + To perform tasks effectively and safely, autonomous mo- the experimental setup, and Section V provides comparative + bile robots need accurate and reliable localization from their results on dynamic sequences taken from the TUM dataset. + representation of the environment. Compared to LIDARs + (Light Detection And Ranging sensors) and GPS (Global II. RELATED WORK + Positioning System), using visual images for Simultaneous + Localization and Mapping (SLAM) adds significant infor- Some approaches take into consideration dynamic objects + mation about the environment [1], such as color, textures, during localization. For instance, BaMVO [6] uses a RGB- + surface composition that can be used for semantic interpre- D camera to estimate ego-motion. It uses a background + tation of the environment. Standard visual SLAM (vSLAM) model estimator combined with an energy-based dense visual + techniques perform well in static environments by being odometry technique to estimate the motion of the camera. Li + able to extract stable visual features from images. However, et al. [7] developed a static point weighting method which + in environments with dynamic objects (e.g., people, cars, calculates a weight for each edge point in a keyframe. This + animals), performance decreases significantly because visual weight indicates the likelihood of that specific edge point + features may come from those objects, making localization being part of the static environment. Weights are determined + less reliable [1]. Deep learning architectures have recently by the movement of a depth edge point between two frames + demonstrated interesting capabilities to achieve semantic seg- and are added to an Intensity Assisted Iterative Closest Point + mentation from images, outperforming traditional techniques (IA-ICP) method used to perform the registration task in + in tasks such as image classification [2]. For instance, Segnet SLAM. Sun et al. [8] present a motion removal approach to + [3] is commonly used for semantic segmentation [4]. It uses increase the localization reliability in dynamic environments. + an encoder and a decoder to achieve pixel wise semantic It consists of three steps: 1) detecting moving objects’ motion + segmentation of a scene. based on ego-motion compensated using image differencing; + 2) using a particle filter for tracking; and 3) applying a + This paper introduces a simple and fast pipeline that Maximum-A-Posterior (MAP) estimator on depth images + uses neural networks, extended Kalman filters and vSLAM to determine the foreground. This approach is used as the + algorithm to deal with dynamic objects. Experiments con- frontend of Dense Visual Odometry (DVO) SLAM [9]. Sun + ducted on the TUM dataset demonstrate the robustness of et al. [10] uses a similar foreground technique but instead + of using a MAP they use a foreground model which is + This work was supported by the Institut du ve´hicule innovant (IVI), updated on-line. All of these approaches demonstrate good + Mitacs, InnovE´ E´ and NSERC. J. Vincent, M. Labbe´, J.-S. Lauzon, localization results using the Technical University of Munich + F. Grondin and F. Michaud are with the Interdisciplinary Institute for (TUM) dataset [11], however, mapping is yet to be addressed. + Technological Innovation (3IT), Dept. Elec. Eng. and Comp. Eng., + Universite´ de Sherbrooke, 3000 boul. de l’Universite´, Que´bec (Canada) 1https://github.com/introlab/dotmask + J1K 0A5. P.-M. Comtois-Rivet is with the Institut du Ve´hicule Innovant + (IVI), 25, boul. Maisonneuve, Saint-Je´roˆme, Que´bec (Canada), J5L 0A1. + {Jonathan.Vincent2, Mathieu.m.Labbe, Jean-Samuel.Lauzon, + Francois.Grondin2, Francois.Michaud}@USherbrooke.ca, + Pmcrivet@ivisolutions.ca + Depth Image RGB Image Instance segmentation Dynamic is then applied to the original depth image, resulting in a + DOS Object masked depth image (MDI). The DOS is also sent to the + Classes Tracking module. After computing a 3D centroid for each + masked object, the Tracking module predict the position and + MDI velocity of the objects. This information is then used by the + Moving Object Classification module (MOC) to classify the +MO-MDI Tracking/MOC Camera object as idle or not based on its class, its estimated velocity + World and its shape deformation. Moving objects are removed + Pose from the original depth image, resulting in the Moving + Object Masked Depth Image (MO-MDI). The original RGB + vSLAM image, the MDI and the MO-MDI are used by the vSLAM + algorithm. It uses the depth images as a mask for feature + Odometry extraction thus ignoring features from the masked regions. + The MO-MDI is used by the visual odometry algorithm of + Map the vSLAM approach while the MDI is used by both its + mapping and loop closure algorithms, resulting in a map free + Fig. 1: Architecture of DOTMask of dynamic objects while still being able to use the features + of the idle objects for visual odometry. The updated camera + SLAM++ [12] and Semantic Fusion [13] focus on pose is then used in the Tracking module to estimate the +the mapping aspect of SLAM in dynamic environments. position and velocity of the dynamic objects resulting in a +SLAM++ [12] is an object-oriented SLAM which achieves closed loop. +efficient semantic scene description using 3D object recog- +nition. SLAM++ defines objects using areas of interest A. Instance Segmentation +to subsequently locate and map them. However, it needs +predefined 3D object models to work. Semantic Fusion Deep learning algorithms such as Mask R-CNN recently +[13] creates a semantic segmented 3D map in real time proved to be useful to accomplish instance semantic seg- +using RGB-CNN [14], a convolutional deep learning neural mentation [4]. A recent and interesting architecture for +network, and a dense SLAM algorithm. However, SLAM++ fast instance segmentation is the YOLACT [18] and its +and Semantic Fusion do not address SLAM localization update YOLACT++ [19]. This network aims at providing +accuracy in dynamic environments, neither do they remove similar results as the Mask-RCNN or the Fully Convolutional +dynamic objects in the 3D map. Instance-aware Semantic Segmentation (FCIS) [20] but at a + much lower computational cost. YOLACT and YOLACT++ + Other approaches use deep learning algorithm to provide can achieve real-time instance segmentation. Development in +improved localisation and mapping. Fusion++ [15] and MID- neural networks has been incredibly fast in the past few years +Fusion [16] uses object-level octree-based volumetric repre- and probably will be in the years to come. DOTMask was +sentation to estimate both the camera pose and the object designed the be modular and can easily change the neural +positions. They use deep learning techniques to segment ob- network used in the pipeline. In its current state, DOTMask +ject instances. DynaSLAM [17] proposes to combine multi- works with Mask-RCNN, YOLACT and YOLACT++. The +view geometry models and deep-learning-based algorithms YOLACT is much faster than the two others and the loss +to detect dynamic objects and to remove them from the im- in precision doesn’t impact our results. This is why this +ages prior to a vSLAM algorithm. They also uses inpainting architecture is used in our tests. The instance segmentation +to recreate the image without object occlusion. DynaSLAM module takes the input RGB image and outputs the bounding +achieves impressive results on the TUM dataset. However, box, class and binary mask for each instance. +these approaches are not optimized for real-time operation. + B. Tracking Using EKF + III. DYNAMIC OBJECT TRACKING AND MASKING FOR + VSLAM Using the DOS from the Instance Segmentation module + and odometry from vSLAM, the Tracking module predicts + The objective of our work is to provide a fast and complete the pose and velocity of the objects in the world frame. This +solution for visual SLAM in dynamic environments. Figure is useful when the camera is moving at speed similar to the +1 illustrates the DOTMask pipeline. As a general overview objects to track (e.g., moving cars on the highway, robot +of the approach, a set of objects of interest (OOI) are defined following a pedestrian) or when idle objects have a high +using a priori knowledge and understanding of dynamic amount of features (e.g., person wearing a plaid shirt). +objects classes that can be found in the environment. Instance +segmentation is done using a neural network trained to First, the Tracking module receives the DOS and the +identify the object classes from an RGB image. For each original depth image as a set, defined as Dk = {d1k, ..., dkI }, +dynamic object instance, its bounding box, class type and where dki = Tk, Bki , ζik is the object instance detected +binary mask are grouped for convenience and referred as the by the Instance Segmentation module, with i ∈ I, I = +dynamic object state (DOS). The binary mask of the DOS {1, ..., L}, L being the total number of object detection + in the frame at time k. T ∈ Rm×n is the depth image , + B ∈ Zm 2 ×n is the binary mask and ζ ∈ J is the class ID, 2) Update: In EKF, the Update step starts by evaluating +with J = {1, ..., W }, and W is the number of total trained the innovation y˜k defined as (4): +classes in the Instance Segmentation module. + y˜k = zk − hˆk(xˆk|k−1) (4) + The DOS and the original depth image are used by EKF +to estimate the dynamic objects positions and velocities. where zk ∈ R3 is a 3D observation of a masked object in +EKF provides steady tracking of each object instance corre- +sponding to the object type detected by the neural network. reference to the camera for each object instance, with z = +An EKF is instantiated for each new object, and a priori [zx zy zz]T , zx = (µx −Cx)zz/fx and zy = (µy −Cy)zz/fy, +knowledge from the set of dynamic object classes defines where Cx and Cy are the principal center point coordinate +some of the filter’s parameters. This instantiation is made and fx and fy are the focal lengths expressed in pixels. zz +using the following parameters: the class of the object, its +binary mask and its 3D centroid position. The 3D centroid is approximated using the average depth from the masked +is defined as the center of the corresponding bounding box. +If the tracked object is observed in the DOS, its position is region on the depth image. The expressions µx and µy stand +updated accordingly, otherwise its predicted position using +EKF is used. If no observations of the object are made for for the center of the bounding box. +e number of frames, the object is considered removed from +the scene and therefore the filter is discarded. The Tracking To simplify the following equations, (s, c) represent re- +module outputs the estimated velocity of the objects to the +MOC module. The MOC module will classify the objects spectively the sine and cosine operations of the the Euler +as idle or not based on the object class, the filter velocity angles φ, θ, ψ (roll, pitch, yaw). h(xk) ∈ R4 is the +estimation and the object deformation. observation function which maps the true state space xk to + the observed state space zk. hˆ(xk) is the three first terms of + To explain further how the Tracking module works, the h(xk). However, in our case, the transform between those +following subsections presents in more details the Prediction +and Update steps of EKF used by DOTMask. spaces is not linear, justifying the use of EKF. The non-linear + rotation matrix used to transform the estimate state xˆk in the + 1) Prediction: Let us define the hidden state x ∈ R6×1 as observed state zk follows the (x, y, z) Tait-Bryan convention +the 3D position and velocity of an object referenced in the and is given by h(xˆk) = [hφ hθ hψ 1], where: +global map in Cartesian coordinates. The a priori estimate +of the state at time k ∈ N is predicted based on the previous hφ = (cφcθ)xˆx + (cφsθsψ − cψsφ)xˆy + (sφsψ + cφcψsθ)xˆz + cx +state at time k − 1 as in (1): hθ = (cθsφ)xˆx + (cφcψ + sφsθsψ)xˆy + (cψsφsθ − cφsψ)xˆz + cy + + hψ = −(sθ)xˆx + (cθsψ)xˆy + (cθcψ)xˆz + cz + (5) + + and cx, cy and cz are the coordinate of the camera referenced + + to the world, which is derived using vSLAM odometry. + The innovation covariance Sk ∈ R3×3 is defined as + + follows, where the expression Hk ∈ R3×6 stands for the + Jacobian of h(xˆk): + +xˆk|k−1 = Fxˆk−1|k−1 with F = I3 ∆tI3 (1) Sk = HkPk|k−1 Hk T + Rk (6) + 03 I3 + +where F ∈ R6×6 is the state transition matrix, ∆t ∈ R+ is where Rk ∈ R3×3 is the covariance of the observation noise, +the time between each prediction, 03 is a 3 × 3 zero matrix its diagonal terms stand for the imprecision of the RGB- +and I3 is a 3 × 3 identity matrix. Note that the value of ∆t D camera. The near optimal Kalman gain Kk ∈ R3×3 is +is redefined before each processing cycle. defined as follows: + + The a priori estimate of the state covariance (Pk|k−1 ∈ Kk = Pk|k−1 Hk T (Sk)−1 (7) +R6×6) at time k is predicted based on the previous state at +time k − 1 as given by (2): Finally, the updated state estimate xˆk|k and the covariance + estimate are given respectively by (8) and (9). +Pk|k−1 = FPk−1|k−1FT + Q (2) + +where Q ∈ R6×6 is the process noise covariance matrix xˆk|k = xˆk|k−1 + Kky˜k (8) +defined using the random acceleration model (3): + Pk|k = (I6 − KkHk)Pk|k−1 (9) + +Q = ΓΣΓT with Γ = [ ∆t2 I3×3 ∆t2I3×3 ]T (3) C. Moving Object Classification + 2 + The MOC module classify dynamic objects as either +where Γ ∈ R6×3 is the mapping between the random moving or idle. It takes as inputs the dynamic objects class, +acceleration vector a ∈ R3 and the state x, and Σ ∈ R3×3 velocity and mask. The object velocity comes from the +is the covariance matrix of a. The acceleration components tracking module estimation. The object class and mask are +ax, ay and az are assumed to be uncorrelated. directly obtained from the DOS. The object class defines + if the object is rigid or not. The deformation of non-rigid + The dynamic of every detected objects may vary greatly object is computed using the intersection over union (IoU) +depending on its class. For instance, a car does not have the of the masks of the object at time k and k − 1. The IoU +same dynamic as a mug. To better track different types of algorithm takes two arbitrary convex shape Mk−1, Mk and +objects, a covariance matrix is defined for each class to better is defined as IoU = |Mk ∩ Mk−1|/|Mk ∪ Mk−1|, where +represent their respective process noise. + TABLE I: Experimental Parameters + + Description Value (a) Original RGB Image + Frame to terminate object tracking + 10 + Score threshold (s) 0.1 +Maximum number of observations (m) 5 + 0.01 m/sec + Velocity threshold for a person 0.1 m/sec +Velocity threshold for the other objects 0.62 m/s2 + 1.0 m/s2 + Random acceleration for a person +Random acceleration for other objects + +| . . . | is the cardinality of the set. A dynamic object is (b) RGB and depth image superposed without DOTMask +classified as moving if its velocity is higher than a predefined +threshold or if it is an non-rigid object with an IoU above (c) RGB and depth image superposed with DOTMask +another predefined threshold. The original depth image is +then updated resulting in the MO-MDI. The MO-MDI is Fig. 2: RTAB-Map features (colored dots) not appearing on +sent to the vSLAM odometry to update the camera pose. moving objects with DOTMask + + IV. EXPERIMENTAL SETUP dataset, along with their superimposed RGB and depth + images with features used by RTAB-Map (Fig. 2b) and with + To test our DOTMask approach, we chose to use the TUM DOTMask (Fig. 2c). Using the depth image as a mask to +dataset because it presents challenging indoor dynamic RGB- filter outlying features, dynamic objects (i.e., humans and +D sequences with ground truth to evaluate visual odometry chairs in this case) are filtered out because the MDI includes +techniques. Also, TUM is commonly used to compare with the semantic mask. The MO-MDI is used by RTAB-Map +other state-of-the-art techniques. We used sequences in low to compute visual odometry, keeping only the features from +dynamic and highly dynamic environments. static objects as seen in Fig. 2c (left vs right) with the colored + dots representing visual features used for visual odometry. In + For our experimental setup, ROS is used as a middleware the left image of Fig. 2c, the man on the left is classified +to make the interconnections between the input images, by the Tracking module as moving, while the man on the +segmentation network, EKF and RTAB-Map. The deep learn- right is classified as being idle, resulting in keeping his +ing library PyTorch is used for the instance segmentation visual features. In the rigth image of Fig. 2c, the man on the +algorithm. The ResNet-50-FPN backbone is used for the right is also classified as moving because he is standing up, +YOLACT architecture because this configuration achieves masking his visual features. Figure 3 illustrates the influence +the best results at a higher framerate [18]. Our Instance of MDI, which contains the depth mask of all the dynamic +segmentation module is based on the implementation of objects, either idle or not, to generate a map free of dynamic +YOLACT by dbolya2 and its pre-trained weights. The net- objects. This has two benefits: it creates a more visually +work is trained on all 91 classes of the COCO dataset. accurate 3D rendered map, and it improves loop closure +The COCO dataset is often used to compare state-of-the-art detection. The differences in the 3D generated maps between +instance segmentation approaches, which is why we chose to RTAB-Map without and with DOTMask are very apparent: +use it in our trials. In our tests, person, chair, cup and bottle there are less artifacts of dynamic objects and less drifting. +are the the OOI used because of their presence in the TUM The fr3/walking static sequence shows improved quality in +dataset and in our in-house tests.The RTAB-Map library [5] the map, while the fr3/walking rpy sequence presents some +is also used, which includes various state-of-the-art visual undesirable artifacts. These artifacts are caused either by the +odometry algorithms, a loop closure detection approach and mask failing to identify dynamic objects that are tilted or +a 3D map render. upside down or by the time delay between the RGB image + and its corresponding depth image. The fr3/sitting static + Table I presents the parameters used for DOTMask in our +trials, based on empirical observations in the evaluated TUM +sequences and our understanding of the nature of the objects. +A probability threshold p and a maximum instance number +m are used to reduce the number of object instances to feed +into the pipeline. Only detections with a score above p are +used and at maximum, m objects detections are processed. +This provides faster and more robust tracking. + + V. RESULTS + + Trials were conducted in comparison with approaches +by Kim and Kim [6], Sun et al. [8], Bescos et al. [17] +and RTAB-Map, the latter being also used with DOTMask. +Figure 2a shows two original RGB frames in the TUM + + 2https://github.com/dbolya/yolact + TABLE II: Absolute Transitional Error (ATE) RMSE in cm TABLE IV: Timing Analysis + +TUM Seqs BaMVO Aproach Img. Res. Avg. Time CPU GPU + Sun et al. + DynaSLAM BaMVO. 320×240 42.6 ms i7 3.3GHz - + RTAB-Map Sun et al. 640×480 500 ms i5 - + DOTMask DynaSLAM 640×480 500 ms - - + Impr. (%) DOTMask 640×480 70 ms GTX1080 + DOTMask 640×480 125 ms i5-8600K GTX1050 +fr3/sit static 2.48 - - 1.70 0.60 64.71 i7-8750H +fr3/sit xyz 1.60 1.80 -12.50 +fr3/wlk static 4.82 3.17 1.5 10.7 0.80 92.52 +fr3/wlk xyz 24.50 2.10 91.42 +fr3/wlk rpy 13.39 0.60 2.61 22.80 5.30 76.75 +fr3/wlk halfsph 14.50 4.00 72.41 + 23.26 9.32 1.50 a mobile robot operating at a moderate speed. The fastest + method is BaMVO with only 42 ms cycle time. + 35.84 13.33 3.50 + Figure 4 shows the tracked dynamic objects in the ROS + 17.38 12.52 2.50 visualizer RViz. DOTMask generates ROS transforms to + track the position of the objects. Those transforms could + TABLE III: Loop Closure Analysis easily be used in other ROS applications. Figure 5 shows the + difference between RTAB-Map and DOTMask in a real scene +TUM Seqs RTAB-Map DOTMask where a robot moves at a similar speed as dynamic objects + Nb Terr Rerr Nb Terr Rerr (chairs and humans). The pink and blue lines represent the +fr3/sit static loop (cm) (deg) loop (cm) (deg) odometry of RTAB-Map without and with DOTMask. These +fr3/sit xyz results suggest qualitatively that DOTMask improves the +fr3/wlk static 33 1.80 0.26 1246 0.60 0.21 odometry and the 3D map. +fr3/wlk xyz 288 2.10 0.42 1486 2.50 0.45 +fr3/wlk halfs. 105 9.00 0.18 1260 7.00 0.15 VI. CONCLUSION +fr3/wlk rpy 55 6.5 0.99 1516 2.9 0.45 + 121 5.90 0.84 964 4.90 0.79 This paper presents DOTMask, a fast and modular pipeline + 94 6.7 1.06 965 6.00 1.04 that uses a deep learning algorithm to semantically segment + images, enabling the tracking and masking of dynamic +shows the result when masking idle object, resulting in objects in scenes to improve both localization and mapping in +completely removing the dynamic objects from the scene. vSLAM. Our approach aims at providing a simple and com- + plete pipeline to allow mobile robots to operate in dynamic + Table II characterizes the overall SLAM quality in terms environments. Results on the TUM dataset suggest that using +of absolute trajectory error (ATE). In almost all cases, DOTMask with RTAB-Map provides similar performance +DOTMask improves the ATE compared to RTAB-Map alone compared to other state-of-the-art localization approaches +(as seen in the last column of the table). Table II characterizes while providing an improved 3D map, dynamic objects +the overall SLAM quality in terms of absolute trajectory tracking and higher loop closure detection. While DOTMask +error (ATE). While DynaSLAM is better in almost every does not outperform DynaSLAM on the TUM dataset or +sequences, DOTMask is not far off with closer values com- outrun BaMVO, it reveals to be a good compromise for +pared to the other techniques. robotic applications. Because DOTMask pipeline is highly + modular, it can also evolve with future improvements of + Table III presents the number of loop closure detections, deep learning architectures and new sets of dynamic object +the mean translation error (Terr) and the mean rotational classes. In future work, we want to use the tracked dynamic +error (Rerr) on each sequences both with and without objects to create a global 3D map with object permanence, +DOTMask. In all sequences, DOTMask helps RTAB-Map and explore more complex neural networks3 to add body +to make more loop closures while also lowering both mean keypoint tracking, which could significantly improve human +errors. Since loop closure features are computed from the feature extraction. We would also like to explore techniques +depth image (MDI), using DOTMask forces RTAB-Map to to detect outlier segmentations from the neural network to +use only features from static object hence providing better improve robustness. +loop closures. + REFERENCES + On the fr3/sitting xyz sequence, RTAB-Map alone pro- +vides better performance in both ATE and loop closure [1] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendo´n-Mancha, +detection. In this entire sequence, the dynamic objects do “Visual simultaneous localization and mapping: A survey,” Artificial +not move. While the MO-MDI enables features from idle Intelligence Review, vol. 43, no. 1, pp. 55–81, 2015. +dynamic objects to be used by the odometry algorithm, the +MDI does not enables those same features for the loop [2] D. Ciregan, U. Meier, and J. Schmidhuber, “Multi-column deep neural +closure algorithm. Since nothing is moving in this particular networks for image classification,” in Proc. IEEE Conf. Computer +sequence, all features will help to provide a better locali- Vision and Pattern Recognition, 2012, pp. 3642–3649. +sation. However, this case is not representative of dynamic +environments. [3] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep + convolutional encoder-decoder architecture for image segmentation,” + Table IV presents the average computation time to process IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 39, +a frame for each approach without vSLAM and odometry no. 12, pp. 2481–2495, 2017. +algorithms. Results are processed on a computer equipped +with a GTX 1080 GPU and a I5-8600k CPU. DOTMask was 3https://github.com/daijucug/Mask-RCNN-TF detection-human segment- +also tested on a laptop with a GTX 1050 where it achieved body keypoint-regression +an average of 8 frames per second. At 70 ms, it can run on + (a) fr3/sitting static (b) fr3/walking static (c) fr3/walking rpy + +Fig. 3: RTAB-Map 3D rendered map from the TUM sequences, without (top) and with (bottom) DOTMask + +Fig. 4: Position of tracked dynamic objects shown in RVIZ static point weighting,” IEEE Robotics and Automation Letters, vol. 2, + no. 4, pp. 2263–2270, 2017. +(a) RTAB-Map alone (b) RTAB-Map with DOTMask [8] Y. Sun, M. Liu, and M. Q.-H. Meng, “Improving RGB-D SLAM in + dynamic environments: A motion removal approach,” Robotics and +Fig. 5: 3D map and odometry improved with DOTMask Autonomous Systems, vol. 89, pp. 110–122, 2017. + [9] C. Kerl, J. Sturm, and D. Cremers, “Dense visual SLAM for RGB-D +[4] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and cameras,” in Proc. IEEE/RSJ Int. Conf. Intelligent Robots and Systems, + J. Garcia-Rodriguez, “A review on deep learning techniques applied 2013, pp. 2100–2106. + to semantic segmentation,” arXiv preprint arXiv:1704.06857, 2017. [10] Y. Sun, M. Liu, and M. Q.-H. Meng, “Motion removal for reliable rgb- + d slam in dynamic environments,” Robotics and Autonomous Systems, +[5] M. Labbe´ and F. Michaud, “Online global loop closure detection for vol. 108, pp. 115–128, 2018. + large-scale multi-session graph-based SLAM,” in Proc. IEEE/RSJ Int. [11] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A + Conf. on Intelligent Robots and Systems, 2014, pp. 2661–2666. benchmark for the evaluation of RGB-D SLAM systems,” in Proc. + IEEE/RSJ Int. Conf. Intelligent Robots and Systems, Oct. 2012. +[6] D.-H. Kim and J.-H. Kim, “Effective background model-based RGB- [12] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and + D dense visual odometry in a dynamic environment,” IEEE Trans. A. J. Davison, “SLAM++: Simultaneous localisation and mapping at + Robotics, vol. 32, no. 6, pp. 1565–1573, 2016. the level of objects,” in Proc. IEEE Conf. Computer Vision and Pattern + Recognition, 2013, pp. 1352–1359. +[7] S. Li and D. Lee, “RGB-D SLAM in dynamic environments using [13] J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Seman- + ticFusion: Dense 3D semantic mapping with convolutional neural + networks,” in Proc. IEEE Int. Conf. Robotics and Automation, 2017, + pp. 4628–4635. + [14] H. Noh, S. Hong, and B. Han, “Learning deconvolution network for + semantic segmentation,” in Proc. IEEE Int. Conf. Computer Vision, + 2015, pp. 1520–1528. + [15] J. McCormac, R. Clark, M. Bloesch, A. Davison, and S. Leutenegger, + “Fusion++: Volumetric object-level SLAM,” in 2018 international + conference on 3D vision (3DV). IEEE, 2018, pp. 32–41. + [16] B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. Davison, and + S. Leutenegger, “Mid-fusion: Octree-based object-level multi-instance + dynamic slam,” in 2019 International Conference on Robotics and + Automation (ICRA). IEEE, 2019, pp. 5231–5237. + [17] B. Bescos, J. M. Fa´cil, J. Civera, and J. Neira, “Dynaslam: Tracking, + mapping, and inpainting in dynamic scenes,” IEEE Robotics and + Automation Letters, vol. 3, no. 4, pp. 4076–4083, 2018. + [18] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “Yolact: Real-time instance + segmentation,” in ICCV, 2019. + [19] ——, “Yolact++: Better real-time instance segmentation,” 2019. + [20] Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei, “Fully convolutional instance- + aware semantic segmentation,” in Proc. IEEE Conf. Computer Vision + and Pattern Recognition, 2017, pp. 2359–2367. + diff --git a/动态slam/2020年-2022年开源动态SLAM/2020年/Empty_Cities_A_Dynamic-Object-Invariant_Space_for_Visual_SLAM.pdf b/动态slam/2020年-2022年开源动态SLAM/2020年/Empty_Cities_A_Dynamic-Object-Invariant_Space_for_Visual_SLAM.pdf new file mode 100644 index 0000000..79db8f5 --- /dev/null +++ b/动态slam/2020年-2022年开源动态SLAM/2020年/Empty_Cities_A_Dynamic-Object-Invariant_Space_for_Visual_SLAM.pdf @@ -0,0 +1,1090 @@ +IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021 433 + +Empty Cities: A Dynamic-Object-Invariant Space for + Visual SLAM + + Berta Bescos , Cesar Cadena , and José Neira + + Abstract—In this article, we present a data-driven approach Fig. 1. Dynamic images are first converted one by one into static with an +to obtain the static image of a scene, eliminating dynamic objects end-to-end deep learning model. Such images allow us to compute an accurate +that might have been present at the time of traversing the scene camera trajectory estimation that is not damaged by the dynamic objects’ motion, +with a camera. The general objective is to improve vision-based as well as to build dense static maps that are useful for long-term applications. +localization and mapping tasks in dynamic environments, where (a) Input of our system: urban images with dynamic content. (b) Output of +the presence (or absence) of different dynamic objects in different our system: Dynamic objects have been removed. (c) Static map built with the +moments makes these tasks less robust. We introduce an end-to-end images preprocessed by our framework. +deep learning framework to turn images of an urban environment +that include dynamic content, such as vehicles or pedestrians, +into realistic static frames suitable for localization and mapping. +This objective faces two main challenges: detecting the dynamic +objects, and inpainting the static occluded background. The first +challenge is addressed by the use of a convolutional network that +learns a multiclass semantic segmentation of the image. The second +challenge is approached with a generative adversarial model that, +taking as input the original dynamic image and the computed +dynamic/static binary mask, is capable of generating the final static +image. This framework makes use of two new losses, one based on +image steganalysis techniques, useful to improve the inpainting +quality, and another one based on ORB features, designed to +enhance feature matching between real and hallucinated image +regions. To validate our approach, we perform an extensive +evaluation on different tasks that are affected by dynamic entities, +i.e.,visual odometry, place recognition, and multiview stereo, +with the hallucinated images. Code has been made available on +https://github.com/bertabescos/EmptyCities_SLAM. + + Index Terms—Visual SLAM, Inpainting, Dynamic objects, +GANs. + + I. INTRODUCTION + +M OST vision-based localization systems are conceived to small fractions of dynamic content, but tend to compute dynamic + work in static environments [1]–[3]. They can deal with objects motion as camera ego-motion. Thus, their performance is + compromised. Building stable maps is also of key importance for + Manuscript received April 19, 2020; revised July 10, 2020; accepted Septem- long-term autonomy. Mapping dynamic objects prevents vision- +ber 8, 2020. Date of publication November 2, 2020; date of current version based robotic systems from recognizing already visited places +April 2, 2021. This work was supported in part by the Spanish Ministry of and reusing precomputed maps. +Economy and Competitiveness under Project PID2019-108398GB-I00 and FPI +Grant BES-2016-077836, in part by the Aragón regional government (Grupos To deal with dynamic objects, some approaches include in +DGA T45-17R, T45-20R), in part by the EU H2020 research project under Grant their model the behavior of the observed dynamic content [4], +688652, and in part by the Swiss State Secretariat for Education, Research and [5]. Such strategy is needed when the majority of the observed +Innovation (SERI) under Grant 15.0284 and NVIDIA, through the donation of a scene is not rigid. However, when scenes are mainly rigid, as in +Titan X GPU. This article was recommended for publication by Associate Editor Fig. 1(a), the standard strategy consists of detecting the dynamic +S. Huang and Editor F. Chaumette upon evaluation of the reviewers’ comments. objects within the images and not to use them for localization and +(Corresponding author: Berta Bescos.) mapping [6]–[9]. To address mainly rigid scenes, we propose + to instead modify these images so that dynamic content is + Berta Bescos is with the Department of Computer Science and Sys- eliminated and the scene is converted realistically into static. We +tem Engineering, University of Zaragoza, 50018 Zaragoza, Spain (e-mail: consider that the combination of experience and context allows +bbescos@unizar.es). to hallucinate, i.e., inpaint, a geometrically and semantically + + Cesar Cadena is with the Mechanical and Process Engineering, ETH Zurich, +8090 Zurich, Switzerland (e-mail: cesarcadena.lerma@gmail.com). + + José Neira is with the Instituto de Investigación en Ingeniería de Aragón, +Universidad de Zaragoza, 50018 Zaragoza, Spain (e-mail: jneira@unizar.es). + + Color versions of one or more of the figures in this article are available online +at http://ieeexplore.ieee.org. + + Digital Object Identifier 10.1109/TRO.2020.3031267 + +1552-3098 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. + See https://www.ieee.org/publications/rights/index.html for more information. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + 434 IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021 + +consistent appearance of the rigid and static structure behind maps (see Fig. 1), as well as for street-view imagery suppliers +dynamic objects. This hallucinated structure can be used by as a privacy measure to replace faces and license plates blurring. +the simultaneous localization and mapping (SLAM) system to We provide an extensive evaluation on robotic applications, such +provide it with robustness against dynamic objects. as visual odometry, place recognition, and mapping to prove the + validity of our framework. + Turning images that contain dynamic objects into realistic +static frames reveals several challenges. II. RELATED WORK + + 1) Detecting such dynamic content in the image. By this, we A. Dynamic Objects Detection + mean to detect not only those objects that are known to + move such as vehicles, people, and animals, but also the The vast majority of SLAM systems assume a static environ- + shadows and reflections that they might generate, since ment. As a consequence, they can only manage small fractions + they also change the image appearance. of dynamic content by classifying them as spurious data or + outliers to such static model. The most typical outlier rejection + 2) Inpainting the resulting space left by the detected dynamic algorithms are RANSAC (e.g., in ORB-SLAM [1], [23]) and + content with plausible imagery. The resulting image would robust cost functions (e.g., in PTAM [24]). + succeed in being realistic if the inpainted areas are both + semantically and geometrically consistent with the static There are several SLAM systems that address more specifi- + content of the image. cally the dynamic scene content. Tan et al. [9] detect changes that + take place in the scene by projecting the map features into the + The first challenge can be addressed with geometrical ap- current frame for appearance and structure validation. Wang and +proaches if an image sequence is available. This procedure usu- Huang [8] segment the dynamic objects in the scene using the +ally consists in studying the optical flow consistency along the RGB optical flow. Alcantarilla et al. [7] detect moving objects +images [7], [8]. In the case in which only one frame is available, by means of a scene flow representation with stereo cameras. +deep learning is the approach that excels at this task by the More recently, thanks to the boost of deep learning, integrating +use of convolutional neural networks (CNNs) [10], [11]. These semantics information into SLAM has allowed to deal with +frameworks are trained with the previous knowledge of what dynamic content in a different manner [6], [25]. This idea allows +classes are dynamic and which ones are not. Recent works show the clustering of map points belonging to independent objects +that it is possible to acquire this knowledge in a self-supervised with different dynamics, as well as the possibility of detecting +way [12], [13]. dynamic objects in just one shot. + + Regarding the second challenge, some recent image inpaint- B. Sequence-Based Inpainting +ing approaches use image statistics of the remaining image to fill +in the holes [14], [15]. The former work estimates the pixel value Previous works on SLAM in dynamic scenes have attempted +with the normalized weighted sum of all the known pixels in the to reconstruct the background occluded by dynamic objects in +neighborhood. While this approach generally produces smooth the images with information from previous frames [6], [26]. +results, it is limited by the available image statistics and has Such works need perpixel depth information and only make use +no concept of visual semantics. Neural networks learn semantic of the static content of the prebuilt map to create the inpainted +priors and meaningful hidden representations in an end-to-end frames, but do not add semantic consistency. The work by Grana- +fashion, which have been used for recent image inpainting dos et al. [27] removes marked dynamic objects from videos by +efforts [16]–[19]. These networks employ convolutional filters aligning other candidate frames in which parts of the missing +on images, replacing the removed content with inpainted areas region are visible, assuming that the scene can be approximated +that have geometrical and semantic consistency with the whole using piecewise planar geometry. The recent work by Uitten- +image. bogaard et al. [28] utilizes a generative adversarial network + (GAN) to learn to use information from different viewpoints + Both challenges can also be seen as one single task: translating and select imagery information from those views to generate a +a dynamic image into a corresponding static image. In this plausible inpainting, which is similar to the ground-truth static +direction, Isola et al. [20] propose a general-purpose solution background. Eventually, if only one frame is available, the static +for image-to-image translation. Our previous work [21] builds occluded background can only be reconstructed by utilizing +on top of this idea and reformulates the framework objectives image-based inpainting techniques. +to take advantage of a precomputed dynamic object mask, +seeking a more inpainting-oriented framework. In this work, C. Image-Based Inpainting +we follow this idea of transforming images with dynamic con- +tent into realistic static frames, while optimizing for localiza- Among the nonlearning approaches to image inpainting, prop- +tion and mapping performance. For such task, we introduce agating appearance information from neighboring pixels to the +a new loss that combined with the integration of a semantic target region is the usual procedure [14]. Accordingly, these +segmentation network achieves the final objective of creating a methods succeed in dealing with narrow holes, where color and +dynamic-object-invariant space. This loss is based on steganal- texture vary smoothly, but fail when handling big holes, resulting +ysis techniques and on ORB features detection, orientation, and in oversmoothing. Differently, patch-based methods iteratively +descriptor maps [22]. Such loss allows the inpainted images to search for relevant patches from the rest of the image [29]. +be realistic and suitable for localization and mapping. These +images provide a richer understanding of the stationary scene, +and could also be of interest for the creation of high-detail road + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM 435 + +These approaches are computationally expensive and, hence, Fig. 2. Our generator G adopts a UResNet-like architecture. It employs three +not fast enough for real-time applications. Yet, they do not make down-convolutional layers with a stride of 2, six ResNet blocks, and three +semantically aware patch selections. up-convolutional layers with a fractional stride of 1/2, with skip connections + between corresponding down- and up-convolutional layers. Only two ResNet + Deep-learning-based methods usually initialize the image blocks are shown for simplicity. +holes with a constant value, and further pass it through a CNN. +Context Encoders [19] were among the first ones to successfully In contrast, a conditional GAN (cGAN) learns a mapping from +use a standard pixelwise reconstruction loss, as well as an observed image x and optional random noise vector z, to y, G : +adversarial loss for image inpainting tasks. Due to the resulting {x, z} → y [34], or G : x → y [20]. The generator G is trained +artifacts, Yang et al. [30] take their results as input and then to produce outputs indistinguishable from the “real” images by +propagate the texture information from nonhole regions to fill the an adversarially trained discriminator D, which is trained to +hole regions as postprocessing. Song et al. [31] use a refinement do as well as possible at detecting the generator’s “fakes.” The +network in which a blurry initial hole-filling result is used objective of a cGAN can be expressed as +as the input, then iteratively replaced with patches from the +closest nonhole regions in the feature space. Iizuka et al. [18] LcGAN(G, D) = Ex,y[log D(x, y)] +extend content encoders by defining global and local discrim- + Ex[log (1 − D(x, G(x)))] (1) +inators, then applying a postprocessing. Following this work, +Yu et al. [17] replaced the postprocessing with a refinement where G tries to minimize this objective against an adversarial +network powered by the contextual attention layers. The recent D that tries to maximize it. Previous approaches have found +work of Liu et al. [16] obtains excellent results by using partial it beneficial to mix the GAN objective with a more traditional +convolutions. appearance loss, such as the L1 or L2 distance [19]. The dis- + criminator’s job remains unchanged, but the generator is tasked + In contrast, the work by Ulyanov et al. [32] proves that not only with fooling the discriminator, but also with being near +there is no need for external dataset training. The generative the ground-truth in a L1 sense, as expressed in +network itself can rely on its structure to complete the corrupted +image. However, this approach usually applies several iterations G∗ = arg min max LcGAN(G, D) + λ1 · LL1(G) (2) +(∼50 000) to get good and detailed results. + G D +D. Image Inpainting for a Dynamic-Object-Invariant Space + where LL1(G) = Ex,y[||y − G(x)||1]. The recent work of Isola + This work builds on our previous work Empty Cities [21], et al. [20] shows that cGANs are suitable for image-to-image +which bins the image sequences and treats the frames inde- translation tasks, where the output image is conditioned on its +pendently. It makes use of deep learning to segment out the corresponding input image, i.e., it translates an image from one +a priori moving objects: vehicles, animals, and pedestrians, space into another (semantic labels to RGB appearance, RGB +and also of image-based “inpainting.” It does not perform pure appearance to drawings, day to night, etc.). The realism of their +inpainting but image-to-image translation with the help of a results is also enhanced by their generator architecture. They +dynamic objects’ mask, which is the outcome of a semantic employ a U-Net [35], which allows low-level information to +segmentation network. This choice is justified by the fact that the shortcut across the network. In our previous work [21], we +dynamic objects’ mask might be inaccurate or may not include made the use of this same architecture with 256 × 256 resolution +their shadows. The adoption of an image-to-image translation images. However, visual localization systems see their accuracy +framework allows to slightly modify the image nonhole regions degraded when working with low-resolution images. For this +for better accommodation of the reconstructed areas. Differently objective, we hereby employ as the architecture for our generator +to inpainting methods, the “holes” cannot be initialized with G a UResNet [36], see Fig. 2. This architecture uses residual +any placeholder values because we do not want the frame- blocks [37] and has shown impressive results for superresolution +work to only modify those values, and hence, our inpainting images [38]. +network input consists of the dynamic original image with +the dynamic/static mask concatenated. Concisely, utilizing an It is well known that L2 and L1 losses produce blurry results +image-to-image translation approach allows us to have the image on image generation problems, i.e., they can capture the low +hole regions inpainted, and the nonhole regions slightly modified frequencies but fail to encourage high-frequency crispness. This +for better accommodation of the reconstructed areas to cope with motivates restricting the GAN discriminator to only model high- +imprecise masks, or with dynamic objects possible shadows and frequency structures. Following this idea, Isola et al. [20] adopt +reflections. + + III. IMAGE-TO-IMAGE TRANSLATION + + Our work makes use of the successful image-to-image transla- +tion framework by Isola et al. [20]. For the sake of completeness, +we summarize the basis of their approach. + + A GAN is a generative model that learns a mapping from a +random noise vector z to an output image y, G: z → y [33]. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + 436 IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021 + +Fig. 3. Block diagram of our proposal. We first compute the segmentation of the RGB dynamic image, as well as its loss against its ground-truth. Both the +dynamic/static binary mask and dynamic image are used to obtain the static image. A loss based on ORB features together with an appearance and an adversarial +loss are obtained and back-propagated until the RGB dynamic image. The striped blocks are differentiable layers that are fixed and, hence, not modified during +training time. The adversarial discriminator is not shown here for simplicity. + +a discriminator architecture that classifies each N × N patch G : {x, m} → y. Also, the discriminator D learns to classify +in an image as real or fake, rather than classifying the image +as a whole. Due to their excellent results, we adopt this same yˆ = G(x, m) patches as “fake” from yˆ, m, and x, and the patches +architecture for our discriminator. + of y as “real” from y, m, and x, D : {x, y/yˆ, m} → real/fake. + IV. OUR PROPOSAL + In most of the training dataset images, the relationship be- + Our proposed system turns images of an urban environment +that show dynamic content, such as vehicles or pedestrians, into tween the static and dynamic regions sizes is unbalanced, i.e., +realistic static frames which are suitable for localization and +mapping. We first obtain the pixelwise semantic segmentation static regions occupy usually a much bigger area. This leads us to +of the RGB dynamic image (see Fig. 3). Then, the segmentation +of only the dynamic objects is obtained with the convolutional believe that the influence of dynamic regions on the final loss is +network DynSS. Once we have this mask, we convert the RGB +dynamic image to grayscale and we compute the static image, significantly reduced. As a solution to this problem, we propose +also in grayscale, with the use of the generator G, which has been +trained in an adversarial way. For simplicity, the discriminator is to reformulate the cGAN and L1 losses so that there is more +not shown in this diagram. To fully exploit the capabilities of this +framework for localization and mapping, inpainting is enriched emphasis on the main areas that have to be inpainted, according +with a loss based on ORB features detection, orientation, and de- +scriptors between the ground-truth and computed static images. to (3) and (4). The weights w are computed as w = N if m = 1 +Another feature of our framework for localization and mapping Ndyn +is the fact that we perform the inpainting in grayscale rather than +in RGB. The motivation for this is that many visual localization (dynamic object), and as w = N if m = 0 (background). +applications only need the images grayscale information. The N −Ndyn +different stages are described in Sections IV-A–IV-E. + N stands for the number of elements in the binary mask m, and +A. From Image-to-Image Translation to Inpainting + Ndyn means the number of pixels where m = 1. + For our objective, dynamic objects masks are specially con- +sidered to reformulate the training objectives of the general LL1(G) = Ex,y[w · ||y − G(x, m)||1] (3) +purpose image-to-image translation work by Isola et al. [20]. +We adopt a variant of the cGAN that learns a mapping from LcGAN(G, D) = Ex,y[w · log D(x, y, m)] +observed image x and dynamic/static binary mask m, to y, + + Ex[w · log (1 − D(x, G(x, m), m))]. (4) + + An important feature that we have also incorporated to the + framework is the computation of our output and target images + “noise.” This is motivated by the use of the noise domain for + steganalysis to detect if an image has been tampered or not. + Fig. 4 shows an example of why working in the noise domain is + helpful for detecting “fake” images. While the static generated + image [see Fig. 4(b)] looks visually similar to its target image + [see Fig. 4(d)], their computed noises [see Fig. 4(c) and (e)] + are very different. It would be very easy for us, humans, to + tell what parts of the original image [see Fig. 4(a)] have been + changed by analyzing their noise mapping. In the same way, the + discriminator could more easily learn to distinguish “real” from + “fake” images if it can take as input their noise. This idea is + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM 437 + +Fig. 4. (a) Original image. (b) Image generated by our framework when taking manipulation techniques. It is the discriminator’s job to classify +(a) as input. (c) Computed noise of (b). (d) Static objective image. (e) Computed the generated image patches as tampered (fake) or real. +noise of (e). (b) and (d) are visually similar, but their computed noise [(c) and (e) +repectively] clearly show what image and what parts of it have been modified Images have a low-frequency component dependent on their +the most. The noise magnitude has been amplified (×10) for visualization. content, and a high-frequency component dependent on their + source camera. These high-frequency components are known +explained more in depth in Section IV-B, and the whole training as noise features or noise residuals, and can be extracted using +procedure is diagrammed in Fig. 5. To the best of our knowledge, linear and nonlinear high-pass filters. Recent works on image +steganalysis noise features have not been used before in the forensics utilize noise features [39], [40] as clues to classify +context of GANs. a specific patch or pixel in an image as tampered or not, and + localize the tampered regions. The intuition behind this idea is + This GAN training setup leads to having good inpainting that when an object is removed from one image (source) and the +results. However, despite the efforts of the discriminator to catch gap is inpainted (target), the noise features between the source +the high frequency of the “real” images, the outputs of our and target are unlikely to match. +framework are still slightly blurry. One of the objectives of this +work is to use our images for localization tasks; therefore, if To provide the discriminator with better clues to distinguish +the inpainted regions are somewhat blurry, features would not real from fake inputs, we first extract the noise features from +be extracted in these areas. Image features are important for our images and concatenate them to the grayscale images, as +localization since many visual SLAM systems rely on them as depicted in Fig. 5. The cGAN objective is reformulated as +their core (ORB-SLAM [23]). Having blurriness in inpainted +areas could be seen as a good feature of our framework for LcGAN(G, D) = Ex,y[w · log D(x, y, m, n)] +navigation, because it would allow feature-based localization +systems to work with our images without any modification in + Ex[w · log (1 − D(x, yˆ, m, nˆ))] (5) +their architecture, and “fake” features would not be introduced. +This would be equivalent to modifying the utilized localization where n = SRM(y) and nˆ = SRM(yˆ). There are many ways +system to work with the raw images and the dynamic/static to produce noise features from an image. Inspired by recent +binary masks. We have proved with our localization experiments progress on steganalysis rich models (SRM) for image manip- +(see Section VI-A) that not utilizing moving objects’ features ulation detection [39], we use SRM filter kernels to extract +lead to worse tracking results than working with fully static the local noise features from the static images as the input to +images. For that reason, we want to exploit our framework for our discriminator. The SRM use statistics of neighboring noise +obtaining both high-quality inpainting results and to succeed- residual samples as features to capture the dependence changes +ing in generating reliable features for visual localization tasks. caused by embedding. Zhou et al. [40] use SRM residuals, +Fortunately, these two assignments are highly related. Solving together with the RGB image to detect and localize corrupted +one of them leads to having the other one tackled. Therefore, regions in images. They only use 3 SRM kernels, instead of 30 +we have implemented a new loss based on ORB features [22]. (as in the original Fridrich Kodovsky’s work [39]) and claim that +That is, we want the output of our generator G to have the same they achieve comparable performance. Similarly, we use these +ORB features than its target image, while keeping it realistic and same three filters (see Fig. 6), setting the kernel size of the SRM +close to its target in a L1 sense. By the same ORB features, we filter layer to be 5 × 5 × 3. +mean the same detected keypoints with their same orientation +and descriptors, following ORB’s implementation to the extent C. ORB-Features-Based Loss +possible. This procedure is further described in Section IV-C. + ORB features allow real-time detection and description, and +B. Steganalysis-Based Loss provide good invariance to changes in viewpoint and illumina- + tion. Furthermore, they are useful for visual SLAM and place + With the advances of image editing techniques, tampered or recognition, as demonstrated in the popular ORB-SLAM [23] +manipulated image generation processes have become widely and its binary bag-of-words [41]. The following sections sum- +available. As a result, distinguishing authentic images from marize how the ORB features detector, descriptors, and orien- +tampered images has become increasingly challenging. tation are computed, and how we have adapted them into a new + loss. + What our framework is actually trying to achieve is to +eliminate certain regions from an authentic image followed 1) Detector: The ORB detector is based on the FAST al- +by inpainting, i.e., removal, one of the most common image gorithm [42]. It takes one parameter, the intensity threshold t + between the center pixel p, Ip, and those in a circular ring around + the center. If there exists a set of contiguous pixels in the circle, + which are all brighter than Ip + t, or all darker than Ip − t, the + pixel p would be a keypoint candidate. Then, the Harris corner + measure is computed for each of these candidates, and the target + N keypoints with the highest Harris measure are finally selected. + FAST does not produce multiscale features; therefore, ORB uses + a scale pyramid of the image and extracts FAST features at each + level in the pyramid. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + 438 IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021 + +Fig. 5. Discriminator D has to learn to differ between the real images y and the images generated by the generator G(x, m). D makes a better decision (real/fake) +by seeing the inputs of the generator x and m, and by seeing the SRM noise features of G(x, m) and y. The striped blocks are convolutional layers whose weights + +do not require upgrading during training time. + + of pixels in the features map, and Nf represents the number of + pixels in the response map det(y), where det(y) > 0.5, i.e., the + + number of FAST features in the current objective frame. + + ⎧ N + ⎪⎨ Nf , det(y) > 0.5 and det(yˆ) ≤ 0.5 + +Fig. 6. Three utilized SRM kernels to extract noise features. The left kernel is wdet = N , det(y) ≤ 0.5 and det(yˆ) > 0.5 . (7) +useful in regions with a strong gradient. The middle and the rightmost kernels otherwise +provide the layer with a high shift-invariance. ⎩⎪0N,−Nf + +Fig. 7. Subset of the 16 kernels used to obtain corner responses in the images. According to our results, the optimum number of image +The 12 black pixels have a value of −1/12, the gray pixels are set to 0, and +the white pixel is set to 1. A very positive or a very negative response will be pyramid levels for this objective is 1. More levels lead to a +obtained when convolving these kernels with a corner area in an image. + greater training time and the results are barely influenced. This is + To bring this to a differentiable solution, we have defined +a convolution capable of detecting corners in an image in the coherent with the idea that we want to maximize the sharpness of +same way that FAST does. We have approximated the FAST +corner detection and have used instead a convolution with the small features rather that of the big corners. These convolutions +kernels in Fig. 7. These images show some of the kernels used +for corners detection for a circular ring of three pixels around the have been applied with a stride of 5, offering a good tradeoff +center. By convolving the image with these kernels for different +kernel sizes, we obtain its corner response for the different image between computational training time and good-quality results. +pyramid levels. We keep the maximum score per pixel and per +level and raise each element to its second power to equally Other approaches have tried before to include a similar loss +leverage positive and negative responses. We then subtract a +value which is equivalent to the FAST threshold t, and apply inside a CycleGAN framework [43]. The work by Porav et al. +a sigmoid operation. Its output is the probability of a pixel of +being a FAST feature and could also be seen as the Harris corner [43] uses the SURF detector [44], which is already differentiable, +measure. Features for the output and target images are computed +following this procedure. We define this network as det, and the but does not compute a binary loss. They compute a more +corresponding loss Ldet(G) can be expressed as + traditional L1 loss between the blob responses of both output and + Ldet(G) = −Ex,y[wdet · (det(y) · log(det(yˆ)) + + (1 − det(y)) · log(1 − det(yˆ)))] (6) ground-truth images. Computing a binary loss as in (7) allows + +where yˆ = G(x, m) and wdet is calculated following (7). This us to have more emphasis on the high-gradient areas. +weights definition allows us to leverage the uneven distribution +of nonfeatures and features pixels, and to affect only those image 2) Orientation: Once FAST features have been detected, the +regions with a wrong feature response. N stands for the number + original ORB work extracts their orientation to provide them + + with rotation invariance. This is done by computing its ori- + + entation θ = atan2(m01, m10) and its intensity centroid C = + m10 m01 x,y xpyqI(x, y) are the moments + ( m00 , m00 ), where mpq = + + of an image patch. More precisely, the three utilized patch + + moments are m10 = x,y x · I(x, y), m01 = x,y y · I(x, y), + and m00 = x,y I(x, y). We have created three 14-pixel-radius + circular kernels with the values x, y, and 1, respectively, for m10, + + m01, and m00 (centered in 0), so that when convolving the image + with them, we obtain their respective patch moments m01, m10, + and m00. We define this network as ori, and the objective of + + its corresponding loss is that the “fake” static image detected + + features det(G(x, m)) have the same orientation parameters + + m01, m10, and m00 than the ground-truth static image detected + features, det(y). This loss can be expressed as + + Lori(G) = −Ex,y[wori · ||ori(y) − ori(yˆ)||1]. (8) + + Even though these convolutions are applied to the whole + image with a stride of 5 as in the detection loss, the weighting + term wori in (8) has a value of 1 if a FAST feature has been + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM 439 + +detected in either the ground-truth static image or the output D. Semantic Segmentation +image, i.e., if det(y) > 0.5 or det(yˆ) > 0.5. Otherwise, the +weighting term wori is set to 0. Semantic segmentation is a challenging task that addresses + most of the perception needs of intelligent vehicles in a unified + 3) Descriptor: The ORB descriptor is a bit string description way. Deep neural networks excel at this task, as they can be + trained end-to-end to accurately classify multiple object cate- +of an image patch constructed from a set of binary intensity tests. gories in an image at the pixel level. However, very few archi- +Consider a smoothed image patch p, a binary test τ is defined tectures have a good tradeoff between high-quality and compu- +by tational resources. The recent work of Romera et al. [11] runs in + real time while providing accurate semantic segmentation. The + τ (p; x, y) = 1, p(x) < p(y) (9) core of their architecture (ERFNet) uses residual connections + 0, p(x) ≥ p(y) and factorized convolutions to remain efficient while retaining + remarkable accuracy. +where p(x) is the intensity of p at a point x. The feature is +defined as a vector of n binary tests Romera et al. [11] have made public some of their trained + models [45]. As in our preliminary work, we use for our approach + fn(p) = 2i−1τ (p; xi, yi). (10) the ERFNet model with encoder and decoder both trained from + scratch on the Cityscapes train set [46]. We have fine tuned + 1≤i≤n their model to adjust it to our inpainting approach by back- + propagating the loss of the semantic segmentation LCE(SS), cal- +As in Rublee et al.’s work [22], we use a Gaussian distribution culated with the cross entropy criterion using the class weights + they suggest wSS and the adversarial loss of our final inpainting +around the center of the patch and a vector length n = 256. This model LcGAN(G, D). The semantic segmentation network’s job + (SS) can be, hence, expressed as +can be achieved in a differentiable and convolutional manner by + +creating n kernels with all values set to 0 except for those in the + +positions x and y: ⎧ + ⎨1, z = x + + k(z) = ⎩−0,1, z=y (11) + otherwise + SS∗ = min max LcGAN (G, D) + λ2 · LCE(SS) +where k(z) is the value of the kernel k at a point z. Convolving arg (16) +an image with these n kernels yields each pixel’s ORB descriptor SS D +(a negative output corresponds to the bit value 0 and a positive +one to 1). This convolution is followed by a sigmoid activation where LCE (SS) = wSS [class] · (log ( j exp(ySS[j])) − ySS +function. We define this network as desc, and the corresponding [class]). Its objective is to produce an accurate semantic seg- +loss Ldesc(G) can be expressed as mentation ySS, but also to fool the discriminator D. The latter + objective might occasionally lead the network to not only rec- + ognize dynamic objects but also their shadows. + + Ldesc(G) = −Ex,y[wdesc · (desc(y) · log(desc(yˆ)) + + + (1 − desc(y)) · log(1 − desc(yˆ)))] (12) E. Dynamic Objects Semantic Segmentation + +where the weights wdesc are defined in (13). This descriptors loss Once the semantic segmentation of the RGB image is done, + +is back-propagated to the whole image, whether a feature has we can select those classes known to be dynamic (vehicles and + +been detected or not, as it helps keeping the image statistics. pedestrians). This has been done by applying a SoftMax layer, + ⎧ + ⎨1, desc(y) > 0.5 & desc(yˆ) ≤ 0.5 followed by a convolutional layer with a kernel of n × 1 × 1, + + where n is the number of classes, and with the weights of + + wdesc = ⎩10,, desc(y) ≤ 0.5 & desc(yˆ) > 0.5 . (13) those dynamic and static channels set to wdyn and wstat, re- + otherwise + spectively. With wdyn = n−ndyn and wstat = − ndyn , where ndyn + n n + + All these losses are combined into one loss LORB(G), that is stands for the number of dynamic existing classes, a positive +computed as in (14). The values for the weights of the different +losses λdet, λori, and λdesc have been chosen empirically, and they output corresponds to a dynamic object, whereas a negative one +are set to 10, 0.1 and 1, respectively. + corresponds to a static one. The resulting output passes through a + + hyperbolic tangent-type activation function to obtain the desired + +LORB(G) = λdetLdet(G) + λoriLori(G) + λdescLdesc(G). (14) dynamic/static mask. Note that the defined weights wdyn and wstat + + are not changed during training time. + + The features detection, orientation, and descriptor maps can This segmentation stage has been adopted from our prelimi- +be computed in a parallel way to decrease the training time, since +their computation is not necessarily sequential. nary work [21] without suffering any new modifications. + + Finally, the generator’s job can be expressed as in the follow- V. IMAGE-BASED EXPERIMENTS +ing equation: + A. Data Generation +G∗ = arg min max LcGAN(G, D) + λ1 · LL1(G) + LORB(G). + We have analyzed the performance of our method using + G D CARLA [47]. CARLA is an open-source simulator for au- + tonomous driving research, which provides open digital assets + (15) (urban layouts, buildings, vehicles, pedestrians, etc.). The sim- + ulation platform supports flexible specification of sensor suites +As an implementation detail, we have first trained the whole + +system without the ORB loss for 125 epochs, and then have + +fine-tuned including it for another 25 epochs. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + 440 IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021 + + TABLE I + QUANTITATIVE EVALUATIONS OF OUR CONTRIBUTIONS IN THE INPAINTING TASK ON THE TEST SYNTHETIC IMAGES + +Bold entities means “the best performing system.” + +The best results for almost all the inpainting metrics (L1, PSNR, and SSIM) are obtained with the generator G(x, m)|w and the discriminator D(x, y, m, n)|w. More correct +features (Feat metric) are detected though when adding the features based loss G(x, m)|w ORB. Full image designates the perpixel error considering the whole image. For In and Out, + +we refer, respectively, to the error per pixel considering the masked and unmasked pixels. + +and environmental conditions. We have generated over 12 000 worse quality results in the nonhole regions (Out). Leveraging +image pairs consisting of a target image captured with neither the unbalanced quantity of static and dynamic data within the +vehicles nor pedestrians, and a corresponding input image cap- dataset with w (3) and (4) helps obtaining better results too. +tured at the same pose with the same illumination conditions, but Providing the GAN’s discriminator with the images noise makes +with cars, tracks, and people moving around. These images have it learn better to distinguish between real and fake images, and +been recorded using a front and a rear RGB camera mounted therefore, the generator learns to produce more realistic images. +on a car. Their ground-truth semantic segmentation has also The ORB-based loss leads to slightly worse inpainting results +been captured. CARLA offers two different towns that we have according to L1, PSNR, and SSIM metrics, but renders this +used for training and testing, respectively. Our dataset, together approach more useful for both localization and mapping tasks +with more information about our framework, is available on since more correct features are created. +https://bertabescos.github.io/EmptyCities_SLAM/. + 1) Baselines for Inpainting: We compare qualitatively and + At present, we are limited to training this framework on quantitatively our “inpainting” method with four other state-of- +synthetic datasets since, to our knowledge, no real-world dataset the-art approaches. +exists that provides RGB images captured under same illumina- +tion conditions at identical poses, with and without dynamic ob- 1) Geo1, Geo2: two nonlearning approaches [14], [15]; +jects. In order to render our framework trained on synthetic data 2) Lea1, Lea2: two deep-learning-based methods [17], [18]. +transferable to real-world data, we have fine-tuned our models For a fair comparison, we have trained the approach by Yu +with data from the Cityscapes and KITTI semantic segmentation et al. (Lea1) with our same training data. Iizuka et al. (Lea2) +training datasets [46], [48]. These datasets are semantically do not have their training code available. We have directly used +similar to the ones synthesized with CARLA. Nonetheless, their their release model [18] trained on the Places2 dataset [51]. This +image statistics are different. We further explain this fine-tuning dataset contains images of urban streets from a car perspective +process in Section V-C. similar to ours. A more direct comparison is not possible. We + provide them with the same mask than to our method to generate +B. Inpainting the holes in the images. We evaluate qualitatively on the 3000 + images from our synthetic test dataset, on the 500 validation + In this section, we report the improvements achieved by our images from the Cityscapes dataset [46], and on the images from +framework for inpainting. Table I describes the ablation study the Oxford Robotcar dataset [50]. We can see in Figs. 8–10 +of our work for the different reported inputs and losses. The the qualitative comparisons on these three datasets. Note that +existence of many possible solutions renders difficult to define results generated with both Lea1 and Lea2 have been generated +a metric to evaluate image inpainting [17]. Nevertheless, we with the color images and then converted to grayscale for visual +follow previous works and report the L1, PSNR, and SSIM comparison. Visually, we see that our method obtains a more +errors [49], as well as a feature-based metric Feat. This last realistic output (these results are computed without the ORB +metric computes the FAST features detection, as explained loss for an inpainting oriented comparison). Also, it is the only +in Section IV-C for the output and ground-truth images, and one capable of removing the shadows generated by the dynamic +compares them, similarly to (6). objects even though they are not included in the dynamic/static + mask (see Fig. 8 row 2 and Fig. 10 row 1). The utilized masks + Adding the dynamic/static mask as input for both the gener- are included in the images in Figs. 8(a) and 10(a), respectively. +ator and discriminator helps obtaining better inpainting results Table II describes the quantitative comparison of our method +within the images hole regions (In), at the expense of having against Geo1, Geo2, Lea1, and Lea2 on our CARLA dataset. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM 441 + +Fig. 8. (a) Input. (b) Geo1 [14]. (c) Lea1 [17]. (d) Lea2 [18]. (e) Ours. (f) Ground-truth. Qualitative comparison of our method (e) against other inpainting +techniques (b), (c), (d) on our synthetic dataset. Our results are semantically and geometrically coherent, and do not show the dynamic objects’ shadows, even if +they are not included in the input mask. + + TABLE II We want to highlight the importance of the inpainting robustness + QUANTITATIVE RESULTS OF OUR METHOD AGAINST OTHER INPAINTING to inaccurate segmentation masks since, in practice, partial or + missing segmentation happens frequently. Empty Cities cannot + APPROACHES IN OUR CARLA DATASET handle missing detections but can cope with partial segmenta- + tions covering at least the 85% of the object image. +Bold entities means “the best performing system.” +For a fair comparison, we only report the different errors within the images’ hole regions We hereby report some metrics evaluating how our framework +since the other methods are conceived to only significantly modify such parts. behaves with the dynamic objects’s shadows. Thresholding the + difference between the dynamic and static image, and sub- +It is not possible to quantitatively measure the performance of tracting the dynamic-objects mask yields its dynamic objects +the different methods on the Cityscapes and Oxford Robotcar shadows and reflections mask. We have first generated the +datasets, since ground-truth does not exist. By following these shadows ground truth of our CARLA dataset, and then computed +results, we can claim that our method outperforms both qualita- the shadows masks for our inpainted images in the same way. +tively and quantitatively the other approaches. The intersection over union of the estimated shadows against + the ground truth is 42.8%. Following recent works in shadow + As seen in Fig. 10 row 1, the fact that our method does not detection [52], we also report our method’s shadow, nonshadow, +perform pure inpainting but image-to-image translation with the and total accuracy (59.7%, 99.8%, and 99.5%, respectively). +help of a dynamic/static mask allows us to modify not only the With this framework, we can remove almost 50% of the shadows +dynamic objects themselves but also their shadows or reflections. of the dynamic objects of our CARLA test dataset. Admitting +We believe that the main underlying reasoning for this is the this could be improved, our method’s nonshadow accuracy is +direct supervision for image-to-image translation. Also, since almost 100%, which means that it does not modify other objects’ +during the training with real-world data, the segmentation masks shadows. +are not 100% accurate, the model learns that it has to modify +mainly the areas of the mask and, in case a smooth representation C. Transfer to Real Data +of the world is not obtained, also its surroundings. We believe +that, was the training performed with perfect masks that also Models trained on synthetic data can be useful for real-world +cover the shadows, the model would not learn to handle the vision tasks [53]–[56]. Accordingly, we provide a study of +shadows of dynamic objects or the inaccuracies of segmentation. synthetic-to-real transfer learning using data from the Cityscapes + dataset [46], which offers a variety of urban real-world environ- + ments similar to the synthetic ones. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + 442 IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021 + +Fig. 9. (a) Input. (b) Geo1. [14]. (c) Lea1. [17]. (d) Ours. Comparison of our method (d) against other image inpainting approaches (b), (c) on the Cityscapes +validation dataset [46]. (c) and (d) Results when real images have been incorporated into our training set together with the synthetic images with a ratio of 1/10. + + When testing our method on real data, we see qualitatively backpropagation of the loss derivative only on those image areas +that the synthesized images show some artifacts. This happens that we consider as static. This way, the model can learn both +because such data have different statistics than the real one and, the inpainting task and the static real-world texture. Once the +therefore, cannot be easily used. The combination of real and model is adapted to real-world data, it can be directly used in +synthetic data is possible during training despite the lack of completely new real-world scenarios, e.g., the Oxford Robotcar +ground-truth static real images. In the case of the real images, dataset [50]. +the network only learns the texture and the style of the static real +world by encoding its information and decoding back the origi- VI. EXPERIMENTS +nal image nonhole regions. The synthetic data are substantially +more plentiful and has information about the inpainting process. A. Visual Odometry +The rendering, however, is far from realistic. Thus, the chosen +representation attempts to bridge the reality gap encountered We have evaluated Empty Cities on 20 CARLA synthetic +when using simulated data and to remove the need for domain sequences and on nine sequences from real-world new en- +adaptation. vironments. For these VO experiments, we have chosen the + state-of-the-art feature-based system ORB-SLAM [23]1 and the + We provide implementation details: we have finetuned our direct method DSO [2]. The former is ideal to test the influence +model with real data for 25 epochs with a real/synthetic images of our ORB features loss, and the latter is useful to prove that +ratio of 1/10. On the one hand, for every ten images there are different systems can also benefit from this approach. +nine synthetic images that provide our model with information +about the inpainting task. On the other hand, one image out 1) Baselines for VO in Our Synthetic Dataset: Fig. 11 dis- +of those ten is a real image from the Cityscapes train dataset. plays the ORB-SLAM absolute trajectory RMSE [m] computed +There is groundtruth of its semantic information but there is for 20 CARLA sequences of approximately 100 m long without +no groundtruth of its static representation. In such cases, we do + 1ORB-SLAM acts as visual odometry in trajectories without loop closures. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM 443 + +Fig. 10. (a) Input. (b) Geo [14]. (c) Lea1 [17]. (d) Ours. Comparison of our method (d) against other image inpainting approaches (b), (c) on the Oxford Robotcar +dataset [50]. (c) and (d) Results when Cityscapes images have been incorporated into our training set together with the synthetic images with a 1/10 ratio. The +binary dynamic/static mask computed for every input image has been added on the top-left corner of every raw. + +Fig. 11. Vertical axis shows the different sequences from our CARLA dataset in which we have tested our model, and the horizontal axes show the ATE [m] +obtained by ORB-SLAM. (a) ORB-SLAM absolute trajectory RMSE [m] for the raw dynamic images. (b) ATE computed by DynaSLAM [6]: this system is based +on ORB-SLAM and computes the dynamic objects masks in every frame not to use features belonging to them. (c) ORB-SLAM ATE when using our inpainted +frames (the ones obtained with the ORB-features-based loss). (d) ORB-SLAM ATE for the ground-truth static images. + +loop closures. Fig. 11(a) shows the results when many vehicles DynaSLAM [6]. This system is based on ORB-SLAM and uses +and pedestrians are moving independently. More precisely, the the semantic segmentation network Mask R-CNN [10] to detect +number of vehicles and pedestrians have been set to the maxi- the moving objects and not extract ORB features within them. +mum allowed by CARLA. Fig. 11(d) shows the same odometry Even though better odometry results are obtained compared +results for the ground-truth static sequences. We can see that dy- against using the raw dynamic images, this experiment shows +namic objects have in many sequences a big influence on ORB- that using static images leads to a more accurate camera tracking. +SLAM’s performance (sequences 02, 07, 08, etc.). Fig. 11(b) One reason for this is that dynamic objects might occlude the +shows the trajectory error obtained with our previous system nearby regions in the scene, which are the most reliable for + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + 444 IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021 + +Fig. 12. Vertical axis shows the different sequences from our CARLA dataset, and the horizontal axes show the ORB-SLAM ATE [m]. (a) and (b) ORB-SLAM +absolute trajectory RMSE [m] for our images obtained with the model trained without and with the ORB loss term, respectively. (c) ATE computed by DynaSLAM +for our images obtained with the model trained with the ORB loss term. On the right-most side, one can find the percentage of keypoints extracted in the inpainted +areas, w.r.t. to the ground-truth keypoints. + +camera pose estimation. Another reason might be that using Fig. 13. (a) Input. (b) Output. (c) Output. Visual comparison of the improve- +dynamic objects masks does not yield anymore a homogeneous ments achieved by utilizing the ORB-based loss (c) against not using it (b).The +features distribution within the image. ORB-SLAM looks for a reconstructed curbs in (c) are sharper and more straight than those in (b). +uniform distribution of image features. Pose optimization could +be degraded and drifting could increase if the features do not fol- 2) Baselines for Inpainting: We have compared in the Sec- +low such distribution. Finally, Fig. 11(c) shows the ORB-SLAM tion V the quality of our results with respect to the inpainting +error when using our inpainted images. Our odometry results metrics against four other methods. We want to compare now +show that better results are usually obtained when using our how our approach compares to them w.r.t. visual odometry +inpainted images. The inpainting is realistic enough to provide metrics. Among these four other methods, we have chosen two +the visual odometry system with consistent features that are of them for this evaluation: Geo1 [14] and Lea1 [17]. The +useful for localization. first choice is motivated by its performance on our inpainting + test dataset: this method performs the best among the two + We want to highlight the importance of the influence of nonlearning-based approaches. The second choice is, however, +using the ORB loss during training (see Fig. 12). Fig. 12(a) motivated by the fact that we have not been able to train the +and (b) presents the ATE obtained by ORB-SLAM with our model by Iizuka et al. [18] with our training data for a direct +inpainted images without and with the ORB loss, respectively. comparison. The evaluation and different results can be seen in +The estimated errors are smaller and more constant when using Fig. 15. +this loss. This performance gain can be due to features in the +regions that originally contained dynamic objects, and to more The images inpainted by Telea’s method are usually very +stable features in the static-content regions. Fig. 12(c) presents smooth and no features are extracted within the inpainted ar- +the DynaSLAM ATE with the inpainted images generated with eas. The behavior of ORB-SLAM when using such sequences +our model trained with this loss term. We expect these errors [Fig. 15(a)] is very similar to using DynaSLAM. However, +to be very similar to those shown in Fig. 11(b), and slightly the learning-based method by Yu et al. [17] tends to inpaint +bigger than those in Fig. 12(b). This experiment shows that our the images with low-frequency patterns found in the image +model barely damages the static content of the scene, keeping static content, generating many crispy artifacts (see examples in +the static features as they used to be. It also demonstrates that Fig. 10). A higher ATE is observed when using such images [see +the hallucinated features are useful to estimate the camera SE3 Fig. 15(b)]. Our method seems to be more suitable for the VO +pose. To support this claim, on the right-most side of Fig. 12, one task: the inpainting is neither too smooth nor generates crispy +can find the percentage of keypoints extracted in the inpainted artifacts. +areas w.r.t. to the ground-truth keypoints. We show in Fig. 13 +two examples of the visual influence of such loss on enhancing 3) Baselines for VO in Real-World New Scenarios: For the +high frequencies inpainted areas. evaluation of Empty Cities on real-world environments w.r.t. + + Fig. 14 shows the DSO error for the same 20 CARLA se- +quences for the different input images with dynamic content +[see Fig. 14(a)], without dynamic content [see Fig. 14(d)], and +the images obtained by our framework without and with the +ORB-based loss [see Fig. 14(b) and (c)]. Even though direct +systems are more robust to dynamic objects within the scene, +utilizing our approach also yields a higher tracking accuracy. +Despite the fact that our feature-based loss follows the ORB +implementation, better results are also obtained with other visual +odometry system that do not rely on ORB features. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM 445 + +Fig. 14. Vertical axis shows the different sequences from our CARLA dataset, and the horizontal axes show the DSO ATE [m]. We have computed the boxplots’ +minimum, maximum, and quartiles with the results from ten repetitions of every test. (a) DSO absolute trajectory RMSE [m] for the raw dynamic images. (b) and (b) +DSO trajectory errors when using our inpainted frames without and with the ORB-features-based loss, respectively. (d) DSO trajectory errors for the ground-truth +static images. + +Fig. 15. Vertical axis shows the different sequences from our CARLA dataset, and the horizontal axes show the ATE [m] obtained by ORB-SLAM for ten +repetitions. (a) and (b) Absolute trajectory RMSE [m] for the images inpainted with the method by Telea [14] and Yu et al. [17], respectively. (c) ORB-SLAM +trajectory error for our images. + +visual odometry, we have chosen the KITTI [48] and the Oxford Fig. 16 shows the evaluation of our method performance with +Robotcar [50] datasets. Dynamic objects in the KITTI dataset do the sequences from the Oxford Robotcar and the KITTI datasets. +not represent a big inconvenience for camera pose estimation, The asterisk at the beginning of some sequences names means +as was shown in our last work [6]. Most of the vehicles that that most of the observed vehicles are either not moving or +appear are not moving and lay in nearby scene areas; thus, their parked. The other sequences though present many moving ve- +features happen to be helpful to compute the sensor odometry. hicles. In the former type of sequences (*), the highest accuracy +Also, the few moving pedestrians and cars along the sequences should be observed in the case in which the raw images are used +do not represent a big region within the images. The Oxford [see Fig. 16(a)]. Removing the features from such vehicles, as +Robotcar dataset has though many sequences with representative in Fig. 16(b), leads to a lower accuracy since the most nearby +moving objects (driving cars), having also sequences with only features are no longer used. Inpainting the static scene behind +stationary objects (parked cars). Note that, the reported results these vehicles [see Fig. 16(c)] would still remove nearby features +are not for the whole sequence since the authors provide their VO but would create new static features a little bit further. That is, +solution as ground-truth, stating that it is accurate over hundreds the visual odometry accuracy should be lower than in Fig. 16(a) +of metres. Hence, the sequences we use are between 100 and but a little bit higher or similar to Fig. 16(b). This is the case +300 m long. of our performance on the Oxford Robotcar sequences and the + KITTI sequence 03. However, DynaSLAM achieves a result in + To perform the KITTI experiment, we have retrained our the KITTI sequence 07 better the proposed Empty Cities. After +network with 256 × 768 resolution images for a better adap- the first half of the sequence, there are a few consecutive frames +tation. The CARLA camera intrinsics have also been mod- in which a truck covers almost 75% of the image. The task +ified to match the ones used in KITTI. In this case, we of inpainting becomes especially difficult, thus, worsening the +have shifted the previously used semantic segmentation model estimation of the camera’s trajectory. Regarding the performance +trained only on Cityscapes for the ERFNet model with en- of the second type of sequences, removing the features from +coder trained on ImageNet and decoder trained on Cityscapes moving vehicles and pedestrians should lead to a lower ATE [see +train set, and have finetuned it with the KITTI semantic seg- Fig. 16(b) compared to Fig. 16(a)]. Empty Cities adds in these +mentation training dataset. The generator and discriminator sequences an important number of features for pose estimation +have also been finetuned with such data, as explained in that usually leads to a slightly better trajectory estimation [see +Section V-C. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + 446 IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021 + +Fig. 16. Vertical axis shows the KITTI and Oxford Robotcar datasets sequences in which we have tested our model, and the horizontal axes show the boxplots +of the absolute trajectory RMSE [m] with the results from ten repetitions of every test. (a) and (b) ORB-SLAM and DynaSLAM trajectory errors, respectively, for +the raw dynamic images. (c) ORB-SLAM results when our framework is employed. The asterisk at the beginning of some sequences names means that most of +the observed vehicles are either not moving or parked. + +Fig. 17. (a) and (b) show the precision and recall curves for the VPR results with BoW and NetVLAD, respectively. We report the results for the dynamic, the +inpainted, and ground-truth-static sequences, as well as the results obtained when masking out the dynamic objects as in DynaSLAM [only in (a)]. The precision +and recall curves for the inpainting methods Geo1 [14] and Lea1 [17] are also presented. + +Fig. 16(c)]. Note that the results given in here for the KITTI We have generated two CARLA sequences with loop closures +sequences might not match the ones reported by ORB-SLAM with and without dynamic objects. Two images are defined as +and DynaSLAM, respectively, because of the utilized images the same place if they are less than 10 m apart. +resolution (256 × 768). + The precision-recall curves for the BoW experiments are +B. Visual Place Recognition depicted in Fig. 17(a). We have extracted the visual words of + every frame along the trajectories and have tried to match every + Visual place recognition (VPR) is an important task for visual two images as a function of the number of common visual words. +SLAM. Such algorithms are useful when revisiting places to For both trajectories, the results obtained with the dynamic +perform loop closure and correct the accumulated drift along images with and without masks are similar. This is congruent +long trajectories. Bags of visual words (BoW) is the approach with the idea that the database of visual words mostly contains +that is widely used to perform such task, as can be seen in static and long-term stable words. The first trajectory is a good +ORB-SLAM [23] and LDSO [57]. Lately, thanks to the boost of example of how the VPR recall drops fast in the presence of +deep learning, learnt global image descriptors are also used for dynamic objects: a place is better represented with words from +VPR [58]. In our previous work [21], we showed preliminary the whole static image. It leads to less false positives and less +results proving the benefits of our solution for VPR by using false negatives mostly. Even though our method slightly brings +descriptors from an off-the-shelf CNN [59]. closer in the bag-of-words space images from the same place + with different dynamic objects, we would have expected our + 1) Baseline for VPR in Our Synthetic Dataset: In this section, results to be closer to those of the ground-truth static images. +we show a VPR experiment performed with the bag of words Our intuition behind these results is that the synthetic features, +work by Mur and Tardos [41], [60]. It is ideal to test our despite being useful for feature matching, do not fully fall on +model, since it is based on ORB features. We also show an any visual words in the BoW space. +experiment with one of the strongest learning-based baselines +NetVLAD [58], which is trained for the specific task of VPR. The precision-recall curves for the NetVLAD experiments are +This comparison can provide a broader understanding of how shown in Fig. 17(b). We have extracted the learnt descriptors of +and when end-to-end task-specific learning becomes more or every frame along the trajectories and tried to match every two +less suitable than an explicit use of semantics-based visual images as a function of their descriptors’ Euclidean distance.2 +description, which forms the primary pitch of this article. + 2With the Python-Tensorflow implementation by Cieslewski et al. [61]. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM 447 + + TABLE III + QUANTITATIVE MAPPING RESULTS FOR FIGS. 19 AND 20 + +Fig. 18. (a) Ref. (b) Query. (c) Empty Ref. (d) Empty Query. (a) and (b) same Bold entities means “the best performing system.” +location with different viewpoints and objects setups. NetVLAD [58] fails to We report the Euclidean fitness score given by the ICP algorithm. This score has been +match them, but it succeeds when our framework is previously employed [(c) computed between the different point clouds w.r.t. the point cloud built with the ground-truth +and (d)]. static images. Note that, since the maps do not have scale, the reported scores are up-to-scale. + +Despite NetVLAD’s incredible performance on VPR face to The areas which have been consistently inpainted along the +illumination, view point, and clutter changes, we show that their frames are mapped, even if they have never been seen. When +execution is seen slightly degraded by dynamic objects. Empty inpainting fails or is not consistent along the sequence, the +Cities brings closer together the descriptors of the same place, mapping photometric and geometric epipolar constraints are not +and pulls apart the descriptors of different places with the same met and such areas cannot be reconstructed. This idea makes our +dynamic objects, leading to higher precision and recall. That is, framework suitable to build stable maps. +the hidden semantic representations of our model match the ones +learnt by NetVLAD. We can conclude that our method brings To give a quantitative experiment on the validity of our +more relevant improvements in VPR if it is used in conjunction maps, we choose the standard iterative closest point (ICP) +with a learning-based method. Finally, Fig. 18 shows a case in algorithm [64] based on the Euclidean fitness score. That is, +which NetVLAD fails at matching two dynamic frames of the having the point cloud built with the ground-truth static images +same place, but manages to match them when dynamic objects as a fix reference, for each of its points the algorithm searches +are inpainted with our framework. for the closest point in the target point cloud and calculates the + distance based on the result of the algorithm search. To have a + 2) Baseline for Inpainting: We compare our approach to baseline for our experiment, we compute the point cloud with +the inpainting methods (Geo1 [14] and Lea1 [17]) w.r.t. the the dynamic images and with the CARLA segmentation. That is, +VPR metrics in Fig. 17. For the BoW curves, the conclusion the pixels belonging to the dynamic objects have not been used +is similar to what we have seen in the previous experiment. in the multiview-stereo pipeline. The results can be described in +Even if the extracted synthetic features were useful for matching, Table III. Since the map has no scale, the ICP Euclidean fitness +they seem to be of less help for place recognition with BoW. score is also up-to-scale. Even though the similarity score is +Few synthetic ORB features match any existing visual word in improved when masking out dynamic objects, the map built with +the BoW space. Our method though creates more useful visual our images has triangulated more 3-D points from the inpainted +words than Lea1 and Geo1. As for the results with NetVLAD, regions, and such points have a low error. +the use of the geometric method Geo1 brings little improvement. +However, the learning-based method Lea1 decreases NetVLADs 2) Baseline for Inpainting: We also compare our ap- +performance on the second trajectory. NetVLAD can—up to proach with the two other state-of-the-art inpainting methods +some extent—ignore the dynamic classes clues, but cannot (Geo1 [14] and Lea1 [17]) w.r.t. the map quality. The qualitative +ignore the transformed regions. That is, the hidden semantic results are depicted in Fig. 20, and the ICP scores are described in +representations of these inpainted regions do not always match Table III. It can be seen that with both other inpainting methods +the static scene representation learnt by NetVLAD. the shadows of the dynamic objects are reconstructed, as well as + their prolongation into the inpainted image regions. This leads +C. Mapping to an ICP score that is higher than that of the map built with + dynamic objects. + Another important application of our framework is the cre- +ation of high-detail road maps. Our inpainting framework allows 3) Baseline for Mapping in Real-World Environments: +to create long-term and reusable maps that neither show dynamic Fig. 21 shows an example of the computed dense maps for both +objects nor the empty spaces left by them. To this end, we use types of inputs with the sequence 04 from the KITTI dataset. +the MVS and SfM software COLMAP [62], [63]. These maps have been computed with the camera poses given + by ORB-SLAM using, respectively, the dynamic and inpainted + 1) Baseline for Mapping in Our Synthetic Dataset: Fig. 19 images. Our framework, when being able to inpaint a coherent +shows the dense map of a simulated city environment (CARLA) context along the sequence, allows us to densely reconstruct +with the original dynamic sequence, with the images processed unseen areas. When the inpainting task is not coherent along the +by our framework, and with the ground-truth static images. The sequence, the epipolar constraints are not met and, therefore, +map seen in Fig. 19(a) is not useful for future use since it shows such areas cannot be reconstructed. Note that this map is shown +dynamic objects that might not be there any more. Fig. 19(c) in RGB only for visualization purposes.3 We do recommend +shows the map built with the ground-truth static images, and +Fig. 19(b) shows the map computed with our generated images. 3Since our framework offers enough flexibility, we have retrained our model + with RGB images just as explained in Section IV. The only difference is that since + the features have to be extracted in grayscale images, we add a convolutional + layer to convert the RGB output images to grayscale. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + 448 IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021 + +Fig. 19. Dense maps of a CARLA city environment with the COLMAP Multi-View-Stereo software. The upper row shows some of the sequence images used +to build such maps. (a) Case in which the original dynamic images have been used. (b) Resulting map with the images previously processed by our framework. (c) +Ground-truth static images to create the resulting map. All maps are computed with the ground-truth camera poses. + +Fig. 20. Dense maps of the same CARLA city environment than in Fig. 19 with the COLMAP Multi-View-Stereo software. (a) Case in which our inpainted +images have been used. (b) and (c) Resulting maps with the images previously processed by the frameworks of Telea [14] and Yu et al. [17], respectively. All maps +are computed with the ground-truth camera poses. + +Fig. 21. Dense maps built with images from the KITTI dataset [48]. The upper row shows some of the sequence images used to build each map. (a) Case in +which the original dynamic images have been used. (b) Resulting map with the images previously processed by our framework. Both maps are computed with the +camera poses that ORB-SLAM estimates for the different sequences. + +to use the grayscale images for both localization and mapping 12 GB with images of a 512 × 512 resolution. Out of the 10 ms, +purposes since a lower reconstruction error is usually achieved. it takes to process one frame, 8 ms are invested into obtaining its + semantic segmentation, and 2 ms are used for the inpainting +D. Timing Analysis task. Other than to deal with dynamic objects, the semantic + Reporting our framework efficiency is crucial to judge its segmentation may be needed for many other tasks involved in + autonomous navigation. In such cases, our framework would +suitability for autonomous driving and robotic tasks in general. only add two extra ms per frame. Based on our analysis, we +The end-to-end pipeline runs at 100 ft/s on a nVidia Titan Xp consider that the inpainting task is not a bottleneck. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM 449 + + VII. FAILURE MODES AND FUTURE WORK parts of the parked cars. It would be interesting for future work + to include such scenarios in our training data. Our suggestion + The aim of this section is to provide the reader with an is, for now, inpainting all instances, regardless of their current +understanding on the benefits and limitations of our proposal, dynamic status. +and on how to integrate it on a VO, SLAM, or MVS pipeline. + VIII. CONCLUSION + Having a scene static representation leads to better visual +odometry results than just excluding the features belonging to We have presented an end-to-end deep learning framework +moving objects. The presented inpainting approach has though that translated images that contained dynamic objects within a +some weaknesses: the bigger the image dynamic region is, the city environment, such as vehicles or pedestrians, into a realistic +lower inpainting quality the resulting image has. Empty Cities image with only static content. These images were suitable for +would be suitable in setups in which approximately less than visual odometry, place recognition, and mapping tasks, thanks +15% of the camera field of view is covered by dynamic objects. In to a new loss based on steganalysis techniques and ORB features +such setups, the reconstructed L1 error is acceptable and usually maps, descriptors, and orientation. We motivated this extra com- +lies between 1 and 10%. The L1 error goes above 10% when plexity by showing quantitatively that the systems ORB-SLAM +more than 15% of the image pixels are covered.4 Work remains to and DSO obtained a higher accuracy when utilizing the images +be done to tackle extreme situations. Also, developing a system synthesized with this loss. Also, mapping systems can benefit +that processes the sequence as a whole, rather than binning it into from this approach since not only they would not map dynamic +independent frames, would result in a more consistent image objects but also would they map the plausible static scene +inpainting along time. behind them. Finally, an architectural nicety was that our system + processes the image streams outside of the localization pipeline, + Our system processes the image streams outside of the ap- either offline or online and, hence, can be used naturally as a +plication pipeline and, hence, can be used naturally as a front front end to many existing systems. +end to many existing systems. We hereby want to discuss other +application-dependent possibilities to boost its performance. REFERENCES + + Visual odometry: Removing the features that belong to sta- [1] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM: A +tionary objects certainly damages VO. However, e.g., a car can versatile and accurate monocular SLAM system,” IEEE Trans. Robot., +from one frame to another change from static to moving. Had we vol. 31, no. 5, pp. 1147–1163, Oct. 2015. +a movement detector, we would use the static objects’ features +and the inpainted ones behind moving objects. [2] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” + IEEE Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611–625, + Place recognition: Using the features of stationary dynamic Mar. 2018. +objects damages the performance of place recognition algo- +rithms. For example, two frames of the same place with a [3] C. Forster, M. Pizzoli, and D. Scaramuzza, “SVO: Fast semi-direct monoc- +different setup of parked cars can be incorrectly tagged as a ular visual odometry,” in Proc. IEEE Int. Conf. Robot. Autom., 2014, +different place. Also, two frames of different places with the pp. 15–22. +same parked cars setup can be wrongly matched as the same +place. Only the features of objects that remain stable in the long [4] A. Agudo, F. Moreno-Noguer, B. Calvo, and J. M. M. Montiel, “Sequen- +term (buildings, sidewalks, etc.) would benefit VPR. tial non-rigid structure from motion using physical priors,” IEEE Trans. + Pattern Anal. Mach. Intell., vol. 38, no. 5, pp. 979–994, May 2015. + Mapping: A map containing information about dynamic ob- +jects would be useless for future reuse. Only the information [5] J. Lamarca, S. Parashar, A. Bartoli, and J. Montiel, “DefSLAM: Tracking +belonging to objects that remain stable in the long term, as well and mapping of deforming scenes from monocular sequences,” IEEE +as the most likely static representation of the static scene behind Trans. Robot., 2020. +dynamic objects should be included in the map. + [6] B. Bescos, J. M. Fácil, J. Civera, and J. Neira, “DynaSLAM: Mapping, + Ideally, one would use a movement detector to identify the tracking and inpainting in dynamic scenes,” IEEE Robot. Autom. Lett., +status of the different observed instances and also to allow the vol. 3, no. 4, pp. 4076–4083, Oct. 2018. +discovery of new dynamic classes on the fly [13]. The features +belonging to static instances would be used for visual odometry, [7] P. F. Alcantarilla, J. J. Yebes, J. Almazán, and L. M. Bergasa, “On +and the corresponding inpainted features would be used for place combining visual SLAM and dense scene flow to increase the robustness +recognition and mapping. This approach would bring the highest of localization and mapping in dynamic environments,” in Proc. IEEE Int. +accuracy results but would entail a series of modifications to the Conf. Robot. Autom., 2012, pp. 1290–1297. +existing pipeline. Our method would currently pose problems in +the case in which one wanted to inpaint the static scene behind [8] Y. Wang and S. Huang, “Motion segmentation based robust RGB- +a car that is currently moving, and this static scene contained D SLAM,” in Proc. 11th World Congr. Intell. Control Autom., 2014, +parked cars. Our model would fail to reconstruct the unseen pp. 3122–3127. + + 4As a practical example, the Oxford Robotcar sequence 2014-05-06-12-54-54 [9] W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao, “Robust monocular SLAM +has 56% of images with less than 5% of covered pixels, 30% with a percentage in dynamic environments,” in Proc. Int. Symp. Mixed Augmented Reality, +of dynamic pixels between 5% and 10%, 12% between 10% and 15%, and 2% 2013, pp. 209–218. +between 15% and 20%. This sequence is not Manhattan at 11 A.M., but shows +cars parked at both sides of the road and cars driving nearby. [10] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. + IEEE Int. Conf. Comput. Vis., 2017, pp. 2980–2988. + + [11] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, “ERFNet: + Efficient residual factorized convnet for real-time semantic segmenta- + tion,” IEEE Trans. Intell. Transport. Syst., vol. 19, no. 1, pp. 263–272, + Jan. 2018. + + [12] D. Barnes, W. Maddern, G. Pascoe, and I. Posner, “Driven to distraction: + Self-supervised distractor learning for robust monocular visual odometry + in urban environments,” in Proc. IEEE Int. Conf. Robot. Autom, 2018, + pp. 1894–1900. + + [13] G. Zhou, B. Bescos, M. Dymczyk, M. Pfeiffer, J. Neira, and R. Siegwart, + “Dynamic objects segmentation for visual localization in urban environ- + ments,” 2018, arXiv:1807.02996. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + 450 IEEE TRANSACTIONS ON ROBOTICS, VOL. 37, NO. 2, APRIL 2021 + +[14] A. Telea, “An image inpainting technique based on the fast marching [40] P. Zhou, X. Han, V. I. Morariu, and L. S. Davis, “Learning rich features for + method,” J. Graph. Tools, vol. 9, no. 1, pp. 23–34, 2004. image manipulation detection,” in Proc. IEEE Conf. Comput. Vis. Pattern + Recognit., 2018, pp. 1053–1061. +[15] M. Bertalmio, A. L. Bertozzi, and G. Sapiro, “Navier-stokes, fluid dynam- + ics, and image and video inpainting,” in Proc. IEEE Comput. Soc. Conf. [41] D. Gálvez-López and J. D. Tardos, “Bags of binary words for fast place + Comput. Vis. Pattern Recognit., 2001, vol. 1, pp. I–I. recognition in image sequences,” IEEE Trans. Robot., vol. 28, no. 5, + pp. 1188–1197, Oct. 2012. +[16] G. Liu, F. A. Reda, K. J. Shih, T.-C. Wang, A. Tao, and B. Catanzaro, + “Image inpainting for irregular holes using partial convolutions,” in Proc. [42] E. Rosten and T. Drummond, “Machine learning for high-speed corner + Eur. Conf. Comput. Vis., 2018, pp. 89–105. detection,” in Proc. Eur. Conf. Comput. Vis., 2006, pp. 430–443. + +[17] J. Yu, Z. Lin, J. Yang, X. Shen, X. Lu, and T. S. Huang, “Generative image [43] H. Porav, W. Maddern, and P. Newman, “Adversarial training for adverse + inpainting with contextual attention,” in Proc. IEEE Conf. Comput. Vis. conditions: Robust metric localisation using appearance transfer,” in Proc. + Pattern Recognit., 2018, pp. 5505–5514. IEEE Int. Conf. Robot. and Autom., 2018, pp. 1011–1018. + +[18] S. Iizuka, E. Simo-Serra, and H. Ishikawa, “Globally and locally consistent [44] H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool, “Speeded-up robust + image completion,” ACM Trans. Graph., vol. 36, no. 4, 2017, Art. no. 107. features (surf),” Comput. Vis. Image Understanding, vol. 110, no. 3, + pp. 346–359, 2008. +[19] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. A. Efros, “Context + encoders: Feature learning by inpainting,” in Proc. IEEE Conf. Comput. [45] E. Romera, J. M. Alvarez, L. M. Bergasa, and R. Arroyo, ERFNet. 2017. + Vis. Pattern Recognit., 2016, pp. 2536–2544. [Online]. Available: https://github.com/Eromera/erfnet + +[20] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation [46] M. Cordts et al., “The cityscapes dataset for semantic urban scene un- + with conditional adversarial networks,” in Proc. IEEE Conf. Comput. Vis. derstanding,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, + Pattern Recognit., 2017, pp. 1125–1134. pp. 3213–3223. + +[21] B. Bescos, R. Siegwart, J. Neira, and C. Cadena, “Empty cities: Image [47] A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V. Koltun, “CARLA: + inpainting for a dynamic-object-invariant space,” in Proc. IEEE Int. Conf. An open urban driving simulator,” in Proc. 1st Annu. Conf. Robot Learn., + Robot. Autom., 2019, pp. 5460–5466. 2017, pp. 1–16. + +[22] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An efficient [48] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: + alternative to sift or surf,” in Proc. IEEE Int. Conf. Comput. Vis., 2011, The KITTI dataset,” Int. J. Robot. Res., vol. 32, no. 11, pp. 1231–1237, + pp. 2564–2571. 2013. + +[23] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: An open-source slam [49] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, “Image quality + system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot., assessment: From error visibility to structural similarity,” IEEE Trans. + vol. 33, no. 5, pp. 1255–1262, Oct. 2017. Image Process., vol. 13, no. 4, pp. 600–612, Apr. 2004. + +[24] G. Klein and D. Murray, “Parallel tracking and mapping for small [50] W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 Year, 1000 km: + AR workspaces,” in Proc. Int. Symp. Mixed Augmented Reality, 2007, The Oxford RobotCar Dataset,” Int. J. Robot. Res., vol. 36, no. 1, pp. 3–15, + pp. 225–234. 2017. [Online]. Available: http://dx.doi.org/10.1177/0278364916679498 + +[25] M. Runz, M. Buffier, and L. Agapito, “Maskfusion: Real-time recognition, [51] B. Zhou, A. Lapedriza, A. Khosla, A. Oliva, and A. Torralba, + tracking and reconstruction of multiple moving objects,” in Proc. IEEE Int. “Places: A 10 million image database for scene recognition,” IEEE + Symp. Mixed Augmented Reality, 2018, pp. 10–20. Trans. Pattern Anal. Mach. Intell., vol. 40, no. 6, pp. 1452–1464, + Jun. 2018. +[26] R. Scona, M. Jaimez, Y. R. Petillot, M. Fallon, and D. Cremers, “Stat- + icFusion: Background reconstruction for dense RGB-D SLAM in dy- [52] S. Hosseinzadeh, M. Shakeri, and H. Zhang, “Fast shadow detection from + namic environments,” in Proc. IEEE Int. Conf. Robot. Autom., 2018, a single image using a patched convolutional neural network,” in Proc. + pp. 3849–3856. IEEE/RSJ Int. Conf. Intell. Robot. Syst., 2018, pp. 3124–3129. + +[27] M. Granados, K. I. Kim, J. Tompkin, J. Kautz, and C. Theobalt, “Back- [53] A. Gaidon, Q. Wang, Y. Cabon, and E. Vig, “Virtual worlds as proxy for + ground inpainting for videos with dynamic objects and a free-moving multi-object tracking analysis,” in Proc. IEEE Conf. Comput. Vis. Pattern + camera,” in Proc. Eur. Conf. Comput. Vis., 2012, pp. 682–695. Recognit., 2016, pp. 4340–4349. + +[28] R. Uittenbogaard, D. Gavrila, C. Sebastian, and J. Vijverberg, “Moving [54] M. Peris, S. Martull, A. Maki, Y. Ohkawa, and K. Fukui, “Towards a + object detection and image inpainting in street-view imagery,” Master simulation driven stereo vision system,” in Proc. 21st Int. Conf. Pattern + Thesis, Delft Univ. Technol., Delft, Netherlands, 2018. Recognit., 2012, pp. 1038–1042. + +[29] A. A. Efros and W. T. Freeman, “Image quilting for texture synthesis and [55] J. Skinner, S. Garg, N. Sünderhauf, P. Corke, B. Upcroft, and + transfer,” in Proc. 28th Annu. Conf. Comput. Graph. Interactive Techn., M. Milford, “High-fidelity simulation for evaluating robotic vision + 2001, pp. 341–346. performance,” in Proc. IEEE Int. Conf. Intell. Robot. Syst., 2016, + pp. 2737–2744. +[30] C. Yang, X. Lu, Z. Lin, E. Shechtman, O. Wang, and H. Li, “High- + resolution image inpainting using multi-scale neural patch synthesis,” [56] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, “Do- + in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, vol. 1, main randomization for transferring deep neural networks from simulation + pp. 4076–4084. to the real world,” in Proc. IEEE Int. Conf. Intell. Robot. Syst., 2017, + pp. 23–30. +[31] Y. Song, C. Yang, Z. L. Lin, H. Li, Q. Huang, and C.-C. J. Kuo, “Contextual- + based image inpainting: Infer, match, and translate,” in Proc. Eur. Conf. [57] X. Gao, R. Wang, N. Demmel, and D. Cremers, “LDSO: Direct sparse + Comput. Vis. (ECCV), 2018, pp. 3–19. odometry with loop closure,” in Proc. Int. Conf. Intell. Robot. Syst., 2018, + pp. 2198–2204. +[32] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Deep image prior,” in Proc. + IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 9446–9454. [58] R. Arandjelovic´, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “NetVLAD: + CNN architecture for weakly supervised place recognition,” in Proc. IEEE +[33] I. Goodfellow et al., “Generative adversarial nets,” in Proc. Adv. Neural Conf. Comput. Vis. Pattern Recognit., 2018, pp. 1437–1451. + Inf. Process. Syst., 2014, pp. 2672–2680. + [59] D. Olid, J. M. Fácil, and J. Civera, “Single-view place recognition under +[34] J. Gauthier, “Conditional generative adversarial nets for convolutional face seasonal changes,” 2018, arXiv:1808.06516. + generation,” Class Project Stanford CS231N: Convolutional Neural Netw. + Vis. Recognit., Winter semester, vol. 2014, no. 5, p. 2, 2014. [60] R. Mur-Artal and J. D. Tardós, “Fast relocalisation and loop closing in + keyframe-based slam,” in Proc. IEEE Int. Conf. Robot. Autom., 2014, +[35] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks pp. 846–853. + for biomedical image segmentation,” in Proc. Int. Conf. Med. Image + Comput. Comput.-Assisted Intervention, 2015, pp. 234–241. [61] T. Cieslewski, S. Choudhary, and D. Scaramuzza, “Data-efficient decen- + tralized visual SLAM,” in Proc. IEEE Int. Conf. Robot. Autom., 2018, +[36] R. Guerrero et al., “White matter hyperintensity and stroke lesion segmen- pp. 2466–2473. + tation and differentiation using convolutional neural networks,” NeuroIm- + age: Clin., vol. 17, pp. 918–934, 2018. [62] J. L. Schönberger and J.-M. Frahm, “Structure-from-motion revisited,” in + Proc. Conf. Comput. Vis. Pattern Recognit, 2016, pp. 4104–4113. +[37] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of + data with neural networks,” Science, vol. 313, no. 5786, pp. 504–507, 2006. [63] J. L. Schönberger, E. Zheng, M. Pollefeys, and J.-M. Frahm, “Pixelwise + view selection for unstructured multi-view stereo,” in Proc. Eur. Conf. +[38] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proc. Comput. Vis., 2016, pp. 501–518. + Int. Conf. Learn. Representations, 2013. + [64] P. J. Besl and N. D. McKay, “Method for registration of 3-D shapes,” Sen- +[39] J. Fridrich and J. Kodovsky, “Rich models for steganalysis of digital sor Fusion IV: Control Paradigms Data Struct., vol. 1611, pp. 586–606, + images,” IEEE Trans. Inf. Forensics Secur., vol. 7, no. 3, pp. 868–882, 1992. + Jun. 2012. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + BESCOS et al.: EMPTY CITIES: A DYNAMIC-OBJECT-INVARIANT SPACE FOR VISUAL SLAM 451 + +Berta Bescos was born in Zaragoza, Spain, in 1993. José Neira was born in Bogotá, Colombia, in 1963. +She received the bachelor’s and M.S. degrees in in- He received the M.S. degree from the Universidad de +dustrial engineering with mention in robotics and los Andes, Bogota, Colombia, in 1986, and the Ph.D. +computer vision from the University of Zaragoza, degree from the University of Zaragoza, Zaragoza, +Zaragoza, Spain, where she is currently working to- Spain, in 1993, both in computer science. +ward the Ph.D. degree with the I3A Robotics, Per- +ception and Real Time Group, with her Ph.D. topic Since 2010, he has been a Full Professor with the +dealing with dynamic objects in SLAM for a better Departamento de Informática e Ingeniería de Sis- +scene understanding. temas, University of Zaragoza, where he is in charge + of courses in compiler theory, computer vision, ma- + Her research interests include the intersection be- chine learning, and mobile robotics. His current re- +tween perception and learning for robotics. search interests are centered around robust, life-long + simultaneous localization and mapping. He also coordinates the university’s + Master Program in robotics, graphics and computer vision. + + Cesar Cadena received the Ph.D. degree in computer + science from the University of Zaragoza, Zaragoza, + Spain, in 2011. + + He is a Senior Researcher with ETH Zurich, + Zurich, Switzerland. He is particularly interested on + how to provide machines the capability of under- + standing this ever changing world through the sensory + information they can gather. He has work intensively + on robotic scene understanding, both geometry and + semantics, covering semantic mapping, data associ- + ation and place recognition tasks, simultaneous lo- +calization and mapping problems, as well as persistent mapping in dynamic +environments. His main research interests include the interception of perception +and learning in robotics. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:20:49 UTC from IEEE Xplore. Restrictions apply. + diff --git a/动态slam/2020年-2022年开源动态SLAM/2020年/VDO-SLAM a visual dynamic object aware SLAM system.pdf b/动态slam/2020年-2022年开源动态SLAM/2020年/VDO-SLAM a visual dynamic object aware SLAM system.pdf new file mode 100644 index 0000000..f342bfd --- /dev/null +++ b/动态slam/2020年-2022年开源动态SLAM/2020年/VDO-SLAM a visual dynamic object aware SLAM system.pdf @@ -0,0 +1,1136 @@ + MANUSCRIPT ONLY 1 + + VDO-SLAM: A Visual Dynamic Object-aware + SLAM System + + Jun Zhang[co], Mina Henein[co], Robert Mahony and Viorela Ila + +arXiv:2005.11052v3 [cs.RO] 14 Dec 2021 Abstract—Combining Simultaneous Localisation and Mapping Fig. 1: Results of our VDO-SLAM system. (Top) A full + (SLAM) estimation and dynamic scene modelling can highly map including camera trajectory in red, static background + benefit robot autonomy in dynamic environments. Robot path points in black and points on moving objects colour coded + planning and obstacle avoidance tasks rely on accurate esti- by their instance. (Bottom) Detected 3D points on the static + mations of the motion of dynamic objects in the scene. This background and the objects’ body, and the estimated object + paper presents VDO-SLAM, a robust visual dynamic object- speed. Black circles represents static points, and each object + aware SLAM system that exploits semantic information to enable is shown with a different colour. + accurate motion estimation and tracking of dynamic rigid objects + in the scene without any prior knowledge of the objects’ shape ([6]–[9]). The former technique excludes information about + or geometric models. The proposed approach identifies and dynamic objects in the scene, and generates static only maps. + tracks the dynamic objects and the static structure in the The accuracy of the latter depends on the camera pose + environment and integrates this information into a unified SLAM estimation, which is more susceptible to failure in complex + framework. This results in highly accurate estimates of the dynamic environments. Increased presence of autonomous + robot’s trajectory and the full SE(3) motion of the objects as well systems in dynamic environments is driving the community + as a spatiotemporal map of the environment. The system is able to challenge the static world assumption that underpins most + to extract linear velocity estimates from objects’ SE(3) motion existing open-source SLAM algorithms. In this paper, we + providing an important functionality for navigation in complex redefine the term “mapping” in SLAM to be concerned with + dynamic environments. We demonstrate the performance of the a spatiotemporal representation of the world, as opposed to + proposed system on a number of real indoor and outdoor datasets the concept of a static map that has long been the emphasis + and the results show consistent and substantial improvements of the classical SLAM algorithms. Our approach focuses on + over the state-of-the-art algorithms. An open-source version of accurately estimate the motion of all dynamic entities in the + the source code is available∗. environment including the robot and other moving objects + in the scene, this information being highly relevant in the + Index Terms—SLAM, dynamic scene, object motion estima- context of robot path planning and navigation in dynamic + tion, multiple object tracking. environments. + + I. INTRODUCTION Existing scene motion estimation techniques mainly rely + on optical flow estimation ([10]–[13]) and scene flow esti- + T He ability of a robot to build a model of the environment, mation ([14]–[17]). Optical flow records the scene motion + often called map, and to localise itself within this map is by estimating the velocities associated with the movement + a key factor in enabling autonomous robots to operate in real + world environments. Creating these maps is achieved by fusing + multiple sensor measurements into a consistent representation + using estimation techniques such as Simultaneous Localisation + And Mapping (SLAM). SLAM is a mature research topic and + have already revolutionised a wide range of applications from + mobile robotics, inspection, entertainment and film produc- + tion to exploration and monitoring of natural environments, + amongst many others. However, most of the existing solutions + to SLAM rely heavily on the assumption that the environment + is predominantly static. + + The conventional techniques to deal with dynamics in + SLAM is to either treat any sensor data associated with moving + objects as outliers and remove them from the estimation + process ([1]–[5]), or detect moving objects and track them + separately using traditional multi-target tracking approaches + + Jun Zhang, Mina Henein and Robert Mahony are with the + Australian National University (ANU), 0020 Canberra, Australia. + {jun.zhang2,mina.henein,robert.mahony}@anu.edu.au + + Viorela Ila is with the University of Sydney (USyd), 2006 Sydney, Australia. + viorela.ila@sydney.edu.au + + [co]: The two authors contributed equally to this work. + ∗https://github.com/halajun/vdo slam + MANUSCRIPT ONLY 2 + +of brightness patterns on an image plane. Scene flow, on the feature and object tracking method is proposed, with the ability +other hand, describes the 3D motion field of a scene observed to handle indirect occlusions resulting from the failure of +at different instants of time. Those techniques only estimate semantic object segmentation. In summary, the contributions +linear translation of individual pixels or 3D points in the of this work are: +scene, and are not exploiting the collective behaviour points +on rigid objects failing to describe the full SE(3) motion of • a novel formulation to model dynamic scenes in a uni- +objects in the scene. In this paper we explore this collective fied estimation framework over robot poses, static and +behaviour of points on individual objects to obtain accurate dynamic 3D points, and object motions. +and robust motion estimation of the objects in the scene while +simultaneously localising the robot and map the environment. • accurate estimation for SE(3) motion of dynamic objects + that outperforms state-of-the-art algorithms, as well as a + A typical SLAM system consists of a front-end module, way to extract objects’ velocity in the scene, +that processes the raw data from the sensors and a back- +end module, that integrates the obtained information (raw • a robust method for tracking moving objects exploiting +and higher-level information) into a probabilistic estimation semantic information with the ability to handle indirect +framework. Simple primitives such as 3D locations of salient occlusions resulting from the failure of semantic object +features are commonly used to represent the environment. segmentation, +This is largely a consequence of the fact that points are easy +to detect, track and integrate within the SLAM estimation • a demonstrable full system in complex and compelling +problem. real-world scenarios. + + Feature tracking has been more reliable and robust with To the best of our knowledge, this is the first full dynamic +the advances in deep learning to provide algorithms that can SLAM system that is able to achieve motion segmentation, +reliably estimate the 2D optical flow associated with the dynamic object tracking, and estimate the camera poses along +apparent motion of every pixel on an image in a dense manner. with the static and dynamic structure, the full SE(3) pose +A task that is particularly important for data association and change of every rigid object in the scene, extract velocity infor- +that has been otherwise challenging in dynamic environments mation, and be demonstrable in real-world outdoor scenarios +using classical feature tracking methods. (see Fig. 1). We demonstrate the performance of our algorithm +Other primitives such as lines and planes ([18]–[21]) or even on real datasets and show capability of the proposed system to +objects ([22]–[24]) have been considered in order to provide resolve rigid object motion estimation and yield motion results +richer map representations. To incorporate such information in that are comparable to the camera pose estimation in accuracy +existing geometric SLAM algorithms, either a dataset of 3D- and that outperform state-of-the-art algorithms by an order of +models of every object in the scene must be available a priori magnitude in urban driving scenarios. +([23], [25]) or the front end must explicitly provide object +pose information in addition to detection and segmentation The remainder of this paper is structured as follows, in the +([26]–[28]) adding a layer of complexity to the problem. The following Section II we discuss the related work. In Section III +requirement for accurate 3D-models severely limits the poten- and IV we describe the proposed algorithm and system. We +tial domains of application, while to the best of our knowledge, introduce the experimental setup, followed by the results and +multiple object tracking and 3D pose estimation remain a evaluations in Section V. We summarise and offer concluding +challenge to learning techniques. There is a clear need for remarks in Section VI. +an algorithm that can exploit the powerful detection and +segmentation capabilities of modern deep learning algorithms II. RELATED WORK +([29], [30]) without relying on additional pose estimation or +object model priors, an algorithm that operates at feature-level In the past two decades, the study of SLAM for dynamic +with the awareness of an object concept. environments has become more and more popular in the + community, with a considerable amount of algorithms being + While the problems of SLAM and object motion track- proposed to solve the dynamic SLAM problem. Motivated by +ing/estimation are long studied in isolation in the literature, different goals to achieve, solutions in the literature can be +recent approaches try to solve the two problems in a unified mainly divided into three categories. +framework ([31], [32]). However, they both focus on the +SLAM back-end instead of a full system, resulting in a The first category aims to explore robust SLAM against +severely limited performance in real world scenarios. In this dynamic environments. Early methods in this category ([2], +paper, we carefully integrate our previous works ([31], [33]) [34], [35]) normally detect and remove the information drawn +and propose VDO-SLAM, a novel feature-based stereo/RGB- from dynamic foreground, which is seen as degrading the +D dynamic SLAM system, that leverages image-based se- SLAM performance. More recent methods on this track tend +mantic information to simultaneously localise the robot, map to go further by not just removing the dynamic foreground, +the static and dynamic structure, and track motions of rigid but also inpainting or reconstructing the static background that +objects in the scene. Different to [31], we rely on a denser is occluded by moving targets. [5] present dynaSLAM that +object feature representation to ensure robust tracking, and combines classic geometry and deep learning-based models to +propose new factors to smoothen the motion of rigid objects in detect and remove dynamic objects, then inpaint the occluded +urban driving scenarios. Different to [33], an improved robust background with multi-view information of the scene. Simi- + larly, a Light Field SLAM front-end is proposed by [36] to + reconstruct the occluded static scene via Synthetic Aperture + Imaging (SAI) technics. Different from [5], features on the + reconstructed static background are also tracked and used + MANUSCRIPT ONLY 3 + +to achieve better SLAM performance. The above state-of- Both methods succeed to exploit object information in a +the-art solutions achieve robust and accurate estimation by dense RGB-D SLAM framework, without prior knowledge of +discarding the dynamic information. However, we argue that object model. Their main interest, however, is the 3D object +this information has potential benefits for SLAM if it is prop- segmentation and consistent fusion of the dense map rather +erly modelled. Furthermore, understanding dynamic scenes in than the estimation of the motion of the objects. +addition to SLAM is crucial for many other robotics tasks such +as planning, control and obstacle avoidance, to name a few. Lately, the use of basic geometric models to represent + objects becomes a popular solution due to the less complexity + Approaches of the second category performs SLAM and and easy integration into a SLAM framework. In Quadric- +Moving Objects Tracking (MOT) separately, as an extension SLAM [46], detected objects are represented as ellipsoids to +to conventional SLAM for dynamic scene understanding ([9], compactly parametrise the size and 3D pose of an object. In +[37]–[39]). [37] developed a theory for performing SLAM this way, the quadric parameters are directly constrained as +with Moving Objects Tracking (SLAMMOT). In the latest geometric error and formulated together with camera poses +version of their SLAM with detection and tracking of mov- in a factor graph SLAM for joint estimation. [24] propose to +ing objects, the estimation problem is decomposed into two combine 2D and 3D object detection with SLAM for both +separate estimators (moving and stationary objects) to make static and dynamic environments. Objects are represented as +it feasible to update both filters in real time. [9] tackle the high-quality cuboids and optimized together with points and +SLAM problem with dynamic objects by solving the problems cameras through multi-view bundle adjustment. While both +of Structure from Motion (SfM) and tracking of moving methods prove the mutual benefit between detected object and +objects in parallel, and unifying the output of the system SLAM, their main focus is on object detection and SLAM +into a 3D dynamic map containing the static structure and primarily for static scenarios. In this paper, we take this +the trajectories of moving objects. Later in [38], the authors direction further to tackle the challenging problem of dynamic +propose to integrate semantic constraints to further improve the object tracking within a SLAM framework, and exploit the +3D reconstruction. The more recent work [39] present a stereo- relationships between moving objects and agent robot, static +based dense mapping algorithm in a SLAM framework, with and dynamic structures for potential advantages. +the advantage of accurately and efficiently reconstructing both +static background and moving objects in large scale dynamic Apart from the dynamic SLAM categories, the literature of +environments. The listed algorithms above have proven that 6-DoF object motion estimation is also crucial for dynamic +combining multiple objects tracking with SLAM is doable SLAM problem. Quite a few methods have been proposed in +and applicable for dynamic scene exploration. To take a step the literature to estimate SE(3) motion of objects in a visual +further by proper exploiting and establishing the spatial and odometry or SLAM framework ([50]–[52]). [50] present a +temporal relationships between the robot, static background, model-free method for detecting and tracking moving objects +stationary and dynamic objects, we show in this paper that in 3D LiDAR scans. The method sequentially estimates mo- +the problems of SLAM and multi-object tracking are mutually tion models using RANSAC [53], then segments and tracks +beneficial. multiple objects based on the models by a proposed Bayesian + approach. In [51], the authors address the problem of simul- + The last and most active category is object SLAM, which taneous estimation of ego and third-party SE(3) motions in +usually includes both static and dynamic objects. Algorithms complex dynamic scenes using cameras. They apply multi- +in this class normally require specific modelling and repre- model fitting techniques into a visual odometry pipeline and +sentation of 3D object, such as 3D shape ([40]–[42]), sur- estimate all rigid motions within a scene. In later work, [52] +fel [43] or volumetric [44] model, geometric model such as present ClusterVO that is able to perform online processing +ellipsoid ([45], [46]) or 3D bounding box ([24], [47]–[49]), for multiple motion estimations. To achieve this, a multi-level +etc., to extract high-level primitive (e.g., object pose) and probabilistic association mechanism is proposed to efficiently +integrate into a SLAM framework. [40] is one of the earliest track features and detections, then a heterogeneous Conditional +works to introduce an object-oriented SLAM paradigm, which Random Field (CRF) clustering approach is applied to jointly +represents cluttered scene in object level and constructs an infer cluster segmentations, with a sliding-window optimiza- +explicit graph between camera and object poses to achieve tion for clusters in the end. While the above proposed methods +joint pose-graph optimisation. Later, [41] propose a novel 3D represent an important step forward to the Multi-motion Visual +object recognition algorithm to ensure the system robustness Odometry (MVO) task, the study of spacial and temporal +and improve the accuracy of estimated object pose. The high- relationships is not fully explored but is arguably important. +level scene representation enables real-time 3D recognition Therefore, by carefully considering the pros and cons in the +and significant compression of map storage for SLAM. Never- literature of SLAM+MOT, object SLAM and MVO, this paper +theless, a database of pre-scanned or pre-trained object models proposes a visual dynamic object-aware SLAM system that is +has to be created in advance. To avoid prebuilt database, able to achieve robust ego and object motion tracking, as well +representing objects using surfel or voxel element in a dense as consistent static and dynamic mapping in a novel SLAM +manner starts to gain popularity, along with RGB-D cameras formulation. +becoming widely used. [43] present MaskFusion that adopts +surfel representation to model, track and reconstruct objects in III. METHODOLOGY +the scene, while [44] apply an octree-based volumetric model +to objects and build multi-object dynamic SLAM system. Before discussing details of the proposed system pipeline, + as shown in Fig. 4, this section covers the mathematical details + MANUSCRIPT ONLY 4 + +of the core components in the system. Variables and notations (5) is crucially important as it relates the same 3D point +are first introduced, including the novel way of modelling the +motion of a rigid-object in a model free manner. Then we on a rigid object in motion at consecutive time steps by +show how the camera pose and object motion are estimated Lk−1 +in the tracking component of the system. Finally, a factor a homogeneous transformation k−01Hk := 0Lk−1 k−1 Hk 0L−k−11 . +graph optimisation is proposed and applied in the mapping +component, to refine the camera poses and object motions, This equation represents a frame change of a pose transforma- +and build a global consistent map including static and dynamic +structure. tion [54], and shows how the body-fixed frame pose change + Lk−1 + k−1 Hk relates to the global reference frame pose change + + k−10Hk. The point motion in global reference frame is then + + expressed as: + + 0mik = k−10Hk 0mik−1 . (6) + +A. Background and Notation Equation (6) is at the core of our motion estimation approach, + as it expresses the rigid object pose change in terms of the + 1) Coordinate Frames: Let 0Xk,0 Lk ∈ SE(3) be the points that reside on the object in a model-free manner without + the need to include the object 3D pose as a random variable +robot/camera and the object 3D pose respectively, at time k in the estimation. Section III-B2 details how this rigid object + pose change is estimated based on the above equation. Here +in a global reference frame 0, with k ∈ T the set of time k−10Hk ∈ SE(3) represents the object point motion in global + reference frame; for the remainder of this document, we refer +steps. Note that calligraphic capital letters are used in our to this quantity as the object pose change or the object motion + for ease of reading. +notation to represent sets of indices. Fig. 2 shows these pose + B. Camera Pose and Object Motion Estimation +transformations as solid curves. + 2) Points: Let 0mki be the homogeneous coordinates of the The cost function chosen to estimate the camera pose and + +ith 3D point at time k, with 0mi = mix, myi , miz, 1 ∈ IE3 and object motion is associated with the 3D-2D re-projection error +i ∈ M the set of points. We write a point in robot/camera +frame as Xk mik =0 Xk−1 0mik. and is defined on the image plane. Since the noise is better + + Define Ik the reference frame associated with the image characterised in image plane, this yields more accurate results + +captured by the camera at time k chosen at the top left for camera localisation [55]. Moreover, based on this error +corner of the image, and let Ik pik = ui, vi, 1 ∈ IE2 be the pixel +location on frame Ik corresponding to the homogeneous 3D term, we propose a novel formulation to jointly optimise the +point Xk mki , which is obtained via the projection function π(·) +as follows: optical flow along with the camera pose and the object motion, + + Ik pik = π(Xk mik) = K Xk mik , (1) to ensure a robust tracking of points. In the mapping module, a + +where K is the camera intrinsics matrix. 3D error cost function is used in global optimization to ensure + + The camera and/or object motions both produce an optical best results of 3D structure and object motions estimation as +flow Ik φ i ∈ IR2 that is the displacement vector indicating the +motion of pixel Ik−1 pik−1 from image frame Ik−1 to Ik, and is later described in Section III-C. +given by: 1) Camera Pose Estimation: Given a set of static 3D + + Ik φ i = Ik p˜ ik − Ik−1 pik−1 . (2) points {0mik−1 | i ∈ M , k ∈ T } observed at time k − 1 in + global reference frame, and the set of 2D correspondences +Here Ik p˜ ik is the correspondence of Ik−1 pik−1 in Ik. Note that, {Ik p˜ ki | i ∈ M , k ∈ T } in image Ik, the camera pose 0Xk is +we overload the same notation to represent the 2D pixel estimated via minimizing the re-projection error: +coordinates ∈ IR2. In this work, we leverage optical flow to + ei(0Xk) = Ik p˜ ik − π(0X−k 1 0mik−1) . (7) + +find correspondences between consecutive frames. We parameterise the SE(3) camera pose by elements of the + Lie-algebra xk ∈ se(3): +3) Object and 3D Point Motions: The object motion be- + +tween times k − 1 and k is described by the homogeneous + Lk−1 +transformation k−1 Hk ∈ SE(3) according to: 0Xk = exp(0xk) , (8) + + Lk−1 Hk =0 L−k−11 0Lk . (3) and define 0x∨k ∈ IR6 with the vee operator a mapping from + k−1 se(3) to IR6. Using the Lie-algebra parameterisation of SE(3) + with the substitution of (8) into (7), the solution of the least +Fig. 2 shows these motion transformations as dashed curves. squares cost is given by: + +We write a point in its corresponding object frame as nb +Lk mik = 0L−k 1 0mki (shown as a dashed vector from the object ∑ 0x∗k∨ = argmin ρh ei (0xk) Σ−p 1 ei(0xk) +reference frame to the red dot in Fig. 2), substituting the object (9) + +pose at time k from (3), this becomes: 0x∨k i + +0mik = 0Lk Lk mik = 0Lk−1 Lk−1 Hk Lk mik . (4) for all nb visible 3D-2D static background point correspon- + k−1 + dences between consecutive frames. Here ρh is the Huber +Note that for rigid body objects, Lk mki stays constant at Lmi, +and Lmi = 0Lk−1 0mki = 0L−k+1n 0mik+n for any integer n ∈ Z. function [56], and Σp is the covariance matrix associated with +Then, for rigid objects with n = −1, (4) becomes: + the re-projection error. The estimated camera pose is given by +0mik = 0Lk−1 Lk−1 Hk 0L−k−11 0mki −1 . (5) 0Xk∗ = exp(0x∗k) and is found using the Levenberg-Marquardt + k−1 algorithm to solve for (9). + MANUSCRIPT ONLY 5 + + −2 −1 + + −2 −1 −1 + −2 −2 + −1 −1 + 0 + 0 + −1 + −2 0 + + 0 0 + + −2 −1 + + 0 + + + + {0} + + {−2 } −2 −2 −1 −1 + + −2 −2 {−1 } { } + + 0 −1 −1 + + −2 + + −2 −1 + + −2 −1 −1 + + 0 0 + + −1 + +Fig. 2: Notation and coordinate frames. Solid curves represent camera and object poses in inertial frame; 0X and 0L +respectively, and dashed curves their respective motions in body-fixed frame. Solid lines represent 3D points in inertial frame, +and dashed lines represent 3D points in camera frames. + + 2) Object Motion Estimation: Analogous to the camera function: + +pose estimation, a cost function based on re-projection error nb +is constructed to solve for the object motion k−10Hk. Using (6), ∑ {0xk∗∨, kΦ∗k} = argmin ρh ei (Ik φ i) Σφ−1 ei(Ik φ i) + +the error term between the re-projection of an object 3D point {0xk∨,kΦk} i + +and the corresponding 2D point in image Ik is: ρh ei (0xk,Ik φ i) Σ−p 1 ei(0xk,Ik φ i) , (13) + + ei(k−10Hk) := Ik p˜ ik − π(0X−k 1 k−01Hk 0mki −1) where ρh(ei (Ik φ i) Σ−φ 1 ei(Ik φ i)) is the regularization term with + + = Ik p˜ ki − π(k−10Gk 0mki −1) , (10) + +where k−01Gk ∈ SE(3). Parameterising 0 Gk := exp k−01gk ei(Ik φ i) = Ik φˆ i − Ik φ i . (14) + k−1 +with k−10gk ∈ se(3), the optimal solution is found via min- Here Ik Φˆ i = {Ik φˆ i | i ∈ M , k ∈ T } is the initial optic-flow + obtained through classical or learning-based methods, and Σφ +imising: is the associated covariance matrix. Analogously, the cost + function for object motion in (11) combining optical flow + nd refinement is given by +∑ k−01g∗k∨ = argmin ρh ei (k−10gk) Σ−p 1 ei(k−01gk) + (11) + + 0 gk∨ i + k−1 + +given all nd visible 3D-2D dynamic point correspondences on nd + ∑ {k0−1g∗k∨, kΦk∗} = argmin ρh ei (Ik φ i) Σ−φ 1 ei(Ik φ i) + +an object between frames k − 1 and k. The object motion, {k0−1gk∨,kΦk} i +k−01Hk = 0Xk k−10Gk can be recovered afterwards. + + 3) Joint Estimation with Optical Flow: The camera pose ρh ei (k−10gk,Ik φ i) Σ−p 1 ei(k−10gk,Ik φ i) . +and object motion estimation both rely on good image corre- (15) +spondences. Tracking of points on moving objects can be very +challenging due to occlusions, large relative motions and large C. Graph Optimisation +camera-object distances. In order to ensure a robust tracking of +points, we follow our earlier work [33] to refine the estimation The proposed approach formulates the dynamic SLAM as +of the optical flow jointly with the motion estimation. a graph optimisation problem, to refine the camera poses and + object motions, and build a global consistent map including + For camera pose estimation, the error term in (7) is refor- static and dynamic structure. We model the dynamic SLAM +mulated considering (2) as: problem as a factor graph as the one demonstrated in Fig. 3. + The factor graph formulation is highly intuitive and has the +ei(0Xk,Ik φ ) = Ik−1 pik−1 + Ik φ i − π(0X−k 1 0mik−1) . (12) advantage that it allows for efficient implementations of batch + ([57], [58]) and incremental ([59]–[61]) solvers. +Applying the Lie-algebra parameterisation of SE(3) element, +the optimal solution is obtained via minimising the cost Four types of measurements/observations are integrated into + a joint optimisation problem; the 3D point measurements, the + MANUSCRIPT ONLY 6 + + rate and the physics laws governing the motion of relatively + + large objects (vehicles) and preventing their motions to change + + abruptly, we introduce smooth motion factors to minimise the + + change in consecutive object motions, with the error term + + defined as: + + el,k(k−02Hkl −1, 0 Hlk ) = k−20Hlk−1−1 k−01Hkl . (19) + k−1 + + The object smooth motion factor el,k(k−02Hkl −1, k−01Hkl ) is used + to minimise the change between the object motion at consec- + +Fig. 3: Factor graph representation of an object-aware utive time steps and is shown as cyan circles in Fig. 3. +SLAM with a moving object. Black squares stand for the Let θ M = {0mik | i ∈ M , k ∈ T } be the set of all 3D points, +camera poses at different time steps, blue for static points, red +for the same dynamic point on an object (dashed box) at dif- and θ X = {0xk∨ | k ∈ T } as the set of all camera poses. +ferent time steps and green for the object pose change between We parameterise the SE(3) object motion k−01Hlk by elements +time steps. For ease of visualisation, only one dynamic point is k−01hkl ∈ se(3) the Lie-algebra of SE(3): +drawn here. A prior factor is shown as a black circle, odometry +factors are shown as orange, point measurement factors as 0 Hlk = exp(k−10 hlk ) , (20) +white and point motion factors as magenta. A smooth motion k−1 +factor is shown as cyan circle. + and define θ H =wi{thk−k01−h01lkh∨lk ∨| k ∈ T , l ∈ L } as the set of all + object motions, ∈ IR6 and L the set of all object + + labels. Given θ = θ X ∪ θ M ∪ θ H as all the nodes in the graph, + + with the Lie-algebra parameterisation of SE(3) for X and H + + (substituting (8) in (16) and (17), and substituting (20) in (18) + + and (19)), the solution of the least squares cost is given by: + + nz + ∑ θ ∗ = argmin ρh ei,k(0xk,0 mki ) Σ−z 1 ei,k(0xk,0 mik) + +visual odometry measurements, the motion of points on a θ i,k +dynamic object and the object smooth motion observations. + no + The 3D point measurement model error ei,k(0Xk,0 mik) is +defined as: ∑ + ρh log(ek(0xk−1, 0xk)) Σo−1 log(ek(0xk−1, 0xk)) + + ei,k(0Xk,0 mik) =0 Xk−1 0mki − zik . (16) k + + ng + + ∑ + ρh ei,l,k(0mik, k−01hkl ,0 mki −1) Σg−1 + i,l,k + +Here z = {zki | i ∈ M , k ∈ T } is the set of all 3D point mea- ei,l,k (0mik , 0 hkl ,0 mki −1 ) +surements at all time steps, with cardinality nz and zki ∈ IR3. k−1 +The 3D point measurement factors are shown as white circles + ns + + ∑ + ρh log(el,k(k−02hkl −1, k−10hlk)) Σ−s 1 + l,k +in Fig. 3. + +The tracking component of the system provides a high-quality log(el,k(k−20hkl −1, 0 hkl )) , (21) + k−1 + +ego-motion via 3D-2D error minimization, which can be used where Σz is the 3D point measurement noise covariance + matrix, Σo is the odometry noise covariance matrix, Σg is +as an odometry measurement to constrain camera poses in the motion noise covariance matrix with ng the total number +the graph. The visual odometry model error ek(0Xk−1,0 Xk) is of ternary object motion factors, and Σs the smooth motion +defined as: covariance matrix, with ns the total number of smooth motion + factors. The non-linear least squares problem in (21) is solved + ek (0 Xk−1 ,0 Xk) = (0X−k−11 0 Xk )−1 Xk−1 Tk , (17) + k−1 using Levenberg-Marquardt method. + +where T = {Xkk−−11 Tk | k ∈ T } is the odometry measurement set + Xk−1 +with k−1 Tk ∈ SE(3) and cardinality no. The odometric factors + +are shown as orange circles in Fig. 3. IV. SYSTEM + +The motion model error of points on dynamic objects In this section, we propose a novel object-aware dynamic + SLAM system that robustly estimates both camera and object +ei,l,k (0mki , 0 Hkl ,0 mki −1 ) is defined as: motions, along with the static and dynamic structure of the + k−1 environment. The full system overview is shown in Fig. 4. + The system consists of three main components: image pre- + ei,l,k(0mik, k−01Hkl ,0 mik−1) = 0mki − k−10Hkl 0mki −1 . (18) processing, tracking and mapping. + +The motion of all points on a detected rigid object l are The input to the system is stereo or RGB-D images. For + stereo images, as a first step, we extract depth information by +characterised by the same pose transformation 0 Hlk ∈ SE(3) applying the stereo depth estimation method described in [62] + k−1 to generate depth maps and the resulting data is treated as + RGB-D. +given by (6) and the corresponding factor, shown as magenta + +circles in Fig. 3, is a ternary factor which we call the motion + +model of a point on a rigid body. + +It has been shown that incorporating prior knowledge about + +the motion of objects in the scene is highly valuable in + +dynamic SLAM ([31], [37]). Motivated by the camera frame + MANUSCRIPT ONLY 7 + + Although this system was initially designed to be an RGB-D 2) Camera Pose Estimation: The camera pose is com- +system, as an attempt to fully exploit image-based semantic in- puted using (13) for all detected 3D-2D static point cor- +formation, we apply single image depth estimation to achieve respondences. To ensure robust estimation, a motion model +depth information from monocular camera. Our “learning- generation method is applied for initialisation. Specifically, +based monocular” system is monocular in the sense that only the method generates two models and compares their inlier +RGB images are used as input to the system, however the numbers based on re-projection error. One model is generated +estimation problem is formulated using RGB-D data, where by propagating the camera previous motion, while the other by +the depth is obtained using single image depth estimation. computing a new motion transform using P3P [63] algorithm + with RANSAC. The motion model that generates most inliers +A. Pre-processing is then selected for initialisation. + + There are two challenging aspects that this module needs to 3) Dynamic Object Tracking: The process of object motion +fulfil. First, to robustly separate static background and objects, +and secondly to ensure long-term tracking of dynamic objects. tracking consists of two steps. In the first step, segmented ob- +To achieve this, we leverage recent advances in computer +vision techniques for instance level semantic segmentation and jects are classified into static and dynamic. Then we associate +dense optical flow estimation in order to ensure efficient object +motion segmentation and robust object tracking. the dynamic objects across pairs of consecutive frames. + • Instance-level object segmentation allows us to separate + 1) Object Instance Segmentation: Instance-level semantic +segmentation is used to segment and identify potentially mov- objects from background. Although the algorithm is capable of +able objects in the scene. Semantic information constitutes an +important prior in the process of separating static and moving estimating the motions of all the segmented objects, dynamic +object points, e.g., buildings and roads are always static, but +cars can be static or dynamic. Instance segmentation helps object identification helps reduce computational cost of the +to further divide semantic foreground into different instance +masks, which makes it easier to track each individual object. proposed system. This is done based on scene flow estimation. +Moreover, segmentation masks provide a “precise” boundary Specifically, after obtaining the camera pose 0Xk, the scene +of the object body that ensures robust tracking of points on flow vector fik describing the motion of a 3D point 0mi between +the object. frames k − 1 and k, can be calculated as in [64]: + + 2) Optical Flow Estimation: The dense optical flow is fki = 0mki −1 − 0mik = 0mik−1 −0 Xk Xk mik . (22) +used to maximise the number of tracked points on moving +objects. Most of the moving objects only occupy a small Unlike optical flow, scene flow−ideally only caused by scene +portion of the image. Therefore, using sparse feature matching motion−can directly decide whether some structure is moving +does not guarantee robust nor long-term feature tracking. Our or not. Ideally, the magnitude of the scene flow vector should +approach makes use of dense optical flow to considerably be zero for all static 3D points. However, noise or error in +increase the number of object points by sampling from all the depth and matching complicates the situation in real scenarios. +points within the semantic mask. Dense optical flow is also To robustly handle this, we compute the scene flow magnitude +used to consistently track multiple objects by propagating a of all the sampled points on each object. If the magnitude of +unique object identifier assigned to every point on an object the scene flow of a certain point is greater than a predefined +mask. Moreover, it allows to recover objects masks if semantic threshold, the point is considered dynamic. This threshold was +segmentation fails; a task that is extremely difficult to achieve set to 0.12 in all experiments carried in this work. An object +using sparse feature matching. is then recognised dynamic if the proportion of “dynamic” + points is above a certain level (30% of total number of +B. Tracking points), otherwise static. Thresholds to identify if an object + is dynamic were deliberately chosen as mentioned above, to + The tracking component includes two modules; the camera be more conservative as the system is flexible to model a +ego-motion tracking with sub-modules of feature detection static object as dynamic and estimate a zero motion at every +and camera pose estimation, and the object motion tracking time step, however, the opposite would degrade the system’s +including sub-modules of dynamic object tracking and object performance. +motion estimation. • Instance-level object segmentation only provides single- + image object labels. Objects then need to be tracked across + 1) Feature Detection: To achieve fast camera pose estima- frames and their motion models propagated over time. We +tion, we detect a sparse set of corner features and track them propose to use optical flow to associate point labels across +with optical flow. At each frame, only inlier feature points frames. A point label is the same as the unique object identifier +that fit the estimated camera motion are saved into the map, on which the point was sampled. We maintain a finite tracking +and used to track correspondences in the next frame. New label set L ⊂ N, where l ∈ L starts from l = 1 for the first +features are detected and added, if the number of inlier tracks detected moving object in the scene. The number of elements +falls below a certain level (1200 in default). These sparse in L increases as more moving objects are being detected. +features are detected on static background, i.e., image regions Static objects and background are labelled with l = 0. +excluding the segmented objects. + Ideally, for each detected object in frame k, the labels of all + its points should be uniquely aligned with the labels of their + correspondences in frame k − 1. However, in practice this is + affected by the noise, image boundaries and occlusions. To + overcome this, we assign all the points with the label that + MANUSCRIPT ONLY 8 + +Fig. 4: Overview of our VDO-SLAM system. Input images are first pre-processed to generate instance-level object +segmentation and dense optical flow. These are then used to track features on static background structure and dynamic objects. +Camera poses and object motions estimated from feature tracks are then refined in a global batch optimisation, and a local +map is maintained and updated with every new frame. The system outputs camera poses, static structure, tracks of dynamic +objects, and estimates of their pose changes over time. + +appears most in their correspondences. For a dynamic object, similarly, a factor graph optimisation is performed to refine all +if the most frequent label in the previous frame is 0, it means the variables within the local map, and then update them back +that the object starts to move, appears in the scene at the into the global map. +boundary, or reappears from occlusion. In this case, the object +is assigned a new tracking label. 2) Global Batch Optimisation: The output of the tracking + component and the local batch optimisation consists of the + 4) Object Motion Estimation: As mentioned above, objects camera pose, the object motions and the inlier structure. These +normally appear in small portions in the scene, which makes are saved in a global map that is constructed with all the +it hard to get sufficient sparse features to track and estimate previous time steps and is continually updated with every +their motions robustly. We sample every third point within new frame. A factor graph is constructed based on the global +an object mask, and track them across frames. Similar to the map after all input frames have been processed. To effectively +camera pose estimation, only inlier points are saved into the explore the temporal constraints, only points that have been +map and used for tracking in the next frame. When the number tracked for more than 3 instances are added into the factor +of tracked object points decreases below a certain level, new graph. The graph is formulated as an optimisation problem as +object points are sampled and added. We follow the same described in Section III-C. The optimisation results serve as +method as discussed in Section IV-B2 to generate an initial the output of the whole system. +object motion model. + 3) From Mapping to Tracking: Maintaining the map pro- +C. Mapping vides history information to the estimate of the current state in + the tracking module, as shown in Fig. 4 with blue arrows going + In the mapping component, a global map is constructed from the global map to multiple components in the tracking +and maintained. Meanwhile, a local map is extracted from module of the system. Inlier points from the last frame are +the global map, which is based on the current time step and leveraged to track correspondences in the current frame and +a window of previous time steps. Both maps are updated via estimate camera pose and object motions. The last camera +a batch optimisation process. and object motion also serve as possible prior models to + initialise the current estimation as described in Section IV-B2 + 1) Local Batch Optimisation: We maintain and update a and IV-B4. Furthermore, object points help associate semantic +local map. The goal of the local batch optimisation is to masks across frames to ensure robust tracking of objects, +ensure accurate camera pose estimates are provided to the by propagating their previously segmented masks in case of +global batch optimisation. The camera pose estimation has a “indirect occlusion” resulting from the failure of semantic +big influence on the accuracy of the object motion estimation object segmentation. +and the overall performance of the algorithm. The local map +is built using a fixed-size sliding window containing the V. EXPERIMENTS +information of the last nw frames, where nw is the window size +and is set to 20 in this paper. Local maps share some common We evaluate VDO-SLAM in terms of camera motion, object +information; this defines the overlap between the different motion and velocity, as well as object tracking performance. +windows. We choose to only locally optimise the camera The evaluation is done on the Oxford Multimotion Dataset [65] +poses and static structure within the window size, as locally for indoor, and KITTI Tracking dataset [66] for outdoor +optimising the dynamic structure does not bring any benefit scenarios, with comparison to other state-of-the-art methods, +to the optimisation unless a hard constraint (e.g. a constant including MVO [51], ClusterVO [52], DynaSLAM II [49] +object motion) is assumed within the window. However, the and CubeSLAM [24]. Due to the non-deterministic nature in +system is able to incorporate static and dynamic structure in running the proposed system, such as RANSAC processing, +the local mapping if needed. When a local map is constructed, we run each sequence 5 times and take median values as the + MANUSCRIPT ONLY 9 + +demonstrating results. All the results are obtained by running Then the speed error Es between the estimated vˆ and the +the proposed system in default parameter setup. Our open- ground truth v velocities can be calculated as: Es = |vˆ| − |v|. +source implementation includes the demo YAML files and +instructions to run the system in both datasets. C. Oxford Multimotion Dataset + +A. Deep Model Setup The recent Oxford Multimotion Dataset [65] contains se- + quences from a moving stereo or RGB-D camera sensor + We adopt a learning-based instance-level object segmen- observing multiple swinging boxes or toy cars in an indoor +tation, Mask R-CNN [67], to generate object segmentation scenario. Ground truth trajectories of the camera and moving +masks. The model of this method is trained on COCO objects are obtained via a Vicon motion capture system. We +dataset [68], and is directly used in this work without any fine- only choose the swinging boxes sequence (500 frames) for +tuning. For dense optical flow, we leverage a state-of-the-art evaluation, since results of real driving scenarios are evaluated +method; PWC-Net [12]. The model is trained on FlyingChairs on KITTI dataset. Note that, the trained model for instance +dataset [69], and then fine-tuned on Sintel [70] and KITTI segmentation cannot be applied to this dataset directly, since +training datasets [71]. To generate depth maps for a “monocu- the training data (COCO) does not contain the class of +lar” version of our proposed system, we apply a learning-based square box. Instead, we use Otsu’s method [77], together with +monocular depth estimation method, MonoDepth2 [72]. The color information and multi-label processing to segment the +model is trained on Depth Eigen split [73] excluding the tested boxes, which works very well for the simple setup of this +data in this paper. Feature detection is done using FAST [74] dataset (color boxes that are highly distinguishable from the +implemented in [75]. All the above methods are applied using background). Table I shows results compared to the state-of- +the default parameters. the-art MVO [51] and ClusterVO [52], with data provided by + the authors, respectively. As they are both visual odometry +B. Error Metrics systems without global refinement, we switch off the batch + optimisation module in our system and generate our results + We use a pose change error metric to evaluate the estimated for fair comparison. We use the error metrics described in +SE(3) motion, i.e., given a ground truth motion transform T Section V-B. +and a corresponding estimated motion Tˆ , where T ∈ SE(3) + Compared to MVO, our proposed method achieves better +could be either a camera relative pose or an object motion. accuracy in the estimation of camera pose (35%) and motion +The pose change error is computed as: E = Tˆ −1 T. This is of the swinging boxes, top-left (15%) and bottom-left (40%). + We obtain slightly higher errors when there is spinning ro- +similar to Relative Pose Error [76], while we set the time tational motion of the object observed, in particular the top- +interval ∆ = 1 (per frame), because the trajectory of different right swinging and rotating box (in translation only), and the +object in a sequence varies from each other and are normally bottom-right rotating box. We believe that this is due to using + an optical flow algorithm that is not well optimised for self- +much shorter than the camera trajectory. rotating objects. The consequence of this is poor estimation of +The translational error Et (meter) is computed as the L2 norm point motion and consequent degradation of the overall object +of the translational component of E. The rotational error Er tracking performance. Even with the associated performance +(degree) is calculated as the angle of rotation in an axis-angle loss for rotating objects, the benefits of dense optical flow +representation of the rotational component of E. For different motion estimation is clear in the other metrics. Our method + performs slightly worse than ClusterVO in the estimate of +camera time steps and different objects in a sequence, we camera pose, and the translation of bottom-right rotating box. + Other than that, we achieve more than twice improvements +compute the root mean squared error (RMSE) for camera against ClusterVO in the estimate of object motions. + +poses and object motions, respectively. The object pose change An illustrative result of the trajectory output of our algo- + rithm on Oxford Multimotion Dataset is shown in Fig. 5. +in body-fixed frame is obtained by transforming the pose Tracks of dynamic features on swinging boxes visually corre- +change k−01Hk in the inertial frame into the body frame using spond to the actual motion of the boxes. This can be clearly +the object pose ground-truth seen in the swinging motion of the bottom-left box shown with + purple color in Fig. 5. + Lk−1 Hk =0 Lk−−11 0 Hk 0Lk−1. (23) + k−1 k−1 D. KITTI Tracking Dataset + + We also evaluate the object speed error. The linear velocity The KITTI Tracking Dataset [66] contains 21 sequences in + total with ground truth information about camera and object +of a point on the object, expressed in the inertial frame, can poses. Among these sequences, some are not included in the + evaluation of our system; as they contain no moving objects +be estimated by applying the pose change 0 Hk and taking (static only scenes) or only contain pedestrians that are non- + k−1 rigid objects, which is outside the scope of this work. Note + that, as only rotation around Y-axis is provided in the ground +the difference + + v ≈0 mik −0 mki −1 = 0 Hk − I4 0 mki −1 + k−1 + + = k−10tk − (I3 − k−10Rk) 0mki −1. (24) + +To get a more reliable measurement, we average over all points + +on an object at a certain time. Define ck−1 := 1 ∑ mik−1 for all + n + +n points on an object at time k − 1. Then + + ∑ 1 n k−01tk − (I3 − k−10Rk) 0mik−1 + + v≈ + n i=1 + + = 0 tk − (I3 − 0 Rk ) ck−1. (25) + k−1 k−1 + MANUSCRIPT ONLY 10 + +TABLE I: Comparison versus MVO [51] and ClusterVO [52] for camera pose and object motion estimation accuracy on the +sequence of swinging 4 unconstrained sequence in Oxford Multi-motion dataset. Bold numbers indicate the better results. + + VDO-SLAM MVO ClusterVO + + Er(deg) Et (m) Er(deg) Et (m) Er(deg) Et (m) + + Camera 0.7709 0.0112 1.1948 0.0314 0.7665 0.0066 + Top-left Swinging Box 1.1889 0.0207 1.4553 0.0288 3.2537 0.0673 + Top-right Swinging and rotating Box 0.7631 0.0132 0.8992 0.0130 3.5308 0.0256 + Bottom-left Swinging Box 0.9153 0.0149 1.4949 0.0261 4.9146 0.0763 + Bottom-right Rotating Box 0.8469 0.0192 0.7815 0.0115 4.0675 0.0144 + + also compute results of a learning-based monocular version of + our proposed method (as mentioned in Section IV) for fair + comparison. + +Fig. 5: Qualitative results of our method on Oxford Our proposed method achieves competitive and high ac- +Multimotion Dataset. (Left) The 3D trajectories of camera curacy in comparison with DynaSLAM II for the estimate +(red) and centres of the four boxes. (Right) Detected points of camera pose. In particular, our method obtains slightly +on static background and object body. Black color corresponds lower rotational errors while higher translational errors than +to static points and features on each object are shown in a DynaSLAM II. We believe the difference in accuracy is due to +different color. the underlying formulations in estimating camera pose. When + compared to CubeSLAM, our RGB-D version gets lower +truth object poses, we assign zeros to the other two axes for errors in camera pose, while our learning-based monocular +the convenience of full motion evaluation. version slightly higher. We believe the weak performance of + monocular version is because the model does not capture + the scale of depth accurately with only monocular input. + Nevertheless, both versions obtain consistently lower errors + in object motion estimation. In particular, as demonstrated in + Fig. 6, the translation and rotation errors in CubeSLAM are + all above 3 meters and 3 degrees, with errors reaching 32 + meters and 5 degrees in extreme cases respectively. However, + our translation errors vary between 0.1-0.3 meters and rotation + errors between 0.2-1.5 degrees in the case of RGB-D, and 0.1- + 0.3 meters, and 0.4-3.1 degrees in the case of learning-based + monocular, which indicates that our object motion estimation + achieves an order of magnitude improvements in most cases. In + general, the results suggest that point-based object motion/pose + estimation methods is more robust and accurate than those us- + ing high-level geometric models, probably due to the fact that + geometric model extraction could lead to losing information + and introducing more uncertainty. + +Fig. 6: Accuracy of object motion estimation of our method 2) Object Tracking and Velocity: We also demonstrate the +compared to CubeSLAM ([24]). The color bars refer to performance of tracking dynamic objects, and show results +translation error that is corresponding to the left Y-axis in log- of object speed estimation, which is an important information +scale. The circles refer to rotation error, which corresponds to for autonomous driving applications. Fig. 7 illustrates results +the right Y-axis in linear-scale. of object tracking length and object speed for some selected + objects (tracked for over 20 frames) in all the tested sequences. + 1) Camera Pose and Object Motion: Table II demonstrates Our system is able to track most objects for more than 80% +results of both camera pose and object motion estimation of their occurrence in the sequence. Moreover, our estimated +in nine sequences, compared to DynaSLAM II [49] and objects speed is always consistently close to the ground truth. +CubeSLAM [24]. Results of DynaSLAM II is obtained di- +rectly from their paper, where only the evaluation of camera 3) Qualitative Results: Fig. 8 illustrates the output of our +pose is available. We initially tried to evaluate CubeSLAM system for three of the KITTI sequences. The proposed system +ourselves with the default provided parameters, however errors is able to output the camera poses, along with the static +were much higher, and hence we only report results of five structure and dynamic tracks of every detected moving object +sequences provided by the authors of CubeSLAM after some in the scene in a spatiotemporal map representation. +correspondences. As CubeSLAM is for monocular camera, we + MANUSCRIPT ONLY 11 + +TABLE II: Comparison versus DynaSLAM II [49] and CubeSLAM [24] for camera pose and object motion estimation accuracy +on nine sequences with moving objects drawn from the KITTI dataset. Bold numbers indicate the better result. + + DynaSLAM II VDO-SLAM (RGB-D) VDO-SLAM (Monocular) CubeSLAM + + Camera Camera Object Camera Object Camera Object + +Seq Er(deg) Et (m) Er(deg) Et (m) Er(deg) Et (m) Er(deg) Et (m) Er(deg) Et (m) Er(deg) Et (m) Er(deg) Et (m) + +00 0.06 0.04 0.0741 0.0674 1.0520 0.1077 0.1830 0.1847 2.0021 0.3827 - - - - + +01 0.04 0.05 0.0382 0.1220 0.9051 0.1573 0.1772 0.4982 1.1833 0.3589 - - - - + +02 0.02 0.04 0.0182 0.0445 1.2359 0.2801 0.0496 0.0963 1.6833 0.4121 - - - - + +03 0.04 0.06 0.0311 0.0816 0.2919 0.0965 0.1065 0.1505 0.4570 0.2032 0.0498 0.0929 3.6085 4.5947 + +04 0.06 0.07 0.0482 0.1114 0.8288 0.1937 0.1741 0.4951 3.1156 0.5310 0.0708 0.1159 5.5803 32.5379 + +05 0.03 0.06 0.0219 0.0932 0.3705 0.1140 0.0506 0.1368 0.6464 0.2669 0.0342 0.0696 3.2610 6.4851 + +06 0.04 0.02 0.0488 0.0186 1.0803 0.1158 0.0671 0.0451 2.0977 0.2394 - - - - + +18 0.02 0.05 0.0211 0.0749 0.2453 0.0825 0.1236 0.3551 0.5559 0.2774 0.0433 0.0510 3.1876 3.7948 + +20 0.04 0.07 0.0271 0.1662 0.3663 0.0824 0.3029 1.3821 1.1081 0.3693 0.1348 0.1888 3.4206 5.6986 + +240 GT Tracks EST. Tracks GT Speed EST. Speed 60 of good points in terms of both quantity and quality. This was + achieved by refining the estimated optical flow jointly with +216 54 the motion estimation, as discussed in Section III-B3. The + effectiveness of joint optimisation is shown by comparing a +192 48 baseline method that only optimises for the motion (Motion + Only) using (9) for camera motion or (11) for object motion, +168Track Length 42 and the improved method that optimises for both the motion + Speed (Km/h) and the optical flow (Joint) using (13) or (15). Table III +144 36 demonstrates that the joint method obtains considerably more + points that are tracked for long periods. +120 30 + +96 24 + +72 18 + +48 12 + +24 6 + +0 0 + + 00-1 00-2 01-1 01-2 02-1 02-2 02-3 03-1 03-2 04-1 05-1 06-1 06-2 06-3 06-4 06-5 06-6 18-1 18-2 18-3 20-1 20-2 20-3 + + Sequence-Object ID + +Fig. 7: Tracking performance and speed estimation. Results TABLE IV: Average camera pose and object motion errors +of object tracking length and object speed for some selected over the nine sequences of the KITTI dataset. Bold numbers +objects (tracked for over 20 frames), due to limited space. indicate the better results. +The color bars represent the length of object tracks, which is +corresponding to the left Y-axis. The circles represent object Motion Only Joint +speeds, which is corresponding to the right Y-axis. GT refers Er(deg) Et (m) Er(deg) Et (m) +to ground truth, and EST. refers to estimated values. + 0.0412 0.0987 0.0365 0.0866 + Camera 1.0179 0.1853 0.7085 0.1367 + Object + +E. Discussion Using the tracked points given by the joint estimation + process leads to better estimation of both camera pose and + Apart from the extensive evaluation in Section V-D and V-C, object motion. As demonstrated in Table IV, an improvement +we also provide detailed experimental results to prove the of about 10% (camera) and 25% (object) in both translation +effectiveness of key modules in our proposed system. Finally, and rotation errors was observed over the nine sequences of +the computational cost of the proposed system is discussed. the KITTI dataset shown above. + +TABLE III: The number of points tracked for more than five 2) Robustness against Non-direct Occlusion: The mask +frames on the nine sequences of the KITTI dataset. Bold segmentation may fail in some cases, due to direct or indirect +numbers indicate the better results. Underlined bold numbers occlusions (illumination change, etc.). Thanks to the mask +indicate an order of magnitude increase in number. propagating method described in Section IV-C3, our proposed + system is able to handle mask failure cases caused by indirect + Background Object occlusions. Fig. 9 demonstrates an example of tracking a white + van for 80 frames, where the mask segmentation fails in 33 + Seq Motion Only Joint Motion Only Joint frames. Despite the object segmentation failure, our system is + still continuously able to track the van, and estimate its speed + 00 1798 12812 1704 7162 with an average error of 2.64 km/h across the whole sequence. + Speed errors in the second half of the sequence are higher due + 01 237 5075 907 4583 to partial direct occlusions, and increased distance to the object + getting farther away from the camera. + 02 7642 10683 52 1442 + 3) Global Refinement on Object Motion: Initial object + 03 778 12317 343 3354 motion estimation (in the tracking component of the system) + is independent between frames, since it is purely related to the + 04 9913 25861 339 2802 sensor measurements. As illustrated in Fig. 10, the blue curve + describes an initial object speed estimate of a wagon observed + 05 713 11627 2363 2977 + + 06 7898 11048 482 5934 + + 18 4271 22503 5614 14989 + + 20 9838 49261 9282 13434 + + 1) Robust Tracking of Points: The graph optimisation ex- +plores the spacial and temporal information to refine the +camera poses and the object motions, as well as the static +and dynamic structure. This process requires robust tracking + MANUSCRIPT ONLY 12 + +Fig. 8: Illustration of system output; a dynamic map with camera poses, static background structure, and tracks of +dynamic objects. Sample results of VDO-SLAM on KITTI sequences. Black represents static background, and each detected +object is shown in a different colour. Top left figure represents Seq.01 and a zoom-in on the intersection at the end of the +sequence, top right figure represents Seq.06 and bottom figure represents Seq.03. + +for 55 frames in sequence 03 of the KITTI tracking dataset. 16 GB RAM. The object semantic segmentation and dense +As seen in the figure, the speed estimation is not smooth and optical flow computation times depend on the GPU power +large errors occur towards the second half of the sequence. and the CNN model complexity. Many current state-of-the- +This is mainly caused by the increased distance to the object art algorithms can run in real time ([30], [78]). In this +getting farther away from the camera, and its structure only paper, the semantic segmentation and optical flow results are +occupying a small portion of the scene. In this case, the object produced off-line as input to the system. The SLAM system +motion estimation from sensor measurements solely becomes is implemented in C++ on CPU using a modified version of +challenging and error-prone. Therefore, we formulate a factor g2o as a back-end [79]. We show the computational time in +graph and refine the motions together with the static and Table V for both datasets. Overall, the tracking part of our +dynamic structure as discussed in Section III-C. The green proposed system is able to run at the frame rate of 5-8 fps +curve in Fig. 10 shows the object speed results after the global depending on the number of detected moving objects, which +refinement, which becomes smoother in the first half of the can be improved by employing parallel implementation. The +sequence and is significantly improved in the second half. runtime of the global batch optimisation strongly depends on + the amount of camera poses (number of frames), and objects + Fig. 11 demonstrates the average improvement for all ob- (density in terms of the number of dynamic objects observed +jects in each sequence of KITTI dataset. With graph optimiza- per frame) present in the scene. +tion, the errors can be reduced up to 39% in translation and +55% in rotation. Interestingly, the translation errors in Seq.18 VI. CONCLUSION +and Seq.20 increase slightly. We believe it is because the ve- +hicles keep alternating between acceleration and deceleration In this paper, we have presented VDO-SLAM, a novel +due to the heavy traffic jams in both sequences, which strongly dynamic feature-based SLAM system that exploits image- +violates the smooth motion constraint that is set for general based semantic information in the scene with no additional +cases. knowledge of the object pose or geometry, to achieve simulta- + neous localisation, mapping and tracking of dynamic objects. + 4) Computational Analysis: Finally, we provide the com- The system consistently shows robust and accurate results on +putational analysis of our system. The experiments are carried indoor and challenging outdoor datasets, and achieves state-of- +out on an Intel Core i7 2.6 GHz laptop computer with the-art performance in object motion estimation. We believe + MANUSCRIPT ONLY 13 + + 0.5 + + Translation 0.27 0.27 0.11 0.39 0.1 0.16 0.02 -0.03 -0.04 0.4 + + 0.3 + + 0.2 + + Rotation 0.2 0.22 0.06 0.54 0.26 0.55 0.04 0.34 0.12 0.1 + + 0 + + Seq.00 Seq.01 Seq.02 Seq.03 Seq.04 Seq.05 Seq.06 Seq.18 Seq.20 + + Fig. 11: Improvement on object motion after graph op- + timization. The numbers in the heatmap show the ratio of + decrease in error on the nine sequences of the KITTI dataset. + + TABLE V: Runtime of different system components for both + datasets. The time cost of every component is averaged over all + frames and sequences, except for the object motion estimation + and object motion estimation that are averaged over the + number of objects. + +Fig. 9: Robustness in tracking performance and speed Dataset Tasks Runtime (mSec) +estimation in case of semantic segmentation failure. KITTI +An example of tracking performance and speed estimation for Feature Detection 16.2550 +a white van (ground-truth average speed 20km/h) in Seq.00. OMD Camera Pose Estimation 52.6542 +(Top) Blue bars represent a successful object segmentation, Dynamic Object Tracking (avg/object) 8.2980 +and green curves refer to the object speed error. (Bottom-left) Object Motion Estimation (avg/object) 22.9081 +An illustration of semantic segmentation failure on the van. Map and Mask Updating 22.1830 +(Bottom-right) Result of propagating the previously tracked Local Batch Optimisation 18.2828 +features on the van by our system. + Feature Detection 7.5220 + Camera Pose Estimation 32.0909 + Dynamic Object Tracking (avg/object) 7.0134 + Object Motion Estimation (avg/object) 19.5280 + Map and Mask Updating 30.3153 + Local Batch Optimisation 15.3414 + + observed far in the past seems to be a natural step towards a + long-term SLAM system in highly dynamic environments. + +Fig. 10: Global refinement effect on object speed estima- ACKNOWLEDGEMENTS +tion. The initial (blue) and refined (green) estimated speeds of +a wagon in Seq.03, travelling along a straight road, compared This research is supported by the Australian Research Coun- +to the ground truth speed (red). Note the ground truth speed cil through the Australian Centre of Excellence for Robotic +is slightly fluctuating. We believe it is due to the ground truth Vision (CE140100016), and the Sydney Institute for Robotics +object poses being approximated from lidar scans. and Intelligent Systems. The authors would like to thank Mr. + Ziang Cheng and Mr. Huangying Zhan for providing help in +the high performance accuracy achieved in object motion preparing the testing datasets. +estimation is due to the fact that our system is a feature-based +system. Feature points remain to be the easiest to detect, track REFERENCES +and integrate within a SLAM system, and that require the +front-end to have no additional knowledge about the object [1] D. Hahnel, D. Schulz, and W. Burgard, “Map Building with Mobile +model, or explicitly provide any information about its pose. Robots in Populated Environments,” in International Conference on + Intelligent Robots and Systems (IROS), vol. 1. IEEE, 2002, pp. 496– + An important issue to be reduced is the computational 501. +complexity of SLAM with dynamic objects. In long-term +applications, different techniques can be applied to limit the [2] D. Hahnel, R. Triebel, W. Burgard, and S. Thrun, “Map Building with +growth of the graph ([80], [81]). In fact, history summari- Mobile Robots in Dynamic Environments,” in International Conference +sation/deletion of map points pertaining to dynamic objects on Robotics and Automation (ICRA), vol. 2. IEEE, 2003, pp. 1557– + 1563. + + [3] D. F. Wolf and G. S. Sukhatme, “Mobile Robot Simultaneous Local- + ization and Mapping in Dynamic Environments,” Autonomous Robots, + vol. 19, no. 1, pp. 53–65, 2005. + + [4] H. Zhao, M. Chiba, R. Shibasaki, X. Shao, J. Cui, and H. Zha, “SLAM + in a Dynamic Large Outdoor Environment using a Laser Scanner,” in + International Conference on Robotics and Automation (ICRA). IEEE, + 2008, pp. 1455–1462. + + [5] B. Bescos, J. M. Fa´cil, J. Civera, and J. Neira, “DynaSLAM: Tracking, + Mapping, and Inpainting in Dynamic Scenes,” Robotics and Automation + Letters (RAL), vol. 3, no. 4, pp. 4076–4083, 2018. + + [6] C.-C. Wang, C. Thorpe, and S. Thrun, “Online Simultaneous Local- + ization and Mapping with Detection and Tracking of Moving Objects: + Theory and Results from a Ground Vehicle in Crowded Urban Areas,” + in International Conference on Robotics and Automation (ICRA), vol. 1. + IEEE, 2003, pp. 842–849. + MANUSCRIPT ONLY 14 + + [7] I. Miller and M. Campbell, “Rao-blackwellized Particle Filtering for [30] D. Bolya, C. Zhou, F. Xiao, and Y. J. Lee, “YOLACT: Real-time Instance + Mapping Dynamic Environments,” in International Conference on Segmentation,” in International Conference on Computer Vision (ICCV). + Robotics and Automation (ICRA). IEEE, 2007, pp. 3862–3869. IEEE, 2019, pp. 9157–9166. + + [8] J. G. Rogers, A. J. Trevor, C. Nieto-Granda, and H. I. Christensen, [31] M. Henein, J. Zhang, R. Mahony, and V. Ila, “Dynamic SLAM: + “SLAM with Expectation Maximization for Moveable Object Tracking,” The Need for Speed,” in International Conference on Robotics and + in International Conference on Intelligent Robots and Systems (IROS). Automation (ICRA). IEEE, 2020, pp. 2123–2129. + IEEE, 2010, pp. 2077–2082. + [32] J. Huang, S. Yang, Z. Zhao, Y. Lai, and S. Hu, “ClusterSLAM: A + [9] A. Kundu, K. M. Krishna, and C. Jawahar, “Realtime Multibody Visual SLAM Backend for Simultaneous Rigid Body Clustering and Motion + SLAM with a Smoothly Moving Monocular Camera,” in International Estimation,” in International Conference on Computer Vision (ICCV). + Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2080–2087. IEEE, 2019, pp. 5874–5883. + +[10] K. Yamaguchi, D. McAllester, and R. Urtasun, “Efficient Joint Segmen- [33] J. Zhang, M. Henein, R. Mahony, and V. Ila, “Robust Ego and Object 6- + tation, Occlusion Labeling, Stereo and Flow Estimation,” in European DoF Motion Estimation and Tracking,” in International Conference on + Conference on Computer Vision (ECCV). Springer, 2014, pp. 756–771. Intelligent Robots and Systems (IROS). IEEE, 2020, pp. 5017–5023. + +[11] D. Sun, S. Roth, and M. J. Black, “Secrets of Optical Flow Estimation [34] P. F. Alcantarilla, J. J. Yebes, J. Almaza´n, and L. M. Bergasa, “On + and Their Principles,” in Computer Society Conference on Computer Combining Visual SLAM and Dense Scene Flow to Increase the + Vision and Pattern Recognition (CVPR). IEEE, 2010, pp. 2432–2439. Robustness of Localization and Mapping in Dynamic Environments,” in + International Conference on Robotics and Automation (ICRA). IEEE, +[12] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “PWC-Net: CNNs for Optical 2012, pp. 1290–1297. + Flow Using Pyramid, Warping, and Cost Volume,” in Computer Soci- + ety Conference on Computer Vision and Pattern Recognition (CVPR). [35] W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao, “Robust Monocular + IEEE, 2018. SLAM in Dynamic Environments,” in International Symposium on + Mixed and Augmented Reality (ISMAR). IEEE, 2013, pp. 209–218. +[13] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, + “FlowNET 2.0: Evolution of Optical Flow Estimation with Deep [36] P. Kaveti and H. Singh, “A Light Field Front-end for Robust SLAM in + Networks,” in Computer Society Conference on Computer Vision and Dynamic Environments,” arXiv preprint arXiv:2012.10714, 2020. + Pattern Recognition (CVPR). IEEE, 2017, pp. 2462–2470. + [37] C.-C. Wang, C. Thorpe, S. Thrun, M. Hebert, and H. Durrant-Whyte, +[14] C. Vogel, K. Schindler, and S. Roth, “Piecewise Rigid Scene Flow,” in “Simultaneous Localization, Mapping and Moving Object Tracking,” + International Conference on Computer Vision (ICCV). IEEE, 2013, pp. International Journal of Robotics Research (IJRR), vol. 26, no. 9, pp. + 1377–1384. 889–916, 2007. + +[15] M. Menze and A. Geiger, “Object Scene Flow for Autonomous Vehi- [38] N. D. Reddy, P. Singhal, V. Chari, and K. M. Krishna, “Dynamic Body + cles,” in Computer Society Conference on Computer Vision and Pattern VSLAM with Semantic Constraints,” in International Conference on + Recognition (CVPR). IEEE, 2015, pp. 3061–3070. Intelligent Robots and Systems (IROS). IEEE, 2015, pp. 1897–1904. + +[16] X. Liu, C. R. Qi, and L. J. Guibas, “FlowNet3D: Learning Scene Flow in [39] I. A. Baˆrsan, P. Liu, M. Pollefeys, and A. Geiger, “Robust Dense + 3D Point Clouds,” in Computer Society Conference on Computer Vision Mapping for Large-Scale Dynamic Environments,” in International + and Pattern Recognition (CVPR). IEEE, 2019, pp. 529–537. Conference on Robotics and Automation (ICRA). IEEE, 2018. + +[17] H. Jiang, D. Sun, V. Jampani, Z. Lv, E. Learned-Miller, and J. Kautz, [40] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and + “SENSE: A Shared Encoder Network for Scene-flow Estimation,” in A. J. Davison, “SLAM++: Simultaneous Localisation and Mapping at + International Conference on Computer Vision (ICCV). IEEE, 2019, the Level of Objects,” in Computer Society Conference on Computer + pp. 3195–3204. Vision and Pattern Recognition (CVPR). IEEE, 2013, pp. 1352–1359. + +[18] P. de la Puente and D. Rodr´ıguez-Losada, “Feature Based Graph-SLAM [41] K. Tateno, F. Tombari, and N. Navab, “When 2.5D is Not Enough: + in Structured Environments,” Autonomous Robots, vol. 37, no. 3, pp. Simultaneous Reconstruction, Segmentation and Recognition on Dense + 243–260, 2014. SLAM,” in International Conference on Robotics and Automation + (ICRA). IEEE, 2016, pp. 2295–2302. +[19] M. Kaess, “Simultaneous Localization and Mapping with Infinite + Planes,” in International Conference on Robotics and Automation [42] E. Sucar, K. Wada, and A. Davison, “NodeSLAM: Neural Object + (ICRA). IEEE, 2015, pp. 4605–4611. Descriptors for Multi-View Shape Reconstruction,” arXiv preprint + arXiv:2004.04485, 2020. +[20] M. Henein, M. Abello, V. Ila, and R. Mahony, “Exploring the Effect + of Meta-structural Information on the Global Consistency of SLAM,” [43] M. Runz, M. Buffier, and L. Agapito, “MaskFusion: Real-time Recog- + in International Conference on Intelligent Robots and Systems (IROS). nition, Tracking and Reconstruction of Multiple Moving Objects,” in + IEEE, 2017, pp. 1616–1623. International Symposium on Mixed and Augmented Reality (ISMAR). + IEEE, 2018, pp. 10–20. +[21] M. Hsiao, E. Westman, G. Zhang, and M. Kaess, “Keyframe-based + Dense Planar SLAM,” in International Conference on Robotics and [44] B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. Davison, and + Automation (ICRA). IEEE, 2017, pp. 5110–5117. S. Leutenegger, “MID-Fusion: Octree-based Object-level Multi-instance + Dynamic SLAM,” in International Conference on Robotics and Automa- +[22] B. Mu, S.-Y. Liu, L. Paull, J. Leonard, and J. P. How, “SLAM with tion (ICRA). IEEE, 2019, pp. 5231–5237. + Objects using a Nonparametric Pose Graph,” in International Conference + on Intelligent Robots and Systems (IROS). IEEE, 2016, pp. 4602–4609. [45] M. Hosseinzadeh, K. Li, Y. Latif, and I. Reid, “Real-time Monocular + Object-model Aware Sparse SLAM,” in International Conference on +[23] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and Robotics and Automation (ICRA). IEEE, 2019, pp. 7123–7129. + A. J. Davison, “SLAM++: Simultaneous Localisation and Mapping at + the Level of Objects,” in Computer Society Conference on Computer [46] L. Nicholson, M. Milford, and N. Su¨nderhauf, “QuadricSLAM: Dual + Vision and Pattern Recognition (CVPR). IEEE, 2013, pp. 1352–1359. Quadrics from Object Detections as Landmarks in Object-oriented + SLAM,” Robotics and Automation Letters (RAL), vol. 4, no. 1, pp. 1–8, +[24] S. Yang and S. Scherer, “CubeSLAM: Monocular 3-D Object SLAM,” 2018. + Transactions on Robotics (T-RO), vol. 35, no. 4, pp. 925–938, 2019. + [47] P. Li, T. Qin, et al., “Stereo Vision-based Semantic 3D Object and Ego- +[25] D. Ga´lvez-Lo´pez, M. Salas, J. D. Tardo´s, and J. Montiel, “Real-time motion Tracking for Autonomous Driving,” in European Conference on + Monocular Object SLAM,” Robotics and Autonomous Systems, vol. 75, Computer Vision (ECCV), 2018, pp. 646–661. + pp. 435–449, 2016. + [48] P. Li, J. Shi, and S. Shen, “Joint Spatial-temporal Optimization for Stereo +[26] A. Milan, L. Leal-Taixe´, I. Reid, S. Roth, and K. Schindler, 3D Object Tracking,” in Computer Society Conference on Computer + “MOT16: A Benchmark for Multi-Object Tracking,” arXiv:1603.00831 Vision and Pattern Recognition (CVPR). IEEE, 2020, pp. 6877–6886. + [cs], Mar. 2016, arXiv: 1603.00831. [Online]. Available: http: + //arxiv.org/abs/1603.00831 [49] B. Bescos, C. Campos, J. D. Tardo´s, and J. Neira, “DynaSLAM + II: Tightly-coupled Multi-object Tracking and SLAM,” Robotics and +[27] A. Byravan and D. Fox, “SE3-Nets: Learning Rigid Body Motion using Automation Letters (RAL), vol. 6, no. 3, pp. 5191–5198, 2021. + Deep Neural Networks,” in International Conference on Robotics and + Automation (ICRA). IEEE, 2017, pp. 173–180. [50] A. Dewan, T. Caselitz, G. D. Tipaldi, and W. Burgard, “Motion-based + Detection and Tracking in 3D Lidar Scans,” in International Conference +[28] P. Wohlhart and V. Lepetit, “Learning Descriptors for Object Recog- on Robotics and Automation (ICRA). IEEE, 2016, pp. 4508–4513. + nition and 3D Pose Estimation,” in Computer Society Conference on + Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3109– [51] K. M. Judd, J. D. Gammell, and P. Newman, “Multimotion Visual + 3118. Odometry (MVO): Simultaneous Estimation of Camera and Third-party + Motions,” in International Conference on Intelligent Robots and Systems +[29] K. He, G. Gkioxari, P. Dolla´r, and R. Girshick, “Mask R-CNN,” (IROS). IEEE, 2018, pp. 3949–3956. + Transactions on Pattern Analysis and Machine Intelligence (PAMI), + 2018. + MANUSCRIPT ONLY 15 + +[52] J. Huang, S. Yang, T.-J. Mu, and S.-M. Hu, “ClusterVO: Clustering [67] K. He, G. Gkioxari, P. Dolla´r, and R. Girshick, “Mask R-CNN,” in + Moving Instances and Estimating Visual Odometry for Self and Sur- International Conference on Computer Vision (ICCV). IEEE, 2017, + roundings,” in Computer Society Conference on Computer Vision and pp. 2980–2988. + Pattern Recognition (CVPR). IEEE, 2020, pp. 2168–2177. + [68] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, +[53] M. A. Fischler and R. C. Bolles, “Random Sample Consensus: A P. Dolla´r, and C. L. Zitnick, “Microsoft COCO: Common objects + Paradigm for Model Fitting with Applications to Image Analysis and in context,” in European Conference on Computer Vision (ECCV). + Automated Cartography,” Communications of the ACM, vol. 24, no. 6, Springer, 2014, pp. 740–755. + pp. 381–395, 1981. + [69] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, +[54] G. S. Chirikjian, R. Mahony, S. Ruan, and J. Trumpf, “Pose Changes and T. Brox, “A Large Dataset to Train Convolutional Networks for + from a Different Point of View,” in The ASME International Design Disparity, Optical Flow, and Scene Flow Estimation,” in Computer So- + Engineering Technical Conferences (IDETC). ASME, 2017. ciety Conference on Computer Vision and Pattern Recognition (CVPR). + IEEE, 2016, pp. 4040–4048. +[55] D. Niste´r, O. Naroditsky, and J. Bergen, “Visual Odometry,” in Com- + puter Society Conference on Computer Vision and Pattern Recognition [70] D. J. Butler, J. Wulff, G. B. Stanley, and M. J. Black, “A Naturalistic + (CVPR), vol. 1. IEEE, 2004, pp. I–I. Open Source Movie for Optical Flow Evaluation,” in European Confer- + ence on Computer Vision (ECCV). Springer, 2012, pp. 611–625. +[56] P. J. Huber, “Robust Estimation of a Location Parameter,” in Break- + throughs in Statistics. Springer, 1992, pp. 492–518. [71] A. Geiger, P. Lenz, and R. Urtasun, “Are We Ready for Autonomous + Driving? The KITTI Vision Benchmark Suite,” in Computer Society +[57] F. Dellaert and M. Kaess, “Square Root SAM: Simultaneous Localiza- Conference on Computer Vision and Pattern Recognition (CVPR). + tion and Mapping via Square Root Information Smoothing,” Interna- IEEE, 2012. + tional Journal of Robotics Research (IJRR), vol. 25, no. 12, pp. 1181– + 1203, 2006. [72] C. Godard, O. Mac Aodha, M. Firman, and G. J. Brostow, “Digging + into Self-supervised Monocular Depth Estimation,” in International +[58] S. Agarwal, K. Mierle, and Others, “Ceres Solver,” http://ceres-solver. Conference on Computer Vision (ICCV). IEEE, 2019, pp. 3828–3838. + org, 2012. + [73] D. Eigen, C. Puhrsch, and R. Fergus, “Depth Map Prediction from a +[59] M. Kaess, H. Johannsson, R. Roberts, V. Ila, J. J. Leonard, and Single Image using a Multi-scale Deep Network,” in Advances in Neural + F. Dellaert, “iSAM2: Incremental Smoothing and Mapping using the Information Processing Systems (NIPS), 2014, pp. 2366–2374. + Bayes Tree,” International Journal of Robotics Research (IJRR), p. + 0278364911430419, 2011. [75] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “ORB: An Efficient + Alternative to SIFT or SURF,” in International Conference on Computer +[60] L. Polok, V. Ila, M. Solony, P. Smrz, and P. Zemcik, “Incremental Block Vision (ICCV). IEEE, 2011, pp. 2564–2571. + Cholesky Factorization for Nonlinear Least Squares in Robotics,” in + Robotics: Science and Systems (RSS), Berlin, Germany, June 2013. [76] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, + “A Benchmark for the Evaluation of RGB-D SLAM Systems,” in +[61] V. Ila, L. Polok, M. Sˇ olony, and P. Svoboda, “SLAM++-A Highly International Conference on Intelligent Robots and Systems (IROS). + Efficient and Temporally Scalable Incremental SLAM Framework,” IEEE, 2012, pp. 573–580. + International Journal of Robotics Research (IJRR), vol. Online First, + no. 0, pp. 1–21, 2017. [77] N. Otsu, “A Threshold Selection Method from Gray-level Histograms,” + Transactions on Systems, Man, and Cybernetics, vol. 9, no. 1, pp. 62–66, +[62] K. Yamaguchi, D. McAllester, and R. Urtasun, “Efficient Joint Segmen- 1979. + tation, Occlusion Labeling, Stereo and Flow Estimation,” in European + Conference on Computer Vision (ECCV). Springer, 2014, pp. 756–771. [78] T.-W. Hui, X. Tang, and C. C. Loy, “A Lightweight Optical Flow CNN + - Revisiting Data Fidelity and Regularization.” IEEE, 2020. [Online]. +[63] T. Ke and S. I. Roumeliotis, “An Efficient Algebraic Solution to the Available: http://mmlab.ie.cuhk.edu.hk/projects/LiteFlowNet/ + Perspective-three-point Problem,” in Computer Society Conference on + Computer Vision and Pattern Recognition (CVPR). IEEE, 2017. [79] R. Ku¨mmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, + “g2o: A General Framework for Graph Optimization,” in International +[64] Z. Lv, K. Kim, A. Troccoli, J. Rehg, and J. Kautz, “Learning Rigidity Conference on Robotics and Automation (ICRA). IEEE, 2011, pp. + in Dynamic Scenes with a Moving Camera for 3D Motion Field 3607–3613. + Estimation,” in European Conference on Computer Vision (ECCV). + Springer, 2018. [80] H. Strasdat, A. J. Davison, J. M. Montiel, and K. Konolige, “Double + Window Optimisation for Constant Time Visual SLAM,” in Interna- +[65] K. M. Judd and J. D. Gammell, “The Oxford Multimotion Dataset: tional Conference on Computer Vision (ICCV). IEEE, 2011, pp. 2352– + Multiple SE(3) Motions with Ground Truth,” Robotics and Automation 2359. + Letters (RAL), vol. 4, no. 2, pp. 800–807, 2019. + [81] V. Ila, J. M. Porta, and J. Andrade-Cetto, “Information-based Compact +[66] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision Meets Robotics: Pose SLAM,” Transactions on Robotics (T-RO), vol. 26, no. 1, pp. 78– + The KITTI Dataset,” International Journal of Robotics Research (IJRR), 93, 2010. + vol. 32, no. 11, pp. 1231–1237, 2013. + +[74] E. Rosten and T. Drummond, “Machine Learning for High-speed Cor- + ner Detection,” in European Conference on Computer Vision (ECCV). + Springer, 2006, pp. 430–443. + diff --git a/动态slam/2020年-2022年开源动态SLAM/2021年/DF_VO What should be learnt for visual odometry.pdf b/动态slam/2020年-2022年开源动态SLAM/2021年/DF_VO What should be learnt for visual odometry.pdf new file mode 100644 index 0000000..5c1d534 --- /dev/null +++ b/动态slam/2020年-2022年开源动态SLAM/2021年/DF_VO What should be learnt for visual odometry.pdf @@ -0,0 +1,1367 @@ + UNDER REVIEW manuscript No. + (will be inserted by the editor) + + DF-VO: What Should Be Learnt for Visual Odometry? + + Huangying Zhan, Chamara Saroj Weerasekera, Jia-Wang Bian, Ravi Garg, + Ian Reid + +arXiv:2103.00933v1 [cs.CV] 1 Mar 2021 the date of receipt and acceptance should be inserted later + + Abstract Multi-view geometry-based methods dominate (a) (b) + the last few decades in monocular Visual Odometry for (c) + their superior performance, while they have been vulnera- (d) (e) + ble to dynamic and low-texture scenes. More importantly, + monocular methods suffer from scale-drift issue, i.e., er- (f) (g) + rors accumulate over time. Recent studies show that deep + neural networks can learn scene depths and relative cam- Fig. 1 Inputs and intermediate CNN outputs of the system. (a, + era in a self-supervised manner without acquiring ground b) Current and previous input images with examples of auto- + truth labels. More surprisingly, they show that the well- selected 2D-2D matches; (c) Single view depth prediction; (d, e) + trained networks enable scale-consistent predictions over Forward and backward optical flow prediction; (f) Flow consis- + long videos, while the accuracy is still inferior to tradi- tency between optical flow and rigid flow; (g) Forward-backward + tional methods because of ignoring geometric informa- flow consistency; In (f)(g), red/blue means high/low inconsis- + tion. Building on top of recent progress in computer vi- tency. + sion, we design a simple yet robust VO system by inte- + grating multi-view geometry and deep learning on Depth 1 Introduction + and optical Flow, namely DF-VO. In this work, a) we + propose a method to carefully sample high-quality corre- The ability of an autonomous robot to localize itself and + spondences from deep flows and recover accurate camera know its surroundings is vital for different robotic tasks + poses with a geometric module; b) we address the scale- such as navigation and object manipulation. Vision-based + drift issue by aligning geometrically triangulated depths methods are often the preferred choice because of factors + to the scale-consistent deep depths, where the dynamic such as cost-saving, low power requirements, and useful + scenes are taken into account. Comprehensive ablation complementary information can be provided to other sen- + studies show the effectiveness of the proposed method, sors such as IMU, GPS, laser scanners. We address the + and extensive evaluation results show the state-of-the-art monocular Visual Odometry (VO) problem in this paper, + performance of our system, e.g., Ours (1.652% ) v.s. ORB- where the goal is to estimate 6DoF motions of a moving + SLAM (3.247% ) in terms of translation error in KITTI camera. + Odometry benchmark. Source code is publicly available + at: DF-VO. + + Keywords Visual Odometry, Self-supervised Learning, + Depth Estimation, Optical Flow Estimation + + All authors are with the University of Adelaide, and Australian + Centre for Robotic Vision + 2 Zhan et al. + + Geometry-based Visual Odometry has shown domi- era motions are estimated using well-studied multi-view +nating performance in the last few decades, while they are geometry in the proposed system. +only reliable and accurate under a restrictive setup, such +as when static scenes comprising well-textured Lamber- To summarize, the contributions of this paper include: +tian surfaces are captured with sufficient uniform illumi- +nation enabling to establish good correspondences (Bian – we propose a hybrid system, DF-VO, which leverages +et al. 2019a; Lowe 2004; Rublee et al. 2011). The tra- both deep learning and multi-view geometry for Vi- +ditional correspondence search pipeline usually detects sual Odometry. Especially, self-supervised learning is +sparse feature points firstly and then matches extracted used for training networks so expensive ground truth +features, resulting in a limited number of high-quality data is not required and it enables online finetuning. +correspondences because of the aforementioned assump- +tions. The accuracy and diversity of the correspondences – we propose to sample accurate sparse correspondences +are of the utmost importance in solving Visual Odome- from dense optical flow predictions for camera track- +try problems. In contrast, we propose to extract accurate ing, and a bi-directional consistency based sampling +correspondences diversely from the dense predictions of method is presented. +an optical flow network using the consistency constraint +between bi-directional flows. Then the selected correspon- – we propose to use scale-consistent monocular depth +dences are fed into geometry-based trackers (Epipolar predictions for maintaining a consistent scale over long +Geometry based tracker and Prospective-n-point based video for Visual Odometry, and propose an iterative +tracker) for accurate and robust VO estimation, as de- scale recovery method for better performance in dy- +scribed in Sec. 4. namic scenarios. + + Most monocular systems suffer from a depth-translation – the comprehensive evaluation shows that the proposed +scale ambiguity issue, which means the predictions (struc- DF-VO system achieves state-of-the-art performance +ture and motion) are up-to-scale. The scale ambiguity in standard benchmarks, and we conduct a detailed +leads to a scale drift issue that accumulated over time. ablation study for evaluating the effect of different +Resolving scale-drift usually relies on keeping a scale- factors in our system. +consistent map for map-to-frame tracking, performing an +expensive global bundle adjustment for scale optimiza- A preliminary version of DF-VO was presented in +tion or additional prior assumptions, like constant camera (Zhan et al. 2020). We extend the system in the following +height from the known ground plane. Recently deep learn- four aspects (1) clearer presentation and more details of +ing methods have made possible end-to-end learning of the proposed system (2) improving the system in dynamic +structure-and-motion from unlabelled videos. The trained environments with an iterative correspondence selection +single-view depth models give scale-consistent predictions scheme; (3) improving the adaptation ability in new en- +with the use of stereo-based training (Garg et al. 2016; vironments by introducing an online adaptation scheme; +Godard et al. 2019; Zhan et al. 2018) or scale-consistency (4) more comprehensive experiments and ablation stud- +constraint in monocular-based training (Bian et al. 2019b). ies. +In this work, we propose to use the scale-consistent single- +view depths as the reference to maintain a consistent scale 2 Related Work +over long videos. The scale-consistent depths are used in +two circumstances: (1) scale recovery when the transla- Geometry based VO: Camera tracking is a fundamen- +tion scale is missed in the Epipolar Geometry tracker; (2) tal and well-studied problem in computer vision, with dif- +establishing scale-consistent 3D-2D correspondences in ferent pose estimation methods based on multiple-view +the PnP tracker. Besides, we propose an iterative method geometry been established (Hartley & Zisserman 2003; +for robust scale recovery, which is especially effective in Scaramuzza & Fraundorfer 2011). Early work in VO dates +highly dynamic scenes by removing the extracted corre- back to the 1980s (Scaramuzza & Fraundorfer 2011; Ull- +spondences (i.e. outliers) on dynamic regions. man 1979), with a successful application of it in the Mars + exploration rover in 2004 (Matthies et al. 2007), albeit + Although recent deep pose networks can learn camera with a stereo camera. Two dominant methods for geometry- +motions directly from videos (Bian et al. 2019b; Godard based VO/SLAM are feature-based (Geiger et al. 2011; +et al. 2019; Wang et al. 2017; Zhan et al. 2018; Zhou Klein & Murray 2007; Mur-Artal & Tard´os 2016) and di- +et al. 2017), the accuracy is limited because of neglect- rect methods (Engel et al. 2017; Newcombe et al. 2011). +ing to incorporate geometric knowledge in inference time. The former involves explicit correspondence estimation, +In contrast, correspondences and scene scales are learnt and the latter takes the form of an energy minimization +in our proposed framework (Fig. 1). Thus accurate cam- problem based on the image colour/feature warp error, + parameterized by pose and map parameters. There are + also hybrid approaches that make use of the good prop- + erties of both (Engel et al. 2014; Forster et al. 2014, + DF-VO: What Should Be Learnt for Visual Odometry? 3 + +2016). One of the most successful and accurate full SLAM ometry to varying extent and degree of success. CNN- +systems using a sparse (ORB) feature-based approach SLAM (Tateno et al. 2017) fuse single view CNN depths +is ORB-SLAM2 (Mur-Artal & Tardo´s 2016), along with in a direct SLAM system, and CNN-SVO (Loo et al. +DSO (Engel et al. 2017), a direct keyframe-based sparse 2019) initialize the depth at a feature location with CNN +SLAM method. VISO2 (Geiger et al. 2011) on the other provided depth for reducing the uncertainty in the ini- +hand is a feature-based VO system that only tracks against tial map. Yang et al.(Yang et al. 2018) feed depth pre- +a local map created by the previous two frames. All these dictions into DSO (Engel et al. 2017) as virtual stereo +methods suffer from the previously mentioned issues (in- measurements. Li et al.(Li et al. 2019) refine their pose +cluding scale-drift) common to monocular geometry-based predictions via pose-graph optimisation. In contrast to +systems. Various techniques have been developed for re- the above methods, we effectively utilize CNNs for both +solving the scale drift issue. For example, an expensive single-view depth prediction and correspondence estima- +global bundle adjustment is performed for global scale tion, on top of standard multi-view geometry to create a +optimization based on loop-closure detection, which does simple yet effective VO system. +not always exist (Mur-Artal et al. 2015b); or additional +prior assumptions are introduced like constant camera 3 Preliminaries +height from the known ground plane (Geiger et al. 2011; +Zhou et al. 2019). In this work, with the aid of depth We revisit geometry-based pose estimation methods, in- +estimations from a consistent-scale deep network, scale cluding Epipolar Geometry and Perspective-n-Point in +estimation is performed with respect to the depth pre- this section to understand the principle and the underly- +dictions such that a single consistent scale is maintained ing limitations of each method. +(Sec. 4.4). + 3.1 Epipolar Geometry + Deep learning for VO: For supervised learning, +Agrawal et al.(Agrawal et al. 2015) propose to learn Epipolar Geometry can be employed for camera motion +good visual features from an ego-motion estimation task, estimation from two images (Ii, Ij) Suppose we have ob- +in which the model is capable of relative camera pose es- tained a set of 2D-2D correspondences (pi, pj) from the +timation. Wang et al.(Wang et al. 2017) propose a recur- image pair. Epipolar constraint is employed for solving +rent network for learning VO from videos. Ummenhofer fundamental matrix, F , or essential matrix, E, which +et al.(Ummenhofer et al. 2017) and Zhou et al.(Zhou are related by the camera intrinsic K such that F = +et al. 2018) propose to learn monocular depth estimation K−T EK−1. Thus, the camera motion [R, t] can be re- +and VO together in an end-to-end fashion by formulating covered by decomposing F or E (Bian et al. 2019c; Hart- +structure from motion as a supervised learning problem. ley 1995; Nister 2003; Zhang 1998). +Dharmasiri et al.(Dharmasiri et al. 2018) train a depth +network and extend the depth system for predicting opti- pjT K−T EK−1pi = 0, where E = [t]×R (1) +cal flows and camera motion. Recent works suggest that +both tasks can be jointly learnt in a self-supervised man- However, the general viewpoint and general structure are +ner using a photometric warp loss to replace a super- assumed in such geometry guided tracking. Problems arise +vised loss based on ground truth. SfM-Learner (Zhou with Epipolar Geometry while frames in the sequence +et al. 2017) is the first self-supervised method for jointly and/or scene structure do not conform to these assump- +learning camera motion and depth estimation. SC-SfM- tions(Torr et al. 1999). +Learner (Bian et al. 2019b) is a very recent work which +solves the scale inconsistent issue in SfM-Learner by en- – Motion degeneracy: motion degeneracy happens when +forcing depth consistency. (Ranjan et al. 2019; Yin & Shi the camera does not translate between frames, i.e. re- +2018) improve SfM-Learner by incorporating optical flow covering R becomes unsolvable if the camera motion +in their joint training framework for dynamics reasoning. is a pure rotation. +Some prior works solve both scale ambiguity and incon- +sistency issue by using stereo sequences in training (Li – Structure degeneracy: viewed scene structure is pla- +et al. 2017; Zhan et al. 2018), which address the issue of nar. +metric scale. + Solving fundamental/essential matrix becomes unstable + The issue with the above learning-based methods is in practice when the camera baseline is small relative to +that they do not explicitly account for the multi-view the scene size. Moreover, translation recovered from the +geometry constraints that are introduced due to camera essential matrix is up-to-scale because of scale ambiguity. +motion during inference. In order to address this, recent +works propose to combine the best of learning and ge- + 4 Zhan et al. + + Deep Models Correspondence 2D-2D + Depth Flow Selection 3D-2D + + E-tracker Scale + Model Selection Recovery + + 2D-2D PnP-tracker + 3D-2D + +Fig. 2 DF-VO pipeline. For an image pair, (forward and backward) optical flows and single view depths are predicted. A +forward-backward flow consistency is computed as a criterion to establish good correspondences (2D-2D; 3D-2D). We have two +alternative trackers out of which one is selected by the data-driven model selection module. The first tracker (E-tracker) uses 2D-2D +correspondences to estimate and decompose an essential matrix to find rotation and translation direction, which is followed by a +transnational scale recovery step to estimate metric VO. The second tracker (PnP) uses single view depth estimates in conjunction +with 3D-2D registration via PnP. + +3.2 Perspective-n-Point (2D-2D and 3D-2D) are considered in this system. To ob- + tain the correspondences, (1) an optical flow network is +Perspective-n-point (PnP) solves camera pose given known trained to predict dense correspondences between images +3D-2D correspondences. In a two-view problem, suppose for 2D-2D correspondences establishment; (2) a single- +we have obtained a set of 3D-2D correspondences, includ- view depth network is used to estimate 3D structure thus +ing the 3D points on i -th view and the corresponding 3D-2D correspondences can be established by combining +projection in j -th view (Xi, pj), PnP can be employed the optical flow estimation. Accurate sparse correspon- +to estimate camera pose by minimizing the reprojection dences are thus selected with a carefully designed mech- +error, anism. Two trackers used for pose estimation are named + E-tracker and PnP-tracker, which employ Epipolar Ge- +e = ||K(RXi[x] + t) − pj[x]||2, (2) ometry with a scale recovery module and Prospective-n- + Point, respectively. Note that the scale recovery module + x is associated with E-tracker for solving the well-known + scale ambiguity and scale drift issues. To decide a suit- +where [x] is pixel coordinate indexing. Solving a PnP able tracker for each input pair, a robust model selection +problem requires accurate estimation of the 3D structure method using geometric robust information criterion is +of the scene which can be obtained from depth sensor used. In order to achieve minimal training and supervi- +measurements or mature stereo reconstruction methods, sion, and high-quality prediction on the deep networks, +while it is a more challenging problem in the monocular we explore a variety of training schemes on the depth net- +case. work and flow network. Building on top of advanced deep + networks and classic geometry methods, we present a sim- +4 DF-VO: Depth and Flow for Visual Odometry ple yet effective and robust monocular Visual Odometry +4.1 System Overview system. + +A standard Visual Odometry pipeline includes feature ex- 4.2 Deep Predictions +traction and matching to establish correspondences , fol- +lowed by pose estimation from the correspondences. We In order to form 2D-2D/3D-2D correspondences from an +follow this pipeline and present DF-VO, which is illus- image pair, specifically (pi, pj) or (Xi, pj), we propose +trated in Fig. 2 and Alg. 1. Two types of correspondences + DF-VO: What Should Be Learnt for Visual Odometry? 5 + +to use an optical flow network and a single-view depth Algorithm 1 DF-VO: Depth and Flow for Visual Odom- +network to establish the correspondences. + etry +Optical flow The 2D-2D correspondences are extracted +from dense optical flow prediction. Give an image pair, Require: Depth-CNN: Md; Flow-CNN: Mf +(Ii, Ij), optical flow describes the pixel movements in Ii, Input: Image sequence: [I1, I2, ..., Ik] +which gives the correspondences of all the pixels of Ii Output: Camera poses: [T1, T2, ..., Tk] +in Ij. Though the state-of-the-art deep optical flow net- 1: Initialization T1 = I ; i = 2 +works have shown high average accuracy, not all the pix- +els share the same high accuracy. Therefore, we propose a 2: while i ≤ k do +correspondence selection scheme in Sec. 4.3 to pick good +predictions robustly. 3: Get CNN predictions: Di, Fii−1, and Fii−1 + 4: Compute forward-backward flow inconsistency from +Single view depth In order to establish 3D-2D correspon- +dences between two views, (Xi, pj), we need to obtain (Fii−1, Fii−1). +the 3D structure of i -th view and the correspondences 5: Correspondence selection: form matches (Pi, Pi−1) from +between the 3D landmarks and 2D landmarks. Tradi- +tional approaches establish the correspondences via fea- the filtered flows based on flow inconsistency +ture matching between 3D landmarks and 2D feature +points. In this work, we use a deep depth network as our 6: Model selection: estimate E and H from (Pi, Pi−1) and +“depth sensor” to estimate the 3D structure on i -th view, compute GRIC scores for the trackers +Xi. Through the 2D-2D correspondences established by +optical flows, we can directly get a set of 3D-2D corre- 7: if E-Tracker then +spondences and solve the relative camera pose by solving +PnP. 8: Recover [R, ˆt] from the estimated Essential matrix + + Unfortunately, the current state-of-the-art single view 9: Triangulate (Pi, Pi−1) to get Di +depth estimation methods are still insufficient for recov- +ering very accurate 3D structure (about 10% relative er- 10: Scale recovery to estimate s +ror) for accurate camera pose estimation, which is shown +in Tab. 3. On the other hand, optical flow estimation 11: Tii−1 = [R, sˆt] +is a more generic task. The state-of-the-art deep learn- +ing methods are accurate and with good generalization 12: else if PnP-Tracker then +ability. Therefore, we mainly use the 2D-2D matches for +solving pose from essential matrix while the depth pre- 13: Form 3D-2D correspondences from (Di, Pi, Pi−1) +dictions are used for scale recovery and PnP-tracker. As +a result, PnP-tracker is used as an auxiliary tracker when 14: Estimate [R, t] using PnP +E-tracker tends to fail. + 15: Tii−1 = [R, t] + + 16: end if + + 17: Ti ← Ti−1Tii−1 + + 18: end while + + the images have worse optical flow predictions, for in- + stance, out-of-view regions where no correspondences can + be found in the other view; dynamic object regions where + occlusion is usually associated with. In order to filter out + the outliers and pick good optical flows, we propose a + correspondence selection scheme based on a bi-directional + flow consistency, see example in Fig. 3. + + Flow consistency Given an image pair, (Ii, Ij), both for- + ward and backward optical flows, Fij and Fji, are pre- + dicted by the flow network. Thus we compute forward- + backward flow consistency as a measure to choose good + 2D-2D correspondences. The flow consistency is computed + by, + + C = −Fij − w Fji, pf (Fij ) , (3) + +4.3 Correspondence Selection The warping process at a pixel x is described as + +Most deep learning-based optical flow models predict dense w Fji[x], pf (Fij[x]) = Fji[x + Fij[x]]. (4) +optical flows, i.e. every pixel is associated with a predicted +flow vector. There can be a lot of matches formed by the As x + Fji[x] does not necessarily locate on the regular +optical flows, in which some of them are very accurate. It grid, the resulted flow is interpolated from the flow vec- +is time-consuming if all matches are taken into consider- tors in the 4 corners(Jaderberg et al. 2015). We use the +ation in solving a VO problem since only sparse matches flow consistency to select correspondences with higher ac- +are required to solve the problem in theory. The vanilla curacy and the hypothesis we made is that the optical +way is to sample the optical flows randomly/uniformly flows with better consistency tend to have higher accu- +from the dense predictions. racy, which is proved with an experiment in Sec. 6. + + However, we have observed that not all the flow pre- +dictions share the same high accuracy. Some regions in + 6 Zhan et al. + +Best-N selection After computing forward-backward flow Fig. 3 (Top) Filtered 2D correspondences established by the +consistency, we choose optical flows with the least incon- optical flow prediction; (Bottom left) Optical flow prediction; +sistency F to form the best-N 2D-2D matches, (Pi, Pj) (Bottom right) Bidirectional flow consistency (high consistency +(Zhan et al. 2020), where N equals to 2000 in most exper- is shown in blue) shows that sufficient correspondences can be +iments. This correspondence selection scheme is able to established in the overexposure case. +reject a lot of inaccurate flows. As shown in (Zhan et al. +2020), DF-VO with this correspondence selection scheme +has already outperformed existing VO/SLAM baselines. +However, there are still some potential issues regarding +the scheme. + + – Model under-fitting: if the chosen best-N matches do + not have enough location diversity, the pose model + estimated can be an under-fitting model. + + – Structure degeneracy: if all the chosen matches locate + on a planar region, structure degeneracy happens and + leads to the failure of estimating essential matrix(Torr + et al. 1999). + +Local best-K selection On top of the Best-N selection, After selecting good 2D-2D correspondences, the es- +we want to increase the location diversity of the matches. sential matrix can be solved using Epipolar Geometry as +We divide the image regions into M (M = 10 × 10) re- described in Sec. 3.1. Then, the camera motion, consist- +gions and choose best-K matches from each region. How- ing of rotation R and translation ˆt, can be decomposed +ever, there might be cases that have severe inaccurate flow from the essential matrix. However, the recovered motion +predictions (e.g. margin regions where usually are out-of- is up-to-scale. Specifically, the translation is a unit vector +view) and the flow predictions should not be used. There- representing the translation direction only. In order to re- +fore, we first filter the flows such that only flows with cover and maintain a consistent scale over the monocular +inconsistency less than a threshold can be picked. As a footage, a consistent scale recovery process is required. +result, The final correspondences (Pi, Pj) formed from F +are a union of best-K matches in each region. The value 4.4 Scale Recovery +K in j -th region is defined as Kj = min(N/M, Qj), where +Qj is the number of valid flows after thresholding. Since In traditional monocular VO pipeline, the per-frame scale +the correspondence quality is vital, we further check the is recovered by aligning triangulated 3D landmarks with +number of valid correspondences and the number of re- existing 3D landmarks which accumulates errors. +gions with valid correspondences to determine if sufficient +good correspondences are used. If insufficient correspon- Simple alignment In this work, we use the predicted depths +dences are found, which rarely happens (mostly when the Di to inform 3D structure as a reference for scale recov- +image quality is very poor such as extreme under/over- ery. After recovering [R, ˆt] from solving essential matrix, +exposure), we use a constant motion model instead of the triangulation is performed for (Pi, Pj) to recover up-to- +E/PnP-tracker. scale depths Di. A scaling factor, s, can be estimated by + aligning the triangulated depth map Di with the CNN + The advantages of performing local best-K selection depth map Di. An important advantage of using depth +are two-fold, (1) increasing location diversity as described; CNN is that we can get rid of the scale drift issue because +(2) speeding up correspondence selection process since of the following reasons. +part of flows are rejected in the first place and sorting +flow inconsistency is performed in a local image region – Depth CNN predicts per-frame 3D structures, which +instead of the whole image region. are scale consistent. We show that we can train scale + consistent depth networks (Sec. 4.6). + Comparing to traditional feature-based methods, which +only use salient feature points for matching and tracking, – Scale drift is introduced by an accumulated error in +any pixel in the dense optical flow can be a candidate creating new 3D landmarks. We do not create new 3D +for tracking. Moreover, traditional features usually gather landmarks but recover scale w.r.t. a single network. +visual information from local regions while CNN gath- +ers more visual information (larger receptive field) and Iterative alignment Aligning 3D landmarks triangulated +higher-level contextual information, which gives more ac- on selected optical flow matches with CNN depths is +curate and robust correspondences. + DF-VO: What Should Be Learnt for Visual Odometry? 7 + +Algorithm 2 Iterative Scale Recovery 4.5 Model Selection + +Input: [R, ˆt], F , Di, st−1 We have presented a camera tracking method integrat- + 1: Initialization s = st−1 ing Epipolar Geometry with deep predictions. However, + 2: while s has not converged do as mentioned in Sec. 3.1, there are some known issues + 3: Pose hypothesis: T = [R, sˆt] with Epipolar Geometry, i.e. motion degeneracy and un- + stable solution when the motion is small. Since we have + 4: Compute rigid flow Frigid from T and Di both 3D-2D and 2D-2D correspondences available, we can + 5: Compute flow inconsistency: Fdiff ← ||F − Frigid||2 instead solve a PnP problem using the correspondences + 6: Select depth-flow pairs (Di, P1, P2)sel with obtained in Sec. 4.3 when Epipolar Geometry tends to + fail. In this section, we show that we can select a suitable + Fdif f < δrigid tracker/model by two possible ways. + 7: Estimate new pose, [R, ˆt] from (P1, P2)sel + 8: Triangulate (P1, P2)sel to get Di + 9: Estimate scaling factor, snew, by comparing (Di,sel, Di) +10: s ← snew +11: end while + +simple and sufficient to recover accurate scale in gen- Flow magnitude We measure the magnitude of the flow +eral cases. However, in a highly dynamic environment, predictions and solve essential matrix only when the av- +the selected optical flows can be lying on dynamic re- erage flow magnitude is large enough. It avoids small +gions, which is problematic for depth alignment. More- camera motions which usually come with small optical +over, similar to optical flow predictions, not all the pre- flows(Zhan et al. 2020). However, this na¨ıve approach is +dicted depths are highly accurate. The pixels with high associated with some issues. (1) It does not resolve mo- +forward-backward flow consistency are not guaranteed to tion degeneracy (pure rotation), which also causes large +have high depth accuracy. Therefore, we here propose an optical flows. (2) It does not take outliers into account, +iterative scheme, Alg. 2. e.g. dynamic objects which cause optical flows even the + camera is stationary. Therefore, we adopt a more robust + The key is to select depths and filtered optical flows measure for model selection. +(Sec. 4.3) that are consistent with each other. Given that +the filtered optical flows generally establish good corre- Geometric Robust Information Criterion Torr et al.(Torr +spondences, a pixel with depth being consistent with the et al. 1999) discuss the degeneracy cases (motion and +optical flow means that (1) the pixel belongs to a static structure) and their influence on geometry guided cam- +region in the environment; (2) the depth is likely to be era motion estimation. Two robust strategies for tackling +accurate. However, the depth and optical flow are related such degeneracies are proposed. (1) A statistical model +by a camera pose for a static scene. Since the camera selection test, named Geometric Robust Information Cri- +pose [R, ˆt] is up-to-scale and does not share the same terion (GRIC), is used to identify cases when degenera- +scale with the depth prediction, we, therefore, propose cies occur; (2) multiple motion models are used to over- +an iterative approach to select depth-flow pair (Alg. 2). come the degeneracies. In this work, we follow the first +We first initialize a relative pose T with a pose T0. Then approach to identify when E-Tracker tends to fail and +the rigid flow is computed using the current relative pose switch to PnP-Tracker. (Torr et al. 1999) estimates both +by, Fundamental F and Homography matrix H and choose + the model with lower GRIC score. The model that ex- +Frigid = KT K−1(xDi) − x (5) plains the data best, i.e. lower GRIC score, is indicated + as most likely. +where x belongs to pixel coordinates of the selected opti- +cal flow. The consistency between the filtered optical flow GRIC calculates a score function for each tracker (Fun- +F and the rigid flow is then measured by ||F − Frigid||2. damental / Homography) considering the following fac- +Only depth-flow pairs with small optical-rigid flow incon- tors. +sistency are selected as new matches. Thus, we update +T with the new scaled pose and iterate the process un- – number of matches, n +til reaching the stopping condition (convergence or meet – residuals of the matches, ei +n-iterations). The scale initialization for the first image – standard deviation of the measurement error, σ +pair is set as zero while the scale at time-(t-1) is used as – data dimension, r (4 for two views) +the scale initialization at time-(t). – number of motion model parameters, k (5 for E, 7 for + + F , 8 for H) + – dimension of the structure, d (3 for F , 2 for H) + + GRIC = ρ(ei2) + λ1dn + λ2k (6) + 8 Zhan et al. + +where ρ(e2i ) is a robust function of the residuals: 4.6.1 Training overview + +ρ(e2) = min e2 . (7) In this work, we jointly train the depth network and the + σ2 , λ3(r − d) pose network by minimizing the mean of the following + +The value of the parameters are λ1 = log4, λ2 = log4n, per-pixel objective function over the whole image. The +λ3 = 2. Different from (Torr et al. 1999), since we have per-pixel loss is + +both 3D-2D and 2D-2D correspondences, we can choose = min Lpe(Ii, Iji ) + λdsLds(Di, Ii)+ +PnP-Tracker instead of Homography-Tracker when E-Tracker L + j + +tends to fail. min λdcLdc(Di , Dji ), (8) + + j + +Cheirality condition In addition to the two methods in- where Lpe is photometric loss; Lds is depth smoothness +troduced above, we check for cheirality condition as well. loss; Ldc is depth consistency loss; and [λds, λdc] are loss +There are 4 possible solutions for [R, ˆt] by decomposing weightings. +E. To find the correct unique solution, cheirality condi- +tion, i.e. the triangulated 3D points must be in front of 4.6.2 Photometric loss +both cameras, is checked to remove the other solutions. +We further use the number of points satisfying cheirality Lpe is the photometric error by computing the differ- +condition as a reference to determine if the solution is ence between the reference image Ii and the synthesized +stable. view Iji warped from the source image Ij, where j ∈ + [i − n, i + n, s]. [i − n, i + n] are neighbouring views of + Therefore, we choose PnP-Tracker when GRICE is Ii while s is stereo pair if stereo sequences are used in +higher than GRICH or cheirality check condition is not training. As proposed in (Godard et al. 2019), instead of +fulfilled. Otherwise, E-Tracker is employed for solving averaging the photometric errors between the reference +frame-to-frame camera motion. To robustify the system, pixel and the synthesized pixels from multiple views, (Go- +we wrap the trackers in RANSAC loops. dard et al. 2019) only counts the photometric error be- + tween the reference pixel and the synthesized pixel with +4.6 Jointly learning of depths and pose the minimum error. The rationale is to overcome the is- + sues related to out-of-view pixels and occlusions. +Various depth training frameworks can be employed de- +pending on the availability of data (monocular/stereo se- Lpe(Ii, Iji) = α 1 − SSIM(Ii, Iji) + (1 − α)|Ii − Iji| (9) +quences, depth sensor measurements). The most trivial 2 +way is using a supervised training framework (Eigen et al. +2014; Fu et al. 2018; Kendall & Gal 2017; Laina et al. Iji = w Ij , pre(K, Di, Tij ) , (10) +2016; Liu et al. 2015, 2016; Nekrasov et al. 2019), but +ground truth depths are not always available for any sce- where SSIM (Wang et al. 2004) is a robust measurement +nario. Some recent works suggest that jointly learning +single-view depths and camera motion in a self-supervised for image similarity and α = 0.85 balances the SSIM +manner is feasible using monocular sequences (Bian et al. +2019b; Godard et al. 2019; Yin & Shi 2018; Zhou et al. error and the simple color intensity error. w(I, p) is a +2017), or stereo sequences (Garg et al. 2016; Godard et al. +2017, 2019; Zhan et al. 2018). Instead of using ground differentiable warping function (Jaderberg et al. 2015) +truth supervisions, the main supervision signal in the self- +supervised framework is photometric consistency across which warps image I according to the pixel locations +multiple-views. p. pre(K, Di, Tij) establishes the pixel coordinates repro- + jected from view-i to view-j, where K is the camera in- + In this work, we mainly follow (Godard et al. 2019) trinsics, Di is the predicted depth map of view-i, and Tij +for training depth models using monocular and stereo is the relative pose between the pair. The reprojection for +sequences. The depth network is based on the encoder- +decoder architecture with skip connections(Ronneberger a pixel x from view-i to view-j is represented by +et al. 2015). The pose network consists of a ResNet18 fea- +ture extractor which takes an image pair as input (con- pre K, Di, Tij = KTij K−1xDi[x] (11) +catenated as a 6-channel input) and predicts 6-DoF rel- +ative pose. We refer readers to (Godard et al. 2019) for 4.6.3 Depth smoothness regularization +more network architecture details. + Following the approach in (Godard et al. 2017), we en- + courage depth to be smooth locally so we induce an edge- + aware depth smoothness term. The depth discontinuity is + DF-VO: What Should Be Learnt for Visual Odometry? 9 + +penalized if colour continuity is presented in the same lo- 2018) as our backbone network for optical flow predic- +cal region. The smoothness regularization is formulated tion since LiteFlowNet is fast, lightweight, and accurate. +as LiteFlowNet consists of a two-stream network for feature + extraction and a cascaded network for flow inference and +Lds(Di, Ii) = |∂xDi|e−|∂xIi| + |∂yDi|e−|∂yIi|, (12) regularization. We refer readers to (Hui et al. 2018) for + more details. LiteFlowNet shows good generalization abil- +where ∂x(.) and ∂y(.) are gradients in horizontal and ity. LiteFlowNet trained on a synthetic dataset (Scene +vertical direction respectively. Note that we use inverse Flow(Dosovitskiy et al. 2015)) can generalize well in real- +depth regularization instead. world scenarios, though sometimes artifacts present in + some regions. +4.6.4 Training without scaling issues + In this work, we mainly use the model trained from +Similar to traditional monocular 3D reconstruction, scale Scene Flow. However, we also show that a self-supervised +ambiguity and scale inconsistency issues exist when finetuning can be performed to help the model better +monocular videos are used for training. Since the monoc- adapt to unseen environments and remove the artifacts. +ular training usually uses image snippets (usually 2 or 3 Two finetuning schemes are tested and compared, in- +frames) for training, the training does not guarantee a cluding offline finetuning and online finetuning (Sec. 6.2). +consistent learnt scale across snippets and it creates the Similar to the self-supervised training of the depth net- +scale inconsistency issue(Bian et al. 2019b). work, the optical flow network is trained by minimizing + the mean of the following per-pixel loss function over the + One solution to solve both scale problems is using whole image. +stereo sequences during training (Godard et al. 2019; Li +et al. 2017; Zhan et al. 2018), the deep predictions are L = min Lpe(Ii, Iji ) + λf s Lf s (||Fij ||2, Ii) +aligned with real-world scale and scale-consistent because +of the constraint introduced by the known stereo baseline. j +Even though stereo sequences are used during training, +only monocular images are required during inference for + λfcLfc −Fij − w Fji, pf (Fij ) (14) +depth predictions. + Iji = w Ij , pf (Fij ) , (15) + Another solution to overcome the scale inconsistency +issue is using temporal geometry consistency regulariza- Different from Eqn. 10, pf (.) establish the correspon- +tion proposed in (Bian et al. 2019b; Zhan et al. 2019), dences between view-i and view-j via the flow field in- +which constrains the depth consistency across multiple +views. As depth predictions are consistent across different stead of using reprojection defined in Eqn. 10. For a pixel +views and thus different snippets, the scale inconsistency x on view-i, the corresponding pixel position , pf (Fij[x]), +issue is resolved. Using the rigid scene assumption as the on view-j is x + Fij[x]. +cameras move in space over time we want the predicted +depths at view-i to be consistent with the respective pre- We also regularize the optical flow to be smooth using +dictions at view-j. This is done by correctly transform- +ing the scene geometry from frame-j to frame-i much an edge-aware flow smoothness loss Lfs(.) similar to the +like the image warping. Specifically, we adopt the inverse depth smoothness loss defined in Eqn. 12. Similar to Meis- +depth consistency proposed in (Zhan et al. 2019). + ter et al.(Meister et al. 2018), we estimate both forward + + and backward optical flow and constrain the bidirectional + + predictions to be consistent with the loss Lfc. + +Ldc(Di, Dji ) = |1/Di − 1/Dji | (13) 5 Implementation and Benchmarking + +Inspired by (Godard et al. 2019), we use minimum error 5.1 Dataset +in multi-view pairs to avoid occlusions and out-of-view +scenes instead of averaging the depth consistency error We train and test our method in popular benchmarking +over all source views. datasets, KITTI (Geiger et al. 2013, 2012) and Oxford + Robotcar(Maddern et al. 2017), which are large scale out- +4.7 Learning of optical flows door driving datasets. There are various splits in KITTI + for several tasks, e.g. depth estimation, odometry, object +Many deep learning-based methods have been proposed tracking. In this work, we select the following three splits +for estimating optical flow (Dosovitskiy et al. 2015; Hui to evaluate our method. +et al. 2018; Ilg et al. 2017; Meister et al. 2018; Sun et al. +2018). In this work, we choose LiteFlowNet(Hui et al. KITTI Odometry Odometry split contains 11 driving se- + quences with publicly available ground truth camera poses. + 10 Zhan et al. + +Most of the sequences are long sequences and some with + +loop closing. Following (Zhou et al. 2017), we train our 500 GT 500 GT +networks on sequences 00-08. The dataset contains 36,671 400 SfM-Learner 400 VISO2 +training pairs, [Ii, Ii−1, Ii+1, Ii,s]. 300 Depth-VO-Feat 300 ORB-SLAM2 (w/ LC) + SC-SfM-Learner ORB-SLAM2 (w/o LC) +KITTI Tracking Tracking split contains 21 sequences with Ours (M-SC-Train.) Ours (M-SC-Train.) +available ground truths. The split is primarily used for Ours (S-Train.) Ours (S-Train.) +object tracking benchmarking so there are more dynamic +objects in these sequences when compared to the Odome- z (m) z (m) 200 +try split, but shorter sequence length in general. Following + 200 + + 100 + + 100 + + 0 + + 0 −100 + + −100 0 100 200 300 −100 0 100 200 300 + x (m) + x (m) + +(Zhang et al. 2020), we choose 9 out of the 21 sequences 150 GT 150 GT +with a considerable number of dynamic objects to test the 125 SfM-Learner 100 VISO2 +robustness of our system in dynamic environments. These 100 Depth-VO-Feat ORB-SLAM2 (w/ LC) +sequences are challenging for most monocular VO/SLAM SC-SfM-Learner 50 ORB-SLAM2 (w/o LC) +systems since most of the systems assume static scenarios. 75 Ours (M-SC-Train.) Ours (M-SC-Train.) + 50 Ours (S-Train.) Ours (S-Train.) + + z (m) z (m) + + 25 + 0 + +KITTI Flow KITTI Flow 2012/2015 splits contain 194/200 0 + +image pairs with high-quality optical flow labels. We use −25 −50 +this split to evaluate the performance of the optical flow +models in this work. −50 + +Oxford Robotcar To further test the generalization abil- 0 100 200 300 400 500 600 700 −100 0 100 200 300 400 500 600 700 + x (m) + x (m) + Fig. 4 Qualitative VO results on KITTI: (Top) Seq.09 and + + (Bottom) Seq.10 against deep learning-based and geometry- + + based methods (shown separately). + +ity of the system, we test the proposed system on Oxford + +Robotcar dataset. Following (Loo et al. 2019), 8 sequences ground truth. Relative pose error (RPE) measures frame- +are selected for evaluation and the first 200 frames 1 are to-frame relative pose error. Since most of the methods +skipped in the evaluation due to the extremely overex- are a monocular method, which lacks a scaling factor +posed images at the beginning of the sequences. to match with the real-world scale, we scale and align + + (7DoF optimization) the predictions to the ground truth + +5.2 Deep network training associated poses during evaluation by minimizing ATE + (Umeyama 1991). Except for methods using stereo depth + +We train our networks with the PyTorch (Paszke et al. models (Ours(Stereo Train.), Depth-VO-Feat) and known +2017) framework. All self-supervised experiments are trained scale prior (VISO2), which have already aligned predic- +using Adam optimizer (Kingma & Ba 2014) for 20 epochs. tions to real-world scale, for a fair comparison, we perform +For KITTI, images with a size of 640 × 192 are used 6DoF optimization w.r.t ATE instead. + +for training. Learning rate is set to 10−4 for the first 15 +epochs and then is dropped to 10−5 for the remaining KITTI Odometry We provide a detailed comparison be- + tween our VO system and some prior arts in KITTI Odom- +epochs. The loss weightings are [λds, λdc] = [10−3, 5] for + etry split, which includes pure deep learning methods +jointly learning depths and camera motion while [λfs, λfc] = (Zhou et al. 2017)2, (Zhan et al. 2018) (Bian et al. 2019b), +[10−1, 5 × 10−3] for optical flow experiments. + and geometry-based methods including DSO(Engel et al. + + 2017)3, VISO2(Geiger et al. 2011), and ORB-SLAM2 (Mur- + +5.3 Visual Odometry Benchmarking Artal et al. 2015a) (w/ and w/o loop-closure). ORB- + SLAM2 occasionally suffers from tracking failure or un- + +Evaluation Criterion Some common evaluation criteria successful initialization. We run ORB-SLAM2 three times + +are adopted for a detailed analysis. KITTI Odometry and report the one with the least trajectory error. The + +criterion reports the average translational error terr(%) quantitative and qualitative results are shown in Tab. 1, +and rotational errors rerr(◦/100m) by evaluating possi- Fig. 4, and Fig. 5. Seq.01 is not included while comput- + ing average error since a sub-sequence of Seq.01 does not +ble sub-sequences of length (100, 200, ..., 800) meters. + +Absolute trajectory error (ATE) measures the root-mean- contain trackable close features and most methods fail in + +square error between predicted camera poses [x, y, z] and the sub-sequence. + + 1 Our system can operate even without skipping the frames. 2 SfM-Learner(Zhou et al. 2017): the updated model in +The 200 frames are skipped in the evaluation for a fair compar- Github is evaluated +ison. + 3 result taken from (Loo et al. 2019) + DF-VO: What Should Be Learnt for Visual Odometry? 11 + + 500 GT 800 GT 200 GT 400 GT + 400 ORB-SLAM2 (w/ LC) 600 ORB-SLAM2 (w/ LC) 150 ORB-SLAM2 (w/ LC) 350 ORB-SLAM2 (w/ LC) + 300 ORB-SLAM2 (w/o LC) ORB-SLAM2 (w/o LC) ORB-SLAM2 (w/o LC) 300 ORB-SLAM2 (w/o LC) + Ours (M-SC-Train.) Ours (M-SC-Train.) Ours (M-SC-Train.) Ours (M-SC-Train.) + Ours (S-Train.) Ours (S-Train.) Ours (S-Train.) Ours (S-Train.) + + 250 + +z (m) 200 z (m) z (m) 100 z (m) 200 + + 400 + + 150 + + 100 100 + + 200 50 + + 0 50 + + −10−0300 −200 −100 0 100 200 300 0 100 200 300 400 500 600 0 100 200 300 400 0 + x (m) 0 x (m) 0 x (m) −50 −40 −30 −20 −10 0 10 20 30 40 50 + + x (m) + + 400 GT 300 GT 100 GT 400 GT + 300 ORB-SLAM2 (w/ LC) 200 ORB-SLAM2 (w/ LC) 50 ORB-SLAM2 (w/ LC) 300 ORB-SLAM2 (w/ LC) + ORB-SLAM2 (w/o LC) 100 ORB-SLAM2 (w/o LC) ORB-SLAM2 (w/o LC) ORB-SLAM2 (w/o LC) + Ours (M-SC-Train.) Ours (M-SC-Train.) Ours (M-SC-Train.) Ours (M-SC-Train.) + Ours (S-Train.) 0 Ours (S-Train.) Ours (S-Train.) Ours (S-Train.) + +z (m) 200 z (m) z (m) z (m) 200 + + 0 + + 100 100 + + −100 −50 + + 0 + + −200−150−100 −50 0 50 100 150 200 −100 −175−150−125−100−75 −50 −25 0 0 + x (m) x (m) + −200 −100 0 100 200 −400 −200 0 200 400 + x (m) x (m) + +Fig. 5 DF-VO and ORB-SLAM2 (monocular, w/ and w/o loop-closure) trajectories in sequences 00, 02, 03, 04, 05, 06, 07 and +08 from the KITTI odometry benchmark. Note that Seq. 08 does not contains loops and ORB-SLAM2 (w/ LC) undergoes severe +scale drifting while DF-VO does not. + +Table 1 Quantitative result on KITTI Odometry Seq. 00-10. The best result is in bold and second best is underlined. + + Category Method Metric 00 01 02 03 04 05 06 07 08 09 10 Avg. Err. + Deep VO SfM-Learner + (Zhou et al. 2017) terr 21.32 22.41 24.10 12.56 4.32 12.99 15.55 12.61 10.66 11.32 15.25 14.068 + Full SLAM / rerr 6.19 2.79 4.18 4.52 3.28 4.66 5.58 6.31 3.75 4.07 4.06 4.660 + VO with Optim. Depth-VO-Feat ATE 104.87 109.61 185.43 8.42 3.10 60.89 52.19 20.12 30.97 26.93 24.09 51.701 + (Zhan et al. 2018) RPE (m) 0.282 0.660 0.365 0.077 0.125 0.158 0.151 0.081 0.122 0.103 0.118 0.158 + VO RPE (◦) 0.227 0.133 0.172 0.158 0.108 0.153 0.119 0.181 0.152 0.159 0.171 0.160 + SC-SfMLearner terr 6.23 23.78 6.59 15.76 3.14 4.94 5.80 6.49 5.45 11.89 12.82 7.911 + (Bian et al. 2019b) rerr 2.44 1.75 2.26 10.62 2.02 2.34 2.06 3.56 2.39 3.60 3.41 3.470 + ATE 64.45 203.44 85.13 21.34 3.12 22.15 14.31 15.35 29.53 52.12 24.70 33.220 + DSO (Engel et al. 2017) RPE (m) 0.084 0.547 0.087 0.168 0.095 0.077 0.079 0.081 0.084 0.164 0.159 0.108 + ORB-SLAM2 (w/o LC) RPE (◦) 0.202 0.133 0.177 0.308 0.120 0.156 0.131 0.176 0.180 0.233 0.246 0.193 + (Mur-Artal & Tardo´s 2016) terr 11.01 27.09 6.74 9.22 4.22 6.70 5.36 8.29 8.11 7.64 10.74 7.803 + rerr 3.39 1.31 1.96 4.93 2.01 2.38 1.65 4.53 2.61 2.19 4.58 3.023 + ORB-SLAM2 (w/ LC) ATE 93.04 85.90 70.37 10.21 2.97 40.56 12.56 21.01 56.15 15.02 20.19 34.208 + (Mur-Artal & Tardo´s 2016) RPE (m) 0.139 0.888 0.092 0.059 0.073 0.070 0.069 0.075 0.085 0.095 0.105 0.086 + RPE (◦) 0.129 0.075 0.087 0.068 0.055 0.069 0.066 0.074 0.074 0.102 0.107 0.083 + VISO2 + (Geiger et al. 2011) ATE 113.18 / 116.81 1.39 0.42 47.46 55.62 16.72 111.08 52.23 11.09 52.600 + terr 11.43 107.57 10.34 0.97 1.30 9.04 14.56 9.77 11.46 9.30 2.57 8.074 + Ours rerr 0.58 0.26 0.19 0.27 0.26 0.26 0.36 0.28 0.26 0.32 0.304 + (Mono-SC Train.) ATE 40.65 0.89 47.82 0.94 1.30 29.95 40.82 16.04 43.09 38.77 5.42 26.480 + RPE (m) 0.169 502.20 0.172 0.031 0.078 0.140 0.237 0.105 0.192 0.128 0.045 0.130 + Ours RPE (◦) 0.079 2.970 0.072 0.055 0.079 0.058 0.055 0.047 0.061 0.061 0.065 0.063 + (Stereo Train.) terr 2.35 0.098 3.32 0.91 1.56 1.84 4.99 1.91 9.41 2.88 3.30 3.247 + rerr 0.35 109.10 0.31 0.19 0.27 0.20 0.23 0.28 0.30 0.25 0.30 0.268 + ATE 6.03 0.45 14.76 1.02 1.57 4.04 11.16 2.19 38.85 8.39 6.63 9.464 + RPE (m) 0.206 508.34 0.221 0.038 0.081 0.294 0.734 0.510 0.162 0.343 0.047 0.264 + RPE (◦) 0.090 3.042 0.079 0.055 0.076 0.059 0.053 0.050 0.065 0.063 0.066 0.066 + 0.087 + terr 10.53 18.71 30.21 34.05 13.16 17.69 10.80 13.85 18.06 26.10 19.316 + rerr 2.73 61.36 1.19 2.21 1.78 3.65 1.93 4.67 2.52 1.25 3.26 2.519 + ATE 79.24 7.68 70.13 52.36 38.33 66.75 40.72 18.32 61.49 52.62 57.25 53.721 + RPE (m) 0.221 494.60 0.318 0.226 0.496 0.213 0.343 0.191 0.234 0.284 0.442 0.297 + RPE (◦) 0.141 1.413 0.108 0.157 0.103 0.131 0.118 0.176 0.128 0.125 0.154 0.134 + terr 2.33 0.432 3.24 2.21 1.43 1.09 1.15 0.63 2.18 2.40 1.82 1.848 + rerr 0.63 39.46 0.49 0.38 0.30 0.25 0.39 0.29 0.32 0.24 0.38 0.367 + ATE 14.45 0.50 19.69 1.00 1.39 3.61 3.20 0.98 7.63 8.36 3.13 6.344 + RPE (m) 0.039 117.40 0.057 0.029 0.046 0.024 0.030 0.021 0.041 0.051 0.043 0.038 + RPE (◦) 0.056 1.554 0.045 0.038 0.029 0.035 0.029 0.030 0.037 0.036 0.043 0.038 + terr 2.01 0.049 2.32 2.22 0.74 1.30 1.42 0.72 1.66 2.07 2.06 1.652 + rerr 0.61 40.02 0.48 0.30 0.25 0.30 0.32 0.35 0.33 0.23 0.36 0.353 + ATE 12.17 0.47 17.59 1.96 0.70 4.94 3.73 1.06 6.96 7.59 4.21 6.091 + RPE (m) 0.025 342.71 0.030 0.021 0.026 0.018 0.025 0.015 0.030 0.044 0.040 0.027 + RPE (◦) 0.055 0.854 0.045 0.038 0.029 0.035 0.030 0.031 0.036 0.037 0.043 0.038 + 0.052 + 12 Zhan et al. + + Ours (Mono-SC Train.) uses a depth model trained Table 2 Visual odometry evaluation in Oxford Robotcar +with monocular videos and inverse depth consistency for Dataset. Absolute Trajectory Error (metre) is used as the eval- +ensuring scale-consistency. Ours (Stereo Train) uses a depth uation criterion. + +model trained with stereo videos. Note that even stereo Sequence SVO CNN-SVO DSO ORB-SLAM (w/o LC) Ours +sequences are used during training, monocular sequences (Forster et al. 2016) (Loo et al. 2019) (Engel et al. 2017) (Mur-Artal et al. 2015b) 4.16 +are used in testing. Therefore, Ours (Stereo Train) is 2014-05-06- 3.46 +still a monocular VO system. We show that our meth- 12-54-54 X 8.66 4.71 10.66 4.55 +ods outperform pure deep learning methods, which rely X 9.19 X X 4.58 +on a PoseCNN for camera motion estimation, by a large 2014-05-06- X 10.19 X X 6.89 +margin in all metrics. For KITTI Odometry criterion, 13-09-52 X 8.26 X X 5.09 +ORB-SLAM2 shows less rotation drift rerr but higher X 13.75 X X 1.83 +translation drift terr due to scale drift issue, which is also 2014-05-06- X 6.30 X X 3.20 +showed in Fig. 4. The drifting issue sometimes can be 13-14-58 X 6.15 X +resolved by loop closing with expensive global bundle ad- X 3.70 2.45 +justment but the issue exists when there is no loop closing 2014-05-06- X 6.56 +detected. Different from other methods, we use a single 13-17-51 +depth network as our “reference map”. The translation +scales are recovered w.r.t to the scale-consistent depth 2014-05-14- + 13-46-12 + + 2014-05-14- + 13-53-47 + + 2014-05-14- + 13-59-05 + + 2014-06-25- + 16-22-15 + + 200 Ground Truth Ground Truth + + 150 Ours 100 Ours + + 100 0 + + 50 + 100 + + 0 + z (m) + z (m) + +predictions. As a result, we mitigate the scale drift is- 50 200 + +sue in most monocular VO/SLAM systems and show less 100 50 100x (m1)50 200 250 300 +translation drift over long sequences. More importantly, 0 50 100 150x2(0m0) 250 300 350 400 +our method shows a consistently smaller relative error, 150 +both translation and rotation, which allows our system 0 + +to be a robust module for frame-to-frame tracking. Fig. 6 Qualitative VO results on Oxford Robotcar: (Left) 2014- + 05-06-12-54-54 and (Right) 2014-06-25-16-22-15. Note that +KITTI Tracking To show the robustness of our system in there is in fact a loop closure in the left sequence but the +dynamic environments, we compare our system with ”Ground truth” is not accurate enough as mentioned in the +ORB-SLAM2 in KITTI Tracking dataset individually. Robotcar official document. + +The results are shown in Tab. 5. However, since the Track- Table 3 Ablation study on KITTI Odometry dataset regard- +ing split contains relatively shorter sequences when com- ing different components +pared to the Odometry split, KITTI Odometry criterion + +is not a suitable measurement to evaluate the perfor- Experiment Variant 09 10 +mance. Therefore, we report frame-to-frame RPE (trans- terr rerr terr rerr +lation) for Tracking split as a reference. Note that se- Reference Model +quence (2011/10/03-47) is the most difficult sequence among 3.45 0.68 3.19 1.00 +the 9 sequences due to its highly dynamic environment Tracker PnP 6.79 2.27 6.31 3.75 +in a highway. ORB-SLAM2 is well known for its supe- 2.90 0.74 2.98 1.03 +rior ability in removing outliers but its performance still Flow Self-Flow (Offline) 2.07 0.38 2.54 0.62 +downgraded significantly in this sequence while our method Self-Flow (Online) 3.45 0.73 3.63 1.20 +performs robustly. 3.52 0.81 4.29 1.44 + Depth Mono-SC 5.05 1.18 5.38 1.97 + Mono. 4.88 1.06 4.26 1.83 + 3.34 0.63 3.05 1.07 + Correspondences Uniform 3.71 0.76 3.57 1.16 + Best-N 2.38 0.37 2.00 0.40 + + Scale Iterative + + Model Sel. Flow + + Img. Res. Full + +Oxford Robotcar We also test the generalization ability reflects the failure cases and constant motion model is +of the system on Oxford Robotcar(Maddern et al. 2017). employed in such cases. The result shows that our sys- +The result 4 is reported in Tab. 2 and illustrated in Fig. 6. tem outperforms the others. More importantly, it proves +Note that there are some overexposed frames at the mid- that sampling correspondence from deep optical flow is +dle of the sequence (e.g. Fig. 3), which are challenging more robust than matching hand-crafted features. +for visual odometry/SLAM algorithms such that many +algorithms listed in Tab. 2 fail to run the sequences. How- 6 Ablation study +ever, the deep optical flow network still predicts sufficient +good correspondences for pose estimation (Fig. 3). The In this section, we present an extensive ablation study +optical network rarely fails to give sufficient good cor- (Tab. 3) to understand the effect of the components pro- +respondences but the number of valid correspondences + + 4 The result of others are taken from (Loo et al. 2019) + DF-VO: What Should Be Learnt for Visual Odometry? 13 + +posed in this work. We use a Reference Model with the Flow evaluation We evaluate the quality of optical flows + +following settings and study the component in the follow- on KITTI 2012/2015, which are two benchmark dataset + +ing categories. for optical flow evaluation. The result is shown in Tab. 4. + +– Tracker: Hybrid (E-tracker and PnP-tracker) We can see that with self-supervised finetuning (offline), +– Depth model: Trained with stereo sequences the accuracy of the flow prediction is significantly im- +– Flow model: LiteFlowNet trained from synthetic dataset proved, especially in the percentage of outliers. One no- +– Correspondence selection: Local best-K selection ticeable result is that self-supervised training increases +– Scale recovery: Simple alignment the end-point-error in KITTI2015 from 4.785 to 4.987. +– Model selection: GRIC The reason is that the self-supervised model is trained in +– Image resolution: down-sampled size (640 × 192) KITTI Odometry split, which contains long driving se- + quences without many dynamic objects. However, KITTI2015 + + contains many dynamic objects and we observe that the + +6.1 Tracker error of the flow estimation on these dynamic objects are + larger for the self-supervised model, which increases the + +DF-VO consists of two trackers – E-tracker and PnP- average error. On the other hand, Scene Flow model is +tracker. E-tracker is considered as the main tracker when trained in highly dynamic synthetic environments, i.e. +general motion (sufficient translation) and general struc- able to estimate large flow magnitude caused by mov- +ture (non-planar) are assumed. PnP-tracker is used when ing objects. Moreover, the synthetic model generates ar- +E-tracker fails to estimate the motion, which is intro- tifacts in some regions when used in real-world data so +duced in Sec. 4.5. Using E-tracker alone potentially fails there are more outliers, as shown in Tab. 4. Nevertheless, +when motion degeneracy or structure degeneracy hap- the correspondence selection module effectively removes +pens as described in Sec. 3.1. Therefore, we only compare the bad flows predicted by the self-supervised model and +the Reference model to the case that only PnP-tracker the overall flow accuracy is improved over the Scene Flow +is used. PnP relies on the accuracy of both depth and model. Since better correspondences are estimated, the +optical flow predictions for establishing accurate 3D-2D odometry result using Self-Flow is improved as well. + +correspondence. However, there is not a straightforward + +way to sample good depth predictions for accurate 3D-2D + +correspondences for 6DoF pose estimation, but the depth 6.3 Depth model +predictions are sufficient for 1DoF scale recovery problem + +in E-tracker. Training depth models with monocular videos comes with + + a scale inconsistency issue (Bian et al. 2019b). We use + +6.2 Flow model an inverse depth consistency proposed in (Zhan et al. + 2019) to enforce the depth predictions to be consistent + +LiteFlowNet trained with synthetic data shows accept- (Sec. 4.6.4). Using a scale-consistent depth CNN for trans- +able generalization ability from synthetic to real. How- lation scale recovery helps to mitigate the scale drift is- +ever, there are still some regions with significantly erro- sue, which usually occurs after long travelling. Here we +neous flow predictions. We find that with self-supervised compare three depth models trained by different strate- +finetuning, the model adapts better to the real world se- gies. We train two models using monocular videos. Mono. +quences and the optical flow prediction accuracy is im- model is trained without the depth consistency term while +proved (Tab. 4). Mono-SC model is trained with the depth consistency + term. Models trained with monocular videos are always + + up-to-scale, i.e. the metric scale is unknown. Therefore, + +offline v.s. online We perform two types of self-supervised we also train a model using stereo sequences. Note that, + +finetuning for the optical flow network. The offline method the model trained with stereo sequences do not include + +finetunes the flow network on sequences 00-08 using monoc- the depth consistency term. The predictions in stereo + +ular videos while the online method finetunes the model training are always associated with one and only one + +on-the-run for the running sequence. We test various amounts scale i.e. real-world scale due to the constraint set by + +of data for online finetuning and evaluate the correspond- the known stereo baseline. Therefore, no scale ambigu- + +ing odometry result. The relationship is shown in Fig. 7. ity/inconsistency issues exist in this training scheme. We + +We can see that finetuning on a small amount of data can see that both Reference Model (stereo) and Mono-SC + +(10%) is sufficient for the optical flow network to adapt have less terr and rerr after long travelling, which is aided + +to unseen scenarios. by the scale-consistent depth predictions. + 14 Zhan et al. + +Table 4 Optical flow evaluation in KITTI 2012/2015 optical flow split. Average end-point-error (AEPE) and the percentage of +pixels with error larger than 1 (Out-1) are evaluated. Non-occluded regions are evaluated. SF (Super.): supervised training on +Scene Flow. KITTI (Self.): self-supervised training on KITTI. BestN: Bidirectional flow consistency thresholding is applied. + + Network Dataset & Method KITTI 2012 KITTI 2015 + AEPE (px) Out-1 (%) AEPE (px) Out-1 (%) + LiteFlowNet SF (Super.) + LiteFlowNet SF (Super.) + KITTI (Self.) 1.593 26.1% 4.785 39.6% + LiteFlowNet SF (Super.) + BestN 1.467 19.7% 4.987 32.7% + LiteFlowNet SF (Super.) + KITTI (Self.) + BestN 0.478 7.6% 0.711 10.5% + 0.422 5.7% 0.628 7.7% + +Table 5 Quantitative result on KITTI tracking sequences. The +RPE (m) is reported. + +Seq. Seq. Length ORB-SLAM2 DF-VO (Ours) + (m) Simple Iterative +2011/09/26-05 0.053 +2011/09/26-09 69.4 0.061 0.039 0.038 Fig. 7 Effect of self-supervised online finetuning. X-axis is the +2011/09/26-11 332.4 0.033 0.049 0.047 percentage of data used in the online finetuning. +2011/09/26-13 114.0 0.075 0.030 0.030 +2011/09/26-14 173.0 0.101 0.071 0.071 6.5 Scale recovery +2011/09/26-15 402.5 0.087 0.074 0.074 +2011/09/26-18 362.8 0.049 0.068 0.063 +2011/09/29-04 51.5 0.073 0.014 0.015 +2011/10/03-47 254.9 0.200 0.040 0.044 +Average 712.6 0.081 0.071 0.060 + 274.8 0.051 0.049 + + We also explored an online adaptation scheme for the We propose two scale recovery methods in this work, +depth network. However, the depth network training is namely simple alignment and Iterative alignment. Simple +unstable in the online finetuning. The scale of the depth alignment aligns the triangulated depths of the filtered +predictions fluctuates during the training due to the scale optical flows and their corresponding depth predictions. +ambiguity nature in the monocular training. However, the filtered optical flows can fall onto dynamic + object regions and the depth predictions may not be ac- +6.4 Correspondence selection curate. The iterative alignment is proposed for more ro- + bust scale recovery in dynamic environments. Only depth +Since only sparse matches are required for DF-VO, a points and filtered optical flows that are consistent with +na¨ıve way to extract sparse matches from dense optical each other are used for scale recovery. This eliminates +flow prediction is to sample matches uniformly/randomly. both bad depth predictions and optical flows of the dy- +We uniformly sampled 2000 flows to form the correspon- namic objects. Iterative alignment slightly improves over +dences and it shows that the odometry result is worse the simple alignment in KITTI Odometry split, which +than either Best-N selection or Local Best-K selection might be because of the less dynamic scene nature of the +method. To verify the effectiveness of forward-backward sequences. However, in a highly dynamic environment, +flow inconsistency, which is used for correspondences se- like KITTI Tracking split, especially in Seq. 2011/10/03- +lection in both Best-N selection and Local Best-K selec- 47 which is a sequence on a highway with one-third of the +tion, we evaluate the optical flow performance with/without image occupied by moving cars, iterative scale recovery +the selection (Tab. 4). Instead of evaluating best-N points, shows a better result when compared to simple alignment +we alternatively set an inconsistency threshold such that and works more robustly when compared to ORB-SLAM2 +only the flows with inconsistency less than δfc are eval- (Tab. 5). +uated. We show that the accuracy of the selected flows +is improved significantly when compared to the average 6.6 Model selection +result of all optical flows. + Two model selection methods are proposed and tested + in this work. Flow magnitude-based method (Zhan et al. + 2020) is straightforward but there are some potential fail- + ure cases, which is explained in Sec. 4.5. Moreover, a flow + DF-VO: What Should Be Learnt for Visual Odometry? 15 + +magnitude thresholding value is required in this method, Acknowledgment +which is found empirically. However, GRIC-based model +selection is a parameter-free method, which calculates a This work was supported by the UoA Scholarship to HZ, +score function for each motion model. It shows a more the ARC Laureate Fellowship FL130100102 to IR and +robust result when compared to the flow-based method. the Australian Centre of Excellence for Robotic Vision + CE140100016. + +6.7 Image resolution References + +Down-sampled images are used in the Reference Model Agrawal, P., Carreira, J., & Malik, J. (2015). Learning to +because the size is used in training deep networks. How- see by moving. In IEEE International Conference on +ever, simply increasing the image size to full resolution Computer Vision (ICCV). (pp. 37–45). +allows the optical flow network predicts more accurate +correspondences thus the odometry result can be boosted Bian, J., Lin, W.-Y., Liu, Y., Zhang, L., Yeung, S.-K., +easily. Cheng, M.-M., et al. (2019a). GMS: Grid-based motion + statistics for fast, ultra-robust feature correspondence. +7 Conclusion International Journal on Computer Vision (IJCV). + +In this paper, we have presented a robust monocular VO Bian, J.-W., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, +system leveraging deep learning and geometry methods. M.-M., et al. (2019b). Unsupervised scale-consistent +We explore the integration of deep predictions with classic depth and ego-motion learning from monocular video. +geometry methods. Specifically, we use optical flow and In Neural Information Processing Systems (NeurIPS). +single-view depth predictions from deep networks as in- +termediate outputs to establish 2D-2D/3D-2D correspon- Bian, J.-W., Wu, Y.-H., Zhao, J., Liu, Y., Zhang, L., +dences for camera pose estimation. We show that the deep Cheng, M.-M., et al. (2019c). An evaluation of fea- +models can be trained/finetuned in a self-supervised man- ture matchers for fundamental matrix estimation. In +ner and we explore the effect of various training schemes. British Machine Vision Conference (BMVC). +Depth models with consistent scale can be used for scale +recovery, which mitigates the scale drift issue in most Dharmasiri, T., Spek, A., & Drummond, T. (2018). +monocular VO/SLAM systems. Instead of learning a com- Eng: End-to-end neural geometry for robust depth +plete VO system in an end-to-end manner, which does and pose estimation using cnns. arXiv preprint +not perform competitively to geometry-based methods, arXiv:1807.05705. +we think integrating deep predictions with geometry gain +the best from both domains. Compared to our previous Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazir- +conference version (Zhan et al. 2020), we robustify differ- bas, C., Golkov, V., et al. (2015). Flownet: Learning +ent components in this system and systematically evalu- optical flow with convolutional networks. In IEEE In- +ate the variants. Moreover, we integrate an online adapta- ternational Conference on Computer Vision (ICCV). +tion scheme into the system for better adaptation ability (pp. 2758–2766). +in unseen scenarios. A detailed ablation study is provided +to verify the effectiveness of different choices in each mod- Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map +ule, including the original choices (Zhan et al. 2020) and prediction from a single image using a multi-scale deep +the new components in this work. With the improved sys- network. In Neural Information Processing Systems +tem, our current version shows more robust performance, (NeurIPS). (pp. 2366–2374). +especially in highly dynamic environments. Some prior +arts (Tang et al. 2019; Tateno et al. 2017; Yang et al. Engel, J., Koltun, V., & Cremers, D. (2017). Direct sparse +2018) show that a local optimization module is useful to odometry. IEEE Transactions on Pattern Recognition +further improve the VO result, which can be a future and Machine Intelligence (TPAMI). +direction to improve our VO system. Current pipeline in- +volves a single view depth network which is less accurate Engel, J., Sch¨ops, T., & Cremers, D. (2014). LSD-SLAM: +than multi-view stereo (MVS) networks. An MVS net- Large-scale direct monocular slam. In European Con- +work can be considered replacing the depth network for ference on Computer Vision (ECCV). Springer, (pp. +better accuracy and possible online adaptation. 834–849). + + Forster, C., Pizzoli, M., & Scaramuzza, D. (2014). Svo: + Fast semi-direct monocular visual odometry. In IEEE + International Conference on Robotics and Automation + (ICRA). (pp. 15–22). + + Forster, C., Zhang, Z., Gassner, M., Werlberger, M., & + Scaramuzza, D. (2016). SVO: Semidirect visual odom- + etry for monocular and multicamera systems. IEEE + Transactions on Robotics (TRO), 33 (2), pp. 249–265. + 16 Zhan et al. + +Fu, H., Gong, M., Wang, C., Batmanghelich, K., & Tao, arXiv:1412.6980. + Klein, G., & Murray, D. (2007). Parallel tracking and + D. (2018). Deep ordinal regression network for monoc- + mapping for small ar workspaces. In Mixed and Aug- + ular depth estimation. In IEEE Conference on Com- mented Reality, 2007. ISMAR 2007. 6th IEEE and + ACM International Symposium on. IEEE, (pp. 225– + puter Vision and Pattern Recognition (CVPR). (pp. 234). + Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., + 2002–2011). & Navab, N. (2016). Deeper depth prediction with + fully convolutional residual networks. In International +Garg, R., B G, V. K., Carneiro, G., & Reid, I. (2016). Conference on 3D Vision (3DV). IEEE, (pp. 239–248). + Li, R., Wang, S., Long, Z., & Gu, D. (2017). Undeepvo: + Unsupervised cnn for single view depth estimation: Ge- Monocular visual odometry through unsupervised deep + learning. arXiv preprint arXiv:1709.06841. + ometry to the rescue. In European Conference on Com- Li, Y., Ushiku, Y., & Harada, T. (2019). Pose graph + optimization for unsupervised monocular visual odom- + puter Vision (ECCV). Springer, (pp. 740–756). etry. IEEE International Conference on Robotics and + Automation (ICRA). +Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Liu, F., Shen, C., & Lin, G. (2015). Deep convolutional + neural fields for depth estimation from a single image. + Vision meets robotics: The kitti dataset. International In IEEE Conference on Computer Vision and Pattern + Recognition (CVPR). (pp. 5162–5170). + Journal of Robotics Research (IJRR). Liu, F., Shen, C., Lin, G., & Reid, I. (2016). Learning + depth from single monocular images using deep con- +Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready volutional neural fields. IEEE Transactions on Pat- + tern Recognition and Machine Intelligence (TPAMI), + for autonomous driving? the kitti vision benchmark 38 (10), pp. 2024–2039. + Loo, S. Y., Amiri, A. J., Mashohor, S., Tang, S. H., & + suite. In IEEE Conference on Computer Vision and Zhang, H. (2019). CNN-SVO: Improving the map- + ping in semi-direct visual odometry using single-image + Pattern Recognition (CVPR). depth prediction. IEEE International Conference on + Robotics and Automation (ICRA). +Geiger, A., Ziegler, J., & Stiller, C. (2011). Stereoscan: Lowe, D. G. (2004). Distinctive image features from scale- + invariant keypoints. International Journal on Com- + Dense 3d reconstruction in real-time. In Intelligent Ve- puter Vision (IJCV), 60 (2), pp. 91–110. + Maddern, W., Pascoe, G., Linegar, C., & New- + hicles Symposium (IV). man, P. (2017). 1 Year, 1000km: The Oxford + RobotCar Dataset. The International Journal +Godard, C., Mac Aodha, O., & Brostow, G. (2017). Un- of Robotics Research (IJRR), 36 (1), pp. 3–15. + http://ijr.sagepub.com/content/early/2016/ + supervised monocular depth estimation with left-right 11/28/0278364916679498.full.pdf+html, URL + http://dx.doi.org/10.1177/0278364916679498. + consistency. In IEEE Conference on Computer Vision Matthies, L., Maimone, M., Johnson, A., Cheng, Y., Will- + son, R., Villalpando, C., et al. (2007). Computer vision + and Pattern Recognition (CVPR). IEEE, (pp. 6602– on mars. International Journal on Computer Vision + (IJCV), 75 (1), pp. 67–92. + 6611). Meister, S., Hur, J., & Roth, S. (2018). Unflow: Unsuper- + vised learning of optical flow with a bidirectional census +Godard, C., Mac Aodha, O., Firman, M., & Brostow, loss. In Association for the Advancement of Artificial + Intelligence (AAAI). + G. J. (2019). Digging into self-supervised monocular Mur-Artal, R., Montiel, J. M. M., & Tardos, J. D. + (2015a). ORB-SLAM: a versatile and accurate monoc- + depth prediction. In IEEE International Conference ular slam system. IEEE Transactions on Robotics + (TRO), 31 (5), pp. 1147–1163. + on Computer Vision (ICCV). + +Hartley, R., & Zisserman, A. (2003). Multiple View Ge- + + ometry in Computer Vision. New York, NY, USA: + + Cambridge University Press, 2 edition. + +Hartley, R. I. (1995). In defence of the 8-point algorithm. + + In IEEE International Conference on Computer Vision + + (ICCV). IEEE, (pp. 1064–1070). + +Hui, T.-W., Tang, X., & Loy, C. C. (2018). Liteflownet: + + A lightweight convolutional neural network for opti- + + cal flow estimation. In IEEE Conference on Computer + + Vision and Pattern Recognition (CVPR). (pp. 8981– + + 8989). + +Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, + + A., & Brox, T. (2017). Flownet 2.0: Evolution of opti- + + cal flow estimation with deep networks. In IEEE Con- + + ference on Computer Vision and Pattern Recognition + + (CVPR). (pp. 2462–2470). + +Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). + + Spatial transformer networks. In Neural Information + + Processing Systems (NeurIPS). (pp. 2017–2025). + +Kendall, A., & Gal, Y. (2017). What uncertainties do + + we need in bayesian deep learning for computer vision? + + In Neural Information Processing Systems (NeurIPS). + + (pp. 5580–5590). + +Kingma, D., & Ba, J. (2014). Adam: A method + + for stochastic optimization. arXiv preprint + DF-VO: What Should Be Learnt for Visual Odometry? 17 + +Mur-Artal, R., Montiel, J. M. M., & Tardo´s, J. D. Torr, P. H., Fitzgibbon, A. W., & Zisserman, A. (1999). + (2015b). Orb-slam: A versatile and accurate monocular The problem of degeneracy in structure and motion + slam system. IEEE Transactions on Robotics (TRO), recovery from uncalibrated image sequences. Interna- + 31 (5), pp. 1147–1163. tional Journal of Computer Vision, 32 (1), pp. 27–44. + +Mur-Artal, R., & Tardo´s, J. D. (2016). ORB-SLAM2: an Ullman, S. (1979). The interpretation of structure from + open-source SLAM system for monocular, stereo and motion. Proceedings of the Royal Society of London. + RGB-D cameras. CoRR, abs/1610.06475. Series B. Biological Sciences, 203 (1153), pp. 405–426. + +Nekrasov, V., Dharmasiri, T., Spek, A., Drummond, T., Umeyama, S. (1991). Least-squares estimation of trans- + Shen, C., & Reid, I. (2019). Real-time joint seman- formation parameters between two point patterns. + tic segmentation and depth estimation using asymmet- IEEE Transactions on Pattern Recognition and Ma- + ric annotations. IEEE International Conference on chine Intelligence (TPAMI), (4), pp. 376–380. + Robotics and Automation (ICRA). + Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., +Newcombe, R. A., Lovegrove, S. J., & Davison, A. J. Dosovitskiy, A., et al. (2017). Demon: Depth and mo- + (2011). Dtam: Dense tracking and mapping in real- tion network for learning monocular stereo. In IEEE + time. In Computer Vision (ICCV), 2011 IEEE Inter- Conference on Computer Vision and Pattern Recogni- + national Conference on. IEEE, (pp. 2320–2327). tion (CVPR). + +Nister, D. (2003). An efficient solution to the five-point Wang, S., Clark, R., Wen, H., & Trigoni, N. (2017). + relative pose problem. In IEEE Conference on Com- Deepvo: Towards end-to-end visual odometry with + puter Vision and Pattern Recognition (CVPR). (pp. deep recurrent convolutional neural networks. In IEEE + II–195). International Conference on Robotics and Automation + (ICRA). IEEE, (pp. 2043–2050). +Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., + DeVito, Z., et al. (2017). Automatic differentiation in Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. + PyTorch. In NIPS Autodiff Workshop. (2004). Image quality assessment: from error visibility + to structural similarity. IEEE transactions on image +Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., processing, 13 (4), pp. 600–612. + Wulff, J., et al. (2019). Competitive collaboration: + Joint unsupervised learning of depth, camera motion, Yang, N., Wang, R., Stueckler, J., & Cremers, D. (2018). + optical flow and motion segmentation. In IEEE Con- Deep virtual stereo odometry: Leveraging deep depth + ference on Computer Vision and Pattern Recognition prediction for monocular direct sparse odometry. In + (CVPR). (pp. 12240–12249). European Conference on Computer Vision (ECCV). + +Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: Yin, Z., & Shi, J. (2018). Geonet: Unsupervised learning + Convolutional networks for biomedical image segmen- of dense depth, optical flow and camera pose. In IEEE + tation. In International Conference on Medical Im- Conference on Computer Vision and Pattern Recogni- + age Computing and Computer-Assisted Intervention. tion (CVPR). (pp. 1983–1992). + Springer, (pp. 234–241). + Zhan, H., Garg, R., Weerasekera, C. S., Li, K., Agarwal, +Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. H., & Reid, I. (2018). Unsupervised learning of monoc- + (2011). Orb: An efficient alternative to sift or surf. ular depth estimation and visual odometry with deep + In Computer Vision (ICCV), 2011 IEEE international feature reconstruction. In IEEE Conference on Com- + conference on. IEEE, (pp. 2564–2571). puter Vision and Pattern Recognition (CVPR). IEEE, + (pp. 340–349). +Scaramuzza, D., & Fraundorfer, F. (2011). Visual odom- + etry: Part i: The first 30 years and fundamentals. IEEE Zhan, H., Weerasekera, C. S., Bian, J., & Reid, I. (2020). + Robotics & Automation Magazine, 18 (4), pp. 80–92. Visual odometry revisited: What should be learnt? + Robotics and Automation (ICRA), 2020 IEEE Inter- +Sun, D., Yang, X., Liu, M.-Y., & Kautz, J. (2018). Pwc- national Conference on. + net: Cnns for optical flow using pyramid, warping, and + cost volume. In IEEE Conference on Computer Vision Zhan, H., Weerasekera, C. S., Garg, R., & Reid, I. D. + and Pattern Recognition (CVPR). (pp. 8934–8943). (2019). Self-supervised learning for single view depth + and surface normal estimation. In IEEE International +Tang, J., Ambrus, R., Guizilini, V., Pillai, S., Kim, H., & Conference on Robotics and Automation (ICRA). (pp. + Gaidon, A. (2019). Self-Supervised 3D Keypoint Learn- 4811–4817). + ing for Ego-motion Estimation. 1912.03426. + Zhang, J., Henein, M., Mahony, R., & Ila, V. (2020). +Tateno, K., Tombari, F., Laina, I., & Navab, N. (2017). Vdo-slam: A visual dynamic object-aware slam system. + CNN-SLAM: Real-time dense monocular slam with arXiv preprint arXiv:2005.11052. + learned depth prediction. In IEEE Conference on Com- + puter Vision and Pattern Recognition (CVPR). (pp. Zhang, Z. (1998). Determining the epipolar geometry + 6243–6252). and its uncertainty: A review. International Journal + 18 Zhan et al. + + on Computer Vision (IJCV), 27 (2), pp. 161–195. rescue. In European Conference on Computer Vision (ECCV). +Zhou, D., Dai, Y., & Li, H. (2019). Ground-plane-based Springer, (pp. 740–756). + Geiger, A., Lenz, P., Stiller, C., & Urtasun, R. (2013). Vision + absolute scale estimation for monocular visual odome- meets robotics: The kitti dataset. International Journal of + try. IEEE Transactions on Intelligent Transportation Robotics Research (IJRR). + Systems. Geiger, A., Lenz, P., & Urtasun, R. (2012). Are we ready for +Zhou, H., Ummenhofer, B., & Brox, T. (2018). Deep- autonomous driving? the kitti vision benchmark suite. In + tam: Deep tracking and mapping. arXiv preprint IEEE Conference on Computer Vision and Pattern Recognition + arXiv:1808.01900. (CVPR). +Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Geiger, A., Ziegler, J., & Stiller, C. (2011). Stereoscan: Dense 3d + Unsupervised learning of depth and ego-motion from reconstruction in real-time. In Intelligent Vehicles Symposium + video. In IEEE Conference on Computer Vision and (IV). + Pattern Recognition (CVPR). Godard, C., Mac Aodha, O., & Brostow, G. (2017). Unsuper- + vised monocular depth estimation with left-right consistency. +References In IEEE Conference on Computer Vision and Pattern Recogni- + tion (CVPR). IEEE, (pp. 6602–6611). +Agrawal, P., Carreira, J., & Malik, J. (2015). Learning to see Godard, C., Mac Aodha, O., Firman, M., & Brostow, G. J. + by moving. In IEEE International Conference on Computer (2019). Digging into self-supervised monocular depth predic- + Vision (ICCV). (pp. 37–45). tion. In IEEE International Conference on Computer Vision + (ICCV). +Bian, J., Lin, W.-Y., Liu, Y., Zhang, L., Yeung, S.-K., Cheng, Hartley, R., & Zisserman, A. (2003). Multiple View Geometry in + M.-M., et al. (2019a). GMS: Grid-based motion statistics Computer Vision. New York, NY, USA: Cambridge Univer- + for fast, ultra-robust feature correspondence. International sity Press, 2 edition. + Journal on Computer Vision (IJCV). Hartley, R. I. (1995). In defence of the 8-point algorithm. In + IEEE International Conference on Computer Vision (ICCV). +Bian, J.-W., Li, Z., Wang, N., Zhan, H., Shen, C., Cheng, M.- IEEE, (pp. 1064–1070). + M., et al. (2019b). Unsupervised scale-consistent depth and Hui, T.-W., Tang, X., & Loy, C. C. (2018). Liteflownet: A + ego-motion learning from monocular video. In Neural Infor- lightweight convolutional neural network for optical flow es- + mation Processing Systems (NeurIPS). timation. In IEEE Conference on Computer Vision and Pattern + Recognition (CVPR). (pp. 8981–8989). +Bian, J.-W., Wu, Y.-H., Zhao, J., Liu, Y., Zhang, L., Cheng, Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., & + M.-M., et al. (2019c). An evaluation of feature matchers for Brox, T. (2017). Flownet 2.0: Evolution of optical flow esti- + fundamental matrix estimation. In British Machine Vision mation with deep networks. In IEEE Conference on Computer + Conference (BMVC). Vision and Pattern Recognition (CVPR). (pp. 2462–2470). + Jaderberg, M., Simonyan, K., Zisserman, A., et al. (2015). Spa- +Dharmasiri, T., Spek, A., & Drummond, T. (2018). Eng: End- tial transformer networks. In Neural Information Processing + to-end neural geometry for robust depth and pose estimation Systems (NeurIPS). (pp. 2017–2025). + using cnns. arXiv preprint arXiv:1807.05705. Kendall, A., & Gal, Y. (2017). What uncertainties do we need + in bayesian deep learning for computer vision? In Neural +Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Information Processing Systems (NeurIPS). (pp. 5580–5590). + Golkov, V., et al. (2015). Flownet: Learning optical flow with Kingma, D., & Ba, J. (2014). Adam: A method for stochastic + convolutional networks. In IEEE International Conference on optimization. arXiv preprint arXiv:1412.6980. + Computer Vision (ICCV). (pp. 2758–2766). Klein, G., & Murray, D. (2007). Parallel tracking and mapping + for small ar workspaces. In Mixed and Augmented Reality, +Eigen, D., Puhrsch, C., & Fergus, R. (2014). Depth map predic- 2007. ISMAR 2007. 6th IEEE and ACM International Sympo- + tion from a single image using a multi-scale deep network. In sium on. IEEE, (pp. 225–234). + Neural Information Processing Systems (NeurIPS). (pp. 2366– Laina, I., Rupprecht, C., Belagiannis, V., Tombari, F., & Navab, + 2374). N. (2016). Deeper depth prediction with fully convolutional + residual networks. In International Conference on 3D Vision +Engel, J., Koltun, V., & Cremers, D. (2017). Direct sparse (3DV). IEEE, (pp. 239–248). + odometry. IEEE Transactions on Pattern Recognition and Ma- Li, R., Wang, S., Long, Z., & Gu, D. (2017). Undeepvo: Monoc- + chine Intelligence (TPAMI). ular visual odometry through unsupervised deep learning. + arXiv preprint arXiv:1709.06841. +Engel, J., Scho¨ps, T., & Cremers, D. (2014). LSD-SLAM: Large- Li, Y., Ushiku, Y., & Harada, T. (2019). Pose graph opti- + scale direct monocular slam. In European Conference on Com- mization for unsupervised monocular visual odometry. IEEE + puter Vision (ECCV). Springer, (pp. 834–849). International Conference on Robotics and Automation (ICRA). + Liu, F., Shen, C., & Lin, G. (2015). Deep convolutional neural +Forster, C., Pizzoli, M., & Scaramuzza, D. (2014). Svo: Fast fields for depth estimation from a single image. In IEEE Con- + semi-direct monocular visual odometry. In IEEE Interna- ference on Computer Vision and Pattern Recognition (CVPR). + tional Conference on Robotics and Automation (ICRA). (pp. (pp. 5162–5170). + 15–22). Liu, F., Shen, C., Lin, G., & Reid, I. (2016). Learning depth + from single monocular images using deep convolutional neu- +Forster, C., Zhang, Z., Gassner, M., Werlberger, M., & Scara- ral fields. IEEE Transactions on Pattern Recognition and Ma- + muzza, D. (2016). SVO: Semidirect visual odometry for chine Intelligence (TPAMI), 38 (10), pp. 2024–2039. + monocular and multicamera systems. IEEE Transactions on Loo, S. Y., Amiri, A. J., Mashohor, S., Tang, S. H., & Zhang, + Robotics (TRO), 33 (2), pp. 249–265. H. (2019). CNN-SVO: Improving the mapping in semi-direct + +Fu, H., Gong, M., Wang, C., Batmanghelich, K., & Tao, D. + (2018). Deep ordinal regression network for monocular depth + estimation. In IEEE Conference on Computer Vision and Pat- + tern Recognition (CVPR). (pp. 2002–2011). + +Garg, R., B G, V. K., Carneiro, G., & Reid, I. (2016). Unsuper- + vised cnn for single view depth estimation: Geometry to the + DF-VO: What Should Be Learnt for Visual Odometry? 19 + + visual odometry using single-image depth prediction. IEEE Tateno, K., Tombari, F., Laina, I., & Navab, N. (2017). CNN- + International Conference on Robotics and Automation (ICRA). SLAM: Real-time dense monocular slam with learned depth +Lowe, D. G. (2004). Distinctive image features from scale- prediction. In IEEE Conference on Computer Vision and Pat- + invariant keypoints. International Journal on Computer Vision tern Recognition (CVPR). (pp. 6243–6252). + (IJCV), 60 (2), pp. 91–110. +Maddern, W., Pascoe, G., Linegar, C., & Newman, P. (2017). Torr, P. H., Fitzgibbon, A. W., & Zisserman, A. (1999). The + 1 Year, 1000km: The Oxford RobotCar Dataset. The In- problem of degeneracy in structure and motion recovery from + ternational Journal of Robotics Research (IJRR), 36 (1), pp. uncalibrated image sequences. International Journal of Com- + 3–15. http://ijr.sagepub.com/content/early/2016/11/28/ puter Vision, 32 (1), pp. 27–44. + 0278364916679498.full.pdf+html, URL http://dx.doi.org/ + 10.1177/0278364916679498. Ullman, S. (1979). The interpretation of structure from motion. +Matthies, L., Maimone, M., Johnson, A., Cheng, Y., Willson, Proceedings of the Royal Society of London. Series B. Biological + R., Villalpando, C., et al. (2007). Computer vision on mars. Sciences, 203 (1153), pp. 405–426. + International Journal on Computer Vision (IJCV), 75 (1), pp. + 67–92. Umeyama, S. (1991). Least-squares estimation of transfor- +Meister, S., Hur, J., & Roth, S. (2018). Unflow: Unsupervised mation parameters between two point patterns. IEEE + learning of optical flow with a bidirectional census loss. In As- Transactions on Pattern Recognition and Machine Intelligence + sociation for the Advancement of Artificial Intelligence (AAAI). (TPAMI), (4), pp. 376–380. +Mur-Artal, R., Montiel, J. M. M., & Tardos, J. D. (2015a). + ORB-SLAM: a versatile and accurate monocular slam sys- Ummenhofer, B., Zhou, H., Uhrig, J., Mayer, N., Ilg, E., Doso- + tem. IEEE Transactions on Robotics (TRO), 31 (5), pp. 1147– vitskiy, A., et al. (2017). Demon: Depth and motion network + 1163. for learning monocular stereo. In IEEE Conference on Com- +Mur-Artal, R., Montiel, J. M. M., & Tard´os, J. D. (2015b). puter Vision and Pattern Recognition (CVPR). + Orb-slam: A versatile and accurate monocular slam system. + IEEE Transactions on Robotics (TRO), 31 (5), pp. 1147–1163. Wang, S., Clark, R., Wen, H., & Trigoni, N. (2017). Deepvo: To- +Mur-Artal, R., & Tardo´s, J. D. (2016). ORB-SLAM2: an open- wards end-to-end visual odometry with deep recurrent con- + source SLAM system for monocular, stereo and RGB-D cam- volutional neural networks. In IEEE International Conference + eras. CoRR, abs/1610.06475. on Robotics and Automation (ICRA). IEEE, (pp. 2043–2050). +Nekrasov, V., Dharmasiri, T., Spek, A., Drummond, T., Shen, + C., & Reid, I. (2019). Real-time joint semantic segmentation Wang, Z., Bovik, A. C., Sheikh, H. R., & Simoncelli, E. P. + and depth estimation using asymmetric annotations. IEEE (2004). Image quality assessment: from error visibility to + International Conference on Robotics and Automation (ICRA). structural similarity. IEEE transactions on image processing, +Newcombe, R. A., Lovegrove, S. J., & Davison, A. J. (2011). 13 (4), pp. 600–612. + Dtam: Dense tracking and mapping in real-time. In Computer + Vision (ICCV), 2011 IEEE International Conference on. IEEE, Yang, N., Wang, R., Stueckler, J., & Cremers, D. (2018). Deep + (pp. 2320–2327). virtual stereo odometry: Leveraging deep depth prediction +Nister, D. (2003). An efficient solution to the five-point relative for monocular direct sparse odometry. In European Conference + pose problem. In IEEE Conference on Computer Vision and on Computer Vision (ECCV). + Pattern Recognition (CVPR). (pp. II–195). +Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., De- Yin, Z., & Shi, J. (2018). Geonet: Unsupervised learning of + Vito, Z., et al. (2017). Automatic differentiation in PyTorch. dense depth, optical flow and camera pose. In IEEE Con- + In NIPS Autodiff Workshop. ference on Computer Vision and Pattern Recognition (CVPR). +Ranjan, A., Jampani, V., Balles, L., Kim, K., Sun, D., Wulff, J., (pp. 1983–1992). + et al. (2019). Competitive collaboration: Joint unsupervised + learning of depth, camera motion, optical flow and motion Zhan, H., Garg, R., Weerasekera, C. S., Li, K., Agarwal, H., & + segmentation. In IEEE Conference on Computer Vision and Reid, I. (2018). Unsupervised learning of monocular depth + Pattern Recognition (CVPR). (pp. 12240–12249). estimation and visual odometry with deep feature reconstruc- +Ronneberger, O., Fischer, P., & Brox, T. (2015). U-net: tion. In IEEE Conference on Computer Vision and Pattern + Convolutional networks for biomedical image segmentation. Recognition (CVPR). IEEE, (pp. 340–349). + In International Conference on Medical Image Computing and + Computer-Assisted Intervention. Springer, (pp. 234–241). Zhan, H., Weerasekera, C. S., Bian, J., & Reid, I. (2020). Visual +Rublee, E., Rabaud, V., Konolige, K., & Bradski, G. (2011). odometry revisited: What should be learnt? Robotics and + Orb: An efficient alternative to sift or surf. In Computer Automation (ICRA), 2020 IEEE International Conference on. + Vision (ICCV), 2011 IEEE international conference on. IEEE, + (pp. 2564–2571). Zhan, H., Weerasekera, C. S., Garg, R., & Reid, I. D. (2019). +Scaramuzza, D., & Fraundorfer, F. (2011). Visual odometry: Self-supervised learning for single view depth and surface + Part i: The first 30 years and fundamentals. IEEE Robotics normal estimation. In IEEE International Conference on + & Automation Magazine, 18 (4), pp. 80–92. Robotics and Automation (ICRA). (pp. 4811–4817). +Sun, D., Yang, X., Liu, M.-Y., & Kautz, J. (2018). Pwc-net: + Cnns for optical flow using pyramid, warping, and cost vol- Zhang, J., Henein, M., Mahony, R., & Ila, V. (2020). Vdo-slam: + ume. In IEEE Conference on Computer Vision and Pattern A visual dynamic object-aware slam system. arXiv preprint + Recognition (CVPR). (pp. 8934–8943). arXiv:2005.11052. +Tang, J., Ambrus, R., Guizilini, V., Pillai, S., Kim, H., & + Gaidon, A. (2019). Self-Supervised 3D Keypoint Learning Zhang, Z. (1998). Determining the epipolar geometry and its + for Ego-motion Estimation. 1912.03426. uncertainty: A review. International Journal on Computer Vi- + sion (IJCV), 27 (2), pp. 161–195. + + Zhou, D., Dai, Y., & Li, H. (2019). Ground-plane-based abso- + lute scale estimation for monocular visual odometry. IEEE + Transactions on Intelligent Transportation Systems. + + Zhou, H., Ummenhofer, B., & Brox, T. (2018). Deeptam: Deep + tracking and mapping. arXiv preprint arXiv:1808.01900. + + Zhou, T., Brown, M., Snavely, N., & Lowe, D. G. (2017). Un- + supervised learning of depth and ego-motion from video. In + IEEE Conference on Computer Vision and Pattern Recognition + (CVPR). + diff --git a/动态slam/2020年-2022年开源动态SLAM/2021年/MonoRec_Semi-Supervised_Dense_Reconstruction_in_Dynamic_Environments_from_a_Single_Moving_Camera.pdf b/动态slam/2020年-2022年开源动态SLAM/2021年/MonoRec_Semi-Supervised_Dense_Reconstruction_in_Dynamic_Environments_from_a_Single_Moving_Camera.pdf new file mode 100644 index 0000000..359a826 --- /dev/null +++ b/动态slam/2020年-2022年开源动态SLAM/2021年/MonoRec_Semi-Supervised_Dense_Reconstruction_in_Dynamic_Environments_from_a_Single_Moving_Camera.pdf @@ -0,0 +1,679 @@ + 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) + +2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) | 978-1-6654-4509-2/21/$31.00 ©2021 IEEE | DOI: 10.1109/CVPR46437.2021.00605 MonoRec: Semi-Supervised Dense Reconstruction in Dynamic Environments + from a Single Moving Camera + + Felix Wimbauer1, Nan Yang1,2, Lukas von Stumberg1 Niclas Zeller1,2 Daniel Cremers1,2 + 1 Technical University of Munich, 2 Artisense + + {wimbauer, yangn, stumberg, zellern, cremers}@in.tum.de + + Abstract + + In this paper, we propose MonoRec, a semi-supervised Figure 1: MonoRec can deliver high-quality dense recon- + monocular dense reconstruction architecture that predicts struction from a single moving camera. The figure shows + depth maps from a single moving camera in dynamic en- an example of a large-scale outdoor point cloud reconstruc- + vironments. MonoRec is based on a multi-view stereo set- tion (KITTI Odometry sequence 07) by simply accumulat- + ting which encodes the information of multiple consecutive ing predicted depth maps. Please refer to our project page + images in a cost volume. To deal with dynamic objects in for the video of the entire reconstruction of the sequence. + the scene, we introduce a MaskModule that predicts mov- + ing object masks by leveraging the photometric inconsisten- creasing demand of reducing the total number of sensors. + cies encoded in the cost volumes. Unlike other multi-view Over the past years, researchers have therefore put a lot of + stereo methods, MonoRec is able to reconstruct both static effort into solving the problem of perception with only a sin- + and moving objects by leveraging the predicted masks. Fur- gle monocular camera. Considering recent achievements in + thermore, we present a novel multi-stage training scheme monocular visual odometry (VO) [8, 58, 51], with respect to + with a semi-supervised loss formulation that does not re- ego-motion estimation, this was certainly successful. Nev- + quire LiDAR depth values. We carefully evaluate MonoRec ertheless, reliable dense 3D mapping of the static environ- + on the KITTI dataset and show that it achieves state-of-the- ment and moving objects is still an open research topic. + art performance compared to both multi-view and single- + view methods. With the model trained on KITTI, we further- To tackle the problem of dense 3D reconstruction based + more demonstrate that MonoRec is able to generalize well on a single moving camera, there are basically two paral- + to both the Oxford RobotCar dataset and the more chal- + lenging TUM-Mono dataset recorded by a handheld cam- + era. Code and related materials are available at https: + //vision.in.tum.de/research/monorec. + + 1. Introduction + + 1.1. Real-world Scene Capture from Video + + Obtaining a 3D understanding of the entire static and dy- + namic environment can be seen as one of the key-challenges + in robotics, AR/VR, and autonomous driving. State of to- + day, this is achieved based on the fusion of multiple sen- + sor sources (incl. cameras, LiDARs, RADARs and IMUs). + This guarantees dense coverage of the vehicle’s surround- + ings and accurate ego-motion estimation. However, driven + by the high cost as well as the challenge to maintain cross- + calibration of such a complex sensor suite, there is an in- + + Indicates equal contribution. + + 978-1-6654-4509-2/21/$31.00 ©2021 IEEE 6108 + + DOI 10.1109/CVPR46437.2021.00605 + + Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. + lel lines of research. On one side, there are dense multi- [26, 27, 36] or 3D point cloud based [3, 12]. Most popular +view stereo (MVS) methods, which evolved over the last are still depth map representations predicted from a 3D cost +decade [39, 45, 2] and saw a great improvement through the volume [23, 53, 61, 66, 22, 56, 41, 24, 33, 62, 19, 64, 57]. +use of convolutional neural networks (CNNs) [23, 61, 57]. Huang et al. [23] proposed one of the first cost-volume +On the other side, there are monocular depth prediction based approaches. They compute a set of image-pair-wise +methods which purely rely on deep learning [7, 16, 58]. plane-sweep volumes with respect to a reference image and +Though all these methods show impressive performance, use a CNN to predict one single depth map based on this +both types have also their respective shortcomings. For set. Zhou et al. [66] also use the photometric cost volumes +MVS the overall assumption is a stationary environment to as the inputs of the deep neural networks and employ a two +be reconstructed, so the presence of dynamic objects deteri- stage approach for dense depth prediction. Yao et al. [61] +orate their performance. Monocular depth prediction meth- instead calculate a single cost volume using deep features +ods, in contrast, perform very well in reconstructing mov- of all input images. +ing objects, as predictions are made only based on individ- +ual images. At the same time, due to their use of a single 2.2. Dense Depth Estimation in Dynamic Scenes +image only, they strongly rely on the perspective appear- +ance of objects as observed with specific camera intrinsics Reconstructing dynamic scenes is challenging since the +and extrinsics and therefore do not generalize well to other moving objects violate the static-world assumption for clas- +datasets. sical multi-view stereo methods. Russell et al. [43] and + Ranftl et al. [40] base on motion segmentation and perform +1.2. Contribution classical optimization. Li et al. [32] proposed to estimate + dense depth maps from the scenes with moving people. All + To combine the advantage of both deep MVS and these methods need additional inputs, e.g., optical flow, ob- +monocular depth prediction, we propose MonoRec, a novel ject masks, etc., for the inference, while MonoRec requires +monocular dense reconstruction architecture that consists of only the posed images as the inputs. Another line of re- +a MaskModule and a DepthModule. We encode the infor- search is monocular depth estimation [7, 6, 29, 31, 11, 59, +mation from multiple consecutive images using cost vol- 16, 48, 67, 63, 65, 52, 18, 17, 58]. These methods are not +umes which are constructed based on structural similarity affected by moving objects, but the depth estimation is not +index measure (SSIM) [54] instead of sum of absolute dif- necessarily accurate, especially in unseen scenarios. Luo +ferences (SAD) like prior works. The MaskModule is able et al. [34] proposed a test-time optimization method which +to identify moving pixels and downweights the correspond- is not real-time capable. In a concurrent work, Watson et +ing voxels in the cost volume. Thereby, in contrast to other al. [55] address moving objects with the consistency be- +MVS methods, MonoRec does not suffer from artifacts on tween monocular depth estimation and multi-view stereo, +moving objects and therefore delivers depth estimations on while MonoRec predicts the dynamic masks explicitly by +both static and dynamic objects. the proposed MaskModule. + + With the proposed multi-stage training scheme, 2.3. Dense SLAM +MonoRec achieves state-of-the-art performance compared +to other MVS and monocular depth prediction methods Several of the methods cited above solve both the prob- +on the KITTI dataset [14]. Furthermore, we validate the lem of dense 3D reconstruction and camera pose estima- +generalization capabilities of our network on the Oxford tion [48, 67, 63, 65, 66, 59, 58]. Nevertheless, these meth- +RobotCar dataset [35] and the TUM-Mono dataset [9]. ods either solve both problems independently or only in- +Figure 1 shows a dense point cloud reconstructed by our tegrate one into the other (e.g. [66, 58]). Newcombe et +method on one of our test sequences of KITTI. al. [37] instead jointly optimize the 6DoF camera pose and + the dense 3D scene structure. However, due to its volu- +2. Related Work metric map representation it is only applicable to small- + scale scenes. Recently, Bloesch et al. [1] proposed a +2.1. Multi-view Stereo learned code representation which can be optimized jointly + with the 6DoF camera poses. This idea is pursued by + Multi-view stereo (MVS) methods estimate a dense rep- Czarnowski et al. [5] and integrated into a full SLAM sys- +resentation of the 3D environment based on a set of im- tem. All the above-mentioned methods, however, do not +ages with known poses. Over the past years, several address the issue of moving objects. Instead, the proposed +methods have been developed to solve the MVS problem MonoRec network explicitly deals with moving objects and +[46, 28, 30, 2, 47, 49, 39, 13, 45, 60] based on classical achieves superior accuracy both on moving and on static +optimization. Recently, due to the advance of deep neu- structures. Furthermore, prior works show that the accuracy +ral networks (DNNs), different learning based approaches of camera tracking does not necessarily improve with more +were proposed. This representation can be volumetric + + 6109 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. + Cost Volume weighted +Construction + MaskModule + DepthModule + + masked + + Reprojections Max Pool + + ResNet-18 Image features for + +Figure 2: MonoRec Architecture: It first constructs a photometric cost volume from multiple input frames. Unlike prior +works, we use the SSIM [54] metric instead of SAD to measure the photometric consistency. The MaskModule aims to detect +inconsistencies between the different input frames to determine moving objects. The multi-frame cost volume C is multiplied +with the predicted mask and then passed to the DepthModule which predicts a dense inverse depth map. In both the decoders +of MaskModule and DepthModule, the cost volume features are concatenated with pre-trained ResNet-18 features. + +points [8, 10]. MonoRec therefore focuses solely on deliv- 3.2. Cost Volume +ering dense reconstruction using poses from a sparse VO +system and shows state-of-the-art results on public bench- A cost volume encodes geometric information from the +marks. Note that, this way, MonoRec can be easily com- different frames in a tensor that is suited as input for neural +bined with any VO systems with arbitrary sensor setups. networks. For a number of discrete depth steps, the tem- + poral stereo frames are reprojected to the keyframe and a +3. The MonoRec Network pixel-wise photometric error is computed. Ideally, the lower + the photometric error, the better the depth step approximates + MonoRec uses a set of consecutive frames and the cor- the real depth at a given pixel. Our cost volume follows the +responding camera poses to predict a dense depth map for general formulation of the prior works [37, 66]. Neverthe- +the given keyframe. The MonoRec architecture combines less, unlike the previous works that define the photometric +a MaskModule and a DepthModule. MaskModule predicts error pe() as a patch-wise SAD, we propose to use the SSIM +moving object masks that improve depth accuracy and al- as follows: +lows us to eliminate noise in 3D reconstructions. Depth- +Module predicts a depth map from the masked cost volume. pe(x, d) = 1 − SSIM(Itt (x, d), It(x)) (2) +In this section, we first describe the different modules of our 2 +architecture, and then discuss the specialized multi-stage +semi-supervised training scheme. with 3 × 3 patch size. Here Itt (x, d) defines the intensity + at pixel x of the image It warped with constant depth d. +3.1. Preliminaries In practice, we clamp the error to [0, 1]. The cost volume + + Our method aims to predict a dense inverse depth map C stores at C(x, d) the aggregated photometric consistency + +Dt of the selected keyframe from a set of consecutive for pixel x and depth d +frames {I1, · · · , IN }. We denote the selected keyframe as +It and others as It (t ∈ {1, · · · , N } \ t). Given the camera C(x, d) = 1 − 2 · 1 pett (x, d) · ωt (x) (3) +intrinsics, the inverse depth map Dt, and the relative cam- · +era pose Ttt ∈ SE(3) between It and It, we can perform +the reprojection from It to It as t ωt t + + where d ∈ {di|dmin + i · (dmin − dmax)}. The weighting + M + + term wt (x) weights the optimal depth step height based on + + the photometric error while others are weighted lower: + + Itt = It proj Dt, Ttt , (1) 1 + wt (x) =1 − M − 1 + exp −α pett (x, d) − pett (x, d∗) 2 (4) +where proj () is the projection function and is the dif- · +ferentiable sampler [25]. This reprojection formulation is +important for both the cost volume formation (Sec. 3.2) and d=d∗ +the self-supervised loss term (Sec. 3.4). + with d∗t = arg mind pett (x, d). Note that C(x, d) has the + In the following, we refer to the consecutive frames as range [−1, 1] where −1/1 indicates the lowest/highest pho- +temporal stereo (T) frames. During training, we use an ad- +ditional static stereo (S) frame ItS for each sample, which tometric consistency. +was captured by a synchronized stereo camera at the same +time as the respective keyframe. In the following section, we denote cost volumes calcu- + lated based on the keyframe It and only one non-keyframe + It by Ct (x, d) where applicable. + + 6110 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. + 3.3. Network Architecture Figure 3: Auxiliary Training Masks: Examples of aux- + iliary training masks from the training set that are used as + As shown in Figure 2, the proposed network architec- reference. +ture contains two sub-modules, namely, MaskModule and +DepthModule. MaskModule refinement stage and the DepthModule refine- + ment stage are executed successively. +MaskModule MaskModule aims to predict a mask Mt +where Mt(x) ∈ [0, 1] indicates the probability of a pixel Bootstrapping In the bootstrapping stage, MaskModule +x in It belonging to a moving object. Determining mov- and DepthModule are trained separately. DepthModule +ing objects from It alone is an ambiguous task and hard to takes the non-masked C as the input and predicts Dt. The +be generalizable. Therefore, we propose to use the set of training objective of DepthModule is defined as a multi- +cost volumes {Ct |t ∈ {1, · · · , N } \ t} which encode the scale (s ∈ [0, 3]) semi-supervised loss. It combines a self- +geometric priors between It and {It |t ∈ {1, · · · , N } \ t} supervised photometric loss and an edge-aware smoothness +respectively. We use Ct instead of C since the inconsis- term, as proposed in [17], with a supervised sparse depth +tent geometric information from different Ct is a strong loss. +prior for moving object prediction – dynamic pixels yield +inconsistent optimal depth steps in different Ct . However, 3 +geometric priors alone are not enough to predict moving +objects, since poorly-textured or non-Lambertian surfaces Ldepth = Lself,s + αLsparse,s + βLsmooth,s. (5) +can lead to inconsistencies as well. Furthermore, the cost +volumes tend to reach a consensus on wrong depths that s=0 +semantically don’t fit into the context of the scene for ob- +jects that move at constant speed . Therefore, we further The self-supervised loss is computed from the photometric +leverage pre-trained ResNet-18 [21] features of It to en- errors between the keyframe and the reprojected temporal +code semantic priors in addition to the geometric ones. The stereo and static stereo frames: +network adapts a U-Net architecture design [42] with skip +connections. All cost volumes are passed through the en- Lself,s = min 1 − SSIM(Itt , It) +coders with shared weights. The features from different cost λ +volumes are aggregated using max-pooling and then passed t ∈t ∪{tS } 2 +through the decoder. In this way, MaskModule can be ap- (6) +plied to different numbers of frames without retraining. + + (1 − λ)||Itt − It||1 , +DepthModule DepthModule predicts a dense pixel-wise +inverse depth map Dt of It. To this end, the module re- where λ = 0.85. Note that Lself,s takes the per-pixel min- +ceives the complete cost volume C concatenated with the +keyframe It. Unlike MaskModule, here we use C instead imum which has be shown to be superior compared to the +of Ct since multi-frame cost volumes in general lead to +higher depth accuracy and robustness against photometric per-pixel average [17]. The sparse supervised depth loss is +noise [37]. To eliminate wrong depth predictions for mov- +ing objects, we perform pixel-wise multiplication between defined as +Mt and the cost volume C for every depth step d. This way, +there won’t be any maxima (i.e. strong priors) in regions Lsparse,s = ||Dt − DV O||1, (7) +of moving objects left, such that DepthModule has to rely +on information from the image features and the surround- where the ground-truth sparse depth maps (DV O) are ob- +ings to infer the depth of moving objects. We employ a tained by a visual odometry system [59]. Note that all the +U-Net architecture with multi-scale depth outputs from the supervision signals of DepthModule are generated from ei- +decoder [17]. Finally, DepthModule outputs an interpola- ther images themselves or the visual odometry system with- +tion factor between dmin and dmax. In practice, we use out any manual labeling or LiDAR depth. +s = 4 scales of depth prediction. + MaskModule is trained with the mask loss Lmask which +3.4. Multi-stage Training is the weighted binary cross entropy between the predicted + mask Mt and the auxiliary ground-truth moving object + In this section, we propose a multi-stage training scheme mask Maux. We generate Maux by leveraging a pre-trained +for the networks. Specifically, the bootstrapping stage, the Mask-RCNN and the trained DepthModule as explained + above. We firstly define the movable object classes, e.g., + cars, cyclists, etc, and then obtain the instance segmenta- + tions of these object classes for the training images. A + movable instance is classified as a moving instance if it + + 6111 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. + has a high ratio of photometrically inconsistent pixels be- a) + Depth +tween temporal stereo and static stereo. Specifically, for Module +each image, we predict its depth maps Dt and DtS using +the cost volumes formed by temporal stereo images C and Mask +static stereo images CS, respectively. Then a pixel x is re- Module + +garded as a moving pixel if two of the following three met- Depth + Module +rics are above predefined thresholds: (1) The static stereo b) +photometric error using Dt, i.e., pettS (x, Dt(x)). (2) The +average temporal stereo photometric error using DtS, i.e., Depth +pett (x, DtS(x)). (3) The difference between Dt(x) and Module +DtS(x). Please refer to our supplementary materials for +more details. Figure 3 shows some examples of the gen- Mask + Module +erated auxiliary ground-truth moving object masks. + Depth +MaskModule Refinement The bootstrapping stage for Module + +MaskModule is limited in two ways: (1) Heavy augmen- Figure 4: Refinement Losses: a) MaskModule refinement + and b) DepthModule refinement loss functions. Dashed out- +tation is needed since mostly only a very small percent- lines denote that no gradient is being computed for the re- + spective forward pass in the module. +age of pixels on the image belongs to moving objects. (2) + stereo and sparse depth losses are backpropagated. Because +The auxiliary masks are not necessarily related to the ge- moving objects make up only a small percentage of all pix- + els in a keyframe, the gradients from the photometric error +ometric prior in the cost volume, which slows down the are rather weak. To solve this, we perform a further static + stereo forward pass and use the resulting depth map DtS +convergence. Therefore, to improve the mask prediction, as prior for moving objects. Therefore, as shown in Fig- + ure 4(b), the loss for refining DepthModule is defined as +we utilize the trained DepthModule from the bootstrapping + Ld ref,s =(1 − Mt) (Lself,s + αLsparse,s) +stage. We leverage the fact that the depth prediction for + + Mt LsSelf,s + γ Dt − DtS 1 (9) +moving objects, and consequently the photometric consis- + + βLsmooth,s. +tency, should be better with a static stereo prediction than + 3.4.1 Implementation Details +with a temporal stereo one. Therefore, similar to the classi- + The networks are implemented in PyTorch [38] with image +fication of moving pixels as explained in the previous sec- size 512×256. For the bootstrapping stage, we train Depth- +tion, we obtain DtS and Dt from two forward passes using Module for 70 epochs with learning rate lr = 1e−4 for +CS and C as inputs, respectively. Then we compute the the first 65 epochs and lr = 1e−5 for the remaining ones. +static stereo photometric error LsSelf,s using DtS as depth MaskModule is trained for 60 epochs with lr = 1e−4. Dur- +and the temporal stereo photometric error LsTelf,s using Dt ing MaskModule refinement, we train for 32 epochs with +as depth. To train Mt, we interpret it as pixel-wise inter- lr = 1e−4, and during DepthModule refinement we train +polation factors between LsSelf,s and LsTelf,s, and minimize for 15 epochs with lr = 1e−4 and another 4 epochs at +the summation: lr = 1e−5. The hyperparameters α, β and γ are set to + 4, 10−3 × 2−s and 4, respectively. For inference, MonoRec + 3 can achieve 10 fps with batch size 1 using 2GB memory. + +Lm ref = MtLdSepth,s + (1 − Mt)LdTepth,s (8) 4. Experiments + + s=0 To evaluate the proposed method, we first compare + against state-of-the-art monocular depth prediction and + + Lmask. MVS methods with our train/test split of the KITTI + dataset [15]. Then, we perform extensive ablation studies +Figure 4(a) shows the diagram illustrating different loss to show the efficacy of our design choices. In the end, +terms. Note that we still add the supervised mask loss we demonstrate the generalization capabilities of different +Lmask as a regularizer to stabilize the training. This way, methods on Oxford RobotCar [35] and TUM-Mono [9] us- +the new gradients are directly related to the geometric struc- ing the model trained on KITTI. +ture in the cost volume and help to improve the mask pre- +diction accuracy and alleviate the danger of overfitting. + +DepthModule Refinement The bootstrapping stage does +not distinguish between the moving pixels and static pixels +when training DepthModule. Therefore, we aim to refine +DepthModule such that it is able to predict proper depths +also for moving objects. The key idea is that, by utilizing +Mt, only the static stereo loss is backpropagated for mov- +ing pixels, while for static pixels the temporal stereo, static + + 6112 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. + Seq=07, KF=395 Depth Map View 1 View 2 + +GT View 2 GT View 1 Keyframe MonoRec + + PackNet + + DORN + + Seq=00, KF=1482 Depth Map View 1 View 2 + +GT View 2 GT View 1 Keyframe MonoRec + + PackNet + + DORN + + Mask Filtered + + Original + + Figure 5: Qualitative Results on KITTI: The upper part of the figure shows the results for a selected number of frames + from the KITTI test set. The compared PackNet model was trained in a semi-supervised fashion using LiDAR as the ground + truth. Besides the depth maps, we also show the 3D point clouds by reprojecting the depth and viewing from two different + perspectives. For comparison we show the LiDAR ground truth from the corresponding perspectives. Our method clearly + shows the best prediction quality. The lower part of the figure shows large scale reconstructions as point clouds accumulated + from multiple frames. The red insets depict the reconstructed artifacts from moving objects. With the proposed MaskModule, + we can effectively filter out the moving objects to avoid those artifacts in the final reconstruction. + + 6113 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. + Method Training Dataset Input Abs Rel Sq Rel RMSE RMSElog δ < 1.25 δ < 1.252 δ < 1.253 + +Colmap [44] (geometric) - - KF + 2 0.099 3.451 5.632 0.184 0.952 0.979 0.986 +Colmap [44] (photometric) - - KF + 2 0.190 6.826 7.781 0.531 0.893 0.932 0.947 + +Monodepth2 [17] MS Eigen Split KF 0.082 0.405 3.129 0.127 0.931 0.985 0.996 +PackNet [20] MS CS+Eigen Split KF 0.080 0.331 2.914 0.124 0.929 0.987 0.997 +PackNet [20] MS, D CS+Eigen Split KF 0.077 0.290 2.688 0.118 0.935 0.988 0.997 +DORN [11] D KF 0.077 0.290 2.723 0.113 0.949 0.988 0.996 + Eigen Split +DeepMVS [23] D KF+2 0.103 1.160 3.968 0.166 0.896 0.947 0.978 +DeepMVS [23] (pretr.) D Odom. Split KF+2 0.088 0.644 3.191 0.146 0.914 0.955 0.982 +DeepTAM [66] (only FB) MS, D* Odom. Split KF+2 0.059 0.474 2.769 0.096 0.964 0.987 0.994 +DeepTAM [66] (1x Ref.) MS, D* Odom. Split KF+2 0.053 0.351 2.480 0.089 0.971 0.990 0.995 + Odom. Split +MonoRec MS, D* KF+2 0.050 0.295 2.266 0.082 0.973 0.991 0.996 + Odom. Split + +Table 1: Quantitative Results on KITTI: Comparison between MonoRec and other methods on our KITTI test set. The +Dataset column shows the training dataset used by the corresponding method and please note that Eigen split is a superset +of our odometry split. Best / Second best results are marked bold / underlined. The evaluation result shows that our method +achieves overall the best performance. Legend: M: Monocular images, S: Stereo images, D: GT depth, D*: Depths from +DVSO, KF: Keyframe, KF + 2: Keyframe + 2 mono frames, CS: Cityscapes [4], pretr.: Pretrained network, FB: Fixed band +module of DeepTAM, Ref.: Narrow band refinement module of DeepTAM + +(a) Keyframe (b) W/o MaskModule TAM), shown in Table 1. Note that the training code of + DeepTAM was not published, we therefore implemented it +(c) MaskModule (d) MaskModule+D.Ref. ourselves for training and testing using our split to deliver + a fair comparison. Our method outperforms all the other +Figure 6: Qualitative Improvement: Effects of cost vol- methods with a notable margin despite relying on images +ume masking and depth refinement. only without using LiDAR ground truth for training. + +4.1. The KITTI Dataset This is also clearly reflected in the qualitative results + shown in Figure 5. Compared with monocular depth esti- + The Eigen split [6] is the most popular training/test split mation methods, our method delivers very sharp edges in +for evaluating depth estimation on KITTI. We cannot make the depth maps and can recover finer details. In comparison +use of it directly since MonoRec requires temporally con- to the other MVS methods, it can better deal with moving +tinuous images with estimated poses. Hence, we select our objects, which is further illustrated in Figure 7. +training/testing splits as the intersection between the KITTI +Odometry benchmark and the Eigen split, which results in A single depth map usually cannot really reflect the qual- +13714/8634 samples for training/testing. We obtain the rel- ity for large scale reconstruction. We therefore also visual- +ative poses between the images from the monocular VO sys- ize the accumulated points using the depth maps from mul- +tem DVSO [59]. During training, we also leverage the point tiple frames in lower part of Figure 5. We can see that our +clouds generated by DVSO as the sparse depth supervision method can deliver very high quality reconstruction and, +signals. For training MaskModule we only use images that due to our MaskModule, is able to remove artifacts caused +contain moving objects in the generated auxiliary masks, by moving objects. We urge readers to watch the supple- +2412 in total. For all the following evaluation results we mentary video for more convincing comparisons. +use the improved ground truth [50] and cap depths at 80 m. Ablation Studies. We also investigated the contribution + of the different components towards the method’s perfor- + We first compare our method against the recent state of mance. Table 2 shows quantitative results of our ablation +the art including an optimization based method (Colmap), studies, which confirm that all our proposed contributions +self-supervised monocular methods (MonoDepth2 and improve the depth prediction over the baseline method. Fur- +PackNet), a semi-supervised monocular method using thermore, Figure 6 demonstrates the qualitative improve- +sparse LiDAR data (PackNet), a supervised monocular ment achieved by MaskModule and refinement training. +method (DORN) and MVS methods (DeepMVS and Deep- + 4.2. Oxford RobotCar and TUM-Mono + + To demonstrate the generalization capabilities of + MonoRec, we test our KITTI model on the Oxford Robot- + Car dataset and the TUM-Mono dataset. Oxford RobotCar + is a street view dataset and shows a similar motion pattern + and view perspective to KITTI. TUM-Mono, however, is + recorded by a handheld monochrome camera, so it demon- + + 6114 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. + Model SSIM MaskModule D. Ref. M. Ref. Abs Rel Sq Rel RMSE RMSElog δ < 1.25 δ < 1.252 δ < 1.253 + +Baseline 0.056 0.342 2.624 0.092 0.965 0.990 0.994 +Baseline 0.054 0.346 2.444 0.088 0.970 0.989 0.995 + +MonoRec 0.054 0.306 2.372 0.087 0.970 0.990 0.995 +MonoRec 0.051 0.346 2.361 0.085 0.972 0.990 0.995 +MonoRec 0.052 0.302 2.303 0.087 0.969 0.990 0.995 +MonoRec 0.050 0.295 2.266 0.082 0.973 0.991 0.996 + +Table 2: Ablation Study: Baseline consists of only DepthModule using the unmasked cost volume (CV). Baseline without +SSIM uses a 5x5 patch that has same receptive field as SSIM. Using SSIM to form CV gives a significant improvement. For +MonoRec, only the addition of MaskModule without refinement does not yield significant improvements. The DepthModule +refinement gives a major improvement. The best performance is achieved by combining all the proposed components. + +Keyframe MonoRec Mask Prediction DeepTAM DeepMVS + +Figure 7: Comparison on Moving Objects Depth Estimation: In comparison to other MVS methods, MonoRec is able to +predict plausible depths. Furthermore, the depth prediction has less noise and artifacts in static regions of the scene. + +strates very different motion and image quality compared PackNet Monodepth2 Keyframe +to KITTI. The results are shown in Figure 8. The monoc- +ular methods struggle to generalize to a new context. The DORN +compared MVS methods show more artifacts and cannot +predict plausible depths for the moving objects. In contrast DeepTAM DeepMVS +our method is able to generalize well to the new scenes for +both depth and moving object predictions. Since Oxford MonoRec +RobotCar also provides LiDAR depth data, we further show +a quantitative evaluation in the supplementary material. Figure 8: Oxford RobotCar and TUM-Mono: All results + are obtained by the respective best-performing variant in +5. Conclusion Table 1. MonoRec shows stronger generalization capabil- + ity than the monocular methods. Compared to DeepMVS + We have presented MonoRec, a deep architecture that and DeepTAM, MonoRec delivers depth maps with less ar- +estimates accurate dense 3D reconstructions from only a tifacts and predicts the moving object masks in addition. +single moving camera. We first propose to use SSIM as +the photometric measurement to construct the cost vol- Acknowledgement This work was supported by the Munich Center +umes. To deal with dynamic objects, we propose a novel +MaskModule which predicts moving object masks from the for Machine Learning and by the ERC Advanced Grant SIMULACRON. +input cost volumes. With the predicted masks, the pro- +posed DepthModule is able to estimate accurate depths for +both static and dynamic objects. Additionally, we propose +a novel multi-stage training scheme together with a semi- +supervised loss formulation for training the depth predic- +tion. All combined, MonoRec is able to outperform the +state-of-the-art MVS and monocular depth prediction meth- +ods both qualitatively and quantitatively on KITTI and also +shows strong generalization capability on Oxford RobotCar +and TUM-Mono. We believe that this capacity to recover +accurate dense 3D reconstructions from a single moving +camera will help to establish the camera as the lead sensor +for autonomous systems. + + 6115 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. + References national Journal of Robotics Research (IJRR), pages 1229– + 1235, 2013. + [1] M. Bloesch, J. Czarnowski, R. Clark, S. Leutenegger, and + A. J. Davison. CodeSLAM - learning a compact, optimisable [15] Andreas Geiger, Philip Lenz, and Raquel Urtasun. Are we + representation for dense visual SLAM. In IEEE Conference ready for autonomous driving? the KITTI vision benchmark + on Computer Vision and Pattern Recognition (CVPR), pages suite. In IEEE Conference on Computer Vision and Pattern + 2560–2568, 2018. Recognition (CVPR), pages 3354–3361. IEEE, 2012. + + [2] Neill D. F. Campbell, George Vogiatzis, Carlos Herna´ndez, [16] Clement Godard, Oisin Mac Aodha, and Gabriel J. Bros- + and Roberto Cipolla. Using multiple hypotheses to improve tow. Unsupervised monocular depth estimation with left- + depth-maps for multi-view stereo. In European Conference right consistency. In IEEE Conference on Computer Vision + on Computer Vision (ECCV), pages 766–779, 2008. and Pattern Recognition (CVPR), 2017. + + [3] Rui Chen, Songfang Han, Jing Xu, and Hao Su. Point-based [17] Cle´ment Godard, Oisin Mac Aodha, Michael Firman, and + multi-view stereo network. In International Conference on Gabriel J. Brostow. Digging into self-supervised monocular + Computer Vision (ICCV), 2019. depth estimation. In International Conference on Computer + Vision (ICCV), pages 3828–3838, 2019. + [4] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo + Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe [18] Ariel Gordon, Hanhan Li, Rico Jonschkowski, and Anelia + Franke, Stefan Roth, and Bernt Schiele. The cityscapes Angelova. Depth from videos in the wild: Unsupervised + dataset for semantic urban scene understanding. In IEEE monocular depth learning from unknown cameras. In In- + Conference on Computer Vision and Pattern Recognition ternational Conference on Computer Vision (ICCV), 2019. + (CVPR), pages 3213–3223, 2016. + [19] Xiaodong Gu, Zhiwen Fan, Siyu Zhu, Zuozhuo Dai, Feitong + [5] Jan Czarnowski, Tristan Laidlow, Ronald Clark, and An- Tan, and Ping Tan. Cascade cost volume for high-resolution + drew J. Davison. DeepFactors: Real-time probabilistic dense multi-view stereo and stereo matching. In IEEE Interna- + monocular SLAM. IEEE Robotics and Automation Letters tional Conference on Computer Vision and Pattern Recog- + (RA-L), 5(2):721–728, 2020. nition (CVPR), 2020. + + [6] David Eigen and Rob Fergus. Predicting depth, surface nor- [20] Vitor Guizilini, Rares Ambrus, Sudeep Pillai, Allan Raven- + mals and semantic labels with a common multi-scale convo- tos, and Adrien Gaidon. 3D packing for self-supervised + lutional architecture. In International Conference on Com- monocular depth estimation. In IEEE Conference on Com- + puter Vision (ICCV), pages 2650–2658, 2015. puter Vision and Pattern Recognition (CVPR), pages 2485– + 2494, 2020. + [7] David Eigen, Christian Puhrsch, and Rob Fergus. Depth map + prediction from a single image using a multi-scale deep net- [21] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. + work. In Neural Information Processing Systems (NIPS), Deep residual learning for image recognition. In IEEE + 2014. Conference on Computer Vision and Pattern Recognition + (CVPR), pages 770–778, 2016. + [8] Jakob Engel, Vladlen Koltun, and Daniel Cremers. Direct + sparse odometry. IEEE Transactions on Pattern Analysis and [22] Yuxin Hou, Juho Kannala, and Arno Solin. Multi-view + Machine Intelligence (PAMI), 40(3):611–625, 2018. stereo by temporal nonparametric fusion. In International + Conference on Computer Vision (ICCV), 2019. + [9] Jakob Engel, Vladyslav Usenko, and Daniel Cremers. A + photometrically calibrated benchmark for monocular visual [23] Po-Han Huang, Kevin Matzen, Johannes Kopf, Narendra + odometry. In arXiv, July 2016. Ahuja, and Jia-Bin Huang. DeepMVS: Learning multi-view + stereopsis. In IEEE Conference on Computer Vision and Pat- +[10] Alejandro Fontan, Javier Civera, and Rudolph Triebel. tern Recognition (CVPR), pages 2821–2830, 2018. + Information-driven direct rgb-d odometry. In IEEE Confer- + ence on Computer Vision and Pattern Recognition (CVPR), [24] Sunghoon Im, Hae-Gon Jeon, Stephen Lin, and In So + pages 4929–4937, 2020. Kweon. DPSNet: End-to-end deep plane sweep stereo. In In- + ternational Conference on Learning Representations (ICLR), +[11] Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Bat- 2019. + manghelich, and Dacheng Tao. Deep ordinal regression net- + work for monocular depth estimation. In IEEE Conference [25] Max Jaderberg, Karen Simonyan, Andrew Zisserman, and + on Computer Vision and Pattern Recognition (CVPR), pages Koray Kavukcuoglu. Spatial transformer networks. In Neu- + 2002–2011, 2018. ral Information Processing Systems (NIPS), pages 2017– + 2025, 2015. +[12] Yasutaka Furukawa and Jean Ponce. Accurate, dense, and + robust multiview stereopsis. IEEE Transactions on Pat- [26] Mengqi Ji, Ju¨rgen Gall, Haitian Zheng, Yebin Liu, and Lu + tern Analysis and Machine Intelligence (PAMI), pages 1362– Fang. SurfaceNet: An end-to-end 3D neural network for + 1376, 2010. multiview stereopsis. In International Conference on Com- + puter Vision (ICCV), pages 2326–2334, 2017. +[13] Silvano Galliani, Katrin Lasinger, and Konrad Schindler. + Massively parallel multiview stereopsis by surface normal [27] Abhishek Kar, Christian Ha¨ne, and Jitendra Malik. Learning + diffusion. In International Conference on Computer Vision a multi-view stereo machine. In Neural Information Process- + (ICCV), 2015. ing Systems (NIPS), page 364–375, 2017. + +[14] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel [28] Kiriakos N. Kutulakos and Steven M. Seitz. A theory of + Urtasun. Vision meets robotics: The KITTI dataset. Inter- shape by space carving. In International Conference on + Computer Vision (ICCV), 1999. + + 6116 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. + [29] Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Fed- puting and Computer Assisted Intervention (MICCAI), pages + erico Tombari, and Nassir Navab. Deeper depth prediction 234–241. Springer, 2015. + with fully convolutional residual networks. In International + Conference on 3D Vision (3DV), 2016. [43] Chris Russell, Rui Yu, and Lourdes Agapito. Video pop-up: + Monocular 3d reconstruction of dynamic scenes. In Euro- +[30] Maxime Lhuillier and Long Quan. A quasi-dense approach pean Conference on Computer Vision (ECCV), pages 583– + to surface reconstruction from uncalibrated images. IEEE 598. Springer, 2014. + Transactions on Pattern Analysis and Machine Intelligence + (PAMI), pages 418–433, 2005. [44] Johannes L Schonberger and Jan-Michael Frahm. Structure- + from-motion revisited. In IEEE Conference on Computer +[31] Bo Li, Chunhua Shen, Yuchao Dai, Anton van den Hengel, Vision and Pattern Recognition (CVPR), pages 4104–4113, + and Mingyi He. Depth and surface normal estimation from 2016. + monocular images using regression on deep features and hi- + erarchical CRFs. In IEEE Conference on Computer Vision [45] Johannes L. Scho¨nberger, Enliang Zheng, Jan-Michael + and Pattern Recognition (CVPR), 2015. Frahm, and Marc Pollefeys. Pixelwise view selection for + unstructured multi-view stereo. In European Conference on +[32] Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Computer Vision (ECCV), pages 501–518, 2016. + Noah Snavely, Ce Liu, and William T Freeman. Learning + the depths of moving people by watching frozen people. In [46] Steven M. Seitz and Charles R. Dyer. Photorealistic scene + Proceedings of the IEEE/CVF Conference on Computer Vi- reconstruction by voxel coloring. In IEEE Conference on + sion and Pattern Recognition, pages 4521–4530, 2019. Computer Vision and Pattern Recognition (CVPR), 1997. + +[33] Keyang Luo, Tao Guan, Lili Ju, Haipeng Huang, and Yawei [47] Jan Stu¨hmer, Stefan Gumhold, and Daniel Cremers. Real- + Luo. P-MVSNet: Learning patch-wise matching confidence time dense geometry from a handheld camera. In DAGM + aggregation for multi-view stereo. In International Confer- Conference on Pattern Recognition, pages 11–20, 2010. + ence on Computer Vision (ICCV), 2019. + [48] Keisuke Tateno, Federico Tombari, Iro Laina, and Nassir +[34] Xuan Luo, Jia-Bin Huang, Richard Szeliski, Kevin Matzen, Navab. CNN-SLAM: Real-time dense monocular SLAM + and Johannes Kopf. Consistent video depth estimation. with learned depth prediction. In IEEE Conference on Com- + 39(4), 2020. puter Vision and Pattern Recognition (CVPR), 2017. + +[35] Will Maddern, Geoff Pascoe, Chris Linegar, and Paul New- [49] Engin Tola, Christoph Strecha, and Pascal Fua. Efficient + man. 1 Year, 1000km: The Oxford RobotCar Dataset. Inter- large-scale multi-view stereo for ultra high-resolution image + national Journal of Robotics Research (IJRR), 36(1):3–15, sets. Machine Vision and Applications (MVA), pages 903– + 2017. 920, 2011. + +[36] Zak Murez, Tarrence van As, James Bartolozzi, Ayan Sinha, [50] Jonas Uhrig, Nick Schneider, Lukas Schneider, Uwe Franke, + Vijay Badrinarayanan, and Andrew Rabinovich. Atlas: End- Thomas Brox, and Andreas Geiger. Sparsity invariant CNNs. + to-end 3D scene reconstruction from posed images. In Euro- In International Conference on 3D Vision (3DV), pages 11– + pean Conference on Computer Vision (ECCV), 2020. 20. IEEE, 2017. + +[37] Richard A. Newcombe, Steven J. Lovegrove, and Andrew J. [51] Vladyslav Usenko, Nikolaus Demmel, David Schubert, Jo¨rg + Davison. DTAM: Dense tracking and mapping in real-time. Stu¨ckler, and Daniel Cremers. Visual-inertial mapping with + In International Conference on Computer Vision (ICCV), non-linear factor recovery. IEEE Robotics and Automation + 2011. Letters (RA-L), 5(2):422–429, 2020. + +[38] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, [52] Chaoyang Wang, Jose Miguel Buenaposada, Rui Zhu, and + James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Simon Lucey. Learning depth from monocular videos using + Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An direct methods. In IEEE Conference on Computer Vision and + imperative style, high-performance deep learning library. In Pattern Recognition (CVPR), 2018. + Advances in neural information processing systems, pages + 8026–8037, 2019. [53] Kaixuan Wang and Shaojie Shen. MVDepthNet: Real-time + multiview depth estimation neural network. In International +[39] Matia Pizzoli, Christian Forster, and Davide Scaramuzza. Conference on 3D Vision (3DV), 2018. + REMODE: Probabilistic, monocular dense reconstruction in + real time. In IEEE International Conference on Robotics and [54] Zhou Wang, Alan C Bovik, Hamid R Sheikh, and Eero P Si- + Automation (ICRA), 2014. moncelli. Image quality assessment: from error visibility to + structural similarity. IEEE transactions on image processing, +[40] Rene Ranftl, Vibhav Vineet, Qifeng Chen, and Vladlen 13(4):600–612, 2004. + Koltun. Dense monocular depth estimation in complex dy- + namic scenes. In IEEE Conference on Computer Vision and [55] Jamie Watson, Oisin Mac Aodha, Victor Prisacariu, Gabriel + Pattern Recognition (CVPR), pages 4058–4066, 2016. Brostow, and Michael Firman. The temporal opportunist: + Self-supervised multi-frame monocular depth. In IEEE +[41] Andrea Romanoni and Matteo Matteucci. TAPA-MVS: Conference on Computer Vision and Pattern Recognition + Textureless-aware PAtchMatch multi-view stereo. In Inter- (CVPR), 2021. + national Conference on Computer Vision (ICCV), 2019. + [56] Youze Xue, Jiansheng Chen, Weitao Wan, Yiqing Huang, +[42] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U- Cheng Yu, Tianpeng Li, and Jiayu Bao. MVSCRF: Learning + Net: Convolutional networks for biomedical image segmen- multi-view stereo with conditional random fields. In Inter- + tation. In International Conference on Medical Image Com- national Conference on Computer Vision (ICCV), 2019. + + 6117 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. + [57] Jiayu Yang, Wei Mao, Jose M. Alvarez, and Miaomiao Liu. + Cost volume pyramid based depth inference for multi-view + stereo. In IEEE Conference on Computer Vision and Pattern + Recognition (CVPR), 2020. + + [58] Nan Yang, Lukas von Stumberg, Rui Wang, and Daniel Cre- + mers. D3VO: Deep depth, deep pose and deep uncertainty + for monocular visual odometry. In IEEE Conference on + Computer Vision and Pattern Recognition (CVPR), 2020. + + [59] Nan Yang, Rui Wang, Jo¨rg Stu¨ckler, and Daniel Cremers. + Deep virtual stereo odometry: Leveraging deep depth predic- + tion for monocular direct sparse odometry. In European Con- + ference on Computer Vision (ECCV), pages 817–833, 2018. + + [60] Yao Yao, Shiwei Li, Siyu Zhu, Hanyu Deng, Tian Fang, and + Long Quan. Relative camera refinement for accurate dense + reconstruction. In International Conference on 3D Vision + (3DV), 2017. + + [61] Yao Yao, Zixin Luo, Shiwei Li, Tian Fang, and Long + Quan. MVSNet: Depth inference for unstructured multi- + view stereo. In European Conference on Computer Vision + (ECCV), pages 785–801, 2018. + + [62] Yao Yao, Zixin Luo, Shiwei Li, Tianwei Shen, Tian Fang, + and Long Quan. Recurrent MVSNet for high-resolution + multi-view stereo depth inference. In IEEE Conference on + Computer Vision and Pattern Recognition (CVPR), 2019. + + [63] Zhichao Yin and Jianping Shi. GeoNet: Unsupervised learn- + ing of dense depth, optical flow and camera pose. In IEEE + Conference on Computer Vision and Pattern Recognition + (CVPR), 2018. + + [64] Zehao Yu and Shenghua Gao. Fast-MVSNet: Sparse-to- + dense multi-view stereo with learned propagation and gauss- + newton refinement. In IEEE Conference on Computer Vision + and Pattern Recognition (CVPR), 2020. + + [65] Huangying Zhan, Ravi Garg, Chamara Saroj Weerasekera, + Kejie Li, Harsh Agarwal, and Ian M. Reid. Unsupervised + learning of monocular depth estimation and visual odometry + with deep feature reconstruction. In IEEE Conference on + Computer Vision and Pattern Recognition (CVPR), 2018. + + [66] Huizhong Zhou, Benjamin Ummenhofer, and Thomas Brox. + DeepTAM: Deep tracking and mapping. In European Con- + ference on Computer Vision (ECCV), pages 822–838, 2018. + + [67] Tinghui Zhou, Matthew Brown, Noah Snavely, and David G. + Lowe. Unsupervised learning of depth and ego-motion from + video. In IEEE Conference on Computer Vision and Pattern + Recognition (CVPR), 2017. + + 6118 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:26:54 UTC from IEEE Xplore. Restrictions apply. + diff --git a/动态slam/2020年-2022年开源动态SLAM/2021年/RDS-SLAM_Real-Time_Dynamic_SLAM_Using_Semantic_Segmentation_Methods.pdf b/动态slam/2020年-2022年开源动态SLAM/2021年/RDS-SLAM_Real-Time_Dynamic_SLAM_Using_Semantic_Segmentation_Methods.pdf new file mode 100644 index 0000000..dc7da67 --- /dev/null +++ b/动态slam/2020年-2022年开源动态SLAM/2021年/RDS-SLAM_Real-Time_Dynamic_SLAM_Using_Semantic_Segmentation_Methods.pdf @@ -0,0 +1,738 @@ +Received December 21, 2020, accepted January 6, 2021, date of publication January 11, 2021, date of current version February 10, 2021. +Digital Object Identifier 10.1109/ACCESS.2021.3050617 + +RDS-SLAM: Real-Time Dynamic SLAM Using +Semantic Segmentation Methods + +YUBAO LIU AND JUN MIURA , (Member, IEEE) + +Department of Computer Science and Engineering, Toyohashi University of Technology, Toyohashi 441-8580, Japan + +Corresponding author: Yubao Liu (yubao.liu.ra@tut.jp) + +This work was supported in part by the Japan Society for the Promotion of Science (JSPS) KAKENHI under Grant 17H01799. + + ABSTRACT The scene rigidity is a strong assumption in typical visual Simultaneous Localization and + Mapping (vSLAM) algorithms. Such strong assumption limits the usage of most vSLAM in dynamic + real-world environments, which are the target of several relevant applications such as augmented reality, + semantic mapping, unmanned autonomous vehicles, and service robotics. Many solutions are proposed + that use different kinds of semantic segmentation methods (e.g., Mask R-CNN, SegNet) to detect dynamic + objects and remove outliers. However, as far as we know, such kind of methods wait for the semantic + results in the tracking thread in their architecture, and the processing time depends on the segmentation + methods used. In this paper, we present RDS-SLAM, a real-time visual dynamic SLAM algorithm that + is built on ORB-SLAM3 and adds a semantic thread and a semantic-based optimization thread for robust + tracking and mapping in dynamic environments in real-time. These novel threads run in parallel with the + others, and therefore the tracking thread does not need to wait for the semantic information anymore. + Besides, we propose an algorithm to obtain as the latest semantic information as possible, thereby making + it possible to use segmentation methods with different speeds in a uniform way. We update and propagate + semantic information using the moving probability, which is saved in the map and used to remove outliers + from tracking using a data association algorithm. Finally, we evaluate the tracking accuracy and real-time + performance using the public TUM RGB-D datasets and Kinect camera in dynamic indoor scenarios. + Source code and demo: https://github.com/yubaoliu/RDS-SLAM.git + +INDEX TERMS Dynamic SLAM, ORB SLAM, Mask R-CNN, SegNet, real-time. + +I. INTRODUCTION algorithm from using data associations related to such +Simultaneous localization and mapping (SLAM) [1] is a fun- dynamic objects in real-time is the challenge to allow vSLAM +damental technique for many applications such as augmented to be deployed in the real world. +reality (AR), robotics, and unmanned autonomous vehicles +(UAV). Visual SLAM (vSLAM) [2] uses the camera as the We classify the solutions into two classes: pure +input and is useful in scene understanding and decision mak- geometric-based [3]–[7] and semantic-based [8]–[13] meth- +ing. However, the strong assumption of scene rigidity limits ods. These geometric-based approaches cannot remove all +the use of most vSLAM in real-world environments. Dynamic potential dynamic objects, e.g., people who are sitting. Fea- +objects will cause many bad or unstable data associations that tures on such objects are unreliable and also need to be +accumulate drifts during the SLAM process. In Fig. 1, for removed from tracking and mapping. These semantic-based +example, assume m1 is on a person and its position changes methods use semantic segmentation or object detection +in the scene. The bad or unstable data associations (the red approaches to obtain pixel-wise masks or bounding box of +lines in Fig. 1) will lead to incorrect camera ego-motion potential dynamic objects. Sitting people can be detected +estimation in dynamic environments. Usually, there are two and removed from tracking and mapping using the semantic +basic requirements for vSLAM: robustness in tracking and information and a map of static objects can be built. Usu- +real-time performance. Therefore, how to detect dynamic ally, in semantic-based methods, geometric check, such as +objects in the populated scene and prevent the tracking Random Sample Consensus (RANSAC) [14] and multi-view + geometry, are also used to remove outliers. + The associate editor coordinating the review of this manuscript and + These semantic-based methods first detect or segment +approving it for publication was Heng Wang . objects and then remove outliers from tracking. The tracking + thread has to wait for semantic information before tracking + +23772 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ VOLUME 9, 2021 + Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods + +FIGURE 1. Example of data association in vSLAM under dynamic scene. These classified map points will be used to select as stable +Ft , (t ≥ 0) is the frame and KFt is the selected keyframe. mi , i ∈ {0, 1, ...} data associations as possible in tracking. +is the map point. Assume m1 moved to new position m1 because it +belongs to a moving object. The red line indicates the unstable or bad The main contributions of this paper are: +data association. (1) we propose a novel semantic-based real-time dynamic + vSLAM algorithm, RDS-SLAM, which enables the tracking + thread does not need to wait for the semantic results any- + more. This method efficiently and effectively uses seman- + tic segmentation results for dynamic object detection and + outliers removing while keeping the algorithm’s real-time + nature. + (2) we propose a keyframe selection strategy that uses as + the latest new semantic information as possible for outliers + removal with any semantic segmentation methods with dif- + ferent speeds in a uniform way. + (3) We show the real-time performance of the proposed + method is better than the existing similar methods using the + TUM dataset. + The rest of the paper is structured as follows. Section II dis- + cusses related work. Section III describes a system overview. + Sections IV, V, and VI detail the implementation of the pro- + posed methods. Section VII shows experimental results, and + section VIII presents the conclusions and discusses future + work. + +FIGURE 2. Blocked model. Semantic model can use different kinds of II. RELATED WORK +segmentation methods, e.g., Mask R-CNN and SegNet. Note that this is A. VISUAL SLAM +not exactly the same as the semantic-based methods mentioned [8]–[13]. vSLAM [2] can be classified into feature-based methods +The tracking process is blocked to wait for the results of semantic model. and direct methods. Mur-Artal et al. presented ORB-SLAM2 + [16], a complete SLAM system for monocular, stereo, and +(camera ego-motion estimation), which is called the blocked RGB-D cameras, which works in real-time on standard CPUs +model in this paper (as shown in Fig. 2). Their processing in a wide variety of environments. This system estimates the +speed is limited by the time-consuming of semantic segmen- ego-motion of the camera by matching the corresponding +tation methods used. For example, Mask R-CNN requires ORB [17] features between the current frame and previous +about 200ms [15] for segmenting one image and this will limit frames and has three parallel threads: tracking, local map- +the real-time performance of the entire system. ping, and loop closing. Carlos et al. proposed the latest + version ORB-SLAM3 [18], mainly adding two novelties: + Our main challenge is how to execute vSLAM in real-time 1) a feature-based tightly-integrated visual-inertial SLAM +under dynamic scenes with various pixel-wise semantic seg- that fully relies on maximum-a-posteriori (MAP) estimation; +mentation methods that ran at a different speed, such as 2) a multiple map system (ATLAS [19]) that relies on a new +SegNet and Mask R-CNN. We propose a semantic thread place recognition method with improved recall. In contrast to +to wait for the semantic information. It runs in parallel with features-based methods,. For example, Kerl et al. proposed a +the tracking thread and the tracking thread does not need to dense visual SLAM method, DVO [20], for RGB-D cameras +wait for the segment result. Therefore, the tracking thread that minimizes both the photometric and the depth error over +can execute in real-time. We call it a non-blocked model in all pixels. However, none of the above methods can address +this paper. Faster segmentation methods (e.g., SegNet) can the common problem of dynamic objects. Detecting and deal- +update semantic information more frequently than slower ing with dynamic objects in a dynamic scene in real-time is a +methods (e.g., Mask R-CNN). Although we cannot control challenging task in vSLAM. +the segmentation speed, we can use a strategy to obtain as +the latest semantic information as possible to remove outliers Our work follows the implementation of ORB-SLAM3 +from the current frame. [18]. The concepts in ORB-SLAM3: keyframe, covisibility + graph, ATLAS and Bundle adjustment (BA), are also used in + Because the semantic thread runs in parallel with the track- our implementation. +ing thread, we use the map points to save and share the +semantic information. As shown in Fig. 1, we update and 1) KEYFRAME +propagate semantic information using the moving probability Keyframes [18] is a subset of selected frames to avoid +and classify map points into three categories, static, dynamic, unnecessary redundancy in tracking and optimization. Each +and unknown, according to the moving probability threshold. + 23773 +VOLUME 9, 2021 + Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods + +keyframe stores 1) a rigid body transformation of the camera C. SEMANTIC-BASED SOLUTIONS +pose that transforms points from the world to the camera DS-SLAM [10], implemented on ORB-SLAM2 [16], com- +coordinate system; 2) ORB features, associated or not to bines a semantic segmentation network (SegNet [22]) with a +a map point. In this paper, keyframes are selected by the moving consistency check to reduce the impact of dynamic +same policy as ORB-SLAM3; a keyframe is selected if all objects and produce a dense semantic octree map [23]. +the following conditions are met: 1) 20 frames have passed DS-SLAM assumes that the feature points on the people are +from the last global relocalization or the last keyframe inser- most likely to be outliers. If a person is determined to be static, +tion; 2) local mapping thread is idle; 3) current frame tracks then matching points on the person can also be used to predict +at least 50 points or less than 90% points than reference the pose of the camera. +keyframe. + DynaSLAM [9], also built on ORB-SLAM2, is robust +2) COVISIBILITY GRAPH in dynamic scenarios for monocular, stereo, and RGB-D +Covisibility graph [16] is represented as an undirected datasets, by adding the capabilities of dynamic object detec- +weighted graph, in which each node is a keyframe and the tion and background inpainting. It can detect the moving +edge holds the number of commonly observed map points. objects either by multi-view geometry, deep learning, or both + and inpaint the frame background that has been occluded by +3) ATLAS dynamic objects using a static map of the scene. It uses Mask +The Atlas [19] is a multi-map representation that handles an R-CNN to segment out all the priori dynamic objects, such as +unlimited number of sub-maps. Two kinds of maps, active people or vehicles. DynaSLAM II [24] tightly integrates the +map and non-active map, are managed in the atlas. When multi-object tracking capability. But this method only works +the camera tracking is considered lost and relocalization was for rigid objects. However, in the dynamic scene of TUM [25] +failed for a few frames, the active map becomes a non-active dataset, people change their shape by sometimes standing and +map, and a new map will be initialized. In the atlas, keyframes sometimes sitting. +and map points are managed using the covisibility graph and +the spanning tree. Detect-SLAM [12], also built on ORB-SLAM2, integrates + visual SLAM with single-shot multi-box detector (SSD) [26] +4) BUNDLE ADJUSTMENT (BA) to make the two functions mutually beneficial. They call +BA [21] is the problem of refining a visual reconstruction to the probability of a feature point belonging to a moving +produce jointly optimal 3D structure and viewing parameter object the moving probability. They distinguish keypoints +estimates. Local BA is used in the local mapping thread into four states, high-confidence static, low-confidence static, +to optimize only the camera pose. Loop closing launches a low-confidence dynamic, and high-confidence dynamic. +thread to perform full BA after the pose-graph optimization Considering the delay of detection and the spatio-temporal +to jointly optimize the camera pose and the corresponding consistency of successive frames, they only use the color +landmarks. images of keyframes to detect using SSD, meanwhile prop- + agating probability frame-by-frame in the tracking thread. +B. GEOMETRIC-BASED SOLUTIONS Once the detection result is obtained, they insert the keyframe +Li et al. [5] proposed a real-time depth edge-based RGB-D into the local map and update the moving probability on +SLAM system for dynamic environments based on the frame- the local map. Then they update the moving probabil- +to-keyframe registration. They only use depth edge points ity of 3D points in the local map that matched with the +which have an associated weight indicating its probability keyframe. +of belonging to a dynamic object. Sun et al. [6] classify +pixels using the segmentation of the quantized depth image DM-SLAM [11] combines Mask R-CNN, optical flow, and +and calculate the difference in intensity between consec- epipolar constraint to judge outliers. The Ego-motion Estima- +utive RGB images. Tan et al. [3] propose a novel online tion module estimates the initial pose of the camera, similar +keyframe representation and updating method to adaptively to the Low-cost tracking module in DynaSLAM. DM-SLAM +model the dynamic environments. The camera pose can reli- also uses features in priori dynamic objects, if they are not +ably be estimated even in challenging situations using a moving heavily, to reduce the feature-less case caused by +novel prior-based adaptive RANSAC algorithm to efficiently removing all priori dynamic objects. +remove outliers. + Fan et al. [8] proposed a novel semantic SLAM system + Although the geometric-based vSLAM solution in with a more accurate point cloud map in dynamic environ- +dynamic environments can restrict the effect of the dynamic ments and they use BlizNet [27] to obtain the masks and +objects to some extent, there are some limitations: 1) they bounding boxes of the dynamic objects in the image. +cannot detect the potential dynamic objects that temporarily +keep static; 2) lack of semantic information. We cannot judge All these methods use the blocked model. They wait for the +dynamic objects using priori knowledge of the scene. semantic results of every frame or keyframe before estimating + the camera pose. As a result, their processing speed are +23774 limited by the specific CNN models they used. In this paper, + we propose RDS-SLAM that uses the non-blocked model and + shows its real-time performance by comparing it with those + methods. + + VOLUME 9, 2021 + Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods + +FIGURE 3. System architecture. Models with orange color are modified blocks based on ORB-SLAM3. Models with magenta color are newly added +features. Blocks in blue are important data structures. + +III. SYSTEM OVERVIEW thread, we use a simple example to explain the general +Each frame will first pass through the tracking thread. The ini- flow, as shown in Fig. 4. Assume the keyframes are selected +tial camera pose is estimated for the current frame after being every two frames. The keyframes are selected by the ORB- +tracked with the last frame and further optimized by being SLAM3 and we inserted them into a keyframe list KF sequen- +tracked with the local map. Then, keyframes are selected tially. Assume, at time t = 12, KF2-KF6 are inside KF. +and they are useful in semantic tracking, semantic-based The next step is to select keyframes from KF to request +optimization, and local mapping thread. We modify several semantic labels from the semantic server. We call this pro- +models in the tracking and the local mapping threads to cess as semantic keyframe selection process in this paper. +remove outliers from camera ego-motion estimation using We take one keyframe from the head of KF (KF2) and one +the semantic information. In the tracking thread, we propose from the back of KF (KF6) to request the semantic labels. +a data association algorithm to use as the features on static Then, we calculate the mask of the priori dynamic objects +objects as possible. using semantic labels S2 and S6. Next, we update the moving + probability of map points stored in the atlas. The moving + The semantic thread runs in parallel with the others, so as probability will be used later to remove outliers from the +not to block the tracking thread and saves the semantic infor- tracking thread. +mation into the atlas. Semantic labels are used to generate +the mask image of the priori dynamic objects. The moving Alg. 1 shows the detailed implementation of the semantic +probability of the map points matched with features in the thread. The first step is to select semantic keyframes from +keyframes is updated using the semantic information. Finally, keyframe list KF (Line 2). Next, we request semantic labels +the camera pose is optimized using the semantic information from the semantic model and return the semantic labels SLs +in the atlas. (Line 3). Lines 4-8 are to save and process the semantic + results for each item returned. Line 6 is to generate the mask + We will introduce the new features and modified models in image of dynamic objects and Line 7 updates the moving +the following sections. We skip the detailed explanations of probability stored in the atlas. We will introduce each sub- +the modules that are the same as those of ORB-SLAM3. module of the semantic thread sequentially (see Fig. 3). + +IV. SEMANTIC THREAD A. SEMANTIC KEYFRAME SELECTION ALGORITHM +The semantic thread is responsible for generating seman- The semantic keyframe selection algorithm is to select +tic information and updating it into the atlas map. Before keyframes for requesting the semantic labels later. We need +we introduce the detailed implementation of the semantic + +VOLUME 9, 2021 23775 + Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods + + FIGURE 4. Semantic tracking example. Assume keyframes KFn is selected every two frames Fn and inserted into keyframe list KF . + We choose keyframes from KF to request semantic labels Sn. Then we update the moving probability into the atlas using the mask image + of dynamic objects that reproduced from the semantic label. Blue circles stand for the static map points and red circles for the dynamic + + map points. Others marked in green are unknown. + +Algorithm 1 Semantic Tracking Thread However, this will monotonically increase the time delay + when using time-consuming segmentation methods as shown +Require: KeyFrame list: KF as the blue line in Fig. 6. For instance, at time t = 10 (F10), + 1: while not_request_finish() do the semantic model completed the segmentation of KF0 (F0) + 2: SK = semantic_keyframe_selection(KF) and the semantic delay is d = 10. Similarly, at time 40 (F40), + 3: SLs = request_segmentation(SK) the semantic delay becomes 34. That is, the last frame that + 4: for i = 0; i < SLs.size(); i + + do has semantic information is 34 frames behind the current + 5: KeyFrame kf = SR[i] frame. The current frame cannot obtain the latest semantic + 6: kf->mask = GenerateMaskImage(SLs[i]) information. + 7: kf->UpdatePrioriMovingProbability() + 8: end for To shorten the distance, supposed that we segment two + 9: end while frames sequentially at the same time (Fig. 5 (b)). Then, + the delay becomes 12−2 = 10 if KF0 and KF1 are segmented +to keep the real-time performance while using different kinds at the same time. The delay still grows linearly as shown as +of semantic segmentation methods. However, some of them, the red line in Fig. 6. +such as Mask R-CNN, are time-consuming and the current +frame in tracking may not obtain the new semantic informa- To further shorten the semantic delay, we use a +tion if we segment every keyframe sequentially. bi-directional model. We do not segment keyframes sequen- + tially. Instead, we do semantic segmentation using keyframes + To evaluate the distance quantitatively, we define the both from the front and back of the list to use as the latest +semantic delay that is the distance between the latest frame id semantic information as possible, as shown in Fig. 5 (c) and as +which has the semantic label (St ) that holds the latest semantic the yellow line in Fig. 6. The semantic delay becomes a con- +information and the current frame (Ft ) id, as follows: stant value. In practice, the delay in the bidirectional model is + not always 10. The distance is influenced by the segmentation + d = FrameID(Ft ) − FrameID(St ). (1) method used, the frequency of keyframe selection, and the + processing speed of the related threads. + Fig. 5 shows the semantic delay for several cases. The +general idea is to segment each frame or keyframe sequen- The left side of Fig. 7 indicates a semantic keyframe selec- +tially, according to the time sequence as shown in Fig. 5 (a). tion example and the right side of Fig. 7 shows the time- +We call this kind of model the sequential segmentation model. line of requesting semantic information from the semantic + model/server. We take both keyframes from the head and +23776 + VOLUME 9, 2021 + Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods + + FIGURE 7. Semantic time line. The left side is the contents inside the + keyframe list KF and right side is the time line of requesting semantic + label. Keyframe in green color means this item has already obtained the + semantic information in the previous round. + +FIGURE 5. Bi-direction model vs sequential model. Assume we use Mask We can obtain relatively new information if we segment the +R-CNN (200ms) and ORB-SLAM3 (20ms), and the keyframe is selected keyframe at the tail of the KF list. Then why do we also need +every two frames. About 200/20 = 10 frames delay while waiting for the to segment the keyframe that in front of the list? Different +semantic result. from the blocked model, there is no semantic information for + the first few frames (about 10 frames if use Mask R-CNN) + in our method. Since the processing speed of the tracking + thread is usually faster than the semantic thread, vSLAM + may have already accumulated large errors because of the + dynamic objects. Therefore, we need to correct these drift + errors using the semantic information by popping out and + feeding the keyframes in the front of the KF list sequentially + to the semantic-based optimization thread to correct/optimize + the camera poses. + + B. SEMANTIC SEGMENTATION + + In our experiment, we use two models with different + speeds, Mask R-CNN (slower) and SegNet (faster), as shown + in Fig. 8. Mask R-CNN [15] is trained with the MS + COCO [28], which has both pixel-wise semantic segmenta- + tion results and instance labels. We implemented it based on + the TensorFlow version of Matterport.1 SegNet [22] imple- + mented using Caffe,2 is trained with the PASCAL VOC [29] + 2012 dataset, where 20 classes are offered. We did not refine + the network using the TUM dataset because SLAM usually + runs in an unknown environment. + +FIGURE 6. Semantic delay of sequential model vs bi-direction model. C. SEMANTIC MASK GENERATION + We merge all the binary mask images of instance segmenta- +back of KF to request the semantic label. (Round 1) At time tion results into one mask image that is used to generate the +t = 2, two keyframes KF0 and KF1 are selected. Segmen- mask image (Fig. 8) of people. Then we calculate the priori +tation finished at t = 12. By this time, new keyframes are moving probability of map points using the mask. In practice, +selected and then inserted into KF (see Round 2). Then we since the segmentation on object boundaries are sometimes +take two elements KF2 from the front and KF6 from this back unreliable, the features on the boundaries cannot be detected +to request the semantic label. At the time t = 22, we received if directly apply the mask image, as shown in Fig. 9 (a). +the semantic result and continue the next round (Round 3). + 1https://github.com/matterport/Mask_RCNN +VOLUME 9, 2021 2https://github.com/alexgkendall/SegNet-Tutorial + + 23777 + Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods + + FIGURE 10. Segmentation failure case. Some features on the body on the + person (a) cannot be identified as outliers using unsound mask + (c) generated by semantic result (b). Therefore, those features are + wrongly labeled as static in this frame. + + FIGURE 11. Moving probability. θs is the static threshold and θd is the + dynamic threshold value. + +FIGURE 8. Semantic information. ‘‘M’’ stands for Mask R-CNN and ‘‘S’’ for moving probability is used to detect and remove outliers from +‘‘SegNet’’. (e) shows the outliers that marked in red color, which are tracking. +detected using the mask image. + 1) DEFINITION OF MOVING PROBABILITY +FIGURE 9. Mask dilation. Remove outliers on the edge of dynamic As we know, vSLAM is usually running in an unknown +objects. environment, the semantic result is not always robust if the + CNN network is not well trained or refined according to +Therefore, we dilate the mask using a morphological filter to the current environment (Fig. 10). To detect outliers, it is +include the edge of dynamic objects, as shown in Fig. 9 (b). more reasonable to consider the spatio-temporal consistency +D. MOVING PROBABILITY UPDATE of frames, rather than just use the semantic result of one +In order not to wait for the semantic information in the frame. Therefore, we use the moving probability to leverage +tracking thread, we isolate the semantic segmentation from the semantic information of successive keyframes. +tracking. We use the moving probability to convey semantic +information from semantic thread to tracking thread. The We define the moving probability (p(mti ), mti ∈ M ) of each + map point i at the current time as shown in Fig. 11. The +23778 status of the map point is more likely dynamic if its moving + probability is closer to one. The more static the map point + is if it is more closer to zero. To simplify, we abbreviate the + moving probability of map point i at time t (p(mti )) to p(mt ). + Each map point has two status (M ), dynamic and static, and + the initial probability (initial belief) is set to 0.5 (bel(m0)). + + M = {static(s), dynamic(d) } + bel(m0 = d) = bel(m0 = s) = 0.5 + + 2) DEFINITION OF OBSERVED MOVING PROBABILITY + Considering the fact that the semantic segmentation is not + 100% accurate, we define the observe moving probability as: + + p(zt = d|mt = d) = α, + p(zt = s|mt = d) = 1 − α, + p(zt = s|mt = s) = β, and + p(zt = d|mt = s) = 1 − β. + + The values α and β are manually given and it is related to the + accuracy of semantic segmentation. In the experiment, we set + α and β to 0.9 by supping the semantic segmentation is fairly + reliable. + + VOLUME 9, 2021 + Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods + +3) MOVING PROBABILITY UPDATE Algorithm 2 Robust Data Association Algorithm + +The moving probability of the current time bel(mt ) is pre- Require: Current Frame: Ft +dicted based on the observation z1:t (semantic segmentation) Last Frame: Ft−1 +and initial status m0. We formulate the moving probability Unknown subset: Unknown +updating problem as a Bayesian filter [30] problem: + Static subset: Static + +bel(mt ) = p(mt |z1:t , m0) Threshold: θd , θs, τ = 20 + + 1: for i = 0; i < Ft−1.Features.size(); i + + do + + = ηp(zt |mt , z1:t−1, m0)p(mt |z1:t−1, m0) 2: MapPoint* m = Ft−1.MapPoints[i] + + = ηp(zt |mt )p(mt |z1:t−1, m0) 3: f = FindMatchedFeatures(Ft , m) + + = ηp(zt |mt )bel(mt ) (2) 4: if p(m) > θd then + + 5: continue + +In Eq. 2 exploits Bayes rule and the conditional independence 6: end if + 7: if p(m) < θs then +that the current observation zt only relies on the current status 8: Static.insert(f , m) +mt . η is a constant. The prediction bel(mt ) is calculated by: + +bel(mt ) = p(mt |mt−1, z1:t−1)p(mt−1|z1:t−1)dmt−1 9: end if + 10: if θd ≤ p(m) ≤ θs then + 11: Unknown.insert(f , m) + + = p(mt |mt−1)bel(mt−1)dmt−1 (3) 12: end if + + 13: end for + +In Eq. (3), we exploit the assumption that our state is com- 14: for it = Static.begin(); it! = Static.end();it ++ do + +plete. This implies if we know the previous state mt−1, past 15: Ft .MapPoints[it->first] = it->second; +measurements convey no information regarding the state mt . 16: end for +We assume the state transition probability p(mt = d|mt−1 = 17: if Static.size()<τ then +s) = 0 and p(mt = d|mt−1 = d) = 1 because we cannot +detect the suddenly change of objects. η is calculated by 18: for it = Unknown.begin(); it!=Unknown.end();it ++ +(bel(mt = d)+bel(mt = s))/2. The probability of map points +belonging to dynamic is calculated by: do + + 19: Ft .MapPoints[it->first] = it->second; + + 20: end for + + bel(mt = d) 21: end if + + = p(mt = d|mt−1 = d)bel(mt−1 = d) (4) + +4) JUDGEMENT OF STATIC AND DYNAMIC POINTS in order to remove the bad influence from dynamic map + points, we skip those map points that have higher moving +Whether a point is dynamic or static is judged using prede- probability (Lines 4-6). Then, there are two kinds of map +fined probability thresholds, θd and θs (see Fig. 11). They are points left, static and unknown map points. We want to +set to 0.6 and 0.4 respectively in the experiment. use only the static map points as far as we can. Therefore, + we classify the remaining map points into two subsets: static +  subset and unknown subset, according to their moving proba- +  dynamic p(mt ) > θd bility (Lines 7-12). Finally, we use the selected relative good +  matches. We first use all the good data stored in static subset +Status(mti ) = static (Lines 14-16). If the size of these data is not enough (less + p(mt ) < θs (5) than the threshold τ = 20, the value used in ORB-SLAM3), + we also use the data in unknown subset (Lines 17-21). +  unknown others +  We try to exclude outliers from tracking using the moving + probability stored in the atlas. How well the outliers are +V. TRACKING THREAD removed will have a great influence on the tracking accuracy. +The tracking thread runs in real-time and tends to accumulate We show the results of a few frames in Fig. 12. All the features +the drift error due to the incorrect or unstable data associa- in the first few frames are in green color because no semantic +tions of 3D map points and 2D features in each frame caused information can be used and the moving probability of all +by dynamic objects. We modify the Track Last Frame model map points is 0.5, the initial value. The features in red belong +and Track Local Map model of ORB-SLAM3 tracking thread to dynamic objects and they are hard to match with the last +to remove outliers (see Fig. 3). We propose a data association frame than static features (blue features). The green features +algorithm that uses as good data associations as possible are almost disappeared because the map points obtained the +using the moving probability stored in the atlas. semantic information over time. We only use features in the + static subset if its size number is enough to estimate camera +A. TRACK LAST FRAME ego-motion. +Alg. 2 shows the data association algorithm in tracking last +frame model. For each feature i in the last frame, we first 23779 +get their matched map point m (Line 2). Next, we find +the matched feature in the current frame by comparing the +descriptor distance of ORB features (Line 3). After that, + +VOLUME 9, 2021 + Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods + +FIGURE 12. Results after tracking last frame. ‘‘M’’ stands for Mask R-CNN an ‘‘S’’ for SegNet. The features in red color are not used in tracking. Blue +features belong to the static subset and green features belong to the unknown subset. + +FIGURE 13. Results after tracking local map. ‘‘M’’ stands for Mask R-CNN and ‘‘S’’ for SegNet. + +B. TRACK LOCAL MAP drifts have already accumulated to some extent with the +The basic idea of the data association algorithm in the Track- +ing Local Map model is similar with Alg. 2. The difference influence of dynamic objects. Therefore, we try to correct +is that here we use all the map points in the local map to +find good data association. The data association result after the camera pose using semantic information. We modify +tracking local map is shown in Fig. 13. More map points are +used to match in this model than the tracking last frame. The the error term used in ORB-SLAM3 by using the moving +features on the people are almost successfully detected or not +matched/used. probability of map points for weighting, as shown below. + +VI. OPTIMIZATION In the experience, we only use the matched static map points +A. SEMANTIC-BASED OPTIMIZATION +We optimize the camera pose using the keyframes given by for optimization. +the semantic keyframe selection algorithm. Considering that Assume Xjw ∈ R3 is the 3D pose of a map point j in the +the tracking thread runs very fast than the semantic thread, + world coordinate system. The i-th keyframe pose in the world +23780 coordinate is Tiw ∈ SE(3). The camera pose Tiw is optimized + by minimizing the reprojection error concerning the matched + keypoint xij ∈ R2 of the map point. The error term for the + observation of a map point j in a keyframe i is: + + e(i, j) = (xij − πi(Tiw, Xjw))(1 − p(mj)), (6) + + VOLUME 9, 2021 + Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods + +where πi is the projection function that projects a 3D map the obtained trajectories using their source codes and +point into a 2D pixel point in the keyframe i. The larger the therefore the trajectories are not exactly the same as the +moving probability is, the smaller contribution to the error. ones in their original paper. We evaluated our system +The cost function to be optimized is: using both Mask R-CNN (M) and SegNet (S). The tra- + jectory of DynaSLAM that use Mask R-CNN is very + C= ρ(eiT,j −1 ei,j) (7) similar with our Mask R-CNN version as shown in + i,j Fig. 14 (m-p) and Fig. 14 (q-t). The performance of our + SegNet version (Fig. 14 (i and j)) is similar to the DS-SLAM + i,j (Fig. 14 (e and f)). + +where ρ is the Huber robust cost function and −1 is the The error in the estimated trajectory was calculated by + i,j comparing it with the ground truth, using two promi- + nent measurements: absolute trajectory error (ATE) and +covariance matrix. relative pose error (RPE) [25], which are well-suited for + measuring the performance of the vSLAM. The root mean +B. BUNDLE ADJUSTMENT IN LOCAL MAPPING THREAD squared error (RMSE), and the standard deviation (S.D.) +We modify the local BA model to reduce the influence of of ATE and RPE are compared. Each sequence was run at +dynamic map points using semantic information. What we least five times as dynamic objects are prone to increase +modified are: 1) the error term, in which the moving probabil- the non-deterministic effect. We compared our method +ity is used, as shown in Eq. 6; 2) only keyframes that already with ORB-SLAM3 [18], DS-SLAM [10], DynaSLAM [9], +obtained semantic information are used for BA. SLAM-PCD [8], DM-SLAM [11], and Detect-SLAM [12]. + The comparison results are summarized in Tables 1, 2, and 3. +VII. EXPERIMENTAL RESULTS DynaSLAM reported they obtained the best performance +We evaluate the tracking accuracy using TUM [25] indoor using the combination of Mask R-CNN and geometric model. +dataset and demonstrate the real-time performance by In this paper, we mainly focus on the time cost problem +comparing with state-of-the-art vSLAMs methods using, caused by semantic segmentation. Contrary to the very heavy +when possible, the results in the original papers. geometric model that DynaSLAM used, we only use the very + light geometric check, such as RANSAC, photometric error +A. SYSTEM SETUP to deal with the outliers that not rely on the priori dynamic + objects. +Our system is evaluated using GeForce RTX 2080Ti GPU, +Cuda 11.1, and docker. 3 Docker is used to deploy differ- Our proposal outperforms the original ORB-SLAM3 + (RGB-D mode only without IMU) and obtains similar per- +ent kinds of semantic segmentation methods on the same formance with DynaSLAM, SLAM-PCD, and DM-SLAM, +machine. We also use Kinect v2 4 camera to evaluate in real in which the tracking error is already very small. Different + from them, we use the non-blocked model. The first few +environment. frames do not have any semantic information. The number + of keyframes that have a semantic label is smaller than suing +B. TRACKING ACCURACY EVALUATION the blocked model because the processing speed of the track- + ing thread is much faster than the semantic segmentation +The proposed method was compared against the ORB- (especially for the heavy model, Mask R-CNN). However, +SLAM3 and similar semantic-based algorithms to quantify we achieved a similar tracking performance using less seman- +the tracking performance of our proposal in dynamic scenar- tic information. +ios. + C. REAL ENVIRONMENT EVALUATION + The TUM RGB-D dataset contains color and depth images We test our system using Kinect2 RGB-D camera, as shown +along the ground-truth trajectory of the sensor. In the in Fig. 15. All the features are in initial status when in the first +sequence named ‘‘fr3/walking_*’’ (labeled as f3/w/*), two few frames because they have not yet obtained any semantic +people walk through an office. This is intended to evaluate information. The static features will be increasingly detected +the robustness of vSLAM in the case of quickly moving over time and used to estimate camera pose. The features +dynamic objects in large parts of a visible scene. Four types on the person is detected and excluded from tracking. The +of camera motion are included in walking data sequences algorithm runs in around 30HZ, as shown in Table 4. +1) ‘‘xyz’’, the Asus Xtion camera is manually moved along +three directions (xyz); 2) ‘‘static’’, where the camera is kept in D. EXECUTION TIME +place manually; 3) ‘‘halfsphere’’, where the camera is moved Tab. 4 compares the execution time of vSLAM algorithms. +on a small half-sphere of approximately one-meter diameter; In the blocked model, the tracking thread needs to wait for +4) ‘‘rpy’’, where the camera is rotated along the principal axes the semantic label. The speed of the other methods is related +(roll-pitch-yaw). In the experiment, the person is dealt with as to the semantic segmentation methods used. The heavy the +the only priori dynamic object in the TUM dataset. + 23781 + We compared the trajectory of camera with ORB- +SLAM3,5 DS-SLAM,6 and DynaSLAM. 7 Fig. 14 compares + + 3https://docs.docker.com/ + 4https://github.com/code-iai/iai_kinect2 + 5https://github.com/UZ-SLAMLab/ORB_SLAM3 + 6https://github.com/ivipsourcecode/DS-SLAM + 7https://github.com/BertaBescos/DynaSLAM + +VOLUME 9, 2021 + Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods + +FIGURE 14. Trajectory comparing frame by brame. ‘‘M’’ stands for ‘‘Mask R-CNN’’ and ‘‘S’’ for ‘‘SegNet’’. +TABLE 1. Results of absolute trajectory error of TUM (m). Ours (1) and (3) are evaluated results only using keyframes. + +semantic model used, the higher the total time consuming is. known, DynaSLAM is not a real-time algorithm. DS-SLAM +Although DynaSLAM achieved good tracking performance, is the second fastest algorithm because it uses a lightweight +the processing time is long due to Mask R-CNN. As we semantic segmentation method, SegNet. However, the + +23782 VOLUME 9, 2021 + Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods +TABLE 2. Results of translational relative pose error (RPE) (m). Ours (1) and (3) are evaluated results only using keyframes. + +TABLE 3. Results of rotational relative pose error (RPE) (m). Ours (1) and (3) are evaluated results only using keyframes. + +TABLE 4. The execution time comparison of TUM Dataset. We use the data in their original paper as possible. If not provide in their papers, +we approximate the processing time. + +architecture used is also a blocked model. The execution time TABLE 5. Semantic keyframe number comparison (Mask R-CNN). +will increase if a more time-consuming method is used. Our +method uses the non-blocked model and runs almost at a keyframes are segmented, the better tracking accuracy can be +constant speed regardless of the segmentation methods. achieved. This depends on the specific application and the + segmentation methods used. + We evaluate the error metric of TUM dataset using 15HZ +by manually adding some time delay in the tracking thread In the bi-direction model, we selected two keyframes at +because TUM dataset is very short. Very small semantic the same time. We offered two strategies to segment them: +information can be obtained in this short time. We compare 1) infer images at the same time as a batch on the same GPU, +the time and the number of keyframes that obtained semantic 2) infer images on the same GPU sequentially (one by one). +label (Semantic keyframe Number) in Tab. 5. We only com- +pared the Mask R-CNN version because SegNet is faster and 23783 +it can segment almost all the keyframes in each dataset. We +assume the time cost of Mask R-CNN is 0.2s for segmenting +each frame. The total time of running the fr3/w/xyz dataset is +about 57.3s for 15HZ, however, only 28.3s for 30HZ. In this +short time, the number of semantic keyframes in 30HZ (143) +is two times smaller than 15HZ (286). Usually, the more + +VOLUME 9, 2021 + Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods + + FIGURE 16. Semantic Delay of TUM w/xyz Dataset. The average value of + Mask R-CNN case is 10 and SegNet is 5. + +FIGURE 15. Result of real environment. The green features are in initial factors: 1) the segmentation speed, 2) the keyframe selection +status and their moving probability is 0.5. The blue features are static policy, 3) the undetermined influence caused by the different +features and the red are outliers. (a) is the original detected ORB running speed of multiple threads (e.g., Loop Closing thread), +features. (b) is the output after tracking last frame process and (c) is the 3) the hardware configures. In the fr3/w/xyz dataset, the cam- +result after tracking local map process. era sometimes moves very slow and sometimes moves for- + ward or backward. As a result, this will change the keyframe +We suggest using (1) if the GPU can infer a batch of images at selection frequency and cause the variance of semantic delay. +the same time. Our Mask R-CNN version uses (1) because we +found we need 0.3s-0.4s in case (1) and 0.2s in case (2). Our VIII. CONCLUSION +SegNet version is evaluated using the strategy (2) because A novel vSLAM system, semantic-based real-time visual +SegNet is very fast and can be segmented sequentially. SLAM (RDS-SLAM) for dynamic environment using an + RGB-D camera is presented. We modify ORB-SLAM3 and +E. SEMANTIC DELAY EVALUATION add a semantic tracking thread and a semantic-based opti- +We have analyzed the semantic delay by assuming the mization thread to remove the influence of dynamic objects +keyframe is selected every two frames (see Fig. 6). In exper- using semantic information. These new threads run in parallel +iment, we follow the keyframe selection policy used in with the tracking thread and therefore, the tracking thread is +ORB SLAM3 and we compared the semantic delay of not blocked to wait for semantic information. We proposed +Mask R-CNN case and SegNet case using the TUM dataset, a keyframe selection strategy for semantic segmentation to +as shown in Fig. 16. The semantic delay is influenced by these obtain as the latest semantic information as possible that can + deal with segmentation methods with different speeds. We +23784 update and propagate semantic information using the moving + probability which is used to detect and remove outliers from + tracking using a data association algorithm. We evaluated the + tracking performance and the processing time using the TUM + dataset. The comparison against state-of-the-art vSLAMs + shows that our method achieved good tracking performance + and can track each frame in real-time. The fastest speed of the + system is about 30HZ, which is similar to the tracking speed + of ORB-SLAM3. In future work, we will try to 1) deploy our + system on a real robot, 2) extend our system to the stereo + camera and mono camera systems, and 3) build a semantic + map. + + REFERENCES + + [1] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, + I. Reid, and J. J. Leonard, ‘‘Past, present, and future of simultaneous + localization and mapping: Toward the robust-perception age,’’ IEEE Trans. + Robot., vol. 32, no. 6, pp. 1309–1332, Dec. 2016. [Online]. Available: + http://ieeexplore.ieee.org/document/7747236/ + + [2] T. Taketomi, H. Uchiyama, and S. Ikeda, ‘‘Visual SLAM algorithms: + A survey from 2010 to 2016,’’ IPSJ Trans. Comput. Vis. Appl., vol. 9, no. 1, + pp. 1–11, Dec. 2017, doi: 10.1186/s41074-017-0027-2. + + VOLUME 9, 2021 + Y. Liu, J. Miura: RDS-SLAM: Real-Time Dynamic SLAM Using Semantic Segmentation Methods + + [3] W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao, ‘‘Robust monocular [25] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, ‘‘A bench- + SLAM in dynamic environments,’’ in Proc. IEEE Int. Symp. Mixed Aug- mark for the evaluation of RGB-D SLAM systems,’’ in Proc. IEEE/RSJ + mented Reality (ISMAR), Oct. 2013, pp. 209–218. [Online]. Available: Int. Conf. Intell. Robots Syst., Oct. 2012, pp. 573–580. [Online]. Available: + http://ieeexplore.ieee.org/ http://vision.in.tum.de/data/datasets/ + + [4] W. Dai, Y. Zhang, P. Li, and Z. Fang, ‘‘RGB-D SLAM in dynamic envi- [26] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Y. Fu, + ronments using points correlations,’’ IEEE Robot. Autom. Lett., vol. 2, and A. C. Berg, ‘‘SSD: Single shot multibox detector,’’ in Computer + no. 4, pp. 2263–2270, Nov. 2018. [Online]. Available: https://arxiv.org/ Vision—ECCV 2016 (Lecture Notes in Computer Science: Lecture Notes + pdf/1811.03217v1.pdf in Artificial Intelligence: Lecture Notes in Bioinformatics), vol. 9905. + Cham, Switzerland: Springer, 2016, pp. 21–37. [Online]. Available: + [5] S. Li and D. Lee, ‘‘RGB-D SLAM in dynamic environments using static https://github.com/weiliu89/caffe/tree/ssd + point weighting,’’ IEEE Robot. Autom. Lett., vol. 2, no. 4, pp. 2263–2270, + Oct. 2017. [27] N. Dvornik, K. Shmelkov, J. Mairal, and C. Schmid, ‘‘BlitzNet: A real-time + deep network for scene understanding,’’ in Proc. IEEE Int. Conf. Comput. + [6] Y. Sun, M. Liu, and M. Q.-H. Meng, ‘‘Improving RGB-D SLAM in Vis. (ICCV), Oct. 2017, pp. 4174–4182. + dynamic environments: A motion removal approach,’’ Robot. Auto. Syst., + vol. 89, pp. 110–122, Mar. 2017. [28] T. Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, + and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in context,’’ in + [7] D.-H. Kim, S.-B. Han, and J.-H. Kim, ‘‘Visual odometry algorithm Computer Vision—ECCV 2014 (Lecture Notes in Computer Science: Lec- + using an RGB-D sensor and IMU in a highly dynamic environ- ture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), + ment,’’ in Robot Intelligence Technology and Applications 3, vol. 345. vol. 8693. Cham, Switzerland: Springer, 2014, pp. 740–755. + New York, NY, USA: Springer-Verlag, 2015, pp. 11–26. [Online]. Avail- + able: https://link.springer.com/chapter/10.1007/978-3-319-16841-8_2 [29] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and + A. Zisserman, ‘‘The Pascal visual object classes (VOC) challenge,’’ + [8] Y. Fan, Q. Zhang, S. Liu, Y. Tang, X. Jing, J. Yao, and H. Han, ‘‘Semantic Int. J. Comput. Vis., vol. 88, no. 2, pp. 303–338, Jun. 2010. [Online]. + SLAM with more accurate point cloud map in dynamic environments,’’ Available: http://link.springer.com/10.1007/s11263-009-0275-4 + IEEE Access, vol. 8, pp. 112237–112252, 2020. + [30] S. Thrun, W. Burgard, and D. Fox, Probabilistic Robotics. Cambridge, MA, + [9] B. Bescos, J. M. Facil, J. Civera, and J. Neira, ‘‘DynaSLAM: Tracking, USA: MIT Press, 2012, p. 2012. [Online]. Available: https://mitpress- + mapping, and inpainting in dynamic scenes,’’ IEEE Robot. Autom. Lett., mit-edu.proxy.library.uu.nl/books/probabilistic-robotics%0Ahttp:// + vol. 3, no. 4, pp. 4076–4083, Oct. 2018. mitpress.mit.edu/books/probabilistic-robotics + +[10] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, ‘‘DS- YUBAO LIU received the bachelor’s degree in + SLAM: A semantic visual SLAM towards dynamic environments,’’ computer science from Qufu Normal University, + in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Oct. 2018, Qufu, China, in 2012, and the master’s degree in + pp. 1168–1174. computer science from Capital Normal University, + Beijing, China, in 2015. He is currently pursuing +[11] J. Cheng, Z. Wang, H. Zhou, L. Li, and J. Yao, ‘‘DM-SLAM: A feature- the Ph.D. degree with the Toyohashi University of + based SLAM system for rigid dynamic scenes,’’ ISPRS Int. J. Geo-Inf., Technology, Toyohashi, Japan. In 2015, he joined + vol. 9, no. 4, pp. 1–18, 2020. Intel Research Center, Beijing, and he transferred + to Isoftstone, Beijing, in 2016, as a Senior Soft- +[12] F. Zhong, S. Wang, Z. Zhang, C. Chen, and Y. Wang, ‘‘Detect-SLAM: ware Engineer, working on computer vision and + Making object detection and SLAM mutually beneficial,’’ in Proc. IEEE AR. His research interests include pattern recognition and SLAM for AR + Winter Conf. Appl. Comput. Vis. (WACV), Mar. 2018, pp. 1001–1010. and smart robotics. + +[13] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, ‘‘Dynamic- JUN MIURA (Member, IEEE) received the + SLAM: Semantic monocular visual localization and mapping based on B.Eng. degree in mechanical engineering and + deep learning in dynamic environment,’’ Robot. Auto. Syst., vol. 117, the M.Eng. and Dr.Eng. degrees in informa- + pp. 1–16, Jul. 2019. [Online]. Available: https://linkinghub.elsevier.com/ tion engineering from The University of Tokyo, + retrieve/pii/S0921889018308029 Tokyo, Japan, in 1984, 1986, and 1989, respec- + tively. In 1989, he joined the Department of +[14] M. A. Fischler and R. Bolles, ‘‘Random sample consensus: A paradigm for Computer-Controlled Mechanical Systems, Osaka + model fitting with applications to image analysis and automated cartogra- University, Suita, Japan. Since April 2007, he has + phy,’’ Commun. ACM, vol. 24, no. 6, pp. 381–395, 1981. been a Professor with the Department of Computer + Science and Engineering, Toyohashi University of +[15] K. He, G. Gkioxari, P. Dollár, and R. B. Girshick, ‘‘Mask R-CNN,’’ in Proc. Technology, Toyohashi, Japan. From March 1994 to February 1995, he was a + IEEE Int. Conf. Comput. Vis. (ICCV), Oct. 2017, pp. 2980–2988. Visiting Scientist with the Computer Science Department, Carnegie Mellon + University, Pittsburgh, PA, USA. He has published over 220 articles in +[16] R. Mur-Artal and J. D. Tardos, ‘‘ORB-SLAM2: An open-source SLAM international journal and conferences in the areas of intelligent robotics, + system for monocular, stereo, and RGB-D cameras,’’ IEEE Trans. Robot., mobile service robots, robot vision, and artificial intelligence. He received + vol. 33, no. 5, pp. 1255–1262, Oct. 2017. several awards, including the Best Paper Award from the Robotics Society + of Japan, in 1997, the Best Paper Award Finalist at ICRA-1995, and the Best +[17] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, ‘‘ORB: An efficient Service Robotics Paper Award Finalist at ICRA-2013. + alternative to SIFT or SURF,’’ in Proc. Int. Conf. Comput. Vis., Nov. 2011, + pp. 2564–2571. 23785 + +[18] C. Campos, R. Elvira, J. J. Gómez Rodríguez, J. M. M. Montiel, and + J. D. Tardós, ‘‘ORB-SLAM3: An accurate open-source library for visual, + visual-inertial and multi-map SLAM,’’ 2020, arXiv:2007.11898. [Online]. + Available: http://arxiv.org/abs/2007.11898 + +[19] R. Elvira, J. D. Tardos, and J. M. M. Montiel, ‘‘ORBSLAM-Atlas: + A robust and accurate multi-map system,’’ in Proc. IEEE/RSJ Int. Conf. + Intell. Robots Syst. (IROS), Nov. 2019, pp. 6253–6259. + +[20] C. Kerl, J. Sturm, and D. Cremers, ‘‘Dense visual SLAM for RGB-D + cameras,’’ in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Nov. 2013, + pp. 2100–2106. + +[21] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, + ‘‘Bundle adjustment—A modern synthesis,’’ in Proc. Int. Workshop Vis. + Algorithms, 2000, pp. 298–372. [Online]. Available: http://link.springer. + com/10.1007/3-540-44480-7_21 + +[22] V. Badrinarayanan, A. Kendall, and R. Cipolla, ‘‘SegNet: A deep convolu- + tional encoder-decoder architecture for image segmentation,’’ IEEE Trans. + Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, Dec. 2017. + +[23] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard, + ‘‘OctoMap: An efficient probabilistic 3D mapping framework based on + octrees,’’ Auto. Robots, vol. 34, no. 3, pp. 189–206, Apr. 2013. + +[24] B. Bescos, C. Campos, J. D. Tardós, and J. Neira, ‘‘DynaSLAM II: Tightly- + coupled multi-object tracking and SLAM,’’ 2020, arXiv:2010.07820. + [Online]. Available: http://arxiv.org/abs/2010.07820 + +VOLUME 9, 2021 + diff --git a/动态slam/2020年-2022年开源动态SLAM/2021年/Stereo camera visual SLAM with hierarchical masking and motion_state classification at outdoor construction sites containing large dynamic object.pdf b/动态slam/2020年-2022年开源动态SLAM/2021年/Stereo camera visual SLAM with hierarchical masking and motion_state classification at outdoor construction sites containing large dynamic object.pdf new file mode 100644 index 0000000..f1273ab --- /dev/null +++ b/动态slam/2020年-2022年开源动态SLAM/2021年/Stereo camera visual SLAM with hierarchical masking and motion_state classification at outdoor construction sites containing large dynamic object.pdf @@ -0,0 +1,760 @@ + 12 + + STEREO CAMERA VISUAL SLAM WITH HIERARCHICAL + MASKING AND MOTION-STATE CLASSIFICATION AT OUTDOOR + + CONSTRUCTION SITES CONTAINING LARGE DYNAMIC + OBJECTS + +arXiv:2101.06563v1 [cs.RO] 17 Jan 2021 Runqiu Bao Ren Komatsu + Dept. of Precision Engineering Dept. of Precision Engineering + The University of Tokyo, Tokyo, Japan The University of Tokyo, Tokyo, Japan + bao@robot.t.u-tokyo.ac.jp komatsu@robot.t.u-tokyo.ac.jp + + Renato Miyagusuku + Dept. of Mechanical and Intelligent Engineering + Utsunomiya University, Utsunomiya, Tochigi, Japan + miyagusuku@cc.utsunomiya-u.ac.jp + + Masaki Chino Atsushi Yamashita + Construction Division Dept. of Precision Engineering + HAZAMA ANDO CORPORATION, Tokyo, Japan The University of Tokyo, Tokyo, Japan + chino.masaki@ad-hzm.co.jp yamashita@robot.t.u-tokyo.ac.jp + + Hajime Asama + Dept. of Precision Engineering + The University of Tokyo, Tokyo, Japan + asama@robot.t.u-tokyo.ac.jp + + January 19, 2021 + + ABSTRACT + + At modern construction sites, utilizing GNSS (Global Navigation Satellite System) to measure the + real-time location and orientation (i.e. pose) of construction machines and navigate them is very + common. However, GNSS is not always available. Replacing GNSS with on-board cameras and + visual simultaneous localization and mapping (visual SLAM) to navigate the machines is a cost- + effective solution. Nevertheless, at construction sites, multiple construction machines will usually + work together and side-by-side, causing large dynamic occlusions in the cameras’ view. Standard + visual SLAM cannot handle large dynamic occlusions well. In this work, we propose a motion + segmentation method to efficiently extract static parts from crowded dynamic scenes to enable robust + tracking of camera ego-motion. Our method utilizes semantic information combined with object-level + geometric constraints to quickly detect the static parts of the scene. Then, we perform a two-step + coarse-to-fine ego-motion tracking with reference to the static parts. This leads to a novel dynamic + visual SLAM formation. We test our proposals through a real implementation based on ORB-SLAM2, + and datasets we collected from real construction sites. The results show that when standard visual + + ∗Code available at: https://github.com/RunqiuBao/kenki-positioning-vSLAM + †∗Corresponding author Email: bao@robot.t.u-tokyo.ac.jp + A PREPRINT - JANUARY 19, 2021 + + SLAM fails, our method can still retain accurate camera ego-motion tracking in real-time. Comparing + to state-of-the-art dynamic visual SLAM methods, ours shows outstanding efficiency and competitive + result trajectory accuracy. + + Keywords + dynamic visual SLAM, motion segmentation, hierarchical masking, object motion-state classification, + ego-motion tracking + +1 Introduction + +Knowledge of real-time location and orientation (i.e. pose) of construction machines, such as bulldozers, excavators, +and vibration rollers, is essential for the automation of construction sites. Currently, RTK-GNSS (Real-Time Kinematic +- Global Navigation Satellite System) is widely used because of its centimeter-level location accuracy. However, +in addition to the high price, the location output of RTK-GNSS could be unstable due to loss of satellite signals +underground, near mountains and trees, and between tall buildings. Therefore, replacing RTK-GNSS with on-board +cameras and visual SLAM (vSLAM) has been proposed [1]. Assuming machine’s starting pose is known in a global +coordinate system, relative pose outputs from vSLAM can be used to navigate the machine. +However at construction sites, several machines usually work together and side-by-side (Figure 1), which results in +large dynamic occlusions in the view of the cameras. Such dynamic occlusions can occupy more than 50% of the image. +It leads to a dramatic drop in tracking accuracy or even tracking failure when using standard vSLAM. We introduce +this problem distinctly in the context of dynamic vSLAM and propose an original stereo camera dynamic vSLAM +formation. +To deal with dynamic occlusions, our idea is to firstly detect static objects and backgrounds, and then track ego-motion +with reference to them. To achieve this, we need to estimate the real motion-states of objects. We use learning-based +object detection and instance segmentation combined with object-wise geometric measurement in stereo frames to label +the motion-states of object instances and generate occlusion masks for dynamic objects. Additionally, two types of +occlusion masks are applied to balance accuracy and computation cost, bounding box mask for small occlusions and +pixel-wise for large occlusions. Pixel-wise masks describe boundaries of objects more accurately. While bounding +boxes are faster to predict, it is not so accurate. +In a nutshell, our contributions in this work include: (1) a semantic-geometric approach to detect static objects and +static backgrounds for stereo vSLAM, (2) a masking technique for dynamic objects called hierarchical masking, (3) a +novel stereo camera dynamic visual SLAM system for construction sites. +The remainder of this work is organized as follows: In Section 2, we summarize the existing research on dynamic visual +SLAM and motion segmentation methods, and describe the features of this work. In Section 3, the system structure +and our original proposals (two algorithms) are introduced. In Section 4, to test the performance of our proposals, +we conducted experiments at real construction sites and built datasets for algorithm evaluation. We used Absolute +Trajectroy RMSE [2] to evaluate accuracy of the location outputs of the vSLAM system. Finally, Section 5 contains the +conclusions and future work plan. + +Figure 1: Simultaneous working of construction machines causing large-area moving occlusions in on-board cameras’ +view. + + 2 + A PREPRINT - JANUARY 19, 2021 + +Figure 2: Cameras are mounted on top of our construction machine facing to the sides, and RTK-GNSS is used to +collect ground truth positions. + +2 Related Work + +2.1 Dynamic Visual SLAM + +Standard visual SLAM (vSLAM) assumes that the environment is static. Correspondingly, vSLAM for dynamic +environments (Dynamic vSLAM or Robust vSLAM) distinguishes static and dynamic features and computes pose +estimation based solely on static features. +Depending on the application, dynamic vSLAM can be categorized into two classes. One solely builds a static +background model, ignoring moving objects [3, 4, 2]. The other aims at not only creating a static background map, but +simultaneously maintaining sub-maps of moving objects [5, 6, 7]. Our task, i.e. positioning of construction machines, +requires fast and accurate camera ego-motion tracking and thus belongs to the first class. +Real-time positioning task at construction sites brought new problem to vSLAM. Specifically, we found that at a busy +construction site, there are often many machines, trucks and persons moving around which become large dynamic +occlusions (occlusion rate >50% from time to time) in the camera view. Besides, such occlusions usually contain +more salient feature points than earthen ground and cause chaos in feature-based camera ego-motion tracking. Even +existing dynamic vSLAM solutions may suffer from different issues and are thus not the optimal solution to this task. +For example, [8, 9, 10, 11] proposed very fast methods for dealing with dynamic objects. Yet, they did not explicitly +consider overly-large dynamic occlusions and thus might suffer from accuracy drop. [2] and [6] proposed very robust +methods for masking dynamic occlusions. But both of them require heavy computation and are not suitable for real-time +positioning task. Therefore, we proposed our own dynamic vSLAM solution for real-time positioning at dynamic +construction sites. +In a dynamic vSLAM system, there are mainly two major modules: (1) motion segmentation and (2) localization and +mapping [12]. Motion segmentation is the key part that distinguishes an outstanding dynamic vSLAM system from the +rests. + +2.2 Motion Segmentation + +Motion segmentation is aimed at detecting moving parts in the image and classifying the features into two groups, static +and dynamic features. +Standard visual SLAM achieves this by applying robust statistical approaches to the estimation of geometric models, +such as Random Sample Consensus (RANSAC) [13]. However, such approach may fail when large dynamic occlusions +exist, and static features are not in the majority. Other approaches leverage external sensors such as inertial measurement +units (IMU) to fix camera ego-motion. In the following, we focus on visual-only approaches to distinguish static and +dynamic features. Muhamad et al. [12] summarizes this research area well, for more details please refer to the study. +The most intuitive approach for motion segmentation is using semantic information to separate object instances that +may move in the scene. To obtain semantic information, Bârsan et al. [6] used learning-based instance segmentation +to generate pixel-wise masks for object instances. Cui et al. [14] proposed only using bounding boxes obtained from +YOLO v3 [15] to filter dynamic objects, which can reduce computation cost. However, these works simply assume that +movable objects are dynamic. End-to-end learning-based methods for motion segmentation (without prior information +about the environment) are still scarce [12]. +Another common strategy for motion segmentation is utilizing geometric constraints. It leverages the fact that dynamic +features will violate constraints defined in multi-view geometry for static scenes. Kundu et al. [16] detected dynamic +features by checking if the points lie on the epipolar line in the subsequent view and used Flow Vector Bound (FVB) to +distinguish motion-states of 3D points moving along the epipolar line. Migliore et al. [17] kept checking the intersection + + 3 + A PREPRINT - JANUARY 19, 2021 + +between three projected viewing rays in three different views to confirm static points. Tan et al. [18] projected existing +map points into the current frame to check if the feature is dynamic. It is difficult for us to evaluate these methods. +However, one obvious drawback is that they require complicated modifications to the bottom components of standard +visual SLAM algorithm without the static environment assumption. We argue that such modifications are not good for +the modularity of a vSLAM system. + +As a novel hybrid approach, Berta et al. [2], in their work named Dynaslam, proposed to combine learning-based +instance segmentation with multi-view geometry to refine masks for objects that are not a priori dynamic, but movable. +Our system follows the hybrid fashion of Dynaslam, but we treat motion segmentation as an object-level classification +problem. Our idea is, by triangulating and measuring positions of points inside the bounding boxes and comparing them +between frames, we can estimate object-level motion-states for every bounding box (assuming objects are all rigid). If +we know the motion-state of every bounding box, the surroundings can be easily divided into static and dynamic parts. + +Besides, bounding boxes of large dynamic occlusions reduce available static features. We will show that it is essential to +keep the overall masked area under a certain threshold if possible. Hence, we designed an algorithm named hierarchical +masking to refine a pixel-wise mask inside the bounding box when the overall masked area extends past a threshold to +save scarce static features. This hierarchical masking algorithm is also an original proposal from us. + +3 Stereo Camera Dynamic Visual SLAM robust against Large Dynamic Occlusions + +The core problem in this research is to achieve fast and accurate camera ego-motion tracking when there are large +occlusions in the camera’s view. Subsection 3.1 is a general introduction of the system pipeline. In Subsection 3.2, the +principle of feature-based camera ego-motion tracking with occlusion masks for dynamic occlusions is introduced. In +order to balance computation speed and accuracy in occlusion mask generation, a hierarchical masking approach is +proposed in Subsection 3.3. Last, through stereo triangulation and comparison, object instances in the current frame +will be assigned a predicted motion-state label, static or dynamic, which leads to further mask refining and a second +around of tracking. + +3.1 System Overview + +The system installation is illustrated in Figure 2 and the system pipeline is shown in Figure 3. Inputs are stereo frames +(left image and right image) captured by a stereo camera. Then semantic information, including object labels and +bounding boxes, are extracted using learning-based object detection. In addition, a hierarchical mask generation +approach is proposed to balance mask accuracy and generation speed. Object masks exclude suspicious dynamic objects +from the static background. The features in the static background are then used in the initial tracking of camera pose. + +After initial tracking, a rough pose of the new frame is known, with which we distinguish static objects from other +objects. This is done by triangulating object-level 3D key points in the reference and current frame and comparing the +3D position errors to distinguish whether the object is moving or not. Large static objects can provide more salient static +features for improving tracking accuracy. Dynamic objects will be kept masked in the second ego-motion tracking. +This two-round coarse-to-fine tracking scheme helps detect static objects and improve pose estimation accuracy. + +After the second round of tracking, there will be mapping and pose graph optimization steps as most of state-of-the-art +vSLAM algorithms do. + +3.2 Feature-based Camera Ego-motion Tracking by Masking Dynamic Occlusions + +Camera ego-motion tracking framework used here is based on ORB-SLAM2 stereo [19]. When a new frame comes in, +first, a constant velocity motion model is used to predict the new camera pose with which we can search for map points +and 2D feature points matches. After enough matches are found, a new pose can be re-estimated by Perspective-n-point +(PnP) algorithm [20]. Motion-only bundle adjustment (BA) is then used for further pose optimization. Motion-only +BA estimates the camera pose of the new stereo frame, including orientation R ∈ SO (3) and position t ∈ R3, by +minimizing the reprojection error between matched 3D points xi ∈ R3 in the SLAM coordinates and feature points +pi(.) in the new frame, where i = 1, 2, ..., N . pi(.) include monocular feature points pm i ∈ R2 and stereo feature points +pis ∈ R3. +Now supposing M out of N 3D points are on a rigid body dynamic object that had a pose change R , t in the physical +world, and their 3D coordinates change from xi to xi, for i = 1, 2, ..., M . The rigid body transformation can be + + 4 + A PREPRINT - JANUARY 19, 2021 + +Figure 3: An overview of the proposed system. Inputs are stereo frames (all the processes are on the left image. Right +image is only for triangulating 3D points). After semantic information extraction, occlusion masks of objects are +generated and used in filtering potential dynamic features. The initial ego-motion tracking is based purely on the static +background. Then more static objects are found and used as references in the second round of tracking to get more +accuracy results. The final output is the camera pose R and t of the current frame in the SLAM coordinates. + +expressed as xi = R xi + t . Pose change estimation can be expressed as: + + M 2 + +{R, t, R , t } = arg min ρ pi(.) − π(.) (R (R xi + t ) + t) Σ + + R,t,R ,t i=1 (1) + + N 2 + ++ ρ pi(.) − π(.) (Rxi + t) Σ , + + M +1 + +where ρ is the robust Huber cost function that controls the error growth of the quadratic function, and Σ is the covariance +matrix associated to the scale of the feature point. The project functions π(.) include monocular πm and rectified stereo + + 5 + A PREPRINT - JANUARY 19, 2021 + +πs, as defined in [19]: + + πm X = fxX/Z + cx = ul , (2) + Y fyY /Z + cy vl + Z + + X fxX/Z + cx ul + + πs Y = fyY /Z + cy = vl , (3) + + Z fx (X − b) /Z + cx ur + +where (fx, fy) is the focal length, (cx, cy) is the principal point and b the baseline. (ul, vl) represents the monocular +feature points and (ul, vl, ur) the stereo feature points. + +However, solving this equation (1) is not easy, not to mention that there could be more than one dynamic objects in real +world. If we only want to estimate R, t, equation (1) can be simplified to: + + N 2 + + {R, t} = arg min ρ pi(.) − π(.) (Rxi + t) , (4) + + R,t i=M +1 Σ + +which means only using static points in the scene to estimate the camera pose. If dynamic feature points as moving +outliers are not excluded, the estimation result will be wrong. + +To separate static and dynamic feature points, our approach is to use a binary image as mask (for the left image of the +input stereo frame). The mask has the same size as the input image, and pixels with value 0 indicate static area, while +pixels with value 1 indicate dynamic area. Suppose that Imask (u, v) refers to a pixel in the mask image Imask. Sp is a +set of static pixels and Dp is a set of dynamic pixels, + + 0, Imask (u, v) ∈ Sp . (5) + Imask (u, v) = 1, Imask (u, v) ∈ Dp + +Figure 4 shows examples of mask (with alpha blending). To generate a mask, we first get bounding boxes or pixel-wise +segmentation results from learning-based object detection and instance segmentation (Subsection 3.3). Then, for those +objects with a priori dynamic semantic label such as "car", "person", "truck", etc., we change the pixels’ value to 1 in +the mask image, while keeping the others as 0. We also apply geometrical measurement and calculate a motion-state +label for every object bounding box. Inside a static bounding box, we change the pixels’ value to 0 whatever it was +(Subsection 3.4). Later during ego-motion tracking period, only the areas where the mask value equals 0 will be used to +estimate camera pose as described by Equation (4). + +3.3 Hierarchical Object Masking + +The switching between two types of masks forms a hierarchical masking strategy that balances computation speed and +mask accuracy. + +To reduce computation cost, we first used object detectors, e.p. EfficientDet [21], to predict object instances and +recognize their bounding boxes. Such learning-based object detector is a deep neural network, which can predict all the +bounding boxes, class labels, and class probabilities directly from an image in one evaluation. A bounding box only +represents a rough boundary of the object, so when using it as an object mask, background feature points inside the +rectangle are also classified as "object". It is, therefore, only a rough boundary description. + +There were cases when bounding boxes occupied most of the area in the image, which led to a shortage of available static +features, and thus the accuracy of the ego-motion tracking declined. In such cases, we perform pixel-wise segmentation +on the image to save more static features. For pixel-wise segmentation, we also use deep learning approaches, such as +Mask R-CNN [22]. Pixel-wise segmentation takes more time and slows down the system output rate. Thus, only in +extreme cases when the frame is so crowded with object bounding boxes, should pixel-wise segmentation be performed. + + 6 + A PREPRINT - JANUARY 19, 2021 + + Figure 4: Two kinds of masks and masked features. + +Algorithm 1: Hierarchical Masking + +Input: stereo images in current frame, Icl, Icr; Mased Area Ratio threshold, τmar. +Output: image mask for the left image in current frame, Imask. + + Initialisation: a blank image mask, Imask; initial Masked Area Ratio as 0, mar = 0; + 1: Imask=objectDetectionAndMasking(Icl); + 2: mar=calMaskedAreaRatio(Imask); + 3: if (mar ≥ τmar) then + 4: Imask=pixelwiseSegmentationAndMasking(Icl); + 5: end if + 6: return Imask + +The switching to pixel-wise segmentation is controlled by an index named Masked Area Ratio (mar). If Am is the total +area of bounding boxes in pixels and Af is the total area of the image in pixels, then we have, + +mar = Am . (6) + Af + +If mar is larger than the threshold τmar, it means the current frame is quite crowded and pixel-wise segmentation is +necessary. + +Hierarchical object masking is concluded as following: when we get one frame input, we first use object detector +performing object detection and obtain bounding boxes. Then mar is calculated. If mar is higher than a pre-set +threshold τmar, then we perform pixel-wise segmentation and output the pixel-wise object mask. If mar is smaller than +the threshold, then the bounding box mask are directly forwarded as object mask. This algorithm is summarized in +Algorithm 1. + +3.4 Objects’ Motion-state Classification for Further Mask Refinement + +After the first ego-motion tracking, with reference to the background, we roughly know the pose of the current frame. +Based on the current pose, we triangulate object-level 3D points on all the detected object instances in the current frame +and a selected reference frame and distinguish whether they have moved. Feature points inside static bounding boxes +are then unmasked and used as valid static references in the second round of tracking. This algorithm (Algorithm 2) +named motion-state classification is detailed in the following. + +To classify objects’ motion-state, first, a reference frame needs to be selected from previous frames. In this work, we +used the N -th frame before the current frame as reference frame. N is determined based on the machines’ velocity. +For example, for vibration rollers moving at 4 km/h mostly, FPS/3 to FPS/2 can be selected as N (FPS stands for +the frame rate of camera recording, namely Frame Per Second). For domestic automobiles running at higher speed, +N should be selected smaller so that there is appropriate visual change between current and reference frame. This +strategy is simple but effective, given the simple moving pattern of construction machines. There are more sophisticated +methods for selecting the best reference frame as stated in [2] and [18]. + +Then, suppose there are objects {obj1, obj2, ..., objm} in the reference frame (RF) and objects {obj1, obj2, ..., objn} +in the current frame (CF). We associate the m objects in RF with the n objects in CF by feature matching. If the + +7 + A PREPRINT - JANUARY 19, 2021 + +Figure 5: Associate bounding boxes between the Reference Frame (RF) and Current Frame (CF) using feature matching. +Triangulate object-level 3D points in RF, then triangulate corresponding 3D points in CF and compare their positions in +the two measurements. If most of point-wise position errors of an object (bounding box) are smaller than three times +the standard variation of static background points, the object is labeled as ‘static’ during camera pose change from RF +to CF. + + Figure 6: Algorithm 2: Objects’ Motion-state Classification. +object instances are associated successfully between two frames, which means the object is co-visible in both frames, +we triangulate 3D points within the bounding boxes in both frames in SLAM coordinates and calculate point-wise +position errors. 3D points’ position errors of static background are assumed to obey zero-mean Gaussian distribution. +The standard deviation, σbkg, is determined beforehand and used as the threshold for classification. For static objects, +principally all 3D points’ position error should be less than three times of σbkg. But considering the inaccuracy of a +bounding box, we loosened the condition to 70%, i.e. objects are classified as "static" when more than 70% of its 3D +points have a position error smaller than (3 × σbkg). However, outliers of feature matching usually result in very large +position errors. We only keep points with position error smaller than the median to exclude outliers. Figure 5 shows the + + 8 + A PREPRINT - JANUARY 19, 2021 + +(a) Construction site bird view (b) Vibration roller + +Figure 7: Experiment setting. + +principle of the geometric constraint, the left one is a dynamic object and the right one is a static object. Figure 6 shows +input and output as well as main ideas of Algorithm 2. Details about how to implement this algorithm can be found in +our code repository. + +4 Experimental Evaluations + +4.1 Testing Environments and Datasets + +To evaluate our proposed approaches, we conducted experiments at two construction sites in Japan with a machine +called vibration roller as shown in Figure 7(b). Vibration roller is used to flatten the earthen basement of structures and +facilities. For efficiency of work, there are usually multiple rollers running simultaneously and side by side, thus large +moving occlusions become a serious problem for visual SLAM. + +In all experiments, a stereo camera was mounted on the cabin top of a roller facing the side. The baseline of the stereo +camera was about 1 m. The roller moved along a typical trajectory (Figure 7(a)) with maximum speed of 11 km/h. The +ground truth trajectories were recorded using RTK-GNSS. We synchronized ground truth and estimated camera poses +by minimizing Absolute Trajectory RMSE ([2, 19, 23]) and choosing appropriate time offsets between GNSS’s and +the camera’s timer. Then the estimated camera trajectories will be aligned with ground truth trajectories by Umeyama +algorithm [24]. We evaluate the accuracy of camera pose outputs of the vSLAM system with reference to the associated +ground truth by Absolute Trajectory RMSE (AT-RMSE). + +Video data were collected at the site and evaluated in the lab. Image resolution was 3840 × 2160, and frame rate was 60 +fps. For efficient evaluation, we downsampled the image sequences to 960 × 540 and 6 fps. We eventually collected +five image sequences, three with dynamic machines inside, the 4th one containing only two static machines, and the +5th one was without any occlusions. The specifications of the computer being used were Intel Core i7-9700K CPU, +and NVIDIA GeForce RTX 2080 Ti GPU. We used a tool provided by [25] for trajectory accuracy evaluation and +visualization. + +When evaluating our vSLAM system implemetation, all the masks including bounding box and pixel-wise masks +are generated beforehand using EfficientDet [21] and Detectron2 [26] version of Mask R-CNN [22]. EfficientDet is +reported to be able to prioritize detection speed or detection accuracy through configuration. In our implementation, we +used EfficientDet-D0 and the weights were trained on MS COCO dataset [27]. The weights for Mask R-CNN are also +trained on MS COCO datasets [27]. Without fine-tuning, they are already good enough for this study. Besides, when +calculating overall computation time per frame, we record time consumption for vSLAM tracking part as well as mask +generation part respectively, and then add them together. Note that in hierarchical masking, the additional time caused +by pixel-wise segmentation will be averaged into all the frames. + + 9 + A PREPRINT - JANUARY 19, 2021 + +(a) Absolute position error of every camera pose (b) Camera trajectory with colormapped position error + +Figure 8: Quantitative evaluation for estimated trajectory of image sequence 1 "kumamoto1". + + Table 1: Details about the five image sequences. + + Dataset details kumamoto1 kumamoto2 chiba1 chiba2 chiba3 +Max. occlusion ratio 0.493 0.445 0.521 0.633 0.0 + 0/1263 0/1186 12/647 69/668 + MAR>0.5 frames 0 to 4 km/h 0 to 4 km/h 0/708 + Machines’ speed 0 to 4 km/h 0 to 4 km/h 0 to 4 km/h + 1 roller 1 roller 1 roller (dynamic) 2 rollers (static) + Occlusions & 1 roller (static) no occlusions + their motion-states (dynamic) (dynamic) + 7 color cones 7 color cones + + (static) (static) + 1 checkerboard 1 checkerboard + + (static) (static) + +4.2 Performance Evaluation with Our Construction Site Datasets + +Figure 8(a) shows the absolute position error of every camera pose between the estimated trajectory using the proposed +system and ground truth of a sequence (kumamoto2). Figure 8(b) is a bird’s eye view of the camera trajectory with +colormapped absolute position error. There are totally 5 sequences prepared, we repeat such evaluation 10 times for +each sequence. The details about the five sequences are described in Table 1. Figure 9(a) shows the distribution of + +(a) Estimated trajectory accuracy (lower is better) (b) Averaged computation speed (lower is better) + + Figure 9: Performance comparison on our construction site datasets. + + 10 + A PREPRINT - JANUARY 19, 2021 + +(a) Three machines working parallelly to each other. (b) From view point of the on-board camera + +Figure 10: Dynamic scene and hierarchical masking example. + +Absolute Trajectory RMSE of all five sequences. We compare our proposed system with a simple baseline system, with +DynaSLAM [2] and with the original ORB-SLAM2 stereo. The baseline system is also based on ORB-SLAM2 but +is able to detect and remove moving objects. Its “moving object removal” method is derived from Detect-SLAM [9], +which performs bounding box detection and masks all movable bounding boxes detected. In the results, our proposed +system shows better trajectory accuracy in 3 sequences out of five comparing to the baseline, including kumamoto1, +chiba1 and chiba3. If the baseline represents fast and efficient handling of dynamic objects, DynaSLAM is much +heavier computationally. But the motion segmentation method in DynaSLAM is pixel-level precise and indeed the +current state-of-the-art. The experiment results shows that, DynaSLAM does show slight superiority of trajectory +accuracy in sequences including kumamoto1, chiba1. The original ORB-SLAM2 stereo can only survive chiba2 and +chiba3, which are completely static. In addition, trajectory accuracy of chiba2 and chiba3 are generally better than +those of dynamic sequences, no matter which method. Dynamic occlusions do cause irreversible influence on camera +ego-motion tracking. + +Averaged computation speed comparisons are shown in Figure 9(b). Our proposed system is relatively slow than the +baseline and orb-slam2 stereo at the beginning. However, our method is able to be significantly accelerated by utilizing +parallel computing such as GPU acceleration. In our implementation named "ours_gpu" in Figure 9, we enabled GPU +acceleration for all the ORB feature extractions, and the speed can be improved notably. However, the trajectory +accuracy became different from "ours" to a certain extent, although theoretically they should be the same. We are still +looking for the root cause. Finally, time cost of DynaSLAM (only tracking, without background inpainting) is 2 to 3 +times of ours_gpu. Large computation latency is not preferable, since our targeted task is real-time positioning and +navigation of a construction machine. + +4.3 Ablation Study + +4.3.1 Hierarchical Object Masking + +Hierarchical masking aims to efficiently propose an appropriate initial mask in case there are overly-large dynamic +occlusions in the image. Figure 10(a) shows a scene when the machine was working along with two other machines and +thus had two large occlusions in the camera view. Figure 10(b) shows a sample image recorded by the on-board camera. +Notice that the two rectangles labeled as truck are bounding boxes detected by object detection algorithm, and the +color masks inside the bounding boxes are by pixel-wise segmentation. Besides, ORB feature points are extracted +and marked on this image. Green points are static features on the static background, blue points are those included +by bounding boxes but not included by pixel-wise masks, and red points are features masked by pixel-wise masks. +It is obvious that bounding box mask causes many innocent static features being treated as dynamic. Through a toy +experiment, we can see how it will cause shortage of available feature points and lead to worse pose tracking accuracy. +Then by a real example in our datasets, we explain the effectiveness of hierarchical masking. + +11 + A PREPRINT - JANUARY 19, 2021 + +Figure 11: A toy experiment: estimated trajectory accuracy when putting different sizes of occlusions on the 4th image +sequence "chiba2". + +Table 2: Tracking accuracy of "chiba2" with three different mask types. + +Mask type AT-RMSE, m Max. occlusion ratio + (average of 10 trials) + +B-box mask 0.0437 0.63 + +Hierarchical mask 0.0404 0.50 + +Pixel-wise mask 0.0397 0.32 + +(1) A toy experiment +We put a fake constant dynamic occlusion at the center of the mask images of the 4th image sequence chiba2 (static +scene). And we adjusted the size of this area to simulate different occlusion ratio and see how the result trajectory +accuracy changes. The result is plotted in Figure 11. Before occlusion ratio reaches 0.6, trajectory error only varies +over a small range; when occlusion ratio exceeds 0.7, the RMSE increases exponentially due to shortage of available +features. Therefore, when occlusion ratio of the image approaches the critical point of 0.6, we define it as a large +occlusion condition, requiring the refinement of the bounding box mask to a pixel-wise mask to suppress the growing +error. Besides, when occlusion ratio is larger than 0.6, tracking lost will frequently happen which is not preferred when +navigating a construction machine. To avoid tracking lost and relocalization, we set the threshold (τmar in section 3.3) +to 0.5 as a safty limit. + +However, when occlusion ratio is far smaller than 0.6, bounding box mask is enough and also faster to get. With our +computer, generating bounding box masks for one image frame takes 0.0207 seconds in average while a pixel-wise +mask takes 0.12 seconds. + +(2) An overly large occlusion case + +In order to demonstrate the effectiveness of hierarchial masking when facing overly large occlusions, we show an +example in sequence "chiba2". From the 3500th frame to 4500th frame (1000 frames in the original 60 fps sequence) in +"chiba2" sequence, we encountered an overly large occlusion. As Table 2 shows, when changing from bounding box +mask to pixel-wise mask, the maximum masked area ratio reduced from 0.63 to 0.32 and, correspondingly, trajectory +error decreased. Hierarchical masking benefits trajectory accuracy, and it will cost much less time than only using +pixel-wise mask. In this example, only 2/3 of the frames during this period need to calculate pixel-wise mask. And the +maximum masked area ratio is constrained within 0.5. Note that although the Absolute Trajectory RMSE difference +between 0.0404 and 0.0437 seems trivial here in Table 2. It is partially because of the trajectory alignment algorithm +[24] we used for evaluation, the actual accuracy difference can be larger. + +4.3.2 Objects’ Motion-state Classification + +Not all a priori dynamic objects are moving. Ignoring static objects leads to loss of information, especially when +they are salient and occupy a large area in the image. Therefore, we designed the objects’ motion-state classification +algorithm to detect static objects and unmask them for ego-motion tracking. Figure 12 shows dynamic and static objects +detected in the image sequences and scores relating to the possibility of them being dynamic. We also show an example + + 12 + A PREPRINT - JANUARY 19, 2021 + +Figure 12: Illustration of the classification result. In the left column, the third row shows that there is one machine +classified as "moving" and another classified as "static" in this frame. The second row shows the position errors of +3D points on these two machines between this frame and the reference frame. Points on the "moving" machine have +higher position errors. Similarly, in the right column, there are also "moving" machines (two parts of one machine) and +a "static" color cone detected. + +Table 3: Tracking accuracy with motion-state classification. + +Mask type AT-RMSE, m Max. occlusion ratio + +All objects masked 0.04973 0.63 + +Static objects unmasked 0.04198 0.0 + +of using the proposed algorithm in visual SLAM. Again, we use the 3500th frame to 4500th frame (1000 frames) in +"chiba2" sequence, and since the machines are totally static during this period, they are detected as static and unmasked. +Table 3 shows how it can influence the tracking accuracy. + +However, there is still one bottleneck in this algorithm. σbkg is an essential parameter for the performance of motion- +state classification. For all the evaluations above with the four image sequences, we set σbkg to 0.12 which was +empirically determined. To explore the influence of this parameter on system performance, we variate σbkg between 0 +and 0.6 to evaluate the classifier in terms of ROC (Receiver Operating Characteristics). Since the final target is to find +static objects, "static" is regarded as positive and "dynamic" as negative, ignoring objects that cannot be classified. The +ROC curve is shown in Figure 13, true positive rate (TPR, sensitivity) on the y axis is the ratio of true positive number +over the sum of true positives and false negatives. False positive rate on the x axis is the ratio of false positives over the +sum of false positives and true negatives. According to this curve, the Area Under the Curve (AUC) reached 0.737, +which proved it to be a valid classifier. The red dot in the plot is the position where σbkg = 0.12. + +4.4 Evaluation with KITTI Dataset + +The KITTI Dataset [28] provides stereo camera sequences in outdoor urban and highway environments. It has been a +wide-spread benchmark for evaluating vSLAM system performance, especially trajectory accuracy. Works such as +[2, 19] all provide evaluation results with KITTI. There are some sequences in KITTI containing normal-size dynamic +occlusions, thus KITTI is also appropriate for evaluation of our method. Table 4 shows the evaluation results. The +comparison includes four systems, our proposed system, the baseline, DynaSLAM and ORB-SLAM2 stereo, same as in +section 4.2. For the baseline, DynaSLAM and ORB-SLAM2 stereo, all the settings remain the same as before. For our +system, τmar (Section 3.3) remains to be 0.5 and σbkg (Section 3.4) remains to be 0.12. However, N (Section 3.4) is +changed to be 2, since frame rate of KITTI is 10 fps and the cars are much faster than our construction machines. We +ran each sequence 10 times with each system and recorded the averaged Absolute Trajectory RMSE (AT-RMSE, m) as +well as the averaged computation time per frame (s). For our system, we recorded both results with GPU acceleration (w +A) and without GPU acceleration (w/o A). Between the four comparisons, best AT-RMSE for each sequence is marked +with bold font and best computation time marked with bold, italic font. Note that the AT-RMSE results of DynaSLAM + + 13 + A PREPRINT - JANUARY 19, 2021 + +Figure 13: ROC curve for the motion-state classification when σbkg was between 0 and 0.6, estimated with the 3rd +image sequence "chiba1". The Area Under Curve (AUC) reached 0.737. Red dot is the position where σbkg = 0.12. + + Table 4: Trajectory accuracy and time consumption evaluation on KITTI Dataset. + + ours baseline dynaslam (tracking) orb-slam2 + +Sequence AT-RMSE (m) time per AT-RMSE time per AT-RMSE time per AT-RMSE time per + frame (s) +KITTI 00 (m) frame (s) (m) frame (s) (m) frame (s) +KITTI 01 w/o A w A w/o A w A +KITTI 02 +KITTI 03 2.1290 1.7304 0.2018 0.1565 2.0173 0.0912 3.9691 0.3354 1.7304 0.0703 +KITTI 04 9.1271 0.0917 21.8982 0.3273 8.7620 0.0734 +KITTI 05 8.4940 8.7620 0.1860 0.1305 4.9280 0.0935 5.9401 0.3243 4.9994 0.0771 +KITTI 06 3.1174 0.0898 4.7770 0.3459 3.0735 0.0723 +KITTI 07 5.1759 4.7338 0.1764 0.1194 0.9970 0.0864 1.3371 0.3420 1.0079 0.0672 +KITTI 08 2.0528 0.0923 1.7644 0.3482 1.9751 0.0717 +KITTI 09 3.2169 3.4246 0.1462 0.0983 1.9338 0.0943 2.0627 0.3434 1.8793 0.0752 +KITTI 10 1.1799 0.0843 1.1285 0.3493 0.9733 0.0632 + 1.0835 1.2937 0.1811 0.1297 4.7857 0.0882 3.7062 0.3488 4.6483 0.0675 + 7.1441 0.0865 4.2753 0.3463 5.9788 0.0657 + 2.1243 2.2529 0.1915 0.1382 2.6986 0.0912 2.2028 0.3466 2.6699 0.0631 + + 2.1718 2.2651 0.2076 0.1546 + + 1.2323 1.3159 0.1791 0.1337 + + 4.5641 5.2294 0.1945 0.1445 + + 4.9692 5.8698 0.1760 0.1231 + + 2.5849 2.6375 0.1522 0.1022 + +and ORB-SLAM2 stereo are different from the original paper. It is because we only align the trajectory with ground +truth without adjusting scale before calculating trajectory error, since our target is online positioning with vSLAM. + +From Table 4, we see that in terms of computation speed, ORB-SLAM2 stereo is always the best. Because it has +adapted the static environment assumption. DynaSLAM is the slowest. Ours is slightly worse than the baseline and +ORB-SLAM2 stereo, however, we do see that GPU acceleration helps improving speed to a tolerable level. In terms of +AT-RMSE, the results are various, but DynaSLAM and ORB-SLAM2 stereo did have the most bold fonts numbers. +In KITTI dataset, there are moving automobiles, bicycles and persons in some frames, but they are not overly-large. +Actually there are only 6 frames in "07" in which occlusion ratio became larger than 0.5. Besides, automobiles on +the street do not contain so many salient feature points as construction machines, most of them have texture-less and +smooth surface. Therefore, our proposed system is not advantageous in KITTI. + +5 Conclusions & Future Work + +We presented a stereo vSLAM system for dynamic outdoor construction sites. The key contributions are, first, a +hierarchical masking strategy that can timely refine overly-large occlusion mask in an efficient way. Second, a semantic- +geometric approach for objects’ motion-state classification and a two-step coarse-to-fine ego-motion tracking scheme. +Our system accurately retrieved the motion trajectories of a stereo camera at construction sites, and most of the +surrounding objects’ motion-states in the scene were correctly predicted. Hierarchical object masking has also been + + 14 + A PREPRINT - JANUARY 19, 2021 + +proved to be a simple but useful strategy. Our proposed system can deal with dynamic and crowded environments that +standard vSLAM systems may fail to keep tracking. + +In future work, the method to select reference frames can be optimized to enable more robust object motion-state +classification. Moreover, we plan to combine vSLAM with an inertial measuring unit (IMU) sensor for higher-accuracy +positioning. However, the fierce and high-frequency vibration of the vibration roller may cause severe noises in the +IMU measurements, which results in worse pose accuracy. Therefore, we will look into this problem and meanwhile +also explore other topics about visual SLAM related research at construction sites. + +References + + [1] Runqiu Bao, Ren Komatsu, Renato Miyagusuku, Masaki Chino, Atsushi Yamashita, and Hajime Asama. Cost- + effective and robust visual based localization with consumer-level cameras at construction sites. In Proceedings of + the 2019 IEEE Global Conference on Consumer Electronics (GCCE 2019), pages 983–985, 2019. + + [2] Berta Bescos, José M Fácil, Javier Civera, and José Neira. Dynaslam: Tracking, mapping, and inpainting in + dynamic scenes. IEEE Robotics and Automation Letters, 3(4):4076–4083, 2018. + + [3] Mariano Jaimez, Christian Kerl, Javier Gonzalez-Jimenez, and Daniel Cremers. Fast odometry and scene flow + from rgb-d cameras based on geometric clustering. In Proceedings of the 2017 IEEE International Conference on + Robotics and Automation (ICRA 2017), pages 3992–3999, 2017. + + [4] Dan Barnes, Will Maddern, Geoffrey Pascoe, and Ingmar Posner. Driven to distraction: Self-supervised distractor + learning for robust monocular visual odometry in urban environments. In Proceedings of the 2018 IEEE + International Conference on Robotics and Automation (ICRA 2018), pages 1894–1900, 2018. + + [5] Binbin Xu, Wenbin Li, Dimos Tzoumanikas, Michael Bloesch, Andrew Davison, and Stefan Leutenegger. Mid- + fusion: Octree-based object-level multi-instance dynamic slam. In Proceedings of the 2019 IEEE International + Conference on Robotics and Automation (ICRA 2019), pages 5231–5237, 2019. + + [6] Ioan Andrei Bârsan, Peidong Liu, Marc Pollefeys, and Andreas Geiger. Robust dense mapping for large-scale + dynamic environments. In Proceedings of the 2018 IEEE International Conference on Robotics and Automation + (ICRA 2018), pages 7510–7517, 2018. + + [7] Martin Runz, Maud Buffier, and Lourdes Agapito. Maskfusion: Real-time recognition, tracking and reconstruction + of multiple moving objects. In Proceedings of the 2018 IEEE International Symposium on Mixed and Augmented + Reality (ISMAR 2018), pages 10–20, 2018. + + [8] Chao Yu, Zuxin Liu, Xin-Jun Liu, Fugui Xie, Yi Yang, Qi Wei, and Qiao Fei. Ds-slam: A semantic visual slam + towards dynamic environments. In Proceedings of the 2018 IEEE/RSJ International Conference on Intelligent + Robots and Systems (IROS 2018), pages 1168–1174. IEEE, 2018. + + [9] Fangwei Zhong, Sheng Wang, Ziqi Zhang, and Yizhou Wang. Detect-slam: Making object detection and slam + mutually beneficial. In Proceedings of the 2018 IEEE Winter Conference on Applications of Computer Vision + (WACV 2018), pages 1001–1010. IEEE, 2018. + +[10] Linhui Xiao, Jinge Wang, Xiaosong Qiu, Zheng Rong, and Xudong Zou. Dynamic-slam: Semantic monocular + visual localization and mapping based on deep learning in dynamic environment. Robotics and Autonomous + Systems, 117:1–16, 2019. + +[11] João Carlos Virgolino Soares, Marcelo Gattass, and Marco Antonio Meggiolaro. Visual slam in human populated + environments: Exploring the trade-off between accuracy and speed of yolo and mask r-cnn. In Proceedings of the + 2019 International Conference on Advanced Robotics (ICAR 2019), pages 135–140. IEEE, 2019. + +[12] Muhamad Risqi U Saputra, Andrew Markham, and Niki Trigoni. Visual slam and structure from motion in + dynamic environments: A survey. ACM Computing Surveys (CSUR), 51(2):1–36, 2018. + +[13] Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications + to image analysis and automated cartography. Communications of the ACM, 24(6):381–395, 1981. + +[14] Zhaopeng Cui, Lionel Heng, Ye Chuan Yeo, Andreas Geiger, Marc Pollefeys, and Torsten Sattler. Real-time dense + mapping for self-driving vehicles using fisheye cameras. In Proceedings of the 2019 International Conference on + Robotics and Automation (ICRA 2019), pages 6087–6093, 2019. + +[15] Joseph Redmon and Ali Farhadi. Yolov3: An incremental improvement. arXiv preprint arXiv:1804.02767, 2018. + +[16] Abhijit Kundu, K Madhava Krishna, and Jayanthi Sivaswamy. Moving object detection by multi-view geometric + techniques from a single camera mounted robot. In Proceedings of the 2009 IEEE/RSJ International Conference + on Intelligent Robots and Systems (IROS 2009), pages 4306–4312, 2009. + + 15 + A PREPRINT - JANUARY 19, 2021 + +[17] Davide Migliore, Roberto Rigamonti, Daniele Marzorati, Matteo Matteucci, and Domenico G Sorrenti. Use a + single camera for simultaneous localization and mapping with mobile object tracking in dynamic environments. + In Proceedings of the 2009 ICRA Workshop on Safe navigation in open and dynamic environments: Application to + autonomous vehicles, pages 12–17, 2009. + +[18] Wei Tan, Haomin Liu, Zilong Dong, Guofeng Zhang, and Hujun Bao. Robust monocular slam in dynamic + environments. In Proceedings of the 2013 IEEE International Symposium on Mixed and Augmented Reality + (ISMAR 2013), pages 209–218, 2013. + +[19] Raul Mur-Artal and Juan D Tardós. Orb-slam2: An open-source slam system for monocular, stereo, and rgb-d + cameras. IEEE Transactions on Robotics, 33(5):1255–1262, 2017. + +[20] Raul Mur-Artal, Jose Maria Martinez Montiel, and Juan D Tardos. Orb-slam: a versatile and accurate monocular + slam system. IEEE Transactions on Robotics, 31(5):1147–1163, 2015. + +[21] Mingxing Tan, Ruoming Pang, and Quoc V Le. Efficientdet: Scalable and efficient object detection. arXiv + preprint arXiv:1911.09070, 2019. + +[22] Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the 2017 IEEE + International Conference on Computer Vision (ICCV 2017), pages 2961–2969, 2017. + +[23] Jürgen Sturm, Nikolas Engelhard, Felix Endres, Wolfram Burgard, and Daniel Cremers. A benchmark for the + evaluation of rgb-d slam systems. In Proceedings of the 2012 IEEE/RSJ International Conference on Intelligent + Robots and Systems (IROS 2012), pages 573–580. IEEE, 2012. + +[24] Shinji Umeyama. Least-squares estimation of transformation parameters between two point patterns. IEEE + Transactions on Pattern Analysis & Machine Intelligence, (4):376–380, 1991. + +[25] Michael Grupp. evo: Python package for the evaluation of odometry and slam. https://github.com/ + MichaelGrupp/evo, 2017. + +[26] Yuxin Wu, Alexander Kirillov, Francisco Massa, Wan-Yen Lo, and Ross Girshick. Detectron2. https: + //github.com/facebookresearch/detectron2, 2019. + +[27] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and + C Lawrence Zitnick. Microsoft coco: Common objects in context. In Proceedings of the 2014 European + Conference on Computer Vision (ECCV 2014), pages 740–755, 2014. + +[28] Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset. The + International Journal of Robotics Research, 32(11):1231–1237, 2013. + + 16 + diff --git a/动态slam/2020年-2022年开源动态SLAM/2021年/Tartanvo A generalizable learning_based vo.pdf b/动态slam/2020年-2022年开源动态SLAM/2021年/Tartanvo A generalizable learning_based vo.pdf new file mode 100644 index 0000000..de9ebaa --- /dev/null +++ b/动态slam/2020年-2022年开源动态SLAM/2021年/Tartanvo A generalizable learning_based vo.pdf @@ -0,0 +1,724 @@ + TartanVO: A Generalizable Learning-based VO + + Wenshan Wang∗ Yaoyu Hu Sebastian Scherer + + Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University + +arXiv:2011.00359v1 [cs.CV] 31 Oct 2020 Abstract: We present the first learning-based visual odometry (VO) model, + which generalizes to multiple datasets and real-world scenarios, and outperforms + geometry-based methods in challenging scenes. We achieve this by leveraging + the SLAM dataset TartanAir, which provides a large amount of diverse synthetic + data in challenging environments. Furthermore, to make our VO model generalize + across datasets, we propose an up-to-scale loss function and incorporate the cam- + era intrinsic parameters into the model. Experiments show that a single model, + TartanVO, trained only on synthetic data, without any finetuning, can be general- + ized to real-world datasets such as KITTI and EuRoC, demonstrating significant + advantages over the geometry-based methods on challenging trajectories. Our + code is available at https://github.com/castacks/tartanvo. + + Keywords: Visual Odometry, Generalization, Deep Learning, Optical Flow + + 1 Introduction + + Visual SLAM (Simultaneous Localization and Mapping) becomes more and more important for + autonomous robotic systems due to its ubiquitous availability and the information richness of im- + ages [1]. Visual odometry (VO) is one of the fundamental components in a visual SLAM system. + Impressive progress has been made in both geometric-based methods [2, 3, 4, 5] and learning-based + methods [6, 7, 8, 9]. However, it remains a challenging problem to develop a robust and reliable VO + method for real-world applications. + + On one hand, geometric-based methods are not robust enough in many real-life situations [10, 11]. + On the other hand, although learning-based methods demonstrate robust performance on many vi- + sual tasks, including object recognition, semantic segmentation, depth reconstruction, and optical + flow, we have not yet seen the same story happening to VO. + + It is widely accepted that by leveraging a large amount of data, deep-neural-network-based methods + can learn a better feature extractor than engineered ones, resulting in a more capable and robust + model. But why haven’t we seen the deep learning models outperform geometry-based methods yet? + We argue that there are two main reasons. First, the existing VO models are trained with insufficient + diversity, which is critical for learning-based methods to be able to generalize. By diversity, we + mean diversity both in the scenes and motion patterns. For example, a VO model trained only on + outdoor scenes is unlikely to be able to generalize to an indoor environment. Similarly, a model + trained with data collected by a camera fixed on a ground robot, with limited pitch and roll motion, + will unlikely be applicable to drones. Second, most of the current learning-based VO models neglect + some fundamental nature of the problem which is well formulated in geometry-based VO theories. + From the theory of multi-view geometry, we know that recovering the camera pose from a sequence + of monocular images has scale ambiguity. Besides, recovering the pose needs to take account of the + camera intrinsic parameters (referred to as the intrinsics ambiguity later). Without explicitly dealing + with the scale problem and the camera intrinsics, a model learned from one dataset would likely fail + in another dataset, no matter how good the feature extractor is. + + To this end, we propose a learning-based method that can solve the above two problems and can + generalize across datasets. Our contributions come in three folds. First, we demonstrate the crucial + effects of data diversity on the generalization ability of a VO model by comparing performance on + different quantities of training data. Second, we design an up-to-scale loss function to deal with the + + ∗Corresponding author: wenshanw@andrew.cmu.edu + + 4th Conference on Robot Learning (CoRL 2020), Cambridge MA, USA. + scale ambiguity of monocular VO. Third, we create an intrinsics layer (IL) in our VO model enabling +generalization across different cameras. To our knowledge, our model is the first learning-based VO +that has competitive performance in various real-world datasets without finetuning. Furthermore, +compared to geometry-based methods, our model is significantly more robust in challenging scenes. +A demo video can be found: https://www.youtube.com/watch?v=NQ1UEh3thbU + +2 Related Work + +Besides early studies of learning-based VO models [12, 13, 14, 15], more and more end-to-end +learning-based VO models have been studied with improved accuracy and robustness. The majority +of the recent end-to-end models adopt the unsupervised-learning design [6, 16, 17, 18], due to the +complexity and the high-cost associated with collecting ground-truth data. However, supervised +models trained on labeled odometry data still have a better performance [19, 20]. + +To improve the performance, end-to-end VO models tend to have auxiliary outputs related to camera +motions, such as depth and optical flow. With depth prediction, models obtain supervision signals +by imposing depth consistency between temporally consecutive images [17, 21]. This procedure can +be interpreted as matching the temporal observations in the 3D space. A similar effect of temporal +matching can be achieved by producing the optical flow, e.g., [16, 22, 18] jointly predict depth, +optical flow, and camera motion. + +Optical flow can also be treated as an intermediate representation that explicitly expresses the 2D +matching. Then, camera motion estimators can process the optical flow data rather than directly +working on raw images[20, 23]. If designed this way, components for estimating the camera motion +can even be trained separately on available optical flow data [19]. We follow these designs and use +the optical flow as an intermediate representation. + +It is well known that monocular VO systems have scale ambiguity. Nevertheless, most of the super- +vised learning models did not handle this issue and directly use the difference between the model +prediction and the true camera motion as the supervision [20, 24, 25]. In [19], the scale is handled +by dividing the optical flow into sub-regions and imposing a consistency of the motion predictions +among these regions. In non-learning methods, scale ambiguity can be solved if a 3D map is avail- +able [26]. Ummenhofer et al. [20] introduce the depth prediction to correcting the scale-drift. Tateno +et al. [27] and Sheng et al. [28] ameliorate the scale problem by leveraging the key-frame selection +technique from SLAM systems. Recently, Zhan et al. [29] use PnP techniques to explicitly solve +for the scale factor. The above methods introduce extra complexity to the VO system, however, the +scale ambiguity is not totally suppressed for monocular setups especially in the evaluation stage. +Instead, some models choose to only produce up-to-scale predictions. Wang et al. [30] reduce the +scale ambiguity in the monocular depth estimation task by normalizing the depth prediction before +computing the loss function. Similarly, we will focus on predicting the translation direction rather +than recovering the full scale from monocular images, by defining a new up-to-scale loss function. + +Learning-based models suffer from generalization issues when tested on images from a new en- +vironment or a new camera. Most of the VO models are trained and tested on the same dataset +[16, 17, 31, 18]. Some multi-task models [6, 20, 32, 22] only test their generalization ability on the +depth prediction, not on the camera pose estimation. Recent efforts, such as [33], use model adap- +tation to deal with new environments, however, additional training is needed on a per-environment +or per-camera basis. In this work, we propose a novel approach to achieve cross-camera/dataset +generalization, by incorporating the camera intrinsics directly into the model. + +Figure 1: The two-stage network architecture. The model consists of a matching network, which +estimates optical flow from two consecutive RGB images, followed by a pose network predicting +camera motion from the optical flow. + + 2 + 3 Approach + +3.1 Background + +We focus on the monocular VO problem, which takes two consecutive undistorted images {It, It+1}, +and estimates the relative camera motion δtt+1 = (T, R), where T ∈ R3 is the 3D translation and +R ∈ so(3) denotes the 3D rotation. According to the epipolar geometry theory [34], the geometry- +based VO comes in two folds. Firstly, visual features are extracted and matched from It and It+1. +Then using the matching results, it computes the essential matrix leading to the recovery of the +up-to-scale camera motion δtt+1. + +Following the same idea, our model consists of two sub-modules. One is the matching module +Mθ(It, It+1), estimating the dense matching result Ftt+1 from two consecutive RGB images (i.e. +optical flow). The other is a pose module Pφ(Ftt+1) that recovers the camera motion δtt+1 from the +matching result (Fig. 1). This modular design is also widely used in other learning-based methods, +especially in unsupervised VO [13, 19, 16, 22, 18]. + +3.2 Training on large scale diverse data + +The generalization capability has always been one of the most critical issues for learning-based +methods. Most of the previous supervised models are trained on the KITTI dataset, which contains +11 labeled sequences and 23,201 image frames in the driving scenario [35]. Wang et al. [8] presented +the training and testing results on the EuRoC dataset [36], collected by a micro aerial vehicle (MAV). +They reported that the performance is limited by the lack of training data and the more complex +dynamics of a flying robot. Surprisingly, most unsupervised methods also only train their models in +very uniform scenes (e.g., KITTI and Cityscape [37]). To our knowledge, no learning-based model +has yet shown the capability of running on multiple types of scenes (car/MAV, indoor/outdoor). To +achieve this, we argue that the training data has to cover diverse scenes and motion patterns. + +TartanAir [11] is a large scale dataset with highly diverse scenes and motion patterns, containing +more than 400,000 data frames. It provides multi-modal ground truth labels including depth, seg- +mentation, optical flow, and camera pose. The scenes include indoor, outdoor, urban, nature, and +sci-fi environments. The data is collected with a simulated pinhole camera, which moves with ran- +dom and rich 6DoF motion patterns in the 3D space. + +We take advantage of the monocular image sequences {It}, the optical flow labels {Ftt+1}, and the +ground truth camera motions {δtt+1} in our task. Our objective is to jointly minimize the optical +flow loss Lf and the camera motion loss Lp. The end-to-end loss is defined as: + +L = λLf + Lp = λ Mθ(It, It+1) − Ftt+1 + Pφ(Fˆtt+1) − δtt+1 (1) + +where λ is a hyper-parameter balancing the two losses. We use ˆ· to denote the estimated variable +from our model. + +Since TartanAir is purely synthetic, the biggest question is can a model learned from simulation +data be generalized to real-world scenes? As discussed by Wang et al. [11], a large number of +studies show that training purely in simulation but with broad diversity, the model learned can be +easily transferred to the real world. This is also known as domain randomization [38, 39]. In our +experiments, we show that the diverse simulated data indeed enable the VO model to generalize to +real-world data. + +3.3 Up-to-scale loss function + +The motion scale is unobservable from a monocular image sequence. In geometry-based methods, +the scale is usually recovered from other sources of information ranging from known object size or +camera height to extra sensors such as IMU. However, in most existing learning-based VO studies, +the models generally neglect the scale problem and try to recover the motion with scale. This is +feasible if the model is trained and tested with the same camera and in the same type of scenario. +For example, in the KITTI dataset, the camera is mounted at a fixed height above the ground and a +fixed orientation. A model can learn to remember the scale in this particular setup. Obviously, the +model will have huge problems when tested with a different camera configuration. Imagine if the + + 3 + Figure 2: a) Illustration of the FoV and image resolution in TartanAir, EuRoC, and KITTI datasets. +b) Calculation of the intrinsics layer. + +camera in KITTI moves a little upwards and becomes higher from the ground, the same amount of +camera motion would cause a smaller optical flow value on the ground, which is inconsistent with +the training data. Although the model could potentially learn to pick up other clues such as object +size, it is still not fully reliable across different scenes or environments. + +Following the geometry-based methods, we only recover an up-to-scale camera motion from the + +monocular sequences. Knowing that the scale ambiguity only affects the translation T , we design + +a new loss function for T and keep the loss for rotation R unchanged. We propose two up-to-scale +loss functions for LP : the cosine similarity loss Lcpos and the normalized distance loss Lnporm. Lpcos +is defined by the cosine angle between the estimated Tˆ and the label T : + +Lpcos = max( Tˆ · T T + Rˆ − R (2) + Tˆ · ,) + +Similarly, for Lnporm, we normalize the translation vector before calculating the distance between +the estimation and the label: + +Lpnorm = Tˆ T + Rˆ − R + max( Tˆ , ) − max( T (3) + ,) + +where = 1e-6 is used to avoid divided by zero error. From our preliminary empirical comparison, +the above two formulations have similar performance. In the following sections, we will use Eq 3 +to replace Lp in Eq 1. Later, we show by experiments that the proposed up-to-scale loss function is +crucial for the model’s generalization ability. + +3.4 Cross-camera generalization by encoding camera intrinsics + +In epipolar geometry theory, the camera intrinsics is required when recovering the camera pose +from the essential matrix (assuming the images are undistorted). In fact, learning-based methods +are unlikely to generalize to data with different camera intrinsics. Imagine a simple case that the +camera changes a lens with a larger focal length. Assume the resolution of the image remains the +same, the same amount of camera motion will introduce bigger optical flow values, which we call +the intrinsics ambiguity. + +A tempting solution for intrinsics ambiguity is warping the input images to match the camera in- +trinsics of the training data. However, this is not quite practical especially when the cameras differ +too much. As shown in Fig. 2-a, if a model is trained on TartanAir, the warped KITTI image only +covers a small part of the TartanAir’s field of view (FoV). After training, a model learns to exploit +cues from all possible positions in the FoV and the interrelationship among those cues. Some cues +no longer exist in the warped KITTI images leading to drastic performance drops. + +3.4.1 Intrinsics layer + +We propose to train a model that takes both RGB images and camera intrinsics as input, thus the +model can directly handle images coming from various camera settings. Specifically, instead of re- +covering the camera motion Ttt+1 only from the feature matching Ftt+1, we design a new pose net- +work Pφ(Ftt+1, K), which depends also on the camera intrinsic parameters K = {fx, fy, ox, oy}, +where fx and fy are the focal lengths, and ox and oy denote the position of the principle point. + + 4 + Figure 3: The data augmentation procedure of random cropping and resizing. In this way we gener- +ate a wide range of camera intrinsics (FoV 40◦ to 90◦). + +As for the implementation, we concatenate an IL (intrinsics layer) Kc ∈ R2×H×W (H and W +are image height and width respectively) to Ftt+1 before going into Pφ. To compose Kc, we first +generate two index matrices Xind and Yind for the x and y axis in the 2D image frame (Fig. 2-b). +Then the two channels of Kc are calculated from the following formula: + +Kxc = (Xind − ox)/fx (4) +Kyc = (Yind − oy)/fy + +The concatenation of Ftt+1 and Kc augments the optical flow estimation with 2D position informa- +tion. Similar to the situation where geometry-based methods have to know the 2D coordinates of the +matched features, Kc provides the necessary position information. In this way, intrinsics ambiguity + +is explicitly handled by coupling 2D positions and matching estimations (Ftt+1). + +3.4.2 Data generation for various camera intrinsics + +To make a model generalizable across different cameras, we need training data with various camera +intrinsics. TartanAir only has one set of camera intrinsics, where fx = fy = 320, ox = 320, +and oy = 240. We simulate various intrinsics by randomly cropping and resizing (RCR) the input +images. As shown in Fig. 3, we first crop the image at a random location with a random size. Next, +we resize the cropped image to the original size. One advantage of the IL is that during RCR, we can +crop and resize the IL with the image, without recomputing the IL. To cover typical cameras with +FoV between 40◦ to 90◦, we find that using random resizing factors up to 2.5 is sufficient during +RCR. Note the ground truth optical flow should also be scaled with respect to the resizing factor. We +use very aggressive cropping and shifting in our training, which means the optical center could be +way off the image center. Although the resulting intrinsic parameters will be uncommon in modern +cameras, we find the generalization is improved. + +4 Experimental Results + +4.1 Network structure and training detail + +Network We utilize the pre-trained PWC-Net [40] as the matching network Mθ, and a modified +ResNet50 [41] as the pose network Pφ. We remove the batch normalization layers from the ResNet, +and add two output heads for the translation and rotation, respectively. The PWC-Net outputs optical +flow in size of H/4 × W/4, so Pφ is trained on 1/4 size, consuming very little GPU memory. The +overall inference time (including both Mθ and Pφ) is 40ms on an NVIDIA GTX 1080 GPU. + +Training Our model is implemented by PyTorch [42] and trained on 4 NVIDIA GTX 1080 GPUs. +There are two training stages. First, Pφ is trained separately using ground truth optical flow and +camera motions for 100,000 iterations with a batch size of 100. In the second stage, Pφ and Mθ are +connected and jointly optimized for 50,000 iterations with a batch size of 64. During both training +stages, the learning rate is set to 1e-4 with a decay rate of 0.2 at 1/2 and 7/8 of the total training +steps. The RCR is applied on the optical flow, RGB images, and the IL (Sec 3.4.2). + +4.2 How the training data quantity affects the generalization ability + +To show the effects of data diversity, we compare the generalization ability of the model trained +with different amounts of data. We use 20 environments from the TartanAir dataset, and set aside +3 environments (Seaside-town, Soul-city, and Hongkong-alley) only for testing, which results in + +5 + Figure 4: Generalization ability with respect to different quantities of training data. Model Pφ is +trained on true optical flow. Blue: training loss, orange: testing loss on three unseen environments. +Testing loss drops constantly with increasing quantity of training data. + +Figure 5: Comparison of the loss curve w/ and w/o up-to-scale loss function. a) The training and +testing loss w/o the up-to-scale loss. b) The translation and rotation loss of a). Big gap exists between +the training and testing translation losses (orange arrow in b)). c) The training and testing losses w/ +up-to-scale loss. d) The translation and rotation losses of c). The translation loss gap decreases. + +more than 400,000 training frames and about 40,000 testing frames. As a comparison, KITTI and +EuRoC datasets provide 23,201 and 26,604 pose labeled frames, respectively. Besides, data in +KITTI and EuRoC are much more uniform in the sense of scene type and motion pattern. As shown +in Fig. 4, we set up three experiments, where we use 20,000 (comparable to KITTI and EuRoC), +100,000, and 400,000 frames of data for training the pose network Pφ. The experiments show that +the generalization ability, measured by the gap between training loss and testing loss on unseen +environments, improves constantly with increasing training data. + +4.3 Up-to-scale loss function + +Without the up-to-scale loss, we observe that there is a gap between the training and testing loss even +trained with a large amount of data (Fig. 5-a). As we plotting the translation loss and rotation loss +separately (Fig. 5-b), it shows that the translation error is the main contributor to the gap. After we +apply the up-to-scale loss function described in Sec 3.3, the translation loss gap decreases (Fig. 5- +c,d). During testing, we align the translation with the ground truth to recover the scale using the +same way as described in [16, 6]. + +4.4 Camera intrinsics layer + +The IL is critical to the generalization ability across datasets. Before we move to other datasets, +we first design an experiment to investigate the properties of the IL using the pose network Pφ. As +shown in Table 1, in the first two columns, where the data has no RCR augmentation, the training +and testing loss are low. But these two models would output nonsense values on data with RCR +augmentation. One interesting finding is that adding IL doesn’t help in the case of only one type +of intrinsics. This indicates that the network has learned a very different algorithm compared with +the geometry-based methods, where the intrinsics is necessary to recover the motion. The last two +columns show that the IL is critical when the input data is augmented by RCR (i.e. various intrin- +sics). Another interesting thing is that training a model with RCR and IL leads to a lower testing +loss (last column) than only training on one type of intrinsics (first two columns). This indicates that +by generating data with various intrinsics, we learned a more robust model for the VO task. + + 6 + Table 1: Training and testing losses with four combinations of RCR and IL settings. The IL is +critical with the presence of RCR. The model trained with RCR reaches lower testing loss than +those without RCR. + +Training configuration w/o RCR, w/o IL w/o RCR, w/ IL w/ RCR, w/o IL w/ RCR, w/ IL +Training loss 0.0325 0.0311 0.1534 0.0499 +Test-loss on data w/ RCR - - 0.1999 0.0723 +Test-loss on data w/o RCR 0.0744 0.0714 0.1630 0.0549 + +Table 2: Comparison of translation and rotation on the KITTI dataset. DeepVO [43] is a super- +vised method trained on Seq 00, 02, 08, 09. It contains an RNN module, which accumulates +information from multiple frames. Wang et al. [9] is a supervised method trained on Seq 00-08 +and uses the semantic information of multiple frames to optimize the trajectory. UnDeepVO [44] +and GeoNet [16] are trained on Seq 00-08 in an unsupervised manner. VISO2-M [45] and ORB- +SLAM [3] are geometry-based monocular VO. ORB-SLAM uses the bundle adjustment on multiple +frames to optimize the trajectory. Our method works in a pure VO manner (only takes two frames). +It has never seen any KITTI data before the testing, and yet achieves competitive results. + + Seq 06 07 09 10 Ave + +DeepVO [43]*† trel rrel trel rrel trel rrel trel rrel trel rrel +Wang et al. [9]*† +UnDeepVO [44]* 5.42 5.82 3.91 4.6 - - 8.11 8.83 5.81 6.41 +GeoNet [16]* +VISO2-M [45] - - - - 8.04 1.51 6.23 0.97 7.14 1.24 +ORB-SLAM [3]† +TartanVO (ours) 6.20 1.98 3.15 2.48 - - 10.63 4.65 6.66 3.04 + + 9.28 4.34 8.27 5.93 26.93 9.54 20.73 9.04 16.3 7.21 + + 7.3 6.14 23.61 19.11 4.04 1.43 25.2 3.8 15.04 7.62 + + 18.68 0.26 10.96 0.37 15.3 0.26 3.71 0.3 12.16 0.3 + + 4.72 2.95 4.32 3.41 6.0 3.11 6.89 2.73 5.48 3.05 + +trel: average translational RMSE drift (%) on a length of 100–800 m. +rrel: average rotational RMSE drift (◦/100 m) on a length of 100–800 m. +*: the starred methods are trained or finetuned on the KITTI dataset. +†: these methods use multiple frames to optimize the trajectory after the VO process. + +4.5 Generalize to real-world data without finetuning + +KITTI dataset The KITTI dataset is one of the most influential datasets for VO/SLAM tasks. We +compare our model, TartanVO, with two supervised learning models (DeepVO [43], Wang et al. +[9]), two unsupervised models (UnDeepVO [44], GeoNet [16]), and two geometry-based methods +(VISO2-M [45], ORB-SLAM [3]). All the learning-based methods except ours are trained on the +KITTI dataset. Note that our model has not been finetuned on KITTI and is trained purely on a +synthetic dataset. Besides, many algorithms use multiple frames to further optimize the trajectory. +In contrast, our model only takes two consecutive images. As listed in Table 2, TartanVO achieves +comparable performance, despite no finetuning nor backend optimization are performed. + +EuRoC dataset The EuRoC dataset contains 11 sequences collected by a MAV in an indoor en- +vironment. There are 3 levels of difficulties with respect to the motion pattern and the light con- +dition. Few learning-based methods have ever been tested on EuRoC due to the lack of training +data. The changing light condition and aggressive rotation also pose real challenges to geometry- +based methods as well. In Table 3, we compare with geometry-based methods including SVO [46], +ORB-SLAM [3], DSO [5] and LSD-SLAM [2]. Note that all these geometry-based methods per- +form some types of backend optimization on selected keyframes along the trajectory. In contrast, our +model only estimates the frame-by-frame camera motion, which could be considered as the frontend +module in these geometry-based methods. In Table 3, we show the absolute trajectory error (ATE) +of 6 medium and difficult trajectories. Our method shows the best performance on the two most dif- +ficult trajectories VR1-03 and VR2-03, where the MAV has very aggressive motion. A visualization +of the trajectories is shown in Fig. 6. + +Challenging TartanAir data TartanAir provides 16 very challenging testing trajectories2 that +cover many extremely difficult cases, including changing illumination, dynamic objects, fog and +rain effects, lack of features, and large motion. As listed in Table 4, we compare our model with the +ORB-SLAM using ATE. Our model shows a more robust performance in these challenging cases. + + 2https://github.com/castacks/tartanair tools#download-the-testing-data-for-the-cvpr-visual-slam-challenge + + 7 + Table 3: Comparison of ATE on EuRoC dataset. We are among very few learning-based methods, +which can be tested on this dataset. Same as the geometry-based methods, our model has never seen +the EuRoC data before testing. We show the best performance on two difficult sequences VR1-03 +and VR2-03. Note our method doesn’t contain any backend optimization module. + + Seq. MH-04 MH-05 VR1-02 VR1-03 VR2-02 VR2-03 + + SVO [46] 1.36 0.51 0.47 x 0.47 x + +Geometry-based * ORB-SLAM [3] 0.20 0.19 x x 0.07 x + DSO [5] 0.25 0.11 0.11 0.93 0.13 1.16 + + LSD-SLAM [2] 2.13 0.85 1.11 x x x + +Learning-based † TartanVO (ours) 0.74 0.68 0.45 0.64 0.67 1.04 + +* These results are from [46]. † Other learning-based methods [36] did not report numerical results. + +Figure 6: The visualization of 6 EuRoC trajectories in Table 3. Black: ground truth trajectory, +orange: estimated trajectory. + +Table 4: Comparison of ATE on TartanAir dataset. These trajectories are not contained in the + +training set. We repeatedly run ORB-SLAM 5 times and report the best result. + +Seq MH000 MH001 MH002 MH003 MH004 MH005 MH006 MH007 + +ORB-SLAM [3] 1.3 0.04 2.37 2.45 x x 21.47 2.73 + +TartanVO (ours) 4.88 0.26 2 0.94 1.07 3.19 1 2.04 + +Figure 7: TartanVO outputs competitive results on D345i IR data compared to T265 (equipped with +fish-eye stereo camera and an IMU). a) The hardware setup. b) Trail 1: smooth and slow motion. c) +Trail 2: smooth and medium speed. d) Trail 3: aggressive and fast motion. See videos for details. + +RealSense Data Comparison We test TartanVO using data collected by a customized sensor +setup. As shown in Fig. 7 a), a RealSense D345i is fixed on top of a RealSense T265 tracking +camera. We use the left near-infrared (IR) image on D345i in our model and compare it with the +trajectories provided by the T265 tracking camera. We present 3 loopy trajectories following similar +paths with increasing motion difficulties. From Fig. 7 b) to d), we observe that although TartanVO +has never seen real-world images or IR data during training, it still generalizes well and predicts +odometry closely matching the output of T265, which is a dedicated device estimating the camera +motion with a pair of fish-eye stereo camera and an IMU. + +5 Conclusions + +We presented TartanVO, a generalizable learning-based visual odometry. By training our model +with a large amount of data, we show the effectiveness of diverse data on the ability of model gener- +alization. A smaller gap between training and testing losses can be expected with the newly defined +up-to-scale loss, further increasing the generalization capability. We show by extensive experiments +that, equipped with the intrinsics layer designed explicitly for handling different cameras, TartanVO +can generalize to unseen datasets and achieve performance even better than dedicated learning mod- +els trained directly on those datasets. Our work introduces many exciting future research directions +such as generalizable learning-based VIO, Stereo-VO, multi-frame VO. + + 8 + Acknowledgments + +This work was supported by ARL award #W911NF1820218. Special thanks to Yuheng Qiu and Huai +Yu from Carnegie Mellon University for preparing simulation results and experimental setups. + +References + + [1] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendo´n-Mancha. Visual simultaneous localization and + mapping: a survey. Artificial Intelligence Review, 43(1):55–81, 2015. + + [2] J. Engel, T. Schops, and D. Cremers. LSD-SLAM: Large-scale direct monocular slam. In ECCV, 2014. + + [3] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. Orb-slam: a versatile and accurate monocular slam + system. IEEE transactions on robotics, 31(5):1147–1163, 2015. + + [4] C. Forster, M. Pizzoli, and D. Scaramuzza. Svo: Fast semi-direct monocular visual odometry. In ICRA, + pages 15–22. IEEE, 2014. + + [5] J. Engel, V. Koltun, and D. Cremers. Direct sparse odometry. IEEE transactions on pattern analysis and + machine intelligence, 40(3):611–625, 2017. + + [6] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from + video. In CVPR, 2017. + + [7] S. Vijayanarasimhan, S. Ricco, C. Schmidy, R. Sukthankar, and K. Fragkiadaki. Sfm-net: Learning of + structure and motion from video. In arXiv:1704.07804, 2017. + + [8] S. Wang, R. Clark, H. Wen, and N. Trigoni. End-to-end, sequence-to-sequence probabilistic visual odom- + etry through deep neural networks. The International Journal of Robotics Research, 37(4-5):513–542, + 2018. + + [9] X. Wang, D. Maturana, S. Yang, W. Wang, Q. Chen, and S. Scherer. Improving learning-based ego- + motion estimation with homomorphism-based losses and drift correction. In 2019 IEEE/RSJ International + Conference on Intelligent Robots and Systems (IROS), pages 970–976. IEEE, 2019. + +[10] G. Younes, D. Asmar, E. Shammas, and J. Zelek. Keyframe-based monocular slam: design, survey, and + future directions. Robotics and Autonomous Systems, 98:67–88, 2017. + +[11] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, and S. Scherer. Tartanair: A + dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots + and Systems (IROS), 2020. + +[12] R. Roberts, H. Nguyen, N. Krishnamurthi, and T. Balch. Memory-based learning for visual odometry. + In Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on, pages 47–52. IEEE, + 2008. + +[13] V. Guizilini and F. Ramos. Semi-parametric models for visual odometry. In Robotics and Automation + (ICRA), 2012 IEEE International Conference on, pages 3482–3489. IEEE, 2012. + +[14] T. A. Ciarfuglia, G. Costante, P. Valigi, and E. Ricci. Evaluation of non-geometric methods for visual + odometry. Robotics and Autonomous Systems, 62(12):1717–1730, 2014. + +[15] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam: Real-time dense monocular slam with learned + depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, + pages 6243–6252, 2017. + +[16] Z. Yin and J. Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In + Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, + 2018. + +[17] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid. Unsupervised learning of monocular + depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE + Conference on Computer Vision and Pattern Recognition, pages 340–349, 2018. + +[18] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and M. J. Black. Competitive collabora- + tion: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In + Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June + 2019. + +[19] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia. Exploring representation learning with cnns for + frame-to-frame ego-motion estimation. RAL, 1(1):18–25, 2016. + +[20] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and + motion network for learning monocular stereo. In Proceedings of the IEEE Conference on Computer + Vision and Pattern Recognition (CVPR), July 2017. + +[21] N. Yang, L. v. Stumberg, R. Wang, and D. Cremers. D3vo: Deep depth, deep pose and deep uncertainty + for monocular visual odometry. In IEEE/CVF Conference on Computer Vision and Pattern Recognition + (CVPR), June 2020. + +[22] Y. Zou, Z. Luo, and J.-B. Huang. Df-net: Unsupervised joint learning of depth and flow using cross-task + consistency. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018. + + 9 + [23] H. Zhou, B. Ummenhofer, and T. Brox. Deeptam: Deep tracking and mapping. In Proceedings of the + European Conference on Computer Vision (ECCV), September 2018. + +[24] C. Tang and P. Tan. Ba-net: Dense bundle adjustment network. arXiv preprint arXiv:1806.04807, 2018. + +[25] R. Clark, M. Bloesch, J. Czarnowski, S. Leutenegger, and A. J. Davison. Ls-net: Learning to solve + nonlinear least squares for monocular stereo. arXiv preprint arXiv:1809.02966, 2018. + +[26] H. Li, W. Chen, j. Zhao, J.-C. Bazin, L. Luo, Z. Liu, and Y.-H. Liu. Robust and efficient estimation of ab- + solute camera pose for monocular visual odometry. In Proceedings of the IEEE International Conference + on Robotics and Automation (ICRA), May 2020. + +[27] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam: Real-time dense monocular slam with learned + depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition + (CVPR), July 2017. + +[28] L. Sheng, D. Xu, W. Ouyang, and X. Wang. Unsupervised collaborative learning of keyframe detec- + tion and visual odometry towards monocular deep slam. In Proceedings of the IEEE/CVF International + Conference on Computer Vision (ICCV), October 2019. + +[29] H. Zhan, C. S. Weerasekera, J.-W. Bian, and I. Reid. Visual odometry revisited: What should be learnt? + In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), May 2020. + +[30] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learning depth from monocular videos using direct + methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), + June 2018. + +[31] Y. Wang, P. Wang, Z. Yang, C. Luo, Y. Yang, and W. Xu. Unos: Unified unsupervised optical-flow and + stereo-depth estimation by watching videos. In Proceedings of the IEEE/CVF Conference on Computer + Vision and Pattern Recognition (CVPR), June 2019. + +[32] R. Mahjourian, M. Wicke, and A. Angelova. Unsupervised learning of depth and ego-motion from monoc- + ular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision + and Pattern Recognition (CVPR), June 2018. + +[33] S. Li, X. Wang, Y. Cao, F. Xue, Z. Yan, and H. Zha. Self-supervised deep visual odometry with online + adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition + (CVPR), June 2020. + +[34] D. Niste´r. An efficient solution to the five-point relative pose problem. IEEE transactions on pattern + analysis and machine intelligence, 26(6):756–770, 2004. + +[35] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International + Journal of Robotics Research, 32(11):1231–1237, 2013. + +[36] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart. The + euroc micro aerial vehicle datasets. The International Journal of Robotics Research, 35(10):1157–1163, + 2016. + +[37] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and + B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE + conference on computer vision and pattern recognition, pages 3213–3223, 2016. + +[38] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transfer- + ring deep neural networks from simulation to the real world. In IROS, pages 23–30. IEEE, 2017. + +[39] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, + and S. Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain ran- + domization. In CVPR Workshops, pages 969–977, 2018. + +[40] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and + cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages + 8934–8943, 2018. + +[41] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the + IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. + +[42] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and + A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017. + +[43] S. Wang, R. Clark, H. Wen, and N. Trigoni. Deepvo: Towards end-to-end visual odometry with deep + recurrent convolutional neural networks. In Robotics and Automation (ICRA), 2017 IEEE International + Conference on, pages 2043–2050. IEEE, 2017. + +[44] R. Li, S. Wang, Z. Long, and D. Gu. Undeepvo: Monocular visual odometry through unsupervised deep + learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7286–7291. + IEEE, 2018. + +[45] S. Song, M. Chandraker, and C. Guest. High accuracy monocular SFM and scale correction for au- + tonomous driving. IEEE Transactions on Pattern Analysis & Machine Intelligence, pages 1–1, 2015. + +[46] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza. Svo: Semidirect visual odometry + for monocular and multicamera systems. IEEE Transactions on Robotics, 33(2):249–265, 2016. + + 10 + A Additional experimental details + +In this section, we provide additional details for the experiments, including the network structure, +training parameters, qualitative results, and quantitative results. + +A.1 Network Structure + +Our network consists of two sub-modules, namely, the matching network Mθ and the pose network +Pφ. As mentioned in the paper, we employ PWC-Net as the matching network, which takes in two +consecutive images of size 640 x 448 (PWC-Net only accepts image size that is multiple of 64). The +output optical flow, which is 160 x 112 in size, is fed into the pose network. The structure of the +pose network is detailed in Table 5. The overall inference time (including both Mθ and Pφ) is 40ms +on an NVIDIA GTX 1080 GPU. + +Table 5: Parameters of the proposed pose network. Constructions of residual blocks are designated + +in brackets multiplied by the number of stacked blocks. Downsampling is performed by Conv1, and + +at the beginning of each residual block. After the residual blocks, we reshape the feature map into a + +one-dimensional vector, which goes through three fully connected layers in the translation head and + +rotation head, respectively. + + Name Layer setting Output dimension + + Input 1 H × 1 W × 2 114 × 160 + Conv1 4 4 56 × 80 + Conv2 56 × 80 + Conv3 3 × 3, 32 1 H × 1 W × 32 56 × 80 + 3 × 3, 32 8 8 + 3 × 3, 32 + 1 H × 1 W × 32 + 8 8 + + 1 H × 1 W × 32 + 8 8 + + ResBlock + + Block1 3 × 3, 64 ×3 1 H × 1 W × 64 28 × 40 + 3 × 3, 64 16 16 + + Block2 3 × 3, 128 ×4 1 H × 1 W × 128 14 × 20 + 3 × 3, 128 32 32 + + Block3 3 × 3, 128 ×6 1 H × 1 W × 128 7 × 10 + 3 × 3, 128 64 64 + + Block4 3 × 3, 256 ×7 1 H × 1 W × 256 4×5 + 3 × 3, 256 128 128 + + Block5 3 × 3, 256 ×3 1 H × 1 W × 256 2×3 + 3 × 3, 256 256 256 + + FC trans FC rot + +Trans head fc1 256 × 6 × 128 Rot head fc1 256 × 6 × 128 + +Trans head fc2 128 × 32 Rot head fc2 128 × 32 + +Trans head fc3 32 × 3 Rot head fc3 32 × 3 + + Output 3 Output 3 + +Table 6: Comparison of ORB-SLAM and TartanVO on the TartanAir dataset using the ATE metric. + +These trajectories are not contained in the training set. We repeatedly run ORB-SLAM for 5 times + +and report the best result. + +Seq SH000 SH001 SH002 SH003 SH004 SH005 SH006 SH007 + +ORB-SLAM x 3.5 x x x x x x + +TartanVO (ours) 2.52 1.61 3.65 0.29 3.36 4.74 3.72 3.06 + +A.2 Testing Results on TartanAir + +TartanAir provides 16 challenging testing trajectories. We reported 8 trajectories in the experiment +section. The rest 8 trajectories are shown in Table 6. We compare TartanVO against the ORB-SLAM +monocular algorithm. Due to the randomness in ORB-SLAM, we repeatedly run ORB-SLAM for 5 +trials and report the best result. We consider a trial is a failure if ORB-SLAM tracks less than 80% + + 11 + of the trajectory. A visualization of all the 16 trajectories (including the 8 trajectories shown in the +experiment section) is shown in Figure 8. + +Figure 8: Visualization of the 16 testing trajectories in the TartanAir dataset. The black dashed line +represents the ground truth. The estimated trajectories by TartanVO and the ORB-SLAM monocular +algorithm are shown in orange and blue lines, respectively. The ORB-SLAM algorithm frequently +loses tracking in these challenging cases. It fails in 9/16 testing trajectories. Note that we run +full-fledge ORB-SLAM with local bundle adjustment, global bundle adjustment, and loop closure +components. In contrast, although TartanVO only takes in two images, it is much more robust than +ORB-SLAM. + + 12 + diff --git a/动态slam/2020年-2022年开源动态SLAM/2022年/AirDOS_Dynamic_SLAM_benefits_from_Articulated_Objects.pdf b/动态slam/2020年-2022年开源动态SLAM/2022年/AirDOS_Dynamic_SLAM_benefits_from_Articulated_Objects.pdf new file mode 100644 index 0000000..2177561 --- /dev/null +++ b/动态slam/2020年-2022年开源动态SLAM/2022年/AirDOS_Dynamic_SLAM_benefits_from_Articulated_Objects.pdf @@ -0,0 +1,518 @@ + 2022 IEEE International Conference on Robotics and Automation (ICRA) + May 23-27, 2022. Philadelphia, PA, USA + + AirDOS: Dynamic SLAM benefits from Articulated Objects + + Yuheng Qiu1, Chen Wang1, Wenshan Wang1, Mina Henein2, and Sebastian Scherer1 + +2022 IEEE International Conference on Robotics and Automation (ICRA) | 978-1-7281-9681-7/22/$31.00 ©2022 IEEE | DOI: 10.1109/ICRA46639.2022.9811667 Abstract— Dynamic Object-aware SLAM (DOS) exploits (a) Challenge of Shibuya Tokyo (b) TartanAir Shibuya Dataset + object-level information to enable robust motion estimation in + dynamic environments. Existing methods mainly focus on iden- (c) Example of KITTI tracking dataset training 19 + tifying and excluding dynamic objects from the optimization. + In this paper, we show that feature-based visual SLAM systems Fig. 1. (a) Example of a highly dynamic environment cluttered with + can also benefit from the presence of dynamic articulated humans which represents a challenge for Visual SLAM. Existing dynamic + objects by taking advantage of two observations: (1) The 3D SLAM algorithms often fail in this challenging scenario (b) Example of + structure of each rigid part of articulated object remains the TartanAir Shibuya Dataset. (c) Example of the estimated full map with + consistent over time; (2) The points on the same rigid part dynamic objects and static background. + follow the same motion. In particular, we present AirDOS, + a dynamic object-aware system that introduces rigidity and Can we make use of moving objects in SLAM to improve + motion constraints to model articulated objects. By jointly camera pose estimation rather than filtering them out? + optimizing the camera pose, object motion, and the object 3D + structure, we can rectify the camera pose estimation, preventing In this paper, we extend the simple rigid objects to general + tracking loss, and generate 4D spatio-temporal maps for both articulated objects, defined as objects composed of one + dynamic objects and static scenes. Experiments show that our or more rigid parts (links) connected by joints allowing + algorithm improves the robustness of visual SLAM algorithms rotational motion [10], e.g., vehicles and humans in Fig. 2, + in challenging crowded urban environments. To the best of our and utilize the properties of articulated objects to improve + knowledge, AirDOS is the first dynamic object-aware SLAM the camera pose estimation. Namely, we jointly optimize + system demonstrating that camera pose estimation can be (1) the 3D structural information and (2) the motion of + improved by incorporating dynamic articulated objects. articulated objects. To this end, we introduce (1) a rigidity + constraint, which assumes that the distance between any two + I. INTRODUCTION points located on the same rigid part remains constant over + time, and (2) a motion constraint, which assumes that feature + Simultaneous localization and mapping (SLAM) is a fun- points on the same rigid parts follow the same 3D motion. + damental research problem in many robotic applications. This allows us to build a 4D spatio-temporal map including + Despite its success in static environments, the performance both dynamic and static structures. + degradation and lack of robustness in the dynamic world has + become a major hurdle for its practical applications [1], [2]. In summary, the main contributions of this paper are: + To address the challenges of dynamic environments, most • A new pipeline, named AirDOS, is introduced for stereo + SLAM algorithms adopt an elimination strategy that treats + moving objects as outliers and estimates the camera pose SLAM to jointly optimize the camera poses, trajectories + only based on the measurements of static landmarks [3], [4]. of dynamic objects, and the map of the environment. + This strategy can handle environments with a small number • We introduce simple yet efficient rigidity and motion + of dynamics, but cannot address challenging cases, where constraints for general dynamic articulated objects. + dynamic objects cover a large field of view as in Fig. 1(a). • We introduce a new benchmark TartanAir Shibuya, on + which we demonstrates that, for the first time, dynamic + Some efforts have been made to include dynamic objects articulated objects can benefit the camera pose estima- + in the SLAM process. Very few methods try to estimate the tion in visual SLAM. + pose of simple rigid objects [5], [6] or estimate their motion + model [7], [8]. For example, CubeSLAM [6] introduces a + simple 3D cuboid to model rigid objects. Dynamic SLAM + [9] estimates 3D motions of dynamic objects. However, these + methods can only cover special rigid objects, e.g., cubes [6] + and quadrics [5] and do not show that camera pose estimation + can be improved by the introduction of dynamic objects [7]– + [9]. This introduces our main question: + + *This work was supported by the Sony award #A023367. + Source Code: https://github.com/haleqiu/AirDOS. + 1Yuheng Qiu, Chen Wang, Wenshan Wang, and Sebastiian + Scherer are with the Robotics Institute, Carnegie Mellon University, + Pittsburgh, PA 15213, USA {yuhengq, wenshanw, basti} + @andrew.cmu.edu; chenwang@dr.com + 2Mina Henein is with the System, Theory and Robotics Lab, Australian + National University. mina.henein@anu.edu.au + + 978-1-7281-9680-0/22/$31.00 ©2022 IEEE 8047 + + Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply. + Wang et al. [18] introduce a simultaneous localization, map- + ping, and moving object tracking (SLAMMOT) algorithm, + which tracks moving objects with a learned motion model + based on a dynamic Bayesian network. Reddy, et al. [19] use + optical flow to segment moving objects, and apply a smooth + trajectory constraint to enforce the smoothness of objects’ + motion. Judd et al. [8] propose multi-motion visual odometry + (MVO), which simultaneously estimates the camera pose + and the object motion. The work by Henein, et al. [7], + [20], [21], of which the most recent is VDO-SLAM [20], + generate a map of dynamic and static structure and estimate + velocities of rigid moving objects using motion constraints. + Rosinol, et al. [22] propose 3D dynamic scene graphs to + detect and track dense human mesh in dynamic scenes. This + method constraints the humans maximum walking speed for + a consistency check. + +Fig. 2. This is an example of the articulated dynamic objects’ point-segment C. Rigidity Constraint +mode. In urban environment, we can model rigid objects like vehicle and +semi-rigid objects like pedestrian as articulated object. pki and pkj are the i-th Rigidity constraint assumes that pair-wise distances of +and j-th dynamic features on the moving objects at time k. pik+1 and pkj+1 points on the same rigid body remain the same over time. +is the dynamic features after the motion l Tk at time k + 1. In this model, It was applied to segment moving objects in dynamic en- +the segment si j is invariant over time and motion. vironments dating back to the 1980s. Zhang et al. [23] + propose to use rigidity constraint to match moving rigid + II. RELATED WORK bodies. Thompson et al. [24] use a similar idea of rigidity + constraint and propose a rigidity geometry testing for moving + Recent works on dynamic SLAM roughly fall into three rigid object matching. Previous research utilized rigidity +categories: elimination strategy, motion constraint, and rigid- assumption to segment moving rigid objects, while in this +ity constraint, which will be reviewed, respectively. paper, we use rigidity constraint to recover objects’ structure. + +A. Elimination Strategy To model rigid object, SLAM++ [25] introduced pre- + defined CAD models into the object matching and pose + Algorithms in this category filter out the dynamic objects optimization. QuadricSLAM [5] utilize dual-quadrics as 3D +and only utilize the static structures of the environment object representation, to represent the orientation and scale of +for pose estimation. Therefore, most of the algorithms in object landmarks. Yang and Scherer [6] propose a monocular +this category apply elimination strategies like RANSAC object SLAM system named CubeSLAM for 3D cuboid +[11] and robust loss functions [12] to eliminate the effects object detection and multi-view object SLAM. As mentioned +of dynamic objects. For example, ORB-SLAM [3] applies earlier, the above methods can only model simple rigid +RANSAC to select and remove points that cannot converge objects, e.g., cubes, while we target more general objects, +to a stable pose estimation. DynaSLAM [13] detects the i.e., articulated objects, which can cover common dynamic +moving objects by multi-view geometry and deep learning objects such as vehicles and humans. +modules. This allows inpainting the frame background that +has been occluded by dynamic objects. Bârsan et al. [14] use III. METHODOLOGY +both instance-aware semantic segmentation and sparse scene +flow to classify objects as either background, moving, or A. Background and Notation +potentially moving objects. Dai et al. [15] utilize the distance +correlation of map points to segment dynamic objects from Visual SLAM in static environments is often formulated as +static background. To reduce the computational cost, Ji et al. +[16] combine semantic segmentation and geometry modules, a factor graph optimization [26]. The objective (1) is to find +which clusters the depth image into a few regions and +identify dynamic regions via reprojection errors. the robot state xk ∈ X, k ∈ [0, nx] and the static landmarks pi ∈ + +B. Motion Constraint Ps, i ∈ [0, nps ] that best fit the observation of the landmarks + zki ∈ Z, where nx denotes the total number of the robots’ + Most algorithms in this category estimate the motion of state and nps denotes the number of the static landmarks. +dynamic objects but do not show that the motion constraint This is often based on a reprojection error minimization +can contribute to the camera pose estimation, and would thus ei,k = h(xk, pi) − zik with: +suffer in highly dynamic environments. For example, Hahnel +et al. [17] track the dynamic objects in the SLAM system. ∑ X ∗, P∗ = argmin eiT,kΩ−i,k1ei,k (1) + + {X,Ps} i,k + + where h(xk, pi) denotes the 3D points observation function + and Ωi,k denotes the observation covariance matrix. + + 8048 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply. + Nose C. Motion Constraint + Neck + Right Left We adopt the motion constraint from [7] which does not + Shoulder Shoulder need a prior geometric model. For every feature point on the + same rigid part of an articulated object l, we have + Right Left + Elbow Elbow + + Right Left l p¯ki +1 = l T l p¯ki , (4) + Hand Hand + where lT ∈ SE(3) is a motion transform associated with the + Right Left object l and ¯· indicates homogeneous coordinates. Therefore, + Knee Knee we can define the loss function for motion constraint as: + + Right Left + Feet Feet + +(a) Rigidity Constraint Factor Graph (b) Human Rigidity em = || l p¯ki +1 − lT l p¯ik||. + + (5) + + (c) Motion Constraint Factor Graph The motion constraint simultaneously estimates the ob- + jects’ motion lT and enforces each point l pik to follow the +Fig. 3. (a) Factor graph of the rigidity constraint. Black nodes represent same motion pattern [7]. This motion model lT assumes that +the camera pose, blue nodes the dynamic points, and red nodes indicate the the object is rigid, thus, for articulated objects, we apply the +rigid segment length. Cyan and red rectangles represent the measurements motion constraint on each rigid part of articulated object. In +of points and rigidity consequently. (c) Human can be modeled with point Fig. 3(c) we show the factor graph of the motion constraint. +and segment based on the body parts’ rigidity. (b) Factor graph of the motion +constraint. The orange node is the estimated motion and the green rectangles In highly dynamic environments, even if we filter out +denote the motion constraints the moving objects, the tracking of static features is easily + interrupted by the moving objects. By enforcing the motion + constraints, dynamic objects will be able to contribute to the + motion estimation of the camera pose. Therefore, when the + static features are not reliable enough, moving objects can + correct the camera pose estimation, preventing tracking loss. + + In dynamic SLAM, the reprojection error ep of dynamic D. Bundle Adjustment +feature points is also considered: + The bundle adjustment (BA) jointly optimizes the static +ep = h(xk, l pki ) − lzik , (2) points pi, dynamic points l pki , segments si j, camera poses xk + and dynamic object motions lT . This can be formulated as + the factor graph optimization: + +where l pik ∈ Pd are the dynamic points and lzik are the X ∗, P∗, S∗, T ∗ = argmin erT Ω−i, j1er+ +corresponding observation of dynamic points. + {X,P,S,T } +B. Rigidity Constraint + emT Ω−i,l1em + eTp Ωi−,k1ep, (6) + Let si j be the segment length between two feature points +l pik and l pkj, the rigidity constraint is that si j is invariant over where P is the union set of Ps and Pd. This problem can be +time, i.e, ski j = ski j+1, if l pki and l pkj are on the same rigid solved using the Levenberg-Marquardt algorithms. +part of an articulated object, as shown in Fig. 2. Inspired by + IV. SYSTEM OVERVIEW +this, we model the dynamic articulated object using a rigidity + We propose the framework AirDOS in Fig. III-B for dy- +constraint, and thus we can define the rigidity error er as namic stereo visual SLAM, which consists of three modules, + pre-processing, tracking, and back-end bundle adjustment. +er = l pki − l pkj − si j . (3) + In pre-processing and tracking modules, we first extract + Fig. 3(a) shows the factor graph of the rigidity constraint, ORB features [28] and perform an instance-level segmen- +where the length of segment si j is invariant after the motion. tation [29] to identify potential moving objects. We then +The benefits to involving the rigidity error (3) are two-fold. estimate the initial ego-motion by tracking the static features. +First, it offers a temporal geometric constraint for dynamic For articulated objects like humans, we perform Alpha-Pose +points, which is able to correct the scale and 3D structure [27] to extract the human key points and calculate their +of dynamic objects. Second, it provides a geometric check, 3D positions by triangulating the corresponding key points +which eliminates the incorrectly matched points. from stereo images. We then track the moving humans using + the optical flow generated by PWC-net [30]. The tracking + We model humans as a special articulated object shown module provides a reliable initialization for the camera pose +in Fig. 3(b), where each human can be described by 14 and also the object poses of dynamic objects. +key points, including nose, shoulders, elbows, hands, waists, +knee, feet, etc. In the experiments, we detect the human key In the back-end optimization, we construct a global map +points using the off-the-shelf algorithm Alpha-Pose [27]. consisting of camera poses, static points, dynamic points, and + the motion of objects. We perform local bundle adjustment + with dynamic objects in the co-visibility graph [31] built + + 8049 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply. + Preprocessing Tracking Back-end + Stereo Image Tracking Camera Pose Optimization + + Instant-level Map +Segmentation +Human Pose Static Feature Ego-motion Track Local Local Map Camera Pose + Extractor Static Points + Detection Estimation Map + Optical Flow + Estimation 3D Human Pose Motion Local Bundle Dynamic Points + Triangulation Estimation Adjustment Object Rigidity + + Dynamic Object Global Bundle Motion + Tracking Adjustment + + Tracking Dynamic Objects + +Fig. 4. The framework of AirDOS, which is composed of three modules, i.e., pre-processing, tracking, and back-end optimization. + + TABLE I Camera Pose +PERFORMANCE ON KITTI DATASETS BASED ON ATE (m). Object 1 + Object 2 +Sequence W/ Mask W/O Mask AirDOS + ORB-SLAM AirDOS ORB-SLAM AirDOS ORB-SLAM + Test 18 + Test 28 0.933 0.934 0.937 0.948 +Train 13 2.033 2.027 2.031 2.021 +Train 14 1.547 1.618 1.551 1.636 +Train 15 0.176 0.172 0.174 0.169 +Train 19 0.240 0.234 0.240 0.234 + 2.633 2.760 2.642 2.760 + +from the co-visible landmarks for the sake of efficiency. Fig. 5. Qualitative analysis of the KITTI Tracking datasets in training 19. +Similar to the strategy of RANSAC, we eliminate the factors Applying rigidity constraint and motion constraint improve the estimation +and edges which contribute a large error based on the rigidity of the objects’ structure. +constraint (3) and motion constraint (5). This process helps +to identify the mismatched or falsely estimated human poses. +Visual SLAM algorithms usually only perform bundle adjust- +ment on selected key-frames due to the repeated static feature +observations. However, in highly dynamic environments, like +the ones presented in this paper, this might easily result in +loss of dynamic object tracking, therefore we perform bundle +adjustment on every frame to capture the full trajectory. + + V. EXPERIMENTS B. Performance on KITTI Tracking Dataset + +A. Metric, Baseline, and Implementation The KITTI Tracking dataset [32] contains 50 sequences + (29 for testing, 21 for training) with multiple moving objects. + We use the Absolute Translation Error (ATE) to evaluate We select 6 sequences that contain moving pedestrians. For +our algorithm. Our method is compared against the state-of- evaluation, we generate the ground truth using IMU and GPS. +the-art methods, ORB-SLAM [3] (1) with and (2) without the As shown in Table I, the ATEs of both our method and ORB- +masking of potential dynamic objects, and RGB-D dynamic SLAM are small in all sequences, which means that both +SLAM algorithm [20]. Similar to the setup described in methods perform well in these sequences. The main reason is +Section IV, we modified the ORB-SLAM to perform BA that the moving objects are relatively far and small, and there +on every frame with the observation from dynamic features, are plentiful static features in these sequences. Moreover, +so as to capture the full trajectory of the moving objects. In most sequences have a simple translational movement, which +the experiment, we applied the same parameters to AirDOS makes these cases very simple. +and ORB-SLAM, i.e., the number of feature points extracted +per frame, the threshold for RANSAC, and the covariance Although the camera trajectory is similar, our algorithm +of the reprojection error. recovers a better human model as shown in Fig. 5. The ORB- + SLAM generates noisy human poses when the human is far + + 8050 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply. + TABLE II + EXPERIMENTS ON TARTAN-AIR DATASET WITH AND WITHOUT MASK + +Datasets Sequence W/ Mask W/O Mask + ORB-SLAM + AirDOS VDO-SLAM [20] AirDOS ORB-SLAM + +Standing I 0.0606 0.0788 0.0994 0.0469 0.1186 + 0.0193 0.0060 0.6129 - - +Human II + 0.0951 0.0657 0.3813 0.0278 0.0782 +Road Crossing III 0.0331 0.0196 0.3879 0.1106 0.0927 + (Easy) IV 0.0206 0.0148 0.2175 0.0149 0.0162 + V + +Road Crossing VI 0.2230 1.0984 0.2400 3.6700 4.3907 + 0.5625 0.8476 0.6628 1.1572 1.4632 +(Hard) VII + +Overall 0.1449 0.3044 0.3717 0.8379 1.0226 + +Results show Absolute Trajectory Error (ATE) in meter (m). ‘-’ means that SLAM failed in this sequence. + + Camera Pose + Object 1 + Object 2 + Object 3 + Object 3 + + ORBSLAM + +(a) Standing Human (b) Road Crossing + +Fig. 6. (a) Example of the Tartan-Air datasets, where almost every one is +standing. (b) Example of moving humans in road crossing. + +away from the camera. That’s because the rigidity constraint AirDOS +helps to recover the structure of the moving articulated +objects. Also, the motion constraint can improve the accuracy Fig. 7. Qualitative analysis of the TartanAir sequence IV. The moving +of the dynamic objects’ trajectory. Given the observation objects tracked by the ORB-SLAM is noisy, while our proposed method +from the entire trajectory, our algorithm recovers the human generate a smooth trajectory. We present that dynamic objects and the +pose and eliminates the mismatched dynamic feature points. camera pose can benefits each other in visual SLAM. + +C. Performance on TartanAir Shibuya Dataset 1) Evaluation: To test the robustness of our system when + the visual odometry is interrupted by dynamic objects or + We notice that the moving objects in KITTI dataset only in cases where the segmentation might fail due to indirect +cover a small field of view. To address the challenges of occlusions such as illumination changes, we evaluate the +the highly dynamic environment, we build the TartanAir performance in two settings: with and without masking the +Shibuya dataset as shown in Fig. 6, and demonstrate that dynamic features during ego-motion estimation. +our method outperforms the existing dynamic SLAM al- +gorithms in this benchmark. Our previous work TartanAir As shown in the Table II, with human masks, our algo- +[33] is a very challenging visual SLAM dataset consisting rithm obtains a 39.5% and 15.2% improvements compared +of binocular RGB-D sequences together with additional per- to ORB-SLAM [3] and VDO-SLAM [20] in the overall +frame information such as camera poses, optical flow, and performance. In Sequence II, IV and V, both ORB-SLAM +semantic annotations. In this paper, we use the same pipeline and our algorithm show a good performance, where all +to generate TartanAir Shibuya, which is to simulate the ATEs are lower than 0.04. We notice that the performance +world’s most busy road intersection at Shibuya Tokyo shown of VDO-SLAM is not as good as ORB-SLAM. This may +in Fig. 1. It covers much more challenging viewpoints and be because that VDO-SLAM relies heavily on the optical +diverse motion patterns for articulated objects than TartanAir. flow for feature matching, it is likely to confuse background + features with dynamic features. + We separate the TartanAir Shibuya dataset into two +groups: Standing Humans in Fig. 6(a) and Road Crossing in Our algorithm also outperforms ORB-SLAM without +Fig. 6(b) with easy and difficult categories. Each sequence masking the potential moving objects. As shown in the +contains 100 frames and more than 30 tracked moving sequence I, III, V, and VI of Table II, our method obtains a +humans. In the sequences of Standing Human, most of higher accuracy than ORB-SLAM by 0.0717, 0.050, 0.721 +the humans standstill, while few of them move around the and 0.306. Overall, we achieve an improvement of 18.1%. +space. In Road Crossing, there are multiple moving humans That’s because moving objects can easily lead the traditional +coming from different directions. For the difficult sequences, visual odometry to fail, but we take the observations from +dynamic objects often enter the scene abruptly, in which the moving articulated objects to rectify the camera poses, and +visual odometry of traditional methods will fail easily. filter out the mismatched dynamic features. + + 8051 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply. + TABLE III + ABLATION STUDY ON SIMULATED DATASET. + + Groups RPE-R I ATE RPE-R II ATE RPE-R III ATE RPE-R IV ATE RPE-R Overall + RPE-T RPE-T RPE-T RPE-T RPE-T ATE + Before BA 0.4898 83.441 0.6343 109.968 1.1003 138.373 0.7925 168.312 0.7908 +BA w/ static point 0.0989 15.991 15.002 0.1348 17.728 25.796 0.2028 21.070 17.085 0.1389 19.242 35.521 0.1537 18.8328 125.024 + 0.0988 3.3184 15.019 0.1349 3.7146 25.708 0.2035 4.2522 16.985 0.1388 3.5074 35.269 0.1538 3.7540 23.351 + BA w/o motion 0.0962 3.3176 14.881 0.1282 3.7176 25.704 0.1871 4.2631 16.921 0.1226 3.5069 35.426 0.1410 3.7565 23.245 + BA w/o rigidity 0.0958 3.2245 14.879 0.1276 3.4984 25.703 0.1870 4.0387 16.914 0.1215 3.2397 35.412 0.1407 3.5148 23.233 +BA in Equation (6) 3.2177 3.4824 4.0372 3.2227 3.5085 23.227 + +Results show RPE-T and ATE in centimeter (cm) and RPE-R in degree (°). + + It can be seen that in Fig. 7, ORB-SLAM was interrupted (d) is 5.71 ± 0.31 lower than (b). This is because the motion +by the moving humans and failed when making a large rota- constraint assumes that every dynamic feature on the same +tion. By tracking moving humans, our method outperforms object follows the same motion pattern, which requires the +ORB-SLAM when making a turn. Also, a better camera object to be rigid. From another point of view, the rigidity +pose estimation can in turn benefit the moving objects’ constraint provides a good initialization to the objects’ 3D +trajectory. As can be seen, the objects’ trajectories generated structure, and so indirectly improves the estimation of the +by ORB-SLAM are noisy and inconsistent, while ours are objects’ trajectory. In general, the ablation study proves +smoother. In general, the proposed motion constraint and that applying motion and rigidity constraints to dynamic +rigidity constraint have a significant impact on the difficult articulated objects can benefit the camera pose estimation. +sequences, where ORB-SLAM outputs inaccurate trajectories +due to dynamic objects. C. Computational Analysis + + VI. ABLATION STUDY Finally, we evaluate the running time of the rigidity con- + straint and motion constraint in the optimization. The back- + We perform an ablation study to show the effects of the end optimization is implemented in C++ with a modified +introduced rigidity and motion constraints. Specifically, we g2o [34] solver. With the same setup as section VI-A, we +demonstrate that the motion constraint and rigidity constraint randomly initialized 10 different sequences with 18 frames. +inprove the camera pose estimation via bundle adjustment. In each frame, we can observe 8 static landmarks, and 12 + dynamic landmarks from one moving object. In Table IV, +A. Implementation We show the (i) convergence time (ii) runtime per iteration + of group I in the ablation study. Our method takes 53.54 + We simulate dynamic articulated objects that follow a (mSec) to converge, which is comparable to 39.22 (mSec) +simple constant motion pattern, and initialize the robot’s state from the optimization with re-projection error only. +with Gaussian noise of σ = 0.05m on translation σ = 2.9° +on rotation. We also generate static features around the path In this paper, semantic mask [29] and human poses [27] +of the robot, and simulate a sensor with a finite field of view. are pre-processed as an input to the system. The experiment +The measurement of point also has a noise of σ = 0.05m. are carried out on an Intel Core i7 with 16GB RAM. +We generate 4 groups of sequences with different lengths +and each group consists of 10 sequences that are initialized TABLE IV +with the same number of static and dynamic features. We TIME ANALYSIS OF BUNDLE ADJUSTMENT +set the ratio of static to dynamic landmarks as 1:1.8. + BA w/ reprojection error Convergence Time (mSec) Runtime/iter (mSec) +B. Results BA w/o Rigidity + BA w/o Motion 39.22 4.024 + We evaluate the performance of (a) bundle adjustment with 45.47 4.078 +static features only, (b) bundle adjustment without motion BA in Equation (6) 45.37 4.637 +constraint, (c) bundle adjustment without rigidity constraint, 53.54 4.792 +and (d) bundle adjustment with both the motion constraint +and rigidity constraint. We use the Absolute Translation Error CONCLUSION +(ATE) and Relative Pose Error of Rotation (RPE-R) and +Translation (RPE-T) as our evaluation metrics. In this paper, we introduce the rigidity constraint and + motion constraint to model dynamic articulated objects. We + As shown in Table III, both motion and rigidity constraints propose a new pipeline, AirDOS for stereo SLAM which +are able to improve the camera pose estimation, while the jointly optimizes the trajectory of dynamic objects, map of +best performance is obtained when the two constraints are the environment, and camera poses, improving the robustness +applied together. An interesting phenomenon is that rigidity and accuracy in dynamic environments. We evaluate our +constraint can also benefit the objects’ trajectory estimation. algorithm in KITTI tracking and TartanAir Shibuya dataset, +In Groups I, we evaluate the estimation of dynamic points and demonstrate that camera pose estimation and dynamic +with setting (b), (c), and (d), with 100 repeated experiments. objects can benefit each other, especially when there is +We find that the ATE of dynamic object feature points in an aggressive rotation or static features are not enough to +setting (c) is 5.68 ± 0.30 lower than setting (b), while setting support the visual odometry. + + 8052 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply. + REFERENCES [24] W. B. Thompson, P. Lechleider, and E. R. Stuck, “Detecting moving + objects using the rigidity constraint,” IEEE Transactions on Pattern + [1] C. Cadena, L. Carlone, H. Carrillo, Y. Latif, D. Scaramuzza, J. Neira, Analysis and Machine Intelligence, vol. 15, no. 2, pp. 162–166, 1993. + I. Reid, and J. J. Leonard, “Past, present, and future of simultaneous + localization and mapping: Toward the robust-perception age,” IEEE [25] R. F. Salas-Moreno, R. A. Newcombe, H. Strasdat, P. H. Kelly, and + Transactions on robotics, vol. 32, no. 6, pp. 1309–1332, 2016. A. J. Davison, “Slam++: Simultaneous localisation and mapping at the + level of objects,” in Proceedings of the IEEE conference on computer + [2] C. Wang, J. Yuan, and L. Xie, “Non-iterative SLAM,” in International vision and pattern recognition, 2013, pp. 1352–1359. + Conference on Advanced Robotics (ICAR). IEEE, 2017, pp. 83–90. + [26] M. Kaess, A. Ranganathan, and F. Dellaert, “isam: Incremental + [3] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: a smoothing and mapping,” IEEE Transactions on Robotics, vol. 24, + versatile and accurate monocular slam system,” IEEE transactions on no. 6, pp. 1365–1378, 2008. + robotics, vol. 31, no. 5, pp. 1147–1163, 2015. + [27] H.-S. Fang, S. Xie, Y.-W. Tai, and C. Lu, “RMPE: Regional multi- + [4] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE person pose estimation,” in ICCV, 2017. + transactions on pattern analysis and machine intelligence, vol. 40, + no. 3, pp. 611–625, 2017. [28] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, “Orb: An + efficient alternative to sift or surf,” in 2011 International conference + [5] L. Nicholson, M. Milford, and N. Sünderhauf, “Quadricslam: Dual on computer vision. Ieee, 2011, pp. 2564–2571. + quadrics from object detections as landmarks in object-oriented slam,” + IEEE Robotics and Automation Letters, vol. 4, no. 1, pp. 1–8, 2018. [29] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask r-cnn,” in + Proceedings of the IEEE international conference on computer vision, + [6] S. Yang and S. Scherer, “Cubeslam: Monocular 3-d object slam,” IEEE 2017, pp. 2961–2969. + Transactions on Robotics, vol. 35, no. 4, pp. 925–938, 2019. + [30] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz, “Pwc-net: Cnns for optical + [7] M. Henein, G. Kennedy, R. Mahony, and V. Ila, “Exploiting rigid body flow using pyramid, warping, and cost volume,” in Proceedings of the + motion for slam in dynamic environments,” environments, vol. 18, IEEE conference on computer vision and pattern recognition, 2018, + p. 19, 2018. pp. 8934–8943. + + [8] K. M. Judd, J. D. Gammell, and P. Newman, “Multimotion visual [31] C. Mei, G. Sibley, and P. Newman, “Closing loops without places,” + odometry (mvo): Simultaneous estimation of camera and third-party in 2010 IEEE/RSJ International Conference on Intelligent Robots and + motions,” in 2018 IEEE/RSJ International Conference on Intelligent Systems. IEEE, 2010, pp. 3738–3744. + Robots and Systems (IROS). IEEE, 2018, pp. 3949–3956. + [32] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: + [9] M. Henein, J. Zhang, R. Mahony, and V. Ila, “Dynamic slam: The The kitti dataset,” The International Journal of Robotics Research, + need for speed,” in 2020 IEEE International Conference on Robotics vol. 32, no. 11, pp. 1231–1237, 2013. + and Automation (ICRA). IEEE, 2020, pp. 2123–2129. + [33] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, +[10] G. Stamou, M. Krinidis, E. Loutas, N. Nikolaidis, and I. Pitas, “4.11- and S. Scherer, “Tartanair: A dataset to push the limits of visual + 2d and 3d motion tracking in digital video,” Handbook of Image and slam,” in IEEE/RSJ International Conference on Intelligent Robots + Video Processing, 2005. and Systems (IROS), 2020. + +[11] M. A. Fischler and R. C. Bolles, “Random sample consensus: a [34] R. Kümmerle, G. Grisetti, H. Strasdat, K. Konolige, and W. Burgard, + paradigm for model fitting with applications to image analysis and “g 2 o: A general framework for graph optimization,” in 2011 IEEE + automated cartography,” Communications of the ACM, vol. 24, no. 6, International Conference on Robotics and Automation. IEEE, 2011, + pp. 381–395, 1981. pp. 3607–3613. + +[12] C. Kerl, J. Sturm, and D. Cremers, “Robust odometry estimation for + rgb-d cameras,” in 2013 IEEE International Conference on Robotics + and Automation. IEEE, 2013, pp. 3748–3754. + +[13] B. Bescos, J. M. Fácil, J. Civera, and J. Neira, “Dynaslam: Tracking, + mapping, and inpainting in dynamic scenes,” IEEE Robotics and + Automation Letters, vol. 3, no. 4, pp. 4076–4083, 2018. + +[14] I. A. Bârsan, P. Liu, M. Pollefeys, and A. Geiger, “Robust dense + mapping for large-scale dynamic environments,” in 2018 IEEE In- + ternational Conference on Robotics and Automation (ICRA). IEEE, + 2018, pp. 7510–7517. + +[15] W. Dai, Y. Zhang, P. Li, Z. Fang, and S. Scherer, “Rgb-d slam in + dynamic environments using point correlations,” IEEE Transactions + on Pattern Analysis and Machine Intelligence, 2020. + +[16] T. Ji, C. Wang, and L. Xie, “Towards real-time semantic rgb-d slam in + dynamic environments,” in 2021 International Conference on Robotics + and Automation (ICRA), 2021. + +[17] D. Hahnel, R. Triebel, W. Burgard, and S. Thrun, “Map building with + mobile robots in dynamic environments,” in 2003 IEEE International + Conference on Robotics and Automation (Cat. No. 03CH37422), + vol. 2. IEEE, 2003, pp. 1557–1563. + +[18] C.-C. Wang, C. Thorpe, S. Thrun, M. Hebert, and H. Durrant-Whyte, + “Simultaneous localization, mapping and moving object tracking,” The + International Journal of Robotics Research, vol. 26, no. 9, pp. 889– + 916, 2007. + +[19] N. D. Reddy, P. Singhal, V. Chari, and K. M. Krishna, “Dynamic body + vslam with semantic constraints,” in 2015 IEEE/RSJ International + Conference on Intelligent Robots and Systems (IROS). IEEE, 2015, + pp. 1897–1904. + +[20] J. Zhang, M. Henein, R. Mahony, and V. Ila, “Vdo-slam: a visual + dynamic object-aware slam system,” arXiv preprint arXiv:2005.11052, + 2020. + +[21] M. Henein, J. Zhang, R. Mahony, and V. Ila, “Dynamic slam: The + need for speed,” in 2020 IEEE International Conference on Robotics + and Automation (ICRA), 2020, pp. 2123–2129. + +[22] A. Rosinol, A. Gupta, M. Abate, J. Shi, and L. Carlone, “3d dynamic + scene graphs: Actionable spatial perception with places, objects, and + humans,” arXiv preprint arXiv:2002.06289, 2020. + +[23] Z. Zhang, O. D. Faugeras, and N. Ayache, “Analysis of a sequence + of stereo scenes containing multiple moving objects using rigidity + constraints,” in ICCV, 1988. + + 8053 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 04,2023 at 02:38:40 UTC from IEEE Xplore. Restrictions apply. + diff --git a/动态slam/2020年-2022年开源动态SLAM/2022年/DynaVINS_A_Visual-Inertial_SLAM_for_Dynamic_Environments.pdf b/动态slam/2020年-2022年开源动态SLAM/2022年/DynaVINS_A_Visual-Inertial_SLAM_for_Dynamic_Environments.pdf new file mode 100644 index 0000000..cb06d7b --- /dev/null +++ b/动态slam/2020年-2022年开源动态SLAM/2022年/DynaVINS_A_Visual-Inertial_SLAM_for_Dynamic_Environments.pdf @@ -0,0 +1,663 @@ +IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 11523 + +DynaVINS: A Visual-Inertial SLAM for + Dynamic Environments + +Seungwon Song , Hyungtae Lim , Graduate Student Member, IEEE, Alex Junho Lee , + and Hyun Myung , Senior Member, IEEE + + Abstract—Visual inertial odometry and SLAM algorithms are Fig. 1. Our algorithm, DynaVINS, in various dynamic environments. (a)–(b) +widely used in various fields, such as service robots, drones, and Feature rejection results in city_day sequence of VIODE dataset [13]. Even if +autonomous vehicles. Most of the SLAM algorithms are based on the most features are dynamic, DynaVINS can discard the effect of the dynamic +assumption that landmarks are static. However, in the real-world, features. (c) Separation of feature matching results into multiple hypotheses in E +various dynamic objects exist, and they degrade the pose estimation shape sequence of our dataset. Even if a temporarily static object exists, only a +accuracy. In addition, temporarily static objects, which are static hypothesis from static objects is determined as true positive. Features with high +during observation but move when they are out of sight, trigger false and low weights are denoted as green circles and red crosses, respectively, in +positive loop closings. To overcome these problems, we propose a both two cases. +novel visual-inertial SLAM framework, called DynaVINS, which is +robust against both dynamic objects and temporarily static objects. cameras [4], [5], [6] are widely used because of their relatively +In our framework, we first present a robust bundle adjustment low cost and weight with rich information. +that could reject the features from dynamic objects by leveraging +pose priors estimated by the IMU preintegration. Then, a keyframe Various visual SLAM methods have been studied for more +grouping and a multi-hypothesis-based constraints grouping meth- than a decade. However, most researchers have assumed that +ods are proposed to reduce the effect of temporarily static objects in landmarks are implicitly static; thus, many visual SLAM meth- +the loop closing. Subsequently, we evaluated our method in a public ods still have potential risks when interacting with real-world +dataset that contains numerous dynamic objects. Finally, the exper- environments that contain various dynamic objects. Only re- +imental results corroborate that our DynaVINS has promising per- cently, several studies focused on dealing with dynamic objects +formance compared with other state-of-the-art methods by success- solely using visual sensors. +fully rejecting the effect of dynamic and temporarily static objects. + Most of the studies [7], [8], [9] address the problems by de- + Index Terms—Visual-inertial SLAM, SLAM, visual tracking. tecting the regions of dynamic objects via depth clustering, fea- + ture reprojection, or deep learning. Moreover, some researchers + I. INTRODUCTION incorporate the dynamics of the objects into the optimization + framework [10], [11], [12]. However, geometry-based methods +S IMULTANEOUS localization and mapping (SLAM) al- require accurate camera poses; hence they can only deal with + gorithms have been widely exploited in various robotic limited fractions of dynamic objects. In addition, deep-learning- +applications that require precise positioning or navigation in aided methods have the limitation of solely working for prede- +environments where GPS signals are blocked. Various types fined objects. +of sensors have been used in SLAM algorithms. In particular, +visual sensors such as monocular cameras [1], [2], [3] and stereo In the meanwhile, visual-inertial SLAM (VI-SLAM) frame- + works [2], [3], [4], [5], [6] have been proposed by integrating an + Manuscript received 27 April 2022; accepted 22 August 2022. Date of inertial measurement unit (IMU) into the visual SLAM. Unlike +publication 31 August 2022; date of current version 6 September 2022. This the visual SLAMs, a motion prior from the IMU helps the +letter was recommended for publication by Associate Editor M. Magnusson and VI-SLAM algorithms to tolerate scenes with dynamic objects to +Editor S. Behnke upon evaluation of the reviewers’ comments. This work was some degree. However, if the dominant dynamic objects occlude +supported in part by the Indoor Robot Spatial AI Technology Development” +project funded by KT, KT award under Grant B210000715 and in part by the +Institute of Information & Communications Technology Planning & Evaluation +(IITP) grant funded by Korea government (MSIT) under Grant 2020-0-00440, +Development of Artificial Intelligence Technology that Continuously Improves +Itself as the Situation Changes in the Real World. The students are supported +by the BK21 FOUR from the Ministry of Education (Republic of Korea). +(Corresponding author: Hyun Myung.) + + Seungwon Song, Hyungtae Lim, and Hyun Myung are with the School of +Electrical Engineering, KAIST, Daejeon 34141, Republic of Korea (e-mail: +sswan55@kaist.ac.kr; shapelim@kaist.ac.kr; hmyung@kaist.ac.kr). + + Alex Junho Lee is with the Department of Civil and Environmen- +tal Engineering, KAIST, Daejeon 34141, Republic of Korea (e-mail: +alex_jhlee@kaist.ac.kr). + + Our code is available: https://github.com/url-kaist/dynaVINShttps://github. +com/url-kaist/dynaVINS + + This letter has supplementary downloadable material available at +https://doi.org/10.1109/LRA.2022.3203231, provided by the authors. + + Digital Object Identifier 10.1109/LRA.2022.3203231 + +2377-3766 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. + See https://www.ieee.org/publications/rights/index.html for more information. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply. + 11524 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 + +most of the view as shown in Fig. 1(b), the problem cannot be camera movement and the feature. Canovas et al. [9] proposed + a similar method, but adopted a surfel, similar to a polygon, +solved solely using the motion prior. to enable a real-time performance by reducing the number of + items to be computed. However, multi-view geometry-based +In addition, in real-world applications, temporarily static ob- algorithms assumed that the camera pose estimation is accurate + enough, leading to the failure when the camera pose estimation +jects are static while being observed but in motion when they are is inaccurate owing to the dominant dynamic objects. + +not under observation. These objects may lead to a critical failure One of the solutions to this problem is to employ a wheel + encoder. G2P-SLAM [18] rejected loop closure matching results +on the loop closure process due to false positives as shown in with a high Mahalanobis distance from the estimated pose by + the wheel odometry, which is invariant to the effect of dynamic +Fig. 1(c). To deal with temporarily static objects, robust back-end and temporarily static objects. Despite the advantages of wheel + encoder, these methods are highly dependent on the wheel +methods [14], [15], [16], [17] are proposed to reduce the effect of encoder, limiting their own applicability. + +the false positive loop closures in optimization. However, since Another feasible approach is to adopt deep learning networks + to identify predefined dynamic objects. In the DynaSLAM [7], +they focused on the instantaneous false positive loop closures, masked areas of the predefined dynamic objects using a deep + learning network were eliminated and the remainder was deter- +they cannot deal with the persistent false positive loop closures mined via multi-view geometry. In the Dynamic SLAM [19], + a compensation method was adopted to make up for missed +caused by the temporarily static objects. detections in a few keyframes using sequential data. Although + the deep learning methods can successfully discard the dynamic +In this study, to address the aforementioned problems, we objects even if they are temporarily static, these methods are + somewhat problematic for the following two reasons: a) the types +propose a robust VI-SLAM framework, called DynaVINS, which of dynamic objects have to be predefined, and b) sometimes, only + a part of the dynamic object is visible as shown in Fig. 1(b). For +is robust against dynamic and temporarily static objects. Our these reasons, the objects may not be detected occasionally. + +conrtriTbhuetiornosbuarset summarized as follows: proposed to handle On the other hand, methods for tracking a dynamic object’s + VI-SLAM approach is motion have been proposed. RigidFusion [10] assumed that only + a single dynamic object is in the environment and estimated + dominant, undefined dynamic objects that cannot be solved the motion of the dynamic object. Qiu et al. [12] combined a + deep learning method and VINS-Mono [2] to track poses of the +r solely by learning-based or vision-only methods. camera and object simultaneously. DynaSLAM II [11] identified + A novel bundle adjustment (BA) pipeline is proposed for dynamic objects, similar to DynaSLAM [7], then, within the BA + factor graph, the poses of static features and the camera were + simultaneously estimating camera poses and discarding the estimated while estimating the motion of the dynamic objects + simultaneously. + features from the dynamic objects that deviate significantly + C. Robust Back-End +r from the motion prior. + A robust global optimization with constraints grouped into In the graph SLAM field, several researchers have attempted + to discard incorrectly created constraints. For instance, max- + multiple hypotheses is proposed to reject persistent loop mixture [14] employed a single integrated Bayesian framework + to eliminate the incorrect loop closures, while switchable con- + closures from the temporarily static objects. straint [15] is proposed to adjust the weight of each constraint to + eliminate false positive loop closures in the optimization. How- +In the remainder of this letter, we introduce the robust BA ever, false-positive loop closures can be expected to be consistent + and occur persistently by the temporarily static objects. These +method for optimizing moving windows in Section III, methods robust kernels are not appropriate to handling such persistent + loop closures. +for the robust global optimization in Section IV, and compare our + On the other hand, the Black-Rangarajan (B-R) duality [20] is +proposed method with other state-of-the-art (SOTA) methods proposed to unify robust estimation and outlier rejection process. + Some methods [16], [17] utilize B-R duality in point cloud +in various environments in Section V. registration and pose graph optimization (PGO) to reduce the + effect of false-positive matches even if they are dominant. These + II. RELATED WORKS methods are useful for rejecting outliers in a PGO. However, + repeatedly detected false-positive loop closures from similar +A. Visual-Inertial SLAM objects are not considered. Moreover, B-R duality is not yet + utilized in the BA of the VI-SLAM. + As mentioned earlier, to address the limitations of the visual +SLAM framework, VI-SLAM algorithms have been recently To address the aforementioned limitations, we improve the +proposed to correct the scale and camera poses by adopting VI-SLAM to minimize the effect of the dynamic and temporarily +the IMU. MSCKF [3] was proposed as an extended Kalman static objects by adopting the B-R duality not only in the graph +filter(EKF)-based VI-SLAM algorithm. ROVIO [6] also used +an EKF, but proposed a fully robocentric and direct VI-SLAM +framework running in real time. + + There are other approaches using optimization. OKVIS [5] +proposed a keyframe-based framework and fuses the IMU +preintegration residual and the reprojection residual in an op- +timization. ORB-SLAM3 [4] used an ORB descriptor for the +feature matching, and poses and feature positions are corrected +through an optimization. VINS-Fusion [2], an extended version +of VINS-Mono, supports a stereo camera and adopts a feature +tracking, rather than a descriptor matching, which makes the +algorithm faster and more robust. + + However, these VI-SLAM methods described above still have +potential limitations in handling the dominant dynamic objects +and the temporarily static objects. + +B. Dynamic Objects Rejection in Visual and VI SLAM + + Numerous researchers have proposed various methods to +handle dynamic objects in visual and VI SLAM algorithms. Fan +et al. [8] proposed a multi-view geometry-based method using an +RGB-D camera. After obtaining camera poses by minimizing the +reprojection error, the type of each feature point is determined +as dynamic or static by the geometric relationship between the + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply. + SONG et al.: DYNAVINS: A VISUAL-INERTIAL SLAM FOR DYNAMIC ENVIRONMENTS 11525 + +Fig. 2. The pipeline of our robust visual inertial SLAM. Features are tracked in mono or stereo images and IMU data are preintegrated in the sensor preprocessing +step. Then, the robust BA is applied to discard tracked features from dynamic objects and only the features from static objects will be remain. Keyframes are +grouped using the number of tracked features, and loop closures detected in current keyframe groups are clustered into hypotheses. Each hypothesis with the weight +is used or rejected in the selective optimization. Using the proposed framework, a trajectory robust against dynamic and temporarily static objects can be obtained. + +structure but also in the BA framework by reflecting the IMU considered as outliers would never become inliers even though +prior and the feature tracking information. the features are originated from static objects. + + III. ROBUST BUNDLE ADJUSTMENT To address these problems, our BA method consists of two + parts: a) a regularization factor that leverages the IMU preinte- +A. Notation gration and b) a momentum factor for considering the previous + state of each weight to cover the case where the preintegration + In this letter, the following notations are defined. The i-th becomes temporarily inaccurate. +camera frame and the j-th tracked feature are denoted as Ci +and fj, respectively. For two frames CA and CB, TBA ∈ SE(3) C. Regularization Factor +denotes the pose of CA relative to CB. And the pose of CA in +the world frame W can be denoted as TWA . First, to reject the outlier features while robustly estimate + the poses, we propose a novel loss term inspired by the B-R + B is a set of indices of the IMU preintegrations, and P is a set duality [20] as follows: +of visual pairs (i, j) where i corresponds to the frame Ci and j to +the feature fj. Because the feature fj is tracked across multiple ρ wj , rPj = wj2rPj + λwΦ2(wj ), (2) +camera frames, different camera frames can contain the same +feature fj. Thus, a set of indices of all tracked features in the where rjP denotes i∈P(fj) rjP,i 2 for simplicity, wj ∈ [0, 1] +current moving window is denoted as FP , and a set of indices denotes the weight corresponding to each feature fj, and fj +of the camera frames that contain the feature fj is denoted as with wj close to 1 is determined as a static feature; λw ∈ R+ +P (fj ). is a constant parameter; Φ(wj) denotes the regularization factor + + In the visual-inertial optimization framework of the current of the weight wj and is defined as follows: +sliding window, X represents the full state vector that contains +sets of poses and velocities of the keyframes, biases of the IMU, Φ(wj) = 1 − wj. (3) +i.e., acceleration and gyroscope biases, and estimated depth of +the features as in [2]. + +B. Conventional Bundle Adjustment Then, ρ(wj, rPj ) in (2) is adopted instead of the Huber norm + in the visual reprojection term in (1). Hence, the BA formulation + In the conventional visual-inertial state estimator [2], the +visual-inertial BA formulation is defined as follows: can be expressed as: + + ⎧ ⎫ + ⎨ ⎬ + rp − HpX 2 + rI zˆbbkk+1 , X 2 2+ rIk 2 + ρ wj , rjP ⎭ , +min Pbbkk+1 min ⎩ rp − HpX + ⎫ + X k∈B 2⎬ (1) X ,W k∈B j∈FP + PjCi ⎭ , + (4) + ++ ρH rP zˆCj i , X where W = {wj|j ∈ FP } represents the set of all weights. + + (i,j)∈P By adopting weight and regularization factor inspired by + + B-R duality, the influence of features with a high reprojection + +where ρH (·) denotes the Huber loss [21]; rp, rI , and rP represent error compared to the estimated state can be reduced while +residuals for marginalization, IMU, and visual reprojection mea- +surements, respectively; zˆbbkk+1 and zˆjCi stand for observations of maintaining the state estimation performance. The details will +IMU and feature points; Hp denotes a measurement estimation +matrix of the marginalization, and P denotes the covariance of be covered in the remainder of this subsection. +each term. For convenience, rI (zˆbbkk+1 , X ) and rP (zˆCj i , X ) are +simplified as rIk and rjP,i, respectively. (4) is solved using an alternating optimization [20]. Because + + The Huber loss does not work successfully once the ra- the current state X can be estimated from the IMU preintegration + +tio of outliers increases. This is because the Huber loss does and the previously optimized state, unlike other methods [16], + +not entirely reject the residuals from outliers [22]. On the [17], W is updated first with the fixed X . Then, X is optimized + +other hand, the redescending M-estimators, such as Geman- with the fixed W. + +McClure (GMC) [23], ignore the outliers perfectly once the While optimizing W, all terms except weights are constants. + +residuals are over a specific range owing to their zero-gradients. Hence, the formulation for optimizing weights can be expressed + +Unfortunately, this truncation triggers a problem that features as follows: + + ⎧ ⎫ + ⎨ ⎬ + ρ wj , rPj ⎭ . + min ⎩ (5) + + W j∈FP + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply. + 11526 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 + + Fig. 4. Framework of robust BA. Each feature has a weight and is used in + the visual residual. Each weight has been optimized through the regularization + factor and the weight momentum factor. Preintegrated IMU data are used in the + IMU residual term. All parameters are optimized in the robust BA. + + D. Weight Momentum Factor + +Fig. 3. Changes of loss functions w.r.t. various parameters. (a) ρ(wj , rjP ) w.r.t. When the motion becomes aggressive, the IMU preinte- +wj in the alternating optimization for λw = 1. ρ¯(rPj ) represents the converged +loss. (b) ρ¯(rPj ) w.r.t. λw. (c) ρ¯m(rPj ) w.r.t. w¯j for nj = 5. (d) ρ¯m(rPj ) w.r.t. gration becomes imprecise, and thus the estimated state be- +nj for w¯j = 0. + comes inaccurate. In this case, the reprojection residuals of + + the features from the static objects become larger; hence, by + + the regularization factor, those features will be ignored in the + + BA process even though the previous weights were close to + +Because the weight wj is independent to each other, (5) can be one. + +optimized independently for each wj as follows: ⎫ If increasing λw to solve this problem, even the fea- + ⎧⎛ ⎞ tures with high reprojection residuals by dynamic objects + ⎨ ⎬ + ⎩wj2 ⎝ rjP,i 2⎠ + λwΦ2(wj )⎭ . are used. Therefore, the result of the BA will be inac- + min (6) + i∈P curate. Thus, increasing λw is not enough to cope this + wj ∈[0,1] (fj ) problem. + +Because the terms in (6) are in a quadratic form w.r.t. wj, the To solve this issue, an additional factor, a weight momentum +optimal wj can be derived as follows: + factor, is proposed to make the previously estimated feature + + λw weights unaffected by an aggressive motion. + + λw + wj = , (7) Because the features are continuously tracked, each feature + + rjP fj is optimized nj times with its previous weight w¯j. In order to + make the current weight tend to remain at w¯j, and to increase the + As mentioned previously, the weights are first optimized +based on the estimated state. Thus the weights of features with degree of the tendency as nj increases, the weight momentum + factor Ψ(wj) is designed as follows: +high reprojection errors start with small values. However, as +shown in Fig. 3(a), the loss of the feature ρ(wj, rjP ) is a convex Ψ(wj) = nj(w¯j − wj). (9) +function unless the weight is zero, there is a non-zero gradient +not only in the loss of an inlier feature but also in the loss of an Then, adding (9) to (2), the modified loss term can be derived +outlier feature. Which means that the new feature affects the BA +regardless of the type at first. as follows: + + While the optimization step is repeated until the states and ρm wj , rPj = wj2 rjP,i 2 +the weights are converged, the weights of the outlier features + i∈P(fj ) + +are lowered and their losses are more flattened. As a result, the + λwΦ2(wj ) + λmΨ2(wj ), (10) +losses of the outlier features approach zero-gradient and cannot +affect the BA. where λm ∈ R+ represents a constant parameter to adjust the + effect of the momentum factor on the BA. + After convergence, the weight can be expressed using the +reprojection error as in (7). Thus the converged loss ρ¯(rjP ) can In summary, proposed robust BA can be illustrated as Fig. 4. +be derived by applying (7) to (2) as follows: + The previous weights of the tracked features are used in the + + weight momentum factor, and the weights of all features in the + + λw rjP current window are used in the regularization factor. As a result, + λw + rPj + ρ¯(rPj ) = . (8) the ro⎧bust BA is expressed as follows: ⎫ + ⎨ ⎬ + 2+ rkI 2 + wj , rjP ⎭ . + As shown in Fig. 3(b), increasing λw affects ρ¯(rjP ) in two di- min ⎩ rp − HpX ρm (11) +rections: increasing the gradient value and convexity. By increas- + X ,W k∈B j∈FP + +ing the gradient value, the visual reprojection residuals affect (11) can be solved by using the alternating optimization in the + +the BA more than the marginalization and IMU preintegration same way as (4). The alternating optimization is iterated until X + and W are converged. Then, the converged loss ρ¯m(rjP ) can be +residuals. And by increasing the convexity, some of the outlier derived. ρ¯m(rjP ) w.r.t. w¯j and nj is shown in Fig. 3(c) and (d), + respectively. +features can affect the BA. + +To sum up, the proposed factor benefits from both Huber + +loss and GMC by adjusting the weights in an adaptive way; As shown in Fig. 3(c), if w¯ is low, the gradient of the loss is + small even when rjP is close to 0. Thus, the features presumably +our method efficiently filters out outliers, but does not entirely + +ignore outliers in the optimization at first as well. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply. + SONG et al.: DYNAVINS: A VISUAL-INERTIAL SLAM FOR DYNAMIC ENVIRONMENTS 11527 + +originated from dynamic objects don’t have much impact on Fig. 5. The procedure of the multiple hypotheses clustering. (a) Keyframes + +the BA even if their reprojection errors are low in the current that share the minimum number of the tracked features are grouped. (b) There are + +step. In addition, the gradient of the loss increases for features two types of features used for matchings: static and temporarily static features. +whose w¯ is close to 1, so even though the current residual is k,mTWi , the estimated pose of Ci, can be estimated using the matching result +high, an optimization is performed in the direction of reducing Tmk and the local relative pose Tki. An accurate keyframe pose can be estimated +the residual rather than w. if static features are used for the matching. (c) The temporarily static feature + + Furthermore, as shown in Fig. 3(d), if w¯j is zero, the gradient is moved from the previous position. However, the matching result is based on +gets smaller as nj increases; hence the tracked outlier feature +has less effect on the BA, and the longer it is tracked, the less it the previous position of the feature. Thus, the estimated keyframe pose will be + inaccurate. Finally, the feature matching results with similar TWi are clustered +affects the BA. based on the Euclidean distance. + + For the stereo camera configuration, in addition to the repro- + +jection on one camera, reprojections on the other camera in the +same keyframe, rPstereo, or another keyframe, rPanother, exist. In that +case, weights are also applied to the reprojection raPnother because +it is also affected by the movement of features, while rPstereo is +invariant to the movement of features and is only adopted as the + +criterion for the depth estimation. + + IV. SELECTIVE GLOBAL OPTIMIZATION However, it is difficult to directly compute the similarity be- + + In the VIO framework, the drift is inevitably cumulative along tween the loop closures from different keyframes in the current +the trajectory because the optimization is performed only within group. Assuming that the relative pose Tki between Ck and Ci +the moving window. Hence, a loop closure detection, e.g. using is sufficiently accurate, the estimated pose of Ci in the world +DBoW2 [24], is necessary to optimize all trajectories. frame can be expressed as follows: + + In a typical visual SLAM, all loop closures are exploited even k,mTWi = Tki ·m TWk . (14) +if some of them are from temporarily static objects. Those false +positive loop closures may lead to the failure of the SLAM If the features used for matchings are from the same object, +framework. Moreover, features from the temporarily static ob- the estimated TWi of the matchings will be located close to each +jects and from the static objects may exist at the same keyframe. other, even if Ck and Cm of the matchings are different. Hence, +Therefore, in this section, we propose a method to eliminate the after calculating Euclidean distances between the loop closure’s +false positive loop closures while maintaining the true positive estimated TWi , the similar loop closures with the small Euclidean +loop closures. distance can be clustered as shown in Fig. 5(c). + +A. Keyframe Grouping Depending on which loop closure cluster is selected, the + +Unlike conventional methods that treat loop closures indi- trajectory result from the graph optimization varies. Therefore, + +vidually, in this study, loop closures from the same features each cluster can be called a hypothesis. To reduce the computa- + +are grouped, even if they are from different keyframes. As a tional cost, top-two hypotheses were adopted by comparing the + +result, only one weight per group is used, allowing for effective cardinality of the loop closures within the hypothesis. These two + hypotheses of the current group Gi are denoted as Hi0 and Hi1. +optimization. + However, it is not yet possible to distinguish between true or +As shown in Fig. 5(a), before grouping the loop closures, + false positive hypotheses. Hence, the method for determining the +adjacent keyframes that share at least a minimum number of + true positive hypothesis among the candidate hypotheses will be +tracked features have to be grouped. The group starting from + described in the next section. +the i-th camera frame Ci is defined as follows: + +Group(Ci) = Ck| |Fik| ≥ α, k ≥ i , (12) + +where α represents a minimum number of tracked features, and C. Selective Optimization for Constraint Groups +Fik represents the set of features tracked from Ci to Ck. For +simplicity, Group(Ci) will be denoted as Gi hereinafter. + +B. Multiple Hypotheses Clustering Most of the recent visual SLAM algorithms use a graph + + optimization. Let C, T , L, and W denote the sets of keyframes, + + After keyframes are grouped as in the previous subsection, poses, loop closures, and all weights, respectively. Then the + +DBoW2 is employed to identify the similar keyframe Cm with graph optimization can be denoted as: +each keyframe Ck in the current group Gi starting from Ci ⎧ ⎫ +(Ck ∈ Gi and m < i). Note that Ck is skipped if there is no ⎨⎪⎪⎪⎪⎪ ⎪⎪⎪⎪⎪⎬ +similar keyframe. After identifying up to three different m + ρ 2 r(Tkj, T ) 2PL⎪⎭⎪⎪⎪⎪, + mTin⎪⎪⎪⎪⎪⎩ i∈C r(Tii+1, T ) PT i+1 H +for k, a feature matching is conducted between Ck and these i (j,k)∈L +keyframes, and the relative pose Tmk can be obtained. Using T +Tmk , the estimated pose of Ck in the world frame, mTWk , can be +obtained as follows: local edge loop closure edge + + mTWk = Tmk · TWm, (15) + + (13) where Tii+1 represents the local pose between two adjacent + keyframes Ci and Ci+1; Tkj is the relative pose between Cj and +where TWm represents the pose of Cm in the world frame. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply. + 11528 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 + +Ck from the loop closure; PT i+1 and PL denote the covariance TABLE I + ABLATION EXPERIMENT + Ti +of the local pose and loop closure, respectively. A. Dataset + + For the two hypotheses of group Gi, weights are denoted VIODE Dataset VIODE dataset [13] is a simulated dataset +as wi0 and wi1, a sum of the weights as wi, and the set of that contains lots of moving objects, such as cars or trucks, +hypotheses as H. Using a similar procedure as in Section III-C, compared with conventional datasets. In addition, the dataset + includes overall occlusion situations, where most parts of +Black-Rangarajan duality is applied to (15) as follows: the image are occluded by dominant dynamic objects as + shown in Fig. 1. Note that the sub-sequence name none + ⎧ 2 to high means how many dynamic objects exist in the + ⎪⎪⎪⎪⎨⎪ scene. + + min ⎩⎪⎪⎪⎪⎪ r Tii+1, T PT i+1 Our Dataset Unfortunately, VIODE dataset does not contain + i harsh loop closing situations caused by temporarily static ob- + T ,W i∈C T jects. Accordingly, we obtained our dataset with four sequences + to evaluate our global optimization. First, Static sequence + ⎛ ⎞ validates the dataset. In Dynamic follow sequence, a dom- + + Hi∈H ⎜⎝⎜⎜⎜⎜⎝⎛(j,k)∈Hi0 inant dynamic object moves in front of the camera. Next, in + wi0 r(Tkj , T ) 2 ⎠ Temporal static sequence, the same object is observed + |Hi0| PL from multiple locations. In other words, the object is static while + being observed, and then it moves to a different position. Finally, + ⎛ residual for hypothesis 0 in E-shape sequence, the camera moves along the shape of the + letter E. The checkerboard is moved while not being observed, + +⎝ ⎞ thus it will be observed at the three end-vertices of the E-shaped + trajectory in the camera perspective, which triggers the false- + (j,k)∈Hi1 wi1 r Tkj , T 2 ⎠ positive loop closures. Note that the feature-rich checkerboard + |Hi1| PL is used in the experiment to address the effect of false loop + closures. + residual for hypothesis1 (optional) + B. Error Metrics + ⎞⎫ + ⎟⎟⎟⎠⎟⎟⎪⎪⎪⎪⎪⎪⎪⎪⎭⎪⎪⎬ The accuracy of the estimated trajectory from each algorithm + + λlΦl2(wi) , (16) is measured by Absolute Trajectory Error (ATE) [25], which di- + rectly measures the difference between points of the ground truth + hypothesis regularization function and the aligned estimated trajectory. In addition, for the VIODE + dataset, the degradation rate [13], rd = ATEhigh/ATEnone, is +where λl ∈ R+ is a constant parameter. The regularization factor calculated to determine the robustness of the algorithm. +for the loop closure, Φl, is defined as follows: + C. Evaluation on the VIODE Dataset + Φl(wi) = 1 − wi + First, the effects of the proposed factors on BA time cost + = 1 − wi0 + wi1 , (17) and accuracy are analyzed as shown in the Table I. Ours + with only the regularization factor has a better result than +where wi0, wi1 ∈ [0, 1]. To ensure that the weights are not af- VINS-Fusion, but with the momentum factor together, not +fected by the number of loop closures in the hypothesis, the only it shows outperforming result than VINS-Fusion, but also +weights are divided by the cardinality of each hypothesis. it takes less time due to a previous information. Moreover, + although the BA time of ours was increased due to addi- + Then, (16) is optimized in the same manner as (11). Ac- tional optimizations, it is sufficient for high-level control of +cordingly, only the hypothesis with a high weight is adopted robots. +in the optimization. In addition, all weights can be close to +0 when all hypotheses are false positives due to the multiple As shown in Table II and Fig. 6, the SOTA methods show +temporarily static objects. Hence, the failure caused by false precise pose estimation results in static environments. However, +positive hypotheses can be prevented. they struggle with the effect of dominant dynamic objects. In + particular, even though DynaSLAM employs a semantic seg- + Because keyframe poses are changed after the optimization, mentation module, DynaSLAM tends to diverge or shows large +the hypothesis clustering in Section IV-B is conducted again for ATE compared with other methods as the number of dynamic +all groups for the next optimization. objects increases (from none to high). This performance + degradation is due to the overall occlusion situations, leading to + V. EXPERIMENTAL RESULTS + + To evaluate the proposed algorithm, we compare ours with +SOTA algorithms, namely, VINS-Fusion [2], ORB-SLAM3 [4], +and DynaSLAM [7]. Each algorithm is tested in a mono- +inertial (-M-I) and a stereo-inertial (-S-I) mode. Note that +an IMU is not used in DynaSLAM, so it is only tested in a +stereo (-S) mode and compared with the -S-I mode of other +algorithms. It could be somewhat unfair, but the comparison is +conducted to stress the necessity for an IMU when dealing with +dynamic environments. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply. + SONG et al.: DYNAVINS: A VISUAL-INERTIAL SLAM FOR DYNAMIC ENVIRONMENTS 11529 + + TABLE II + COMPARISON WITH STATE-OF-THE-ART METHODS (RMSE OF ATE IN [M]) + +Fig. 6. ATE results of state-of-the-art algorithms and ours on the city_day sequences of the VIODE dataset [13]. Note that the y-axis is expressed in logarithmic +scale. Our algorithm shows promising performance with less performance degeneration compared with the other state-of-the-art methods. + +the failure of the semantic segmentation module and the absence Fig. 7. Results of the state-of-the-art algorithms and ours on the park- +of features from static objects. ing_lot high sequence of the VIODE dataset [13]. (a) Trajectory of each + algorithm in the 3D feature map, which is the result of our proposed algorithm. + Similarly, although ORB-SLAM3 tries to reject the frames Features with low weight are depicted in red. (b) Enlarged view of (a). All +with inaccurate features, it diverges when dominant dynamic other algorithms except our algorithm lost track or had noisy trajectories while +objects exist in parking_lot mid, high and city_day observing dynamic objects and as in (c) feature weighting result of our algorithm, +high sequences. However, especially in parking_lot low features from dynamic objects (red crosses) have low weight while robust +sequence, there is only one vehicle that is far from the camera, features (green circles) have high weight. +and it occludes an unnecessary background environment. As +a consequence, ORB-SLAM3-S-I outperforms other algo- TABLE III +rithms. COMPARISON OF DEGRADATION RATE rd + + VINS-Fusion is less hindered by the dynamic objects because +it tries to remove the features with an incorrectly estimated depth +(negative or far) after BA. However, those features have affected +the BA before they are removed. As a result, as the number of +the features from dynamic objects increases, the trajectory error +of VINS-Fusion gets higher. + + In contrast, our proposed method shows promising perfor- +mance in both mono-inertial and stereo-inertial modes. For +example, in parking_lot high sequence as shown in +Fig. 7(a)–(b), ours performs stable pose estimation even when +other algorithms are influenced by dynamic objects. Moreover, +even though the number of dynamic objects increases, a perfor- +mance degradation remains small compared to other methods +in all scenes. This confirms that our method overcomes the +problems caused by dynamic objects owing to our robust BA +method, which is also supported by Table III. In other words, +our proposed method successfully rejects all dynamic features +by adjusting the weights in an adaptive way. Also, our method +could be even robust against the overall occlusion situations, as +shown in Fig. 1(b). + + Interestingly, our proposed robust BA method enables robust- +ness against changes in illuminance by rejecting the inconsistent +features (e.g., low weight features in dark area of Fig. 7(c)). Ac- +cordingly, our method shows remarkable performance compared +with the SOTA methods in city_night scenes where not only + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply. + 11530 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 + +Fig. 8. Results of the algorithms on E-shape sequence. (a) Trajectory results. [3] A. I. Mourikis and S. I. Roumeliotis, “A multi-state constraint kalman +Other algorithms are inaccurate due to false positive loop closures. (b) A loop filter for vision-aided inertial navigation,” in Proc. IEEE Int. Conf. Robot. +closure rejection result of our algorithm. Constraints with low weight (red lines) Automat., 2007, pp. 3565–3572. +do not contribute to the optimized trajectory. + [4] C. Campos, R. Elvira, J. J. G. Rodríguez, J. M. M. Montiel, and J. +dynamic objects exist, but also there is a lack of illuminance. D. Tardós, “ORB-SLAM3: An accurate open-source library for visual, +Note that -M-I of ours has better result than -S-I. This is visual–inertial, and multimap SLAM,” IEEE Trans. Robot., vol. 37, no. 6, +because the stereo reprojection, rPstereo, can be inaccurate in pp. 1874–1890, Dec. 2021. +low-light conditions. + [5] S. Leutenegger, S. Lynen, M. Bosse, R. Siegwart, and P. Furgale, +D. Evaluation on Our Dataset “Keyframe-based visual-inertial odometry using nonlinear optimization,” + Int. J. Robot. Res., vol. 34, no. 3, pp. 314–334, 2015. + In the static case, all algorithms have low ATE values. +This sequence validates that our dataset is correctly obtained. [6] M. Bloesch, S. Omari, M. Hutter, and R. Siegwart, “Robust visual inertial + odometry using a direct EKF-based approach,” in Proc. IEEE/RSJ Int. + However, in Dynamic follow, other algorithms tried to Conf. Intell. Robots Syst., 2015, pp. 298–304. +track the occluding object. Hence, not only failures of BA but +also false-positive loop closures are triggered. Consequently, [7] B. Bescos, J. M. Fácil, J. Civera, and J. Neira, “DynaSLAM: Tracking, +other algorithms except ours have higher ATEs. mapping, and inpainting in dynamic scenes,” IEEE Robot. Automat. Lett., + vol. 3, no. 4, pp. 4076–4083, Oct. 2018. + Furthermore, in Temporal static, ORB-SLAM3 and +VINS-Fusion can eliminate the false-positive loop closure in [8] Y. Fan, H. Han, Y. Tang, and T. Zhi, “Dynamic objects elimination +the stereo-inertial case. However, in the mono-inertial case, in SLAM based on image fusion,” Pattern Recognit. Lett., vol. 127, +due to an inaccurate depth estimation, they cannot reject the pp. 191–201, 2019. +false-positive loop closures. Additionaly, VINS-Fusion with +Switchable Constraints [15] can also reject the false-positive [9] B. Canovas, M. Rombaut, A. Nègre, D. Pellerin, and S. Olympi- +loop closures, but ours has a better performance as shown in eff, “Speed and memory efficient dense RGB-D SLAM in dynamic +Table II. scenes,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2020, + pp. 4996–5001. + Finally, in E-shape case, other algorithms fail to optimize +the trajectory, as illustrated in Fig. 8(a), owing to the false- [10] R. Long, C. Rauch, T. Zhang, V. Ivan, and S. Vijayakumar, “RigidFusion: +positive loop closures. Also VINS-Fusion with Switchable Con- Robot localisation and mapping in environments with large dynamic +straints cannot reject the false-positive loop closures that are con- rigid objects,” IEEE Robot. Automat. Lett., vol. 6, no. 2, pp. 3703–3710, +tinuously generated. However, ours optimizes the weight of each Apr. 2021. +hypothesis, not individual loop closures. Hence, false-positive +loop closures are rejected in the optimization irrespective of the [11] B. Bescos, C. Campos, J. D. Tardós, and J. Neira, “DynaSLAM II: Tightly- +number of them, as illustrated in Fig. 8(b). Ours does not use coupled multi-object tracking and SLAM,” IEEE Robot. Automat. Lett., +any object-wise information from the image; hence the features vol. 6, no. 3, pp. 5191–5198, Jul. 2021. +from the same object can be divided into different hypotheses, +as depicted in Fig. 1(c). [12] K. Qiu, T. Qin, W. Gao, and S. Shen, “Tracking 3-D motion of dynamic + objects using monocular visual-inertial sensing,” IEEE Trans. Robot., + VI. CONCLUSION vol. 35, no. 4, pp. 799–816, Aug. 2019. + + In this study, DynaVINS has been proposed, which is a robust [13] K. Minoda, F. Schilling, V. Wüest, D. Floreano, and T. Yairi, “VIODE: A +visual-inertial SLAM framework based on the robust BA and simulated dataset to address the challenges of visual-inertial odometry +the selective global optimization in dynamic environments. The in dynamic environments,” IEEE Robot. Automat. Lett., vol. 6, no. 2, +experimental evidence corroborated that our algorithm works pp. 1343–1350, Apr. 2021. +better than other algorithms in simulations and in actual environ- +ments with various dynamic objects. In future works, we plan to [14] E. Olson and P. Agarwal, “Inference on networks of mixtures for ro- +improve the speed and the performance. Moreover, we will adapt bust robot mapping,” Int. J. Robot. Res., vol. 32, no. 7, pp. 826–840, +the concept of DynaVINS to the LiDAR-Visual-Inertial (LVI) 2013. +SLAM framework. + [15] N. Sünderhauf and P. Protzel, “Switchable constraints for robust pose + REFERENCES graph SLAM,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2012, + pp. 1879–1884. + [1] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardós, “ORB-SLAM: A + versatile and accurate monocular SLAM system,” IEEE Trans. Robot., [16] H. Yang, P. Antonante, V. Tzoumas, and L. Carlone, “Graduated non- + vol. 31, no. 5, pp. 1147–1163, Oct. 2015. convexity for robust spatial perception: From non-minimal solvers to + global outlier rejection,” IEEE Robot. Automat. Lett., vol. 5, no. 2, + [2] T. Qin, P. Li, and S. Shen, “VINS-Mono: A robust and versatile monoc- pp. 1127–1134, Apr. 2020. + ular visual-inertial state estimator,” IEEE Trans. Robot., vol. 34, no. 4, + pp. 1004–1020, Aug. 2018. [17] Q.-Y. Zhou, J. Park, and V. Koltun, “Fast global registration,” in Proc. Eur. + Conf. Comput. Vis., 2016, pp. 766–782. + + [18] S. Song, H. Lim, S. Jung, and H. Myung, “G2P-SLAM: Generalized RGB- + D SLAM framework for mobile robots in low-dynamic environments,” + IEEE Access, vol. 10, pp. 21370–21383, 2022. + + [19] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, “Dynamic-SLAM: + Semantic monocular visual localization and mapping based on deep + learning in dynamic environment,” Robot. Auton. Syst., vol. 117, + pp. 1–16, 2019. + + [20] M. J. Black and A. Rangarajan, “On the unification of line processes, + outlier rejection, and robust statistics with applications in early vision,” + Int. J. Comput. Vis., vol. 19, no. 1, pp. 57–91, 1996. + + [21] P. J. Huber, “Robust estimation of a location parameter,” in Breakthroughs + Statist., 1992, pp. 492–518. + + [22] P. Babin, P. Giguère, and F. Pomerleau, “Analysis of robust functions for + registration algorithms,” in Proc. IEEE Int. Conf. Robot. Automat., 2019, + pp. 1451–1457. + + [23] S. Geman, D. E. McClure, and D. Geman, “A nonlinear filter for film + restoration and other problems in image processing,” CVGIP: Graph. + Models Image Process., vol. 54, no. 4, pp. 281–289, 1992. + + [24] D. Galvez-López and J. D. Tardos, “Bags of binary words for fast place + recognition in image sequences,” IEEE Trans. Robot., vol. 28, no. 5, + pp. 1188–1197, Oct. 2012. + + [25] Z. Zhang and D. Scaramuzza, “A tutorial on quantitative trajectory eval- + uation for visual(-inertial) odometry,” in Proc. IEEE/RSJ Int. Conf. Intell. + Robots Syst., 2018, pp. 7244–7251. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:36:06 UTC from IEEE Xplore. Restrictions apply. + diff --git a/动态slam/2020年-2022年开源动态SLAM/2022年/DytanVO_Joint_Refinement_of_Visual_Odometry_and_Motion_Segmentation_in_Dynamic_Environments.pdf b/动态slam/2020年-2022年开源动态SLAM/2022年/DytanVO_Joint_Refinement_of_Visual_Odometry_and_Motion_Segmentation_in_Dynamic_Environments.pdf new file mode 100644 index 0000000..9d059c7 --- /dev/null +++ b/动态slam/2020年-2022年开源动态SLAM/2022年/DytanVO_Joint_Refinement_of_Visual_Odometry_and_Motion_Segmentation_in_Dynamic_Environments.pdf @@ -0,0 +1,476 @@ + 2023 IEEE International Conference on Robotics and Automation (ICRA 2023) + May 29 - June 2, 2023. London, UK + + DytanVO: Joint Refinement of Visual Odometry and Motion + Segmentation in Dynamic Environments + + Shihao Shen, Yilin Cai, Wenshan Wang, Sebastian Scherer + +2023 IEEE International Conference on Robotics and Automation (ICRA) | 979-8-3503-2365-8/23/$31.00 ©2023 IEEE | DOI: 10.1109/ICRA48891.2023.10161306 Abstract— Learning-based visual odometry (VO) algorithms Fig. 1: A overview of the DytanVO. (a) Input frame at time t0 and + achieve remarkable performance on common static scenes, t1. (b) Optical flow output from the matching network. (c) Motion + benefiting from high-capacity models and massive annotated segmentation output after iterations. (d) Trajectory estimation on + data, but tend to fail in dynamic, populated environments. sequence RoadCrossing VI from the AirDOS-Shibuya Dataset, + Semantic segmentation is largely used to discard dynamic which is a highly dynamic environment cluttered with humans. Ours + associations before estimating camera motions but at the cost is the only learning-based VO that keeps track. + of discarding static features and is hard to scale up to unseen + categories. In this paper, we leverage the mutual dependence (MAV) that operate with aggressive and frequent rotations + between camera ego-motion and motion segmentation and cars do not have. Learning without supervision is hindered + show that both can be jointly refined in a single learning- from generalizing due to biased data with simple motion + based framework. In particular, we present DytanVO, the patterns. Therefore, we approach the dynamic VO problem + first supervised learning-based VO method that deals with as supervised learning so that the model can map inputs to + dynamic environments. It takes two consecutive monocular complex ego-motion ground truth and be more generalizable. + frames in real-time and predicts camera ego-motion in an + iterative fashion. Our method achieves an average improvement To identify dynamic objects, object detection or semantic + of 27.7% in ATE over state-of-the-art VO solutions in real-world segmentation techniques are largely relied on to mask all + dynamic environments, and even performs competitively among movable objects, such as pedestrians and vehicles [12]– + dynamic visual SLAM systems which optimize the trajectory [15]. Their associated features are discarded before applying + on the backend. Experiments on plentiful unseen environments geometry-based methods. However, there are two issues of + also demonstrate our method’s generalizability. utilizing semantic information in dynamic VO. First, class- + specific detectors for semantic segmentation heavily depend + I. INTRODUCTION on appearance cues but not every object that can move is + present in the training categories, leading to false negatives. + Visual odometry (VO), one of the most essential com- Second, even if all moving objects in a scene within the cat- + ponents for pose estimation in the visual Simultaneous egories, algorithms could not distinguish between “actually + Localization and Mapping (SLAM) system, has attracted moving” versus “static but being able to move”. In dynamic + significant interest in robotic applications over past few VO where static features are crucial to robust ego-motion + years [1]. A lot of research works have been conducted estimation, one should segment objects based on pure motion + to develop an accurate and robust monocular VO system (motion segmentation) rather than heuristic appearance cues. + using both geometry-based methods [2], [3]. However, they + require significant engineering effort for each module to Motion segmentation utilizes relative motion between con- + be carefully designed and finetuned [4], which makes it secutive frames to remove the effect of camera movement + difficult to be readily deployed in the open world with from the 2D motion fields and calculates residual optical flow + complex environmental dynamcis, changes of illumination to account for moving regions. But paradoxically, ego-motion + or inevitable sensor noises. cannot be correctly estimated in dynamic scenes without a + robust segmentation. There exists such a mutual dependence + On the other hand, recent learning-based methods [4]– + [7] are able to outperform geometry-based methods in + more challenging environments such as large motion, fog + or rain effects and lack of features. However, they will + easily fail in dynamic environments if they do not take + into consideration independently moving objects that cause + unpredictable changes in illumination or occlusions. To this + end, recent works utilize abundant unlabeled data and adopt + either self-supervised learning [8], [9] or unsupervised learn- + ing [10], [11] to handle dynamic scenes. Although they + achieve outstanding performance on particular tasks, such as + autonomous driving, they produce worse results if applied to + very different data distributions, such as micro air vehicles + + Code is available at https://github.com/Geniussh/DytanVO + S. Shen, Y. Cai, W. Wang, S. Scherer are with the Robotics Institute, + Carnegie Mellon University, Pittsburgh, PA 15213, USA. {shihaosh, + yilincai, wenshanw, basti}@andrew.cmu.edu + + 979-8-3503-2365-8/23/$31.00 ©2023 IEEE 4048 + + Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply. + between motion segmentation and ego-motion estimation that on geometric constraints arising from epipolar geometry +has never been explored in supervised learning methods. and rigid transformations, which are vulnerable to motion +Therefore, motivated by jointly refining the VO and motion ambiguities such as objects moving in the colinear direc- +segmentation, we propose our learning-based dynamic VO tion relative to the camera being indistinguishable from the +(DytanVO). To our best knowledge, our work is the first background given only ego-motion and optical flow. On the +supervised learning-based VO for dynamic environments. other hand, MaskVO [8] and SimVODIS++ [9] approach the +The main contributions of this paper are threefold: problem by learning to mask dynamic feature points in a self- + supervised manner. CC [11] couples motion segmentation, + • A novel learning-based VO is introduced to leverage flow, depth and camera motion models which are jointly + the interdependence among camera ego-motion, optical solved in an unsupervised way. Nevertheless, these self- + flow and motion segmentation. supervised or unsupervised methods are trained on self- + driving vehicle data dominated by pure translational motions + • We introduce an iterative framework where both ego- with little rotation, which makes them difficult to generalize + motion estimation and motion segmentation can con- to completely different data distributions such as handheld + verge quickly within time constraints for real-time ap- cameras or drones. Our work introduces a framework that + plications. jointly refines camera ego-motion and motion segmentation + in an iterative way that is robust against motion ambiguities + • Among learning-based VO solutions, our method as well as generalizes to the open world. + achieves state-of-the-art performance in real-world dy- + namic scenes without finetuning. Furthermore, our III. METHODOLOGY + method performs even comparably with visual SLAM + solutions that optimize trajectories on the backend. A. Datasets + + II. RELATED WORK Built on TartanVO [5], our method remains its general- + ization capability while handling dynamic environments in + Learning-based VO solutions aim to avoid hard-coded multiple types of scenes, such as car, MAV, indoor and +modules that require significant engineering efforts for de- outdoor. Besides taking camera intrinsics as an extra layer +sign and finetuning in classic pipelines [1], [16]. For exam- into the network to adapt to various camera settings as +ple, Valada [17] applies auxiliary learning to leverage rela- explored in [5], we train our model on large amounts of +tive pose information to constrain the search space and pro- synthetic data with broad diversity, which is shown capable +duce consistent motion estimation. Another class of learning- of facilitating easy adaptation to the real world [27]–[29]. +based methods rely on dense optical flow to estimate pose as +it provides more robust and redundant modalities for feature Our model is trained on both TartanAir [27] and Scene- +association in VO [5], [18], [19]. However, their frameworks Flow [30]. The former contains more than 400,000 data +are built on the assumption of photometric consistency which frames with ground truth of optical flow and camera pose in +only holds in a static environment without independently static environments only. The latter provides 39,000 frames +moving objects. They easily fail when dynamic objects in highly dynamic environments with each trajectory hav- +unpredictably cause occlusions or illuminations change. ing backward/forward passes, different objects and motion + characteristics. Although SceneFlow does not provide ground + Semantic information is largely used by earlier works in truth of motion segmentations, we are able to recover it by +VO or visual SLAM to handle dynamic objects in the scene, taking use of its ground truth of disparity, optical flow and +which is obtained by either a feature-based method or a disparity change maps. +learning-based method. Feature-based methods utilize hand- +designed features to recognize semantic entities [20]. An B. Architecture +exemplary system proposed by [21] computes SIFT descrip- +tors from monocular image sequences in order to recognize Our network architecture is illustrated in Fig. 2, which is +semantic objects. On the other hand, data-driven CNN-based based on TartanVO. Our method takes in two consecutive +semantic methods have been widely used to improve the undistorted images It, It+1 and outputs the relative camera +performance, such as DS-SLAM [22] and SemanticFusion motion δtt+1 = (R|T), where T ∈ R3 is the 3D translation +[23]. A few works on semantic VO/SLAM have fused the and R ∈ SO(3) is the 3D rotation. Our framework consists +semantic information from recognition modules to enhance of three sub-modules, a matching network, a motion seg- +motion estimation and vice versa [24], [25]. However, all mentation network, and a pose network. We estimate dense +these methods are prone to limited semantic categories, optical flow Ftt+1 with a matching network, Mθ (It, It+1), +which leads to false negatives when scaling to unusual real- from two consecutive images. The network is built based +world applications such as offroad driving or MAV, and on PWC-Net [31]. The motion segmentation network Uγ, +requires continuous efforts in ground-truth labeling. based on a lightweight U-Net [32], takes in the relative + camera motion output, R|T, optical flow from Mθ, and the + Instead of utilizing appearance cues for segmentation, original input frames. It outputs a probability map, ztt+1, +efforts are made to segment based on geometry cues. Flow- of every pixel belonging to a dynamic object or not, which +Fusion [26] iteratively refines its ego-motion estimation by is thresholded and turned into a binary segmentation mask, +computing residual optical flow. GeoNet [10] divides its Stt+1. The optical flow is then stacked with the mask and +system into two sub-tasks by separately predicting static +scene structure and dynamic motions. However, both depend + + 4049 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply. + Fig. 2: Overview of our three-stage network architecture. It consists of a matching network which estimates optical flow from two +consecutive images, a pose network that estimates pose based on optical flow without dynamic movements, and a motion segmentation +network that outputs a probability mask of the dynamicness. The matching network is forwarded only once while the pose network and +the segmentation network are iterated to jointly refine pose estimate and motion segmentation. In the first iteration, we randomly initialize +the segmentation mask. In each iteration, optical flow is set to zero inside masked regions. + +the intrinsics layer KC, followed by setting all optical flow iterations being smaller than prefixed thresholds ϵ. Instead +inside the masked regions to zeros, i.e., F˜tt+1. The last is of having a fixed constant to threshold probability maps into +a pose network P ϕ, with ResNet50 [33] as the backbone, segmentation masks, we predetermine a decaying parameter +which takes in the previous stack, and outputs camera that empirically reduces the input threshold over time, in +motion. order to discourage inaccurate masks in earlier iterations + while embracing refined masks in later ones. +C. Motion segmentation + Algorithm 1 Inference with Iterations + Earlier dynamic VO methods that use motion segmentation +rely on purely geometric constraints arising from epipolar Given two consecutive frames It, It+1 and intrinsics K +geometry and rigid transformations [12], [26] so that they can Initialize iteration number: i ← 1 +threshold residual optical flow which is designed to account Initialize difference in output camera motions: δR|T ← ∞ +for moving regions. However, they are prone to catastrophic iFtt+1 ← OpticalFlow(It, It+1) +failures under two cases: (1) points in 3D moving along while δR|T ≥ stopping criterion, ϵ do +epipolar lines cannot be identified from the background given +only monocular cues; (2) pure geometry methods leave no if i is 0 then +tolerance to noisy optical flow and less accurate camera iStt+1 ← getCowmask(It) +motion estimations, which in our framework is very likely to +happen in the first few iterations. Therefore, following [34], else +to deal with the ambiguities above, we explicitly model cost iztt+1 ← MotionSegmentation(iFtt+1, It, iR|iT) +maps as inputs into the segmentation network after upgrading iStt+1 ← mask iztt+1 ≥ zthreshold +the 2D optical flow to 3D through optical expansion [35], +which estimates the relative depth based on the scale change iF˜tt+1 ← set iFtt+1 = 0 for iStt+1 == 1 +of overlapping image patches. The cost maps are tailored iR|iT ← PoseNetwork(iF˜tt+1, iStt+1, K) +to coplanar and colinear motion ambiguities that cause seg- δR|T ← iR|iT − i−1R|i−1T +mentation failures in geometry-based motion segmentation. i←i+1 +More details can be found in [34]. + Intuitively, during early iterations, the estimated motion +D. Iteratively refine camera motion is less accurate, which leads to false positives in the seg- + mentation output (assigning high probabilities to static ar- + We provide an overview of our iterative framework in eas). However, due to the fact that optical flow map still +Algorithm 1. During inference, the matching network is provides enough correspondences regardless of cutting out +forwarded only once while the pose network and the seg- non-dynamic regions from it, Pϕ is able to robustly leverage +mentation network are iterated to jointly refine ego-motion the segmentation mask Stt+1 concatenated with F˜tt+1, and +estimation and motion segmentation. In the first iteration, outputs reasonable camera motion. In later iterations, Uγ +the segmentation mask is initialized randomly using [36]. is expected to output increasingly precise probability maps +The criterion to stop iteration is straightforward, which is the such that static regions in the optical flow map are no longer +rotational and translational differences of R|T between two + + 4050 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply. + ground truth motion R|T , + + LP = Tˆ T + Rˆ − R + max(∥Tˆ ∥, ϵ) − max (∥T∥, ϵ) (1) + + where ϵ=1e-6 to prevent numerical instability and ˆ· denotes + estimated quantities. + + Our framework can also be trained in an end-to-end + fashion, in which case the objective becomes an aggregated + loss of the optical flow loss LM , the camera motion loss + LP and the motion segmentation loss LU , where LM is the + L1 norm between the predicted flow and the ground truth + flow whereas LU is the binary cross entropy loss between + predicted probability and the segmentation label. + + L = λ1LM + λ2LU + LP (2) + +Fig. 3: Motion segmentation output at each iteration when testing From preliminary empirical comparison, end-to-end training +on unseen data. (a) Running inference on the hardest sequence in gives similar performance to training the pose network only, +AirDOS-Shibuya with multiple people moving in different direc- because we use λ1 and λ2 to regularize the objective such +tions with our segmentation network. (b) Inference on the sequence that the training is biased toward mainly improving the +from FlyingThings3D where dynamic objects take up more than odometry rather than optimizing the other two tasks. This is +60% area. Ground truth (GT) mask on Shibuya is generated by the ideal since the pose network is very tolerant of false positives +segmentation network with GT ego-motion as input. in segmentation results (shown in III-D). In the following + section, we show our results of supervising only on Eq. 1 +“wasted” and hence Pϕ can be improved accordingly. by fixing the motion segmentation network. + In practice, we find that 3 iterations are more than enough + IV. EXPERIMENTAL RESULTS +to get both camera motion and segmentation refined. To clear +up any ambiguity, a 1-iteration pass is composed of one Mθ A. Implementation details +forward pass and one Pϕ forward pass with random mask, +while a 3-iteration pass consists of one Mθ forward pass, 1) Network: We intialize the matching network Mθ with +two Uγ forward passes and three Pϕ forward passes. In Fig. the pre-trained model from TartanVO [5], and fix the motion +3 we illustrate how segmentation masks evolve after three segmentation network Uγ with the pre-trained weights from +iterations on unseen data. The mask at the first iteration Yang et al. [34]. The pose network Pϕ uses ResNet50 [33] +contains a significant amount of false positives but quickly as the backbone, removes the bach normalization layers, and +converges beyond the second iteration. This verifies our adds two output heads for rotation R and translation T . +assumption that the pose network is robust against false Mθ outputs optical flow at size of H/4 × W/4. Pϕ takes +positives in segmentation results. in a 5-channel input, i.e., F˜tt+1 ∈ R2×H/4×W/4, Stt+1 ∈ + RH/4×W/4 and KC ∈ R2×H/4×W/4. The concatenation of +E. Supervision F˜tt+1 and KC augments the optical flow input with 2D + positional information while concatenating F˜tt+1 with Stt+1 + We train our pose network to be robust against large areas encourages the network to learn dynamic representations. +of false positives. On training data without any dynamic +object, we adopt the cow-mask [36] to create sufficiently 2) Training: Our method is implemented in PyTorch [43] +random yet locally connected segmentation patterns as a and trained on 2 NVIDIA A100 Tensor Core GPUs. We train +motion segmentation could occur in any size, any shape the network in two stages on TartanAir, which includes only +and at any position in an image while exhibiting locally static scenes, and SceneFlow [30]. In the first stage, we train +explainable structures corresponding to the types of moving Pϕ independently using ground truth optical flow, camera +objects. In addition, we apply curriculum learning to the motion, and motion segmentation mask in a curriculum- +pose network where we gradually increase the maximum learning fashion. We generate random cow-masks [36] on +percentage of dynamic areas in SceneFlow from 15%, 20%, TartanAir as motion segmentation input. Each curriculum is +30%, 50% to 100%. Since TartanAir only contains static initialized with weights from the previous curriculum and +scenes, we adjust the size of the cow-masks accordingly. takes 100,000 iterations with a batch size of 256. In the + second stage, Pϕ and Mθ are jointly optimized for another + We supervise our network on the camera motion loss LP . 100,000 iterations with a batch size of 64. During curriculum +Under the monocular setting, we only recover an up-to-scale learning, the learning rate starts at 2e-4, while the second +camera motion. We follow [5] and normalize the translation stage uses a learning rate of 2e-5. Both stages apply a +vector before calculating the distance to ground truth. Given decay rate of 0.2 to the learning rate every 50,000 iterations. + Random cropping and resizing (RCR) [5] as well as frame + skipping are applied to both datasets. + + 4051 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply. + TABLE I: ATE (m) results on dynamic sequences from AirDOS-Shibuya. Our method gives outstanding performance among VO methods. +DeepVO, TrianFlow and CC are trained on KITTI only and unable to generalize to complex motion patterns. All SLAM methods use +bundle adjustment (BA) on multiple frames to optimize the trajectory and hence we only numerically compare ours with pure VO methods. +The best and the second best VO performances are highlighted as bold and underlined. We use “-” to denote SLAM methods that fail to +initialize. + + StandingHuman RoadCrossing (Easy) RoadCrossing (Hard) + + I II III IV V VI VII + +SLAM method DROID-SLAM [37] 0.0051 0.0073 0.0103 0.0120 0.2778 0.0253 0.5788 + VO method AirDOS w/ mask [38] 0.0606 0.0193 0.0951 0.0331 0.0206 0.2230 0.5625 + ORB-SLAM w/ mask [39] 0.0788 0.0060 0.0657 0.0196 0.0148 1.0984 0.8476 + VDO-SLAM [40] 0.0994 0.6129 0.3813 0.3879 0.2175 0.2400 0.6628 + DynaSLAM [41] 0.8836 0.3907 0.4196 0.4925 0.6446 0.6539 + - + DeepVO [4] + TrianFlow [42] 0.3956 0.6351 0.7788 0.3436 0.5434 0.7223 0.9633 + CC [11] 0.9743 1.3835 1.3348 1.6172 1.4769 1.7154 1.9075 + TartanVO [5] 0.4527 0.7714 0.5406 0.6345 0.5411 0.8558 1.0896 + Ours 0.0600 0.1605 0.2762 0.1814 0.2174 0.3228 0.5009 + 0.0327 0.1017 0.0608 0.0516 0.0755 0.0365 0.0660 + + 3) Runtime: Although our method iterates multiple times Crossing (Easy) contains multiple humans moving in and out +to refine both segmentation and camera motion, we find in of the camera’s view, and in Road Crossing (Hard) humans +practice that 3 iterations are more than enough due to the enter camera’s view abruptly. Besides VO methods, we also +robustness of Pϕ as shown in Fig. 3. On an NVIDIA RTX compare ours with SLAM methods that are able to handle +2080 GPU, inference takes 40ms with 1 iteration, 100ms dynamic scenes. DROID-SLAM [37] is a learning-based +with 2 iterations and 160ms with 3 iterations. SLAM trained on TartanAir. AirDOS [38], VDO-SLAM [40] + and DynaSLAM [41] are three feature-based SLAM methods + 4) Evaluation: We use the Absolute Trajectory Error targeting dynamic scenes. We provide the performance of +(ATE) to evaluate our algorithm against other state-of-the-art AirDOS and ORB-SLAM [39] after masking the dynamic +methods including both VO and Visual SLAM. We evaluate features during their ego-motion estimation. DeepVO [4], +our method on AirDOS-Shibuya dataset [38] and KITTI TartanVO and TrianFlow [42] are three learning-based VO +Odometry dataset [44]. Additionally, in the supplemental methods not targeting dynamic scenes while CC [11] is an +material, we test our method on data collected in a cluttered unsupervised VO resolving dynamic scenes through motion +intersection to demonstrate our method can scale to real- segmentation. +world dynamic scenes competitively. + Our model achieves the best performance in all sequences +B. Performance on AirDOS-Shibuya Dataset among VO baselines and is competitive even among SLAM + methods. DeepVO, TrianFlow and CC perform badly on + We first provide an ablation study of the number of itera- AirDOS-Shibuya dataset because they are trained on KITTI +tions (iter) in Tab. III using three sequences from AirDOS- only and not able to generalize. TartanVO performs better but +Shibuya [38]. The quantitative results are consistent with Fig. it is still susceptible to the disturbance of dynamic objects. +3 where the pose network quickly converges after the first On RoadCrossing V as shown in Fig. 1, all VO baselines +iteration. We also compare the 3-iteration finetuned model fail except ours. In hard sequences where there are more +after jointly optimizing Pϕ and Mθ (second stage), which aggressive camera movements and abundant moving objects, +shows less improvement because the optical flow estimation ours outperforms dynamic SLAM methods such as AirDOS, +on AirDOS-Shibuya already has high quality. VDO-SLAM and DynaSLAM by more than 80%. While + DROID-SLAM remains competitive most time, it loses track +TABLE III: Experiments on number of iterations in ATE (m) of RoadCrossing V and VII as soon as a walking person + occupies a large area in the image. Note that ours only takes +1 iter Standing I RoadCrossing III RoadCrossing VII 0.16 seconds per inference with 3 iterations but DROID- +2 iter SLAM takes extra 4.8 seconds to optimize the trajectory. +3 iter 0.0649 0.1666 0.3157 More qualitative results are in the supplemental material. +Finetuned 0.0315 0.0974 0.0658 + 0.0327 0.0608 0.0660 C. Performance on KITTI + 0.0384 0.0631 0.0531 + We also evaluated our method against others on sequences + We then compare our method with others on the seven from KITTI Odometry dataset [44] in Tab. II. Our method +sequences from AirDOS-Shibuya in Tab. I and demonstrate outperforms other VO baselines in 6 out of 8 dynamic +that our method outperforms existing state-of-the-art VO sequences with an improvement of 27.7% on average against +algorithms. This benchmark covers much more challenging the second best method. DeepVO, TrianFlow and CC are +viewpoints and diverse motion patterns for articulated objects trained on some of the sequences in KITTI while ours has not +than our training data. The seven sequences are categorized been finetuned on KITTI and is trained purely using synthetic +into three levels of difficulty: most humans stand still in +Standing Human with few of them moving around, Road + + 4052 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply. + TABLE II: Results of ATE (m) on Dynamic Sequences from KITTI Odometry. Original sequences are trimmed into shorter ones that +contain dynamic objects1. DeepVO [4], TrianFlow [42] and CC [11] are trained on KITTI, while ours has not been finetuned on KITTI + +and is trained purely using synthetic data. Without backend optimization unlike SLAM, we achieve the best performance on 00, 02, 04, + +and competitive performance on the rest among all methods including SLAM. + +SLAM method DROID-SLAM [37] 00 01 02 03 04 07 08 10 + ORB-SLAM w/ mask [39] 0.0148 49.193 0.1064 0.0119 0.0374 0.1939 0.9713 0.0368 + DynaSLAM [41] 0.0187 0.0796 0.1519 0.0198 0.2108 1.0479 0.0246 + 0.0138 - 0.1046 0.1450 0.3187 1.0559 0,0264 + - - + 0.7262 (0.6547) 0.1042 +VO method DeepVO [4] (0.0206) 1.2896 (0.2975) 0.0783 0.0506 1.5540 (3.8984) 0.2545 + TrianFlow [42] 0.6966 (8.2127) (1.8759) 1.6862 1.2950 0.6789 (1.0411) (0.0346) + CC [11] 0.0253 (0.3060) (0.2559) 0.0505 0.0337 0.7108 0.9776 0.1024 + TartanVO [5] 0.0345 4.7080 0.1049 0.2832 0.0743 0.6367 1.0344 0.0280 + Ours 0.0126 0.4081 0.0594 0.0406 0.0180 + +We use (·) to denote the sequence is in the training set of the corresponding method. + +Fig. 4: Qualitative results on dynamic sequences in KITTI Odometry 01, 03, 04 and 10. The first row is our segmentation outputs of +moving objects. The second row is the visualization after aligning the scales of trajectories with ground truth all at once. Ours produces +precise odometry given large areas in the image being dynamic even among methods that are trained on KITTI. Note that the trajectories +do not always reflect the ATE results due to alignment. + +data. Moreoever, we achieve the best ATE on 3 sequences almost the entire optical flow map as zeros, leading to the +among both VO and SLAM without any optimization. We divergence of motion estimation and segmentation. Future +provide qualitative results in Fig. 4 on four challenging work could hence consider incorporating dynamic object- +sequences with fast-moving vehicles or dynamic objects awareness into the framework and utilizing dynamic cues +occupying large areas in images. Note on sequence 01 which instead of fully discarding them. Additionally, learning-based +starts with a high-speed vehicle passing by, both ORB-SLAM VO tends to overfit on simple translational movements such +and DynaSLAM fail to initialize, while DROID-SLAM loses as in KITTI, which is resolved in our method by training on +track from the beginning. Even though CC uses 01 in its datasets with broad diversity, but our method gives worse +training set, ours gives only 0.1 higher ATE while 0.88 lower performance when there is little or zero camera motion, +than the third best baseline. On sequence 10 when a huge caused by the bias in currently available datasets. One should +van takes up significant areas in the center of the image, ours consider training on zero-motion inputs in addition frame +is the only VO that keeps track robustly. skipping. + +D. Diagnostics V. CONCLUSION + + While we observe our method is robust to heavily dynamic In this paper, we propose a learning-based dynamic VO +scenes with as much as 70% dynamic objects in the image, (DytanVO) which can jointly refine the estimation of camera +it still fails when all foreground objects are moving, leaving pose and segmentation of the dynamic objects. We demon- +textureless background only. This is most likely to happen strate both ego-motion estimation and motion segmentation +when dynamic objects take up large areas in the image. For can converge quickly within time constrains for real-time +example, when testing on the test set of FlyingThings3D [30] applications. We evaluate our method on KITTI Odometry +where 80% of the image being dynamic, our method masks and AirDOS-Shibuya datasets, and demonstrate state-of-the- + art performance in dynamic environments without finetuning + 1Sequences listed are trimmed into lengths of 28, 133, 67, 31, 40, 136, 51 nor optimation on the backend. Our work introduces new +and 59 respectively which contain moving pedestrians, vehicles and cyclists. directions for dynamic visual SLAM algorithms. + + 4053 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply. + REFERENCES [23] J. McCormac, A. Handa, A. Davison, and S. Leutenegger, “Se- + manticfusion: Dense 3d semantic mapping with convolutional neural + [1] D. Scaramuzza and F. Fraundorfer, “Visual odometry [tutorial],” IEEE networks,” in 2017 IEEE International Conference on Robotics and + robotics & automation magazine, vol. 18, no. 4, pp. 80–92, 2011. automation (ICRA), pp. 4628–4635, IEEE, 2017. + + [2] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE [24] L. An, X. Zhang, H. Gao, and Y. Liu, “Semantic segmentation–aided + transactions on pattern analysis and machine intelligence, vol. 40, visual odometry for urban autonomous driving,” International Journal + no. 3, pp. 611–625, 2017. of Advanced Robotic Systems, vol. 14, no. 5, p. 1729881417735667, + 2017. + [3] C. Forster, M. Pizzoli, and D. Scaramuzza, “Svo: Fast semi-direct + monocular visual odometry,” in 2014 IEEE international conference [25] K.-N. Lianos, J. L. Schonberger, M. Pollefeys, and T. Sattler, “Vso: + on robotics and automation (ICRA), pp. 15–22, IEEE, 2014. Visual semantic odometry,” in Proceedings of the European conference + on computer vision (ECCV), pp. 234–250, 2018. + [4] S. Wang, R. Clark, H. Wen, and N. Trigoni, “Deepvo: Towards + end-to-end visual odometry with deep recurrent convolutional neural [26] T. Zhang, H. Zhang, Y. Li, Y. Nakamura, and L. Zhang, “Flowfusion: + networks,” in 2017 IEEE international conference on robotics and Dynamic dense rgb-d slam based on optical flow,” in 2020 IEEE Inter- + automation (ICRA), pp. 2043–2050, IEEE, 2017. national Conference on Robotics and Automation (ICRA), pp. 7322– + 7328, IEEE, 2020. + [5] W. Wang, Y. Hu, and S. Scherer, “Tartanvo: A generalizable learning- + based vo,” arXiv preprint arXiv:2011.00359, 2020. [27] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, + and S. Scherer, “Tartanair: A dataset to push the limits of visual slam,” + [6] H. Zhou, B. Ummenhofer, and T. Brox, “Deeptam: Deep tracking and in 2020 IEEE/RSJ International Conference on Intelligent Robots and + mapping,” in Proceedings of the European conference on computer Systems (IROS), pp. 4909–4916, IEEE, 2020. + vision (ECCV), pp. 822–838, 2018. + [28] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel, + [7] S. Li, X. Wang, Y. Cao, F. Xue, Z. Yan, and H. Zha, “Self-supervised “Domain randomization for transferring deep neural networks from + deep visual odometry with online adaptation,” in Proceedings of the simulation to the real world,” in 2017 IEEE/RSJ international con- + IEEE/CVF Conference on Computer Vision and Pattern Recognition, ference on intelligent robots and systems (IROS), pp. 23–30, IEEE, + pp. 6339–6348, 2020. 2017. + + [8] W. Xuan, R. Ren, S. Wu, and C. Chen, “Maskvo: Self-supervised [29] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, + visual odometry with a learnable dynamic mask,” in 2022 IEEE/SICE T. To, E. Cameracci, S. Boochoon, and S. Birchfield, “Training deep + International Symposium on System Integration (SII), pp. 225–231, networks with synthetic data: Bridging the reality gap by domain + IEEE, 2022. randomization,” in Proceedings of the IEEE conference on computer + vision and pattern recognition workshops, pp. 969–977, 2018. + [9] U.-H. Kim, S.-H. Kim, and J.-H. Kim, “Simvodis++: Neural seman- + tic visual odometry in dynamic environments,” IEEE Robotics and [30] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, + Automation Letters, vol. 7, no. 2, pp. 4244–4251, 2022. and T. Brox, “A large dataset to train convolutional networks for + disparity, optical flow, and scene flow estimation,” in Proceedings +[10] Z. Yin and J. Shi, “Geonet: Unsupervised learning of dense depth, of the IEEE conference on computer vision and pattern recognition, + optical flow and camera pose,” in Proceedings of the IEEE conference pp. 4040–4048, 2016. + on computer vision and pattern recognition, pp. 1983–1992, 2018. + [31] D. Sun, X. Yang, M. Liu, and J. Kautz, “Pwc-net: Cnns for +[11] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and optical flow using pyramid, warping, and cost volume,” CoRR, + M. J. Black, “Competitive collaboration: Joint unsupervised learning vol. abs/1709.02371, 2017. + of depth, camera motion, optical flow and motion segmentation,” in + Proceedings of the IEEE/CVF conference on computer vision and [32] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional + pattern recognition, pp. 12240–12249, 2019. networks for biomedical image segmentation,” in International Confer- + ence on Medical image computing and computer-assisted intervention, +[12] H. Liu, G. Liu, G. Tian, S. Xin, and Z. Ji, “Visual slam based on pp. 234–241, Springer, 2015. + dynamic object removal,” in 2019 IEEE International Conference on + Robotics and Biomimetics (ROBIO), pp. 596–601, IEEE, 2019. [33] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for + image recognition,” CoRR, vol. abs/1512.03385, 2015. +[13] B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. Davison, and + S. Leutenegger, “Mid-fusion: Octree-based object-level multi-instance [34] G. Yang and D. Ramanan, “Learning to segment rigid motions + dynamic slam,” in 2019 International Conference on Robotics and from two frames,” in Proceedings of the IEEE/CVF Conference on + Automation (ICRA), pp. 5231–5237, IEEE, 2019. Computer Vision and Pattern Recognition, pp. 1266–1275, 2021. + +[14] S. Li and D. Lee, “RGB-D SLAM in dynamic environments using [35] G. Yang and D. Ramanan, “Upgrading optical flow to 3d scene + static point weighting,” IEEE Robotics and Automation Letters, vol. 2, flow through optical expansion,” in Proceedings of the IEEE/CVF + no. 4, pp. 2263–2270, 2017. Conference on Computer Vision and Pattern Recognition, pp. 1334– + 1343, 2020. +[15] Y. Sun, M. Liu, and M. Q.-H. Meng, “Improving rgb-d slam in + dynamic environments: A motion removal approach,” Robotics and [36] G. French, A. Oliver, and T. Salimans, “Milking cowmask for semi- + Autonomous Systems, vol. 89, pp. 110–122, 2017. supervised image classification,” arXiv preprint arXiv:2003.12022, + 2020. +[16] F. Fraundorfer and D. Scaramuzza, “Visual odometry: Part ii: Match- + ing, robustness, optimization, and applications,” IEEE Robotics & [37] Z. Teed and J. Deng, “Droid-slam: Deep visual slam for monocular, + Automation Magazine, vol. 19, no. 2, pp. 78–90, 2012. stereo, and rgb-d cameras,” Advances in Neural Information Process- + ing Systems, vol. 34, pp. 16558–16569, 2021. +[17] A. Valada, N. Radwan, and W. Burgard, “Deep auxiliary learning + for visual localization and odometry,” in 2018 IEEE international [38] Y. Qiu, C. Wang, W. Wang, M. Henein, and S. Scherer, “Airdos: + conference on robotics and automation (ICRA), pp. 6939–6946, IEEE, Dynamic slam benefits from articulated objects,” in 2022 International + 2018. Conference on Robotics and Automation (ICRA), pp. 8047–8053, + IEEE, 2022. +[18] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia, “Exploring + representation learning with cnns for frame-to-frame ego-motion esti- [39] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos, “Orb-slam: A + mation,” IEEE robotics and automation letters, vol. 1, no. 1, pp. 18–25, versatile and accurate monocular slam system,” IEEE transactions on + 2015. robotics, vol. 31, no. 5, pp. 1147–1163, 2015. + +[19] H. Zhan, C. S. Weerasekera, J.-W. Bian, and I. Reid, “Visual odometry [40] J. Zhang, M. Henein, R. Mahony, and V. Ila, “Vdo-slam: a visual + revisited: What should be learnt?,” in 2020 IEEE International Con- dynamic object-aware slam system,” arXiv preprint arXiv:2005.11052, + ference on Robotics and Automation (ICRA), pp. 4203–4210, IEEE, 2020. + 2020. + [41] B. Bescos, J. M. Fa´cil, J. Civera, and J. Neira, “DynaSLAM: Tracking, +[20] D.-H. Kim and J.-H. Kim, “Effective background model-based rgb-d mapping, and inpainting in dynamic scenes,” IEEE Robotics and + dense visual odometry in a dynamic environment,” IEEE Transactions Automation Letters, vol. 3, no. 4, pp. 4076–4083, 2018. + on Robotics, vol. 32, no. 6, pp. 1565–1573, 2016. + [42] W. Zhao, S. Liu, Y. Shu, and Y.-J. Liu, “Towards better generalization: +[21] S. Pillai and J. Leonard, “Monocular slam supported object recogni- Joint depth-pose learning without posenet,” in Proceedings of the + tion,” arXiv preprint arXiv:1506.01732, 2015. IEEE/CVF Conference on Computer Vision and Pattern Recognition, + pp. 9151–9161, 2020. +[22] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “Ds- + slam: A semantic visual slam towards dynamic environments,” in 2018 [43] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, + IEEE/RSJ International Conference on Intelligent Robots and Systems T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al., “Pytorch: An + (IROS), pp. 1168–1174, IEEE, 2018. + + 4054 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply. + imperative style, high-performance deep learning library,” Advances + in neural information processing systems, vol. 32, 2019. + [44] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun, “Vision meets robotics: + The kitti dataset,” The International Journal of Robotics Research, + vol. 32, no. 11, pp. 1231–1237, 2013. + + 4055 +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:38:36 UTC from IEEE Xplore. Restrictions apply. + diff --git a/动态slam/2020年-2022年开源动态SLAM/2022年/Multi_modal Semantic SLAM for Complex Dynamic Environment.pdf b/动态slam/2020年-2022年开源动态SLAM/2022年/Multi_modal Semantic SLAM for Complex Dynamic Environment.pdf new file mode 100644 index 0000000..dbee05c --- /dev/null +++ b/动态slam/2020年-2022年开源动态SLAM/2022年/Multi_modal Semantic SLAM for Complex Dynamic Environment.pdf @@ -0,0 +1,510 @@ + Multi-modal Semantic SLAM for Complex Dynamic Environments + + Han Wang*, Jing Ying Ko* and Lihua Xie, Fellow, IEEE + + Abstract— Simultaneous Localization and Mapping (SLAM) Input Image Segmentation Node Mask Image + is one of the most essential techniques in many real-world + robotic applications. The assumption of static environments is Semantic + common in most SLAM algorithms, which however, is not the Category + case for most applications. Recent work on semantic SLAM +arXiv:2205.04300v1 [cs.RO] 9 May 2022 aims to understand the objects in an environment and distin- Semantic + guish dynamic information from a scene context by performing Mask + image-based segmentation. However, the segmentation results + are often imperfect or incomplete, which can subsequently Peception Node Point Cloud Filter ing & SL AM Node + reduce the quality of mapping and the accuracy of localization. + In this paper, we present a robust multi-modal semantic LiDAR Input + framework to solve the SLAM problem in complex and highly + dynamic environments. We propose to learn a more powerful Clustering Projected + object feature representation and deploy the mechanism of Result Segmentation + looking and thinking twice to the backbone network, which + leads to a better recognition result to our baseline instance Result + segmentation model. Moreover, both geometric-only clustering + and visual semantic information are combined to reduce the Dynamic Point Static Point Mapping + effect of segmentation error due to small-scale objects, occlusion + and motion blur. Thorough experiments have been conducted to Fig. 1: System overview of the proposed multi-modal se- + evaluate the performance of the proposed method. The results mantic SLAM. Compared to traditional semantic SLAM, + show that our method can precisely identify dynamic objects we propose to use multi-modal method to improve the + under recognition imperfection and motion blur. Moreover, the efficiency and accuracy of the existing SLAM methods in the + proposed SLAM framework is able to efficiently build a static complex and dynamic environment. Our method significantly + dense map at a processing rate of more than 10 Hz, which can reduces the localization drifts caused by dynamic objects and + be implemented in many practical applications. Both training performs dense semantic mapping in real time. + data and the proposed method is open sourced1. + correspondences or insufficient matching features [4]. The + I. INTRODUCTION presence of dynamic objects can greatly degrade the accuracy + of localization and the reliability of the mapping during the + Simultaneous Localization and Mapping (SLAM) is one of SLAM process. + the most significant capabilities in many robot applications + such as self-driving cars, unmanned aerial vehicles, etc. Advancements in deep learning have enabled the develop- + Over the past few decades, SLAM algorithms have been ments of various instance segmentation networks based on + extensively studied in both Visual SLAM such as ORB- 2D images [5]–[6]. Most existing semantic SLAMs leverage + SLAM [1] and LiDAR-based SLAM such as LOAM [2] the success of deep learning-based image segmentation, e.g., + and LeGO-LOAM [3]. Unfortunately, many existing SLAM dynamic-SLAM [7] and DS-SLAM [8]. However, the seg- + algorithms assume the environment to be static, and cannot mentation results are not ideal under dynamic environments. + handle dynamic environments well. The localization is often Various factors such as small-scale objects, objects under + achieved via visual or geometric features such as feature occlusion and motion blur contribute to challenges in 2D + points, lines and planes without including semantic infor- instance segmentation. For example, the object is partially + mation to represent the surrounding environment, which can recognized under motion blur or when it is near to the border + only work well under static environments. However, the real- of the image. These can degrade the accuracy of localization + world is generally complex and dynamic. In the presence of and the reliability of the mapping. Some recent works target + moving objects, pose estimation might suffer from drifting, to perform deep learning on 3D point clouds to achieve + which may cause the system failure if there are wrong semantic recognition [9]–[10]. However, 3D point cloud + instance segmentation does not perform as well as its 2D + *Jing Ying Ko and Han Wang contribute equally to this paper and are + considered as jointly first authors. + + The research is supported by the National Research Foundation, Singa- + pore under its Medium Sized Center for Advanced Robotics Technology + Innovation. + + Jing Ying Ko, Han Wang and Lihua Xie are with the School of + Electrical and Electronic Engineering, Nanyang Technological Univer- + sity, 50 Nanyang Avenue, Singapore 639798. e-mail: {hwang027, + E170043}@e.ntu.edu.sg; elhxie@ntu.edu.sg + + 1https://github.com/wh200720041/MMS_SLAM + counterpart due to its smaller scale of training data and high A. Feature Consistency Verification +computational cost. There are several reasons: 1) 3D point +cloud instance segmentation such as PointGroup takes a long Dai et al. [13] presents a segmentation method using the +computation time (491ms) [11]; 2) it is much less efficient to correlation between points to distinguish moving objects +label a point cloud since the geometric information is not as from the stationary scene, which has a low computational +straightforward as the visual information; 3) it is inevitable requirement. Lee et al. [14] introduces a real-time depth +to change the viewpoint in order to label a point cloud [12], edge-based RGB-D SLAM system to deal with a dynamic +which increases the labeling time. environment. Static weighting method is proposed to mea- + sure the likelihood of the edge point being part of the + In this paper, we propose a robust and computationally static environment and is further used for the registration of +efficient multi-modal semantic SLAM framework to tackle the frame-to-keyframe point cloud. These methods generally +the limitation of existing SLAM methods in dynamic en- can achieve real-time implementation without increasing the +vironments. We modify the existing backbone network to computational complexity. Additionally, they need no prior +learn a more powerful object feature representation and knowledge about the dynamic objects. However, they are +deploy the mechanism of looking and thinking twice to the unable to continuously track potential dynamic objects, e.g., +backbone network, which leads to a better recognition result a person that stops at a location temporarily between moves +to our baseline instance segmentation model. Moreover, we is considered as a static object in their work. +combine the geometric-only clustering and visual semantic +information to reduce the effect of motion blur. Eventually B. Deep Learning-Based Dynamic SLAM +the multi-modal semantic recognition is integrated into the +SLAM framework which is able to provide real-time local- Deep learning-based dynamic SLAM usually performs +ization in different dynamic environments. The experiment better than feature consistency verification as it provides +results show that the segmentation errors due to misclassifi- conceptual knowledge of the surrounding environment to +cation, small-scale object and occlusion can be well-solved perform the SLAM tasks. Xun et al. [15] proposes a feature- +with our proposed method. The main contributions of this based visual SLAM algorithm based on ORB-SLAM2, +paper are summarized as follows: where a front-end semantic segmentation network is in- + troduced to filter out dynamic feature points and subse- + • We propose a robust and fast multi-modal semantic quently fine-tune the camera pose estimation, thus making + SLAM framework that targets to solve the SLAM prob- the tracking algorithm more robust. Reference [16] combines + lem in complex and dynamic environments. Specifically, a semantic segmentation network with a moving consistency + we combine the geometric-only clustering and visual se- check method to reduce the impact of dynamic objects and + mantic information to reduce the effect of segmentation generate a dense semantic octree map. A visual SLAM + error due to small-scale objects, occlusion and motion system proposed by [17] develops a dynamic object detector + blur. with multi-view geometry and background inpainting, which + aims to estimate a static map and reuses it in long term + • We propose to learn a more powerful object feature applications. However, Mask R-CNN is considered as com- + representation and deploy the mechanism of looking and putationally intensive; as a result, the whole framework can + thinking twice to the backbone network, which leads only be performed offline. + to a better recognition result to our baseline instance + segmentation model. Deep learning-based LiDAR SLAM in dynamic envi- + ronments are relatively less popular than visual SLAM. + • A thorough evaluation on the proposed method is pre- Reference [18] integrates semantic information by using a + sented. The results show that our method is able to fully convolutional neural network to embed these labels + provide reliable localization and a semantic dense map. into a dense surfel-based map representation. However, the + adopted segmentation network is based on 3D point clouds, + The rest of the paper is organized as follows: Section II which is less effective as compared to 2D segmentation net- +presents an overview of the related works regarding the three works. Reference [19] develops a laser-inertial odometry and +main SLAM methods in dynamic environments. Section III mapping method which consists of four sequential modules +describes the details of the proposed SLAM framework. to perform a real-time and robust pose estimation for large +Section IV provides quantitative and qualitative experimental scale high-way environments. Reference [20] presents a dy- +results in dynamic environments. Section V concludes this namic objects-free LOAM system by overlapping segmented +paper. images into LiDAR scans. Although deep learning-based + methods can effectively alleviate the impact of dynamic ob- + II. RELATED WORK jects on the SLAM performance, they are normally difficult + to operate in real-time due to the implementation of deep- + In this section, we present the existing works that address learning neural networks which possess high computational +SLAM problems in dynamic environments. The existing complexity. +dynamic SLAM can be categorized into three main methods: +feature consistency verification method, deep learning-based +method and multi-modal-based method. + Iterative + Estima + + (b) M ulti-modal Fusion M odule (c) L ocalization M odule + + LiDAR Point Cloud Geometric Semantic Feature Data + Clustring Fusion Extraction Association +Camera + Pose + Estimation + + Image Segmentation Motion Blur Key Frame Feature Map + & Compensation Selection Update + + Classfication + + (a) I nstance Segmentation M odule (d) Global Optimization & M apping M odule + + Dynamic Data Model Convolutional Static Map Map Localization +Object Info Acquisition Generalization Neural Network Construction Optimization Output + + Data Data Training Dynamic Global Map Visualization 3D Map + Labelling Mapping Fusion Output + +Fig. 2: Flow chart of the proposed method. Our system consists of four modules: (a) semantic fusion module; (b) semantic +learning module; (c) localization module; (d) global optimization and mapping module. + +C. Multi-modal-based Dynamic SLAM other state-of-the-art instance segmentation models, both in + segmentation accuracy and inference speed. Given an input + Multi-modal approaches are also explored to deal with image I, our adopted instance segmentation network predicts +dynamic environments. Reference [21] introduces a multi- a set of {Ci, M i}in=1, where Ci is a class label and M i is a +modal sensor-based semantic mapping algorithm to improve binary mask, n is the number of instances in the image. +the semantic 3D map in large-scale as well as in featureless The image is spatially separated into N × N grid cells. If +environments. Although this work is similar to our proposed the center of an object falls into a grid cell, that grid cell +method, it incurs higher computational cost as compared to is responsible for predicting the semantic category Cij and +our proposed method. A LiDAR-camera SLAM system [22] semantic mask M ij of the object in category branch Bc and +is presented by applying a sparse subspace clustering-based mask branch P m respectively: +motion segmentation method to build a static map in dynamic +environments. Reference [23] incorporates the information of Bc(I, θc) : I → {Cij ∈ Rλ | i, j = 0, 1, ..., N }, (1a) +a monocular camera and a laser range finder to remove the P m(I, θm) : I → {M ij ∈ Rφ | i, j = 0, 1, ..., N }, (1b) +feature outliers related to dynamic objects. However, both +reference [22] and [23] can only work well in low dynamic where θc and θm are the parameters of category branch Bc +environments. and mask branch P m respectively. λ is the number of classes. + φ is the total number of grid cells. The category branch + III. METHODOLOGY and mask branch are implemented with a Fully Connected + Network (FCN). Cij has a total of λ elements. Each element + In this section, the proposed method will be discussed of Cij indicates the class probability for each object instance +in detail. Fig. 2 illustrates an overview of our framework. at grid cell (i, j). In parallel with the category branch, M ij +It is mainly composed of four modules, namely instance has a total of N 2 elements [24]. Each positive grid cell (i, j) +segmentation module, multi-modal fusion module, localiza- will generate the corresponding instance mask in kth element, +tion module and global optimization & mapping module. where kth = i · N + j. Since our proposed SLAM system is +Instance segmentation module uses a real-time instance seg- intentionally designed for real-world robotics applications, +mentation network to extract the semantic information of computational cost for performing instance segmentation +all potential dynamic objects that are present in an RGB is our primary concern. Therefore, we use a light-weight +image. The convolution neural network is trained offline and version of SOLOv2 with lower accuracy to achieve real- +is later implemented online to achieve real-time performance. time instance segmentation. To improve the segmentation +Concurrently, the multi-modal fusion module transfers rel- accuracy, several methods have been implemented to build a +evant semantic data to LiDAR through sensor fusion and more effective and robust feature representation discriminator +subsequently uses the multi-modal information to further in the backbone network. Firstly, we modify our backbone +strengthen the segmentation results. The static information architecture from the original Feature Pyramid Network +is used in the localization module to find the robot pose, (FPN) to Recursive Feature Pyramid Network (RFP) [25]. +while both static information and dynamic information are Theoretically, RFP instills the idea of looking twice or +utilized in the global optimization and mapping module to more by integrating additional feedback from FPN into +build a 3D dense semantic map. bottom-up backbone layers. This recursively strengthens the + existing FPN and provides increasingly stronger feature +A. Instance Segmentation & Semantic Learning representations. By offsetting richer information with small + + A recent 2D instance segmentation framework [24] is +employed in our work due to its ability to outperform + (1a) (2a) (3a) + +SOL Ov2 with Or iginal (1b) (2b) (3b) + DetectoRS SOL O v2 +Fig. 3: Comparison of the original SOLOv2 with the proposed method. Our segmentation results achieve higher accuracy: +In (1b), our method can preserve a more detailed mask for the rider on a motorcycle compared to the SOLOv2 result in +(1a); In (2b), we can handle the occluded object while it is not detected in (2a); In (3b), our method can accurately predict +the mask for a handbag compared to (3a). + +receptive field in the lower-level feature maps, we are able dynamic targets will degrade the localization accuracy and +to improve the segmentation performance on small objects. produce noise when performing a mapping task. Therefore, +Meanwhile, the ability of RFP to adaptively strengthen and we firstly implement morphological dilation to convolute +suppress neuron activation enables the instance segmentation the 2D pixel-wise mask image with a structuring element, +network to handle occluded objects more efficiently. On for gradually expanding the boundaries of regions for the +the other hand, we replace the convolutional layers in the dynamic objects. The morphological dilation result marks +backbone architecture with Switchable Atrous Convolution the ambiguous boundaries around the dynamic objects. We +(SAC). SAC operates as a soft switch function, which is take the both dynamic objects and their boundaries as the +used to collect the outputs of convolutional computation with dynamic information, which will be further refined in the +different atrous rates. Therefore, we are able to learn the multi-modal fusion section. +optimal coefficient from SAC and can adaptively select the +size of receptive field. This allows SOLOv2 to efficiently 2) Geometric Clustering & Semantic Fusion: Compensa- +extract important spatial information. tion via connectivity analysis on Euclidean space [27] is also + implemented in our work. Instance segmentation network has + The outputs are pixel-wise instance masks for each dy- excellent recognition capability in most practical situations, +namic object, as well as their corresponding bounding box however motion blur limits the segmentation performance +and class type. To better integrate the dynamic information to due to ambiguous pixels between regions, leading to undesir- +the SLAM algorithm, the output binary mask is transformed able segmentation error. Therefore, we combine both point +into a single image containing all pixel-wise instance masks cloud clustering results and segmentation results to better +in the scene. The pixel with the mask falling onto it is refine the dynamic objects. In particular, we perform the +considered as “dynamic state” and otherwise is considered as connectivity analysis on the geometry information and merge +“static state”. The binary mask is then applied to the semantic with vision-based segmentation results. +fusion module to generate a 3D dynamic mask. + A raw LiDAR scan often contains tens of thousands of +B. Multi-Modal Fusion points. To increase the efficiency of our work, 3D point + cloud is firstly downsized to reduce the scale of data and + 1) Motion Blur Compensation: The instance segmenta- used as the input for point cloud clustering. Then the +tion has achieved good performance on the public dataset instance segmentation results are projected to the point cloud +such as the COCO dataset and the Object365 dataset [24]– coordinate to label each point. The point cloud cluster will be +[26]. However, in practice the target may be partially rec- considered as a dynamic cluster when most points (90%) are +ognized or incomplete due to the motion blur on moving dynamic labelled points. The static point will be re-labeled to +objects, resulting in ambiguous boundaries of a moving the dynamic tag when it is close to a dynamic point cluster. +object. Moreover, motion blur effect is further enlarged And the dynamic point will be re-labelled when there is no +when projecting the 2D pixel-wise semantic mask for a dynamic points cluster nearby. +dynamic object to 3D semantic label, leading to point mis- +alignment and inconsistency of feature point extraction. In +the experiments, we find that the ambitious boundaries of + C. Localization & Pose Estimation (a) (b) (c) (d) (e) + + 1) Feature Extraction: After applying multi-modal dy- (f) +namic segmentation, the point cloud is divided into a dy- +namic point cloud PD and a static point cloud PS . The +static point cloud is subsequently used for the localization +and mapping module based on our previous work [28]. +Compared to the existing SLAM approach such as LOAM +[2], the proposed framework in [28] is able to support real- +time performance at 30 Hz which is a few times faster. It +is also resistant to illumination variation compared to visual +SLAMs such as ORB-SLAM [1] and VINS-MONO [29]. +For each static point pk ∈ PS , we can search for its nearby +static points set Sk by radius search in Euclidean space. Let +|S| be the cardinality of a set S, the local smoothness is thus +defined by: + + σk = 1 · (||pk|| − ||pi||). (2) Fig. 4: Different types of AGVs used in our warehouse + environment: (a) the grabbing AGV with a robot arm; (b) + |Sk | pi ∈Sk folklift AGV; (c) scanning AGV; (d) the Pioneer robot; (e) + the transportation AGV with conveyor belt; (f) warehouse +The edge features are defined by the points with large σk environment; +and the planar features are defined by the points with small + D. Global Map Building +σk . + 2) Data Association: The final robot pose is calculated The semantic map is separated into a static map and + a dynamic map. Note that the visual information given +by minimizing the point-to-edge and point-to-plane distance. previously is also used to construct the colored dense static + map. Specifically, the visual information can be achieved +For an edge feature point pE ∈ PE , it can be transformed by re-projecting 3D points into the image plane. After each +into local map coordinate by pˆE = T·pE , where T ∈ SE(3) update, the map is down-sampled by using a 3D voxelized + grid approach [30] in order to prevent memory overflow. +is the current pose. We can search for 2 nearest edge features The dynamic map is built by PD and it is used to reveal the +pE1 and pE2 from the local edge feature map and the point- dynamic objects. The dynamic information can be used for +to-edge residual is defined by [28]: high-level tasks such as motion planning. + + fE (pˆE ) = ||(pˆ E − pE1 ) × (pˆE − pE2 )|| , (3) IV. EXPERIMENT EVALUATION + ||pE1 − p2E || + In this section, experimental results will be presented to +where symbol × is the cross product. Similarly, given a demonstrate the effectiveness of our proposed method. First, + our experimental setup will be discussed in detail. Second, +planar feature point pL ∈ PL and its transformed point we elaborate how we acquire the data of potential moving +pˆL = T · pL, we can search for 3 nearest points p1L, p2L, and objects in a warehouse environment. Third, we evaluate the +p3L from the local planar map. The point-to-plane residual is segmentation performance on our adopted instance segmen- + tation model. Subsequently, we explain how we perform the +defined by: dense mapping and dynamic tracking. Lastly, we evaluate + the performance of our proposed method regarding the +fL(pˆL) = (pˆL − p1L)T · (p1L − pL2 ) × (pL1 − p3L) . (4) localization drifts under dynamic environments. + ||(pL1 − p2L) × (pL1 − p3L)|| + A. Experimental Setup +3) Pose Estimation: The final robot pose is calculated + For our experimental setup, the Robot Operating System +by minimizing the sum of point-to-plane and point-to-edge (ROS) is utilized as the interface for the integration of + the semantic learning module and the SLAM algorithm, as +residuals: shown in Fig. 1. Intel RealSense LiDAR camera L515 is + used to capture RGB and point cloud at a fixed frame rate. +T∗ = arg min fE (pˆE ) + fL(pˆL). (5) All the experiments are performed on a computer with an + Intel i7 CPU and an Nvidia GeForce RTX 2080 Ti GPU. + T pE ∈PE pL ∈PL + +This non-linear optimization problem can be solved by the + +Gauss-Newton method and we can derive an optimal robot + +pose based on the static information. + +4) Feature Map Update & Key Frame Selection: Once + +the optimal pose is derived, the features are updated to the + +local edge map and local plane map respectively, which + +will be used for the data association on the next frame. + +Note that to build and update a global dense map is often + +very computational costly. Hence, the global static map is + +updated based on the keyframe. A key frame is selected when + +the translational change of the robot pose is greater than a + +predefined translation threshold, or the rotational change of + +the robot pose is greater than a predefined rotation threshold. + (a) (b) + +Fig. 5: Static map creation and final semantic mapping result: (a) static map built by the proposed SLAM framework; (b) +final semantic mapping result. The instance segmentation is shown on the left. Human operators are labeled by red bounding +boxes and AGVs are labeled by green bounding boxes. + +B. Data Acquisition tion network, SOLOv2 is built based on the MMDetection + 2.0 [32], an open-source object detection toolbox based + Humans are often considered as dynamic objects in many on PyTorch. We trained SOLOv2 on the COCO dataset +scenarios such as autonomous driving and smart warehouse which consists of 81 classes. We choose ResNet-50 as our +logistics. Therefore we choose 5,000 human images from backbone architecture since this configuration satisfies our +the COCO dataset. In the experiment, the proposed method requirements for the real-world robotics applications. Instead +is evaluated in the warehouse environment as shown in of training the network from scratch, we make use of the +Fig. 4. Other than considering humans as dynamic objects, parameters of ResNet-50 that are pre-trained on ImageNet. +an advanced factory requires human-to-robot and robot-to- For fair comparison, all the models are trained under the +robot collaboration, so that the Automated Guided Vehicles same configurations, they are trained with the synchronized +(AGVs) are also potential dynamic objects. Hence a total stochastic gradient descent with a total of 8 images per mini- +of 3,000 AGV images are collected to train the instance batch for 36 epochs. +segmentation network and some of the AGVs are shown in +Fig. 4. For SOLOv2 with Recursive Feature Pyramid (RFP), we + modify our backbone architecture from Feature Pyramid + In order to solve the small dataset problem, we implement Network (FPN) to RFP network. In this experiment, we only +the copy-paste augmentation method proposed by [31] to set the number of stages to be 2, allowing SOLOv2 to look +enhance the generalization ability of the network and directly at the image twice. As illustrated in Table I, implementation +improve the robustness of the network. To be specific, this of RFP network brings a significant improvement on the +method generates new images through applying random segmentation performance. On the other hand, we replace +scale jittering on two random training datasets and randomly all 3x3 convolutional layers in the backbone network with +chooses a subset of object instances from one image to paste Switchable Atrous Convolution (SAC), which increases the +onto the other image. segmentation accuracy by 2.3%. By implementing both SAC + and RFP network to SOLOv2, the segmentation performance +C. Evaluation on Instance Segmentation Performance is further improved by 5.9% with only 17ms increase in + inference time. Overall, SOLOv2 learns to look at the image + In this part, we will evaluate the segmentation performance twice with adaptive receptive fields, therefore it is able to +on the COCO dataset with regards to the segmentation loss highlight important semantic information for the instance +and mean Average Precision (mAP). The purpose of this segmentation network. The segmentation result is further +evaluation is to compare our adopted instance segmentation visualized in Fig. 3. +network, SOLOv2, with the proposed method. The results +are illustrated in Table I. Our adopted instance segmenta- + +Model Segmentation Mean Inference + + Loss AP (%) Time (ms) Methods ATDE MTDE + (cm) (cm) +SOLOv2 0.52 38.8 54.0 W/O Semantic Recognition + Vision-based Semantic Recognition 4.834 1.877 +SOLOv2 + RFP 0.36 41.2 64.0 Multi-Modal Recognition (Ours) 1.273 0.667 + 0.875 0.502 +SOLOv2 + SAC 0.39 39.8 59.0 + +SOLOv2+DetectoRS(Ours) 0.29 43.4 71.0 + +TABLE I: Performance comparison of instance segmenta- TABLE II: Ablation study of localization drifts under dy- +tion. namic environments. + 3 (a) (c) + + w/o filtering + +2 proposed + + Ground Truth + +1 (b) + +0 + +-1 + +-2 Fig. 7: Ablation study of localization drifts. (a) original + image view; (b) the visual semantic recognition result based +-3 on the proposed method; (c) Localization drifts observed due + to the moving objects. The localization drifts are highlighted +-3 -2 -1 0 1 2 3 in red circle. + +Fig. 6: Localization comparison in a dynamic environment. E. Ablation Study of Localization Drifts (d) +The ground truth, the original localization result without +filtering and the localization result with our proposed multi- To further evaluate the performance of localization under +modal semantic filtering are plotted in red, green and orange dynamic profiles, we compare the localization drifts of differ- +respectively. ent dynamic filtering approaches. Firstly, we keep the robot + still and let a human operator walk frequently in front of the +D. Dense Mapping and Dynamic Tracking robot. The localization drifts are recorded in order to evaluate + the performance under dynamic objects. Specifically, we + To evaluate the performance of our multi-modal semantic calculate the Average Translational Drifts Error (ATDE) and +SLAM in dynamic environments, the proposed method is Maximum Translational Drifts Error (MTDE) to verify the +implemented on warehouse AGVs which are shown in Fig. 4. localization, where the ATDE is the average translational +In a smart manufacturing factory, both human operators and error of each frame and MTDE is the maximum translational +different types of AGVs (e.g., folklift AGVs, transportation drift caused by the walking human. The results are shown +AGVs and robot-arm equipped AGVs) are supposed to work in Table II. We firstly remove the semantic recognition +in a collaborative manner. Therefore, the capability of each module from SLAM and evaluate the performance. Then +AGV to localize itself under moving human operators and we use the visual semantic recognition (SOLOv2) to re- +other AGVs is the essential technology towards industry 4.0. move the dynamic information. The results are compared +In many warehouse environments, the rest of objects such with the proposed semantic multi-modal SLAM. It can be +as operating machines or tables can be taken as a static seen that, compared to the original SLAM, the proposed +environment. Hence we only consider humans and AGVs method significantly reduces the localization drift. Compared +as dynamic objects in order to reduce the computational to vision-only-based filtering methods, the proposed multi- +cost. In the experiment, an AGV is manually controlled modal semantic SLAM is more stable and accurate under the +to move around and build the warehouse environment map presence of dynamic objects. +simultaneously, while the human operators are walking fre- +quently in the warehouse. The localization result is shown V. CONCLUSION +in Fig. 6, where we compare the results of ground truth, the +proposed SLAM method and original SLAM without our In this paper, we have presented a semantic multi-modal +filtering approach. It can be seen that when the dynamic framework to tackle the SLAM problem in dynamic en- +object appears (in blue), the proposed multi-modal semantic vironments, which is able to effectively reduce the impact +SLAM is more robust and stable than traditional SLAM. of dynamic objects in complex dynamic environments. Our +The mapping results are shown in Fig. 5. The proposed approach aims to provide a modular pipeline to allow real- +method is able to efficiently identify the potential dynamic world applications in dynamic environments. Meanwhile, a +objects and separate them from the static map. Although the 3D dense stationary map is constructed with the removal +human operators are walking frequently in front of the robot, of dynamic information. To verify the effectiveness of the +they are totally removed from the static map. All potential proposed method in a dynamic complex environment, our +dynamic objects are enclosed by bounding boxes and are method is evaluated on warehouse AGVs used for smart +added into a final semantic map to visualize the status of manufacturing. The results show that our proposed method +each object in real time, where the moving human is colored can significantly improve the existing semantic SLAM algo- +in red and the AGVs are colored in green. Our method is rithm in terms of robustness and accuracy. +able to identify and locate multiple targets in the complex +dynamic environment. + REFERENCES [17] B. Bescos, J. M. Fa´cil, J. Civera, and J. Neira, “Dynaslam: Tracking, + mapping, and inpainting in dynamic scenes,” IEEE Robotics and + [1] R. Mur-Artal and J. D. Tardo´s, “Orb-slam 2: An open-source slam Automation Letters, vol. 3, no. 4, pp. 4076–4083, 2018. + system for monocular, stereo, and rgb-d cameras,” IEEE Transactions + on Robotics, vol. 33, no. 5, pp. 1255–1262, 2017. [18] X. Chen, A. Milioto, E. Palazzolo, P. Gigue`re, and C. Stachniss, + “Suma++: Efficient lidar-based semantic slam,” IEEE International + [2] J. Zhang and S. Singh, “Loam: Lidar odometry and mapping in real- Conference on Intelligent Robots and Systems, 2019. + time.” in Robotics: Science and Systems, vol. 2, no. 9, 2014. + [19] S. Zhao, Z. Fang, H. Li, and S. Scherer, “A robust laser-inertial + [3] T. Shan and B. Englot, “Lego-loam: Lightweight and ground- odometry and mapping method for large-scale highway environments,” + optimized lidar odometry and mapping on variable terrain,” in 2018 IEEE International Conference on Intelligent Robots and Systems, + IEEE/RSJ International Conference on Intelligent Robots and Systems 2019. + (IROS). IEEE, 2018, pp. 4758–4765. + [20] R. Jian, W. Su, R. Li, S. Zhang, J. Wei, B. Li, and R. Huang, + [4] W. Tan, H. Liu, Z. Dong, G. Zhang, and H. Bao, “Robust monocular “A semantic segmentation based lidar slam system towards dynamic + slam in dynamic environments,” IEEE International Symposium on environments,” IEEE International Conference on Intelligent Robotics + Mixed and Augmented Reality, vol. 1, pp. 209–218, 2013. and Applications, pp. 582–590, 2019. + + [5] K. He, G. Gkioxari, P. Dolla´r, and R. Girshick, “Mask r-cnn,” IEEE [21] J. Jeong, T. S. Yoon, and P. J. Bae, “Towards a meaningful 3d map + International Conference on Computer Vision, 2017. using a 3d lidar and a camera,” Sensors, vol. 18, no. 8, 2018. + + [6] D. Bolya, Z. Chong, F. Xiao, and Y. J. Lee, “Yolact: Real-time instance [22] C. Jiang, D. P. Paudel, Y. Fougerolle, D. Fofi, and C. Demonceaux, + segmentation,” IEEE International Conference on Computer Vision, “Static-map and dynamic object reconstruction in outdoor scenes using + 2019. 3-d motion segmentation,” IEEE Robotics and Automation Letters, + vol. 1, no. 1, pp. 324–331, 2016. + [7] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, “Dynamic-slam: + Semantic monocular visual localization and mapping based on deep [23] X. Zhang, A. B. Rad, and Y.-K. Wong, “Sensor fusion of monocular + learning in dynamic environment,” Robotics and Autonomous Systems, cameras and laser range finders for line-based simultaneous localiza- + vol. 117, pp. 1–16, 2019. tion and mapping (slam) tasks in autonomous mobile robots,” Sensors, + vol. 12, pp. 429–452, 2012. + [8] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “Ds- + slam: A semantic visual slam towards dynamic environments,” in 2018 [24] X. Wang, R. Zhang, K. Tao, L. Lei, and C. Shen, “Solov2: Dynamic + IEEE/RSJ International Conference on Intelligent Robots and Systems and fast instance segmentation,” IEEE Computer Vision and Pattern + (IROS). IEEE, 2018, pp. 1168–1174. Recognition, 2020. + + [9] L. Han, T. Zheng, L. Xu, and L. Fang, “Occuseg: Occupancy-aware 3d [25] S. Qiao, L.-C. Chen, and A. Yuille, “Detectors: Detecting objects with + instance segmentation,” IEEE International Conference on Computer recursive feature pyramid and switchable atrous convolution,” IEEE + Vision and Pattern Recognition, 2020. Computer Vision and Pattern Recognition, 2020. + +[10] J. Li, H. Zhao, S. Shi, S. Liu, C.-W. Fu, and J. Jia, “Pointgroup: Dual- [26] G. Ghiasi, Y. Cui, A. Srinivas, R. Qian, T.-Y. Lin, E. D. Cubuk, Q. V. + set point grouping for 3d instance segmentation,” IEEE International Le, and B. Zoph, “Simple copy-paste is a strong data augmentation + Conference on Computer Vision and Pattern Recognition, 2020. method for instance segmentation,” in Proceedings of the IEEE/CVF + Conference on Computer Vision and Pattern Recognition, 2021, pp. +[11] L. Jiang, H. Zhao, S. Shi, S. Liu, C.-W. Fu, and J. Jia, “Pointgroup: 2918–2928. + Dual-set point grouping for 3d instance segmentation,” in Proceed- + ings of the IEEE/CVF Conference on Computer Vision and Pattern [27] R. B. Rusu, “Semantic 3d object maps for everyday manipulation in + Recognition, 2020, pp. 4867–4876. human living environments,” KI-Ku¨nstliche Intelligenz, vol. 24, no. 4, + pp. 345–348, 2010. +[12] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stach- + niss, and J. Gall, “SemanticKITTI: A Dataset for Semantic Scene [28] H. Wang, C. Wang, and L. Xie, “Lightweight 3-d localization and + Understanding of LiDAR Sequences,” in Proc. of the IEEE/CVF mapping for solid-state lidar,” IEEE Robotics and Automation Letters, + International Conf. on Computer Vision (ICCV), 2019. 2020. + +[13] W. Dai, Y. Zhang, P. Li, Z. Fang, and S. Schere, “Rgb-d slam in [29] T. Qin, P. Li, and S. Shen, “Vins-mono: A robust and versatile monoc- + dynamic environments using point correlations,” IEEE Transactions ular visual-inertial state estimator,” IEEE Transactions on Robotics, + on Pattern Analysis and Machine Intelligence, vol. 1, no. 1, 2020. vol. 34, no. 4, pp. 1004–1020, 2018. + +[14] S. Li and D. Lee, “Rgb-d slam in dynamic environments using static [30] R. B. Rusu and S. Cousins, “3d is here: Point cloud library (pcl),” + point weighting,” IEEE Robotics and Automation Letters), vol. 2, no. 4, in 2011 IEEE international conference on robotics and automation. + pp. 2262–2270, 2017. IEEE, 2011, pp. 1–4. + +[15] Y. Xun and C. Song, “Sad-slam: A visual slam based on semantic [31] G. Ghiasi, C. Yin, A. Srinivas, R. Qian, T.-Y. Lin, E. D.Cubuk, + and depth information,” IEEE International Conference on Intelligent Q. V. Le, and B. Zoph, “Simple copy-paste is a strong data aug- + Robots and Systems, 2021. mentation method for instance segmentation,” IEEE Computer Vision + and Pattern Recognition, 2020. +[16] C. Yu, Z. Liu, X.-J. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, “Ds- + slam: A semantic visual slam towards dynamic environments,” IEEE [32] K. Chen, J. Wang, J. Pang, Y. Cao, Y. Xiong, X. Li, S. Sun, W. Feng, + International Conference on Intelligent Robots and Systems, pp. 1168– Z. Liu, J. Xu, Z. Zhang, D. Cheng, C. Zhu, T. Cheng, Q. Zhao, B. Li, + 1174, 2018. X. Lu, R. Zhu, Y. Wu, J. Dai, W. Jingdong, J. Shi, W. Ouyang, C. C. + Loy, and D. Lin, “Mmdetection: Open mmlab detection toolbox and + benchmark,” IEEE Computer Vision and Pattern Recognition, 2019. + diff --git a/动态slam/2020年-2022年开源动态SLAM/2022年/RGB_D_Inertial_Odometry_for_a_Resource Restricted_Robot_in_Dynamic_Environments.pdf b/动态slam/2020年-2022年开源动态SLAM/2022年/RGB_D_Inertial_Odometry_for_a_Resource Restricted_Robot_in_Dynamic_Environments.pdf new file mode 100644 index 0000000..47e5ced --- /dev/null +++ b/动态slam/2020年-2022年开源动态SLAM/2022年/RGB_D_Inertial_Odometry_for_a_Resource Restricted_Robot_in_Dynamic_Environments.pdf @@ -0,0 +1,478 @@ +IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 9573 + +RGB-D Inertial Odometry for a Resource-Restricted + Robot in Dynamic Environments + + Jianheng Liu , Xuanfu Li, Yueqian Liu , and Haoyao Chen , Member, IEEE + + Abstract—Current simultaneous localization and mapping performance. Most of the existing vSLAM systems depend on +(SLAM) algorithms perform well in static environments but eas- a static world assumption. Stable features in the environment +ily fail in dynamic environments. Recent works introduce deep are used to form a solid constraint for Bundle Adjustment [5]. +learning-based semantic information to SLAM systems to reduce However, in real-world scenarios like shopping malls and sub- +the influence of dynamic objects. However, it is still challeng- ways, dynamic objects such as moving people, vehicles, and +ing to apply a robust localization in dynamic environments for unknown objects, have an adverse impact on pose optimization. +resource-restricted robots. This paper proposes a real-time RGB-D Although some approaches like RANSAC [6] can suppress the +inertial odometry system for resource-restricted robots in dynamic influence of dynamic features to a certain extent, it will become +environments named Dynamic-VINS. Three main threads run in overwhelmed when a vast number of dynamic objects appear in +parallel: object detection, feature tracking, and state optimization. the scene. +The proposed Dynamic-VINS combines object detection and depth +information for dynamic feature recognition and achieves per- Therefore, it is necessary for the system to reduce dynamic +formance comparable to semantic segmentation. Dynamic-VINS objects’ influence on the estimation results consciously. The +adopts grid-based feature detection and proposes a fast and ef- pure geometric methods [7]–[9] are widely used to handle +ficient method to extract high-quality FAST feature points. IMU dynamic objects, but it is unable to cope with latent or slightly +is applied to predict motion for feature tracking and moving moving objects. With the development of deep learning, many +consistency check. The proposed method is evaluated on both researchers have tried combining multi-view geometric methods +public datasets and real-world applications and shows competitive with semantic information [10]–[13] to implement a robust +localization accuracy and robustness in dynamic environments. Yet, SLAM system in dynamic environments. To avoid the accidental +to the best of our knowledge, it is the best-performance real-time deletion of stable features through object detection [14], recent +RGB-D inertial odometry for resource-restricted platforms in dy- dynamic SLAM systems [15], [16] exploit the advantages of +namic environments for now. The proposed system is open source pixel-wise semantic segmentation for a better recognition of +at: https://github.com/HITSZ-NRSL/Dynamic-VINS.git dynamic features. Due to the expensive computing resource + consumption of semantic segmentation, it is difficult for a + Index Terms—Localization, visual-inertial SLAM. semantic-segmentation-based SLAM system to run in real-time. + Therefore, some researchers have tried to perform semantic + I. INTRODUCTION segmentation only on keyframes and track moving objects via + moving probability propagation [17], [18] or direct method [19] +S IMULTANEOUS localization and mapping (SLAM) is a on each frame. In the cases of missed detections or object track- + foundational capability for many emerging applications, ing failures, the pose optimization is imprecise. Moreover, since +such as autonomous mobile robots and augmented reality. Cam- semantic segmentation is performed after keyframe selection, +eras as portable sensors are commonly equipped on mobile real-time precise pose estimation is inaccessible, and unstable +robots and devices. Therefore, visual SLAM (vSLAM) has dynamic features in the original frame may also cause redundant +received tremendous attention over the past decades. Lots of keyframe creation and unnecessary computational burdens. +works [1]–[4] are proposed to improve visual SLAM systems’ + The above systems still require too many computing re- + Manuscript received 25 February 2022; accepted 20 June 2022. Date of sources to perform robust real-time localization in dynamic +publication 15 July 2022; date of current version 26 July 2022. This letter was environments for Size, Weight, and Power (SWaP) restricted +recommended for publication by Associate Editor L. Paull and Editor J. Civera mobile robots or devices. Some researchers [20]–[22] try to +upon evaluation of the reviewers’ comments. This work was supported in part run visual odometry in real-time on embedded computing +by the National Natural Science Foundation of China under Grants U21A20119 devices, yet the keyframe-based visual odometry is not per- +and U1713206 and in part by the Shenzhen Science and Innovation Com- formed [23], which makes their accuracy unsatisfactory. At +mittee under Grants JCYJ20200109113412326, JCYJ20210324120400003, the same time, increasingly embedded computing platforms are +JCYJ20180507183837726, and JCYJ20180507183456108. (Corresponding equipped with NPU/GPU computing units, such as HUAWEI +Author: Haoyao Chen.) Atlas200, NVIDIA Jetson, etc. It enables lightweight deep + learning networks to run on the embedded computing platform + Jianheng Liu, Yueqian Liu, and Haoyao Chen are with the School of Mechan- in real-time. Some studies [14], [24] implemented a keyframe- +ical Engineering and Automation, Harbin Institute of Technology Shenzhen, based dynamic SLAM system running on embedded computing +Shenzhen, Guangdong 518055, China (e-mail: liujianhengchris@qq.com; yue- +qianliu@outlook.com; hychen5@hit.edu.cn). + + Xuanfu Li is with the Department of HiSilicon Research, Huawei Tech- +nology Co., Ltd, Shenzhen, Guangdong 518129, China (e-mail: lixuanfu@ +huawei.com). + + This letter has supplementary downloadable material available at +https://doi.org/10.1109/LRA.2022.3191193, provided by the authors. + + Digital Object Identifier 10.1109/LRA.2022.3191193 + +2377-3766 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. + See https://www.ieee.org/publications/rights/index.html for more information. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply. + 9574 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 + +Fig. 1. The framework of Dynamic-VINS. The contributing modules are highlighted and surrounded by dash lines with different colors. Three main threads run +in parallel in Dynamic-VINS. Features are tracked and detected in the feature tracking thread. The object detection thread detects dynamic objects in each frame in +real-time. The state optimization thread summarizes the features information, object detection results, and depth image to recognize the dynamic features. Finally, +stable features and IMU preintegration results are used for pose estimation. + +platforms. However, these works are still difficult to balance summarize the features information, object detection results, +efficiency and accuracy for mobile robot applications. and depth image to recognize the dynamic features. A missed + detection compensation module is conducted in case of missed + To address all these issues, this paper proposes a real-time detection. The moving consistency check procedure combines +RGB-D inertial odometry for resource-restricted robots in dy- the IMU preintegration and historical pose estimation results +namic environments named Dynamic-VINS. It enables edge to identify potential dynamic features. Finally, stable features +computing devices to provide instant robust state feedback for and IMU preintegration results are used for the pose estimation. +mobile platforms with little computation burden. An efficient And the propagation of the IMU is responsible for an IMU-rate +dynamic feature recognition module that does not require a pose estimation result. Loop closure is also supported in this +high-precision depth camera can be used in mobile devices system, but this paper pays more attention to the localization +equipped with depth-measure modules. The main contributions independent of loop closure. +of this paper are as follows: + III. METHODOLOGY + 1) An efficient optimization-based RGB-D inertial odometry + is proposed to provide real-time state estimation results This study proposes lightweight, high-quality feature tracking + for resource-restricted robots in dynamic and complex and detection methods to accelerate the system. Semantic and + environments. geometry information from the input RGB-D images and IMU + preintegration are applied for dynamic feature recognition and + 2) Lightweight feature detection and tracking are proposed moving consistency check. The missed detection compensation + to cut the computing burden. In addition, dynamic fea- module plays a subsidiary role to object detection in case of + ture recognition modules combining object detection and missed detection. Dynamic features on unknown objects are + depth information are proposed to provide robust dynamic further identified by moving consistency check. The proposed + feature recognition in complex and outdoor environments. methods are divided into five parts for a detailed description. + + 3) Validation experiments are performed to show the pro- A. Feature Matching + posed system’s competitive accuracy, robustness, and ef- + ficiency on resource-restricted platforms in dynamic en- For each incoming image, the feature points are tracked using + vironments. the KLT sparse optical flow method [27]. In this paper, the IMU + measurements between frames are used to predict the motion of + II. SYSTEM OVERVIEW features. Better initial position estimation of features is provided + to improve the efficiency of feature tracking by reducing optical + The proposed SLAM system in this paper is extended based on flow pyramid layers. It can effectively discard unstable features +VINS-Mono [2] and VINS-RGBD [25]; our framework is shown such as noise and dynamic features with inconsistent motion. +in Fig. 1, and the contributing modules are highlighted with The basic idea is illustrated in Fig. 2. +different colors. For efficiency, three main threads (surrounded +by dash lines) run parallel in Dynamic-VINS: object detection, In the previous frame, stable features are colored red, and +feature tracking, and state optimization. Color images are passed newly detected features are colored blue. When the current frame +to both the object detection thread and the feature tracking arrives, the IMU measurements between the current and previous +thread. IMU measurements between two consecutive frames frames are used to predict the feature position (green) in the +are preintegrated [26] for feature tracking, moving consistency current frame. Optical flow uses the predicted feature position +check, and state optimization. as the initial position to look for a match feature in the current + frame. The successfully tracked features are turned red, while + In the feature tracking thread, features are tracked with the those that failed to be tracked are marked as unstable features +help of IMU preintegration and detected by grid-based feature +detection. The object detection thread detects dynamic objects in +each frame in real-time. Then, the state optimization thread will + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply. + LIU et al.: RGB-D INERTIAL ODOMETRY FOR A RESOURCE-RESTRICTED ROBOT IN DYNAMIC ENVIRONMENTS 9575 + +Fig. 2. Illustration of feature tracking and detection. Stable features and new Fig. 3. Illustration of semantic mask setting for dynamic feature recognition +features are colored red and blue, respectively. The green circles denote the when all pixel’s depth is available (d > 0). The left scene represents when an +prediction for optical flow. The successfully tracked features turn red; otherwise, objected bounding box’s farthest corner’s depth is bigger than the center to +the features turn purple. The orange and purple dash-line circles as masks are a threshold and a semantic mask with weighted depth is set between them +set for a uniform feature distribution and reliable feature detection. New feature to separate features on dynamic objects from the background. Otherwise, the +points are detected from unmasked areas in the current frame. semantic mask is set behind the bounding box’s center with the distance of , + shown on the right. + +(purple). In order to avoid the repetition and aggregation of constraints to the system. For the sake of efficiency and compu- +feature detection, an orange circular mask centered on the stable tational cost, a real-time single-stage object detection method, +feature is set; the region where the unstable features are located is YOLOv3 [11], is used to detect many kinds of dynamic scene +considered an unstable feature detection region and masked with elements like people and vehicles. If a detected bounding box +a purple circular to avoid unstable feature detection. According covers a large region of the image, blindly deleting feature +to the mask, new features are detected from unmasked areas in points in the bounding box might result in no available features +the current frame and colored blue. to provide constraints. Therefore, semantic-segmentation-like + masks are helpful to maintain the system’s running by tracking + The above means can obtain uniformly distributed features to features not occluded by dynamic objects. +capture comprehensive constraints and avoid repeatedly extract- +ing unstable features on the area with blurs or weak textures. This paper combines object detection and depth information +Long-term feature tracking can reduce the time consumption for highly efficient dynamic feature recognition to achieve per- +with the help of grid-based feature detection in the following. formance comparable to semantic segmentation. As the farther + the depth camera measures, the worse the accuracy is. This +B. Grid-Based Feature Detection problem makes some methods, such as Seed Filling, DBSCAN, + and K-Means, which make full use of the depth information, + The system maintains a minimum number of features for exhibit poor performance with a low accuracy depth camera, as +stability. Therefore, feature points need to be extracted from the shown in Fig. 5(a). Therefore, a set of points in the detected +frame constantly. This study adopts grid-based feature detection. bounding box and depth information are integrated to obtain +Image is divided into grids, and the boundary of each grid is comparable performance to the semantic segmentation, as illus- +padded to prevent the features at the edge of the grid from being trated in Fig. 3. +ignored; the padding enables the current grid to obtain adjacent +pixel information for feature detection. Unlike traversing the A pixel’s depth d is available, if d > 0, otherwise, d = 0. +whole image to detect features, only the grid with insufficient Considering that the bounding box corners of most dynamic +matched features will conduct feature detection. The grid cell objects correspond to the background points, and the dynamic +that fails to detect features due to weak texture or is covered by objects commonly have a relatively large depth gap with the +the mask will be skipped in the next detection frame to avoid background. The K-th dynamic object’s largest background +repeated useless detection. The thread pool technique is used to depth K dmax is obtained as follow +exploit the parallel performance of grid-based feature detection. +Thus, the time consumption of feature detection is significantly K dmax = max K dtl + K dtrK + K dbl + K dbr , (1) +reduced without loss. + where K dtl, K dtr, K dbl, K dbr are the depth values of the Kth + The FAST feature detector [28] can efficiently extract feature +points but easily treats noise as features and extracts similar object detection bounding box’s corners, respectively. Next, the +clustered features. Therefore, the ideas of mask in Section III-A Kth bounding box’s depth threshold Kd¯is defined as +and Non-Maximum-Suppression are combined to select high- +quality and uniformly distributed FAST features. ⎧ 1 K dmax + K dc , if K dmax −K dc > , K dc > 0, + ⎨⎪⎪⎪⎪ 2 if K dmax −K dc < , K dc > 0, +C. Dynamic Feature Recognition ⎪⎪⎪⎪⎩ if K dmax > 0, K dc = 0, + K d¯= K dc + , + Most feature points can be stably tracked through the above otherwise , +improvement. However, long-term tracking features on dynamic K dmax, +objects always come with abnormal motion and introduce wrong + +∞, + + (2) + + where K dc is the depth value of the bounding box’s center; > 0 + + is a predefined distance according to the most common dynamic + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply. + 9576 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 + +Fig. 4. Results of missed detection compensation. The dynamic feature recognition results are shown in the first row. The green box shows the dynamic object’s +position from the object detection results. The second row shows the generated semantic mask. With the help of missed detection compensation, even if object +detection failed in (b) and (d), a semantic mask including all dynamic objects could be built. + + incoming feature point from the feature tracking thread will be + judged whether it is a historical dynamic feature or not. The + above methods can avoid blindly deleting feature points while + ensuring efficiency. It can save time from detecting features on + dynamic objects, has the robustness to the missed detection of + object detection, and recycle false-positive dynamic features, as + illustrated in Section III-E. + + D. Missed Detection Compensation + +Fig. 5. Results of dynamic feature recognition. The stable features are circled Since object detection might sometimes fail, the proposed +by yellow. The dynamic feature recognition results generated by Seed Filling Dynamic-VINS utilizes the previous detection results to predict +and the proposed method are shown in (a) and (b), respectively. The weighted the following detection result to compensate for missed detec- +depth d¯is colored gray; the brighter means a bigger value. The feature point on tions. It is assumed that the dynamic objects in adjacent frames +the white area will be marked as a dynamic feature. have a consistent motion. Once a dynamic object is detected, its + pixel velocity and bounding box will be updated. Assumed that +objects’ size in scenes. The depth threshold Kd¯is defined in the j is the current detected frame and j − 1 is the previous detected +middle of the center’s depth K dc and the deepest background frame, the pixel velocity K vcj (pixel/frame) of the Kth dynamic +depth K dmax. When the dynamic object has a close connection object between frames is defined as +with the background or is behind an object K dmax − K dc < , +the depth threshold is defined at distance from the dynamic K vcj = K uccj − K ucj−1 , (3) +object. If the depth is unavailable, a conservative strategy is +adopted to choose an infinite depth as the threshold. c + + On the semantic mask, the area covered by the K-th dynamic where K uccj , u K cj−1 represent the pixel location of the K th +object bounding box is set to the weighted depth Kd¯; the area c +without dynamic objects is set to 0. Each incoming feature’s +depth d is compared with the corresponding pixel’s depth thresh- object detection bounding box’s center in jth frame and j − 1th +old d¯on the semantic mask. If d < d¯, the feature is considered as +a dynamic one. Otherwise, the feature is considered as a stable frame, respectively. A weighted predicted velocity K vˆ is defined +one. Therefore, the region where the depth value is smaller than +the weighted depth d¯constitutes the generalized semantic mask, as +as shown in Figs. 4 and 5(b). + K vˆcj+1 = 1 K vcj + K vˆcj , (4) + Considering that dynamic objects may exist in the field of 2 +view for a long time, the dynamic features are tracked but +not used for pose estimation, different from directly deleting With the update going on, the velocities of older frames will have +dynamic features. According to its recorded information, each + a lower weight in K vˆ. If the object fail to be detected in the next + frame, the bounding box K Box containing the corners’ pixel + locations K utl, K utr, K ubl and K ubr, will be updated based on + the predicted velocity K vˆ as follow + + K Bˆoxcj+1 = K Boxcj + K vˆcj+1 , (5) + + When the missed detection time is over a threshold, this dynamic + object’s compensation will be abandoned. The result is shown + in Fig. 4. It improves the recall rate of object detection and is + helpful for a more consistent dynamic feature recognition. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply. + LIU et al.: RGB-D INERTIAL ODOMETRY FOR A RESOURCE-RESTRICTED ROBOT IN DYNAMIC ENVIRONMENTS 9577 + +E. Moving Consistency Check In order to demonstrate the efficiency of the proposed system, + all experiments of Dynamic-VINS are performed on the em- + Since object detection can only recognize artificially defined bedded edge computing devices, HUAWEI Atlas200 DK and +dynamic objects and has a missed detection problem, the state NVIDIA Jetson AGX Xavier. And the compared algorithms’ +optimization will still be affected by unknown moving objects results are included from their original papers. Atlas200 DK +like books moved by people. Dynamic-VINS combines the pose has an 8-core A55 Arm CPU (1.6 GHz), 8 GB of RAM, and +predicted by IMU and the optimized pose in the sliding windows a 2-core HUAWEI DaVinci NPU. Jetson AGX Xavier has an +to recognize dynamic features. 8-core ARMv8.2 64-bit CPU (2.25 GHz), 16 GB of RAM, + and a 512-core Nvidia Volta GPU. And the results tested on + Consider the kth feature is first observed in the ith image and both devices are named Dynamic-VINS-Atlas and Dynamic- +is observed by other m images in sliding windows. The average VINS-Jetson, respectively. Yet, to the best of our knowledge, the +reprojection residual rk of the feature observation in the sliding proposed method is the best-performance real-time RGB-D iner- +windows is defined as tial odometry for dynamic environments on resource-restricted + embedded platforms. +rk = 1 ukci − π TcbTwbi Tbwj TbcPkcj , (6) + m A. OpenLORIS-Scene Dataset + j=i + OpenLORIS-Scene [3] is a real-world indoor dataset with +where ukci is the observation of kth feature in the ith frame; Pkcj a large variety of challenging scenarios like dynamic scenes, +is the 3D location of kth feature in the jth frame; Tcb and Twbj featureless frames, and dim illumination. The results on the +are the transforms from camera frame to body frame and from OpenLORIS-Scene dataset are shown in Fig. 6, including the +jth body frame to world frame, respecvtively; π represents the results of VINS-Mono, ORB-SLAM2, and DS-SLAM from [3] +camera projection model. When the rk is over a preset threshold, as baselines. +the kth feature is considered as a dynamic feature. + The OpenLORIS dataset includes five scenes and 22 se- + As shown in Fig. 7, the moving consistency check (MCC) quences in total. The proposed Dynamic-VINS shows the best +module can find out unstable features. However, some stable robustness among the tested algorithms. In office scenes that +features are misidentified (top left image), and features on are primarily static environments, all the algorithms can track +standing people are not recognized (bottom right image). A low successfully and achieve a decent accuracy. It is challenging for +threshold holds a high recall rate of unstable features. Further, the pure visual SLAM systems to track stable features in home +a misidentified unstable feature with more observations will be and corridor scenes that contain a large area of textureless walls +recycled if its reprojection error is lower than the threshold. and dim lighting. Thanks to the IMU sensor, the VINS systems + show robustness superiority when the camera is unreliable. The + IV. EXPERIMENTAL RESULTS scenarios of home and caf e contain a number of sitting people + with a bit of motion, and market exists lots of moving pedes- + Quantitative experiments1 are performed to evaluate the pro- trians and objects with unpredictable motion. And the market +posed system’s accuracy, robustness, and efficiency. Public scenes cover the largest area and contain highly dynamic objects, +SLAM evaluation datasets, OpenLORIS-Scene [29] and TUM as shown in Fig. 5. Although DS-SLAM is able to filter out +RGB-D [30], provide sensor data and ground truth to evaluate some dynamic features, its performance is still unsatisfactory. +SLAM system in complex dynamic environments. Since our sys- VINS-RGBD has a similar performance with Dynamic-VINS +tem is built on VINS-Mono [2] and VINS-RGBD [25], they are in relative static scenes, while VINS-RGBD’s accuracy drops in +used as the baselines to demonstrate our improvement. VINS- highly dynamic market scenes. The proposed Dynamic-VINS +Mono [2] provides robust and accurate visual-inertial odometry can effectively deal with complex dynamic environments and +by fusing IMU preintegration and feature observations. VINS- improve robustness and accuracy. +RGBD [25] integrates RGB-D camera based on VINS-Mono +for better performance. Furthermore, DS-SLAM [15] and Ji B. TUM RGB-D Dataset +et al.[24], state-of-the-art semantic algorithms based on ORB- +SLAM2 [4], are also included for comparison. The TUM RGB-D dataset [30] offers several sequences con- + taining dynamic objects in indoor environments. The highly + The accuracy is evaluated by Root-Mean-Square-Error dynamic f r3_walking sequences are chosen for evaluation +(RMSE) of Absolute Trajectory Error (ATE), Translational Rel- where two people walk around a desk and change chairs’ +ative Pose Error (T.RPE), and Rotational Relative Pose Error positions while the camera moves in different motions. As +(R.RPE). Correct Rate (CR) [29] measuring the correct rate the VINS system does not support VO mode and the TUM +over the whole period of data is used to evaluate the robustness. RGB-D dataset does not provide IMU measurements, a VO +The RMSE of an algorithm is calculated only for its success- mode is implemented by simply disabling modules relevant to +ful tracking outputs. Therefore, the longer an algorithm tracks IMU in Dynamic-VINS for experiments. The results are shown +successfully, the more error is likely to accumulate. It implies in Table I. The compared methods’ results are included from +that evaluating algorithms purely by ATE could be mislead- their original published papers. The algorithms based on ORB- +ing. On the other hand, considering only CR could also be SLAM2 and semantic segmentation perform better. Although +misleading. + +1The experimental video is available at https://youtu.be/y0U1IVtFBwY. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply. + 9578 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 + +Fig. 6. Per-sequence testing results with the OpenLORIS-Scene datasets. Each black dot on the top line represents the start of one data sequence. For each +algorithm, blue dots indicate successful initialization moments, and blue lines indicate successful tracking span. The percentage value on the top left of each scene +is the average correct rate; the higher the correct rate of an algorithm, the more robust it is. The float value on the first line below is average ATE RMSE and the +values on the second line below are T.RPE and R.RPE from left to right, and smaller means more accurate. + + TABLE I + RESULTS OF RMSE OF ATE [m], T.RPE [m/s], AND R.RPE [◦/s] ON TUM RGB-D f r3_walking DATASETS + + TABLE II + ABLATION EXPERIMENT RESULTS OF RMSE OF ATE [m], T.RPE [m/s], AND R.RPE [◦/s] ON TUM RGB-D f r3_walking DATASETS + + to extract evenly distributed stable features, which seriously + degrades the accuracy performance. Without the object detec- + tion (W/O OBJECT DETECTION), dynamic features introduce + wrong constraints to impair the system’s accuracy. Dynamic- + VINS-W/O-SEG-LIKE-MASK shows the results that mask all + features in the bounding boxes. The background features help the + system maintain as many stable features as possible to provide + more visual constraints. The moving consistency check plays + an important role when object detection fails, as shown in the + column W/O-MCC. + +Fig. 7. Results of Moving Consistency Check. Features without yellow circu- C. Runtime Analysis +lar are the outliers marked by the Moving Consistency Check module. + This part compares VINS-Mono, VINS-RGBD, and +Dynamic-VINS is not designed for pure visual odometry, it still Dynamic-VINS for runtime analysis. These methods are ex- +shows competitive performance and has a significant improve- pected to track and detect 130 feature points, and the frames +ment over ORB-SLAM2. in Dynamic-VINS are divided into 7x8 grids. The object detec- + tion runs on the NPU/GPU parallel to the CPU. The average + To validate the effectiveness of each module in Dynamic- computation times of each module and thread are calculated on +VINS, ablation experiments are conducted as shown in Table II. OpenLORIS market scenes; the results run on both embedded +The system without applying circular masks (W/O CIRCU- platforms are shown in Table III. It should be noted that the +LAR MASK) from the Section III-A and Section III-B fails average computation time is only to be updated when the module + is used. Specifically, in VINS architecture, the feature detection + is executed at a consistent frequency with the state optimization + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply. + LIU et al.: RGB-D INERTIAL ODOMETRY FOR A RESOURCE-RESTRICTED ROBOT IN DYNAMIC ENVIRONMENTS 9579 + + TABLE III + AVERAGE COMPUTATION TIME [ms] OF EACH MODULE AND THREAD ON OPENLORIS market SCENES + +* Tracking Thread, Optimization Thread and Object Detection correspond to the three different threads shown in Fig. 1, respectively. +† Dynamic Feature Recognition Modules sum up the Dynamic Feature Recognition, Missed Detection Compensation, and Moving Consistency Check modules. + +Fig. 8. A compact aerial robot equipped with an RGB-D camera, an autopilot Fig. 9. The estimated trajectories in the outdoor environment aligned with the +with IMUs, an onboard computer, and an embedded edge computing device. Google map. The green line is the estimated trajectory from Dynamic-VINS, the +The whole size is about 255 × 165 mm. red line is from VINS-RGBD, and the yellow line represents the loop closure + that happened at the end of the dataset. +thread, which means the frequency of feature detection is lower +than that of Feature Tracking Thread. Fig. 10. Results of dynamic feature recognition in outdoor environments. The + dynamic feature recognition modules are still able to segment dynamic objects + On edge computing devices with AI accelerator modules, but with a larger mask region. +the single-stage object detection method is computed by an +NPU or GPU without costing the CPU resources and can out- handheld aerial robot above for safety. The total path lengths +put inference results in real-time. With the same parameters, are approximately 800 m and 1220 m, respectively. The dataset +Dynamic-VINS shows significant improvement in feature de- has a similar scene at the beginning and the end for loop +tection efficiency in both embedded platforms and is the one able closure, while loop closure fails in the THUSZ campus dataset. +to achieve instant feature tracking and detection in HUAWEI At- VINS-RGBD and Dynamic-VINS run the dataset on NVIDIA +las200 DK. The dynamic feature recognition modules (Dynamic Jetson AGX Xavier. The estimated trajectories and loop closure +Feature Recognition, Missed Detection Compensation, Moving trajectory aligned with the Google map are shown in Fig. 9. +Consistency Check) to recognize dynamic features only take In outdoor environments, the depth camera is limited in range +a tiny part of the consuming time. For real-time application, and affected by the sunlight. The dynamic feature recognition +the system is able to output a faster frame-to-frame pose and a modules can still segment dynamic objects but with a larger +higher-frequency imu-propagated pose rather than waiting for mask region, as shown in Fig. 10. Compared with loop closure +the complete optimization result. results, Dynamic-VINS could provide a robust and stable pose + estimation with little drift. +D. Real-World Experiments + + A compact aerial robot is shown in Fig. 8. An RGB-D camera +(Intel Realsense D455) provides 30 Hz color and aligned depth +images. An autopilot (CUAV X7pro) with an onboard IMU +(ADIS16470, 200 Hz) is used to provide IMU measurements. +The aerial robot is equipped with an onboard computer (Intel +NUC, i7-5557 U CPU) and an embedded edge computing de- +vice (HUAWEI Atlas200 DK). These two computation resource +providers play different roles in the aerial robot. The onboard +computer charges for peripheral management and other core +functions requiring more CPU resources, such as planning and +mapping. The edge computing device as auxiliary equipment +offers instant state feedback and object detection results to the +onboard computer. + + Large-scale outdoor datasets with moving people and vehi- +cles on the HITSZ and THUSZ campus are recorded by the + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply. + 9580 IEEE ROBOTICS AND AUTOMATION LETTERS, VOL. 7, NO. 4, OCTOBER 2022 + + V. CONCLUSION [12] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep con- + volutional encoder-decoder architecture for image segmentation,” IEEE + This paper presents a real-time RGB-D inertial odometry Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, +for resource-restricted robots in dynamic environments. Cost- Dec. 2017. +efficient feature tracking and detection methods are proposed to +cut down the computing burden. A lightweight object-detection- [13] K. He, G. Gkioxari, P. Dollar, and R. Girshick, “Mask R-CNN,” in Proc. +based method is introduced to deal with dynamic features in IEEE Int. Conf. Comput. Vis., 2017, pp. 2961–2969. +real-time. Validation experiments show the proposed system’s +competitive accuracy, robustness, and efficiency in dynamic [14] L. Xiao et al., “Dynamic-SLAM: Semantic monocular visual localization +environments. Furthermore, Dynamic-VINS is able to run on and mapping based on deep learning in dynamic environment,” Robot. +resource-restricted platforms to output an instant pose estima- Auton. Syst., vol. 117, pp. 1–16, 2019. +tion. In the future, the proposed approaches are expected to +be validated on the existing popular SLAM frameworks. The [15] C. Yu et al., “DS-SLAM: A semantic visual SLAM towards dynamic +missed detection compensation module is expected to develop environments,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2018, +into a moving object tracking module, and semantic information pp. 1168–1174. +will be further introduced for high-level guidance on mobile +robots or mobile devices in complex dynamic environments. [16] B. Bescos,, J. M. Facil, J. Civera, and J. Neira, “DynaSLAM: Tracking, + mapping, and inpainting in dynamic scenes,” IEEE Robot. Automat. Lett., + REFERENCES vol. 3, no. 4, pp. 4076–4083, Oct. 2018. + + [1] J. Engel, V. Koltun, and D. Cremers, “Direct sparse odometry,” IEEE [17] F. Zhong, S. Wang, Z. Zhang, C. Chen, and Y. Wang, “Detect-SLAM: + Trans. Pattern Anal. Mach. Intell., vol. 40, no. 3, pp. 611–625, Mar. 2018. Making object detection and SLAM mutually beneficial,” in Proc. IEEE + Winter Conf. Appl. Comput. Vis., 2018, pp. 1001–1010. + [2] T. Qin, P. Li, and S. Shen, “VINS-Mono: A robust and versatile monoc- + ular visual-inertial state estimator,” IEEE Trans. Robot., vol. 34, no. 4, [18] Y. Liu and J. Miura, “RDS-SLAM: Real-time dynamic SLAM using + pp. 1004–1020, Aug. 2018. semantic segmentation methods,” IEEE Access, vol. 9, pp. 23 772–23 785, + 2021. + [3] P. Geneva, K. Eckenhoff, W. Lee, Y. Yang, and G. Huang, “OpenVINS: A + research platform for visual-inertial estimation,” in Proc. IEEE Int. Conf. [19] I. Ballester, A. Fontán, J. Civera, K. H. Strobl, and R. Triebel, “DOT: + Robot. Automat., 2020, pp. 4666–4672. Dynamic object tracking for visual SLAM,” in Proc. IEEE Int. Conf. Robot. + Automat., 2021, pp. 11 705–11 711. + [4] R. Mur-Artal and J. D. Tardos, “ORB-SLAM2: An open-source SLAM + system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot., [20] K. Schauwecker, N. R. Ke, S. A. Scherer, and A. Zell, “Markerless visual + vol. 33, no. 5, pp. 1255–1262, Oct. 2017. control of a quad-rotor micro aerial vehicle by means of on-board stereo + processing,” in Proc. Auton. Mobile Syst., 2012, pp. 11–20. + [5] B. Triggs, P. F. McLauchlan, R. I. Hartley, and A. W. Fitzgibbon, “Bundle + adjustment—a modern synthesis,” in Proc. Int. Workshop Vis. Algorithms, [21] Z. Z. Nejad and A. Hosseininaveh Ahmadabadian, “ARM-VO: An efficient + 1999, pp. 298–372. monocular visual odometry for ground vehicles on ARM CPUs,” Mach. + Vis. Appl., vol. 30, no. 6, pp. 1061–1070, 2019. + [6] M. A. Fischler and R. C. Bolles, “Random sample consensus: A paradigm + for model fitting with applications to image analysis and automated car- [22] S. Bahnam, S. Pfeiffer, and G. C. H. E. de Croon, “Stereo visual iner- + tography,” Commun. ACM, vol. 24, no. 6, pp. 381–395, 1981. tial odometry for robots with limited computational resources,” in Proc. + IEEE/RSJ Int. Conf. Intell. Robots Syst., 2021, pp. 9154–9159. + [7] Y. Sun, M. Liu, and M.Q.-H. Meng, “Improving RGB-D SLAM in dynamic + environments: A motion removal approach,” Robot. Auton. Syst., vol. 89, [23] G. Younes et al., “Keyframe-based monocular SLAM: Design, survey, and + pp. 110–122, 2017. future directions,” Robot. Auton. Syst., vol. 98, pp. 67–88, 2017. + + [8] E. Palazzolo,, J. Behley, P. Lottes, P. Gigu, and C. Stachniss, “ReFusion: [24] T. Ji, C. Wang, and L. Xie, “Towards real-time semantic RGB-D SLAM in + 3D reconstruction in dynamic environments for RGB-D cameras exploit- dynamic environments,” in Proc. IEEE Int. Conf. Robot. Automat., 2021, + ing residuals,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., 2019, pp. 11 175–11 181. + pp. 7855–7862. + [25] Z. Shan, R. Li, and S. Schwertfeger, “RGBD-inertial trajectory estima- + [9] W. Dai et al., “RGB-D SLAM in dynamic environments using point tion and mapping for ground robots,” Sensors, vol. 19, no. 10, 2019, + correlations,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 44, no. 1, Art. no. 2251. + pp. 373–389, Jan. 2022. + [26] C. Forster et al., “IMU preintegration on manifold for efficient visual- +[10] W. Liu et al., “SSD: Single shot MultiBox detector,” in Eur. Conf. Comp. inertial maximum-a-posteriori estimation,” in Proc. Robot.: Sci. Syst., + Vis., 2016, pp. 21–37. 2015. + +[11] J. Redmon and A. Farhadi, “YOLOv3: An incremental improvement,” [27] B. D. Lucas et al., “An iterative image registration technique with an appli- + 2018, arXiv:1804.02767. cation to stereo vision,” in Proc. DARPA Image Understanding Workshop, + 1981, pp. 121–130. + + [28] E. Rosten and T. Drummond, “Machine learning for high-speed corner + detection,” in Proc. Eur. Conf. Comput. Vis., 2006, pp. 430–443. + + [29] X. Shi et al., “Are we ready for service robots? The OpenLORIS-Scene + datasets for lifelong SLAM,” in Proc. IEEE Int. Conf. Robot. Automat., + 2020, pp. 3139–3145. + + [30] J. Sturm,, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A bench- + mark for the evaluation of RGB-D SLAM systems,” in Proc. IEEE/RSJ + Int. Conf. Intell. Robots Syst., 2012, pp. 573–580. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on November 14,2023 at 12:34:48 UTC from IEEE Xplore. Restrictions apply. + diff --git a/动态slam/2020年-2022年开源动态SLAM/2022年/SG-SLAM_A_Real-Time_RGB-D_Visual_SLAM_Toward_Dynamic_Scenes_With_Semantic_and_Geometric_Information.pdf b/动态slam/2020年-2022年开源动态SLAM/2022年/SG-SLAM_A_Real-Time_RGB-D_Visual_SLAM_Toward_Dynamic_Scenes_With_Semantic_and_Geometric_Information.pdf new file mode 100644 index 0000000..0065e5b --- /dev/null +++ b/动态slam/2020年-2022年开源动态SLAM/2022年/SG-SLAM_A_Real-Time_RGB-D_Visual_SLAM_Toward_Dynamic_Scenes_With_Semantic_and_Geometric_Information.pdf @@ -0,0 +1,665 @@ +IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023 7501012 + +SG-SLAM: A Real-Time RGB-D Visual SLAM + Toward Dynamic Scenes With Semantic and + Geometric Information + + Shuhong Cheng , Changhe Sun , Shijun Zhang , Student Member, IEEE, and Dianfan Zhang + + Abstract— Simultaneous localization and mapping (SLAM) is systems, we have access to cheaper, faster, higher quality, +one of the fundamental capabilities for intelligent mobile robots and smaller vision-based sensors. It also helps vision-based +to perform state estimation in unknown environments. However, measurement (VBM) become more ubiquitous and applica- +most visual SLAM systems rely on the static scene assumption ble [2]. Hence, in the past years, a large number of excellent +and consequently have severely reduced accuracy and robustness visual SLAM systems have emerged, such as PTAM [3], +in dynamic scenes. Moreover, the metric maps constructed by ORB-SLAM2 [4], DVO [5], and Kimera [6]. Some of these +many systems lack semantic information, so the robots cannot visual SLAM systems are quite mature and have achieved +understand their surroundings at a human cognitive level. good performance under certain specific environmental +In this article, we propose SG-SLAM, which is a real-time conditions. +RGB-D semantic visual SLAM system based on the ORB-SLAM2 +framework. First, SG-SLAM adds two new parallel threads: an As SLAM enters the age of robust perception [7], the system +object detecting thread to obtain 2-D semantic information and has higher requirements in terms of robustness and high-level +a semantic mapping thread. Then, a fast dynamic feature rejec- understanding characteristics. However, many visual-based +tion algorithm combining semantic and geometric information classical SLAM systems still fall short of these requirements +is added to the tracking thread. Finally, they are published in some practical scenarios. On the one hand, most visual +to the robot operating system (ROS) system for visualization SLAM systems work based on the static scene assumption, +after generating 3-D point clouds and 3-D semantic objects in which makes the system less accurate and less robust in real +the semantic mapping thread. We performed an experimental dynamic scenes (e.g., scenes containing walking people and +evaluation on the TUM dataset, the Bonn dataset, and the moving vehicles). On the other hand, most existing SLAM +OpenLORIS-Scene dataset. The results show that SG-SLAM is systems only construct a globally consistent metric map of +not only one of the most real-time, accurate, and robust systems in the robot’s working environment [8]. However, the metric map +dynamic scenes but also allows the creation of intuitive semantic does not help the robot to understand its surroundings at a +metric maps. higher semantic level. + + Index Terms— Dynamic scenes, geometric constraint, seman- Most visual SLAM algorithms rely on the static scene +tic metric map, visual-based measurement, visual simultaneous assumption, which is why the presence of dynamic objects can +localization and mapping (SLAM). cause these algorithms to produce the wrong data correlation. + These outliers obtained from dynamic objects can seriously + I. INTRODUCTION impair the accuracy and stability of the algorithms. Even + though these algorithms show superior performance in some +S IMULTANEOUS localization and mapping (SLAM) has specific scenarios, it is difficult to extend them to actual + an important role in the state perception of mobile robots. production and living scenarios containing dynamic objects. +It can help a robot in an unknown environment with an Some recent works, such as [9], [10], [11], and [12], have +unknown pose to incrementally build a globally consistent map used methods that combine geometric and semantic infor- +and simultaneously measure its pose in this map [1]. Due to mation to eliminate the adverse effects of dynamic objects. +continuing and rapid development of cameras and computing These algorithms mainly using deep learning have significant + improvements in experimental accuracy, but they suffer from + Manuscript received 25 August 2022; revised 31 October 2022; accepted shortcomings in scene generalizability or real time due to vari- +23 November 2022. Date of publication 9 December 2022; date of current ous factors. Therefore, how skillfully detecting and processing +version 17 January 2023. This work was supported in part by the National Key dynamic objects in the scene is crucial for the system to +Research and Development Program under Grant 2021YFB3202303, in part operate accurately, robustly, and in real time. +by the S&T Program of Hebei under Grant 20371801D, in part by the Hebei +Provincial Department of Education for Cultivating Innovative Ability of Traditional SLAM systems construct only a sparse metric +Postgraduate Students under Grant CXZZBS2022145, and in part by the Hebei map [3], [4]. This metric map consists of simple geome- +Province Natural Science Foundation Project under Grant E2021203018. tries (points, lines, and surfaces) and every pose is strictly +The Associate Editor coordinating the review process was Dr. Jae-Ho Han. related to the global coordinate system. Enabling a robot to +(Corresponding authors: Shijun Zhang; Dianfan Zhang.) perform advanced tasks with intuitive human–robot interac- + tion requires it to understand its surroundings at a human + Shuhong Cheng and Changhe Sun are with the School of Electri- +cal Engineering, Yanshan University, Qinhuangdao 066000, China (e-mail: +shhcheng@ysu.edu.cn; silencht@qq.com). + + Shijun Zhang is with the School of Mechanical Engineering, Yanshan +University, Qinhuangdao 066000, China (e-mail: 980871977@qq.com). + + Dianfan Zhang is with the Key Laboratory of Special Delivery Equipment, +Yanshan University, Qinhuangdao 066004, China (e-mail: zdf@ysu.edu.cn). + + Digital Object Identifier 10.1109/TIM.2022.3228006 + +1557-9662 © 2022 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. + See https://www.ieee.org/publications/rights/index.html for more information. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. + 7501012 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023 + +Fig. 1. Overview of the framework of the SG-SLAM system. The original work of ORB-SLAM2 is presented on an aqua-green background, while our main +new or modified work is presented on a red background. + +cognitive level. However, the metric map lacks the neces- The main contributions of this article include the following. +sary semantic information and therefore cannot provide this 1) A complete real-time RGB-D visual SLAM system +capability. With the rapid development of deep learning in +recent years, some neural networks can effectively capture the called SG-SLAM is proposed using ORB-SLAM2 as +semantic information in the scenes. Therefore, the metric map a framework. Compared to ORB-SLAM2, it has higher +can be extended to the semantic metric map by integrating accuracy and robustness in dynamic scenes and can pub- +semantic information. The semantic information contained lish a semantic metric map through the robot operating +in the semantic metric map can provide the robot with the system (ROS) system [13]. +capability to understand its surroundings at a higher level. 2) A fast dynamic feature rejection algorithm is proposed + by combining geometric information and semantic infor- + This article focuses on a dynamic feature rejection algorithm mation. The geometric information is calculated from +that integrates semantic and geometric information, which not the epipolar constraint between image frames. Also, the +only significantly improves the accuracy of system localization semantic information about dynamic objects is obtained +but also has excellent computational efficiency. Thus, our algo- through an NCNN-based [14] object detection network +rithm is very useful from an instrumentation and measurement in a new thread. The algorithm speed is greatly improved +point of view [2]. This article also focuses on how to construct by appropriate modifications and a combination of clas- +the semantic metric map to improve the perceptual level of sical methods while maintaining accuracy. +the robot to understand the surrounding scenes. The overall 3) An independent semantic metric mapping thread that can +framework of the SG-SLAM system is shown in Fig. 1. generate semantic objects and Octo maps [15] using the + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. + CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES 7501012 + + ROS interface is embedded in SG-SLAM. These maps regarded as outliers and eliminated. Similarly, Dynamic- + can be useful in subsequent localization, navigation, and SLAM proposed by Xiao et al. [25] has the same problem + object capture tasks. of directly rejecting all features within the bounding box. + Liu and Miura [26] adopted a semantic segmentation method + The remaining sections of this article are organized as to detect dynamic objects and remove outliers in keyframes. +follows. The work related to this system is described in The semantic segmentation method solves the problem of +Section II. Section III shows the details related to the imple- wrong recognition due to bounding boxes to a certain extent. +mentation of this system. Section IV provides an experimental However, the semantic information method relies heavily on +evaluation and analysis of the results. The conclusions and the quality of the neural network, so it is difficult to meet the +future works of this article are presented in Section V. requirements of speed and accuracy at the same time. + + II. RELATED WORKS Recently, much work has taken on the method of combining + geometric and semantic information. For the RGB-D camera, +A. SLAM in Dynamic Scenes Bescos et al. [9] used the semantic segmentation results of + Mask R-CNN [27] combined with multiview geometry to + Most current visual SLAMs assume that the working scene detect dynamic objects and reject outliers. Yu et al. [10] +is static and rigid. When these systems work in dynamic used an optical flow-based moving consistency check method +scenes, erroneous data associations due to the static scene to detect all feature points and simultaneously performed +assumption can seriously weaken the accuracy and stability semantic segmentation of the image using SegNet [28] in +of the system. The presence of dynamic objects in the scene an independent thread. If the moving consistency checking +makes all features divided into two categories: static features method detects more than a certain percentage of dynamic +and dynamic features. How to detect and reject dynamic points within the range of the human object, all feature points +features is the key to the problem solution. The previous that lie inside the object are directly rejected. Wu et al. [11] +research work can be divided into three categories: geomet- used YOLO to detect a priori dynamic objects in the scene +ric information method, semantic information method, and and then combined it with the depth-RANSAC method to +method combining geometric and semantic information. reject the feature points inside the range of dynamic objects. + Chang et al. [12] segmented the dynamic objects by YOLACT + Geometric information method, whose main idea is to and then removed the outliers inside the objects. Then, geo- +assume that only static features can satisfy the geometric metric constraints are introduced to further filter the missing +constraints of the algorithm. A remarkable early monocular dynamic points. +dynamic object detection system comes from the work of +Kundu et al. [16]. The system creates two geometric con- The above methods have achieved quite good results in +straints to detect dynamic objects based on the multiview terms of accuracy improvement. Nevertheless, the idea of all +geometry [17]. One of the most important is the epipolar these methods relies heavily on semantic information and, to a +constraint defined by the fundamental matrix. The idea is lesser extent, on geometric information. Thus, more or less all +that a static feature point in the current image must lie on of them have the following shortcomings. +the pole line corresponding to the same feature point in the +previous image. A feature point is considered dynamic if 1) Inability to correctly handle dynamic features outside of +its distance from the corresponding polar line exceeds an the prior object [10], [11], [23], [25], [26]. For example, +empirical threshold. The fundamental matrix of the system is chairs are static objects by default, but dynamic during +calculated with the help of an odometer. In a purely visual being moved by a person; moving cats appear in the +system, the fundamental matrix can be calculated by the scene, while the neural network is not trained on the +seven-point method based on RANSAC [18]. The algorithm category of cats; low recall problem for the detection +of Kundu et al. [16] has the advantages of fast speed and algorithm. +strong scene generalization. However, it lacks a high-level +understanding of the scene, so the empirical threshold is 2) The a priori dynamic object remains stationary yet still +difficult to select and the accuracy is not high. In addition, brutally rejects the feature points in its range, resulting in +some works use the direct method for motion detection of less available association data [11], [12], [23], [25], [26]. +scenes, such as [19], [20], [21], and [22]. The direct method For example, a person who is sitting still is nevertheless +algorithms are faster and can utilize more image information. considered a dynamic object. +However, it is less robust in complex environments because it +is based on the gray-scale invariance assumption. 3) The real-time performance is weak [9], [10], [11], [12]. + The average frame rate of the system is low due to + Semantic information method, whose main idea is brutally factors such as complex semantic segmentation networks +rejecting features in dynamic regions that are obtained a priori or unreasonable system architecture. +using deep learning techniques. Zhang et al. [23] used the +YOLO [24] object detection method to obtain the semantic We propose an efficient dynamic feature rejection algorithm +information of dynamic objects in the working scene and combining geometric and semantic information to solve the +then reject the dynamic feature points based on the semantic above problem. Unlike most current work that relies heavily +information to improve the accuracy of the system. However, on deep learning, our algorithm uses mainly geometric infor- +the way YOLO extracts semantic information by bounding mation and then supplements it with semantic information. +box will cause a part of static feature points to be wrongly This shift in thinking allows our algorithm to avoid the short- + comings associated with relying too much on deep learning. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. + 7501012 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023 + +B. Semantic Mapping the base framework to provide global localization and mapping + functions. + Many current visual SLAMs only provide a metric map +that only satisfies the basic functions of localization and As shown in Fig. 1, the SG-SLAM system adds two more +navigation of mobile robots, such as the sparse feature point parallel threads: the object detecting thread and the seman- +map constructed by ORB-SLAM2. If a mobile robot is to tic mapping thread. Multithreading mechanism improves the +perceive its surroundings at the human conceptual level, it is system operation efficiency. The purpose of adding an object +necessary to incorporate semantic information in the metric detecting thread is to use the neural network to obtain the +map to form a semantic map. The semantic metric map can 2-D semantic information. This 2-D semantic information +help robots to act according to human rules, execute high-level then provides a priori dynamic object information for the +tasks, and communicate with humans at the conceptual level. dynamic feature rejection strategy. The semantic mapping + thread integrates the 2-D semantic information and 3-D point + In an earlier study, Mozos et al. [29] used the hidden cloud information from keyframes to generate a 3-D semantic +Markov model to partition the metric map into different object database. An intuitive semantic metric map is obtained +functional locations (rooms, corridors, and doorways). The by publishing the 3-D point cloud, 3-D semantic objects, and +work of Nieto-Granda et al. [30] deployed a mapping module camera pose to the ROS system. The semantic metric maps can +based on the Rao–Blackwellized particle filtering technique on help mobile robots understand their surroundings and perform +a ROS [13] and used the Gaussian model to partition the map advanced tasks from a higher cognitive level compared to the +into marked semantic regions. Subsequently, the development sparse feature point maps of ORB-SLAM2. +of deep learning has greatly contributed to the advancement +of object detection and semantic segmentation algorithms. When the SG-SLAM system is running, the image frames +Sünderhauf et al. [31] used SSD [32] to detect objects in captured from the RGB-D camera are first fed together to the +each RGB keyframe and then assign a 3-D point cloud to tracking thread and the object detecting thread. The object +each object using an adaptive 3-D unsupervised segmentation detecting thread starts to perform object recognition on the +method. This work is based on the data association mechanism input RGB images. At the same time, the tracking thread also +of ICP-like matching scores to decide whether to create starts to extract ORB feature points from the input frames. +new objects in the semantic map or to associate them with After the extraction is completed, the iterative Lucas–Kanade +existing objects. Zhang et al. [23] acquired semantic maps optical flow method with pyramids is used to match the sparse +of the working scene through the YOLO object detection feature points between the current frame and previous frames. +module and localization module in the RGB-D SLAM system. Then, the seven-point method based on RANSAC is used to +In summary, many works only stop at using SLAM to help compute the fundamental matrix between the two frames. This +with semantic mapping and do not fully utilize the acquired reduces the adverse effects due to incorrect data correlation +semantic information to help to track. DS-SLAM, a semantic in dynamic regions. Compared with feature extraction and +mapping system proposed by Yu et al. [10], adopted semantic fundamental matrix computation, the object detection task is +segmentation information to build semantic maps. However, more time-consuming. In other words, when the fundamental +DS-SLAM only simply attaches semantic labels to the metric matrix is computed, the tracking thread needs to wait for the +map for visual display. The lack of position coordinates for result of the object detecting thread. Since the tracking thread +the objects described in mathematical form limits the system’s adopts object detection rather than semantic segmentation, the +ability to perform advanced task planning. blocking time is not too long [26]. This enhances the real-time + performance of the system. Next, the tracking thread combines + III. SYSTEM OVERVIEW the epipolar constraint and 2-D semantic information to reject + the dynamic feature points. The camera pose is computed + In this section, we will introduce the technical details of and released to ROS according to the remaining static feature +the SG-SLAM system from five aspects. First, we introduce points. +the framework and the basic flow of the system. Second, +we give information about the object detecting thread. Then, The new keyframes are fed into the local mapping thread +the geometric principle of the epipolar constraint method and the loop closing thread for pose optimization, which is +for judging dynamic features is illustrated. Subsequently, the the same as the original ORB-SLAM2 system. The difference +dynamic feature rejection strategy is proposed. Finally, we pro- is that the depth image of the new keyframe is used to +pose methods to acquire semantic objects and build semantic generate a 3-D point cloud in the semantic mapping thread. +maps. Next, the 3-D point cloud is combined with the 2-D semantic + information to generate a 3-D semantic object database. There +A. System Framework are problems such as high computational effort and redundant + information between normal frames in semantic map con- + The SG-SLAM proposed in this article is developed based struction. Thus, the practice of processing only keyframe data +on the ORB-SLAM2 system, which is a feature point-based here improves the efficiency of mapping. The reuse of 2-D +classical visual SLAM system. ORB-SLAM2 consists of three semantic information also improves the real-time performance +main parallel threads: tracking, local mapping, and loop clos- of the system. Finally, the 3-D point cloud and the 3-D +ing. With the evaluation of many popular public datasets, semantic object data are published to the 3-D visualization +ORB-SLAM2 is one of the systems that achieve the state-of- tool Rviz for map display using the interface of the ROS +the-art accuracy. Therefore, SG-SLAM selects ORB-SLAM as system. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. + CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES 7501012 + + The adoption of object detection networks (rather than +semantic segmentation), multithreading, keyframe-based map- +ping, and data reuse mechanisms overcomes the real-time +performance shortcomings listed in Section II-A. + +B. Object Detection Fig. 2. Epipolar constraints. + + Due to the limitations in battery life, mobile robots generally According to the pinhole camera model, as shown in Fig. 2, +choose ARM architecture processors with high performance the camera observes the same spatial point P from different +per watt. NCNN is a high-performance neural network infer- angles. O1 and O2 denote the optical centers of the camera. P1 +ence computing framework optimized for mobile platforms and P2 are the matching feature points of the spatial point P +since NCNN is implemented in pure C++ with no third-party maps in the previous frame and the current frame, respectively. +dependencies and can be easily integrated into SLAM systems. The short dashed lines L1 and L2 are the epipolar lines in the +Thus, we choose it as the base framework for object detecting frame. The homogeneous coordinate forms of P1 and P2 are +thread. denoted as follows: + + Many SLAM systems, such as [9], [10], [11], and [12], P1 = [x1, y1, 1], P2 = [x2, y2, 1] (1) +run slowly due to complex semantic segmentation networks or +unreasonable system architectures. SLAM, as a fundamental where x and y denote the coordinate values of the feature +component for state estimation of mobile robots, only has the +good real-time performance to ensure the smooth operation points in the image pixel coordinate system. Then, the polar +of upper level tasks. To improve the object detection speed +as much as possible, the single-shot multibox detector SSD is line L2 in the current frame can be calculated from the +chosen as the detection head. In addition, we use MobileNetV3 +[33] as a drop-in replacement for the backbone feature extrac- fundamental matrix (denoted as F) with the equation as +tor in SSDLite. Finally, the network was trained using the +PASCAL VOC 2007 Dataset [34]. follows: ⎡⎤ ⎡⎤ + + In reality, other detectors can be used flexibly depending X x1 +on the hardware performance to achieve a balance between +accuracy and speed. L2 = ⎢⎣ Y ⎥⎦ = F P1 = F⎢⎣ y1 ⎥⎦ (2) + +C. Epipolar Constraints Z 1 + + SG-SLAM uses geometric information obtained from epipo- where X, Y , and Z represent the line vectors. According to +lar constraint to determine whether feature points are dynamic [16], the epipolar constraint can be formulated as follows: +or not. The judgment pipeline of the epipolar constraint is +very straightforward. First, match the ORB feature points of P2T F P1 = P2T L2 = 0. (3) +two consecutive frames. Next, solve the fundamental matrix. +Finally, the distance is calculated between the feature point of Next, the distance between the feature point Pi (i = 2, 4) and +the current frame and its corresponding polar line. The bigger the corresponding polar line is defined as the offset distance, +the distance is, the more likely the feature point is dynamic. denoted by the symbol d. The offset distance can be described + as follows: + To solve the fundamental matrix, it is necessary to have the +correct data association between the feature points. However, di = √PiT F P1 . (4) +the purpose of solving the fundamental matrix is to judge X2 + Y2 +whether the data association is correct or not. This becomes +a classic chicken or the egg problem. ORB-SLAM2 takes the If the point P is a static space point, jointly with (3) and (4), +Bag-of-Words method to accelerate feature matching, and the the offset distance of the point P2 is +continued use of this method cannot eliminate the adverse +effect of outliers. Hence, to obtain a relatively accurate funda- d2 = √P2T F P1 = 0. (5) +mental matrix, SG-SLAM uses the pyramidal iterative Lucas- X2 + Y2 +Kanade optical flow method to calculate the matching point set +of features. Inspired by Yu et al. [10], the matching point pairs Equation (5) demonstrates that in the ideal case, the feature +located at the edges of images and with excessive differences point P2 in the current frame falls exactly on the polar line L2. +in appearance are then removed to further reduce erroneous In reality, however, the offset distance is generally greater than +data associations. Then, the seven-point method based on zero but below an empirical threshold ε due to the influence +RANSAC is used to calculate the fundamental matrix between of various types of noise. +two frames. In general, the proportion of dynamic regions +is relatively small compared to the whole image. Thus, the +RANSAC algorithm can effectively reduce the adverse effects +of wrong data association in dynamic regions. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. + 7501012 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023 + +Algorithm 1 Dynamic Feature Rejection Strategy + +Input: Previous frame, F1; Current frame, F2; Previous frame’s feature points, P1; Current frame’s feature points, P2; + Standard empirical thresholds, εstd; + +Output: The set of static feature points in the current frame’s feature points, S; +1: P1 = CalcOpticalFlowPyrLK( F2, F1, P2 ) +2: Remove matched pairs that are located at the edges and have too much variation in appearance + +3: FundmentalMatrix = FindFundamentalMat(P2, P1, 7-point method based on RANSAC) + +4: for each matched pair p1, p2 in P1, P2 do: + +5: if (DynamicObjectsExist && IsInDynamicRegion(P2)) then + +6: if (CalcEpiLineDistance( p2, p1, FundmentalMatrix) × GetDynamicWeightValue ( p2) < εstd ) then + +7: Append p2 to S + +8: end if + +9: else + +10: if (CalcEpiLineDistance( p2, p1, FundmentalMatrix) < εstd) then + +11: Append p2 to S + +12: end if + +13: end if + +14: end for + + If the point P is not a static spatial point, as shown in Fig. 2, With these preparations, all feature points in the current +when the camera moves from the previous frame to the current frame can be judged one by one. The dynamic feature rejection +frame, the point P also moves to P . In this case, the point P1 strategy is described in Algorithm 1. +is matched with the P4 point mapped from P to the current +frame. If point P moves without degeneration [16], then in E. Semantic Mapping +general, the offset distance of P4 is greater than the threshold ε. +In other words, the feature points can be judged as dynamic The ROS [13] is a set of software tool libraries that +or not by comparing the offset distance with the empirical help developers quickly build robot applications. Rviz is a +threshold ε. visualization tool in the ROS. In addition to the tracking thread + that publishes camera poses to the ROS, the semantic mapping +D. Dynamic Feature Rejection Strategy thread also publishes two kinds of data: 3-D point clouds and + 3-D semantic objects. These data are then processed by rviz + To avoid the shortcomings of relying heavily on deep to display an intuitive map interface. +learning for dynamic feature judgment, our algorithm relies +mainly on geometric information. The geometric information For efficiency, only keyframes are used to construct seman- +method judges whether a feature is dynamic by comparing the tic metric maps. When a new keyframe arrives, the semantic +offset distance d with an empirical threshold ε. However, the mapping thread immediately uses its depth image and pose to +threshold ε value is very difficult to set [12]: setting it too generate a 3-D ordered point cloud. The 3-D point cloud is +small will make many static feature points wrongly judged as subsequently published to the ROS, and a global Octo-map +dynamic points and setting it too large will miss many true is built incrementally by the Octomap_server package. The +dynamic feature points. This is because the purely geometric global Octo-map has the advantages of being updatable, +method cannot understand the scene at the semantic level and flexible, and compact, which can easily serve navigation +can only mechanically process all feature points using a fixed and obstacle avoidance tasks. However, the Octo-map lacks +threshold. semantic information, so it limits the capability of advanced + task planning between mobile robots and semantic objects. + To solve the above problem, all objects that can be detected Hence, a map with semantic objects with their coordinates +by the object detecting thread are first classified as static is also necessary. The semantic mapping thread generates the +objects and dynamic objects based on a priori knowledge. Any 3-D semantic objects by combining 2-D semantic information +object with moving properties is defined as a dynamic object with 3-D point clouds, and the main process is described as +(e.g., a person or car); otherwise, it is a static object. Then, follows. +both weight values w are defined. The standard empirical +threshold εstd is set in a very straightforward way: just make The 2-D object bounding box is captured in the dynamic +sure that only obvious true dynamic feature points are rejected feature rejection algorithm stage. Fetch the 3-D point clouds in +when using it. The dynamic weight value w is an a priori in the bounding box region to calculate the 3-D semantic object +the range of 1–5, which is set according to the probability information. Yet, since the bounding box contains some noisy +of the object moving. For example, a human normally moves regions of nontarget objects, it cannot accurately segment the +with a high probability, and then, w = 5; a chair normally semantic object outline. To acquire relatively accurate position +does not move, and then, w = 2. and size information of the objects, the bounding box is + first reduced appropriately. Next, we calculate the average + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. + CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES 7501012 + TABLE I + + RESULTS OF METRIC ROTATIONAL DRIFT (RPE) + + TABLE II + RESULTS OF METRIC TRANSLATIONAL DRIFT (RPE) + + TABLE III + RESULTS OF METRIC ABSOLUTE TRAJECTORY ERROR (ATE) + +depth of the point cloud corresponding to the bounding box A. Performance Evaluation on TUM RGB-D Dataset +region. Then, the depth of each point cloud in the original +bounding box is compared with the average depth, which is The TUM RGB-D dataset [35] is a large dataset provided +rejected if the difference is too large. Eventually, we filter by the Technical University of Munich Computer Vision +the remaining point cloud and calculate their sizes and spatial Group to create a novel benchmark for visual odometry and +centroid coordinates. SLAM systems. To evaluate the accuracy and robustness of + the SG-SLAM system in dynamic scenes, the experiments + The above operation is performed for each 2-D semantic mainly use five sequences under the dynamic objects category +information (except dynamic objects, e.g., people, and dogs) in the dataset. The first four of them are high dynamic scene +in the current keyframe to obtain the 3-D semantic object data. sequences, as a supplement, and the fifth one is a low dynamic +During the operation of the system, the 3-D semantic object scene sequence. +database can be continuously merged or updated according to +the object class, centroid, and size information. By publishing There are two main error evaluation metrics for the exper- +this database through the ROS interface, the semantic metric iment. One is the absolute trajectory error (ATE), which is +maps can be visualized. directly used to measure the difference between the ground + trajectory and the estimated trajectory. The other is the relative + IV. EXPERIMENTAL RESULTS pose error (RPE), which is mainly used to measure rotational + drift and translational drift. To evaluate the improvement in + In this section, we will experimentally evaluate and performance relative to the original system, the experimental +demonstrate the SG-SLAM system in four aspects. First, results of SG-SLAM were compared with the ORB-SLAM2. +the tracking performance is evaluated with two public The evaluation comparison results in the five dynamic scene +datasets. Second, we demonstrate the effectiveness of the sequences are shown in Tables I–III. +dynamic feature rejection strategy and analyze the advan- +tages of the fusion algorithm compared to the individ- The experimental results in Tables I–III show that our +ual algorithms. Next, the system’s real-time performance system improves more than 93% in most metrics in high +is evaluated. Finally, the visualization of the semantic dynamic sequences compared to the ORB-SLAM2 system. +objects and the global Octo-map are shown. The experi- Figs. 3 and 4 show the experimental results of ATE and +ments were performed mainly on the NVIDIA Jetson AGX RPE for the two systems at five sequences with an RGB-D +Xavier development kit with Ubuntu 18.04 as the system camera input. As shown in the figure, the accuracy of the +environment. estimation results of our system in the high dynamic scene + sequences [Figs. 3(a)–(d) and 4(a)–(d)] is significantly higher + than ORB-SLAM2. In the experiments with low dynamic + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. + 7501012 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023 + +Fig. 3. ATE results of SG-SLAM and ORB-SLAM2 running five sequences. (a) fr3/walking_xyz. (b) fr3/walking_static. (c) fr3/walking_rpy. +(d) fr3/walking_halfsphere. (e) fr3/sitting_static. + +Fig. 4. RPE results of SG-SLAM and ORB-SLAM2 running five sequences. (a) fr3/walking_xyz. (b) fr3/walking_static. (c) fr3/walking_rpy. +(d) fr3/walking_halfsphere. (e) fr3/sitting_static. + +scene sequences [Figs. 3(e) and 4(e)], the accuracy improve- SLAM provided by Bonn University in 2019. To validate the +ment is only 31.03% because the area and magnitude of generalization performance of the dynamic feature rejection +dynamic object activity are small. algorithm, we performed another experimental evaluation + using this dataset. + To further evaluate the effectiveness of the proposed algo- +rithm, it continues to be compared with M-removal DVO [22], The experiment mainly selected nine representative +RDS-SLAM [26], ORB-SLAM3 [36], and other similar algo- sequences in the dataset. Among them, the “crowd” sequences +rithms. The results are shown in Table IV. Although the are the scenes of three people walking randomly in the room. +DynaSLAM system using pixel-level semantic segmentation The “moving no box” sequences show a person moving a box +achieves a slight lead in individual sequence results, its from the floor to a desk. The “person tracking” sequences are +real-time performance is weak (as shown in Table VII). All scenes where the camera is tracking a walking person. The +other methods have difficulty in achieving the highest accu- “synchronous” sequences present scenes of several people +racy of experimental results because of certain shortcomings jumping together in the same direction over and over again. +described in Section II. Overall, from the experimental results, In order to evaluate the accuracy performance of our system, +it can be concluded that SG-SLAM achieves a state-of-the-art it is mainly compared with the original ORB-SLAM2 +level in terms of average accuracy improvement for all system and the current state-of-the-art YOLO-SLAM +sequences. system. + +B. Performance Evaluation on Bonn RGB-D Dataset The evaluation comparison results in the nine dynamic + scene sequences are shown in Table V. Only in the two + The Bonn RGB-D Dynamic Dataset [37] is a dataset “synchronization” sequences, SG-SLAM does not perform as +with 24 dynamic sequences for the evaluation of RGB-D well as YOLO-SLAM. The main reason is that the human + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. + CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES 7501012 + + TABLE IV + RESULTS OF METRIC ATE + +Fig. 5. Dynamic feature rejection effect demonstration. The empir- feature points on walking people are missed. Next, Fig. 5(d) +ical threshold ε in (b) is 0.2 and in (c) is 1.0. (a) ORB-SLAM2. shows the results of feature point extraction using only the +(b) and (c) SG-SLAM (G). (d) SG-SLAM (S). (e) SG-SLAM (S + G). semantic information method: all feature points around the + human body are brutally rejected. Finally, the experimental + results of the SG-SLAM system combining semantic and + geometric information are shown in Fig. 5(e). SG-SLAM + rejects all feature points on the human body and retains as + many static feature points outside the human body as possible, + and the rejection effect is better than the first two algorithms. + The experimental results of the two algorithms based on + separate information are mutually superior and inferior in + different sequences. The algorithm combining both pieces of + information shows the most accurate experimental results in + all sequences. From the results in Table VI, the experimental + data of each algorithm match the intuitive rejection effect in + Fig. 5. This proves the effectiveness of the fusion of geometric + and semantic information algorithms. + +jump direction in the scene is similar to the polar line D. Timing Analysis +direction leading to different degrees of degeneration of the +algorithm [16]. The results in Table V show that our algorithm As the basic component of robot state estimation, the speed +outperforms other algorithms in most sequences. Not only of SLAM directly affects the smooth execution of higher level +does this once again prove that the SG-SLAM system achieves tasks. Thus, we tested the average time cost of processing each +state-of-the-art accuracy and robustness in dynamic scenes but frame when the system is running and compared it with other +also proves its generalizability. systems. + +C. Effectiveness of Dynamic Feature Rejection Strategy The experimental time-consuming results and hardware + platforms are shown in Table VII. Since systems, such as + SG-SLAM combines geometrical and semantic information DS-SLAM, DynaSLAM, and YOLACT-based SLAM, use +to reject dynamic features, drawing on the advantages and pixel-level semantic segmentation networks, their average time +avoiding the disadvantages of both methods. To validate cost per frame is expensive. YOLO-SLAM uses the end-to-end +the effectiveness of the fusion of geometric and semantic YOLO fast object detection algorithm, but it is very slow due +information algorithms, we designed comparative experiments. to limitations such as system architecture optimization and +Fig. 5 shows the experimental results of these methods for hardware performance. The SG-SLAM system significantly +detecting dynamic points. First, SG-SLAM (S) denotes a increases frame processing speed by using multithreading, +semantic information-only algorithm to reject dynamic feature SSD object detection algorithms, and data multiplexing mech- +points. Next, SG-SLAM (G) is only the geometry algo- anisms. Compared to ORB-SLAM2, our work increases the +rithm based on the epipolar constraint. Finally, SG-SLAM average processing time per frame by less than 10 ms, which +(S + G) uses a fusion algorithm based on geometric and can meet the real-time performance requirements of mobile +semantic information. The experimental results are shown in robots. +Table VI. + E. Semantic Mapping + Fig. 5(a) shows the results of ORB-SLAM2 extracting +feature points: essentially no dynamic regions are processed. To show the actual semantic mapping effect, the SG-SLAM +Fig. 5(b) and (c) shows the results of using only the epipolar system conducts mapping experiments in the TUM RGB-D +constraint method at different empirical thresholds. At the low dataset and the OpenLORIS-Scene dataset [38]. OpenLORIS- +threshold [see Fig. 5(b)], many static feature points are misde- Scene is a dataset of data recorded by robots in real +tected and rejected (e.g., feature points at the corners of the TV scenes using a motion capture system to obtain real tra- +monitor); at the high threshold [see Fig. 5(c)], some dynamic jectories. This dataset is intended to help evaluate the + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. + 7501012 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023 + TABLE V + RESULTS OF METRIC ATE + + TABLE VI + RESULTS OF METRIC ATE + + TABLE VII + TIME ANALYSIS + + Fig. 7. (a) Semantic object map and (b) global octo-map for the cafe1-2 + sequence of the OpenLORIS-Scene dataset. + +Fig. 6. Semantic object map for fr3_walking_xyz sequence. Fig. 6 shows the semantic object mapping effect of + SG-SLAM in the fr3_walking_xyz sequence of the TUM +maturity of SLAM and scene understanding algorithms in real RGB-D dataset. Fig. 7(a) and (b) shows the semantic object +deployments. map and the global Octo-map built in the cafe1-2 sequence of + the OpenLORIS-Scene dataset, respectively. The coordinates + of the objects shown in the map are transformed from the + origin point where the SLAM system is running. The semantic + metric map and the global Octo-map not only enable mobile + robots to navigate and avoid obstacles but also enable them + to understand scenes at a higher level and perform advanced + tasks. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. + CHENG et al.: SG-SLAM: A REAL-TIME RGB-D VISUAL SLAM TOWARD DYNAMIC SCENES 7501012 + + V. CONCLUSION [16] A. Kundu, K. M. Krishna, and J. Sivaswamy, “Moving object detection + by multi-view geometric techniques from a single camera mounted + This article presents a real-time semantic visual SG-SLAM robot,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2009, +toward dynamic scenes with an RGB-D camera input. pp. 4306–4312. +SG-SLAM adds two new threads based on ORB-SLAM2: +the object detecting thread and the semantic mapping thread. [17] R. Hartley and A. Zisserman, Multiple View Geometry in Computer +The system significantly improves real time, accuracy, and Vision. Cambridge, U.K.: Cambridge Univ. Press, 2003. +robustness in dynamic scenes with the dynamic feature rejec- +tion algorithm. The semantic mapping thread reuses the 2-D [18] M. A. Fischler and R. Bolles, “Random sample consensus: A para- +semantic information to build the semantic object map with digm for model fitting with applications to image analysis and auto- +object coordinates and the global Octo-map. Experiments mated cartography,” Commun. ACM, vol. 24, no. 6, pp. 381–395, +prove that improved traditional algorithms can achieve supe- 1981. +rior performance when introducing deep learning and coupled +with proper engineering implementations. [19] M. Piaggio, R. Fornaro, A. Piombo, L. Sanna, and R. Zaccaria, + “An optical-flow person following behaviour,” in Proc. IEEE Int. Symp. + There are still some disadvantages of the system that need Intell. Control (ISIC), IEEE Int. Symp. Comput. Intell. Robot. Autom. +to be addressed in the future. For example, the degeneration (CIRA), Intell. Syst. Semiotics (ISAS), 1998, pp. 301–306. +problem of dynamic objects moving along the polar line direc- +tion can cause the dynamic feature rejection algorithm to fail, [20] D. Nguyen, C. Hughes, and J. Horgan, “Optical flow-based moving- +semantic metric map improvement in precision, experimental static separation in driving assistance systems,” in Proc. IEEE 18th Int. +quantitative analysis, and so on. Conf. Intell. Transp. Syst., Sep. 2015, pp. 1644–1651. + + REFERENCES [21] T. Zhang, H. Zhang, Y. Li, Y. Nakamura, and L. Zhang, “Flow- + Fusion: Dynamic dense RGB-D SLAM based on optical flow,” + [1] H. Durrant-Whyte and T. Bailey, “Simultaneous localization and map- in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2020, + ping: Part I,” IEEE Robot. Autom. Mag., vol. 13, no. 2, pp. 99–110, pp. 7322–7328. + Jun. 2006. + [22] Y. Sun, M. Liu, and M. Q.-H. Meng, “Motion removal for reliable + [2] S. Shirmohammadi and A. Ferrero, “Camera as the instrument: RGB-D SLAM in dynamic environments,” Robot. Auton. Syst., vol. 108, + The rising trend of vision based measurement,” IEEE Instrum. Meas. pp. 115–128, Oct. 2018. + Mag., vol. 17, no. 3, pp. 41–47, Jun. 2014. + [23] L. Zhang, L. Wei, P. Shen, W. Wei, G. Zhu, and J. Song, “Semantic + [3] G. Klein and D. Murray, “Parallel tracking and mapping for small AR SLAM based on object detection and improved octomap,” IEEE Access, + workspaces,” in Proc. 6th IEEE ACM Int. Symp. Mixed Augmented vol. 6, pp. 75545–75559, 2018. + Reality, Nov. 2007, pp. 225–234. + [24] J. Redmon and A. Farhadi, “YOLO9000: Better, faster, stronger,” in + [4] R. Mur-Artal and J. D. Tardós, “ORB-SLAM2: An open-source slam Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, + system for monocular, stereo, and RGB-D cameras,” IEEE Trans. Robot., pp. 7263–7271. + vol. 33, no. 5, pp. 1255–1262, Oct. 2017. + [25] L. Xiao, J. Wang, X. Qiu, Z. Rong, and X. Zou, “Dynamic-SLAM: + [5] C. Kerl, J. Sturm, and D. Cremers, “Dense visual SLAM for RGB-D Semantic monocular visual localization and mapping based on deep + cameras,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Nov. 2013, learning in dynamic environment,” Robot. Auton. Syst., vol. 117, + pp. 2100–2106. pp. 1–16, Jul. 2019. + + [6] A. Rosinol, M. Abate, Y. Chang, and L. Carlone, “Kimera: An open- [26] Y. Liu and J. Miura, “RDS-SLAM: Real-time dynamic SLAM + source library for real-time metric-semantic localization and map- using semantic segmentation methods,” IEEE Access, vol. 9, + ping,” in Proc. IEEE Int. Conf. Robot. Autom. (ICRA), May 2020, pp. 23772–23785, 2021. + pp. 1689–1696. + [27] K. He, G. Gkioxari, P. Dollár, and R. Girshick, “Mask R-CNN,” in Proc. + [7] C. Cadena et al., “Past, present, and future of simultaneous localization ICCV, Jun. 2017, pp. 2961–2969. + and mapping: Toward the robust-perception age,” IEEE Trans. Robot., + vol. 32, no. 6, pp. 1309–1332, Dec. 2016. [28] V. Badrinarayanan, A. Kendall, and R. Cipolla, “SegNet: A deep + convolutional encoder–decoder architecture for image segmentation,” + [8] I. Kostavelis and A. Gasteratos, “Semantic mapping for mobile IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, + robotics tasks: A survey,” Robot. Auton. Syst., vol. 66, pp. 86–103, Jan. 2017. + Apr. 2015. + [29] Ó. M. Mozos, R. Triebel, P. Jensfelt, A. Rottmann, and W. Burgard, + [9] B. Bescos, J. M. Fácil, J. Civera, and J. L. Neira, “DynaSLAM: Tracking, “Supervised semantic labeling of places using information extracted + mapping, and inpainting in dynamic scenes,” IEEE Robot. Autom. Lett., from sensor data,” Robot. Auto. Syst., vol. 55, no. 5, pp. 391–402, + vol. 3, no. 4, pp. 4076–4083, Oct. 2018. May 2007. + +[10] C. Yu et al., “DS-SLAM: A semantic visual SLAM towards dynamic [30] C. Nieto-Granda, J. G. Rogers, A. J. B. Trevor, and H. I. Christensen, + environments,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), “Semantic map partitioning in indoor environments using regional + Oct. 2018, pp. 1168–1174. analysis,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2010, + pp. 1451–1456. +[11] W. Wu, L. Guo, H. Gao, Z. You, Y. Liu, and Z. Chen, “YOLO- + SLAM: A semantic SLAM system towards dynamic environment [31] N. Sunderhauf, T. T. Pham, Y. Latif, M. Milford, and I. Reid, “Mean- + with geometric constraint,” Neural Comput. Appl., vol. 34, pp. 1–16, ingful maps with object-oriented semantic mapping,” in Proc. IEEE/RSJ + Apr. 2022. Int. Conf. Intell. Robots Syst. (IROS), Sep. 2017, pp. 5079–5085. + +[12] J. Chang, N. Dong, and D. Li, “A real-time dynamic object segmentation [32] W. Liu et al., “SSD: Single shot MultiBox detector,” in Proc. Eur. Conf. + framework for SLAM system in dynamic scenes,” IEEE Trans. Instrum. Comput. Vis. Cham, Switzerland: Springer, 2016, pp. 21–37. + Meas., vol. 70, pp. 1–9, 2021. + [33] A. Howard et al., “Searching for MobileNetV3,” in Proc. IEEE/CVF Int. +[13] M. Quigley et al., “ROS: An open-source robot operating system,” in Conf. Comput. Vis., Oct. 2019, pp. 1314–1324. + Proc. ICRA Workshop Open Source Softw., Kobe, Japan, 2009, vol. 3, + no. 3, p. 5. [34] M. Everingham, L. Van Gool, C. Williams, J. Winn, and + A. Zisserman, “The PASCAL visual object classes challenge 2007 +[14] Tencent. (2017). NCNN. [Online]. Available: https://github.com/Tencent/ results,” 2008. [Online]. Available: http://www.pascal-network.org/ + ncnn challenges/VOC/voc2007/workshop/index.html + +[15] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard, [35] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, + “OctoMap: An efficient probabilistic 3D mapping framework based on “A benchmark for the evaluation of RGB-D SLAM systems,” in Proc. + octrees,” Auton. Robot., vol. 34, no. 3, pp. 189–206, 2013. IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2012, pp. 573–580. + + [36] C. Campos, R. Elvira, J. J. G. Rodriguez, J. M. M. Montiel, and + J. D. Tardos, “ORB-SLAM3: An accurate open-source library for visual, + visual–inertial, and multimap SLAM,” IEEE Trans. Robot., vol. 37, + no. 6, pp. 1874–1890, Dec. 2021. + + [37] E. Palazzolo, J. Behley, P. Lottes, P. Giguere, and C. Stachniss, “ReFu- + sion: 3D reconstruction in dynamic environments for RGB-D cameras + exploiting residuals,” in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. + (IROS), Nov. 2019, pp. 7855–7862. + + [38] X. Shi et al., “Are we ready for service robots? The OpenLORIS-scene + datasets for lifelong SLAM,” in Proc. IEEE Int. Conf. Robot. Autom. + (ICRA), May 2020, pp. 3139–3145. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. + 7501012 IEEE TRANSACTIONS ON INSTRUMENTATION AND MEASUREMENT, VOL. 72, 2023 + + Shuhong Cheng was born in Daqing, Heilongjiang, Shijun Zhang (Student Member, IEEE) was born + China, in 1978. She received the B.S., M.S., and in Lianyungang, China, in 1993. He received the + Ph.D. degrees from Yanshan University, Qinhuang- bachelor’s and master’s degrees in control engineer- + dao, China, in 2001, 2007, and 2012, respectively. ing from Yanshan University, Qinhuangdao, China, + in 2016 and 2019, respectively, where he is cur- + She studied as a Visiting Scholar at the University rently pursuing the Ph.D. degree in mechanical + of Reading, Reading, U.K., in 2014. After her Ph.D. engineering. + degree, she has been working as a Professor at + Yanshan University since 2019. She has published His main research directions include mobile robot + about 50 papers in journals and international confer- control and perception, computer vision, and deep + ences and eight computer software copyrights. She learning. + has been granted more than four Chinese invention +patents. Since 2012, she has presided over and undertaken more than ten +national projects. Her current research interests are in rehabilitation robots, +assisting robot for the disabled, and the elderly and computer vision. + + Changhe Sun was born in Tangshan, China, Dianfan Zhang was born in Jilin, China, in 1978. + in 1996. He received the bachelor’s degree in com- He received the bachelor’s and master’s degrees + munication engineering from the Chongqing Uni- in control engineering and the Ph.D. degree from + versity of Technology, Chongqing, China, in 2019. Yanshan University, Qinhuangdao, China, in 2001, + He is currently pursuing the master’s degree with 2006, and 2010, respectively. + the School of Electrical Engineering, Yanshan Uni- + versity, Qinhuangdao, China. His main research directions include mobile robot + control and signal processing. + His main research interests include simultaneous + localization and mapping (SLAM), computer vision, + and robotics. + +Authorized licensed use limited to: University of Electronic Science and Tech of China. Downloaded on July 06,2023 at 09:28:10 UTC from IEEE Xplore. Restrictions apply. + diff --git a/动态slam/2020年-2022年开源动态SLAM/2022年/The_STDyn-SLAM_A_Stereo_Vision_and_Semantic_Segmentation_Approach_for_VSLAM_in_Dynamic_Outdoor_Environments.pdf b/动态slam/2020年-2022年开源动态SLAM/2022年/The_STDyn-SLAM_A_Stereo_Vision_and_Semantic_Segmentation_Approach_for_VSLAM_in_Dynamic_Outdoor_Environments.pdf new file mode 100644 index 0000000..d43aefc --- /dev/null +++ b/动态slam/2020年-2022年开源动态SLAM/2022年/The_STDyn-SLAM_A_Stereo_Vision_and_Semantic_Segmentation_Approach_for_VSLAM_in_Dynamic_Outdoor_Environments.pdf @@ -0,0 +1,520 @@ +Received January 10, 2022, accepted January 27, 2022, date of publication February 7, 2022, date of current version February 18, 2022. +Digital Object Identifier 10.1109/ACCESS.2022.3149885 + +The STDyn-SLAM: A Stereo Vision and Semantic +Segmentation Approach for VSLAM in Dynamic +Outdoor Environments + +DANIELA ESPARZA AND GERARDO FLORES , (Member, IEEE) + +Laboratorio de Percepción y Robótica [LAPyR], Centro de Investigaciones en Óptica (CIO), León, Guanajuato 37150, Mexico + +Corresponding author: Gerardo Flores (gflores@cio.mx) +This work was supported in part by the Consejo Nacional de Ciencia y Tecnología (CONACYT), Fondo Institucional de Fomento Regional +para el Desarrollo Científico, Tecnológico y de Innovación (FORDECYT) under Grant 292399. + + ABSTRACT The Visual Simultaneous Localization and Mapping (VSLAM) is a system based on the scene’s + features to estimate a map and the system pose. Commonly, VSLAM algorithms are focused on a static + environment; however, some dynamic objects are present in the vast majority of real-world applications. + This work presents a feature-based SLAM system focused on dynamic environments using convolutional + neural networks, optical flow, and depth maps to detect objects in the scene. The proposed system employs + a stereo camera as the primary sensor to capture the scene. The neural network is responsible for object + detection and segmentation to avoid erroneous maps and wrong system locations. Moreover, the proposed + system’s processing time is fast and can run in real-time, running in outdoor and indoor environments. The + proposed approach has been compared with state-of-the-art; besides, we present several experimental results + outdoors that corroborate the approach’s effectiveness. Our code is available online. + + INDEX TERMS VSLAM, dynamic environment, stereo vision, neural network. + +I. INTRODUCTION moving objects can generate an erroneous map and wrong +Simultaneous Localization and Mapping (SLAM) systems poses because dynamic features cause a bad pose estimation +are strategic for developing the following navigation tech- and incorrect data. For this reason, new approaches have +niques. This is mainly due to its fundamental utility in arisen for solving the dynamic environment problem, such +solving the problem of autonomous exploration tasks in as NeuroSLAM [10], hierarchical Outdoor SLAM [11], and +unknown environments such as mines, highways, farmlands, Large-Scale Outdoor SLAM [12]. +underwater/aerial environments, and in broad terms, indoor +and outdoor scenes. The problem of SLAM for indoor In this work, we propose a method called STDyn-SLAM for +environments has been investigated for years, where usually solving VSLAM’s problem in dynamic outdoor environments +RGB-D cameras or Lidars are the primary sensors to capture using stereo vision [19]. Fig. 1 depicts a sketch of our +scenes [1]–[3]. Indoors, dynamic objects are usually more proposal in real experiments. The first row shows the input +controllable, unlike outdoors, where dynamic objects are images, where a potentially dynamic object is present on +inherent to the scene. the scene and is detected by a semantic segmentation neural + network. Fig. 1d depicts the 3D reconstruction excluding + On the other hand, the vast majority of SLAM systems dynamic objects. To evaluate our system, we carried out +are focused on the assumption of static environments, such experiments in different outdoor scenes, and we qualitatively +as HECTOR-SLAM [4], Kintinuous [5], MonoSLAM [6], compared the 3D reconstructions taking into account the +PTAM [7], SVO [8], LSD-SLAM [9], among others. Since excluding of dynamic objects. We conducted experiments +this assumption is strong, the system is restricted to work in using sequences from KITTI Dataset, and they are compared +static environments. However, in dynamic environments, the with state-of-the-art systems. Furthermore, our approach is + implemented in ROS, in which we use the depth image + The associate editor coordinating the review of this manuscript and from a stereo camera for making the 3D reconstruction using + the octomap. Also, we analyzed the processing time using +approving it for publication was Sudipta Roy . + +VOLUME 10, 2022 This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. 18201 + For more information, see https://creativecommons.org/licenses/by-nc-nd/4.0/ + D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM +TABLE 1. This table shows the state-of-the-art SLAM problem considering dynamic environments. + +different datasets. Further, we publish our code been available FIGURE 1. The STDyn-SLAM results in scenes with moving objects. First +on GitHub.1 Also, a video is available on YouTube. The main raw: Input images with two dynamic objects. Second raw: 3D +contributions are itemized as follows: reconstruction performed by the STDyn-SLAM discarding moving objects. + + • We proposed a Stereo SLAM for dynamic environments OctoMap. The dynamic pixels are removed using an object + using semantic segmentation neural network and geo- detector and a K-means to segment the point cloud. On the + metrical constraints to eliminate the dynamic objects. other hand, in [21], Gimenez et al. present a CP-SLAM based + on continuous probabilistic mapping and a Markov random + • We use the depth image from a stereo camera for making field; they use the iterated conditional modes. Wang et al. [22] + the 3D reconstruction using the octomap. The depth propose a SLAM system for indoor environments based + image is not necessary for the SLAM process. on an RGB-D camera. They use the number of features + on the static scene and assume that the parallax between + • This work was tested using the KITTI and EurocMav consecutive images is a movement constraint. In [23], + datasets, and we compared our system with the stereo Cheng, Sun, and Meng implement an optical-flow and the + configuration systems from state-of-the-art. In addition, five-point algorithm approach to obtain dynamic features. + we obtained results from outdoor and indoor environ- In [24], Ma and Jia proposed a visual SLAM for dynamic + ments of our sequences. + VOLUME 10, 2022 + • Some results are shown in a YouTube video, and the + STDyn-SLAM is available as a GitHub repo. + + The rest of the paper is structured as follows. Section II +mentions the related work of SLAM in dynamic environ- +ments. Then, in Section III, we show the main results and +the algorithm STDyn-SLAM algorithm. Section IV presents +the real-time experiments of STDyn-SLAM in outdoor +environments with moving objects; we compare our approach +with state-of-art methods using the KITTI dataset. Finally, the +conclusions and the future work are given in Section V. + +II. RELATED WORK +A. CLASSIC APPROACHES +The classical methods do not consider artificial intelligence. +Some of these approaches are based on optical flow, epipolar +geometry, or a combination of the two. For example, in [20], +Yang et al. propose a SLAM system using an RGB-D camera +and two encoders for estimating the pose and building an + + 1https://github.com/DanielaEsparza/STDyn-SLAM + +18202 + D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM + +FIGURE 2. A block diagram showing the algorithm steps of the STDyn-SLAM. + +environments, detecting the moving objects in the scene using the Mask R-CNN, edge refinement, and optical flow to detect +optical flow. Furthermore, they use the RANSAC algorithm the probably dynamic objects. Henein et al. [18] proposed a +to improve the computation of the homography matrix. system based on an RGBD camera and proprioceptive sensors +In [25], Sun et al. proposed an RGB-D system for detecting for tackling the SLAM problem. They employ a model +moving objects based on ego-motion to compensate for the of factor graph and an instance-level object segmentation +camera movement, then obtaining the frame difference. The algorithm to the classification of objects and the tracking of +result of frame difference helps for detecting the moving features. The proprioceptive sensors are used to estimate the +object. After that, Sun et al. proposed in [26] an RGB-D camera pose. Also, some works use a monocular camera, +system for motion removal based on a foreground model. This for instance, the DSOD-SLAM presented in [16]. Ma et al. +system does not require prior information. employ a semantic segmentation network, a depth prediction + network, and geometry properties to improve the results in +B. ARTIFICIAL-INTELLIGENCE-BASED APPROACHES dynamic environments. Our work is built on the well-known +Thanks to the growing use of deep learning, the ORB-SLAM2 [32], taking some ideas from DS-SLAM +researchers have proposed some SLAM systems using system [33]. In the DS-SLAM, the authors used stored images +artificial-intelligence-based approaches. Table 1 resumes from an RGB-D camera for solving the SLAM problem +the state-of-art in this regard. Some works, such as in indoor dynamic environments. Nevertheless, the depth +Dosovitskiy et al. [27], Ilg et al. [28] and Mayer et al. [29], map obtained from an RGB-D camera is hard for external +used optical flow and supervised learning for detecting and environments. In [34], Cheng et al. proposed a SLAM +segmenting moving objects. system for building a semantic map in dynamic environments + using CRF-RNN for segmenting objects. Bescos et al. + In [30], Xu et al. proposed an instance segmentation of in [14] proposed a system for object detecting using the +the objects in the scene based on the COCO dataset [31]. Mask R-CNN, and their method proposed for inpainting the +The geometric and motion properties are detected and used to background using the information from previous images. +improve the mask boundaries. Also, they tracked the visible An update of [14] is [35], where Bescos et al. proposed a +objects and moving objects and estimated the system’s pose. visual SLAM based on the trajectories of the objects and a +Several works are based on RGB-D cameras, such as [15], bundle adjustment. +[17], and [18]. Cui and Ma [15] proposed the SOF-SLAM, +an RGB-D system based on ORB-SLAM2, which combines III. METHODS +a neural network for semantic segmentation, and optical flow In this section, we present and describe the framework of the +for removing dynamic features. Zhao et al. [17] proposed an STDyn-SLAM with all the parts that compose it. A block +RGB-D framework to dynamic scenes, where they combined + 18203 +VOLUME 10, 2022 + D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM + +diagram describing the framework’s pipeline is depicted in natural dynamic objects among all the objects in the scene. +Fig. 2, where the inputs at the instant time t are the stereo It is here where the NN depicted in Fig. 2 is introduced. +pair, depth image, and the left image captured at t − 1 (aka In the NN block of that figure, a semantic segmentation +previous left image). The process starts with extracting ORB neural network is shown, with the left image as input and +features in the stereo pair and the past left image. Then, a segmented image with the object of interest as output. +it follows the optical flow and epipolar geometry image This NN is a pixel-wise classification and segmentation +processing. Next, the neural network segments potentially framework. The STDyn-SLAM implements a particular NN +moving objects parallelly in the current left image. To remove of this kind called SegNet [37], which is an encoder-decoder +outliers (features inside dynamic objects) and estimate the network based on the VGG-16 model [38]. The encoder +visual odometry, it is necessary to computation the semantic of this NN architecture counts with thirteen convolutional +information and the movement checking process. Finally, the layers with batch normalization, a ReLU non-linearity +3D reconstruction is computed from the segmented image, divided into five encoders, and five non-overlapping max- +visual odometry, the current left frame, and the depth image. pooling and sub-sampling layers located at the end of each +These processes are explained in detail in the following encoder. Since each encoder is connected to a corresponding +subsections. decoder, the decoder architecture has the same number + of layers as encoder architecture, and every decoder has +A. STEREO PROCESS an upsampling layer at first. The last layer is a softmax +Motivated by the vast applications of robotics outdoors, classifier. SegNet classifies the pixel-wise using a model +where dynamic objects are presented, we proposed that based on the PASCAL VOC dataset [39], which consists +our STDyn-SLAM system be focused on stereo vision. of twenty classes. The pixel-wise can be classified into +A considerable advantage of this is that the depth estimation one of the following classes: airplane, bicycle, bird, boat, +from a stereo camera is directly given as a distance measure. bottle, bus, car, cat, chair, cow, dining table, dog, horse, +The process described in this part is depicted in Fig. 2, motorbike, person, potted plant, sheep, sofa, train and +where three main tasks are developed: feature extraction, TV/monitor. +optical flow, and epipolar geometry. Let’s begin with the +former. Notwithstanding those above, not all feature points in the + left frame are matched in the right frame. For that reason and + The first step of the stereo process is acquiring the left, to save computing resources, the SegNet classifies the objects +right, and depth frames from a stereo camera. Then, a local of interest only on the left input image. +feature detector is applied in the stereo pair and the previous +left image. As a feature detector, we use the Oriented fast 1) OUTLIERS REMOVAL +and Rotated Brief (ORB) feature detector, which throws the +well-known ORB features [36]. Once the ORB features are Once all the previous steps have been accomplished, a thresh- +found, optical flow and a process using epipolar geometry are +conducted. old is selected to determine the features as inlier or outlier. + + To avoid dynamic objects not classified by the neural Fig. 3 depicts the three cases of a mapped feature. Let x1, x2, +network (explained in the following subsection), the STDyn- and x3 denote the ORB features from the previous left image; +SLAM computes optical flow using the previous and current x1, x2, and x3 are the corresponding features from the current +left frames. This step employs a Harris detector to compute left image; X and X represent the homogeneous coordinates +the optical flow. Remember, these features are different from +the ORB ones. The Harris points pair is discarded if at least of x and x , respectively; F is the fundamental matrix; and +one of the points is on the edge corner or close to it. + l1 = FX1, l2 = FX2, and l3 = FX3 are the epipolar lines. + From the fundamental matrix, ORB features, and optical The first and second cases correspond to inliers, x1 is over +flow, we compute the epipolar lines. Thus, we can map l1, and the distance from x2 to l2 is less than the threshold. +the matched features from the current left frame into the The third case is an outlier because the distance from x3 +previous left frame. The distance from the corresponding to l3 is greater than the threshold. To compute the distance +epipolar line to the mapped feature into the past left image between the point x and the epipolar line, l , we proceed as +determines an inlier or outlier. Please refer to the remove +outliers section in Fig. 2. Notice that the orb features of the follows, +car in the left image were removed, but the points on the +right frame remain unchanged. This is because removing d(X , l ) = X T FX (1) +the points in the right images adds computational cost and is +unnecessary. (FX )21 + (FX )22 + +B. ARTIFICIAL NEURAL NETWORK’s ARCHITECTURE where the subindex from (FX )1 and (FX )2 denotes the +The approach we use is eliminating the ORB features on element of the epipolar line. If the distance is larger than +dynamic objects. To address this, we need to discern the the threshold, the feature point is considered an outlier, i.e., + a dynamic feature. +18204 + Remember that the SegNet, described before, semantically + segments the left image in object classes. The semantic + segmentation enhances the rejection of ORB features on + the possible dynamic objects. The ORB features inside + + VOLUME 10, 2022 + D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM + +FIGURE 3. The cases of inliers and outliers. Green: the x1 and x2 are +inliers; the distance from the point to their corresponding epipolar line l +is less than a threshold. Red: x3 is an outlier, since the distance is greater +than the threshold. + + FIGURE 5. The STDyn-SLAM when a static object becomes dynamic. + Images a) and b) corresponds to the left images from a sequence. Image + c) is the 3D reconstruction of the environment; in red dots is the + trajectory. The OctoMap node fills empty areas along the sequence of + images. + +FIGURE 4. Diagram of the ROS nodes of the STDyn-SLAM required to +generate the trajectory and 3D reconstruction. The circles represent each +process’s ROS node, and the arrows are the ROS topics published by the +ROS nodes. The continued arrows depict the final ROS topics. + +segmented objects, and thus possible moving objects, are +rejected. The remained points are matched with the ORB +features from the right image. + +C. VISUAL ODOMETRY +Because the system is based on ORB-SLAM2, the VSLAM +visually computes the odometry. Therefore, the next step +needs the ORB features to estimate the depth for each feature +pair. The features are classified in mono and stereo and will +be necessary to track the camera’s pose. Again, this step is +merely a process from ORB-SLAM2. + +D. 3D RECONSTRUCTION FIGURE 6. The 3D reconstruction from STDyn-SLAM in an indoor +Finally, the STDyn-SLAM builds a 3D reconstruction from environment. In the scene appears a moving person, which is crossing +left, segmented, and depth images using visual odometry. from left to right. The VSLAM system considers the person as a dynamic +First, the 3D reconstruction process checks each pixel of the object. +segmented image to reject the point corresponding to the +classes of the objects selected as dynamic in section III-B. Remark 1: It is essential to mention that we merely applied +Then, if the pixel is not considered a dynamic object, the the semantic segmentation, optical flow, and geometry +equivalent pixel from the depth image is added to the point constraints to the left image to avoid increasing the time +cloud, and the assigned color of the point is obtained from executing. Moreover, the right-hand-side frame segmentation +the left frame. This section builds a local point cloud only in is unnecessary because feature selection rejects the ORB +the current pose of the system, and then the octomap [40] features inside dynamic objects from the left image, so the +joins and updates the local point clouds in a full point corresponding points from the right frame will not be +cloud. matched. + +VOLUME 10, 2022 18205 + D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM + +FIGURE 7. The 3D reconstruction, with the presence of static (two parked cars) and dynamic objects (a person and two dogs). Notice that the person and +dogs are not visualized in the scene for the effect of the STDyn-SLAM. Fig. a) depicts the static objects. Nevertheless, the vehicles are potentially dynamic +objects, thus in Fig. b), the STDyn-SLAM excludes the bodies considering its possible movement. + +IV. EXPERIMENTS B. REAL-TIME EXPERIMENTS +This section tests our algorithm STDyn-SLAM in real-time We present real-time experiments under three different +scenes under the KITTI datasets. Our system’s experiments scenarios explained next. +were compared to other state-of-art systems to evaluate the +3D reconstruction and the odometry. The results of the 3D First, we test the STDyn-SLAM in an outdoor environment +map were qualitatively measured because of the nature of where a car is parked and then moves forward. In this case, +the experiment. We employ the Absolute Pose Error (APE) a static object (a car) becomes dynamic, see Fig. 5. This figure +metric for the odometry. shows the 3D reconstruction, where the car appears static in + the first images from the sequence, Fig. 5 a). Then, the car +A. HARDWARE AND SOFTWARE SETUP becomes a dynamic object when it moves forward (Fig. 5 b), +We tested our system on an Intel Core i7-7820HK laptop so the STDyn-SLAM is capable of filling the empty zone if +computer with 32 Gb RAM and a GPU GeForceGTX the scene is covered again, as is the case in Fig. 5 c). +1070. Moreover, we used as input a ZED camera, which +is a stereo camera developed by Sterolabs. We selected an The second experiment tests our system in an indoor +HD720 resolution. The ZED camera resolutions are WVGA environment. The scene consists of a moving person crossing +(672 × 376), HD720 (1280 × 720), HD1080 (1920 × 1080), from left to right. Subfigures a and b depicts the left and right +and 2.2K (2208 × 1242). images from Fig. 6. And c shows the 3D reconstruction. The + area occupied by the moving person is filled after the zone is + The STDyn-SLAM is developed naturally on ROS. Our visible. +system’s main inputs are the left and right images, but +the depth map is necessary to build the point cloud. The third experiment consists of a scene sequence with +However, if this is not available, it is possible to exe- two parked cars, a walking person, and a dog. Even though +cute the STDyn-SLAM only with the stereo images and the vehicles are static, the rest of the objects move. Fig. 7a +then obtain the trajectory. On the other hand, the STDyn shows the scene taking into account the potentially dynamic +node in ROS generates two main topics; the Odom and entities. However, a car can change its position; the STDyn- +the ORB_SLAM2_PointMap_SegNetM/Point_Clouds topics. SLAM excludes the probable moving bodies (parked cars) to +The point cloud topic is the input of the octomap_server avoid multiple plotting throughout the reconstruction. This is +node; this node publishes the joined point cloud of the depicted in Fig. 7b. +scene. + We compared the point clouds from the RTABMAP and the + Fig. 4 depicts the required ROS nodes by the STDyn- STDyn-SLAM systems as a fourth experiment. The sequence +SLAM to generate the trajectory and the 3D reconstruction. was carried out outdoors with a walking person and two +The camera node publishes the stereo images and computes dogs. Since RTABMAP generates a point cloud of the scene, +the depth map from the left and right frames. Then, the we decided to compare it with our system. To build the +STDyn-SLAM calculates the odometry and the local point 3D reconstructions from RTABMAP, we provided left and +cloud. The OctoMap combines and updates the current local depth images, camera info, and odometry as inputs for the +point cloud with the previous global map to visualize the RTABMAP. We used stereo and depth images; the intrinsic +global point cloud. It is worth mentioning that the user can parameters are saved in a text file in the ORB-SLAM2 +choose the maximum depth of the local point cloud. All the package. Fig 8 shows the 3D reconstructions. In Fig. 8a our +ROS topics can be shown through the viewer. system excludes the dynamic objects. On the other hand, Fig + 8b RTABMAP plotted the dynamic objects on different sides +18206 of the scene, resulting in an incorrect map of the environment. + + VOLUME 10, 2022 + D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM + TABLE 3. Comparison of Absolute Pose Error (APE) on Euroc-Mav dataset. + + TABLE 4. Comparison of Relative Pose Error (RPE) on KITTI dataset. + +FIGURE 8. Experiment comparison between the STDyn-SLAM and the +RTABMAP [41]. Image a) shows the 3D reconstruction given by +STDyn-SLAM; it eliminates dynamic objects’ effect on the mapping. Image +b) shows the point cloud created by RTABMAP; notice how dynamic +objects are mapped along the trajectory. This is undesirable behavior. + +TABLE 2. Comparison of Absolute Pose Error (APE) on KITTI dataset. + + TABLE 5. Comparison of Relative Pose Error (RPE) on Euroc-Mav dataset. + +C. COMPARISON OF STATE-OF-ART AND OUR SLAM To evaluate the significative difference of the ATE evalua- +USING KITTI AND EurocMav DATASETS tion, we implemented the Score Sρ [45] over the sequences +We compare our VSLAM with DynaSLAM1 [14] and ORB- of EurocMav and KITTI datasets of tables 6 and 7. The +SLAM2 approaches. We selected sequences with dynamic results in table 8 show an improvement of our system against +objects, loop, and no-loop closure to evaluate the SLAM ORBSLAM2 in the trajectories of the EurocMav dataset. +systems. Therefore, we chose the 00−10 sequences from In the KITTI dataset, STDyn-SLAM and ORBSLAM2 are +the odometry KITTI datasets [42], furthermore all sequences not significative different. In evaluating our system and +from the EurocMav dataset excepting the V1_03 and V2_03. DynaSLAM1, the Dyna is slightly better. +Moreover, we employed EVO [43] tools to evaluate the +Absolute Pose Error (APE) and the Relative Pose Error D. PROCESSING TIME +(RPE), and RGB-D tools [44] to calculate the Absolute In this section, we analyzed the processing time of this work. +Trajectory Error (ATE). For the study, we evaluate some datasets with different types + of images. The analysis consists of obtaining the processing + We present the results of APE, RPE, and ATE in different time of each sequence with the same characteristics and +tables. We divided the tables depending on the dataset calculating the average of the sequence’s mean. Table 9 shows +evaluated. Tables 3 and 4 show the APE experiments on the times getting with the datasets. We use the KITTI and +KITTI and EurocMav datasets, respectively. Tables 4 and 5 EurocMav datasets for the RGB and Gray columns. Since +correspond to RPE, and tables 6 and 7 present the ATE results. the sequences do not provide a depth image, we did not map +We did not evaluate the EurocMav with the DynaSLAM1 due a 3D reconstruction. For the last column, we utilized our +to excessive processing time to compute the trajectories. trajectories. In addition, our dataset contains depth images, + +VOLUME 10, 2022 18207 + D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM + +TABLE 6. Comparison of Absolute Trajectory Error (ATE) on KITTI dataset. The STDyn-SLAM is based on images captured by a stereo + pair for 3D reconstruction of scenes, where the possible +TABLE 7. Comparison of Absolute Trajectory Error (ATE) on Euroc-Mav dynamic objects are discarded from the map; this allows a +dataset. trustworthy point cloud. The system capability for computing + a reconstruction and localization in real-time depends on +TABLE 8. Comparison of Score Sρ (a, b) on the datasets. the computer’s processing power, since a GPU is necessary + to support the processing. However, with a medium-range +TABLE 9. Processing time. computer, the algorithms work correctly. + +so we plotted a 3D reconstruction. For this reason, the In the future, we plan to implement an optical flow +processing time is longer. approach based on the last generation of neural networks +V. CONCLUSION to improve dynamic object detection. The implementation +This work presents the STDyn-SLAM system for outdoor of neural networks allows replacing classic methods such +and indoor environments where dynamic objects are present. as geometric constraints. Furthermore, we plan to increase + the size of the 3D map to reconstruct larger areas and +18208 obtain longer reconstructions of the scenes. The next step + is implementing the algorithm in an aerial manipulator + constructed in the lab. + + SUPPLEMENTARY MATERIAL + The implementation of our system is released on GitHub + and is available under the following link: https://github. + com/DanielaEsparza/STDyn-SLAM + + Besides, this letter has supplementary video material + available at https://youtu.be/3tnkwvRnUss, provided by the + authors. + + REFERENCES + + [1] J. Castellanos, J. Montiel, J. Neira, and J. Tardos, ‘‘The SPmap: A + probabilistic framework for simultaneous localization and map building,’’ + IEEE Trans. Robot. Autom., vol. 15, no. 5, pp. 948–952, 1999. + + [2] G. Dissanayake, H. Durrant-Whyte, and T. Bailey, ‘‘A computationally + efficient solution to the simultaneous localisation and map building + (SLAM) problem,’’ in Proc. IEEE Int. Conf. Robot. Automation. Symposia + (ICRA), 2000, pp. 1009–1014. + + [3] M. Montemerlo, S. Thrun, D. Koller, and B. Wegbreit, ‘‘FastSLAM: A + factored solution to the simultaneous localization and mapping problem,’’ + in Proc. AAAI Nat. Conf. Artif. Intell., 2002, pp. 593–598. + + [4] S. Kohlbrecher, O. von Stryk, J. Meyer, and U. Klingauf, ‘‘A flexible and + scalable SLAM system with full 3D motion estimation,’’ in Proc. IEEE Int. + Symp. Saf., Secur., Rescue Robot., Nov. 2011, pp. 155–160. + + [5] T. Whelan, J. McDonald, M. Kaess, M. Fallon, H. Johannsson, and + J. J. Leonard, ‘‘Kintinuous: Spatially extended KinectFusion,’’ in Proc. + RSS Workshop RGB-D, Adv. Reasoning with Depth Cameras, Jul. 2012, + pp. 1–10. + + [6] A. J. Davison, I. D. Reid, N. D. Molton, and O. Stasse, ‘‘MonoSLAM: + Real-time single camera SLAM,’’ IEEE Trans. Pattern Anal. Mach. Intell., + vol. 29, no. 6, pp. 1052–1067, Jun. 2007. + + [7] Y. Kameda, ‘‘Parallel tracking and mapping for small AR workspaces + (PTAM) augmented reality,’’ J. Inst. Image Inf. Telev. Engineers, vol. 66, + no. 1, pp. 45–51, 2012. + + [8] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza, + ‘‘SVO: Semidirect visual odometry for monocular and multicam- + era systems,’’ IEEE Trans. Robot., vol. 33, no. 2, pp. 249–265, + Apr. 2017. + + [9] J. Engel, T. Schöps, and D. Cremers, ‘‘LSD-SLAM: Large-scale direct + monocular SLAM,’’ in Proc. Eur. Conf. Comput. Vis. (ECCV), D. Fleet, + T. Pajdla, B. Schiele, and T. Tuytelaars, Eds. Cham, Switzerland: Springer, + 2014, pp. 834–849. + + [10] F. Yu, J. Shang, Y. Hu, and M. Milford, ‘‘NeuroSLAM: A brain-inspired + SLAM system for 3D environments,’’ Biol. Cybern., vol. 113, nos. 5–6, + pp. 515–545, Dec. 2019. + + [11] D. Schleicher, L. M. Bergasa, M. Ocana, R. Barea, and M. E. Lopez, + ‘‘Real-time hierarchical outdoor SLAM based on stereovision and GPS + fusion,’’ IEEE Trans. Intell. Transp. Syst., vol. 10, no. 3, pp. 440–452, + Sep. 2009. + + VOLUME 10, 2022 + D. Esparza, G. Flores: STDyn-SLAM: Stereo Vision and Semantic Segmentation Approach for VSLAM + +[12] R. Ren, H. Fu, and M. Wu, ‘‘Large-scale outdoor SLAM based on 2D [35] B. Bescos, C. Campos, J. D. Tardos, and J. Neira, ‘‘DynaSLAM II: Tightly- + LiDAR,’’ Electronics, vol. 8, no. 6, p. 613, May 2019. coupled multi-object tracking and SLAM,’’ IEEE Robot. Autom. Lett., + vol. 6, no. 3, pp. 5191–5198, Jul. 2021. +[13] S. Yang and S. Scherer, ‘‘CubeSLAM: Monocular 3-D object SLAM,’’ + IEEE Trans. Robot., vol. 35, no. 4, pp. 925–938, Aug. 2019. [36] E. Rublee, V. Rabaud, K. Konolige, and G. Bradski, ‘‘ORB: An efficient + alternative to SIFT or SURF,’’ in Proc. Int. Conf. Comput. Vis., Nov. 2011, +[14] B. Bescos, J. M. Fácil, J. Civera, and J. L. Neira, ‘‘DynaSLAM: Tracking, pp. 2564–2571. + mapping, and inpainting in dynamic scenes,’’ IEEE Robot. Autom. Lett., + vol. 3, no. 4, pp. 4076–4083, Oct. 2018. [37] V. Badrinarayanan, A. Kendall, and R. Cipolla, ‘‘SegNet: A deep + convolutional encoder-decoder architecture for image segmentation,’’ +[15] L. Cui and C. Ma, ‘‘SOF-SLAM: A semantic visual SLAM for IEEE Trans. Pattern Anal. Mach. Intell., vol. 39, no. 12, pp. 2481–2495, + dynamic environments,’’ IEEE Access, vol. 7, pp. 166528–166539, Dec. 2017. + 2019. + [38] K. Simonyan and A. Zisserman, ‘‘Very deep convolutional networks for +[16] P. Ma, Y. Bai, J. Zhu, C. Wang, and C. Peng, ‘‘DSOD: DSO large-scale image recognition,’’ in Proc. Int. Conf. Learn. Represent. + in dynamic environments,’’ IEEE Access, vol. 7, pp. 178300–178309, (ICLR), San Diego, CA, USA, Jul. 2015, pp. 1–14. + 2019. + [39] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and W. Zisserman, +[17] L. Zhao, Z. Liu, J. Chen, W. Cai, W. Wang, and L. Zeng, ‘‘A compatible ‘‘The PASCAL visual object classes (VOC) challenge,’’ Int. J. Comput. + framework for RGB-D SLAM in dynamic scenes,’’ IEEE Access, vol. 7, Vis., vol. 88, no. 2, pp. 303–338, Sep. 2010. + pp. 75604–75614, 2019. + [40] A. Hornung, K. M. Wurm, M. Bennewitz, C. Stachniss, and W. Burgard, +[18] M. Henein, J. Zhang, R. Mahony, and V. Ila, ‘‘Dynamic SLAM: The ‘‘OctoMap: An efficient probabilistic 3D mapping framework based on + need for speed,’’ in Proc. IEEE Int. Conf. Robot. Automat. (ICRA), 2020, octrees,’’ Auto. Robots, vol. 34, no. 3, pp. 189–206, Apr. 2013. [Online]. + pp. 2123–2129. Available: https://octomap.github.io + +[19] S. Trejo, K. Martinez, and G. Flores, ‘‘Depth map estimation methodology [41] M. Labbé and F. Michaud, ‘‘Long-term online multi-session graph- + for detecting free-obstacle navigation areas,’’ in Proc. Int. Conf. Unmanned based SPLAM with memory management,’’ Auto. Robots, vol. 42, no. 6, + Aircr. Syst. (ICUAS), Jun. 2019, pp. 916–922. pp. 1133–1150, 2018. + +[20] D. Yang, S. Bi, W. Wang, C. Yuan, W. Wang, X. Qi, and Y. Cai, ‘‘DRE- [42] A. Geiger, P. Lenz, and R. Urtasun, ‘‘Are we ready for autonomous driving? + SLAM: Dynamic RGB-D encoder SLAM for a differential-drive robot,’’ The KITTI vision benchmark suite,’’ in Proc. Int. Conf. Pattern Recognit., + Remote Sens., vol. 11, no. 4, p. 380, Feb. 2019. Jun. 2012, pp. 3354–3361. + +[21] J. Gimenez, A. Amicarelli, J. M. Toibero, F. di Sciascio, and R. Carelli, [43] (2017). U. Technologies. EVO: Python Package for the Evalua- + ‘‘Continuous probabilistic SLAM solved via iterated conditional modes,’’ tion of Odometry and SLAM. [Online]. Available: https://github.com/ + Int. J. Autom. Comput., vol. 16, no. 6, pp. 838–850, Aug. 2019. MichaelGrupp/evo + +[22] R. Wang, W. Wan, Y. Wang, and K. Di, ‘‘A new RGB-D SLAM method [44] J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, ‘‘A + with moving object detection for dynamic indoor scenes,’’ Remote Sens., benchmark for the evaluation of RGB-D SLAM systems,’’ in Proc. + vol. 11, no. 10, p. 1143, May 2019. IEEE/RSJ Int. Conf. Intell. Robots Syst., Oct. 2012, pp. 573–580. + +[23] J. Cheng, Y. Sun, and M. Q.-H. Meng, ‘‘Improving monocular visual [45] R. Muñoz-Salinas and R. Medina-Carnicer, ‘‘UcoSLAM: Simultaneous + SLAM in dynamic environments: An optical-flow-based approach,’’ Adv. localization and mapping by fusion of keypoints and squared planar + Robot., vol. 33, no. 12, pp. 576–589, Jun. 2019. markers,’’ Pattern Recognit., vol. 101, May 2020, Art. no. 107193. + +[24] Y. Ma and Y. Jia, ‘‘Robust SLAM algorithm in dynamic environment using DANIELA ESPARZA received the B.S. degree in + optical flow,’’ in Proc. Chin. Intell. Syst. Conf., Y. Jia, J. Du, and W. Zhang, robotic engineering from the Universidad Politéc- + Eds. Singapore: Springer 2020, pp. 681–689. nica del Bicentenario, México, in 2017, and the + master’s degree in optomechatronics from the +[25] Y. Sun, M. Liu, and M. Q.-H. Meng, ‘‘Improving RGB-D SLAM in Center for Research in Optics, in 2019, where + dynamic environments: A motion removal approach,’’ Robot. Auton. Syst., she is currently pursuing the Ph.D. degree in + vol. 89, pp. 110–122, Mar. 2017. mechatronics and mechanical design. + +[26] Y. Sun, M. Liu, and M. Q.-H. Meng, ‘‘Motion removal for reliable Her research interests include artificial vision, + RGB-D SLAM in dynamic environments,’’ Robot. Auton. Syst., vol. 108, such as 3D reconstruction and deep learning + pp. 115–128, Oct. 2018. applied to SLAM developed over platforms as + mobile robots. +[27] A. Dosovitskiy, P. Fischer, E. Ilg, P. Hausser, C. Hazirbas, V. Golkov, + P. V. D. Smagt, D. Cremers, and T. Brox, ‘‘FlowNet: Learning optical GERARDO FLORES (Member, IEEE) received + flow with convolutional networks,’’ in Proc. IEEE Int. Conf. Comput. Vis. the B.S. degree (Hons.) in electronic engineering + (ICCV), Dec. 2015, pp. 2758–2766. from the Instituto Tecnológico de Saltillo, Mexico, + in 2007, the M.S. degree in automatic control +[28] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox, from CINVESTAV-IPN, Mexico City, in 2010, + ‘‘FlowNet 2.0: Evolution of optical flow estimation with deep networks,’’ and the Ph.D. degree in systems and information + in Proc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), Jul. 2017, technology from the Heudiasyc Laboratory, Uni- + pp. 2462–2470. versité de Technologie de Compiègne–Sorbonne + Universités, France, in October 2014. +[29] N. Mayer, E. Ilg, P. Hausser, P. Fischer, D. Cremers, A. Dosovitskiy, and + T. Brox, ‘‘A large dataset to train convolutional networks for disparity, Since August 2016, he has been a full-time + optical flow, and scene flow estimation,’’ in 2016 IEEE Conf. Comput. Vis. Researcher and the Head of the Perception and Robotics Laboratory, Center + Pattern Recognit. (CVPR), Jun. 2016, pp. 4040–4048. for Research in Optics, León, Guanajuato, Mexico. His current research + interests include the theoretical and practical problems arising from the +[30] B. Xu, W. Li, D. Tzoumanikas, M. Bloesch, A. Davison, and development of autonomous robotic and vision systems. He has been an + S. Leutenegger, ‘‘MID-fusion: Octree-based object-level multi-instance Associate Editor of Mathematical Problems in Engineering, since 2020. + dynamic SLAM,’’ in Proc. Int. Conf. Robot. Automat. (ICRA), May 2019, + pp. 5231–5237. 18209 + +[31] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, + P. Dollár, and C. L. Zitnick, ‘‘Microsoft COCO: Common objects in + context,’’ in Computer Vision, D. Fleet, T. Pajdla, B. Schiele, and + T. Tuytelaars, Eds. Cham: Springer International Publishing, 2014, + pp. 740–755. + +[32] R. Mur-Artal and J. D. Tardós, ‘‘ORB-SLAM2: An open-source slam + system for monocular, stereo, and RGB-D cameras,’’ IEEE Trans. Robot., + vol. 33, no. 5, pp. 1255–1262, Oct. 2017. + +[33] C. Yu, Z. Liu, X. Liu, F. Xie, Y. Yang, Q. Wei, and Q. Fei, ‘‘DS- + SLAM: A semantic visual SLAM towards dynamic environments,’’ + in Proc. IEEE/RSJ Int. Conf. Intell. Robots Syst. (IROS), Oct. 2018, + pp. 1168–1174. + +[34] J. Cheng, Y. Sun, and M. Q.-H. Meng, ‘‘Robust semantic mapping + in challenging environments,’’ Robotica, vol. 38, no. 2, pp. 256–270, + Feb. 2020. + +VOLUME 10, 2022 + diff --git a/动态slam/2020年-2022年开源动态SLAM/~$20-2022年开源动态SLAM.docx b/动态slam/2020年-2022年开源动态SLAM/~$20-2022年开源动态SLAM.docx new file mode 100644 index 0000000..80e1aae --- /dev/null +++ b/动态slam/2020年-2022年开源动态SLAM/~$20-2022年开源动态SLAM.docx @@ -0,0 +1,3 @@ + +junwen Lai +junwen Lai77ҵe2yiy \ No newline at end of file diff --git a/动态slam/df_vo创建conda环境报错.txt b/动态slam/df_vo创建conda环境报错.txt new file mode 100644 index 0000000..0755048 --- /dev/null +++ b/动态slam/df_vo创建conda环境报错.txt @@ -0,0 +1,846 @@ +jinja2=2.10 -> markupsafe[version='>=0.23|>=0.23,<2'] +_anaconda_depends=2019.03 -> jinja2 -> markupsafe[version='<2.0|>=0.23|>=0.23,<2|>=0.23,<2.1|>=2.0|>=2.0.0rc2|>=2.1.1'] +jupyter=1.0.0 -> nbconvert -> markupsafe[version='>=2.0'] + +Package pycairo conflicts for: +nltk=3.4 -> matplotlib -> pycairo +anaconda=custom -> _anaconda_depends -> pycairo +_anaconda_depends=2019.03 -> pycairo +seaborn=0.9.0 -> matplotlib[version='>=1.4.3'] -> pycairo +scikit-image=0.15.0 -> matplotlib[version='>=2.0.0'] -> pycairo + +Package isort conflicts for: +pylint=2.3.1 -> isort[version='>=4.2.5'] +spyder=3.3.3 -> pylint -> isort[version='>=4.2.5|>=4.2.5,<5|>=4.2.5,<6'] +isort=4.3.16 +_anaconda_depends=2019.03 -> pylint -> isort[version='>=4.2.5|>=4.2.5,<5|>=4.2.5,<6'] +anaconda=custom -> _anaconda_depends -> isort +_anaconda_depends=2019.03 -> isort + +Package pyflakes conflicts for: +spyder=3.3.3 -> pyflakes +anaconda=custom -> _anaconda_depends -> pyflakes +pyflakes=2.1.1 +_anaconda_depends=2019.03 -> pyflakes + +Package pycurl conflicts for: +anaconda=custom -> _anaconda_depends -> pycurl +pycurl=7.43.0.2 +_anaconda_depends=2019.03 -> pycurl + +Package pycodestyle conflicts for: +spyder=3.3.3 -> pycodestyle +_anaconda_depends=2019.03 -> pycodestyle +pycodestyle=2.5.0 +anaconda=custom -> _anaconda_depends -> pycodestyle + +Package singledispatch conflicts for: +distributed=1.26.0 -> singledispatch +ipykernel=5.1.0 -> tornado[version='>=4.0'] -> singledispatch==3.4.0.3 +nltk=3.4 -> singledispatch +matplotlib=3.0.3 -> tornado -> singledispatch==3.4.0.3 +_anaconda_depends=2019.03 -> singledispatch +terminado=0.8.1 -> tornado[version='>=4'] -> singledispatch==3.4.0.3 +jupyter_client=5.2.4 -> tornado[version='>=4.1'] -> singledispatch==3.4.0.3 +bokeh=1.0.4 -> tornado[version='>=4.3'] -> singledispatch==3.4.0.3 +numba=0.43.1 -> singledispatch +dask=1.1.4 -> distributed[version='>=1.26.0'] -> singledispatch +spyder=3.3.3 -> pylint -> singledispatch +anaconda=custom -> _anaconda_depends -> singledispatch +_anaconda_depends=2019.03 -> astroid -> singledispatch==3.4.0.3 +singledispatch=3.4.0.3 +notebook=5.7.8 -> tornado[version='>=4.1,<7'] -> singledispatch==3.4.0.3 +anaconda-project=0.8.2 -> tornado[version='>=4.2'] -> singledispatch==3.4.0.3 +distributed=1.26.0 -> tornado[version='<6.2'] -> singledispatch==3.4.0.3 + +Package gast conflicts for: +gast=0.2.2 +tensorflow=1.13.1 -> gast[version='>=0.2.0'] + +Package cudnn conflicts for: +torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> cudnn[version='7.3.*|>=7.6,<8.0a0|>=7.6.5.32,<8.0a0|>=8.4.1.50,<9.0a0|>=8.8.0.121,<9.0a0|>=8.2.1.32,<9.0a0|>=8.1.0.77,<9.0a0|>=8.9,<9.0a0|>=8.9.2.26,<9.0a0|>=8.2,<9.0a0|>=8.2.1,<9.0a0|>=7.6.5,<8.0a0|>=7.6.4,<8.0a0|>=7.3.1,<8.0a0|>=7.3.0,<=8.0a0'] +cupy=6.0.0 -> cudnn[version='>=7.1.3,<8.0a0|>=7.3.1,<8.0a0'] +pytorch=1.1.0 -> cudnn[version='>=7.3.1,<8.0a0'] +cudnn=7.6.0 +tensorflow=1.13.1 -> tensorflow-base==1.13.1=gpu_py27h8f37b9b_0 -> cudnn[version='>=7.3.1,<8.0a0'] + +Package libdeflate conflicts for: +anaconda=custom -> _anaconda_depends -> libdeflate +_anaconda_depends=2019.03 -> libtiff -> libdeflate[version='>=1.10,<1.11.0a0|>=1.12,<1.13.0a0|>=1.13,<1.14.0a0|>=1.14,<1.15.0a0|>=1.16,<1.17.0a0|>=1.17,<1.18.0a0|>=1.18,<1.19.0a0|>=1.19,<1.20.0a0|>=1.8,<1.9.0a0|>=1.7,<1.8.0a0'] +pillow=6.0.0 -> libtiff[version='>=4.0.9,<4.4.0a0'] -> libdeflate[version='>=1.10,<1.11.0a0|>=1.8,<1.9.0a0|>=1.7,<1.8.0a0|>=1.19,<1.20.0a0|>=1.18,<1.19.0a0|>=1.17,<1.18.0a0|>=1.16,<1.17.0a0|>=1.14,<1.15.0a0|>=1.13,<1.14.0a0|>=1.12,<1.13.0a0'] + +Package smart_open conflicts for: +anaconda=custom -> _anaconda_depends -> smart_open +nltk=3.4 -> gensim -> smart_open[version='>=1.2.1|>=1.8.1'] + +Package gmp conflicts for: +nbconvert=5.4.1 -> pandoc[version='>=1.12.1,<2.0.0'] -> gmp=6.1 +mpc=1.1.0 -> mpfr[version='>=4.0.2,<5.0a0'] -> gmp[version='>=6.2.1,<7.0a0'] +gmpy2=2.0.8 -> gmp[version='>=6.1.2|>=6.1.2,<7.0a0'] +gmp=6.1.2 +mpc=1.1.0 -> gmp[version='>=5.0.1,<7|>=6.1.2,<7.0a0|>=6.2.0,<7.0a0|>=6.1.2'] +pandoc=2.2.3.2 -> gmp +gmpy2=2.0.8 -> mpc[version='>=1.1.0,<2.0a0'] -> gmp[version='>=5.0.1,<7|>=6.2.0,<7.0a0|>=6.2.1,<7.0a0'] +mpfr=4.0.1 -> gmp[version='>=6.1.2|>=6.1.2,<7.0a0'] +sympy=1.3 -> gmpy2[version='>=2.0.8'] -> gmp[version='>=6.1.2|>=6.1.2,<7.0a0|>=6.2.0,<7.0a0|>=6.2.1,<7.0a0'] +anaconda=custom -> _anaconda_depends -> gmp +_anaconda_depends=2019.03 -> gmp +_anaconda_depends=2019.03 -> gmpy2 -> gmp[version='6.1.*|>=5.0.1,<7|>=6.1.2|>=6.1.2,<7.0a0|>=6.2.0,<7.0a0|>=6.2.1,<7.0a0'] + +Package numexpr conflicts for: +anaconda=custom -> _anaconda_depends -> numexpr +seaborn=0.9.0 -> pandas[version='>=0.14.0'] -> numexpr[version='>=2.6.8|>=2.7.0|>=2.7.1|>=2.7.3|>=2.8.0'] +_anaconda_depends=2019.03 -> pandas -> numexpr[version='2.0.*|2.1.*|2.2.*|2.3.*|2.4.*|2.5.*|>=2.6.2|>=2.6.8|>=2.7.0|>=2.7.1|>=2.7.3|>=2.8.0'] +_anaconda_depends=2019.03 -> numexpr +numexpr=2.6.9 +dask=1.1.4 -> pandas[version='>=0.19.0,<2.0.0a0'] -> numexpr[version='>=2.6.8|>=2.7.0|>=2.7.1|>=2.7.3|>=2.8.0'] +statsmodels=0.9.0 -> pandas[version='>=0.14'] -> numexpr[version='>=2.6.8|>=2.7.0|>=2.7.1|>=2.7.3|>=2.8.0'] +bkcharts=0.2 -> pandas -> numexpr[version='>=2.6.8|>=2.7.0|>=2.7.1|>=2.7.3|>=2.8.0'] +pytables=3.5.1 -> numexpr + +Package iniconfig conflicts for: +pytest-astropy=0.5.0 -> pytest[version='>=3.1'] -> iniconfig +anaconda=custom -> _anaconda_depends -> iniconfig +pytest-remotedata=0.3.1 -> pytest[version='>=3.1'] -> iniconfig +pytest-doctestplus=0.3.0 -> pytest[version='>=3.0'] -> iniconfig +pytest-openfiles=0.3.2 -> pytest[version='>=2.8.0'] -> iniconfig +_anaconda_depends=2019.03 -> pytest -> iniconfig +pytest-arraydiff=0.3 -> pytest -> iniconfig + +Package contextlib2 conflicts for: +contextlib2=0.5.5 +anaconda=custom -> _anaconda_depends -> contextlib2 +_anaconda_depends=2019.03 -> contextlib2 +importlib_metadata=0.8 -> contextlib2 +path.py=11.5.0 -> importlib_metadata[version='>=0.5'] -> contextlib2 + +Package sympy conflicts for: +sympy=1.3 +_anaconda_depends=2019.03 -> sympy +anaconda=custom -> _anaconda_depends -> sympy +torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> sympy + +Package pyodbc conflicts for: +anaconda=custom -> _anaconda_depends -> pyodbc +_anaconda_depends=2019.03 -> pyodbc +pyodbc=4.0.26 + +Package pytorch conflicts for: +torchvision=0.3.0 -> pytorch[version='1.1.*|>=1.1.0'] +pytorch=1.1.0 + +Package qtawesome conflicts for: +anaconda=custom -> _anaconda_depends -> qtawesome +_anaconda_depends=2019.03 -> qtawesome +qtawesome=0.5.7 +spyder=3.3.3 -> qtawesome[version='>=0.4.1'] +_anaconda_depends=2019.03 -> spyder -> qtawesome[version='>=0.4.1|>=0.5.7|>=1.0.2|>=1.2.1'] + +Package exceptiongroup conflicts for: +jupyter_console=6.0.0 -> ipython -> exceptiongroup +ipykernel=5.1.0 -> ipython[version='>=5.0'] -> exceptiongroup +pytest-astropy=0.5.0 -> pytest[version='>=3.1'] -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8'] +pytest-arraydiff=0.3 -> pytest -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8'] +pytest-doctestplus=0.3.0 -> pytest[version='>=3.0'] -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8'] +pytest-remotedata=0.3.1 -> pytest[version='>=3.1'] -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8'] +_anaconda_depends=2019.03 -> ipython -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8'] +ipywidgets=7.4.2 -> ipython[version='>=4.0.0'] -> exceptiongroup +pytest-openfiles=0.3.2 -> pytest[version='>=2.8.0'] -> exceptiongroup[version='>=1.0.0|>=1.0.0rc8'] + +Package dbus conflicts for: +keyring=18.0.0 -> secretstorage -> dbus[version='>=1.13.18,<2.0a0'] +anaconda=custom -> _anaconda_depends -> dbus +_anaconda_depends=2019.03 -> dbus +_anaconda_depends=2019.03 -> pyqt -> dbus[version='>=1.10.22,<2.0a0|>=1.12.2,<2.0a0|>=1.13.12,<2.0a0|>=1.13.6,<2.0a0|>=1.13.2,<2.0a0|>=1.13.0,<2.0a0|>=1.13.18,<2.0a0'] +pyqt=5.9.2 -> dbus[version='>=1.12.2,<2.0a0|>=1.13.12,<2.0a0|>=1.13.6,<2.0a0|>=1.13.2,<2.0a0'] +matplotlib=3.0.3 -> pyqt -> dbus[version='>=1.10.22,<2.0a0|>=1.12.2,<2.0a0|>=1.13.12,<2.0a0|>=1.13.6,<2.0a0|>=1.13.2,<2.0a0'] +qt=5.9.7 -> dbus[version='>=1.13.2,<2.0a0|>=1.13.6,<2.0a0'] +secretstorage=3.1.1 -> dbus +dbus=1.13.6 +spyder=3.3.3 -> pyqt[version='>=5.6,<5.7'] -> dbus[version='>=1.10.22,<2.0a0|>=1.13.6,<2.0a0|>=1.13.12,<2.0a0|>=1.13.2,<2.0a0|>=1.12.2,<2.0a0'] +qtconsole=4.4.3 -> pyqt -> dbus[version='>=1.10.22,<2.0a0|>=1.12.2,<2.0a0|>=1.13.12,<2.0a0|>=1.13.6,<2.0a0|>=1.13.2,<2.0a0'] + +Package greenlet conflicts for: +_anaconda_depends=2019.03 -> greenlet +anaconda=custom -> _anaconda_depends -> greenlet +gevent=1.4.0 -> greenlet[version='>=0.4.14'] +_anaconda_depends=2019.03 -> bokeh -> greenlet[version='!=0.4.17|0.4.*|>=2.0.0|>=1.1.3,<2.0|>=1.1.0,<2.0|>=0.4.17,<2.0|>=0.4.17|>=0.4.14|>=0.4.13|>=0.4.10|>=0.4.9'] +greenlet=0.4.15 + +Package graphite2 conflicts for: +pango=1.42.4 -> harfbuzz[version='>=2.7.2,<3.0a0'] -> graphite2[version='1.3.*|>=1.3.11,<2.0a0|>=1.3.10,<2.0a0'] +anaconda=custom -> _anaconda_depends -> graphite2 +_anaconda_depends=2019.03 -> graphite2 +pango=1.42.4 -> graphite2[version='>=1.3.12,<2.0a0|>=1.3.13,<2.0a0|>=1.3.14,<2.0a0'] +harfbuzz=1.8.8 -> graphite2[version='>=1.3.11,<2.0a0'] +graphite2=1.3.13 +_anaconda_depends=2019.03 -> harfbuzz -> graphite2[version='1.3.*|>=1.3.14,<2.0a0|>=1.3.13,<2.0a0|>=1.3.11,<2.0a0|>=1.3.10,<2.0a0|>=1.3.12,<2.0a0'] + +Package pthread-stubs conflicts for: +qt=5.9.7 -> libxcb -> pthread-stubs +libxcb=1.13 -> pthread-stubs +gst-plugins-base=1.14.0 -> libxcb[version='>=1.14,<2.0a0'] -> pthread-stubs +harfbuzz=1.8.8 -> libxcb[version='>=1.13,<2.0a0'] -> pthread-stubs +cairo=1.14.12 -> libxcb -> pthread-stubs +_anaconda_depends=2019.03 -> libxcb -> pthread-stubs + +Package astropy conflicts for: +astropy=3.1.2 +anaconda=custom -> _anaconda_depends -> astropy +_anaconda_depends=2019.03 -> astropy + +Package pyasn1 conflicts for: +urllib3=1.24.1 -> cryptography[version='>=1.3.4'] -> pyasn1[version='>=0.1.8'] +anaconda=custom -> _anaconda_depends -> pyasn1 +_anaconda_depends=2019.03 -> cryptography -> pyasn1[version='0.1.7|0.1.9|>=0.1.8'] +secretstorage=3.1.1 -> cryptography -> pyasn1[version='0.1.7|0.1.9|>=0.1.8'] + +Package ninja conflicts for: +torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> ninja +ninja=1.9.0 +pytorch=1.1.0 -> ninja + +Package tensorboard conflicts for: +tensorboard=1.13.1 +tensorflow=1.13.1 -> tensorboard[version='1.13.*|>=1.13.0,<1.14.0a0|>=1.13.0,<1.14.0'] + +Package bokeh conflicts for: +anaconda=custom -> _anaconda_depends -> bokeh +dask=1.1.4 -> bokeh[version='>=0.13.0|>=0.13.0,<3.0.0a0'] +_anaconda_depends=2019.03 -> bokeh +bokeh=1.0.4 +_anaconda_depends=2019.03 -> dask -> bokeh[version='<3.0a0|>=0.13.0,<3.0.0a0|>=1.0.0,!=2.0.0,<3.0.0a0|>=2.1.1,<3.0.0a0|>=2.4.2,<3.0.0a0|>=2.4.2|>=2.4.2,!=3.0.*|>=2.4.2,<3|>=1.0.0,<3.0.0a0|>=2.4.2,<3.0|>=2.1.1|>=1.0.0,!=2.0.0|>=1.0.0|>=0.13.0|>=0.12.3|>=0.12.1'] + +Package future conflicts for: +path.py=11.5.0 -> backports.os -> future +_anaconda_depends=2019.03 -> future +backports.os=0.1.1 -> future +pytorch=1.1.0 -> future +anaconda=custom -> _anaconda_depends -> future +torchvision=0.3.0 -> future + +Package path.py conflicts for: +_anaconda_depends=2019.03 -> path.py +ipython=7.4.0 -> pickleshare -> path.py +anaconda=custom -> _anaconda_depends -> path.py +spyder=3.3.3 -> pickleshare -> path.py +path.py=11.5.0 + +Package dbus-python conflicts for: +keyring=18.0.0 -> secretstorage -> dbus-python +_anaconda_depends=2019.03 -> secretstorage -> dbus-python + +Package _ipython_minor_entry_point conflicts for: +jupyter_console=6.0.0 -> ipython -> _ipython_minor_entry_point=8.7.0 +ipywidgets=7.4.2 -> ipython[version='>=4.0.0'] -> _ipython_minor_entry_point=8.7.0 +ipykernel=5.1.0 -> ipython[version='>=5.0'] -> _ipython_minor_entry_point=8.7.0 +_anaconda_depends=2019.03 -> ipython -> _ipython_minor_entry_point=8.7.0 + +Package gmpy2 conflicts for: +sympy=1.3 -> gmpy2[version='>=2.0.8'] +sympy=1.3 -> mpmath[version='>=0.19'] -> gmpy2 +_anaconda_depends=2019.03 -> sympy -> gmpy2[version='>=2.0.8'] +anaconda=custom -> _anaconda_depends -> gmpy2 +_anaconda_depends=2019.03 -> gmpy2 +gmpy2=2.0.8 + +Package fonttools conflicts for: +scikit-image=0.15.0 -> matplotlib-base[version='>=2.0.0'] -> fonttools[version='>=4.22.0'] +seaborn=0.9.0 -> matplotlib-base -> fonttools[version='>=4.22.0'] +anaconda=custom -> _anaconda_depends -> fonttools + +Package blis conflicts for: +numpy=1.16.2 -> libblas[version='>=3.8.0,<4.0a0'] -> blis[version='0.5.1.*|>=0.5.2,<0.5.3.0a0|>=0.6.0,<0.6.1.0a0|>=0.6.1,<0.6.2.0a0|>=0.7.0,<0.7.1.0a0|>=0.8.0,<0.8.1.0a0|>=0.8.1,<0.8.2.0a0|>=0.9.0,<0.9.1.0a0'] +scipy=1.2.1 -> libblas[version='>=3.8.0,<4.0a0'] -> blis[version='0.5.1.*|>=0.5.2,<0.5.3.0a0|>=0.6.0,<0.6.1.0a0|>=0.6.1,<0.6.2.0a0|>=0.7.0,<0.7.1.0a0|>=0.8.0,<0.8.1.0a0|>=0.8.1,<0.8.2.0a0|>=0.9.0,<0.9.1.0a0'] + +Package qtconsole conflicts for: +_anaconda_depends=2019.03 -> spyder -> qtconsole[version='>=4.2|>=4.6.0|>=4.7.7|>=5.0.1|>=5.0.3|>=5.1.0|>=5.1.0,<5.2.0|>=5.2.1,<5.3.0|>=5.3.0,<5.4.0|>=5.3.2,<5.4.0|>=5.4.0,<5.5.0|>=5.4.2,<5.5.0|>=5.5.0,<5.6.0'] +qtconsole=4.4.3 +anaconda=custom -> _anaconda_depends -> qtconsole +spyder=3.3.3 -> qtconsole[version='>=4.2'] +jupyter=1.0.0 -> qtconsole +_anaconda_depends=2019.03 -> qtconsole + +Package filelock conflicts for: +torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> filelock +anaconda=custom -> _anaconda_depends -> filelock +_anaconda_depends=2019.03 -> filelock + +Package libnghttp2 conflicts for: +_anaconda_depends=2019.03 -> libcurl -> libnghttp2[version='>=1.41.0,<2.0a0|>=1.43.0,<2.0a0|>=1.47.0,<2.0a0|>=1.51.0,<2.0a0|>=1.52.0,<2.0a0|>=1.57.0|>=1.57.0,<2.0a0|>=1.52.0|>=1.46.0|>=1.46.0,<2.0a0'] +tensorflow=1.13.1 -> libcurl[version='>=7.64.1,<9.0a0'] -> libnghttp2[version='>=1.41.0,<2.0a0|>=1.43.0,<2.0a0|>=1.47.0,<2.0a0|>=1.51.0,<2.0a0|>=1.52.0,<2.0a0|>=1.57.0|>=1.57.0,<2.0a0|>=1.52.0|>=1.46.0|>=1.46.0,<2.0a0'] +pycurl=7.43.0.2 -> libcurl[version='>=7.64.1,<9.0a0'] -> libnghttp2[version='>=1.41.0,<2.0a0|>=1.43.0,<2.0a0|>=1.47.0,<2.0a0|>=1.51.0,<2.0a0|>=1.52.0,<2.0a0|>=1.57.0|>=1.57.0,<2.0a0|>=1.52.0|>=1.46.0|>=1.46.0,<2.0a0'] +anaconda=custom -> _anaconda_depends -> libnghttp2 + +Package secretstorage conflicts for: +secretstorage=3.1.1 +spyder=3.3.3 -> keyring -> secretstorage[version='>=3|>=3.2'] +_anaconda_depends=2019.03 -> secretstorage +keyring=18.0.0 -> secretstorage +anaconda=custom -> _anaconda_depends -> secretstorage +_anaconda_depends=2019.03 -> keyring -> secretstorage[version='>=3|>=3.2'] + +Package pyobjc-framework-cocoa conflicts for: +_anaconda_depends=2019.03 -> send2trash -> pyobjc-framework-cocoa +notebook=5.7.8 -> send2trash -> pyobjc-framework-cocoa + +Package astroid conflicts for: +spyder=3.3.3 -> pylint -> astroid[version='1.0.1|1.1.0|1.1.1|1.2.1|1.3.2|1.3.4|1.4.4|2.5.6|>=2.11.0,<=2.12.0|>=2.11.2,<=2.12.0|>=2.11.3,<=2.12.0|>=2.11.5,<2.12.0|>=2.11.6,<2.12.0|>=2.12.10,<2.14.0-dev0|>=2.12.11,<2.14.0-dev0|>=2.12.12,<2.14.0-dev0|>=2.12.13,<2.14.0-dev0|>=2.14.1,<2.16.0-dev0|>=2.14.2,<2.16.0-dev0|>=2.15.0,<2.17.0-dev0|>=2.15.2,<2.17.0-dev0|>=2.15.4,<2.17.0-dev0|>=2.15.6,<2.17.0-dev0|>=2.15.7,<2.17.0-dev0|>=2.15.8,<2.17.0-dev0|>=3.0.0,<3.1.0-dev0|>=3.0.1,<3.1.0-dev0|>=2.12.9,<2.14.0-dev0|>=2.12.4,<2.14.0-dev0|>=2.9.0,<2.10|>=2.8.0,<2.9|>=2.7.2,<2.8|>=2.6.5,<2.7|>=2.6.4,<2.7|>=2.6.2,<2.7|>=2.6.1,<2.7|>=2.5.7,<2.7|>=2.5.1,<2.6|>=2.4.0,<=2.5|>=2.4.0,<2.5|>=2.3.0,<2.4|>=2.2.0,<3|>=2.2.0|>=2.0.0|>=1.6,<2.0|>=1.5.1|>=1.4.5,<1.5.0|>=2.14.2,<=2.16.0|>=2.6.5,<=2.7|>=2.6.2,<=2.7|>=2.5.8,<=2.7|>=1.4.1,<1.5.0'] +_anaconda_depends=2019.03 -> astroid +pylint=2.3.1 -> astroid[version='>=2.2.0'] +anaconda=custom -> _anaconda_depends -> astroid +astroid=2.2.5 +_anaconda_depends=2019.03 -> pylint -> astroid[version='1.0.1|1.1.0|1.1.1|1.2.1|1.3.2|1.3.4|1.4.4|2.5.6|>=2.11.0,<=2.12.0|>=2.11.2,<=2.12.0|>=2.11.3,<=2.12.0|>=2.11.5,<2.12.0|>=2.11.6,<2.12.0|>=2.12.10,<2.14.0-dev0|>=2.12.11,<2.14.0-dev0|>=2.12.12,<2.14.0-dev0|>=2.12.13,<2.14.0-dev0|>=2.14.1,<2.16.0-dev0|>=2.14.2,<2.16.0-dev0|>=2.15.0,<2.17.0-dev0|>=2.15.2,<2.17.0-dev0|>=2.15.4,<2.17.0-dev0|>=2.15.6,<2.17.0-dev0|>=2.15.7,<2.17.0-dev0|>=2.15.8,<2.17.0-dev0|>=3.0.0,<3.1.0-dev0|>=3.0.1,<3.1.0-dev0|>=2.12.9,<2.14.0-dev0|>=2.12.4,<2.14.0-dev0|>=2.9.0,<2.10|>=2.8.0,<2.9|>=2.7.2,<2.8|>=2.6.5,<2.7|>=2.6.4,<2.7|>=2.6.2,<2.7|>=2.6.1,<2.7|>=2.5.7,<2.7|>=2.5.1,<2.6|>=2.4.0,<=2.5|>=2.4.0,<2.5|>=2.3.0,<2.4|>=2.2.0,<3|>=2.2.0|>=2.0.0|>=1.6,<2.0|>=1.5.1|>=1.4.5,<1.5.0|>=2.14.2,<=2.16.0|>=2.6.5,<=2.7|>=2.6.2,<=2.7|>=2.5.8,<=2.7|>=1.4.1,<1.5.0'] + +Package xorg-libice conflicts for: +cairo=1.14.12 -> xorg-libsm -> xorg-libice[version='1.0.*|>=1.1.1,<2.0a0'] +cairo=1.14.12 -> xorg-libice + +Package anaconda-project conflicts for: +anaconda=custom -> _anaconda_depends -> anaconda-project +anaconda-project=0.8.2 +_anaconda_depends=2019.03 -> anaconda-client -> anaconda-project[version='>=0.9.1'] +_anaconda_depends=2019.03 -> anaconda-project + +Package parso conflicts for: +spyder=3.3.3 -> jedi[version='>=0.9'] -> parso[version='0.1.0|>=0.1.0,<0.2|>=0.2.0,<0.8.0a0|>=0.3.0,<0.8.0a0|>=0.5.0,<0.8.0a0|>=0.5.2,<0.8.0a0|>=0.7.0,<0.8.0a0|>=0.7.0,<0.8.0|>=0.8.0,<0.9.0|>=0.8.3,<0.9.0|>=0.7.0|>=0.5.2|>=0.5.0|>=0.3.0|>=0.2.0'] +ipython=7.4.0 -> jedi[version='>=0.10'] -> parso[version='0.1.0|>=0.1.0,<0.2|>=0.2.0,<0.8.0a0|>=0.3.0,<0.8.0a0|>=0.5.0,<0.8.0a0|>=0.5.2,<0.8.0a0|>=0.7.0,<0.8.0a0|>=0.7.0,<0.8.0|>=0.8.0,<0.9.0|>=0.8.3,<0.9.0|>=0.7.0|>=0.5.2|>=0.5.0|>=0.3.0|>=0.2.0'] +_anaconda_depends=2019.03 -> parso +_anaconda_depends=2019.03 -> jedi -> parso[version='0.1.0|>=0.1.0,<0.2|>=0.2.0,<0.8.0a0|>=0.3.0,<0.8.0a0|>=0.5.0,<0.8.0a0|>=0.5.2,<0.8.0a0|>=0.7.0,<0.8.0a0|>=0.7.0,<0.8.0|>=0.8.0,<0.9.0|>=0.8.3,<0.9.0|>=0.7.0|>=0.5.2|>=0.5.0|>=0.3.0|>=0.2.0|>=0.7.0,<0.9.0|0.7.0.*|0.5.2.*'] +parso=0.3.4 +jedi=0.13.3 -> parso[version='>=0.3.0|>=0.3.0,<0.8.0a0'] +anaconda=custom -> _anaconda_depends -> parso + +Package typing conflicts for: +spyder=3.3.3 -> sphinx -> typing +torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> typing +anaconda=custom -> _anaconda_depends -> typing +_anaconda_depends=2019.03 -> typing +numpydoc=0.8.0 -> sphinx -> typing + +Package clyent conflicts for: +clyent=1.2.2 +anaconda-project=0.8.2 -> anaconda-client -> clyent[version='>=1.2.0|>=1.2.2'] +anaconda-client=1.7.2 -> clyent[version='>=1.2.0|>=1.2.2'] +_anaconda_depends=2019.03 -> clyent +anaconda=custom -> _anaconda_depends -> clyent +_anaconda_depends=2019.03 -> anaconda-client -> clyent[version='>=1.2.0|>=1.2.2'] + +Package jupyterlab_pygments conflicts for: +anaconda=custom -> _anaconda_depends -> jupyterlab_pygments +notebook=5.7.8 -> nbconvert -> jupyterlab_pygments +jupyter=1.0.0 -> nbconvert -> jupyterlab_pygments +spyder=3.3.3 -> nbconvert -> jupyterlab_pygments +_anaconda_depends=2019.03 -> nbconvert -> jupyterlab_pygments + +Package pytest conflicts for: +pytest-doctestplus=0.3.0 -> pytest[version='>=2.8|>=3.0'] +anaconda=custom -> _anaconda_depends -> pytest +_anaconda_depends=2019.03 -> pytest +pytest=4.3.1 +pytest-astropy=0.5.0 -> pytest[version='>=3.1'] +pytest-openfiles=0.3.2 -> pytest[version='>=2.8.0'] +pytest-remotedata=0.3.1 -> pytest[version='>=3.1'] +_anaconda_depends=2019.03 -> astropy -> pytest[version='<3.7|<4|>=2.8|>=4.6|>=3.1|>=3.1.0|>=4.0|>=3.0|>=2.8.0'] +pytest-astropy=0.5.0 -> pytest-arraydiff[version='>=0.1'] -> pytest[version='>=2.8.0|>=2.8|>=3.0|>=4.0|>=4.6'] +astropy=3.1.2 -> pytest-astropy -> pytest[version='>=3.1.0|>=3.1|>=4.6'] +pytest-arraydiff=0.3 -> pytest + +Package jsonschema conflicts for: +anaconda=custom -> _anaconda_depends -> jsonschema +_anaconda_depends=2019.03 -> jsonschema +jsonschema=3.0.1 +ipywidgets=7.4.2 -> nbformat[version='>=4.2.0'] -> jsonschema[version='>=2.4,!=2.5.0|>=2.6'] +nbformat=4.4.0 -> jsonschema[version='>=2.4,!=2.5.0'] +anaconda-client=1.7.2 -> nbformat[version='>=4.4.0'] -> jsonschema[version='2.4.0|>=2.0,!=2.5.0|>=2.4,!=2.5.0|>=2.6'] +nbconvert=5.4.1 -> nbformat[version='>=4.4'] -> jsonschema[version='>=2.4,!=2.5.0|>=2.6'] +notebook=5.7.8 -> nbformat -> jsonschema[version='2.4.0|>=2.0,!=2.5.0|>=2.4,!=2.5.0|>=2.6'] +_anaconda_depends=2019.03 -> jupyterlab_server -> jsonschema[version='2.4.0|>=2.0,!=2.5.0|>=2.4,!=2.5.0|>=2.6|>=3.0.1|>=4.17.3|>=4.18|>=4.18.0|>=3.2.0'] + +Package tblib conflicts for: +tblib=1.3.2 +_anaconda_depends=2019.03 -> distributed -> tblib[version='>=1.6.0'] +dask=1.1.4 -> distributed[version='>=1.26.0'] -> tblib[version='>=1.6.0'] +distributed=1.26.0 -> tblib +_anaconda_depends=2019.03 -> tblib +anaconda=custom -> _anaconda_depends -> tblib + +Package sphinxcontrib-websupport conflicts for: +sphinxcontrib-websupport=1.1.0 +_anaconda_depends=2019.03 -> sphinxcontrib-websupport +numpydoc=0.8.0 -> sphinx -> sphinxcontrib-websupport +anaconda=custom -> _anaconda_depends -> sphinxcontrib-websupport +spyder=3.3.3 -> sphinx -> sphinxcontrib-websupport + +Package tqdm conflicts for: +_anaconda_depends=2019.03 -> anaconda-client -> tqdm[version='>=4.56.0'] +anaconda=custom -> _anaconda_depends -> tqdm +_anaconda_depends=2019.03 -> tqdm +anaconda-project=0.8.2 -> anaconda-client -> tqdm[version='>=4.56.0'] +tqdm=4.32.2 + +Package brotli-python conflicts for: +anaconda-client=1.7.2 -> urllib3[version='<2.0.0a'] -> brotli-python[version='>=1.0.9'] +_anaconda_depends=2019.03 -> urllib3 -> brotli-python[version='>=1.0.9'] + +Package jdcal conflicts for: +jdcal=1.4 +anaconda=custom -> _anaconda_depends -> jdcal +_anaconda_depends=2019.03 -> jdcal +_anaconda_depends=2019.03 -> openpyxl -> jdcal==1.0 +openpyxl=2.6.1 -> jdcal + +Package werkzeug conflicts for: +anaconda=custom -> _anaconda_depends -> werkzeug +_anaconda_depends=2019.03 -> werkzeug +flask=1.0.2 -> werkzeug[version='>=0.14|>=0.15,<2.0'] +werkzeug=0.14.1 +tensorboard=1.13.1 -> werkzeug[version='>=0.11.10|>=0.11.15'] +_anaconda_depends=2019.03 -> flask -> werkzeug[version='0.8.3|>=0.14|>=0.15|>=0.15,<2.0|>=2.0|>=2.2.0|>=2.2.2|>=2.3.0|>=2.3.3|>=2.3.7|>=3.0.0|>=0.7|>=0.7,<1.0.0'] +tensorflow=1.13.1 -> tensorboard[version='>=1.13.0,<1.14.0a0'] -> werkzeug[version='>=0.11.10|>=0.11.15'] + +Package sphinxcontrib-qthelp conflicts for: +numpydoc=0.8.0 -> sphinx -> sphinxcontrib-qthelp +_anaconda_depends=2019.03 -> sphinx -> sphinxcontrib-qthelp +anaconda=custom -> _anaconda_depends -> sphinxcontrib-qthelp +spyder=3.3.3 -> sphinx -> sphinxcontrib-qthelp + +Package cairo conflicts for: +pango=1.42.4 -> harfbuzz[version='>=1.7.6,<2.0a0'] -> cairo[version='1.14.*|>=1.14.12,<2.0.0a0'] +pango=1.42.4 -> cairo[version='>=1.14.12,<2.0a0|>=1.16.0,<2.0.0a0'] +anaconda=custom -> _anaconda_depends -> cairo +_anaconda_depends=2019.03 -> cairo +_anaconda_depends=2019.03 -> harfbuzz -> cairo[version='1.12.*|1.14.*|>=1.14.12,<2.0.0a0|>=1.16.0,<2.0.0a0|>=1.16.0,<2.0a0|>=1.18.0,<2.0a0|>=1.14.12,<2.0a0|>=1.14.10,<2.0a0|>=1.12.10|>=1.14.10,<2.0.0a0'] +cairo=1.14.12 +harfbuzz=1.8.8 -> cairo[version='>=1.14.12,<2.0.0a0|>=1.14.12,<2.0a0'] + +Package qtpy conflicts for: +spyder=3.3.3 -> qtpy[version='>=1.5.0'] +qtpy=1.7.0 +spyder=3.3.3 -> qtawesome[version='>=0.4.1'] -> qtpy[version='>=2.0.1|>=2.4.0'] +jupyter=1.0.0 -> qtconsole-base -> qtpy[version='>=2.0.1|>=2.4.0'] +_anaconda_depends=2019.03 -> qtconsole -> qtpy[version='>=1.1|>=1.2.0|>=1.5.0|>=2.0.1|>=2.4.0|>=2.1.0'] +qtawesome=0.5.7 -> qtpy +anaconda=custom -> _anaconda_depends -> qtpy +_anaconda_depends=2019.03 -> qtpy + +Package pycparser conflicts for: +anaconda=custom -> _anaconda_depends -> pycparser +_anaconda_depends=2019.03 -> pycparser +pycparser=2.19 +gevent=1.4.0 -> cffi[version='>=1.11.5'] -> pycparser +cffi=1.12.2 -> pycparser +pytorch=1.1.0 -> cffi -> pycparser +cryptography=2.6.1 -> cffi[version='>=1.7'] -> pycparser + +Package mpi conflicts for: +hdf5=1.10.4 -> openmpi[version='>=3.1,<3.2.0a0'] -> mpi==1.0[build='openmpi|mpich'] +anaconda=custom -> _anaconda_depends -> mpi +h5py=2.9.0 -> openmpi[version='>=3.1.4,<3.2.0a0'] -> mpi==1.0[build='openmpi|mpich'] + +Package cycler conflicts for: +_anaconda_depends=2019.03 -> matplotlib -> cycler[version='>=0.10|>=0.10.0'] +matplotlib=3.0.3 -> cycler[version='>=0.10'] +_anaconda_depends=2019.03 -> cycler +scikit-image=0.15.0 -> matplotlib-base[version='>=2.0.0'] -> cycler[version='>=0.10|>=0.10.0'] +anaconda=custom -> _anaconda_depends -> cycler +cycler=0.10.0 +seaborn=0.9.0 -> matplotlib-base -> cycler[version='>=0.10|>=0.10.0'] +nltk=3.4 -> matplotlib -> cycler[version='>=0.10|>=0.10.0'] + +Package cached-property conflicts for: +_anaconda_depends=2019.03 -> h5py -> cached-property +keras-applications=1.0.7 -> h5py -> cached-property +anaconda=custom -> _anaconda_depends -> cached-property + +Package boto conflicts for: +anaconda=custom -> _anaconda_depends -> boto +boto=2.49.0 +_anaconda_depends=2019.03 -> boto + +Package wheel conflicts for: +anaconda=custom -> _anaconda_depends -> wheel +_anaconda_depends=2019.03 -> wheel +pip=19.0.3 -> wheel +wheel=0.33.1 +python=3.6.8 -> pip -> wheel + +Package wurlitzer conflicts for: +_anaconda_depends=2019.03 -> spyder-kernels -> wurlitzer[version='>=1.0.3'] +spyder-kernels=0.4.2 -> wurlitzer +_anaconda_depends=2019.03 -> wurlitzer +wurlitzer=1.0.2 +spyder=3.3.3 -> spyder-kernels[version='>=0.4.2,<1'] -> wurlitzer +anaconda=custom -> _anaconda_depends -> wurlitzer + +Package get_terminal_size conflicts for: +ipywidgets=7.4.2 -> ipython[version='>=4.0.0'] -> get_terminal_size +ipykernel=5.1.0 -> ipython[version='>=5.0'] -> get_terminal_size +jupyter_console=6.0.0 -> ipython -> get_terminal_size +get_terminal_size=1.0.0 +anaconda=custom -> _anaconda_depends -> get_terminal_size +_anaconda_depends=2019.03 -> get_terminal_size + +Package pyqtchart conflicts for: +spyder=3.3.3 -> pyqt=5 -> pyqtchart==5.12[build='py36h7ec31b9_6|py38h7400c14_7|py37he336c9b_7|py39h0fcd23e_7|py310hfcd6d55_8|py39h0fcd23e_8|py37he336c9b_8|py38h7400c14_8|py36h7ec31b9_7|py39h0fcd23e_6|py38h7400c14_6|py37he336c9b_6|py37he336c9b_5'] +matplotlib=3.0.3 -> pyqt -> pyqtchart==5.12[build='py36h7ec31b9_6|py38h7400c14_7|py37he336c9b_7|py39h0fcd23e_7|py310hfcd6d55_8|py39h0fcd23e_8|py37he336c9b_8|py38h7400c14_8|py36h7ec31b9_7|py39h0fcd23e_6|py38h7400c14_6|py37he336c9b_6|py37he336c9b_5'] +qtconsole=4.4.3 -> pyqt -> pyqtchart==5.12[build='py36h7ec31b9_6|py38h7400c14_7|py37he336c9b_7|py39h0fcd23e_7|py310hfcd6d55_8|py39h0fcd23e_8|py37he336c9b_8|py38h7400c14_8|py36h7ec31b9_7|py39h0fcd23e_6|py38h7400c14_6|py37he336c9b_6|py37he336c9b_5'] +_anaconda_depends=2019.03 -> pyqt -> pyqtchart==5.12[build='py36h7ec31b9_6|py38h7400c14_7|py37he336c9b_7|py39h0fcd23e_7|py310hfcd6d55_8|py39h0fcd23e_8|py37he336c9b_8|py38h7400c14_8|py36h7ec31b9_7|py39h0fcd23e_6|py38h7400c14_6|py37he336c9b_6|py37he336c9b_5'] + +Package numba conflicts for: +_anaconda_depends=2019.03 -> numba +anaconda=custom -> _anaconda_depends -> numba +numba=0.43.1 + +Package mccabe conflicts for: +_anaconda_depends=2019.03 -> mccabe +_anaconda_depends=2019.03 -> pylint -> mccabe[version='>=0.6,<0.7|>=0.6,<0.8'] +spyder=3.3.3 -> pylint -> mccabe[version='>=0.6,<0.7|>=0.6,<0.8'] +mccabe=0.6.1 +pylint=2.3.1 -> mccabe +anaconda=custom -> _anaconda_depends -> mccabe + +Package jaraco.itertools conflicts for: +_anaconda_depends=2019.03 -> zipp -> jaraco.itertools +importlib_metadata=0.8 -> zipp[version='>=0.3.2'] -> jaraco.itertools + +Package pycrypto conflicts for: +_anaconda_depends=2019.03 -> pycrypto +anaconda=custom -> _anaconda_depends -> pycrypto +pycrypto=2.6.1 + +Package _anaconda_depends conflicts for: +_anaconda_depends=2019.03 +anaconda=custom -> _anaconda_depends + +Package pkg-config conflicts for: +dbus=1.13.6 -> glib -> pkg-config +_anaconda_depends=2019.03 -> glib -> pkg-config + +Package jupyter conflicts for: +_anaconda_depends=2019.03 -> jupyter +anaconda=custom -> _anaconda_depends -> jupyter +jupyter=1.0.0 + +Package scikit-image conflicts for: +scikit-image=0.15.0 +anaconda=custom -> _anaconda_depends -> scikit-image +_anaconda_depends=2019.03 -> scikit-image + +Package tensorflow-estimator conflicts for: +tensorflow-estimator=1.13.0 +tensorflow=1.13.1 -> tensorflow-estimator[version='>=1.13.0,<1.14.0a0|>=1.13.0,<1.14.0rc0'] + +Package dataclasses conflicts for: +anaconda=custom -> _anaconda_depends -> dataclasses +torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> dataclasses +nltk=3.4 -> gensim -> dataclasses +tensorboard=1.13.1 -> werkzeug[version='>=0.11.10'] -> dataclasses +_anaconda_depends=2019.03 -> werkzeug -> dataclasses +flask=1.0.2 -> werkzeug[version='>=0.14'] -> dataclasses + +Package jupyter-lsp conflicts for: +_anaconda_depends=2019.03 -> jupyterlab -> jupyter-lsp[version='>=2.0.0'] +jupyter=1.0.0 -> jupyterlab -> jupyter-lsp[version='>=2.0.0'] + +Package et_xmlfile conflicts for: +et_xmlfile=1.0.1 +openpyxl=2.6.1 -> et_xmlfile +anaconda=custom -> _anaconda_depends -> et_xmlfile +_anaconda_depends=2019.03 -> et_xmlfile + +Package heapdict conflicts for: +_anaconda_depends=2019.03 -> heapdict +heapdict=1.0.0 +distributed=1.26.0 -> zict[version='>=0.1.3'] -> heapdict +zict=0.1.4 -> heapdict +anaconda=custom -> _anaconda_depends -> heapdict + +Package spyder conflicts for: +anaconda=custom -> _anaconda_depends -> spyder +_anaconda_depends=2019.03 -> spyder +spyder=3.3.3 + +Package notebook-shim conflicts for: +_anaconda_depends=2019.03 -> jupyterlab -> notebook-shim[version='>=0.2|>=0.2,<0.3'] +jupyterlab_server=0.2.0 -> notebook -> notebook-shim[version='>=0.2,<0.3'] +jupyter=1.0.0 -> notebook -> notebook-shim[version='>=0.2|>=0.2,<0.3'] +widgetsnbextension=3.4.2 -> notebook[version='>=4.4.1'] -> notebook-shim[version='>=0.2,<0.3'] +jupyterlab=0.35.4 -> notebook[version='>=4.3.1'] -> notebook-shim[version='>=0.2,<0.3'] + +Package xlsxwriter conflicts for: +anaconda=custom -> _anaconda_depends -> xlsxwriter +_anaconda_depends=2019.03 -> xlsxwriter +xlsxwriter=1.1.5 + +Package qtconsole-base conflicts for: +jupyter=1.0.0 -> qtconsole-base +_anaconda_depends=2019.03 -> jupyter -> qtconsole-base[version='>=5.2.2,<5.2.3.0a0|>=5.3.0,<5.3.1.0a0|>=5.3.1,<5.3.2.0a0|>=5.3.2,<5.3.3.0a0|>=5.4.0,<5.4.1.0a0|>=5.4.1,<5.4.2.0a0|>=5.4.2,<5.4.3.0a0|>=5.4.3,<5.4.4.0a0|>=5.4.4,<5.4.5.0a0|>=5.5.0,<5.5.1.0a0|>=5.5.1,<5.5.2.0a0'] +spyder=3.3.3 -> qtconsole[version='>=4.2'] -> qtconsole-base[version='>=5.2.2,<5.2.3.0a0|>=5.3.0,<5.3.1.0a0|>=5.3.1,<5.3.2.0a0|>=5.3.2,<5.3.3.0a0|>=5.4.0,<5.4.1.0a0|>=5.4.1,<5.4.2.0a0|>=5.4.2,<5.4.3.0a0|>=5.4.3,<5.4.4.0a0|>=5.4.4,<5.4.5.0a0|>=5.5.0,<5.5.1.0a0|>=5.5.1,<5.5.2.0a0'] +jupyter=1.0.0 -> qtconsole -> qtconsole-base[version='>=5.2.2,<5.2.3.0a0|>=5.3.0,<5.3.1.0a0|>=5.3.1,<5.3.2.0a0|>=5.3.2,<5.3.3.0a0|>=5.4.0,<5.4.1.0a0|>=5.4.1,<5.4.2.0a0|>=5.4.2,<5.4.3.0a0|>=5.4.3,<5.4.4.0a0|>=5.4.4,<5.4.5.0a0|>=5.5.0,<5.5.1.0a0|>=5.5.1,<5.5.2.0a0'] + +Package pycosat conflicts for: +_anaconda_depends=2019.03 -> pycosat +pycosat=0.6.3 +anaconda=custom -> _anaconda_depends -> pycosat + +Package xyzservices conflicts for: +dask=1.1.4 -> bokeh[version='>=0.13.0'] -> xyzservices[version='>=2021.09.1'] +_anaconda_depends=2019.03 -> bokeh -> xyzservices[version='>=2021.09.1'] + +Package brotlipy conflicts for: +anaconda-client=1.7.2 -> urllib3[version='<2.0.0a'] -> brotlipy[version='>=0.6.0'] +anaconda=custom -> _anaconda_depends -> brotlipy +_anaconda_depends=2019.03 -> urllib3 -> brotlipy[version='>=0.6.0'] + +Package libtool conflicts for: +_anaconda_depends=2019.03 -> libtool +anaconda=custom -> _anaconda_depends -> libtool +libtool=2.4.6 + +Package backports.os conflicts for: +anaconda=custom -> _anaconda_depends -> backports.os +_anaconda_depends=2019.03 -> backports.os +path.py=11.5.0 -> backports.os +backports.os=0.1.1 + +Package tbb4py conflicts for: +anaconda=custom -> _anaconda_depends -> tbb4py +mkl_random=1.0.2 -> numpy-base[version='>=1.0.2,<2.0a0'] -> tbb4py +_anaconda_depends=2019.03 -> numpy-base -> tbb4py + +Package libllvm8 conflicts for: +numba=0.43.1 -> llvmlite[version='>=0.28.0'] -> libllvm8[version='>=8.0.1,<8.1.0a0'] +_anaconda_depends=2019.03 -> llvmlite -> libllvm8[version='>=8.0.1,<8.1.0a0'] + +Package anaconda-anon-usage conflicts for: +anaconda-project=0.8.2 -> anaconda-client -> anaconda-anon-usage[version='>=0.4.0'] +_anaconda_depends=2019.03 -> anaconda-client -> anaconda-anon-usage[version='>=0.4.0'] + +Package pcre2 conflicts for: +dbus=1.13.6 -> libglib[version='>=2.70.2,<3.0a0'] -> pcre2[version='>=10.37,<10.38.0a0|>=10.40,<10.41.0a0|>=10.42,<10.43.0a0'] +pango=1.42.4 -> libglib[version='>=2.64.6,<3.0a0'] -> pcre2[version='>=10.37,<10.38.0a0|>=10.40,<10.41.0a0|>=10.42,<10.43.0a0'] + +Package cupti conflicts for: +tensorflow=1.13.1 -> tensorflow-base==1.13.1=gpu_py27h8f37b9b_0 -> cupti +torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> cupti + +Package unicodecsv conflicts for: +anaconda=custom -> _anaconda_depends -> unicodecsv +unicodecsv=0.14.1 +_anaconda_depends=2019.03 -> unicodecsv + +Package dill conflicts for: +_anaconda_depends=2019.03 -> dask -> dill[version='0.2.2|0.2.3|0.2.4|>=0.3.7|>=0.3.6|>=0.2'] +spyder=3.3.3 -> pylint -> dill[version='>=0.2|>=0.3.6|>=0.3.7'] + +Package cryptography-vectors conflicts for: +_anaconda_depends=2019.03 -> cryptography -> cryptography-vectors[version='2.3.*|2.3.1.*'] +urllib3=1.24.1 -> cryptography[version='>=1.3.4'] -> cryptography-vectors[version='2.3.*|2.3.1.*'] +pyopenssl=19.0.0 -> cryptography[version='>=2.2.1'] -> cryptography-vectors[version='2.3.*|2.3.1.*'] +secretstorage=3.1.1 -> cryptography -> cryptography-vectors[version='2.3.*|2.3.1.*'] + +Package xlrd conflicts for: +anaconda=custom -> _anaconda_depends -> xlrd +_anaconda_depends=2019.03 -> xlrd +xlrd=1.2.0 + +Package seaborn conflicts for: +anaconda=custom -> _anaconda_depends -> seaborn +_anaconda_depends=2019.03 -> seaborn +seaborn=0.9.0 + +Package mpi4py conflicts for: +keras-applications=1.0.7 -> h5py -> mpi4py[version='>=3.0'] +h5py=2.9.0 -> mpi4py +_anaconda_depends=2019.03 -> h5py -> mpi4py[version='>=3.0'] + +Package selectors2 conflicts for: +spyder-kernels=0.4.2 -> wurlitzer -> selectors2 +_anaconda_depends=2019.03 -> wurlitzer -> selectors2 + +Package referencing conflicts for: +_anaconda_depends=2019.03 -> jsonschema -> referencing[version='>=0.28.4'] +nbformat=4.4.0 -> jsonschema[version='>=2.4,!=2.5.0'] -> referencing[version='>=0.28.4'] + +Package pyside conflicts for: +nltk=3.4 -> matplotlib -> pyside[version='1.1.2|1.2.1'] +_anaconda_depends=2019.03 -> matplotlib -> pyside[version='1.1.2|1.2.1'] + +Package gevent conflicts for: +_anaconda_depends=2019.03 -> bokeh -> gevent==1.0.1 +anaconda=custom -> _anaconda_depends -> gevent +_anaconda_depends=2019.03 -> gevent +gevent=1.4.0 + +Package pbr conflicts for: +pytables=3.5.1 -> mock -> pbr[version='1.3.0|>=1.3'] +tensorflow=1.13.1 -> mock[version='>=2.0.0'] -> pbr[version='>=1.3'] +tensorflow-estimator=1.13.0 -> mock[version='>=2.0.0'] -> pbr[version='>=1.3'] + +Package keras-base conflicts for: +keras-applications=1.0.7 -> keras[version='>=2.1.6'] -> keras-base[version='2.2.0.*|2.2.2.*|2.2.4.*|2.3.1.*|2.4.3.*'] +keras-preprocessing=1.0.9 -> keras[version='>=2.1.6'] -> keras-base[version='2.2.0.*|2.2.2.*|2.2.4.*|2.3.1.*|2.4.3.*'] + +Package openpyxl conflicts for: +anaconda=custom -> _anaconda_depends -> openpyxl +_anaconda_depends=2019.03 -> openpyxl +openpyxl=2.6.1 + +Package distribute conflicts for: +_anaconda_depends=2019.03 -> pip -> distribute +python=3.6.8 -> pip -> distributeThe following specifications were found to be incompatible with your system: + + - feature:/linux-64::__cuda==11.7=0 + - feature:/linux-64::__glibc==2.27=0 + - feature:/linux-64::__linux==5.4.0=0 + - feature:/linux-64::__unix==0=0 + - feature:|@/linux-64::__cuda==11.7=0 + - feature:|@/linux-64::__glibc==2.27=0 + - feature:|@/linux-64::__linux==5.4.0=0 + - feature:|@/linux-64::__unix==0=0 + - _anaconda_depends=2019.03 -> click -> __unix + - _anaconda_depends=2019.03 -> click -> __win + - _anaconda_depends=2019.03 -> gst-plugins-base -> __glibc[version='>=2.17|>=2.17,<3.0.a0'] + - _anaconda_depends=2019.03 -> ipykernel -> __linux + - astropy=3.1.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - bitarray=0.8.3 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - blosc=1.15.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - bottleneck=1.2.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - bzip2=1.0.6 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - c-ares=1.15.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - cairo=1.14.12 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - cffi=1.12.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - cryptography=2.6.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - cudatoolkit=9 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17'] + - cupy=6.0.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - curl=7.64.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - cython=0.29.6 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - cytoolz=0.9.0.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - dbus=1.13.6 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17'] + - distributed=1.26.0 -> click[version='>=6.6'] -> __unix + - distributed=1.26.0 -> click[version='>=6.6'] -> __win + - expat=2.2.6 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - fastcache=1.0.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - fastrlock=0.4 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - flask=1.0.2 -> click[version='>=5.1'] -> __unix + - flask=1.0.2 -> click[version='>=5.1'] -> __win + - fontconfig=2.13.0 -> libgcc-ng[version='>=4.9'] -> __glibc[version='>=2.17'] + - freetype=2.9.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - fribidi=1.0.5 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - gevent=1.4.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - glib=2.56.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - gmp=6.1.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - gmpy2=2.0.8 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - graphite2=1.3.13 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17'] + - greenlet=0.4.15 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - grpcio=1.16.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - gst-plugins-base=1.14.0 -> gstreamer[version='>=1.14.0,<2.0a0'] -> __glibc[version='>=2.17|>=2.17,<3.0.a0'] + - gstreamer=1.14.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - h5py=2.9.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - harfbuzz=1.8.8 -> libgcc-ng[version='>=4.9'] -> __glibc[version='>=2.17'] + - hdf5=1.10.4 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - icu=58.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - ipykernel=5.1.0 -> ipython[version='>=5.0'] -> __linux + - ipykernel=5.1.0 -> ipython[version='>=5.0'] -> __unix + - ipykernel=5.1.0 -> ipython[version='>=5.0'] -> __win + - ipywidgets=7.4.2 -> ipykernel[version='>=4.5.1'] -> __linux + - ipywidgets=7.4.2 -> ipykernel[version='>=4.5.1'] -> __osx + - ipywidgets=7.4.2 -> ipykernel[version='>=4.5.1'] -> __win + - ipywidgets=7.4.2 -> ipython[version='>=4.0.0'] -> __unix + - jbig=2.1 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17'] + - jpeg=9b -> libgcc-ng[version='>=7.2.0'] -> __glibc[version='>=2.17'] + - jupyter=1.0.0 -> ipykernel -> __linux + - jupyter=1.0.0 -> ipykernel -> __win + - jupyter_console=6.0.0 -> ipykernel -> __linux + - jupyter_console=6.0.0 -> ipykernel -> __win + - jupyter_console=6.0.0 -> ipython -> __unix + - kiwisolver=1.0.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - krb5=1.16.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - lazy-object-proxy=1.3.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - libcurl=7.64.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - libedit=3.1.20181209 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - libffi=3.2.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - libpng=1.6.36 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - libprotobuf=3.8.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - libsodium=1.0.16 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - libssh2=1.8.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - libtiff=4.0.10 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - libtool=2.4.6 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17'] + - libuuid=1.0.3 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17'] + - libxcb=1.13 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17'] + - libxml2=2.9.9 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - libxslt=1.1.33 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17'] + - llvmlite=0.28.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - lxml=4.3.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - lzo=2.10 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17'] + - markupsafe=1.1.1 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17'] + - matplotlib=3.0.3 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - mistune=0.8.4 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17'] + - mkl-service=1.1.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - mkl_fft=1.0.10 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - mkl_random=1.0.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - mpc=1.1.0 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17'] + - mpfr=4.0.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - msgpack-python=0.6.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - nccl=1.3.5 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - ncurses=6.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - ninja=1.9.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - notebook=5.7.8 -> ipykernel -> __linux + - notebook=5.7.8 -> ipykernel -> __win + - numba=0.43.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - numexpr=2.6.9 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - numpy-base=1.16.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - numpy=1.16.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - openssl=1.1.1c -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - pandas=0.24.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - pango=1.42.4 -> libgcc-ng[version='>=7.5.0'] -> __glibc[version='>=2.17'] + - pcre=8.43 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - pillow=6.0.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - pixman=0.38.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - psutil=5.6.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - pycosat=0.6.3 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17'] + - pycrypto=2.6.1 -> libgcc-ng[version='>=9.4.0'] -> __glibc[version='>=2.17'] + - pycurl=7.43.0.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - pyodbc=4.0.26 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - pyqt=5.9.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - pyrsistent=0.14.11 -> libgcc-ng[version='>=9.3.0'] -> __glibc[version='>=2.17'] + - pytables=3.5.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - python=3.6.8 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - pytorch=1.1.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - pywavelets=1.0.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - pyyaml=5.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - pyzmq=18.0.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - qt=5.9.7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - qtconsole=4.4.3 -> ipykernel[version='>=4.1'] -> __linux + - qtconsole=4.4.3 -> ipykernel[version='>=4.1'] -> __win + - readline=7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - ruamel_yaml=0.15.46 -> libgcc-ng[version='>=4.9'] -> __glibc[version='>=2.17'] + - scikit-image=0.15.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - scikit-learn=0.20.3 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - scipy=1.2.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - sip=4.19.8 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - snappy=1.1.7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - spyder-kernels=0.4.2 -> ipykernel[version='>4.9.0'] -> __linux + - spyder-kernels=0.4.2 -> ipykernel[version='>4.9.0'] -> __win + - sqlalchemy=1.3.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - sqlite=3.27.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - statsmodels=0.9.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - tensorboard=1.13.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - tensorflow=1.13.1 -> libgcc-ng[version='>=5.4.0'] -> __glibc[version='>=2.17'] + - tk=8.6.8 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - torchvision=0.3.0 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17|>=2.17,<3.0.a0'] + - torchvision=0.3.0 -> pytorch[version='>=1.1.0'] -> __cuda[version='>=11.8'] + - tornado=6.0.2 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - typed-ast=1.3.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - unixodbc=2.3.7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - urllib3=1.24.1 -> pysocks[version='>=1.5.6,<2.0,!=1.5.7'] -> __unix + - urllib3=1.24.1 -> pysocks[version='>=1.5.6,<2.0,!=1.5.7'] -> __win + - wrapt=1.11.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - xz=5.2.4 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - yaml=0.1.7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - zeromq=4.3.1 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + - zlib=1.2.11 -> libgcc-ng[version='>=10.3.0'] -> __glibc[version='>=2.17'] + - zstd=1.3.7 -> libgcc-ng[version='>=7.3.0'] -> __glibc[version='>=2.17'] + +Your installed version is: not available diff --git a/动态slam/run.txt b/动态slam/run.txt new file mode 100644 index 0000000..d394924 --- /dev/null +++ b/动态slam/run.txt @@ -0,0 +1,9 @@ +python evaluation.py --result_dir=./data/ --eva_seqs=../pose_est/06/06_pred + +python evaluate_kitti.py ./pose_gt/06.txt ./06_est.txt + +python tartanair_evaluator.py + + + +conda env create -f requirement.yml -p /root/miniconda3/envs/dfvo \ No newline at end of file diff --git a/动态slam/tartan.pdf b/动态slam/tartan.pdf new file mode 100644 index 0000000..de9ebaa --- /dev/null +++ b/动态slam/tartan.pdf @@ -0,0 +1,724 @@ + TartanVO: A Generalizable Learning-based VO + + Wenshan Wang∗ Yaoyu Hu Sebastian Scherer + + Carnegie Mellon University Carnegie Mellon University Carnegie Mellon University + +arXiv:2011.00359v1 [cs.CV] 31 Oct 2020 Abstract: We present the first learning-based visual odometry (VO) model, + which generalizes to multiple datasets and real-world scenarios, and outperforms + geometry-based methods in challenging scenes. We achieve this by leveraging + the SLAM dataset TartanAir, which provides a large amount of diverse synthetic + data in challenging environments. Furthermore, to make our VO model generalize + across datasets, we propose an up-to-scale loss function and incorporate the cam- + era intrinsic parameters into the model. Experiments show that a single model, + TartanVO, trained only on synthetic data, without any finetuning, can be general- + ized to real-world datasets such as KITTI and EuRoC, demonstrating significant + advantages over the geometry-based methods on challenging trajectories. Our + code is available at https://github.com/castacks/tartanvo. + + Keywords: Visual Odometry, Generalization, Deep Learning, Optical Flow + + 1 Introduction + + Visual SLAM (Simultaneous Localization and Mapping) becomes more and more important for + autonomous robotic systems due to its ubiquitous availability and the information richness of im- + ages [1]. Visual odometry (VO) is one of the fundamental components in a visual SLAM system. + Impressive progress has been made in both geometric-based methods [2, 3, 4, 5] and learning-based + methods [6, 7, 8, 9]. However, it remains a challenging problem to develop a robust and reliable VO + method for real-world applications. + + On one hand, geometric-based methods are not robust enough in many real-life situations [10, 11]. + On the other hand, although learning-based methods demonstrate robust performance on many vi- + sual tasks, including object recognition, semantic segmentation, depth reconstruction, and optical + flow, we have not yet seen the same story happening to VO. + + It is widely accepted that by leveraging a large amount of data, deep-neural-network-based methods + can learn a better feature extractor than engineered ones, resulting in a more capable and robust + model. But why haven’t we seen the deep learning models outperform geometry-based methods yet? + We argue that there are two main reasons. First, the existing VO models are trained with insufficient + diversity, which is critical for learning-based methods to be able to generalize. By diversity, we + mean diversity both in the scenes and motion patterns. For example, a VO model trained only on + outdoor scenes is unlikely to be able to generalize to an indoor environment. Similarly, a model + trained with data collected by a camera fixed on a ground robot, with limited pitch and roll motion, + will unlikely be applicable to drones. Second, most of the current learning-based VO models neglect + some fundamental nature of the problem which is well formulated in geometry-based VO theories. + From the theory of multi-view geometry, we know that recovering the camera pose from a sequence + of monocular images has scale ambiguity. Besides, recovering the pose needs to take account of the + camera intrinsic parameters (referred to as the intrinsics ambiguity later). Without explicitly dealing + with the scale problem and the camera intrinsics, a model learned from one dataset would likely fail + in another dataset, no matter how good the feature extractor is. + + To this end, we propose a learning-based method that can solve the above two problems and can + generalize across datasets. Our contributions come in three folds. First, we demonstrate the crucial + effects of data diversity on the generalization ability of a VO model by comparing performance on + different quantities of training data. Second, we design an up-to-scale loss function to deal with the + + ∗Corresponding author: wenshanw@andrew.cmu.edu + + 4th Conference on Robot Learning (CoRL 2020), Cambridge MA, USA. + scale ambiguity of monocular VO. Third, we create an intrinsics layer (IL) in our VO model enabling +generalization across different cameras. To our knowledge, our model is the first learning-based VO +that has competitive performance in various real-world datasets without finetuning. Furthermore, +compared to geometry-based methods, our model is significantly more robust in challenging scenes. +A demo video can be found: https://www.youtube.com/watch?v=NQ1UEh3thbU + +2 Related Work + +Besides early studies of learning-based VO models [12, 13, 14, 15], more and more end-to-end +learning-based VO models have been studied with improved accuracy and robustness. The majority +of the recent end-to-end models adopt the unsupervised-learning design [6, 16, 17, 18], due to the +complexity and the high-cost associated with collecting ground-truth data. However, supervised +models trained on labeled odometry data still have a better performance [19, 20]. + +To improve the performance, end-to-end VO models tend to have auxiliary outputs related to camera +motions, such as depth and optical flow. With depth prediction, models obtain supervision signals +by imposing depth consistency between temporally consecutive images [17, 21]. This procedure can +be interpreted as matching the temporal observations in the 3D space. A similar effect of temporal +matching can be achieved by producing the optical flow, e.g., [16, 22, 18] jointly predict depth, +optical flow, and camera motion. + +Optical flow can also be treated as an intermediate representation that explicitly expresses the 2D +matching. Then, camera motion estimators can process the optical flow data rather than directly +working on raw images[20, 23]. If designed this way, components for estimating the camera motion +can even be trained separately on available optical flow data [19]. We follow these designs and use +the optical flow as an intermediate representation. + +It is well known that monocular VO systems have scale ambiguity. Nevertheless, most of the super- +vised learning models did not handle this issue and directly use the difference between the model +prediction and the true camera motion as the supervision [20, 24, 25]. In [19], the scale is handled +by dividing the optical flow into sub-regions and imposing a consistency of the motion predictions +among these regions. In non-learning methods, scale ambiguity can be solved if a 3D map is avail- +able [26]. Ummenhofer et al. [20] introduce the depth prediction to correcting the scale-drift. Tateno +et al. [27] and Sheng et al. [28] ameliorate the scale problem by leveraging the key-frame selection +technique from SLAM systems. Recently, Zhan et al. [29] use PnP techniques to explicitly solve +for the scale factor. The above methods introduce extra complexity to the VO system, however, the +scale ambiguity is not totally suppressed for monocular setups especially in the evaluation stage. +Instead, some models choose to only produce up-to-scale predictions. Wang et al. [30] reduce the +scale ambiguity in the monocular depth estimation task by normalizing the depth prediction before +computing the loss function. Similarly, we will focus on predicting the translation direction rather +than recovering the full scale from monocular images, by defining a new up-to-scale loss function. + +Learning-based models suffer from generalization issues when tested on images from a new en- +vironment or a new camera. Most of the VO models are trained and tested on the same dataset +[16, 17, 31, 18]. Some multi-task models [6, 20, 32, 22] only test their generalization ability on the +depth prediction, not on the camera pose estimation. Recent efforts, such as [33], use model adap- +tation to deal with new environments, however, additional training is needed on a per-environment +or per-camera basis. In this work, we propose a novel approach to achieve cross-camera/dataset +generalization, by incorporating the camera intrinsics directly into the model. + +Figure 1: The two-stage network architecture. The model consists of a matching network, which +estimates optical flow from two consecutive RGB images, followed by a pose network predicting +camera motion from the optical flow. + + 2 + 3 Approach + +3.1 Background + +We focus on the monocular VO problem, which takes two consecutive undistorted images {It, It+1}, +and estimates the relative camera motion δtt+1 = (T, R), where T ∈ R3 is the 3D translation and +R ∈ so(3) denotes the 3D rotation. According to the epipolar geometry theory [34], the geometry- +based VO comes in two folds. Firstly, visual features are extracted and matched from It and It+1. +Then using the matching results, it computes the essential matrix leading to the recovery of the +up-to-scale camera motion δtt+1. + +Following the same idea, our model consists of two sub-modules. One is the matching module +Mθ(It, It+1), estimating the dense matching result Ftt+1 from two consecutive RGB images (i.e. +optical flow). The other is a pose module Pφ(Ftt+1) that recovers the camera motion δtt+1 from the +matching result (Fig. 1). This modular design is also widely used in other learning-based methods, +especially in unsupervised VO [13, 19, 16, 22, 18]. + +3.2 Training on large scale diverse data + +The generalization capability has always been one of the most critical issues for learning-based +methods. Most of the previous supervised models are trained on the KITTI dataset, which contains +11 labeled sequences and 23,201 image frames in the driving scenario [35]. Wang et al. [8] presented +the training and testing results on the EuRoC dataset [36], collected by a micro aerial vehicle (MAV). +They reported that the performance is limited by the lack of training data and the more complex +dynamics of a flying robot. Surprisingly, most unsupervised methods also only train their models in +very uniform scenes (e.g., KITTI and Cityscape [37]). To our knowledge, no learning-based model +has yet shown the capability of running on multiple types of scenes (car/MAV, indoor/outdoor). To +achieve this, we argue that the training data has to cover diverse scenes and motion patterns. + +TartanAir [11] is a large scale dataset with highly diverse scenes and motion patterns, containing +more than 400,000 data frames. It provides multi-modal ground truth labels including depth, seg- +mentation, optical flow, and camera pose. The scenes include indoor, outdoor, urban, nature, and +sci-fi environments. The data is collected with a simulated pinhole camera, which moves with ran- +dom and rich 6DoF motion patterns in the 3D space. + +We take advantage of the monocular image sequences {It}, the optical flow labels {Ftt+1}, and the +ground truth camera motions {δtt+1} in our task. Our objective is to jointly minimize the optical +flow loss Lf and the camera motion loss Lp. The end-to-end loss is defined as: + +L = λLf + Lp = λ Mθ(It, It+1) − Ftt+1 + Pφ(Fˆtt+1) − δtt+1 (1) + +where λ is a hyper-parameter balancing the two losses. We use ˆ· to denote the estimated variable +from our model. + +Since TartanAir is purely synthetic, the biggest question is can a model learned from simulation +data be generalized to real-world scenes? As discussed by Wang et al. [11], a large number of +studies show that training purely in simulation but with broad diversity, the model learned can be +easily transferred to the real world. This is also known as domain randomization [38, 39]. In our +experiments, we show that the diverse simulated data indeed enable the VO model to generalize to +real-world data. + +3.3 Up-to-scale loss function + +The motion scale is unobservable from a monocular image sequence. In geometry-based methods, +the scale is usually recovered from other sources of information ranging from known object size or +camera height to extra sensors such as IMU. However, in most existing learning-based VO studies, +the models generally neglect the scale problem and try to recover the motion with scale. This is +feasible if the model is trained and tested with the same camera and in the same type of scenario. +For example, in the KITTI dataset, the camera is mounted at a fixed height above the ground and a +fixed orientation. A model can learn to remember the scale in this particular setup. Obviously, the +model will have huge problems when tested with a different camera configuration. Imagine if the + + 3 + Figure 2: a) Illustration of the FoV and image resolution in TartanAir, EuRoC, and KITTI datasets. +b) Calculation of the intrinsics layer. + +camera in KITTI moves a little upwards and becomes higher from the ground, the same amount of +camera motion would cause a smaller optical flow value on the ground, which is inconsistent with +the training data. Although the model could potentially learn to pick up other clues such as object +size, it is still not fully reliable across different scenes or environments. + +Following the geometry-based methods, we only recover an up-to-scale camera motion from the + +monocular sequences. Knowing that the scale ambiguity only affects the translation T , we design + +a new loss function for T and keep the loss for rotation R unchanged. We propose two up-to-scale +loss functions for LP : the cosine similarity loss Lcpos and the normalized distance loss Lnporm. Lpcos +is defined by the cosine angle between the estimated Tˆ and the label T : + +Lpcos = max( Tˆ · T T + Rˆ − R (2) + Tˆ · ,) + +Similarly, for Lnporm, we normalize the translation vector before calculating the distance between +the estimation and the label: + +Lpnorm = Tˆ T + Rˆ − R + max( Tˆ , ) − max( T (3) + ,) + +where = 1e-6 is used to avoid divided by zero error. From our preliminary empirical comparison, +the above two formulations have similar performance. In the following sections, we will use Eq 3 +to replace Lp in Eq 1. Later, we show by experiments that the proposed up-to-scale loss function is +crucial for the model’s generalization ability. + +3.4 Cross-camera generalization by encoding camera intrinsics + +In epipolar geometry theory, the camera intrinsics is required when recovering the camera pose +from the essential matrix (assuming the images are undistorted). In fact, learning-based methods +are unlikely to generalize to data with different camera intrinsics. Imagine a simple case that the +camera changes a lens with a larger focal length. Assume the resolution of the image remains the +same, the same amount of camera motion will introduce bigger optical flow values, which we call +the intrinsics ambiguity. + +A tempting solution for intrinsics ambiguity is warping the input images to match the camera in- +trinsics of the training data. However, this is not quite practical especially when the cameras differ +too much. As shown in Fig. 2-a, if a model is trained on TartanAir, the warped KITTI image only +covers a small part of the TartanAir’s field of view (FoV). After training, a model learns to exploit +cues from all possible positions in the FoV and the interrelationship among those cues. Some cues +no longer exist in the warped KITTI images leading to drastic performance drops. + +3.4.1 Intrinsics layer + +We propose to train a model that takes both RGB images and camera intrinsics as input, thus the +model can directly handle images coming from various camera settings. Specifically, instead of re- +covering the camera motion Ttt+1 only from the feature matching Ftt+1, we design a new pose net- +work Pφ(Ftt+1, K), which depends also on the camera intrinsic parameters K = {fx, fy, ox, oy}, +where fx and fy are the focal lengths, and ox and oy denote the position of the principle point. + + 4 + Figure 3: The data augmentation procedure of random cropping and resizing. In this way we gener- +ate a wide range of camera intrinsics (FoV 40◦ to 90◦). + +As for the implementation, we concatenate an IL (intrinsics layer) Kc ∈ R2×H×W (H and W +are image height and width respectively) to Ftt+1 before going into Pφ. To compose Kc, we first +generate two index matrices Xind and Yind for the x and y axis in the 2D image frame (Fig. 2-b). +Then the two channels of Kc are calculated from the following formula: + +Kxc = (Xind − ox)/fx (4) +Kyc = (Yind − oy)/fy + +The concatenation of Ftt+1 and Kc augments the optical flow estimation with 2D position informa- +tion. Similar to the situation where geometry-based methods have to know the 2D coordinates of the +matched features, Kc provides the necessary position information. In this way, intrinsics ambiguity + +is explicitly handled by coupling 2D positions and matching estimations (Ftt+1). + +3.4.2 Data generation for various camera intrinsics + +To make a model generalizable across different cameras, we need training data with various camera +intrinsics. TartanAir only has one set of camera intrinsics, where fx = fy = 320, ox = 320, +and oy = 240. We simulate various intrinsics by randomly cropping and resizing (RCR) the input +images. As shown in Fig. 3, we first crop the image at a random location with a random size. Next, +we resize the cropped image to the original size. One advantage of the IL is that during RCR, we can +crop and resize the IL with the image, without recomputing the IL. To cover typical cameras with +FoV between 40◦ to 90◦, we find that using random resizing factors up to 2.5 is sufficient during +RCR. Note the ground truth optical flow should also be scaled with respect to the resizing factor. We +use very aggressive cropping and shifting in our training, which means the optical center could be +way off the image center. Although the resulting intrinsic parameters will be uncommon in modern +cameras, we find the generalization is improved. + +4 Experimental Results + +4.1 Network structure and training detail + +Network We utilize the pre-trained PWC-Net [40] as the matching network Mθ, and a modified +ResNet50 [41] as the pose network Pφ. We remove the batch normalization layers from the ResNet, +and add two output heads for the translation and rotation, respectively. The PWC-Net outputs optical +flow in size of H/4 × W/4, so Pφ is trained on 1/4 size, consuming very little GPU memory. The +overall inference time (including both Mθ and Pφ) is 40ms on an NVIDIA GTX 1080 GPU. + +Training Our model is implemented by PyTorch [42] and trained on 4 NVIDIA GTX 1080 GPUs. +There are two training stages. First, Pφ is trained separately using ground truth optical flow and +camera motions for 100,000 iterations with a batch size of 100. In the second stage, Pφ and Mθ are +connected and jointly optimized for 50,000 iterations with a batch size of 64. During both training +stages, the learning rate is set to 1e-4 with a decay rate of 0.2 at 1/2 and 7/8 of the total training +steps. The RCR is applied on the optical flow, RGB images, and the IL (Sec 3.4.2). + +4.2 How the training data quantity affects the generalization ability + +To show the effects of data diversity, we compare the generalization ability of the model trained +with different amounts of data. We use 20 environments from the TartanAir dataset, and set aside +3 environments (Seaside-town, Soul-city, and Hongkong-alley) only for testing, which results in + +5 + Figure 4: Generalization ability with respect to different quantities of training data. Model Pφ is +trained on true optical flow. Blue: training loss, orange: testing loss on three unseen environments. +Testing loss drops constantly with increasing quantity of training data. + +Figure 5: Comparison of the loss curve w/ and w/o up-to-scale loss function. a) The training and +testing loss w/o the up-to-scale loss. b) The translation and rotation loss of a). Big gap exists between +the training and testing translation losses (orange arrow in b)). c) The training and testing losses w/ +up-to-scale loss. d) The translation and rotation losses of c). The translation loss gap decreases. + +more than 400,000 training frames and about 40,000 testing frames. As a comparison, KITTI and +EuRoC datasets provide 23,201 and 26,604 pose labeled frames, respectively. Besides, data in +KITTI and EuRoC are much more uniform in the sense of scene type and motion pattern. As shown +in Fig. 4, we set up three experiments, where we use 20,000 (comparable to KITTI and EuRoC), +100,000, and 400,000 frames of data for training the pose network Pφ. The experiments show that +the generalization ability, measured by the gap between training loss and testing loss on unseen +environments, improves constantly with increasing training data. + +4.3 Up-to-scale loss function + +Without the up-to-scale loss, we observe that there is a gap between the training and testing loss even +trained with a large amount of data (Fig. 5-a). As we plotting the translation loss and rotation loss +separately (Fig. 5-b), it shows that the translation error is the main contributor to the gap. After we +apply the up-to-scale loss function described in Sec 3.3, the translation loss gap decreases (Fig. 5- +c,d). During testing, we align the translation with the ground truth to recover the scale using the +same way as described in [16, 6]. + +4.4 Camera intrinsics layer + +The IL is critical to the generalization ability across datasets. Before we move to other datasets, +we first design an experiment to investigate the properties of the IL using the pose network Pφ. As +shown in Table 1, in the first two columns, where the data has no RCR augmentation, the training +and testing loss are low. But these two models would output nonsense values on data with RCR +augmentation. One interesting finding is that adding IL doesn’t help in the case of only one type +of intrinsics. This indicates that the network has learned a very different algorithm compared with +the geometry-based methods, where the intrinsics is necessary to recover the motion. The last two +columns show that the IL is critical when the input data is augmented by RCR (i.e. various intrin- +sics). Another interesting thing is that training a model with RCR and IL leads to a lower testing +loss (last column) than only training on one type of intrinsics (first two columns). This indicates that +by generating data with various intrinsics, we learned a more robust model for the VO task. + + 6 + Table 1: Training and testing losses with four combinations of RCR and IL settings. The IL is +critical with the presence of RCR. The model trained with RCR reaches lower testing loss than +those without RCR. + +Training configuration w/o RCR, w/o IL w/o RCR, w/ IL w/ RCR, w/o IL w/ RCR, w/ IL +Training loss 0.0325 0.0311 0.1534 0.0499 +Test-loss on data w/ RCR - - 0.1999 0.0723 +Test-loss on data w/o RCR 0.0744 0.0714 0.1630 0.0549 + +Table 2: Comparison of translation and rotation on the KITTI dataset. DeepVO [43] is a super- +vised method trained on Seq 00, 02, 08, 09. It contains an RNN module, which accumulates +information from multiple frames. Wang et al. [9] is a supervised method trained on Seq 00-08 +and uses the semantic information of multiple frames to optimize the trajectory. UnDeepVO [44] +and GeoNet [16] are trained on Seq 00-08 in an unsupervised manner. VISO2-M [45] and ORB- +SLAM [3] are geometry-based monocular VO. ORB-SLAM uses the bundle adjustment on multiple +frames to optimize the trajectory. Our method works in a pure VO manner (only takes two frames). +It has never seen any KITTI data before the testing, and yet achieves competitive results. + + Seq 06 07 09 10 Ave + +DeepVO [43]*† trel rrel trel rrel trel rrel trel rrel trel rrel +Wang et al. [9]*† +UnDeepVO [44]* 5.42 5.82 3.91 4.6 - - 8.11 8.83 5.81 6.41 +GeoNet [16]* +VISO2-M [45] - - - - 8.04 1.51 6.23 0.97 7.14 1.24 +ORB-SLAM [3]† +TartanVO (ours) 6.20 1.98 3.15 2.48 - - 10.63 4.65 6.66 3.04 + + 9.28 4.34 8.27 5.93 26.93 9.54 20.73 9.04 16.3 7.21 + + 7.3 6.14 23.61 19.11 4.04 1.43 25.2 3.8 15.04 7.62 + + 18.68 0.26 10.96 0.37 15.3 0.26 3.71 0.3 12.16 0.3 + + 4.72 2.95 4.32 3.41 6.0 3.11 6.89 2.73 5.48 3.05 + +trel: average translational RMSE drift (%) on a length of 100–800 m. +rrel: average rotational RMSE drift (◦/100 m) on a length of 100–800 m. +*: the starred methods are trained or finetuned on the KITTI dataset. +†: these methods use multiple frames to optimize the trajectory after the VO process. + +4.5 Generalize to real-world data without finetuning + +KITTI dataset The KITTI dataset is one of the most influential datasets for VO/SLAM tasks. We +compare our model, TartanVO, with two supervised learning models (DeepVO [43], Wang et al. +[9]), two unsupervised models (UnDeepVO [44], GeoNet [16]), and two geometry-based methods +(VISO2-M [45], ORB-SLAM [3]). All the learning-based methods except ours are trained on the +KITTI dataset. Note that our model has not been finetuned on KITTI and is trained purely on a +synthetic dataset. Besides, many algorithms use multiple frames to further optimize the trajectory. +In contrast, our model only takes two consecutive images. As listed in Table 2, TartanVO achieves +comparable performance, despite no finetuning nor backend optimization are performed. + +EuRoC dataset The EuRoC dataset contains 11 sequences collected by a MAV in an indoor en- +vironment. There are 3 levels of difficulties with respect to the motion pattern and the light con- +dition. Few learning-based methods have ever been tested on EuRoC due to the lack of training +data. The changing light condition and aggressive rotation also pose real challenges to geometry- +based methods as well. In Table 3, we compare with geometry-based methods including SVO [46], +ORB-SLAM [3], DSO [5] and LSD-SLAM [2]. Note that all these geometry-based methods per- +form some types of backend optimization on selected keyframes along the trajectory. In contrast, our +model only estimates the frame-by-frame camera motion, which could be considered as the frontend +module in these geometry-based methods. In Table 3, we show the absolute trajectory error (ATE) +of 6 medium and difficult trajectories. Our method shows the best performance on the two most dif- +ficult trajectories VR1-03 and VR2-03, where the MAV has very aggressive motion. A visualization +of the trajectories is shown in Fig. 6. + +Challenging TartanAir data TartanAir provides 16 very challenging testing trajectories2 that +cover many extremely difficult cases, including changing illumination, dynamic objects, fog and +rain effects, lack of features, and large motion. As listed in Table 4, we compare our model with the +ORB-SLAM using ATE. Our model shows a more robust performance in these challenging cases. + + 2https://github.com/castacks/tartanair tools#download-the-testing-data-for-the-cvpr-visual-slam-challenge + + 7 + Table 3: Comparison of ATE on EuRoC dataset. We are among very few learning-based methods, +which can be tested on this dataset. Same as the geometry-based methods, our model has never seen +the EuRoC data before testing. We show the best performance on two difficult sequences VR1-03 +and VR2-03. Note our method doesn’t contain any backend optimization module. + + Seq. MH-04 MH-05 VR1-02 VR1-03 VR2-02 VR2-03 + + SVO [46] 1.36 0.51 0.47 x 0.47 x + +Geometry-based * ORB-SLAM [3] 0.20 0.19 x x 0.07 x + DSO [5] 0.25 0.11 0.11 0.93 0.13 1.16 + + LSD-SLAM [2] 2.13 0.85 1.11 x x x + +Learning-based † TartanVO (ours) 0.74 0.68 0.45 0.64 0.67 1.04 + +* These results are from [46]. † Other learning-based methods [36] did not report numerical results. + +Figure 6: The visualization of 6 EuRoC trajectories in Table 3. Black: ground truth trajectory, +orange: estimated trajectory. + +Table 4: Comparison of ATE on TartanAir dataset. These trajectories are not contained in the + +training set. We repeatedly run ORB-SLAM 5 times and report the best result. + +Seq MH000 MH001 MH002 MH003 MH004 MH005 MH006 MH007 + +ORB-SLAM [3] 1.3 0.04 2.37 2.45 x x 21.47 2.73 + +TartanVO (ours) 4.88 0.26 2 0.94 1.07 3.19 1 2.04 + +Figure 7: TartanVO outputs competitive results on D345i IR data compared to T265 (equipped with +fish-eye stereo camera and an IMU). a) The hardware setup. b) Trail 1: smooth and slow motion. c) +Trail 2: smooth and medium speed. d) Trail 3: aggressive and fast motion. See videos for details. + +RealSense Data Comparison We test TartanVO using data collected by a customized sensor +setup. As shown in Fig. 7 a), a RealSense D345i is fixed on top of a RealSense T265 tracking +camera. We use the left near-infrared (IR) image on D345i in our model and compare it with the +trajectories provided by the T265 tracking camera. We present 3 loopy trajectories following similar +paths with increasing motion difficulties. From Fig. 7 b) to d), we observe that although TartanVO +has never seen real-world images or IR data during training, it still generalizes well and predicts +odometry closely matching the output of T265, which is a dedicated device estimating the camera +motion with a pair of fish-eye stereo camera and an IMU. + +5 Conclusions + +We presented TartanVO, a generalizable learning-based visual odometry. By training our model +with a large amount of data, we show the effectiveness of diverse data on the ability of model gener- +alization. A smaller gap between training and testing losses can be expected with the newly defined +up-to-scale loss, further increasing the generalization capability. We show by extensive experiments +that, equipped with the intrinsics layer designed explicitly for handling different cameras, TartanVO +can generalize to unseen datasets and achieve performance even better than dedicated learning mod- +els trained directly on those datasets. Our work introduces many exciting future research directions +such as generalizable learning-based VIO, Stereo-VO, multi-frame VO. + + 8 + Acknowledgments + +This work was supported by ARL award #W911NF1820218. Special thanks to Yuheng Qiu and Huai +Yu from Carnegie Mellon University for preparing simulation results and experimental setups. + +References + + [1] J. Fuentes-Pacheco, J. Ruiz-Ascencio, and J. M. Rendo´n-Mancha. Visual simultaneous localization and + mapping: a survey. Artificial Intelligence Review, 43(1):55–81, 2015. + + [2] J. Engel, T. Schops, and D. Cremers. LSD-SLAM: Large-scale direct monocular slam. In ECCV, 2014. + + [3] R. Mur-Artal, J. M. M. Montiel, and J. D. Tardos. Orb-slam: a versatile and accurate monocular slam + system. IEEE transactions on robotics, 31(5):1147–1163, 2015. + + [4] C. Forster, M. Pizzoli, and D. Scaramuzza. Svo: Fast semi-direct monocular visual odometry. In ICRA, + pages 15–22. IEEE, 2014. + + [5] J. Engel, V. Koltun, and D. Cremers. Direct sparse odometry. IEEE transactions on pattern analysis and + machine intelligence, 40(3):611–625, 2017. + + [6] T. Zhou, M. Brown, N. Snavely, and D. G. Lowe. Unsupervised learning of depth and ego-motion from + video. In CVPR, 2017. + + [7] S. Vijayanarasimhan, S. Ricco, C. Schmidy, R. Sukthankar, and K. Fragkiadaki. Sfm-net: Learning of + structure and motion from video. In arXiv:1704.07804, 2017. + + [8] S. Wang, R. Clark, H. Wen, and N. Trigoni. End-to-end, sequence-to-sequence probabilistic visual odom- + etry through deep neural networks. The International Journal of Robotics Research, 37(4-5):513–542, + 2018. + + [9] X. Wang, D. Maturana, S. Yang, W. Wang, Q. Chen, and S. Scherer. Improving learning-based ego- + motion estimation with homomorphism-based losses and drift correction. In 2019 IEEE/RSJ International + Conference on Intelligent Robots and Systems (IROS), pages 970–976. IEEE, 2019. + +[10] G. Younes, D. Asmar, E. Shammas, and J. Zelek. Keyframe-based monocular slam: design, survey, and + future directions. Robotics and Autonomous Systems, 98:67–88, 2017. + +[11] W. Wang, D. Zhu, X. Wang, Y. Hu, Y. Qiu, C. Wang, Y. Hu, A. Kapoor, and S. Scherer. Tartanair: A + dataset to push the limits of visual slam. In 2020 IEEE/RSJ International Conference on Intelligent Robots + and Systems (IROS), 2020. + +[12] R. Roberts, H. Nguyen, N. Krishnamurthi, and T. Balch. Memory-based learning for visual odometry. + In Robotics and Automation, 2008. ICRA 2008. IEEE International Conference on, pages 47–52. IEEE, + 2008. + +[13] V. Guizilini and F. Ramos. Semi-parametric models for visual odometry. In Robotics and Automation + (ICRA), 2012 IEEE International Conference on, pages 3482–3489. IEEE, 2012. + +[14] T. A. Ciarfuglia, G. Costante, P. Valigi, and E. Ricci. Evaluation of non-geometric methods for visual + odometry. Robotics and Autonomous Systems, 62(12):1717–1730, 2014. + +[15] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam: Real-time dense monocular slam with learned + depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, + pages 6243–6252, 2017. + +[16] Z. Yin and J. Shi. Geonet: Unsupervised learning of dense depth, optical flow and camera pose. In + Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 2, + 2018. + +[17] H. Zhan, R. Garg, C. S. Weerasekera, K. Li, H. Agarwal, and I. Reid. Unsupervised learning of monocular + depth estimation and visual odometry with deep feature reconstruction. In Proceedings of the IEEE + Conference on Computer Vision and Pattern Recognition, pages 340–349, 2018. + +[18] A. Ranjan, V. Jampani, L. Balles, K. Kim, D. Sun, J. Wulff, and M. J. Black. Competitive collabora- + tion: Joint unsupervised learning of depth, camera motion, optical flow and motion segmentation. In + Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June + 2019. + +[19] G. Costante, M. Mancini, P. Valigi, and T. A. Ciarfuglia. Exploring representation learning with cnns for + frame-to-frame ego-motion estimation. RAL, 1(1):18–25, 2016. + +[20] B. Ummenhofer, H. Zhou, J. Uhrig, N. Mayer, E. Ilg, A. Dosovitskiy, and T. Brox. Demon: Depth and + motion network for learning monocular stereo. In Proceedings of the IEEE Conference on Computer + Vision and Pattern Recognition (CVPR), July 2017. + +[21] N. Yang, L. v. Stumberg, R. Wang, and D. Cremers. D3vo: Deep depth, deep pose and deep uncertainty + for monocular visual odometry. In IEEE/CVF Conference on Computer Vision and Pattern Recognition + (CVPR), June 2020. + +[22] Y. Zou, Z. Luo, and J.-B. Huang. Df-net: Unsupervised joint learning of depth and flow using cross-task + consistency. In Proceedings of the European Conference on Computer Vision (ECCV), September 2018. + + 9 + [23] H. Zhou, B. Ummenhofer, and T. Brox. Deeptam: Deep tracking and mapping. In Proceedings of the + European Conference on Computer Vision (ECCV), September 2018. + +[24] C. Tang and P. Tan. Ba-net: Dense bundle adjustment network. arXiv preprint arXiv:1806.04807, 2018. + +[25] R. Clark, M. Bloesch, J. Czarnowski, S. Leutenegger, and A. J. Davison. Ls-net: Learning to solve + nonlinear least squares for monocular stereo. arXiv preprint arXiv:1809.02966, 2018. + +[26] H. Li, W. Chen, j. Zhao, J.-C. Bazin, L. Luo, Z. Liu, and Y.-H. Liu. Robust and efficient estimation of ab- + solute camera pose for monocular visual odometry. In Proceedings of the IEEE International Conference + on Robotics and Automation (ICRA), May 2020. + +[27] K. Tateno, F. Tombari, I. Laina, and N. Navab. Cnn-slam: Real-time dense monocular slam with learned + depth prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition + (CVPR), July 2017. + +[28] L. Sheng, D. Xu, W. Ouyang, and X. Wang. Unsupervised collaborative learning of keyframe detec- + tion and visual odometry towards monocular deep slam. In Proceedings of the IEEE/CVF International + Conference on Computer Vision (ICCV), October 2019. + +[29] H. Zhan, C. S. Weerasekera, J.-W. Bian, and I. Reid. Visual odometry revisited: What should be learnt? + In Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), May 2020. + +[30] C. Wang, J. M. Buenaposada, R. Zhu, and S. Lucey. Learning depth from monocular videos using direct + methods. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), + June 2018. + +[31] Y. Wang, P. Wang, Z. Yang, C. Luo, Y. Yang, and W. Xu. Unos: Unified unsupervised optical-flow and + stereo-depth estimation by watching videos. In Proceedings of the IEEE/CVF Conference on Computer + Vision and Pattern Recognition (CVPR), June 2019. + +[32] R. Mahjourian, M. Wicke, and A. Angelova. Unsupervised learning of depth and ego-motion from monoc- + ular video using 3d geometric constraints. In Proceedings of the IEEE Conference on Computer Vision + and Pattern Recognition (CVPR), June 2018. + +[33] S. Li, X. Wang, Y. Cao, F. Xue, Z. Yan, and H. Zha. Self-supervised deep visual odometry with online + adaptation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition + (CVPR), June 2020. + +[34] D. Niste´r. An efficient solution to the five-point relative pose problem. IEEE transactions on pattern + analysis and machine intelligence, 26(6):756–770, 2004. + +[35] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision meets robotics: The kitti dataset. The International + Journal of Robotics Research, 32(11):1231–1237, 2013. + +[36] M. Burri, J. Nikolic, P. Gohl, T. Schneider, J. Rehder, S. Omari, M. W. Achtelik, and R. Siegwart. The + euroc micro aerial vehicle datasets. The International Journal of Robotics Research, 35(10):1157–1163, + 2016. + +[37] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and + B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proceedings of the IEEE + conference on computer vision and pattern recognition, pages 3213–3223, 2016. + +[38] J. Tobin, R. Fong, A. Ray, J. Schneider, W. Zaremba, and P. Abbeel. Domain randomization for transfer- + ring deep neural networks from simulation to the real world. In IROS, pages 23–30. IEEE, 2017. + +[39] J. Tremblay, A. Prakash, D. Acuna, M. Brophy, V. Jampani, C. Anil, T. To, E. Cameracci, S. Boochoon, + and S. Birchfield. Training deep networks with synthetic data: Bridging the reality gap by domain ran- + domization. In CVPR Workshops, pages 969–977, 2018. + +[40] D. Sun, X. Yang, M.-Y. Liu, and J. Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and + cost volume. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages + 8934–8943, 2018. + +[41] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of the + IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. + +[42] A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and + A. Lerer. Automatic differentiation in pytorch. In NIPS-W, 2017. + +[43] S. Wang, R. Clark, H. Wen, and N. Trigoni. Deepvo: Towards end-to-end visual odometry with deep + recurrent convolutional neural networks. In Robotics and Automation (ICRA), 2017 IEEE International + Conference on, pages 2043–2050. IEEE, 2017. + +[44] R. Li, S. Wang, Z. Long, and D. Gu. Undeepvo: Monocular visual odometry through unsupervised deep + learning. In 2018 IEEE International Conference on Robotics and Automation (ICRA), pages 7286–7291. + IEEE, 2018. + +[45] S. Song, M. Chandraker, and C. Guest. High accuracy monocular SFM and scale correction for au- + tonomous driving. IEEE Transactions on Pattern Analysis & Machine Intelligence, pages 1–1, 2015. + +[46] C. Forster, Z. Zhang, M. Gassner, M. Werlberger, and D. Scaramuzza. Svo: Semidirect visual odometry + for monocular and multicamera systems. IEEE Transactions on Robotics, 33(2):249–265, 2016. + + 10 + A Additional experimental details + +In this section, we provide additional details for the experiments, including the network structure, +training parameters, qualitative results, and quantitative results. + +A.1 Network Structure + +Our network consists of two sub-modules, namely, the matching network Mθ and the pose network +Pφ. As mentioned in the paper, we employ PWC-Net as the matching network, which takes in two +consecutive images of size 640 x 448 (PWC-Net only accepts image size that is multiple of 64). The +output optical flow, which is 160 x 112 in size, is fed into the pose network. The structure of the +pose network is detailed in Table 5. The overall inference time (including both Mθ and Pφ) is 40ms +on an NVIDIA GTX 1080 GPU. + +Table 5: Parameters of the proposed pose network. Constructions of residual blocks are designated + +in brackets multiplied by the number of stacked blocks. Downsampling is performed by Conv1, and + +at the beginning of each residual block. After the residual blocks, we reshape the feature map into a + +one-dimensional vector, which goes through three fully connected layers in the translation head and + +rotation head, respectively. + + Name Layer setting Output dimension + + Input 1 H × 1 W × 2 114 × 160 + Conv1 4 4 56 × 80 + Conv2 56 × 80 + Conv3 3 × 3, 32 1 H × 1 W × 32 56 × 80 + 3 × 3, 32 8 8 + 3 × 3, 32 + 1 H × 1 W × 32 + 8 8 + + 1 H × 1 W × 32 + 8 8 + + ResBlock + + Block1 3 × 3, 64 ×3 1 H × 1 W × 64 28 × 40 + 3 × 3, 64 16 16 + + Block2 3 × 3, 128 ×4 1 H × 1 W × 128 14 × 20 + 3 × 3, 128 32 32 + + Block3 3 × 3, 128 ×6 1 H × 1 W × 128 7 × 10 + 3 × 3, 128 64 64 + + Block4 3 × 3, 256 ×7 1 H × 1 W × 256 4×5 + 3 × 3, 256 128 128 + + Block5 3 × 3, 256 ×3 1 H × 1 W × 256 2×3 + 3 × 3, 256 256 256 + + FC trans FC rot + +Trans head fc1 256 × 6 × 128 Rot head fc1 256 × 6 × 128 + +Trans head fc2 128 × 32 Rot head fc2 128 × 32 + +Trans head fc3 32 × 3 Rot head fc3 32 × 3 + + Output 3 Output 3 + +Table 6: Comparison of ORB-SLAM and TartanVO on the TartanAir dataset using the ATE metric. + +These trajectories are not contained in the training set. We repeatedly run ORB-SLAM for 5 times + +and report the best result. + +Seq SH000 SH001 SH002 SH003 SH004 SH005 SH006 SH007 + +ORB-SLAM x 3.5 x x x x x x + +TartanVO (ours) 2.52 1.61 3.65 0.29 3.36 4.74 3.72 3.06 + +A.2 Testing Results on TartanAir + +TartanAir provides 16 challenging testing trajectories. We reported 8 trajectories in the experiment +section. The rest 8 trajectories are shown in Table 6. We compare TartanVO against the ORB-SLAM +monocular algorithm. Due to the randomness in ORB-SLAM, we repeatedly run ORB-SLAM for 5 +trials and report the best result. We consider a trial is a failure if ORB-SLAM tracks less than 80% + + 11 + of the trajectory. A visualization of all the 16 trajectories (including the 8 trajectories shown in the +experiment section) is shown in Figure 8. + +Figure 8: Visualization of the 16 testing trajectories in the TartanAir dataset. The black dashed line +represents the ground truth. The estimated trajectories by TartanVO and the ORB-SLAM monocular +algorithm are shown in orange and blue lines, respectively. The ORB-SLAM algorithm frequently +loses tracking in these challenging cases. It fails in 9/16 testing trajectories. Note that we run +full-fledge ORB-SLAM with local bundle adjustment, global bundle adjustment, and loop closure +components. In contrast, although TartanVO only takes in two images, it is much more robust than +ORB-SLAM. + + 12 + diff --git a/动态slam/tartanvo average time.txt b/动态slam/tartanvo average time.txt new file mode 100644 index 0000000..4578211 --- /dev/null +++ b/动态slam/tartanvo average time.txt @@ -0,0 +1,12 @@ +tartanvo + +shibuya_Standing01 +sum: 99 +total time: 8.106080770492554 +average time: 0.08458585690970373 + +Kitti04序列 +sum: 270 +total time: 20.52476716041565 +average time: 0.07601765614968758 + diff --git a/武博文-学术学位研究生学位论文中期考评表.docx b/武博文-学术学位研究生学位论文中期考评表.docx new file mode 100644 index 0000000..1a36032 --- /dev/null +++ b/武博文-学术学位研究生学位论文中期考评表.docx @@ -0,0 +1,266 @@ + 电 子 科 技 大 学 + 学术学位研究生学位论文中期考评表 + 攻读学位级别: □博士 硕士 + 学科专业: 软件工程 + 学 院: 信息与软件工程 + 学 号: 202221090225 + 姓 名: 武博文 + 论文题目: 室外动态场景下基于实例 + 分割的视觉SLAM研究 + 指导教师: 王春雨 + 填表日期: 2024 年 9 月 15 日 + 电子科技大学研究生院 + + 已完成的主要工作 +1.开题报告通过时间: 2023 年 12 月 21 日 +2. 课程学习情况 +是否已达到培养方案规定的学分要求 +□是 否 +3. 论文研究进展 +从理论分析或计算部分、实验(或实证)工作等方面进行总结(可续页) + 一、基于实例分割和光流检测的运动物体判别算法 + 理论分析 + 动态物体判别是整个动态SLAM问题要解决的一个关键环节,其最终解决的效果好坏直接影响到相机位姿的估计和后端的建图效果。该问题解决的是在相机不同的两帧之间,物体的空间位置是否发生了移动的问题。只使用实例分割获得的语义信息只能判定已知语义的物体是动态物体,但是不能确定在当前图像物体是否真的发生了移动。同时在处理未知运动物体方面,语义信息会失效。因此在ORB_SLAM2的基础上,设计了一种基于实时光流检测和实例分割的物体判别方法。 + 光流检测是一种用于估计图像序列中像素移动的技术,即在连续帧之间,图像中像素的运动轨迹。实例分割不仅要求检测出图像中的物体,还要精确地划分出每个物体实例的像素级别的掩码,不仅要区分图像中的不同类别,还需要区分属于同一类别的不同实例。 + 设计的运动物体判别方法如下,算法流程如图1-1所示: + +图1-1 运动物体判别算法流程图 + 首先判断运动物体,通过实例分割得到当前帧物体掩码,将实例Oi中像素不为0的点作为动点候选点pio。同时通过光流检测出当前帧中的像素点pi的光流值,光流值可分为x方向的fix和y方向fiy,当光流值不为0时认为该像素在其方向上存在运动。故可以结合两个方向的光流值得到唯一光流值fi,计算如公式(1)所示: + fi=fix2+fiy2 (1) + 设置光流阈值Thf,当光流值fi大于阈值时,认为该像素点存在光流运动,作为光流动点pif;否则作为静点,系统中Thf取值为0.12。对于实例Oi的动点候选点,计算其光流值和光流动点数。实例Oi的运动状态Di可以表述如公式(2): + rd=pifpio (2) + 其中rd为实例中光流动点数目在动点候选点数目所占的比例,pio是实例Oi中动点候选点总数目,pif是光流动点总数目,通过上述方式将光流信息与语义信息融合。最终根据Di来判断物体的运动状态,0为静止,1为运动,如公式(3)所示。 + Di=1,rd>=Thd 0,rd