Having exact 3D reconstructions and measures of indoor scenes is useful for numerous applications e.g. augmented reality furniture placement. Recent 3D reconstruction approaches obtain complex and highly detailed 3D models, which are difficult to handle, since the computational cost of manipulating models is directly related to its complexity. Consequently, it is also challenging to display and manipulate such detailed models on a mobile device because of limited resources. In order to keep the processing time low, simplified approximations of highly detailed models are desirable for mobile applications. Therefore, in this thesis we present a framework for simplifying indoor scenes using multi-modal RGBD video sequences. The framework consists of two parts - a 3D layout estimation pipeline as well as an object detection and pose estimation approach. Layout segments (ground plane, walls, ceiling) are represented by 3D planes and merged over time. After determining the 2D floor plan of the fused point cloud obtained from registered shots, a compact representation of the scene is generated by extruding the floor plan. In order to create semantically meaningful 3D layouts, objects are detected and further replaced by synthetic CAD models using state-of-the-art 2D object detection methods and 3D point cloud descriptors. In each frame semantic types and poses are determined. A Markov Random Field (MRF) is introduced over time, which exploits temporal coherence between consecutive frames in order to refine the pose results. The framework is trained in an offline stage with synthetically rendered point clouds obtained from CAD models downloaded from a public database. Qualitative and quantitative experiments on various indoor video sequences show that the resulting spatial layout results outperform monocular state-of-the-art algorithms when comparing with a variety of semantically labeled ground truth scenes. The MRF optimization as well as the temporal fusion of multiple 3D layouts yield to improvements concerning the pose results and the accuracy of the scene dimensions. Moreover, in terms of the storage demand, we achieve a data reduction rate of over 99% compared to the raw point-based representations.