The task of visual scene understanding is to develop computer algorithms for automatically understanding and analysing the content of scene images and videos, which is a fundamental problem for computer vision and machine learning research. Existing methods for scene understanding are still far from human performance in terms of prediction accuracy, semantic richness and learning efficiency. In this project, we address a number of challenging problems for approaching human-level scene understanding. We will develop novel scene understanding methods based on the recent development of semantic segmentation methods, but go beyond the conventional category-level recognition in existing methods in terms of learning paradigm and semantic concepts for prediction. Specifically, we will focus on instance-level semantic segmentation, high-level semantics recognition including relation recognition and scene description generation, learning from web data with weak annotations, and learning from synthetic data.