Microsoft is proud to sponsor the 41st edition Conference on Computer Vision and Pattern Recognition (CVPR 2024), which was held from June 17 to 21. This premier conference covers a wide range of topics in the field, including 3D reconstruction and modeling, action and motion analysis, video and image processing, synthetic data generation , neural networks and much more. This year, 63 Microsoft papers were accepted, with six selected for oral presentations. This article highlights these contributions.
The diversity of these research projects reflects the interdisciplinary approach taken by Microsoft research teams, from techniques that accurately recreate human figures and 3D perspectives in augmented reality (AR) to the combination of segmentation advanced imagery with synthetic data to better reproduce real-world scenarios. Other projects demonstrate how researchers combine machine learning with natural language processing and structured data, developing models that not only visualize but also interact with their environment. Collectively, these projects aim to improve machine perception and enable more precise and responsive interactions with the world.
Microsoft Research Podcast
What’s Your Story: Jacki O’Neill
Jacki O’Neill saw an opportunity to expand Microsoft’s research efforts to Africa. She now heads Microsoft Research Africa, Nairobi (formerly MARI). O’Neill talks about the choices that led her there, the impact of the lab, and how living abroad is good for innovation.
Oral presentations
BIOCLIP: A Vision Foundation Model for the Tree of Life
Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G. Campolongo, Chan Hee Song, David Carlyn,Li DongW. Dahdul, Charles Stewart, Tanya Y. Berger-Wolf, Wei-Lun Chao, Yu Su
The increase in images captured from a variety of sources – from drones to smartphones – provides a rich source of biological data. To harness this potential, we introduce TreeOfLife-10M, the largest and most diverse ML-ready biological image dataset, and BioCLIP, a base model for biological sciences. BioCLIP, using the TreeOfLife-10M’s vast array of organism images and structured knowledge, excels at fine-grained biological classification, outperforming existing models by significant margins and demonstrating strong generalizability.
EgoGen: an egocentric synthetic data generator
Gen Li, Kaifeng Zhao, Siwei Zhang, Xiaozhong Lyu, Mihai DusmanuYan Zhang, Marc Pollefeys
A crucial challenge in augmented reality (AR) is simulating realistic anatomical movements to guide cameras toward authentic egocentric views. To overcome this problem, the authors developed EgoGen, a sophisticated synthetic data generator that not only improves the accuracy of training data for egocentric tasks, but also refines the integration of movement and perception. It provides a practical solution for creating realistic egocentric training data, with the aim of serving as a useful tool for egocentric computer vision research.
Florence-2: Advancing a unified representation for a variety of vision tasks
Ben XiaoHaiping Wu, Weijian Xu, Xiyang DaiHoudong Hu, Yumao Lu, Michel ZengThis Liu, Lu Yuan
Florence-2 introduces a unified, prompt-based vision core model capable of handling a variety of tasks, from captioning to object detection and segmentation. Designed to interpret text prompts as task instructions, Florence-2 generates text output across a spectrum of vision and visual language tasks. Training this model uses the FLD-5B dataset, which includes 5.4 billion annotations across 126 million images, developed using an iterative automated image annotation strategy and continuous refinement of the model.
LISA: segmentation of reasoning via a large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui YuanShu Liu, Jiaya Jia
This work presents reasoning segmentation, a new segmentation task using complex query texts to generate segmentation masks. The authors also established a new benchmark, including more than a thousand image-instruction-mask data samples, integrating complex reasoning and world knowledge for evaluation. Finally, the authors present Large Language Instructed Segmentation Assistant (LISA), a tool that combines the linguistic capabilities of large language models with the ability to produce segmentation masks. LISA handles complex queries efficiently and exhibits strong no-fire learning capabilities, further enhanced by minimal fine-tuning.
MultiPly: Reconstruction of multiple people from monocular video in nature
Zeren Jiang, Chen Guo, Manuel Kaufmann, Tianjian Jiang, Julien Valentin (opens in a new tab)Otmar Hilliges, Jie Song
MultiPly is a new framework for reconstructing multiple people in 3D from single-camera videos in natural environments. This technique uses a layered neural representation for the entire scene, refined through differentiable volume rendering per layer. Enhanced by hybrid instance segmentation that combines self-supervised 3D and incentivized 2D techniques, it provides reliable segmentation even with close interactions. The process uses trust-guided optimization to alternately refine human poses and shapes, achieving consistent, high-fidelity 3D models.
SceneFun3D: detailed understanding of functionality and affordability in 3D scenes
Alexandros Delitzas, Ayça Takmaz, Federico Tombari, Robert Sumner, Marc PollefeysFrancis Engelmann
Traditional methods of understanding 3D scenes focus heavily on sematic and 3D instance segmentation, but the real challenge lies in interacting with functional interactive elements such as handles, knobs, and buttons to accomplish specific tasks . Enter SceneFun3D: a robust dataset comprising over 14,800 precise interaction annotations on 710 high-resolution real 3D indoor scenes. This dataset enriches scene understanding with task-specific motion parameters and natural language descriptions, facilitating advanced research on feature segmentation, grounding task-oriented means of action, and 3D motion estimation.
Find out more about our work and contributions to CVPR 2024, including our list of publications And sessionsabout our conference Web page.