Zero-Shot Learning with Vision-Language Models for Estimating Building Energy Efficiency from Street View Images

The built environment significantly contributes to global energy use and CO2 emissions, with large-scale energy efficiency evaluations often hindered by cost and data inconsistencies. This study employs zero-shot learning with Vision-Language Models (VLMs) to classify building energy efficiency from images. Our methodology integrates advanced image processing with natural language understanding, enabling buildings to be classified into UK Energy Performance Certificate (EPC) grades based solely on visual inputs. By leveraging pre-trained knowledge in VLMs, the framework identifies energy efficiency levels using descriptive attributes without requiring prior training on labeled examples. Performance evaluation against traditional supervised learning models demonstrates VLMs can effectively categorize buildings into energy efficiency classes. Results highlight VLMs' potential as a scalable tool for building energy assessment to inform renovation planning and sustainable urban development.

Conference Name

ASCE International Conference on Computing in Civil Engineering (i3CE 2025)

Rights and licensing

Except where otherwised noted, this item's license is described as Attribution 4.0 International

Sponsorship

European Commission Horizon 2020 (H2020) Marie Sk?odowska-Curie actions (101034337)

European Union’s Horizon 2020 research and innovation programme under the Marie Skłodowska-Curie grant agreement No 101034337

Collections

University of Cambridge Research Outputs (Articles and Conferences)