Show simple item record

dc.contributor.authorCangea, Catalinaen
dc.date.accessioned2021-07-12T02:20:13Z
dc.date.available2021-07-12T02:20:13Z
dc.date.submitted2021-03-01en
dc.identifier.urihttps://www.repository.cam.ac.uk/handle/1810/325035
dc.description.abstractAn essential aim of artificial intelligence research is to design agents that will eventually cooperate with humans within the real world. To this end, embodied learning is emerging as one of the most important efforts contributed by the machine learning community towards this goal. Recently developing sub-fields concern various aspects of such systems---visual reasoning, language representations, causal mechanisms, robustness to out-of-distribution inputs, to name only a few. In particular, multimodal learning and language grounding are vital to achieving a strong understanding of the real world. Humans build internal representations via interacting with their environment, learning complex associations between visual, auditory and linguistic concepts. Since the world abounds with structure, graph-based encodings are also likely to be incorporated in reasoning and decision-making modules. Furthermore, these relational representations are rather symbolic in nature---providing advantages over other formats, such as raw pixels---and can encode various types of links (temporal, causal, spatial) which can be essential for understanding and acting in the real world. This thesis presents three research works that study and develop likely aspects of future intelligent agents. The first contribution centers on vision-and-language learning, introducing a challenging embodied task that shifts the focus of an existing one to the visual reasoning problem. By extending popular visual question answering (VQA) paradigms, I also designed several models that were evaluated on the novel dataset. This produced initial performance estimates for environment understanding, through the lens of a more challenging VQA downstream task. The second work presents two ways of obtaining hierarchical representations of graph-structured data. These methods either scaled to much larger graphs than the ones processed by the best-performing method at the time, or incorporated theoretical properties via the use of topological data analysis algorithms. Both approaches competed with contemporary state-of-the-art graph classification methods, even outside social domains in the second case, where the inductive bias was PageRank-driven. Finally, the third contribution delves further into relational learning, presenting a probabilistic treatment of graph representations in complex settings such as few-shot, multi-task learning and scarce-labelled data regimes. By adding relational inductive biases to neural processes, the resulting framework can model an entire distribution of functions which generate datasets with structure. This yielded significant performance gains, especially in the aforementioned complex scenarios, with semantically-accurate uncertainty estimates that drastically improved over the neural process baseline. This type of framework may eventually contribute to developing lifelong-learning systems, due to its ability to adapt to novel tasks and distributions. The benchmark, methods and frameworks that I have devised during my doctoral studies suggest important future directions for embodied and graph representation learning research. These areas have increasingly proved their relevance to designing intelligent and collaborative agents, which we may interact with in the near future. By addressing several challenges in this problem space, my contributions therefore take a few steps towards building machine learning systems to be deployed in real-life settings.en
dc.description.sponsorshipDREAM CDTen
dc.rightsAll rights reserveden
dc.subjectmachine learningen
dc.subjectembodied learningen
dc.subjectgraph neural networksen
dc.subjectneural processesen
dc.subjectvisual question answeringen
dc.subjectlanguage groundingen
dc.subjectgraph classificationen
dc.subjectmeta-learningen
dc.subjectfew-shot learningen
dc.subjectgraph poolingen
dc.titleExploiting multimodality and structure in world representationsen
dc.typeThesis
dc.type.qualificationlevelDoctoralen
dc.type.qualificationnameDoctor of Philosophy (PhD)en
dc.publisher.institutionUniversity of Cambridgeen
dc.identifier.doi10.17863/CAM.72490
rioxxterms.licenseref.urihttp://www.rioxx.net/licenses/all-rights-reserveden
rioxxterms.typeThesisen
dc.publisher.collegeKings
dc.type.qualificationtitlePhD in Computer Scienceen
pubs.funder-project-idNERC (2221169)
pubs.funder-project-idNERC (1940784)
cam.supervisorLio, Pietro


Files in this item

FilesSizeFormatView

There are no files associated with this item.

This item appears in the following Collection(s)

Show simple item record