← Back to Research

Complex Systems Physics

I am working on AI safety with a focus on mechanistic interpretability, aiming to clarify how modern neural networks represent and manipulate information. My present focus concerns investigating the mechanisms behind superposition, how models compress many features into shared directions, using a combination of information theory, linear probes, and targeted interventions. By examining how different kinds of information, such as redundant, unique, and synergistic components, are distributed and interact within these systems, I aim to uncover structural principles that make model behaviour more transparent and ultimately safer.

This project is carried out in collaboration with Prof. Pedro Mediano at Imperial College London and his team, with whom I explore both theoretical aspects and practical methods for interpreting high-dimensional learned representations.