SSL4EO-2024: A review of the 1st summer school on “Self-Supervised Learning for Earth Observation”
The SSL4EO-2024 cohort with 40 participants from 17 institutions (sold out!)
Learning representations is at the heart of modern machine learning. While supervised learning has demonstrated major breakthroughs, many real applications with limited reference data cannot easily profit from these promises. Self-supervised learning (SSL), a research direction aiming to learn semantic representations from unlabeled data, has seen major advances and seems a promising direction to explore to better understand Earth observation (EO) data like optical satellite images, synthetic aperture radar, or climate data.
EO data are the key to understand important processes on Earth. How do human activities affect our ecosystems? How does climate change influence the harvest of crops? How do natural hazards such as wildfires, droughts, heat waves, floods, landslides, tropical cyclones, volcanic activity, earthquakes, and avalanches impact our society?
All these major questions can be better understood with the help of EO data. However, while the raw observations provide an abundant source of unlabeled data (speaking petabytes of data), we need efficient, scalable, and robust methods to extract the information we need.
In July 2024, the SSL4EO summer school brought together leading experts working on SSL and EO to teach recent advances and discuss open research questions at this intersection, with the first cohort of PhD students joining this format. For a full week, 40 participants attended the school hosted in Copenhagen, hearing from 8 invited speakers and working on mini-projects to gain hands-on experience with the methods discussed. With the generous support from Danish e-infrastructure Consortium (DeiC) which provided access to their GPU-cluster during the course, the participants studied the role of augmentations, learning objectives, architectural design, and sampling strategies.
Randall Balestriero introducing the core principles of self-supervised learning
Randall Balestriero (Brown University, Meta AI Research), the first author of the Cookbook of SSL kicked off the course with a deep introduction to SSL and an extensive summary of small “tricks” that add up to large performance gains. Randall’s perspective is that “SSL is a superset of supervised and unsupervised learning”. Puzhao Zhang (DHI) summarized important EO sensors and their characteristics for several remote sensing applications and discussed the opportunities and challenges for machine learning. These introductions set the foundation to look at the potential of EO data and its metadata for learning representations. Nico Lang (University of Copenhagen) introduced some key ideas of prior works such as Geography-Aware Self-Supervised Learning (GASSL) and Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data (SeCO), which exploit the geospatial and temporal aspects of EO data to design better augmentations and thus positive pairs for SSL. The third key aspect of EO is multi-modality. EO data from different sensors and map products can be aligned at virtually no costs using geolocation and time information, as shown in MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning. Datasets like MMEarth can for example be used to train multi-modal, cross-modal, or single-sensor encoders that aim to yield general-purpose representations that generalize to diverse downstream tasks.
One focus topic was deep location encoders and how geospatial data can be represented in neural networks. Marc Rußwurm (Wageningen University) gave a lecture on the foundations of storing geospatial data and the lessons learned from the literature on implicit neural representations that are highly driven by the development of Neural Radiance Fields (NeRFs). He showed how these techniques impact real world applications like e.g. global-scale species distribution modelling and presented their recent work that introduces principles from Geodesy for Geographic Location Encoding with Spherical Harmonics and Sinusoidal Representation Networks. Konstantin Klemmer (Microsoft Research) presented how such geographic location encoders can be learned in a self-supervised way by leveraging techniques known from CLIP (for Contrastive Language-Image Pre-training), in their approach called SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery.
Since SSL strategies are also driving the development of large pre-trained models - also termed Foundation models - another big topic was the development of such models for EO data. We had the pleasure to hear Xiaoxiang Zhu’s (TUM) perspective On the Foundations of Earth and Climate Foundation Models covering questions like “Why do we need them?” or “How does the ideal Earth FM look like?”. One such model called DOFA (Dynamic One-For-All) that can be applied to data from different sensors was presented by Zhitong Xiong (TUM) from their work called Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation. The key idea is to condition the model on the sensors’ wavelengths and pre-train on a multi-sensor dataset covering a range of wavelengths from the visible to the microwave spectrum (SAR data).
“So what?”. This is how Bruno Sanchez-Andrade Nuño (Clay) started his session to get our cohort out of their comfort zone - which means stop thinking about technical challenges for a moment and think beyond the models. “The model itself is important, but the product and stories around it are what makes it work.” Bruno summarized his thoughts for a detailed read in this post: If you think “AI for Earth” is about AI on Earth data, you are not paying attention and Ankit Kariryaa (KU) responded in his Reflections on the PhD Course SSL4EO.
Following this discussion, Jan Dirk Wegner (University of Zurich) provided some answers to this question in his keynote sharing recent advances in crucial applications covering snow depth estimation at country-scale, high-resolution species distribution modelling, and remote monitoring of armed conflicts.
During this energetic week, we not only gained new knowledge, but also enjoyed several social events, dinners, museum visits, and a boat cruise though the waters of Copenhagen. A new little community was born.
To close this review, I would like to leave you with a piece called “EPOCH” a visual representation of Earth created by Kevin McGloughlin:
Epoch is a visual representation of our connection to earth and it’s vulnerable glory.
Our time here is esoteric, limited and intangible.
The fragility which exists in all aspects of life is one thing that is certain.
We are brittle, and so is Mother Earth.\
Acknowledgement: This course was supported by the University of Copenhagen, Danish e-infrastructure Consortium (DeiC), and the Pioneer Centre for AI.
Resources
Course website: https://ankitkariryaa.github.io/ssl4eo/
Recordings: [youtube]
Slides: [google drive]
Waiting list: Would you be interested in a future edition of this summer school? Please sign up for the waiting list and let us know what topic you would be interested in. [form]
Reading list
A Cookbook of Self-Supervised Learning
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
SimCLR: A Simple Framework for Contrastive Learning of Visual Representations
BYOL: Bootstrap your own latent: A new approach to self-supervised Learning
Barlow Twins: Self-Supervised Learning via Redundancy Reduction
MAE: Masked Autoencoders Are Scalable Vision Learners
ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders
Benchmarking Representation Learning for Natural World Image Collections
Geography-Aware Self-Supervised Learning
Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data
MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning
Geographic Location Encoding with Spherical Harmonics and Sinusoidal Representation Networks
Spatial Implicit Neural Representations for Global-Scale Species Mapping
SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery
On the Foundations of Earth and Climate Foundation Models
Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation
Snow depth estimation at country-scale with high spatial and temporal resolution
Sat-SINR: High-Resolution Species Distribution Models Through Satellite Imagery
An Open-Source Tool for Mapping War Destruction at Scale in Ukraine using Sentinel-1 Time Series