SSL4EO-2024: A review of the 1st summer school on “Self-Supervised Learning for Earth Observation”

image_info The SSL4EO-2024 cohort with 40 participants from 17 institutions (sold out!)

Learning representations is at the heart of modern machine learning. While supervised learning has demonstrated major breakthroughs, many real applications with limited reference data cannot easily profit from these promises. Self-supervised learning (SSL), a research direction aiming to learn semantic representations from unlabeled data, has seen major advances and seems a promising direction to explore to better understand Earth observation (EO) data like optical satellite images, synthetic aperture radar, or climate data.

EO data are the key to understand important processes on Earth. How do human activities affect our ecosystems? How does climate change influence the harvest of crops? How do natural hazards such as wildfires, droughts, heat waves, floods, landslides, tropical cyclones, volcanic activity, earthquakes, and avalanches impact our society?

All these major questions can be better understood with the help of EO data. However, while the raw observations provide an abundant source of unlabeled data (speaking petabytes of data), we need efficient, scalable, and robust methods to extract the information we need.

In July 2024, the SSL4EO summer school brought together leading experts working on SSL and EO to teach recent advances and discuss open research questions at this intersection, with the first cohort of PhD students joining this format. For a full week, 40 participants attended the school hosted in Copenhagen, hearing from 8 invited speakers and working on mini-projects to gain hands-on experience with the methods discussed. With the generous support from Danish e-infrastructure Consortium (DeiC) which provided access to their GPU-cluster during the course, the participants studied the role of augmentations, learning objectives, architectural design, and sampling strategies.

randall randall Randall Balestriero introducing the core principles of self-supervised learning

Randall Balestriero (Brown University, Meta AI Research), the first author of the Cookbook of SSL kicked off the course with a deep introduction to SSL and an extensive summary of small “tricks” that add up to large performance gains. Randall’s perspective is that “SSL is a superset of supervised and unsupervised learning”. Puzhao Zhang (DHI) summarized important EO sensors and their characteristics for several remote sensing applications and discussed the opportunities and challenges for machine learning. These introductions set the foundation to look at the potential of EO data and its metadata for learning representations. Nico Lang (University of Copenhagen) introduced some key ideas of prior works such as Geography-Aware Self-Supervised Learning (GASSL) and Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data (SeCO), which exploit the geospatial and temporal aspects of EO data to design better augmentations and thus positive pairs for SSL. The third key aspect of EO is multi-modality. EO data from different sensors and map products can be aligned at virtually no costs using geolocation and time information, as shown in MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning. Datasets like MMEarth can for example be used to train multi-modal, cross-modal, or single-sensor encoders that aim to yield general-purpose representations that generalize to diverse downstream tasks.

One focus topic was deep location encoders and how geospatial data can be represented in neural networks. Marc Rußwurm (Wageningen University) gave a lecture on the foundations of storing geospatial data and the lessons learned from the literature on implicit neural representations that are highly driven by the development of Neural Radiance Fields (NeRFs). He showed how these techniques impact real world applications like e.g. global-scale species distribution modelling and presented their recent work that introduces principles from Geodesy for Geographic Location Encoding with Spherical Harmonics and Sinusoidal Representation Networks. Konstantin Klemmer (Microsoft Research) presented how such geographic location encoders can be learned in a self-supervised way by leveraging techniques known from CLIP (for Contrastive Language-Image Pre-training), in their approach called SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery.

Since SSL strategies are also driving the development of large pre-trained models - also termed Foundation models - another big topic was the development of such models for EO data. We had the pleasure to hear Xiaoxiang Zhu’s (TUM) perspective On the Foundations of Earth and Climate Foundation Models covering questions like “Why do we need them?” or “How does the ideal Earth FM look like?”. One such model called DOFA (Dynamic One-For-All) that can be applied to data from different sensors was presented by Zhitong Xiong (TUM) from their work called Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation. The key idea is to condition the model on the sensors’ wavelengths and pre-train on a multi-sensor dataset covering a range of wavelengths from the visible to the microwave spectrum (SAR data).

So what?”. This is how Bruno Sanchez-Andrade Nuño (Clay) started his session to get our cohort out of their comfort zone - which means stop thinking about technical challenges for a moment and think beyond the models. “The model itself is important, but the product and stories around it are what makes it work.” Bruno summarized his thoughts for a detailed read in this post: If you think “AI for Earth” is about AI on Earth data, you are not paying attention and Ankit Kariryaa (KU) responded in his Reflections on the PhD Course SSL4EO.

Following this discussion, Jan Dirk Wegner (University of Zurich) provided some answers to this question in his keynote sharing recent advances in crucial applications covering snow depth estimation at country-scale, high-resolution species distribution modelling, and remote monitoring of armed conflicts.

During this energetic week, we not only gained new knowledge, but also enjoyed several social events, dinners, museum visits, and a boat cruise though the waters of Copenhagen. A new little community was born.

To close this review, I would like to leave you with a piece called “EPOCH” a visual representation of Earth created by Kevin McGloughlin:

Epoch is a visual representation of our connection to earth and it’s vulnerable glory.
Our time here is esoteric, limited and intangible.
The fragility which exists in all aspects of life is one thing that is certain.
We are brittle, and so is Mother Earth.\

Acknowledgement: This course was supported by the University of Copenhagen, Danish e-infrastructure Consortium (DeiC), and the Pioneer Centre for AI.

Resources

Reading list

A Cookbook of Self-Supervised Learning

VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning

SimCLR: A Simple Framework for Contrastive Learning of Visual Representations

BYOL: Bootstrap your own latent: A new approach to self-supervised Learning

Barlow Twins: Self-Supervised Learning via Redundancy Reduction

MAE: Masked Autoencoders Are Scalable Vision Learners

ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders

Benchmarking Representation Learning for Natural World Image Collections

Geography-Aware Self-Supervised Learning

Seasonal Contrast: Unsupervised Pre-Training from Uncurated Remote Sensing Data

MMEarth: Exploring Multi-Modal Pretext Tasks For Geospatial Representation Learning

Geographic Location Encoding with Spherical Harmonics and Sinusoidal Representation Networks

Spatial Implicit Neural Representations for Global-Scale Species Mapping

SatCLIP: Global, General-Purpose Location Embeddings with Satellite Imagery

On the Foundations of Earth and Climate Foundation Models

Neural Plasticity-Inspired Multimodal Foundation Model for Earth Observation

Snow depth estimation at country-scale with high spatial and temporal resolution

Sat-SINR: High-Resolution Species Distribution Models Through Satellite Imagery

An Open-Source Tool for Mapping War Destruction at Scale in Ukraine using Sentinel-1 Time Series

Impressions