2.2.

Key Self-Supervised Learning models and techniques

(Approx. 30 min reading)

SSL has rapidly gained popularity in the machine learning community due to its ability to leverage large amounts of unlabeled data to learn useful feature representations. Unlike traditional supervised learning, SSL does not require manual annotation during pretraining. Instead, it constructs artificial tasks, known as pretext tasks, that provide supervision signals derived from the data itself. After pretraining, these models can be fine-tuned with a small amount of labeled data, making them highly suitable for domains like remote sensing where annotated data is scarce, expensive, or difficult to obtain.

In this section, we explore five influential SSL models that have shown significant promise in remote sensing applications: SimSiam, SimCLR, MoCo, Barlow Twins, and VICReg. Each of these methods addresses different technical challenges in SSL, such as the need for negative samples, reliance on large batch sizes, or redundancy in feature representations. Understanding these models allows practitioners to select and adapt the best strategies for agricultural monitoring tasks like crop classification, land cover mapping, and environmental monitoring.

SimSiam: Learning without Negative Samples

SimSiam, or Simple Siamese Network for Self-Supervised Learning, introduced a minimalist framework that challenges the assumption that negative samples or momentum encoders are required for contrastive learning. SimSiam consists of a pair of networks—an encoder and a projection/prediction head—that are trained to produce similar outputs from two augmented views of the same image. What sets SimSiam apart is that it eliminates the need for negative pairs entirely. Instead, it prevents model collapse (where the network outputs the same feature for all inputs) by asymmetrically stopping gradients from one side of the network during training. This simplicity makes SimSiam computationally efficient and easier to implement, especially for remote sensing datasets where large batch sizes are hard to manage due to high image resolution or limited memory. SimSiam has been adapted for agricultural imagery by pretraining on large unlabeled satellite archives and then fine-tuning on small crop classification datasets. The approach shows competitive performance while reducing training complexity, making it a strong choice for practical SSL applications in resource-constrained environments.

SimCLR: Contrastive Learning with Augmentations

SimCLR (Simple Framework for Contrastive Learning of Visual Representations) is one of the most influential SSL models. It relies on contrastive learning, where two augmented versions of the same image are treated as a positive pair, and all other images in the batch are treated as negatives. The goal is to bring representations of similar inputs closer while pushing dissimilar ones apart. The training pipeline includes a backbone encoder (e.g., ResNet), a projection head, and heavy use of data augmentations such as random cropping, flipping, color jittering, and Gaussian blur. One limitation of SimCLR is that it typically requires large batch sizes to ensure enough negative samples for learning meaningful representations. Despite this requirement, SimCLR has proven effective in remote sensing tasks such as scene classification and semantic segmentation. By adapting the augmentations to fit remote sensing data—e.g., including rotation, spectral shifts, or band dropout—SimCLR can be applied to satellite or hyperspectral imagery. It has been particularly successful in settings with seasonal variability, enabling better generalization across time and location.

MoCo: Momentum Contrast with Dynamic Dictionaries

MoCo (Momentum Contrast) was introduced to address some of the limitations of SimCLR, particularly the need for large batch sizes. Instead of relying on a large number of in-batch negative samples, MoCo maintains a dynamic queue of encoded representations (called a dictionary) that is updated slowly over time via a momentum encoder. This queue serves as a memory bank from which negative samples can be drawn efficiently. This design makes MoCo more memory-efficient and scalable to high-resolution remote sensing data. In addition, the momentum update mechanism stabilizes training, which is valuable when dealing with diverse remote sensing datasets that span different seasons, resolutions, or regions. MoCo has been applied in agricultural monitoring to pretrain models using unlabeled satellite images from various regions. When fine-tuned on a small number of labeled crop samples, these models outperform their fully supervised counterparts. MoCo’s ability to generalize across domains has made it a go-to solution in small-data remote sensing tasks.

Barlow Twins: reduction for representation learning

Barlow Twins introduced a novel perspective in SSL by removing the need for negative pairs altogether and instead focusing on redundancy reduction. The method creates two augmented views of an image and passes them through the same encoder. The resulting embeddings are then compared via a cross-correlation matrix. The training objective pushes this matrix to be close to the identity matrix. That means it encourages embeddings of the same dimension to be highly correlated (diagonal entries close to 1) while reducing correlation between different dimensions (off-diagonal entries close to 0). This ensures that each feature carries unique information, which is particularly useful when training on datasets with limited diversity or coverage. For remote sensing, Barlow Twins offers two major advantages: it does not require large batch sizes or memory banks, and it produces more diverse and disentangled features. These features can improve downstream tasks such as classification, segmentation, and object detection. When applied to hyperspectral imagery, Barlow Twins has shown that it can learn spectral-spatial patterns effectively, even from limited data.

VICReg: Variance-Invariance-Covariance Regularization

VICReg builds on Barlow Twins but makes its training objectives more explicit and better balanced. The method incorporates three key loss components:

Invariance: Ensures that embeddings from different augmentations of the same image are close in the feature space.
Variance: Forces the model to maintain diversity across feature dimensions, avoiding collapse to constant vectors.
Covariance: Penalizes redundancy between different dimensions to encourage informative representations. VICReg achieves high performance without needing negative pairs or large batch sizes. This makes it suitable for small-data agricultural settings where computational resources may be limited and where diversity in training data is often restricted.

VICReg has demonstrated promising results in classification tasks using satellite data, particularly when fine-tuned with few labeled samples. It also performs well in transfer learning scenarios, where models pretrained with VICReg on a large unlabeled dataset can be adapted to new regions or tasks with minimal additional data.

Table 2. Summary of Model Characteristics

Model	Negative pairs needed	Batch size requirement	Redundancy reduction	Efficient for Remote Sensing
SimSiam	No	Small	Implicit	Yes
SimCLR	Yes	Large	No	Yes, with tuning
MoCo	Yes	Small	No	Yes
Barlow Twins	No	Moderate	Yes	Yes
VICReg	No	Moderate	Yes	Yes

Each of these SSL methods contributes differently to tackling small-data challenges in remote sensing. Choosing the right one depends on available computational resources, dataset characteristics, and the specific downstream task. In the next section, we will explore how these models can be integrated with traditional deep learning workflows to improve generalization, particularly in agricultural applications.