1.1.

Introduction to the small data problem in Remote Sensing for agriculture

(Approx. 10 min reading)

Remote sensing has become a vital tool for monitoring and managing agricultural systems at a wide range of spatial and temporal scales. With the increasing availability of Earth Observation (EO) platforms such as Sentinel, Landsat, MODIS, and various commercial or drone-based sensors, researchers and practitioners now have access to an unprecedented volume of high-resolution imagery and time-series data (Fig. 1). These data support a variety of agricultural applications, including crop classification, yield estimation, soil moisture mapping, and assessments of land degradation.

Fig. 1. Earth Observation platforms and drone-based sensor.

Despite the growing volume of available EO data, one of the most significant obstacles to effectively applying deep learning in agricultural remote sensing is the lack of labeled training data. This is widely recognized as the small data problem. While remote sensing data provides a wealth of spatial and temporal information, most of it is unlabeled. Deep learning models, particularly those based on supervised learning, require a substantial amount of high-quality labeled data to perform well. In many real-world scenarios, especially in agriculture, this labeled data is scarce, expensive to obtain, and often limited in geographic or temporal scope.

The small data problem in agricultural remote sensing refers to the limited availability of annotated data that is necessary for training machine learning models. Although remote sensing platforms generate a massive volume of image data daily, most of it cannot be directly used for supervised learning tasks due to the absence of corresponding ground truth labels. According to Safonova et al. (2023), this issue is not just a technical inconvenience. It is one of the most persistent challenges in applying deep learning to Earth observation, particularly in complex socio-environmental domains such as agriculture.

Supervised deep learning models, such as convolutional neural networks (CNNs), are known for their performance on image classification tasks when trained on large, diverse datasets. For example, models trained on datasets like ImageNet, which contains millions of labeled images, have achieved high levels of accuracy in various visual tasks. In contrast, most agricultural remote sensing applications rely on relatively small datasets that may include only hundreds or a few thousand labeled samples. These samples are often collected using manual fieldwork, surveys, or government databases, all of which come with significant limitations in terms of spatial coverage, consistency, and timeliness.

Several structural factors contribute to the persistence of the small data problem in agricultural remote sensing. First, field data collection is labor-intensive and costly. Generating accurate ground truth labels for variables such as crop type, yield, or soil condition typically requires physical access to the fields, expert judgment, and precise geolocation, all of which add logistical and financial burdens. This makes large-scale data collection challenging, particularly in remote or resource-limited regions.

Second, agricultural landscapes are highly heterogeneous. Factors such as crop diversity, management practices, weather conditions, and soil characteristics vary significantly not only between countries but also within regions. This heterogeneity makes it difficult to apply a model trained in one area to another without losing predictive performance. This issue, commonly referred to as domain shift, limits the transferability of models across space and time, even when similar crops or sensors are used.

Third, datasets are often imbalanced. In agricultural data, a few dominant crops like maize, wheat, or rice may account for most of the labeled samples, while less common or regionally specific crops are underrepresented. This imbalance can cause models to perform poorly on minority classes, even if those classes are critical from a policy or environmental perspective.

Fourth, the quality of available labels may be compromised by noise and inconsistencies. Labels derived from old maps, participatory mapping, crowdsourcing, or coarse-resolution surveys may not accurately align with the imagery or current ground conditions. This results in label noise, which negatively affects model performance, especially in low-data regimes where every sample carries more weight.

Fifth, there are often legal, ethical, or institutional barriers to accessing high-quality labeled datasets. In some countries, crop or land use data are considered sensitive or proprietary, limiting their distribution. Regulatory frameworks such as the General Data Protection Regulation (GDPR) may restrict access to high-resolution imagery linked to individuals or private properties, even if those data are technically available.

It is important to note that remote sensing is often described as a big data field, and in terms of raw image volume, that is accurate. However, the issue is not the number of pixels available, but rather the lack of labeled, contextually relevant data that can be used to train supervised models. This mismatch between data availability and data usability defines the small data problem. It creates a paradox where the potential of remote sensing appears vast, but the practical limitations of data annotation make it difficult to realize that potential in many real-world agricultural applications.

The consequences of the small data problem are not limited to lower model accuracy. Deep learning models trained on small or biased datasets are at risk of overfitting, where they memorize the training data without learning generalizable features. They may also exhibit systematic bias, performing well on dominant classes or regions while failing elsewhere. In agriculture, this has real-world consequences. Models that cannot generalize across regions or seasons may lead to poor decision-making, misallocation of resources, or incorrect policy recommendations.

Safonova et al. (2023) emphasize that the small data problem is not just a limitation to be tolerated. It is a fundamental driver of innovation in remote sensing and machine learning. It has spurred the development of a range of data-efficient deep learning techniques that are designed to work effectively even in the presence of limited, noisy, or imbalanced data. These include strategies such as transfer learning, self-supervised learning, semi-supervised learning, few-shot and zero-shot learning, active learning, weakly supervised learning, and process-aware learning. Each of these methods approaches the small data problem from a different angle, either by making better use of unlabeled data, reducing the need for labels altogether, or integrating external knowledge into the learning process.

Understanding the small data problem is a necessary first step in the design of robust, scalable, and fair remote sensing models for agriculture. The challenge is not going away, especially in low-income regions or applications that require specialized field data. The focus, therefore, must shift from collecting ever more labeled samples to developing smarter algorithms that can learn from limited supervision. In the following sections of this lecture, we will explore the broader implications of limited ground truth data in socio-environmental applications and examine in detail the deep learning strategies that are helping to address this central challenge.