Ground cover (also referred to as FCOVER - Fraction of green vegetation cover), the percentage of ground covered by vegetation in a specified area, is a commonly used metric in agronomic studies. For example to estimate crop yield. Ground cover has been shown to be correlated with biomass yield (measured by weighing dried biomass) by Baret and Guyot. The estimation of ground cover can be used in smart agriculture applications to perform effective automatic treatment of crops, for instance by finding an optimal distribution of water or pesticide administration. This is especially helpful for farms located in third world countries, that have a lot to gain in terms of crop yield.
Farmers in developing regions such as the Bangladesh delta struggle to cultivate their land during the dry season due to soil moisture levels in these times, which are often lacking. Ground water is often salty or briny, which is unfit for raising crop plants. Farmers need to use water pumps to be able to use sweet surface water from rivers and channels, but the pumps are expensive. Therefore, irrigation needs to be optimized.
Effective use of irrigation systems depends largely on the current ground cover and soil water balance. Accurate ground cover analysis could therefore help in developing an advisory system for irrigation scheduling, thereby increasing crop yields.
Ground cover analysis, including estimation (the percentage of an area covered by vegetation), classification (instance-based labelling of vegetation), and segmentation (demarcation and extraction of regions of vegetation), is most often done using image-based methods (though other methods such as thermal analysis also exist). A wide range of computer vision and image processing techniques are applied to monitor and study agricultural changes using both ground level RGB cameras and multi-spectral air- and space-borne radiometers.
Successful ground level image-based ground cover analysis relies on overcoming a number of common obstacles found in agrarian imagery. Lighting conditions can be vary extremely over time. Images obtained during sunny weather can result in different ground cover estimations compared to images of the same field obtained with overcast, or rainy weather. This is due to specular reflection in sunny conditions, or differences in refraction when a canopy is showered in raindrops. Another obstacle is the formation of shadows of the canopy onto itself. Parts of the vegetation that are cloaked in the shadow of leaves or other objects may be classified as soil instead. Especially for large crops that have a high number of leaves, this problem can severely affect classification and segmentation accuracy. Finally, the soil area may be littered with objects that do not fall into either of the two categories, soil or vegetation. Residual waste from other nearby vegetation or animals may influence ground cover analysis, as would small stones, or pieces of wood.
Figure 1. Top: Segmentation of rice crops. Bottom: The three types of crops present in our dataset. Mung bean (left); Maize (middle); and Wheat (right). Notice common obstacles for good segmentation: Shade casting, hard lighting, overlapping leaves, and residu on the ground.
An example of crop segmentation is shown in Figure 1 above (top), together with an example of adverse lighting effects and overlapping in our dataset (bottom).
Ground cover analysis and crop segmentation has been an active field of research for decades. We will mainly consider the segmentation problem here, because it envelops the classification problem to a degree (if a segmentation map can be generated, pixel-wise classification has either been done already, or trivial to perform), and can be used to perform estimation as well (though segmentation is not a requirement for cover estimation, since we can transform it into a regression problem).
Ground cover segmentation is essentially a two class problem. Pixels in an image belong either to a vegetation class, or to a soil/ground class. Hamuda et al. provide an extensive research on image based vegetation segmentation techniques, broadly categorizing these techniques as colour index based, threshold based, and machine learning based. We will evaluate existing methods keeping to this categorization, but grouping threshold based methods with colour index based methods because the former is mainly performed using colour indices as well.
To a human observer, the most noticeable distinction between soil and vegetation is the colour of these categories. Whilst soil is for the most part brown or grey, vegetation is often mainly green or yellow. This difference in colour is also used in crop segmentation techniques. Most conventional cameras used in ground level field photography will generate RGB images.
One problem with RGB images is that they are dynamic, meaning that the colour of soil varies throughout an image. The soil may be less moist in one place and very moist in another, affecting its RGB values. The same holds for different shades of green on leaves. To give more prominence to the soil/vegetation colour difference this RGB colour space is converted to another colour space, for example by attributing different weightings to the R, B, and G values, resulting in a colour index. In extreme cases, this may directly lead to a binary image that can be used for segmentation. These indices are often generated using either expert knowledge, or machine assisted methods such as fuzzy classifiers.
There exist many proposed alternative colour spaces for estimating various qualities of vegetation. An overview is presented in the following Table.
|ExG||, with the percentage of ,, and pixels respectively (normalized).|
|CANO||and and , with , and|
Table 1. Various colour indices found in literature.
Colour index based methods of vegetation segmentation have the advantage that they are simple, fast and computationally light, but struggle to perform accurately when lighting conditions are bad (i.e. in overcast or sunny weather, or with shadows). This is apparent when one observes the large number of different indices used throughout literature for various purposes. The indices found in the Table above also show such a variation in performance based on the type of images they are applied to.
Usually, these colour indices are used in combination with a set threshold to generate a binary image that can be used for segmentation. The Otsu method is one of the most common approaches to thresholding. It calculates the optimal value to separate two classes by maximizing the inter-class variance (based on a foreground/background histogram). Equation 1 shows the formula of the weighted sum of variances (or intra-class variance) of the two classes that need to be separated. Otsu's method then searches for a threshold that minimizes this intra-class variance (and in doing so maximizes inter-class variance). The class probabilities are calculated from the image histograms.
Other techniques include local dynamic thresholding, hysteresis thresholding, and homogeneity thresholding (deriving a local threshold by calculating local homogeneity from converted greyscale intensity images). Dynamic thresholds increase an algorithms complexity (and therefore influence execution time and memory usage), but often provide better results compared to using only simple colour index based methods with fixed thresholds.
Machine learning (ML), programming computers to learn from experience, has been applied in a vast range of data-driven research areas for classification, clustering, segmentation, and regression. The model is iteratively corrected, according to the labels corresponding to the input data, through which a function that maps the input data to the desired output data emerges.
More recently, deep learning methods - a branch of artificial neural networks (ANNs) - have gained increasing interest in the machine learning community, and in mainstream media due to the success of AlphaGo (The first AI to ever beat a professional human Go player). ANNs are a type of model inspired by its biological counterpart, hence the name. They are structured as a network of nodes and directed weighted edges, and incorporate an activation function similar to the natural neuron. The input for a given node can consists of one or many connections to other nodes. The input from all input nodes is combined using a transfer function (i.e. summation), and tested against an activation function containing a set threshold. When the threshold is reached, the node will fire, after which its input is available to succeeding nodes. Eventually the network will calculate an output value (or multiple, depending on the number of output nodes), based on its input, weights, transfer functions and activation functions.
Backpropagation is then used to allow the model to learn. When the network has calculated its output value, it can be right or wrong with a certain error margin, which can be defined as an error function. By backwardly propagating this error into the network, it can adjust its weights to provide a more accurate output value in the next iteration. This is done using Gradient Descent, an optimization algorithm that minimizes the error function stepwise by iterating over the training set.
ANNs can be built using many layers, allowing them to learn complex relationships between input data and evaluation data. There are many types of neural networks: Feedforward networks such as autoencoders, and restricted Boltzmann machines (RBMs); Convolutional networks such as AlexNet or R-CNN that have layers that convolve kernels with the image; Recurrent networks such as LSTM (often used in time series analysis); and Recursive networks (recursive autoencoders, for example). Often, a combination of these networks is used, as is the case in convolutional networks that combine fully connected autoencoders, or stack RBMs.
Deep neural networks (DNNs) are a subclass of ANNs that incorporate a multitude of hidden layers, allowing them to learn more complex functions. They are often called Deep Learning methods. DNNs show state-of-the-art results in various domains including natural language processing (NLP), speech recognition, and computer vision tasks like concept- detection and visual question answering (VQA).
The prime component of a convolutional neural network (CNN) is the convolutional layer. This convolutional layer consists of a kernel (smaller than the original input layer) that is convolved with the image, computing dot products. The main advantage of a convolutional layer is that they are able to learn both spatial patterns across multiple input pixels and patterns between pixel channels (R,G,B for example), instead of only the latter.
Often, many of these layers are sequenced together, intermixed with activation functions. Activation functions define the output of a neuron given its input. A simple example of this is the rectified linear unit (ReLu) activation function (), but more complex functions are used as well (SoftPlus, Gaussian). This results in an activation map per kernel.
Figure 2. Visualization of features in a convolutional neural network. Activation maps of high scoring layers (right side) are shown for random inputs (left side). Images obtained from: https://arxiv.org/abs/1311.2901.
A feature visualization of convolutional networks is shown in Figure 2. Fully (or dense) connected layers contain neurons that are connected to all input neurons (as is the case with normal neural networks). Pooling layers, such as (soft-)max pooling, can be used a well to downsample and reduce the size of feature representations. Dropout layers, layers that remove nodes based on a certain probability distribution, have also been shown to improve performance because they can prevent overfitting (especially in fully connected networks).
One of the main disadvantages that neural networks have is that training them requires a lot of labelled data, which may not always be readily available. In addition, localization is not the prime strength of neural networks, therefore accurate pixel-wise segmentation is a challenge. To overcome this problem, Ronneberger et al. propose an effective network architecture (U-net) that relies on data augmentation, that allows pixel-wise state-of-the-art classification. U-net consists of a contracting, downsampling pathway that incorporates a number of convolutional layers, and an expanding, upsampling pathway that allows for localization.
Although DNNs are used for semantic segmentation, not much research has been done on ground cover segmentation using a deep learning approach.
Figure 3. U-net architecture. The blue boxes represent multi-channel feature maps. The number of channels is denoted at the top of the boxes, and the shape data is denoted at the bottom. The white boxes in the upsampling pathway correspond to the copied feature maps. Arrows denote various layer operations (see legend). Image taken from Ronneberger et al.
To train and evaluate our machine learning methods for ground cover analysis, we used the extensive labelled dataset produced in the framework of two projects by NWO through an Applied Research Fund (ARF), and the STARS project, a research consortium consisting of the Bangladesh Institute of ICT in Developement (BIID); the International Maize and Wheat Improvement Center (CIMMYT); and the Geo-Information Science and Earth Observation (ITC). The STARS project is funded by the Bill and Melinda Gates Foundation. The NWO projects are funded by the Dutch government.
An 8GB dataset consisting of 2564 images of varying quality and dimensions (around 2000x1500 pixel JPEGs) was hand-annotated by the team in Bangladesh using CAN-EYE imaging software. This dataset is the basis of our ground cover research.
Figure 4. Sample data of vegetation RGB imagery taken using a smartphone, and hand-annotated ground truth data using CAN-EYE software.
Using the CAN-EYE software, the Bangladesh team extracted canopy structure characteristics, such a Leaf Area Index (LAI), and Vegetation cover fraction (FCOVER). Various plant species, including Wheat, Maize, and Mung bean, are photographed (see Figure 1 for examples of these crop types, each covering roughly 30% of the dataset). Per plot, around 9 to 10 images are taken in a square formation. Some small overlap is present in these images, but these do not cause any problems for our proposed method since all 9 images are annotated. This is the ground-truth data to which compare our method. From here on out, we will refer to this dataset as CROPS.
Figure 5. Frequency of FCOVER value occurence per plant type.
To obtain a score for the colour indices methods, we implemented all algorithms listed in Table 1, namely NDI, ExG, ExR, ExGR, MExG, CIVE, VEG, COM1, COM2, and CANO(PEO), and applied them to our CROPS dataset. Since these algorithms are basic colour space transormations, they require a threshold to do obtain a binary segmentation map. For ExGR the threshold is set at 0, in accordance with literature. For VEG and MExG, we use the mean value of the resulting image as threshold. For the other algorithms, we apply Otsu thresholding. For ExR and CIVE, due to their formulation, the Otsu threshold is taken as an upper bound. For the others, NDI, ExG, COM1, and COM2, this threshold is a lower bound. CANO’s formulation already ensures the generation of a binary map. We leave the P1, P2, and P3 parameters to their default values as reported in the CANOPEO paper.
Applying these methods results in a binary segmentation map of the input image. From these binary images, we can determine the estimation and segmentation errors. The estimation error is determined by calculating the percentage of plant pixels in the binary map, and comparing this to the ground truth FCOVER. The segmentation error can be determined by evaluation of the True Positives, True Negatives, False Positives, and False Negatives.
We also use the AP-HI method proposed by Yu et al. for evaluation, due to their high reported performance. After reaching out to the researchers, they were kind enough to share their MATLAB implementation. The pipeline was easily adapted to be able to run on our CROPS dataset. The AP-HI algorithm produces a binary segmentation image, which can be evaluated in the same way as the colour indices methods described in the previous section.
Additionally, we implemented SVM and Random Forest classifiers (just for regression - i.e. calculating the FCOVER percentage), and the previously highlighted U-net neural network (for segmentation). Our network architecture consists of 23 convolutional layers. The final step is a 1 by 1 convolution and Otsu thresholding to produce a binary segmentation map. In our experiments we reduce the number of features to 10 in the first convolutional layer, and increase the amount further down the pipeline (up to a 3x3x80 convolution). We employ Logistic Regression as our loss function.
To deal with the high-resolution data, we split the input into 256x256 pixel tiles, and train/test the network on these samples. The input for the network therefore consists of an 8-bit 3-channel 256x256 pixel RGB image. The output will be a single channel 256x256 pixel binary map. Since we cannot divide the original image into tiles of this shape perfectly, some overlap is apparent. It can be argued that the overlap will introduce some overfitting, but this is overfitting is marginal. The samples are augmented by performing 5 independent semi-random elastic transformations on the original image, and 4 rotations. This boosts the size of our training set, and decreases variance. This results in a total set of 5,101,076 input images, to be split into training and validation data using K-fold cross validation.
We also trained the network using scaled images where no transformations are performed. This has the added benefit of negating any imprecise annotations by downsampling the errors together with the image, while the FCOVER remains the same by definition. By downsampling the original images using a nearest neighbour resampling filter, and applying the same transformation to our ground truth images, we obtain input images of 256x194x3 pixels. We verified that the FCOVER does not change significantly upon downsampling, and found that the mean difference in FCOVER between the original and scaled down version is around 0.0009512. This corresponds to a 0.9512 percentage point difference, which we deemed acceptable for our experiments, since we use the scaled down FCOVER in these cases.
We varied the network's hyperparameters, such as number of layers, and number of features, as part of our experiment. Since the literature on the subject of tweaking deep learning architectures does not provide an optimal strategy for this, we varied the parameters until observing diminishing returns. In similar manner we also evaluated the effect of Dropout at different positions, since this has been shown to increase performance in DNNs.
In addition to a single network for all plant types in our CROPS dataset, we will also evaluated the DNNs performance by training it on each type individually. This resulted in three networks - a Wheat, Maize, and Mung bean segmentation network.
We implemented a number of different methods to be used as benchmark for our machine learning methods. For some of these methods, we conducted additional experiments to gauge the effects of certain parameters on the resulting segmentation or estimation. Figure 6 shows examples of segmentation results for each method.
Figure 6. Example segmentations of Maize, Mung bean, and Wheat crops for the given segmentation algorithms (scaled images).
We evaluated the segmentations on Root mean square error (RMSE), and precision and accuracy scores. Overall, Maize seems to be the easiest crop type to segment using these methods. Figure 7 shows the algorithms' precision plotted against their accuracy, split on crop type. The algorithms are able to score highest on Maize images, and lowest on Wheat images. We see that DNN outperforms all conventional methods, although there are instances where another method has a slight edge on the DNN - for example COM2 shows a minor improvement in accuracy for Wheat images. We see that in most cases, RMSE scores correlate with accuracy and precision scores. However, this is not always the case. For instance, we can observe that COM2 has quite a high precision score for Maize images, but the corresponding RMSE is contrastingly relatively high. This discrepancy is due to COM2 underestimating Maize images quite a lot (which, in turn, is caused by the index not being able to recognize very light leaf areas). Since COM2 does not classify many pixels as plant pixels, it does not generate many false positives either.
Figure 7. Top: RMSE and standard deviation of scaled images for ground cover estimates for all segmentation methods. Bottom: Precision and accuracy scores for methods of ground cover segmentation of scaled images.
The colour indices methods we implemented provide varying segmentation results. We see the detrimental effects of changing lighting in the leaves, clutter on the ground, and shadows cast by other leaves. We see MExG performing best overall for scaled images. ExG, NDI, VEG, COM1, and COM2 have the biggest trouble with hard lighting changes, as can be seen in the examples above. The other indices, ExR, ExGR, MExG, CIVE, and CANO, overcome this problem. However, this feature also makes them tend to overestimate the crop area, as can be seen in the Wheat examples most prominently due to the density of the crops in this type of vegetation. Their ability to pick up light surfaces has the downside of not being able to distinguish between darker surfaces adequately.
Striking is that the colour index that performs best is surprisingly simple. The MExG index is a weighted linear combination of the R, B, and G planes. We find that MExG only really struggles when dealing with very hard lighting, and near-white surfaces. An example of this can be found in the Maize column, in the area on the large leaf on the left side of the image. We see the error rate increase in images of Wheat especially, since the spikelets are highly reflective. The high blue values in white pixels cause problems for this index. Red values are relatively high even in green leaves, which is why MExG penalizes them strongly. Since colour indices methods do not take into account other information than the R, G, and B values of a single pixel, they cannot infer that white pixels with high RGB values values actually belong to a leaf element in the image, rather than background noise.
The colour indices that perform the worst, most notably ExGR, drastically underestimate the vegetation area, due to their inability to deal with big changes in leaf colour when the light conditions are extreme. On the other hand, ExGR sometimes labels shadows as being part of the plant, resulting in an overestimation of the FCOVER.
The AP-HI method as proposed by Yu et al. performs relatively well for the original images, but not quite as well as, for example, MExG or the DNN. Especially when segmenting images of Wheat plants, it lacks the power to identify the yellowish, lighter spikelets on the top of the plant as being part of the plant. The AP-HI method, like some of the colour indices, struggles to appropriately classify strong highlights, albeit to a much lesser degree. AP-HI performs somewhat worse on our scaled CROPS dataset, as it does on the original set. We believe that due to the decrease in pixel availability in smaller images, the sample size of the hue intensity lookup table decreases as well. In addition, the affinity propagation clustering algorithm has less data to work with, leading to non-optimal results.
For original images, we observed that in many cases the DNN overestimates the plant region, which could be caused by a combination of the input images being rather small, and the large variety in lighting conditions in our CROPS dataset. We did also observe the potential of the DNN to overcome intra-leaf lighting changes. Note that this was with a single network for all 3 plant types.
When we investigate the resulting segmentation map of our scaled images, we see that the convolutional neural network provides segmentations that overcome the problems that hinder conventional colour indices methods. Comparing the DNN to the other methods we clearly see that DNN outperforms conventional colour indices methods, and the AP-HI method. Not only is it the best performing segmentation method overall, we also see top performance for each individual plant type. Especially for Maize and Mung bean we can observe exceptional performance. The DNN obtains similar performance to the well performing colour indices methods for the Wheat type.
|Single||0.0901 (0.0175)||0.0418 (0.0113)||0.0494 (0.0046)||0.1653 (0.0240)|
|Typewise||0.0779 (0.0193)||0.0351 (0.0053)||0.0444 (0.0040)||0.1423 (0.0351)|
Table 2. MSE results (standard deviation) of a single network versus the scores of the three separately trained networks.
The DNN is able to handle extreme changes in lighting conditions, more so than conventional methods. This is due to the added information of other pixels surrounding the target pixel. The convolutional layer helps in learning these gradual and abrupt hue and intensity differences in a leaf. This does lead to rounded edges in most segmentation maps. In Mung bean images, this is an advantage since the leaves are relatively round themselves. In Maize, but especially in Wheat, this leads to misclassification of a large number of pixels, but the DNN is still able to outperform the other methods.
When looking at the precision and accuracy of the DNN scaled segmentations, we again see that it outperforms all methods. Per individual plant type, there is only one other method that obtains better accuracy: COM2 on Wheat images.
We have trained a array of different network setups. The DNN results shown in Figure 7 show the most optimal performance of our method, which corresponds to three separately trained networks for each plant type.
We have explained how segmentation algorithms can also provide an estimation of ground cover. A number of algorithms are able to provide an estimate of the percentage of vegetation in our images without creating a binary map first. They are machine learning methods that are trained on the ground truth cover fraction. Together with the segmentation methods, we show their estimates here. Figure 8 shows the RMSE for each regression method, and the DNN (on scaled images).
Figure 8. RMSE and standard deviation of scaled images for ground cover estimates for all regression methods.
All regression methods outperform the DNN. However, the difference in performance is small. In fact, the DNN provides better estimation results for Maize type vegetation, and is on par with the SVM and MLP for estimation of images containing Mung bean plants. The regression methods do significantly outperform the DNN method on Wheat imagery, however.
It is hard to gauge exactly why the regression methods do so much better on Wheat images, since we converted the original image to a percentage based feature vector. We know that our DNN underperforms on Wheat image because it is prone to generate rounded, thick edges, quite the opposite of Wheat leaves. Methods that do direct regression on the FCOVER value do not encounter this problem, which might help to explain the discrepancy in scores.
Machine learning is an effective tool that can be applied to perform more accurate ground cover analysis than conventional methods. The methods proposed in this work outperform existing methods of ground cover analysis, both in RMSE, as well as accuracy and precision. Out of our machine learning methods, the SVM, RF, MLP, and DNN, our DNN architecture shows the best segmentation performance while also being able to provide accurate ground cover estimation.
We must note that our neural network approach also has a number of disadvantages. First of all, neural networks, convolutional neural networks in particular, are heavily reliant on properly labelled data. This means that a lot of preliminary work and preprocessing steps are involved with using this type of method for ground cover analysis. In addition, the learning phase is quite slow, especially compared to the colour indices methods that were able to handle high resolution input. Training time was around an hour on an 8-core i7-4700MQ with an NVIDIA Quadro K1100M. This means that for regression tasks, it is probably preferable to use an SVM-based approach, considering the algorithms speed and very accurate estimation results. Second, the architecture we used was not able to handle very large input images. The original image resolution of around 2000x1500px was too large for our network c.q. machine, and we were forced to scale down our input images. Initially, we split our original images into smaller blocks and used those as input images, however, we found that our ground truth data was not always optimal.
We must also address that while we get very good results on networks trained separately on each crop type, this will have implications for the operational use of our methods, and introduce additional costs. A final disadvantage of DNNs is that while they can achieve good performance, it is hard to observe why they perform so well. Methods of visualizing layer activation are currently being researched however to try and overcome this problem.
This brings us to the limitations of our experiments. When we performed visual inspection on our original ground truth, we found that there were many small errors. This was an artefact of doing annotation with the CANEYE software, but difficult to resolve. We expect that the non-optimal ground truth labelling is the reason that the DNN did not perform accurately on the original splitted images. By scaling the original image, and with it the ground truth, we solved many small errors through nearest-neighbour resampling. This leads to a better ground truth, and much better results. We expect that with more accurate ground truth, the DNN is also able to obtain better results on the original (splitted) images. In addition, we expect that with a better machine, larger size input images can also be processed.
 Hamuda, Esmael, Martin Glavin, and Edward Jones. "A survey of image processing techniques for plant extraction and segmentation in the field." Computers and electronics in agriculture 125 (2016): 184-199.
 Ronneberger, Olaf, Philipp Fischer, and Thomas Brox. "U-net: Convolutional networks for biomedical image segmentation." Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer International Publishing, 2015.