Tree species classification from complex laser scanning data in Mediterranean forests using deep learning

Recent advances in terrestrial laser scanning (TLS) technology have enabled the automatic capture of three‐dimensional vegetation structure at high resolution, but the scalability of using these data for large‐scale forest monitoring is limited by reliance on intensive manual data processing, including the use of stem maps generated in the field to determine tree species. New methods from data science have the capacity to automate this identification process, reducing the hurdles towards automated inventories with TLS. In particular, contemporary developments in point cloud processing methods, alongside large increases in the computing power of consumer‐level graphics processing units, provide new opportunities. Here, we apply a deep learning‐based approach, based on joint classification from multiple viewpoints for each stem, to automatically classify tree species directly from laser scanning data obtained in structurally complex Mediterranean forests. We also explore the use of data augmentation techniques to maximise performance for a fixed number of manually labelled stems. Our method does not require expensive pre‐processing such as leaf‐wood separation or quantitative reconstructions. Using modern network architectures and data augmentation techniques, and without extensive pre‐processing, we are able to achieve high overall and per‐species accuracy that is comparable or higher than in existing work while using data from a water‐limited ecosystem complicated by structural convergence and multi‐stem trees. Our findings demonstrate the power of deep learning to remove a major TLS data processing obstacle—individual species identification—and to minimise the bottleneck created by manual data labelling requirements in the use of TLS for standard forest monitoring.


| INTRODUC TI ON
Reliable large-scale data on the state of forests are needed to monitor many aspects of forest structure and function, including ecosystem health (Pommerening & Grabarnik, 2019), carbon storage (Kurz et al., 2002) and the impacts of global change (Peñuelas et al., 2017).
Such information can be obtained via plot data and national inventories (Burley et al., 2004), and typically comprises simple measurements such as diameter at breast height (DBH) and, perhaps, height, crown radius and depth (Burley et al., 2004). Allometric equations can then be used to estimate quantities of interest such as basal area and stand biomass (Kebede & Soromessa, 2018). These measurements and estimates may further be used, for example, to calculate merchantable timber (Kangas & Maltamo, 2006) or aboveground carbon storage (Talbot et al., 2014). Although these ground measurements are widely used, they come with many limitations; calculated quantities likely have substantial uncertainty due to the implicit approximations made by allometric equations (Wang, 2006), and they cannot capture fine-scale information such as variation in crown shape with depth, accurate cross-sectional shapes and branching structure. Detailed information on tree crown morphology is important for the study of forest dynamics such as competition between individual trees (Owen et al., 2021), and for accurate estimation of quantities such as branch biomass .
Recent advances in terrestrial laser scanning (TLS) mean that automatic capture of three-dimensional (3D) vegetation structure at high resolution is now possible (Wilkes et al., 2017). Because of its demonstrated high accuracy (Maas et al., 2008;Mengesha et al., 2014) and the additional metrics it can be used to calculate, there is substantial interest in the application of TLS to carry out inventories (Bienert et al., 2006;Liang et al., 2018;Piermattei et al., 2019;Simonse et al., 2003). However, while the use of TLS data to calculate simple metrics such as DBH, basal area and height, among others, is relatively straightforward once data are segmented into individual trees (Liang et al., 2016), major hurdles to large-scale implementation of TLS for surveying remain. One particular challenge is TLS data give only structural information, and the automated classification of stems into species or genus is not currently possible from raw point cloud data. This means that studies using TLS data must rely on stem maps generated in the field to identify species, to calculate any properties reliant on species information such as diversity, competitive interactions or specific wood density for carbon storage estimates. To address this, attempts have been made to classify tree species from TLS based on geometric features extracted after extensive data processing. For example, Terryn et al. (2020) used manually extracted structural features generated from quantitative structural models (QSMs; Raumonen et al., 2013) to classify five species of five different genera collected from a semi-natural western European woodland, obtaining a best overall classification accuracy of approximately 82%. However, while the use of manually specified features may increase interpretability of classification, this approach has drawbacks limiting both its scalability and wide applicability, since reducing information content to interpretable features in this may remove characteristics that improve classification, and the construction of QSMs is complex, relying on high point density raw data and manual refinement (Lau et al., 2018). In this study, we apply deep learning-the automated learning of the best features for downstream classification-to automate species identification from whole-tree TLS point clouds extracted from a dataset collected in Mediterranean woodland (Owen et al., 2021). Compared to a QSM-based approach, pre-processing requirements are substantially reduced and operator bias avoided, since feature choice is not pre-determined. We demonstrate that computer vision techniques (Hartley & Zisserman, 2003) and deep learning (LeCun et al., 2015) can be used to classify tree species from TLS point clouds from structurally mixed Mediterranean forests, with structurally complex features including multi-stems, potential structural convergence as a result of water limitation and significant variation in topographic conditions between individual plots.
TLS scanning generates point clouds-unordered sets of points in space representing a 3D shape or object-that are used in many areas of computer vision, including autonomous navigation (Zeng et al., 2018), robotics (Whitty et al., 2010) and forestry (Rahlf et al., 2021). Due to their wide range of applications, point clouds such as those generated by TLS systems have led to increasing attention in point cloud learning over the last decade, but applying machine learning to 3D data brings unique challenges compared to two-dimensional (2D) images (the standard domain of computer vision research); 3D point clouds are constructed by sampling irregularly in Cartesian space so are inherently non-Euclidean, meaning common deep learning techniques from 2D computer vision are not directly applicable. Although methods operating directly on point cloud data exist (Qi et al., 2017;Wang et al., 2019), they are often conceptually complex, difficult to implement and do not always outperform methods based on applying existing convolutional neural networks (CNNs) to projected images (Goyal et al., 2021;Seidel et al., 2021). We instead use a method based on joint classification of features extracted simultaneously from multiple 2D projections, henceforth referred to as 'SimpleView', which obviates the issue of building a network for point cloud data. This conceptually simple approach has been found to obtain state-of-the-art performance classifying exemplar laser scanning datasets (Goyal et al., 2021), and is likely agnostic to sensor resolution due to the abstraction of the input data from 3D point cloud to low-resolution 2D image.
Deep learning-based classification-the application of an unmodified architecture to a single orthographically projected image (mapping of a point cloud to a flat image)-has been applied for species classification from TLS data in plantation and unmanaged forests, with existing work focused on datasets with a small number species per genera. In managed plantations, Zou et al. (2017) classified species from TLS point cloud data by applying a single deep belief net (Hinton et al., 2006) to multiple projected views separately, achieving a maximum accuracy of 95.6% on data comprising eight species from eight distinct genera. Seidel et al. (2021) used a similar approach to classify TLS data from national parks and unmanaged forests in Germany and the United States into seven species of six genera, with augmented samples appended to underrepresented classes before training. A simple network based on LeNet-5 (Lecun et al., 1998) achieved 86% overall accuracy, and compared positively to a pointwise approach, PointNet (Qi et al., 2017). Xi et al. (2020) benchmarked several models including voxel-based and point-based deep learning methods on a dataset of monospecific plots comprising nine species of five genera from Canada and Finland, with random augmentations applied during training, and achieved up to 95.8% overall accuracy with the best model. Existing work, however, is limited by its application to structurally simple data and by labourintensive labelling requirements. Plantation forests such as those in Zou et al. (2017) lack the structural complexity of unmanaged forests, the use of monospecific plots by Xi et al. (2020) may have reduced structural variation in the data. Xi et al. (2020) achieve the best accuracy (95.8%), but used leaf-wood separated training data, which required between 1 and 10 h of manual labelling per tree, making this an impractical approach at scale. Furthermore, our approach requires minimal pre-processing and is not reliant on leaf-wood separated point clouds. We (1) assess whether our method can achieve high overall classification accuracy; (2) compare classification accuracies for individual species and assess whether higher classification accuracy is achieved between rather than within genera and (3) assess the extent to which data augmentation based on randomised transformations at train-time can improve performance and overcome unevenness in species sample sizes characteristic of forest data.

| Data
Our dataset comprises 2478 individual trees (Table 1)  See the main text and supplementary material of Owen et al. (2021) for further details. Plots were scanned using a Leica HDS6200 scanner, with scanner resolution set to 3.1 mm. In all, 16 scans spaced at 10 m in a grid pattern were used for each 30 × 30 m plot. Trees were automatically segmented from the data using the treeseg package (Burt et al., 2018), followed by manual refinement on canopy trees; for full details of data collection and processing, see Owen et al. (2021). We discarded dead trees, species with low counts (less than 5 samples) and stems that were not identified or identified only to genus level, leaving five species-two Pinus and three Quercus.

| Classification approach
2.2.1 | Camera projection and data pre-processing Following Goyal et al. (2021), we used six camera projected images for each tree, since this approach, where projection lines are drawn from world point (the point in 3D coordinates projected onto the image) to camera point (the point at which the simulated camera is located in 3D coordinates), has been found to outperform orthographic projection, in which all projection lines are parallel to the image plane, when used for shape classification with this network architecture (Goyal et al., 2021). See Supplement 10 for projection details.
Individual tree point clouds were zero-centred by subtracting the mean and uniformly scaled to lie within − 1, 1 3 . The six camera projections were taken at a distance of 1.4 units. Axes are defined by the internal coordinate system of the scanner, where the z-axis is vertical and the x-and y-axes have no canonical direction (although they are consistent within each plot). Camera field of view (FOV) was set to 90° (as in Goyal et al., 2021) and image resolution was set to 256 × 256. TA B L E 1 Distribution of tree species from our dataset used to train/test the classifier. Individuals are classified as multi-stem if they bifurcated below 1.3 m (Owen et al., 2021)

| Network architecture and protocol
We used the network architecture and training protocol SimpleView (Goyal et al., 2021), a simple method based on orthogonal camera projections that has been demonstrated to achieve near state-ofthe-art performance on exemplar datasets of laser-scanned objects (Wu et al., 2015). Our architecture takes six images by camera pro-

| Loss functions
Loss functions are used during the training of the network to assess current performance. Loss functions must be differentiable, and can be thought of as the 'distance' to the ground truth labels-with more accurate classifiers typically, but not always, having a lower loss. If a classifier frequently makes marginal, but incorrect, decisions between the correct label and another it is possible to obtain low loss simultaneously with low accuracy. In practice, this does not prevent the use of such functions to train networks. The value of the loss function is used as the minimisation target by the optimiser at each iteration during training. We used the smooth-loss (cross entropy with label smoothing) function from Goyal et al. (2021). For smoothloss, where K is the number of classes, t is the vector truth label and p the vector of softmax probabilities: Label smoothing is applied to a one-hot truth vector: F I G U R E 1 Example point clouds of each of the species in our dataset, before pre-processing; from left to right: Quercus faginea, Quercus ilex, Pinus nigra, Pinus pinaster, Pinus sylvestris.

F I G U R E 2
Overview of the projection and SimpleView network architecture. (a) Camera viewpoints for projection, shown with a Quercus faginea point cloud (downsampled); (b) example projections for a single tree. Six projections are made in total, with three not shown in this figure. Projection images are single channel and coloured yellow-green-blue for illustration only with yellow points being further away from the camera; (c) CNN (ResNet-18/4) taking six single channel projected images as input; (d) the CNN output vectors of extracted features (learned from the data); (e) feature vectors are concatenated into a single vector six times larger, and classification is performed based on this; (f) output layer is of length five, with each entry corresponding to the probability of the point cloud belonging to the corresponding species in our dataset. The species with the maximum probability is chosen as the prediction. et al. (2021) found smoothed labels outperform one-hot labels in all configurations on ModelNet classification, so we chose to use this loss function.

| Data augmentation
We perform data augmentation on-the-fly as stems are sampled from the dataset at train-time. This allows for a greater variety of augmented samples without increasing memory use, and may be particularly useful in ecological applications with many samples from common species and few from rare ones. When a stem is sampled from the dataset, the following random transformations are applied to the point cloud before projection: (1)

| Experimental configuration
A randomly selected train-validation-test split of 70%/15%/15% was used throughout. The same individual 70%/15%/15% of samples were used for training, validating and testing in all experiments, with the proportion of each species approximately even in all three sets.
At train-time, weighted sampling with replacement was used to balance the dataset so that each species was equally represented. The Adam algorithm (Kingma & Ba, 2015) was used to optimise model parameters over 150 epochs. Learning rate was implemented using a step scheduler, of initial value l 0 , decay and step size s, which were optimised jointly with data augmentation hyperparameters.
We compare the accuracy (fraction of trees classified correctly) of the network trained with and without data augmentation applied, and use the same optimiser settings for both schemes. Typical traintimes were of the order 30 min with pre-computed projections, and 5 h with on-the-fly projections (see system specifications below).
Since we apply transformations whose parameters are sampled from a continuous space at train-time, rather than fixing these parameters in advance of training, the projections must also be computed at train-time; the majority of train-time was therefore spent performing projections rather than training the network.
Training hyperparameters (l 0 , , s) were optimised jointly with augmentation hyperparameters (R max , T max , k min , k max ) using Bayesian optimisation (Frazier, 2018) to maximise the highest model accuracy obtained on the validation dataset at any training epoch. For each model, we then report the best (at any epoch during training) model accuracy, as measured by performance on the test set (following Goyal et al., 2021). We also report the highest minimum producer accuracy obtained by any model during training. We calculate genus accuracies by aggregation-by predicting at the first species level, and scoring a prediction of any species in the same genus as the true label as correct. For example, a prediction of Q. faginea for a stem with true label Q. ilex is judged to be correct under this scheme.

| Hyperparameters
The optimised hyperparameters-parameters that govern the behaviour of the gradient descent algorithm and the maximum extent of transformations applied for data augmentation-obtained using Bayesian optimisation, are shown in Table 2. Hyperparameters were rounded to the nearest two decimal places or to the optimisation bounds if the final value was within 5%. Augmentation was found to produce the greatest increase in accuracy for rotation angles distributed around an entire turn rather than small angles, and a combina-

| Classification accuracy
We compared results from the same architecture with and without data augmentation, and found augmentation produced significant improvements. Corresponding overall, minimum producer and genus accuracies for the best overall and final models, and the model

TA B L E 2 Optimised hyperparameters.
Optimal hyperparameters for overall accuracy were the same as for the best producer accuracies with the best minimum producer accuracy, are shown in Table 3. The highest overall accuracy obtained without augmentation was 69.3%, rising to 80.6% with data augmentation-an increase of 11.3%. With augmentation, we found an increase of 50% was observed in minimum producer accuracy-from 12.5% to 62.5%. We also calculated genus-level accuracies by aggregating the relevant entries of the confusion matrices in Figure 3-for example, for any true label in genus Pinus, a prediction of any Pinus species is considered to be correct, and likewise for Quercus-and found very high accuracy: 93.4 % for Pinus and 91.2% for Quercus for the final model, which had the highest minimum producer accuracy at the genus level.
Confusion matrices for individual species prediction are shown in Figure 3 with and without data augmentation applied. Without augmentation, a large number of individuals were mistaken for Q.
faginea-the most common species in our dataset. For the best overall model without augmentation, Q. faginea was the most frequent misclassification for two of the four remaining species, and three of the four remaining species for the final and best minimum producer TA B L E 3 Tabulated accuracy results for our classification approach. Within either the non-augmented or augmented schemes, all three models are from the same training run (so use the same hyperparameters). Model selection metric therefore refers to the metric used to select models (sets of optimised network weights) to be extracted from particular epochs during training. The best results for each metric under both the non-augmented and augmented schemes are shown in bold. We also calculated genus-level accuracies by aggregating the relevant entries of the confusion matrices in Figure 3-for example, for any true label in genus Pinus, a prediction of any Pinus species is considered to be correct, and likewise for Quercus. For example, to calculate the number of correct Pinus predictions from the confusion matrices, all entries in the top 3 × 3 sub-matrix are added, appropriately weighted by the number of samples in each species. The model achieving the best minimum species accuracy does not achieve the best minimum genus accuracy in our case  Table 4. Our results were comparable to those requiring additional processing and subjective feature selection (Terryn et al., 2020), despite our use of data with multiple species-pergenus; classification based on learned features extracted automatically appears to yield better accuracies, as well as requiring substantially less manual input at the pre-processing stage.

Best overall Best minimum producer
Requiring manual data processing presents substantial barriers to the uptake of TLS systems in standard forest monitoring; in addition to increasing labour requirements, tasks including the tuning of models such as QSMs require additional technical expertise. This necessitates the need for additional training or personnel as part of the forest monitoring process. Our approach presents fewer barriers to uptake of TLS systems in standard forest monitoring, and our results are comparable to those of others (e.g. Seidel et al., 2021)-despite the fact that our data are from an environment with significant abiotic variation that may create more intraspecific structural variation. Water limitation may affect tree structure by limiting height (Fajardo et al., 2019) and altering crown dimensions (Ding et al., 2020;Lines et al., 2012), leading to greater intraspecific variability across our plot network. The high topographic variability of our study area also means wind exposure varies between plots, inducing changes to tree structure (Watt et al., 2005), irrespective of drought conditions (Niez et al., 2019), potentially causing structural convergence between all species (MacFarlane & Kane, 2017). Individuals growing without abiotic stressors or exposed to homogeneous competitive conditions can grow into an architectural form that closely aligns with their phenotype and inherited developmental TA B L E 4 Tabulated comparison of our classifier versus existing work. 'Simultaneous multi-view perspective projections' refers to classification of multiple viewpoints based on extracting features from all projections without classifying each individual projection, then producing an overall classification based on considering features (e.g. the classifier may learn features such as height or crown width) from all projected images simultaneously. 'Sequential multi-view orthographic projections' refers to classifying each individual projection as a species, then taking a majority vote  (Horn, 1971;Pugnaire & Valladares, 2007), and studies in plantation forests (Zou et al., 2017) and with monospecific plots (Xi et al., 2020) are unlikely to exhibit the same structural convergence due to this lack of extrinsic stresses (MacFarlane & Kane, 2017;Martin-Ducup et al., 2020).
Data size, forest type and pre-processing efforts are likely to have driven the gap in performance between our work and that of Zou et al. (2017) and Xi et al. (2020). In the former case, substantially more data were used-an order of magnitude more than in our work, and two orders of magnitude more than in other work (Seidel et al., 2021;Terryn et al., 2020). Although obtaining and using large amounts of data is usually well advised when making use of deep learning, it is infeasible to gather such large datasets in many ecosystems.
Furthermore, the data used in this case were gathered using mobile LiDAR mounted to a moving vehicle (Guan et al., 2015), which is not possible in most forests. The results presented by Xi et al. (2020) rely on leaf-wood segmented data, and the outlined approach relies on labelled training data which required 4-10 h of manual labelling per tree without assistance from a model, and 1-2 h per tree when model assisted (Xi et al., 2020). Supervised machine learning approaches to species classification from TLS data can only reduce the amount of required manual labour if producing the necessary volume of labelled training data takes less time than labelling the entire dataset by hand.
Data augmentation substantially reduced the impact of uneven species distribution in our data, improving accuracy of classification of less common species and maximising performance for the same number of manually labelled stems. Our findings suggest that this approach minimises the required data collection for a given level of performance and reduces bias introduced to predictions by unbalanced data. Using data augmentation, we were able to increase overall predictive accuracy (by 13%) without the need for additional data collection or processing. In the case of the model with the highest minimum producer accuracy, the minimum accuracy was increased from 12% to 62% (P. pinaster), with producer accuracies increasing for all but the most common species. The accuracy with which Q.
faginea was correctly predicted decreased; it is overpredicted when data augmentation is not applied, owing to being overrepresented in the training set. This occurs despite the use of weighted sampling during training. After augmentation, confusion is more often between species of the same genus-we therefore suggest that the application of data augmentation reduces the effect of dataset bias on predictions, producing a model where predictive confusion is reflective of the functional similarity between species (which itself may reflect phylogenetic distance between species; Letten & Cornwell, 2015), so is inherent to the problem, rather than introduced artificially by unbalanced data. Minimising the required data labelling to achieve a given level of accuracy is an important step towards operationalising TLS in large-scale standard forest monitoring, and lower data labelling requirements mean our approach is also applicable at small to medium scales, unlike previous approaches relying on very large amounts of labelled data (Zou et al., 2017) or leafwood separated data (Xi et al., 2020). It may be possible to further reduce data requirements through use of transfer learning-either by training on existing data from other forest ecosystems and fine tuning on the target data or by use of more sophisticated techniques such as Model-Agnostic Meta Learning (Finn et al., 2017) if data from several ecosystems could be obtained collaboratively.
Currently, labour-intensive labelled and segmented TLS data are not widely available. Making these data publicly available would be a significant step forward for the development of automated TLS processing methods, enabling the direct comparison of results across multiple ecosystems, in addition to transfer learning schemes. We make the data used in this work available for public use.
We found that the convergence of the classifier was somewhat unstable-large deviations in accuracy can be seen for each species across the two confusion matrices in Figure 3, despite the two models being selected from the same training run, using the same hyperparameters (seen in Table 1). This instability may be exaggerated by We were limited to a batch size of 128 trees (each with 6 images) by graphics processing unit memory constraints-increasing the batch size may improve the convergence of methods based on stochastic gradient descent (as gradient estimates based on small batches may be noisy), although this may come at the cost of decreased image resolution, which is likely to reduce classification accuracy (Koziarski & Cyganek, 2018;Sabottke & Spieler, 2020). Very large batches may also decrease the ability of the classifier to generalise from training to test data (Hoffer et al., 2017).

| CON CLUS IONS
Automatic species identification is one barrier to widespread uptake of TLS for forest monitoring, but others remain. For example, the full automation of the TLS data processing pipeline from plot to individual tree point clouds remains a major aim of the field. Although methods are available for semi-automatic segmentation of individual stems using generic point cloud processing techniques such as shape-fitting and clustering (Burt et al., 2018), a task focused approach using deep-learning may improve accuracy. A point-based approach would be adaptable to segmentation in addition to classification. For example, Lv et al. (2021) achieved 86.6% classification accuracy on 1330 stems of four species from four genera by augmenting point cloud airborne laser scanning data with custom feature descriptors and applying a point-based network, PointNet++ (Qi et al., 2017), demonstrating that the use of point-based architectures is feasible with some modification. Xi et al. (2020) and Krisanski et al. (2021) make significant progress towards this goal, but are reliant on labour-intensive leaf-wood separated training data.
To the best of our knowledge, there is no work performing instance segmentation on complex forest point clouds to jointly segment and classify individual trees by species in a single stage-although it is unclear whether this would offer benefits above a two-stage approach (separate segmentation and species classification), assuming both stages are automated.
The high classification accuracies achieved in this study demonstrate the strength of our approach-using deep learning architectures and data augmentation approaches from contemporary machine learning literature-and are an important step towards fully automated forest inventory, which would enable scalable long-term resource management and ecosystem conservation, improved inventory-based carbon storage estimates and greater understanding of ecological processes such as canopy and regeneration dynamics. Although untested here, our substantial downsampling of data from high-density TLS point clouds to 256 × 256 images suggests that lower quality TLS data may produce similar results. A method that is sensor agnostic and robust to varying data quality would be a major step forward to the scalability of TLS forest inventory, allowing faster data collection methods-such as backpack laser scanners-as well as simultaneously lowering the cost of data acquisition. We point to Krisanski et al. (2021) as an example of sensor agnostic methods for TLS data processing, although this work does not extend to species identification. The achieved high level of accuracy on structurally complex data is promising when considering whether this method will generalise to other ecosystems, although further investigation is required regarding performance in ecosystems with a very large number of species each with a smaller number of samples, such as in tropical forests. Large-scale data sharing initiatives would enable work on the automated processing of TLS data to be extended easily to multiple ecosystems. The cost of data acquisition could be further reduced if species identification methods could be adapted to not require stem maps as training labelsperhaps by transfer learning using openly available task-adjacent data such as RGB imagery (e.g. the iNaturalist dataset; Van Horn et al., 2017), rather than directly using projections from TLS data during training.

AUTH O R CO NTR I B UTI O N S
All authors conceived the idea; Matthew J. Allen and Emily R. Lines designed the methodology; Harry J. F. Owen collected and preprocessed the data; Matthew J. Allen analysed the data and led the writing of the manuscript. All authors contributed critically to the drafts and gave final approval for publication.

CO N FLI C T O F I NTE R E S T
We declare that this research was conducted without any commercial or financial relationships that could be considered as conflicts of interest.

PEER R E V I E W
The peer review history for this article is available at https://publo ns.com/publo n/10.1111/2041-210X.13981.

DATA AVA I L A B I L I T Y S TAT E M E N T
Our data (Owen et al., 2021) are available for public use, and can be found at https://zenodo.org/recor d/6962717 (Owen et al., 2022).