Nonparametric, tuning-free estimation of S-shaped functions

We consider the nonparametric estimation of an S-shaped regression function. The least squares estimator provides a very natural, tuning-free approach, but results in a non-convex optimisation problem, since the inflection point is unknown. We show that the estimator may nevertheless be regarded as a projection onto a finite union of convex cones, which allows us to propose a mixed primal-dual bases algorithm for its efficient, sequential computation. After developing a projection framework that demonstrates the consistency and robustness to misspecification of the estimator, our main theoretical results provide sharp oracle inequalities that yield worst-case and adaptive risk bounds for the estimation of the regression function, as well as a rate of convergence for the estimation of the inflection point. These results reveal not only that the estimator achieves the minimax optimal rate of convergence for both the estimation of the regression function and its inflection point (up to a logarithmic factor in the latter case), but also that it is able to achieve an almost-parametric rate when the true regression function is piecewise affine with not too many affine pieces. Simulations and a real data application to air pollution modelling also confirm the desirable finite-sample properties of the estimator, and our algorithm is implemented in the R package Sshaped.


| INTRODUCTION
We define a function f :  Figure 1. In many areas of applied science, there are domain-specific reasons to model the regression of a response variable on a covariate as an S-shaped function. For instance, development curves for individuals or populations often exhibit S-shaped behaviour in the context of biological growth (Archontoulis & Miguez, 2015;Cao et al., 2019;Zeidi, 1993) or skill proficiency (Gibbs, 2000). Further examples where time is the covariate can be found in audio signal processing (Smith, 2010) and sociology (Tarde, 1903). In agronomy, the van Genuchten-Gupta model (van Genuchten & Gupta, 1993) postulates an inverted S-shaped relationship between crop yield and soil salinity, and S-shaped trends are also observed for the production levels of commercial goods as labour or other resources are scaled up (Ginsberg, 1974). For the latter, economic principles such as the Regular Ultra Passum law (Frisch, 1964) have been formulated to describe scenarios where marginal gains (i.e. returns to scale) increase up to a point of maximal productivity and then taper off.
In some of the examples above, for instance when population or disease dynamics can be modelled by some governing differential equation, it may be natural to confine attention to certain parametric subclasses of S-shaped functions, such as those consisting of sigmoidal (i.e. logistic) functions of the form the desirable finite-sample properties of the estimator, and our algorithm is implemented in the R package with A, a > 0 and b ∈ ℝ; see also Jarne et al. (2007). However, in many other settings, such domain-specific knowledge is often lacking, and parametric assumptions may be excessively restrictive. To illustrate this effect, see Figure 2, where we compare two popular parametric fits of an S-shaped regression function with the estimator we propose in this paper. The first parametric method fits a logistic curve of the form (1) using nonlinear least squares. The second uses segmented linear regression with two kinks, fitted using least squares and a search over the locations of the kinks. Although these parametric fits appear to the naked eye to be satisfactory, it turns out that their estimation performance, as measured by the squared error loss on the training data, is roughly six times worse than that of our proposal (on average 0.38 and 0.43 compared with 0.067, over 100 repetitions). If the noise standard deviation is halved, then these parametric methods become 17 times and 19 times worse than our proposal respectively. Notice also that our S-shaped estimator is sufficiently flexible to be able to capture the discontinuity of the regression function, whereas the parametric methods struggle in this respect. The benefits of our nonparametric approach are also apparent in the analysis of real data: see Section 5.3, where we study the way that a quantity related to atmospheric mercury concentration varies with distance from an experimental device close to a geothermal power station. Motivated by the limitations described in the previous paragraph, the goal of this paper is to introduce a flexible framework for nonparametric estimation of S-shaped functions. The x y main challenges in removing the parametric restrictions are two-fold: first, the class  of S-shaped functions on [0, 1] is infinite-dimensional; and second, since the inflection point is unknown, the family  is non-convex. Despite this non-convexity, we are able to develop methodology based on suitably defined L 2 -'projections' of general distributions onto . The significant advantage of working in this additional generality is that, having established continuity properties of the projection, results on the consistency and robustness under misspecification of the estimator follow as simple corollaries of basic facts about convergence of empirical distributions. Nevertheless, since the fully general statements are fairly involved, we defer this formal presentation to Section S3 of the supplementary material (Feng et al., 2021b), and focus in Section 2 on the special case of projections of the empirical distribution of data of the form (x 1 , Y 1 ), …, (x n , Y n ) ∈ [0, 1] × ℝ with x 1 < ⋯ < x n . This allows us to prove that an S-shaped least squares estimator always exists, and to study its uniqueness properties. Moreover, when the design is fixed and the errors are independent and identically distributed with mean zero and finite variance, we present a basic consistency result that follows from the general theory in Section S3.
In Section 3, we take up the challenge of computing the S-shaped least squares estimator. Since its inflection point occurs at one of the design points, a naive strategy would be to fit, for each choice of m ∈ {x 1 , …, x n }, the least squares estimate over the class of Sshaped functions with inflection point m, before selecting a solution that minimizes the residual sum of squares. The individual constrained estimates are straightforward to compute using, for example, active set methods (Dümbgen et al., 2007;Nocedal & Wright, 2006, Chapters 12 and 16.5), but it can be time-consuming to run the active set method n times. We show how a simple refinement of the search strategy can improve the running time by a factor of around 4, but our major contribution here begins with the observation that the global S-shaped least squares estimate can be obtained as a concatenation of a convex increasing least squares estimate to the left of an estimated inflection point, with a concave increasing least squares estimate to the right. This enables us to pursue a sequential approach, where we reveal new observations one by one, and update the least squares fits using a mixed primal-dual bases algorithm (Fraser & Massam, 1989;Meyer, 1999). Our algorithm, which is available in the R package Sshaped (Feng et al., 2021a), is shown to be around 40 times faster than the naive strategy in examples; see Figure 5.
Our main theoretical contributions are presented in Section 4, under an independent and sub-Gaussian error assumption. Here, we derive worst-case and adaptive sharp oracle inequalities for the S-shaped least squares estimator. When combined with our corresponding minimax lower bounds, this theory reveals in particular that the S-shaped least squares estimator attains the optimal worst-case risk of order n −2∕5 with respect to L 2 -loss, in the case where the design points are not too irregularly spaced. These results apply both when the S-shaped regression function hypothesis is correctly specified, and where it is misspecified, provided in the latter case that we interpret the loss as the distance to the projection of the signal onto . For adversarially chosen design configurations, we show that the risk bound can deteriorate to n −1∕3 in the worst case. Moreover, the S-shaped least squares estimator adaptively attains the parametric rate of order n −1∕2 (up to a logarithmic factor), when the projection of the signal is piecewise affine with a relatively small number of affine pieces. Finally, we study the delicate problem of estimating the true inflection point m 0 , which represents the boundary between the convex and concave parts of the signal. Under an appropriate local smoothness assumption indexed by a parameter α > 0, we show that the inflection point m n of the least squares estimator converges to m 0 at rate O p ((n −1 log n) 1∕(2 +1) ), which matches our local asymptotic minimax lower bound, up to the logarithmic factor. Interestingly, the combination of the monotonicity with the convexity/concavity means that our S-shaped estimator is sufficiently regularized to avoid boundary problems at the endpoints {0, 1} of the covariate domain; other common shapeconstrained methods are known to lead to boundary estimation inconsistency (Balabdaoui et al., 2011;Balász et al., 2015;Cule et al., 2010;Han & Kato, 2021;Kulikov & Lopuhaä, 2006;Samworth, 2018).
In Section 5, we study the empirical properties of our S-shaped least squares estimator, comparing both its running time and statistical performance with those of alternative approaches on simulated data. We also present a real data application of these techniques in air pollution modelling, which highlights the convenience and efficacy of our proposal. We conclude by discussing some possible directions for future research in Section 6. The appendix (Section A) provides further details of the mixed primal-dual bases algorithm that we use to compute our estimator. The proofs of our main results are deferred to the supplementary material (Feng et al., 2021b), in which the results and sections appear with an 'S' before the relevant label number.
Previous work on nonparametric estimation of S-shaped functions includes Yagi et al. (2019Yagi et al. ( , 2020, who, in the context of production theory in economics, apply a method known as shapeconstrained kernel least squares to estimate multivariate production functions that are S-shaped along one-dimensional rays. Kachouie and Schwartzman (2013) use local polynomial regression techniques to identify an inflection point of a smooth signal from corrupted observations. In both of these works, kernel bandwidths must be chosen carefully to control the bias-variance tradeoff and (for the approach of Kachouie and Schwartzman (2013) in particular) to ensure that the fitted curve does not have multiple inflection points. Liao and Meyer (2017) instead estimate univariate convex-concave functions using cubic splines defined with respect to a number of user-specified knots, and establish rates of convergence for the inflection points of the resulting estimators. Their method is implemented in the R package ShapeChange (Liao & Meyer, 2016), which Lee et al. (2020) subsequently used in combination with the scam (Shape Constrained Additive Models) package of Pya and Wood (2015) to estimate S-shaped disease trajectories of patients with Huntington's disease. We also mention the extremum distance estimator and extremum surface estimator proposed by Christopoulos (2016), with the aim of locating the inflection point of a smooth function based on its geometric properties. We provide a numerical comparison of our procedure with those of Liao and Meyer (2017), Yagi et al. (2019Yagi et al. ( , 2020 and Christopoulos (2016) in Section 5.2.

| Notation
For n ∈ ℕ, we write [n] := {1, …, n}, and given 0 ≤ x 1 < ⋯ < x n ≤ 1, define  ≡ [x 1 , …, x n ] to be the set of continuous, piecewise affine f : S n (f ) over some class  of functions on [0, 1], we say that f n is a least squares estimator (LSE) over  based on {(x i , Y i ) : 1 ≤ i ≤ n}. We write a n ≲ b n to mean that there exists a universal constant C > 0 such that a n ≤ Cb n for all n.

SHAPED LEAST SQUARES ESTIMATORS
The purpose of this section is to study the existence, uniqueness and consistency of S-shaped least squares estimators. We will see later that in a suitable sense, these estimators can be regarded as L 2 -projections onto  of the empirical distribution of the data. As such, the results in this section turn out to be special cases of a much more general theory, presented in Section S3, concerning the existence and continuity of L 2 -projections of arbitrary distributions on [0, 1] × ℝ having finite variance. The generality of this projection framework remains of importance to statisticians, particularly in terms of providing results on the robustness of S-shaped least squares estimators to model misspecification; however, the results are of a more technical nature, so to facilitate understanding of the main ideas, we focus on the well-specified case here.
Suppose we have observations ( A straightforward and direct proof of this result is given in Section S1. As part of the projection framework in Section S3, we obtain generalizations of Proposition 1 in Corollaries S10(d) and S14(a). Since our objective criterion only measures the error incurred at the design points, it is no surprise that any LSE f
In order to present a basic consistency result, we introduce a model where we regard our nn )} as being realized from a triangular array sampling scheme where f 0 : [0, 1] → ℝ is a Borel measurable regression function, where n1 , …, nn are independent noise variables with mean zero and finite variance for each n, and where 0 ≤ x n1 < ⋯ < x nn ≤ 1 are fixed design points. We write ℙ n : for the joint and Xmarginal empirical distributions respectively.
For a finite Borel measure ν on [0, 1], we denote by supp ν the support of ν, which is defined as the smallest closed set A such that ν(A c ) = 0, or equivalently the set of all x ∈ [0, 1] with the property that ν(U) > 0 for any open neighbourhood U of x in [0, 1]. (2), assume that f 0 ∈  has unique inflection point m 0 ∈ [0, 1] and that n1 , …, nn are independent and identically distributed for each n. For each n ∈ ℕ, let f m 0 n and f n denote LSEs over  m 0 and  respectively. Suppose further that (ℙ X n ) converges weakly to a distribution P X 0 on [0, 1] satisfying supp P X 0 = [0, 1] and P X 0 ({m}) = 0 for all m ∈ [0, 1]. Then, for g n ∈ {f m 0 n ,f n } and with m n denoting any inflection point of g n , we have

Proposition 2 In model
Proposition 2 follows from Proposition S16 in Section S3, which handles the more general case where f 0 need not belong to , and where it may have multiple inflection points. A proof of the latter result is given in Section S6.

| COMPUTATION OF S-SHAPED LEAST SQUARES ESTIMATORS
Returning to the setting of data (x 1 , Y 1 ), …, (x n , Y n ) ∈ [0, 1] × ℝ with x 1 < ⋯ < x n , we now consider the problem of computing an S-shaped LSE over . In light of the non-uniqueness discussion in Section 2, we will take as our target the LSE f n : =fm n n , where m n : = x̂ n and n : = sargmin 1≤j≤n S n (f x j n ); here and below, sargmin denotes the smallest element of the argmin.
One of the main challenges here is that in general the function j ↦ S n (f minima; see Figure 4. A 'brute-force' method that we call ScanAll, then, is to compute each of the LSEs f x 1 n , …,f x n n directly by solving n separate constrained least squares problems. In each instance, we can run the support reduction algorithm (Groeneboom et al., 2008) or a generic active set algorithm (Dümbgen et al., 2007;Nocedal & Wright, 2006, Chapters 12 and 16.5) on the whole dataset {(x i , Y i ) : 1 ≤ i ≤ n}, but it is computationally expensive to repeat this n times, even when n is only moderately large; see Section 5.1.
To improve the overall efficiency of this procedure, it would therefore be desirable to both refine the initial search strategy as well as exploit any common structure underlying the individual minimization problems. For instance, we might hope to be able to obtain f x j n via a faster update step that takes as input the previous LSE f x j−1 n , but it is not immediately clear how this can be done. We now describe and justify an alternative approach that achieves both of the above objectives. For In other words, ĥ j n is obtained by partitioning the data into two disjoint subsets n )}, and then fitting separate increasing convex and increasing concave LSEs on the left and right pieces respectively. In general, ĥ j n is not guaranteed to be Sshaped or even increasing on [0, 1], in which case ĥ j n does not coincide with the LSE f Nevertheless, observe that ĥ j n is the LSE over a larger subclass of [x 1 , …, x n ] that contains  x j . Together with Equation (3), this immediately implies Proposition 3 below, a key fact that we will exploit in our algorithm.
(   (5) In addition, we have the following crucial result for all global S-shaped LSEs over

Proposition 4 Given any S-shaped LSE
We explain in the final example of Section S1 that Proposition 4 is a consequence of Proposition S4(c, d, e), whose proof also reveals why ĥ j n =f x j n is not guaranteed to hold for a pre-specified j ∈ [n − 1]. A further remark is that the localization property for f n in Proposition 4 is only valid for particular choices of partition of our data into subintervals, namely where the split occurs at the smallest or largest inflection points of f n . In other words, if for example x j is chosen to be a kink of f n that is strictly to the left of the smallest inflection point, then f n is not guaranteed to agree with the increasing convex LSE f 1,j on [x 1 , x j ]. This presents a substantial additional difficulty for both computation and theory in comparison with the problem of unimodal regression (Shoung & Zhang, 2001;Stout, 2008), where, for every jump x j of the unimodal LSE g n to the left of its mode, it is the case that g n agrees on [x 1 , x j ] with the increasing LSE based on {(x i , Y i ) : 1 ≤ i ≤ j} . These issues are discussed in greater depth in Section S1.
Propositions 3 and 4 motivate the following generic procedure as an improvement on ScanAll: To see that the output (x̃ ,ĥ̃ n ) of Algorithm 1 is indeed (m n ,f n ), note first that by Proposition 3, the set  in Step III consists precisely of those j ∈ [n − 1] for which ĥ j n =f x j n . In addition, by Proposition 4, ̂ n = sargmin 1≤j≤n S n (f x j n ) ∈  since m n = x̂ n is the smallest inflection point of f n =fm n n . Thus, ̃ = sargmin j∈ S n (f x j n ) =̂ n , and hence x̃ =m n and ĥ̃ n =f n , as desired. The most obvious implementation of Step II of Algorithm 1 simply computes f 1,j and f n,j+1 from scratch for each different j; we refer to this as the ScanSelected algorithm. Even this naive modification has two significant advantages over ScanAll: (i) In advance of carrying out any least squares minimization, we can restrict the set of candidates for ̂ n based on just n − 1 pairwise comparisons. If (x 1 , Y 1 ), …, (x n , Y n ) are drawn according to a regression model (2) featuring a continuous f 0 and independent and identically distributed errors with zero mean, then Step I typically screens out about half of the indices in [n] when n is reasonably large.

(ii) For the remaining indices j in
Step II, we do not attempt to compute the S-shaped function f x j n based on all n data points, but instead fit the increasing convex LSE f 1,j and the increasing concave LSE f n,j+1 using j and n − j observations respectively.
The main drawback of the ScanSelected algorithm, however, is that it fails to exploit the commonalities in the computation of f 1,j for different j (and similarly of f n,j+1 for different j). Our main computational contribution, then, is to show that for k ∈ [j − 1], it is possible to obtain f 1,j by modifying f 1,k appropriately when the observations {(x i , Y i ) : k < i ≤ j} are introduced. We can therefore proceed in a sequential manner and hence make significant computational gains.
Recall that for j ∈ [n] and a closed, convex cone Λ ⊆ ℝ j , there exists a unique L 2 -projection Π Λ : ℝ j → Λ, given by The key to our approach is to develop a mixed primal-dual bases algorithm (Fraser & Massam, 1989;Meyer, 1999) that allows us to compute Π Λ (L) when L ⊆ ℝ j is a line segment and Λ is a polyhedral convex cone. An important observation is that, given (1)) is continuous and piecewise linear on [0, 1], where the individual linear pieces correspond to projections onto different faces of Λ; see Remark 1 in Appendix A. This enables us to compute Π Λ (v(1)) when Π Λ (v(0)) is known. Indeed, we give a detailed description of a general procedure for this task in Algorithm 2 in Appendix A, and we focus here on its application to increasing convex regression (increasing concave regression for the right-hand end can be handled very similarly). In this case, the cones of particular interest to us are those of increasing convex sequences based on x 1 , …, x j for some j ∈ [n], which we denote by Given k ∈ [j − 1] and supposing that we have already fitted the increasing convex LSE f 1,k (which is linear on [x k−1 , 1] ), an appropriate choice of v(0), v(1) is and indeed, Π Λ j (v(1)) = (f 1,j (x 1 ), …,f 1,j (x j )) is what we seek to compute, and moreover we claim that Π Λ j (v(0)) = (f 1,k (x 1 ), …,f 1,k (x j )) (which is known). To establish this claim, observe that for any u ≡ (u 1 , …, u j ) ∈ Λ j , we have and (f 1,k (x 1 ), …,f 1,k (x j )) ∈ Λ j . In fact, we will apply this version of the mixed primal-dual bases algorithm with k = j − 1, so that the observations Y 1 , …, Y n are introduced sequentially. Note that when Y j ≥f 1,j−1 (x j ), we have by the same argument as in Equation (8) , so no calculations are required. We refer to this sequential implementation of Algorithm 1 as SeqConReg.

| Worst-case and adaptive sharp oracle inequalities
Our first main results of this section consist of worst-case and adaptive sharp oracle inequalities for S-shaped least squares estimators. These reveal not only risk bounds when our S-shaped regression function hypothesis is correctly specified, but also control the way in which the performance of the estimators deteriorate as the model becomes increasingly misspecified. We will work in the setting of model (2), and now make the following assumption on the errors: Assumption 1 { i ≡ ni : 1 ≤ i ≤ n} is a collection of independent sub-Gaussian random variables with parameter 1, so that (e t ni ) ≤ e t 2 ∕2 for all t ∈ ℝ and i ∈ [n].
For fixed n ∈ ℕ and f : [0, 1] → ℝ, we write x i ≡ x ni for i ∈ [n] and let ‖f ‖ n : denote by k(f) the number of affine pieces of f, so that k(f) is the smallest k ∈ [n] with the property that f is affine on each of k subintervals I 1 , …, I k that partition [0, 1].
Theorem 1 For fixed n ≥ 2, suppose that Assumption 1 holds and let f n be any LSE over . Let R : = n −1 (x n − x 1 )∕min 2≤i≤n (x i − x i−1 ). Then there exists a universal constant C > 0 such that for every f 0 : [0, 1] → ℝ and t > 0, we have with probability at least 1 − e −t .
By integrating this tail bound, we obtain the worst-case risk bound
In the special case where f 0 ∈ , we may take f = f 0 in Theorem 1 to conclude that thus, when R and V (f 0 ) are of constant order, we obtain a worstcase risk bound of order n −2∕5 . More generally, Equations (9) and (10) reveal the impact of both non-equispaced design and the range of the signal. In fact, an alternative, more complicated definition of R is possible, and this further refines our bounds for certain designs; see the discussion following the proof of Theorem 1 in Section S2.1. To see that the rate of order n −2∕5 cannot in general be attained for arbitrary configurations of design points, we appeal to Bellec (2018, Theorem 4.5) for a suitable minimax lower bound: for any V ≥ n −1∕2 , there exist design points x 1 < ⋯ < x n that depend on V such that if 1 , …, n iid ∼ N(0, 1) in Equation (2), then where the infimum is taken over all estimators g n ≡g n (x 1 , Y 1 , …, x n , Y n ), and c, C > 0 are universal constants.
Another very attractive aspect of Theorem 1 is that, in cases where f 0 ∉ , we can control the performance of an LSE f n over  via approximation error and estimation error terms. The fact that the approximation error term ‖f − f 0 ‖ n has leading constant 1 (which is the best possible) is the reason that Equations (9) and (13) are referred to as sharp oracle inequalities.
To complement the worst-case sharp oracle inequality in Equation (10), we now consider the more favourable situation where f 0 is well approximated by a piecewise affine function with not too many affine pieces. The fact that an LSE f n over  can approximate such a signal with a relatively small number of kinks suggests that we may be able to obtain improved sharp oracle inequalities in such cases.
Theorem 2 For fixed n ≥ 2, suppose that Assumption 1 holds, and let f n be any LSE over . Then for every f 0 : [0, 1] → ℝ and t > 0, we have with probability at least 1 − e −t .
As with Theorem 1, we can integrate the tail bound from Equation (11) to obtain In particular, we see from Equation (12) that if f 0 ∈  has k affine pieces, then any LSE f n over  attains the parametric rate k 1∕2 ∕n 1∕2 , up to a logarithmic factor. Adaptation to signals of low complexity is one of the particularly intriguing aspects of shape-constrained estimators (Guntuboyina & Sen, 2018;Samworth, 2018). For instance, Guntuboyina and Sen (2013), Chatterjee et al. (2015) and Chatterjee and Lafferty (2019) investigated the adaptive behaviour of univariate convex, isotonic and unimodal LSEs respectively when the truth is well approximated by a function with a small number of affine or constant pieces. For multivariate extensions of these results, see for example Han and Wellner (2016), Kur et al. (2020) and Han (2021) among others. Sharp oracle inequalities of a similar flavour to Theorem 2 have been obtained for a variety of LSEs (Bellec, 2018), including multivariate isotonic LSEs (Han et al., 2019;Pananjady & Samworth, 2021). In log-concave density estimation, adaptation results of this type were established for the logconcave maximum likelihood estimator by Kim et al. (2018) and Feng et al. (2021) in univariate and multivariate settings respectively. Finally, Baraud and Birgé (2016) introduced a ρ-estimation framework for univariate shape-constrained estimation and studied its adaptation properties.

| Inflection point estimation
A particular feature of S-shaped function estimation that differentiates it from other shapeconstrained estimation problems is the existence of an inflection point m 0 . In some respects, this is like a boundary point, because it represents the point of transition from convex to concave parts of the function, and the behaviour of the function is therefore less regulated there (in particular, the derivative of an S-shaped function may diverge to infinity as we approach the inflection point). When m 0 ∈ (0, 1), we may well have design points on either side of m 0 , and in that sense the inflection point may be regarded as an interior point. The distinguished nature of the inflection point means that its location is often of interest in applications such as the modelling of economic growth (e.g. Jarne et al., 2007) and disease progression in longitudinal studies (e.g. Lee et al., 2020). For instance, in the latter work, S-shaped functions were used to model the deterioration in motor function associated with Huntington's disease, and the estimated inflection points from a nonparametric procedure were seen to be clinically useful indicators of the onset of severe motor dysfunction, in the sense of having the potential to facilitate timely diagnosis and intervention.
In studying the inflection point estimation problem, we will assume that f 0 ∈  and the following additional conditions hold: Assumption 2 Suppose that f 0 ∈  has a unique inflection point m 0 ∈ (0, 1), and that there exist B > 0 and α ∈ (0, 1) ∪ (1, ∞) such that as x → m 0 , we have In the regression model (2), suppose also that x ni = i∕n and ni d = for all n ∈ ℕ and i ∈ [n], where ξ is a sub-Gaussian random variable with parameter 1.
We mention that Liao and Meyer (2017) study a least squares estimator over a subclass of  consisting of cubic splines (where the number of knots is of order n 1∕9 ); they show that its inflection point converges to the true m 0 at rate O p (n −8∕63 ) in a random design setting where f 0 satisfies (a stronger version of) Equation (13) with α = 3. The proof of their Theorem 2 relies on a quantitative result on the quality of local approximations to f 0 near m 0 by convex or concave functions (Liao & Meyer, 2017, Lemma 2), as well as a global rate of convergence for their splinebased estimator.
In our setting, Theorem 3 shows that the inflection point estimator m n (based on an LSE f n over the entire class ) converges to m 0 at rate O p ((n∕log n) −1∕7 ) when α = 3. The proof of Theorem 3, which is given in Section S2, is lengthy and broken up into several steps, each of which requires some delicate technical arguments; see Figure S1 for an illustration. The crucial Step 2a exploits the observation that if m n is a long way from m 0 , then there is a long interval between the two on which one of f 0 ,f n is convex and the other is concave. On such an interval, we show that f n has a long affine piece, as would be intuitively expected, and thereby quantify the approximation error due to misspecification; see Lemma S6. Another important aspect of our proof strategy is that we find a suitable way to localize the analysis of f n to a neighbourhood of m 0 , rather than rely on global considerations that would lead to a suboptimal bound. As we explain in Section S1, our localization technique for convex or Sshaped LSEs relies on non-trivial 'boundary adjustments' that are not needed for isotonic or unimodal LSEs. Nevertheless, a simpler version of the proof of Theorem 3 allows us to recover the result of Shoung and Zhang (2001) on the rate of convergence of the mode of the LSE of a unimodal regression function, at least under our sub-Gaussian assumption on the errors ni and their local smoothness condition (1.3).
The rate of convergence of m n to m 0 in Theorem 3 matches that in the following complementary local asymptotic minimax lower bound, up to a logarithmic factor. For r > 0, let

| SIMULATIONS AND REAL DATA EXAMPLE
In this section, we first investigate the computation time and empirical performance of our Sshaped estimator in some numerical experiments. We then demonstrate the use of our estimator in a real data application to air pollution modelling.

| Computation time
We compare the running time of our sequential cone projection Algorithm 2, denoted as SeqConReg, with two other possible approaches. The first, which we call ScanAll, relies on a brute-force search that scans through all possible inflection points m ∈ {x 1 , …, x n } as described in the introduction, performing least squares over  m , and determining the candidate that minimizes the residual sum of squares. Here the active set least squares procedure used for each m is based on a simple modification of the R package scar (Chen & Samworth, 2014, 2016. The second approach, which we call ScanSelected, is based on the observation in Step I of Algorithm 1 that there is no need to scan through all design points. Instead, we restrict attention to those indices j for which Y j ≤ Y j+1 , fitting an increasing convex function to {(x i , Y i ) : 1 ≤ i ≤ j}, an increasing concave function to {(x i , Y i ) : j + 1 ≤ i ≤ n} (both using scar), before finding the smallest j that minimizes the residual sum of squares. For n ∈ {100, 200, 500, 1000, 2000}, we set x i = i∕(n + 1) and Y i = sin( (x i − 0.5)) + i for i = 1, ..., n, where 1 , …, n are independent normal random variables with zero mean and unit variance. Here, to examine the impact of the signal-to-noise ratio on the running time, we also vary the value of σ ∈ {1, 0.1, 0.01}, and plot the average running time of the different approaches in Figure 5. We see that SeqConReg is the fastest among all three approaches, being approximately 10 times more efficient than ScanSelected and 40 times faster than ScanAll. The ratio of the timings becomes larger as the signal-to-noise ratio increases, because the resulting fitted function has more knots, which makes it more appealing to use algorithms of a sequential nature, such as SeqConReg.

| Statistical performance
We compare our estimator (denoted by LSE below) with the following alternatives: • Spline: The method of Liao and Meyer (2017), based on cubic B-splines with shape constraints, which is implemented in the R package ShapeChange (Liao & Meyer, 2016); • SCKLS: The shape-constrained kernel least squares method of Yagi et al. (2019Yagi et al. ( , 2020 based on local linear kernels, with M = 50 evaluation points and kernel bandwidths selected according to the method of Ruppert et al. (1995); • BEDE and BESE: The bisection extremum distance estimator and bisection extremum surface estimators of Christopoulos (2016), both developed based on the geometric properties of the inflection point for a smooth function and implemented in the R package inflection (Christopoulos, 2019).
For LSE, Spline and SCKLS, we assess their performance based on both the average L 2 (ℙ n ) loss and the mean absolute error of the estimated inflection point location, while for BEDE and BESE we compute only the mean absolute error of the estimated inflection point location. All results are based on numerical experiments over 1000 repetitions.
For n ∈ {100, 200, 500, 1000}, and design points x 1 , …, x n , we set Y i = f j (x i ) + 0.1 i for i = 1, …, n, where 1 , …, n iid ∼ N(0, 1), for four different choices of signal function f j : These signals are plotted in Figure 6. The signals are designed in such a way that their ranges over [0,1] are roughly the same. Furthermore, they all belong to  and have a unique inflection point at m 0 = 0.3. Note that f 1 satisfies Assumption 2 with α = 1/2, and f 2 and f 3 do not satisfy Assumption 2 for any α > 0, while f 4 satisfies the assumption with α = 3.
We consider two different designs by setting x i = F −1 (i∕(n + 1)) for i = 1, …, n, where F is the distribution function of either the U[0, 1] or Beta(4, 8) distributions. In the second setting, the design points are not equally spaced, and m 0 = 0.3 is the mode of the Beta(4, 8) distribution. The results are shown in Figures 7 and 8.
For the estimation of the regression function, the LSE performs well in all cases; in particular, it is able to adapt to inhomogeneous smoothness levels and asymmetric designs. The spline-and kernel-based approaches struggle in this regard, and perform much worse for signals f 1 and f 3 especially. In fact, the spline-based method appears to be inconsistent for signals f 1 and f 3 , and the kernel-based approach seems to suffer the same problem for signal f 3 too. For the estimation of the inflection point, the story has some similarities, but also some differences: for signals f 1 , f 2 and f 3 , the least squares approach provides more reliable estimates, for two main reasons. First, it is able to adapt to a much wider range of local smoothnesses around m 0 . Second, by carefully comparing Figure 8 to Figure 7, we see that the least squares approach is also able to take advantage of the additional design points near m 0 under the beta design to obtain improved estimation performance (relative to the uniform design). For signal f 4 , the other methods are able to exploit the homogeneity of the signal across the entire domain (and the symmetry of the signal around the inflection point) and tend to have a smaller mean absolute error than the least squares approach. We recall Figure 2, which further illustrates the dangers of assuming smoothness of an S-shaped signal when it is not present.

| Real data example
In this subsection, we apply our nonparametric S-shaped procedure to n = 221 LIDAR (light detection and ranging) measurements for determining atmospheric concentrations of mercury   emissions from the Bella Vista geothermal power station in Italy. This dataset, which is of interest from an air pollution modelling perspective, is discussed at length by Ruppert et al. (2003) and included in the R package SemiPar (Wand, 2018).
To explain the rationale behind the use of the S-shaped regression model (2) in this context, we begin by briefly outlining the physical background and experimental setup; see Edner et al. (1989Edner et al. ( , 1992 and Holst et al. (1996, Section 2) for further details. 2 In this instance, the LIDAR equipment was set up at a fixed location downwind of the power station, at a distance of 390-720 m from the bulk of the mercury plume. The DIAL (differential absorption LIDAR) technique involves firing two laser beams in quick succession in the same direction towards the plume, where the first beam contains light at the resonant wavelength on = 253.6 nm of mercury while the second 'reference' beam is set to a slightly different 'off-resonant' wavelength off . The light in both beams is scattered (or reflected back) to roughly the same extent by particles and aerosols in the atmosphere, but the light at wavelength on is absorbed much more strongly by atoms of mercury, the pollutant of interest. The LIDAR apparatus records the intensity (i.e. power) of the reflected signals from both incident beams as a function of time elapsed, which is proportional to the distance travelled by the light before it is reflected back towards the source. The latter is the independent variable range in the dataset. The intensity curves from 100 pairs of laser shots in the same direction were then averaged to produce power estimates P(r i ; on ) and P(r i ; off ) for n = 221 equispaced values r i of range between 390 and 720 m (at intervals of 1.5 m). In view of the physical reasons outlined above, the relative sizes of these two quantities for different r i can be used to estimate how the atmospheric concentration g 0 (r) of mercury (in ng∕m 3 ) varies with distance r (in metres) along the path of the laser beams.
More precisely, based on an approximation of the governing equation for LIDAR scattering, Holst et al. (1996, Section 3) consider a regression model for the logratio values where on physical grounds, f 0 (r) = − C ∫ r 0 g 0 (s) ds is defined for r ≥ 0 as the integral of the concentration function g 0 over [0, r] multiplied by − C ≡ − C( on , off ) = − 1.6 × 10 −5 ng −1 m 2 . Since mercury concentration is always non-negative and would generally be expected to decrease with distance from the interior of the plume, g 0 can reasonably be modelled as a non-negative unimodal function, in which case its antiderivative satisfies our definition of an S-shaped function. The data, shown in Figure 9, do indeed appear to support f 0 as an inverted S-shaped regression function. Moreover, Holst et al. (1996, Figure 4) present plots of suitably normalized residuals against range as well as the sample autocorrelations at different lags, which provide some empirical justification for the assumption that the errors 1 , …, n are independent.
The different panels of Figure 9 illustrate least squares fits over different classes of regression functions. In the top-left panel, we plot a fit of a logistic function here we see the limitations of the parametric model in terms of its inability to capture the behaviour of the regression function in the range 390-550 m. The segmented linear regression fits shown in the 2 For additional graphical illustrations, see for example http://www.nist.gov/progr ams-proje cts/diffe renti al-absor ption -lidar -detec tion-and-quant ifica tion-green house -gases as well as http://dialt echno logy.info/histo ry.html.
log P(r i ; on ) P(r i ; off ) = f 0 (r i ) + i , i = 1, …, n, two bottom panels require the choice of a set of knots, and the left and right panels use 4 and 16 knots respectively. We see that the selection of the set of knots can have quite a significant influence, and moreover, the fits are not guaranteed to be S-shaped or even monotone. Interestingly, despite the overfitting that is apparent in the bottom-right plot of the figure, the residual sum of squares remains higher than that of the S-shaped LSE 3 illustrated in the top-right panel. Moreover, the S-shaped LSE selects the number and location of its knots adaptively, with no input required from the practitioner. Another attractive feature of the S-shaped LSE is that its theoretical guarantees presented in Theorems 1 and 2 allow for heteroscedasticity, which is clearly present in this dataset. Finally, we note that the inflection point of this LSE at range = 586 m yields an estimate of the distance from the LIDAR equipment to the central part of the plume, where the mercury concentration is highest.

| DISCUSSION
In this paper, we have developed a framework for the estimation of S-shaped regression functions and their inflection points via nonparametric least squares. In spite of the challenges of working with a non-convex shape-constrained function class, we have proposed and implemented an efficient sequential algorithm for the computation of S-shaped least squares estimators, and also established theoretical guarantees on the consistency, robustness and rates of convergence of our estimators. We will conclude by discussing some variations and possible extensions of our S-shaped regression problem that may prove to be interesting avenues for future research. First, while our monotonicity requirement for S-shaped functions is natural in many practical applications, and useful for regulating the boundary behaviour of the least squares estimator at the endpoints of the covariate domain, much of our methodology and theory can be adapted straightforwardly to handle functions that are convex on [0, m 0 ] and concave on [m 0 , 1], but not necessarily increasing on [0, 1]. On the computational side, our sequential strategy SeqConReg would still be applicable after the obvious small modifications to Step II of Algorithm 1. This modified algorithm would be justified by analogues of Propositions 3 and 4, and we could still use the mixed primal-dual bases algorithm (Algorithm 2) to sequentially compute convex LSEs . The theoretical results in Section 4 would also go through with some minor alterations (e.g. to the smoothness condition (13) in Assumption 2). The proofs of the oracle inequalities would be broadly the same, and the current localization argument for the inflection point result does not rely in any essential way on monotonicity near m 0 . Some properties of our projection framework may need more significant adjustment, however, in order to handle potential boundary issues.
In another direction, one could consider the estimation of 'symmetric' S-shaped regression functions, by which we mean S-shaped functions f 0 with inflection point m 0 ∈ (0, 1) such that f 0 (x) = 2f 0 (m 0 ) − f 0 (2m 0 − x) for x ∈ [0 ∨ (2m 0 − 1), (2m 0 ) ∧ 1]. We believe that this additional symmetry constraint is likely to bring about considerable challenges when it comes to developing theory and algorithms for the LSE that minimizes the residual sum of squares over all symmetric S-shaped functions. In particular, unlike in our Proposition 1, it is not clear if the global minimizer in the least squares procedure can be attained at some symmetric S-shaped function with inflection point in {x 1 , …, x n }. Moreover, the sequential strategy that underpins our current algorithm may no longer be valid, because in contrast to the conclusion of Proposition 4, the symmetric S-shaped LSE may not coincide with increasing convex or increasing concave LSEs on any subinterval. Theoretically, although the global risk bounds in Section 4.1 are likely to carry over even with the additional symmetry constraint, the rate of convergence of the inflection point estimator m n may be very different to that in Theorem 3, and may even be (nearly) parametric.
A further topic for future research could be to seek quantitative versions of the continuity result (Proposition S12) for our L 2 -projection onto the class of S-shaped functions, in the spirit of the recent work of Barber and Samworth (2021) on the log-concave projection. Such a result could, for instance, provide insight into the rate at which the estimated inflection point converges to the inflection point of the projected regression function under model misspecification.
Finally, under local curvature conditions on an S-shaped function f 0 similar to those in Assumption 2, it would be of methodological and theoretical interest to be able to carry out (uniformly) asymptotically valid inference for f 0 (x) at fixed x ∈ [0, 1], as well as for the inflection point m 0 . For x ≠ m 0 , defining [ũ n (x),ṽ n (x)] to be the largest interval containing x on which the LSE f n is linear, we anticipate that the techniques of Deng et al. (2020) can be applied to obtain a limiting distribution for √ n(ṽ n (x) −ũ n (x)) (f n (x) − f 0 (x)) that does not depend on f 0 , and hence construct asymptotically valid confidence intervals for f 0 (x) . On the other hand, since m 0 marks the boundary between the convex and concave parts of f 0 , we expect the problem of uncertainty quantification for m 0 and f 0 (m 0 ) to be more challenging and of a qualitatively different character. With this end in view, it is natural to seek tractable asymptotic distributions for m n and f n (m n ). As an initial step, one would need to refine the results in Section 4.2 by closing the logarithmic gap between the upper and lower bounds therein on the rate of convergence of m n to m 0 . A satisfactory solution to this problem would ideally also settle the analogous problem for the plug-in mode estimator based on the unimodal LSE (Shoung & Zhang, 2001), and is likely to require significant further technical developments.
extracting the columns of U : = (u 0 u 1 ⋯ u j−1 ) ∈ ℝ j×j indexed by A ∪ {0}. By taking v � = v �� in Lemma 2, we recover a version of Ghosal and Sen (2017) Proposition 2.1): suppose that we are given v ∈ ℝ j and have oracle knowledge of A ≡ A(Π Λ j (v)), that is, the locations of the knots of Π Λ j (v). Then to compute Π Λ j (v), we can note that Π Λ j (v) = P A v = ∑ j−1 =0̂ u , where ̂ ≡̂ A (v) : = (P A v) for 0 ≤ ℓ ≤ j − 1, so that ̂ = 0 for all ℓ ∉ A and solves an ordinary (unconstrained) least squares problem.
(iv) The algorithm does not get stuck at any of the thresholds t r ; that is, when t = t r for some r, there is always a subsequent iteration of (II) that strictly increases t; (v) At distinct thresholds t r , the corresponding 'active sets' A r are distinct subsets of [j−1].
We will justify (iv) and (v) in Section S4 in the supplementary material, where we also exploit the specific structure of Λ j to handle the degeneracies mentioned in Stage (IV)(b); see in particular modification (IV') and Proposition S18.