Bivariate and Multivariate Data Cloning through Non Linear Regression Models

Nonlinear regression analysis holds significant popularity in mathematical, engineering, and social science domains. Disciplines like financial matters, biology, and natural chemistry have broadly utilized nonlinear regression models (NLRMs). Cloned datasets have their own importance in such areas which provide the same fit of bivariate and multivariate nonlinear regression models for the actual datasets. This article presents a sequence of cloned datasets that give exactly the same fit of bivariate and multivariate nonlinear regression models.


INTRODUCTION
If genuine information is private and cannot be shown, a matching or alternative set of data is required which provide same summary statistics as of the actual data.Cloned data refers to the alternative or matching set of data through mathematical techniques that allow rapid provisioning in testing and developments.Data cloning has its own significance as an alternative method for protecting confidential information and database.Table 1 shows four fictitious distinct cloned datasets (CDSs) created by Anscombe [1] to demonstrate the significance of graphs in statistical analysis.The summary statistics (mean, standard deviation, and correlation) as well as the parameter estimates of the fitted regression equation R 2 and estimated standard deviation of residuals are identical across these four distinct CDSs, however, they were vastly different scatter plots as shown in Figure 1.Dataset I was strongly linear with a single outlier and II appears to follow a parabolic distribution, whereas dataset III appears to adhere to a noisy linear regression model (LRM), and dataset IV appears to follow a vertical line with the regression thrown off by a single outlier.Datesets in Table 1 are significant and frequently used to show how important visible methods are.These datasets were also known for their significant use in education.However, the method used to create the datasets was not explained in [1].A genetic algorithm-based approach was proposed by Chatterjee and Firat [2], who generated 1,000 random datasets with comparable summary statistics and graphics for the basic datasets.Govindaraju and Haslett [3] devised a method for producing datasets by regressing the response on the covariate in the direction of their unconditional sample means, while maintaining identical LRM estimates.As a result, the variability in the response and the covariate decreased in each subsequent cloned dataset.Haslett and Govindaraju's [4] method for creating matched datasets was extended to include a multiple linear regression model, ensuring that the matched datasets have an identical fit to the original data.The idea of data-cloning emerged from both biostatistics [5,6] and financial time series [7].
Cloning for maximum likelihood estimation using Bayesian software was achieved by the simple device of replicating the original data many times [6].Fung et al. [8] expressed that the creation of CDSs to anonymize sensitive data was another application for datasets with the same statistical properties, as discussed in [3].In this instance, it is critical that individual data points were altered, while the data's overall structure remained unchanged.
Haslett and Govindaraju [9] described a straightforward approach for modifying LRM data, while still obtaining the same fitted regression parameters.Ponciano et al. [10] showed how structural parameter nonidentifiability can be diagnosed with Data Cloning (DC) and distinguished from other parameter estimability issues, such as when parameters are structurally identifiable but not estimable in a given data set or when they are identifiable and weakly estimable.Bayesian phylogenetics software can be used to diagnose non-identifiability with the DC approach.Additionally, it was demonstrated that DC can be used to examine and eliminate the influence of priors, particularly when prior elicitation was difficult.Finally, DC can be used to investigate at least two significant statistical issues when applied to phylogenetic inference, developing effective sampling strategies for computationally expensive posterior densities, and evaluating the identifiability of discrete parameters, such as the tree's topology.
Data confidentiality is one of the designed goals of tunable encrypted deduplication, see Amvrosiadis and Bhadkamkar [11].Additionally, it reduced the risk of data leakage brought by frequency analysis.Furthermore, it was identified that better ways of seeing and exploring data lead to better insights.The "Datasaurus" Cairo dataset was created by Alberto Cairo [12].This, like Anscombe's Quartet, emphasized the significance of data visualization, despite the dataset's normal summary statistics, the plot it produced depicted a dinosaur.They started with the datasaurus and created additional datasets with the same summary statistics.Additionally, Cairo's Datasaurus data visualization prohibited to solely rely on the summary statistics of the used data.
Resultantly, according to [2], datasets should be as graphically distinct as possible.With different standard deviations but identical means and LRM estimates, [3,4,9] data are intended to be graphically comparable.Matejka and Fitzmaurice [13] developed a novel method for creating datasets, which are identical across a variety of statistical properties but visually distinct during the data exploration.To address the primary empirical facts of financial time series, numerous complex parametric stochastic volatility models were proposed in the subsequent literature.The models that Mao et al. [14,15] proposed incorporated a broader asymmetric volatility function.
Hussain et al. [16] used a simple procedure to clone data for nonlinear regression models having linearizable or nonlinearize regression functions, such as aX b , ab X , ae bX , k    , ks X    , k+ab X ,  1+  , A [a 2 − + (1 − a) 1 − ] − 1  .They found that cloned data generated by linearizable or nonlinearizable estimable functions of parameters have unchanged estimates.The procedure increased the sample size of cloned data without changing the parameters estimates, which was for n original sample points (x, y).This generated n 2 observations by adding [ai: i = 1, 2, …, n] to the data points y over ∑    = 0. Due to increased sample size, cloned estimates showed smaller standard errors as compared to the original standard errors.This procedure used by [16] was sufficient for the first iteration because in the next iterations, it became tedious work.This procedure was useful for modeling but not for confidentialising or encrypting data, as in the design matrix variables remained unchanged.In this case, the term "confidentializing" referred to making the values of particular variables certain for particular people that cannot be deduced from the data.Our goal in this article is to create datasets with the same fit for nonlinear linear regression models (NLRMs).[3,4] methods were used to generate these cloned data sets.To get around the problem in [16], nonlinear regression models with linearizeable regression functions were the prime focus of this article.

The Linear Regression Model
In multiple LRM (MLRM), function  are characterized as linear in the parameters.

The Nonlinear Regression Model
In NLRM, function  is regarded in such a way that it can't be written as linear in parameters.In case, there are infinite ways to explain the deterministic part of the model.

Linearizable Regression Functions (LRFs)
In NLRMs, functions can be linearized by the transformation of the variable of interest and the explanatory variables.Therefore, the regression is named as function  which is linearizable if it can be converted into a function linear in the parameters.

Linearizable Regression Function Model
School of Science Volume 7 Issue 3, 2023 A LRM with the LRF in the referred example is based on the model given below: Where,   follows the normal distribution.This model was back-converted and for this reason, the following equation was obtained: The errors  ̃ follows lognormal distributed and contributed multiplicatively.The assumptions about the random deviations were accordingly now appreciably distinct for a model, which was primarily based on: with random deviations   * that follows normal distribution and contributed additively.

DATA CLONING BY USING REGRESSING Y ON X AND X ON Y
Assuming  paired observations of  and  say (  ,   )  = 1,2, ⋯ , .
The following procedure from [3] would generate a sequence of CDSs by obtaining the same fitted NLRM equations.

Procedure for Bivariate Nonlinear Regression Model 𝒀 = 𝑨𝑿 𝑩
The simple NLRM (a geometric or power curve)  =

Procedure for Bivariate Nonlinear Regression Model 𝒀 = 𝑨𝑩 𝑿
A simple nonlinear regression model (an exponential curve)  =   was linearizable due to logarithmic transformation as  ̃=  + where 1.First fit the regression of  ̃ on specifically  ̃1 =  + .Also, fit the IR of  on  ̃ particularly  1 =  +  ̃.

Procedure for Bivariate Nonlinear Regression Model 𝒀 = 𝑨𝒆 𝑩𝑿
The simple nonlinear regression model (an exponential curve)  =    3. The approach used above can be iterated with  ̃2 and  Here, the CDSs would be yielded as shown in Table 4, using steps 1-4 discussed earlier, to produce same equation of fitted NLRM as given in (Eq.3.3).

Table 4. Cloned Data Sets Having the Same Nonlinear Regression Fit 𝑌 = 𝐴𝑒 𝐵𝑋
We have generated the cloned data sets for following nonlinear regression models  = ,  =  + √ ,  =  2 +  and  =  +  +  2 by using the procedure given by [3] and presented, respectively in    In accordance with [4], the current approach was extended to a structure of an arbitrary error covariance after discussing data that were independent and identically distributed (iid).Let us give the multiple NLRM in (Eq.4.1).
The identified problem here was to create a new response variable vector,   , and a new covariate data matrix,   .This can be easily accomplished by transposing back to  ̃ and  ̃ , such that Alternatively, multivariate CDSs to be required (  ,   (1) ,   (2) ) which produced the same multiple NLRM equation as the original dataset (,  (1) ,  (2) ).
Returning to the case of iid, how generation of CDSs can be accomplished via manipulating any one covariate, was exhibited, say   , where  = 1,2., using the steps below.1) Iniatially, fit multiple linearizable RM (Eq.4.4), using MCtD data.
2) Select a covariate  2 .For CDSs in Table 10, 1-9 steps specified above were used (first X2 was used for manipulation) for which the fitted multiple NLRM equation was exactly the same as in (Eq.4.5).2, represent a matrix plot of raw and cloning data in Table 10, which show the effect on X2 done by orthogonal manipulation as described in steps 1-9 of the algorithm.Bivariate relationship strength between X2,clone and Yclone is much weaker than X2 and Y.However, this is not the case with X1,clone and Yclone , because the manipulation was not done with X1.

DISCUSSION
This study showed that the parameter estimates of the original datasets discussed in this artical and their generated cloned datasets were identical.As a result, it was identified that data cloning had the potential to be used in a wide range of applications, including data encryption, visualization, and smoothing.The application of encryption was particularly intriguing because it can be used to generalize the databases even when regression modeling was not desired.In prior literature, cloned datasets were generated for linear regression models.However, it had equal importance to be generated for the nonlinear regression models.In this context, new methods can be developed for nonlinear regression models to conduct cloning for the datasets or databases.and  0 (  (1) )  1 (  (2) )  2 with exactly the same nonlinear regression coefficients.In terms of bivariate LRFs, the response and a covariate of the CDSs collapsed to their means, which had smaller variability when compared to the original dataset.

Figure 2 .
Figure 2. Matrix Plot of raw and cloned data

Table 2 .
3. The above method can be iterated with  ̃2 and  ̃2 as done in step 1 to gain cloning sets of data having the identical linearizable regression equation (LRE).Again   If preferred, transform back for CDSs, having same coefficients of NLRM.It was noted that variability in Y and X of the cloned datasets fluctuated after every iteration (see Table2).Cloned Data Sets Having the same Non Linear Regression Fit  =

Table 3 .
2. The regression of  ̃1 on  1 would be  ̃2 =  +  1 , maintaing the identical parameter estimates.Similarly,  2 =  +  ̃1.Note that   The above prcedure can be iterated with  ̃2 and  2 as in step 1 to obtain CDSs having the same LRE.Again   If preferred, convert back to produce a sequence of CDSs, all with the same NLRM coefficients.Therefore, it was observed that variability in Y of the cloned datasets fluctuated after each and every generation (see Table3).Cloned Data Sets Having the same Nonlinear Regression Fit  = Steps 1-4 described above would generate the CDSs presented by Table3having exactly same NLRM fitted equation as in (Eq.3.2).
If preferred, transform back to get a sequence of CDSs, all with the same NLRM coefficients.It was observed that variability in Y of the cloned datasets fluctuate after every iteration (see Table4).
2 as in step 1 to get CDSs with the same LRE.Again

Table 5 .
Cloned Data Sets Having the Same Non Linear Regression

Table 6 .
Cloned Data Sets Having the Same Non Linear Regression Fit  =  +

Table 10 .
Cloned Data Sets Having the Same Multiple Nonlinear Regression Fit  =  0  1  1  2 have been presented for bivariate and multivariate NLRMs that have linearizable regression functions including   ,   ,   ,