In table 1 we can see that those estimates are are close to their true values within 2\%. Because our estimation model is the same as the model used to generate the data, we have a sufficient number of points and the assumptions of OLS are satisfied. In this situation we can expect good results of OLS estimation.
If we compare the estimates with those of question 1.2. We see that the estimate of the intersect is not close to the true value with a difference of 4.
We can explain the bias of $\beta_0$ because this new model has a new error term: $\beta_2x_{2i}+ u_i$. This error term no longer has an expacted value of 0, but in fact $\beta_2E(x_{2i})+ E(u_i)=4$ wich is very close to the bias we find. For $\beta_1$, there is little to no bias. This can we explained because $x_2$ and $u$ are stochasticaly independent from $x_1$. The standard error is bigger because $\beta_2E(x_{2i})+ E(u_i)$ has a bigger variance than just $u$.
Wich model would you choose? If I have sufficient calculation power, I would choose model 1.2 as it is much more accurate. However for a very resource constraint situation model 1.3 might give acceptable estimates.
The estimation results compared to the results in question 1.2 are similar, there is very little bias. It appears that $x_2^{new}$ is sufficiently independent from $x_1$. We expected very little bias because $x2_{new}$ has a large independent part compared to $x_1$. The standard errors of the estimates of $\beta_1$ and $\beta_2$ are about 25\% higher wich can be explained partly bij the lower standard deviation in $x_2^{new}$.
The OLS estimators for the slope coefficients are biased. We see that $\beta_1$ is $-3$ instead of the true value of $-4$. We can explain this bias in the following way, lets start from the model.
We find in table \ref{tab::table_1_6} that the parameters are essentially unbiased but have a bigger standard error for the intersect and $\beta_1$. The standard error of $\beta_1$ is 6 times bigger (from 0.016 to 0.10). We see no difference of the estimates $\beta_2$. Because nothing has changed in $x_2$.
We expected a similar estimation result as in 1.2 because there are no changes except of the standard deviation of $x_1$. This means that the OLS assumptions are equally valid and we expect unbiased estimates.
We can explain the difference in standard error of the estimates of $\beta_1$ using the formula of $Var(\beta_1)$.
The lwage histogram in fig \ref{fig::question_2_2_lwage} is nicely centered so there is no need to remove any outliners. This is also close to a normal distribution. The wage historgam in fig \ref{fig::question_2_2_wage} is not symmetrical but is leaning to the left. Clealy not normal distributed.
We can see in table \ref{tab::table_2_3} that there is a positive correlation between wage and school. It means that people who go longer to school will get a higher wage. There is a negative correlation between age and school. The younger generation is higher educated than older generation. Chinese citizens are better payed than malay, indian citizens have a negative correlation with wage.
We estimate a regression for lwage using the variables chinese and indian. We can calculate malay influence from the results.
\begin{table}[ht]
\centering
\input{results_24}
\caption{Linear model lwage}
\label{tab::results_24}
\end{table}
$R^{2}=00.0255$ this means that there is a very weak correlation found. If we look at the coefficients in table \ref{tab::results_24}. We see a negative value of 0.17 for indian and a positive value of 0.14 for chinese. This gives a slightly positive value for the malay of $0.14-0.17+0.03=0$
This results implicate that there is a wage gap based on ethnicity.
\subsection{Question 2.5}
We estimate a regression for lwage using the variables chinese, indian and school.
\begin{table}[ht]
\centering
\input{results_25}
\caption{Linear model lwage/school}
\label{tab::results_25}
\end{table}
$R^{2}=0.224$ this means that there is a weak correlation found. If we look at the coefficients in table \ref{tab::results_25}. We see a negative value of 0.07 for indian and a positive value of 0.18 for chinese. This gives a negative value for the malay of $0.18-0.07-0.11=0$
To see if lwage vs years of schooling is not linear. We plot it: