diff --git a/data/assignment2025.dta b/data/assignment2025.dta new file mode 100644 index 0000000..3c6a4be Binary files /dev/null and b/data/assignment2025.dta differ diff --git a/figures/question_1_1.png b/figures/question_1_1.png new file mode 100644 index 0000000..4fbffef Binary files /dev/null and b/figures/question_1_1.png differ diff --git a/figures/question_1_4.png b/figures/question_1_4.png new file mode 100644 index 0000000..f445c44 Binary files /dev/null and b/figures/question_1_4.png differ diff --git a/figures/question_1_6.png b/figures/question_1_6.png new file mode 100644 index 0000000..e85b320 Binary files /dev/null and b/figures/question_1_6.png differ diff --git a/report/Assignment.tex b/report/Assignment.tex index c80c985..8441b0d 100644 --- a/report/Assignment.tex +++ b/report/Assignment.tex @@ -112,9 +112,7 @@ Hendrik Marcel W Tillemans\\ \section{Simulation Study} -\subsection{Question 1.2} - Are the estimates of $\beta_0$, $\beta_1$ and $\beta_2$ close to their true values? Why (not)? - +\subsection{Question 1.1} We investigate a linear model with noise @@ -122,9 +120,9 @@ We investigate a linear model with noise where -\[x1 \sim \mathcal{N}(3,\,6)\] -\[x2 \sim \mathcal{N}(3,\,6)\] -\[u \sim \mathcal{N}(0,\,3)\] +\[x1 \sim \mathcal{N}(3,\,36)\] +\[x2 \sim \mathcal{N}(2,\,25)\] +\[u \sim \mathcal{N}(0,\,9)\] In figure \ref{fig::plot_1_1} we have a 3D representation of the generated model. @@ -135,59 +133,130 @@ In figure \ref{fig::plot_1_1} we have a 3D representation of the generated model \end{figure} +\subsection{Question 1.2} -\subsection{1.2: Linear Fit on Generated Data} - + We now estimate the parameters of $\beta_0$, $\beta_1$ and $\beta_2$ using the \textbf{Ordinary Least Squares} (OLS) method. With model: +\[y_i=\beta_0 + \beta_1 x_1 + \beta_2 x2 + u_i\] + \begin{table}[h] +\centering \input{table_1_2} \caption{Linear Fit on Generated Data} \label{tab::table_1_2} \end{table} + In table 1 we can see that those estimates are are close to their true values within 2\%. Because our estimation model is the same as the model used to generate the data, we have a sufficient number of points and the assumptions of OLS are satisfied. In this situation we can expect good results of OLS estimation. + \subsection{Question 1.3} -Compare your estimates with those of question 1.2. Wich model do you choose? Discuss in terms of $\beta_1$ and model prediction. +If we compare the estimates with those of question 1.2. We see that the estimate of the intersect is not close to the true value with a difference of 4. +We can explain the bias of $\beta_0$ because this new model has a new error term: $\beta_2x_{2i} + u_i$. This error term no longer has an expacted value of 0, but in fact $\beta_2E(x_{2i}) + E(u_i)= 4$ wich is very close to the bias we find. For $\beta_1$, there is little to no bias. This can we explained because $x_2$ and $u$ are stochasticaly independent from $x_1$. The standard error is bigger because $\beta_2E(x_{2i}) + E(u_i)$ has a bigger variance than just $u$. + +Wich model would you choose? If I have sufficient calculation power, I would choose model 1.2 as it is much more accurate. However for a very resource constraint situation model 1.3 might give acceptable estimates. + \begin{table}[h] +\centering \input{table_1_3} \caption{Linear Fit with 1 Variable} \label{tab::table_1_3} \end{table} \subsection{Question 1.4} -Do the results confirm what you would have expected to change in your estimation results compared to the results in question 1.2? Why (not)? How about the standard errors of the estimates of $\beta_1$ and $\beta_2$? + +In figure \ref{fig::plot_1_4} we have a 3D representation of the generated model. + +\begin{figure}[h] +\includegraphics[width=0.6\paperwidth]{../figures/question_1_4} +\caption{Generated points for Question 1.4.} +\label{fig::plot_1_4} +\end{figure} + +The estimation results compared to the results in question 1.2 are similar, there is very little bias. It appears that $x_2^{new}$ is sufficiently independent from $x_1$. We expected very little bias because $x2_{new}$ has a large independent part compared to $x_1$. The standard errors of the estimates of $\beta_1$ and $\beta_2$ are about 25\% higher wich can be explained partly bij the lower standard deviation in $x_2^{new}$. + \begin{table}[h] +\centering \input{table_1_4} \caption{New Linear Fit on Generated Data} \label{tab::table_1_4} \end{table} + \subsection{Question 1.5} -Are the OLS estimators for the slope coefficients biased? Why (not)? + +Similar as in question 1.3 we estimated the parameter with a single independent variable. + \begin{table}[h] +\centering \input{table_1_5} \caption{Linear Fit with 1 Variable} \label{tab::table_1_5} \end{table} +The OLS estimators for the slope coefficients are biased. We see that $\beta_1$ is $-3$ instead of the true value of $-4$. We can explain this bias in the following way, lets start from the model. +\[y^{new}=\beta_0 + \beta_1 x_1 + \beta_2 x_2^{new} + u_i\] + +We now have: + +\[x_2^{new} = 0.5 * x_1 + x_2^{'}\] + +Where: + +\[x_2^{'}\sim \mathcal{N}(5,\,16)\] + +Substituting in the model: + +\[\Longrightarrow y^{new}=\beta_0 + \beta_1 x_1 + \beta_2(0.5 * x_1 + x_2^{'}) + u_i\] + +Lets fill in the betas with the actual values: + +\[\Longrightarrow y^{new}= 3 + -4 x_1 + 2(0.5 * x_1 + x_2^{'}) + u_i\] + +\[\Leftrightarrow y^{new}= 3 - 4 x_1 + x_1 + 2x_2^{'}) + u_i\] + +\[\Leftrightarrow [y^{new}= 3 - 3 x_1 + 2x_2^{'}) + u_i\] + +Here we can see in table \ref{tab::table_1_5} easily that the OLS estimator will find -3 as the estimate for $\beta_1$. + +Similarly as in question 1.3 we can explain the bias on the intercept. + \subsection{Question 1.6} -Do the results confirm what you would have -expected to change in your estimation results compared to the results in question 1.2? -Why (not)? How about the standard errors of the estimates of $\beta_1$ ? Use the formula -Var$\beta_1$ to motivate your answer. What would happen if the standard deviation of x1 -is equal to 0 instead of equals 1? Discuss in terms of the assumptions of the Multiple -Linear Regression mode. + +Now we replace $x_1$ in the original model with +\[x_1 \sim \mathcal{N}(3,\,1)\] + +If we now estimate the parameters we find: \begin{table}[h] +\centering \input{table_1_6} \caption{Generate Data with Small Variance on x1} \label{tab::table_1_6} \end{table} -\begin{figure}[hb] +We find in table \ref{tab::table_1_6} that the parameters are essentially unbiased but have a bigger standard error for the intersect and $\beta_1$. The standard error of $\beta_1$ is 6 times bigger (from 0.016 to 0.10). We see no difference of the estimates $\beta_2$. Because nothing has changed in $x_2$. + +\begin{figure}[h] \includegraphics[width=0.6\paperwidth]{../figures/question_1_6} \caption{Generated points for Question 1.6.} \label{fig::plot_1_6} \end{figure} +We expected a similar estimation result as in 1.2 because there are no changes except of the standard deviation of $x_1$. This means that the OLS assumptions are equally valid and we expect unbiased estimates. + +We can explain the difference in standard error of the estimates of $\beta_1$ using the formula of $Var(\beta_1)$. + +\[Var(\beta_1) = \sigma^2(X^tX)_{11}^{-1}\] + + We can write this as + + \[Var(\beta_1) = \sigma^2/Var(x_1)\] + +This means that $Var(\(beta_1) \sim 1/Var(x_1)$. + +Because $Var(x_1)$ changed from 36 in to 1, we expect the standard error to be $/sqrd(36) = 6$ times bigger. Which is exactly what we found. + +If the standard deviation from $x_1$ changes to 0, $\beta_1$ cannot we calculated. As we have seen with the no multicollinearity assumption. + + \section{examples} Some greek letters: diff --git a/scripts/simulation.py b/scripts/simulation.py index d656c24..d4e1972 100644 --- a/scripts/simulation.py +++ b/scripts/simulation.py @@ -165,6 +165,18 @@ m = sm.OLS(y_new, X) results = m.fit() results_to_latex_table_file('table_1_4.tex', results, beta) +# plot the resulting data +fig = plt.figure() +ax = fig.add_subplot(projection='3d') + +ax.scatter(x1, x2_new, y_new, marker='o') + +ax.set_xlabel('x1') +ax.set_ylabel('x2_new') +ax.set_zlabel('y_new') + +plt.savefig(FIGURE_DIR + "question_1_4.png") +plt.show() # ----------------------------------------------------------------------------- # 1.5 # -----------------------------------------------------------------------------