\title{\color{report_main}{Assignment Econometrics 2024}} % Title
\author{Hendrik Marcel W Tillemans} % Author
\date{\today} % Date

\begin{titlepage}
\centering
\vspace*{0.5 cm}
\includegraphics[scale = 0.95]{../figures/vub.png}\\[1.0 cm] % University Logo
\textsc{\LARGE \newline\newline Free University Brussels}\\[2.0 cm] % University Name
\textsc{\Large \color{report_main}{Class: Econometrics}}\\[0.5 cm] % Course Code
\rule{\linewidth}{0.2 mm} \\[0.4 cm]
{ \huge \bfseries \thetitle}\\
\rule{\linewidth}{0.2 mm} \\[1.5 cm]

\begin{minipage}{0.5\textwidth}
\begin{flushleft} \large
\emph{Professor:}\\
Jeroen Kerkhof\\
Faculty of Economic Sciences\\
\end{flushleft}
\end{minipage}~
\begin{minipage}{0.4\textwidth}
\begin{flushright} \large
\emph{Group:} \\
Hendrik Marcel W Tillemans\\
\end{flushright}
\end{minipage}\\[2 cm]

% takes the current date
\thedate

\end{titlepage}

\tableofcontents
\pagebreak

\listoffigures
\listoftables
\pagebreak

\section{Simulation Study}

\subsection{Question 1.1}

We investigate a linear model with noise

\[y=\beta_0 + \beta_1 x1 + \beta_2 x2 + u\]

where

\[x1 \sim \mathcal{N}(3,\,36)\]
\[x2 \sim \mathcal{N}(2,\,25)\]
\[u \sim \mathcal{N}(0,\,9)\]

In figure \ref{fig::plot_1_1} we have a 3D representation of the generated model.

\begin{figure}[hb]
\includegraphics[width=0.6\paperwidth]{../figures/question_1_1}
\caption{Generated points for Question 1.1.}
\label{fig::plot_1_1}
\end{figure}

\subsection{Question 1.2}

We now estimate the parameters of $\beta_0$, $\beta_1$ and $\beta_2$ using the \textbf{Ordinary Least Squares} (OLS) method. With model: \[y_i=\beta_0 + \beta_1 x_1 + \beta_2 x2 + u_i\] \begin{table}[ht] \centering \input{table_1_2} \caption{Linear Fit on Generated Data} \label{tab::table_1_2} \end{table} In table 1 we can see that those estimates are are close to their true values within 2\%. Because our estimation model is the same as the model used to generate the data, we have a sufficient number of points and the assumptions of OLS are satisfied. In this situation we can expect good results of OLS estimation. \subsection{Question 1.3} If we compare the estimates with those of question 1.2. We see that the estimate of the intersect is not close to the true value with a difference of 4. We can explain the bias of $\beta_0$ because this new model has a new error term: $\beta_2x_{2i} + u_i$. This error term no longer has an expacted value of 0, but in fact $\beta_2E(x_{2i}) + E(u_i)= 4$ wich is very close to the bias we find. For $\beta_1$, there is little to no bias. This can we explained because $x_2$ and $u$ are stochasticaly independent from $x_1$. The standard error is bigger because $\beta_2E(x_{2i}) + E(u_i)$ has a bigger variance than just $u$. Wich model would you choose? If I have sufficient calculation power, I would choose model 1.2 as it is much more accurate. However for a very resource constraint situation model 1.3 might give acceptable estimates. \begin{table}[ht] \centering \input{table_1_3} \caption{Linear Fit with 1 Variable} \label{tab::table_1_3} \end{table} \subsection{Question 1.4} In figure \ref{fig::plot_1_4} we have a 3D representation of the generated model. \begin{figure}[ht] \includegraphics[width=0.6\paperwidth]{../figures/question_1_4} \caption{Generated points for Question 1.4.} \label{fig::plot_1_4} \end{figure} The estimation results compared to the results in question 1.2 are similar, there is very little bias. It appears that $x_2^{new}$ is sufficiently independent from $x_1$. We expected very little bias because $x2_{new}$ has a large independent part compared to $x_1$. The standard errors of the estimates of $\beta_1$ and $\beta_2$ are about 25\% higher wich can be explained partly bij the lower standard deviation in $x_2^{new}$. \begin{table}[ht] \centering \input{table_1_4} \caption{New Linear Fit on Generated Data} \label{tab::table_1_4} \end{table} \subsection{Question 1.5} Similar as in question 1.3 we estimated the parameter with a single independent variable. \begin{table}[ht] \centering \input{table_1_5} \caption{Linear Fit with 1 Variable} \label{tab::table_1_5} \end{table} The OLS estimators for the slope coefficients are biased. We see that $\beta_1$ is $-3$ instead of the true value of $-4$. We can explain this bias in the following way, lets start from the model. \[y^{new}=\beta_0 + \beta_1 x_1 + \beta_2 x_2^{new} + u_i\] We now have: \[x_2^{new} = 0.5 * x_1 + x_2^{'}\] Where: \[x_2^{'}\sim \mathcal{N}(5,\,16)\] Substituting in the model: \[\Longrightarrow y^{new}=\beta_0 + \beta_1 x_1 + \beta_2(0.5 * x_1 + x_2^{'}) + u_i\] Lets fill in the betas with the actual values: \[\Longrightarrow y^{new}= 3 + -4 x_1 + 2(0.5 * x_1 + x_2^{'}) + u_i\] \[\Leftrightarrow y^{new}= 3 - 4 x_1 + x_1 + 2x_2^{'}) + u_i\] \[\Leftrightarrow [y^{new}= 3 - 3 x_1 + 2x_2^{'}) + u_i\] Here we can see in table \ref{tab::table_1_5} easily that the OLS estimator will find -3 as the estimate for $\beta_1$. Similarly as in question 1.3 we can explain the bias on the intercept. \subsection{Question 1.6} Now we replace $x_1$ in the original model with \[x_1 \sim \mathcal{N}(3,\,1)\] If we now estimate the parameters we find: \begin{table}[ht] \centering \input{table_1_6} \caption{Generate Data with Small Variance on x1} \label{tab::table_1_6} \end{table} We find in table \ref{tab::table_1_6} that the parameters are essentially unbiased but have a bigger standard error for the intersect and $\beta_1$. The standard error of $\beta_1$ is 6 times bigger (from 0.016 to 0.10). We see no difference of the estimates $\beta_2$. Because nothing has changed in $x_2$. \begin{figure}[ht] \includegraphics[width=0.6\paperwidth]{../figures/question_1_6} \caption{Generated points for Question 1.6.} \label{fig::plot_1_6} \end{figure} We expected a similar estimation result as in 1.2 because there are no changes except of the standard deviation of $x_1$. This means that the OLS assumptions are equally valid and we expect unbiased estimates. We can explain the difference in standard error of the estimates of $\beta_1$ using the formula of $Var(\beta_1)$. \[Var(\beta_1) = \sigma^2(X^tX)_{11}^{-1}\] We can write this as \[Var(\beta_1) = \sigma^2/Var(x_1)\] This means that $Var(\beta_1) \sim 1/Var(x_1)$. Because $Var(x_1)$ changed from 36 in to 1, we expect the standard error to be $/sqrd(36) = 6$ times bigger. Which is exactly what we found. If the standard deviation from $x_1$ changes to 0, $\beta_1$ cannot we calculated. As we have seen with the no multicollinearity assumption. \section{Empirical Investigation} \subsection{Question 2.1} We retain 2510 observations. \begin{table}[ht] \centering \input{summary_stats} \caption{Generate Data with Small Variance on x1} \label{tab::summary_stats} \end{table} \subsection{Question 2.2} \begin{figure} [ht] \includegraphics[width=0.6\paperwidth]{../figures/question_2_2_wage} \caption{Histogram wage} \label{fig::question_2_2_wage} \end{figure} \begin{figure} [ht] \includegraphics[width=0.6\paperwidth]{../figures/question_2_2_lwage} \caption{Histogram lwage} \label{fig::question_2_2_lwage} \end{figure} The lwage histogram in fig \ref{fig::question_2_2_lwage} is nicely centered so there is no need to remove any outliners. This is also close to a normal distribution. The wage historgam in fig \ref{fig::question_2_2_wage} is not symmetrical but is leaning to the left. Clealy not normal distributed. \subsection{Question 2.3} We are going to investigate the correlation between the variables wage, age, school, man, malay, chinese and indian. \begin{table}[ht] \centering \input{table_2_3} \caption{Correlation matrix} \label{tab::table_2_3} \end{table} We can see in table \ref{tab::table_2_3} that there is a positive correlation between wage and school. It means that people who go longer to school will get a higher wage. There is a negative correlation between age and school. The younger generation is higher educated than older generation. Chinese citizens are better payed than malay, indian citizens have a negative correlation with wage. \subsection{Question 2.4} We estimate a regression for lwage using the variables chinese and indian. We can calculate malay influence from the results. \begin{table}[ht] \centering \input{results_24} \caption{Linear model lwage} \label{tab::results_24} \end{table} $R^{2} = 00.0255$ this means that there is a very weak correlation found. If we look at the coefficients in table \ref{tab::results_24}. We see a negative value of 0.17 for indian and a positive value of 0.14 for chinese. This gives a slightly positive value for the malay of $0.14 - 0.17 + 0.03 = 0$ This results implicate that there is a wage gap based on ethnicity. \subsection{Question 2.5} We estimate a regression for lwage using the variables chinese, indian and school. \begin{table}[ht] \centering \input{results_25} \caption{Linear model lwage/school} \label{tab::results_25} \end{table} $R^{2} = 0.224$ this means that there is a weak correlation found. If we look at the coefficients in table \ref{tab::results_25}. We see a negative value of 0.07 for indian and a positive value of 0.18 for chinese. This gives a negative value for the malay of $0.18 - 0.07 - 0.11 = 0$ To see if lwage vs years of schooling is not linear. We plot it: \begin{figure} [ht] \includegraphics[width=0.6\paperwidth]{../figures/question_2_5} \caption{lwage vs school} \label{fig::question_2_5} \end{figure} In figure \ref{fig::question_2_5} there is no obvious non-linearity. \subsection{Question 2.6} We estimate a regression for lwage using the variables chinese, indian and school. \begin{table}[ht] \centering \input{results_26} \caption{Linear model lwage/age} \label{tab::results_26} \end{table} $R^{2} = 0.370$ this means that there is a weak correlation found. If we look at the coefficients in table \ref{tab::results_26}. To see if lwage vs age of schooling is not linear. We plot it: \begin{figure} [ht] \includegraphics[width=0.6\paperwidth]{../figures/question_2_6} \caption{lwage vs age} \label{fig::question_2_6} \end{figure} In figure \ref{fig::question_2_6} there is a banana shaped model indicating an non-linear relationship with a peak earnings around 40. We can use \textbf{agesq} to do a parabolic fit. If we run this model we get: \begin{table}[ht] \centering \input{results_26b} \caption{parabolic model lwage/age} \label{tab::results_26b} \end{table} We find a $R^{2} = 0.429$ which is higher than without the agesq. In table \ref{tab::results_26b} a negative coefficient for agesq wich explains the parabolic distribution with a maximum. \subsection{Question 2.8} \begin{table}[ht] \centering \input{results_28} \caption{Linear model 2.8} \label{tab::results_28} \end{table} From the table \ref{tab::results_28} we can conclude that age does not differ substantially between the tables. For school we see a little difference. \end{document}