applied-econometrics-2024/report/Assignment.tex

\documentclass[12pt]{article}
\usepackage{natbib}
\usepackage{url}
\usepackage[utf8x]{inputenc}
\usepackage{mathtools}% 
\usepackage{graphicx}
\usepackage{parskip}
\usepackage{xcolor}%
\usepackage{fancyhdr}
\usepackage{vmargin}
\usepackage{booktabs}%
\usepackage{sectsty}% for coloring sections
\setmarginsrb{3 cm}{2.5 cm}{3 cm}{2.5 cm}{1 cm}{1.5 cm}{1 cm}{1.5 cm}

% define your own custom colors
% If you want to change the colors you would need to update the RGB code in the 
% last brackets. Better not change the name of the color as it is used elsewhere
\definecolor{report_main}{HTML}{200045}
\definecolor{report_second}{HTML}{F39912}
\definecolor{report_third}{HTML}{8B0010}

\title{\color{report_main}{Assignment Econometrics 2024}}		% Title
\author{Hendrik Marcel W Tillemans}						% Author
\date{\today}											% Date

\makeatletter
\let\thetitle\@title
\let\theauthor\@author
\let\thedate\@date
\makeatother

\pagestyle{fancy}
\fancyhf{}
\rhead{\theauthor} % header on the right
\lhead{\thetitle}  % header on the left
\cfoot{\thepage}   % footer in the center

\sectionfont{\color{report_main}}
\subsectionfont{\color{report_third}}

%% Add pagebreak before each section
\let\oldsection\section
\renewcommand\section{\clearpage\oldsection}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% This is where the actual document starts
% 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{document}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% This section details the group information
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\begin{titlepage}
	\centering
    \vspace*{0.5 cm}
    \includegraphics[scale = 0.95]{../figures/vub.png}\\[1.0 cm]	% University Logo
    \textsc{\LARGE \newline\newline Free University Brussels}\\[2.0 cm]	% University Name
	\textsc{\Large \color{report_main}{Class: Econometrics}}\\[0.5 cm]				% Course Code
	\rule{\linewidth}{0.2 mm} \\[0.4 cm]
	{ \huge \bfseries \thetitle}\\
	\rule{\linewidth}{0.2 mm} \\[1.5 cm]
	
	\begin{minipage}{0.5\textwidth}
		\begin{flushleft} \large
			\emph{Professor:}\\
			Jeroen Kerkhof\\
            Faculty of Economic Sciences\\
			\end{flushleft}
			\end{minipage}~
			\begin{minipage}{0.4\textwidth}
            
			\begin{flushright} \large
			\emph{Group:} \\
Hendrik Marcel W Tillemans\\
             
		\end{flushright}
        
	\end{minipage}\\[2 cm]

	% takes the current date	
    \thedate
        	
\end{titlepage}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% This details the inclusion (or not) of the table of contents
% and list of figures and tables. 
% You can add/remove page breaks as you seem fit.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\tableofcontents

\pagebreak

\listoffigures

\listoftables

\pagebreak


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% This is the start of the actual document content
% You can just write text in here as you would in any other word processor.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%


\section{Simulation Study}


\subsection{Question 1.1}
 
We investigate a linear model with noise

\[y=\beta_0 + \beta_1 x1 + \beta_2 x2 + u\]

where

\[x1 \sim \mathcal{N}(3,\,36)\]
\[x2 \sim \mathcal{N}(2,\,25)\]
\[u \sim \mathcal{N}(0,\,9)\]

In figure \ref{fig::plot_1_1} we have a 3D representation of the generated model.

\begin{figure}[hb]
\includegraphics[width=0.6\paperwidth]{../figures/question_1_1}
\caption{Generated points for Question 1.1.}
\label{fig::plot_1_1}
\end{figure}


\subsection{Question 1.2}

 We now estimate the parameters of $\beta_0$, $\beta_1$ and $\beta_2$ using the \textbf{Ordinary Least Squares} (OLS) method. With model: 
\[y_i=\beta_0 + \beta_1 x_1 + \beta_2 x2 + u_i\]
 
\begin{table}[ht]
\centering
\input{table_1_2}
\caption{Linear Fit on Generated Data}
\label{tab::table_1_2}
\end{table}

 In table 1 we can see that those estimates are are close to their true values within 2\%. Because our estimation model is the same as the model used to generate the data, we have a sufficient number of points and the assumptions of OLS are satisfied. In this situation we can expect good results of OLS estimation.

\subsection{Question 1.3}
If we compare the estimates with those of question 1.2. We see that the estimate of the intersect is not close to the true value with a difference of 4.
We can explain the bias of $\beta_0$ because this new model has a new error term: $\beta_2x_{2i} + u_i$. This error term no longer has an expacted value of 0, but in fact $\beta_2E(x_{2i}) + E(u_i)= 4$ wich is very close to the bias we find. For $\beta_1$, there is little to no bias. This can we explained because $x_2$ and $u$ are stochasticaly independent from $x_1$. The standard error is bigger because $\beta_2E(x_{2i}) + E(u_i)$ has a bigger variance than just $u$. 

Wich model would you choose? If I have sufficient calculation power, I would choose model 1.2 as it is much more accurate. However for a very resource constraint situation model 1.3 might give acceptable estimates.   


\begin{table}[ht]
\centering
\input{table_1_3}
\caption{Linear Fit with 1 Variable}
\label{tab::table_1_3}
\end{table}

\subsection{Question 1.4}

In figure \ref{fig::plot_1_4} we have a 3D representation of the generated model.

\begin{figure}[ht]
\includegraphics[width=0.6\paperwidth]{../figures/question_1_4}
\caption{Generated points for Question 1.4.}
\label{fig::plot_1_4}
\end{figure}

The estimation results compared to the results in question 1.2 are similar, there is very little bias. It appears that $x_2^{new}$ is sufficiently independent from $x_1$. We expected very little bias because $x2_{new}$ has a large independent part compared to $x_1$. The  standard errors of the estimates of $\beta_1$ and $\beta_2$ are about 25\% higher wich can be explained partly bij the lower standard deviation in $x_2^{new}$. 

\begin{table}[ht]
\centering
\input{table_1_4}
\caption{New Linear Fit on Generated Data}
\label{tab::table_1_4}
\end{table}


\subsection{Question 1.5}

Similar as in question 1.3 we estimated the parameter with a single independent variable.

\begin{table}[ht]
\centering
\input{table_1_5}
\caption{Linear Fit with 1 Variable}
\label{tab::table_1_5}
\end{table}

The OLS estimators for the slope coefficients are biased. We see that $\beta_1$ is $-3$ instead of the true value of $-4$. We can explain this bias in the following way, lets start from the model. 
\[y^{new}=\beta_0 + \beta_1 x_1 + \beta_2 x_2^{new} + u_i\] 

We now have:
  
\[x_2^{new} = 0.5 * x_1 + x_2^{'}\]

Where:

\[x_2^{'}\sim \mathcal{N}(5,\,16)\]

Substituting in the model:

\[\Longrightarrow y^{new}=\beta_0 + \beta_1 x_1 + \beta_2(0.5 * x_1 + x_2^{'}) + u_i\] 

Lets fill in the betas with the actual values:

\[\Longrightarrow y^{new}= 3 + -4 x_1 + 2(0.5 * x_1 + x_2^{'}) + u_i\] 

\[\Leftrightarrow y^{new}= 3 - 4 x_1 + x_1 + 2x_2^{'}) + u_i\] 

\[\Leftrightarrow [y^{new}= 3 - 3 x_1 + 2x_2^{'}) + u_i\]
 
Here we can see in table \ref{tab::table_1_5} easily that the OLS estimator will find -3 as the estimate for $\beta_1$. 

Similarly as in question 1.3 we can explain the bias on the intercept.

\subsection{Question 1.6}

Now we replace $x_1$ in the original model with 
\[x_1 \sim \mathcal{N}(3,\,1)\]

If we now estimate the parameters we find:
\begin{table}[ht]
\centering
\input{table_1_6}
\caption{Generate Data with Small Variance on x1}
\label{tab::table_1_6}
\end{table}

We find in table \ref{tab::table_1_6} that the parameters are essentially unbiased but have a bigger standard error for the intersect and $\beta_1$. The standard error of $\beta_1$ is 6 times bigger (from 0.016 to 0.10). We see no difference of the estimates $\beta_2$. Because nothing has changed in $x_2$.

\begin{figure}[ht]
\includegraphics[width=0.6\paperwidth]{../figures/question_1_6}
\caption{Generated points for Question 1.6.}
\label{fig::plot_1_6}
\end{figure}

We expected a similar estimation result as in 1.2 because there are no changes except of the standard deviation of $x_1$. This means that the OLS assumptions are equally valid and we expect unbiased estimates.

We can explain the difference in standard error of the estimates of $\beta_1$ using the formula of $Var(\beta_1)$. 

\[Var(\beta_1) = \sigma^2(X^tX)_{11}^{-1}\]

 We can write this as 
 
 \[Var(\beta_1) = \sigma^2/Var(x_1)\]

This means that $Var(\beta_1) \sim 1/Var(x_1)$.

Because $Var(x_1)$ changed from 36 in to 1, we expect the standard error to be $/sqrd(36) = 6$ times bigger. Which is exactly what we found.

If the standard deviation from $x_1$ changes to 0, $\beta_1$ cannot we calculated. As we have seen with the no multicollinearity assumption. 


\section{Empirical Investigation}


\subsection{Question 2.1}

We retain 2510 observations.

\begin{table}[ht]
\centering
\input{summary_stats}
\caption{Generate Data with Small Variance on x1}
\label{tab::summary_stats}
\end{table}


\subsection{Question 2.2}

\begin{figure} [ht]
\includegraphics[width=0.6\paperwidth]{../figures/question_2_2_wage}
\caption{Histogram wage}
\label{fig::question_2_2_wage}
\end{figure}

\begin{figure} [ht]
\includegraphics[width=0.6\paperwidth]{../figures/question_2_2_lwage}
\caption{Histogram lwage}
\label{fig::question_2_2_lwage}
\end{figure}

The lwage histogram in fig \ref{fig::question_2_2_lwage} is nicely centered so there is no need to remove any outliners. This is also close to a normal distribution. The wage historgam in fig \ref{fig::question_2_2_wage} is not symmetrical but is leaning to the left. Clealy not normal distributed. 

\subsection{Question 2.3}

We are going to investigate the correlation between the variables wage, age, school, man, malay, chinese and indian.

\begin{table}[ht]
\centering
\input{table_2_3}
\caption{Correlation matrix}
\label{tab::table_2_3}
\end{table}

We can see in table \ref{tab::table_2_3} that there is a positive correlation between wage and school. It means that people who go longer to school will get a higher wage. There is a negative correlation between age and school. The younger generation is higher educated than older generation. Chinese citizens are better payed than malay, indian citizens have a negative correlation with wage. 

\subsection{Question 2.4}

We estimate a regression for lwage using the variables chinese and indian. We can calculate malay influence from the results. 

\begin{table}[ht]
\centering
\input{results_24}
\caption{Linear model lwage}
\label{tab::results_24}
\end{table}

 $R^{2} = 00.0255$ this means that there is a very weak correlation found. If we look at the coefficients in table \ref{tab::results_24}. We see a negative value of 0.17 for indian and a positive value of 0.14 for chinese. This gives a slightly positive value for the malay of $0.14 - 0.17 + 0.03 = 0$
This results implicate that there is a wage gap based on ethnicity.  

\subsection{Question 2.5} 
We estimate a regression for lwage using the variables chinese, indian and school.

\begin{table}[ht]
\centering
\input{results_25}
\caption{Linear model lwage/school}
\label{tab::results_25}
\end{table}

 $R^{2} = 0.224$ this means that there is a weak correlation found. If we look at the coefficients in table \ref{tab::results_25}. We see a negative value of 0.07 for indian and a positive value of 0.18 for chinese. This gives a negative value for the malay of $0.18 - 0.07 - 0.11 = 0$

To see if lwage vs years of schooling is not linear. We plot it:

\begin{figure} [ht]
\includegraphics[width=0.6\paperwidth]{../figures/question_2_5}
\caption{lwage vs school}
\label{fig::question_2_5}
\end{figure}

In figure \ref{fig::question_2_5} there is no obvious non-linearity.

\subsection{Question 2.6} 
We estimate a regression for lwage using the variables chinese, indian and school.

\begin{table}[ht]
\centering
\input{results_26}
\caption{Linear model lwage/age}
\label{tab::results_26}
\end{table}

 $R^{2} = 0.370$ this means that there is a weak correlation found. If we look at the coefficients in table \ref{tab::results_26}.

To see if lwage vs age of schooling is not linear. We plot it:

\begin{figure} [ht]
\includegraphics[width=0.6\paperwidth]{../figures/question_2_6}
\caption{lwage vs age}
\label{fig::question_2_6}
\end{figure}

In figure \ref{fig::question_2_6} there is a banana shaped model indicating an non-linear relationship with a peak earnings around 40.
We can use \textbf{agesq} to do a parabolic fit. If we run this model we get:
 
\begin{table}[ht]
\centering
\input{results_26b}
\caption{parabolic model lwage/age}
\label{tab::results_26b}
\end{table}

We find a  $R^{2} = 0.429$ which is higher than without the agesq.
In table \ref{tab::results_26b} a negative coefficient for agesq wich explains the parabolic distribution with a maximum.

\subsection{Question 2.8} 

\begin{table}[ht]
\centering
\input{results_28}
\caption{Linear model 2.8}
\label{tab::results_28}
\end{table}

From the table \ref{tab::results_28} we can conclude that age does not differ substantially between the tables. For school we see a little difference.

\end{document}
simulation code done 2024-12-30 00:35:42 +01:00			`\documentclass[12pt]{article}`
			`\usepackage{natbib}`
			`\usepackage{url}`
			`\usepackage[utf8x]{inputenc}`
			`\usepackage{mathtools}%`
			`\usepackage{graphicx}`
			`\usepackage{parskip}`
			`\usepackage{xcolor}%`
			`\usepackage{fancyhdr}`
			`\usepackage{vmargin}`
			`\usepackage{booktabs}%`
			`\usepackage{sectsty}% for coloring sections`
			`\setmarginsrb{3 cm}{2.5 cm}{3 cm}{2.5 cm}{1 cm}{1.5 cm}{1 cm}{1.5 cm}`

			`% define your own custom colors`
			`% If you want to change the colors you would need to update the RGB code in the`
			`% last brackets. Better not change the name of the color as it is used elsewhere`
			`\definecolor{report_main}{HTML}{200045}`
			`\definecolor{report_second}{HTML}{F39912}`
			`\definecolor{report_third}{HTML}{8B0010}`

			`\title{\color{report_main}{Assignment Econometrics 2024}} % Title`
			`\author{Hendrik Marcel W Tillemans} % Author`
			`\date{\today} % Date`

			`\makeatletter`
			`\let\thetitle\@title`
			`\let\theauthor\@author`
			`\let\thedate\@date`
			`\makeatother`

			`\pagestyle{fancy}`
			`\fancyhf{}`
			`\rhead{\theauthor} % header on the right`
			`\lhead{\thetitle} % header on the left`
			`\cfoot{\thepage} % footer in the center`

			`\sectionfont{\color{report_main}}`
			`\subsectionfont{\color{report_third}}`

refactored table generation and added figures 2024-12-30 15:44:38 +01:00			`%% Add pagebreak before each section`
			`\let\oldsection\section`
			`\renewcommand\section{\clearpage\oldsection}`


simulation code done 2024-12-30 00:35:42 +01:00			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
			`% This is where the actual document starts`
			`%`
			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`

			`\begin{document}`

			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
			`% This section details the group information`
			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`

			`\begin{titlepage}`
			`\centering`
			`\vspace*{0.5 cm}`
			`\includegraphics[scale = 0.95]{../figures/vub.png}\\[1.0 cm] % University Logo`
			`\textsc{\LARGE \newline\newline Free University Brussels}\\[2.0 cm] % University Name`
			`\textsc{\Large \color{report_main}{Class: Econometrics}}\\[0.5 cm] % Course Code`
			`\rule{\linewidth}{0.2 mm} \\[0.4 cm]`
			`{ \huge \bfseries \thetitle}\\`
			`\rule{\linewidth}{0.2 mm} \\[1.5 cm]`

			`\begin{minipage}{0.5\textwidth}`
			`\begin{flushleft} \large`
			`\emph{Professor:}\\`
			`Jeroen Kerkhof\\`
			`Faculty of Economic Sciences\\`
			`\end{flushleft}`
			`\end{minipage}~`
			`\begin{minipage}{0.4\textwidth}`

			`\begin{flushright} \large`
			`\emph{Group:} \\`
			`Hendrik Marcel W Tillemans\\`

			`\end{flushright}`

			`\end{minipage}\\[2 cm]`

			`% takes the current date`
			`\thedate`

			`\end{titlepage}`

			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
			`% This details the inclusion (or not) of the table of contents`
			`% and list of figures and tables.`
			`% You can add/remove page breaks as you seem fit.`
			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`

			`\tableofcontents`

			`\pagebreak`

			`\listoffigures`

			`\listoftables`

			`\pagebreak`


			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`
			`% This is the start of the actual document content`
			`% You can just write text in here as you would in any other word processor.`
			`%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%`


refactored table generation and added figures 2024-12-30 15:44:38 +01:00			`\section{Simulation Study}`
simulation code done 2024-12-30 00:35:42 +01:00

simulation code done 2024-12-30 19:59:18 +01:00			`\subsection{Question 1.1}`
fixed conflict 2024-12-30 16:19:30 +01:00
refactored table generation and added figures 2024-12-30 15:44:38 +01:00			`We investigate a linear model with noise`
simulation code done 2024-12-30 00:35:42 +01:00
refactored table generation and added figures 2024-12-30 15:44:38 +01:00			`\[y=\beta_0 + \beta_1 x1 + \beta_2 x2 + u\]`
simulation code done 2024-12-30 00:35:42 +01:00
refactored table generation and added figures 2024-12-30 15:44:38 +01:00			`where`
simulation code done 2024-12-30 00:35:42 +01:00
simulation code done 2024-12-30 19:59:18 +01:00			`\[x1 \sim \mathcal{N}(3,\,36)\]`
			`\[x2 \sim \mathcal{N}(2,\,25)\]`
			`\[u \sim \mathcal{N}(0,\,9)\]`
simulation code done 2024-12-30 00:35:42 +01:00
refactored table generation and added figures 2024-12-30 15:44:38 +01:00			`In figure \ref{fig::plot_1_1} we have a 3D representation of the generated model.`

			`\begin{figure}[hb]`
			`\includegraphics[width=0.6\paperwidth]{../figures/question_1_1}`
			`\caption{Generated points for Question 1.1.}`
			`\label{fig::plot_1_1}`
			`\end{figure}`
simulation code done 2024-12-30 00:35:42 +01:00

simulation code done 2024-12-30 19:59:18 +01:00			`\subsection{Question 1.2}`
simulation code done 2024-12-30 00:35:42 +01:00
simulation code done 2024-12-30 19:59:18 +01:00			`We now estimate the parameters of $\beta_0$, $\beta_1$ and $\beta_2$ using the \textbf{Ordinary Least Squares} (OLS) method. With model:`
			`\[y_i=\beta_0 + \beta_1 x_1 + \beta_2 x2 + u_i\]`

question 2.1 issue 2024-12-30 21:01:10 +01:00			`\begin{table}[ht]`
simulation code done 2024-12-30 19:59:18 +01:00			`\centering`
simulation code done 2024-12-30 00:35:42 +01:00			`\input{table_1_2}`
			`\caption{Linear Fit on Generated Data}`
			`\label{tab::table_1_2}`
			`\end{table}`
added content 2024-12-30 16:13:24 +01:00
simulation code done 2024-12-30 19:59:18 +01:00			`In table 1 we can see that those estimates are are close to their true values within 2\%. Because our estimation model is the same as the model used to generate the data, we have a sufficient number of points and the assumptions of OLS are satisfied. In this situation we can expect good results of OLS estimation.`

added content 2024-12-30 16:13:24 +01:00			`\subsection{Question 1.3}`
simulation code done 2024-12-30 19:59:18 +01:00			`If we compare the estimates with those of question 1.2. We see that the estimate of the intersect is not close to the true value with a difference of 4.`
			`We can explain the bias of $\beta_0$ because this new model has a new error term: $\beta_2x_{2i} + u_i$. This error term no longer has an expacted value of 0, but in fact $\beta_2E(x_{2i}) + E(u_i)= 4$ wich is very close to the bias we find. For $\beta_1$, there is little to no bias. This can we explained because $x_2$ and $u$ are stochasticaly independent from $x_1$. The standard error is bigger because $\beta_2E(x_{2i}) + E(u_i)$ has a bigger variance than just $u$.`

			`Wich model would you choose? If I have sufficient calculation power, I would choose model 1.2 as it is much more accurate. However for a very resource constraint situation model 1.3 might give acceptable estimates.`

simulation code done 2024-12-30 00:35:42 +01:00
question 2.1 issue 2024-12-30 21:01:10 +01:00			`\begin{table}[ht]`
simulation code done 2024-12-30 19:59:18 +01:00			`\centering`
simulation code done 2024-12-30 00:35:42 +01:00			`\input{table_1_3}`
			`\caption{Linear Fit with 1 Variable}`
			`\label{tab::table_1_3}`
			`\end{table}`

added content 2024-12-30 16:13:24 +01:00			`\subsection{Question 1.4}`
simulation code done 2024-12-30 19:59:18 +01:00
			`In figure \ref{fig::plot_1_4} we have a 3D representation of the generated model.`

question 2.1 issue 2024-12-30 21:01:10 +01:00			`\begin{figure}[ht]`
simulation code done 2024-12-30 19:59:18 +01:00			`\includegraphics[width=0.6\paperwidth]{../figures/question_1_4}`
			`\caption{Generated points for Question 1.4.}`
			`\label{fig::plot_1_4}`
			`\end{figure}`

			`The estimation results compared to the results in question 1.2 are similar, there is very little bias. It appears that $x_2^{new}$ is sufficiently independent from $x_1$. We expected very little bias because $x2_{new}$ has a large independent part compared to $x_1$. The standard errors of the estimates of $\beta_1$ and $\beta_2$ are about 25\% higher wich can be explained partly bij the lower standard deviation in $x_2^{new}$.`

question 2.1 issue 2024-12-30 21:01:10 +01:00			`\begin{table}[ht]`
simulation code done 2024-12-30 19:59:18 +01:00			`\centering`
simulation code done 2024-12-30 00:35:42 +01:00			`\input{table_1_4}`
			`\caption{New Linear Fit on Generated Data}`
			`\label{tab::table_1_4}`
			`\end{table}`

simulation code done 2024-12-30 19:59:18 +01:00
added content 2024-12-30 16:13:24 +01:00			`\subsection{Question 1.5}`
simulation code done 2024-12-30 19:59:18 +01:00
			`Similar as in question 1.3 we estimated the parameter with a single independent variable.`

question 2.1 issue 2024-12-30 21:01:10 +01:00			`\begin{table}[ht]`
simulation code done 2024-12-30 19:59:18 +01:00			`\centering`
simulation code done 2024-12-30 00:35:42 +01:00			`\input{table_1_5}`
			`\caption{Linear Fit with 1 Variable}`
			`\label{tab::table_1_5}`
			`\end{table}`

simulation code done 2024-12-30 19:59:18 +01:00			`The OLS estimators for the slope coefficients are biased. We see that $\beta_1$ is $-3$ instead of the true value of $-4$. We can explain this bias in the following way, lets start from the model.`
			`\[y^{new}=\beta_0 + \beta_1 x_1 + \beta_2 x_2^{new} + u_i\]`

			`We now have:`

			`\[x_2^{new} = 0.5 * x_1 + x_2^{'}\]`

			`Where:`

			`\[x_2^{'}\sim \mathcal{N}(5,\,16)\]`

			`Substituting in the model:`

			`\[\Longrightarrow y^{new}=\beta_0 + \beta_1 x_1 + \beta_2(0.5 * x_1 + x_2^{'}) + u_i\]`

			`Lets fill in the betas with the actual values:`

			`\[\Longrightarrow y^{new}= 3 + -4 x_1 + 2(0.5 * x_1 + x_2^{'}) + u_i\]`

			`\[\Leftrightarrow y^{new}= 3 - 4 x_1 + x_1 + 2x_2^{'}) + u_i\]`

			`\[\Leftrightarrow [y^{new}= 3 - 3 x_1 + 2x_2^{'}) + u_i\]`

			`Here we can see in table \ref{tab::table_1_5} easily that the OLS estimator will find -3 as the estimate for $\beta_1$.`

			`Similarly as in question 1.3 we can explain the bias on the intercept.`

added content 2024-12-30 16:13:24 +01:00			`\subsection{Question 1.6}`
simulation code done 2024-12-30 19:59:18 +01:00
			`Now we replace $x_1$ in the original model with`
			`\[x_1 \sim \mathcal{N}(3,\,1)\]`

			`If we now estimate the parameters we find:`
question 2.1 issue 2024-12-30 21:01:10 +01:00			`\begin{table}[ht]`
simulation code done 2024-12-30 19:59:18 +01:00			`\centering`
simulation code done 2024-12-30 00:35:42 +01:00			`\input{table_1_6}`
			`\caption{Generate Data with Small Variance on x1}`
			`\label{tab::table_1_6}`
			`\end{table}`

simulation code done 2024-12-30 19:59:18 +01:00			`We find in table \ref{tab::table_1_6} that the parameters are essentially unbiased but have a bigger standard error for the intersect and $\beta_1$. The standard error of $\beta_1$ is 6 times bigger (from 0.016 to 0.10). We see no difference of the estimates $\beta_2$. Because nothing has changed in $x_2$.`

question 2.1 issue 2024-12-30 21:01:10 +01:00			`\begin{figure}[ht]`
refactored table generation and added figures 2024-12-30 15:44:38 +01:00			`\includegraphics[width=0.6\paperwidth]{../figures/question_1_6}`
			`\caption{Generated points for Question 1.6.}`
			`\label{fig::plot_1_6}`
			`\end{figure}`

simulation code done 2024-12-30 19:59:18 +01:00			`We expected a similar estimation result as in 1.2 because there are no changes except of the standard deviation of $x_1$. This means that the OLS assumptions are equally valid and we expect unbiased estimates.`

			`We can explain the difference in standard error of the estimates of $\beta_1$ using the formula of $Var(\beta_1)$.`

			`\[Var(\beta_1) = \sigma^2(X^tX)_{11}^{-1}\]`

			`We can write this as`

			`\[Var(\beta_1) = \sigma^2/Var(x_1)\]`

question 2.1 issue 2024-12-30 21:01:10 +01:00			`This means that $Var(\beta_1) \sim 1/Var(x_1)$.`
simulation code done 2024-12-30 19:59:18 +01:00
			`Because $Var(x_1)$ changed from 36 in to 1, we expect the standard error to be $/sqrd(36) = 6$ times bigger. Which is exactly what we found.`

			`If the standard deviation from $x_1$ changes to 0, $\beta_1$ cannot we calculated. As we have seen with the no multicollinearity assumption.`


simulation code done 2024-12-30 00:35:42 +01:00

			`\section{Empirical Investigation}`



question 2.1 issue 2024-12-30 21:01:10 +01:00			`\subsection{Question 2.1}`
simulation code done 2024-12-30 00:35:42 +01:00
question 2.3 2024-12-30 22:05:03 +01:00			`We retain 2510 observations.`
simulation code done 2024-12-30 00:35:42 +01:00
question 2.1 issue 2024-12-30 21:01:10 +01:00			`\begin{table}[ht]`
			`\centering`
			`\input{summary_stats}`
			`\caption{Generate Data with Small Variance on x1}`
question 2.3 2024-12-30 22:05:03 +01:00			`\label{tab::summary_stats}`
simulation code done 2024-12-30 00:35:42 +01:00			`\end{table}`

question 2.1 issue 2024-12-30 21:01:10 +01:00
question 2.3 2024-12-30 22:05:03 +01:00			`\subsection{Question 2.2}`
simulation code done 2024-12-30 00:35:42 +01:00
question 2.8 2024-12-30 23:58:35 +01:00			`\begin{figure} [ht]`
question 2.3 2024-12-30 22:05:03 +01:00			`\includegraphics[width=0.6\paperwidth]{../figures/question_2_2_wage}`
			`\caption{Histogram wage}`
			`\label{fig::question_2_2_wage}`
simulation code done 2024-12-30 00:35:42 +01:00			`\end{figure}`

question 2.8 2024-12-30 23:58:35 +01:00			`\begin{figure} [ht]`
question 2.3 2024-12-30 22:05:03 +01:00			`\includegraphics[width=0.6\paperwidth]{../figures/question_2_2_lwage}`
			`\caption{Histogram lwage}`
			`\label{fig::question_2_2_lwage}`
			`\end{figure}`
simulation code done 2024-12-30 00:35:42 +01:00
question 2.3 2024-12-30 22:05:03 +01:00			`The lwage histogram in fig \ref{fig::question_2_2_lwage} is nicely centered so there is no need to remove any outliners. This is also close to a normal distribution. The wage historgam in fig \ref{fig::question_2_2_wage} is not symmetrical but is leaning to the left. Clealy not normal distributed.`
simulation code done 2024-12-30 00:35:42 +01:00
question 2.3 2024-12-30 22:05:03 +01:00			`\subsection{Question 2.3}`
simulation code done 2024-12-30 00:35:42 +01:00
question 2.8 2024-12-30 23:58:35 +01:00			`We are going to investigate the correlation between the variables wage, age, school, man, malay, chinese and indian.`

question 2.3 2024-12-30 22:05:03 +01:00			`\begin{table}[ht]`
			`\centering`
			`\input{table_2_3}`
			`\caption{Correlation matrix}`
			`\label{tab::table_2_3}`
simulation code done 2024-12-30 00:35:42 +01:00			`\end{table}`

question 2.8 2024-12-30 23:58:35 +01:00			`We can see in table \ref{tab::table_2_3} that there is a positive correlation between wage and school. It means that people who go longer to school will get a higher wage. There is a negative correlation between age and school. The younger generation is higher educated than older generation. Chinese citizens are better payed than malay, indian citizens have a negative correlation with wage.`
simulation code done 2024-12-30 00:35:42 +01:00
question 2.3 2024-12-30 22:05:03 +01:00			`\subsection{Question 2.4}`
simulation code done 2024-12-30 00:35:42 +01:00
question 2.8 2024-12-30 23:58:35 +01:00			`We estimate a regression for lwage using the variables chinese and indian. We can calculate malay influence from the results.`

			`\begin{table}[ht]`
			`\centering`
			`\input{results_24}`
			`\caption{Linear model lwage}`
			`\label{tab::results_24}`
			`\end{table}`

			`$R^{2} = 00.0255$ this means that there is a very weak correlation found. If we look at the coefficients in table \ref{tab::results_24}. We see a negative value of 0.17 for indian and a positive value of 0.14 for chinese. This gives a slightly positive value for the malay of $0.14 - 0.17 + 0.03 = 0$`
			`This results implicate that there is a wage gap based on ethnicity.`

			`\subsection{Question 2.5}`
			`We estimate a regression for lwage using the variables chinese, indian and school.`

			`\begin{table}[ht]`
			`\centering`
			`\input{results_25}`
			`\caption{Linear model lwage/school}`
			`\label{tab::results_25}`
			`\end{table}`

			`$R^{2} = 0.224$ this means that there is a weak correlation found. If we look at the coefficients in table \ref{tab::results_25}. We see a negative value of 0.07 for indian and a positive value of 0.18 for chinese. This gives a negative value for the malay of $0.18 - 0.07 - 0.11 = 0$`

			`To see if lwage vs years of schooling is not linear. We plot it:`

			`\begin{figure} [ht]`
			`\includegraphics[width=0.6\paperwidth]{../figures/question_2_5}`
			`\caption{lwage vs school}`
			`\label{fig::question_2_5}`
			`\end{figure}`

			`In figure \ref{fig::question_2_5} there is no obvious non-linearity.`

			`\subsection{Question 2.6}`
			`We estimate a regression for lwage using the variables chinese, indian and school.`

			`\begin{table}[ht]`
			`\centering`
			`\input{results_26}`
			`\caption{Linear model lwage/age}`
			`\label{tab::results_26}`
			`\end{table}`

			`$R^{2} = 0.370$ this means that there is a weak correlation found. If we look at the coefficients in table \ref{tab::results_26}.`

			`To see if lwage vs age of schooling is not linear. We plot it:`

			`\begin{figure} [ht]`
			`\includegraphics[width=0.6\paperwidth]{../figures/question_2_6}`
			`\caption{lwage vs age}`
			`\label{fig::question_2_6}`
			`\end{figure}`

			`In figure \ref{fig::question_2_6} there is a banana shaped model indicating an non-linear relationship with a peak earnings around 40.`
			`We can use \textbf{agesq} to do a parabolic fit. If we run this model we get:`

			`\begin{table}[ht]`
			`\centering`
			`\input{results_26b}`
			`\caption{parabolic model lwage/age}`
			`\label{tab::results_26b}`
			`\end{table}`

			`We find a $R^{2} = 0.429$ which is higher than without the agesq.`
			`In table \ref{tab::results_26b} a negative coefficient for agesq wich explains the parabolic distribution with a maximum.`

			`\subsection{Question 2.8}`

			`\begin{table}[ht]`
			`\centering`
			`\input{results_28}`
			`\caption{Linear model 2.8}`
			`\label{tab::results_28}`
			`\end{table}`
simulation code done 2024-12-30 00:35:42 +01:00
question 2.8 2024-12-30 23:58:35 +01:00			`From the table \ref{tab::results_28} we can conclude that age does not differ substantially between the tables. For school we see a little difference.`
simulation code done 2024-12-30 00:35:42 +01:00
			`\end{document}`
No results found.