Econ 103: Introduction to Econometrics

Author: Lucas Zhang.

Disclaimer: This notebook is intended as a template for students to get started on empirical exercises in the introductory econometrics class. The examples given will be using data from published research papers. As Python is an open-source all-purpose coding language, there are many different packages that serve the same purpose, and those contained in this notebook represent just one possibility. All the errors are mine.

Some Q&A's

A Starting Point

As an open-source language, Python has a huge number of packages dedicated to written by professionals and amateurs. This is both a blessing and curse. How do we tell which ones are reliable or not? Luckily for us most of the packages we will be using are well-maintained and have stable releases, i.e. they have been "peer-reviewed" constantly updated to keep up with users' demands.

As you will see, the first thing we do will be import all the packages we will be using:

Comments: In the above code cell, we import some most popular packages. Note that we didn't simply just import them, we also gave them easy "nicknames". You will see that these nicknames will make the coding considerably simpler.

Comments: Each of the packages serves its own purpose:

For details, search them on Google (or your own choice of search engine) and they have their own dedidated pages.

Load Data

Without further introduction, you need to load your data into the enviroment. If you have already uploaded the datasets to the jupyter, you can simply call them out using the following code:

Comments:

Take a Look at the Data

Often times, especially when the datasets are large, you won't be able to know what the data looks like in advance. As a good habit, you want to take a look at the datasets once you have loaded it to the system.

Comments:

Most likely, your data will come with many variables you won't use. Here's one way of choosing only the variables that you care. For example, I want a new dataset with only two variables: education (educ) and hourly wage(hrwage):

Comments:

We can also select specific rows of data. For example, we want to create a dataset consists of only females:

Comments:

We also create a dataset that contains non-female observations only, call it df_nf

Summarize the Data

Summary Statistics

The first few rows of data are far from representative of the entire dataset. We can take a look at the summary statistics of the variables.

Comments:

Here are some other functions in numpy np you can use:

In fact, numpy has many mathematical functions built in, such as sine/cosine, exponential, logarithm etc. You can find a partial list here https://numpy.org/doc/stable/reference/routines.math.html. As we mentioned, Python is open-sourced, so if you want to find some codes for specific uses, just search them and the chance is that someone already wrote them.

What if we want them altogether in one table? It's simple. One way to do this is using the affix/function .describe() after the dataset df1. See below:

What if we want to find the covariance between two variables educ and hrwage? Numpy also has a function for that

Comments:

Histograms

In addition to the statistical summarization of the data, sometimes it's useful to have some visualization of the data. We start with a single variable. The first thing we do is to plot the histogram of that variable.

Comments:

Comments:

Scatter Plots

To visualize the relationship between two variables, we want to use scatter plots. That is, we will plot one variable against another.

Comments:

Comments:

Comments:

Simple Linear Regression with One Variable

In lecture, we have seen how the theory for simple linear regression works. How do we do it in Python?

Regression Output

There are many libraries that deal with regression analysis. We will use statsmodels.formula, as it is a well maintained package and very intuitive to use. Recall that we imported this package as smf.

In this section, we are going to regress hourly wage hrwage on educational level educ:

$$ \text{hourly wage} = \beta_0 + \beta_1 \text{educ} + \text{residual}$$

and the code is as following:

Comments:

What information are summarized in the table?

Interpretation of Regression Output

Recall our regression output is

\begin{align*} \widehat{\text{hrwage}} &= -12.5953 + 1.9266\times \text{educ}\\ &\quad\quad(3.597)\quad(0.276)\\ \end{align*}

where in the above expression, the value inside parentheses under each coefficient is the standard error of that coefficient. hrwage is the hourly wage and educ is the years of education.

Plot the Regression Line

In the previous section, we have created a scatter plot on hourly wage and years of education. Now we are going to fit a regression line in the scatter plot. There are many different ways of plotting the regression line. Here we give you two possible ways:

Use Seaborn Library

Comments

Do It Yourself

Comments:

Comments:

Comments:

Plot the Residuals

Comments:

Comments:

Below is the standard code for create the scatter plot that we have seen beofre. Are the residuals homoskedastic or heteroskedastic?

Regression with Multiple Regressors

Recall our regression model

$$ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_k X_{ki} + u_i $$

As the explicit formula for OLS estimators of $\beta_0,\beta_1,\cdots,\beta_k$ are complicated, in practice, we rely on the statistical softwares. In this section, we are going to extend our previous codes to multiple regression setting.

Additional Regressors

We will continue use our original dataset. Are there any other factors other than education that are associated with wage? We start by considering adding regressor age to the model

Comments:

The interpretation of the regression table is exactly the same as in the simple regression case. Refer to section 5.1 for details.

Adding Binary Regressor

The variable female is a binary variable:

We include can include this binary variable in the regression as well:

Bonus Question: How would you interterpret the regression coefficient on female?

Transformation of Dependent Variable and Regressors

In this section, we are going to take a look at how to transform the regressors in the regression. There are two ways:

  1. edit data directly, which you have seen when we add the residuals to the dataset;

  2. modify the regression formula. (Won't discussed in this class)

The first method is foolproof, and we will focus on that. The second method is elective.

Add Transformed Variables to the Dataset

(1) Suppose we want to add natural log of wage to the dataset:

Comments:

(2) Suppose we want to create a new variable, "potential experience", which is defined as

$$ \text{potential experience} = \text{age} - \text{years of education} - 7 $$

Let's create the variable, potential experience, and name it pexp

Comments:

Commonly Used Arithmic Operations:

Let's create another variable, squared potential experience, and let's name it pexp2

Mincer Regression

In labor economics, the most famous equation is the so called "Mincer equation", which explains the wage as a function of education and experience. It is specified as following:

$$ \text{log wage} = \beta_0 + \beta_1\text{education} + \beta_2\text{potential experience} + \beta_3(\text{potential experience})^2 + \text{residual} $$

Let's use our transformed variables from previous section to run the Mincer regression.

Bonus Questions:

Interaction Between Regressors

In lecture, we have seen that we can include the interaction terms to allow difference in intercepts and difference in slopes in the regression for different groups. How do we do this in practice?

To best illustrate this, we will conduct a case study on gender-wage differential.

Gender-Wage Differentials

We will run the following regression specification:

$$\text{hourly wage} = \beta_0 + \beta_1\text{educ} + \beta_2\text{female} + \beta_3(\text{educ}\times\text{female}) + \text{residual} $$

Using direct method

First, let's start with the direct method by creating the interaction term as a new variable, which we name as educ_f.

Comments:

Using the package option

Second, we use the option provided by the function smf.ols():

Comments:

Compare the regression outputs from these two methods, they are entirely equivalent.

Bonus Question: How would you interpret the regression coefficient on the interaction term?

Visualization

We can also visualize this in a single plot:

Does the graph match the regression results?

Joint Hypothesis Testing

Recall our regression model

$$ Y_i = \beta_0 + \beta_1 X_{1i} + \beta_2 X_{2i} + \cdots + \beta_k X_{ki} + u_i $$

and we are interested in testing joint hypotheses, for example:

How do we test hypothesis like this in Python? Luckily we have a well written package from stats models. We made up an example below to show you how to test such hypothesis.

Suppose we want to run the following specification:

$$ \text{log wage} = \beta_0 + \beta_1 \text{educ} + \beta_2 \text{potential experience} + \beta_3 (\text{potential experience})^2 + \beta_4 \text{female} + \beta_5 \text{white} + \text{residual} $$

and we want to test the hypotheses: $H_0: \beta_1 = 0, \beta_2 = 2, \beta_4 = \beta_5$ vs. $H_1:~\text{at least one of the contraint in null is false}$.

Comments:

White Test for Heteroskedasticity

Suppose we want to do a White test for heteroskedasticity for the following regression:

$$ \text{log wage}_i = \beta_0 + \beta_1 \text{educ}_i + \beta_2 \text{potential experience}_i + u_i $$

Step 1: Run the main regression, and save squared residuals

Comments:

Step 2: Run the auxiliary regression

$$ u^2_i = \alpha_0 + \alpha_1 \text{educ}_i + \alpha_2 \text{potential experience}_i + \alpha_3\text{educ}^2_i + \alpha_4 \text{potential experience}^2_i + \alpha_5 (\text{potential experience}_i\times \text{educ}_i) + e_i $$

That is regress the squared residual on the regressors from the main regression, the squared regressors, and interaction of the regressors.

We want to do a F-test on the null hypothesis that $H_0: \alpha_1 = \alpha_2 = ... = \alpha_5 = 0 $. If the null is rejected, we conclude that the heteroskedasticity is present.

Comments:

From the results of F-test, we see that the p-value $p = 0.004$, which is less than $1\%$, so we can reject the null at $1\%$ significance level, and conclude that the heteroskedasticity is present.

Difference in Differences

Card and Krueger 1994 (Minimum Wage)

In their famous 1994 minimum wage study, Card and Krueger compared the employment rates in fastfood restaurants in New Jersey and Pennsylvania before and after a minimum wage increase in New Jersey (while there's no such change in Penn). Their results suggest that, contrary to the predicition of the textbook model on minimum wage, the exogenous increase in minimum wage did not reduce the employment. Below, we are going to "replicate" this study using their original dataset, modified for simplicity.

In the did dataset:

We run the baseline regression

$$ \text{fte}_i = \beta_0 + \beta_1 \text{d}_i + \beta_2 \text{nj}_i + \beta_3\text{d_nj}_i + u_i$$

As we showed in lecture, $\beta_3$ is the treatment effect. The we can also include control variables in the DiD model. In the data, we have the following binary variables as controls

Note that each fast food restaurant can only belong to one location (centralj, southj, pa1, pa2) and one brand (bk, kfc, roys, wendys)

Bonus Question: Why didn't we include Berger King bk in the regression?