ECONOMETRICS FOR EC307 Oriana Bandiera Imran Rasul February, 22, 2000

WARNING: The sole purpose of these notes is to give you a “road map” of the econometric issues and techniques you are likely to see in the class papers. They are BY NO MEANS a substitute for an econometric book or course. You should use them as a reference to recall concepts you already know. ALWAYS refer to a book (Dougherty or Greene) for a comprehensive analysis. MEMORISING THESE NOTES WILL NOT HELP YOU IN THE EXAM.

TOPICS:

1. The Linear Regression model

2. Inference in the OLS model

3. Problems for the OLS model (heteroscedasticity, autocorrelation)

4. GLS

5. Panel Data

6. Simultaneous Equations (endogeneity bias)

7. 2SLS

8. Multicollinearity

9. Omitted Variable Bias

10.Including Irrelevant Variables

11.Measurement Error

12.Limited Dependent Variables: Probit and Logit

13. Fixed Effects (again)

14.FIML and LIML

15.Non Parametric Estimation

0.1 Types of Econometric Data

The unit of each observation i, can be an individual, family, school, …rm, region,

country etc. This data can be;

(a) Cross sectional : collected for a sample of units at a given moment in time

(b) Time series : collected for a given unit, over several time periods

(c) Panel : collected for a sample of units over a period of time. If this period is

the same for all i, this is a balanced panel, otherwise it is an unbalanced panel.

1 The Linear Regression Model

The underlying idea in a multiple regression model is that there is some relationship

between a ‘dependent’ variable, y, and a set of ‘explanatory’ or ‘independent’

variables, x1, x2 , …, xK;

y = f(x1; x2; :::; xK) (1)

In a sense we are identifying a causal relationship between the x variables on

the RHS and the y on the LHS. The basic assumption of the model is that the

sample observations on y may be expressed as a linear combination of the sample

observations on the explanatory x variables plus a disturbance vector, u;

y = ¯1×1 + ¯2×2 + ::: + ¯KxK + u (2)

More precisely, if we have N observations on each set of y’s and corresponding

x’s so that for observation i;

yi = ¯1x1i + ¯2x2i + ::: + ¯KxKi + ui (3)

The disturbance term re‡ects the fact that no empirical relationship is ever exact,

but on average, a relationship de…ned by (3) is expected to hold, so on average we

expect our disturbances to be zero, hence;

E(ui) = 0 for each i

Graphically we can illustrate what we are trying to do in the simple regression

case;

1

Fitting a Regression Line

x

y

Our linear regression line is the line which “best …ts” the sample data. Heuristically,

this line is that which minimises the distance between itself and the actual

y values observed for each x observation. The gap between the actual y observation

and the …tted, or predicted, y, ^y;from our regression line is called the “residual”;

residual : ei = yi¡ ^yi

Residuals,ei

x

y

i y

Ù

The “ordinary least squares” (OLS) method of …tting a regression line thus solves

the following in the simple regression case (with only one explanatory variable plus

the intercept);

min

¯

X

i

e2i

(4)

We choose

P

i

e2i

rather than

P

i

eibecause the latter may equal zero even though

the …t is very poor because huge positive and negative ei’s cancel out.

1.1 Gauss-Markov Conditions

A1: Each disturbance term on average is equal to zero

2

E(u) = 0 ) E(y) = X¯, which will be satis…ed automatically if a constant term

is included, since the role of the constant term is to pick up any systematic tendency

in y not accounted for by the explanatory variables in the regression.

A2: Each disturbance term has the same variance (around mean zero)

)(a) each ui distribution has the same variance (homoscedasticity)

)(b) all disturbances are pairwise uncorrelated (no serial correlation)

(b) means that the size of the disturbance term for individual i has no in‡uence

on the size of the disturbance for individual j, for i 6= j.

A3:All the explanatory variables contribute something in explaining the

variation in the data

)the explanatory variables do not form a linearly dependent set.

A4: The explanatory variables are …xed and can be taken as given

X is a non stochastic matrix )so in our sample, the only source of variation is in

the u vector and hence in the y vector, ) cov(X; u) = 0.

A5: The disturbance term follows a normal distribution

u has a multivariate normal distribution )by A1, A2 and A5, u » N(0;¾2I).

Theorem 1 (Gauss Markov): Under the Gauss Markov assumptions A1-A5 the

OLS estimator is BLUE (best linear unbiased estimator).

1.2 Exogenous and Endogenous Variables

We have two types of variable that we consider with respect to a given model. An

exogenous variable is one that is determined outside of the model under consideration,

and so its value is taken as given. An endogenous variable is one whose value is

explained from within the model. In our model above, X is exogenous, and y is

endogenous.

1.3 Unbiasedness, E¢ciency and Consistency

There are three principle properties we look for in any estimator;

² Unbiasedness: on average, we expect the estimated parameter to equal the

true population value of the parameter (in the OLS case, ^¯ OLSis unbiased as

E(^¯OLS) = ¯).

² E¢ciency:an estimator, ^μ 1is said to more e¢cient than another estimator ^μ 2if

V ar(^μ1) > V ar(^μ2)

² Consistency: an estimator, ^μ1is said to be consistent if as the sample size

increases to in…nity, lim

n!1

V ar(^μ1) ! 0; and lim

n!1

E(^μ) ! μ:

3

Consistency, Efficiency and Bias

b

unbiased

more efficient

Hence if OLS is BLUE, this implies that no other unbiased linear estimator has a

smaller variance (is more e¢cient) than the OLS estimator under the Gauss Markov

assumptions.

1.4 Goodness of Fit

A summary statistic of the goodness of …t of our regression model is given by the R2

statistic. Clearly,

0 · R2 · 1

and as R2 ! 1 the …t of the model is said to improve.

One word of warning is that as the number of explanatory variables, K, increases,

it can be shown that R2 necessarily increases and so can be brought arbitrarily close

to one simply by including more explanatory variables into the regression, even if

these explanatory variables are found to be insigni…cant.

In order to prevent this problem, an “adjusted R2” or

_R

2statistic is often reported,

de…ned as;

_R

2=

(N ¡ 1)R2 ¡ K

N ¡ K ¡ 1

< R2 (5)

This does not necessarily rise with K. It can be shown that the addition of a

new variable to the regression will cause

_R2

to rise i¤ its t-statistic is greater than one

(which still does not imply that the variable is signi…cant), therefore a rise in

_R

2 does

not necessarily establish that the speci…cation of the model has improved, although

it is a better indicator of this than R2, just not a perfect one.

1.5 Interpretation of the Estimated Parameters, ^¯

The estimated parameters, ^¯ = (^¯1;^¯2; :::;^¯ K) have a very simple interpretation. Let

us consider one particular element of ^¯ ; ^¯i;

4

² the standard interpretation of ^¯i is that if xi increases by one unit, then this

causes y to increase by ^¯i units

² if the LHS variables y is in log form (but X is not) then the equation is in

semi-logarithmic form. In this case the interpretation is that, for small ¯i, if xi

increases by one unit this leads to a ¯i% increase in y.

² if all variables are in logarithmic form then the interpretation is that ¯i corresponds

to the elasticity of y with respect to a 1% unit change in xi:

2 Inference in the OLS Model

2.1 t-tests

A hypothesis that we commonly want to test is;

Null hypothesis- H0 : ¯i = 0

Alternative hypothesis- H1:¯i 6= 0

If we accept H0 )the explanatory variable corresponding to ¯i, namely xi, is not

important (or ‘insigni…cant’) in explaining y. We statistically test for this using the

t ¡ test statistic;

t =

^¯

i

se(^¯i)

H0 ~

a

t(N ¡ K) (6)

where ^¯ i refers to the OLS estimate of ¯i, and

se(^¯i) =

r

V ar(^¯ i) =

p

¾2aii = ¾paii (7)

and N= number of observations, K= number of explanatory variables including

the intercept.

In general to test H0 : ¯i =»¯ we use the test statistic;

t =

^¯

i ¡ »¯

se(^¯ i)

H0 ~

a

t(N¡ K) (8)

This test statistic can be easily computed once we have performed OLS, and will

give us a value for each ¯i. This value is compared against the ‘critical’ value given

by the t(N ¡ K) distribution.

5

Accept H0 i¤

¯¯¯¯¯¯

^¯

i

se(^¯i)

¯¯¯¯¯¯

< tcrit(N ¡ K) (9) Do not accept H0 i¤ ¯¯¯¯¯¯ ^¯ i se(^¯ i) ¯¯¯¯¯¯ > tcrit(N ¡ K) (10)

where tcrit(N ¡ K) is derived from tables for the appropriate signi…cance level,

e.g. 5%, 1%, and is approximately equal to two. A rough and ready calculation that

you can do is to reject H0 i¤ the t-statistic is more than two. If (17) holds, then the

variable xi is said to be ‘signi…cant’. If (18) holds, then the variable xi is said to be

‘insigni…cant’. Either the t-statistic (in absolute value) or the standard error will be

reported with coe¢cient estimates.

2.1.1 Degrees of Freedom

The test statistic for the t-test is t(N ¡K). (N¡ K) is referred to as the ‘degrees

of freedom’ for the test. To get some intuition of where this comes from consider the

simple regression model, y = ® + x¯ + u. Hence the t-test for H0 : ¯ = 0 uses the

test statistic t(N ¡ 2): We have to take o¤ two degrees of freedom from the sample

size because the …rst two observations give us no information on the line of best …t;

Degrees of Freedom

no information perfect relationship

(untrue)

regression

2.1.2 Con…dence Intervals

Consider the simple regression line: ^y = 95:3+ 2:53

(0:08)

t, where t is time, and (0.08) is

the se on the estimate, estimated from a sample size of 23. To test H0 : ¯ = 0 we use

the test statistic;

t =

^¯

se(^¯ )

=

2:53

:08

= 31:625

6

The critical value for this test is tn¡2 = t23;0:05 = 2:069 (5% signi…cance level).

Clearly, as 31.625>2.069 that implies that we reject H0 and conclude that time does

matter. We can construct a con…dence interval from our t-statistic. We know ^¯ is

just an estimate of ¯, so in what range might we reasonably expect the true ¯ to lie

in? For large n, the t distribution is very similar to the normal distribution. Using

this fact we can construct the following con…dence intervals;

To cover 95% of the distribution : ¯ 2 [b § 1:96se(b)] (11)

To cover 99% of the distribution : ¯ 2 [b § 2:56se(b)]

95% Confidence Interval

1.96 ( ) b

Ù

b – se b 1.96 ( )

Ù

b + se b

2.5% 2.5%

95% Confidence Interval

1.96 ( ) b

Ù

b – se b 1.96 ( )

Ù

b + se b

2.5% 2.5%

In practice we need to take account of the fact that we don’t know ¾2

u. Hence we

use the t-distribution to form the con…dence intervals;

To cover 95% of the distribution : ¯ 2 [b § tn¡2;0:025:se(b)] (12)

To cover 99% of the distribution : ¯ 2 [b § tn¡2;0:005:se(b)]

Above we rejected H0 : ¯ = 0 at the 5% signi…cance level. This is equivalent to

saying that 0 does not lie in the 95% con…dence interval.

7

2.2 F-tests

Another commonly reported test is that which tests the joint signi…cance of all the

¯’s;

H0 : ¯1 = ¯2 = ::: = ¯K = 0

To test H0 we use the test statistic;

F =

ESS=(K ¡ 1)

RSS=(N ¡ K)

=

R2

(1 ¡ R2)=(N ¡ K)

H0 ~

a

F [(K ¡ 1) ; (N ¡ K)] (13)

Again we have to choose an appropriate critical value for F [(K ¡ 1) ; (N ¡ K)]

to set the signi…cance level. We can also use the F-test to test whether a group of

explanatory variables are jointly signi…cant;

H0 : ¯K+1 = ¯K+2 = ::: = ¯K+M = 0

So under H0 the regression model is;

y = ® + ¯1×1 + ::: + ¯K + u ! RSSK

Under H1 the regression model is;

y = ® + ¯1×1 + ::: + ¯KxK + ¯K+1xK+1 + ::: + ¯K+MxK+M + u ! RSSM

The test statistic to use is;

F =

(RSSK ¡ RSSM) =(M ¡ K)

RSSM=(N ¡M ¡ 1)

=

H0 ~

a

F [(M ¡ K) ; (N ¡M ¡ 1)] (14)

2.3 Type I and Type II Errors

The signi…cance level refers to the probability that you will not accept H0 even though

the true ¯i accords with H0 . Hence the signi…cance level (which is typically 1% or

5%) gives the probability of wrongly rejecting a true null hypothesis (H0 ), known as

a ‘type I error’.

Clearly we want to minimise this as much as possible, but as we decrease the

probability of a type I error (by changing our critical value, tcrit(N¡K), accordingly),

we necessarily increase the probability of a type II error, namely, accepting a false

null hypothesis, H0 . Hence we face a trade-o¤ between the two types of error.

H0 accepted H0 rejected

H0 true type I error

H0 false type II error

It is standard practice in the literature to …x the probability of a type I error, the

signi…cance level, at either 1%, 5% or 10%. The exact probability of a type II error

will then depend on the speci…c test that we are using. A ‘good’ test will give a low

probability of a type II error when the probability of a type I error is low. In that

case the test is said to have ‘high power’.

8

Type I and Type II Errors

prob type II error

prob type I

error High power test

Standard trade-off

2.4 Dummy Variables

Note that in the last example, one of the x’s was the variable ‘male’. This is an

example of a ‘dummy’ variable which takes the following form;

malei = 1 if individual i is male (15)

0 otherwise

We can de…ne dummies for all such dichotomous variables, e.g. race, seasonals.

2.5 Interaction Variables

Consider a wage equation;

w = ¯1 + ¯2age + ¯3d + u

d = 1 if college graduate

0 otherwise

Now suppose we want to examine the hypothesis that “not only are the salaries

of college graduates higher than those of non college graduates at any given age, but

they rise faster as the individuals get older”. to test this we require the inclusion of

an interaction term;

w = ¯1 + ¯2age + ¯3d + ¯4d:age + u (16)

9

Interaction Terms

age

wage

1 b

3 b

2 4 slope= b + b

2 slope= b

graduates

non graduates

2.6 Chow Test

It sometimes happens that your sample observations contain two subsamples potentially,

e.g. male and female. Do we run separate or combined regressions? Sometimes

we can combine the subsamples using a dummy variable, e.g. male dummy, or we

can allow for interaction terms that don’t restrict the coe¢cients to be the same for

each subsample;

Chow Test: Combined

Regression

x

y

Chow Test: Separate Regression

x

y

Suppose you have two subsamples: A and B ! RSS UA; UB from the separate

regressions

10

If you run a pooled regression: P ! RSS UP = UP

A + UP

B from the pooled

regression (where UP

i is the contribution to the RSS from subsample i:)

The subsample regressions must …t data at least as well as the pooled sample

) UA · UP

A and UB · UP

B and therefore (UA + UB) · UP

(UA + UB) = UP only in the case when there is no need to split the samples.

There is a price to pay for the improved …t using subsamples – we lose degrees

of freedom as (k + 1) extra parameters are estimated (k =#explanatory variables).

Hence we have (2k + 2) parameters to estimate (k explanatory variables and ¾2

A and

¾2

B): Is the improvement in …t signi…cant? We use the following F-statistic;

ChowTest : F =

Improvement in …t/dof used up

Unexplained/dof remaining

(17)

=

¡

UP ¡ UA ¡ UB¢

= (k + 1)

(UA + UB) = (n ¡ 2k ¡ 2)

H0 ~

a

F [(k + 1) ; (n ¡ 2k ¡ 2)]

3 Problems for the OLS Model

The two extensions to the OLSmodel that we shall consider both arise froma failure of

all the GaussMarkov assumptions to hold. In particular, the assumption of ‘spherical’

disturbances;

A2:V ar(u) = E(uu0) = ¾2I no longer holds in each of these cases.

3.1 Heteroscedasticity

When A2 holds, the disturbance term is said to be homoscedastic, i.e. each observation

i has the same variability in its disturbance term, ¾2:When this is not the case,

u is said to be “heteroscedastic”. This can be illustrated graphically in the simple

regression case;

Heteroscedasticity

x

y

Clearly the disturbance terms appear to be increasing with x here. This case

of heteroscedasticity still has that the disturbances are pairwise uncorrelated. The

consequences of heteroscedasticity are;

11

² a; b from OLS (y = ® + ¯x + u) are still unbiased and consistent

² a; b are no longer BLUE

² se(a); se(b) are invalid (as they are constructed under the incorrect assumption

of homoscedasticity).

3.1.1 The Gold…eld-Quandt Test

H0 : V ar(ui) = ¾2 for all i (homoscedasticity)

H1 : V ar(ui) = f(xi), f0(xi) ? 0 (some functional relationship with x)

For example, suppose we believe V ar(ui) is increasing in xi;

Goldfield-Quandt Test

xi

yi

n’ n-2n’ n’

Test Statistic : F =

RSS2

RSS1

H0 ~

a

F [(n0 ¡ k ¡ 1) ; (n0 ¡ k ¡ 1)] (18)

3.1.2 What Can you Do About Heteroscedasticity?

Suppose V ar(ui) = ¾2i

. If we know ¾2i

for all i we can eliminate the heteroscedasticity

by dividing through by ¾ifor each observation so that the transformed model becomes;

μ

yi

¾i

¶

=

μ

®

¾i

¶

+ ¯

μ

xi

¾i

¶

+

μ

ui

¾i

¶

(36)

E

μ

ui

¾i

¶

= 0; V ar

μ

ui

¾i

¶

=

¾2i

¾2i

= 1 i.e. homoscedatic errors

So running this transformed model will give improve e¢cient estimates. Note

that because the constant term is

³

®

¾i

´

then there will be a di¤erent constant term

estimated for each individual. The intuition behind this is that those observations

with the smallest ¾2i

will be most useful for locating the true regression line. We take

advantage by using weighted least squares in the transformed model, which gives

the greatest weights to the highest quality observations (lowest ¾2i

), whereas OLS is

ine¢cient because it gives all observations an equal weighting.

12

3.2 Autocorrelation

Suppose now that i refers to a time period, not an observation on an individual

entity. Then in many economic applications we may expect to see a relationship

between disturbances in adjacent time periods. This can again be illustrated in the

case of the simple regression model;

Autocorrelation

x

y

A common form of such disturbances is the “autocorrelation” (AR) structure;

AR(1):ut = ½ut¡1 + “t; where “t » N(0; ¾2″

I), and j½j < 1 (19) This is denoted AR(1) because of the presence of one lag on the RHS. This means that for ½ > 0, if we experience a large disturbance in period t, then we expect a

similarly large disturbance (dampened by a factor ½) in the subsequent period. If

j½j > 1, then this would imply that the disturbances diverge over time, which is not

typically observed in economic data.

It can be shown that the var-covariance structure in the case of AR(1) disturbances

is such that E(u) = 0 still but;

V ar(u) =

0

BBB@

var(u1) cov(u1; u2) ¢ ¢¢ cov(u1; uN)

cov(u2; u1) var(u2) ¢ ¢¢ cov(u2; uN)

…

…

… …

cov(uN; u1) cov(uN; u1) ¢ ¢¢ var(uN)

1

CCCA

(20)

= ¾2

0

BBBBBB@

1 ½ ½2 ¢ ¢¢ ½N¡1

½ 1 ½ … …

½2 ½ 1 … ½2

…

… …

… ½

½N¡1 ¢ ¢¢ ½2 ½ 1

1

CCCCCCA

where ¾2 =

¾2”

1 ¡ ½2

The consequences of autocorrelation are;

13

² regression coe¢cients remain unbiased

² estimates become “too e¢cient”, se’s are wrongly calculated being biased downwards.

3.2.1 The Durbin-Watson Test

We can run tests to test for the existence of autocorrelation. The most commonly

reported test is the Durbin-Watson test statistic, which is calculated from the OLS

residuals. To derive the test statistic note that for the AR(1) model;

ut = ½ut¡1 + “t

so ut depends on its own lagged values. If we just run an OLS regression on this

equation;

^½

=

cov(et; et¡1)

var(et)

where et =^ut (…tted residuals) (21)

=

E(etet¡1) ¡ E(et)E(et¡1)

E(e2t

¡1) ¡ [E(et)]2

As t ! 1; E(et) = E(et¡1) = 0 so;

^½

¼

E(etet¡1)

E(e2t

¡1)

=

1

T

P

t

etet¡1

1

T

P

t

e2t

¡1

=

P

t

etet¡1

P

t

e2t

¡1

(22)

The DW statistic is based on this and is calculated as;

DW=

P

(et ¡ et¡1)2

P

e2t

‘ 2(1 ¡ r), where r = corr(et; et¡1) (23)

The critical values for this test are reported in Johnston (1991). In large samples;

DW ! (2 ¡ 2½)

H0 : no autocorrelation

H1 : AR(1)

) if ½ = 0 ) DW ¼ 2: Hence the further DW is from 2 (in either direction), the

more likely it is that AR(1) is present. The problem with this test is that the actual

critical values depend on the explanatory variables in the regression, so in tables,

only an upper and lower bound on critical values can be reported. This means that

there is a region of values for the DW statistic where the test is indeterminate;

14

The Durbin-Watson Test

0 dl du 2 4

indeterminate

H1 H0

4 Generalised Least Squares

Both heteroscedastic and autocorrelated errors imply that A2 no longer holds . In

either of these cases we can write the var-covariance matrix of the error terms as;

V ar(u) = E(uu0) = 6=

¾2I (24)

This violates the Gauss Markov assumption A2 and so OLS will no longer will

be appropriate. The best estimator for ¯ in this model;

y = X¯ + u; V ar(u) = is

the “generalised least squares” (GLS) estimator.

Proposition 2 The GLS estimator is

^¯

GLS= (X0¡

1X)¡1X0¡

1y (25)

E(^¯ GLS) = ¯ so GLS is still unbiased,

V ar(^¯ GLS) = (X0¡

1X)¡1

Under normality of u (so A5 still holds),

^¯

GLS» N(¯; (X0¡

1X)¡1) (26)

15

5 Panel Data

When our data contain repeated observations on each individual, say, the resulting

panel data opens up a number of possibilities that are not available in a single cross

section. In particular, the opportunity to compare the same individual over time

allows us the possibility of using that individual as his or her own control. this

enables us to get closer towards an ideal experimental situation..

5.1 An Introduction to Panel Data

An increasing number of developing countries are collecting survey data, usually

across households, for a period of time. amongst the best known such panel data sets

are the ICRISAT data set from India and the LSMS data sets from the World Bank.

There are two dimensions to panel data;

yit; i = 1:::N; t = 1:::T (27)

The standard linear panel data model is of the form;

yit = ¯0xit + “it where “it » N(0; V ) (28)

Due to the two dimensions present in the data there is likely to be serial correlation

present in “, i.e. the disturbances are not all independent of each other. The simplest

case to model is the “one-factor” model;

“it = ®i + Àit (29)

This treats as negligible the error correlations across individuals, but focuses

instead on the relationship between across individuals within the same time period.

In the one-factor model;

(A) ®’s and À’s are independent across all i and t

(B) Àit » N(0; ¾2

À)

(C) ®i » N(0; ¾2

®)

The last two factors )homoscedastic disturbances..

Hence, one interpretation of ®i is that it picks up idiosyncratic disturbances of

each individual, e.g. expensive tastes. Of course, the observation i could refer to a

particular household, country etc. Hence;

cov(“it; “is) = cov(®i + Àit; ®i + Àis) 6= 0 (30)

so that the errors in the panel data model;

16

yit = ¯0xit + “it = yit = ¯0xit + ®i + Àit (31)

are serially correlated. This violates the Gauss Markov assumption;

A2:V ar(“) =¾2I which implies that;

² each disturbance has the same variance( still true here, (B), (C) above unsure

homoscedastic errors)

² all disturbances are pairwise uncorrelated (this is violated here)

Hence, using OLS will lead to inconsistent estimates of ¯. What solutions can be

proposed?

5.2 Method 1: GLS/Random E¤ects

In the simplest linear regression model we have seen that when A2 does not hold,

because of heteroscedasticity or autocorrelation, we can use GLS to obtain consistent

estimates of ¯. However, recall that to operationalise GLS we need to calculate the

inverse of the var-covariance matrix of the disturbances, V ar(“) = .

In this case

this will be an (NT £ NT) matrix which is potentially huge so even with modern

day computing power, it is often not plausible to use GLS.

5.3 Method 2: Fixed E¤ects

The root cause of our problem is the presence of the idiosyncratic factor ®i in the

error which makes OLS invalid. One way to transform the regression equation to

remove the ®i’s by removing the time average of each variable. This is the “…xed

e¤ects” transformation.

To estimate (¯;®1; ®2::::; ®N) using OLS we use a two step procedure;

Step One :Take the …xed e¤ects transformation;

yit¡

_y

i:= (xit¡

_xi:)¯0 + (vit¡

_v

i:) (32)

Here

_y

i:;

_x

i:;

_v

i:refer to time averages;

_y

i:=

1

T

XT

t=1

yit;

_xi:=

1

T

XT

t=1

xit;

_v

i:=

1

T

XT

t=1

vit (33)

Note that as ®i is the same for all t (is a …xed e¤ect),

_®

i:= ®i so

¡

®i¡ _®i:

¢

= 0 (34)

Doing OLS on (7) gets us consistent estimates of ¯, denoted ^¯FE.

17

Step Two :To recover (®1; ®2::::; ®N);

^®

i=

_y

i: ¡

^ ¯0FE

_xi: (35)

5.3.1 Fixed E¤ects and Random E¤ects

yit = ®i + x0

it¯ + “it

“it = ®i + Àit

Fixed e¤ects estimates^¯ conditional on the ®0s (just as we normally do estimation

conditional on the x’s). Random e¤ects treats each ®i as an observation arising from

some underlying distribution. The FE parameters are thus (¯;®1; :::;®N;¾2

º ) and

the RE parameters are

¡

¯;¾2

®;¾2

º

¢

. Note the signi…cantly di¤erent interpretation of

these models.

6 Simultaneous Equations (Endogeneity Bias)

Here we investigate violation of the fourth G-M condition;

A4: X is a nonstochastic matrix ) in our sample, the only source of variation is in

u and hence y, therefore, cov(X; u) = 0:

Consider the following Keynesian income determination model;

Ct = ® + ¯Yt + ut (36)

Yt = Ct + It

Hence;

Yt =

®

1 ¡ ¯

+

It

1 ¡ ¯

+

ut

1 ¡ ¯

(37)

The 1

1¡¯ term is the multiplier. the important point to note is that Yt depends on

ut, the disturbance term from the consumption equation. clearly then Yt is correlated

with the disturbance term in (1) which violates the Gauss Markov assumption. If we

try and estimate ® and ¯ from (1) our estimates will be biased and the se’s will be

invalid. In most cases, OLS will also be inconsistent.

Fortunately, the problem of simultaneous equations bias can often be mitigated

by replacing OLS by a di¤erent estimation technique. These fall into two types;

² single equation estimation

² systems equation estimation

The latter method is more e¢cient, but it is also harder to implement.

18

6.1 Instrumental Variables

The problems arise her because cov(X; u) 6= 0. Hence we aim to …nd an “instrument”

for X;W with two desirable properties;

² W should be (highly) correlated with what it is instrumenting for, i.e. cov(W;X) 6=

0

² W should not be correlated with the disturbance term, i.e. cov(W; u) = 0

In this case, the model itself provides us with a suitable instrument for Yt. It

is correlated with Yt through the identity (2), and it cannot be correlated with the

disturbance term because it is an exogenous variable. the estimator we use is the

instrumental variables estimator de…ned as;

bIV =

Cov(It;Ct)

Cov(It; Yt)

(38)

As a general rule, if an equation in a simultaneous equations model is exactly

identi…ed, IV will yield exactly the same coe¢cient estimates as ILS if the exogenous

variables in the model are used as instruments. however, any variable that satis…es

the two conditions can be a potential instrument for X.

6.2 Underidenti…cation

Consider the following supply and demand model;

yd = ® + ¯p + °x + ud (39)

ys = ± + “p + us

where x= per capita income, assumed exogenous. (p; y) are the endogenous

variables, determined by the market clearing process. When the market clears,

yd = ys = y: Solving for the RF equations;

p =

® ¡ ±

” ¡ ¯

+

°

” ¡ ¯

+

ud ¡ us

” ¡ ¯

(40)

y =

®” ¡ ¯±

” ¡ ¯

+

°”

” ¡ ¯

+

“ud ¡ ¯us

” ¡ ¯

p depends on ud so …tting OLS to (9) would lead to biased and inconsistent

estimates. Similarly for (10). We rewrite the RF equations as;

p = ®0 + ¯0x + Àp (41)

y = ±0 + “0x + ºy

19

6.2.1 IV

x is the only exogenous variable in the model. We should be able to use it to instrument

for p. This works in the supply equation but not in the demand equation

where x already enters. Hence we can only obtain estimates of the supply equation

parameters.

6.3 Overidenti…cation

Consider the following supply and demand equation system where demand is also a

function of time (perhaps due to evolving habit formation);

ydt = ® + ¯pt + °xt + ½t + udt (42)

yst = ± + “pt + ust

6.3.1 IV

There are two exogenous variables in the model, xt and t both already in the demand

equation so neither can be used to instrument for pt in that equation. Hence the

demand equation is underidenti…ed.

The supply equation is overidenti…ed because there are more exogenous variables

that can be used as instruments than we actually need. We can use either xt or t as

instruments for pt: They will give us di¤erent estimates of ± and “, although both will

be consistent. As a …rst pass, we might prefer to use the instrument which is more

correlated with pt. however, the optimal method to use is “two stage least squares”

where we use a linear combination of potential instruments.

7 Two Stage Least Squares (2SLS)

We have seen how the supply equation is overidenti…ed because both xt and t were

available as instruments for pt. The optimal estimation method in this case is 2SLS

where we use a linear combination of the potential instruments;

zt = ho + h1xt + h3t (43)

We want an instrument which is as highly correlated as possible with pt so we

want to maximise corr( pt; zt): We have already done this, when we estimated the

RF equations (24), (25). In (26) we found ^ptfrom a linear combination of xt and t.

When we ran that OLS regression we are doing three things at the same time;

1 minimising the sum of squares of residuals in (24)

2 maximising the value of R2 (goodness of …t)

20

3 maximising the correlation between the predicted and actual values of pt , i.e.

corr( pt; zt)

It is 3 that we are doing here. Hence we have a two stage procedure;

² regress RF equations and calculate the predicted values using endogenous variables

² use the predicted values as instruments for the actual values

This procedure produces consistent estimates. note that when an equation is

exactly identi…ed, 2SLS o¤ers no advantage over ILS or standard IV.

7.1 An Overidenti…cation Test

The most intuitive way to test the validity of the instruments, namely whether they

conform to the conditions;

(A) cov(xi; zi) 6= 0

(B) cov(zi; ui) = 0:

is to use the test statistic;

NR2 » Â2(K0 ¡ K) (44)

where N=sample size, K0=number of instruments, K=number of explanatory

variables. R2 is the R2 from the following regression;

y ¡ X ^¯=W+ u (45)

) R2 is from the regression e =W+ u where W is the set of K0 instruments.

This procedure tells us whether the instruments play a direct role in determining

y, not just an indirect role through the predicted x’s, ^X. If the test fails, one or more

of the instruments are invalid and ought to be included in the explanation of y.

8 Multicollinearity

Multicollinearity is the problem of when an approximately linear relationship among

the explanatory variables leads to unreliable regression estimates, i.e. because two

explanatory variables are highly correlated, you will not be able to precisely estimate

the contribution from each variable. As the standard errors rise there is a greater

probability of incorrectly …nding the variable not to be signi…cant in the regression.

All regression su¤er multicollinearity to some degree, but some more so than

others, especially in time series data. Symptoms of multicollinearity are;

21

² small changes in the data can produce wide swings in parameter estimates

² coe¢cients may have high se’s and low signi…cance levels despite being jointly

highly signi…cant and the R2 is high

² coe¢cients have the wrong sign or implausible magnitude

8.1 What Can You do About It?

There are two responses – direct attempts to improve the conditions for the reliability

of regression estimates, or the use of extraneus information.

8.1.1 Direct Measures

² increase the number of observations, e.g. switch from annual to quarterly data

(problem is that this might make measurement errors or autocorrelation worst)

² reduce ¾2

u by including more explanatory variables

8.1.2 Extraneous Information

² theoretical restrictions, e.g. in a Cobb Douglas production function, Y =

AK®L¯ertº we may impose the restriction of CRTS ) ® + ¯ = 1:

² empirical estimates so use previous studies to impose a restriction on a particular

parameter, e.g. intertemporal elasticity of substitution ¼ 0.3

9 Omitted Variables Bias

Suppose a dependent variable depends on two variables x1 and x2 according to the

DGP;

y = ® + ¯1×1 + ¯2×2 + u (46)

but you omit x2 from the regression and run;

y = ® + ¯1×1 + u (47)

Your …tted regression is;

^y= a + b1x ; b1 =

cov(x1; y)

var(x1)

(48)

If (1) is the true DGP then it can be shown that;

22

E(b1) = E

·

cov(x1; y)

var(x1)

¸

= ¯1 + ¯2

cov(x1; x2)

var(x1)

(49)

and so your estimate will now be biased. This is “omitted variables bias”. note

that this bias can go in either direction depending on the signs of ¯2 and cov(x1; x2).

This bias term arises because x1is being asked to also pick up the omitted e¤ects of

x2;

Omitted Variables Bias

y

x1 x2

Direct effect of x1 holding x2 constant True effect of x2

Apparent effect of x1 picking up x2

Only in the special case where cov(x1; x2) = 0 will omitted variables bias not

occur. A further consequence of omitted variables bias is that the se’s and tests

become invalid.

10 Including Irrelevant Variables

Suppose that the DGP is given by;

y = ® + ¯1×1 + u

but the econometrician estimates;

y = ® + ¯1×1 + ¯2×2 + u

and you estimate b1 using (4) instead of b1 = cov(x1;y)

var(x1) . In this case;

E(b1) = ¯1 (50)

Hence our estimate is still unbiased but in general it will be ine¢cient because it

does not exploit the information that ¯2 = 0: This is demonstrated below;

23

Irrelevant Variables

b

var( )

cov( , )

estimated using

1

1

1 x

x y

b

not exploit info that 0

estimated using(4) but does

2

1

b =

b

The level of ine¢ciency rises as more explanatory variables are added, and the

closer is the correlation coe¢cient to §1: Only in the special case where corr(x1; x2) =

0 is there no loss of e¢ciency.

11 Measurement Error

Economic variables are often measured with error, e.g. the correlation between actual

years of schooling and reported years of schooling is typically around 0.9 in most

datasets.

11.1 Measurement Error in the Explanatory Variables

Suppose we have the relationship;

y = ® + ¯z + À; º~(0; ¾2

º ) (51)

z cannot be measured accurately, let x denote its true value. so for each observation

zi;

xi = zi + wi; w~(0; ¾2

w); w; À independent (52)

Substituting (10) into (9);

y = ® + ¯x + º ¡ ¯w (53)

Denote by u the composite error in this equation so

u = º ¡ ¯w (54)

Hence (11) can be written as;

y = ® + ¯x + u (55)

24

We regress y on x (rather than what we really wanted to do which was to regress

it on z) and obtain our OLS estimator;

b =

cov(x; y)

var(x)

= ¯ +

cov(x; u)

var(x)

(56)

Note that x and u are negatively correlated as x depends positively on w, u

depends negatively on w so our estimator is downwards biased. In fact;

plim b = ¯ + ¡¯¾2

w

¾2z

+ ¾2

w

Hence we underestimate ¯ by;

¡

¯¾2

w

¾2z

+ ¾2

w

The implications of this are that the bigger the population variance of the measurement

error, ¾2

w relative to the population variance, ¾2

w; the bigger will be the

downwards bias. Graphically the e¤ects of measurement error are illustrated below;

Measurement Error

x

y True line

Regression line with ME

The standard solution to this problem is to …nd an instrument for the variable that

is measured with error. This instrument will be correlated with z and uncorrelated

with w: If more than one instrument is available then use 2SLS.

11.2 Measurement Error in the Dependent Variable

On the whole this does not matter as much. This is undesirable because it will tend to

decrease the precision of the estimates but it will not cause the estimates to become

unbiased. Let the true dependent variable be q so that;

q = ® + ¯x + º

If q is measured with error such that;

25

yi = qi + ri

y ¡ r = ® + ¯x + (º + r) = ® + ¯x + u (57)

The only di¤erence from the usual regression is that the disturbance term has

two components. The explanatory variables x have not been a¤ected, the estimates

remain unbiased.

26

27

12. LIMITED DEPENDENT VARIABLES.

12.1 PROBIT AND LOGIT

In many cases the dependent variable we want to explain is discrete rather than

continuous. In these notes we’ll talk about discrete variables that can only take two

values. We have seen many of these during the course, example include the type of

tenancy contract (fixed rent or sharecropping), the decision to send kids to school (yes

or no), to plant trees (again: yes or no). Without loss of generality we can give the

dependent variables values of 1 and 0, depending on whether the event occurs or not

(ex.: 1 if the farmer plants trees, 0 if he doesn’t). Assume that the variable is 1 with

probability p and 0 with probability (1-p). Let’s call our dependent variable y. The

expected value of y is p*1+(1-p)*0= p. So p is the probability that the event will

occur. The theory will suggest a set of explanatory variables X that affect p. Then P is

a function of X and of unknown parameters B, which measure the effect of X on P. B

are the parameters we want to estimate. In math we write:

P=Probability (Y=1) = F(XB) (1)

How does F look like?

The simplest thing is to assume F is linear, we then have:

P= XB + e (2)

where e is, as usual, a stochastic error term. This model is called linear probability

model and can be estimated like the standard linear regression you already know.

SO WHAT? Problem is that the linear probability model does not take into account

the fact that the dependent variable is a probability and, as such, should always be

between 0 and 1. Estimation of (2) can give you predicted values larger than 1 (or

smaller than 0) which make very little sense. The trick is then to use a function which

only takes values between 0 and 1. Good candidates are probability distributions like

the normal, which, by definitions, are bounded between 0 and 1. If we assume F(.) is a

normal distribution with mean 0 and variance 1; (1) would look like:

( )

2

1 2

2

P e du XB

u

XB = = F –

-¥ ò p

This is the probit model.

If we assume F(.) is the standard logistic distribution, (1) would look like:

28

( )

1

XB

e

e

P XB

XB

= L

+

=

This is called a logit model.

The results obtained with probit and logit are generally very similar1, so you shouldn’t

worry too much about which one is used.

Probits and logits cannot be estimated by least squares. Instead we use a method

called maximum likelihood (ML). Simply put, this method consists in finding the

parameters of the distribution that maximise the likelihood that our data comes from

precisely that distribution. You don’t have to be able to maximise the likelihood

yourself, generally only computers can.

12.2 THINGS YOU NEED TO KNOW TO READ PROBIT AND LOGIT

RESULTS

a. The estimated coefficients don’t mean much.

Pretty depressing at first. To see what this means it is better to think about OLS first.

In an OLS regression coefficients have an intuitive interpretation. Say you estimate

consumption as a function of income and the income coefficient is 0.8. This means

that if income increases by 100, consumption will increase by 80. Now say you use a

probit to estimate the probability of going to college as a function of income and you

get 0.8 again. Does it mean that when income increases by 100 the probability

increases by 80? Obviously not! If you look closely at the model you see that the

coefficients are inside a complicated function and to find the marginal effect of a

variable you need to take the derivative of the function with respect to that variable.

This will depend on the coefficient of the variable but also on the coefficients and on

the sample values of all the other variables in the regression. Clearly, you don’t have

to do this in the exam. Generally authors who estimate probit models will kindly

provide you with three numbers: the coefficient, the t-statistic (or standard error) and

the marginal effect (typically in brackets, below the t-stat).

b. T-stats are t-stats no matter what

That is, to see whether a coefficient is statistically significant from zero at the 5%

level compare the t-stat (=coefficient/standard error) to 1.96. If the t-stat is LARGER

than 1.96 the coefficient is significant.

You might encounter something called ‘robust’ or ‘white’ standard errors2, don’t

worry the critical values for the t-stat are always the same.

c. It’s rare to see Instrumental Variables with probit models.

You can do a two stage estimation similar to the IV with OLS but the standard errors

come out wrong and it is a bit complicated to correct them. That’s why you rarely see

1 The coefficients B will differ because they come from different functions. To compare them one has

to multiply the probit coefficients times 1.6. This is quite irrelevant for the purpose of this course.

2 These are sometimes used to correct for heteroscedasticity.

29

it. If there is a problem of endogeneity, people tend to use linear probability models

(with which you can do IV)

d. Mis-specification

Omitted variables, heteroscedasticity, measurement error and endogeneity all create

problems to probit and logit estimates. The consequences of mis-specification are

even more serious than in the OLS. For instance if there is an omitted variable, the

coefficients on the other variables are biased even if these are not correlated with the

omitted variable itself. If the disturbances are heteroscedastic the probit and logit

estimators are inconsistent.

13. FIXED EFFECT ESTIMATION

Assume you want to estimate the effect of contractual structure on farm productivity.

You expect plots cultivated under fixed rent contracts to be more productive than

plots which are sharecropped (why?). Say you have data on 1000 plots, you run your

regression and find that effectively the coefficient on the dummy variable that

identifies fixed rent contract is positive and significant. Unfortunately that’s not

enough to prove your theory right (actually nothing can “prove” a theory right but

here we’re really far from it!). I might argue that behind all of this there is an omitted

variable that affects both contracts and productivity. I say that very good (smart,

educated, entrepreneurial, whatever) farmers are obviously more productive and that,

at the same time, they prefer fixed rent contract (they are confident they’d do a good

job so they are willing to take all the risk). So the results might have nothing to do

with your theory, they are just a fruit of farmers’ heterogeneity. If you can’t find a

proxy for farmers’ ability, what can you do?

There is a way out if each farmer cultivates more than one plot. The way out, as you

might imagine, is fixed effect estimation. Say there are 200 farmers, cultivating 50

plots each. To do fixed effect you create 199 dummies, which identify each farmer

and plug them in the regression with the other explanatory variable. For instance the

first dummy is equal to 1 if the plot is tilled by Mr. Blue, 0 otherwise; the second

dummy is equal to 1 if the plot is tilled by Mr Pink, 0 otherwise and so on. This

allows you to estimate the effect of contracts on productivity, conditional on farmers’

identity, which controls for all farmers’ unobserved characteristics. Now you’re

checking whether given the farmer’s identity (and hence his skills etc) fixed rent

contracts foster productivity. If the coefficient on the contract variable is still positive

and significant now you know it was not so because of farmers’ heterogeneity. 3

14. FIML & LIML

In some cases we find a model within a larger model. For example in the group

lending paper (Pitt et al.) the Authors wanted to explain some economic outcomes

(expenditure, kids’ schooling etc.) as a function of the amount of credit received

through group lending. At the same time the amount of credit received was itself a

function of other variables.

To keep matters simple say you want to estimate y1 as a function of (x,q) and y2 as a

function of (z,l, y1). Here x and z are the independent variables and q and l are the

3 If you have understood how this work it should be trivial to see that you can’t do fixed effects if each

farmer only cultivates one plot.

30

parameters to be estimated. Notice that y1 only depends on q while y2 depends on l

AND q (through y1). You have a choice:

– with full information maximum likelihood (FIML) estimation you form the

joint distribution f(y1, y2 | x, z, l,q) and then maximise the full likelihood function.

– With limited information maximum likelihood (LIML) estimation you first

estimate q (since y1 is a function of q only) and then you use the estimated value

of q to estimate l. This method is very convenient because you don’t have to form

the joint distribution and maximise the full likelihood function (which is

computationally very complex), rather you get the distribution f(y1| x, q), choose q

to maximise the likelihood and then employ that value of q (let’s call it q*) to

form the second distribution f(y2| z, l, (x, q*)) and the likelihood that you have to

maximise only with respect to l.

15. NON PARAMETRIC ESTIMATION

15.1 Description

Note that when we want to estimate the relationship between, say, X and Y we

generally assume that this relationship has a specific functional form. We have often

seen LINEAR relationships of the form Y=a+bX. If we think that’s a good

specification (like in the keynesian consumption function) all we need to do is to find

the appropriate values of a and b, i.e. those such that the predicted Y is as close as

possible to the observed Y, given X. This kind of regression is called

PARAMETRIC.4 In some cases we have no reason to think that the relationship

between X and Y should take a specific functional form and we want to learn that

from the data, that is we want to estimate the “shape” of the relationship between X

and Y. Since we do not estimate parameters of a specific functional form (remember

we haven’t specified one) this kind of analysis is called NON-PARAMETRIC.

Intuitively, asking the data to tell us about the shape, as opposed to the parameters of

a given shape, is much more demanding. It follows that non-parametric estimation is

feasible only if we have a large number of data points.

Remember that a regression of Y on X is the conditional expectation of Y given X,

that is E(Y|X). With OLS we assume that the regression function is linear. Under non

parametric estimation we assume nothing. How can we get E(Y|X)? If we had infinite

data points, E(Y|X) would simply be the average of all the values of Y for a given X.

If (as it is always the case!) the data set contains a finite, no matter how large, number

of Xs, things gets messier. The idea, however, is similar. Non parametric estimation

does the following:

1. divide up the range of X into a (evenly spaced) grid of points (100 in the Deaton’s

paper)

4 Since we estimates parameters given an assumed functional form, any kind of regression where the

functional form is specified, even if it’s not a linear function, is called parametric. During the course

we have seen many different functional forms: for instance the Strauss and Thomas paper assumed that

productivity was a concave function of calories intake.

31

2. choose a symmetric interval around each X: the size of the interval is called

“bandwidth”5

3. choose a function that gives weights to different data points within the interval such

that points farther away from the central X gets less weight and that points just inside

and just outside the band have zero weight. (remember that we are trying to

approximate the procedure we’d follow if we had infinite Xs: in that case we’d take

one X at the time, here we take a bunch of them but give more weight to the one in

the middle). This weighting function is called a “kernel” function and comes in a

couple of different forms.(The one in the Deaton’s paper is a “quartic” kernel

function). Then do either:

4a KERNEL ESTIMATION: For each bandwidth, compute the average all the Ys that

correspond to the Xs within the bandwidth, giving less weight to those corresponding

to Xs that are farther away (as dictated by the kernel function specified above).

OR

4b SMOOTH LOCAL REGRESSION: For each bandwidth run a weighted OLS

regression of Y on X using the weights defined by the kernel function defined above.

15.2 PROS & CONS

The obvious advantage of non parametric vs. parametric estimation is that with the

former we let the data choose the shape of the relationship. Unfortunately, non

parametric estimation is both more flexible and more costly. In particular:

1. We need a very big sample (otherwise you need to specify very wide bandwidth,

which result in very unprecise estimates, which are practically useless)

2. It is quite difficult to condition Y on more than one variable (you’d need many

many more data points)

3. Non parametric estimation cannot deal with simultaneity (endogeneity),

measurement error, selection bias etc.

5 You have to choose the size of the bandwidth bearing in mind that estimates will be more precise but

also more variable the smaller is the bandwidth (if it were zero we’d know the “true” density at each

point, but this makes sense only if the data were infinite)