Loader Loading...
EAD Logo Taking too long?

Reload Reload document
| Open Open in new tab


ECONOMETRICS FOR EC307 Oriana Bandiera Imran Rasul February, 22, 2000

WARNING: The sole purpose of these notes is to give you a “road map” of the econometric issues and techniques you are likely to see in the class papers. They are BY NO MEANS a substitute for an econometric book or course. You should use them as a reference to recall concepts you already know. ALWAYS refer to a book (Dougherty or Greene) for a comprehensive analysis. MEMORISING THESE NOTES WILL NOT HELP YOU IN THE EXAM.
1. The Linear Regression model
2. Inference in the OLS model
3. Problems for the OLS model (heteroscedasticity, autocorrelation)
4. GLS
5. Panel Data
6. Simultaneous Equations (endogeneity bias)
7. 2SLS
8. Multicollinearity
9. Omitted Variable Bias
10.Including Irrelevant Variables
11.Measurement Error
12.Limited Dependent Variables: Probit and Logit
13. Fixed Effects (again)
14.FIML and LIML
15.Non Parametric Estimation
0.1 Types of Econometric Data
The unit of each observation i, can be an individual, family, school, …rm, region,
country etc. This data can be;
(a) Cross sectional : collected for a sample of units at a given moment in time
(b) Time series : collected for a given unit, over several time periods
(c) Panel : collected for a sample of units over a period of time. If this period is
the same for all i, this is a balanced panel, otherwise it is an unbalanced panel.
1 The Linear Regression Model
The underlying idea in a multiple regression model is that there is some relationship
between a ‘dependent’ variable, y, and a set of ‘explanatory’ or ‘independent’
variables, x1, x2 , …, xK;
y = f(x1; x2; :::; xK) (1)
In a sense we are identifying a causal relationship between the x variables on
the RHS and the y on the LHS. The basic assumption of the model is that the
sample observations on y may be expressed as a linear combination of the sample
observations on the explanatory x variables plus a disturbance vector, u;
y = ¯1×1 + ¯2×2 + ::: + ¯KxK + u (2)
More precisely, if we have N observations on each set of y’s and corresponding
x’s so that for observation i;
yi = ¯1x1i + ¯2x2i + ::: + ¯KxKi + ui (3)
The disturbance term re‡ects the fact that no empirical relationship is ever exact,
but on average, a relationship de…ned by (3) is expected to hold, so on average we
expect our disturbances to be zero, hence;
E(ui) = 0 for each i
Graphically we can illustrate what we are trying to do in the simple regression
Fitting a Regression Line
Our linear regression line is the line which “best …ts” the sample data. Heuristically,
this line is that which minimises the distance between itself and the actual
y values observed for each x observation. The gap between the actual y observation
and the …tted, or predicted, y, ^y;from our regression line is called the “residual”;
residual : ei = yi¡ ^yi
i y
The “ordinary least squares” (OLS) method of …tting a regression line thus solves
the following in the simple regression case (with only one explanatory variable plus
the intercept);
We choose
rather than
eibecause the latter may equal zero even though
the …t is very poor because huge positive and negative ei’s cancel out.
1.1 Gauss-Markov Conditions
A1: Each disturbance term on average is equal to zero
E(u) = 0 ) E(y) = X¯, which will be satis…ed automatically if a constant term
is included, since the role of the constant term is to pick up any systematic tendency
in y not accounted for by the explanatory variables in the regression.
A2: Each disturbance term has the same variance (around mean zero)
)(a) each ui distribution has the same variance (homoscedasticity)
)(b) all disturbances are pairwise uncorrelated (no serial correlation)
(b) means that the size of the disturbance term for individual i has no in‡uence
on the size of the disturbance for individual j, for i 6= j.
A3:All the explanatory variables contribute something in explaining the
variation in the data
)the explanatory variables do not form a linearly dependent set.
A4: The explanatory variables are …xed and can be taken as given
X is a non stochastic matrix )so in our sample, the only source of variation is in
the u vector and hence in the y vector, ) cov(X; u) = 0.
A5: The disturbance term follows a normal distribution
u has a multivariate normal distribution )by A1, A2 and A5, u » N(0;¾2I).
Theorem 1 (Gauss Markov): Under the Gauss Markov assumptions A1-A5 the
OLS estimator is BLUE (best linear unbiased estimator).
1.2 Exogenous and Endogenous Variables
We have two types of variable that we consider with respect to a given model. An
exogenous variable is one that is determined outside of the model under consideration,
and so its value is taken as given. An endogenous variable is one whose value is
explained from within the model. In our model above, X is exogenous, and y is
1.3 Unbiasedness, E¢ciency and Consistency
There are three principle properties we look for in any estimator;
² Unbiasedness: on average, we expect the estimated parameter to equal the
true population value of the parameter (in the OLS case, ^¯ OLSis unbiased as
E(^¯OLS) = ¯).
² E¢ciency:an estimator, ^μ 1is said to more e¢cient than another estimator ^μ 2if
V ar(^μ1) > V ar(^μ2)
² Consistency: an estimator, ^μ1is said to be consistent if as the sample size
increases to in…nity, lim
V ar(^μ1) ! 0; and lim
E(^μ) ! μ:
Consistency, Efficiency and Bias
more efficient
Hence if OLS is BLUE, this implies that no other unbiased linear estimator has a
smaller variance (is more e¢cient) than the OLS estimator under the Gauss Markov
1.4 Goodness of Fit
A summary statistic of the goodness of …t of our regression model is given by the R2
statistic. Clearly,
0 · R2 · 1
and as R2 ! 1 the …t of the model is said to improve.
One word of warning is that as the number of explanatory variables, K, increases,
it can be shown that R2 necessarily increases and so can be brought arbitrarily close
to one simply by including more explanatory variables into the regression, even if
these explanatory variables are found to be insigni…cant.
In order to prevent this problem, an “adjusted R2” or
2statistic is often reported,
de…ned as;
(N ¡ 1)R2 ¡ K
N ¡ K ¡ 1
< R2 (5)
This does not necessarily rise with K. It can be shown that the addition of a
new variable to the regression will cause
to rise i¤ its t-statistic is greater than one
(which still does not imply that the variable is signi…cant), therefore a rise in
2 does
not necessarily establish that the speci…cation of the model has improved, although
it is a better indicator of this than R2, just not a perfect one.
1.5 Interpretation of the Estimated Parameters, ^¯
The estimated parameters, ^¯ = (^¯1;^¯2; :::;^¯ K) have a very simple interpretation. Let
us consider one particular element of ^¯ ; ^¯i;
² the standard interpretation of ^¯i is that if xi increases by one unit, then this
causes y to increase by ^¯i units
² if the LHS variables y is in log form (but X is not) then the equation is in
semi-logarithmic form. In this case the interpretation is that, for small ¯i, if xi
increases by one unit this leads to a ¯i% increase in y.
² if all variables are in logarithmic form then the interpretation is that ¯i corresponds
to the elasticity of y with respect to a 1% unit change in xi:
2 Inference in the OLS Model
2.1 t-tests
A hypothesis that we commonly want to test is;
Null hypothesis- H0 : ¯i = 0
Alternative hypothesis- H1:¯i 6= 0
If we accept H0 )the explanatory variable corresponding to ¯i, namely xi, is not
important (or ‘insigni…cant’) in explaining y. We statistically test for this using the
t ¡ test statistic;
t =

H0 ~
t(N ¡ K) (6)
where ^¯ i refers to the OLS estimate of ¯i, and
se(^¯i) =
V ar(^¯ i) =
¾2aii = ¾paii (7)
and N= number of observations, K= number of explanatory variables including
the intercept.
In general to test H0 : ¯i =»¯ we use the test statistic;
t =

i ¡ »¯
se(^¯ i)
H0 ~
t(N¡ K) (8)
This test statistic can be easily computed once we have performed OLS, and will
give us a value for each ¯i. This value is compared against the ‘critical’ value given
by the t(N ¡ K) distribution.
Accept H0 i¤

< tcrit(N ¡ K) (9) Do not accept H0 i¤ ¯¯¯¯¯¯ ^¯ i se(^¯ i) ¯¯¯¯¯¯ > tcrit(N ¡ K) (10)
where tcrit(N ¡ K) is derived from tables for the appropriate signi…cance level,
e.g. 5%, 1%, and is approximately equal to two. A rough and ready calculation that
you can do is to reject H0 i¤ the t-statistic is more than two. If (17) holds, then the
variable xi is said to be ‘signi…cant’. If (18) holds, then the variable xi is said to be
‘insigni…cant’. Either the t-statistic (in absolute value) or the standard error will be
reported with coe¢cient estimates.
2.1.1 Degrees of Freedom
The test statistic for the t-test is t(N ¡K). (N¡ K) is referred to as the ‘degrees
of freedom’ for the test. To get some intuition of where this comes from consider the
simple regression model, y = ® + x¯ + u. Hence the t-test for H0 : ¯ = 0 uses the
test statistic t(N ¡ 2): We have to take o¤ two degrees of freedom from the sample
size because the …rst two observations give us no information on the line of best …t;
Degrees of Freedom
no information perfect relationship
2.1.2 Con…dence Intervals
Consider the simple regression line: ^y = 95:3+ 2:53
t, where t is time, and (0.08) is
the se on the estimate, estimated from a sample size of 23. To test H0 : ¯ = 0 we use
the test statistic;
t =

se(^¯ )
= 31:625
The critical value for this test is tn¡2 = t23;0:05 = 2:069 (5% signi…cance level).
Clearly, as 31.625>2.069 that implies that we reject H0 and conclude that time does
matter. We can construct a con…dence interval from our t-statistic. We know ^¯ is
just an estimate of ¯, so in what range might we reasonably expect the true ¯ to lie
in? For large n, the t distribution is very similar to the normal distribution. Using
this fact we can construct the following con…dence intervals;
To cover 95% of the distribution : ¯ 2 [b § 1:96se(b)] (11)
To cover 99% of the distribution : ¯ 2 [b § 2:56se(b)] 95% Confidence Interval
1.96 ( ) b
b – se b 1.96 ( )
b + se b
2.5% 2.5%
95% Confidence Interval
1.96 ( ) b
b – se b 1.96 ( )
b + se b
2.5% 2.5%
In practice we need to take account of the fact that we don’t know ¾2
u. Hence we
use the t-distribution to form the con…dence intervals;
To cover 95% of the distribution : ¯ 2 [b § tn¡2;0:025:se(b)] (12)
To cover 99% of the distribution : ¯ 2 [b § tn¡2;0:005:se(b)] Above we rejected H0 : ¯ = 0 at the 5% signi…cance level. This is equivalent to
saying that 0 does not lie in the 95% con…dence interval.
2.2 F-tests
Another commonly reported test is that which tests the joint signi…cance of all the
H0 : ¯1 = ¯2 = ::: = ¯K = 0
To test H0 we use the test statistic;
F =
ESS=(K ¡ 1)
RSS=(N ¡ K)
(1 ¡ R2)=(N ¡ K)
H0 ~
F [(K ¡ 1) ; (N ¡ K)] (13)
Again we have to choose an appropriate critical value for F [(K ¡ 1) ; (N ¡ K)] to set the signi…cance level. We can also use the F-test to test whether a group of
explanatory variables are jointly signi…cant;
H0 : ¯K+1 = ¯K+2 = ::: = ¯K+M = 0
So under H0 the regression model is;
y = ® + ¯1×1 + ::: + ¯K + u ! RSSK
Under H1 the regression model is;
y = ® + ¯1×1 + ::: + ¯KxK + ¯K+1xK+1 + ::: + ¯K+MxK+M + u ! RSSM
The test statistic to use is;
F =
(RSSK ¡ RSSM) =(M ¡ K)
RSSM=(N ¡M ¡ 1)
H0 ~
F [(M ¡ K) ; (N ¡M ¡ 1)] (14)
2.3 Type I and Type II Errors
The signi…cance level refers to the probability that you will not accept H0 even though
the true ¯i accords with H0 . Hence the signi…cance level (which is typically 1% or
5%) gives the probability of wrongly rejecting a true null hypothesis (H0 ), known as
a ‘type I error’.
Clearly we want to minimise this as much as possible, but as we decrease the
probability of a type I error (by changing our critical value, tcrit(N¡K), accordingly),
we necessarily increase the probability of a type II error, namely, accepting a false
null hypothesis, H0 . Hence we face a trade-o¤ between the two types of error.
H0 accepted H0 rejected
H0 true type I error
H0 false type II error
It is standard practice in the literature to …x the probability of a type I error, the
signi…cance level, at either 1%, 5% or 10%. The exact probability of a type II error
will then depend on the speci…c test that we are using. A ‘good’ test will give a low
probability of a type II error when the probability of a type I error is low. In that
case the test is said to have ‘high power’.
Type I and Type II Errors
prob type II error
prob type I
error High power test
Standard trade-off
2.4 Dummy Variables
Note that in the last example, one of the x’s was the variable ‘male’. This is an
example of a ‘dummy’ variable which takes the following form;
malei = 1 if individual i is male (15)
0 otherwise
We can de…ne dummies for all such dichotomous variables, e.g. race, seasonals.
2.5 Interaction Variables
Consider a wage equation;
w = ¯1 + ¯2age + ¯3d + u
d = 1 if college graduate
0 otherwise
Now suppose we want to examine the hypothesis that “not only are the salaries
of college graduates higher than those of non college graduates at any given age, but
they rise faster as the individuals get older”. to test this we require the inclusion of
an interaction term;
w = ¯1 + ¯2age + ¯3d + ¯4d:age + u (16)
Interaction Terms
1 b
3 b
2 4 slope= b + b
2 slope= b
non graduates
2.6 Chow Test
It sometimes happens that your sample observations contain two subsamples potentially,
e.g. male and female. Do we run separate or combined regressions? Sometimes
we can combine the subsamples using a dummy variable, e.g. male dummy, or we
can allow for interaction terms that don’t restrict the coe¢cients to be the same for
each subsample;
Chow Test: Combined
Chow Test: Separate Regression
Suppose you have two subsamples: A and B ! RSS UA; UB from the separate
If you run a pooled regression: P ! RSS UP = UP
A + UP
B from the pooled
regression (where UP
i is the contribution to the RSS from subsample i:)
The subsample regressions must …t data at least as well as the pooled sample
) UA · UP
A and UB · UP
B and therefore (UA + UB) · UP
(UA + UB) = UP only in the case when there is no need to split the samples.
There is a price to pay for the improved …t using subsamples – we lose degrees
of freedom as (k + 1) extra parameters are estimated (k =#explanatory variables).
Hence we have (2k + 2) parameters to estimate (k explanatory variables and ¾2
A and
B): Is the improvement in …t signi…cant? We use the following F-statistic;
ChowTest : F =
Improvement in …t/dof used up
Unexplained/dof remaining
UP ¡ UA ¡ UB¢
= (k + 1)
(UA + UB) = (n ¡ 2k ¡ 2)
H0 ~
F [(k + 1) ; (n ¡ 2k ¡ 2)] 3 Problems for the OLS Model
The two extensions to the OLSmodel that we shall consider both arise froma failure of
all the GaussMarkov assumptions to hold. In particular, the assumption of ‘spherical’
A2:V ar(u) = E(uu0) = ¾2I no longer holds in each of these cases.
3.1 Heteroscedasticity
When A2 holds, the disturbance term is said to be homoscedastic, i.e. each observation
i has the same variability in its disturbance term, ¾2:When this is not the case,
u is said to be “heteroscedastic”. This can be illustrated graphically in the simple
regression case;
Clearly the disturbance terms appear to be increasing with x here. This case
of heteroscedasticity still has that the disturbances are pairwise uncorrelated. The
consequences of heteroscedasticity are;
² a; b from OLS (y = ® + ¯x + u) are still unbiased and consistent
² a; b are no longer BLUE
² se(a); se(b) are invalid (as they are constructed under the incorrect assumption
of homoscedasticity).
3.1.1 The Gold…eld-Quandt Test
H0 : V ar(ui) = ¾2 for all i (homoscedasticity)
H1 : V ar(ui) = f(xi), f0(xi) ? 0 (some functional relationship with x)
For example, suppose we believe V ar(ui) is increasing in xi;
Goldfield-Quandt Test
n’ n-2n’ n’
Test Statistic : F =
H0 ~
F [(n0 ¡ k ¡ 1) ; (n0 ¡ k ¡ 1)] (18)
3.1.2 What Can you Do About Heteroscedasticity?
Suppose V ar(ui) = ¾2i
. If we know ¾2i
for all i we can eliminate the heteroscedasticity
by dividing through by ¾ifor each observation so that the transformed model becomes;


+ ¯



= 0; V ar

= 1 i.e. homoscedatic errors
So running this transformed model will give improve e¢cient estimates. Note
that because the constant term is
then there will be a di¤erent constant term
estimated for each individual. The intuition behind this is that those observations
with the smallest ¾2i
will be most useful for locating the true regression line. We take
advantage by using weighted least squares in the transformed model, which gives
the greatest weights to the highest quality observations (lowest ¾2i
), whereas OLS is
ine¢cient because it gives all observations an equal weighting.
3.2 Autocorrelation
Suppose now that i refers to a time period, not an observation on an individual
entity. Then in many economic applications we may expect to see a relationship
between disturbances in adjacent time periods. This can again be illustrated in the
case of the simple regression model;
A common form of such disturbances is the “autocorrelation” (AR) structure;
AR(1):ut = ½ut¡1 + “t; where “t » N(0; ¾2″
I), and j½j < 1 (19) This is denoted AR(1) because of the presence of one lag on the RHS. This means that for ½ > 0, if we experience a large disturbance in period t, then we expect a
similarly large disturbance (dampened by a factor ½) in the subsequent period. If
j½j > 1, then this would imply that the disturbances diverge over time, which is not
typically observed in economic data.
It can be shown that the var-covariance structure in the case of AR(1) disturbances
is such that E(u) = 0 still but;
V ar(u) =
var(u1) cov(u1; u2) ¢ ¢¢ cov(u1; uN)
cov(u2; u1) var(u2) ¢ ¢¢ cov(u2; uN)

… …
cov(uN; u1) cov(uN; u1) ¢ ¢¢ var(uN)
= ¾2
1 ½ ½2 ¢ ¢¢ ½N¡1
½ 1 ½ … …
½2 ½ 1 … ½2

… …
… ½
½N¡1 ¢ ¢¢ ½2 ½ 1
where ¾2 =
1 ¡ ½2
The consequences of autocorrelation are;
² regression coe¢cients remain unbiased
² estimates become “too e¢cient”, se’s are wrongly calculated being biased downwards.
3.2.1 The Durbin-Watson Test
We can run tests to test for the existence of autocorrelation. The most commonly
reported test is the Durbin-Watson test statistic, which is calculated from the OLS
residuals. To derive the test statistic note that for the AR(1) model;
ut = ½ut¡1 + “t
so ut depends on its own lagged values. If we just run an OLS regression on this

cov(et; et¡1)
where et =^ut (…tted residuals) (21)
E(etet¡1) ¡ E(et)E(et¡1)
¡1) ¡ [E(et)]2
As t ! 1; E(et) = E(et¡1) = 0 so;

The DW statistic is based on this and is calculated as;
(et ¡ et¡1)2
‘ 2(1 ¡ r), where r = corr(et; et¡1) (23)
The critical values for this test are reported in Johnston (1991). In large samples;
DW ! (2 ¡ 2½)
H0 : no autocorrelation
H1 : AR(1)
) if ½ = 0 ) DW ¼ 2: Hence the further DW is from 2 (in either direction), the
more likely it is that AR(1) is present. The problem with this test is that the actual
critical values depend on the explanatory variables in the regression, so in tables,
only an upper and lower bound on critical values can be reported. This means that
there is a region of values for the DW statistic where the test is indeterminate;
The Durbin-Watson Test
0 dl du 2 4
H1 H0
4 Generalised Least Squares
Both heteroscedastic and autocorrelated errors imply that A2 no longer holds . In
either of these cases we can write the var-covariance matrix of the error terms as;
V ar(u) = E(uu0) = 6=
¾2I (24)
This violates the Gauss Markov assumption A2 and so OLS will no longer will
be appropriate. The best estimator for ¯ in this model;
y = X¯ + u; V ar(u) = is
the “generalised least squares” (GLS) estimator.
Proposition 2 The GLS estimator is

GLS= (X0¡
1y (25)
E(^¯ GLS) = ¯ so GLS is still unbiased,
V ar(^¯ GLS) = (X0¡
Under normality of u (so A5 still holds),

GLS» N(¯; (X0¡
1X)¡1) (26)
5 Panel Data
When our data contain repeated observations on each individual, say, the resulting
panel data opens up a number of possibilities that are not available in a single cross
section. In particular, the opportunity to compare the same individual over time
allows us the possibility of using that individual as his or her own control. this
enables us to get closer towards an ideal experimental situation..
5.1 An Introduction to Panel Data
An increasing number of developing countries are collecting survey data, usually
across households, for a period of time. amongst the best known such panel data sets
are the ICRISAT data set from India and the LSMS data sets from the World Bank.
There are two dimensions to panel data;
yit; i = 1:::N; t = 1:::T (27)
The standard linear panel data model is of the form;
yit = ¯0xit + “it where “it » N(0; V ) (28)
Due to the two dimensions present in the data there is likely to be serial correlation
present in “, i.e. the disturbances are not all independent of each other. The simplest
case to model is the “one-factor” model;
“it = ®i + Àit (29)
This treats as negligible the error correlations across individuals, but focuses
instead on the relationship between across individuals within the same time period.
In the one-factor model;
(A) ®’s and À’s are independent across all i and t
(B) Àit » N(0; ¾2
(C) ®i » N(0; ¾2
The last two factors )homoscedastic disturbances..
Hence, one interpretation of ®i is that it picks up idiosyncratic disturbances of
each individual, e.g. expensive tastes. Of course, the observation i could refer to a
particular household, country etc. Hence;
cov(“it; “is) = cov(®i + Àit; ®i + Àis) 6= 0 (30)
so that the errors in the panel data model;
yit = ¯0xit + “it = yit = ¯0xit + ®i + Àit (31)
are serially correlated. This violates the Gauss Markov assumption;
A2:V ar(“) =¾2I which implies that;
² each disturbance has the same variance( still true here, (B), (C) above unsure
homoscedastic errors)
² all disturbances are pairwise uncorrelated (this is violated here)
Hence, using OLS will lead to inconsistent estimates of ¯. What solutions can be
5.2 Method 1: GLS/Random E¤ects
In the simplest linear regression model we have seen that when A2 does not hold,
because of heteroscedasticity or autocorrelation, we can use GLS to obtain consistent
estimates of ¯. However, recall that to operationalise GLS we need to calculate the
inverse of the var-covariance matrix of the disturbances, V ar(“) = .
In this case
this will be an (NT £ NT) matrix which is potentially huge so even with modern
day computing power, it is often not plausible to use GLS.
5.3 Method 2: Fixed E¤ects
The root cause of our problem is the presence of the idiosyncratic factor ®i in the
error which makes OLS invalid. One way to transform the regression equation to
remove the ®i’s by removing the time average of each variable. This is the “…xed
e¤ects” transformation.
To estimate (¯;®1; ®2::::; ®N) using OLS we use a two step procedure;
Step One :Take the …xed e¤ects transformation;
i:= (xit¡
_xi:)¯0 + (vit¡
i:) (32)
i:refer to time averages;
vit (33)
Note that as ®i is the same for all t (is a …xed e¤ect),

i:= ®i so
®i¡ _®i:
= 0 (34)
Doing OLS on (7) gets us consistent estimates of ¯, denoted ^¯FE.
Step Two :To recover (®1; ®2::::; ®N);

i: ¡
^ ¯0FE
_xi: (35)
5.3.1 Fixed E¤ects and Random E¤ects
yit = ®i + x0
it¯ + “it
“it = ®i + Àit
Fixed e¤ects estimates^¯ conditional on the ®0s (just as we normally do estimation
conditional on the x’s). Random e¤ects treats each ®i as an observation arising from
some underlying distribution. The FE parameters are thus (¯;®1; :::;®N;¾2
º ) and
the RE parameters are
. Note the signi…cantly di¤erent interpretation of
these models.
6 Simultaneous Equations (Endogeneity Bias)
Here we investigate violation of the fourth G-M condition;
A4: X is a nonstochastic matrix ) in our sample, the only source of variation is in
u and hence y, therefore, cov(X; u) = 0:
Consider the following Keynesian income determination model;
Ct = ® + ¯Yt + ut (36)
Yt = Ct + It
Yt =
1 ¡ ¯
1 ¡ ¯
1 ¡ ¯
The 1
1¡¯ term is the multiplier. the important point to note is that Yt depends on
ut, the disturbance term from the consumption equation. clearly then Yt is correlated
with the disturbance term in (1) which violates the Gauss Markov assumption. If we
try and estimate ® and ¯ from (1) our estimates will be biased and the se’s will be
invalid. In most cases, OLS will also be inconsistent.
Fortunately, the problem of simultaneous equations bias can often be mitigated
by replacing OLS by a di¤erent estimation technique. These fall into two types;
² single equation estimation
² systems equation estimation
The latter method is more e¢cient, but it is also harder to implement.
6.1 Instrumental Variables
The problems arise her because cov(X; u) 6= 0. Hence we aim to …nd an “instrument”
for X;W with two desirable properties;
² W should be (highly) correlated with what it is instrumenting for, i.e. cov(W;X) 6=
² W should not be correlated with the disturbance term, i.e. cov(W; u) = 0
In this case, the model itself provides us with a suitable instrument for Yt. It
is correlated with Yt through the identity (2), and it cannot be correlated with the
disturbance term because it is an exogenous variable. the estimator we use is the
instrumental variables estimator de…ned as;
bIV =
Cov(It; Yt)
As a general rule, if an equation in a simultaneous equations model is exactly
identi…ed, IV will yield exactly the same coe¢cient estimates as ILS if the exogenous
variables in the model are used as instruments. however, any variable that satis…es
the two conditions can be a potential instrument for X.
6.2 Underidenti…cation
Consider the following supply and demand model;
yd = ® + ¯p + °x + ud (39)
ys = ± + “p + us
where x= per capita income, assumed exogenous. (p; y) are the endogenous
variables, determined by the market clearing process. When the market clears,
yd = ys = y: Solving for the RF equations;
p =
® ¡ ±
” ¡ ¯
” ¡ ¯
ud ¡ us
” ¡ ¯
y =
®” ¡ ¯±
” ¡ ¯
” ¡ ¯
“ud ¡ ¯us
” ¡ ¯
p depends on ud so …tting OLS to (9) would lead to biased and inconsistent
estimates. Similarly for (10). We rewrite the RF equations as;
p = ®0 + ¯0x + Àp (41)
y = ±0 + “0x + ºy
6.2.1 IV
x is the only exogenous variable in the model. We should be able to use it to instrument
for p. This works in the supply equation but not in the demand equation
where x already enters. Hence we can only obtain estimates of the supply equation
6.3 Overidenti…cation
Consider the following supply and demand equation system where demand is also a
function of time (perhaps due to evolving habit formation);
ydt = ® + ¯pt + °xt + ½t + udt (42)
yst = ± + “pt + ust
6.3.1 IV
There are two exogenous variables in the model, xt and t both already in the demand
equation so neither can be used to instrument for pt in that equation. Hence the
demand equation is underidenti…ed.
The supply equation is overidenti…ed because there are more exogenous variables
that can be used as instruments than we actually need. We can use either xt or t as
instruments for pt: They will give us di¤erent estimates of ± and “, although both will
be consistent. As a …rst pass, we might prefer to use the instrument which is more
correlated with pt. however, the optimal method to use is “two stage least squares”
where we use a linear combination of potential instruments.
7 Two Stage Least Squares (2SLS)
We have seen how the supply equation is overidenti…ed because both xt and t were
available as instruments for pt. The optimal estimation method in this case is 2SLS
where we use a linear combination of the potential instruments;
zt = ho + h1xt + h3t (43)
We want an instrument which is as highly correlated as possible with pt so we
want to maximise corr( pt; zt): We have already done this, when we estimated the
RF equations (24), (25). In (26) we found ^ptfrom a linear combination of xt and t.
When we ran that OLS regression we are doing three things at the same time;
1 minimising the sum of squares of residuals in (24)
2 maximising the value of R2 (goodness of …t)
3 maximising the correlation between the predicted and actual values of pt , i.e.
corr( pt; zt)
It is 3 that we are doing here. Hence we have a two stage procedure;
² regress RF equations and calculate the predicted values using endogenous variables
² use the predicted values as instruments for the actual values
This procedure produces consistent estimates. note that when an equation is
exactly identi…ed, 2SLS o¤ers no advantage over ILS or standard IV.
7.1 An Overidenti…cation Test
The most intuitive way to test the validity of the instruments, namely whether they
conform to the conditions;
(A) cov(xi; zi) 6= 0
(B) cov(zi; ui) = 0:
is to use the test statistic;
NR2 » Â2(K0 ¡ K) (44)
where N=sample size, K0=number of instruments, K=number of explanatory
variables. R2 is the R2 from the following regression;
y ¡ X ^¯=W+ u (45)
) R2 is from the regression e =W+ u where W is the set of K0 instruments.
This procedure tells us whether the instruments play a direct role in determining
y, not just an indirect role through the predicted x’s, ^X. If the test fails, one or more
of the instruments are invalid and ought to be included in the explanation of y.
8 Multicollinearity
Multicollinearity is the problem of when an approximately linear relationship among
the explanatory variables leads to unreliable regression estimates, i.e. because two
explanatory variables are highly correlated, you will not be able to precisely estimate
the contribution from each variable. As the standard errors rise there is a greater
probability of incorrectly …nding the variable not to be signi…cant in the regression.
All regression su¤er multicollinearity to some degree, but some more so than
others, especially in time series data. Symptoms of multicollinearity are;
² small changes in the data can produce wide swings in parameter estimates
² coe¢cients may have high se’s and low signi…cance levels despite being jointly
highly signi…cant and the R2 is high
² coe¢cients have the wrong sign or implausible magnitude
8.1 What Can You do About It?
There are two responses – direct attempts to improve the conditions for the reliability
of regression estimates, or the use of extraneus information.
8.1.1 Direct Measures
² increase the number of observations, e.g. switch from annual to quarterly data
(problem is that this might make measurement errors or autocorrelation worst)
² reduce ¾2
u by including more explanatory variables
8.1.2 Extraneous Information
² theoretical restrictions, e.g. in a Cobb Douglas production function, Y =
AK®L¯ertº we may impose the restriction of CRTS ) ® + ¯ = 1:
² empirical estimates so use previous studies to impose a restriction on a particular
parameter, e.g. intertemporal elasticity of substitution ¼ 0.3
9 Omitted Variables Bias
Suppose a dependent variable depends on two variables x1 and x2 according to the
y = ® + ¯1×1 + ¯2×2 + u (46)
but you omit x2 from the regression and run;
y = ® + ¯1×1 + u (47)
Your …tted regression is;
^y= a + b1x ; b1 =
cov(x1; y)
If (1) is the true DGP then it can be shown that;
E(b1) = E
cov(x1; y)
= ¯1 + ¯2
cov(x1; x2)
and so your estimate will now be biased. This is “omitted variables bias”. note
that this bias can go in either direction depending on the signs of ¯2 and cov(x1; x2).
This bias term arises because x1is being asked to also pick up the omitted e¤ects of
Omitted Variables Bias
x1 x2
Direct effect of x1 holding x2 constant True effect of x2
Apparent effect of x1 picking up x2
Only in the special case where cov(x1; x2) = 0 will omitted variables bias not
occur. A further consequence of omitted variables bias is that the se’s and tests
become invalid.
10 Including Irrelevant Variables
Suppose that the DGP is given by;
y = ® + ¯1×1 + u
but the econometrician estimates;
y = ® + ¯1×1 + ¯2×2 + u
and you estimate b1 using (4) instead of b1 = cov(x1;y)
var(x1) . In this case;
E(b1) = ¯1 (50)
Hence our estimate is still unbiased but in general it will be ine¢cient because it
does not exploit the information that ¯2 = 0: This is demonstrated below;
Irrelevant Variables
var( )
cov( , )
estimated using
1 x
x y
not exploit info that 0
estimated using(4) but does
b =
The level of ine¢ciency rises as more explanatory variables are added, and the
closer is the correlation coe¢cient to §1: Only in the special case where corr(x1; x2) =
0 is there no loss of e¢ciency.
11 Measurement Error
Economic variables are often measured with error, e.g. the correlation between actual
years of schooling and reported years of schooling is typically around 0.9 in most
11.1 Measurement Error in the Explanatory Variables
Suppose we have the relationship;
y = ® + ¯z + À; º~(0; ¾2
º ) (51)
z cannot be measured accurately, let x denote its true value. so for each observation
xi = zi + wi; w~(0; ¾2
w); w; À independent (52)
Substituting (10) into (9);
y = ® + ¯x + º ¡ ¯w (53)
Denote by u the composite error in this equation so
u = º ¡ ¯w (54)
Hence (11) can be written as;
y = ® + ¯x + u (55)
We regress y on x (rather than what we really wanted to do which was to regress
it on z) and obtain our OLS estimator;
b =
cov(x; y)
= ¯ +
cov(x; u)
Note that x and u are negatively correlated as x depends positively on w, u
depends negatively on w so our estimator is downwards biased. In fact;
plim b = ¯ + ¡¯¾2
+ ¾2
Hence we underestimate ¯ by;
+ ¾2
The implications of this are that the bigger the population variance of the measurement
error, ¾2
w relative to the population variance, ¾2
w; the bigger will be the
downwards bias. Graphically the e¤ects of measurement error are illustrated below;
Measurement Error
y True line
Regression line with ME
The standard solution to this problem is to …nd an instrument for the variable that
is measured with error. This instrument will be correlated with z and uncorrelated
with w: If more than one instrument is available then use 2SLS.
11.2 Measurement Error in the Dependent Variable
On the whole this does not matter as much. This is undesirable because it will tend to
decrease the precision of the estimates but it will not cause the estimates to become
unbiased. Let the true dependent variable be q so that;
q = ® + ¯x + º
If q is measured with error such that;
yi = qi + ri
y ¡ r = ® + ¯x + (º + r) = ® + ¯x + u (57)
The only di¤erence from the usual regression is that the disturbance term has
two components. The explanatory variables x have not been a¤ected, the estimates
remain unbiased.
In many cases the dependent variable we want to explain is discrete rather than
continuous. In these notes we’ll talk about discrete variables that can only take two
values. We have seen many of these during the course, example include the type of
tenancy contract (fixed rent or sharecropping), the decision to send kids to school (yes
or no), to plant trees (again: yes or no). Without loss of generality we can give the
dependent variables values of 1 and 0, depending on whether the event occurs or not
(ex.: 1 if the farmer plants trees, 0 if he doesn’t). Assume that the variable is 1 with
probability p and 0 with probability (1-p). Let’s call our dependent variable y. The
expected value of y is p*1+(1-p)*0= p. So p is the probability that the event will
occur. The theory will suggest a set of explanatory variables X that affect p. Then P is
a function of X and of unknown parameters B, which measure the effect of X on P. B
are the parameters we want to estimate. In math we write:
P=Probability (Y=1) = F(XB) (1)
How does F look like?
The simplest thing is to assume F is linear, we then have:
P= XB + e (2)
where e is, as usual, a stochastic error term. This model is called linear probability
model and can be estimated like the standard linear regression you already know.
SO WHAT? Problem is that the linear probability model does not take into account
the fact that the dependent variable is a probability and, as such, should always be
between 0 and 1. Estimation of (2) can give you predicted values larger than 1 (or
smaller than 0) which make very little sense. The trick is then to use a function which
only takes values between 0 and 1. Good candidates are probability distributions like
the normal, which, by definitions, are bounded between 0 and 1. If we assume F(.) is a
normal distribution with mean 0 and variance 1; (1) would look like:
( )
1 2
P e du XB
XB = = F –
-¥ ò p
This is the probit model.
If we assume F(.) is the standard logistic distribution, (1) would look like:
( )
= L
This is called a logit model.
The results obtained with probit and logit are generally very similar1, so you shouldn’t
worry too much about which one is used.
Probits and logits cannot be estimated by least squares. Instead we use a method
called maximum likelihood (ML). Simply put, this method consists in finding the
parameters of the distribution that maximise the likelihood that our data comes from
precisely that distribution. You don’t have to be able to maximise the likelihood
yourself, generally only computers can.
a. The estimated coefficients don’t mean much.
Pretty depressing at first. To see what this means it is better to think about OLS first.
In an OLS regression coefficients have an intuitive interpretation. Say you estimate
consumption as a function of income and the income coefficient is 0.8. This means
that if income increases by 100, consumption will increase by 80. Now say you use a
probit to estimate the probability of going to college as a function of income and you
get 0.8 again. Does it mean that when income increases by 100 the probability
increases by 80? Obviously not! If you look closely at the model you see that the
coefficients are inside a complicated function and to find the marginal effect of a
variable you need to take the derivative of the function with respect to that variable.
This will depend on the coefficient of the variable but also on the coefficients and on
the sample values of all the other variables in the regression. Clearly, you don’t have
to do this in the exam. Generally authors who estimate probit models will kindly
provide you with three numbers: the coefficient, the t-statistic (or standard error) and
the marginal effect (typically in brackets, below the t-stat).
b. T-stats are t-stats no matter what
That is, to see whether a coefficient is statistically significant from zero at the 5%
level compare the t-stat (=coefficient/standard error) to 1.96. If the t-stat is LARGER
than 1.96 the coefficient is significant.
You might encounter something called ‘robust’ or ‘white’ standard errors2, don’t
worry the critical values for the t-stat are always the same.
c. It’s rare to see Instrumental Variables with probit models.
You can do a two stage estimation similar to the IV with OLS but the standard errors
come out wrong and it is a bit complicated to correct them. That’s why you rarely see
1 The coefficients B will differ because they come from different functions. To compare them one has
to multiply the probit coefficients times 1.6. This is quite irrelevant for the purpose of this course.
2 These are sometimes used to correct for heteroscedasticity.
it. If there is a problem of endogeneity, people tend to use linear probability models
(with which you can do IV)
d. Mis-specification
Omitted variables, heteroscedasticity, measurement error and endogeneity all create
problems to probit and logit estimates. The consequences of mis-specification are
even more serious than in the OLS. For instance if there is an omitted variable, the
coefficients on the other variables are biased even if these are not correlated with the
omitted variable itself. If the disturbances are heteroscedastic the probit and logit
estimators are inconsistent.
Assume you want to estimate the effect of contractual structure on farm productivity.
You expect plots cultivated under fixed rent contracts to be more productive than
plots which are sharecropped (why?). Say you have data on 1000 plots, you run your
regression and find that effectively the coefficient on the dummy variable that
identifies fixed rent contract is positive and significant. Unfortunately that’s not
enough to prove your theory right (actually nothing can “prove” a theory right but
here we’re really far from it!). I might argue that behind all of this there is an omitted
variable that affects both contracts and productivity. I say that very good (smart,
educated, entrepreneurial, whatever) farmers are obviously more productive and that,
at the same time, they prefer fixed rent contract (they are confident they’d do a good
job so they are willing to take all the risk). So the results might have nothing to do
with your theory, they are just a fruit of farmers’ heterogeneity. If you can’t find a
proxy for farmers’ ability, what can you do?
There is a way out if each farmer cultivates more than one plot. The way out, as you
might imagine, is fixed effect estimation. Say there are 200 farmers, cultivating 50
plots each. To do fixed effect you create 199 dummies, which identify each farmer
and plug them in the regression with the other explanatory variable. For instance the
first dummy is equal to 1 if the plot is tilled by Mr. Blue, 0 otherwise; the second
dummy is equal to 1 if the plot is tilled by Mr Pink, 0 otherwise and so on. This
allows you to estimate the effect of contracts on productivity, conditional on farmers’
identity, which controls for all farmers’ unobserved characteristics. Now you’re
checking whether given the farmer’s identity (and hence his skills etc) fixed rent
contracts foster productivity. If the coefficient on the contract variable is still positive
and significant now you know it was not so because of farmers’ heterogeneity. 3
In some cases we find a model within a larger model. For example in the group
lending paper (Pitt et al.) the Authors wanted to explain some economic outcomes
(expenditure, kids’ schooling etc.) as a function of the amount of credit received
through group lending. At the same time the amount of credit received was itself a
function of other variables.
To keep matters simple say you want to estimate y1 as a function of (x,q) and y2 as a
function of (z,l, y1). Here x and z are the independent variables and q and l are the
3 If you have understood how this work it should be trivial to see that you can’t do fixed effects if each
farmer only cultivates one plot.
parameters to be estimated. Notice that y1 only depends on q while y2 depends on l
AND q (through y1). You have a choice:
– with full information maximum likelihood (FIML) estimation you form the
joint distribution f(y1, y2 | x, z, l,q) and then maximise the full likelihood function.
– With limited information maximum likelihood (LIML) estimation you first
estimate q (since y1 is a function of q only) and then you use the estimated value
of q to estimate l. This method is very convenient because you don’t have to form
the joint distribution and maximise the full likelihood function (which is
computationally very complex), rather you get the distribution f(y1| x, q), choose q
to maximise the likelihood and then employ that value of q (let’s call it q*) to
form the second distribution f(y2| z, l, (x, q*)) and the likelihood that you have to
maximise only with respect to l.
15.1 Description
Note that when we want to estimate the relationship between, say, X and Y we
generally assume that this relationship has a specific functional form. We have often
seen LINEAR relationships of the form Y=a+bX. If we think that’s a good
specification (like in the keynesian consumption function) all we need to do is to find
the appropriate values of a and b, i.e. those such that the predicted Y is as close as
possible to the observed Y, given X. This kind of regression is called
PARAMETRIC.4 In some cases we have no reason to think that the relationship
between X and Y should take a specific functional form and we want to learn that
from the data, that is we want to estimate the “shape” of the relationship between X
and Y. Since we do not estimate parameters of a specific functional form (remember
we haven’t specified one) this kind of analysis is called NON-PARAMETRIC.
Intuitively, asking the data to tell us about the shape, as opposed to the parameters of
a given shape, is much more demanding. It follows that non-parametric estimation is
feasible only if we have a large number of data points.
Remember that a regression of Y on X is the conditional expectation of Y given X,
that is E(Y|X). With OLS we assume that the regression function is linear. Under non
parametric estimation we assume nothing. How can we get E(Y|X)? If we had infinite
data points, E(Y|X) would simply be the average of all the values of Y for a given X.
If (as it is always the case!) the data set contains a finite, no matter how large, number
of Xs, things gets messier. The idea, however, is similar. Non parametric estimation
does the following:
1. divide up the range of X into a (evenly spaced) grid of points (100 in the Deaton’s
4 Since we estimates parameters given an assumed functional form, any kind of regression where the
functional form is specified, even if it’s not a linear function, is called parametric. During the course
we have seen many different functional forms: for instance the Strauss and Thomas paper assumed that
productivity was a concave function of calories intake.
2. choose a symmetric interval around each X: the size of the interval is called
3. choose a function that gives weights to different data points within the interval such
that points farther away from the central X gets less weight and that points just inside
and just outside the band have zero weight. (remember that we are trying to
approximate the procedure we’d follow if we had infinite Xs: in that case we’d take
one X at the time, here we take a bunch of them but give more weight to the one in
the middle). This weighting function is called a “kernel” function and comes in a
couple of different forms.(The one in the Deaton’s paper is a “quartic” kernel
function). Then do either:
4a KERNEL ESTIMATION: For each bandwidth, compute the average all the Ys that
correspond to the Xs within the bandwidth, giving less weight to those corresponding
to Xs that are farther away (as dictated by the kernel function specified above).
4b SMOOTH LOCAL REGRESSION: For each bandwidth run a weighted OLS
regression of Y on X using the weights defined by the kernel function defined above.
15.2 PROS & CONS
The obvious advantage of non parametric vs. parametric estimation is that with the
former we let the data choose the shape of the relationship. Unfortunately, non
parametric estimation is both more flexible and more costly. In particular:
1. We need a very big sample (otherwise you need to specify very wide bandwidth,
which result in very unprecise estimates, which are practically useless)
2. It is quite difficult to condition Y on more than one variable (you’d need many
many more data points)
3. Non parametric estimation cannot deal with simultaneity (endogeneity),
measurement error, selection bias etc.
5 You have to choose the size of the bandwidth bearing in mind that estimates will be more precise but
also more variable the smaller is the bandwidth (if it were zero we’d know the “true” density at each
point, but this makes sense only if the data were infinite)