# airline data

airline data
1. [10 points] The following questions are based on airline data collected from routes in the U.S. for the year 2000. You are interested in examining the determinants of ticket prices.
You decide to generate a scatter diagram to visually assess if the average one-way fare for a route is related to the distance of that route.

Based on this scatter diagram, would OLS be BLUE if we ran a regression of the average one-way fare for a route on the distance of that route? Explain why or why not and if this would affect any estimated coefficients.

2. [28 points] Answer the following questions with respect to the estimated model of gasoline consumption below. It is based on data collected for the 48 contiguous states.
i = 377.29 – 34.79Ti + 1336.45Di – .07Ii – .002Mi
R2 = .68
Where:
Y = gasoline consumption (gallons per person)
T = gasoline tax (cents per gallon)
D = proportion of the population with driver’s licenses
I = per capita personal income (in dollars)
M = length of roads in the state (in miles)

A. [5 points] Does the estimated value for the intercept term in this model have any economic meaning? If yes, what is it? If not, why not and why is it included?
B. [5 points] Explain what is meant that the value of the R2 = .68. What is an alternative measure that can used for the same purpose?
C. [5 points] If Massachusetts increased its gasoline tax by \$0.15, everything else unchanged, how much would the per person consumption of gasoline change in the state?
D. [8 points] An economist might look at this estimated equation and be concerned about the potential for specification error due to an omitted variable. What variable would you recommend be included, why, and in what direction is its exclusion biasing any of your estimates?
E. [5 points] What is the meaning of the estimated coefficient on I? Based on this information, what type of good is gasoline as it relates to I? [Hint: remember what you learned in microeconomic theory]
3. [23 points] Based on annual data from 1990-2010, the following regressions were obtained:
Model A: i = 2.69 – 0.48Xi R2 = .66
(.122) (.114)

Model B: ln(i) = 0.78 – 0.25ln(Xi) R2 = .74
(.0115) (.049)

Where:
Yi = cups of coffee consumed by the ith person per day
Xi = price of a cup of coffee in dollars
ln( ) = natural log of ( )
*Standard errors are reported in parentheses
A. [5 points] Provide the economic meaning of the slope coefficients in the two models.
B. [5 points] Test the significance of the independent variable in each model using a = .05.
C. [8 points] Construct a 95% confidence interval for the independent variable in each model. Explain what this measure is telling us and how it can be affected by the consistency of your estimator.
D. [5 points] Since the R2 is larger in model B compared to model A, is it evidence that model B is superior to model A? Explain why or why not.

4. [18 points] Suppose that you are interested in studying the effect parental involvement has on children’s grades. You plan to measure parental involvement by asking the students a few survey questions.
After asking several prospective schools in Massachusetts to participate, Millford Academy, ranked one of the most prestigious private high schools in the state, allows you to conduct your study there. They have agreed to send a participation consent form to each student’s parents or guardians asking whether or not they consent to have their child take the survey and release their grades for each course.
A. [6 points] Provide 2 questions that you would want to ask on your survey in order to collect information of parental involvement and justify their use. Make sure these provide concrete measures that can be used in your model.
B. [12 points] If you estimate your model using OLS, do you believe that any of our Gauss-Markov assumptions would be violated? If so, explain which ones, why they are violated, and what potential problems that could pose for your estimation.

5. [21 points] You wish to predict the sale price of single-family residences in Massachusetts using property features (commonly called a “hedonic pricing model”). You collect price and property features data on each property sold in the state for the years 2008 – 2014.
A. [5 points] Taking into account the structure of the data, how would you categorize this dataset? What is an advantage and disadvantage of using this type of dataset?
B. [6 points] You decide to examine the distribution of prices in 2008 to gain a sense of the data and produce the following histogram (note that the data is plotted alongside a normal curve):
Does the distribution of prices affect any of our CLM assumptions? If we are collecting sales data for the entire state of Massachusetts for 7 years, will this still pose a problem for estimation or inference? Explain why or why not.

C. [4 points] Using the above data, you plan to estimate the following model:
Pricei = ß0 + ß1*lotsizei + ß2*houseagei + ß3*bedroomsi + ß4*bathroomsi + µi
Where:
lotsize = size of the house (in square feet)
houseage = age of the house (in years)
bedrooms = number of bedrooms in the house
bathrooms = number of bathrooms in the house

What sign would you expect each of these coefficients to take? Explain why.
D. [6 points] After running the regression, you discover that both bedrooms and bathrooms are statistically insignificant at the a = .1 level. What problem might there be with your equation that could explain this occurring? How could you fix this problem and how else could you test for the statistical significance of bedrooms and bathrooms? 