Linear Regression
Data Production, Sampling Methods, LLN/FTS
Probability & Process Distributions
Normal Distributions: Linear Combos & Transformations
100

A correlation, r = 0.9, was found between the number of fire trucks at a scene and damage amount in dollars. Explain why this doesn't imply causation.

The high correlation arises because both are caused by fire severity--a confounding variable. More severe fires cause both more trucks and more damaged, not because trucks cause damage. 

100

A researcher surveys shoppers at a mall on a weekday afternoon about their work-life balance. Identify the type of haphazard sampling method and a proper sampling alternative.

Haphazard sampling method: Convenience sampling (the researcher is conveniently asking people in a high-traffic area, but during work hours. This does not represent those who are at work)

Sampling alternative: stratified random sample that includes individuals from different employment statuses/times.

100

State the three methods that can be used to check if two events, A and B, are independent. 

1. P(A and B) = P(A)*P(B)

2. P(B|A) = P(B)

3. P(A|B) = P(A)

100

X ~ N(100, 15), and Y = 2X + 10

Find mean(Y)

2*100 + 10 = 210

200

A student regresses GPA over weekly hours studied and obtains residuals that resemble a U-shape. Why should we be cautious about fitting a LSQ line on this data?

The data likely isn't linear! We should always want to see a random spattering of points on our residual plots, not ones that form a pattern.

200

What does the Law of Large Numbers guarantee about the sample mean?

It guarantees that as the sample size increases, the sample mean converges to the population/process mean.

200

In a warehouse, 60% of products are High Quality (H) and 15% were created by Manufacturer A. Of those created by Manufacturer A, 40% are High Quality. A product is randomly selected.

Compute P(A or H), the probability the randomly selected product was created by Manuf A or is of High Quality.

P(A and H) = P(A) * P(H|A) = 0.15*0.40 = 0.06

P(A or H) = P(A) + P(H) - P(A and H)

= 0.15 + 0.60 - 0.06 = 0.69

200

X ~ N(100, 15), and Y = 2X + 10

Find sd(Y)

sd(Y) = 2*15 = 30

300

Consider the following 2+2+1 stats:

x_bar = 5 | y_bar = 72.5 | sx = 2.582 | sy = 6.455 | 

r = 1 | n = 852

Assuming you regress Y over X, compute the LSQ line.

b1 = 1 * 6.455 / 2.582 = 2.5

b0 = 72.5 - 2.5*5 = 60

Y_hat = 2.5X + 60

300

Consider the population of ALL current students in the Big-10 conference. 

TRUE OR FALSE: If I take convenience sample of n = 15,000 students that attend certain Big-10 football games and record their GPA, then, because the sample size is large enough and by the Fundamental Theorem of Statistics, the sample distribution of GPA values will look essentially like the population distribution of GPA values.

False - this sample is taken haphazardly and not randomly; therefore, the link between the sample data and population data has been broken.

300

A game show has two boxes:

- Box A contains 70% gold coins and 30% silver coins.

- Box B contains 20% gold coins and 80% silver coins.

A contestant randomly chooses one of the boxes (each box is likely to be chosen). They then draw one coin from the chosen box and it turns out to be gold.

Find the probability that the contestant picked Box A, given that they drew a gold coin.

P(A) = P(B) = 0.50

P(G|A) = 0.7

P(G|B) = 0.2

P(A|G) = 0.70*0.50 / (0.7*0.5 + 0.2*0.5) = 0.78

300

Suppose X and Y are INDEPENDENT with normal distributions:

X ~ N(8, 4)

Y ~ N(-5, 12)

Compute mean(3Y + X) and sd(3Y + X).

mean(3Y + X) = 3*-5 + 8 = -7

sd(3Y + X) = sqrt[32*122+12*42] = 36.22

*sd of linear combo = sqrt[b2*sx2 + c2*sy2]

400

A basketball player scores an average of 25 points per game with a standard deviation of 5 points. The correlation between points scored in consecutive games is r = 0.40. In game 1, the player scored 35 points (which is much higher than their average ppg). 

Is the player likely to score 35 points again? Why or why not? Predict the number of points scored in game 2. 

Y_hat(Game 2 | Extreme Game 1) = x_bar * r * k * sx

k = (35 - 25) / 5 = 2

= 25 + 0.4 * 2 * 5 = 29

The player is unlikely to score 35 points in game 2 since, according to the regression to the mean concept, extreme performances are likely followed by more-average outcomes. A more likely prediction is 29 points. 

400

TRUE OR FALSE: If I toss coins (i.e., if I carry out RE1 repeatedly) until I see 12 heads in a row, then the selected data,"12 heads in a row" gives good evidence that the coin is not fair because if it were fair, the chance of seeing these data is just (0.5)12 = 0.000244.

False - even highly unlikely events are bound to happen if you continue a process until they occur (e.g., monkeys and Shakespeare).

400

X = # of injuries in Plant A | Y = # of injuries in Plant B

         |       X = 1       |       X = 2    |      X = 3 

Y = 1 |       0.10         |       0.15      |       0.05  

Y = 2 |       0.20         |       0.25      |       0.25  

Find Var(X:RE).


Mean(X:RE) = 1*0.30 + 2*0.40 + 3*0.30 = 2.2

Mean(X2:RE) = 12*30 + 22*0.40 + 32*0.30 = 4.6

Var(X:RE) = 4.6 - 2.22 = 0.60

M
e
n
u