Regression
More Regression
Measures of Center
Samples
100

What makes a line of best fit, fit "best"?

Closest to the data, minimizes error

100

(a+b+c)^2 = _______

a^2 + b^2 + c^2 + 2ab + 2bc + 2ab

100

What are the 3 main measures of center? find each for this given set of data: 

{ 1 , 2 , 3 , 1 , 2 , 3 , 1 , 2 , 3 , 4 }

Mean = 2.2

Median = 2

Mode = {1,2,3}

100

Given the sample {30, 40, 50} of apples on a tree from a group of 7,000,000 apple trees find: 

1. How many apple trees have less than 40 apples?

2. More than 900 apples?

3. Between 30 and 50 apples?


mean = 40;  sample standard deviation = 10; standard error = 5.77

1. P(0) = 50% --> 3.5 mil

2. Q(148.96) = 1- P(148.96) = reasonably 0%

3. P(1) = .8413; P(-1) = 1-.8413 = .1587

.8413-.1587 = .6826 --> 4,778,200 apples

200

What are the 4 types of models we can linearize? Provide the general equation for each

1. linear y = ax+b

2. natural log y = alnx+b

3. exponential y = ab^x

4. variation y = ax^b

200

For the given data - Fill in the formula for r with the relevant sums but do not multiply it out.

x    y

1   17

3   42

5   8.3

(-17.4)/sqrt((8)(612.127)


200

Find the mean, median, mode, variance, and standard deviation of the data set: 

{32, 54, 63, 12, 56, 23, 9}

mean = 35.57

median = 32

mode = all

variance = 420.24

standard deviation = 20.4998

200

What is the only difference in calculating a sample variance from a population variance? Why is this difference included?>

divide by n-1 for sample instead of n for population. Gives a larger error since we are only looking at a small component of the whole population
300

What is the domain of an r-value (correlation coefficient). What does an r-value tell us about a line of best fit?

d: [-1,1] 

r tells us on average how far the data points are from the line --> gives us a gauge of how good a line of best fit models our data.

300

Using your formulas from brute force for m and b, write out lists you would need in your calculator and the sums of those needed to find m and b. You do not need to find m or b.

x.  y

1   17

3   42

5   8.3

sum (x) = 9

sum (y) = 67.3

sum(x^2) = 35

sum(xy) = 184.5

300
What z-scores (lower and upper) will capture the following: 

1. 68.26% of data centered on mean

2. 95% of data centered on mean

3. 99.7% of data centered on mean

1. {-1,1}

2. {-2,2}

3. {-3,3}

300

Find the standard error of the data set: 

{32, 54, 63, 12, 56, 23, 9}

Sample standard deviation = sqrt(2941/7-1) = 22.14

standard error = 22.14 / sqrt(7) = 8.37

400

Find the ssRES for the following data given y'=2x-3

x     y

1    4

2    6

3    7

4   10

5   12

116

400

find the model that best fits the data and find the equation of best fit to the nearest thousandth place.

x   y

1   4

2   6

3   7

4   10

5   814

exponential 

lny=1.114x-.517

y = .596(3.047)^x

400

When finding the standard deviation of a data set, why do we not just find the average distance of each data point from the mean? What do we do instead?

sum = 0. 

We find the sum of the squares of each, and then root the final answer.

400

A chain cookie store has been selling cookies laced with mercury. Here is a sample of the amount of mercurcy per batch from 8 stores: 

{3.2, 5.6, 4.3, 7.8, 9.0, 1.1, 4.5, 6.6}

Find the 99% confidence interval and interpret it.

z = 2.576, mean = 5.2625,  sample standard dev = 2.5466, standard error = .900

2.576 = x-(5.2625) / .900 --> x = 7.582

lower bound: 7.582-5.2625 = 2.319; 

5.2625-2.319 = 2.94

99% confident that the true mean amount of mercury in the cookies is between 2.94 and 7.58.

500

Why do we calculate the ssRES instead of the sum of the residuals? hint: there are 2 reasons

1. sum of residuals cancel each other out from pos and neg sings. Squaring makes all pos

2. Squaring exaggerates further off error 

500

DAILY DOUBLE - SPICY

Reverse the linearization of the equation y = 8x^4 to write it in terms of natural logs. 

your final answer should look like : lny = alnx+b

lny = ln(8x^4) = ln(8)+4ln(x)


lny = 4lnx + 2.08

500

Randomly sampled salaries in the Bay Area (rounded): {21,000,000; 348,000; 22,000; 125,000; 95,000; 89,000; 206,000, 187,000}

A govt. official is interested in knowing what the average salary in the bay is from this data. Find it.

mean = 2,759,000 --> wrong average, skewed by 21 mil. 

Median = 156,000$ --> less effected by outlier

500

A data scientist collected a sample of n values and found the standard error to be 1, and also found the 

sum of (x-mean(x))^2 = 12. How many data points must he have collected? How many answers are possible here?

Bonus: Prove there will always only be 1 sensible answer

equation becomes: n^2-n-12 = 0. 

Either 4 or -3 --> only 4 makes sense in context. 

M
e
n
u