Intro
Descriptive Stats (Univariate)
Descriptive Stats (Bivariate)
100

Suppose P = {Al, Barb, Carla), H = hair color (light, medium, dark), and H* = hair color (1=blonde, 2=brown, 3=other)

What are the variable types of H and H*?

H = Qual-O (these values can be ordered)

H* = Qual-N (these values can't be ordered, and the numbers just represent categories)

100

X:RE = 'roll a dice and record the number of updots'

n = 20

dist(X:RE) = { (1,2,3,4,5,6) , (1/20, 3/20, 3/20, 5/20, 4/20, 4/20) }

What is the relative frequency of rolling a six?

= 4/20 = 0.20

100

You are given the two sample variances var(X) = 4.3 and var(Y) = 13.8. Can you compute the sample variance over var(15+2X-3Y)? If yes, compute it. If not, why not?

We cannot compute the sample variance because we need the correlation, r, to plug into our formula:

var(a+bX+cY:s)  = 𝑏2*𝑠𝑥2 + 𝑐2*𝑠𝑦2 + 2*𝑏*𝑐*𝑟*𝑠𝑥*𝑠𝑦

200

What is the difference between X:s and x_bar?

X:s denotes the x-values for the sample, s. 

x_bar denotes the average of the x-values for the sample, s.

200

X:s = {1, 1, 3, 5, 7, 5, 6, 99}

Which would be a better measure of centrality to describe this sample: mean or median? Why?


Median since it controls for the outlier

mean = 15.875

median = 5

200

210 people bought both beer and brats, 495 people bought beer but no brats, 398 people bought brats but no beer, and 5392 bought neither. 

Compute and interpret the Ratio Ratio of the purchase of beer owing to the purchase of brats. 

props(Beer=yes | Brats=yes) / props(Beer=yes | Brats=no)

= 210 / (210 + 398) = 0.3454

= 495 / (495 + 5392) = 0.0839

= 0.3454 / 0.0839 = 4.12

The proportion of receipts that include beer is 4.107 times higher for receipts that include brats than for those that do not. 

300

dist(X:s) = {(0,1,2,3) , (1/8, 3/8, 3/8, 1/8)}

If Success = 'X>=2' and Failure = 'X<2', what is props(Failure)?

props(Failure) = 1/8 + 3/8 = 4/8 = 0.50

300

X:s = {487.54, 510.88, 501.01, 223.533, 744.221, 634.08, 112.44}

I use bins of [0-500] and (500-1,000] to create a density histogram. What is the density of the (500-1,000] bin?

Frequency in bin = 4

Relative Frequency of bin = 4/7 = 0.57143

Density = 0.57143 / 500 = 0.11%

300

(X,Y):s is characterized by the following 2+2+1 summary:

x_bar = 7.85, sx = 1.5566, y_bar = 19.37, sy = 1.442, and r = 0.932

Compute mean(25 + 5X + 2Y:s).

102.99

= 25 + 5*7.85 + 2*19.37

400

Let's say RE = 'manufacture a widget' and X = 'Quality score from 1-10'

If you carry out RE 1 million times and record X for each widget, the proportion of widgets with a Quality Score greater than 7 (i.e., X>7) is the approximately the same as the ______ of randomly selecting a manufactured widget with a Quality Score greater than 7.

Probability

*Due to the Law of Large Numbers (as you continue repeating a process, the process proportion becomes closer and closer to the true probability)

400

A course's test score are normally distributed with a mean score of 85 and a standard deviation of 5. Using the 68-95-99.7 Rule, estimate the probability of scoring less than an 80. 

~16%

= 50% (to the left of the mean) - 34% (from the mean to 1 - sigma)

400

Two students took different standardized tests:

1. Alice scored 78 on a test with a mean of 70 and a standard deviation of 4.

2. Ben scored 84 on a test with a mean of 80 and a standard deviation of 8. 

Assuming the scores for both tests are characterized by a normal distribution, which student performed better relative to their group?

Alice: z = (78-70) / 4 = 2 

Ben: z = (84-80) / 8 = 0.5

Alice scored 2 standard deviations above the mean while Ben only scored 0.5 standard deviations above the mean. Alice performed better.

*Note...this is a univariate question.

500

Let's say you take a sample (s) of 10 students (n) and record their grade on the most recent assignment. Suppose 3 students received As, 2 students received Bs, and 5 students received Cs.

*Note that x = Qual-0

Create an appropriate visualization for dist(X:s)

X = grade

Y = frequency (or relative frequency)

Frequency (or relative frequency) bar chart w/ three bars indicating each grade

500

X:s = {7, 12, 4, 18, 15, 21, 10}

Report the 5 number summary.

Min = 4

Q1 = x(0.25*7+0.75) = x(2.5) = x(2) + 0.5(x(3)-x(2)) = 7 +0.5(10-7) = 8.5

Median = 12

Q3 =  x(0.75*7+0.25) = x(5.5) = x(5) + 0.5(x(6)-x(5)) = 15 +0.5(18-15) = 16.5

Max = 21

n = 7

500

A company collects data on 15 of its employees' commute times. The company found that there are only 4 employees who spend less than 22 minutes commuting to work. Alternatively, the company found that 12 employees spend less than 42 minutes commuting to work.

What is the difference in percentile rank between the 22-minute and 42-minutes commute times?

n = 15

L22 = 4... percentile rank = 4 / (15 -1) = 28.57%

L42 = 12... percentile rank = 12 / (15 - 1) = 85.71%

85.71% - 28.57% = 57.14%

*Note...this is a univarite question.