One Variable Data
Two Variable Data
Collecting Data
Probability and Random Variables
Sampling Distributions
100

What is a variable?

A variable describes some characteristics of an individual, like their height, hair color, or even salary.
100

Describe the relationship between two Categorical and two Quantitative variables.

A response variable measures an outcome of a study, while explanatory variables may help predict or explain changes in a response variable. Two variables have an association if knowing the value of one variable helps us predict the value of the other.

A segmented bar graph displays the possible values of a categorical variables as segments of a rectangle, with the area of each segment proportional to the percent of individuals in the corresponding category.

A scatterplot shows the relationship between two quantitative variables measured on the same individuals. The values of one variable appear on the horizontal axis, and the other on the vertical axis.

100

What is the statistical problem solving process?

A process that involves 4 steps.

Ask questions, collect data, analyze data, and interpret results. 

A valid statistical question is based on data that varies. There are many data collection methods, including sample surveys, observational studies, and experiments.

The scope of inference determines how we intepret results. Random selection of individuals allows inference about the population from which the sample was selected. Random assignment of individuals to groups allows for inference about cause and effect

100

Interpret probability as a long run relative frequency and how you can estimate with a simulation

Chance behavior is unpredictable in the short run, but has a regular and predictable pattern in the long run.

The long run relative frequency of a chance outcome is its probability. A probability is a number between 0(never) and 1(always). The law of large numbers says that in many repititions of the same chance process, the proportion of times that a particular outcome occurs will approach its probability

Simulation can be used to imitate chance behavior and estimate probabilties. You describe how to use a chance probability to perform 1 repitition of the simulation. The perform many repititions and use the results to answer the question of interest.

100

Distinguish between a parameter and a statistic

A parameter is a number that describes some characteristic of a population. A statistic is a number that describes some characteristics of a sample. We use statistics to estimate parameters.

200

What is the difference between Categorical and Quantitative data?

Categorical data is made up of values that are labels. These labels place each individual into one of several groups, such as hair color or height. 


Quantitative data is made up of numerical values. These values count or measure some characteristic of each individual, such as number of family members or height in inches.

200

What is correlation and how do you determine it?

Correlation (r value) is a measure of the strength and direction of a linear relationship between two quantitative variables. The correlation takes values between -1 and 1, where postive values indicate a positive association and negative values indicate a negative association. Values closer to 1 and -1 indicate a stronger linear relationship and values closer to 0 are weaker.

200

Define the types of bias that can occur within a survey

A convenience sample consists of individuals from the population that are easy to reach. This is bias because the chosen individuals are typically not representative of the population.

A voluntary response sample consists of people who choose to be in the sample by responding to an invitation. This method is also bias because the individuals are not representative of the population.

Undercoverage occurs when some members of the population are less likely to be chosen for the sample. These methods show bias if the less likely individuals differ in relevant ways from other members.

Nonresponse occurs when an individual chosen for the sample cannot be contacted or refuses to participate. These methods show bias if the individuals differ in relevant ways from other members.

Response bias occurs when there is a consistent pattern of inaccurate responses to a survey question. 

200

What are mutually exclusive events and unions?

To understand mutually exclusive events, we first must understand events and the general addition rule.

Events are collections of possible outcomes from the sample space. To find the probability that an event occurs, we use basic rules. 

If all outcomes in the sample space are equally likely, P(A) = number of outcomes in event A/total number of outcomes in sample space

Complement rule (P(A^c) = 1-P(A)) is where A^c is the complement of A, that is, the event that A doesn't happen

General addition rule is for any two events A and B, P(A or B) = P(A) + P(B) - P(A and B)

Events A and B are mutually exclusive if they have no outcomes in common and so can never occur together, that is, if P(A and B) = 0. These can be displayed with venn diagrams and two way tables.

The event "A or B" is known as the UNION of A and B, denoted by A u B. It consists of all outcomes in event A, event B, or both.

The event "A and B" is known as the intersection of A and B, denoted by A n B. It consists of all outcomes that are common to both events.



200

What is the Central Limit Theorem?

The central limit theorem states that if you have a population with mean μ and standard deviation σ and take sufficiently large random samples from the population with replacement , then the distribution of the sample means will be approximately normally distributed.

300

How would you represent both Quantitative and Categorical data on a graph?

Categorical data can be displayed using bar charts or pie charts. Bar charts especially can be used to compare the distribution of a categorical variable in two or more groups. This is done by using frequencies or relative frequencies.

Quantitative data can be displayed using dotplots, stemplots, or histograms. Dotplots display individual values on a number line, stemplots seperate each observation into a "stem" and one digit "leaf", while histograms plot the counts (frequencies) or percents (relative frequencies) of values in equal length intervals. These graphs will have patterns and departures. Shape, center, and variability describe the overall pattern of the distribution. Outliers can be found as observations outside of the overall pattern.

300

What is Least Squares Regression?

If the data shows a leaner relationship between two variables, the line that best fits this linear relationship is known as a least-squares regression line, which minimizes the vertical distance from the data points to the regression line.

300

What is SRS?

SRS is a simple random sample of size n chosen in such a way that every group of n individuals in the population has an equal chance to be selected as the sample. 

The fact that different random samples of the same size from the same population produce different estimates is called sampling variability

300

What are independent events and conditional probability?

Conditional probability describes the probability that one event happens given that another event is already known to have happened. 

To calculate this we are given the formulas

P(A|B) = P(AnB)/P(B)

The general multiplication rule calculates the probability that events A and B both occur.

P(A and B) = P(A) * P(B|A)

Tree diagrams help to display conditional probability.

When knowing whether or not one event has occured does not change the probability that another event happens, we say that the two events are independent.

P(A|B)=P(A|B^c)=P(A) and P(B|A)=P(B|A^c)=P(B)

300

Define biased and unbiased point estimates

The bias of a point estimator is defined as the difference between the expected value of the estimator and the value of the parameter being estimated. When the estimated value of the parameter and the value of the parameter being estimated are equal, the estimator is considered unbiased.

Biased estimators include range, standard deviation, variance, or any measure of spread.

Unbiased estimators include mean or median

400

Describe and compare the distribution of a Quantitative Variable

To describe a distribution, we use the term SOCS

S - Shape

O - Outliers

C - Center

S - Spread (variability)

Shape is the overall pattern of the distribution. The distribution can either by symmetrical, skewed right or left, or even bimodal.

Outliers are unusual values that are found outside of the pattern. These can include gaps and clusters.

The mean and median are ways to describe the center of a distribution in different ways. Median is the midpoint of the distribution, where half the numbers are larger and half are smaller. The mean is the average of the observations.

Range is the easiest way to measure variability. It is the distance from the maximum value to the minimum value.

When using mean to describe the center, you can measure variability using standard deviation. This term describes the typical distance of values in a distibution from the mean. The standard deviation is 0 weh there is no variability, and gets bigger as variability increases.

When using median to describe the center, you can measure the variability using the interquartile range (IQR). The first quartile Q1 has about 1/4 of the observations below it, and the third quartile Q3 has about 3/4 of the observations below it. The IQR measures variability in the middle half of the distribution and is found by Q3 - Q1.

(To find outliers): Any value that is less than Q1 - 1.5(IQR) or greater than Q3 + 1.5(IQR)

400

What is a Linear Regression Model?

A linear regression model describes the relationship between a dependent variable, y, and one or more independent variables, X. The dependent variable is also called the response variable. Independent variables are also called explanatory variables.

400

Define confounding, treatments, and the purpose of control groups within an experiment

Confounding occurs when two variables are associated in such a way that their effects on a response variable cannot be distinguished from each other.

Experiments deliberately imposes treatments on indiviudals to measure their responses

A treatment is a specific condition applied to the individuals in an experiment. An experimental unit is the object to which a treatment is being randomly assigned.

In an experiment, a control group is used to provide a baseline for comparing the effects of other treatments. Depending on the purpose of the experiment, a control group may be given a placebo, active treatment, or no treatment at all. 

Placebo effect describes the fact some subjects will respond favorably to any treatment, even an inactive treatment.

400

Define the mean, standard deviation, and the combining of random variables

The mean of a discrete random variable X is a weighted average of the possible values that the random variable can take. Unlike the sample mean of a group of observations, which gives each observation equal weight, the mean of a random variable weights each outcome according to its probability.

A measure of spread for a distribution of a random variable that determines the degree to which the values differ from the expected value. The standard deviation of random variable X is often written as σ.

We can form new distributions by combining random variables. If we know the mean and standard deviation of the original distributions, we can use that information to find the mean and standard deviation of the resulting distribution. We can combine means directly, but we can't do this with standard deviations

400

Describe the sampling distribution of a sample proportion and of a sample mean

The sampling distribution of a sample proportion (p^) describes the distribution of values taken by the sample proportion in all possible samples of the same size from the same population. The mean of p^ is μˆP=p. This describes the average value of p^ in repeated random samples. The standard deviation of p^ is σˆP=√p(1-p)/n. This describes how far the values of p^ typically vary from p in repeated random samples. You can assume normality when the Large Counts Condition (np>10 and n(1-p)>10) is met.

The sampling distribution of a sample mean describes the distribution of values taken by the sample mean in all possible samples of the same size from the same population. The mean of the sampling distribution is μM = μ. This mean describes the average value of xbar in repeated random samples. The standard deviation of the sampling distribution of xbar is 𝜎𝑋=𝜎/√n. This describes how far the values of xbar typically vary from the mean in repeated random samples. You can assume normality when the population is normal or if the CLT (n>30) holds

500

What is the Normal distribution?

A normal distribution is a type of continuous probability distribution in which most data points cluster toward the middle of the range, while the rest go off symmetrically toward either extreme.

500

What are residuals?

A residual is the difference between an actual value of y and the value of y predicted by the regression line.

Residual = actual y - predicted y

A residual plot is a scatterplot that plots the residuals on the the vertical axis and the explanatory variable on the horizontal axis. 

The standard deviation of the residuals s measures the size of a typical residual. That is, s measures the typical distance between the actual y values and the predicted y values.

The coeffeicent of determination r^2 measures the percent reduction in the sum of squared residuals when using the least-squares regression line to make predictions rather than the mean value. In other words, r^2 measures the percent of the variability in the response variable that is accounted for by the least-squares regression line. This is found by square rooting the r value.

500

Define what statistically signifigant means and how experiments use Completely Randomized Designs

Statisically signifigant is when an observed difference in responses between the groups in an experiment is so large that it is unlikely to be explained by chance variation in the random assignment. 

In a completely randomized design, the experimental units are assigned to the treatments (or treatments to the experimental units) completely at random

500

Define binomial distributions and their parameters and geometric distributions

Binomial distribution is a common probability distribution that models the probability of obtaining one of two outcomes under a given number of parameters. It summarizes the number of trials when each trial has the same chance of attaining one specific outcome. The binomial probability distribution is characterized by two parameters, the number of independent trials n and the probability of success p. There are 4 conditions to a binomial distribution. 

1: The number of observations n is fixed. 

2: Each observation is independent. 

3: Each observation represents one of two outcomes ("success" or "failure"). 

4: The probability of "success" p is the same for each outcome.

A geometric distribution is defined as a discrete probability distribution of a random variable x which satisfies some of the conditions. The geometric distribution conditions are. A phenomenon that has a series of trials. Each trial has only two possible outcomes, either success or failure

500

What are normal distributions within sampling distributions?

When the shapes of the sample proportions are assumed normal due to if the Large Counts Condition holds or when shapes of sample means are assumed normal due to if the CLT holds

M
e
n
u