About Data
Equations and Expressions 🤢🤢🤢
Statistics with Python 😖😖😖
Pandas Potpourri
Toilsome Terminology
100

What does this code do?

pandas.read_csv('some_data.csv')

Creates a dataframe of some data in CSV format.

100

What does this represent?

Σ ((observed - expected)2 / (expected))

--------------------------------------------------------

Sum of the squared observed values minus expected mean value divided by expected mean value

A Chi-Squared test

100

How do you get the mean of a population without using imported package modules ( .mean() )?

hint - use Python functions

sum(population) / len(population)

100

How do you check the datatypes within a pandas dataframe?

pandas.DataFrame.dtypes

100

Measure of the spread of a population or sample of numbers.

Symbols - σ (population) or s (sample)

What is Standard Deviation?

200

Name the type of variable here:

Boolean values (True/False; Yes/No; 1/0)

Discrete

200

What is does this represent?

x̄ ± t(s/√ n)

------------------------------------------------------------

sample mean +/- t_statistic * (standard_deviation / square root of number of samples)

A Confidence Interval

200

What is the code to perform a 1-sample t-test using Scipy?

- include common parameters

scipy.stats.ttest_1samp(array, expected_mean, axis=0, nan_policy='propagate')

200

How do you check for the number of null values in a dataframe?

pandas.DataFrame.isna().sum()

200

It's a visual that displays the frequency distribution of variables.

What is a Contingency Table?

------------------------------------

What is a Cross Tabulation?

300

Name the type of variable here:

All real numbers

Continuous

300

What does this represent?

P(A|B) = P(B|A)P(A) / P(B)

-------------------------------------------------------------

Probability of (A, given B) = (Probability of (B, given A) * Probability of (A)) / (Probability of (B))

Bayes' Theorem

300

What is the code for a 2-sample t-test using Scipy?

- include common parameters (not defaults)

scipy.stats.ttest_ind(array_1, array_2axis=0, equal_var=True, nan_policy='propagate')

300

What package and module do you use in this scenario to replace non-NaN values?

pandas.DataFrame.replace('?', _______)

numpy.NaN

300

It's a number that represents the likelihood of obtaining a test result at least as extreme as the results actually obtained assuming the null hypothesis is correct.

What is a P-value?

400

What does this code return?

df[df['year']==2017]

Returns a dataframe where each observation in the df['time'] column is equal to 2017

400

What does this particular equation represent?

σ 2 = (Σ (x - μ )2) / N

-------------------------------------------

(sigma)2 = (Sum of (observation - mean of all observations)2) / (Total number of observations)

Population Variance

400

When making a contingency table, what two lines of code do I need to grab the vertical and horizontal 'All' values assuming the table is 7 x 7 (excluding variable names)?

hint - start with:

contingency_table. ........

row_sums = contingency_table.iloc[0:6, 6]

grabs the 0th, 1st, 2nd, 3rd, 4th, and 5th row values from the 6th column (starting from 0)

col_sums = contingency_table.iloc[6, 0:6]

grabs the 0th, 1st, 2nd, 3rd, 4th, and 5th column values from the 6th row (starting from 0)

400

You notice when using pandas.read_csv() that there are no column headers.  Which parameter do you use to insert column headers?

pandas.read_csv('.csv', names=['list'])

400

In the case of deciding whether a six-sided die is fair, it's what you do if an obtained p-value is lower than the previously set confidence level.

What is 'Reject the Null Hypothesis that it is a fair six-sided die and suggest the alternative that it is an unfair die'

500

What does this code show you?

df.describe(exclude='number')

Non-numeric statistics of a dataset

500

What does this represent?

P(A|B) = P(A∩B) / P(B)

-----------------------------------------------------------

Probability of (A, given B) = Probability of (similar elements of both A and B) / Probability of (B)

Law of Total Probability

500

What code is used to perform a Chi-Squared test using Scipy?

- Include Parameters

scipy.stats.chisquare(observed, expected)

500

What code do you use to create a contingency table using two categorical variables?

hint - use Pandas

pandas.crosstab(df['column_1'], df['column_2'], margins=True)

500

You use it to determine whether there is a statistically significant difference between the expected and observed frequencies in one or more categories.

What is a Chi-Squared test?

M
e
n
u