Numpy Jeopardy!! Hari P

Open Ended

Indexing/Slicing

Array Manipulation

Math

Miscellaneous

100

Suppose I have the following lists/arrays (in starter code). And I want to the return the list/array [6,8,10,12]. How many operations would I need to do with Python lists vs Numpy arrays:

# return [6,8,10,12]

x = [1,2,3,4]

y = [5,6,7,8]

# How many operations needed to return [6,8,10,12] with the lists x and y

x1 = np.array(x)

y1 = np.array(y)

# How many operations needed to return [6,8,10,12] with the arrays x1 and y1

With lists: about 12

With arrays: 4

100

I want only arr1 to be the array [999999, 2, 4, 6, 8, 10] but when I run the code below both arr1 and arr2 are the same array. Why is this happening and how can I change this?

arr1 = np.arange(0,11,2)

arr2 = arr1

arr1[0] = 999999

print('arr1: ', arr1)

print('arr2: ', arr2)

#why are they same and how do I change this?

A variable stores a pointer (the memory address) to its value. So arr2 = arr1 will make arr2 point to the same array as arr1. We use the np.copy() method to alter this.

100

Why doesn't the np.reshape() method actually change the shape of the array below:

arr1 = np.array([1,2,3,4,5,6,7,8])

np.reshape(arr1, (2,4))

arr1.shape

To maintain the shape of the original array.

100

I want to find the sum of the first one million natural numbers (1 + 2 + 3 + ... + 999999 + 1000000). The code below does not work. Can you explain why and how to fix it?

x = list(range(1, 1000001))

x = np.array(x)

result = np.sum(x, dtype=np.int32)

print(result)

The value is too large to be represented as a 32 bit integer. Need to return type to be np.int64

100

Breifly explain why you should avoid using for loops when doing some computation on a Python list. When should you use a for loop and when should you not use a for loop?

Lists store points not the actual vlaues so it is more operations.

Use for loop if the list has different data types.

200

What is the numpy concept/idea that allow us to operate on mismatched shaped arrays by replicating the smaller array to make sure both arrays have compatible shapes?

Broadcasting

200

The starter code loads in the wine dataset (has different chemical features of various wine) from sklearn and stores it as a 2D array. Using this 2D array, extract all the columns within the array that has an average greater than 10.

data[:, np.mean(data, axis=0) > 10]

200

There are 3 arrays in the starter code. I want the dot product of all 3 for these arrays to be the array result=[39984, 82782, 61530] as a column vector. How might you do this?

arr1 = np.array([2,3,4,6,7,8,8,9,10,4,7,4]).reshape((3,4))

arr2 = np.array([3,5,2,6,1,5,1,8]).reshape((4, 2))

arr3 = np.array([6,3,4,6,3,2,5,12,14,3]).reshape((2, 5))

arr4 = np.array([2,5,12,14,3]).reshape((5, 1))

# Change the above code

result = arr1.dot(arr2).dot(arr3).dot(arr4)

result

200

The starter code loads in the iris dataset from sklearn and stores it as a 2D array.

"The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters."

A function for mulliple linear regression and r^2 is provided. The multlinreg(X, y_obs) take in data (X) and the dependent variable (y_obs) and it returns y_obs and y_pred (the predict values of y_obs given X). The r_sq(observed, pred) returns the r^2 values between the observed and pred arrays.

Using these two functions, Calculate the Variation Inflation Factor (VIF) for each column in the iris dataset (link to learn more in the starter code). "VIFs are calculated by taking a predictor, and regressing it against every other predictor in the model."

from sklearn import datasets

iris = datasets.load_iris().data

#TODO

def multlinreg(X, y_obs):

X_t = np.transpose(X) # X.T

X_tX = np.dot(X_t,X)

X_ty = np.dot(X_t,y)

beta = np.linalg.inv(X_tX).dot(X_ty)

y_pred = np.dot(X, beta)

return y_obs, y_pred

def r_sq(observed, pred):

num = np.sum((observed - pred)**2)

denum = np.sum((observed - np.mean(observed))**2)

return 1 - (num / denum)

def vif(rsq):

return 1 / (1 - rsq)

for i in range(0, iris.shape[1]):

arr = np.repeat(True, 4)

arr[i] = False

X = iris[:, arr]

y = iris[:, i]

observed, pred = multlinreg(X, y)

print(vif(r_sq(observed, pred)))

200

The starter code loads in a pokemon dataset and creates a 2d array called "poke". Column 1 of "poke" is the type of pokemon (water, fire, etc.) and column 2 is the pokemons HP. A function for the one-way anova test is loaded into the starter code as well.

I want to see if the HP of the different pokemon types are statstically significant from each other. Use this test on the data to determine this. (You only need to be able to properly subset the dataset for the function)

Ho = There is no difference among the avg HP values in Pokemon types.

Ha = At least one group differs significantly from the overall mean of the HP.

Note when subsetting the data make sure to store the type of the pokemon and the array of HP into the dictionary 'types'

Bonus: +100 can you interpret the p-val at a 0.05 significance level?

for type1 in np.unique(subset[:, 0]):

types[type1] = subset[np.where(subset[:, 0] == type1), 1].ravel()

300

Give three reasons why Numpy is better than base Python.

1) Numpy arrays take up less memory

2) Numpy operations are faster

3) Numpy is optimized for linear algebra

300

The starter code loads in the breast cancer dataset from sklearn, as a 2D array, and randomly creates missing values within the array.

Extract all the columns that have less than or equal to 25% of its data missing.

data[:, np.sum(np.isnan(data), axis = 0) / data.shape[0] * 100 <= 25]

300

Below is an image of apples. The starter code loads in the image as a 2D numpy array and turns it into a gray scale image in which each value of the array is between 0 (black) and 255 (white).

Segment the cut-open apple from this image. Note, you can use multiple values and see which one works the best.

img2 = img.copy()

thres = 175

img2[img > thres] = 255

img2[img <= thres] = 0

plt.imshow(img2 , cmap='gray')

300

A confusion matrix is used to evaluate model performence.

The starter code has two arrays (y_obs [the actual values] and y_pred)

Use the arrays to calculate the recall, precision, and f1 score of the data. Equations in starter code

ones_y_obs = np.where(y_obs == 1)

ones_y_test = np.where(y_test == 1)

TP = len(np.intersect1d(ones_y_obs, ones_y_test))

zeros_y_obs = np.where(y_obs == 0)

ones_y_test = np.where(y_test == 1)

FP = len(np.intersect1d(zeros_y_obs, ones_y_test))

ones_y_obs = np.where(y_obs == 1)

zeros_y_test = np.where(y_test == 0)

FN = len(np.intersect1d(ones_y_obs, zeros_y_test))

zeros_y_obs = np.where(y_obs == 0)

zeros_y_test = np.where(y_test == 0)

TN = len(np.intersect1d(zeros_y_obs, zeros_y_test))

recall = TP / (TP + FN)

precision = TP / (TP + FP)

f_1 = (2 * recall * precision) / (recall + precision)

print(recall, precision, f_measure)

300

Create a function that retuns the dot product of two matrices. Do not use np.dot()

def dot_mul(mat1, mat2):

ret = np.zeros((mat1.shape[0], mat2.shape[1]))

for i in range(0, mat1.shape[0]):

row = mat1[i, :]

for j in range(0, mat2.shape[1]):

col = mat2[:, j].ravel()

ret[i, j] = np.sum(row * col)

return ret

400

Is it pronounced num-pee or num-pie

num-pie

(if you said num-pee you will be failing this class)

400

The starter code loads in the Linnerrud dataset from sklearn as a 2D array and then induces missing values randomly within the data/array.

Replace all the NaN values in the dataset with the median and separately with the mean (make a copy). Which one do you think is better to use?

The median since the mean is easily impacted by outliers.

for i in range(0, data.shape[1]):

print(np.median(data[:, i][np.invert(np.isnan(data[:, i]))]))

data[:, i] = np.where(np.isnan(data[:, i]), np.median(data[:, i][np.invert(np.isnan(data[:, i]))]), data[:, i])

400

The starter code loads in the California Datasets from sklearn as a 2D array and induces missing values into the data/array.

Delete all the rows in the dataset with missing values.

data[~np.isnan(data).any(axis=1), :]

400

The starter code initializes 2 arrays x1 and x2.

Code the equation of the student t test for x1 and x2.

NOTE that in the equation x1 and x2 refer to the avg of x1 and y1 in the starter code..

num = np.mean(x1) - np.mean(x2)

denum = np.sqrt(((np.std(x1) ** 2) / len(x1)) + ((np.std(x2) ** 2) / len(x2)))

num/denum

400

(True or False) All Numpy operations are faster than the base Python version of the same operation. If True, explain reasoning, if False provide an example of a operation that is faster in base python that in Numpy

False, the np.append()