This part of a neural network receives the initial data.
What is the input layer?
In a neural network, we use these as our parameters.
What are weights and biases?
In word embeddings, a word becomes mapped to a coordinate, which in other words is a ...
What is a vector?
The output of a neural network is often called this.
What is a prediction?
This function returns the max between 0 and the input.
What is:
def relu(x):
return max(0, x)
This function takes an input and clamps the negative inputs to zero while leaving the other inputs the same.
What is the ReLU activation function?
In this phase, parameters are learned.
What is training?
Word embeddings help to capture this kind of relationship between two words...
What is similarity?
In classification problems, the network predicts a value between 0 and 1 using this function.
What is softmax?
Calculate the cosine similarity of these three vectors, then calculate the probabilities that vector B or C comes after vector A. Then, return the most probable word.
A = "Amira" = [2, 7]
B = "Paige" = [1, 7]
C = "Aisha" = [4, 8]
Mult two vectors:
Amira^T Paige = 2 + 49 = 51
Amira^T Aisha = 8 + 56 = 64
Magnitude:
|Amira| = sqrt(53)
|Paige| = sqrt(50)
|Aisha| = sqrt(80)
Cosine Similarity
Amira to Paige: (51)/(sqrt(53) * sqrt(50)) = 0.991
Amira to Aisha: (61)/(sqrt(53) * sqrt(80)) = 0.936
Softmax Probabilities
Paige Next = softmax(0.991) = 0.5137
Aisha Next = softmax(0.936) = 0.4863
So, we should predict Paige as our next word.
This function helps introduce linearity to our neural network.
What is an activation function?
This value adjusts how much ATTENTION we pay to the gradient.
What is a learning rate?
In word embeddings, words that are similar in meaning are usually ____ in space?
What is close/nearby/nearest?
The difference between a network's prediction and the actual value is called this.
What is error/loss?
This function returns a linear combination of all weights and biases with respect to the input.
What is:
def linear_1D(x, w, b):
return w*x + b
A model that performs well on training data, but poorly on new data is doing this.
What is overfitting?
More neurons and parameters usually means a more powerful model, but increases the risk of what?
What is overfitting the training data?
Word embeddings are trained using this.
What is a corpus/vocabulary/data?
This is the opposite of error.
What value does the following function return?
inputs = [2, 1]
weights = [0.5, -0.5]
bias = 0.1
output = 0
for i in range(len(inputs)):
input, weight = inputs[i], weights[0]
output += input * weight
output += bias
return output
What is 0.6?
What is an input layer, linear function, activation function, and output layer?
In a fully connected neural net layer with 4 inputs and 4 neurons, how many parameters are there?
What is 20?
What is cosine similarity?
This is a method we use to test how good our prediction is.
What is MSE, RMSE, Cross-Entropy loss, any kind of error we discussed in class?
This function returns the closest word in the vocabulary to the input word.
What is:
def nearest_neighbor(input, vocab):
best_dist = float('inf)
best_point = None
best_index = None
for i in range(vocab_size):
next_point = U[i]
diff = (prev_point - next_point)
diffSq = diff**2 # (diff[0]**2, diff[1]**2)
sumDiffSq = np.sum(diffSq) # diffSq[0] + diffSq[1]
dist = np.sqrt(sumDiffSq)
if dist < best_dist:
best_dist = dist
best_point = next_point
best_index = i
best_token = vocab[best_index]
best_token