Spring 2025 Midterm review for Inb321G Jeopardy Template

The biology and statistics

What's that data structure?!
and OOO

Extractions!

if( is.logical(Question) ){
print( "LOGIC!" )
}else{ print("FUNCTIONS!") }

Ultimately, what signal is measured to indicate the nucleotide for a certain cluster in next generation sequencing?

Needed: Fluorescence / light / camera

Explanation: Each fragment during the sequencing stage incorporates a single nucleotide per round. The most recently incorporated nucleotide has an attached fluorophore that gives off a distinctive wavelength of light when stimulated by a laser. This fluorescence is what provides the signal measured by a camera.

Deeper: This signal is not strong enough to be detected from a single nucleotide. This is why fragments on the flow cell are locally amplified first to provide a stronger signal. As errors occur, the fragments fall out of synch with each other, and gradually the signal degrades due to the noise produced by the out-of-sync fragments. This signal degradation is a primary limiting factor in the length of reads during NGS.

Besides being familiar with the object, how could you recognize the class of the iris object based on the below outputs? I am looking for 2 or more distinct reasons.

Printing out the contents allows you to see it prints in the form of a table as it is organized into rows and columns. This by itself lets you narrow down the base R options to data.frames and matrixes

The precense of multiple object classes organized into named columns is sufficient to identify it as list-type (and rule out a matrix).

The precense of a row (obs.) descriptor is indicative of data.frame class object.

The dollar sign symbols in the str() output are used to describe the names of list-type objects.

Conceptually describe the answer: Where in `iris` are the extracted values originally coming from in following?

iris[1:10 , 2:3][,1]

The first ten values from the second column of `iris`.

What value will the following print?

x <- 1
y <- 2

if(x == y){
    print(1)
}else if( x > y){
    print(2)
}else{
    print(3)
}

Write a pattern that makes the following code return values that contain an "e" eventually followed by an "m".

li <- c("Lorem", "ipsum", "dolor","sit",
        "amet", "consectetur","adipiscing",
        "elit", "sed","do", "eiusmod")
li[grep(____,li)]

Output:
[1] "Lorem"   "eiusmod"

# an e 
# followed by any character 0 or more times 
# followed by an m
li[grep("e.*m", li)]

#2 points if they use a + instead of a *

How does read depth data (and count-based data in general) differ from normal distributions? Give me two ways.

Needed: Read depths are discrete (integers) and positive values only due to their count-based nature.

Deeper: A normal distribution is described fully by its mean and standard deviation based on a specific relationship .

Extension: Read depth (and count-based data in general) is generally poorly described by the normal distribution, although it can superficially resemble one. Instead, they are more appropriately described using more complicated models such as negative binomial models.

Note: The normal distribution CAN have a mean of 0 and standard deviation of 1, but this is specifically the "Standard Normal Distribution", and not a property of normal distributions in general.

What is the class of obj1. Note: iris by itself is a data.frame.

obj1 <- iris[1:10,1:2]

data.frame; because more than one column was extracted.

What value does the following create? Your whole team may help with this one.

paste0(
    c(LETTERS[5],
      letters[c(24,20,18,1,3,20,9,15,14,19)]),
    collapse = ""
)

"Extractions"

Why does the following produce an error?

x <- c(1, 2, 3)
y <- c(3, 2, 1)

if(x == y){
    print(1)
}else if( x > y){
    print(2)
}else{
    print(3)
}

Because if statements require a length 1 logical vector and the first current comparison produces an error because x==y is a length 3 logical vector.

Write a pattern that makes the following code return all values that begin with an "s" or end with an "m".

li <- c("Lorem", "ipsum", "dolor","sit",
        "amet", "consectetur","adipiscing",
        "elit", "sed","do", "eiusmod")
li[grep(____,li)]

Output:
[1] "Lorem" "ipsum" "sit"   "sed"

# s at the beginning of the string
# OR
# m at the end of the string
li[grep("^s|m$",li)]

What was the major difference between first and second generation of sequencing?

Sanger sequencing only read a single sequence at a time. Separate solutions had to be used for separate sequences.

NGS is a massively (millions of clusters) parallel sequencing process.

Put the following functions in order of execution in the following code. The last to complete should be last in the reordered values.

Code
avgStrEngMpg <- mean(mtcars[mtcars$vs==1 , "mpg"])

Functions (listed in order of appearance)
<-
mean()
[
$
==

From earliest to the latest in OOO
$
==
[
mean()
<-

Write code that prints the rows of mtcars where both of the following are TRUE:
- The vs column is 1 (stright engine shape)
- The hp column is less than 100 (horsepower)

This should result in 8 rows and 11 columns.

mtcars[mtcars$vs==1&mtcars$hp<100,]

What are the three (not a typo) values the following prints to the console?

xyz <- function(x , y = 3, z = 5){ 
    out <- (x - y)/z 
    return(out) 
} 
xyz(18) 
xyz(12,2) 
out <- xyz(z=2,5)
xyz(z=2,13)

xyz(18) # 3
xyz(12,2) # 2
out <- xyz(z=2,5) #Nothing
xyz(z=2,13) # 5

Write a pattern that makes the following code return all values with the following substitution:

- Text where a "p" is followed by an "i" should be replaced with "!!!".
- In the event there is more than one "i" following the "p", the match should stop at the first "i".

li <- c(
    "Hello world!",
    "Character type looks like this!",
    "Numeric type does not look like this."
)
gsub(_____,"!!!",li)

Output:
[1] "Hello world!"          
[2] "Character ty!!!ke this!" 
[3] "Numeric ty!!!ke this."

# .*? stands for "any character 0 or more times but stop once a match is found"

gsub("p.*?i", "!!!", li)

What was the most problematic issue with genome assembly that long-read sequencing overcame to allow for the newest generation of genome assemblies such as the human telomere-to-telomere genome?

Repetitive sequences used to be larger than sequence fragments, so scientists were unable to stitch the genome together from shorter reads across regions of repetitive sequence.

Provide code that calculates the number of FALSEs in the `obj1`:

obj1 <- iris$Sepal.Length<4.5

Many ways:
sum(!obj1)
table(obj1)["FALSE"]
length(obj1)-sum(obj1)
(1-mean(obj1))*length(obj1)

How many extractions?

iris[iris$Sepal.Length < 4.5 , 1:2 ]$Sepal.Length[1:3]

4 extractions

Explanations
iris$Sepal.Length

iris[iris$Sepal.Length<4.5 , 1:2 ]

iris[iris$Sepal.Length<4.5 , 1:2 ]$Sepal.Length

iris[iris$Sepal.Length<4.5 , 1:2 ]$Sepal.Length[1:3]

What two things will be printed when all below is run (use brain not R)?

Question <- function(x = 3){
    if(x==1){
        print("LOGIC!")
    }else if(x==2){
        print("FUNCTION!")
    }else{
        print("Not anticipated!")
    }
    return(x)
}

x <- 4
if( is.logical(Question) ){
    Question(1)
}else{ 
    Question(2) 
}

3 pts (because the function said to print):

[1] "FUNCTION!"

3 pts (because it returned unassigned data):

[1] 2

Respond with a list of choices / filled in blanks:

The following returns a [vector/data.frame] of the rows of `iris` where the [at least one of the / both of the] following are TRUE:
The Species column value contains ______.
The Sepal.Width column contains values less than or equal to 2.3.

iris[grepl("^.e",iris$Species)&iris$Sepal.Width<=2.3,]

The following returns a data.frame of the rows of `iris` where the both of the following are TRUE:
The Species column value contains an e following the first letter.
The Sepal.Width column contains values of less than or equal to 2.3.

data.frame #2 pt
both of the #2 pt
an e following the first letter #2 pts

Calculate coverage of the genome in the below example.

You can use R on this one (the exam just has you set up the math).

Example: A scientist uses a sequencing technique that generates 500 reads of 100 bp paired-end reads (100 bp per side). This technique is applied to a 10 Kilobase target sequence. What will be the average read depth per position assuming all positions are equally likely to be observed?

Coverage = (Total nucleotides of sequencing data) / (Total nucleotides in sequencing target)

Formula:
C = (N * L) / G

Where:
N = Total number of reads
L = Length of each read (in base pairs)
G = Total size of the genome (in base pairs)

Total nucleotides of sequencing data
500 reads * 100 bp per side * 2 sides = 100,000 Total nucleotides

Total nucleotides of sequencing target
10 Kb * 1000 nucleotides / Kb = 10,000 nucleotides

Coverage
100,000 bp of reads / 10,000 in target = 10x coverage

Background
Coverage refers to the average number of times a nucleotide is read during sequencing. It's an important metric that helps assess the completeness and reliability of the sequencing data.

What is the class of each obj# in the output below?

obj1 <- letters[1:4]
obj2 <- data.frame(lower=obj1,pos = 1:4)
obj3 <- obj2$pos
obj4 <- obj2[2:3, ]
obj5 <- obj2[2:3,1]

#2 pts per correct answer
character
data.frame
integer (numeric or double would suffice for credit)
data.frame
character

Write code that returns a data.frame with all columns across the rows of `iris`where the Petal length or Petal width are greater than 6.6 or 2.4 respectively.

This should be 6 rows total. I have shown these rows in red in the plot below for clarity.

#3 pts for left comparison
#3 pts for right comparison
#2 pts for logical operator
#2 pts for extracting correctly
iris[iris$Petal.Length>6.6|iris$Petal.Width>2.4 , ]

You may use R for this question.

Nolan often fidgets while teaching class. This sometimes results in objects flying into the air.

Assume the highly realistic scenario of Nolan flinging a frictionless object directly upwards at an initial velocity of 25 m/s on a post-apocalyptic Earth that has an acceleration due to gravity of exactly -10 m/s^2.

For 5 points, how long would Nolan have to wait for the object to land^* back in his unmoved hand? You could rearrange the function, but the answer is an integer, so you can use this to constrain your experiments with he function as written.

For 5 points, how high did it travel?

#Here is a function that calculates displacement in 
#   1-dimension at a constant acceleration.
disp <- function(t, vi, a){
    d <- 0.5*a*t^2+vi*t
    return(d)
}

_{* Assume that the rogue planetoid that added mass to the Earth (and liquified most of the planet) also vaporized the roof of the building.}

#5 pts: You are looking for t values where disp() returns 0
disp(1:10,25,-10) # 5 seconds

#5 pts: It's apogee will occur at time to catch / 2 seconds.
disp(5/2,25,-10) # 31.25 meters _{(Post-apocalyptical Nolan has incredibly powerful fidgeting skills)}

Correct the following so that it produces the following output:

fibLength   <- BLANK1
fibVec      <- rep(NA,fibLength)
fibVec[1:2] <- 1
for(i in 3:fibLength){
    fibVec[BLANK2] <- fibVec[BLANK3]+fibVec[BLANK4]
}
print(fibVec)
#[1]  1  1  2  3  5  8 13 21 34 55

2 pts per blank, +2 for all being correct.

fibLength   <- 10
fibVec      <- rep(NA,fibLength)
fibVec[1:2] <- 1
for(i in 3:fibLength){
    fibVec[i] <- fibVec[i-1]+fibVec[i-2]
}
print(fibVec)
#[1]  1  1  2  3  5  8 13 21 34 55