Data Analysis
Genomic Methods
Machine Learning
Cancer
Workflows
100

A plot used to show the distribution of a continuous variable.

What is a histogram?

100

A standard text-based file format used to store genomic sequences without quality scores.

What is FASTA?

100

A method that groups similar data points together without labels.

What is clustering?

100

A widely used database containing multi-omics data for various human cancers.

What is The Cancer Genome Atlas (TCGA)?

100

A platform commonly used to host and share code repositories.

What is GitHub?

200

A plot that compares statistical significance (P-value) versus magnitude of change (log fold change).

What is a volcano plot?

200

A standard text-based file format used to store genomic sequences with quality scores.

What is FASTQ?

200

A machine learning method that learns from labeled training data.

What is supervised learning?

200

A gene that normally helps prevent uncontrolled cell growth.

What is a tumor suppressor gene?

200

A workflow tool that automatically reruns only the steps affected by a file change?

What is Snakemake or Nextflow?

300

You want to compare the overall gene expression distributions of 2 RNA-seq samples. Which plot would you use?

What is a boxplot? (also density or violin plot)

300

A file format usually used to store aligned reads.

What is BAM?

300
You want to predict whether a sample is normal or cancer. What type of ML problem is this?

What is classification?

300

The spread of cancer cells from the original tumor to another part of the body.

What is metastasis?

300

A version control system used to track changes in code.

What is git?

400

A PCA shows samples clustering by BioProject instead of a biological condition. What problem does this suggest?

What is a batch effect?

400

An ML approach for generating ancestry unbiased genetic signatures.

What is PhyloFrame?

400

A model performs well on training data but performs poorly on new samples. What problem is this showing? 

What is overfitting?

400

A type of mutation that contributes to cancer development.

What is a driver mutation?

400

If 2 people run the same pipeline but 1 of them gets errors because their software is newer. What's the issue?

What is version mismatch?

500

After batch correction, biological separation disappears in PCA. What went wrong?

What is overcorrection (loss of biological signal)?

500

A genetic signature works well in Europeans but fails in African ancestry samples. What is the underlying problem?

What is ancestry bias? (also lack of generalization across populations)

500

If you add more features to a model and the performance decreases on new data, what technique would help fix this?

What is regularization or feature selection?

500

A drug targets a mutation, but some tumors do not respond. What's the reason?

What is tumor heterogeneity or resistance?

500

If 2 people run the same pipeline and get different results, what is the problem?

What is lack of reproducibility?