Problem set 4: project proposal

This is the project proposal for the final project. Proposals should be for projects that are some combination of the following three forms

Implement a non-trivial data analysis method for a problem of your choosing.
Design a data science algorithm and rigorously analyze it.
Prove a theorem about an existing data science method.

The project proposal should be 2-3 pages long, and type-set in LaTeX in an appropriate conference format (e.g. IEEE conference template).

Note: there should be interesting mathematical content in your project. This means that I expect you to do more than take publicly available data and throw it into an existing machine learning pipeline! Although those kinds of projects are perfectly adequate in some conferences, they are not suitable for this class.

I encourage you to take on an ambitious project; if you propose an ambitious project that doesn't work out, you can either (1) fall back to doing an interesting synthesis/review of your chosen topic, or (2) write it up as a negative result, describing why the approach you tried failed and what you might do differently the next time. I will not penalize you for failing an ambitious project, so long as you put in a good-faith effort, learned something from the process, and can articulate it in paper/presentation form.

To help you out, I've also included some examples of appropriate projects below. Note that if you use one of my project ideas, I will expect to be a co-author on any publications after the class ends. More generally, if I contribute to the project (i.e. by giving significant amounts of advice), I expect to be a co-author. However, if you do everything yourself, you will of course get sole credit.

Project ideas

Use probabilistic sketches to build a streaming variant caller for human genome sequencing reads. We propose to first filter out k-mers in a corpus of reads that match the human reference using a Bloom filter or cuckoo hashing. Then, we will use a count-min sketch for finding heavy-hitters in the remaining k-mers, which will allow us to identify true variant k-mers as the heavy hitters, while filtering out sequencing error noise. Those true-variant k-mers can act to form a pile-up table once aligned to the human genome. This will be faster than existing methodologies because only the true-variant k-mers need to be aligned, rather than every read.
Use persistent homology as a feature in a machine learning classification pipeline to determine the healthiness of a gut microbiome. We propose that signatures of horizontal gene transfer in bacterial populations may be correlated with the stress of being in an unhealthy microbiome. To analyze this, we will first need to extract appropriate gene or DNA compositional signatures from biological data (perhaps whole metagenome shotgun sequencing), which we can then feed into a classification pipeline. One of the challenges will be preventing the ML pipeline from simply memorizing common disease strains of bacteria and learning the compositional signatures instead, so that it may be more generalizable to non-industrialized populations (which tend to have extremely different gut microbiomes). This challenge we hope to meet by using only topological features, which should hopefully mask the specific identities of bacteria.
Several well-known similarity search data structures seem to have average-case range-query time-complexity which is nearly linear in the entropy of the database (e.g. compressed suffix arrays, entropy-scaling search trees). In the case of compressed suffix arrays, this is true regardless of any other properties of the underlying text. For entropy-scaling search trees, however, this is only true if the database has low fractal (Minkowski) dimension, along with a few other uniformity conditions. We conjecture that this is true for other existing data structures designed for nearest neighbor search in generic metric spaces (such as M-trees and M-indexes), possibly with similar limitations as in the latter case.
Deep neural networks are randomly initialized before training. Multilayer perceptrons (fully connected layers) thus correspond before training to random matrices multiplication, plus a nonlinearity. Can you use random matrix theory to prove properties of randomly initialized deep neural networks? Most NNs are not fully connected, however. There has been some work on tying in the theory of random graphs with convolutional neural nets. Can you similarly prove properties of NNs which have architecture generated by different random graph models?
Using wavelets to design a DNA compression algorithm. Consider walking along a chromosome as the time-domain, and consider the current base pair (A, C, G, or T) as a signal from 1-4. Can we choose an appropriate wavelet decomposition of genomes that encapsulates DNA patterns in a compressible fashion? Does this perform better or worse than standard run-length encoding and Burrows-Wheeler-Transform based methods? Is there additional interpretability of the wavelet compression, due to it revealing both localized patterns and longer-time-scale self-similarities?