Welcome to MAT1841: Mathematics of Massive Data Analysis: Fundamentals and Applications - Fall 2021!
Course Description
This course will focus on understanding the structure of high-dimensional data and the mathematical tools we can use to characterize and reshape it for computational analysis.
Several major threads will be woven throughout the course:
- What is the geometry of high-dimensional space?
- What are meaningful structures and dynamics present in high-dimensional data generated from real-world phenomenon?
- How can we reshape the data so that it is more amenable to analysis?
This is a survey course, and we intend to cover a range of techniques/applications cursorially. In no particular order, some of the specific topics we may cover include:
- High dimensional space and concentration inequalities
- Johnson-Lindenstrauss Lemma and random projections
- Markov chains
- Hashing and (pseudo)randomness
- Probabilistic streaming and sketching algorithms
- Random graph theory and applies to percolation theory
- Wavelet bases
- Complexity and entropy
- Clustering and classification of data
- Nonlinear dimensionality reduction
- Computational topology
This course will assume background in linear algebra, probability, and algorithms. A few topics may make use of elementary results from group theory and algebraic topology.
Schedule
Wednesdays, Thursdays, and Fridays 3-4pm.
Wednesday lecture will be in Bahen BAB024.
Thursday and Friday lectures will be in Koffler House KP113
The first lecture will be on Thursday, September 9.
Reference texts and notes
The primary reference text is Foundations of Data Science by Blum, Hopcroft, and Kanna (2020) Cambridge University Press. Note that the full text is available from the University of Toronto libraries as an online downloadable resource.
There also exist partial earlier drafts of the textbook elsewhere on the internet, but I recommend you
use the published version from the library.
Lecture notes
- Lec 1, 2021-09-09: PDF and Video. Concentration inequalitie; geometry of high-dimensional Euclidean space. [BHK 2.1-2.3]
- Lec 2, 2021-09-10: PDF and Video. Hyperspheres and hyperballs are weird in high dimensions. [BHK 2.4-2.6]
- Lec 3, 2021-09-15: PDF and Video. Gaussian annulus theorem and Johnson-Lindenstrauss Lemma. [BHK 2.7-2.9]
- Lec 4, 2021-09-16: PDF and Video. Markov Chains and MCMC [BHK 4.1-4.2]
- Lec 5, 2021-09-17: PDF and Video. Convergence of random walks on undirected graphs using conductance [BHK 4.4]
- Lec 6, 2021-09-22: PDF and Video. More Markov chains [BHK 4.2-4.3] (finished up Lec 5 as well)
- Lec 7a, 2021-09-23: PDF and Video. Volume via MCMC and PageRank [BHK 4.2-4.3]
- Lec 7b, 2021-09-23: PDF and Video. Overview of ML topics we will not be covering in class
- Lec 8, 2021-09-24: PDF and Video. Overview of hashing, sketching, and random graph theory
- Lec 9a, 2021-09-29: PDF and Video. Overview of wavelets, low-dimensional embeddings, and persistent homology
- Lec 9b, 2021-09-29: PDF and Video. Hashing theory. Recommended reading: Mikkel Thorup's expository article
- Lec 10, 2021-09-30: PDF and Video. k-independent hash functions. (note that the PDF includes some of the basic algebra proofs we omitted in class)
- Lec 11, 2021-10-06: PDF and Video. Frequency moments of data streams and an idealized count-distinct sketch. Recommended reading: [BHK 6.1-6.2] and [Nelson, 2.2]
- Lec 12, 2021-10-07: PDF and Video Non-idealized count-distinct sketch and F2 sketch. Recommended reading: [BHK 6.1-6.2] and [Nelson, 2.2]
- Lec 13, 2021-10-08: PDF and Video. Frequent items sketch (Misra-Gries and Count-Min), MinHash sketch
- Lec 14, 2021-10-13: Prelec PDF, Lec PDF, and Video. Erdos-Renyi random graphs and phase transitions. Recommended reading: [BHK 8.1-8.2]
- Lec 15, 2021-10-14: Prelec PDF, Lec PDF, and Video. Erdos-Renyi diameter and isolated vertices phase transitions. Recommended reading: [BHK 8.3]
- Lec 16, 2021-10-15: Prelec PDF, Lec PDF, and Video. Erdos-Renyi component size phase transitions. Recommended reading: [BHK 8.4].
- Lec 17, 2021-10-20: Prelec PDF, Video. Percolation theory. Recommended reading Jeffrey Steif's lecture notes, [BHK 8.5].
- Lec 18, 2021-10-21: Prelec PDF, Video. Finishing up percolation theory, and intro to wavelets. [BHK 11.1-11.2]
- Lec 19, 2021-10-22: Video Finishing up Haar wavelet since we didn't get to it in the previous lecture. Supplemental source: David Malone's PhD thesis introduction
- Lec 20, 2021-10-27: Prelec PDF and Video. Necessary conditions for general systems of wavelets. Recommended reading: [BHK 11.3-11.5]
- Lec 21, 2021-10-28: PDF and Video. Sufficient conditions for general systems of wavelets. Note that the "proof" of Lemma 11.6, taken from BHK, is unfortunately actually wrong. I have modified it in the uploaded notes as "Computation 11.6" because it is not really a lemma. Recommended reading: [BHK 11.6-11.7]
- Lec 22, 2021-10-29: PDF and Video. The Fast Fourier Transform. Note that I made a mistake in the video on the "symmetry if real" proof, because I forgot to put in the complex conjugates, which are in the modified lecture notes.
- Lec 23, 2021-11-03: Prelec PDF and Video. Matrix sketching. [BHK: 6.3-6.4].
- Lec 24, 2021-11-04: Prelec PDF and Video. VC-dimension [BHK 5.5]
- Lec 25, 2021-11-05: Prelec PDF and Video. Generalization error in classification [BHK 5.6]
- Reading week 2021-11-08 to 2021-11-12 so no lectures
- Lec 26, 2021-11-17. Prelec PDF and Video. Representing topological spaces by decomposing into complexes. Recommended reading [McGuirl 3, Edelsbrunner and Harer III.1]
- Lec 27, 2021-11-18. Prelec PDF and Video. More complexes and introduction to chains.
- Lec 28, 2021-11-19. Prelec PDF and Video. Computing Simplicial homology.
- Lec 29, 2021-11-24. Prelec PDF and Video. Morse functions. Recommended reading [Zomorodian, Ch. 5 , Edelsbrunner and Harer, VI]
- Lec 30, 2021-11-25. Prelec PDF (same as prior) and Video. Morse-Smale complex and piece-wise linear Morse functions
- Lec 31, 2021-11-26. Video. HyperMinHash. Reference: [Yu, Weber, 2000, IEEE TKDE]
- Lec 32, 2021-12-01. Video. Secure federated count-queries. Reference: [Yu, Weber, 2020, JMIR] and [Leighton, Yu, 2021, bioRxiv preprint
- Lec 33, 2021-12-02. Student presentations
- Lec 34, 2021-12-03. Student presentations
- Lec 35, 2021-12-08. Student presentations
Homework
There will be 10 homework assignments that will be due roughly weekly and will be assigned and
collected online. There will be no extensions to posted homework due dates. However, the lowest
homework mark will be dropped.
The homework problems will be a mix of theory and implementation. I recommend using Python
for implementation, but will accept R, Julia, or C/C++. If you wish to do an implementation in any
other language or framework, please clear it with me beforehand. Notably, I will not be accepting
MATLAB implementations in this course, unless you present a very compelling reason.
Note that solutions should be as short, clear, and concise as possible. I will be taking marks off
for long, meandering solutions to otherwise short problems, even if all of the reasoning is technically
correct. Brevity is the soul of wit.
- Problem set 1. Due date: Wednesday, 9/22, 11:59pm
- Problem set 2. Due date: Thursday, 9/30, 11:59pm
- Problem set 3 - project proposal. Due date: Thursday, 10/07, 11:59pm
- Problem set 4. Due date: Friday, 10/15, 11:59pm
- Problem set 5. Due date: Friday, 10/22, 11:59pm
- Problem set 6. Due date: Wednesday, 11/03, 11:59pm
- Problem set 7. Due date: Wednesday, 11/17, 11:59pm
- Problem set 8. Due date: Wednesday, 11/24, 11:59pm
- Problem set 9. Due date: Wednesday, 12/01, 11:59pm
- Problem set 10 - peer feedback. Due date: Wednesday, 12/08, 11:59pm
Final Project
Final project deadline: Wednesday, Dec 8, 2021.
There will be a final project. You will either (A) design and implement a data analysis method
for a problem of your choosing, (B) prove a new result about one of the methods we covered, or
(C) perform a survey of a group of relevant academic articles.
You will submit a written report in
the format of a conference proceedings or journal article and deliver a short presentation to your peers.
Also, you may work with a partner on this project.
Icons made by xnimrodx from www.flaticon.com.