1. Matrix Chain product
   - X := M_1 M_2 ... M_n 
     What is the "optimal" parenthesization?
   - 1) Find the parenthesization that performs the minimum number of flops  (matrix chain algorithm)
     best known algorithm: O(n log n)   <- forget about it
     standard dynamic programming algorithm: O(n^3)  
   - 2) Find the parenthesization that results in the best execution time
   - All matrix products are performed through BLAS calls (no parallelism)
   - Compare the execution times for solutions 1. and 2. Quantify the discrepancy.
   - Find examples in which the discrepancy is large (say, > 10%)

2. DGEMM Performance modeling (C <- A*B)
   - Matrix sizes: m,n,k
     # threads: p
     fix the flags, ldAs, and scalars. 
   - Objective: for a given BLAS library, construct the function "predict", which models the execution time of DGEMM(m,n,k) when executed with p threads.
   - "predict" takes as input 4 arguments (m, n, k, p), and returns as output an estimate of the execution time
   - Sample DGEMM to aquire knowledge
   - You are given a time budget. The budget is spent by sampling.
     Initially, assume that the available budget is as high as you need.
   - Define "predict". How much memory does it need? 
   - How do you predict the execution time for an input for which a sample does not exist?
   - How do you predict for inputs that are larger than any of the samples?
   - How accurate is "predict" with respect to an actual call to DGEMM?
   - [research] Consider a limited time budget. 

3. Parallelism in Matrix Chain products 
   - X := M_1 M_2 ... M_6   (consider only chains of length 6)
     What is the "optimal" parenthesization? 
   - 1) As for project #1, find the parenthesization that leads to the minimum # of flops
     Execute this parenthesization, where each call to DGEMM is multi-threaded 
   - 2) Look at all the parenthesizations where 2 or more DGEMMs can be performed in parallel (no dependencies between them)
     Use OpenMP to distribute your threads to the different DGEMMs (oversubscription is also allowed).
     - Example: 4 matrices and 6 cores
       For the parenthesization X := (AB)(CD), 
       execute DGEMM(A,B) with 3 threads, AND AT THE SAME TIME, execute DGEMM(C,D) with 3 threads;
       also possible:
       execute DGEMM(A,B) with 1 threads, AND AT THE SAME TIME, execute DGEMM(C,D) with 5 threads,
       execute DGEMM(A,B) with 6 threads, AND AT THE SAME TIME, execute DGEMM(C,D) with 6 threads,
       ...
   - Compare the execution times for solutions 1. and 2. Quantify the discrepancy.
   - The objective is to find matrix chains for which solution 1. does not lead to the fastest execution. 
   - Find examples in which the discrepancy is large (say > 20%)

4. FLAME methodology:  Derivative of LU
   - D(L * U = A) -> L' * U + L * U' = A'
     L, U and A' are inputs, L' and U' are unknown.
     For readability, L' -> X, U' = Y, A' -> C :  X U + L Y = C     
   - Define the pattern for the problem
   - Derive PME and dependencies
   - Find loop invariants; for each of them, derive the corrensponding algorithm
   - Implement the algorithms (blocked, in c), choose block size
   - Time all algorithms, sequential and multi-threaded
   - Performance plots

   - This was: Cholesky
     D(L * L^T = A) -> L' * L^T + L * L'^T = A'
     L and A' are inputs, L' is unknown...  
     For readability, L' -> X, A' -> C :  X L^T + L X^T = C
     
5. Small eigenproblems
   - You are given thousands of small tridiagonal eigenproblems (n<60)
   - LAPACK provides three solvers, with different performance and accuracy signatures: 
     Bisection (DSTEVX), QR (DSTEQR), MR3 (DSTEMR)
   - Objective: Compare the three eigensolvers in terms of execution time, # of flops, accuracy
   - Accuracy:  "residual" = norm(Ax-lx),  "orthogonality" = norm(X^T X - I)
   - Task: Count the number of flops each algorithm performs.
     (This can be done in different ways. For instance, by adding counters to the source code, and/or by using external tools such as PAPI)
   - Accuracy of an eigensolver: orthogonality, residual
   - For your tests, use different matrix types:
     1) Random eigenvalues, uniform distribution (0, 1)
     2) Uniform eigenvalue distribution*
     *) See page 119 of  [[http://arxiv.org/pdf/1401.4950v1.pdf]]
   - Generation of tridiagonal matrices with a given eigenspectrum:
     - store the n desired eigenvalues as a diagonal matrix D
     - generate a random orthogonal matrix. For instance, [Q,R] = qr(rand(n,n))
     - construct M := Q^T D Q, and reduce M to tridiagonal form (DSYTRD).
   - [advanced] Study # of flops vs accuracy, for different accuracy levels.

6. Hyper-cube multiplication.
   - You are given an 3-dimensional cube H (size n x n x n), and a square matrix M (size n x n). 
   - Objective: for a given direction (dimension), multiply all the layers of the cube by the matrix M.
     Example:
     choose n -> 4

         1,5,9 ,13   17,21,25,29   33,37,41,45   49,53,57,61
     H = 2,6,10,14 ; 18,22,26,30 ; 34,38,42,46 ; 50,54,58,62
         3,7,11,15   19,23,27,31   35,39,43,47   51,55,59,63
         4,8,12,16   20,24,28,32   36,40,44,48   52,56,60,64

     choose a direction -> Top-Down (along dimension 1)

     multiply the 4 layers by M
     
     1 ,5 ,9 ,13          2, 6, 10,14 
     17,21,25,29 * M,     18,22,26,30 * M,  ... 
     33,37,41,45          34,38,42,46 
     49,53,57,61          50,54,58,62

   - Solution 1: Use BLIS (BLAS with multiple strides)
   - Solution 2: Transpose H + loop of DGEMMs
     For transpositions, use HPTT: https://github.com/HPAC/hptt
   - Solution 3: Express the operation as a contraction, and use TCL: https://github.com/springer13/tcl
   - Compare the timings for a different sizes, directions, and number of threads.
          
7. Computation of eigenvalues: bisection vs dqds
   - Fix a matrix size n.
     Generate a tridiagonal matrix with a given eigenspectrum (see Project 5).
   - Compute the eigenvalues by dqds (LAPACK : DLARRE -> make sure it invokes DLASQ2) and by bisection (https://github.com/HPAC/mr3smp : mrrr -> make sure it does NOT call DLASQ2); measure accuracy and execution time
   - 1) parallel vs sequential
     bisection is slow, but scales beautifully
     dqds is super fast, but inherently sequential
     where is the crossover?
   - 2) computation of the full spectrum vs. subset 
     bisection is slow, but it allows to compute a subset of eigvals
     dqds is super fast, but can only compute the full spectrum
     where is the crossover?
   - 3) accuracy
     dqds computes to full accuracy, no freedom
     bisection has arguments that control the target accuracy 
   - Plot the surface that partitions the "configuration space" <# of cores, # of eigvals, target (or resulting) accuracy> according to the fastest algorithm