1. Matrix Chain product - X := M_1 M_2 ... M_n What is the "optimal" parenthesization? - 1) Find the parenthesization that performs the minimum number of flops (matrix chain algorithm) best known algorithm: O(n log n) <- forget about it standard dynamic programming algorithm: O(n^3) - 2) Find the parenthesization that results in the best execution time - All matrix products are performed through BLAS calls (no parallelism) - Compare the execution times for solutions 1. and 2. Quantify the discrepancy. - Find examples in which the discrepancy is large (say, > 10%) 2. DGEMM Performance modeling (C <- A*B) - Matrix sizes: m,n,k # threads: p fix the flags, ldAs, and scalars. - Objective: for a given BLAS library, construct the function "predict", which models the execution time of DGEMM(m,n,k) when executed with p threads. - "predict" takes as input 4 arguments (m, n, k, p), and returns as output an estimate of the execution time - Sample DGEMM to aquire knowledge - You are given a time budget. The budget is spent by sampling. Initially, assume that the available budget is as high as you need. - Define "predict". How much memory does it need? - How do you predict the execution time for an input for which a sample does not exist? - How do you predict for inputs that are larger than any of the samples? - How accurate is "predict" with respect to an actual call to DGEMM? - [research] Consider a limited time budget. 3. Parallelism in Matrix Chain products - X := M_1 M_2 ... M_6 (consider only chains of length 6) What is the "optimal" parenthesization? - 1) As for project #1, find the parenthesization that leads to the minimum # of flops Execute this parenthesization, where each call to DGEMM is multi-threaded - 2) Look at all the parenthesizations where 2 or more DGEMMs can be performed in parallel (no dependencies between them) Use OpenMP to distribute your threads to the different DGEMMs (oversubscription is also allowed). - Example: 4 matrices and 6 cores For the parenthesization X := (AB)(CD), execute DGEMM(A,B) with 3 threads, AND AT THE SAME TIME, execute DGEMM(C,D) with 3 threads; also possible: execute DGEMM(A,B) with 1 threads, AND AT THE SAME TIME, execute DGEMM(C,D) with 5 threads, execute DGEMM(A,B) with 6 threads, AND AT THE SAME TIME, execute DGEMM(C,D) with 6 threads, ... - Compare the execution times for solutions 1. and 2. Quantify the discrepancy. - The objective is to find matrix chains for which solution 1. does not lead to the fastest execution. - Find examples in which the discrepancy is large (say > 20%) 4. FLAME methodology: Derivative of LU - D(L * U = A) -> L' * U + L * U' = A' L, U and A' are inputs, L' and U' are unknown. For readability, L' -> X, U' = Y, A' -> C : X U + L Y = C - Define the pattern for the problem - Derive PME and dependencies - Find loop invariants; for each of them, derive the corrensponding algorithm - Implement the algorithms (blocked, in c), choose block size - Time all algorithms, sequential and multi-threaded - Performance plots - This was: Cholesky D(L * L^T = A) -> L' * L^T + L * L'^T = A' L and A' are inputs, L' is unknown... For readability, L' -> X, A' -> C : X L^T + L X^T = C 5. Small eigenproblems - You are given thousands of small tridiagonal eigenproblems (n<60) - LAPACK provides three solvers, with different performance and accuracy signatures: Bisection (DSTEVX), QR (DSTEQR), MR3 (DSTEMR) - Objective: Compare the three eigensolvers in terms of execution time, # of flops, accuracy - Accuracy: "residual" = norm(Ax-lx), "orthogonality" = norm(X^T X - I) - Task: Count the number of flops each algorithm performs. (This can be done in different ways. For instance, by adding counters to the source code, and/or by using external tools such as PAPI) - Accuracy of an eigensolver: orthogonality, residual - For your tests, use different matrix types: 1) Random eigenvalues, uniform distribution (0, 1) 2) Uniform eigenvalue distribution* *) See page 119 of [[http://arxiv.org/pdf/1401.4950v1.pdf]] - Generation of tridiagonal matrices with a given eigenspectrum: - store the n desired eigenvalues as a diagonal matrix D - generate a random orthogonal matrix. For instance, [Q,R] = qr(rand(n,n)) - construct M := Q^T D Q, and reduce M to tridiagonal form (DSYTRD). - [advanced] Study # of flops vs accuracy, for different accuracy levels. 6. Hyper-cube multiplication. - You are given an 3-dimensional cube H (size n x n x n), and a square matrix M (size n x n). - Objective: for a given direction (dimension), multiply all the layers of the cube by the matrix M. Example: choose n -> 4 1,5,9 ,13 17,21,25,29 33,37,41,45 49,53,57,61 H = 2,6,10,14 ; 18,22,26,30 ; 34,38,42,46 ; 50,54,58,62 3,7,11,15 19,23,27,31 35,39,43,47 51,55,59,63 4,8,12,16 20,24,28,32 36,40,44,48 52,56,60,64 choose a direction -> Top-Down (along dimension 1) multiply the 4 layers by M 1 ,5 ,9 ,13 2, 6, 10,14 17,21,25,29 * M, 18,22,26,30 * M, ... 33,37,41,45 34,38,42,46 49,53,57,61 50,54,58,62 - Solution 1: Use BLIS (BLAS with multiple strides) - Solution 2: Transpose H + loop of DGEMMs For transpositions, use HPTT: https://github.com/HPAC/hptt - Solution 3: Express the operation as a contraction, and use TCL: https://github.com/springer13/tcl - Compare the timings for a different sizes, directions, and number of threads. 7. Computation of eigenvalues: bisection vs dqds - Fix a matrix size n. Generate a tridiagonal matrix with a given eigenspectrum (see Project 5). - Compute the eigenvalues by dqds (LAPACK : DLARRE -> make sure it invokes DLASQ2) and by bisection (https://github.com/HPAC/mr3smp : mrrr -> make sure it does NOT call DLASQ2); measure accuracy and execution time - 1) parallel vs sequential bisection is slow, but scales beautifully dqds is super fast, but inherently sequential where is the crossover? - 2) computation of the full spectrum vs. subset bisection is slow, but it allows to compute a subset of eigvals dqds is super fast, but can only compute the full spectrum where is the crossover? - 3) accuracy dqds computes to full accuracy, no freedom bisection has arguments that control the target accuracy - Plot the surface that partitions the "configuration space" <# of cores, # of eigvals, target (or resulting) accuracy> according to the fastest algorithm