Wednesday, August 10, 2016

Week 8

What was done ?
1) Using C++ to perform eigendecomposition of the matrix on single node if the whole matrix fits into one partition. Generated Eigen and LAPACK implementation for these.
Running Time Now:
1) 1000 x 1000 matrix - 100 components
  ECL code - 3 min 42 sec
  Eigen - 32 sec
2) BBC example - 9000 x 2000 matrix - 100 components
 ECL code - 4 min 26 sec
 Eigen - 55 sec
3) BBC example - 9000 x 2000 matrix - 50 components
 ECL code - 3 min 36 sec
 Eigen - 51 sec
4) 10000 x 10000 matrix - 100 components
 ECL code - 4 min 57 code
 Eigen - 59 sec

All other tests are working with this new code.

2) Generated Pull Request for finished work. Waiting for reviews.

What needs to be done ?
1) Performing full inspection of code and make any style changes if required.
2) Make any changes suggested on the pull request.

Monday, July 25, 2016

Week 6 and 7

What was done ?
1) Improved timings by identifying sources of runtime bottlenecks (matrix multiplication in Range Finder). 
Current timings : (100 components, 100 partitions, 100 node cluster)
10000 x 2000 - 2 min 51 sec
50000 x 10000 - 3 min 21 sec
100000 x 10000 - 6 min 

2) Experimented with sparse matrix multiplication in blocked format using eigen library
Observations :
i) Using CSC forma for storage reduce cost of distributing blocks.
ii) After local multiplication, blocks are no longer sparse. This becomes a bottleneck since we do not get any benefit from sparse addition (axpy), but bringing sparse blocks in and out of C++ is time consuming.
Thus, this approach is viable in its current state.

What needs to be done ?
1) Continue working with sparse matrix multiplication to experiment with other approaches.

Saturday, July 9, 2016

Week 4 and 5

What was done?
1) I have written four different tests for my LSA implementation, which consists of its applications. This tells us that my implementation is working correctly as well as providing appropriate results on real world tasks.
The task performed in these tests are -
 i) Document classification by SVM
 ii) Document Classification by KNN using cosine distance measure
 iii) Document clustering using Kmeans
 iv) Document Retrieval
For each of these tasks, I have used BBC news datasets in five classes, containing 2225 documents each with 9654 features. Using LSA, I have reduced the features to 50.
Time taken for LSA is approx 1 min 15 seconds on this matrix.

2) I have diagnosed and rectified the problem with current implementation of LibSVM so that it can now work with multiclass classification.

What will be done from now on?
1) As such, LSA is now completely implemented. For remaining time, I will be focusing on improving timings and research other methods for SVD. Since not many implementations are available for shared nothing architectures like HPCC platform, this will take some experimentation and trial and error approach.

Saturday, June 25, 2016

Third Week

What was done ?
1) Both Standard and Randomised SVD has been converted into PBblas format. Code can be found on github.
2) Current Runnign times are :
    Test.ecl - 12 sec
    Performance.ecl - 20 x 10 matrix - 49 sec
    Performance.ecl - 200 x 100 matrix - 1 min 31 sec

What for next week?
1) From here on, main concern would be to reduce running time as low as possible. As such, I will spend next week researching better algorithms that can be easily parallelizable and implemented.
2) Run some tests on semantic similarity and document clustering based on current implementation.

Tuesday, June 21, 2016

Second Week

What was done ?
1) Completed Algorithm for Randomised SVD.
2) Implemented a rudimentary SVD algorithm for dense matrix svd using PB blas based on Cholesky factorisation. Can be found at : https://github.com/successar/ecl-ml/tree/lsa/ML/LSA/DenseSVD 

For next week?
1) Convert existing Dense SVD Implementation into PBblas format completely.
1) Improve running time for Dense Matrix SVD by implementing better QR algorithm using PBblas.

Friday, June 10, 2016

First Week

1) Completed a basic interface for LSA.

Interface include functions :
  i) ComputeSVD - Computes Eigenvectors for V and Eigenvalues for S.
                            Calculates Eigenvectors for U = A(VS-1)
  ii) ComputeQueryVector
  iii) GetDocVectors
  iv) CosineSim

The interface is valid for small test dataset and results are included.

For next week ?
1) Do performance testing with bigger matrices.
2) More Validation tests for various sizes and shapes of matrices.
3) Begin Randomized Truncated SVD implementation for faster results.