• People
  • Courses

Assignments

Assignment 3 - Hadoop in Machine Learning

This assignment is optional. It will not count towards your grade.

Instructions

  • Deadline: May 21st, 2013 before midnight.
  • Other requirements similar to earlier assignments.

Resources

Description

pdf | tar.gz (LaTeX source)

Datasets

MNIST Encoder

  • MNIST 800 encoder: tar.gz
  • Visualization: png

The encoder has 800 outputs. The architecture is Linear → SoftPlus. It was trained using predictive sparse decomposition, after the original data was filtered through a local contrast normalization with a 5×5 Gaussian window with standard deviation 0.25.

To load the encoder into Torch, please specify the format as “ascii”. For example:

encoder = torch.load("encoder.t7","ascii")

Overview

Here is a brief introduction of what it contains:

  1. (20%) Getting hands on Hadoop streaming
  2. (40%) One-vs-one multiclass classification using Hadoop; design output format of mapper
  3. (40%) One-vs-all multiclass classification using Hadoop;

Assignment 2 - Deep Learning using Sparse Coding

  • Update April 19th 2013: the assignment deadline has been extended to April 23rd.
  • Update April 19th 2013: the third term in the predictive sparse decomposition energy in problem 4 should be squared. It is fixed.
  • Update April 19th 2013: another way of obtaining the ImageNet data from CIMS servers has been posted.

Instructions

  • Deadline: April 23rd, 2013 before midnight.
  • Other requirements similar to assignment 1.

Resources

Description

pdf | tar.gz (LaTeX source)

Datasets

  • MNIST and its reader: tar.gz
  • ImageNet: if you have access to the hadoop cluster, they are available in the path below (they will be directories containing the JPEGs if you remove '.tar'):
  /home/xz558/public/train256.tar
  /home/xz558/public/test256.tar
  • ImageNet on Hadoop FS:
  /user/xz558/public/train256
  /user/xz558/public/test256
  • ImageNet on CIMS 'energon3' server
  /scratch/xz558/public/train256.tar
  /scratch/xz558/public/test256.tar

Clément's tutorial on unsupervised learning

link (If you are stuck with implementation, these sample code may help)

Overview

Here is a brief introduction of what it contains:

  1. (5%) Intuition behind sparsity.
  2. (30%) Inference in sparse coding with FISTA; the shrinkage operator; line search in FISTA; trial of obtaining Lipschitz constant with spectral approximation.
  3. (30%) Dictionary learning in sparse coding; implement the alternating direction algorithm; test on the MNIST dataset
  4. (35%) Training predictive sparse decomposition architecture; apply to preprocessed ImageNet data; rectified linear unit and its smooth version.

Assignment 1 - Binary Linear Learning

  • Update March 3, 2013: A new version of the description file is available, correcting some notation errors.

Instructions

  • Deadline: March 12, 2013 before midnight.
  • You must answer the questions by yourself, but you may discusss the results of experiments with other students. We recommend you to submit questions to our Piazza forum where you can get help from both instructors and students.
  • Send an assignment report, along with all of your source code, to the TA ONLY. Please make everything in one tar ball or zip file, naming it as: LASTNAME_FIRSTNAME.tar.gz
  • Wirte all your answers to the questions below to the report, in pdf format. Please do not include any doc, docx, odt, html or plain texts. If you hate writing maths in computers, we can accept hand-written reports submitted before the class at the day of the deadline. You still have to send all the source code to the TA if you choose to submit a hand-written report.
  • Do not include any dataset in your submission.
  • We accept late submissions, but there will be some penalty on your score.

Resources

Description: pdf | tar.gz (LaTeX source)

Skeleton code: tar.gz

The datasets used are spambase dataset and the malicious URL dataset. You should put the file spambase.data into the src/learn directory in order to do problem 2. To prevent too many downloads of the datasets from their original distributors, we made the following back-up means of downloading the datasets:

  • Using the NYU Files service: spambase; malicious url
  • If you have access to CIMS servers or the NYU hadoop cluster, the datasets can be copied from the following path
  /home/xz558/public/lsml/assign1-dat

Overview

Here is a brief introduction of what it contains:

  1. (20%) Semantics of loss functions: square, quantile, hinge and logistic.
  2. (45%) Demo with pre-conditioned SGD and batch GD. Implement all kinds of loss functions, and train them with SGD, bfgs, cg and lbfgs algorithms on spambase dataset.
  3. (35%) Vowpal Wabbit with Malicious URL dataset. Write code to compute testing error, precision and recall. Train on the dataset with different weights on different labels to see precision-recall trade-off.

The three parts are relatively independent; you can choose either one of them as a start. Please refer to the description file for more details.

 
 
/srv/www/cilvr/htdocs/data/pages/courses/bigdata/assignments/start.txt · Last modified: 2013/05/13 15:31 by xz558
Recent changes RSS feed Creative Commons License Valid XHTML 1.0 Valid CSS Driven by DokuWiki
Drupal Garland Theme for Dokuwiki