Home

A09: Recipe classification

This assignment helps you practice classification of textual documents. We will be using data from the What’s Cooking? Kaggle competition.

Task

Predict the kind of cuisine based on the ingredients. Find the best data transformations and classification technique, and report your procedure and accuracy.

Data

Create a free Kaggle account and download the train.json file. Use the Python script below to transform this file into CSV format. You may wish to modify this script to cleanup the recipe ingredients (remove bogus characters, etc.).

import sys
import re
import json
import io
reload(sys)
sys.setdefaultencoding('utf-8')
from pprint import pprint

with io.open(sys.argv[1], encoding='utf8') as jsonin:
    jrecipes = json.load(jsonin)
    print "ingredients,cuisine"
    for recipe in jrecipes:
        print '\"%s\",%s' % (' '.join(map(lambda x: re.sub(r'\s', '_', x),
                             recipe['ingredients'])), recipe['cuisine'])

Run the script as follows, assuming the script is saved as json2csv.py:

python json2csv.py train.json > train.csv

or if on delenn:

PYTHONIOENCODING=utf8 python2 json2csv.py train.json > train.csv

delenn for training

Feel free to use delenn to train. Refer to the Weka notes for details.

Deliverables

Submit each of the following:

Grading rubric

Grading of this assignment will be somewhat subjective. I’ll be looking for evidence of poor methodology.

To earn full credit, data must be processed correctly and the classifiers must be evaluated correctly. The best classifier must have a weighted average F-measure > 0.710 on 10-fold cross-validation with the training set. The methodology must be sophisticated, correct, and well-documented.

CSCI 431 material by Joshua Eckroth is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Source code for this website available at GitHub.