A09: Recipe classification
This assignment helps you practice classification of textual documents. We will be using data from the What’s Cooking? Kaggle competition.
Predict the kind of cuisine based on the ingredients. Find the best data transformations and classification technique, and report your procedure and accuracy.
Create a free Kaggle account and download the train.json file. Use the Python script below to transform this file into CSV format. You may wish to modify this script to cleanup the recipe ingredients (remove bogus characters, etc.).
import sys import re import json import io reload(sys) sys.setdefaultencoding('utf-8') from pprint import pprint with io.open(sys.argv, encoding='utf8') as jsonin: jrecipes = json.load(jsonin) print "ingredients,cuisine" for recipe in jrecipes: print '\"%s\",%s' % (' '.join(map(lambda x: re.sub(r'\s', '_', x), recipe['ingredients'])), recipe['cuisine'])
Run the script as follows, assuming the script is saved as
python json2csv.py train.json > train.csv
or if on delenn:
PYTHONIOENCODING=utf8 python2 json2csv.py train.json > train.csv
delenn for training
Feel free to use delenn to train. Refer to the Weka notes for details.
Submit each of the following:
- Any code you write to transform the data.
- Documentation of the Weka operations you performed, in the order performed.
- A report of the accuracy of the ZeroR and OneR classifiers with 10-fold cross validation (include pct correct, prec/recall/f-measure, confusion matrix).
- A report of the accuracy of your best classifier with 10-fold cross validation (include pct correct, prec/recall/f-measure, confusion matrix).
Grading of this assignment will be somewhat subjective. I’ll be looking for evidence of poor methodology.
To earn full credit, data must be processed correctly and the classifiers must be evaluated correctly. The best classifier must have a weighted average F-measure > 0.710 on 10-fold cross-validation with the training set. The methodology must be sophisticated, correct, and well-documented.