Home

A07: Detecting Spam Comments Part Deux

Using this dataset, build a neural network classifier that accurately (>= 95% accuracy) detects spam YouTube comments. Use all the data files, totalling 1956 rows. Use Keras for the neural network stuff.

You are also required to demonstrate a few extra features:

  • Use cross-validation to take 80/20% splits (5 folds) and train/test the model on each split. Print the average accuracy. Consider using StratifiedKFold from scikit-learn for generating the splits.
  • Create two Python files: one that trains and saves a model (trained on all the data) and a second file that loads the model and interactively asks the user for a new comment, then tells that user if it is spam.

Deliverables

Produce two Python files: train.py (which performs cross validation and after that trains and saves a model on all the data), and run.py (which loads the model and asks for user input).

Grading rubric

  • 5 pts: All deliverables present.
  • 4 pts: All deliverables present except the model saving and reloading and interactive user input, or 95% accuracy was not achieved.
  • 3 pts: All deliverables present except the model saving and reloading and cross-validation.
  • 2 pts: Inappropriate dataset processing, but most code is correct.
  • 0-1 pts: Inappropriate dataset processing and most code is incorrect.

CSCI 431 material by Joshua Eckroth is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Source code for this website available at GitHub.