Home

A05: Can you predict which developers prefer to use [insert your favorite IDE here]?

Using logistic regression and the StackOverflow 2018 developer survey, predict which developers prefer to use your favorite IDE, e.g., Vim. Use the dataset on delenn found in /home/jeckroth/csci431-public/stackoverflow_survey_2018. Also be sure to meet the following criteria:

  • Only load the columns you need. Use pandas and load them as categorical values (since they’re strings but we need numbers).
  • Use “one-hot encoding” for the input columns (and use 1/0 for the IDE column if it has your IDE of choice). The pandas pd.get_dummies function is good here: https://stackoverflow.com/a/37269683
  • If you have a column with multiple values (joined by ‘;’), split that column making one-hot encoding: https://stackoverflow.com/a/50069709
  • Shuffle the dataset, then split it into 80% training and 20% testing.
  • Print the % of developers who use your preferred IDE in the training and testing datasets (or don’t use it, whichever number is higher). Your model must achieve better accuracy than these %’s (i.e., your model must be better than a random guess).
  • Use a single logistic regression function, predicting one output (yes/no your favorite IDE is used by that developer).
  • Report statistics for each epoch: loss, accuracy on training data, accuracy on test data.
  • Use TensorFlow for the model, like we did in class.
  • Do not write more code than necessary; i.e., do not attempt to blindly adapt existing code from the internet. Write code specific to this assignment.

For example, I was able to achieve 77% accuracy on predicting whether a person uses Vim, while only 28% do use Vim and 72% do not (so random guessing would give me 72% accuracy).

Note, most of the effort in this assignment is spent on data processing, not machine learning. Give yourself plenty of time to figure out how to do this. Use StackOverflow a lot, particularly searching on pandas and numpy.

Grading

  • 4/5 points: all requirements complete, learning occurs, but model does not predict any better than chance

Example code from class

import tensorflow as tf                                                                                                                                                                                     
import random                                                                                                                                                                                               
                                                                                                                                                                                                            
num_examples = 100                                                                                                                                                                                          
xs = []                                                                                                                                                                                                     
ys = []
random.seed(123)
for i in range(num_examples):
    x = random.uniform(-4, 17)
    xs.append([x])
    if x < 12:
        ys.append([1])
    else:
        ys.append([0])

x = tf.placeholder(tf.float32, [None, 1])
y = tf.placeholder(tf.float32, [None, 1])

W = tf.Variable(tf.zeros([1, 1]))
b = tf.Variable(tf.zeros([1]))

pred = tf.nn.sigmoid(tf.matmul(x, W) + b)

cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred+1e-07)+(1-y)*tf.log(1-pred+1e-07), reduction_indices=1))

num_epochs = 200
print("learningrate,epoch,loss,accuracy,W,b")
for learning_rate in [0.001, 0.01, 0.1, 0.3]:
    optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
    init = tf.global_variables_initializer()
    with tf.Session() as sess:
        sess.run(init)

        for epoch in range(num_epochs):
            total_loss = 0
            for example_x, example_y in zip(xs, ys):
                _, loss = sess.run([optimizer, cost], feed_dict={x: [example_x], y: [example_y]})
                total_loss += loss
            correct_prediction = tf.equal(tf.round(pred), y)
            accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)).eval({x: xs, y: ys})
            Wtmp = sess.run(W)
            btmp = sess.run(b)
            print("%.4f,%d,%.4f,%.4f,%.4f,%.4f" % (learning_rate, epoch, total_loss, accuracy, Wtmp[0][0], btmp[0]))

Data demo from class

import pandas as pd
import numpy as np
import tensorflow as tf

df = pd.read_csv('/home/jeckroth/csci431-public/stackoverflow_survey_2018/survey_results_public.csv',
                 usecols=['Hobby', 'Employment', 'IDE'], dtype='category')

df = pd.get_dummies(df, prefix=['hobby_', 'emp_'], columns=['Hobby', 'Employment'])
df['IDE'] = df['IDE'].apply(lambda s: 1 if 'Vim' in s else 0)

print(df.columns)
print(df.head())

CSCI 431 material by Joshua Eckroth is licensed under a Creative Commons Attribution-ShareAlike 3.0 Unported License. Source code for this website available at GitHub.