# A05: Can you predict which developers prefer to use [insert your favorite IDE here]?

Using logistic regression and the StackOverflow 2018 developer survey, predict which developers prefer to use your favorite IDE, e.g., Vim. Use the dataset on delenn found in /home/jeckroth/csci431-public/stackoverflow_survey_2018. Also be sure to meet the following criteria:

• Only load the columns you need. Use pandas and load them as categorical values (since they’re strings but we need numbers).
• Use “one-hot encoding” for the input columns (and use 1/0 for the IDE column if it has your IDE of choice). The pandas pd.get_dummies function is good here: https://stackoverflow.com/a/37269683
• If you have a column with multiple values (joined by ‘;’), split that column making one-hot encoding: https://stackoverflow.com/a/50069709
• Shuffle the dataset, then split it into 80% training and 20% testing.
• Print the % of developers who use your preferred IDE in the training and testing datasets (or don’t use it, whichever number is higher). Your model must achieve better accuracy than these %’s (i.e., your model must be better than a random guess).
• Use a single logistic regression function, predicting one output (yes/no your favorite IDE is used by that developer).
• Report statistics for each epoch: loss, accuracy on training data, accuracy on test data.
• Use TensorFlow for the model, like we did in class.
• Do not write more code than necessary; i.e., do not attempt to blindly adapt existing code from the internet. Write code specific to this assignment.

For example, I was able to achieve 77% accuracy on predicting whether a person uses Vim, while only 28% do use Vim and 72% do not (so random guessing would give me 72% accuracy).

Note, most of the effort in this assignment is spent on data processing, not machine learning. Give yourself plenty of time to figure out how to do this. Use StackOverflow a lot, particularly searching on pandas and numpy.

• 4/5 points: all requirements complete, learning occurs, but model does not predict any better than chance

## Example code from class

import tensorflow as tf
import random

num_examples = 100
xs = []
ys = []
random.seed(123)
for i in range(num_examples):
x = random.uniform(-4, 17)
xs.append([x])
if x < 12:
ys.append([1])
else:
ys.append([0])

x = tf.placeholder(tf.float32, [None, 1])
y = tf.placeholder(tf.float32, [None, 1])

W = tf.Variable(tf.zeros([1, 1]))
b = tf.Variable(tf.zeros([1]))

pred = tf.nn.sigmoid(tf.matmul(x, W) + b)

cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred+1e-07)+(1-y)*tf.log(1-pred+1e-07), reduction_indices=1))

num_epochs = 200
print("learningrate,epoch,loss,accuracy,W,b")
for learning_rate in [0.001, 0.01, 0.1, 0.3]:
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)

for epoch in range(num_epochs):
total_loss = 0
for example_x, example_y in zip(xs, ys):
_, loss = sess.run([optimizer, cost], feed_dict={x: [example_x], y: [example_y]})
total_loss += loss
correct_prediction = tf.equal(tf.round(pred), y)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)).eval({x: xs, y: ys})
Wtmp = sess.run(W)
btmp = sess.run(b)
print("%.4f,%d,%.4f,%.4f,%.4f,%.4f" % (learning_rate, epoch, total_loss, accuracy, Wtmp[0][0], btmp[0]))


## Data demo from class

import pandas as pd
import numpy as np
import tensorflow as tf