# A05: Can you predict which developers prefer to use [insert your favorite IDE here]?

Using logistic regression and the StackOverflow 2018 developer survey, predict which developers prefer to use your favorite IDE, e.g., Vim. Use the dataset on delenn found in `/home/jeckroth/csci431-public/stackoverflow_survey_2018`

. Also be sure to meet the following criteria:

- Only load the columns you need. Use
`pandas`

and load them as categorical values (since they’re strings but we need numbers). - Use “one-hot encoding” for the input columns (and use 1/0 for the IDE column if it has your IDE of choice). The pandas
`pd.get_dummies`

function is good here: https://stackoverflow.com/a/37269683 - If you have a column with multiple values (joined by ‘;’), split that column making one-hot encoding: https://stackoverflow.com/a/50069709
- Shuffle the dataset, then split it into 80% training and 20% testing.
- Print the % of developers who use your preferred IDE in the training and testing datasets (or don’t use it, whichever number is higher). Your model must achieve better accuracy than these %’s (i.e., your model must be better than a random guess).
- Use a single logistic regression function, predicting one output (yes/no your favorite IDE is used by that developer).
- Report statistics for each epoch: loss, accuracy on training data, accuracy on test data.
- Use TensorFlow for the model, like we did in class.
- Do not write more code than necessary; i.e., do not attempt to blindly adapt existing code from the internet. Write code specific to this assignment.

For example, I was able to achieve 77% accuracy on predicting whether a person uses Vim, while only 28% do use Vim and 72% do not (so random guessing would give me 72% accuracy).

Note, most of the effort in this assignment is spent on data processing, not machine learning. Give yourself plenty of time to figure out how to do this. Use StackOverflow a lot, particularly searching on pandas and numpy.

## Grading

- 4/5 points: all requirements complete, learning occurs, but model does not predict any better than chance

## Example code from class

```
import tensorflow as tf
import random
num_examples = 100
xs = []
ys = []
random.seed(123)
for i in range(num_examples):
x = random.uniform(-4, 17)
xs.append([x])
if x < 12:
ys.append([1])
else:
ys.append([0])
x = tf.placeholder(tf.float32, [None, 1])
y = tf.placeholder(tf.float32, [None, 1])
W = tf.Variable(tf.zeros([1, 1]))
b = tf.Variable(tf.zeros([1]))
pred = tf.nn.sigmoid(tf.matmul(x, W) + b)
cost = tf.reduce_mean(-tf.reduce_sum(y*tf.log(pred+1e-07)+(1-y)*tf.log(1-pred+1e-07), reduction_indices=1))
num_epochs = 200
print("learningrate,epoch,loss,accuracy,W,b")
for learning_rate in [0.001, 0.01, 0.1, 0.3]:
optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(cost)
init = tf.global_variables_initializer()
with tf.Session() as sess:
sess.run(init)
for epoch in range(num_epochs):
total_loss = 0
for example_x, example_y in zip(xs, ys):
_, loss = sess.run([optimizer, cost], feed_dict={x: [example_x], y: [example_y]})
total_loss += loss
correct_prediction = tf.equal(tf.round(pred), y)
accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32)).eval({x: xs, y: ys})
Wtmp = sess.run(W)
btmp = sess.run(b)
print("%.4f,%d,%.4f,%.4f,%.4f,%.4f" % (learning_rate, epoch, total_loss, accuracy, Wtmp[0][0], btmp[0]))
```

## Data demo from class

```
import pandas as pd
import numpy as np
import tensorflow as tf
df = pd.read_csv('/home/jeckroth/csci431-public/stackoverflow_survey_2018/survey_results_public.csv',
usecols=['Hobby', 'Employment', 'IDE'], dtype='category')
df = pd.get_dummies(df, prefix=['hobby_', 'emp_'], columns=['Hobby', 'Employment'])
df['IDE'] = df['IDE'].apply(lambda s: 1 if 'Vim' in s else 0)
print(df.columns)
print(df.head())
```