In the glut of Python data analysis tools, I’m sometimes embarrassed by my lack of comfort with Python for analysis. Static types, Java/Scaladoc, and slick IDEs in concert with compilers provide a guides that I haven’t been able to replace in Python. Additionally, the problem of dynamic types seems to exacerbate problems with library interoperability. With Anaconda and Jupyter, though, I can share some quick notes on getting started.
Here are some notes on surveying some admittedly canned data to classify malignant/benign tumors. The Web is littered with examples of using sklearn to classify iris species using feature dimensions, so I thought I would share some notes exploring one of the other datasets included with scikit-learn, the Breast Cancer Wisconsin (Diagnostic) Data Set. I’ve also decided to use Python 3 to take advantage of comprehensions and because that’s what the Python community uses where I work.
The notebook below illustrates how to load demo data (loading csv is simple, too), convert the scikit-learn matrix to a DataFrame if you want to use Pandas for analysis, and applies linear and logistic regression to classify tumors as malignant or benign.
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import pylab as pl
import pandas as pd
from sklearn import datasets
# demo numpy matrix to Pandas DataFrame
bc = datasets.load_breast_cancer()
pbc = pd.DataFrame(data=bc.data,columns=bc.feature_names)
pbc.describe()
from math import sqrt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression
# Plot training-set size versus classifier accuracy.
def make_test_train(test_count):
n = bc.target.size
trainX = bc.data[0:test_count,:]
trainY = bc.target[0:test_count]
testX = bc.data[n//2:n,:]
testY = bc.target[n//2:n]
return trainX, trainY, testX, testY
def eval_lin(trainX, trainY, testX, testY):
regr = LinearRegression()
regr.fit(trainX, trainY)
y = regr.predict(testX)
err = ((y.T > 0.5) - testY)
correct = [x == 0 for x in err]
return sum(correct) / err.size, np.std(correct) / sqrt(err.size)
def eval_log(trainX, trainY, testX, testY):
regr = LogisticRegression()
regr.fit(trainX, trainY)
correct = (regr.predict(testX) - testY) == 0
return sum(correct) / testY.size, np.std(correct) / sqrt(correct.size)
def lin_log_cmp(n):
trainX, trainY, testX, testY = make_test_train(n) # min 20
lin_acc, lin_stderr = eval_lin(trainX, trainY, testX, testY)
log_acc, log_stderr = eval_log(trainX, trainY, testX, testY)
return lin_acc, log_acc
xs = range(20,280,20)
lin_log_acc = [lin_log_cmp(x) for x in xs]
pl.figure()
lin_lin, = pl.plot(xs, [y[0] for y in lin_log_acc], label = 'linear')
log_lin, = pl.plot(xs, [y[1] for y in lin_log_acc], label = 'logistic')
pl.legend(handles = [lin_lin, log_lin])
pl.xlabel('training size from ' + str(bc.target.size))
pl.ylabel('accuracy');
