Exploring Python with Data

In the glut of Python data analysis tools, I’m sometimes embarrassed by my lack of comfort with Python for analysis. Static types, Java/Scaladoc, and slick IDEs in concert with compilers provide a guides that I haven’t been able to replace in Python. Additionally, the problem of dynamic types seems to exacerbate problems with library interoperability. With Anaconda and Jupyter, though, I can share some quick notes on getting started.

Here are some notes on surveying some admittedly canned data to classify malignant/benign tumors. The Web is littered with examples of using sklearn to classify iris species using feature dimensions, so I thought I would share some notes exploring one of the other datasets included with scikit-learn, the Breast Cancer Wisconsin (Diagnostic) Data Set. I’ve also decided to use Python 3 to take advantage of comprehensions and because that’s what the Python community uses where I work.

The notebook below illustrates how to load demo data (loading csv is simple, too), convert the scikit-learn matrix to a DataFrame if you want to use Pandas for analysis, and applies linear and logistic regression to classify tumors as malignant or benign.

share_breast

In [7]:

%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
import pylab as pl
import pandas as pd
from sklearn import datasets

# demo numpy matrix to Pandas DataFrame
bc = datasets.load_breast_cancer()
pbc = pd.DataFrame(data=bc.data,columns=bc.feature_names)
pbc.describe()

Out[7]:

	mean radius	mean texture	mean perimeter	mean area	mean smoothness	mean compactness	mean concavity	mean concave points	mean symmetry	mean fractal dimension	…	worst radius	worst texture	worst perimeter	worst area	worst smoothness	worst compactness	worst concavity	worst concave points	worst symmetry	worst fractal dimension
count	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	…	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000	569.000000
mean	14.127292	19.289649	91.969033	654.889104	0.096360	0.104341	0.088799	0.048919	0.181162	0.062798	…	16.269190	25.677223	107.261213	880.583128	0.132369	0.254265	0.272188	0.114606	0.290076	0.083946
std	3.524049	4.301036	24.298981	351.914129	0.014064	0.052813	0.079720	0.038803	0.027414	0.007060	…	4.833242	6.146258	33.602542	569.356993	0.022832	0.157336	0.208624	0.065732	0.061867	0.018061
min	6.981000	9.710000	43.790000	143.500000	0.052630	0.019380	0.000000	0.000000	0.106000	0.049960	…	7.930000	12.020000	50.410000	185.200000	0.071170	0.027290	0.000000	0.000000	0.156500	0.055040
25%	11.700000	16.170000	75.170000	420.300000	0.086370	0.064920	0.029560	0.020310	0.161900	0.057700	…	13.010000	21.080000	84.110000	515.300000	0.116600	0.147200	0.114500	0.064930	0.250400	0.071460
50%	13.370000	18.840000	86.240000	551.100000	0.095870	0.092630	0.061540	0.033500	0.179200	0.061540	…	14.970000	25.410000	97.660000	686.500000	0.131300	0.211900	0.226700	0.099930	0.282200	0.080040
75%	15.780000	21.800000	104.100000	782.700000	0.105300	0.130400	0.130700	0.074000	0.195700	0.066120	…	18.790000	29.720000	125.400000	1084.000000	0.146000	0.339100	0.382900	0.161400	0.317900	0.092080
max	28.110000	39.280000	188.500000	2501.000000	0.163400	0.345400	0.426800	0.201200	0.304000	0.097440	…	36.040000	49.540000	251.200000	4254.000000	0.222600	1.058000	1.252000	0.291000	0.663800	0.207500

8 rows × 30 columns

In [8]:

from math import sqrt
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import LogisticRegression

# Plot training-set size versus classifier accuracy.
def make_test_train(test_count):
    n = bc.target.size
    trainX = bc.data[0:test_count,:]
    trainY = bc.target[0:test_count]
    testX = bc.data[n//2:n,:]
    testY = bc.target[n//2:n]
    return trainX, trainY, testX, testY

def eval_lin(trainX, trainY, testX, testY):
    regr = LinearRegression()
    regr.fit(trainX, trainY)
    y = regr.predict(testX)
    err = ((y.T > 0.5) - testY)
    correct = [x == 0 for x in err]
    return sum(correct) / err.size, np.std(correct) / sqrt(err.size)

def eval_log(trainX, trainY, testX, testY):
    regr = LogisticRegression()
    regr.fit(trainX, trainY)
    correct = (regr.predict(testX) - testY) == 0
    return sum(correct) / testY.size, np.std(correct) / sqrt(correct.size)

def lin_log_cmp(n):
    trainX, trainY, testX, testY = make_test_train(n)  # min 20
    lin_acc, lin_stderr = eval_lin(trainX, trainY, testX, testY)
    log_acc, log_stderr = eval_log(trainX, trainY, testX, testY)
    return lin_acc, log_acc

xs = range(20,280,20)
lin_log_acc = [lin_log_cmp(x) for x in xs]

pl.figure()
lin_lin, = pl.plot(xs, [y[0] for y in lin_log_acc], label = 'linear')
log_lin, = pl.plot(xs, [y[1] for y in lin_log_acc], label = 'logistic')
pl.legend(handles = [lin_lin, log_lin])
pl.xlabel('training size from ' + str(bc.target.size))
pl.ylabel('accuracy');

Incidentally, I used the iPython nbconvert to paste the notebook here.

Caveats: Without types, it’s pretty easy to make mistakes in manipulating the raw data. Python and numpy scalar, array, and matrix arithmetic operators are gracious in accepting parameters, so you might get a surprise or two if you’re not careful. That combined with operating with black-box analysis tools gives me some skepticism of any conclusions, but it’s a start, and the investment was cheap.

Other Plotting Tools: Seaborn.pairplot generates some slick scatter plot and histograms that will help identify outliers, describe ranges, and demonstrated redundancy in the data dimensions. I tried removing some of obviously redundant data columns, resulting in no quality change in logistic classification and less than statistically significant reduction linear classification.

Linear or Logistic? It surprises me that logistic regression proved inferior classification to linear, but economists frequently use linear regression to model 0/1 variables. Paul von Hippel has a post comparing relative advantages of linear versus logistic regression. As a student, I had trouble both with application of logistic regression and conveying my travails to a thesis adviser. I wish I had read more commentary comparing the two 20 years ago.

{{site_title_here}}

More posts

Some Portraits and Candids

Gemini (more than) edits your photos

A Fret of Nothing in Kansas City

Classical Mandolin Society of America, 2022