Category: programming

  • Exploring Python with Data

    In the glut of Python data analysis tools, I’m sometimes embarrassed by my lack of comfort with Python for analysis. Static types, Java/Scaladoc, and slick IDEs in concert with compilers provide a guides that I haven’t been able to replace in Python. Additionally, the problem of dynamic types seems to exacerbate problems with library interoperability. With Anaconda and Jupyter, though, I can share some quick notes on getting started.

    Here are some notes on surveying some admittedly canned data to classify malignant/benign tumors. The Web is littered with examples of using sklearn to classify iris species using feature dimensions, so I thought I would share some notes exploring one of the other datasets included with scikit-learn, the Breast Cancer Wisconsin (Diagnostic) Data Set. I’ve also decided to use Python 3 to take advantage of comprehensions and because that’s what the Python community uses where I work.

    The notebook below illustrates how to load demo data (loading csv is simple, too), convert the scikit-learn matrix to a DataFrame if you want to use Pandas for analysis, and applies linear and logistic regression to classify tumors as malignant or benign.

    share_breast
    In [7]:
    %matplotlib inline
    import numpy as np
    import matplotlib.pyplot as plt
    from sklearn.linear_model import LinearRegression
    import pylab as pl
    import pandas as pd
    from sklearn import datasets
    
    # demo numpy matrix to Pandas DataFrame
    bc = datasets.load_breast_cancer()
    pbc = pd.DataFrame(data=bc.data,columns=bc.feature_names)
    pbc.describe()
    
    Out[7]:
    mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
    count 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000 569.000000
    mean 14.127292 19.289649 91.969033 654.889104 0.096360 0.104341 0.088799 0.048919 0.181162 0.062798 16.269190 25.677223 107.261213 880.583128 0.132369 0.254265 0.272188 0.114606 0.290076 0.083946
    std 3.524049 4.301036 24.298981 351.914129 0.014064 0.052813 0.079720 0.038803 0.027414 0.007060 4.833242 6.146258 33.602542 569.356993 0.022832 0.157336 0.208624 0.065732 0.061867 0.018061
    min 6.981000 9.710000 43.790000 143.500000 0.052630 0.019380 0.000000 0.000000 0.106000 0.049960 7.930000 12.020000 50.410000 185.200000 0.071170 0.027290 0.000000 0.000000 0.156500 0.055040
    25% 11.700000 16.170000 75.170000 420.300000 0.086370 0.064920 0.029560 0.020310 0.161900 0.057700 13.010000 21.080000 84.110000 515.300000 0.116600 0.147200 0.114500 0.064930 0.250400 0.071460
    50% 13.370000 18.840000 86.240000 551.100000 0.095870 0.092630 0.061540 0.033500 0.179200 0.061540 14.970000 25.410000 97.660000 686.500000 0.131300 0.211900 0.226700 0.099930 0.282200 0.080040
    75% 15.780000 21.800000 104.100000 782.700000 0.105300 0.130400 0.130700 0.074000 0.195700 0.066120 18.790000 29.720000 125.400000 1084.000000 0.146000 0.339100 0.382900 0.161400 0.317900 0.092080
    max 28.110000 39.280000 188.500000 2501.000000 0.163400 0.345400 0.426800 0.201200 0.304000 0.097440 36.040000 49.540000 251.200000 4254.000000 0.222600 1.058000 1.252000 0.291000 0.663800 0.207500

    8 rows × 30 columns

    In [8]:
    from math import sqrt
    from sklearn.linear_model import LinearRegression
    from sklearn.linear_model import LogisticRegression
    
    # Plot training-set size versus classifier accuracy.
    def make_test_train(test_count):
        n = bc.target.size
        trainX = bc.data[0:test_count,:]
        trainY = bc.target[0:test_count]
        testX = bc.data[n//2:n,:]
        testY = bc.target[n//2:n]
        return trainX, trainY, testX, testY
    
    def eval_lin(trainX, trainY, testX, testY):
        regr = LinearRegression()
        regr.fit(trainX, trainY)
        y = regr.predict(testX)
        err = ((y.T > 0.5) - testY)
        correct = [x == 0 for x in err]
        return sum(correct) / err.size, np.std(correct) / sqrt(err.size)
    
    def eval_log(trainX, trainY, testX, testY):
        regr = LogisticRegression()
        regr.fit(trainX, trainY)
        correct = (regr.predict(testX) - testY) == 0
        return sum(correct) / testY.size, np.std(correct) / sqrt(correct.size)
    
    def lin_log_cmp(n):
        trainX, trainY, testX, testY = make_test_train(n)  # min 20
        lin_acc, lin_stderr = eval_lin(trainX, trainY, testX, testY)
        log_acc, log_stderr = eval_log(trainX, trainY, testX, testY)
        return lin_acc, log_acc
    
    xs = range(20,280,20)
    lin_log_acc = [lin_log_cmp(x) for x in xs]
    
    pl.figure()
    lin_lin, = pl.plot(xs, [y[0] for y in lin_log_acc], label = 'linear')
    log_lin, = pl.plot(xs, [y[1] for y in lin_log_acc], label = 'logistic')
    pl.legend(handles = [lin_lin, log_lin])
    pl.xlabel('training size from ' + str(bc.target.size))
    pl.ylabel('accuracy');
    

    Incidentally, I used the iPython nbconvert to paste the notebook here.

    Caveats: Without types, it’s pretty easy to make mistakes in manipulating the raw data. Python and numpy scalar, array, and matrix arithmetic operators are gracious in accepting parameters, so you might get a surprise or two if you’re not careful. That combined with operating with black-box analysis tools gives me some skepticism of any conclusions, but it’s a start, and the investment was cheap.

    Other Plotting Tools: Seaborn.pairplot generates some slick scatter plot and histograms that will help identify outliers, describe ranges, and demonstrated redundancy in the data dimensions. I tried removing some of obviously redundant data columns, resulting in no quality change in logistic classification and less than statistically significant reduction linear classification.

    Linear or Logistic? It surprises me that logistic regression proved inferior classification to linear, but economists frequently use linear regression to model 0/1 variables. Paul von Hippel has a post comparing relative advantages of linear versus logistic regression. As a student, I had trouble both with application of logistic regression and conveying my travails to a thesis adviser. I wish I had read more commentary comparing the two 20 years ago.