Validation Set Data Munging

Rec'd validation data July 23, 2014.

Data munged July 24, 2014.

The original source file is data/ELISA_Training_Validation_Combined.xslx (md5:13fe02727dee37d3ba8e69103b01d8a1)

This was modified to be convertible to csv in data/ELISA_Validation_PREP.xlsx (md5:6f3a3c7e3eb4c5eff7ac2f826185774e)

Which was then exported to csv as data/ELISA_Validation_PREP.csv (md5:8aca344297b5eea7d9d8ad7b6e0ebed8)

In [2]:
import hashlib
import numpy as np
import pandas.io.parsers
def md5_for_file(f, block_size=2**20):
    """
    See: http://stackoverflow.com/questions/1131220/get-md5-hash-of-big-files-in-python
    """
    md5 = hashlib.md5()
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data)
    return md5.hexdigest()

with open('data/ELISA_Training_Validation_Combined.xlsx', 'rb') as orig:
    print "ELISA_Training_Validation_Combined.xlsx md5(%s)" % md5_for_file( orig )
        
with open('data/ELISA_Validation_PREP.xlsx', 'rb') as mung1:
    print "ELISA_Validation_PREP.xlsx md5(%s)" % md5_for_file( mung1 )
        
with open('data/ELISA_Validation_PREP.csv', 'rb') as mung2:
    print "ELISA_Validation_PREP.csv md5(%s)" % md5_for_file( mung2 )
ELISA_Training_Validation_Combined.xlsx md5(13fe02727dee37d3ba8e69103b01d8a1)
ELISA_Validation_PREP.xlsx md5(6f3a3c7e3eb4c5eff7ac2f826185774e)
ELISA_Validation_PREP.csv md5(8aca344297b5eea7d9d8ad7b6e0ebed8)

Now the data needs to be put into the format the validation script expects.

In [3]:
PREP_val = pandas.io.parsers.read_csv('data/ELISA_Validation_PREP.csv')
PREP_val = PREP_val.set_index('Sample')
PREP_val
Out[3]:
Age Gender VCAM-Replicate-1 VCAM-Replicate-2 TGFBI-Replicate-1 TGFBI-Replicate-2 HMOX1-Replicate-1 HMOX1-Replicate-2 CD44-Replicate-1 CD44-Replicate-2
Sample
Cont-1 48 F 333.247910 437.103798 461.024828 331.843011 3.102960 2.506669 104.093206 86.129974
Cont-2 72 F 764.150100 869.078205 608.574046 446.550627 7.125417 6.361283 182.388445 176.225880
Cont-3 52 F 2138.863437 2070.821721 356.827990 543.079193 5.926707 5.270235 240.456677 231.559896
Cont-4 54 F 1585.525405 1209.702021 489.428493 372.487074 4.717413 4.526218 263.128888 253.737912
Cont-5 56 F 3984.529733 3510.811850 660.424129 846.459872 9.862739 8.523904 274.169871 265.119235
Cont-6 45 F 1245.486362 1240.686171 484.894522 677.125411 4.625589 3.993123 156.826953 152.019083
Cont-7 58 F 1147.753670 1032.489290 555.627802 668.057355 4.549140 4.236398 161.524256 160.884629
Cont-8 57 M 1186.143702 1184.753853 363.379299 353.560770 2.938339 2.930867 162.335527 151.653790
Cont-9 61 M 938.810429 796.962937 644.645152 586.127055 10.364301 9.799141 199.199077 192.691346
Cont-10 56 M 605.226653 588.381125 511.677696 521.115158 8.177032 8.098301 154.257016 150.124406
Cont-11 56 M 386.633844 298.003913 413.983659 242.808714 7.462061 6.571363 120.663172 118.689892
Cont-12 49 M 441.692409 366.509724 555.627802 275.251235 3.501352 2.617967 183.549210 176.062908
Cont-13 60 M 903.757865 722.169366 428.516296 316.772472 3.932421 3.328162 206.222271 206.174133
Cont-14 54 M 512.551485 505.506157 407.784436 307.451567 6.883224 6.174843 166.005297 163.604132
Cont-15 59 M 627.762309 563.711402 432.164105 282.455621 15.897184 13.713407 189.681595 180.902854
Cont-16 76 M 624.438162 663.193879 486.760491 286.794951 8.666017 7.909509 183.180488 172.447718
Cont-17 61 M 515.186482 403.651844 437.646680 277.408873 9.703783 8.051082 115.644285 107.626430
Cont-18 43 F 800.843824 726.158545 269.513101 201.179715 2.373446 2.588263 125.667543 123.613181
Cont-19 54 F 695.683123 771.997941 392.876251 249.847054 9.275275 11.892330 156.410801 154.977496
Cont-20 58 M 951.398520 782.166861 521.925589 430.600046 18.129317 20.598555 180.902854 168.941741
Cont-21 54 M 356.817905 305.142528 259.526599 197.347167 6.283565 6.602515 119.935234 112.765504
GBM-CP-1 68 M 288.754543 337.207953 240.704621 225.150623 10.069578 12.302105 103.522598 100.381812
GBM-CP-2 55 M 438.940854 376.130247 233.019291 130.933301 6.812990 6.563576 73.565014 70.544339
GBM-CP-3 71 M 324.289058 253.017766 290.178550 200.051057 8.816155 10.979011 115.343856 121.820916
GBM-CP-4 68 F 623.606351 563.711402 362.874558 277.408873 6.649258 8.019610 135.908515 139.387623
GBM-CP-5 45 F 215.800546 173.114172 161.984092 164.155158 9.878642 11.627565 71.308700 70.434625
GBM-CP-6 64 M 312.234648 329.274636 135.951042 137.841780 10.915061 11.547394 89.832033 87.532319
GBM-CP-7 57 F 382.823715 351.944317 206.159908 178.176284 11.291039 11.731830 83.174693 81.617441
GBM-CP-8 59 F 388.534965 349.989590 152.068192 155.934284 14.580057 18.334459 113.071178 113.147498
GBM-CP-9 61 F 137.480008 146.298124 148.007275 144.815993 26.258244 32.175966 54.273994 52.920107
GBM-CP-10 49 M 283.580071 244.406643 145.453202 149.928291 10.332418 12.664255 63.352875 62.994857
GBM-CP-11 39 F 387.584731 394.222786 184.155241 111.634601 6.875418 7.195811 86.224008 82.884420
GBM-CP-12 68 F 214.679677 169.529509 131.558572 89.442094 12.543478 13.083400 60.072245 53.600332
GBM-CP-13 49 F 142.537692 68.289101 146.090936 110.422775 2.314358 3.365774 42.234818 38.358589
GBM-CP-14 76 M 405.530225 285.653009 117.730575 90.412713 22.377688 24.863941 72.066825 68.212060
GBM-CP-15 64 M 209.050428 145.047688 196.447411 117.934817 11.130969 12.704526 84.423668 80.235897
GBM-CP-16 69 M 258.358131 232.425112 107.805734 92.945477 9.568775 10.627525 86.411839 73.458779
GBM-CP-17 70 M 50.593805 83.310165 141.215134 87.506797 23.581718 26.467304 70.653921 63.471870
GBM-CP-18 64 M 234.616227 183.746629 101.812095 77.575035 17.359057 21.507865 94.826572 80.434374
GBM-CP-19 51 F 224.709932 132.369627 189.277420 108.408618 13.438602 15.449359 42.397131 38.533301
GBM-CP-20 60 M 111.330866 123.289792 119.980886 94.510599 16.713061 19.518338 47.213286 41.744978
GBM-CP-21 84 F 189.579237 97.603582 104.800859 80.797112 33.690016 42.569100 89.467132 81.421180

42 rows × 10 columns

Create new matrix for test

From 05-Independent Confirmation

Data needs to be

In [4]:
SID = []
Sample_SID_map = []
for samp_name in list(PREP_val.index):
    parsed = samp_name.split('-')
    if 'Cont' in parsed:
        SID.append('H%s' % parsed[-1] )
    else:
        SID.append('G%s' % parsed[-1] )
    Sample_SID_map.append( ( samp_name, SID[-1] ) )
col_names = ['SID', 'HMOX1', 'TGFBI', 'VCAM1', 'CD44']
index = PREP_val.loc
ind_test = pandas.DataFrame(np.zeros((len(PREP_val.index), len(col_names))),columns=col_names)
ind_test['SID'] = SID
ind_test = ind_test.set_index('SID')
proteins = col_names[1:]
for protein in proteins:
    #name mismatch
    if protein == 'VCAM1':
        prot = 'VCAM'
    else:
        prot = protein
    PREP_mean = PREP_val[['%s-Replicate-1' % prot,'%s-Replicate-2' % prot]].sum(axis=1)/2
    for prep_samp, ind_samp in Sample_SID_map:
        ind_test.loc[ind_samp, protein] = PREP_mean[prep_samp]  
ind_test
Out[4]:
HMOX1 TGFBI VCAM1 CD44
SID
H1 2.804814 396.433919 385.175854 95.111590
H2 6.743350 527.562336 816.614153 179.307162
H3 5.598471 449.953592 2104.842579 236.008287
H4 4.621815 430.957784 1397.613713 258.433400
H5 9.193321 753.442001 3747.670791 269.644553
H6 4.309356 581.009966 1243.086266 154.423018
H7 4.392769 611.842579 1090.121480 161.204443
H8 2.934603 358.470035 1185.448777 156.994658
H9 10.081721 615.386104 867.886683 195.945211
H10 8.137666 516.396427 596.803889 152.190711
H11 7.016712 328.396187 342.318879 119.676532
H12 3.059659 415.439519 404.101066 179.806059
H13 3.630292 372.644384 812.963615 206.198202
H14 6.529033 357.618002 509.028821 164.804715
H15 14.805295 357.309863 595.736856 185.292225
H16 8.287763 386.777721 643.816021 177.814103
H17 8.877432 357.527776 459.419163 111.635357
H18 2.480855 235.346408 763.501184 124.640362
H19 10.583802 321.361652 733.840532 155.694149
H20 19.363936 476.262818 866.782690 174.922298
H21 6.443040 228.436883 330.980217 116.350369
G1 11.185841 232.927622 312.981248 101.952205
G2 6.688283 181.976296 407.535551 72.054676
G3 9.897583 245.114804 288.653412 118.582386
G4 7.334434 320.141715 593.658877 137.648069
G5 10.753103 163.069625 194.457359 70.871663
G6 11.231228 136.896411 320.754642 88.682176
G7 11.511435 192.168096 367.384016 82.396067
G8 16.457258 154.001238 369.262277 113.109338
G9 29.217105 146.411634 141.889066 53.597051
G10 11.498336 147.690746 263.993357 63.173866
G11 7.035615 147.894921 390.903758 84.554214
G12 12.813439 110.500333 192.104593 56.836288
G13 2.840066 128.256855 105.413397 40.296703
G14 23.620814 104.071644 345.591617 70.139443
G15 11.917748 157.191114 177.049058 82.329782
G16 10.098150 100.375606 245.391621 79.935309
G17 25.024511 114.360965 66.951985 67.062896
G18 19.433461 89.693565 209.181428 87.630473
G19 14.443980 148.843019 178.539780 40.465216
G20 18.115699 107.245743 117.310329 44.479132
G21 38.129558 92.798986 143.591410 85.444156

42 rows × 4 columns

Write out to csv for 05-Independent Confirmation

Above table is at data/ind-test.csv

In [5]:
ind_test.to_csv('data/ind-test.csv')