Mapping the Antibody Response¶

For the final exercise, we will visualize the affinity of the antibody response to different regions of the HIV envelope protein. The data come from experiments in which animals are given an HIV vaccine, and pre- and post-vaccination samples are tested on an HIV peptide array containing thousands of HIV envelope sequences. Each array has triplicates of each sequence.

We need to do the following for each animal:

read in pre-vaccination peptide array data a text file
read in post-vaccination peptide array data a text file
find the change in fluorescence reading from pre- to post-vaccination sample
find the median value across the 3 replicates per sequence
extract the clade data for each peptide
read in a metadata file showing the mapping of peptide sequence to a region on the HIV envelope
find the region each sequence is mapped to

Then, the investigator has asked us to plot the following heatmap where each cell contains the sum of values for sequences belonging to that clade, normalized by the number of such sequences in each clade.

heatmap

This non-trivial exercise will clearly test your mastery of the workshop data manipulation and visualization lessons!

The data comes from the paper Schiffner et al., J Virol. 2013 Sep;87(18):10163-72.

In [1]:

import os
import glob
import numpy as np
import pandas as pd
from pandas import DataFrame, Series
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

Preview¶

Here is some code that can generate the above figure from raw data, but it is likely to be incomprehensible. Later we will unpack the code into the individual steps.

Paths to data and metadata files¶

We use os.path.join to create paths to files that will work regardless of the operating system. In particular, Windows uses the forward slash while Unix and Macs use the backwards slash to separate parts of the path.

In [2]:

base_dir = os.path.join('data', 'capstone2')
meta_path = os.path.join(base_dir, 'Copy_Peptides Lib_HIV_Revised BK_20160905.txt')
pre_files = sorted(glob.glob(os.path.join(base_dir, '*pre_*anon.txt')))
post_files = sorted(glob.glob(os.path.join(base_dir, '*wk24_*anon.txt')))
pairs = list(zip(pre_files, post_files))

Step-by-step¶

What are the paths to files that were created?¶

In [3]:

base_dir

Out[3]:

'data/capstone2'

In [4]:

meta_path

Out[4]:

'data/capstone2/Copy_Peptides Lib_HIV_Revised BK_20160905.txt'

In [5]:

pairs

Out[5]:

[('data/capstone2/770 pre_20130118_S1_460_anon.txt',
  'data/capstone2/770 wk24_20130118_S2_460_anon.txt'),
 ('data/capstone2/771 pre_20130118_S3_460_anon.txt',
  'data/capstone2/771 wk24_20130118_S4_460_anon.txt'),
 ('data/capstone2/776 pre_20130118_S5_460_anon.txt',
  'data/capstone2/776 wk24_20130118_S6_460_anon.txt'),
 ('data/capstone2/778 pre_20130118_S7_460_anon.txt',
  'data/capstone2/778 wk24_20130118_S8_460_anon.txt'),
 ('data/capstone2/780 pre_20130118_S9_460_anon.txt',
  'data/capstone2/780 wk24_20130118_S10_460_anon.txt')]

Load the metadata file into a `pandas` data frame¶

Looking at the original text file, we see that - indicates blank space and can be coded as missing. Also we note that there are blank rows in the text file that we must take care of. We take a peek at the text file using Unix shell commands (the ! sends commands to the shell).

In [6]:

! head -n 15 "$meta_path"

Use `pandas` to read in table¶

In [7]:

meta = pd.read_table(meta_path, na_values='-')

We have some blank rows¶

In [8]:

meta.ix[:11, :8]

Out[8]:

	OriNum	NewNum	PeptideSeq	Region	Seqs	N.ident	Batch	12
0	1.0	E1	MRVMGIQRNWQHLWR	1.0	1	1.0	Ori	NaN
1	2.0	E2	MRVMGIQRNCQHLWR	1.0	2	1.0	Ori	NaN
2	3.0	E3	MRVKGIRKNYQHLWR	1.0	3	1.0	Ori	NaN
3	4.0	E4	MRVRGILRNWQQWWI	1.0	4	1.0	Ori	NaN
4	5.0	E5	MRVRGIERNYQHLWR	1.0	5	1.0	Ori	NaN
5	6.0	E6	MRVKETQMNWPNLWK	1.0	6	1.0	Ori	NaN
6	7.0	E7	MRVMGIQKNYPLLWR	1.0	7	1.0	Ori	NaN
7	NaN	E1426	MRVRGIWKNWPQWLI	1.0	11	1.0	New	NaN
8	NaN	E1427	MRVMGTQKNCQQWWI	1.0	12	1.0	New	NaN
9	NaN	E1920	MRVREILRNWQRWWT	1.0	13	1.0	New	C.ZM651
10	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
11	8.0	E8	MGIQRNWQHLWRWGT	2.0	1	1.0	Ori	NaN

Now get rid of blank rows¶

In [9]:

meta = meta.dropna(axis=0, how='all')

Check that blank rows have been removed¶

In [10]:

meta.ix[:11, :8]

Out[10]:

	OriNum	NewNum	PeptideSeq	Region	Seqs	N.ident	Batch	12
0	1.0	E1	MRVMGIQRNWQHLWR	1.0	1	1.0	Ori	NaN
1	2.0	E2	MRVMGIQRNCQHLWR	1.0	2	1.0	Ori	NaN
2	3.0	E3	MRVKGIRKNYQHLWR	1.0	3	1.0	Ori	NaN
3	4.0	E4	MRVRGILRNWQQWWI	1.0	4	1.0	Ori	NaN
4	5.0	E5	MRVRGIERNYQHLWR	1.0	5	1.0	Ori	NaN
5	6.0	E6	MRVKETQMNWPNLWK	1.0	6	1.0	Ori	NaN
6	7.0	E7	MRVMGIQKNYPLLWR	1.0	7	1.0	Ori	NaN
7	NaN	E1426	MRVRGIWKNWPQWLI	1.0	11	1.0	New	NaN
8	NaN	E1427	MRVMGTQKNCQQWWI	1.0	12	1.0	New	NaN
9	NaN	E1920	MRVREILRNWQRWWT	1.0	13	1.0	New	C.ZM651
11	8.0	E8	MGIQRNWQHLWRWGT	2.0	1	1.0	Ori	NaN

Extract clades from `meta`¶

Each peptide sequence may match one or more HIV clades. The clade information is in the columns '12', '11', '10', '9', '8', '1', '2', '3', '4', '5', '6', '7', 'Unnamed: 19', 'Unnamed: 20', and we will need to extract them from there.

In [11]:

clade_cols = ['12', '11', '10', '9', '8', '1', '2', '3', '4', '5', '6', '7', 'Unnamed: 19', 'Unnamed: 20']
meta[clade_cols].head()

Out[11]:

	12	11	10	9	8	1	2	3	4	5	6	7	Unnamed: 19	Unnamed: 20
0	NaN	NaN	NaN	NaN	NaN	NaN	M Con	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	A Con	NaN	NaN	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	B Con	NaN	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	C Con	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	D Con	NaN	NaN	AE.A244

First find all unique clades in the values within `clade_cols`¶

In [12]:

clades = sorted(set(meta[clade_cols].values.ravel()) - set([np.nan]))

In [13]:

clades

Out[13]:

['A Con',
 'AE Con',
 'AE.A244',
 'AE.TH023',
 'AG Con',
 'B Con',
 'B.MN',
 'C Con',
 'C.1086',
 'C.TV1',
 'C.ZM651',
 'D Con',
 'M Con']

Create a new column for each clade where the value is True if the clade is present in that row and false otherwise¶

In [14]:

for clade in clades:
    meta[clade] = (meta[clade_cols] == clade).any(axis=1)

In [15]:

meta.columns

Out[15]:

Index(['OriNum', 'NewNum', 'PeptideSeq', 'Region', 'Seqs', 'N.ident', 'Batch',
       '12', '11', '10', '9', '8', '1', '2', '3', '4', '5', '6', '7',
       'Unnamed: 19', 'Unnamed: 20', 'A Con', 'AE Con', 'AE.A244', 'AE.TH023',
       'AG Con', 'B Con', 'B.MN', 'C Con', 'C.1086', 'C.TV1', 'C.ZM651',
       'D Con', 'M Con'],
      dtype='object')

Drop all columns except `Region`, `PeptideSeq` and the columns for clades¶

In [16]:

meta = meta[['PeptideSeq', 'Region'] + clades]
meta.head()

Out[16]:

	PeptideSeq	Region	A Con	AE Con	AE.A244	AE.TH023	AG Con	B Con	B.MN	C Con	C.1086	C.TV1	C.ZM651	D Con	M Con
0	MRVMGIQRNWQHLWR	1.0	False	False	False	False	False	False	False	False	False	False	False	False	True
1	MRVMGIQRNCQHLWR	1.0	True	False	False	False	False	False	False	False	False	False	False	False	False
2	MRVKGIRKNYQHLWR	1.0	False	False	False	False	False	True	False	False	False	False	False	False	False
3	MRVRGILRNWQQWWI	1.0	False	False	False	False	False	False	False	True	False	False	False	False	False
4	MRVRGIERNYQHLWR	1.0	False	False	True	False	False	False	False	False	False	False	False	True	False

Reshape `meta` so that it only contains 4 columns `PeptideSeq`, `Region` , `variable` and `value`¶

Then drop all rows where the value of variable is False.

In [17]:

meta = pd.melt(meta, ['PeptideSeq', 'Region'])
meta.head()

Out[17]:

	PeptideSeq	Region	variable	value
0	MRVMGIQRNWQHLWR	1.0	A Con	False
1	MRVMGIQRNCQHLWR	1.0	A Con	True
2	MRVKGIRKNYQHLWR	1.0	A Con	False
3	MRVRGILRNWQQWWI	1.0	A Con	False
4	MRVRGIERNYQHLWR	1.0	A Con	False

In [18]:

meta = meta[meta.value == True].drop('value', axis=1)
meta.head()

Out[18]:

	PeptideSeq	Region	variable
1	MRVMGIQRNCQHLWR	1.0	A Con
11	MGIQRNCQHLWRWGT	2.0	A Con
21	QRNCQHLWRWGTMIL	3.0	A Con
31	CQHLWRWGTMILGMI	4.0	A Con
40	LWRWGTMILGMIIIC	5.0	A Con

Reanme the columns of meta as `seq`, `region`, `clade`¶

In [19]:

meta.columns = ['seq', 'region', 'clade']

Load a data file into a dataframe¶

We’ll just use one of the pre files as an example.

In [20]:

path = pairs[0][0]
path

Out[20]:

'data/capstone2/770 pre_20130118_S1_460_anon.txt'

In [21]:

! head -n 15 "$path"

ATF     1.0
32      58
"Type=GenePix Export 3"
"DateTime=2013/01/18 16:53:39"
"Settings="
"GalFile="
"PixelSize=10"
"Wavelengths=635    532"
"ImageFiles="
"NormalizationMethod=None"
"NormalizationFactors=1     1"
"JpegImage="
"StdDev=Type 1"
"RatioFormulations=W1/W2 (635/532)"
"FeatureType=Circular"

Hmm, the top part of the file does not actually contain the data we are looking for¶

It turns out that the first 34 lines are information about the assay we don’t need for this exercise. We use some Unix shell commands to show the lines 31-40.

In [22]:

! head -n 40 "$path" | tail -n 10

"ScanArea=1"
"Supplier=Arrayjet Ltd"
"ArrayerSoftwareName=Aj Marathon"
"ArrayerSoftwareVersion=1.3.4"
"Flags" "Normalize"     "Autoflag"      "Block" "Column"        "Row"   "Name"  "ID"    "X"     "Y"     "Dia."  "F635 Median"   "F635 Mean"     "F635 SD"       "F635 CV"       "B635"  "B635 Median"   "B635 Mean"     "B635 SD"       "B635 CV"       "% > B635+1SD"       "% > B635+2SD"       "F635 % Sat."   "F532 Median"   "F532 Mean"     "F532 SD"       "F532 CV"       "B532"  "B532 Median"   "B532 Mean"     "B532 SD"       "B532 CV"       "% > B532+1SD"       "% > B532+2SD"       "F532 % Sat."   "F635 Total Intensity"  "F532 Total Intensity"  "SNR 635"       "SNR 532"       "Ratio of Medians (635/532)"    "Ratio of Means (635/532)"      "Median of Ratios (635/532)"    "Mean of Ratios (635/532)"      "Ratios SD (635/532)"   "Rgn Ratio (635/532)"   "Rgn R2 (635/532)"      "Log Ratio (635/532)"   "F635 Median - B635"    "F532 Median - B532"    "F635 Mean - B635"      "F532 Mean - B532"      "Sum of Medians (635/532)"      "Sum of Means (635/532)"        "F Pixels"      "B Pixels"      "Circularity"   "Index" "Annotation"
0       0       0       1       1       1       "E73-73"        "WVTVYYGVPVWKDAE"       2050    5160    100     347     359     160     44      267     267     276     84      30      49      6       0       2844    2962    1104    37      605     605     681     431     63      98      93      0       28686   236979  0.988   5.292   0.036   0.039   0.040   0.038   2.916   0.050   0.241   -4.807  80      2239    92      2357    2319    2449    80      560     100     1       "2729_1_2|798770|10010268_A09_1014|30006056_E09_0650
"
0       0       0       1       2       1       "E8-8"  "MGIQRNWQHLWRWGT"       2350    5170    60      602     588     75      12      292     292     292     60      20      100     97      0       1283    1199    448     37      551     551     627     390     62      69      44      0       18824   38372   4.933   1.467   0.423   0.457   0.394   0.459   2.407   0.273   0.331   -1.240  310     732     296     648     1042    944     32      176     100     2       "2729_1_1|798676|10010267_A11_1103|30006056_A11_0861
"
-50     0       0       1       3       1       "E1932" "FCASDAKSYEKEVHN"       2640    5180    140     142     156     64      41      327     327     332     79      23      0       0       0       641     734     518     70      615     615     1051    3998    380     0       0       0       24295   114537  -2.228  -0.079  -7.115  -1.437  0.618   0.758   2.609   -0.001  0.000   Error   -185    26      -171    119     -159    -52     156     1140    100     3       "2729_1_2|798820|10010268_E11_1219|30006056_G11_0959

Use `pandas` to read in the file, dropping the first 34 rows¶

In [23]:

pre = pd.read_table(path, skiprows=34)

In [24]:

pre.ix[:10, :8]

Out[24]:

	Flags	Block	Column	Row	Name	ID
0	0	1	1	1	E73-73	WVTVYYGVPVWKDAE
1	0	1	2	1	E8-8	MGIQRNWQHLWRWGT
2	-50	1	3	1	E1932	FCASDAKSYEKEVHN
3	0	1	4	1	E1435	WLIWSILGFWIGNME
4	0	1	5	1	E1449	EDLWVTVYYGVPVWR
5	0	1	6	1	E1427	MRVMGTQKNCQQWWI
6	-50	1	7	1	E1467	FCASDAKAQETEAHN
7	-50	1	8	1	E42-42	MILGMIIICSAAENL
8	0	1	9	1	E75-75	WVTVYYGVPVWRDAD
9	0	1	10	1	E7-7	MRVMGIQKNYPLLWR
10	0	1	11	1	E105-105	FCASDAKSYKTEAHN

Drop unused columns¶

For this analysis, we are only interested in the columns named “ID”, “Block” and “F635 Median - B635”, and we can drop the rest of the many columns for simplicity. We never, ever touch the raw data files, so we can always get the other columns by re-reading them in if necessary.

In [25]:

pre = pre[['ID', 'Block', 'F635 Median - B635']]

In [26]:

pre.head()

Out[26]:

	ID	Block	F635 Median - B635
0	WVTVYYGVPVWKDAE	1	80
1	MGIQRNWQHLWRWGT	1	310
2	FCASDAKSYEKEVHN	1	-185
3	WLIWSILGFWIGNME	1	158
4	EDLWVTVYYGVPVWR	1	568

Do the same thing for the file in `pairs[0][1]` and create a `post` data frame. All the data files have the exact same format, so you must drop the first 34 rows as before.¶

In [27]:

path = pairs[0][1]
post = pd.read_table(path, skiprows=34)
post = post[['ID', 'Block', 'F635 Median - B635']]

In [28]:

post.head()

Out[28]:

	ID	Block	F635 Median - B635
0	WVTVYYGVPVWKDAE	1	0
1	MGIQRNWQHLWRWGT	1	82
2	FCASDAKSYEKEVHN	1	-13
3	WLIWSILGFWIGNME	1	119
4	EDLWVTVYYGVPVWR	1	954

Find the change in intensity before and after vaccination¶

First change the column names for convenience¶

For pre, call the columns seq, block, pre
For post call the columns seq, block, post

In [29]:

pre.columns = ['seq', 'block', 'pre']
post.columns = ['seq', 'block', 'post']

Calculate summary statistics¶

from pre, create a new data frame baseline that contains seq and a column equal to the median value across blocks + 3 SD - if the resulting value is less than 200, use 200 instead
for post create a new data frame vaccine for the median value across blocks

In [30]:

def baseline(x):
    return max(200, x.median() + 3*x.std(ddof=1))

In [31]:

base = pre.groupby('seq').agg({'pre': baseline})
base = base.reset_index()
base.head()

Out[31]:

	seq	pre
0	AA	200.000000
1	AAARLGAVILFVVIV	200.000000
2	AARAVELLGRSSLRG	335.874922
3	AARIVELLGRRGWEA	537.473794
4	AARTVELLGHSSLKG	200.000000

In [32]:

vaccine = post.groupby('seq').agg({'post': np.median})
vaccine = vaccine.reset_index()
vaccine.head()

Out[32]:

	seq	post
0	AA	-94
1	AAARLGAVILFVVIV	-54
2	AARAVELLGRSSLRG	492
3	AARIVELLGRRGWEA	636
4	AARTVELLGHSSLKG	357

Create a datta frame `df` by merging baseline and vaccine on `seq`¶

In [33]:

df = pd.merge(base, vaccine, on=['seq'])
df.head()

Out[33]:

	seq	pre	post
0	AA	200.000000	-94
1	AAARLGAVILFVVIV	200.000000	-54
2	AARAVELLGRSSLRG	335.874922	492
3	AARIVELLGRRGWEA	537.473794	636
4	AARTVELLGHSSLKG	200.000000	357

Create a new column `signal` in `df` by subtracting the `pre` column from the `post` column¶

If the resulting value is less than 1, use 1 instead.

In [34]:

df['signal'] = df['post'] - df['pre']
df['signal'] = df['signal'].clip_lower(1)
df.head()

Out[34]:

	seq	pre	post	signal
0	AA	200.000000	-94	1.000000
1	AAARLGAVILFVVIV	200.000000	-54	1.000000
2	AARAVELLGRSSLRG	335.874922	492	156.125078
3	AARIVELLGRRGWEA	537.473794	636	98.526206
4	AARTVELLGHSSLKG	200.000000	357	157.000000

Drop the `pre` and `post` columns since we don’t need them anymore¶

In [35]:

df = df.drop(['pre', 'post'], axis=1)
df.head()

Out[35]:

	seq	signal
0	AA	1.000000
1	AAARLGAVILFVVIV	1.000000
2	AARAVELLGRSSLRG	156.125078
3	AARIVELLGRRGWEA	98.526206
4	AARTVELLGHSSLKG	157.000000

Merge with metadta¶

Merge `df` and `meta` on `seq` to create a new data frame called `data`¶

In [36]:

data = pd.merge(df, meta, on='seq')

In [37]:

data.head()

Out[37]:

	seq	signal	region	clade
0	AARAVELLGRSSLRG	156.125078	254.0	C Con
1	AARIVELLGRRGWEA	98.526206	254.0	D Con
2	AARTVELLGHSSLKG	157.000000	254.0	A Con
3	AARTVELLGHSSLKG	157.000000	254.0	AE Con
4	AARTVELLGHSSLKG	157.000000	254.0	AG Con

Now do this for all files in a loop¶

Refactor code into functions to make the code more manageable¶

In [38]:

def process_meta(path, clade_cols):
    """Collect steps to transform metadata here."""

    meta = pd.read_table(meta_path, na_values='-')
    clades = sorted(set(meta[clade_cols].values.ravel()) - set([np.nan]))
    for clade in clades:
        meta[clade] = (meta[clade_cols] == clade).any(axis=1)
    meta = meta[['PeptideSeq', 'Region'] + clades]
    meta = pd.melt(meta, ['PeptideSeq', 'Region'])
    meta = meta[meta.value == True].drop('value', axis=1)
    meta.columns = ['seq', 'region', 'clade']
    return meta

def process_baseline(path):
    """Steps to calculate baseline values."""

    pre = pd.read_table(path, skiprows=34)
    pre = pre[['ID', 'Block', 'F635 Median - B635']]
    pre.columns = ['seq', 'block', 'pre']
    d = pre.groupby('seq').agg({'pre': baseline})
    d = d.reset_index()
    return d

def process_vaccinated(path):
    """Steps to calculate vaccinated median values."""

    post = pd.read_table(path, skiprows=34)
    post = post[['ID', 'Block', 'F635 Median - B635']]
    post.columns = ['seq', 'block', 'post']
    d = post.groupby('seq').agg({'post': np.median})
    d = d.reset_index()
    return d

def process_pair(pre_path, post_path):
    """Process a pair of pre- and post-files."""

    s1 = os.path.split(pre_path)[-1][:3]
    s2 = os.path.split(pre_path)[-1][:3]
    assert(s1 == s2)

    pre = process_baseline(pre_path)
    post = process_vaccinated(post_path)
    df = pd.merge(pre, post, on=['seq'])
    df['signal'] = df['post'] - df['pre']
    df['signal'] = df['signal'].clip_lower(1)
    df['subject'] = s1
    df = df.drop(['pre', 'post'], axis=1)
    return df

In [39]:

def process_all(meta_path, pairs, clade_cols):
    """Process metadata and all pairs of files into a single data frame."""

    meta = meta = process_meta(meta_path, clade_cols)

    dfs = []
    for pre_path, post_path in pairs:
        df = process_pair(pre_path, post_path)
        dfs.append(df)
    dfs = pd.concat(dfs)
    data = pd.merge(dfs, meta)
    return data

In [40]:

clade_cols = ['12', '11', '10', '9', '8', '1', '2', '3', '4', '5', '6', '7', 'Unnamed: 19', 'Unnamed: 20']
base_dir = os.path.join('data', 'capstone2')
meta_path = os.path.join(base_dir, 'Copy_Peptides Lib_HIV_Revised BK_20160905.txt')
pre_files = sorted(glob.glob(os.path.join(base_dir, '*pre_*anon.txt')))
post_files = sorted(glob.glob(os.path.join(base_dir, '*wk24_*anon.txt')))
pairs = list(zip(pre_files, post_files))

In [41]:

data = process_all(meta_path, pairs, clade_cols)
data.head()

Out[41]:

	seq	signal	subject	region	clade
0	AARAVELLGRSSLRG	156.125078	770	254.0	C Con
1	AARAVELLGRSSLRG	1.000000	771	254.0	C Con
2	AARAVELLGRSSLRG	1.000000	776	254.0	C Con
3	AARAVELLGRSSLRG	1.000000	778	254.0	C Con
4	AARAVELLGRSSLRG	1.000000	780	254.0	C Con

Assigned groups¶

Group information GLA-Rb Study:

WT Group: 770, 771, 776
GLA Group: 778, 780, 783

In [42]:

wt = ['770', '771', '776']
gla = ['778', '780', '783']

In [43]:

data.ix[data.subject.isin(wt), 'group'] = 'wt'
data.ix[data.subject.isin(gla), 'group'] = 'gla'
data.head()

Out[43]:

	seq	signal	subject	region	clade	group
0	AARAVELLGRSSLRG	156.125078	770	254.0	C Con	wt
1	AARAVELLGRSSLRG	1.000000	771	254.0	C Con	wt
2	AARAVELLGRSSLRG	1.000000	776	254.0	C Con	wt
3	AARAVELLGRSSLRG	1.000000	778	254.0	C Con	gla
4	AARAVELLGRSSLRG	1.000000	780	254.0	C Con	gla

Assigned envelope proteins¶

If region <= 163, then protein = gp120, otherwise protein = gp144

Graphics¶

In [44]:

data.ix[data.region <= 163, 'protein'] = 'gp120'
data.ix[data.region >163, 'protein'] = 'gp144'

In [45]:

data.head()

Out[45]:

	seq	signal	subject	region	clade	group	protein
0	AARAVELLGRSSLRG	156.125078	770	254.0	C Con	wt	gp144
1	AARAVELLGRSSLRG	1.000000	771	254.0	C Con	wt	gp144
2	AARAVELLGRSSLRG	1.000000	776	254.0	C Con	wt	gp144
3	AARAVELLGRSSLRG	1.000000	778	254.0	C Con	gla	gp144
4	AARAVELLGRSSLRG	1.000000	780	254.0	C Con	gla	gp144

Plotting signal against region¶

In [46]:

plt.scatter(data.region, data.signal)
plt.margins(0.05)
plt.xlabel('Region')
plt.ylabel('Signal')
pass

Conditioning plots on categorical variables¶

In [47]:

gp120 = data[data.protein == 'gp120']
g = sns.FacetGrid(gp120, row='clade', col='group', aspect=2, margin_titles=True)
g.map(plt.bar, 'region', 'signal')
pass

Plotting heatmaps¶

In [48]:

hm = data.groupby(['region', 'subject']).agg({'signal': np.mean})
hm.head()

Out[48]:

		signal
region	subject
1.0	770	1.000000
	771	144.941271
	776	54.898099
	778	70.446511
	780	3.236776

In [49]:

hm = hm.unstack().T
hm.ix[:4, :6]

Out[49]:

	region	1.0	2.0	3.0	4.0	5.0	6.0
	subject
signal	770	1.000000	65.258850	2.400000	1.000000	22.888889	30.400000
	771	144.941271	99.295281	30.186508	94.206611	43.111111	24.425327
	776	54.898099	17.535661	1.000000	33.542959	38.111111	48.678128
	778	70.446511	111.469690	37.500000	135.000000	166.111111	126.300000

In [50]:

hm = hm.reset_index(level=0, drop=True)
hm.ix[:4, :6]

Out[50]:

region	1.0	2.0	3.0	4.0	5.0	6.0
subject
770	1.000000	65.258850	2.400000	1.000000	22.888889	30.400000
771	144.941271	99.295281	30.186508	94.206611	43.111111	24.425327
776	54.898099	17.535661	1.000000	33.542959	38.111111	48.678128
778	70.446511	111.469690	37.500000	135.000000	166.111111	126.300000

In [51]:

plt.figure(figsize=(16, 3))
sns.heatmap(hm)
xlim = [1, 163]
plt.xlim(xlim)
plt.xticks(range(xlim[0], xlim[1], 5), range(xlim[0], xlim[1], 5))
pass

Conditioning heatmaps¶

In [52]:

nrows = len(clades)
ncols = 2
fig, axes = plt.subplots(nrows, ncols, figsize=(16, 3*nrows))

low, high, step = 1, 162, 5
for i, (clade, df_clade) in enumerate(data[data.protein == 'gp120'].groupby('clade')):
    for j, (group, df_group) in enumerate(df_clade.groupby('group')):
        ax = axes[i, j]
        hm = df_group.groupby(['region', 'subject'])[['signal']].mean()
        hm = hm.unstack().T.reset_index(level=0, drop=True)
        hm.columns = hm.columns.astype('int')
        sns.heatmap(hm, ax=ax, linewidth=0.01, cbar=j)
        ax.set_title(clade)
        ax.set_xlim(low, high)
        ax.set_xticks(range(low, high+1, step))
        ax.set_xticklabels(range(low, high+1, step))
        ax.set_ylabel('Animal')
plt.tight_layout()

Code to remove identifying laboratory information¶

for f in glob.glob(os.path.join(base_dir, '7*txt')):
    if 'anon' in f:
        continue
    with open(f) as fin:
        with open(f.replace('.txt', '_anon.txt'), 'w') as fout:
            for line in fin:
                if line.startswith('"GalFile='):
                    fout.write('"GalFile="\n')
                elif line.startswith('"ImageFiles='):
                    fout.write('"ImageFiles="\n')
                else:
                    fout.write(line)