Saving and sharing data

Many data science applications require an intermediate storage format for transfer of data. The data to be stored may be structurally complex or large. One application is serialization.

Serialization

From Wikipedia

In computing, serialization (US spelling) or serialisation (UK spelling) is the process of translating a data structure or object state into a format that can be stored (for example, in a file or memory data buffer) or transmitted (for example, across a computer network) and reconstructed later (possibly in a different computer environment)

ML example

For example, in ML applications, we often need to store details about a machine learning model (including train/test data so that we can compare it with other models. These may then need to be transferred across computers to perform comparative analysis.

We illustrate with an example from scikit-learn docs.

[1]:
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
X, y = make_classification(random_state=0)
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    random_state=0)
pipe = Pipeline([('scaler', StandardScaler()), ('svc', SVC())])
pipe.fit(X_train, y_train)
pipe.score(X_test, y_test)
[1]:
0.88

We monkey-patch the pipeline to give it a name.

[2]:
pipe.name = 'my_pipeline_0.0.1'

A pipeline has several parameters.

[3]:
pipe.get_params()
[3]:
{'memory': None,
 'steps': [('scaler', StandardScaler()), ('svc', SVC())],
 'verbose': False,
 'scaler': StandardScaler(),
 'svc': SVC(),
 'scaler__copy': True,
 'scaler__with_mean': True,
 'scaler__with_std': True,
 'svc__C': 1.0,
 'svc__break_ties': False,
 'svc__cache_size': 200,
 'svc__class_weight': None,
 'svc__coef0': 0.0,
 'svc__decision_function_shape': 'ovr',
 'svc__degree': 3,
 'svc__gamma': 'scale',
 'svc__kernel': 'rbf',
 'svc__max_iter': -1,
 'svc__probability': False,
 'svc__random_state': None,
 'svc__shrinking': True,
 'svc__tol': 0.001,
 'svc__verbose': False}

We also want to know the data used to train and test the model. Here 2 samples of training data are shown.

[4]:
X_train[:2]
[4]:
array([[-0.65240858,  0.49374178,  1.30184623, -1.28532883, -1.94473774,
         2.06449286, -2.03068447,  1.02017271,  0.68981816,  0.28634369,
        -0.43265956,  0.60884383,  1.21114529, -0.11610394, -0.69204985,
        -0.39095338,  1.53637705, -1.30819171, -1.04525337, -0.11054066],
       [ 0.35178011, -0.47003288, -0.37914756, -0.15902752, -2.23460699,
        -0.17858909, -0.9301565 ,  0.41731882,  0.11514787, -1.40596292,
         1.13712778, -0.59005765, -1.66069981, -0.21673147, -0.94436849,
         0.37923553,  0.23810315, -2.38076394, -0.11048941, -1.55042935]])
[5]:
y_train[:2]
[5]:
array([0, 0])

We combine these into a single data structure.

[6]:
python_model = {
    'model': pipe,
    'X_train': X_train,
    'y_train': y_train,
    'X_test': X_test,
    'y_test': y_test
}
[7]:
import pendulum

filename_base = f'{pipe.name}_{pendulum.now()}'
filename_base
[7]:
'my_pipeline_0.0.1_2020-11-11T19:27:47.226967-05:00'

Python native data formats

If you only ever use Python and don’t need to share your data with anyone else, you can use efficient data structures native to Python.

Pickle

[8]:
import pickle
[9]:
# Note that we need to open file in write binary
pickle_file = f'{filename_base}.pickle'
with open(pickle_file, 'wb') as f:
    pickle.dump(python_model, f)
[10]:
! head -c 200 $pickle_file
��/w}�(�model��sklearn.pipeline��Pipeline���)��}�(�steps�]�(�scaler��sklearn.preprocessing._data��StandardScaler���)��}�(�        with_mean���with_std���copy���n_features_in_�K�n_samples_s
[11]:
with open(pickle_file, 'rb') as f:
    m_pickle = pickle.load(f)
print(m_pickle.keys())
dict_keys(['model', 'X_train', 'y_train', 'X_test', 'y_test'])

This is super convenient because the model is immediately usable!

[12]:
m_pickle['model'].score(m_pickle['X_test'], m_pickle['y_test'])
[12]:
0.88

Joblib

Joblib is more efficient for objects with large arrays. Behind the scenes this uses a library called dill that is adds some features to pickle.

[13]:
import joblib
[14]:
joblib_file = f'{filename_base}.joblib'
joblib.dump(python_model, joblib_file)
[14]:
['my_pipeline_0.0.1_2020-11-11T19:27:47.226967-05:00.joblib']
[15]:
! head -c 200 $joblib_file
���}�(�model��sklearn.pipeline��Pipeline���)��}�(�steps�]�(�scaler��sklearn.preprocessing._data��StandardScaler���)��}�(�        with_mean���with_std���copy���n_features_in_�K�n_samples_s
[16]:
m_joblib = joblib.load(joblib_file)
[17]:
m_joblib['model'].score(m_joblib['X_test'], m_joblib['y_test'])
[17]:
0.88

Portable data formats

Here we generally cannot automatically store Python objects, so we create a generic data structure to store. Serialization using these non-native formats usually takes more work.

Note. Some Python libraries such as pyyaml provide mechanisms for directly storing and recreating objects like pickle and joblib - not covered in lecture notes.

[18]:
generic_model = {
    'name': pipe.name,
    'params': pipe.get_params(),
    'X_train': X_train,
    'y_train': y_train,
    'X_test': X_test,
    'y_test': y_test
}

CSV

CSV cannot handle non-tabular data structures, so we would have to do something like store 5 different files:

  • model key, value pairs (one per line)

  • X_train

  • X_test

  • y_train

  • y_test

[19]:
import csv

csv_file = f'{pipe.name}_{pendulum.now()}.csv'
with open(csv_file, 'w') as f:
    writer = csv.writer(f, delimiter=',', quotechar='"')
    writer.writerow(['name', pipe.name])
    for k, v in pipe.get_params().items():
        writer.writerow([k, v])
[20]:
! head -c 200 $csv_file








scaler__with_std,

Reading back using the CSV module solves the commas embedded in qutotes problem.

[21]:
with open(csv_file, 'r') as f:
    reader = csv.reader(f, delimiter=',', quotechar='"')
    for i, row in enumerate(reader):
        print(row)
        if i >= 2:
            break
['name', 'my_pipeline_0.0.1']
['memory', '']
['steps', "[('scaler', StandardScaler()), ('svc', SVC())]"]

We can write the numpy arrays to CSV in the same way, but it’s easier to do so directly in Python.

[22]:
import numpy as np
[23]:
X_train_filename = f'X_train_{filename_base}'
np.savetxt(X_train_filename, X_train, delimiter=',')
[24]:
! head -c 200 $X_train_filename
-6.524085823870200418e-01,4.937417773491884487e-01,1.301846229564998403e+00,-1.285328829789109673e+00,-1.944737744352711406e+00,2.064492861359319420e+00,-2.030684467781494362e+00,1.020172711715799707e

Reading back into numpy is also straightforward.

[25]:
np.loadtxt(X_train_filename, delimiter=',').shape
[25]:
(75, 20)

JSON

JSON is ubiquitous as a data format, and is native to the REST API. Generally, JSON only understands basic data types - string, numbers, ,object (this is like a Python dictionary), array (this is like a Python list), boolean and null - so is inefficient for transferring large binary objects such as numpy arrays.

[26]:
import json
import numpy as np

Unfortunately, the get_params method returns values that are Python objects suhc as StandardScaler() that cannot be directly serialized to JSON.

[27]:
json_file = f'{filename_base}.json'

with open(json_file, 'w') as f:
    try:
        json.dump(generic_model, f)
    except TypeError as e:
        print(e)
Object of type StandardScaler is not JSON serializable

We need to convert to strings first.

[28]:
def serialize(m):
    """Serialize all objects to their string represntation."""
    d = {}
    for k, v in m.items():
        if type(v) is np.ndarray:
            d[k] = v.tolist()
        else:
            d[k] = str(v)
    return d
[29]:
with open(json_file, 'w') as f:
    json.dump(serialize(generic_model), f)
[30]:
! head -c 200 $json_file
{"name": "my_pipeline_0.0.1", "params": "{'memory': None, 'steps': [('scaler', StandardScaler()), ('svc', SVC())], 'verbose': False, 'scaler': StandardScaler(), 'svc': SVC(), 'scaler__copy': True, 'sc

The price is that now, everything is a string and you need to do the reconstruction.

See docs for how to restore scikit-learn models.

It is simple to restore numpy arrays.

[31]:
with open(json_file, 'r') as f:
    m_json = json.load(f)
[32]:
X_test_json = np.asarray(m_json['X_test'])

YAML

  • YAML Ain’t Markup Language

  • YAML is often used for configuration - for example, in docker-compose to specify containers

YAML is a superset of JSON, so anything that can be serialized as JSON will work. However YAML is more flexible. See YAML docs for more information - especially how to use YAML aliases and references.

[33]:
import yaml
[34]:
yaml_file = f'{filename_base}.yaml'

with open(yaml_file, 'w') as f:
    yaml.safe_dump(serialize(generic_model), f)
[35]:
! head -c 200 $yaml_file
X_test:
- - -0.16137353627777917
  - 0.0275097020298358
  - -0.5110404635801098
  - 0.8566996977399313
  - 0.1140320833086349
  - 1.3674149824601585
  - -0.10497970101895356
  - 0.15364446081566638

[36]:
with open(yaml_file, 'r') as f:
    m_yaml = yaml.safe_load(f)
[37]:
m_yaml.keys()
[37]:
dict_keys(['X_test', 'X_train', 'name', 'params', 'y_test', 'y_train'])

XML

XML is a recursive data structure.

[38]:
import xml.etree.ElementTree as ET

XML is painful to create manually so I will convert from JSON instead.

[39]:
! python3 -m pip install --quiet json2xml
[40]:
from json2xml import json2xml
from json2xml.utils import readfromjson
[41]:
xml_file = f'{filename_base}.xml'

data = readfromjson(json_file)
xml = json2xml.Json2xml(data).to_xml()
[42]:
with open(xml_file, 'w') as f:
    f.write(xml)
[43]:
! head -c 200 $xml_file
<?xml version="1.0" ?>
<all>
        <name type="str">my_pipeline_0.0.1</name>
        <params type="str">{'memory': None, 'steps': [('scaler', StandardScaler()), ('svc', SVC())], 'verbose': False, 'scaler': Standa
[44]:
tree = ET.parse(xml_file)
root = tree.getroot()
[45]:
for item in root:
    print(item)
<Element 'name' at 0x12cb41ae0>
<Element 'params' at 0x12cb41b80>
<Element 'X_train' at 0x12cb41bd0>
<Element 'y_train' at 0x12cb52db0>
<Element 'X_test' at 0x12cb3f9a0>
<Element 'y_test' at 0x12cc8fe50>

Use XPath notation to navigate the XML tree.

[46]:
name = root.find('.//name')
name.tag, name.text
[46]:
('name', 'my_pipeline_0.0.1')
[47]:
len(root.findall('.//item'))
[47]:
2200

HDF5

HDF5 was designed to store large and heterogeneous data sets. It is ideal if you need to store lots of numerical data with annotation.

There are two popular libraries in Python:

I find h5py to have a friendlier interface, but the implementation supported by pandas is pytables.

[48]:
h5_file = f'{filename_base}.h5'
[49]:
import h5py
[50]:
with h5py.File(h5_file, 'w') as f:
    g = f.create_group(pipe.name)
    g.create_dataset(name='X_train', data=python_model['X_train'])
    g.create_dataset(name='y_train', data=python_model['y_train'])
    g.create_dataset(name='X_test', data=python_model['X_test'])
    g.create_dataset(name='y_test', data=python_model['y_test'])
    g.attrs['name'] = pipe.name
    for k, v in pipe.get_params().items():
        g.attrs[k] = str(v)
[51]:
! head -c 200 $h5_file


��������(f��������`����TREE�����������������
[52]:
with h5py.File(h5_file, 'r') as f:
    for k in f:
        g = f[k]
        print(g)
        for attr in g.attrs:
            print(attr, g.attrs[attr])
        for item in (g):
            print(item, g[item])
<HDF5 group "/my_pipeline_0.0.1" (4 members)>
memory None
name my_pipeline_0.0.1
scaler StandardScaler()
scaler__copy True
scaler__with_mean True
scaler__with_std True
steps [('scaler', StandardScaler()), ('svc', SVC())]
svc SVC()
svc__C 1.0
svc__break_ties False
svc__cache_size 200
svc__class_weight None
svc__coef0 0.0
svc__decision_function_shape ovr
svc__degree 3
svc__gamma scale
svc__kernel rbf
svc__max_iter -1
svc__probability False
svc__random_state None
svc__shrinking True
svc__tol 0.001
svc__verbose False
verbose False
X_test <HDF5 dataset "X_test": shape (25, 20), type "<f8">
X_train <HDF5 dataset "X_train": shape (75, 20), type "<f8">
y_test <HDF5 dataset "y_test": shape (25,), type "<i8">
y_train <HDF5 dataset "y_train": shape (75,), type "<i8">
[53]:
with h5py.File(h5_file, 'r') as f:
    xs = f['my_pipeline_0.0.1/X_train']
    print(xs[:2, :5])
[[-0.65240858  0.49374178  1.30184623 -1.28532883 -1.94473774]
 [ 0.35178011 -0.47003288 -0.37914756 -0.15902752 -2.23460699]]

Google Protocol Buffer (protobuf)

This is typically used to transmit data for ML prediction, especially for ML deployments on a cloud platform. It is a binary buffer, so much more efficient than JSON for large data sets.

From the official docs, there are 3 steps:

  • Define message formats in a .proto file.

  • Use the protocol buffer compiler

  • Use the Python protocol buffer API to write and read messages

This will make more sense when we deploy an ML model, so we’ll punt the example till then.