{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "Preparing Data for Analysis\n", "====\n", "\n", "In biomedical contexts, data most often comes from external text files such as spreadsheets. Here we will look at how to prepare such data for importing into R as a data frame. In order to read spreadsheets correctly, you need to be follow some simple rules when cosntrucitng the table:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Do this\n", "\n", "- A table has column headers and a number of rows and nothing else – it is RECTANGULAR\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Not this\n", "\n", "- Do not put more than 1 table in a worksheet\n", "- Do not use non-rectangular tables\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Do this\n", "\n", "- One cell = one value\n", "- Easy to filter by tube, sample or subject\n", "- Easy to write validation rules or lookup table\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Not this\n", "\n", "- ID column has 3 different values\n", "- Need to do text parsing to recover information – very error prone\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Round-trip from Excel to CSV and back to Excel\n", "\n", "#### Before\n", "\n", "- Inofmration in highlighting\n", "- Information in comment notes\n", "- Information in font color\n", "- Merged cells\n", "\n", "\n", "\n", "### After\n", "\n", "- Comments are lost\n", "- Highlighting is lost\n", "- Bad cell formatting is lost\n", "- Merged cells become missing information\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Other suggestions\n", "----\n", "\n", "- When sorting, make sure you sort all the columns, even the ones that are not on the screen\n", "- Use a lookup table rather than typing if possible to avoid errors due to typos\n", "- Use a special marker to indicate misssing values - do not use 0 or 999 etc\n", "- Do not keeep multiple copies of the same spredsheet \n", "- If you must keep multiple copies, make sure you version them clearly in the fileanme\n", "- Excel is OK if you use almost NONE of its features!" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "collapsed": true }, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.11" } }, "nbformat": 4, "nbformat_minor": 0 }