Importing and Exporting Data
My typical advice is that, if you're doing your analysis in Matlab, you can focus on importing data and forget about exporting it. But in reality, it's helpful to know how to do both, and the commands are quite similar.
Unfortunately, Matlab is not nearly as versatile when it comes to data import/export as a language like R, which was designed for statisticians who deal with data in lots of forms. In this lesson, we'll focus on a simple example, and I'll point you to more complicated functions at the end.
This is going to be a complicated discussion for a few reasons:
- The most common data format out in the world is Microsoft's .xls spreadsheet format. Unfortunately, this is a proprietary format, which means that, in order to open .xls files in Matlab, you will need a copy of Excel, and you will need to be running Windows. Sorry, but them's the breaks.
- Even if you are using Microsoft's semi-open .xlsx format, which is supposed to open on Macs and Unix/Linux under conditions below, you may run into problems. Your data need to be very cleanly formatted.
- Fine, you say, Excel will convert my data to csv (comma-separated value), which is nice and portable and easy to open up in a text editor to read. And yes, Matlab will read csv. But only numbers. Too bad if you have a column of strings. This will be okay with Excel files, but the outcome will be a little funky.
Thus, in general, you will need to do one of the following:
- Recode everything in your spreadsheet as numbers, save as csv, and import.
- Use a Windows machine with Excel installed to import your data.
- Succeed in getting Matlab to load xls or xlsx files in "basic" mode with xlsread. (I had to clean the data a lot to get this working.)
As a result, I'm going to list several options below. You'll have to play around to figure out which is best for you.
First you'll need to download the data (xls xlsx csv). These are data extracted from the Multnomah County Jury Project, which tabulated jury decisions in Multnomah County, Oregon, in the period 1973-1976. The data are currently stored in a comma-separated value file (csv), which is a pretty typical and portable data format. (Excel will read and save files in this format, and it's actually human-readable, since each line is just the entries in each row, separated by a delimeter: commas, spaces, tabs, etc.)
Now move to the directory where the file is located. Those of you with a spreadsheet program installed may want to take a look at the data.
In its simplest form, we want to use one of the following commands to pull data
[num,txt,raw] = xlsread('jury_data.xlsx'); num = csvread('jury_data.csv',1,0);
- xlsread returns three arguments, the numerical data, num (in an array), the text data, txt (in a cell array the same size as the spreadsheet, with all numerical data removed), and the "raw" data in a cell array (raw), where each cell array is a cell in the spreadsheet. In raw, depending on whether or not you have stray spaces, Matlab may convert your numbers to strings inside the cell array.
- csvread always returns a numerical matrix. The two extra arguments supplied are the row and column at which to start. Both are indexed from 0, so that the arguments 1 and 0 above mean to start at the second row, first column.
If we used the xlsread option, we can do some data cleaning (advanced):
dat=nan(size(raw)); %make an empty cell array strtype = cellfun(@ischar,raw); %apply the ischar function to every element of raw dat(~strtype) = cell2mat(raw(~strtype)); %copy over cells that are already numbers dat(strtype) = str2double(raw(strtype)); %convert non-numeric cells to numbers, then copy
Why do we solve the problem like this? Or better yet: what should our thought process be when we approach such a problem? Some thoughts:
- This is not simple. It took me several minutes to work out how it had to be done.
- This would be really painful to do with a loop, but if you know how to do it with a loop, and that's the only way you know how to do it, then use a loop! Go for a solution that works before you waste to much time on being elegant.
- Our basic plan is to copy all the data over to a new array in two pieces. First, we copy over everything that's already a number, then we convert the rest to numbers and copy them the same way.
- To do this, we need to know which cells contain numbers. As you will find, functions like ischar that test whether a single variable is a string, don't work so well on cell array. cellfun is a function that applies a function to every element of a cell array. To do that, we need to supply the name of a function to use on each cell. For reasons we'll cover in the intermediate class, you need to do that using @. The return from cellfun is then a matrix the same size as the cell array, with a 1 if the corresponding cell contains a string and 0 otherwise.
- Next, we want to copy all the numbers. We use logical indexing, with ~strtype as our logical (it's 0 for strings and 1 for numbers). On the right-hand side, we need to get cells from raw using ~strtype, but that returns cells. We can convert those to an array using the cell2mat function.
- Finally, we need to convert the string cells to number and copy them over. Here, we do have a nice, vectorized function, str2double that converts all our string-valued cells to double-precision numbers (this won't be quite right for your integers, since it will add decimal points, but you can fix that later). Again, we use logical indexing with strtype to select the string values in raw, convert them to numbers, and assign these to the correct subset of dat.
Two things to take away from this:
- There is a lot of Matlab left to learn, and unfortunately, it's sometimes needed for very basic things. But there aren't that many new concepts, just more functions to learn.
- Matlab can be really powerful. It only took four lines to solve a data cleaning problem that would be absurd to do by hand-editing.
Thankfully, writing data out is basically the reverse process of reading data in. For more information, see the help files listed below. The process should be self-explanatory, with the caveat that you need to be aware of the limits of what type of data can be written out using each command.
This one's easy. To load data simply use the load command:
load practice.mat % you have to be in the directory containing practice.mat load('practice.mat') %also works
Now, if you're inside a function, load won't put variables directly into the workspace, since it might accidentally overwrite something existing. In that case (or anytime), you can get an output from load
myvars = load('practice.mat'); %returns a structure; each field is a variable
As for saving data, you do that with save, which takes as its argument an output file name and list of variables:
a=2; b=1; save myoutput.mat a b %save variables a and b to myoutput.mat save('myoutput.mat','a','b') %does the same thing