{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Python: Machine Learning with `sklearn`\n", "\n", "This is mostly a tutorial to illustrate how to use `scikit-learn` to perform common machine learning pipelines. It is NOT meant to show how to do machine learning tasks well - you should take a machine learning course for that." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": true }, "outputs": [], "source": [ "%matplotlib inline\n", "import itertools as it\n", "import numpy as np\n", "import pandas as pd\n", "from pandas import DataFrame, Series\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Resources\n", "----\n", "\n", "[Official scikit-learn documentation](http://scikit-learn.org/stable/documentation.html)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Example\n", "----\n", "\n", "We will try to separate rocks from mines using this [data set](https://archive.ics.uci.edu/ml/datasets/Connectionist+Bench+(Sonar,+Mines+vs.+Rocks).\n", "\n", "From the description provided:\n", "```\n", "Data Set Information:\n", "\n", "The file \"sonar.mines\" contains 111 patterns obtained by bouncing sonar signals off a metal cylinder at various angles and under various conditions. The file \"sonar.rocks\" contains 97 patterns obtained from rocks under similar conditions. The transmitted sonar signal is a frequency-modulated chirp, rising in frequency. The data set contains signals obtained from a variety of different aspect angles, spanning 90 degrees for the cylinder and 180 degrees for the rock. \n", "\n", "Each pattern is a set of 60 numbers in the range 0.0 to 1.0. Each number represents the energy within a particular frequency band, integrated over a certain period of time. The integration aperture for higher frequencies occur later in time, since these frequencies are transmitted later during the chirp. \n", "\n", "The label associated with each record contains the letter \"R\" if the object is a rock and \"M\" if it is a mine (metal cylinder). The numbers in the labels are in increasing order of aspect angle, but they do not encode the angle directly.\n", "```" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/undocumented/connectionist-bench/sonar/sonar.all-data', header=None, prefix='X')" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "(208, 61)" ] }, "execution_count": 3, "metadata": {}, "output_type": "execute_result" } ], "source": [ "df.shape" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "#### The last column are labels - make it a category" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "df.rename(columns={'X60':'Label'}, inplace=True)\n", "df.Label = df.Label.astype('category')" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
| \n", " | X0 | \n", "X1 | \n", "X2 | \n", "X3 | \n", "X4 | \n", "X5 | \n", "X6 | \n", "X7 | \n", "X8 | \n", "X9 | \n", "... | \n", "X51 | \n", "X52 | \n", "X53 | \n", "X54 | \n", "X55 | \n", "X56 | \n", "X57 | \n", "X58 | \n", "X59 | \n", "Label | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "0.0200 | \n", "0.0371 | \n", "0.0428 | \n", "0.0207 | \n", "0.0954 | \n", "0.0986 | \n", "0.1539 | \n", "0.1601 | \n", "0.3109 | \n", "0.2111 | \n", "... | \n", "0.0027 | \n", "0.0065 | \n", "0.0159 | \n", "0.0072 | \n", "0.0167 | \n", "0.0180 | \n", "0.0084 | \n", "0.0090 | \n", "0.0032 | \n", "R | \n", "
| 1 | \n", "0.0453 | \n", "0.0523 | \n", "0.0843 | \n", "0.0689 | \n", "0.1183 | \n", "0.2583 | \n", "0.2156 | \n", "0.3481 | \n", "0.3337 | \n", "0.2872 | \n", "... | \n", "0.0084 | \n", "0.0089 | \n", "0.0048 | \n", "0.0094 | \n", "0.0191 | \n", "0.0140 | \n", "0.0049 | \n", "0.0052 | \n", "0.0044 | \n", "R | \n", "
| 2 | \n", "0.0262 | \n", "0.0582 | \n", "0.1099 | \n", "0.1083 | \n", "0.0974 | \n", "0.2280 | \n", "0.2431 | \n", "0.3771 | \n", "0.5598 | \n", "0.6194 | \n", "... | \n", "0.0232 | \n", "0.0166 | \n", "0.0095 | \n", "0.0180 | \n", "0.0244 | \n", "0.0316 | \n", "0.0164 | \n", "0.0095 | \n", "0.0078 | \n", "R | \n", "
| 3 | \n", "0.0100 | \n", "0.0171 | \n", "0.0623 | \n", "0.0205 | \n", "0.0205 | \n", "0.0368 | \n", "0.1098 | \n", "0.1276 | \n", "0.0598 | \n", "0.1264 | \n", "... | \n", "0.0121 | \n", "0.0036 | \n", "0.0150 | \n", "0.0085 | \n", "0.0073 | \n", "0.0050 | \n", "0.0044 | \n", "0.0040 | \n", "0.0117 | \n", "R | \n", "
| 4 | \n", "0.0762 | \n", "0.0666 | \n", "0.0481 | \n", "0.0394 | \n", "0.0590 | \n", "0.0649 | \n", "0.1209 | \n", "0.2467 | \n", "0.3564 | \n", "0.4459 | \n", "... | \n", "0.0031 | \n", "0.0054 | \n", "0.0105 | \n", "0.0110 | \n", "0.0015 | \n", "0.0072 | \n", "0.0048 | \n", "0.0107 | \n", "0.0094 | \n", "R | \n", "
5 rows × 61 columns
\n", "| \n", " | X0 | \n", "X1 | \n", "X2 | \n", "X3 | \n", "X4 | \n", "X5 | \n", "X6 | \n", "X7 | \n", "X8 | \n", "X9 | \n", "... | \n", "X50 | \n", "X51 | \n", "X52 | \n", "X53 | \n", "X54 | \n", "X55 | \n", "X56 | \n", "X57 | \n", "X58 | \n", "X59 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "0.0200 | \n", "0.0371 | \n", "0.0428 | \n", "0.0207 | \n", "0.0954 | \n", "0.0986 | \n", "0.1539 | \n", "0.1601 | \n", "0.3109 | \n", "0.2111 | \n", "... | \n", "0.0232 | \n", "0.0027 | \n", "0.0065 | \n", "0.0159 | \n", "0.0072 | \n", "0.0167 | \n", "0.0180 | \n", "0.0084 | \n", "0.0090 | \n", "0.0032 | \n", "
| 1 | \n", "0.0453 | \n", "0.0523 | \n", "0.0843 | \n", "0.0689 | \n", "0.1183 | \n", "0.2583 | \n", "0.2156 | \n", "0.3481 | \n", "0.3337 | \n", "0.2872 | \n", "... | \n", "0.0125 | \n", "0.0084 | \n", "0.0089 | \n", "0.0048 | \n", "0.0094 | \n", "0.0191 | \n", "0.0140 | \n", "0.0049 | \n", "0.0052 | \n", "0.0044 | \n", "
| 2 | \n", "0.0262 | \n", "0.0582 | \n", "0.1099 | \n", "0.1083 | \n", "0.0974 | \n", "0.2280 | \n", "0.2431 | \n", "0.3771 | \n", "0.5598 | \n", "0.6194 | \n", "... | \n", "0.0033 | \n", "0.0232 | \n", "0.0166 | \n", "0.0095 | \n", "0.0180 | \n", "0.0244 | \n", "0.0316 | \n", "0.0164 | \n", "0.0095 | \n", "0.0078 | \n", "
3 rows × 60 columns
\n", "| \n", " | X0 | \n", "X1 | \n", "X2 | \n", "X3 | \n", "X4 | \n", "X5 | \n", "X6 | \n", "X7 | \n", "X8 | \n", "X9 | \n", "... | \n", "X50 | \n", "X51 | \n", "X52 | \n", "X53 | \n", "X54 | \n", "X55 | \n", "X56 | \n", "X57 | \n", "X58 | \n", "X59 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "-0.399551 | \n", "-0.040648 | \n", "-0.026926 | \n", "-0.715105 | \n", "0.364456 | \n", "-0.101253 | \n", "0.521638 | \n", "0.297843 | \n", "1.125272 | \n", "0.021186 | \n", "... | \n", "0.595283 | \n", "-1.115432 | \n", "-0.597604 | \n", "0.680897 | \n", "-0.295646 | \n", "1.481635 | \n", "1.763784 | \n", "0.069870 | \n", "0.171678 | \n", "-0.658947 | \n", "
| 1 | \n", "0.703538 | \n", "0.421630 | \n", "1.055618 | \n", "0.323330 | \n", "0.777676 | \n", "2.607217 | \n", "1.522625 | \n", "2.510982 | \n", "1.318325 | \n", "0.588706 | \n", "... | \n", "-0.297902 | \n", "-0.522349 | \n", "-0.256857 | \n", "-0.843151 | \n", "0.015503 | \n", "1.901046 | \n", "1.070732 | \n", "-0.472406 | \n", "-0.444554 | \n", "-0.419852 | \n", "
| 2 | \n", "-0.129229 | \n", "0.601067 | \n", "1.723404 | \n", "1.172176 | \n", "0.400545 | \n", "2.093337 | \n", "1.968770 | \n", "2.852370 | \n", "3.232767 | \n", "3.066105 | \n", "... | \n", "-1.065875 | \n", "1.017585 | \n", "0.836373 | \n", "-0.197833 | \n", "1.231812 | \n", "2.827246 | \n", "4.120162 | \n", "1.309360 | \n", "0.252761 | \n", "0.257582 | \n", "
3 rows × 60 columns
\n", "| \n", " | X0 | \n", "X1 | \n", "X2 | \n", "X3 | \n", "X4 | \n", "X5 | \n", "X6 | \n", "X7 | \n", "X8 | \n", "X9 | \n", "... | \n", "X50 | \n", "X51 | \n", "X52 | \n", "X53 | \n", "X54 | \n", "X55 | \n", "X56 | \n", "X57 | \n", "X58 | \n", "X59 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "-0.126126 | \n", "0.200000 | \n", "0.217949 | \n", "-0.581931 | \n", "0.528726 | \n", "0.096125 | \n", "0.642271 | \n", "0.538267 | \n", "1.163123 | \n", "0.182309 | \n", "... | \n", "0.750000 | \n", "-0.920635 | \n", "-0.310433 | \n", "0.723288 | \n", "-0.037736 | \n", "1.595142 | \n", "1.791822 | \n", "0.385185 | \n", "0.390977 | \n", "-0.387097 | \n", "
| 1 | \n", "1.013514 | \n", "0.682540 | \n", "1.282051 | \n", "0.619315 | \n", "0.896746 | \n", "2.476155 | \n", "1.486320 | \n", "2.646482 | \n", "1.330279 | \n", "0.665714 | \n", "... | \n", "-0.112903 | \n", "-0.317460 | \n", "-0.066158 | \n", "-0.493151 | \n", "0.238994 | \n", "1.983806 | \n", "1.197026 | \n", "-0.133333 | \n", "-0.180451 | \n", "-0.165899 | \n", "
| 2 | \n", "0.153153 | \n", "0.869841 | \n", "1.938462 | \n", "1.601246 | \n", "0.560868 | \n", "2.024590 | \n", "1.862517 | \n", "2.971685 | \n", "2.987903 | \n", "2.775925 | \n", "... | \n", "-0.854839 | \n", "1.248677 | \n", "0.717557 | \n", "0.021918 | \n", "1.320755 | \n", "2.842105 | \n", "3.814126 | \n", "1.570370 | \n", "0.466165 | \n", "0.460829 | \n", "
3 rows × 60 columns
\n", "| \n", " | pc | \n", "explained | \n", "cumsum | \n", "
|---|---|---|---|
| 0 | \n", "0 | \n", "0.203466 | \n", "0.203466 | \n", "
| 1 | \n", "1 | \n", "0.188972 | \n", "0.392438 | \n", "
| 2 | \n", "2 | \n", "0.085500 | \n", "0.477938 | \n", "
| 3 | \n", "3 | \n", "0.056792 | \n", "0.534730 | \n", "
| 4 | \n", "4 | \n", "0.050071 | \n", "0.584800 | \n", "
| 5 | \n", "5 | \n", "0.040650 | \n", "0.625450 | \n", "
| 6 | \n", "6 | \n", "0.032790 | \n", "0.658240 | \n", "
| 7 | \n", "7 | \n", "0.030465 | \n", "0.688705 | \n", "
| 8 | \n", "8 | \n", "0.025660 | \n", "0.714364 | \n", "
| 9 | \n", "9 | \n", "0.024911 | \n", "0.739275 | \n", "
| \n", " | X0 | \n", "X1 | \n", "X2 | \n", "X3 | \n", "X4 | \n", "X5 | \n", "X6 | \n", "X7 | \n", "X8 | \n", "X9 | \n", "... | \n", "X20 | \n", "X21 | \n", "X22 | \n", "X23 | \n", "X24 | \n", "X25 | \n", "X26 | \n", "X27 | \n", "X28 | \n", "X29 | \n", "
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | \n", "1.921168 | \n", "-1.370893 | \n", "-1.666476 | \n", "0.837913 | \n", "-1.057324 | \n", "1.712504 | \n", "1.785716 | \n", "-1.581264 | \n", "0.335418 | \n", "-1.028065 | \n", "... | \n", "-1.208238 | \n", "0.723202 | \n", "0.304876 | \n", "0.120470 | \n", "-0.458567 | \n", "-0.021847 | \n", "-1.089710 | \n", "0.096606 | \n", "0.168123 | \n", "-0.753434 | \n", "
| 1 | \n", "-0.480125 | \n", "7.586388 | \n", "-1.275734 | \n", "3.859346 | \n", "2.121112 | \n", "-2.186818 | \n", "-1.742764 | \n", "1.517061 | \n", "0.307933 | \n", "-1.341882 | \n", "... | \n", "-2.388110 | \n", "0.021429 | \n", "-0.145524 | \n", "-0.246021 | \n", "0.117770 | \n", "0.704112 | \n", "-0.052387 | \n", "-0.240064 | \n", "-0.178744 | \n", "-0.554605 | \n", "
| 2 | \n", "3.859228 | \n", "6.439860 | \n", "-0.030635 | \n", "5.454599 | \n", "1.552060 | \n", "1.181619 | \n", "-1.820138 | \n", "-1.495929 | \n", "-1.152459 | \n", "-1.006030 | \n", "... | \n", "-1.740823 | \n", "-2.000942 | \n", "-0.295682 | \n", "1.931963 | \n", "0.758036 | \n", "-0.113901 | \n", "0.964319 | \n", "0.214707 | \n", "0.527529 | \n", "-0.033003 | \n", "
| 3 | \n", "4.597419 | \n", "-3.104089 | \n", "-1.785344 | \n", "-1.115908 | \n", "-2.785528 | \n", "-2.072673 | \n", "2.084530 | \n", "1.707289 | \n", "0.452390 | \n", "-1.117318 | \n", "... | \n", "-0.685825 | \n", "1.307367 | \n", "-0.662918 | \n", "1.142591 | \n", "-0.352601 | \n", "-0.491193 | \n", "-0.061186 | \n", "0.150725 | \n", "1.389191 | \n", "0.642030 | \n", "
| 4 | \n", "-0.533868 | \n", "1.849847 | \n", "-0.860097 | \n", "3.302076 | \n", "2.808954 | \n", "-0.783945 | \n", "0.362657 | \n", "0.812621 | \n", "0.184578 | \n", "-0.023594 | \n", "... | \n", "0.503340 | \n", "0.258970 | \n", "0.253982 | \n", "1.199262 | \n", "-0.165722 | \n", "-0.041342 | \n", "-0.589311 | \n", "-0.500720 | \n", "-1.549835 | \n", "-0.783667 | \n", "
5 rows × 30 columns
\n", "