Introduction to Python for statistics¶

Adam Scherling¶

Stats 404, Winter 2018¶

Outline¶

1) Introduction

what is Python
why use Python
installation suggestions
Python 2 vs. Python 3

2) Python basics

data types
syntax
basic commands
file reading/writing

3) Python packages for statistics & data science

numpy
matplotlib
scipy
pandas
statsmodels
scikit-learn

Introduction¶

What is Python?¶

Python is a general-purpose programming language. Like R, it is a high-level, object-oriented language. Unlike R, it was not developed specifically for the practice of statistics. However, a number of useful packages have been developed in Python for mathematics, statistics, and data science.

Why use Python?¶

Having been developed in the computer science community, Python is used heavily for data science and machine learning. Some reasons you might use Python rather than R:

You're interested in using a particular package in Python (e.g. pandas, scikit-learn)
You want to integrate your code into a project (e.g. a website) that is already coded in Python
Personal preference: you're more comfortable coding in Python (functionality, syntax)
Your coworkers or collaborators are using Python

Disadvantages:

Documentation isn't great compared to R (but still pretty good)
Working with data can be a little clunky compared to R

My advice: give it a try and see if you like it!

Anaconda¶

The Anaconda distribution of Python is an easy way to install Python along with the most useful packages. You can download it here: https://www.continuum.io/downloads

Python 2.7 vs 3.6¶

There are two widely supported versions of Python. Apart from some slight differences in syntax (ex: whether the print function requires parentheses) the practical differences are small usually not important. I'm using Python 3 and would recommend doing the same unless you have some prior reason for doing otherwise (working with older python code, coworkers using Python 2, etc.)

Note:¶

This presentation was prepared using Jupyter Notebook: http://jupyter.org

The basics¶

Comments¶

Like R, comments in Python use the # character.

# This code does nothing.

Variable names¶

You can include underscores and numbers in variable names, but not periods.

The period is saved for methods on objects.

Note: like R, no need to declare variable types.

# no problem
my_var = 0
var0 = 0

# problem
my.var = 0

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-34465469c6e3> in <module>()
      1 # problem
----> 2 my.var = 0

NameError: name 'my' is not defined

A brief note about methods¶

Explanation: Python is looking for a variable called my that has method var.

A method is a function that is saved as an attribute of an object and generally manipulates that object in some way.

Declaring multiple variables¶

A little trick: you can declare or set multiple variables at once.

x, y = 5, 10

print(x)
print(y)

5
10

ints and floats¶

Like R, generally no need to declare the class of a number.

x = 5
y = 2.75
z = 8 / 3

print(x,y,z)

5 2.75 2.6666666666666665

print(type(x), type(y), type(z))

<class 'int'> <class 'float'> <class 'float'>

Strings¶

Like R, strings can be defined using either single quotes '' or double quotes "".

len() returns the length of a string.

Concatenate strings using +.

x = 'taco'

print(len(x))

4

y = 'sauce'
z = x + ' ' + y

print(z)

taco sauce

Split a string using the split() method.

z = z.split(' ')

print(z)

['taco', 'sauce']

You can also collapse a list of strings using join. (Note: join is a method of all strings.)

x = ['One', 'two', 'red', 'blue']
x1 = ', '.join(x)

print(x1)

One, two, red, blue

Tuples, lists, and dictionaries¶

The objects in square brackets above are lists.

Python has several different data types for combining multiple elements into a single object.

# list
x = [1,2,3]

# tuple
y = (4,5,6)

# dictionary
z = {'a':7, 'b':8, 'c':9}

Why these different data structures?

Lists are simple and relatively easy to work with.
Tuples take up less memory than lists, but are immutable: once you've created one, you can't make any edits.
Dictionaries allow for non-numeric indexing and are generally faster (faster searches and retrieval).

len() returns the length of these objects.

print(len(x), len(y), len(z))

3 3 3

Note: len() also works for strings.

len("apple")

5

Notably, these data structures can include any kind of object:

mylist = [1,'taco',x,y,z]
mytuple = (1,'taco',x,y,z)
mydictionary = {0:1, 1:'taco', 2:x, 3:y, 4:z}

print(mylist)
print(mytuple)
print(mydictionary)

[1, 'taco', [1, 2, 3], (4, 5, 6), {'a': 7, 'b': 8, 'c': 9}]
(1, 'taco', [1, 2, 3], (4, 5, 6), {'a': 7, 'b': 8, 'c': 9})
{0: 1, 1: 'taco', 2: [1, 2, 3], 3: (4, 5, 6), 4: {'a': 7, 'b': 8, 'c': 9}}

A few useful methods of dictionaries:

z = {'a':7, 'b':8, 'c':9}

list(z.keys())

['a', 'b', 'c']

list(z.values())

[7, 8, 9]

list(z.items())

[('a', 7), ('b', 8), ('c', 9)]

Indexing¶

List and tuple elements (and substrings) can be accessed by index. Python starts indexes with 0, rather than 1.

Dictionary entries are accessed using the dictionary keys.

# list
x = [1,2,3]

# tuple
y = (4,5,6)

# dictionary
z = {'a':7, 'b':8, 'c':9}

# string
s = "sasquatch"

# print an individual element of each object
print(x[0], y[2], z['a'], s[3])

1 6 7 q

Unlike R, negative indexes do not remove an element from a list.

Instead negative indexes go the other direction, e.g. x[-1] is the last entry in x, x[-2] is the second to last, etc.

x = [1,2,3]

print(x[0],x[1],x[2])
print(x[-1],x[-2],x[-3])

1 2 3
3 2 1

Slicing¶

Subset a list or tuple (or string) using square brackets and a colon. Think of the indexes as the walls separating different boxes. To return the contents of multiple boxes, you have to give the indexes of the outside walls.

# Example:
x = [8,9,10]

print("indexes:  0   1   2    3\nelements: | 8 | 9 | 10 |")

indexes:  0   1   2    3
elements: | 8 | 9 | 10 |

print(x[0:2], x[1:3])

[8, 9] [9, 10]

Note that you can leave out the numbers on either side of the colon to extend the slice as far as possible.

print(x[0:3], x[-3:], x[:])

[8, 9, 10] [8, 9, 10] [8, 9, 10]

Slicing also works with strings.

s = "sasquatch"

print(s[0:6])

sasqua

Accessing non-consecutive elements is a little tricky. One way of doing so is by list comprehension, which I'll talk about more later.

x = [2,4,6,8]

# take the subset corresponding to indexes 1 and 3
x_sub = [x[i] for i in [1,3]]
print(x_sub)

[4, 8]

A note of caution: copying variables¶

When you set a variable equal to another variable, python saves space by pointing to the same object in memory. It does not create a unique copy of the object.

If you edit the object, both variables will change.

# problem
x = [1,2,3]
y = x
y[0] = 10

print(x)

[10, 2, 3]

Safe ways of copying: slice it, use list(), or use the copy() method

x = [1,2,3]

# make a copy by slicing x
a = x[:]

# make a copy using list()
b = list(x)

# make a copy using the x.copy() method
c = x.copy()

# make changes to the copies
a[0] = 10
b[1] = 20
c[2] = 30

print('x: ', x)
print('a: ', a)
print('b: ', b)
print('c: ', c)

x:  [1, 2, 3]
a:  [10, 2, 3]
b:  [1, 20, 3]
c:  [1, 2, 30]

If statements¶

x = 5

if (x > 10):
    print('x is greater than 10')
elif (x < 0):
    print('x is negative')
else:
    print('x is between 0 and 10')

x is between 0 and 10

If you like, you can put an if statement on a single line.

y = 7

if y==7: print('y is seven!')

y is seven!

Note: booleans in Python are True and False. You can also use 0 and 1.

if True:
    print('yes')

if False:
    print('no')

yes

if 1:
    print('yes')

if 0:
    print('no')

yes

Loops¶

x = ['apple','banana','cherry']

for fruit in x:
    print(fruit)

apple
banana
cherry

for i in range(3):
    print(i)

0
1
2

for a,b in [(1,1),(2,4),(3,9)]:
    print(a,'squared equals',b)

1 squared equals 1
2 squared equals 4
3 squared equals 9

i = 10

while (i < 13):
    print(i)
    i+=1

10
11
12

Functions¶

def square(x):
    return(x**2)

print(square(4))

16

# like R, you can give default parameter values
def prod(x=1,y=2):
    return(x*y)

print(prod(), prod(2), prod(3,4))

2 4 12

Indentation¶

In Python, indentation is part of the syntax. If you fail to indent properly, it will return an error. This is particularly relevant for if statements, loops, and functions.

if True:
print('true')

  File "<ipython-input-38-ce54c251ecbd>", line 2
    print('true')
        ^
IndentationError: expected an indented block

for fruit in x:
print(fruit)

  File "<ipython-input-39-4a529fdc0120>", line 2
    print(fruit)
        ^
IndentationError: expected an indented block

def square(x):
return(x**2)

  File "<ipython-input-40-3d5cef0cc8ea>", line 2
    return(x**2)
         ^
IndentationError: expected an indented block

Basic operations on tuples and lists¶

For tuples and lists, the + operator performs concatenation and * performs repetition.

x = [1,2,3]
y = [4,5,6]

print(x + y)

[1, 2, 3, 4, 5, 6]

print(x * 3)

[1, 2, 3, 1, 2, 3, 1, 2, 3]

# create a long list of zeros
x = [0] * 50

print(x)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

# or an empty list
x = [None] * 10

print(x)

[None, None, None, None, None, None, None, None, None, None]

# The following all cause errors:
x + 2
x - y
x / 2

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-845de10f1db6> in <module>()
      1 # The following all cause errors:
----> 2 x + 2
      3 x - y
      4 x / 2

TypeError: can only concatenate list (not "int") to list

Sort a tuple or list using the sort() method.

x = [2,3,1]
x.sort()

print(x)

[1, 2, 3]

Manipulating lists and dictionaries¶

Python employs list comprehension and dictionary comprehension for creating and manipulating lists and dictionaries.

Recall that range() can be used to create a range of integers.

x = range(4,12,2)

print(list(x))

[4, 6, 8, 10]

Great. But what if you want decimals? Or something more fancy?

# decimals
x = [i/10 for i in range(10)]

print(x)

[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]

# a familiar sequence?
x = [(1/i) * (-1)**(i+1) for i in range(1,10)]

print(x)

[1.0, -0.5, 0.3333333333333333, -0.25, 0.2, -0.16666666666666666, 0.14285714285714285, -0.125, 0.1111111111111111]

This is also useful for manipulating an existing list. Suppose you want to apply a function to every number in a list:

x = [3,2,6,4,12]
x_squared = [i**2 for i in x]

print(x_squared)

[9, 4, 36, 16, 144]

Or suppose you want to filter a list:

y = [v for v in x if v > 5]

print(y)

[6, 12]

The same general idea also works for dictionaries.

You do have to use the items() method which essentially converts the dictionary to a list.

x = {'a':4, 'b':7, 'c':10}

y = {k:v**2 for k,v in x.items()}

print(y)

{'a': 16, 'b': 49, 'c': 100}

Reading and writing to files¶

In this regard, Python is quite similar to C.

# create some data to write to file
x = [1,2,3,4,5]
squares = [i**2 for i in x]
colors = ['red','blue','yellow','green','red']

# convert x and squares to strings
x = [str(i) for i in x]
squares = [str(i) for i in squares]

print(x)
print(squares)

['1', '2', '3', '4', '5']
['1', '4', '9', '16', '25']

# zip the three variables together
mydata = list(zip(x, squares, colors))
mydata

[('1', '1', 'red'),
 ('2', '4', 'blue'),
 ('3', '9', 'yellow'),
 ('4', '16', 'green'),
 ('5', '25', 'red')]

# open a file for writing
f = open('testfile.csv','w')

# write one row at a time from mydata
# the for loop automatically pulls one element at a time (in this case one tuple at a time)
for row in mydata:
    f.write(','.join(row) + '\n')
    
# close the file
f.close()

Now to read the data back in:

# look at the raw data
# think about how this can be parsed
f = open('testfile.csv', 'r')

f.read()

'1,1,red\n2,4,blue\n3,9,yellow\n4,16,green\n5,25,red\n'

# read in the data
f = open('testfile.csv', 'r')

x2, squares2, colors2 = [], [], []
for line in f:
    # note: strip() takes out the \n
    tmp = line.strip().split(',')
    x2.append(tmp[0])
    squares2.append(tmp[1])
    colors2.append(tmp[2])

print(x2)
print(squares2)
print(colors2)

['1', '2', '3', '4', '5']
['1', '4', '9', '16', '25']
['red', 'blue', 'yellow', 'green', 'red']

# convert the numeric strings to integers
x2 = [int(i) for i in x2]
squares2 = [int(i) for i in squares2]

print(x2)
print(squares2)

[1, 2, 3, 4, 5]
[1, 4, 9, 16, 25]

View or set the working directory¶

# import the os package. (More about packages next.)
import os

# view the working directory
os.getcwd()

'/Users/adamscherling/Documents/ucla/stat404/python'

# change the working directory
os.chdir('/Users/adamscherling/Documents/ucla/stat404')
os.getcwd()

'/Users/adamscherling/Documents/ucla/stat404'

The numpy package¶

R is designed so that everything is a vector, which is great for statistics. In Python, vectors do not come as naturally.

The numpy package makes life easier. It includes many useful tools for mathematics and statistics. For more information see http://www.numpy.org.

Packages are loaded in Python using import. It is common to specify an abbreviated name for loaded packages; these names help prevent mistakes if the same function is defined in multiple packages.

# it's standard practice to use the np prefix for numpy
import numpy as np

Arrays¶

Numpy arrays are convenient for representing vectors and matrices.

np.arange creates an array rather than a range object or list

np.arange(10)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

np.arange also allows non-integer step sizes

np.arange(0,3,0.2)

array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ,  1.2,  1.4,  1.6,  1.8,  2. ,
        2.2,  2.4,  2.6,  2.8])

The linspace command is also helpful:

np.linspace(0,1,10)

array([ 0.        ,  0.11111111,  0.22222222,  0.33333333,  0.44444444,
        0.55555556,  0.66666667,  0.77777778,  0.88888889,  1.        ])

Unlike Python lists, array operations are elementwise.

x = np.array([1,2,3])
y = np.array([4,5,6])

print(x + y)

[5 7 9]

print(x**2)

[1 4 9]

Arrays can be reshaped to use as matrices.

x = np.array([1,2,2,4,3,6]).reshape(3,2)

print(x)

[[1 2]
 [2 4]
 [3 6]]

Special types of matrices can be created easily.

np.zeros([3,2])

array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])

np.ones([2,3])

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

np.diag([1,2,3])

array([[1, 0, 0],
       [0, 2, 0],
       [0, 0, 3]])

You still have to be careful about making copies.

x = np.arange(5)
y = x
y[0] = 17

print(x)

[17  1  2  3  4]

x = np.arange(5)
y = x.copy()
y[0] = 17

print(x)
print(y)

[0 1 2 3 4]
[17  1  2  3  4]

Concatenate matrices and arrays¶

# note the use of chain syntax; it's often useful to chain commands together
x = np.array([1,2,2,4,3,6,4,8]).reshape(4,2)

print(x)

[[1 2]
 [2 4]
 [3 6]
 [4 8]]

To concatenate arrays, the dimensions must align. Even for one-dimensional vectors you have to be explicit about the shape.

y = np.array([3,6,9,12])
x = np.concatenate([x,y],axis=1)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-74-2564b3f87d13> in <module>()
      1 y = np.array([3,6,9,12])
----> 2 x = np.concatenate([x,y],axis=1)

ValueError: all the input arrays must have same number of dimensions

y = np.array([3,6,9,12]).reshape(4,1)
x = np.concatenate([x,y],axis=1)

print(x)

[[ 1  2  3]
 [ 2  4  6]
 [ 3  6  9]
 [ 4  8 12]]

z = np.array([5,10,15]).reshape(1,3)
x = np.concatenate([x,z],axis=0)

print(x)

[[ 1  2  3]
 [ 2  4  6]
 [ 3  6  9]
 [ 4  8 12]
 [ 5 10 15]]

x = np.concatenate([x,x,x], axis=1)
print(x)

[[ 1  2  3  1  2  3  1  2  3]
 [ 2  4  6  2  4  6  2  4  6]
 [ 3  6  9  3  6  9  3  6  9]
 [ 4  8 12  4  8 12  4  8 12]
 [ 5 10 15  5 10 15  5 10 15]]

Matrix algebra¶

Matrix multiplication is a defined property of arrays

x = np.array([1,2,2,4,3,6]).reshape(3,2)
y = np.array([1,1])

print('x: ', x, '\n')
print('y: ', y, '\n')
print(x.dot(y))

# note: x.dot(y) is equivalent to np.dot(x,y)

x:  [[1 2]
 [2 4]
 [3 6]] 

y:  [1 1] 

[3 6 9]

Transpose a matrix:

x.T

array([[1, 2, 3],
       [2, 4, 6]])

Invert a matrix:

x = np.arange(4).reshape(2,2)
print(x)

[[0 1]
 [2 3]]

y = np.linalg.inv(x)
print(y)

[[-1.5  0.5]
 [ 1.   0. ]]

print(np.dot(x,y))

[[ 1.  0.]
 [ 0.  1.]]

Find eigenvalues and eigenvectors:

eig = np.linalg.eig(x)
val = eig[0]
vec = eig[1]

print(val, '\n')
print(vec)

[-0.56155281  3.56155281] 

[[-0.87192821 -0.27032301]
 [ 0.48963374 -0.96276969]]

Other useful numpy functions¶

np.pi

3.141592653589793

x = np.arange(3)

# note that these allow the input to be an array
y = np.exp(x)
z = np.log(y)

print(y.round(2))
print(z)

[ 1.    2.72  7.39]
[ 0.  1.  2.]

x = np.linspace(0,2*np.pi,5)
y = np.sin(x)

print(x.round(2))
print(y.round(2))

[ 0.    1.57  3.14  4.71  6.28]
[ 0.  1.  0. -1. -0.]

Plotting using matplotlib¶

Matplotlib is the standard plotting package in Python. For a tutorial see https://matplotlib.org/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py.

Other packages such as ggplot and seaborn are also available.

import matplotlib.pyplot as plt

x = np.linspace(0,2*np.pi,1000)
y = np.sin(x)

plt.plot(x,y)
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.show()

Random variables using the scipy package¶

R has built-in functionality for working with common probability distributions. In Python similar functions are available through the stats portion of the scipy package.

scipy has a lot more than just probability distributions; there are sub-libraries for topics such as linear algebra, optimization, Fourier transforms, signal processing, and more.

For more information about scipy see https://docs.scipy.org/doc/scipy/reference/.

from scipy import stats

# plot normal pdf
x = np.linspace(-4,4,1000)
y = stats.norm().pdf(x)

plt.plot(x,y)
plt.show()

# plot normal cdf
# (of course, you can also use this for Z tests, etc.)
x = np.linspace(-4,4,1000)
y = stats.norm().cdf(x)

plt.plot(x,y)
plt.show()

# generate random variables from a distribution
# (note: you can also do this in numpy using np.random)
x = stats.norm(3,2).rvs(3)
y = stats.expon(4).rvs(4)
z = stats.poisson(1).rvs(5)

print(x.round(2))
print(y.round(2))
print(z)

[ 0.21  4.77  3.84]
[  4.07  11.07   4.6   12.06]
[0 0 0 1 2]

The pandas package¶

The pandas package provides its own universe of special objects and functions for easily reading in and manipulating data, time series functionality, and other useful tools. In many ways pandas mimics R, but it also has some interesting tricks and tools of its own.

For more information see http://pandas.pydata.org/pandas-docs/stable/.

import pandas as pd

The `Series` object¶

pandas has its own answer to vectorizing Python: the Series object. These have an immense number of special attributes. See https://pandas.pydata.org/pandas-docs/stable/api.html#series.

They can also be easily converted into numpy arrays if you so choose.

x = pd.Series(stats.norm().rvs(10))
print(x)

0    1.252194
1   -0.914974
2    2.317017
3    0.211160
4    0.235011
5   -1.285713
6    0.075007
7    0.126686
8   -0.453423
9   -0.738689
dtype: float64

# some of the many available summary statistics
[x.min(), x.quantile(0.25), x.median(), x.quantile(0.75), x.max(), x.mean(), x.var(), x.autocorr()]

[-1.2857132036083179,
 -0.6673725452398099,
 0.10084687854416731,
 0.2290481375996089,
 2.3170173620705352,
 0.08242761805914933,
 1.1368443837056643,
 -0.30413002684341717]

# find unique values
x.round().unique()

array([ 1., -1.,  2.,  0.])

# filter
x.where(x > 0)

0    1.252194
1         NaN
2    2.317017
3    0.211160
4    0.235011
5         NaN
6    0.075007
7    0.126686
8         NaN
9         NaN
dtype: float64

# filter and remove missing values
x.where(x > 0).dropna()

0    1.252194
2    2.317017
3    0.211160
4    0.235011
6    0.075007
7    0.126686
dtype: float64

As with numpy arrays, operations are elementwise.

x + x

0    2.504387
1   -1.829948
2    4.634035
3    0.422320
4    0.470022
5   -2.571426
6    0.150015
7    0.253373
8   -0.906846
9   -1.477378
dtype: float64

x ** 2

0    1.567989
1    0.837178
2    5.368569
3    0.044589
4    0.055230
5    1.653058
6    0.005626
7    0.016049
8    0.205592
9    0.545662
dtype: float64

Concatenate using append.

x2 = x.append(x)
print(x2)

0    1.252194
1   -0.914974
2    2.317017
3    0.211160
4    0.235011
5   -1.285713
6    0.075007
7    0.126686
8   -0.453423
9   -0.738689
0    1.252194
1   -0.914974
2    2.317017
3    0.211160
4    0.235011
5   -1.285713
6    0.075007
7    0.126686
8   -0.453423
9   -0.738689
dtype: float64

The `DataFrame` object¶

# create using a dictionary
data = {'x': stats.uniform().rvs(10), 'y': stats.norm().rvs(10)}
df = pd.DataFrame(data)
print(df)

          x         y
0  0.573899  1.174264
1  0.635064 -0.076698
2  0.300348 -1.862586
3  0.511668  0.231349
4  0.981913  0.392888
5  0.827418  1.784487
6  0.533860  0.408539
7  0.175385 -1.062335
8  0.435950  0.258134
9  0.474151  0.124744

# create using a list
data = [stats.uniform().rvs(10),stats.norm().rvs(10)]
df = pd.DataFrame(data)
print(df)

          0         1         2         3         4         5         6  \
0  0.981552  0.710948  0.158218  0.281315  0.643304  0.762147  0.264676   
1  0.177340 -1.098031 -1.926916  0.336361 -1.004373 -0.771140  0.685671   

          7         8         9  
0  0.472766  0.743807  0.195761  
1 -1.112838 -0.197350 -0.731538

# transpose
df = df.T
print(df)

          0         1
0  0.981552  0.177340
1  0.710948 -1.098031
2  0.158218 -1.926916
3  0.281315  0.336361
4  0.643304 -1.004373
5  0.762147 -0.771140
6  0.264676  0.685671
7  0.472766 -1.112838
8  0.743807 -0.197350
9  0.195761 -0.731538

# rename columns
df.columns = ['x','y']
print(df)

          x         y
0  0.981552  0.177340
1  0.710948 -1.098031
2  0.158218 -1.926916
3  0.281315  0.336361
4  0.643304 -1.004373
5  0.762147 -0.771140
6  0.264676  0.685671
7  0.472766 -1.112838
8  0.743807 -0.197350
9  0.195761 -0.731538

# pick out a column by name
df.x

# equivalent: df['x']

talk about picking multiple columns?

Object `columns` not found.

talk about picking multiple columns

Note that columns in a pandas DataFrame are Series objects.

type(df.x)

pandas.core.series.Series

# select columns by index
df2 = df.iloc[:,1]
print(df2)

0    0.177340
1   -1.098031
2   -1.926916
3    0.336361
4   -1.004373
5   -0.771140
6    0.685671
7   -1.112838
8   -0.197350
9   -0.731538
Name: y, dtype: float64

# select rows by index
df2 = df.iloc[0:5,]
print(df2)

          x         y
0  0.981552  0.177340
1  0.710948 -1.098031
2  0.158218 -1.926916
3  0.281315  0.336361
4  0.643304 -1.004373

# select rows by value
df2 = df.where(df['x']>0.5,).dropna()
print(df2)

          x         y
0  0.981552  0.177340
1  0.710948 -1.098031
4  0.643304 -1.004373
5  0.762147 -0.771140
8  0.743807 -0.197350

Reading in data¶

pandas can read from and write to many different formats: tables, csv, json, Excel, html, SQL, Stata, SAS...

Often much simpler than trying to do it manually using base Python.

# write df to csv
# index=False tells Python not to write the row names
df.to_csv('test_df.csv',index=False)

# read it back in
df2 = pd.read_csv('test_df.csv')
print(df2)

          x         y
0  0.981552  0.177340
1  0.710948 -1.098031
2  0.158218 -1.926916
3  0.281315  0.336361
4  0.643304 -1.004373
5  0.762147 -0.771140
6  0.264676  0.685671
7  0.472766 -1.112838
8  0.743807 -0.197350
9  0.195761 -0.731538

The Statsmodels package¶

The statsmodels package provides a number of basic statistical models like linear models, GLMs, ANOVA, and more.

For more info see http://www.statsmodels.org.

Simple example: a linear model

# load statsmodels
import statsmodels.api as sm

# make up some data
X = np.array(stats.norm().rvs(100)).reshape(25,4)
beta = [3,0.5,1,2]
epsilon = stats.norm().rvs(25)
Y = np.dot(X,beta) + epsilon

# fit a model
lm = sm.OLS(Y,X)
lm_fit = lm.fit()
lm_fit.summary()

/Users/adamscherling/anaconda/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools

The scikit-learn package¶

scikit-learn is the go-to package for machine learning in python. It includes supervised and unsupervised methods, regression and classification, etc. etc. For more information see http://scikit-learn.org.

Using the same data: fit a random forest regression

# import the RandomForestRegressor function
from sklearn.ensemble import RandomForestRegressor

# fit the model
rf_reg = RandomForestRegressor(n_estimators=10)
rf_fit = rf_reg.fit(X, Y)

# compare the MSE to that from the linear model
# calculate fitted values
rf_pred = rf_fit.predict(X)
lm_pred = lm_fit.predict(X)

# calculate MSEs
rf_mse = np.sqrt(np.sum((Y-rf_pred)**2))
lm_mse = np.sqrt(np.sum((Y-lm_pred)**2))

print(lm_mse, rf_mse)

5.34276761181 5.85583374927

# take a look at the fitted values from the two models
plt.plot(Y,Y,label='Y=Y')
plt.scatter(Y, lm_pred, label='lm fit')
plt.scatter(Y, rf_pred, label='rf fit')
plt.xlabel('Y')
plt.ylabel('fitted values')
plt.legend()
plt.show()

Dep. Variable:	y	R-squared:	0.942
Model:	OLS	Adj. R-squared:	0.931
Method:	Least Squares	F-statistic:	84.71
Date:	Tue, 27 Feb 2018	Prob (F-statistic):	1.21e-12
Time:	12:13:05	Log-Likelihood:	-37.131
No. Observations:	25	AIC:	82.26
Df Residuals:	21	BIC:	87.14
Df Model:	4
Covariance Type:	nonrobust

	coef	std err	t	P>\|t\|	[0.025	0.975]
x1	3.2630	0.230	14.157	0.000	2.784	3.742
x2	0.4055	0.199	2.034	0.055	-0.009	0.820
x3	0.9779	0.201	4.870	0.000	0.560	1.395
x4	1.9184	0.247	7.774	0.000	1.405	2.432

Omnibus:	1.407	Durbin-Watson:	1.563
Prob(Omnibus):	0.495	Jarque-Bera (JB):	1.002
Skew:	-0.482	Prob(JB):	0.606
Kurtosis:	2.817	Cond. No.	1.57