Introduction to Python for statistics

Adam Scherling

Stats 404, Winter 2018

Outline

1) Introduction

  • what is Python
  • why use Python
  • installation suggestions
  • Python 2 vs. Python 3

2) Python basics

  • data types
  • syntax
  • basic commands
  • file reading/writing

3) Python packages for statistics & data science

  • numpy
  • matplotlib
  • scipy
  • pandas
  • statsmodels
  • scikit-learn

Introduction

What is Python?

Python is a general-purpose programming language. Like R, it is a high-level, object-oriented language. Unlike R, it was not developed specifically for the practice of statistics. However, a number of useful packages have been developed in Python for mathematics, statistics, and data science.

Why use Python?

Having been developed in the computer science community, Python is used heavily for data science and machine learning. Some reasons you might use Python rather than R:

  • You're interested in using a particular package in Python (e.g. pandas, scikit-learn)
  • You want to integrate your code into a project (e.g. a website) that is already coded in Python
  • Personal preference: you're more comfortable coding in Python (functionality, syntax)
  • Your coworkers or collaborators are using Python

Disadvantages:

  • Documentation isn't great compared to R (but still pretty good)
  • Working with data can be a little clunky compared to R

My advice: give it a try and see if you like it!

Anaconda

The Anaconda distribution of Python is an easy way to install Python along with the most useful packages. You can download it here: https://www.continuum.io/downloads

Python 2.7 vs 3.6

There are two widely supported versions of Python. Apart from some slight differences in syntax (ex: whether the print function requires parentheses) the practical differences are small usually not important. I'm using Python 3 and would recommend doing the same unless you have some prior reason for doing otherwise (working with older python code, coworkers using Python 2, etc.)

Note:

This presentation was prepared using Jupyter Notebook: http://jupyter.org

The basics

Comments

Like R, comments in Python use the # character.

In [1]:
# This code does nothing.

Variable names

You can include underscores and numbers in variable names, but not periods.

The period is saved for methods on objects.

Note: like R, no need to declare variable types.

In [2]:
# no problem
my_var = 0
var0 = 0
In [3]:
# problem
my.var = 0
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-3-34465469c6e3> in <module>()
      1 # problem
----> 2 my.var = 0

NameError: name 'my' is not defined

A brief note about methods

Explanation: Python is looking for a variable called my that has method var.

A method is a function that is saved as an attribute of an object and generally manipulates that object in some way.

Declaring multiple variables

A little trick: you can declare or set multiple variables at once.

In [4]:
x, y = 5, 10

print(x)
print(y)
5
10

ints and floats

Like R, generally no need to declare the class of a number.

In [5]:
x = 5
y = 2.75
z = 8 / 3

print(x,y,z)
5 2.75 2.6666666666666665
In [6]:
print(type(x), type(y), type(z))
<class 'int'> <class 'float'> <class 'float'>

Strings

Like R, strings can be defined using either single quotes '' or double quotes "".

len() returns the length of a string.

Concatenate strings using +.

In [7]:
x = 'taco'

print(len(x))
4
In [8]:
y = 'sauce'
z = x + ' ' + y

print(z)
taco sauce

Split a string using the split() method.

In [9]:
z = z.split(' ')

print(z)
['taco', 'sauce']

You can also collapse a list of strings using join. (Note: join is a method of all strings.)

In [10]:
x = ['One', 'two', 'red', 'blue']
x1 = ', '.join(x)

print(x1)
One, two, red, blue

Tuples, lists, and dictionaries

The objects in square brackets above are lists.

Python has several different data types for combining multiple elements into a single object.

In [11]:
# list
x = [1,2,3]

# tuple
y = (4,5,6)

# dictionary
z = {'a':7, 'b':8, 'c':9}

Why these different data structures?

  • Lists are simple and relatively easy to work with.
  • Tuples take up less memory than lists, but are immutable: once you've created one, you can't make any edits.
  • Dictionaries allow for non-numeric indexing and are generally faster (faster searches and retrieval).

len() returns the length of these objects.

In [12]:
print(len(x), len(y), len(z))
3 3 3

Note: len() also works for strings.

In [13]:
len("apple")
Out[13]:
5

Notably, these data structures can include any kind of object:

In [14]:
mylist = [1,'taco',x,y,z]
mytuple = (1,'taco',x,y,z)
mydictionary = {0:1, 1:'taco', 2:x, 3:y, 4:z}

print(mylist)
print(mytuple)
print(mydictionary)
[1, 'taco', [1, 2, 3], (4, 5, 6), {'a': 7, 'b': 8, 'c': 9}]
(1, 'taco', [1, 2, 3], (4, 5, 6), {'a': 7, 'b': 8, 'c': 9})
{0: 1, 1: 'taco', 2: [1, 2, 3], 3: (4, 5, 6), 4: {'a': 7, 'b': 8, 'c': 9}}

A few useful methods of dictionaries:

In [15]:
z = {'a':7, 'b':8, 'c':9}

list(z.keys())
Out[15]:
['a', 'b', 'c']
In [16]:
list(z.values())
Out[16]:
[7, 8, 9]
In [17]:
list(z.items())
Out[17]:
[('a', 7), ('b', 8), ('c', 9)]

Indexing

List and tuple elements (and substrings) can be accessed by index. Python starts indexes with 0, rather than 1.

Dictionary entries are accessed using the dictionary keys.

In [18]:
# list
x = [1,2,3]

# tuple
y = (4,5,6)

# dictionary
z = {'a':7, 'b':8, 'c':9}

# string
s = "sasquatch"

# print an individual element of each object
print(x[0], y[2], z['a'], s[3])
1 6 7 q

Unlike R, negative indexes do not remove an element from a list.

Instead negative indexes go the other direction, e.g. x[-1] is the last entry in x, x[-2] is the second to last, etc.

In [19]:
x = [1,2,3]

print(x[0],x[1],x[2])
print(x[-1],x[-2],x[-3])
1 2 3
3 2 1

Slicing

Subset a list or tuple (or string) using square brackets and a colon. Think of the indexes as the walls separating different boxes. To return the contents of multiple boxes, you have to give the indexes of the outside walls.

In [20]:
# Example:
x = [8,9,10]
In [21]:
print("indexes:  0   1   2    3\nelements: | 8 | 9 | 10 |")
indexes:  0   1   2    3
elements: | 8 | 9 | 10 |
In [22]:
print(x[0:2], x[1:3])
[8, 9] [9, 10]

Note that you can leave out the numbers on either side of the colon to extend the slice as far as possible.

In [23]:
print(x[0:3], x[-3:], x[:])
[8, 9, 10] [8, 9, 10] [8, 9, 10]

Slicing also works with strings.

In [24]:
s = "sasquatch"

print(s[0:6])
sasqua

Accessing non-consecutive elements is a little tricky. One way of doing so is by list comprehension, which I'll talk about more later.

In [25]:
x = [2,4,6,8]

# take the subset corresponding to indexes 1 and 3
x_sub = [x[i] for i in [1,3]]
print(x_sub)
[4, 8]

A note of caution: copying variables

When you set a variable equal to another variable, python saves space by pointing to the same object in memory. It does not create a unique copy of the object.

If you edit the object, both variables will change.

In [26]:
# problem
x = [1,2,3]
y = x
y[0] = 10

print(x)
[10, 2, 3]

Safe ways of copying: slice it, use list(), or use the copy() method

In [27]:
x = [1,2,3]

# make a copy by slicing x
a = x[:]

# make a copy using list()
b = list(x)

# make a copy using the x.copy() method
c = x.copy()

# make changes to the copies
a[0] = 10
b[1] = 20
c[2] = 30

print('x: ', x)
print('a: ', a)
print('b: ', b)
print('c: ', c)
x:  [1, 2, 3]
a:  [10, 2, 3]
b:  [1, 20, 3]
c:  [1, 2, 30]

If statements

In [28]:
x = 5

if (x > 10):
    print('x is greater than 10')
elif (x < 0):
    print('x is negative')
else:
    print('x is between 0 and 10')
x is between 0 and 10

If you like, you can put an if statement on a single line.

In [29]:
y = 7

if y==7: print('y is seven!')
y is seven!

Note: booleans in Python are True and False. You can also use 0 and 1.

In [30]:
if True:
    print('yes')

if False:
    print('no')
yes
In [31]:
if 1:
    print('yes')

if 0:
    print('no')
yes

Loops

In [32]:
x = ['apple','banana','cherry']

for fruit in x:
    print(fruit)
apple
banana
cherry
In [33]:
for i in range(3):
    print(i)
0
1
2
In [34]:
for a,b in [(1,1),(2,4),(3,9)]:
    print(a,'squared equals',b)
1 squared equals 1
2 squared equals 4
3 squared equals 9
In [35]:
i = 10

while (i < 13):
    print(i)
    i+=1
10
11
12

Functions

In [36]:
def square(x):
    return(x**2)

print(square(4))
16
In [37]:
# like R, you can give default parameter values
def prod(x=1,y=2):
    return(x*y)

print(prod(), prod(2), prod(3,4))
2 4 12

Indentation

In Python, indentation is part of the syntax. If you fail to indent properly, it will return an error. This is particularly relevant for if statements, loops, and functions.

In [38]:
if True:
print('true')
  File "<ipython-input-38-ce54c251ecbd>", line 2
    print('true')
        ^
IndentationError: expected an indented block
In [39]:
for fruit in x:
print(fruit)
  File "<ipython-input-39-4a529fdc0120>", line 2
    print(fruit)
        ^
IndentationError: expected an indented block
In [40]:
def square(x):
return(x**2)
  File "<ipython-input-40-3d5cef0cc8ea>", line 2
    return(x**2)
         ^
IndentationError: expected an indented block

Basic operations on tuples and lists

For tuples and lists, the + operator performs concatenation and * performs repetition.

In [41]:
x = [1,2,3]
y = [4,5,6]

print(x + y)
[1, 2, 3, 4, 5, 6]
In [42]:
print(x * 3)
[1, 2, 3, 1, 2, 3, 1, 2, 3]
In [43]:
# create a long list of zeros
x = [0] * 50

print(x)
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
In [44]:
# or an empty list
x = [None] * 10

print(x)
[None, None, None, None, None, None, None, None, None, None]
In [45]:
# The following all cause errors:
x + 2
x - y
x / 2
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-45-845de10f1db6> in <module>()
      1 # The following all cause errors:
----> 2 x + 2
      3 x - y
      4 x / 2

TypeError: can only concatenate list (not "int") to list

Sort a tuple or list using the sort() method.

In [46]:
x = [2,3,1]
x.sort()

print(x)
[1, 2, 3]

Manipulating lists and dictionaries

Python employs list comprehension and dictionary comprehension for creating and manipulating lists and dictionaries.

Recall that range() can be used to create a range of integers.

In [47]:
x = range(4,12,2)

print(list(x))
[4, 6, 8, 10]

Great. But what if you want decimals? Or something more fancy?

In [48]:
# decimals
x = [i/10 for i in range(10)]

print(x)
[0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
In [49]:
# a familiar sequence?
x = [(1/i) * (-1)**(i+1) for i in range(1,10)]

print(x)
[1.0, -0.5, 0.3333333333333333, -0.25, 0.2, -0.16666666666666666, 0.14285714285714285, -0.125, 0.1111111111111111]

This is also useful for manipulating an existing list. Suppose you want to apply a function to every number in a list:

In [50]:
x = [3,2,6,4,12]
x_squared = [i**2 for i in x]

print(x_squared)
[9, 4, 36, 16, 144]

Or suppose you want to filter a list:

In [51]:
y = [v for v in x if v > 5]

print(y)
[6, 12]

The same general idea also works for dictionaries.

You do have to use the items() method which essentially converts the dictionary to a list.

In [52]:
x = {'a':4, 'b':7, 'c':10}

y = {k:v**2 for k,v in x.items()}

print(y)
{'a': 16, 'b': 49, 'c': 100}

Reading and writing to files

In this regard, Python is quite similar to C.

In [53]:
# create some data to write to file
x = [1,2,3,4,5]
squares = [i**2 for i in x]
colors = ['red','blue','yellow','green','red']

# convert x and squares to strings
x = [str(i) for i in x]
squares = [str(i) for i in squares]

print(x)
print(squares)
['1', '2', '3', '4', '5']
['1', '4', '9', '16', '25']
In [54]:
# zip the three variables together
mydata = list(zip(x, squares, colors))
mydata
Out[54]:
[('1', '1', 'red'),
 ('2', '4', 'blue'),
 ('3', '9', 'yellow'),
 ('4', '16', 'green'),
 ('5', '25', 'red')]
In [55]:
# open a file for writing
f = open('testfile.csv','w')

# write one row at a time from mydata
# the for loop automatically pulls one element at a time (in this case one tuple at a time)
for row in mydata:
    f.write(','.join(row) + '\n')
    
# close the file
f.close()

Now to read the data back in:

In [56]:
# look at the raw data
# think about how this can be parsed
f = open('testfile.csv', 'r')

f.read()
Out[56]:
'1,1,red\n2,4,blue\n3,9,yellow\n4,16,green\n5,25,red\n'
In [57]:
# read in the data
f = open('testfile.csv', 'r')

x2, squares2, colors2 = [], [], []
for line in f:
    # note: strip() takes out the \n
    tmp = line.strip().split(',')
    x2.append(tmp[0])
    squares2.append(tmp[1])
    colors2.append(tmp[2])

print(x2)
print(squares2)
print(colors2)
['1', '2', '3', '4', '5']
['1', '4', '9', '16', '25']
['red', 'blue', 'yellow', 'green', 'red']
In [58]:
# convert the numeric strings to integers
x2 = [int(i) for i in x2]
squares2 = [int(i) for i in squares2]

print(x2)
print(squares2)
[1, 2, 3, 4, 5]
[1, 4, 9, 16, 25]

View or set the working directory

In [59]:
# import the os package. (More about packages next.)
import os

# view the working directory
os.getcwd()
Out[59]:
'/Users/adamscherling/Documents/ucla/stat404/python'
In [60]:
# change the working directory
os.chdir('/Users/adamscherling/Documents/ucla/stat404')
os.getcwd()
Out[60]:
'/Users/adamscherling/Documents/ucla/stat404'

The numpy package

R is designed so that everything is a vector, which is great for statistics. In Python, vectors do not come as naturally.

The numpy package makes life easier. It includes many useful tools for mathematics and statistics. For more information see http://www.numpy.org.

Packages are loaded in Python using import. It is common to specify an abbreviated name for loaded packages; these names help prevent mistakes if the same function is defined in multiple packages.

In [61]:
# it's standard practice to use the np prefix for numpy
import numpy as np

Arrays

Numpy arrays are convenient for representing vectors and matrices.

np.arange creates an array rather than a range object or list

In [62]:
np.arange(10)
Out[62]:
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

np.arange also allows non-integer step sizes

In [63]:
np.arange(0,3,0.2)
Out[63]:
array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ,  1.2,  1.4,  1.6,  1.8,  2. ,
        2.2,  2.4,  2.6,  2.8])

The linspace command is also helpful:

In [64]:
np.linspace(0,1,10)
Out[64]:
array([ 0.        ,  0.11111111,  0.22222222,  0.33333333,  0.44444444,
        0.55555556,  0.66666667,  0.77777778,  0.88888889,  1.        ])

Unlike Python lists, array operations are elementwise.

In [65]:
x = np.array([1,2,3])
y = np.array([4,5,6])

print(x + y)
[5 7 9]
In [66]:
print(x**2)
[1 4 9]

Arrays can be reshaped to use as matrices.

In [67]:
x = np.array([1,2,2,4,3,6]).reshape(3,2)

print(x)
[[1 2]
 [2 4]
 [3 6]]

Special types of matrices can be created easily.

In [68]:
np.zeros([3,2])
Out[68]:
array([[ 0.,  0.],
       [ 0.,  0.],
       [ 0.,  0.]])
In [69]:
np.ones([2,3])
Out[69]:
array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])
In [70]:
np.diag([1,2,3])
Out[70]:
array([[1, 0, 0],
       [0, 2, 0],
       [0, 0, 3]])

You still have to be careful about making copies.

In [71]:
x = np.arange(5)
y = x
y[0] = 17

print(x)
[17  1  2  3  4]
In [72]:
x = np.arange(5)
y = x.copy()
y[0] = 17

print(x)
print(y)
[0 1 2 3 4]
[17  1  2  3  4]

Concatenate matrices and arrays

In [73]:
# note the use of chain syntax; it's often useful to chain commands together
x = np.array([1,2,2,4,3,6,4,8]).reshape(4,2)

print(x)
[[1 2]
 [2 4]
 [3 6]
 [4 8]]

To concatenate arrays, the dimensions must align. Even for one-dimensional vectors you have to be explicit about the shape.

In [74]:
y = np.array([3,6,9,12])
x = np.concatenate([x,y],axis=1)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-74-2564b3f87d13> in <module>()
      1 y = np.array([3,6,9,12])
----> 2 x = np.concatenate([x,y],axis=1)

ValueError: all the input arrays must have same number of dimensions
In [75]:
y = np.array([3,6,9,12]).reshape(4,1)
x = np.concatenate([x,y],axis=1)

print(x)
[[ 1  2  3]
 [ 2  4  6]
 [ 3  6  9]
 [ 4  8 12]]
In [76]:
z = np.array([5,10,15]).reshape(1,3)
x = np.concatenate([x,z],axis=0)

print(x)
[[ 1  2  3]
 [ 2  4  6]
 [ 3  6  9]
 [ 4  8 12]
 [ 5 10 15]]
In [77]:
x = np.concatenate([x,x,x], axis=1)
print(x)
[[ 1  2  3  1  2  3  1  2  3]
 [ 2  4  6  2  4  6  2  4  6]
 [ 3  6  9  3  6  9  3  6  9]
 [ 4  8 12  4  8 12  4  8 12]
 [ 5 10 15  5 10 15  5 10 15]]

Matrix algebra

Matrix multiplication is a defined property of arrays

In [78]:
x = np.array([1,2,2,4,3,6]).reshape(3,2)
y = np.array([1,1])

print('x: ', x, '\n')
print('y: ', y, '\n')
print(x.dot(y))

# note: x.dot(y) is equivalent to np.dot(x,y)
x:  [[1 2]
 [2 4]
 [3 6]] 

y:  [1 1] 

[3 6 9]

Transpose a matrix:

In [79]:
x.T
Out[79]:
array([[1, 2, 3],
       [2, 4, 6]])

Invert a matrix:

In [80]:
x = np.arange(4).reshape(2,2)
print(x)
[[0 1]
 [2 3]]
In [81]:
y = np.linalg.inv(x)
print(y)
[[-1.5  0.5]
 [ 1.   0. ]]
In [82]:
print(np.dot(x,y))
[[ 1.  0.]
 [ 0.  1.]]

Find eigenvalues and eigenvectors:

In [83]:
eig = np.linalg.eig(x)
val = eig[0]
vec = eig[1]

print(val, '\n')
print(vec)
[-0.56155281  3.56155281] 

[[-0.87192821 -0.27032301]
 [ 0.48963374 -0.96276969]]

Other useful numpy functions

In [84]:
np.pi
Out[84]:
3.141592653589793
In [85]:
x = np.arange(3)

# note that these allow the input to be an array
y = np.exp(x)
z = np.log(y)

print(y.round(2))
print(z)
[ 1.    2.72  7.39]
[ 0.  1.  2.]
In [86]:
x = np.linspace(0,2*np.pi,5)
y = np.sin(x)

print(x.round(2))
print(y.round(2))
[ 0.    1.57  3.14  4.71  6.28]
[ 0.  1.  0. -1. -0.]

Plotting using matplotlib

Matplotlib is the standard plotting package in Python. For a tutorial see https://matplotlib.org/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py.

Other packages such as ggplot and seaborn are also available.

In [87]:
import matplotlib.pyplot as plt
In [88]:
x = np.linspace(0,2*np.pi,1000)
y = np.sin(x)

plt.plot(x,y)
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.show()

Random variables using the scipy package

R has built-in functionality for working with common probability distributions. In Python similar functions are available through the stats portion of the scipy package.

scipy has a lot more than just probability distributions; there are sub-libraries for topics such as linear algebra, optimization, Fourier transforms, signal processing, and more.

For more information about scipy see https://docs.scipy.org/doc/scipy/reference/.

In [89]:
from scipy import stats
In [90]:
# plot normal pdf
x = np.linspace(-4,4,1000)
y = stats.norm().pdf(x)

plt.plot(x,y)
plt.show()
In [91]:
# plot normal cdf
# (of course, you can also use this for Z tests, etc.)
x = np.linspace(-4,4,1000)
y = stats.norm().cdf(x)

plt.plot(x,y)
plt.show()
In [92]:
# generate random variables from a distribution
# (note: you can also do this in numpy using np.random)
x = stats.norm(3,2).rvs(3)
y = stats.expon(4).rvs(4)
z = stats.poisson(1).rvs(5)

print(x.round(2))
print(y.round(2))
print(z)
[ 0.21  4.77  3.84]
[  4.07  11.07   4.6   12.06]
[0 0 0 1 2]

The pandas package

The pandas package provides its own universe of special objects and functions for easily reading in and manipulating data, time series functionality, and other useful tools. In many ways pandas mimics R, but it also has some interesting tricks and tools of its own.

For more information see http://pandas.pydata.org/pandas-docs/stable/.

In [93]:
import pandas as pd

The Series object

pandas has its own answer to vectorizing Python: the Series object. These have an immense number of special attributes. See https://pandas.pydata.org/pandas-docs/stable/api.html#series.

They can also be easily converted into numpy arrays if you so choose.

In [94]:
x = pd.Series(stats.norm().rvs(10))
print(x)
0    1.252194
1   -0.914974
2    2.317017
3    0.211160
4    0.235011
5   -1.285713
6    0.075007
7    0.126686
8   -0.453423
9   -0.738689
dtype: float64
In [95]:
# some of the many available summary statistics
[x.min(), x.quantile(0.25), x.median(), x.quantile(0.75), x.max(), x.mean(), x.var(), x.autocorr()]
Out[95]:
[-1.2857132036083179,
 -0.6673725452398099,
 0.10084687854416731,
 0.2290481375996089,
 2.3170173620705352,
 0.08242761805914933,
 1.1368443837056643,
 -0.30413002684341717]
In [96]:
# find unique values
x.round().unique()
Out[96]:
array([ 1., -1.,  2.,  0.])
In [97]:
# filter
x.where(x > 0)
Out[97]:
0    1.252194
1         NaN
2    2.317017
3    0.211160
4    0.235011
5         NaN
6    0.075007
7    0.126686
8         NaN
9         NaN
dtype: float64
In [98]:
# filter and remove missing values
x.where(x > 0).dropna()
Out[98]:
0    1.252194
2    2.317017
3    0.211160
4    0.235011
6    0.075007
7    0.126686
dtype: float64

As with numpy arrays, operations are elementwise.

In [99]:
x + x
Out[99]:
0    2.504387
1   -1.829948
2    4.634035
3    0.422320
4    0.470022
5   -2.571426
6    0.150015
7    0.253373
8   -0.906846
9   -1.477378
dtype: float64
In [100]:
x ** 2
Out[100]:
0    1.567989
1    0.837178
2    5.368569
3    0.044589
4    0.055230
5    1.653058
6    0.005626
7    0.016049
8    0.205592
9    0.545662
dtype: float64

Concatenate using append.

In [101]:
x2 = x.append(x)
print(x2)
0    1.252194
1   -0.914974
2    2.317017
3    0.211160
4    0.235011
5   -1.285713
6    0.075007
7    0.126686
8   -0.453423
9   -0.738689
0    1.252194
1   -0.914974
2    2.317017
3    0.211160
4    0.235011
5   -1.285713
6    0.075007
7    0.126686
8   -0.453423
9   -0.738689
dtype: float64

The DataFrame object

In [102]:
# create using a dictionary
data = {'x': stats.uniform().rvs(10), 'y': stats.norm().rvs(10)}
df = pd.DataFrame(data)
print(df)
          x         y
0  0.573899  1.174264
1  0.635064 -0.076698
2  0.300348 -1.862586
3  0.511668  0.231349
4  0.981913  0.392888
5  0.827418  1.784487
6  0.533860  0.408539
7  0.175385 -1.062335
8  0.435950  0.258134
9  0.474151  0.124744
In [103]:
# create using a list
data = [stats.uniform().rvs(10),stats.norm().rvs(10)]
df = pd.DataFrame(data)
print(df)
          0         1         2         3         4         5         6  \
0  0.981552  0.710948  0.158218  0.281315  0.643304  0.762147  0.264676   
1  0.177340 -1.098031 -1.926916  0.336361 -1.004373 -0.771140  0.685671   

          7         8         9  
0  0.472766  0.743807  0.195761  
1 -1.112838 -0.197350 -0.731538  
In [104]:
# transpose
df = df.T
print(df)
          0         1
0  0.981552  0.177340
1  0.710948 -1.098031
2  0.158218 -1.926916
3  0.281315  0.336361
4  0.643304 -1.004373
5  0.762147 -0.771140
6  0.264676  0.685671
7  0.472766 -1.112838
8  0.743807 -0.197350
9  0.195761 -0.731538
In [105]:
# rename columns
df.columns = ['x','y']
print(df)
          x         y
0  0.981552  0.177340
1  0.710948 -1.098031
2  0.158218 -1.926916
3  0.281315  0.336361
4  0.643304 -1.004373
5  0.762147 -0.771140
6  0.264676  0.685671
7  0.472766 -1.112838
8  0.743807 -0.197350
9  0.195761 -0.731538
In [106]:
# pick out a column by name
df.x

# equivalent: df['x']

talk about picking multiple columns?
Object `columns` not found.
In [ ]:
talk about picking multiple columns

Note that columns in a pandas DataFrame are Series objects.

In [107]:
type(df.x)
Out[107]:
pandas.core.series.Series
In [108]:
# select columns by index
df2 = df.iloc[:,1]
print(df2)
0    0.177340
1   -1.098031
2   -1.926916
3    0.336361
4   -1.004373
5   -0.771140
6    0.685671
7   -1.112838
8   -0.197350
9   -0.731538
Name: y, dtype: float64
In [109]:
# select rows by index
df2 = df.iloc[0:5,]
print(df2)
          x         y
0  0.981552  0.177340
1  0.710948 -1.098031
2  0.158218 -1.926916
3  0.281315  0.336361
4  0.643304 -1.004373
In [110]:
# select rows by value
df2 = df.where(df['x']>0.5,).dropna()
print(df2)
          x         y
0  0.981552  0.177340
1  0.710948 -1.098031
4  0.643304 -1.004373
5  0.762147 -0.771140
8  0.743807 -0.197350

Reading in data

pandas can read from and write to many different formats: tables, csv, json, Excel, html, SQL, Stata, SAS...

Often much simpler than trying to do it manually using base Python.

In [111]:
# write df to csv
# index=False tells Python not to write the row names
df.to_csv('test_df.csv',index=False)
In [112]:
# read it back in
df2 = pd.read_csv('test_df.csv')
print(df2)
          x         y
0  0.981552  0.177340
1  0.710948 -1.098031
2  0.158218 -1.926916
3  0.281315  0.336361
4  0.643304 -1.004373
5  0.762147 -0.771140
6  0.264676  0.685671
7  0.472766 -1.112838
8  0.743807 -0.197350
9  0.195761 -0.731538

The Statsmodels package

The statsmodels package provides a number of basic statistical models like linear models, GLMs, ANOVA, and more.

For more info see http://www.statsmodels.org.

Simple example: a linear model

In [113]:
# load statsmodels
import statsmodels.api as sm

# make up some data
X = np.array(stats.norm().rvs(100)).reshape(25,4)
beta = [3,0.5,1,2]
epsilon = stats.norm().rvs(25)
Y = np.dot(X,beta) + epsilon

# fit a model
lm = sm.OLS(Y,X)
lm_fit = lm.fit()
lm_fit.summary()
/Users/adamscherling/anaconda/lib/python3.6/site-packages/statsmodels/compat/pandas.py:56: FutureWarning: The pandas.core.datetools module is deprecated and will be removed in a future version. Please use the pandas.tseries module instead.
  from pandas.core import datetools
Out[113]:
OLS Regression Results
Dep. Variable: y R-squared: 0.942
Model: OLS Adj. R-squared: 0.931
Method: Least Squares F-statistic: 84.71
Date: Tue, 27 Feb 2018 Prob (F-statistic): 1.21e-12
Time: 12:13:05 Log-Likelihood: -37.131
No. Observations: 25 AIC: 82.26
Df Residuals: 21 BIC: 87.14
Df Model: 4
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 3.2630 0.230 14.157 0.000 2.784 3.742
x2 0.4055 0.199 2.034 0.055 -0.009 0.820
x3 0.9779 0.201 4.870 0.000 0.560 1.395
x4 1.9184 0.247 7.774 0.000 1.405 2.432
Omnibus: 1.407 Durbin-Watson: 1.563
Prob(Omnibus): 0.495 Jarque-Bera (JB): 1.002
Skew: -0.482 Prob(JB): 0.606
Kurtosis: 2.817 Cond. No. 1.57

The scikit-learn package

scikit-learn is the go-to package for machine learning in python. It includes supervised and unsupervised methods, regression and classification, etc. etc. For more information see http://scikit-learn.org.

Using the same data: fit a random forest regression

In [114]:
# import the RandomForestRegressor function
from sklearn.ensemble import RandomForestRegressor

# fit the model
rf_reg = RandomForestRegressor(n_estimators=10)
rf_fit = rf_reg.fit(X, Y)

# compare the MSE to that from the linear model
# calculate fitted values
rf_pred = rf_fit.predict(X)
lm_pred = lm_fit.predict(X)

# calculate MSEs
rf_mse = np.sqrt(np.sum((Y-rf_pred)**2))
lm_mse = np.sqrt(np.sum((Y-lm_pred)**2))

print(lm_mse, rf_mse)
5.34276761181 5.85583374927
In [115]:
# take a look at the fitted values from the two models
plt.plot(Y,Y,label='Y=Y')
plt.scatter(Y, lm_pred, label='lm fit')
plt.scatter(Y, rf_pred, label='rf fit')
plt.xlabel('Y')
plt.ylabel('fitted values')
plt.legend()
plt.show()