1) Introduction
2) Python basics
3) Python packages for statistics & data science
Python is a general-purpose programming language. Like R, it is a high-level, object-oriented language. Unlike R, it was not developed specifically for the practice of statistics. However, a number of useful packages have been developed in Python for mathematics, statistics, and data science.
Having been developed in the computer science community, Python is used heavily for data science and machine learning. Some reasons you might use Python rather than R:
Disadvantages:
My advice: give it a try and see if you like it!
The Anaconda distribution of Python is an easy way to install Python along with the most useful packages. You can download it here: https://www.continuum.io/downloads
There are two widely supported versions of Python. Apart from some slight differences in syntax (ex: whether the print function requires parentheses) the practical differences are small usually not important. I'm using Python 3 and would recommend doing the same unless you have some prior reason for doing otherwise (working with older python code, coworkers using Python 2, etc.)
This presentation was prepared using Jupyter Notebook: http://jupyter.org
# This code does nothing.
You can include underscores and numbers in variable names, but not periods.
The period is saved for methods on objects.
Note: like R, no need to declare variable types.
# no problem
my_var = 0
var0 = 0
# problem
my.var = 0
Explanation: Python is looking for a variable called my that has method var.
A method is a function that is saved as an attribute of an object and generally manipulates that object in some way.
A little trick: you can declare or set multiple variables at once.
x, y = 5, 10
print(x)
print(y)
Like R, generally no need to declare the class of a number.
x = 5
y = 2.75
z = 8 / 3
print(x,y,z)
print(type(x), type(y), type(z))
Like R, strings can be defined using either single quotes '' or double quotes "".
len() returns the length of a string.
Concatenate strings using +.
x = 'taco'
print(len(x))
y = 'sauce'
z = x + ' ' + y
print(z)
Split a string using the split() method.
z = z.split(' ')
print(z)
You can also collapse a list of strings using join. (Note: join is a method of all strings.)
x = ['One', 'two', 'red', 'blue']
x1 = ', '.join(x)
print(x1)
The objects in square brackets above are lists.
Python has several different data types for combining multiple elements into a single object.
# list
x = [1,2,3]
# tuple
y = (4,5,6)
# dictionary
z = {'a':7, 'b':8, 'c':9}
Why these different data structures?
len() returns the length of these objects.
print(len(x), len(y), len(z))
Note: len() also works for strings.
len("apple")
Notably, these data structures can include any kind of object:
mylist = [1,'taco',x,y,z]
mytuple = (1,'taco',x,y,z)
mydictionary = {0:1, 1:'taco', 2:x, 3:y, 4:z}
print(mylist)
print(mytuple)
print(mydictionary)
A few useful methods of dictionaries:
z = {'a':7, 'b':8, 'c':9}
list(z.keys())
list(z.values())
list(z.items())
List and tuple elements (and substrings) can be accessed by index. Python starts indexes with 0, rather than 1.
Dictionary entries are accessed using the dictionary keys.
# list
x = [1,2,3]
# tuple
y = (4,5,6)
# dictionary
z = {'a':7, 'b':8, 'c':9}
# string
s = "sasquatch"
# print an individual element of each object
print(x[0], y[2], z['a'], s[3])
Unlike R, negative indexes do not remove an element from a list.
Instead negative indexes go the other direction, e.g. x[-1] is the last entry in x, x[-2] is the second to last, etc.
x = [1,2,3]
print(x[0],x[1],x[2])
print(x[-1],x[-2],x[-3])
Subset a list or tuple (or string) using square brackets and a colon. Think of the indexes as the walls separating different boxes. To return the contents of multiple boxes, you have to give the indexes of the outside walls.
# Example:
x = [8,9,10]
print("indexes: 0 1 2 3\nelements: | 8 | 9 | 10 |")
print(x[0:2], x[1:3])
Note that you can leave out the numbers on either side of the colon to extend the slice as far as possible.
print(x[0:3], x[-3:], x[:])
Slicing also works with strings.
s = "sasquatch"
print(s[0:6])
Accessing non-consecutive elements is a little tricky. One way of doing so is by list comprehension, which I'll talk about more later.
x = [2,4,6,8]
# take the subset corresponding to indexes 1 and 3
x_sub = [x[i] for i in [1,3]]
print(x_sub)
When you set a variable equal to another variable, python saves space by pointing to the same object in memory. It does not create a unique copy of the object.
If you edit the object, both variables will change.
# problem
x = [1,2,3]
y = x
y[0] = 10
print(x)
Safe ways of copying: slice it, use list(), or use the copy() method
x = [1,2,3]
# make a copy by slicing x
a = x[:]
# make a copy using list()
b = list(x)
# make a copy using the x.copy() method
c = x.copy()
# make changes to the copies
a[0] = 10
b[1] = 20
c[2] = 30
print('x: ', x)
print('a: ', a)
print('b: ', b)
print('c: ', c)
x = 5
if (x > 10):
print('x is greater than 10')
elif (x < 0):
print('x is negative')
else:
print('x is between 0 and 10')
If you like, you can put an if statement on a single line.
y = 7
if y==7: print('y is seven!')
Note: booleans in Python are True and False. You can also use 0 and 1.
if True:
print('yes')
if False:
print('no')
if 1:
print('yes')
if 0:
print('no')
x = ['apple','banana','cherry']
for fruit in x:
print(fruit)
for i in range(3):
print(i)
for a,b in [(1,1),(2,4),(3,9)]:
print(a,'squared equals',b)
i = 10
while (i < 13):
print(i)
i+=1
def square(x):
return(x**2)
print(square(4))
# like R, you can give default parameter values
def prod(x=1,y=2):
return(x*y)
print(prod(), prod(2), prod(3,4))
In Python, indentation is part of the syntax. If you fail to indent properly, it will return an error. This is particularly relevant for if statements, loops, and functions.
if True:
print('true')
for fruit in x:
print(fruit)
def square(x):
return(x**2)
For tuples and lists, the + operator performs concatenation and * performs repetition.
x = [1,2,3]
y = [4,5,6]
print(x + y)
print(x * 3)
# create a long list of zeros
x = [0] * 50
print(x)
# or an empty list
x = [None] * 10
print(x)
# The following all cause errors:
x + 2
x - y
x / 2
Sort a tuple or list using the sort() method.
x = [2,3,1]
x.sort()
print(x)
Python employs list comprehension and dictionary comprehension for creating and manipulating lists and dictionaries.
Recall that range() can be used to create a range of integers.
x = range(4,12,2)
print(list(x))
Great. But what if you want decimals? Or something more fancy?
# decimals
x = [i/10 for i in range(10)]
print(x)
# a familiar sequence?
x = [(1/i) * (-1)**(i+1) for i in range(1,10)]
print(x)
This is also useful for manipulating an existing list. Suppose you want to apply a function to every number in a list:
x = [3,2,6,4,12]
x_squared = [i**2 for i in x]
print(x_squared)
Or suppose you want to filter a list:
y = [v for v in x if v > 5]
print(y)
The same general idea also works for dictionaries.
You do have to use the items() method which essentially converts the dictionary to a list.
x = {'a':4, 'b':7, 'c':10}
y = {k:v**2 for k,v in x.items()}
print(y)
In this regard, Python is quite similar to C.
# create some data to write to file
x = [1,2,3,4,5]
squares = [i**2 for i in x]
colors = ['red','blue','yellow','green','red']
# convert x and squares to strings
x = [str(i) for i in x]
squares = [str(i) for i in squares]
print(x)
print(squares)
# zip the three variables together
mydata = list(zip(x, squares, colors))
mydata
# open a file for writing
f = open('testfile.csv','w')
# write one row at a time from mydata
# the for loop automatically pulls one element at a time (in this case one tuple at a time)
for row in mydata:
f.write(','.join(row) + '\n')
# close the file
f.close()
Now to read the data back in:
# look at the raw data
# think about how this can be parsed
f = open('testfile.csv', 'r')
f.read()
# read in the data
f = open('testfile.csv', 'r')
x2, squares2, colors2 = [], [], []
for line in f:
# note: strip() takes out the \n
tmp = line.strip().split(',')
x2.append(tmp[0])
squares2.append(tmp[1])
colors2.append(tmp[2])
print(x2)
print(squares2)
print(colors2)
# convert the numeric strings to integers
x2 = [int(i) for i in x2]
squares2 = [int(i) for i in squares2]
print(x2)
print(squares2)
# import the os package. (More about packages next.)
import os
# view the working directory
os.getcwd()
# change the working directory
os.chdir('/Users/adamscherling/Documents/ucla/stat404')
os.getcwd()
R is designed so that everything is a vector, which is great for statistics. In Python, vectors do not come as naturally.
The numpy package makes life
easier. It includes many useful tools for mathematics and statistics. For more information see http://www.numpy.org.
Packages are loaded in Python using import. It is common to specify an abbreviated name for loaded packages; these names help prevent mistakes if the same function is defined in multiple packages.
# it's standard practice to use the np prefix for numpy
import numpy as np
Numpy arrays are convenient for representing vectors and matrices.
np.arange creates an array rather than a range object or list
np.arange(10)
np.arange also allows non-integer step sizes
np.arange(0,3,0.2)
The linspace command is also helpful:
np.linspace(0,1,10)
Unlike Python lists, array operations are elementwise.
x = np.array([1,2,3])
y = np.array([4,5,6])
print(x + y)
print(x**2)
Arrays can be reshaped to use as matrices.
x = np.array([1,2,2,4,3,6]).reshape(3,2)
print(x)
Special types of matrices can be created easily.
np.zeros([3,2])
np.ones([2,3])
np.diag([1,2,3])
You still have to be careful about making copies.
x = np.arange(5)
y = x
y[0] = 17
print(x)
x = np.arange(5)
y = x.copy()
y[0] = 17
print(x)
print(y)
# note the use of chain syntax; it's often useful to chain commands together
x = np.array([1,2,2,4,3,6,4,8]).reshape(4,2)
print(x)
To concatenate arrays, the dimensions must align. Even for one-dimensional vectors you have to be explicit about the shape.
y = np.array([3,6,9,12])
x = np.concatenate([x,y],axis=1)
y = np.array([3,6,9,12]).reshape(4,1)
x = np.concatenate([x,y],axis=1)
print(x)
z = np.array([5,10,15]).reshape(1,3)
x = np.concatenate([x,z],axis=0)
print(x)
x = np.concatenate([x,x,x], axis=1)
print(x)
Matrix multiplication is a defined property of arrays
x = np.array([1,2,2,4,3,6]).reshape(3,2)
y = np.array([1,1])
print('x: ', x, '\n')
print('y: ', y, '\n')
print(x.dot(y))
# note: x.dot(y) is equivalent to np.dot(x,y)
Transpose a matrix:
x.T
Invert a matrix:
x = np.arange(4).reshape(2,2)
print(x)
y = np.linalg.inv(x)
print(y)
print(np.dot(x,y))
Find eigenvalues and eigenvectors:
eig = np.linalg.eig(x)
val = eig[0]
vec = eig[1]
print(val, '\n')
print(vec)
np.pi
x = np.arange(3)
# note that these allow the input to be an array
y = np.exp(x)
z = np.log(y)
print(y.round(2))
print(z)
x = np.linspace(0,2*np.pi,5)
y = np.sin(x)
print(x.round(2))
print(y.round(2))
Matplotlib is the standard plotting package in Python. For a tutorial see https://matplotlib.org/tutorials/introductory/pyplot.html#sphx-glr-tutorials-introductory-pyplot-py.
Other packages such as ggplot and seaborn are also available.
import matplotlib.pyplot as plt
x = np.linspace(0,2*np.pi,1000)
y = np.sin(x)
plt.plot(x,y)
plt.xlabel('x')
plt.ylabel('sin(x)')
plt.show()
R has built-in functionality for working with common probability distributions. In Python similar functions are available through the stats portion of the scipy package.
scipy has a lot more than just probability distributions; there are sub-libraries for topics such as linear algebra, optimization, Fourier transforms, signal processing, and more.
For more information about scipy see https://docs.scipy.org/doc/scipy/reference/.
from scipy import stats
# plot normal pdf
x = np.linspace(-4,4,1000)
y = stats.norm().pdf(x)
plt.plot(x,y)
plt.show()
# plot normal cdf
# (of course, you can also use this for Z tests, etc.)
x = np.linspace(-4,4,1000)
y = stats.norm().cdf(x)
plt.plot(x,y)
plt.show()
# generate random variables from a distribution
# (note: you can also do this in numpy using np.random)
x = stats.norm(3,2).rvs(3)
y = stats.expon(4).rvs(4)
z = stats.poisson(1).rvs(5)
print(x.round(2))
print(y.round(2))
print(z)
The pandas package provides its own universe of special objects and functions for easily reading in and manipulating data, time series functionality, and other useful tools. In many ways pandas mimics R, but it also has some interesting tricks and tools of its own.
For more information see http://pandas.pydata.org/pandas-docs/stable/.
import pandas as pd
Series object¶pandas has its own answer to vectorizing Python: the Series object. These have an immense number of special attributes. See https://pandas.pydata.org/pandas-docs/stable/api.html#series.
They can also be easily converted into numpy arrays if you so choose.
x = pd.Series(stats.norm().rvs(10))
print(x)
# some of the many available summary statistics
[x.min(), x.quantile(0.25), x.median(), x.quantile(0.75), x.max(), x.mean(), x.var(), x.autocorr()]
# find unique values
x.round().unique()
# filter
x.where(x > 0)
# filter and remove missing values
x.where(x > 0).dropna()
As with numpy arrays, operations are elementwise.
x + x
x ** 2
Concatenate using append.
x2 = x.append(x)
print(x2)
DataFrame object¶# create using a dictionary
data = {'x': stats.uniform().rvs(10), 'y': stats.norm().rvs(10)}
df = pd.DataFrame(data)
print(df)
# create using a list
data = [stats.uniform().rvs(10),stats.norm().rvs(10)]
df = pd.DataFrame(data)
print(df)
# transpose
df = df.T
print(df)
# rename columns
df.columns = ['x','y']
print(df)
# pick out a column by name
df.x
# equivalent: df['x']
talk about picking multiple columns?
talk about picking multiple columns
Note that columns in a pandas DataFrame are Series objects.
type(df.x)
# select columns by index
df2 = df.iloc[:,1]
print(df2)
# select rows by index
df2 = df.iloc[0:5,]
print(df2)
# select rows by value
df2 = df.where(df['x']>0.5,).dropna()
print(df2)
pandas can read from and write to many different formats: tables, csv, json, Excel, html, SQL, Stata, SAS...
Often much simpler than trying to do it manually using base Python.
# write df to csv
# index=False tells Python not to write the row names
df.to_csv('test_df.csv',index=False)
# read it back in
df2 = pd.read_csv('test_df.csv')
print(df2)
The statsmodels package provides a number of basic statistical models like linear models, GLMs, ANOVA, and more.
For more info see http://www.statsmodels.org.
Simple example: a linear model
# load statsmodels
import statsmodels.api as sm
# make up some data
X = np.array(stats.norm().rvs(100)).reshape(25,4)
beta = [3,0.5,1,2]
epsilon = stats.norm().rvs(25)
Y = np.dot(X,beta) + epsilon
# fit a model
lm = sm.OLS(Y,X)
lm_fit = lm.fit()
lm_fit.summary()
scikit-learn is the go-to package for machine learning in python. It includes supervised and unsupervised methods, regression and classification, etc. etc. For more information see http://scikit-learn.org.
Using the same data: fit a random forest regression
# import the RandomForestRegressor function
from sklearn.ensemble import RandomForestRegressor
# fit the model
rf_reg = RandomForestRegressor(n_estimators=10)
rf_fit = rf_reg.fit(X, Y)
# compare the MSE to that from the linear model
# calculate fitted values
rf_pred = rf_fit.predict(X)
lm_pred = lm_fit.predict(X)
# calculate MSEs
rf_mse = np.sqrt(np.sum((Y-rf_pred)**2))
lm_mse = np.sqrt(np.sum((Y-lm_pred)**2))
print(lm_mse, rf_mse)
# take a look at the fitted values from the two models
plt.plot(Y,Y,label='Y=Y')
plt.scatter(Y, lm_pred, label='lm fit')
plt.scatter(Y, rf_pred, label='rf fit')
plt.xlabel('Y')
plt.ylabel('fitted values')
plt.legend()
plt.show()