Pandas#

Pandas is a Python package for tabular data.

You may need to install pandas first, as well as seaborn

(pycourse) $ conda install pandas seaborn
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
pd.__version__
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 3
      1 import pandas as pd
      2 import matplotlib.pyplot as plt
----> 3 import seaborn as sns
      4 pd.__version__

ModuleNotFoundError: No module named 'seaborn'

Overview#

Pandas provides a DataFrame object, which is used to hold tables of data (the name DataFrame comes from a similar object in R). The primary difference compared to a NumPy ndarray is that you can easily handle different data types in each column.

Difference between a DataFrame and NumPy Array#

Pandas DataFrames and NumPy arrays both have similarities to Python lists.

  • Numpy arrays are designed to contain data of one type (e.g. Int, Float, …)

  • DataFrames can contain different types of data (Int, Float, String, …)

    • Usually each column has the same type

Both arrays and DataFrames are optimized for storage/performance beyond Python lists

Pandas is also powerful for working with missing data, working with time series data, for reading and writing your data, for reshaping, grouping, merging your data, …

Key Features#

  • File I/O - integrations with multiple file formats

  • Working with missing data (.dropna(), pd.isnull())

  • Normal table operations: merging and joining, groupby functionality, reshaping via stack, and pivot_tables,

  • Time series-specific functionality:

    • date range generation and frequency conversion, moving window statistics/regressions, date shifting and lagging, etc.

  • Built in Matplotlib integration

Other Strengths#

  • Strong community, support, and documentation

  • Size mutability: columns can be inserted and deleted from DataFrame and higher dimensional objects

  • Powerful, flexible group by functionality to perform split-apply-combine operations on data sets, for both aggregating and transforming data

  • Make it easy to convert ragged, differently-indexed data in other Python and NumPy data structures into DataFrame objects Intelligent label-based slicing, fancy indexing, and subsetting of large data sets

Python/Pandas vs. R#

  • R is a language dedicated to statistics. Python is a general-purpose language with statistics modules.

  • R has more statistical analysis features than Python, and specialized syntaxes.

However, when it comes to building complex analysis pipelines that mix statistics with e.g. image analysis, text mining, or control of a physical experiment, the richness of Python is an invaluable asset.

Objects and Basic Creation#

Name

Dimensions

Description

pd.Series

1

1D labeled homogeneously-typed array

pd.DataFrame

2

General 2D labeled, size-mutable tabular structure

pd.Panel

3

General 3D labeled, also size-mutable array

Get Started#

We’ll load the Titanic data set, which is part of seaborn’s example data. It contains information about 891 passengers on the infamous Titanic

See the Pandas tutorial on reading/writing data for more information.

titanic = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv")
df = titanic
df
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 0 2 male 27.0 0 0 13.0000 S Second man True NaN Southampton no True
887 1 1 female 19.0 0 0 30.0000 S First woman False B Southampton yes True
888 0 3 female NaN 1 2 23.4500 S Third woman False NaN Southampton no False
889 1 1 male 26.0 0 0 30.0000 C First man True C Cherbourg yes True
890 0 3 male 32.0 0 0 7.7500 Q Third man True NaN Queenstown no True

891 rows × 15 columns

we see that there are a variety of columns. Some contain numeric values, some contain strings, some contain booleans, etc.

If you want to access a particular column, you can create a Series using the column title.

ages = df["age"] # __getitem__
ages
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

column titles are also treated as object attributes:

ages = df.age
ages
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

Pandas DataFrames and Series have a variety of built-in methods to examine data

print(ages.max(), ages.min(), ages.mean(), ages.median())
80.0 0.42 29.69911764705882 28.0

You can display a variety of statistics using describe(). Note that NaN values are simply ignored.

ages.describe()
count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: age, dtype: float64

depending on the type of data held in the series, describe has different functionality.

df['sex'].describe()
count      891
unique       2
top       male
freq       577
Name: sex, dtype: object

You can find some more discussion in the Pandas tutorial here

Selecting Subsets of DataFrames#

You can select multiple columns

df2 = df[["age", "survived"]]
df2
age survived
0 22.0 0
1 38.0 1
2 26.0 1
3 35.0 1
4 35.0 0
... ... ...
886 27.0 0
887 19.0 1
888 NaN 0
889 26.0 1
890 32.0 0

891 rows × 2 columns

If you want to select certain rows, using some sort of criteria, you can create a series of booleans

mask = df2['age'] > 30
df2[mask]
age survived
1 38.0 1
3 35.0 1
4 35.0 0
6 54.0 0
11 58.0 1
... ... ...
873 47.0 0
879 56.0 1
881 33.0 0
885 39.0 0
890 32.0 0

305 rows × 2 columns

We see that we only have 305 rows now

If you just want to access specific rows, you can use iloc

df.iloc[:4]
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False

Aside: Attributes#

Recall that you can access columns of a DataFrame using attributes:

df.age
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64

You have typically defined attributes using syntax like

class MyDF():
    def __init__(self, **kwargs):
        self.is_df = True
        
df2 = MyDF(a=1, b=np.ones(3))
df2.is_df
True
df2.a
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-19-f30d3ef5cdfe> in <module>
----> 1 df2.a

AttributeError: 'MyDF' object has no attribute 'a'

However, there are other options for defining attibutes.

Object attributes are accessed using a Python dictionary. If you want to store attributes manually, you can put these attributes into the object dictionary using __dict__:

import numpy as np
class MyDF():
    def __init__(self, **kwargs):
        """
        Store all keyword arguments as attributes
        """
        for k, v in kwargs.items():
            self.__dict__[k] = v
            
        self.is_df = True
        
    
df2 = MyDF(a=1, b=np.ones(3))
print(df2.a)
print(df2.b)
1
[1. 1. 1.]

You can also define methods so they behave like attributes as well using the @property decorator

class MyDF():
    def __init__(self, **kwargs):
        """
        Store all keyword arguments as attributes
        """
        for k, v in kwargs.items():
            self.__dict__[k] = v
            
        self.is_df = True
        
           
    @property
    def nitems(self):
        return len(self.__dict__)
    
    
    @property
    def get_number(self):
        return np.random.rand()


    
df2 = MyDF(a=1, b=np.ones(3))
df2.nitems
3
df2.get_number
0.21378456767590537

Missing data#

In data analysis, knowing how to properly fill in missing data is very important, sometimes we don’t want to just ignore them, especially when the observational numbers are small. There are various ways to do it such as filling with the mean, K-Nearest Neighbors (KNN) methods and so on.

Exercise#

Here, as an exercise, suppose we are trying to replace missing values in age column as its mean from titanic dataset.

titanic = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv")
MissDataPra = titanic.copy()
MissDataPra
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
0 0 3 male 22.0 1 0 7.2500 S Third man True NaN Southampton no False
1 1 1 female 38.0 1 0 71.2833 C First woman False C Cherbourg yes False
2 1 3 female 26.0 0 0 7.9250 S Third woman False NaN Southampton yes True
3 1 1 female 35.0 1 0 53.1000 S First woman False C Southampton yes False
4 0 3 male 35.0 0 0 8.0500 S Third man True NaN Southampton no True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
886 0 2 male 27.0 0 0 13.0000 S Second man True NaN Southampton no True
887 1 1 female 19.0 0 0 30.0000 S First woman False B Southampton yes True
888 0 3 female NaN 1 2 23.4500 S Third woman False NaN Southampton no False
889 1 1 male 26.0 0 0 30.0000 C First man True C Cherbourg yes True
890 0 3 male 32.0 0 0 7.7500 Q Third man True NaN Queenstown no True

891 rows × 15 columns

## Your code here
ages = MissDataPra["age"]
ages
0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: age, Length: 891, dtype: float64
# Solution 
ageMean = ages.mean()
MissDataPra["age"] = ages.fillna(ageMean)
MissDataPra["age"]
0      22.000000
1      38.000000
2      26.000000
3      35.000000
4      35.000000
         ...    
886    27.000000
887    19.000000
888    29.699118
889    26.000000
890    32.000000
Name: age, Length: 891, dtype: float64

Series#

  • a Series is a one-dimensional labeled array capable of holding any data type (integers, strings, floating point numbers, Python objects, etc.). The axis labels are collectively referred to as the index.

  • Basic method to create a series: s = pd.Series(data, index = index)

  • Data can be many things:

    • A Python Dictionary

    • An ndarray (or reg. list)

    • A scalar

  • The passed index is a list of axis labels (which varies on what data is)

Think “Series = Vector + labels”

first_series = pd.Series([2**i for i in range(7)])
print(type(first_series))
print(first_series)
<class 'pandas.core.series.Series'>
0     1
1     2
2     4
3     8
4    16
5    32
6    64
dtype: int64
s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
print('-'*50)
print(s.index)
a    0.451096
b    1.335011
c    0.433332
d    1.729655
e    1.092280
dtype: float64
--------------------------------------------------
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')
s['a']
0.45109622266084853

If Data is a dictionary, if index is passed the values in data corresponding to the labels in the index will be pulled out, otherwise an index will be constructed from the sorted keys of the dict

d = {'a': [0., 0], 'b': {'1':1.}, 'c': 2.}
pd.Series(d)
a      [0.0, 0]
b    {'1': 1.0}
c             2
dtype: object

You can create a series from a scalar, but need to specify indices

pd.Series(5, index = ['a', 'b', 'c'])
a    5
b    5
c    5
dtype: int64

You can index and slice series like you would numpy arrays/python lists

s = pd.Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
a    0.264398
b    0.701476
c    1.106238
d    0.455066
e    1.202556
dtype: float64
print(s['b':'d'])
b    0.701476
c    1.106238
d    0.455066
dtype: float64
s[s > s.mean()]

You can iterate over a Series as well

for idx, val in s.iteritems():
    print(idx,val)
a 0.2643980344953476
b 0.7014760363611338
c 1.106238389833158
d 0.4550662058745087
e 1.2025560049383854

or sort by index or value

print(s.sort_index())
print(s.sort_values())
a    0.264398
b    0.701476
c    1.106238
d    0.455066
e    1.202556
dtype: float64
a    0.264398
d    0.455066
b    0.701476
c    1.106238
e    1.202556
dtype: float64

You can also count unique values:

s = pd.Series([0,0,0,1,1,1,2,2,2,2])
s.value_counts()
2    4
1    3
0    3
dtype: int64

Exercise#

  • Consider the series s of letters in a sentence.

  • What is count of each letter in the sentence, output a series which is sorted by the count

  • Create a list with only the top 5 common letters (not including space)

## Your code here
sentence = 'Consider the series s of letters in a sentence.'
s = pd.Series([c for c in sentence])
s
0     C
1     o
2     n
3     s
4     i
5     d
6     e
7     r
8      
9     t
10    h
11    e
12     
13    s
14    e
15    r
16    i
17    e
18    s
19     
20    s
21     
22    o
23    f
24     
25    l
26    e
27    t
28    t
29    e
30    r
31    s
32     
33    i
34    n
35     
36    a
37     
38    s
39    e
40    n
41    t
42    e
43    n
44    c
45    e
46    .
dtype: object
s.value_counts()
e    9
     8
s    6
n    4
t    4
i    3
r    3
o    2
.    1
l    1
d    1
C    1
h    1
a    1
c    1
f    1
dtype: int64
sc = s[s != ' '].value_counts()
sc[:5]
e    9
s    6
n    4
t    4
i    3
dtype: int64

Data Frames#

  • a DataFrame is a 2-dimensional labeled data structure with columns of potentially different types. You can think of it like a spreadsheet or SQL table, or a dict of Series objects. It is generally the most commonly used pandas object.

  • You can create a DataFrame from:

    • Dict of 1D ndarrays, lists, dicts, or Series

    • 2-D numpy array

    • A list of dictionaries

    • A Series

    • Another Dataframe

df = pd.DataFrame(data, index = index, columns = columns)
  • index/ columns is a list of the row/ column labels. If you pass an index and/ or columns, you are guarenteeing the index and /or column of the df.

  • If you do not pass anything in, the input will be constructed by “common sense” rules

Documentation: pandas.DataFrame

DataFrame Creation#

from dictionaries:

end_string = "\n -------------------- \n"

# Create a dictionary of series
d = {'one': pd.Series([1,2,3], index  = ['a', 'b', 'c']), 
     'two': pd.Series(list(range(4)), index = ['a','b', 'c', 'd'])}
df = pd.DataFrame(d)
print(df, end = end_string)

d= {'one': {'a': 1, 'b': 2, 'c':3}, 
     'two': pd.Series(list(range(4)), index = ['a','b', 'c', 'd'])}
# Columns are dictionary keys, indices and values obtained from series
df = pd.DataFrame(d)
# Notice how it fills the column one with NaN for d
print(df, end = end_string)
   one  two
a  1.0    0
b  2.0    1
c  3.0    2
d  NaN    3
 -------------------- 
   one  two
a  1.0    0
b  2.0    1
c  3.0    2
d  NaN    3
 -------------------- 

From dictionaries of ndarrays or lists:

d = {'one' : [1., 2., 3., 4], 'two' : [4., 3., 2., 1.]}
pd.DataFrame(d)
one two
0 1.0 4.0
1 2.0 3.0
2 3.0 2.0
3 4.0 1.0

from a list of dicts:

data = []
for i in range(100):
    data += [ {'Column' + str(j):np.random.randint(100) for j in range(5)} ]
    # dictionary comprehension!
    
data[:5]
[{'Column0': 87, 'Column1': 30, 'Column2': 20, 'Column3': 90, 'Column4': 49},
 {'Column0': 36, 'Column1': 17, 'Column2': 22, 'Column3': 9, 'Column4': 80},
 {'Column0': 9, 'Column1': 43, 'Column2': 93, 'Column3': 68, 'Column4': 31},
 {'Column0': 82, 'Column1': 28, 'Column2': 68, 'Column3': 28, 'Column4': 37},
 {'Column0': 6, 'Column1': 60, 'Column2': 1, 'Column3': 30, 'Column4': 74}]
# Creation from a list of dicts
df = pd.DataFrame(data)
df.head()
Column0 Column1 Column2 Column3 Column4
0 87 30 20 90 49
1 36 17 22 9 80
2 9 43 93 68 31
3 82 28 68 28 37
4 6 60 1 30 74

Adding and Deleting Columns#

# Adding and accessing columns 
d = {'one': pd.Series([1,2,3], index  = ['a', 'b', 'c']), 
     'two': pd.Series(range(4), index = ['a','b', 'c', 'd'])}
df = pd.DataFrame(d)
# multiply 
df['three'] =  df['one']*df['two']
# Create a boolean flag
df['flag'] = df['one'] > 2
df
one two three flag
a 1.0 0 0.0 False
b 2.0 1 2.0 False
c 3.0 2 6.0 True
d NaN 3 NaN False
# inserting column in specified location, with values
df.insert(1, 'bar', pd.Series(3, index=['a', 'b', 'd']))
df
one bar two three flag
a 1.0 3.0 0 0.0 False
b 2.0 3.0 1 2.0 False
c 3.0 NaN 2 6.0 True
d NaN 3.0 3 NaN False
# Deleting Columns  
three = df.pop('three')
df
one bar two flag
a 1.0 3.0 0 False
b 2.0 3.0 1 False
c 3.0 NaN 2 True
d NaN 3.0 3 False

Indexing and Selection#

4 methods [], ix, iloc, loc

Operation

Syntax

Result

Select Column

df[col]

Series

Select Row by Label

df.loc[label]

Series

Select Row by Integer Location

df.iloc[idx]

Series

Slice rows

df[5:10]

DataFrame

Select rows by boolean

df[mask]

DataFrame

Note all the operations below are valid on series as well restricted to one dimension

Indexing using []

  • Series: selecting a label: s[label]

  • DataFrame: selection single or multiple columns:

df['col'] or df[['col1', 'col2']]
  • DataFrame: slicing the rows:

df['rowlabel1': 'rowlabel2']

or

df[boolean_mask]
# Lets create a data frame
pd.options.display.max_rows = 4
dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C','D'])
df
A B C D
2000-01-01 2.481613 -0.577964 0.222924 -1.098381
2000-01-02 0.342951 -0.696970 1.165167 0.839895
... ... ... ... ...
2000-01-07 -1.135999 0.948088 -0.060405 1.720806
2000-01-08 0.683102 -0.914922 0.613417 0.461496

8 rows × 4 columns

# column  'A
df['A']
2000-01-01    2.481613
2000-01-02    0.342951
                ...   
2000-01-07   -1.135999
2000-01-08    0.683102
Freq: D, Name: A, Length: 8, dtype: float64
# all rows, columns A, B
df.loc[:,"A":"B"]
A B
2000-01-01 2.481613 -0.577964
2000-01-02 0.342951 -0.696970
... ... ...
2000-01-07 -1.135999 0.948088
2000-01-08 0.683102 -0.914922

8 rows × 2 columns

# multiple columns
df[['A', 'B']]
A B
2000-01-01 2.481613 -0.577964
2000-01-02 0.342951 -0.696970
... ... ...
2000-01-07 -1.135999 0.948088
2000-01-08 0.683102 -0.914922

8 rows × 2 columns

# slice by rows
df['2000-01-01': '2000-01-04']
A B C D
2000-01-01 2.481613 -0.577964 0.222924 -1.098381
2000-01-02 0.342951 -0.696970 1.165167 0.839895
2000-01-03 -2.449466 -0.704328 -1.625973 0.233635
2000-01-04 -0.936724 -0.764115 0.049249 -0.893543
# boolean mask 
df[df['A'] > df['B']]
A B C D
2000-01-01 2.481613 -0.577964 0.222924 -1.098381
2000-01-02 0.342951 -0.696970 1.165167 0.839895
2000-01-08 0.683102 -0.914922 0.613417 0.461496
# Assign via []
df['A'] = df['B'].values
df
A B C D
2000-01-01 -0.577964 -0.577964 0.222924 -1.098381
2000-01-02 -0.696970 -0.696970 1.165167 0.839895
... ... ... ... ...
2000-01-07 0.948088 0.948088 -0.060405 1.720806
2000-01-08 -0.914922 -0.914922 0.613417 0.461496

8 rows × 4 columns

Selecting by label: .loc

  • is primarily label based, but may also be used with a boolean array.

  • .loc will raise KeyError when the items are not found

  • Allowed inputs:

    1. A single label

    2. A list of labels

    3. A boolean array

## Selection by label .loc
df.loc['2000-01-01']
A   -0.577964
B   -0.577964
C    0.222924
D   -1.098381
Name: 2000-01-01 00:00:00, dtype: float64
df.loc[:, 'A':'C']
A B C
2000-01-01 -0.577964 -0.577964 0.222924
2000-01-02 -0.696970 -0.696970 1.165167
... ... ... ...
2000-01-07 0.948088 0.948088 -0.060405
2000-01-08 -0.914922 -0.914922 0.613417

8 rows × 3 columns

# Get columns for which value is greater than 0 on certain day, get all rows
df.loc[:, df.loc['2000-01-01'] > 0]
C
2000-01-01 0.222924
2000-01-02 1.165167
... ...
2000-01-07 -0.060405
2000-01-08 0.613417

8 rows × 1 columns

Selecting by position: iloc

  • The .iloc attribute is the primary access method. The following are valid input:

    • An integer

    • A list of integers

    • A slice

    • A boolean array

df1 = pd.DataFrame(np.random.randn(6,4), 
                   index=list(range(0,12,2)), columns=list(range(0,12,3)))

df1
0 3 6 9
0 0.475385 -0.631032 -1.535313 0.662338
2 0.289818 -0.659518 0.877284 -1.113570
... ... ... ... ...
8 -0.635411 0.512394 -0.638799 1.760938
10 -0.690578 -0.656628 0.463184 -1.155488

6 rows × 4 columns

# rows 0-2
df1.iloc[:3]
0 3 6 9
0 0.475385 -0.631032 -1.535313 0.662338
2 0.289818 -0.659518 0.877284 -1.113570
4 -0.130941 1.969738 1.103449 -1.368122
#  rows 1:5  and columns 2 : 4
df1.iloc[1:5, 2:4]
6 9
2 0.877284 -1.113570
4 1.103449 -1.368122
6 1.145936 -0.435587
8 -0.638799 1.760938
# select via integer list
df1.iloc[[1,3,5], [1,3]]
3 9
2 -0.659518 -1.113570
6 0.017810 -0.435587
10 -0.656628 -1.155488
# selecting via integer mask 
boolean_mask = df1.iloc[:, 1] > 0.0
df1.iloc[boolean_mask.values,1]
4    1.969738
6    0.017810
8    0.512394
Name: 3, dtype: float64

Exercise#

Given the dataframe df below, find the following

  • Last two rows of columns A and D

  • Last three rows such which statisfy that column A > Column B

dates = pd.date_range('1/1/2000', periods=8)
df = pd.DataFrame(np.random.randn(8, 4), index=dates, columns=['A', 'B', 'C','D'])
df
## Your code here
# Solution: Last two rows of columns A and D
df.iloc[[6,7], [0,3]]
# Solution: Last three rows such which statisfy that column A > Column B
last3rows = df.iloc[5:8,]  # last three rows 
print(last3rows['A'] > last3rows['B']) # The boolean statement return true if column A > Column B. 

Plotting#

DataFrames and Series have a variety of built-in plotting methods:

Let’s take a look at the Iris data set

  1. sepal length in cm

  2. sepal width in cm

  3. petal length in cm

  4. petal width in cm

  5. class: – Iris Setosa – Iris Versicolour – Iris Virginica

iris = pd.read_csv(
    "https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
    names=["sepal length", "sepal width", "petal length", "petal width", "class"]
)
iris
sepal length sepal width petal length petal width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
... ... ... ... ... ...
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica

150 rows × 5 columns

iris.plot()
plt.show()
../_images/bdbe502f475706b91acb477ca0e37f83625578c02f8a45805d31918ba13b27f0.png
# Titanic DataSet
df = pd.read_csv("https://raw.githubusercontent.com/mwaskom/seaborn-data/master/titanic.csv")

df.age.plot()
plt.show()
../_images/6f1e74901e38b88766c3a56360871081162614078dbf4285b6d389e8262f49a0.png

plot just plots the value by index, and doesn’t make a lot of sense unless the index means something (like time). In this case, a histogram makes more sense:

df.age.hist()
plt.show()
../_images/9ee40ce1a52d2ecd45f752b2abc60b3d02bfb4022dd4197f2601672e521c0cd4.png

we can also create scatter plots of two columns

df.plot.scatter(x='age', y='fare', alpha=0.5)
plt.show()
../_images/e66857ff4ad6776534960f66b4586dd75a6be0c5a49f2ed1db01a839d10611ab.png

You can also create box plots

df.age.plot.box()
plt.show()
../_images/120a8b1594a954b2f4967e766c3f3f65aa7577db22e069d7f366dee80f2d2a84.png

Summary Statistics#

Basic summary statistics are built into Pandas. These are easy to compute on columns/series

print(df.age.mean())
print(df.age.median())
29.69911764705882
28.0

You can also compute statistics on multiple (or all columns)

df.mean()
survived      0.383838
pclass        2.308642
                ...   
adult_male    0.602694
alone         0.602694
Length: 8, dtype: float64
df[['age', 'fare']].mean()
age     29.699118
fare    32.204208
dtype: float64

You can also compute statistics grouping by category

df[['age', 'sex']].groupby('sex').mean()
age
sex
female 27.915709
male 30.726645
df.groupby('sex').mean()
survived pclass age sibsp parch fare adult_male alone
sex
female 0.742038 2.159236 27.915709 0.694268 0.649682 44.479818 0.000000 0.401274
male 0.188908 2.389948 30.726645 0.429809 0.235702 25.523893 0.930676 0.712305

You can count how many records are in each category for categorical variables

df['pclass'].value_counts()
3    491
1    216
2    184
Name: pclass, dtype: int64
df.groupby('sex')['pclass'].value_counts()
sex     pclass
female  3         144
        1          94
                 ... 
male    1         122
        2         108
Name: pclass, Length: 6, dtype: int64

Table Manipulation#

You can sort tables by a column value:

df.sort_values(by='age')[:10]
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
803 1 3 male 0.42 0 1 8.5167 C Third child False NaN Cherbourg yes False
755 1 2 male 0.67 1 1 14.5000 S Second child False NaN Southampton yes False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
381 1 3 female 1.00 0 2 15.7417 C Third child False NaN Cherbourg yes False
164 0 3 male 1.00 4 1 39.6875 S Third child False NaN Southampton no False

10 rows × 15 columns

You can also sort by a primary key and secondary key

df.sort_values(by=['pclass', 'age'])[:600]
survived pclass sex age sibsp parch fare embarked class who adult_male deck embark_town alive alone
305 1 1 male 0.92 1 2 151.5500 S First child False C Southampton yes False
297 0 1 female 2.00 1 2 151.5500 S First child False C Southampton no False
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
162 0 3 male 26.00 0 0 7.7750 S Third man True NaN Southampton no True
207 1 3 male 26.00 0 0 18.7875 C Third man True NaN Cherbourg yes True

600 rows × 15 columns

Groups#

pandas objects can be split on any of their axes. The abstract definition of grouping is to provide a mapping of labels to group names:

Syntax:

  • groups = df.groupby(key)

  • groups = df.groupby(key, axis = 1)

  • groups = df.groupby([key1, key2], axis = 1)

The group by concept is that we want to apply the same function on subsets of the dataframe, based on some key we use to split the DataFrame into subsets

This idea is referred to as the “split-apply-combine” operation:

  • Split the data into groups based on some criteria

  • Apply a function to each group independently

  • Combine the results

image

df = pd.DataFrame({'key':['A','B','C','A','B','C','A','B','C'],
                   'data': [0, 5, 10, 5, 10, 15, 10, 15, 20]})
df
key data
0 A 0
1 B 5
... ... ...
7 B 15
8 C 20

9 rows × 2 columns

df.groupby('key')
<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7f60022988e0>
df.groupby('key').sum()
data
key
A 15
B 30
C 45
sums = df.groupby('key').agg(np.sum)
sums
data
key
A 15
B 30
C 45
ax = iris.groupby('class') \
         .get_group('Iris-setosa') \
         .boxplot(column=["sepal length","sepal width"], return_type='axes')
../_images/51a1aadac39c182ee5f6138ddad4201b27c8d9b0f0231c64b9f95fa5a402f14c.png

Pivot Tables#

Say you want the mean age for each sex grouped by class. We can create a pivot table to display the data:

titanic.pivot_table(values="age", index="sex",
                columns="pclass", aggfunc="mean")
pclass 1 2 3
sex
female 34.611765 28.722973 21.750000
male 41.281386 30.740707 26.507589

you can change the aggregation function to compute other statistics

titanic.pivot_table(values="age", index="sex",
                columns="pclass", aggfunc="median")
pclass 1 2 3
sex
female 35.0 28.0 21.5
male 40.0 30.0 25.0

Merging DataFrames#

pd.merge(left, right, how='inner', on=None, left_on=None, right_on=None,
left_index=False, right_index=False, sort=True)
# Example of merge
left = pd.DataFrame({'key': ['foo', 'bar'], 'lval': [4, 2]})
right = pd.DataFrame({'key': ['bar', 'zoo'], 'rval': [4, 5]})

print("left: ",left,"right: ",right, sep=end_string)
left: 
 -------------------- 
   key  lval
0  foo     4
1  bar     2
 -------------------- 
right: 
 -------------------- 
   key  rval
0  bar     4
1  zoo     5
merged = pd.merge(left, right, how="inner")
print(merged)
   key  lval  rval
0  bar     2     4
merged = pd.merge(left, right, how="outer")
print(merged)
   key  lval  rval
0  foo   4.0   NaN
1  bar   2.0   4.0
2  zoo   NaN   5.0
merged = pd.merge(left, right, how="left")
print(merged)
   key  lval  rval
0  foo     4   NaN
1  bar     2   4.0
merged = pd.merge(left, right, how="right")
print(merged)
   key  lval  rval
0  bar   2.0     4
1  zoo   NaN     5

Functions#

Row or Column-wise Function Application: Applies function along input axis of DataFrame

df.apply(func, axis = 0)

Elementwise: apply the function to every element in the df

df.applymap(func)
  • Note, applymap is equivalent to the map function on lists.

  • Note, Series objects support .map instead of applymap

df1 = pd.DataFrame(np.random.randn(6,4), index=list(range(0,12,2)), columns=list('abcd'))
df1
a b c d
0 -0.086992 -0.084940 0.732539 0.055577
2 0.414442 -0.085030 0.108881 0.392840
... ... ... ... ...
8 -1.401044 -0.814559 -0.227711 -1.525791
10 -1.349603 -0.058741 1.319168 0.215847

6 rows × 4 columns

A = np.random.randn(6,4)
print(np.mean(A, axis=0))
print(A[0])
[-0.30215707  0.32244258 -0.19654789 -0.17821756]
[-0.88573991  1.51802081 -0.79924036 -0.34815962]
print(np.mean(A, axis=1))
print(A[:,0])
[-0.12877977 -0.25276604  0.02348098  0.63743703 -0.20947629 -0.60161581]
[-0.88573991  0.03663791  0.69401035 -1.54572936 -0.68917893  0.57705751]
# Apply to each column
df1.apply(np.mean, axis=0)
a   -0.554604
b   -0.243100
c    0.496747
d    0.087104
dtype: float64
# Apply to each row
df1.apply(np.mean, axis=1)
0     0.154046
2     0.207783
        ...   
8    -0.992276
10    0.031668
Length: 6, dtype: float64
# # Use lambda functions  to normalize columns
df1.apply(lambda x: (x - x.mean())/ x.std(), axis = 0)
a b c d
0 0.567312 0.279936 0.440002 -0.037146
2 1.175658 0.279778 -0.723779 0.360238
... ... ... ... ...
8 -1.026911 -1.011459 -1.351878 -1.900419
10 -0.964503 0.326306 1.534685 0.151694

6 rows × 4 columns

df1.applymap(np.exp)
a b c d
0 0.916684 0.918567 2.080356 1.057151
2 1.513526 0.918485 1.115030 1.481181
... ... ... ... ...
8 0.246340 0.442835 0.796355 0.217449
10 0.259343 0.942951 3.740308 1.240912

6 rows × 4 columns

Bibliography - this notebook used content from some of the following sources: