Using Pandas for Analyzing Data - Data Munging¶

wavedatalab.github.io ¶

%matplotlib
import numpy as np
import pandas as pd

Using matplotlib backend: Qt4Agg

Read the csv file of your choice using Pandas¶

ver=pd.read_csv("ver.csv")

View the first few rows of the file and set the number of columns displayed. Note: Using ver.head() will display the first five rows.¶

pd.set_option('display.max_columns', 80) 
ver.head(3)

Determine the number of rows and columns in the dataset¶

ver.shape

(34573, 46)

Find the number of rows in the dataset¶

len(ver)

34573

Get the names of the columns¶

ver.columns

Index([u'action_taken', u'action_taken_name', u'agency_code', u'agency_abbr', u'agency_name', u'applicant_ethnicity', u'applicant_ethnicity_name', u'applicant_income_000s', u'applicant_race_1', u'applicant_race_name_1', u'applicant_sex', u'applicant_sex_name', u'census_tract_number', u'co_applicant_ethnicity', u'co_applicant_ethnicity_name', u'co_applicant_race_1', u'co_applicant_race_name_1', u'co_applicant_sex', u'co_applicant_sex_name', u'county_code', u'county_name', u'hoepa_status', u'hoepa_status_name', u'lien_status', u'lien_status_name', u'loan_purpose', u'loan_purpose_name', u'loan_type', u'loan_type_name', u'owner_occupancy', u'owner_occupancy_name', u'preapproval', u'preapproval_name', u'property_type', u'property_type_name', u'purchaser_type', u'purchaser_type_name', u'hud_median_family_income', u'loan_amount_000s', u'number_of_1_to_4_family_units', u'number_of_owner_occupied_units', u'minority_population', u'population', u'tract_to_msamd_income', u'logloanamt', u'logincome'], dtype='object')

Get the first five rows of a column by name¶

ver['action_taken'][:5]

0    1
1    1
2    1
3    1
4    1
Name: action_taken, dtype: int64

Create categorical ranges for numerical data. Note that that you can specifiy the number of ranges you wish.¶

incomeranges = pd.cut(ver['applicant_income_000s'], 14)
incomeranges[:5]

0    (-8.998, 715.143]
1    (-8.998, 715.143]
2    (-8.998, 715.143]
3    (-8.998, 715.143]
4    (-8.998, 715.143]
Name: applicant_income_000s, dtype: category
Categories (14, object): [(-8.998, 715.143] < (715.143, 1429.286] < (1429.286, 2143.429] < (2143.429, 2857.571] ... (7142.429, 7856.571] < (7856.571, 8570.714] < (8570.714, 9284.857] < (9284.857, 9999]]

Look at the value counts in the ranges created above¶

pd.value_counts(incomeranges)

(-8.998, 715.143]       34306
(715.143, 1429.286]       200
(1429.286, 2143.429]       33
(2143.429, 2857.571]       21
(2857.571, 3571.714]        6
(3571.714, 4285.857]        4
(5714.143, 6428.286]        2
(9284.857, 9999]            1
(8570.714, 9284.857]        0
(7856.571, 8570.714]        0
(7142.429, 7856.571]        0
(6428.286, 7142.429]        0
(5000, 5714.143]            0
(4285.857, 5000]            0
dtype: int64

Index into the first six columns of the first row¶

ver.ix[0,0:6]

action_taken                                                   1
action_taken_name                                Loan originated
agency_code                                                    1
agency_abbr                                                  OCC
agency_name            Office of the Comptroller of the Currency
applicant_ethnicity                                            2
Name: 0, dtype: object

Order the data by specified column¶

ver['loan_amount_000s'].order()[:5]

4193     1
32737    1
32965    1
34335    1
34342    1
Name: loan_amount_000s, dtype: int64

Sort by a column and that obtain a cross-section of that data¶

sorteddata = ver.sort(['loan_amount_000s'])
sorteddata.ix[:,0:6].head(3)

Obtain the first three rows and first three columns of the sorted data¶

sorteddata.iloc[0:3,0:3]

Obtain value counts of specifiec column¶

ver['action_taken_name'].value_counts()

Loan originated                                21852
Application denied by financial institution     4894
Application withdrawn by applicant              2759
Loan purchased by the institution               2671
Application approved but not accepted           1421
File closed for incompleteness                   976
dtype: int64

A way to obtain the datatype for every column¶

zip(ver.columns, [type(x) for x in ver.ix[0,:]])

[('action_taken', numpy.int64),
 ('action_taken_name', str),
 ('agency_code', numpy.int64),
 ('agency_abbr', str),
 ('agency_name', str),
 ('applicant_ethnicity', numpy.int64),
 ('applicant_ethnicity_name', str),
 ('applicant_income_000s', numpy.int64),
 ('applicant_race_1', numpy.int64),
 ('applicant_race_name_1', str),
 ('applicant_sex', numpy.int64),
 ('applicant_sex_name', str),
 ('census_tract_number', numpy.float64),
 ('co_applicant_ethnicity', numpy.int64),
 ('co_applicant_ethnicity_name', str),
 ('co_applicant_race_1', numpy.int64),
 ('co_applicant_race_name_1', str),
 ('co_applicant_sex', numpy.int64),
 ('co_applicant_sex_name', str),
 ('county_code', numpy.int64),
 ('county_name', str),
 ('hoepa_status', numpy.int64),
 ('hoepa_status_name', str),
 ('lien_status', numpy.int64),
 ('lien_status_name', str),
 ('loan_purpose', numpy.int64),
 ('loan_purpose_name', str),
 ('loan_type', numpy.int64),
 ('loan_type_name', str),
 ('owner_occupancy', numpy.int64),
 ('owner_occupancy_name', str),
 ('preapproval', numpy.int64),
 ('preapproval_name', str),
 ('property_type', numpy.int64),
 ('property_type_name', str),
 ('purchaser_type', numpy.int64),
 ('purchaser_type_name', str),
 ('hud_median_family_income', numpy.int64),
 ('loan_amount_000s', numpy.int64),
 ('number_of_1_to_4_family_units', numpy.int64),
 ('number_of_owner_occupied_units', numpy.int64),
 ('minority_population', numpy.float64),
 ('population', numpy.int64),
 ('tract_to_msamd_income', numpy.float64),
 ('logloanamt', numpy.float64),
 ('logincome', numpy.float64)]

The Pandas way to obtain datatypes for every column¶

ver.dtypes

action_taken                        int64
action_taken_name                  object
agency_code                         int64
agency_abbr                        object
agency_name                        object
applicant_ethnicity                 int64
applicant_ethnicity_name           object
applicant_income_000s               int64
applicant_race_1                    int64
applicant_race_name_1              object
applicant_sex                       int64
applicant_sex_name                 object
census_tract_number               float64
co_applicant_ethnicity              int64
co_applicant_ethnicity_name        object
co_applicant_race_1                 int64
co_applicant_race_name_1           object
co_applicant_sex                    int64
co_applicant_sex_name              object
county_code                         int64
county_name                        object
hoepa_status                        int64
hoepa_status_name                  object
lien_status                         int64
lien_status_name                   object
loan_purpose                        int64
loan_purpose_name                  object
loan_type                           int64
loan_type_name                     object
owner_occupancy                     int64
owner_occupancy_name               object
preapproval                         int64
preapproval_name                   object
property_type                       int64
property_type_name                 object
purchaser_type                      int64
purchaser_type_name                object
hud_median_family_income            int64
loan_amount_000s                    int64
number_of_1_to_4_family_units       int64
number_of_owner_occupied_units      int64
minority_population               float64
population                          int64
tract_to_msamd_income             float64
logloanamt                        float64
logincome                         float64
dtype: object

Get the unique values for a column by name.¶

ver['county_name'].unique()

array(['Grand Isle County', 'Chittenden County', 'Washington County',
       'Franklin County', 'Caledonia County', 'Addison County',
       'Orange County', 'Orleans County', 'Rutland County',
       'Windham County', 'Lamoille County', 'Windsor County',
       'Bennington County', 'Essex County'], dtype=object)

Get a count of the unique values of a column¶

len(ver['county_name'].unique())

14

Index into a column and get the first four rows¶

ver.ix[0:3,'preapproval_name']

0               Not applicable
1               Not applicable
2    Preapproval was requested
3               Not applicable
Name: preapproval_name, dtype: object

Obtain binary values¶

ver.ix[0:3,'preapproval_name'] == "Preapproval was requested"

0    False
1    False
2     True
3    False
Name: preapproval_name, dtype: bool

	action_taken	action_taken_name	agency_code	agency_abbr	agency_name	applicant_ethnicity	applicant_ethnicity_name	applicant_income_000s	applicant_race_1	applicant_race_name_1	applicant_sex	applicant_sex_name	census_tract_number	co_applicant_ethnicity	co_applicant_ethnicity_name	co_applicant_race_1	co_applicant_race_name_1	co_applicant_sex	co_applicant_sex_name	county_code	county_name	hoepa_status	hoepa_status_name	lien_status	lien_status_name	loan_purpose	loan_purpose_name	loan_type	loan_type_name	owner_occupancy	owner_occupancy_name	preapproval	preapproval_name	property_type	property_type_name	purchaser_type	purchaser_type_name	hud_median_family_income	loan_amount_000s	number_of_1_to_4_family_units	number_of_owner_occupied_units	minority_population	population	tract_to_msamd_income	logloanamt	logincome
0	1	Loan originated	1	OCC	Office of the Comptroller of the Currency	2	Not Hispanic or Latino	161	5	White	1	Male	201	2	Not Hispanic or Latino	5	White	2	Female	13	Grand Isle County	2	Not a HOEPA loan	1	Secured by a first lien	3	Refinancing	1	Conventional	2	Not owner-occupied as a principal dwelling	3	Not applicable	1	One-to-four family dwelling (other than manufa…	1	Fannie Mae (FNMA)	76700	200	2681	1186	7.12	3272	80.459999	5.298317	5.081404
1	1	Loan originated	1	OCC	Office of the Comptroller of the Currency	2	Not Hispanic or Latino	92	2	Asian	1	Male	34	2	Not Hispanic or Latino	2	Asian	2	Female	7	Chittenden County	2	Not a HOEPA loan	1	Secured by a first lien	3	Refinancing	1	Conventional	1	Owner-occupied as a principal dwelling	3	Not applicable	1	One-to-four family dwelling (other than manufa…	0	Loan was not originated or was not sold in cal…	76700	279	2818	2345	4.58	7144	154.830002	5.631212	4.521789
2	1	Loan originated	5	NCUA	National Credit Union Administration	2	Not Hispanic or Latino	106	5	White	1	Male	29	1	Hispanic or Latino	5	White	2	Female	7	Chittenden County	2	Not a HOEPA loan	1	Secured by a first lien	1	Home purchase	1	Conventional	1	Owner-occupied as a principal dwelling	1	Preapproval was requested	1	One-to-four family dwelling (other than manufa…	1	Fannie Mae (FNMA)	76700	244	2447	2136	3.42	6227	124.199997	5.497168	4.663439

	action_taken	action_taken_name	agency_code	agency_abbr	agency_name	applicant_ethnicity
4193	1	Loan originated	5	NCUA	National Credit Union Administration	2
32737	6	Loan purchased by the institution	9	CFPB	Consumer Financial Protection Bureau	3
32965	6	Loan purchased by the institution	9	CFPB	Consumer Financial Protection Bureau	3

Using Pandas for Analyzing Data - Data Munging¶

wavedatalab.github.io¶

Read the csv file of your choice using Pandas¶

View the first few rows of the file and set the number of columns displayed. Note: Using ver.head() will display the first five rows.¶

Determine the number of rows and columns in the dataset¶

Find the number of rows in the dataset¶

Get the names of the columns¶

Get the first five rows of a column by name¶

Create categorical ranges for numerical data. Note that that you can specifiy the number of ranges you wish.¶

Look at the value counts in the ranges created above¶

Index into the first six columns of the first row¶

Order the data by specified column¶

Sort by a column and that obtain a cross-section of that data¶

Obtain the first three rows and first three columns of the sorted data¶

Obtain value counts of specifiec column¶

A way to obtain the datatype for every column¶

The Pandas way to obtain datatypes for every column¶

Get the unique values for a column by name.¶

Get a count of the unique values of a column¶

Index into a column and get the first four rows¶

Obtain binary values¶

wavedatalab.github.io ¶