fn_babs_datamunging

Practical Data Analysis with Python - Data Munging

Background

Background: Bay Area Bike Share commenced it's pilot phase of operation in the San Francisco bay area in August 2013 with plans to expand. It is the first bike sharing scheme in California. As it is meant for short trips, the bikes should be returned to a dock in thirty minutes or less or an additional fee would be incurred according to the website. There are two types of memberships: customer and subscriber. A subscriber is an annual membership while a customer is defined as someone using either the twenty-four hour or three day passes. Currently(Sept 2014), it costs nine dollars for twenty-four hours, twenty-two dollars for three days, and eight-eighy dollars for the year. Overtime fees are four dollars for an extra thirty minutes and seven dollars for each thirty minutes after that. Data on the first six months of operations were released as part of a data challenge. The data included three files for trip history, weather information, and dock availability. The merged data was used for the following analysis.

Data Munging and Carpentry

First, we'll read in the data and inspect the data columns and datatypes and think about what questions we want to ask our data and what things are we interested in learning about the data. Be curious and empathetic in thinking about what the various stakeholders including the City, the customers, and other interested people would be interested in gleaning by keeping civic fiscal, civic, and social goals in mind. In addition to that, there will be quite a bit of cleaning and data carpentry needed to get the data into a format useful for analysis.

The dataset comes from three csv files from the Bay Area Bikeshare data challenge. We merged the data in R as we started a similar analysis there but really wanted to use IPython and the superb time series functionality in Pandas. We'll leave it to the reader to merge the csv files and then you'll be able to reproduce the analysis.

In [2]:
%load_ext ipycache
In [3]:
%logstart
Activating auto-logging. Current session state plus future input saved.
Filename       : ipython_log.py
Mode           : rotate
Output logging : False
Raw input log  : False
Timestamping   : False
State          : active

In [4]:
import numpy as np
import pandas as pd
from datetime import datetime, time
In [5]:
%matplotlib inline
import seaborn as sns
from ggplot import *
In [6]:
print pd.__version__
print np.__version__
0.14.1
1.9.0

Read the data

In [5]:
dmerge4 = pd.read_csv('finalmerge.csv', parse_dates=['Start.Date'])
dmerge4.head(3)
Out[5]:
lubstartdate Start.Station Trip.ID Duration Start.Date Start.Terminal End.Date End.Station End.Terminal Bike.. Events Wind_Dir_Degrees zip landmark station_id name lat long dockcount installation
0 2013-08-29 2nd at Folsom 4636 186 2013-08-29 15:11:00 62 8/29/2013 15:14 2nd at Townsend 61 366 NaN 286 94107 San Francisco 62 2nd at Folsom 37.785299 -122.396236 19 8/22/2013
1 2013-08-29 2nd at Folsom 4820 813 2013-08-29 17:35:00 62 8/29/2013 17:48 Townsend at 7th 65 409 NaN 286 94107 San Francisco 62 2nd at Folsom 37.785299 -122.396236 19 8/22/2013
2 2013-08-29 2nd at Folsom 5001 236 2013-08-29 20:00:00 62 8/29/2013 20:04 Market at Sansome 77 567 NaN 286 94107 San Francisco 62 2nd at Folsom 37.785299 -122.396236 19 8/22/2013

3 rows × 43 columns

Format column names

In [6]:
#format the column names
dmerge4.columns = dmerge4.columns.map(lambda x: x.replace('.', '').lower())
In [7]:
dmerge4['zip_name'] = dmerge4.zip.replace({94107: 'San Francisco', 95113: 'San Jose', 
                                           94301:'Palo Alto',94041:'Mountain View',94063:'Redwood City'})

Inspecting the datatypes and variables are the first things that should be done in data munging. And, we'll do it again after doing our data cleaning and formatting operations. Notice that the end date and start date are not currently in datetime format so let's remedy that and take care of some other things for the questions related to time-series.

In [8]:
dmerge4.dtypes
Out[8]:
lubstartdate                          object
startstation                          object
tripid                                 int64
duration                               int64
startdate                     datetime64[ns]
startterminal                          int64
enddate                               object
endstation                            object
endterminal                            int64
bike                                   int64
subscriptiontype                      object
zipcode                               object
date                                  object
max_temperature_f                      int64
mean_temperature_f                     int64
min_temperaturef                       int64
max_dew_point_f                        int64
meandew_point_f                        int64
min_dewpoint_f                         int64
max_humidity                           int64
mean_humidity                          int64
min_humidity                           int64
max_sea_level_pressure_in            float64
mean_sea_level_pressure_in           float64
min_sea_level_pressure_in            float64
max_visibility_miles                   int64
mean_visibility_miles                  int64
min_visibility_miles                   int64
max_wind_speed_mph                     int64
mean_wind_speed_mph                    int64
max_gust_speed_mph                   float64
precipitation_in                      object
cloud_cover                            int64
events                                object
wind_dir_degrees                       int64
zip                                    int64
landmark                              object
station_id                             int64
name                                  object
lat                                  float64
long                                 float64
dockcount                              int64
installation                          object
zip_name                              object
dtype: object

Working with dates and time

Parse date columns into datetime format

In [9]:
dmerge4['enddate'] = pd.to_datetime(dmerge4['enddate'])
dmerge4['lubstartdate'] = pd.to_datetime(dmerge4['lubstartdate'])

Set datetime index column

In [10]:
#Set datetime index column
dmerge4.set_index('startdate', inplace=True, drop=False, append=False)

Extract hour,day,and month from datetime object

In [11]:
dmerge4['day'] = dmerge4.index.day
dmerge4['hour'] = dmerge4.index.hour
dmerge4['month'] = dmerge4.index.month

Having the date and time is great. However for the analysis, it would be really useful to group time into subjective useful categories that would be useful for understanding usage a little bit better. We think we would be really interested to learn about the differences and similarities in usage and duration patterns between the morning commute and evening rush period. It would also be interesting to know the profile and usage of users in the middle of the day as well as at night and in what we refer to as the wee hours. So, We've decided to split the time into the following segments.

morning: 5,6,7,8,9,10 evening: 15,16,17,18,19 midday: 11,12,13,14, night: 20,21,22,23,0 wee hours: 1,2,3,4

In [12]:
dmerge4['timeofday'] = dmerge4.hour.replace({15: 'evening', 17: 'evening', 20:'night',19:'evening',12:'mid_day',14:'mid_day', 13:'mid_day', 9: 'morning', 22:'night', 
        11:'mid_day', 18:'evening', 16:'evening', 10:'morning', 21:'night', 23:'night',  
        6:'morning', 8:'morning',7:'morning',  1:'wee_hours',  2:'wee_hours',  3:'wee_hours', 0:'night',  5:'morning',  4:'wee_hours'})

Difference between end date and start date

In [13]:
#difference between endate and startdate
dmerge4['diff'] = dmerge4.apply(lambda x: x['enddate'] - x['startdate'], axis=1)

Difference between end hour and start hour

In [14]:
dmerge4['hourdiff'] = dmerge4.apply(lambda x: x['enddate'].hour - x['startdate'].hour, axis=1)

Now, one of the most important variables in the dataset is duration. We are very interested in understanding how long a customer or subscriber has a bicycle. Duration is defined as the number of seconds from taking a bicycle from a dock and returning it to a dock. Many of the interesting questions such as breakdown of duration between customer and subscriber by landmark or by time of day are relevant to both customer end users and civic authorities as it affects how many bikes will be available and where will they be needed. Let's transform duration into minutes and then also define a binary variable to compare trips under thirty minutes and less to trips over thirty minutes.

Duration is in seconds so let's convert it to minutes

In [15]:
#duration is in seconds so convert to minutes
dmerge4['durationminutes'] = dmerge4['duration']/60

Convert duration to an integer type

In [16]:
#round and convert to integer
dmerge4["duration_i"] = dmerge4['durationminutes'].round(0).astype('int64')

Duration as float type to two decimal places

In [17]:
#round to two decimal spaces and keeep as float
dmerge4["duration_f"] = dmerge4['durationminutes'].round(2)

We'll write a function that maps ‘under thirty’ to all values under 30 (minutes) and ‘over thirty’ to all values over 30 (minutes)

In [18]:
def isitthirty(x):
    if x <= 30: return 'in_thirty'
    elif 31 <= x : return 'over_thirty'
    else: return 'None'

dmerge4["thirtymin"] = dmerge4['duration_i'].map(isitthirty)
In [22]:
dmerge4.dtypes
Out[22]:
lubstartdate                   datetime64[ns]
startstation                           object
tripid                                  int64
duration                                int64
startdate                      datetime64[ns]
startterminal                           int64
enddate                        datetime64[ns]
endstation                             object
endterminal                             int64
bike                                    int64
subscriptiontype                       object
zipcode                                object
date                                   object
max_temperature_f                       int64
mean_temperature_f                      int64
min_temperaturef                        int64
max_dew_point_f                         int64
meandew_point_f                         int64
min_dewpoint_f                          int64
max_humidity                            int64
mean_humidity                           int64
min_humidity                            int64
max_sea_level_pressure_in             float64
mean_sea_level_pressure_in            float64
min_sea_level_pressure_in             float64
max_visibility_miles                    int64
mean_visibility_miles                   int64
min_visibility_miles                    int64
max_wind_speed_mph                      int64
mean_wind_speed_mph                     int64
max_gust_speed_mph                    float64
precipitation_in                       object
cloud_cover                             int64
events                                 object
wind_dir_degrees                        int64
zip                                     int64
landmark                               object
station_id                              int64
name                                   object
lat                                   float64
long                                  float64
dockcount                               int64
installation                           object
zip_name                               object
day                                     int64
hour                                    int64
month                                   int64
timeofday                              object
diff                          timedelta64[ns]
hourdiff                                int64
durationminutes                       float64
duration_i                              int64
duration_f                            float64
thirtymin                              object
Length: 54, dtype: object
In [23]:
%logstop