Background: Bay Area Bike Share commenced it's pilot phase of operation in the San Francisco bay area in August 2013 with plans to expand. It is the first bike sharing scheme in California. As it is meant for short trips, the bikes should be returned to a dock in thirty minutes or less or an additional fee would be incurred according to the website. There are two types of memberships: customer and subscriber. A subscriber is an annual membership while a customer is defined as someone using either the twenty-four hour or three day passes. Currently(Sept 2014), it costs nine dollars for twenty-four hours, twenty-two dollars for three days, and eight-eighy dollars for the year. Overtime fees are four dollars for an extra thirty minutes and seven dollars for each thirty minutes after that. Data on the first six months of operations were released as part of a data challenge. The data included three files for trip history, weather information, and dock availability. The merged data was used for the following analysis.
First, we'll read in the data and inspect the data columns and datatypes and think about what questions we want to ask our data and what things are we interested in learning about the data. Be curious and empathetic in thinking about what the various stakeholders including the City, the customers, and other interested people would be interested in gleaning by keeping civic fiscal, civic, and social goals in mind. In addition to that, there will be quite a bit of cleaning and data carpentry needed to get the data into a format useful for analysis.
The dataset comes from three csv files from the Bay Area Bikeshare data challenge. We merged the data in R as we started a similar analysis there but really wanted to use IPython and the superb time series functionality in Pandas. We'll leave it to the reader to merge the csv files and then you'll be able to reproduce the analysis.
%load_ext ipycache
%logstart
import numpy as np
import pandas as pd
from datetime import datetime, time
%matplotlib inline
import seaborn as sns
from ggplot import *
print pd.__version__
print np.__version__
Read the data
dmerge4 = pd.read_csv('finalmerge.csv', parse_dates=['Start.Date'])
dmerge4.head(3)
Format column names
#format the column names
dmerge4.columns = dmerge4.columns.map(lambda x: x.replace('.', '').lower())
dmerge4['zip_name'] = dmerge4.zip.replace({94107: 'San Francisco', 95113: 'San Jose',
94301:'Palo Alto',94041:'Mountain View',94063:'Redwood City'})
Inspecting the datatypes and variables are the first things that should be done in data munging. And, we'll do it again after doing our data cleaning and formatting operations. Notice that the end date and start date are not currently in datetime format so let's remedy that and take care of some other things for the questions related to time-series.
dmerge4.dtypes
Parse date columns into datetime format
dmerge4['enddate'] = pd.to_datetime(dmerge4['enddate'])
dmerge4['lubstartdate'] = pd.to_datetime(dmerge4['lubstartdate'])
Set datetime index column
#Set datetime index column
dmerge4.set_index('startdate', inplace=True, drop=False, append=False)
Extract hour,day,and month from datetime object
dmerge4['day'] = dmerge4.index.day
dmerge4['hour'] = dmerge4.index.hour
dmerge4['month'] = dmerge4.index.month
Having the date and time is great. However for the analysis, it would be really useful to group time into subjective useful categories that would be useful for understanding usage a little bit better. We think we would be really interested to learn about the differences and similarities in usage and duration patterns between the morning commute and evening rush period. It would also be interesting to know the profile and usage of users in the middle of the day as well as at night and in what we refer to as the wee hours. So, We've decided to split the time into the following segments.
morning: 5,6,7,8,9,10 evening: 15,16,17,18,19 midday: 11,12,13,14, night: 20,21,22,23,0 wee hours: 1,2,3,4
dmerge4['timeofday'] = dmerge4.hour.replace({15: 'evening', 17: 'evening', 20:'night',19:'evening',12:'mid_day',14:'mid_day', 13:'mid_day', 9: 'morning', 22:'night',
11:'mid_day', 18:'evening', 16:'evening', 10:'morning', 21:'night', 23:'night',
6:'morning', 8:'morning',7:'morning', 1:'wee_hours', 2:'wee_hours', 3:'wee_hours', 0:'night', 5:'morning', 4:'wee_hours'})
Difference between end date and start date
#difference between endate and startdate
dmerge4['diff'] = dmerge4.apply(lambda x: x['enddate'] - x['startdate'], axis=1)
Difference between end hour and start hour
dmerge4['hourdiff'] = dmerge4.apply(lambda x: x['enddate'].hour - x['startdate'].hour, axis=1)
Now, one of the most important variables in the dataset is duration. We are very interested in understanding how long a customer or subscriber has a bicycle. Duration is defined as the number of seconds from taking a bicycle from a dock and returning it to a dock. Many of the interesting questions such as breakdown of duration between customer and subscriber by landmark or by time of day are relevant to both customer end users and civic authorities as it affects how many bikes will be available and where will they be needed. Let's transform duration into minutes and then also define a binary variable to compare trips under thirty minutes and less to trips over thirty minutes.
Duration is in seconds so let's convert it to minutes
#duration is in seconds so convert to minutes
dmerge4['durationminutes'] = dmerge4['duration']/60
Convert duration to an integer type
#round and convert to integer
dmerge4["duration_i"] = dmerge4['durationminutes'].round(0).astype('int64')
Duration as float type to two decimal places
#round to two decimal spaces and keeep as float
dmerge4["duration_f"] = dmerge4['durationminutes'].round(2)
We'll write a function that maps ‘under thirty’ to all values under 30 (minutes) and ‘over thirty’ to all values over 30 (minutes)
def isitthirty(x):
if x <= 30: return 'in_thirty'
elif 31 <= x : return 'over_thirty'
else: return 'None'
dmerge4["thirtymin"] = dmerge4['duration_i'].map(isitthirty)
dmerge4.dtypes
%logstop