Time-series data is fun and interesting. Working with time-series data has already been covered extensively in the data munging and grouping & aggregating notebooks. This notebook aims to cover ad-hoc time-series topics relevant for data analysis. The time-series plots produced by the datasets in these notebooks is presented in the visualization section.
We have implemented and shown varying strategies for working with datetime data to gain meaning and insight from the dataset. Now, we'll go a little further in depth into looking at specific dates and times, categorical time intervals, timedeltas, and general properties of datetime and timestamp data.
We can see data analytics over longer periods of time or dates, at a specific time or date, or even just in a particular time interval. In part, this particular dataset has several nice properties and it was chosen to illustrate working with time-series data.
Preliminary work with getting date and time data into a nice format for data analysis has been covered in the data munging section. Now, let's take a deeper look at working with time-series data to gain meaning and insight from our bike sharing dataset.
import numpy as np
import pandas as pd
from datetime import datetime, time
%matplotlib inline
import seaborn as sns
from ggplot import *
print pd.__version__
print np.__version__
%load_ext ipycache
Resampling allows us to resample to different time periods and show summary statistics.
Resample the dataset to daily means so each row will have the average means for trips started on the same day
# Daily means
daily_means = dmerge4.resample('D', how='mean').reset_index(drop=False)
daily_means.head(2)
Resample dataset to monthly means so each row will have the average means for trips started in the same month
#monthly means
monthly_means = dmerge4.resample('M', how='mean').reset_index(drop=False)
monthly_means.head(2)
The timeseries functionality in pandas allows granularity down to the second, minute, or hour. For instance, one can look at average duration at four pm on a monthly basis and compare it to eight am. However, it may be more instructive to look at a range such as a time interval like rush hour and the morning commute.
Average duration at eight am and four pm resampled to monthly mean
#the datetime index at 8am resampled to monthly
eight_am = dmerge4.at_time(time(8,0)).resample('M', how='mean')[['duration_f']]
eight_am
four_pm = dmerge4.at_time(time(16,0)).resample('M', how='mean')[['duration_f']]
four_pm
In every month, duration at four pm is higher than at eight am. When looking at a range of times such as the monthly morning duration and the monthly evening duration, the differences are less delineated. This is interesting because knowing at what hour or range of times bicycles will not be available based on duration is useful for demand and infrastructure planning. Also, this information can inform marketing strategies. Perhaps, there are incentives that can be given to change duration times now that we know more based on the time-series data.
Look at average duration between 5am and 10am on a monthly basis
morning_monthly = dmerge4.between_time(time(5, 0), time(10, 0)).resample('M', how='mean')[['duration_f']]
morning_monthly
Summary statistics for morning time interval
dmerge4.between_time(time(5, 0), time(10, 0)).describe()
Look at average duration between three pm and seven pm on a monthly basis
evening_monthly = dmerge4.between_time(time(15, 0), time(19, 0)).resample('M', how='mean')[['duration_f']]
evening_monthly
Summary statistics for evening time interval which was chosen to encompass the ‘rush hour’ time period.
dmerge4.between_time(time(15, 0), time(19, 0)).describe()
So, by slicing and dicing the datetime object into morning and evening commuting hours, we've learned that there are 47592 observations in the evening ,38885 in the morning ,86477 in this combined time period out of the total, and 10% more observations in the evening than in the morning.
When we munged the data, we created a categorical field for time intervals by extracting the hour component of the time stamp which will be useful for grouping, aggregating, and plotting based on the labelled ranges.
Here is a table of average duration by time of day category.
dmerge4.groupby('timeofday')[['duration_f']].mean()
The results are very interesting. We were curious as to why the data suggested that trips were primarily shorter rides. Splitting the data into labelled time of day intervals reveals some interesting insights. Perhaps evening and morning rides are commutes while non-rush hour times of day are closer to the thirty minute time limit for each trip. But,why is the average duration longer in the wee hours? In the group and aggregation portion of the analysis, we run further queries by time of day to learn more.
morning: 5,6,7,8,9,10, evening: 15,16,17,18,19, midday: 11,12,13,14, night: 20,21,22,23,0, wee hours: 1,2,3,4
The column we created named diff takes the difference between the datetime objects of end date and start date
dmerge4['diff'].head()
This will extract minutes from the datetime object column named diff. Also note that this will be different than using hour, minute, extracted columns because those are based on start date only.
np.round((dmerge4['diff']/ np.timedelta64(1, 'm') % 60))
This returns the minutes component of the timestamp which is not what we want rather than total minutes which is what we want for this analysis. We mention it here because this can be useful for when working with financial time-series data in particular.
np.round((dmerge4['diff']/ np.timedelta64(1, 'm') % 60)).max()
By looking at the hour component, notice that most of the trips were under one hour
deltahour = np.round((dmerge4['diff']/ np.timedelta64(1, 'h')))
deltahour
This will give the what we want in minutes rather than just extracting the minutes element
dmerge4['diff'].apply(lambda x: x / np.timedelta64(1, 'm'))
The max function shows that total minutes rather than just the minute component is returned. Success.
dmerge4['diff'].apply(lambda x: x / np.timedelta64(1, 'm')).max()
The will give the same result without a lambda function
np.round((dmerge4['diff']/ np.timedelta64(1, 'm'))).max()
dmerge4['diff'].apply(lambda x: x / np.timedelta64(1, 'm')).plot()
One thing to keep in mind is that pandas uses numpy datetime while there is also a datetime.datetime in python. This is something to keep in mind for general python users. The latest release of pandas also introduces a new datetime accessor for working with dates and times and is worth checking out.