fn_babs_timeseries

Practical Data Analysis with Python - Time Series

Working with time-series data

Time-series data is fun and interesting. Working with time-series data has already been covered extensively in the data munging and grouping & aggregating notebooks. This notebook aims to cover ad-hoc time-series topics relevant for data analysis. The time-series plots produced by the datasets in these notebooks is presented in the visualization section.

We have implemented and shown varying strategies for working with datetime data to gain meaning and insight from the dataset. Now, we'll go a little further in depth into looking at specific dates and times, categorical time intervals, timedeltas, and general properties of datetime and timestamp data.

We can see data analytics over longer periods of time or dates, at a specific time or date, or even just in a particular time interval. In part, this particular dataset has several nice properties and it was chosen to illustrate working with time-series data.

Preliminary work with getting date and time data into a nice format for data analysis has been covered in the data munging section. Now, let's take a deeper look at working with time-series data to gain meaning and insight from our bike sharing dataset.

In [2]:
import numpy as np
import pandas as pd
from datetime import datetime, time
In [3]:
%matplotlib inline
import seaborn as sns
from ggplot import *
In [11]:
print pd.__version__
print np.__version__
0.14.1
1.9.0

In [4]:
%load_ext ipycache

Resampling

Resampling allows us to resample to different time periods and show summary statistics.

Resample the dataset to daily means so each row will have the average means for trips started on the same day

In [7]:
# Daily means
daily_means = dmerge4.resample('D', how='mean').reset_index(drop=False) 
daily_means.head(2)
Out[7]:
startdate tripid duration startterminal endterminal bike max_temperature_f mean_temperature_f min_temperaturef max_dew_point_f lat long dockcount day hour month hourdiff durationminutes duration_i duration_f
0 2013-08-29 4661.505348 1561.296791 54.679144 55.017380 431.470588 74.871658 68.491979 61.318182 61.201872 37.728908 -122.345651 18.970588 29 15.381016 8 0.223262 26.021613 26.028075 26.021564
1 2013-08-30 5837.312325 2703.133053 54.168067 54.177871 439.878151 79.400560 69.901961 60.456583 61.390756 37.724274 -122.337827 19.235294 30 14.277311 8 0.439776 45.052218 45.051821 45.052227

2 rows × 38 columns

Resample dataset to monthly means so each row will have the average means for trips started in the same month

In [8]:
#monthly means
monthly_means =  dmerge4.resample('M', how='mean').reset_index(drop=False) 
monthly_means.head(2)
Out[8]:
startdate tripid duration startterminal endterminal bike max_temperature_f mean_temperature_f min_temperaturef max_dew_point_f lat long dockcount day hour month hourdiff durationminutes duration_i duration_f
0 2013-08-31 5763.824453 2588.182683 55.104186 55.247859 433.639391 75.504757 67.814462 59.870599 60.127973 37.730096 -122.346329 19.064700 29.948620 14.493815 8 0.430067 43.136378 43.139867 43.136413
1 2013-09-30 24370.800578 1578.130729 56.731054 56.811908 438.652102 74.284079 65.888048 56.996989 56.373965 37.745550 -122.357832 19.261142 16.282415 13.664660 9 0.265539 26.302179 26.302302 26.302204

2 rows × 38 columns

Working with time data

The timeseries functionality in pandas allows granularity down to the second, minute, or hour. For instance, one can look at average duration at four pm on a monthly basis and compare it to eight am. However, it may be more instructive to look at a range such as a time interval like rush hour and the morning commute.

Average duration at eight am and four pm resampled to monthly mean

In [10]:
#the datetime index at 8am resampled to monthly
eight_am = dmerge4.at_time(time(8,0)).resample('M', how='mean')[['duration_f']]
eight_am
Out[10]:
duration_f
startdate
2013-08-31 10.980000
2013-09-30 10.226316
2013-10-31 10.027674
2013-11-30 10.429714
2013-12-31 9.469492
2014-01-31 9.413051
2014-02-28 8.042712
In [11]:
four_pm = dmerge4.at_time(time(16,0)).resample('M', how='mean')[['duration_f']]
four_pm
Out[11]:
duration_f
startdate
2013-08-31 16.785000
2013-09-30 14.976786
2013-10-31 26.602927
2013-11-30 19.147391
2013-12-31 11.453214
2014-01-31 12.583846
2014-02-28 14.857083

In every month, duration at four pm is higher than at eight am. When looking at a range of times such as the monthly morning duration and the monthly evening duration, the differences are less delineated. This is interesting because knowing at what hour or range of times bicycles will not be available based on duration is useful for demand and infrastructure planning. Also, this information can inform marketing strategies. Perhaps, there are incentives that can be given to change duration times now that we know more based on the time-series data.

Look at average duration between 5am and 10am on a monthly basis

In [7]:
morning_monthly = dmerge4.between_time(time(5, 0), time(10, 0)).resample('M', how='mean')[['duration_f']]
morning_monthly
Out[7]:
duration_f
startdate
2013-08-31 37.648913
2013-09-30 17.973624
2013-10-31 13.302877
2013-11-30 13.209249
2013-12-31 15.384694
2014-01-31 11.679421
2014-02-28 11.441120

Summary statistics for morning time interval

In [9]:
dmerge4.between_time(time(5, 0), time(10, 0)).describe()
Out[9]:
tripid duration startterminal endterminal bike max_temperature_f mean_temperature_f min_temperaturef max_dew_point_f meandew_point_f lat long dockcount day hour month hourdiff durationminutes duration_i duration_f
count 38885.000000 38885.000000 38885.000000 38885.000000 38885.000000 38885.000000 38885.000000 38885.000000 38885.000000 38885.000000 38885.000000 38885.000000 38885.000000 38885.000000 38885.000000 38885.000000 38885.000000 38885.000000 38885.000000 38885.000000
mean 108075.706699 822.261926 59.107522 57.958313 440.186139 65.528764 57.483477 49.009850 49.158262 44.304565 37.750334 -122.365098 19.981664 15.667507 7.919789 7.396040 0.223660 13.704365 13.704616 13.704364
std 54154.593638 4807.622205 16.421185 16.264455 134.382220 7.346052 6.013163 5.988127 5.980818 6.967008 0.117364 0.119254 3.846183 8.451855 0.990138 4.387635 0.697399 80.127037 80.128487 80.127029
min 4069.000000 60.000000 2.000000 2.000000 9.000000 46.000000 38.000000 25.000000 23.000000 16.000000 37.329732 -122.418954 11.000000 1.000000 5.000000 1.000000 -7.000000 1.000000 1.000000 1.000000
25% 62969.000000 339.000000 53.000000 50.000000 353.000000 60.000000 54.000000 45.000000 46.000000 41.000000 37.776317 -122.403234 19.000000 9.000000 7.000000 2.000000 0.000000 5.650000 6.000000 5.650000
50% 108193.000000 491.000000 63.000000 61.000000 448.000000 65.000000 57.000000 49.000000 50.000000 45.000000 37.785299 -122.395569 19.000000 16.000000 8.000000 10.000000 0.000000 8.183333 8.000000 8.180000
75% 155471.000000 688.000000 70.000000 70.000000 546.000000 71.000000 62.000000 53.000000 53.000000 49.000000 37.791300 -122.394203 23.000000 23.000000 9.000000 11.000000 0.000000 11.466667 11.000000 11.470000
max 198297.000000 619322.000000 83.000000 82.000000 717.000000 95.000000 76.000000 66.000000 66.000000 61.000000 37.804770 -121.877349 27.000000 31.000000 10.000000 12.000000 15.000000 10322.033333 10322.000000 10322.030000

8 rows × 37 columns

Look at average duration between three pm and seven pm on a monthly basis

In [11]:
evening_monthly = dmerge4.between_time(time(15, 0), time(19, 0)).resample('M', how='mean')[['duration_f']]
evening_monthly
Out[11]:
duration_f
startdate
2013-08-31 37.240278
2013-09-30 24.621531
2013-10-31 16.752767
2013-11-30 17.277542
2013-12-31 14.980330
2014-01-31 14.962542
2014-02-28 15.718395

Summary statistics for evening time interval which was chosen to encompass the ‘rush hour’ time period.

In [12]:
dmerge4.between_time(time(15, 0), time(19, 0)).describe()
Out[12]:
tripid duration startterminal endterminal bike max_temperature_f mean_temperature_f min_temperaturef max_dew_point_f meandew_point_f lat long dockcount day hour month hourdiff durationminutes duration_i duration_f
count 47592.000000 47592.000000 47592.000000 47592.000000 47592.000000 47592.000000 47592.000000 47592.000000 47592.000000 47592.000000 47592.000000 47592.000000 47592.000000 47592.000000 47592.000000 47592.000000 47592.000000 47592.000000 47592.000000 47592.000000
mean 101102.863128 1070.586275 56.689212 58.069213 436.845583 66.428854 58.189738 49.515360 49.735649 44.907926 37.746342 -122.358438 19.100227 15.762775 16.629118 7.629497 0.211653 17.843105 17.842789 17.843105
std 56272.442662 6060.799465 16.969014 17.303616 138.118301 7.600397 6.295636 6.244173 6.174970 7.177395 0.126618 0.130883 4.109463 8.633882 1.031338 4.200082 0.671495 101.013324 101.013812 101.013307
min 4621.000000 60.000000 2.000000 2.000000 9.000000 46.000000 38.000000 25.000000 23.000000 16.000000 37.329732 -122.418954 11.000000 1.000000 15.000000 1.000000 -18.000000 1.000000 1.000000 1.000000
25% 51994.500000 359.000000 50.000000 50.000000 348.000000 61.000000 54.000000 45.000000 46.000000 41.000000 37.776619 -122.403234 15.000000 9.000000 16.000000 2.000000 0.000000 5.983333 6.000000 5.980000
50% 99347.000000 542.000000 60.000000 62.000000 446.000000 66.000000 57.000000 50.000000 50.000000 46.000000 37.786978 -122.398108 19.000000 16.000000 17.000000 10.000000 0.000000 9.033333 9.000000 9.030000
75% 150233.250000 794.000000 69.000000 70.000000 545.000000 71.000000 63.000000 54.000000 54.000000 49.000000 37.794139 -122.392738 23.000000 23.000000 17.000000 11.000000 0.000000 13.233333 13.000000 13.230000
max 198701.000000 597517.000000 82.000000 82.000000 717.000000 95.000000 78.000000 66.000000 66.000000 61.000000 37.804770 -121.877349 27.000000 31.000000 19.000000 12.000000 7.000000 9958.616667 9959.000000 9958.620000

8 rows × 37 columns

So, by slicing and dicing the datetime object into morning and evening commuting hours, we've learned that there are 47592 observations in the evening ,38885 in the morning ,86477 in this combined time period out of the total, and 10% more observations in the evening than in the morning.

Categorical time intervals

When we munged the data, we created a categorical field for time intervals by extracting the hour component of the time stamp which will be useful for grouping, aggregating, and plotting based on the labelled ranges.

Here is a table of average duration by time of day category.

In [28]:
dmerge4.groupby('timeofday')[['duration_f']].mean()
Out[28]:
duration_f
timeofday
evening 17.499759
mid_day 29.401846
morning 16.110753
night 23.363483
wee_hours 60.553483

The results are very interesting. We were curious as to why the data suggested that trips were primarily shorter rides. Splitting the data into labelled time of day intervals reveals some interesting insights. Perhaps evening and morning rides are commutes while non-rush hour times of day are closer to the thirty minute time limit for each trip. But,why is the average duration longer in the wee hours? In the group and aggregation portion of the analysis, we run further queries by time of day to learn more.

morning: 5,6,7,8,9,10, evening: 15,16,17,18,19, midday: 11,12,13,14, night: 20,21,22,23,0, wee hours: 1,2,3,4

Understanding timedeltas

The column we created named diff takes the difference between the datetime objects of end date and start date

In [7]:
dmerge4['diff'].head()
Out[7]:
startdate
2013-08-29 15:11:00   00:03:00
2013-08-29 17:35:00   00:13:00
2013-08-29 20:00:00   00:04:00
2013-08-29 17:30:00   00:03:00
2013-08-29 19:07:00   00:14:00
Name: diff, dtype: timedelta64[ns]

This will extract minutes from the datetime object column named diff. Also note that this will be different than using hour, minute, extracted columns because those are based on start date only.

In [17]:
np.round((dmerge4['diff']/ np.timedelta64(1, 'm') % 60))
Out[17]:
startdate
2013-08-29 15:11:00     3
2013-08-29 17:35:00    13
2013-08-29 20:00:00     4
2013-08-29 17:30:00     3
2013-08-29 19:07:00    14
2013-08-29 12:33:00    14
2013-08-29 19:01:00    13
2013-08-29 19:11:00    25
2013-08-29 14:14:00     6
2013-08-29 19:13:00    23
2013-08-29 19:08:00    13
2013-08-29 19:01:00    14
2013-08-29 13:57:00     2
2013-08-29 09:08:00     3
2013-08-29 14:04:00    28
...
2014-02-28 09:45:00    15
2014-02-28 18:34:00    16
2014-02-28 08:28:00     9
2014-02-28 13:07:00    12
2014-02-28 17:03:00     5
2014-02-28 16:44:00     5
2014-02-28 10:05:00     9
2014-02-28 09:50:00    10
2014-02-28 16:19:00     5
2014-02-28 19:13:00     6
2014-02-28 18:48:00     4
2014-02-28 07:42:00     2
2014-02-28 18:13:00     6
2014-02-28 17:20:00     5
2014-02-28 13:22:00     6
Name: diff, Length: 144015

This returns the minutes component of the timestamp which is not what we want rather than total minutes which is what we want for this analysis. We mention it here because this can be useful for when working with financial time-series data in particular.

In [18]:
np.round((dmerge4['diff']/ np.timedelta64(1, 'm') % 60)).max()
Out[18]:
59.0

By looking at the hour component, notice that most of the trips were under one hour

In [21]:
deltahour = np.round((dmerge4['diff']/ np.timedelta64(1, 'h')))
deltahour
Out[21]:
startdate
2013-08-29 15:11:00    0
2013-08-29 17:35:00    0
2013-08-29 20:00:00    0
2013-08-29 17:30:00    0
2013-08-29 19:07:00    0
2013-08-29 12:33:00    0
2013-08-29 19:01:00    0
2013-08-29 19:11:00    0
2013-08-29 14:14:00    0
2013-08-29 19:13:00    0
2013-08-29 19:08:00    0
2013-08-29 19:01:00    0
2013-08-29 13:57:00    0
2013-08-29 09:08:00    0
2013-08-29 14:04:00    0
...
2014-02-28 09:45:00    0
2014-02-28 18:34:00    0
2014-02-28 08:28:00    0
2014-02-28 13:07:00    0
2014-02-28 17:03:00    0
2014-02-28 16:44:00    0
2014-02-28 10:05:00    0
2014-02-28 09:50:00    0
2014-02-28 16:19:00    0
2014-02-28 19:13:00    0
2014-02-28 18:48:00    0
2014-02-28 07:42:00    0
2014-02-28 18:13:00    0
2014-02-28 17:20:00    0
2014-02-28 13:22:00    0
Name: diff, Length: 144015

This will give the what we want in minutes rather than just extracting the minutes element

In [6]:
dmerge4['diff'].apply(lambda x: x / np.timedelta64(1, 'm'))
Out[6]:
startdate
2013-08-29 15:11:00     3
2013-08-29 17:35:00    13
2013-08-29 20:00:00     4
2013-08-29 17:30:00     3
2013-08-29 19:07:00    14
2013-08-29 12:33:00    14
2013-08-29 19:01:00    13
2013-08-29 19:11:00    25
2013-08-29 14:14:00     6
2013-08-29 19:13:00    23
2013-08-29 19:08:00    13
2013-08-29 19:01:00    14
2013-08-29 13:57:00     2
2013-08-29 09:08:00     3
2013-08-29 14:04:00    28
...
2014-02-28 09:45:00    15
2014-02-28 18:34:00    16
2014-02-28 08:28:00     9
2014-02-28 13:07:00    12
2014-02-28 17:03:00     5
2014-02-28 16:44:00     5
2014-02-28 10:05:00     9
2014-02-28 09:50:00    10
2014-02-28 16:19:00     5
2014-02-28 19:13:00     6
2014-02-28 18:48:00     4
2014-02-28 07:42:00     2
2014-02-28 18:13:00     6
2014-02-28 17:20:00     5
2014-02-28 13:22:00     6
Name: diff, Length: 144015

The max function shows that total minutes rather than just the minute component is returned. Success.

In [7]:
dmerge4['diff'].apply(lambda x: x / np.timedelta64(1, 'm')).max()
Out[7]:
12037.0

The will give the same result without a lambda function

In [10]:
np.round((dmerge4['diff']/ np.timedelta64(1, 'm'))).max()
Out[10]:
12037.0
In [25]:
dmerge4['diff'].apply(lambda x: x / np.timedelta64(1, 'm')).plot()
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fe640261e10>

One thing to keep in mind is that pandas uses numpy datetime while there is also a datetime.datetime in python. This is something to keep in mind for general python users. The latest release of pandas also introduces a new datetime accessor for working with dates and times and is worth checking out.