Practical Data Analysis with Python - Time Series¶

Working with time-series data¶

Time-series data is fun and interesting. Working with time-series data has already been covered extensively in the data munging and grouping & aggregating notebooks. This notebook aims to cover ad-hoc time-series topics relevant for data analysis. The time-series plots produced by the datasets in these notebooks is presented in the visualization section.

We have implemented and shown varying strategies for working with datetime data to gain meaning and insight from the dataset. Now, we'll go a little further in depth into looking at specific dates and times, categorical time intervals, timedeltas, and general properties of datetime and timestamp data.

We can see data analytics over longer periods of time or dates, at a specific time or date, or even just in a particular time interval. In part, this particular dataset has several nice properties and it was chosen to illustrate working with time-series data.

Preliminary work with getting date and time data into a nice format for data analysis has been covered in the data munging section. Now, let's take a deeper look at working with time-series data to gain meaning and insight from our bike sharing dataset.

import numpy as np
import pandas as pd
from datetime import datetime, time

%matplotlib inline
import seaborn as sns
from ggplot import *

print pd.__version__
print np.__version__

0.14.1
1.9.0

%load_ext ipycache

Resampling¶

Resampling allows us to resample to different time periods and show summary statistics.

Resample the dataset to daily means so each row will have the average means for trips started on the same day

# Daily means
daily_means = dmerge4.resample('D', how='mean').reset_index(drop=False) 
daily_means.head(2)

Resample dataset to monthly means so each row will have the average means for trips started in the same month

#monthly means
monthly_means =  dmerge4.resample('M', how='mean').reset_index(drop=False) 
monthly_means.head(2)

Working with time data¶

The timeseries functionality in pandas allows granularity down to the second, minute, or hour. For instance, one can look at average duration at four pm on a monthly basis and compare it to eight am. However, it may be more instructive to look at a range such as a time interval like rush hour and the morning commute.

Average duration at eight am and four pm resampled to monthly mean

#the datetime index at 8am resampled to monthly
eight_am = dmerge4.at_time(time(8,0)).resample('M', how='mean')[['duration_f']]
eight_am

four_pm = dmerge4.at_time(time(16,0)).resample('M', how='mean')[['duration_f']]
four_pm

In every month, duration at four pm is higher than at eight am. When looking at a range of times such as the monthly morning duration and the monthly evening duration, the differences are less delineated. This is interesting because knowing at what hour or range of times bicycles will not be available based on duration is useful for demand and infrastructure planning. Also, this information can inform marketing strategies. Perhaps, there are incentives that can be given to change duration times now that we know more based on the time-series data.

Look at average duration between 5am and 10am on a monthly basis

morning_monthly = dmerge4.between_time(time(5, 0), time(10, 0)).resample('M', how='mean')[['duration_f']]
morning_monthly

Summary statistics for morning time interval

dmerge4.between_time(time(5, 0), time(10, 0)).describe()

Look at average duration between three pm and seven pm on a monthly basis

evening_monthly = dmerge4.between_time(time(15, 0), time(19, 0)).resample('M', how='mean')[['duration_f']]
evening_monthly

Summary statistics for evening time interval which was chosen to encompass the ‘rush hour’ time period.

dmerge4.between_time(time(15, 0), time(19, 0)).describe()

So, by slicing and dicing the datetime object into morning and evening commuting hours, we've learned that there are 47592 observations in the evening ,38885 in the morning ,86477 in this combined time period out of the total, and 10% more observations in the evening than in the morning.

Categorical time intervals¶

When we munged the data, we created a categorical field for time intervals by extracting the hour component of the time stamp which will be useful for grouping, aggregating, and plotting based on the labelled ranges.

Here is a table of average duration by time of day category.

dmerge4.groupby('timeofday')[['duration_f']].mean()

The results are very interesting. We were curious as to why the data suggested that trips were primarily shorter rides. Splitting the data into labelled time of day intervals reveals some interesting insights. Perhaps evening and morning rides are commutes while non-rush hour times of day are closer to the thirty minute time limit for each trip. But,why is the average duration longer in the wee hours? In the group and aggregation portion of the analysis, we run further queries by time of day to learn more.

morning: 5,6,7,8,9,10, evening: 15,16,17,18,19, midday: 11,12,13,14, night: 20,21,22,23,0, wee hours: 1,2,3,4

Understanding timedeltas¶

The column we created named diff takes the difference between the datetime objects of end date and start date

dmerge4['diff'].head()

startdate
2013-08-29 15:11:00   00:03:00
2013-08-29 17:35:00   00:13:00
2013-08-29 20:00:00   00:04:00
2013-08-29 17:30:00   00:03:00
2013-08-29 19:07:00   00:14:00
Name: diff, dtype: timedelta64[ns]

This will extract minutes from the datetime object column named diff. Also note that this will be different than using hour, minute, extracted columns because those are based on start date only.

np.round((dmerge4['diff']/ np.timedelta64(1, 'm') % 60))

startdate
2013-08-29 15:11:00     3
2013-08-29 17:35:00    13
2013-08-29 20:00:00     4
2013-08-29 17:30:00     3
2013-08-29 19:07:00    14
2013-08-29 12:33:00    14
2013-08-29 19:01:00    13
2013-08-29 19:11:00    25
2013-08-29 14:14:00     6
2013-08-29 19:13:00    23
2013-08-29 19:08:00    13
2013-08-29 19:01:00    14
2013-08-29 13:57:00     2
2013-08-29 09:08:00     3
2013-08-29 14:04:00    28
...
2014-02-28 09:45:00    15
2014-02-28 18:34:00    16
2014-02-28 08:28:00     9
2014-02-28 13:07:00    12
2014-02-28 17:03:00     5
2014-02-28 16:44:00     5
2014-02-28 10:05:00     9
2014-02-28 09:50:00    10
2014-02-28 16:19:00     5
2014-02-28 19:13:00     6
2014-02-28 18:48:00     4
2014-02-28 07:42:00     2
2014-02-28 18:13:00     6
2014-02-28 17:20:00     5
2014-02-28 13:22:00     6
Name: diff, Length: 144015

This returns the minutes component of the timestamp which is not what we want rather than total minutes which is what we want for this analysis. We mention it here because this can be useful for when working with financial time-series data in particular.

np.round((dmerge4['diff']/ np.timedelta64(1, 'm') % 60)).max()

59.0

By looking at the hour component, notice that most of the trips were under one hour

deltahour = np.round((dmerge4['diff']/ np.timedelta64(1, 'h')))
deltahour

startdate
2013-08-29 15:11:00    0
2013-08-29 17:35:00    0
2013-08-29 20:00:00    0
2013-08-29 17:30:00    0
2013-08-29 19:07:00    0
2013-08-29 12:33:00    0
2013-08-29 19:01:00    0
2013-08-29 19:11:00    0
2013-08-29 14:14:00    0
2013-08-29 19:13:00    0
2013-08-29 19:08:00    0
2013-08-29 19:01:00    0
2013-08-29 13:57:00    0
2013-08-29 09:08:00    0
2013-08-29 14:04:00    0
...
2014-02-28 09:45:00    0
2014-02-28 18:34:00    0
2014-02-28 08:28:00    0
2014-02-28 13:07:00    0
2014-02-28 17:03:00    0
2014-02-28 16:44:00    0
2014-02-28 10:05:00    0
2014-02-28 09:50:00    0
2014-02-28 16:19:00    0
2014-02-28 19:13:00    0
2014-02-28 18:48:00    0
2014-02-28 07:42:00    0
2014-02-28 18:13:00    0
2014-02-28 17:20:00    0
2014-02-28 13:22:00    0
Name: diff, Length: 144015

This will give the what we want in minutes rather than just extracting the minutes element

dmerge4['diff'].apply(lambda x: x / np.timedelta64(1, 'm'))

startdate
2013-08-29 15:11:00     3
2013-08-29 17:35:00    13
2013-08-29 20:00:00     4
2013-08-29 17:30:00     3
2013-08-29 19:07:00    14
2013-08-29 12:33:00    14
2013-08-29 19:01:00    13
2013-08-29 19:11:00    25
2013-08-29 14:14:00     6
2013-08-29 19:13:00    23
2013-08-29 19:08:00    13
2013-08-29 19:01:00    14
2013-08-29 13:57:00     2
2013-08-29 09:08:00     3
2013-08-29 14:04:00    28
...
2014-02-28 09:45:00    15
2014-02-28 18:34:00    16
2014-02-28 08:28:00     9
2014-02-28 13:07:00    12
2014-02-28 17:03:00     5
2014-02-28 16:44:00     5
2014-02-28 10:05:00     9
2014-02-28 09:50:00    10
2014-02-28 16:19:00     5
2014-02-28 19:13:00     6
2014-02-28 18:48:00     4
2014-02-28 07:42:00     2
2014-02-28 18:13:00     6
2014-02-28 17:20:00     5
2014-02-28 13:22:00     6
Name: diff, Length: 144015

The max function shows that total minutes rather than just the minute component is returned. Success.

dmerge4['diff'].apply(lambda x: x / np.timedelta64(1, 'm')).max()

12037.0

The will give the same result without a lambda function

np.round((dmerge4['diff']/ np.timedelta64(1, 'm'))).max()

12037.0

dmerge4['diff'].apply(lambda x: x / np.timedelta64(1, 'm')).plot()

<matplotlib.axes._subplots.AxesSubplot at 0x7fe640261e10>

One thing to keep in mind is that pandas uses numpy datetime while there is also a datetime.datetime in python. This is something to keep in mind for general python users. The latest release of pandas also introduces a new datetime accessor for working with dates and times and is worth checking out.

	startdate	tripid	duration	startterminal	endterminal	bike	max_temperature_f	mean_temperature_f	min_temperaturef	max_dew_point_f	…	lat	long	dockcount	day	hour	month	hourdiff	durationminutes	duration_i	duration_f
0	2013-08-29	4661.505348	1561.296791	54.679144	55.017380	431.470588	74.871658	68.491979	61.318182	61.201872	…	37.728908	-122.345651	18.970588	29	15.381016	8	0.223262	26.021613	26.028075	26.021564
1	2013-08-30	5837.312325	2703.133053	54.168067	54.177871	439.878151	79.400560	69.901961	60.456583	61.390756	…	37.724274	-122.337827	19.235294	30	14.277311	8	0.439776	45.052218	45.051821	45.052227

	startdate	tripid	duration	startterminal	endterminal	bike	max_temperature_f	mean_temperature_f	min_temperaturef	max_dew_point_f	…	lat	long	dockcount	day	hour	month	hourdiff	durationminutes	duration_i	duration_f
0	2013-08-31	5763.824453	2588.182683	55.104186	55.247859	433.639391	75.504757	67.814462	59.870599	60.127973	…	37.730096	-122.346329	19.064700	29.948620	14.493815	8	0.430067	43.136378	43.139867	43.136413
1	2013-09-30	24370.800578	1578.130729	56.731054	56.811908	438.652102	74.284079	65.888048	56.996989	56.373965	…	37.745550	-122.357832	19.261142	16.282415	13.664660	9	0.265539	26.302179	26.302302	26.302204

	duration_f
startdate
2013-08-31	10.980000
2013-09-30	10.226316
2013-10-31	10.027674
2013-11-30	10.429714
2013-12-31	9.469492
2014-01-31	9.413051
2014-02-28	8.042712

	duration_f
startdate
2013-08-31	16.785000
2013-09-30	14.976786
2013-10-31	26.602927
2013-11-30	19.147391
2013-12-31	11.453214
2014-01-31	12.583846
2014-02-28	14.857083

	duration_f
startdate
2013-08-31	37.648913
2013-09-30	17.973624
2013-10-31	13.302877
2013-11-30	13.209249
2013-12-31	15.384694
2014-01-31	11.679421
2014-02-28	11.441120

	tripid	duration	startterminal	endterminal	bike	max_temperature_f	mean_temperature_f	min_temperaturef	max_dew_point_f	meandew_point_f	…	lat	long	dockcount	day	hour	month	hourdiff	durationminutes	duration_i	duration_f
count	38885.000000	38885.000000	38885.000000	38885.000000	38885.000000	38885.000000	38885.000000	38885.000000	38885.000000	38885.000000	…	38885.000000	38885.000000	38885.000000	38885.000000	38885.000000	38885.000000	38885.000000	38885.000000	38885.000000	38885.000000
mean	108075.706699	822.261926	59.107522	57.958313	440.186139	65.528764	57.483477	49.009850	49.158262	44.304565	…	37.750334	-122.365098	19.981664	15.667507	7.919789	7.396040	0.223660	13.704365	13.704616	13.704364
std	54154.593638	4807.622205	16.421185	16.264455	134.382220	7.346052	6.013163	5.988127	5.980818	6.967008	…	0.117364	0.119254	3.846183	8.451855	0.990138	4.387635	0.697399	80.127037	80.128487	80.127029
min	4069.000000	60.000000	2.000000	2.000000	9.000000	46.000000	38.000000	25.000000	23.000000	16.000000	…	37.329732	-122.418954	11.000000	1.000000	5.000000	1.000000	-7.000000	1.000000	1.000000	1.000000
25%	62969.000000	339.000000	53.000000	50.000000	353.000000	60.000000	54.000000	45.000000	46.000000	41.000000	…	37.776317	-122.403234	19.000000	9.000000	7.000000	2.000000	0.000000	5.650000	6.000000	5.650000
50%	108193.000000	491.000000	63.000000	61.000000	448.000000	65.000000	57.000000	49.000000	50.000000	45.000000	…	37.785299	-122.395569	19.000000	16.000000	8.000000	10.000000	0.000000	8.183333	8.000000	8.180000
75%	155471.000000	688.000000	70.000000	70.000000	546.000000	71.000000	62.000000	53.000000	53.000000	49.000000	…	37.791300	-122.394203	23.000000	23.000000	9.000000	11.000000	0.000000	11.466667	11.000000	11.470000
max	198297.000000	619322.000000	83.000000	82.000000	717.000000	95.000000	76.000000	66.000000	66.000000	61.000000	…	37.804770	-121.877349	27.000000	31.000000	10.000000	12.000000	15.000000	10322.033333	10322.000000	10322.030000

	duration_f
startdate
2013-08-31	37.240278
2013-09-30	24.621531
2013-10-31	16.752767
2013-11-30	17.277542
2013-12-31	14.980330
2014-01-31	14.962542
2014-02-28	15.718395

	tripid	duration	startterminal	endterminal	bike	max_temperature_f	mean_temperature_f	min_temperaturef	max_dew_point_f	meandew_point_f	…	lat	long	dockcount	day	hour	month	hourdiff	durationminutes	duration_i	duration_f
count	47592.000000	47592.000000	47592.000000	47592.000000	47592.000000	47592.000000	47592.000000	47592.000000	47592.000000	47592.000000	…	47592.000000	47592.000000	47592.000000	47592.000000	47592.000000	47592.000000	47592.000000	47592.000000	47592.000000	47592.000000
mean	101102.863128	1070.586275	56.689212	58.069213	436.845583	66.428854	58.189738	49.515360	49.735649	44.907926	…	37.746342	-122.358438	19.100227	15.762775	16.629118	7.629497	0.211653	17.843105	17.842789	17.843105
std	56272.442662	6060.799465	16.969014	17.303616	138.118301	7.600397	6.295636	6.244173	6.174970	7.177395	…	0.126618	0.130883	4.109463	8.633882	1.031338	4.200082	0.671495	101.013324	101.013812	101.013307
min	4621.000000	60.000000	2.000000	2.000000	9.000000	46.000000	38.000000	25.000000	23.000000	16.000000	…	37.329732	-122.418954	11.000000	1.000000	15.000000	1.000000	-18.000000	1.000000	1.000000	1.000000
25%	51994.500000	359.000000	50.000000	50.000000	348.000000	61.000000	54.000000	45.000000	46.000000	41.000000	…	37.776619	-122.403234	15.000000	9.000000	16.000000	2.000000	0.000000	5.983333	6.000000	5.980000
50%	99347.000000	542.000000	60.000000	62.000000	446.000000	66.000000	57.000000	50.000000	50.000000	46.000000	…	37.786978	-122.398108	19.000000	16.000000	17.000000	10.000000	0.000000	9.033333	9.000000	9.030000
75%	150233.250000	794.000000	69.000000	70.000000	545.000000	71.000000	63.000000	54.000000	54.000000	49.000000	…	37.794139	-122.392738	23.000000	23.000000	17.000000	11.000000	0.000000	13.233333	13.000000	13.230000
max	198701.000000	597517.000000	82.000000	82.000000	717.000000	95.000000	78.000000	66.000000	66.000000	61.000000	…	37.804770	-121.877349	27.000000	31.000000	19.000000	12.000000	7.000000	9958.616667	9959.000000	9958.620000

	duration_f
timeofday
evening	17.499759
mid_day	29.401846
morning	16.110753
night	23.363483
wee_hours	60.553483