Let's make some visualizations as part of the iterative process of data munging, aggregating, and visualizing. Visualizations help us understand the data better. At best, exploratory data visualization inspires questions and informs our analysis while identifying trends and patterns to further learn from the data.
We will see that displaying similar information in a variety of ways and using different types and styles of plots can reveal even more about the data.
%matplotlib inline
import numpy as np
import pandas as pd
from datetime import datetime, time
from ggplot import *
import seaborn as sns
print pd.__version__
print np.__version__
%load_ext ipycache
In practical data analysis, we want to make our plots in the most efficient and succinct way possible following an iterative data analysis process. A visualization will form more questions that lead to further visualzations using pandas, ggplot, and seaborn. In exploratory visualization, grouping & aggregating data and plotting are part of that iterative process. We want to learn from our data what further visualzations will provide insight and hopefully lead to more statistical questions and more visualizations.
Mean temperature
#mean temperature
dmerge4.mean_temperature_f.plot()
In this plot, we can see that most trips were under sixty minutes and fluctuated for the most part between under twenty minutes up to around fifty minutes depending on the time of day.
#daily average duration for the first six months of operation
print (ggplot(aes(x='startdate', y='duration_i'), data=daily_means) + \
geom_line()) + geom_smooth()
Since San Francisco doesn't have typically harsh winters, temperature did not seem to have much effect on length of bicycle trips
print (ggplot(aes(x='startdate', y='mean_temperature_f'), data=daily_means) + \
geom_line())
This plot is included here to compare the above graph with plotting the entire dataset. Taking the daily mean in the prevous plots allows us to consolidate multiple observations on a single day to give a less noisy graph. In this graph, every single observation is plotted. Taking the daily mean is a way to see the trend more cleanly and also will allow for further analysis on a daily data.
print (ggplot(aes(x='startdate', y='mean_temperature_f'), data=dmerge4) + \
geom_line())
The monthly duration takes away some of the noise indicated by the daily data. We do actually see a declining trend as we go from Autumn to Winter.
print (ggplot(aes(x='startdate', y='duration_i'), data=monthly_means) + \
geom_line())
Remember that we concatenated two time series intervals for the morning and evening commute. We can now see in this graph that the average duration by month is mostly higher in the evening than in the morning. Keeping in mind that August was the first month of operation, the plot also indicates that on average, the bicycle was returned to a dock within the initial thirty minute from start time from September to February.
concatenated.plot()
What's really nice about Pandas time series functionality is that we can also look at plots at specific times as well. Here is a plot of total duration at 4pm for the entire six month period. The plot indicates that a user had a bicycle for a really long time at 4pm in October but mostly the trip durations were under fifty minutes.
#total duration at particular time
dmerge4.duration_i.at_time(time(16, 0)).plot()
four_pm_df=pd.DataFrame(dmerge4.duration_f.at_time(time(16, 0))).reset_index()
ggplot(four_pm_df,aes(x='startdate',y='duration_f')) + geom_line()
#duration at 4pm for everyday the entire six month period
Durations are shorter during the morning and evening commuting hours and slightly higher during the mid-day and at night. The data also indicates that people are keeping bicycles for a longer period of time after midnight until four am.
dmerge4.groupby('timeofday')[['duration_i']].mean().plot(kind='bar')
This plot gives a really nice visual of average duration by landmark and time of day. Across the board, morning durations are shortest which would make sense for morning commuters. Morning commutes are highest in Palo Alto. Perhaps there is more traffic or greater distances to travel. In addition, we know that there are fewer bicycle docks in Palo Alto. Mid-day start times have longer trip duration except for Mountain View where evening bicycle trips are the longest.
Duration in San Francisco is the shortest and this is also the area that the highest number of docks in the dataset as shown in the grouping and aggregating section of the analysis.
sns.factorplot("landmark", "duration_f", "timeofday", data=timelandmark);
It is evident in this plot that dockcounts were lowest in Palo Alto where average durations were also the highest.
sns.factorplot("landmark", "duration_f", "dockcount", data=timelandmark,palette="BuPu_d")
This plot is nice because it allows us to visually see the differences in duration by landmark by categorical time of day.
sns.factorplot("landmark","duration_f", data=timelandmark, row="timeofday",
margin_titles=True, aspect=3, size=2,palette="BuPu_d");
Palo Alto had highest duration in almost every month except for Februrary when Mountain View had the highest average duration.
#faceted by month
sns.factorplot("landmark","duration_f", data=dmerge4, row="month",
margin_titles=True, aspect=3, size=2,palette="Pastel1");
This plot indicates plots for mean,minimum,and maximum summary statistics
dmerge4.groupby(dmerge4.index.month)['max_temperature_f'].agg(['mean','min','max']).plot()
Duration is lowest in the morning and evening commute hours and highest in the evening and wee hours.
dmerge4.groupby(dmerge4.index.hour)['duration_f'].agg(['mean','min']).plot()
day_means['duration_f'].plot(kind='bar')
The highest duration occured in December for a subscription type of customer in Redwood City
dmerge4.groupby([pd.Grouper(freq='M',key='startdate'),"landmark","subscriptiontype"])[['duration_f']].mean().unstack().unstack().plot(colormap='gist_rainbow')
In December, January, and February, the highest average duration was for a customer in Redwood City
three_grouper_unstack_twice = dmerge4.groupby([pd.Grouper(freq='M',key='startdate'),"landmark","subscriptiontype"])[['duration']].mean().unstack().unstack()
three_grouper_unstack_twice.plot(kind="barh",colormap='gist_rainbow')
#change color scheme
From September until December, Palo Alto had the highest average duration by month while San Jose and San Francisco had the lowest. Average duration by month increased in Redwood City from September to February.
grouperl = dmerge4.groupby([pd.Grouper(freq='M',key='startdate'),'landmark']).mean()[['duration_f']]
grouperl.unstack().plot()
This dodged barplot shows that duration was highest for Redwood City in Feburary and highest for Palo Alto in November. In every month, durations were lowest for both San Francisco and San Jose. We could hypothesize this is due to commuting, traffic patterns, and demand for bicycles. In any case, knowing where durations are higher will influence demand and dockcount.
grouperl.unstack().plot(kind="bar")
#avg duration monthly faceted by time of day
grouperm = dmerge4.groupby([pd.Grouper(freq='M',key='startdate'),'timeofday']).mean()[['duration_f']]
grouperm.unstack().plot(kind="bar")
dmerge4.groupby("landmark").mean()[['duration_f']].plot(kind='bar')
ggplot(dmerge4, aes('hour', 'duration_f',color='landmark')) + geom_point(stat="identity")
#how to add legend
Subscribers for the most part are below sixty minute durations while customers have higher durations and the widest range of duration in Redwood City followed by Mountain View.
three_grouper = dmerge4.groupby([pd.Grouper(freq='M',key='startdate'),"landmark","subscriptiontype"])[['duration_f']].mean().reset_index()
import seaborn as sns
sns.factorplot("landmark", "duration_f", "subscriptiontype", three_grouper, kind="box",
palette="PRGn", aspect=2.25)
It doesn't get too warm or too cold in Northern California between August and February.
dmerge4['mean_temperature_f'].value_counts().sort_index().plot(kind='bar')
Bar plot of average duration by landmark and subscription type. In this case, we only wanted to compare duration by one type of customer.
dmerge4.groupby(['landmark','subscriptiontype'])[['duration_f']].agg([np.mean, np.std]).query('subscriptiontype == "Customer"').plot(kind="bar")
Note, this is the exact same plot as above except that the unstacking does not affect the result because we specified the subscription type value.
dmerge4.groupby(['landmark','subscriptiontype'])[['duration_f']].agg([np.mean, np.std]).query('subscriptiontype == "Customer"').unstack().plot(kind="barh")
Looking at the mean duration by subscription type shows that subscribers are on average within the thirty minutes while ‘customers’ often exceed that time period. Remember that seventy-nine percent of total observations in the dataset are subscribers. Subscribers have an annual membership while customers are defined as either twenty-four hour or three day pass.
dmerge4.groupby(['landmark','subscriptiontype'])['duration_f'].mean().unstack().plot(kind="bar")
We love this plot. It gives a very clear idea of the differences in average duration between customers and subscribers with dodged plots by landmark.
dmerge4.groupby(['subscriptiontype','landmark'])['duration_f'].mean().unstack().plot(kind="bar")
The plot above reveals really interesting insights when compared to this plot of the number of subscribers versus the number of customers in the bikesharing system in the first six months of operation. This plot indicates that the majority of bike trips were taken by subscribers in San Francisco. The previous plot indicated that they had the shortest duration time. Looking at these plots together tells a story.
dmerge4.groupby(['subscriptiontype', 'landmark']).size().unstack().plot(kind='barh')
The highest daily average duration occured in December in Palo Alto.
dmerge4.groupby([pd.Grouper(freq='D',key='startdate'),'landmark'])[['duration_f']].mean().unstack().plot()
The most common end station when a bicyclist starts at the Powell Street BART is the University and Emerson.
dmerge4twolevels.query('startterminal == 53').groupby('endstation')['duration_f'].mean().plot(kind='barh')
grouplevel1 = dmerge4twolevels.groupby(level=1)
grouplevel1.mean()[['duration_f','dockcount']].plot(kind='bar')
When working with a mulitlevel index, this is one way of accessing a particular level.
dmerge4twolevels.groupby(level=['landmark']).mean()['duration_f'].plot(kind="bar")
dmerge4.groupby('thirtymin')[['duration_f']].mean().plot(kind='bar')
dmerge4.groupby(['thirtymin','timeofday'])[['duration_f']].mean().unstack().plot(kind="bar")
dmerge4.query('thirtymin == "in_thirty"')[['duration_f']].hist()
It's very interesting that the majority of rides thirty minutes and less are between five and 10 minutes! So, perhaps the system is mainly used for short commutes as intended rather than for recreational purpose. We have seen this to be the case for rides originating in San Francisco where the majority of trips began and docks are located.
dmerge4.query('thirtymin == "in_thirty"').groupby('duration_f').size().plot()
#total duration of trips thrity minutes and under
dmerge4.query('thirtymin == "in_thirty"')[['duration_f']].hist()
hourly2 = dmerge4.groupby([pd.Grouper(freq='H',key='startdate'),'landmark'],as_index=False).mean()
ggplot(hourly2, aes('startdate', 'duration',color='landmark')) + geom_line()
It's nice to see the differences by landmark. For instance, it appears that durations are higher in Palo Alto near the end of the month.
#mean duration by day o the week
dmerge4.groupby(['day','landmark']).aggregate(np.mean)[['duration']].unstack().plot()
Redwood City in the eleven pm hour has the longest average duration.
#mean duration by hour of the day
dmerge4.groupby(['hour','landmark']).aggregate(np.mean)[['duration_f']].unstack().plot(kind='barh')
For every hour of the day, Customers have longer average duration than Subscribers especially in the wee hours.
#mean duration by hour of the day faceted by subscriber type
dmerge4.groupby(['hour','subscriptiontype']).aggregate(np.mean)[['duration']].unstack().plot(kind='barh')
There were more bicycle trips taken during the morning commute and the evening commute.
hour_counts = dmerge4.groupby('hour').aggregate(np.size)
hour_counts['duration_i'].plot(kind='bar')
Total duration was highest from eight am to six pm which also corresponds to working hours.
hour_sum = dmerge4.groupby('hour').aggregate(np.sum)
hour_sum['duration_i'].plot(kind='bar')
This plot indicates that subscriber trips were generally shorter while customer trips had higher durations
from ggplot import *
ggplot(aes(x='hour', y='duration_f',color='subscriptiontype'), data=dmerge4) + \
geom_point(alpha=.6)
ggplot(dmerge4, aes(x='duration', color='landmark')) + \
geom_density() + xlim(0,20000)
ggplot(aes(x='hour', y='duration', colour='subscriptiontype'), data=dmerge4) + \
geom_point() + facet_wrap("landmark")
This plots illustrates that trips in San Francisco for the most part had shorter total durations while there were some outliers in Palo Alto.
ggplot(aes(x='month', y='duration', color='landmark'), data=dmerge4) + \
geom_point()
This type of plot makes it easier to pinpoint outliers by time of day.
ggplot(aes(x='hour', y='duration', color='landmark'), data=dmerge4) + \
geom_point()
sns.factorplot("month", "duration",data=dmerge4,hue="subscriptiontype",
palette="PRGn", aspect=1.25)
#unable to select order of x-axis(how to handle a time type in seaborn)
sns.factorplot("landmark", "duration",data=dmerge4,hue="subscriptiontype",
palette="PRGn", aspect=1.25)
sns.factorplot("landmark", "duration", col="subscriptiontype",data=dmerge4, palette="PuBu_d",size=4,aspect=1.5);
sns.factorplot("subscriptiontype", "duration_f", col="landmark", data=dmerge4, palette="PuBu_d",size=4,aspect=.5);
dmerge4["startstation"].value_counts()[:10]
dmerge4["startstation"].value_counts()[:5].plot(kind="bar")
dmerge4.groupby('startstation').duration_f.mean().order(ascending=True)[:5].plot(kind='barh')
#mean duration
startstation = dmerge4.groupby('startstation').duration_f.mean().order(ascending=True)[:5]
startstation.plot(kind='barh')
dmerge4.groupby(['landmark','subscriptiontype']).agg({'duration_f' : np.sum}).unstack().plot(kind="bar")
This plot shows that the average dock count has a decreasing trend as mean temperature increases
#shows that avg dock count decr when mean temp incr
ggplot(h, aes(x = 'mean_temperature_f', y ='dockcount')) + geom_point() + geom_smooth(method = 'lm')