fn_babs_visualization

Practical Data Analysis with Python - Visualization

Data visualization

Let's make some visualizations as part of the iterative process of data munging, aggregating, and visualizing. Visualizations help us understand the data better. At best, exploratory data visualization inspires questions and informs our analysis while identifying trends and patterns to further learn from the data.

We will see that displaying similar information in a variety of ways and using different types and styles of plots can reveal even more about the data.

In [1]:
%matplotlib inline
In [2]:
import numpy as np
import pandas as pd
from datetime import datetime, time
In [3]:
from ggplot import *
import seaborn as sns
In [7]:
print pd.__version__
print np.__version__
0.14.1
1.9.0

In [8]:
%load_ext ipycache

Exploratory data visualization

In practical data analysis, we want to make our plots in the most efficient and succinct way possible following an iterative data analysis process. A visualization will form more questions that lead to further visualzations using pandas, ggplot, and seaborn. In exploratory visualization, grouping & aggregating data and plotting are part of that iterative process. We want to learn from our data what further visualzations will provide insight and hopefully lead to more statistical questions and more visualizations.

Mean temperature

In [8]:
#mean temperature
dmerge4.mean_temperature_f.plot()
Out[8]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0b122f7990>

Plotting time-series data

Daily average duration for the first six months of operation of the bike sharing system

In this plot, we can see that most trips were under sixty minutes and fluctuated for the most part between under twenty minutes up to around fifty minutes depending on the time of day.

In [9]:
#daily average duration for the first six months of operation
print (ggplot(aes(x='startdate', y='duration_i'), data=daily_means) + \
    geom_line()) + geom_smooth()
<ggplot: (8744552623097)>

Daily average mean temperature

Since San Francisco doesn't have typically harsh winters, temperature did not seem to have much effect on length of bicycle trips

In [10]:
print (ggplot(aes(x='startdate', y='mean_temperature_f'), data=daily_means) + \
    geom_line())
<ggplot: (8742867033829)>

Mean temperatures for the entire dataset

This plot is included here to compare the above graph with plotting the entire dataset. Taking the daily mean in the prevous plots allows us to consolidate multiple observations on a single day to give a less noisy graph. In this graph, every single observation is plotted. Taking the daily mean is a way to see the trend more cleanly and also will allow for further analysis on a daily data.

In [10]:
print (ggplot(aes(x='startdate', y='mean_temperature_f'), data=dmerge4) + \
    geom_line())
<ggplot: (8746630520053)>

Average duration on a monthly basis for the six month period

The monthly duration takes away some of the noise indicated by the daily data. We do actually see a declining trend as we go from Autumn to Winter.

In [14]:
print (ggplot(aes(x='startdate', y='duration_i'), data=monthly_means) + \
    geom_line())
<ggplot: (8746629296849)>

Remember that we concatenated two time series intervals for the morning and evening commute. We can now see in this graph that the average duration by month is mostly higher in the evening than in the morning. Keeping in mind that August was the first month of operation, the plot also indicates that on average, the bicycle was returned to a dock within the initial thirty minute from start time from September to February.

Average duration by month facetted by morning and evening commuting hours

In [18]:
concatenated.plot()
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f47bbb13790>

Average duration at 4pm for the six month period

What's really nice about Pandas time series functionality is that we can also look at plots at specific times as well. Here is a plot of total duration at 4pm for the entire six month period. The plot indicates that a user had a bicycle for a really long time at 4pm in October but mostly the trip durations were under fifty minutes.

In [19]:
#total duration at particular time
dmerge4.duration_i.at_time(time(16, 0)).plot() 
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f47bceb0750>
In [14]:
four_pm_df=pd.DataFrame(dmerge4.duration_f.at_time(time(16, 0))).reset_index()
ggplot(four_pm_df,aes(x='startdate',y='duration_f')) + geom_line()

#duration at 4pm for everyday the entire six month period
Out[14]:
<ggplot: (16850305)>

Average duration by time of day

Durations are shorter during the morning and evening commuting hours and slightly higher during the mid-day and at night. The data also indicates that people are keeping bicycles for a longer period of time after midnight until four am.

In [26]:
dmerge4.groupby('timeofday')[['duration_i']].mean().plot(kind='bar')
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f47bb96c4d0>

Dodged bar plot of average duration by landmark and time of day

This plot gives a really nice visual of average duration by landmark and time of day. Across the board, morning durations are shortest which would make sense for morning commuters. Morning commutes are highest in Palo Alto. Perhaps there is more traffic or greater distances to travel. In addition, we know that there are fewer bicycle docks in Palo Alto. Mid-day start times have longer trip duration except for Mountain View where evening bicycle trips are the longest.

Duration in San Francisco is the shortest and this is also the area that the highest number of docks in the dataset as shown in the grouping and aggregating section of the analysis.

In [32]:
sns.factorplot("landmark", "duration_f", "timeofday", data=timelandmark);

Dockcounts by landmark and duration

It is evident in this plot that dockcounts were lowest in Palo Alto where average durations were also the highest.

In [37]:
sns.factorplot("landmark", "duration_f", "dockcount", data=timelandmark,palette="BuPu_d")
Out[37]:
<seaborn.axisgrid.FacetGrid at 0x7f47bbb0b750>

Average duration by landmark faceted by timeofday

This plot is nice because it allows us to visually see the differences in duration by landmark by categorical time of day.

In [40]:
sns.factorplot("landmark","duration_f", data=timelandmark, row="timeofday",
               margin_titles=True, aspect=3, size=2,palette="BuPu_d");

Average duration for each landmark faceted by month

Palo Alto had highest duration in almost every month except for Februrary when Mountain View had the highest average duration.

In [42]:
#faceted by month
sns.factorplot("landmark","duration_f", data=dmerge4, row="month",
               margin_titles=True, aspect=3, size=2,palette="Pastel1");

Plot of mean, minimum, and maximum temperature by month

This plot indicates plots for mean,minimum,and maximum summary statistics

In [43]:
dmerge4.groupby(dmerge4.index.month)['max_temperature_f'].agg(['mean','min','max']).plot()
Out[43]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f47c8fcc290>

Mean and minimum duration by hour of the day

Duration is lowest in the morning and evening commute hours and highest in the evening and wee hours.

In [47]:
dmerge4.groupby(dmerge4.index.hour)['duration_f'].agg(['mean','min']).plot()
Out[47]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f47c910b450>

What are the average durations by day of the month?

In [10]:
day_means['duration_f'].plot(kind='bar')
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f0b0ef8cc90>

In what month did the highest duration occur and what was the suscription type?

The highest duration occured in December for a subscription type of customer in Redwood City

In [25]:
dmerge4.groupby([pd.Grouper(freq='M',key='startdate'),"landmark","subscriptiontype"])[['duration_f']].mean().unstack().unstack().plot(colormap='gist_rainbow')
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f400153ee90>

Dodged bar plots by month of average duration by landmark and subscription type

In December, January, and February, the highest average duration was for a customer in Redwood City

In [35]:
three_grouper_unstack_twice = dmerge4.groupby([pd.Grouper(freq='M',key='startdate'),"landmark","subscriptiontype"])[['duration']].mean().unstack().unstack()
three_grouper_unstack_twice.plot(kind="barh",colormap='gist_rainbow')
#change color scheme
Out[35]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f3ffff109d0>

Monthly average duration by landmark

From September until December, Palo Alto had the highest average duration by month while San Jose and San Francisco had the lowest. Average duration by month increased in Redwood City from September to February.

In [55]:
grouperl = dmerge4.groupby([pd.Grouper(freq='M',key='startdate'),'landmark']).mean()[['duration_f']]
grouperl.unstack().plot()
Out[55]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f47c8f01b10>

Dodged bar plot of monthly average duration by landmark

This dodged barplot shows that duration was highest for Redwood City in Feburary and highest for Palo Alto in November. In every month, durations were lowest for both San Francisco and San Jose. We could hypothesize this is due to commuting, traffic patterns, and demand for bicycles. In any case, knowing where durations are higher will influence demand and dockcount.

In [56]:
grouperl.unstack().plot(kind="bar")
Out[56]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f47c9513ad0>

Average monthly duration faceted by time of day

In [58]:
#avg duration monthly faceted by time of day
grouperm = dmerge4.groupby([pd.Grouper(freq='M',key='startdate'),'timeofday']).mean()[['duration_f']]
grouperm.unstack().plot(kind="bar")
Out[58]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f47bc3cc090>

Average duration by landmark

In [59]:
dmerge4.groupby("landmark").mean()[['duration_f']].plot(kind='bar')
Out[59]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f47bc2b1990>

Average duration by hour of the day for the entire six month period

In [13]:
ggplot(dmerge4, aes('hour', 'duration_f',color='landmark')) + geom_point(stat="identity")
    
    #how to add legend
Out[13]:
<ggplot: (1767377)>

Boxplot of monthly average duration by landmark

Subscribers for the most part are below sixty minute durations while customers have higher durations and the widest range of duration in Redwood City followed by Mountain View.

In [64]:
three_grouper = dmerge4.groupby([pd.Grouper(freq='M',key='startdate'),"landmark","subscriptiontype"])[['duration_f']].mean().reset_index()
import seaborn as sns
sns.factorplot("landmark", "duration_f", "subscriptiontype", three_grouper, kind="box",
                   palette="PRGn", aspect=2.25)
Out[64]:
<seaborn.axisgrid.FacetGrid at 0x7f47bbf9d550>

Distribution of mean temperature

It doesn't get too warm or too cold in Northern California between August and February.

In [69]:
dmerge4['mean_temperature_f'].value_counts().sort_index().plot(kind='bar') 
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f47ba67e550>

Average duration and standard deviation for subscription type of Customer by landmark

Bar plot of average duration by landmark and subscription type. In this case, we only wanted to compare duration by one type of customer.

In [71]:
dmerge4.groupby(['landmark','subscriptiontype'])[['duration_f']].agg([np.mean, np.std]).query('subscriptiontype == "Customer"').plot(kind="bar")
Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f479e4be310>

Note, this is the exact same plot as above except that the unstacking does not affect the result because we specified the subscription type value.

In [73]:
dmerge4.groupby(['landmark','subscriptiontype'])[['duration_f']].agg([np.mean, np.std]).query('subscriptiontype == "Customer"').unstack().plot(kind="barh")
Out[73]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f479e545d90>

Average duration by subscription type

Looking at the mean duration by subscription type shows that subscribers are on average within the thirty minutes while ‘customers’ often exceed that time period. Remember that seventy-nine percent of total observations in the dataset are subscribers. Subscribers have an annual membership while customers are defined as either twenty-four hour or three day pass.

In [74]:
dmerge4.groupby(['landmark','subscriptiontype'])['duration_f'].mean().unstack().plot(kind="bar")
Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f479ec85310>

Plot that indicates the differences in average duration between subscribers and customers by landmark

We love this plot. It gives a very clear idea of the differences in average duration between customers and subscribers with dodged plots by landmark.

In [75]:
dmerge4.groupby(['subscriptiontype','landmark'])['duration_f'].mean().unstack().plot(kind="bar")
Out[75]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f479eb78c10>

Number of observations in the dataset by landmark and subscription type

The plot above reveals really interesting insights when compared to this plot of the number of subscribers versus the number of customers in the bikesharing system in the first six months of operation. This plot indicates that the majority of bike trips were taken by subscribers in San Francisco. The previous plot indicated that they had the shortest duration time. Looking at these plots together tells a story.

In [76]:
dmerge4.groupby(['subscriptiontype', 'landmark']).size().unstack().plot(kind='barh')
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f479ea95bd0>

Daily average duration by landmark for the six month period

The highest daily average duration occured in December in Palo Alto.

In [29]:
dmerge4.groupby([pd.Grouper(freq='D',key='startdate'),'landmark'])[['duration_f']].mean().unstack().plot()
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f39b25ff090>

What is the most common end station when a bicyclist starts at the Powell Street BART?

The most common end station when a bicyclist starts at the Powell Street BART is the University and Emerson.

In [20]:
dmerge4twolevels.query('startterminal == 53').groupby('endstation')['duration_f'].mean().plot(kind='barh')
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb3a90e0450>

Plot duration and dockcount by subscription type

In [22]:
grouplevel1 = dmerge4twolevels.groupby(level=1)
grouplevel1.mean()[['duration_f','dockcount']].plot(kind='bar')
Out[22]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb3a8fc6f10>

When working with a mulitlevel index, this is one way of accessing a particular level.

In [24]:
dmerge4twolevels.groupby(level=['landmark']).mean()['duration_f'].plot(kind="bar")
Out[24]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb3b0092a10>

Plot the average duration for trips that were thirty minutes and under versus trips over thirty minutes

In [25]:
dmerge4.groupby('thirtymin')[['duration_f']].mean().plot(kind='bar')
Out[25]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb3b0092490>

Average duration for trips thirty minutes and under compared to trips over thirty minutes faceted by time of day

In [33]:
dmerge4.groupby(['thirtymin','timeofday'])[['duration_f']].mean().unstack().plot(kind="bar")
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f39b245c710>

Histogram of duration of trips thirty minutes and less

In [35]:
dmerge4.query('thirtymin == "in_thirty"')[['duration_f']].hist()
Out[35]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f39b2257790>]], dtype=object)

Number of observations for each duration of thirty minutes and under

It's very interesting that the majority of rides thirty minutes and less are between five and 10 minutes! So, perhaps the system is mainly used for short commutes as intended rather than for recreational purpose. We have seen this to be the case for rides originating in San Francisco where the majority of trips began and docks are located.

In [36]:
dmerge4.query('thirtymin == "in_thirty"').groupby('duration_f').size().plot()
Out[36]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f39b20df190>
In [38]:
#total duration of trips thrity minutes and under
dmerge4.query('thirtymin == "in_thirty"')[['duration_f']].hist()
Out[38]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x7f39b1f53f90>]], dtype=object)

Hourly average duration for the six month period

In [12]:
hourly2 =  dmerge4.groupby([pd.Grouper(freq='H',key='startdate'),'landmark'],as_index=False).mean()
ggplot(hourly2, aes('startdate', 'duration',color='landmark')) + geom_line()
Out[12]:
<ggplot: (11835345)>

What days of the week are bicycles being used for the longest average duration by landmark?

It's nice to see the differences by landmark. For instance, it appears that durations are higher in Palo Alto near the end of the month.

In [42]:
#mean duration by day o the week
dmerge4.groupby(['day','landmark']).aggregate(np.mean)[['duration']].unstack().plot()
Out[42]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f39b1a6e390>

What hour of the day has the longest duration by landmark?

Redwood City in the eleven pm hour has the longest average duration.

In [28]:
#mean duration by hour of the day
dmerge4.groupby(['hour','landmark']).aggregate(np.mean)[['duration_f']].unstack().plot(kind='barh')
Out[28]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb3a8fc6a10>

Average duration by hour of the day faceted by subscriber type

For every hour of the day, Customers have longer average duration than Subscribers especially in the wee hours.

In [44]:
#mean duration by hour of the day faceted by subscriber type
dmerge4.groupby(['hour','subscriptiontype']).aggregate(np.mean)[['duration']].unstack().plot(kind='barh')
Out[44]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f39b186cd90>

Distribution of number of observations in the dataset by hour of the day

There were more bicycle trips taken during the morning commute and the evening commute.

In [31]:
hour_counts = dmerge4.groupby('hour').aggregate(np.size)
hour_counts['duration_i'].plot(kind='bar')
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb3aa947e50>

Distribution of total duration by hour of the day

Total duration was highest from eight am to six pm which also corresponds to working hours.

In [33]:
hour_sum = dmerge4.groupby('hour').aggregate(np.sum)
hour_sum['duration_i'].plot(kind='bar')
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb3a8c67050>

Average duration by hour of the day and subscription type.

This plot indicates that subscriber trips were generally shorter while customer trips had higher durations

In [7]:
from ggplot import *
ggplot(aes(x='hour', y='duration_f',color='subscriptiontype'), data=dmerge4) + \
    geom_point(alpha=.6)
Out[7]:
<ggplot: (5823961)>

Density plot of average duration by landmark

In [10]:
ggplot(dmerge4, aes(x='duration', color='landmark')) + \
    geom_density() + xlim(0,20000)
Out[10]:
<ggplot: (16841881)>

Duration by hour for both subscription types faceted by landmark

In [49]:
ggplot(aes(x='hour', y='duration', colour='subscriptiontype'), data=dmerge4) + \
    geom_point() + facet_wrap("landmark")
Out[49]:
<ggplot: (8742859725017)>

Monthly total duration by landmark.

This plots illustrates that trips in San Francisco for the most part had shorter total durations while there were some outliers in Palo Alto.

In [6]:
ggplot(aes(x='month', y='duration', color='landmark'), data=dmerge4) + \
    geom_point()
Out[6]:
<ggplot: (8062841)>

Total duration by hour of day faceted by landmark

This type of plot makes it easier to pinpoint outliers by time of day.

In [7]:
ggplot(aes(x='hour', y='duration', color='landmark'), data=dmerge4) + \
    geom_point()
Out[7]:
<ggplot: (7042513)>

Plot of duration by month faceted by subscription type

In [52]:
sns.factorplot("month", "duration",data=dmerge4,hue="subscriptiontype",
                 palette="PRGn", aspect=1.25)
#unable to select order of x-axis(how to handle a time type in seaborn)
Out[52]:
<seaborn.axisgrid.FacetGrid at 0x7f39b0ad29d0>

Plot of duration by landmark faceted by subscription type

In [53]:
sns.factorplot("landmark", "duration",data=dmerge4,hue="subscriptiontype",
                 palette="PRGn", aspect=1.25)
Out[53]:
<seaborn.axisgrid.FacetGrid at 0x7f39b0333990>

Plot of total duration by landmark faceted by subscription type

In [57]:
sns.factorplot("landmark", "duration", col="subscriptiontype",data=dmerge4, palette="PuBu_d",size=4,aspect=1.5);

Plot of average duration by landmark faceted by subscription type

In [61]:
sns.factorplot("subscriptiontype", "duration_f", col="landmark", data=dmerge4, palette="PuBu_d",size=4,aspect=.5);

The top five start stations

In [9]:
dmerge4["startstation"].value_counts()[:10]
Out[9]:
San Francisco Caltrain (Townsend at 4th)         9838
Harry Bridges Plaza (Ferry Building)             7343
Embarcadero at Sansome                           6545
Market at Sansome                                5922
Temporary Transbay Terminal (Howard at Beale)    5113
Market at 4th                                    5030
2nd at Townsend                                  4987
San Francisco Caltrain 2 (330 Townsend)          4976
Steuart at Market                                4913
Townsend at 7th                                  4493
dtype: int64
In [10]:
dmerge4["startstation"].value_counts()[:5].plot(kind="bar")
Out[10]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb1acc6bfd0>

Plot of the five lowest average durations by start station

In [23]:
dmerge4.groupby('startstation').duration_f.mean().order(ascending=True)[:5].plot(kind='barh')
Out[23]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb1ae79b590>

Plot of the top five average durations by start station

In [21]:
#mean duration
startstation = dmerge4.groupby('startstation').duration_f.mean().order(ascending=True)[:5]
startstation.plot(kind='barh')
Out[21]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb1ae7a4190>

Plot of the total duration by landmark faceted by subscriber type

In [26]:
dmerge4.groupby(['landmark','subscriptiontype']).agg({'duration_f' : np.sum}).unstack().plot(kind="bar")
Out[26]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fb1ae2f13d0>

Scatter plot with a smoothing line

This plot shows that the average dock count has a decreasing trend as mean temperature increases

In [29]:
#shows that avg dock count decr when mean temp incr
ggplot(h, aes(x = 'mean_temperature_f', y ='dockcount')) + geom_point() + geom_smooth(method = 'lm') 
Out[29]:
<ggplot: (8775068078681)>