Friday, June 14, 2013

2D Histograms


Histograms visualize the statistical distribution of data sets. In this example we will generate some random sets of data and view 1 and 2D histograms.

First, generate some random data.
from numpy import *
 
# generate some random sets of data
y0 = random.randn(100)/5. + 0.5 
x0 = random.randn(100)/5. + 0.5 
 
y1 = random.rayleigh(size=20)/7. + 0.1
x1 = random.rayleigh(size=20)/8. + 1.1
 
y2 = random.randn(50)/10. + 0.9
x2 = random.rayleigh(size=50)/10. + 0.1
 
y3 = random.randn(50)/8. + 0.1
x3 = random.randn(50)/8. + 0.1
 
y = concatenate([y0,y1,y2,y3])
x = concatenate([x0,x1,x2,x3])

The distribution of one variable looks like:
And the distribution of the other variable looks like: We can view the distribution of both variables with a 2D histogram. The color corresponds to the number of points in a cell. Here, cells with lighter colors have more points.


These plots were made with Plotly (https://plot.ly) inside their Python sandbox.
They are interactive: click-drag to zoom, double-click to autoscale, shift-click to pan.

The following code generated these plots

from numpy import *
## place the data into Plotly's dict format

# histograms
histx = {'x': x, 'type':'histogramx'}
histy = {'y': y, 'type':'histogramy'}
hist2d = {'x': x, 'y': y, 'type':'histogram2d'}

# scatter plots above the 1D histograms
# "jitter" the scatter plot points to make their distribution easier to distinguish
jitterx = {'x': x, 'y': 60+3*random.rand((len(x))), 'type':'scatter','mode':'markers','marker':{'size':4,'opacity':0.5,'symbol':'square'}}

jittery = {'x': y, 'y': 35+3*random.rand((len(x))), 'type':'scatter','mode':'markers','marker':{'size':4,'opacity':0.5,'symbol':'square'}}

# scatter points in the 2D histogram
xy = {'x': x, 'y': y, 'type':'scatter','mode':'markers','marker':{'size':5,'opacity':0.5,'symbol':'square'}}

plot([histx, jitterx], layout={'title': 'Distribution of Variable 1'})
plot([histy, jittery], layout={'title': 'Distribution of Variable 2'})
plot([hist2d,xy], layout={'title': 'Distribution of Variable 1 and Variable 2'})

No comments:

Post a Comment