No Pylab Thanks

Please Stop using Pylab

[Edit] thanks to Randy Olson for sending me a PR with grammar and spelling corrections.

TL;DR, Please stop advising people to use the pylab flag when using IPython. It is harmful. If you want to help IPython, try to avoid retweeting, and promoting things that use the pylab flag and make the authors aware of the issues.

Use explicit import, and %matplotlib inline magic. It is better (and supports switching between inline/not inline.)

This was mainly prompted by a day where I came across consecutive issues due to the pylab flag, and people happy to discover it.

Pylab, my worst friend

When IPython 1.0 was released almost 6 months ago, we had quite a few decisions to make in less than a week and loads of discussions at the last moment. One of those decisions comes and goes often: we want to get rid of this stupid pylab flag. If you look at our stable (1.0) dev doc and examples there shouldn't be a mention of the pylab flag anywhere. If there is, we will quickly remove it. Why? Because it is harmful, first to us, then to new users, to the Python community, and finally to research in general.

Do you know about pylab? If you don't, please be extra careful and try to avoid it as much as possible. It is like smoking: it looks cool at first, then after a few years you realise you can live without it, but it makes you sick.

You think you know what pylab does? Really? Take few seconds to think what the pylab flag is doing, then read the rest.

What is it supposed to do?

The pylab package main purpose was to build a transition tool from other languages to Python. As it was more and more common and painful to do the same import in IPython every time when only the IPython shell was around, the --pylab flag was added. Basically, it did the following:

import numpy
import matplotlib
from matplotlib import pylab, mlab, pyplot
np = numpy
plt = pyplot

from IPython.core.pylabtools import figsize, getfigs

from pylab import *
from numpy import *

Did you get it right the first time without cheating ? Are you able to say what has been imported in from pylab import * ? in from numpy import * ?

Of course, that is not the only thing it does. But was it your intention to do that the last time you used pylab?

It is irreversible

Once you activate pylab mode, there is no going back. You cannot unimport things. Of course, with the %pylab magic you can always restart your kernel, but with the flag, all your kernels will start in pylab mode. Are you sure you will not need a non-pylab kernel?

Unclear

When using the pylab flag, your audience usually has no way of knowing that you used the flag. If I use plot(range(10)), what will happen? Will it pop up a figure ? Inline it? Or throw an error?

In [ ]:
plot(range(10))

Because I'm mean, I won't execute the cell, so that you don't know whether or not I'm running in pylab mode or not. You might think it's not a big deal, but in teaching and research it is important to communicate exactly what you are doing.

It pollutes the namespace

In [1]:
len(get_ipython().user_ns.keys())
Out[1]:
20
In [2]:
%pylab
Using matplotlib backend: Agg
Populating the interactive namespace from numpy and matplotlib

IPython 2.0-dev gives some warning, and will tell you wether it clobbers already non-built-in variables in the user namespace.

In [3]:
len(get_ipython().user_ns.keys())
Out[3]:
968

Can you really tell me the 948 additional things you now have in your namespace? Not a big deal? Really?

It replaces built-ins

In [6]:
sum,all
Out[6]:
(<function numpy.core.fromnumeric.sum>, <function numpy.core.fromnumeric.all>)

Both used to be <function sum>,<function all> before using pylab, and this changes the behavior of your programs!

For example, it leads to non-pickable object in some cases:

@treycausey on Twitter

@ogrisel @myusuf3 Ah, very curious. It loads fine in IPython but not in an IPython notebook. I must have a namespace collision or something.

Me:

Let me guess: $ipython notebook --pylab? But $ipython alone?

Other unrelated actions

IPython developed and added the rich display protocol, so we added from IPython.display import display to what import pylab does.

Because matplotlib can work with different event loops, pylab added options to activate qt/gtk/osx/etc. event loops, and with the QtConsole and Notebook creation, the ability to select the inline backend, which registers display hooks.

The main things users remember, and often the main message that you can read here and there is that:

You need to use Pylab to get inline figures.

Which is false. Inline pylab mode is a convenient method to activate inline, and set up a display hook with matplotlib. You do not need pylab to have inline images, nor do you need matplotlib. With 1.0 and above the recommended way to set up inline figures would be to use %matplotlib.

Side Effects

The worst part is it has the side effect of making people use pylab without even knowing it. Later on, they start asking question about how to reuse a graph in matplotlib because no one ever encounters the following code any more:

fig,ax = subplots(1,1)
ax.plot(...)

or why their script suddenly does not work in plain Python but works in IPython.

It kills kittens, with a spoon...

Maybe not kittens, but it will force developers to explain the same things, again, and again, and again...

In [8]:
from IPython.display import YouTubeVideo
YouTubeVideo('9VDvgL58h_Y',start=4*60+27)
Out[8]:

Make --pylab disappear.

Of course, we will not remove the --pylab flag for compatibility reasons, but please, for 2014, make the following resolutions:

  • Use %matplotlib [inline|qt|osx|gtx] to select the right backend/event hook.
  • Use explicit imports
  • Make people aware of the issues
  • Help by making no new articles/tutorials that mention --pylab

If really you only have 10 seconds left, the %pylab magic is fine.

As usual, this has been written in an IPython Notebook. You can send me pull request if I made mistakes, or if you have any additional remarks on why you shouldn't use --pylab.

Matplotlib and IPython love child

This is an experiement of how one could try to improve matplotlib configuration system using IPython's traitlets system; It has probably a huge number of disatvantages going from slowing down all the matplotlib stack, as making every class name matter in backward compatibility.

Far from beeing a perfect lovechild of two awesome project, this is for now an horible mutant that at some point inspect up it's statck to find information about it's caller in some places. It woudl need a fair amount of work to be nicely integrated into matplotlib.

Warning

This post has been written with a patched version of Matplotlib, so you will not be able to reproduce this post by re-executing this notebook.

I've also ripped out part of IPython configurable sytem into a small self contained package that you would need.

TL,DR;

Here is an example where the color of any Text object of matplotlib has a configurable color as long as the object that creates it is (also) an Artist, with minimal modification of matplotlib.

In [1]:
%pylab inline
from IPConfigurable.configurable import Config
matplotlib.rc('font',size=20)
Populating the interactive namespace from numpy and matplotlib

Here is the interesting part where one cas see that everything is magically configurable.

In [2]:
matplotlib.config = Config()

# by default all Text are now purple
matplotlib.config.Text.t_color = 'purple'

# Except Text created by X/Y Axes will  red/aqua
matplotlib.config.YAxis.Text.t_color='red'
matplotlib.config.XAxis.Text.t_color='aqua'

# If this is the text of a Tick it should be orange 
matplotlib.config.Tick.Text.t_color='orange'

# unless this is an XTick, then it shoudl be Gray-ish
# as (XTick <: Tick) it will have precedence over Tick
matplotlib.config.XTick.Text.t_color=(0.4,0.4,0.3)

## legend
matplotlib.config.TextArea.Text.t_color='y'
matplotlib.config.AxesSubplot.Text.t_color='pink'
In [3]:
plt.plot(sinc(arange(0,3*np.pi,0.1)),'green', label='a sinc')
plt.ylabel('sinc(x)')
plt.xlabel('This is X')
plt.title('Title')
plt.annotate('Max',(20,0.2))
plt.legend()
Out[3]:
<matplotlib.legend.Legend at 0x1050cc910>

I love Matplotlib

I love Matplotlib and what you can do with it, I am always impressed by how Jake Van Der Plas is able to bend Matpltolib in dooing amazing things. That beeing said, default color for matplotlib graphics are not that nice, mainly because of legacy, and even if Matplotlib is hightly configurable, many libraries are trying to fix it and receipt on the net are common.

IPython configuration is magic

I'm not that familiar with Matplotlib internal, but I'm quite familiar with IPython's internal. In particular, using a lightweight version of Enthought Traits we call Traitlets, almost every pieces of IPython is configurable.

According to Clarke's third law

Any sufficiently advanced technology is indistinguishable from magic.

So I'll assume that IPython configuration system is magic. Still there is some rule you shoudl know.

In IPython, any object that inherit from Configurable can have attributes that are configurable. The name of the configuration attribute that will allow to change the value of this attribute are easy, it's Class.attribute = Value, and if the creator of an object took care of passing a reference to itself, you can nest config as ParentClass.Class.attribute = value, by dooing so only Class created by ParentClass will have value set. With a dummy example.

class Foo(Configurable):

    length = Integer(1,config=True)
    ...

class Bar(Configurable):

    def __init__(self):
        foo = Foo(parent=self)

class Rem(Bar):
    pass

every Foo object length can be configured with Foo.length=2 or you can target a subset of foo by setting Rem.Foo.length or Bar.Foo.lenght.

But this might be a little abstarct, let's do a demo with matplotlib

In [4]:
cd ~/matplotlib/
/Users/bussonniermatthias/matplotlib

let's make matplotlib Artist an IPython Configurable, grab default config from matplotlib.config if it exist, and pass it to parrent.

-class Artist(object):
+class Artist(Configurable):

-    def __init__(self):
+    def __init__(self, config=None, parent=None):
+        
+        c = getattr(matplotlib,'config',Config({}))
+        if config :
+            c.merge(config)
+        super(Artist, self).__init__(config=c, parent=parent)

Now we will define 2 attributes of Patches (a subclass of Artist) ; t_color, t_lw that are respectively a Color or a Float and set the default color of current Patch to this attribute.

 class Patch(artist.Artist):

+    t_color = MaybeColor(None,config=True)
+    t_lw = MaybeFloat(None,config=True)
+
     ...
         if linewidth is None:
-            linewidth = mpl.rcParams['patch.linewidth']
+            if self.t_lw is not None:
+                linewidth = self.t_lw
+            else:
+                linewidth = mpl.rcParams['patch.linewidth']
         ...
         if color is None:
-            color = mpl.rcParams['patch.facecolor']
+            if self.t_color is not None:
+                color = self.t_color
+            else :
+                color = mpl.rcParams['patch.facecolor']

One could also set _t_color_default to mpl.rcParams['patch.facecolor'] but it becommes complicaed for the explanation

That's enough

This is the minimum viable to have this to work we can know magically configure independently any Subclass of Patches

We know that Wedge, Ellipse,... and other are part of this category, so let's play with their t_color

In [5]:
# some minimal imports
import matplotlib.pyplot as plt;

import numpy as np
import matplotlib.path as mpath
import matplotlib.lines as mlines
import matplotlib.patches as mpatches
from matplotlib.collections import PatchCollection
In [6]:
matplotlib.config = Config({
                            "Wedge"         :{"t_color":"0.4"},
                            "Ellipse"       :{"t_color":(0.9, 0.3, 0.7)},
                            "Circle"        :{"t_color":'red'},
                            "Arrow"         :{"t_color":'green'},
                            "RegularPolygon":{"t_color":'aqua'},
                            "FancyBboxPatch":{"t_color":'y'},
                            })

Let's see what this gives :

In [7]:
"""
example derived from 
http://matplotlib.org/examples/shapes_and_collections/artist_reference.html
"""
fig, ax = plt.subplots()
grid = np.mgrid[0.2:0.8:3j, 0.2:0.8:3j].reshape(2, -1).T

patches = []
patches.append(mpatches.Circle(grid[0], 0.1,ec="none"))
patches.append(mpatches.Rectangle(grid[1] - [0.025, 0.05], 0.05, 0.1, ec="none"))
patches.append(mpatches.Wedge(grid[2], 0.1, 30, 270, ec="none"))
patches.append(mpatches.RegularPolygon(grid[3], 5, 0.1))
patches.append(mpatches.Ellipse(grid[4], 0.2, 0.1))
patches.append(mpatches.Arrow(grid[5, 0]-0.05, grid[5, 1]-0.05, 0.1, 0.1, width=0.1))
patches.append(mpatches.FancyBboxPatch(
        grid[7] - [0.025, 0.05], 0.05, 0.1,
        boxstyle=mpatches.BoxStyle("Round", pad=0.02)))
collection = PatchCollection(patches, match_original=True)
ax.add_collection(collection)

plt.subplots_adjust(left=0, right=1, bottom=0, top=1)
plt.axis('equal')
plt.axis('off')
plt.show()

It works !!! Isn't that great ? Free configuration for all Artists ; of course as long as you don't explicitely set the color, or course.

Let's be ugly.

We need slightly more to have nested configuration, each Configurable have to be passed the parent keyword, but Matplotlib is not made to pass the parent keyword to every Artist it creates, this prevent the use of nested configuration. Still using inspect, we can try to get a handle on the parent, by walking up the stack.

adding the following in Artist constructor:

import inspect 
def __init__(self, config=None, parent=None):
    i_parent = inspect.currentframe().f_back.f_back.f_locals.get('self',None)
    if (i_parent is not self) and (parent is not i_parent) :
        if (isinstance(i_parent,Configurable)):
            parent = i_parent
    ....

let's patch Text to also accept a t_color configurable, bacause Text is a good candidate for nesting configurability:

 class Text(Artist):
+    t_color = MaybeColor(None,config=True)

         if color is None:
-            color = rcParams['text.color']
+            if self.t_color is not None:
+                color = self.t_color
+            else :
+                color = rcParams['text.color']
+        if self.t_color is not None:
+            color = self.t_color
+
         if fontproperties is None:
             fontproperties = FontProperties()

Now we shoudl be able to make default Text always purple, nice things about Config object is that once created they accept acces of any attribute with dot notation.

In [8]:
matplotlib.config.Text.t_color = 'purple'
In [9]:
fig,ax = plt.subplots(1,1)
plt.plot(sinc(arange(0,6,0.1)))
plt.ylabel('SinC(x)')
plt.title('SinC of X')
Out[9]:
<matplotlib.text.Text at 0x10512e0d0>

Ok, not much further than current matplotlib configuratin right ?

We also know that XAxis and Yaxis inherit from Axis which itself inherit from Artist. Both are responsible from creating the x- and y-label

In [10]:
matplotlib.config.YAxis.Text.t_color='r'
matplotlib.config.YAxis.Text.t_color='aqua'

same goes for Tick, XTick and YTicks. I can of course set a parameter to the root class:

In [11]:
matplotlib.config.Tick.Text.t_color='orange'

and overwrite it for a specific subclass:

In [12]:
matplotlib.config.XTick.Text.t_color='gray'
In [13]:
fig,ax = plt.subplots(1,1)
plt.plot(sinc(arange(0,6,0.1)))
plt.ylabel('SinC(x)')
plt.title('SinC of X')
Out[13]:
<matplotlib.text.Text at 0x10527d610>

This is, as far as I know not possible to do with current matplotlib confuguration system. At least not without adding a rc-param for each and every imaginable combinaison.

What more ?

First thing it that this make it trivial for external library to plug into matplotlib configuration system to have their own defaults/configurable defaults.

You also can of course refine configurability by use small no-op class that inherit from base classes and give them meaning. Especially right now, Ticks are separated in XTick and YTick with a major/minor attribute. They shoudl probably be refactor into MajorTick/MinorTick. With that you can mix and match configuration from the most global Axis.Tick.value=... to the more precise YAxis.MinorTick.

Let's do an example with a custom artist that create 2 kinds of circles. We'll need a custom no-op class that inherit Circle.

In [14]:
class CenterCircle(mpatches.Circle):
   pass
In [15]:
matplotlib.config = Config()
matplotlib.config.Circle.t_color='red'
matplotlib.config.CenterCircle.t_color='aqua'
In [16]:
from IPConfigurable.configurable import Configurable
import math

class MyGenArtist(Configurable):
    
    def n_circle(self, x,y,r,n=3):
        pi = math.pi
        sin,cos = math.sin, math.cos
        l= []
        for i in range(n):
            l.append(mpatches.Circle(  ## here Circle
                                (x+2*r*cos(i*2*pi/n),y+2*r*sin(i*2*pi/n)),
                                 r,
                                 ec="none",
                                ))
        l.append(CenterCircle((x,y),r))  ## Here CenterCircle
        return l
        
        


fig, ax = plt.subplots()
patches = []
patches.extend(MyGenArtist().n_circle(4,1,0.5))
patches.extend(MyGenArtist().n_circle(2,4,0.5,n=6))
patches.extend(MyGenArtist().n_circle(1,1,0.5,n=5))
collection = PatchCollection(patches, match_original=True)
ax.add_collection(collection)
plt.subplots_adjust(left=0, right=1, bottom=0, top=1)
plt.axis('equal')
plt.axis('off')
Out[16]:
(-1.0, 6.0, -1.0, 6.0)

What's next ?

This configuration system is, of course not limited to Matplotib. But to use it it should probably be better decoupled into a separated package first, independently of IPython.

Also if it is ever accepted into matplotlib, there will still be a need to adapt current mechnisme to work on top of this.

Bonus

This patches version of matplotlib keep track of all the Configurable it discovers while you use it. Here is a non exaustive list.

In [17]:
from matplotlib import artist
print "-------------------------"
print "Some single configurables"
print "-------------------------"
for k in  sorted(artist.Artist.s):
    print k
print ""
print "----------------------------------"
print "Some possible nested configurables"
print "----------------------------------;"
for k in sorted(artist.Artist.ps):
    print k
-------------------------
Some single configurables
-------------------------
Annotation
Arrow
AxesSubplot
CenterCircle
Circle
DrawingArea
Ellipse
FancyBboxPatch
Figure
HPacker
Legend
Line2D
PatchCollection
Rectangle
RegularPolygon
Spine
Text
TextArea
VPacker
Wedge
XAxis
XTick
YAxis
YTick

----------------------------------
Some possible nested configurables
----------------------------------;
AxesSubplot.Legend
AxesSubplot.Text
AxesSubplot.XAxis
AxesSubplot.YAxis
TextArea.Text
XAxis.Text
XTick.Text
YAxis.Text
YTick.Text

Dear DiffLib

Dear difflib, What The F@&#! ?

TL,DR:

Acording to difflib, the following strings: 'baaaaa', 'aaabaa' and 'aaaaba' share 5 caracters with 'aaaaaa', but 'aabaaa' share only 3 .... but only if the change is in the first half of the string.

In [1]:
import difflib
print sum(x.size for x in difflib.SequenceMatcher(None, '-aaaaa', 'aaaaaa').get_matching_blocks())
print sum(x.size for x in difflib.SequenceMatcher(None, 'a-aaaa', 'aaaaaa').get_matching_blocks())
print sum(x.size for x in difflib.SequenceMatcher(None, 'aa-aaa', 'aaaaaa').get_matching_blocks())
print sum(x.size for x in difflib.SequenceMatcher(None, 'aaa-aa', 'aaaaaa').get_matching_blocks())
print '- Control -'
print sum(x.size for x in difflib.SequenceMatcher(None, 'aaaaaa', 'aaaaaa').get_matching_blocks())
5
4
3
5
- Control -
6

It only get weirder and more fractal if you read the rest

Context

A few weeks back I was in EuroSciPy, where I had to dive a little deeper than usual into difflib. Indead, one often requested feature in IPython is to be able to diff notebooks, so I started looking at how this can be done. Thanks to @MinRK for helping me figuring out the rest of this post and give me some nices ideas of graphs.

Naturaly I turned myself toward difflib :

This module provides classes and functions for comparing sequences. It can be used for example, for comparing files, and can produce difference information in various formats, including HTML and context and unified diffs. For comparing directories and files, see also, the filecmp module.

More especially, I am interested in SequenceMatcher which aims to be :

[...] a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable.

It is also explicitely state that generated diff might not be minimal, but might looks more "human readable" than classical diff (emphasis mine).

The basic algorithm [..] is a little fancier than,[...] “gestalt pattern matching.” The idea is to find the longest contiguous matching subsequence that contains no “junk” elements [...]. The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.

To do that, we need to define "junk" which are things you don't want the algorithme to match on. Let see a common example in python where when you add a function with a derorator, the decorator is often added to the "next" function.

In [2]:
s1 = """
@decorate
def fun1():
    return 1
    
@decorate
def fun3():
    return 3
"""

s2 = """
@decorate
def fun1():
    return 1
    
@decorate
def fun2():
    return 2
    
@decorate
def fun3():
    return 3
"""

Classical diff

In [3]:
import difflib
print ''.join(difflib.ndiff(s1.splitlines(1), s2.splitlines(1)))
  
  @decorate
  def fun1():
      return 1
      
  @decorate
+ def fun2():
+     return 2
+     
+ @decorate
  def fun3():
      return 3

Now we will tell it that blank line are junk:

In [4]:
blankline = lambda x:x.strip() ==''
print ''.join(difflib.ndiff(s1.splitlines(1), s2.splitlines(1), linejunk=blankline))
  
  @decorate
  def fun1():
      return 1
      
+ @decorate
+ def fun2():
+     return 2
+     
  @decorate
  def fun3():
      return 3

This is clearly better, as hunk do not have blank lines in the midlle, but more on sides.

Where it gets weird

Things is, SequenceMatcher is also the method that help you get the proximity between two strings. If the sequence you pass to SequenceMatcher is a string, it will try to match each caracters.

In [5]:
from difflib import SequenceMatcher
In [6]:
print SequenceMatcher('hello world','hello world').ratio()
print SequenceMatcher('xyz','abc').ratio()
0.0
0.0

Oh, sorry, you need to explicitelty, pass the isjunk function. Hopefully, it accepts None.

In [7]:
print SequenceMatcher(None, 'hello world','hello world').ratio()
print SequenceMatcher(None, 'xyz','abc').ratio()
1.0
0.0

Ok, that's better, it goes from 0.0 for completely different strings (nothing in common) to 1.0 for perfectly matching strings.

API is weird, but for compatibility we keep the old one... fine with me

let's try longer...

In [8]:
print SequenceMatcher(None, 'y'+'abc'*150,'x'+'abc'*150).ratio()
0.0

Still, don't pass it strings longer than 200 char, it automatically detect junk... Yes this is documented, but not obvisous.

In [9]:
print SequenceMatcher(None, 'y'+'abc'*150,'x'+'abc'*150, autojunk=False).ratio()
0.9977827051

So let's define a custom SequenceMatcher that make sens for the rest of the post, so that isjunkis None by default, and no autojunk for n > 200, and a simple ratio(a, b) method as a shortcut.

In [10]:
def MySequenceMatcher( seq1, seq2, isjunk=None, autojunk=False):
    return SequenceMatcher(isjunk, seq1, seq2, autojunk=autojunk)

def ratio(a,b):
    return MySequenceMatcher(a,b).ratio()
In [10]:
print ratio('abc','abc')
print ratio('abcd','abcf')
1.0
0.75

Where it gets weirder

I'll probably won't go into how we found out the following, but for some reason ratio(a,b) was different from ratio(reversed(a),reversed(b)), that is to say ratio('ab...yz','ab...yz') != ratio('zy...ba','zy...ba') in some case.

So let's look at what happend if we compare a string against itself, with only one modification, as a function as the position of the modification, that is to say the ratio of aaaa... vs baaa... vs abaa... vs aaba...

In [11]:
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
In [12]:
n = 100
step = 1
s1 = u'a'*n
r = lambda x : ratio(x, s1)

def modified(i):
    """ return the string s1, where the i-th is replaced by -"""
    return s1[:i-1]+u'-'+s1[i+1:]

xx = range(1, n, step)


print(s1[:10]),'...'
for i in range(10):
    print(modified(i)[:10]),'...'
print('..............')
distance = map( r, [modified(i) for i in xx])

plt.plot(xx, distance)
plt.ylabel('similarity')
plt.xlabel('position of modified caracter')
aaaaaaaaaa ...
aaaaaaaaaa ...
-aaaaaaaaa ...
a-aaaaaaaa ...
aa-aaaaaaa ...
aaa-aaaaaa ...
aaaa-aaaaa ...
aaaaa-aaaa ...
aaaaaa-aaa ...
aaaaaaa-aa ...
aaaaaaaa-a ...
..............
Out[12]:
<matplotlib.text.Text at 0x1037590d0>

WWWHHHHAAAAAT ???

Hum..let's look with some other string, basically same as before, but with repeating sequence ababab.., abcabcabcabc...

In [13]:
import string
In [14]:
def test_rep(k=1, n=128):
    s1 = string.letters[:k]*int(n/k*2)
    s1 = s1[:n]
    r = lambda x : ratio(x, s1)
    
    def modified(i):
        """ return the string s1, where the i-th is replaced by -"""
        return s1[:i-1]+u'-'+s1[i+1:]
    
    xx = range(1, n, step)
    distance = map( r, [modified(i) for i in xx])
    return xx,distance

fig,ax = plt.subplots(1,1)
fig.set_figwidth(16)
fig.set_figheight(8)

for k in [1,2,4,8,16,32]:
    xx, distance = test_rep(k, n=64)
    plt.step(xx, distance, where='post', label='k={k}'.format(k=k))
    

plt.ylabel('similarity')
plt.xlabel('position of modified caracter')
plt.legend(loc=4)
Out[14]:
<matplotlib.legend.Legend at 0x103b97ed0>

(bottom of the graph is at 0.5, not 0.0)

Huummm... this definitively does not look at what I was expecting (mainly something constant), but looks more like a Sierpinski triangle to me.

Bottom line, even if difflib make some pretty looking diff, I cannot trust it for giving me proximity of two sequences.

Still Some good sides...

Anyway, I knew what levenstein distance was, but not really what the efficient algorithme where. I played around a little bit with them, and recoded them in pure python. I'll post the link soon, just time to clean things up im make them as pure python module. Below is how it looks like.

Thing is, as far as I can tell, libdiff stays an order of magnitude faster in some cases, as well as computes the matches at the same time.

In [15]:
def lcs_len3(Seq1 , Seq2):
    """ Compute the LCS len 2 sequences
    
    Do not calculate the matrix and try to be as efficient as possible 
    in storing only the minimal ammount of elelment in memory, mainly the previous
    matrix row + 1 element.
    """
    LL1 = len(Seq1)+1
    LL2 = len(Seq2)+1

    ## we will do the big loop over the longest sequence (L1)
    ## and store the previous row of the matrix (L2+1)
    if LL2 > LL1 : 
        Seq2, Seq1 = Seq1, Seq2
        LL2, LL1 = LL1, LL2

    
    previousrow = [0]*(LL2)
    cindex = 0

    for Seq1ii in Seq1:
        for jj in range(1,LL2):
            cindex = (cindex+1) % LL2

            if Seq1ii == Seq2[jj-1]:
                if jj == 1:
                    previousrow[cindex] = 1
                else :
                    previousrow[cindex]+=1
            if Seq1ii != Seq2[jj-1] :
                up = previousrow[(cindex+1) % LL2]
                
                if jj != 1 :
                    left = previousrow[(cindex-1) % LL2]
                    if left > up :
                        previousrow[cindex] = left
                        continue    
                previousrow[cindex] = up


    return previousrow[cindex]
In [16]:
import numpy as np
import difflib
In [17]:
def compare(s1,s2):
    m0 = difflib.SequenceMatcher(None, s1, s2, autojunk=False).get_matching_blocks()
    m1 = lcs_len3(s1,s2)
    a,b = sum([x.size for x in m0]),m1
    return a,b 

for k in range(10):
    s1 = np.random.randint(0,250, 100)
    s2 = np.random.randint(0,250, 100)
    a,b = compare(s1,s2)
    print 'random',u'√' if a == b else 'x',a,b
    a,b = compare(sorted(s1),sorted(s2))
    print 'sorted',u'√' if a == b else 'x',a,b
    
random x 4 8
sorted √ 23 23
random x 4 9
sorted √ 26 26
random x 6 10
sorted √ 25 25
random x 4 12
sorted √ 27 27
random x 5 7
sorted √ 27 27
random x 7 10
sorted √ 30 30
random x 6 13
sorted √ 33 33
random x 9 10
sorted √ 25 25
random x 2 7
sorted √ 28 28
random x 5 10
sorted √ 27 27

Except on the sorted case, SequenceMatcher from stdlib give the wrong matches length almost all the time (sometime it is right)

In [18]:
%timeit difflib.SequenceMatcher(None, s1, s2, autojunk=False).get_matching_blocks()
%timeit lcs_len3(s1,s2)
print '--------- sorted ----------'
s1.sort()
s2.sort()
%timeit difflib.SequenceMatcher(None, s1, s2, autojunk=False).get_matching_blocks()
%timeit lcs_len3(s1,s2)
1000 loops, best of 3: 402 µs per loop
100 loops, best of 3: 9.38 ms per loop
--------- sorted ----------
1000 loops, best of 3: 734 µs per loop
100 loops, best of 3: 9.33 ms per loop

On both sorted and unsorted arrays, SequenceMatcher is way faster (10 to 20 times) than I am, but wrong.

In [19]:
s1 = 'a'*251
i = 124
s2 = 'a'*(i)+'b'+'a'*(250-i)
%timeit MySequenceMatcher(s1, s2).get_matching_blocks()        
%timeit lcs_len3(s1,s2)
10 loops, best of 3: 26 ms per loop
10 loops, best of 3: 30.8 ms per loop
In [20]:
sum(x.size for x in MySequenceMatcher(s1, s2).get_matching_blocks())
Out[20]:
126

But in the cases where SequenceMatcher is clearly wrong for the optimal sequence, we are around the same time to execute.

Not that I don't like current python libdiff, but I hope to be able to build a lower lever library (pure python of course) without any surprise behavior, on which maby one can rebuild the current libdiff and avoid to rely on regular expression to parse the output of another function.

I'll be happy to get any tips to make it faster, of course it is not possible in all the cases but I'm sure we can figure it out.


As usual comments, typo and PRs welcommed on the repo that host thoses notebooks. I should probably learn how to use Pelican now that it support notebooks.


Edit: At the request of some person, I opened a place to comment on this post until I turn it into a real blogpost.

07-the-sound-of-hydrogen

The sound Of Hydrogen

Inspired by minutephysics, and the explanation do do it in mathematica: The sound of hydrogen.

The goal of this notebook is to show how one can play a sound file in notebook, using Html5 <audio> tag to play it dirrectly inside the browser.

To do this we use the spectrum of hydrogen that we shift the into the audible range. You can listen to it in the last cell of this notebook. Wait a few second if you are on nbviewer, the notebook is not light (someone to update it to use mp3? ogg? or a compressed format?)

Please be aware that the html5 player is not working on some old browser and IE.

[Edit] apparently the process of converting something not audible into sound have diffent naming: sonify, sonification or auralizing according to the comment on twitter.

In [1]:
%matplotlib inline
In [2]:
import scipy.constants as const
import numpy as np
import scipy

from matplotlib.pyplot import plot
from scipy.io import wavfile
from IPython.core.display import HTML, display
from __future__ import division
In [3]:
# this is a wrapper that take a filename and publish an html <audio> tag to listen to it
# 
# note that this is not needed anymore in recent version of IPython and SciPy.
import sys
if sys.version_info < (3,):
    from StringIO import StringIO as BytesIO
else:
    from io import BytesIO
import base64

def wavPlayer(data, rate):
    """ will display html 5 player for compatible browser

    The browser need to know how to play wav through html5.

    there is no autoplay to prevent file playing when the browser opens
    
    Adapted from SciPy.io.
    """
    
    buffer = BytesIO()
    buffer.write(b'RIFF')
    buffer.write(b'\x00\x00\x00\x00')
    buffer.write(b'WAVE')

    buffer.write(b'fmt ')
    if data.ndim == 1:
        noc = 1
    else:
        noc = data.shape[1]
    bits = data.dtype.itemsize * 8
    sbytes = rate*(bits // 8)*noc
    ba = noc * (bits // 8)
    buffer.write(struct.pack('<ihHIIHH', 16, 1, noc, rate, sbytes, ba, bits))

    # data chunk
    buffer.write(b'data')
    buffer.write(struct.pack('<i', data.nbytes))

    if data.dtype.byteorder == '>' or (data.dtype.byteorder == '=' and sys.byteorder == 'big'):
        data = data.byteswap()

    buffer.write(data.tostring())
#    return buffer.getvalue()
    # Determine file size and place it in correct
    #  position at start of the file.
    size = buffer.tell()
    buffer.seek(4)
    buffer.write(struct.pack('<i', size-8))
    
    val = buffer.getvalue()
    
    src = """
    <head>
    <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
    <title>Simple Test</title>
    </head>
    
    <body>
    <audio controls="controls" style="width:600px" >
      <source controls src="data:audio/wav;base64,{base64}" type="audio/wav" />
      Your browser does not support the audio element.
    </audio>
    </body>
    """.format(base64=base64.encodestring(val))
    display(HTML(src))

Update notes, the above is not need anymore as IPython now provide the "Audio" object.

In [4]:
try:
    from IPython.display import Audio
    def wavPlayer(data, rate):
        display(Audio(data, rate=rate))
except ImportError:
    pass
In [5]:
## some consstant for our audio file 
from numpy import sin, pi
rate = 44100 #44.1 khz
duration =2.5 # in sec

# this will give us sin with the righ amplitude to use with wav files
normedsin = lambda f,t : 2**13*sin(2*pi*f*t)

time = np.linspace(0,duration, num=rate*duration)

Test the wav Player

let's try to first just play an A (440 Hz).

In [6]:
from __future__ import division, print_function, absolute_import

import numpy
import struct
import warnings    
In [7]:
# define A as a 440 Hz sin function 
la    = lambda t : normedsin(440,t)

# look at it on the first 25 ms
plot(time[0:1000], la(time)[0:1000])

ampl = la(time).astype(np.int16)

# write the file on disk, and show in in a Html 5 audio player
wavPlayer(ampl, rate)

The differents frequencies emmited by an hydrogen atom is given by the rydberg formulae :

$$ {1 \over \lambda} = R \left({1\over n_1}-{1\over n_2}\right) $$

Which gives a similar relation on the emitted frequencies of the Hydrogen :

$$ f_{n,m}={c \over \lambda} = {R_h\over h} \left({1\over n}-{1\over m}\right) $$

for $n=1$ we've got the Lyman series, and for $n=2$ we have the Balmer series

In [8]:
# fondamental frequency of hydrogen
f0 = const.Rydberg*const.c
print("The highest frequency of hydrogen is ",f0,"Hz. and correspond to n = 1, m = ∞")
fshift = 440
print("we can shift the spectrum for it to be at 440 Hz (A)")
The highest frequency of hydrogen is  3289841960355208.5 Hz. and correspond to n = 1, m = ∞
we can shift the spectrum for it to be at 440 Hz (A)
In [9]:
ryd = lambda n,m : fshift*(1/(n**2) -1/(m**2))
flyman = lambda x : ryd(1,x)
fbalmer = lambda x : ryd(2,x)
In [10]:
## define the sum, 
ser = lambda t : sum( [normedsin(flyman(i),t)+normedsin(fbalmer(i+1),t) for i in range(2,8)])

# and a verorialized function to work on a by element basis with matlab
serv = scipy.vectorize(ser)
In [11]:
ss = serv(time)
In [12]:
plot(time,ss)
ss = 2**15*ss/ ss.max()
In [13]:
#wavfile.write('hydrogen.wav', rate, ss.astype(np.int16))
wavPlayer(ss.astype(np.int16),rate)

As always, Pull-Request and comment to fix typo and any other things are welcommed on github.

06-NBconvert-Doc-Draft

How to Use NBConvert

NBconvert migration

NBconvert has now been merged into IPython itself. You will need IPython 1.0 or above to have this works (asuuming the API have not changed)

Intro

In this post I will introduce you to the programatic API of nbconvert to show you how to use it in various context.

For this I will use one of @jakevdp great blog post. I've explicitely chosen a post with no javascript tricks as Jake seem to be found of right now, for the reason that the becommings of embeding javascript in nbviewer, which is based on nbconvert is not fully decided yet.

This will not focus on using the command line tool to convert file. The attentive reader will point-out that no data are read from, or written to disk during the conversion process. Indeed, nbconvert as been though as much as possible to avoid IO operation and work as well in a database, or web-based environement.

Quick overview

The main principle of nbconvert is to instanciate a Exporter that controle a pipeline through which each notebook you want to export with go through.

Let's start by importing what we need from the API, and download @jakevdp's notebook.

In [1]:
import requests
response = requests.get('http://jakevdp.github.com/downloads/notebooks/XKCD_plots.ipynb')
response.content[0:60]+'...'
Out[1]:
'{\n "metadata": {\n  "name": "XKCD_plots"\n },\n "nbformat": 3,\n...'

We read the response into a slightly more convenient format which represent IPython notebook. There are not real advantages for now, except some convenient methods, but with time this structure should be able to guarantee that the notebook structure is valid.

In [2]:
from IPython.nbformat import current as nbformat
jake_notebook = nbformat.reads_json(response.content)
jake_notebook.worksheets[0].cells[0]
Out[2]:
{u'cell_type': u'heading',
 u'level': 1,
 u'metadata': {},
 u'source': u'XKCD plots in Matplotlib'}

So we have here Jake's notebook in a convenient for, which is mainly a Super-Powered dict and list nested. You don't need to worry about the exact structure.

The nbconvert API exposes some basic exporter for common format and default options. We will start by using one of them. First we import it, instanciate an instance with all the defautl parameters and fed it the downloaded notebook.

In [3]:
import IPython.nbconvert
In [6]:
from IPython.config import Config
from IPython.nbconvert import HTMLExporter

## I use basic here to have less boilerplate and headers in the HTML.
## we'll see later how to pass config to exporters.
exportHtml = HTMLExporter(config=Config({'HTMLExporter':{'default_template':'basic'}}))
In [7]:
(body,resources) = exportHtml.from_notebook_node(jake_notebook)

The exporter returns a tuple containing the body of the converted notebook, here raw HTML, as well as a resources dict. The resource dict contains (among many things) the extracted PNG, JPG [...etc] from the notebook when applicable. The basic HTML exporter does keep them as embeded base64 into the notebook, but one can do ask the figures to be extracted. Cf advance use. So for now the resource dict should be mostly empty, except for 1 key containing some css, and 2 others whose content will be obvious.

Exporter are stateless, you won't be able to extract any usefull information (except their configuration) from them. You can directly re-use the instance to convert another notebook. Each exporter expose for convenience a from_file and from_filename methods if you need.

In [6]:
print resources.keys()
print resources['metadata']
print resources['output_extension']
# print resources['inlining'] # too lng to be shown
['inlining', 'output_extension', 'metadata']
defaultdict(None, {'name': 'Notebook'})
html
In [7]:
# Part of the body, here the first Heading
start = body.index('<h1 id', )
print body[:400]+'...'
<div class="text_cell_render border-box-sizing rendered_html">
<h1 id="XKCD-plots-in-Matplotlib">XKCD plots in Matplotlib<a class="anchor-link" href="#XKCD-plots-in-Matplotlib">&#182;</a></h1>
</div>

<div class="text_cell_render border-box-sizing rendered_html">
<p>This notebook originally appeared as a blog post at <a href="http://jakevdp.github.com/blog/2012/10/07/xkcd-style-plots-in-matplotli...

You can directly write the body into an HTML file if you wish, as you see it does not contains any body tag, or style declaration, but thoses are included in the default HtmlExporter if you do not pass it a config object as I did.

Extracting Figures

When exporting one might want to extract the base64 encoded figures to separate files, this is by default what does the RstExporter does, let see how to use it.

In [8]:
from IPython.nbconvert import RSTExporter

rst_export = RSTExporter()

(body,resources) = rst_export.from_notebook_node(jake_notebook)
In [9]:
print body[:970]+'...'
print '[.....]'
print body[800:1200]+'...'
XKCD plots in Matplotlib
========================


This notebook originally appeared as a blog post at `Pythonic
Perambulations <http://jakevdp.github.com/blog/2012/10/07/xkcd-style-plots-in-matplotlib/>`_
by Jake Vanderplas.

 *Update: the matplotlib pull request has been merged! See* `*This
post* <http://jakevdp.github.io/blog/2013/07/10/XKCD-plots-in-matplotlib/>`_
*for a description of the XKCD functionality now built-in to
matplotlib!*

One of the problems I've had with typical matplotlib figures is that
everything in them is so precise, so perfect. For an example of what I
mean, take a look at this figure:
In[1]:
.. code:: python

    from IPython.display import Image
    Image('http://jakevdp.github.com/figures/xkcd_version.png')





.. image:: output_3_0.png



Sometimes when showing schematic plots, this is the type of figure I
want to display. But drawing it by hand is a pain: I'd rather just use
matplotlib. The problem is, matplotlib is a bit...
[.....]
owing schematic plots, this is the type of figure I
want to display. But drawing it by hand is a pain: I'd rather just use
matplotlib. The problem is, matplotlib is a bit too precise. Attempting
to duplicate this figure in matplotlib leads to something like this:
In[2]:
.. code:: python

    Image('http://jakevdp.github.com/figures/mpl_version.png')





.. image:: output_5_0.png



It just doesn'...

Here we see that base64 images are not embeded, but we get what look like file name. Actually those are (Configurable) keys to get back the binary data from the resources dict we havent inspected earlier.

So when writing a Rst Plugin for any blogengine, Sphinx or anything else, you will be responsible for writing all those data to disk, in the right place. Of course to help you in this task all those naming are configurable in the right place.

let's try to see how to get one of these images

In [10]:
resources['outputs'].keys()
Out[10]:
[u'output_13_1.text',
 u'output_18_0.text',
 u'output_3_0.text',
 u'output_18_1.png',
 u'output_12_0.text',
 u'output_5_0.text',
 u'output_5_0.png',
 u'output_13_1.png',
 u'output_16_0.text',
 u'output_13_0.text',
 u'output_18_1.text',
 u'output_3_0.png',
 u'output_16_0.png']

We have extracted 5 binary figures, here pngs, but they could have been svg, and then wouldn't appear in the binary sub dict. keep in mind that a object having multiple repr will store all it's repr in the notebook.

Hence if you provide _repr_javascript_,_repr_latex_ and _repr_png_to an object, you will be able to determine at conversion time which representaition is the more appropriate. You could even decide to show all the representaition of an object, it's up to you. But this will require beeing a little more involve and write a few line of Jinja template. This will probably be the subject of another tutorial.

Back to our images,

In [11]:
from IPython.display import Image
Image(data=resources['outputs']['output_3_0.png'],format='png')
Out[11]:

Yep, this is indeed the image we were expecting, and I was able to see it without ever writing or reading it from disk. I don't think I'll have to show to you what to do with those data, as if you are here you are most probably familiar with IO.

Extracting figures with HTML Exporter ?

Use case:

I write an awesome blog in HTML, and I want all but having base64 embeded images. Having one html file with all inside is nice to send to coworker, but I definitively want resources to be cached ! So I need an HTML exporter, and I want it to extract the figures !

Some theory

The process of converting a notebook to a another format with the nbconvert Exporters happend in a few steps:

  • Get the notebook data and other required files. (you are responsible for that)
  • Feed them to the exporter that will
    • sequentially feed the data to a number of Transformers. Transformer only act on the structure of the notebook, and have access to it all.
    • feed the notebook through the jinja templating engine
      • the use templates are configurable.
      • templates make use of configurable macros called filters.
  • The exporter return the converted notebook as well as other relevant resources as a tuple.
  • Write what you need to disk, or elsewhere. (You are responsible for it)

Here we'll be interested in the Transformers. Each Transformer is applied successively and in order on the notebook before going through the conversion process.

We provide some transformer that do some modification on the notebook structure by default. One of them, the ExtractOutputTransformer is responsible for crawling notebook, finding all the figures, and put them into the resources directory, as well as choosing the key (filename_xx_y.extension) that can replace the figure in the template.

The ExtractOutputTransformer is special in the fact that it should be availlable on all Exporters, but is just inactive by default on some exporter.

In [12]:
# second transformer shoudl be Instance of ExtractFigureTransformer
exportHtml._transformers # 3rd one shouel be <ExtractOutputTransformer>
Out[12]:
[<function IPython.nbconvert.transformers.coalescestreams.wrappedfunc>,
 <IPython.nbconvert.transformers.svg2pdf.SVG2PDFTransformer at 0x10c203e90>,
 <IPython.nbconvert.transformers.extractoutput.ExtractOutputTransformer at 0x10c20e410>,
 <IPython.nbconvert.transformers.csshtmlheader.CSSHTMLHeaderTransformer at 0x10c20e490>,
 <IPython.nbconvert.transformers.revealhelp.RevealHelpTransformer at 0x10c1cbf10>,
 <IPython.nbconvert.transformers.latex.LatexTransformer at 0x10c203550>,
 <IPython.nbconvert.transformers.sphinx.SphinxTransformer at 0x10c203690>]

To enable it we will use IPython configuration/Traitlets system. If you are have already set some IPython configuration options, this will look pretty familiar to you. Configuration option are always of the form:

ClassName.attribute_name = value

A few ways exist to create such config, like reading a config file in your profile, but you can also do it programatically usign a dictionary. Let's create such a config object, and see the difference if we pass it to our HtmlExporter

In [13]:
from IPython.config import Config

c =  Config({
            'ExtractOutputTransformer':{'enabled':True}
            })

exportHtml = HTMLExporter()
exportHtml_and_figs = HTMLExporter(config=c)

(_, resources)          = exportHtml.from_notebook_node(jake_notebook)
(_, resources_with_fig) = exportHtml_and_figs.from_notebook_node(jake_notebook)

print 'resources without the "figures" key :'
print resources.keys()

print ''
print 'Here we have one more field '
print resources_with_fig.keys()
resources_with_fig['outputs'].keys() 
resources without the "figures" key :
['inlining', 'output_extension', 'metadata']

Here we have one more field 
['outputs', 'inlining', 'output_extension', 'metadata']
Out[13]:
[u'output_13_1.text',
 u'output_18_0.text',
 u'output_3_0.text',
 u'output_18_1.png',
 u'output_12_0.text',
 u'output_5_0.text',
 u'output_5_0.png',
 u'output_13_1.png',
 u'output_16_0.text',
 u'output_13_0.text',
 u'output_18_1.text',
 u'output_3_0.png',
 u'output_16_0.png']

So now you can loop through the dict and write all those figures to disk in the right place...

Custom transformer

Of course you can imagine many transformation that you would like to apply to a notebook. This is one of the reason we provide a way to register your own transformers that will be applied to the notebook after the default ones.

To do so you'll have to pass an ordered list of Transformers to the Exporter constructor.

But what is an transformer ? Transformer can be either decorated function for dead-simple Transformers that apply independently to each cell, for more advance transformation that support configurability You have to inherit from Transformer and define a call method as we'll see below.

All transforers have a magic attribute that allows it to be activated/disactivate from the config dict.

In [14]:
from IPython.nbconvert.transformers import Transformer
import IPython.config
print "Four relevant docstring"
print '============================='
print Transformer.__doc__
print '============================='
print Transformer.call.__doc__
print '============================='
print Transformer.transform_cell.__doc__
print '============================='
Four relevant docstring
=============================
 A configurable transformer

    Inherit from this class if you wish to have configurability for your
    transformer.

    Any configurable traitlets this class exposed will be configurable in profiles
    using c.SubClassName.atribute=value

    you can overwrite transform_cell to apply a transformation independently on each cell
    or __call__ if you prefer your own logic. See corresponding docstring for informations.

    Disabled by default and can be enabled via the config by
        'c.YourTransformerName.enabled = True'
    
=============================

        Transformation to apply on each notebook.
        
        You should return modified nb, resources.
        If you wish to apply your transform on each cell, you might want to 
        overwrite transform_cell method instead.
        
        Parameters
        ----------
        nb : NotebookNode
            Notebook being converted
        resources : dictionary
            Additional resources used in the conversion process.  Allows
            transformers to pass variables into the Jinja engine.
        
=============================

        Overwrite if you want to apply a transformation on each cell.  You 
        should return modified cell and resource dictionary.
        
        Parameters
        ----------
        cell : NotebookNode cell
            Notebook cell being processed
        resources : dictionary
            Additional resources used in the conversion process.  Allows
            transformers to pass variables into the Jinja engine.
        index : int
            Index of the cell being processed
        
=============================

We don't provide convenient method to be aplied on each worksheet as the data structure for worksheet will be removed. (not the worksheet functionnality, which is still on it's way)


Example

I'll now demonstrate a specific example requested while nbconvert 2 was beeing developped. The ability to exclude cell from the conversion process based on their index.

I'll let you imagin how to inject cell, if what you just want is to happend static content at the beginning/end of a notebook, plese refer to templating section, it will be much easier and cleaner.

In [15]:
from IPython.utils.traitlets import Integer
In [16]:
class PelicanSubCell(Transformer):
    """A Pelican specific transformer to remove somme of the cells of a notebook"""
    
    # I could also read the cells from nbc.metadata.pelican is someone wrote a JS extension
    # But I'll stay with configurable value. 
    start = Integer(0, config=True, help="first cell of notebook to be converted")
    end   = Integer(-1,