05-YAML Notebook

  |   Source

YAML IPython notebook

Little experiment base on the fact that apparently YAML is made to be better readable by Humans than JSON. We've also had some complaint that metadata are not keep in nbconvert when roundtripping through markdown, those two made me think that I could try to see what ipynb files stored as YAML would look like.

I'll also use this post to do some experiment for nbviewer future nbviewer features, if you see anything wrong with the css on some device, please tell me.

First atempt

Apparently Json is a subset of YAML:

cp foo.ipynb foo.ipyamlnb

Yeah, Mission acomplished !

Second try

Install PyYaml, and see what we can do.

In [42]:
import json
import yaml
In [43]:
from IPython.nbformat import current as nbf
In [44]:
ls Y*.ipynb
YAML Notebook.ipynb
In [45]:
with open('YAML Notebook.ipynb') as f:
    nbook = nbf.read( f, 'json')
In [46]:
nbook.worksheets[0].cells[9]
Out[46]:
{u'cell_type': u'code',
 u'collapsed': False,
 u'input': u'from IPython.nbformat import current as nbf',
 u'language': u'python',
 u'metadata': {},
 u'outputs': []}

I'll skipp the fiddling around with the yaml converter. In short, you have to specify explicitely the part you want to dump in the literal form, otherwise they are exported as list of strings, which is a little painfull to edit afterward. I'm using the safe_dump and safe_load methods (or pass safeLoader and Dumper). Those should be default or otherwise you could unserialise arbitrary object, and have code exucuted.

We probably don't want to reproduct the recent file Rail's critical vulnerability that append not so long ago.

In [47]:
# we'll patch a safe Yaml Dumper
sd = yaml.SafeDumper

# Dummy class, just to mark the part we want with custom dumping
class folded_unicode(unicode): pass
class literal_unicode(unicode): pass

I know classes should be wit upper case, but we just want to hide the fact that thoses a class to end user. At the same time I define a folded method to use it with markdown cell. when markdown contain really long lines, those will be wrapped in the yaml document.

In [48]:
def folded_unicode_representer(dumper, data):
    return dumper.represent_scalar(u'tag:yaml.org,2002:str', data, style='>')
def literal_unicode_representer(dumper, data):
    return dumper.represent_scalar(u'tag:yaml.org,2002:str', data, style='|')

sd.add_representer(folded_unicode, folded_unicode_representer)
sd.add_representer(literal_unicode, literal_unicode_representer)


with open('YAML Notebook.ipynb') as f:
    nbjson =  json.load(f)

now we patch the part of the ipynb file we know we want to be literal or folded

In [49]:
for tcell in nbjson['worksheets'][0]['cells']:
    if 'source' in tcell.keys():
        tcell['source'] = folded_unicode("".join(tcell['source']))
    if 'input' in tcell.keys():
        tcell['input'] = literal_unicode("".join(tcell['input']))
In [50]:
with open('Yaml.ipymlnb','w') as f:
    f.write(yaml.dump(nbjson, default_flow_style=False, Dumper=sd))

You can round trip it to json, and it's still a valid ipynb file that can be loaded. Haven't fiddled with it much more. There are just a few gotchas with empty lines as well as trailing whitespace at EOL that can respectively diseapear or make the dumper fall back to a string quoted methods to store values.

You can skip down to the end of this notebook to look at how it looks like. It's probably much compact than the current json we emit, in some cases it might be more easy to read, but I don't think it is worth considering using in the format specification.

ipynb files are ment to be humanely fixable, and I strongly prefere having a consistent format with simple rules than having to explain what are the meaning of the differents shenigan like : |2+ for literal string.

Also support across languages are not consistent, and it would probably be too much of a security burden for all code that will support loading ipynb to take care of sanitazing Yaml.

One area where I woudl use it would be to describe the ipynb format at a talk for example, and/or to have metadata editing more human readable/writable.

In [51]:
!cat Yaml.ipymlnb
metadata:
  name: YAML Notebook
nbformat: 3
nbformat_minor: 0
worksheets:
- cells:
  - cell_type: heading
    level: 1
    metadata: {}
    source: >-
      YAML IPython notebook
  - cell_type: markdown
    metadata: {}
    source: "Little experiment base on the fact that apparently YAML is made to be\
      \ better readable by Humans than JSON.\nWe've also had some complaint that metadata\
      \ are not keep in nbconvert when roundtripping through markdown, those two\n\
      made me think that I could try to see what ipynb files stored as YAML would\
      \ look like. "
  - cell_type: heading
    level: 4
    metadata: {}
    source: >-
      First atempt
  - cell_type: markdown
    metadata: {}
    source: >-
      Apparently Json is a subset of YAML:
  - cell_type: markdown
    metadata: {}
    source: >2+
          cp foo.ipynb foo.ipyamlnb


  - cell_type: markdown
    metadata: {}
    source: >-
      Yeah, Mission acomplished !
  - cell_type: heading
    level: 4
    metadata: {}
    source: >-
      Second try
  - cell_type: markdown
    metadata: {}
    source: "Install PyYaml, and see what we can do. "
  - cell_type: code
    collapsed: false
    input: |-
      import json
      import yaml
    language: python
    metadata: {}
    outputs: []
  - cell_type: code
    collapsed: false
    input: |-
      from IPython.nbformat import current as nbf
    language: python
    metadata: {}
    outputs: []
  - cell_type: code
    collapsed: false
    input: |-
      ls Y*.ipynb
    language: python
    metadata: {}
    outputs: []
  - cell_type: code
    collapsed: false
    input: |-
      with open('YAML Notebook.ipynb') as f:
          nbook = nbf.read( f, 'json')
    language: python
    metadata: {}
    outputs: []
  - cell_type: code
    collapsed: false
    input: |-
      nbook.worksheets[0].cells[9]
    language: python
    metadata: {}
    outputs: []
  - cell_type: markdown
    metadata: {}
    source: >-
      I'll skipp the fiddling around with the yaml converter. In short, you have to
      specify explicitely the part you want to dump in the literal form, otherwise
      they are exported as list of strings, which is a little painfull to edit afterward.
      I'm using the `safe_dump` and `safe_load` methods (or pass safeLoader and Dumper).
      Those should be default or otherwise you could unserialise arbitrary object,
      and have code exucuted.


      We probably don't want to reproduct the recent file Rail's critical vulnerability
      that append not so long ago.
  - cell_type: code
    collapsed: false
    input: |-
      # we'll patch a safe Yaml Dumper
      sd = yaml.SafeDumper

      # Dummy class, just to mark the part we want with custom dumping
      class folded_unicode(unicode): pass
      class literal_unicode(unicode): pass
    language: python
    metadata: {}
    outputs: []
  - cell_type: markdown
    metadata: {}
    source: >-
      I know classes should be wit upper case, but we just want to hide the fact that
      thoses a class to end user. At the same time I define a folded method if I want
      to use it later.
  - cell_type: code
    collapsed: false
    input: |-
      def folded_unicode_representer(dumper, data):
          return dumper.represent_scalar(u'tag:yaml.org,2002:str', data, style='>')
      def literal_unicode_representer(dumper, data):
          return dumper.represent_scalar(u'tag:yaml.org,2002:str', data, style='|')

      sd.add_representer(folded_unicode, folded_unicode_representer)
      sd.add_representer(literal_unicode, literal_unicode_representer)


      with open('YAML Notebook.ipynb') as f:
          nbjson =  json.load(f)
    language: python
    metadata: {}
    outputs: []
  - cell_type: markdown
    metadata: {}
    source: >-
      now we patch the part of the ipynb file we know we want to be literal or folded
  - cell_type: code
    collapsed: false
    input: |-
      for tcell in nbjson['worksheets'][0]['cells']:
          if 'source' in tcell.keys():
              tcell['source'] = folded_unicode("".join(tcell['source']))
          if 'input' in tcell.keys():
              tcell['input'] = literal_unicode("".join(tcell['input']))
    language: python
    metadata: {}
    outputs: []
  - cell_type: code
    collapsed: false
    input: |-
      with open('Yaml.ipymlnb','w') as f:
          f.write(yaml.dump(nbjson, default_flow_style=False, Dumper=sd))
    language: python
    metadata: {}
    outputs: []
  - cell_type: markdown
    metadata: {}
    source: >-
      You can round trip it to json, and it's still a valid ipynb file that can be
      loaded. Haven't fiddled with it much more.

      There are just a few gotchas with empty lines as well as trailing whitespace
      at EOL that can respectively diseapear or make the dumper fall back to a string
      quoted methods to store values.


      One could also try to tiker with `folded_unicode` in markdown cell that tipically
      have long lines to play a little more nicely with VCS.
  - cell_type: markdown
    metadata: {}
    source: >-
      You can skip down to the end of this notebook to loko at how it looks like.
      It's probably much compact than the current json we emit, in **some** cases
      it might be more easy to read, but I don't think it is worth considering using
      in the format specification.


      ipynb files are ment to be humanely fixable, and I strongly prefere having a
      consistent format with simple rules than having to explain what are the meaning
      of the differents shenigan like `: |2+` for literal string.


      Also support across languages are not consistent, and it would probably be too
      much of a security burden for all code that will support loading ipynb to take
      care of sanitazing Yaml.


      One area where I woudl use it would be to describe the ipynb format at a talk
      for example, and/or to have metadata editing more human readable/writable.
  - cell_type: code
    collapsed: false
    input: |-
      !cat Yaml.ipymlnb
    language: python
    metadata: {}
    outputs: []
  metadata: {}