Using IPython notebooks under version control

What is a good strategy for keeping IPython notebooks under version control?

The notebook format is quite amenable for version control: if one wants to version control the notebook and the outputs then this works quite well. The annoyance comes when one wants only to version control the input, excluding the cell outputs (aka. "build products") which can be large binary blobs, especially for movies and plots. In particular, I am trying to find a good workflow that:

allows me to choose between including or excluding output,
prevents me from accidentally committing output if I do not want it,
allows me to keep output in my local version,
allows me to see when I have changes in the inputs using my version control system (i.e. if I only version control the inputs but my local file has outputs, then I would like to be able to see if the inputs have changed (requiring a commit). Using the version control status command will always register a difference since the local file has outputs.)
allows me to update my working notebook (which contains the output) from an updated clean notebook. (update)

As mentioned, if I chose to include the outputs (which is desirable when using nbviewer for example), then everything is fine. The problem is when I do not want to version control the output. There are some tools and scripts for stripping the output of the notebook, but frequently I encounter the following issues:

I accidentally commit a version with the the output, thereby polluting my repository.
I clear output to use version control, but would really rather keep the output in my local copy (sometimes it takes a while to reproduce for example).
Some of the scripts that strip output change the format slightly compared to the Cell/All Output/Clear menu option, thereby creating unwanted noise in the diffs. This is resolved by some of the answers.
When pulling changes to a clean version of the file, I need to find some way of incorporating those changes in my working notebook without having to rerun everything. (update)

I have considered several options that I shall discuss below, but have yet to find a good comprehensive solution. A full solution might require some changes to IPython, or may rely on some simple external scripts. I currently use mercurial, but would like a solution that also works with git: an ideal solution would be version-control agnostic.

This issue has been discussed many times, but there is no definitive or clear solution from the user's perspective. The answer to this question should provide the definitive strategy. It is fine if it requires a recent (even development) version of IPython or an easily installed extension.

Update: I have been playing with my modified notebook version which optionally saves a .clean version with every save using Gregory Crosswhite's suggestions. This satisfies most of my constraints but leaves the following unresolved:

This is not yet a standard solution (requires a modification of the ipython source. Is there a way of achieving this behaviour with a simple extension? Needs some sort of on-save hook.
A problem I have with the current workflow is pulling changes. These will come in to the .clean file, and then need to be integrated somehow into my working version. (Of course, I can always re-execute the notebook, but this can be a pain, especially if some of the results depend on long calculations, parallel computations, etc.) I do not have a good idea about how to resolve this yet. Perhaps a workflow involving an extension like ipycache might work, but that seems a little too complicated.

Notes

Removing (stripping) Output

When the notebook is running, one can use the Cell/All Output/Clear menu option for removing the output.
There are some scripts for removing output, such as the script nbstripout.py which remove the output, but does not produce the same output as using the notebook interface. This was eventually included in the ipython/nbconvert repo, but this has been closed stating that the changes are now included in ipython/ipython,but the corresponding functionality seems not to have been included yet. (update) That being said, Gregory Crosswhite's solution shows that this is pretty easy to do, even without invoking ipython/nbconvert, so this approach is probably workable if it can be properly hooked in. (Attaching it to each version control system, however, does not seem like a good idea ? this should somehow hook in to the notebook mechanism.)

Newsgroups

Thoughts on the notebook format for version control.

Issues

977: Notebook feature requests (Open).
1280: Clear-all on save option (Open). (Follows from this discussion.)
3295: autoexported notebooks: only export explicitly marked cells (Closed). Resolved by extension 11 Add writeandexecute magic (Merged).

Pull Requests

1621: clear In[] prompt numbers on "Clear All Output" (Merged). (See also 2519 (Merged).)
1563: clear_output improvements (Merged).
3065: diff-ability of notebooks (Closed).
3291: Add the option to skip output cells when saving. (Closed). This seems extremely relevant, however was closed with the suggestion to use a "clean/smudge" filter. A relevant question what can you use if you want to strip off output before running git diff? seems not to have been answered.
3312: WIP: Notebook save hooks (Closed).
3747: ipynb -> ipynb transformer (Closed). This is rebased in 4175.
4175: nbconvert: Jinjaless exporter base (Merged).
142: Use STDIN in nbstripout if no input is given (Open).

Answer by Gregory Crosswhite for Using IPython notebooks under version control

Unfortunately, I do not know much about Mercurial, but I can give you a possible solution that works with Git, in the hopes that you might be able to translate my Git commands into their Mercurial equivalents.

For background, in Git the add command stores the changes that have been made to a file into a staging area. Once you have done this, any subsequent changes to the file are ignored by Git unless you tell it to stage them as well. Hence, the following script, which, for each of the given files, strips out all of the outputs and prompt_number sections, stages the stripped file, and then restores the original:

NOTE: If running this gets you an error message like ImportError: No module named IPython.nbformat, then use ipython to run the script instead of python.

from IPython.nbformat import current  import io  from os import remove, rename  from shutil import copyfile  from subprocess import Popen  from sys import argv    for filename in argv[1:]:      # Backup the current file      backup_filename = filename + ".backup"      copyfile(filename,backup_filename)        try:          # Read in the notebook          with io.open(filename,'r',encoding='utf-8') as f:              notebook = current.reads(f.read(),format="ipynb")            # Strip out all of the output and prompt_number sections          for worksheet in notebook["worksheets"]:              for cell in worksheet["cells"]:                 cell.outputs = []                 if "prompt_number" in cell:                      del cell["prompt_number"]            # Write the stripped file          with io.open(filename, 'w', encoding='utf-8') as f:              current.write(notebook,f,format='ipynb')            # Run git add to stage the non-output changes          print("git add",filename)          Popen(["git","add",filename]).wait()        finally:          # Restore the original file;  remove is needed in case          # we are running in windows.          remove(filename)          rename(backup_filename,filename)

Once the script has been run on the files whose changes you wanted to commit, just run git commit.

Answer by Pietro Battiston for Using IPython notebooks under version control

Here is my solution with git. It allows you to just add and commit (and diff) as usual: those operations will not alter your working tree, and at the same time (re)running a notebook will not alter your git history.

Although this can probably be adapted to other VCSs, I know it doesn't satisfy your requirements (at least the VSC agnosticity). Still, it is perfect for me, and although it's nothing particularly brilliant, and many people probably already use it, I didn't find clear instructions about how to implement it by googling around. So it may be useful to other people.

Save a file with this content somewhere (for the following, let us assume ~/bin/ipynb_output_filter.py)
Make it executable (chmod +x ~/bin/ipynb_output_filter.py)
Create the file ~/.gitattributes, with the following content
```
*.ipynb    filter=dropoutput_ipynb  
```

Run the following commands:

git config --global core.attributesfile ~/.gitattributes  git config --global filter.dropoutput_ipynb.clean ~/bin/ipynb_output_filter.py  git config --global filter.dropoutput_ipynb.smudge cat

Done!

Limitations:

it works only with git
in git, if you are in branch somebranch and you do git checkout otherbranch; git checkout somebranch, you usually expect the working tree to be unchanged. Here instead you will have lost the output and cells numbering of notebooks whose source differs between the two branches.
more in general, the output is not versioned at all, as with Gregory's solution. In order to not just throw it away every time you do anything involving a checkout, the approach could be changed by storing it in separate files (but notice that at the time the above code is run, the commit id is not known!), and possibly versioning them (but notice this would require something more than a git commit notebook_file.ipynb, although it would at least keep git diff notebook_file.ipynb free from base64 garbage).
that said, incidentally if you do pull code (i.e. committed by someone else not using this approach) which contains some output, the output is checked out normally. Only the locally produced output is lost.

My solution reflects the fact that I personally don't like to keep generated stuff versioned - notice that doing merges involving the output is almost guaranteed to invalidate the output or your productivity or both.

EDIT:

if you do adopt the solution as I suggested it - that is, globally - you will have trouble in case for some git repo you want to version output. So if you want to disable the output filtering for a specific git repository, simply create inside it a file .git/info/attributes, with

**.ipynb filter=

as content. Clearly, in the same way it is possible to do the opposite: enable the filtering only for a specific repository.

the code is now maintained in its own git repo
if the instructions above result in ImportErrors, try adding "ipython" before the path of the script:
```
git config --global filter.dropoutput_ipynb.clean ipython ~/bin/ipynb_output_filter.py  
```

Answer by Albert for Using IPython notebooks under version control

I use a very pragmatic approach; which work well for several notebooks, at several sides. And it even enables me to 'transfer' notebooks around. It works both for Windows as Unix/MacOS.
Al thought it is simple, is solve the problems above...

Concept

Basically, do not track the .ipnyb-files, only the corresponding .py-files.
By starting the notebook-server with the --script option, that file is automatically created/saved when the notebook is saved.

Those .py-files do contain all input; non-code is saved into comments, as are the cell-borders. Those file can be read/imported ( and dragged) into the notebook-server to (re)create a notebook. Only the output is gone; until it is re-run.

Personally I use mercurial to version-track the .py files; and use the normal (command-line) commands to add, check-in (ect) for that. Most other (D)VCS will allow this to.

Its simple to track the history now; the .py are small, textual and simple to diff. Once and a while, we need a clone (just branch; start a 2nd notebook-sever there), or a older version (check-it out and import into a notebook-server), etc.

Tips & tricks

Add *.ipynb to '.hgignore', so Mercurial knows it can ignore those files
Create a (bash) script to start the server (with the --script option) and do version-track it
Saving a notebook does save the .py-file, but does not check it in.
- This is a drawback: One can forget that
- It's a feature also: It possible to save a notebook (and continue later) without clustering the repository-history.

Wishes

It would be nice to have a buttons for check-in/add/etc in the notebook Dashboard
A checkout to (by example) file@date+rev.py) should be helpful It would be to much work to add that; and maybe I will do so once. Until now, I just do that by hand.

Answer by Rich Signell for Using IPython notebooks under version control

We have a collaborative project where the product is Ipython Notebooks, and we've use an approach for the last six months that is working great: we activate saving the .py files automatically and track both .ipynb files and the .py files.

That way if someone wants to view/download the latest notebook they can do that via github or nbviewer, and if someone wants to see how the the notebook code has changed, they can just look at the changes to the .py files.

For ipython 2 notebook servers, this can be accomplished by starting the server using:

ipython notebook --script

or by adding the line

c.FileNotebookManager.save_script = True

to the ipython_notebook_config.py file and restarting the notebook server.

For ipython 3 notebook servers, this can be accomplished by adding the lines

import os  from subprocess import check_call    def post_save(model, os_path, contents_manager):      """post-save hook for converting notebooks to .py scripts"""      if model['type'] != 'notebook':          return # only do this for notebooks      d, fname = os.path.split(os_path)      check_call(['ipython', 'nbconvert', '--to', 'script', fname], cwd=d)    c.FileContentsManager.post_save_hook = post_save

to the ipython_notebook_config.py file and restarting the notebook server. These lines are from a github issues answer @minrk provided and @dror includes them in his SO answer as well.

If you aren't sure in which directory to find your ipython_notebook_config.py file, you can type ipython locate profile default, and if you don't find the file there, you can create it by typing ipython profile create.

Here's our project on github that is using this approach: and here's a github example of exploring recent changes to a notebook.

We've been very happy with this.

Answer by Peter for Using IPython notebooks under version control

I did what Albert & Rich did - Don't version .ipynb files (as these can contain images, which gets messy). Instead, either always run ipython notebook --script or put c.FileNotebookManager.save_script = True in your config file, so that a (versionable) .py file is always created when you save your notebook.

To regenerate notebooks (after checking out a repo or switching a branch) I put the script py_file_to_notebooks.py in the directory where I store my notebooks.

Now, after checking out a repo, just run python py_file_to_notebooks.py to generate the ipynb files. After switching branch, you may have to run python py_file_to_notebooks.py -ov to overwrite the existing ipynb files.

Just to be on the safe side, it's good to also add *.ipynb to your .gitignore file.

Edit: I no longer do this because (A) you have to regenerate your notebooks from py files every time you checkout a branch and (B) there's other stuff like markdown in notebooks that you lose. I instead strip output from notebooks using a git filter. Discussion on how to do this is here.

Answer by Spencer Boucher for Using IPython notebooks under version control

Here is a new solution from Cyrille Rossant for IPython 3.0, which persists to markdown files rather than json-based ipymd files:

https://github.com/rossant/ipymd

Answer by Dror for Using IPython notebooks under version control

As pointed out by, the --script is deprecated in 3.x. This approach can be used by applying a post-save-hook. In particular, add the following to ipython_notebook_config.py:

import os  from subprocess import check_call    def post_save(model, os_path, contents_manager):      """post-save hook for converting notebooks to .py scripts"""      if model['type'] != 'notebook':          return # only do this for notebooks      d, fname = os.path.split(os_path)      check_call(['ipython', 'nbconvert', '--to', 'script', fname], cwd=d)    c.FileContentsManager.post_save_hook = post_save

The code is taken from #8009.

Answer by Peter for Using IPython notebooks under version control

Ok, so it looks like the current best solution, as per a discussion here, is to make a git filter to automatically strip output from ipynb files on commit.

Here's what I did to get it working (copied from that discussion):

I modified cfriedline's nbstripout file slightly to give an informative error when you can't import the latest IPython: https://github.com/petered/plato/blob/fb2f4e252f50c79768920d0e47b870a8d799e92b/notebooks/config/strip_notebook_output And added it to my repo, lets say in ./relative/path/to/strip_notebook_output

Also added the file .gitattributes file to the root of the repo, containing:

*.ipynb filter=stripoutput

And created a setup_git_filters.sh containing

git config filter.stripoutput.clean "$(git rev-parse --show-toplevel)/relative/path/to/strip_notebook_output"   git config filter.stripoutput.smudge cat  git config filter.stripoutput.required true

And ran source setup_git_filters.sh. The fancy $(git rev-parse...) thing is to find the local path of your repo on any (Unix) machine.

Answer by Guillaume Dumas for Using IPython notebooks under version control

To follow up on the excellent script by Pietro Battiston, if you get a Unicode parsing error like this:

Traceback (most recent call last):    File "/Users/kwisatz/bin/ipynb_output_filter.py", line 33, in   write(json_in, sys.stdout, NO_CONVERT)    File "/Users/kwisatz/anaconda/lib/python2.7/site-packages/IPython/nbformat/__init__.py", line 161, in write  fp.write(s)  UnicodeEncodeError: 'ascii' codec can't encode character u'\u2014' in position 11549: ordinal not in range(128)

You can add at the beginning of the script:

reload(sys)  sys.setdefaultencoding('utf8')

Answer by kynan for Using IPython notebooks under version control

I have created nbstripout, based on MinRKs gist, which supports both Git and Mercurial (thanks to mforbes). It is intended to be used either standalone on the command line or as a filter, which is easily (un)installed in the current repository via nbstripout install / nbstripout uninstall.

Get it from PyPI or simply

pip install nbstripout

Fatal error: Call to a member function getElementsByTagName() on a non-object in D:\XAMPP INSTALLASTION\xampp\htdocs\endunpratama9i\www-stackoverflow-info-proses.php on line 72

Discussion of Coding

Blog coding and discussion of coding about JavaScript, PHP, CGI, general web building etc.

Saturday, April 9, 2016

Using IPython notebooks under version control

Using IPython notebooks under version control

Notes

Removing (stripping) Output

Newsgroups

Issues

Pull Requests

Answer by Gregory Crosswhite for Using IPython notebooks under version control

Answer by Pietro Battiston for Using IPython notebooks under version control

Answer by Albert for Using IPython notebooks under version control

Concept

Tips & tricks

Wishes

Answer by Rich Signell for Using IPython notebooks under version control

Answer by Peter for Using IPython notebooks under version control

Answer by Spencer Boucher for Using IPython notebooks under version control

Answer by Dror for Using IPython notebooks under version control

Answer by Peter for Using IPython notebooks under version control

Answer by Guillaume Dumas for Using IPython notebooks under version control

Answer by kynan for Using IPython notebooks under version control

0 comments:

Post a Comment

Popular Posts

Fun Page