|JimmyG | Blog | Book | Life | Projects | Contact|
Posted: Mon 6th Apr 2009, 1:06am
Tags: pylons, book, restructuredtext
As many of you are aware I wrote the entire book The Definitive Guide to Pylons in reStructuredText format.
I found reStructredText to be a pleasure to work with. It was rich enough to express all the constructs I needed and yet simple enough that it didn't get in my way. In very little time I had learnt all the syntax I needed and with the exception of figures and tables never really needed to look up anything in the reStructuredText documentation.
After I completed each draft it would be checked into subversion. The editors were free to add comments (although they very rarely did) and Mike Orr, the technical reviewer, was able to check out and review the drafts.
We used a convention where the comments would be written in the reStructuredText documents themselves as reStructruredText comments starting with two dots and the person's initials, if something was broken we'd use XXX to make it stand out. These rules meant that we'd have sections of markup like the one below directly embedded in the source.
.. XXX JG Personally I don't like this behaviour. I think get() and  should .. return getone() and getall() respectively depending on whether there are .. multiple values or not. It cases a huge headache with repeating fields and .. FormEncode schema as we'll see in the example application. .. XXX MO No! The application will crash if it gets a list when it's expecting .. XXX MO a string or vice-versa. That was a major problem with the cgi module. .. XXX MO Reword this section to explain single/multiple values more clearly. .. XXX MO Also, you've mentioned MultiDict without explaining it. .. XXX MO Will MultiDict also change with WebOb?
As the drafts evolved and I fixed the problems or re-worded paragraphs I'd remove the comments which were no longer necessary but I found it helpful to have the comments directly in the source.
Another beauty of using reStructuredText was that I could easily monitor changesets via graphical diffs generated by the Trac instance I installed to manage them. I was able to easily deploy the drafts on the http://pylonsbook.com website as and when they were ready and the whole approach worked very well right up to the copyediting stage.
At that point Apress required that the copyediting was to be done using Microsoft Word but they kindly offered to handle the conversion to Word for me. Unfortunately this involved nothing more sophisticated than copying and pasting from the generated HTML into Word and then trying to tidy things up. Although I'm grateful I didn't have to do the conversion work myself there were a number of problems with this approach:
Unicode characters frequently got lost or mis-interpreted
Some of the raw reStructuredText source markup ended up as text by mistake
Syntax highlighting produced by Pygments when I generated the HTML got mis-interpreted as essential markup and quite lot of time was spent by the Apress team putting things in bold and italics unnecessarily before I noticed what was happening
a lot of the meaning contained in the carefully-constructed reStructuredText markup got lost
The Apress Word styles didn't quite match up to reStructuredText styles, for example Apress format new paragraphs and definition lists differently
After a lot of work by everyone the manuscript eventually ended up properly formatted in Word but being a free software man myself I don't own a copy of Microsoft Word! The closest thing I had at the time was a rather buggy OpenOffice 2.4 for Mac OS X so I had to do all the copyedit reviews using that software. When text was removed by someone it remained in place but with a line through it and as new text was added by the copyediter it appeared in a different colour. Although this sounds straightforward, compared with the purity and simplicity of working with reStructuredText, subversion and diff I found working in Word rather awkward, particularly with all the crossings-out everywhere.
Any comments or questions were added using Word's comment system which appeared (some of the time) to the right of the page in OpenOffice, but often didn't display correctly. Scrolling in OpenOffice frequently left artefacts on the screen which made it very difficult to work out what had changed and what hadn't and to make matters worse the whole system regularly crashed.
Despite the minor problems we managed to get the chapters through copyedit and another team had the job of taking the Word documents and producing the final PDFs for me to review. Any comments or corrections had to be made on the PDFs using Adobe Acrobat Reader's highlight and comment tools. Although these are better than the OpenOffice comment tools I found making corrections a tedious, albeit necessary, process.
Generating graphics was much easier. I simply used the Gimp <http://www.gimp.org/> on Ubuntu 8.04 to take the screenshots I needed and saved them as .tiff files which the Apress production team downloaded from the pylonsbook.com site to work with. (Incidentally I'd taken the screenshots for the first draft of the book on a Mac but Gimp for the Mac couldn't take the screenshots itself, I had to use Apple's Grab program and then load them in Gimp - Ubuntu was much easier!).
The book also contained three diagrams which needed to be produced as vector graphics. I drew these in pencil, photographed them and emailed through the .jpg files. The production team returned them as .pdf files which I could directly import into Inkscape to convert to .png files for the HTML version on pylonsbook.com
Once the final PDF review was complete my work, as far as the physical book was concerned, was done. But my work getting the book back into reStructuredText was just about to begin. There had been so many changes to the text over the many months of reviews that I decided it would be less effort to start from scratch than to try to re-apply all the Word and PDF changes to my original reStructuredText source. So in January Apress kindly provided me with Word documents generated from the final PDFs that had been used to generate the book.
I was left with two choices:
Save each Word document as a text file and manually re-apply the reStructuredText markup manually
Write a program to parse the Apress Word format and auto-generate the reStructuredText, fixing any problems by hand
Both options would require me to compare each sentence with the physical book to make sure it was correct. I started off on the manual approach but after 6 chapters I just couldn't take it any more and decided to use an automatic approach.
I started by using OpenOffice to save Chapter 18 in each format it supported, looking for one which I could use as a starting point for parsing. It was only after trying all of them and giving up that I realised .odt files are simply zip files containing XML. I'd finally found a format I could parse.
I used the OpenOffice unoconv tool to batch convert the Word files to .odt files on Ubuntu 8.10 like this:
$ unoconv -f odt 9349ch*_backout.doc
The content I needed was inside a file called content.xml so I started by extracting that file from each .odt and stripping the top and bottom before trying to parse the rest.
The first approach I took was to try to write an XSLT document which when used with lxml would produce XML markup which could be parsed with the xml2rst.xsl file which is in the docutils SVN repository. It quickly became clear that XSLT wasn't up to the job (the document was just not neatly-formed enough) so next I looked Python's SAX parser but, as with every other time I've tried to use a SAX parser for any real-world task, I gave up - it is too difficult to track the state of what's just happened. That left me with writing a DOM parser and after a huge amount of effort I eventually got a good enough implementation to generate the entire book. If you are interested I'd encourage you to look at it. It is called apress_odt_to_rst.py and requires the table.py module which accompanies it.
You use it like this:
$ python apress_odt_to_rst.py *.odt
and it produces .rst files for each chapter along with the stripped and partially transformed XML which it uses as a starting point. It is useful if you need to track down the markup which could not be handled by the program.
Although the tool I'd written did the best it could, there were plenty of occasions where so much had been lost in the use of Word, (not to mention formatting errors, misapplied styles and other problems which don't affect affect how a document looks but do affect what the markup means) that it wasn't possible for the program to determine how to handle specific cases. In those circumstances the program just gives an error which has to be manually resolved later on.
Download the software as it was for the Pylons Book
Next I used Sphinx to generate the HTML. This gave a load of warnings too so for three weeks I just worked through all the errors and warnings from both tools, comparing each paragraph with the book until I finally produced the finished reStructuredText today. I also updated the text to respond to any errata and formatting errors I noticed along the way.
The reStructuredText produced isn't the prettiest you'll have ever seen. It uses tabs for indentation and has no wrapping, every line is as long as it needs to be (which is only really a problem with tables!). I don't intend to accept patches which simply re-format the reStructuredText markup because having no artificial line breaks has the one advantage that changes are easier to track because they don't cause text to need to re-flow so the changes to the text don't get lost in changes to the formatting.
To convert the images from .tiff to .png files I just installed ImageMagick and ran this command:
$ mogrify -format png *.tiff
All in all I thoroughly enjoyed working with reStructuredText. As is described in the book, it is possible to generate a PDF directly from reStructuredText using a tool called Sphinx and although I know the styling and level of control isn't sufficient for a production book yet I'd very much look forward to the day when I can write a book with production-quality formatting directly from reStructuredText without needing to jump through the Word and PDF hoops.
You can see the final book, generated from reStructuredText at http://pylonsbook.com/en/1.0/ along with the source and you can buy a hard copy at http://www.amazon.com/Definitive-Guide-Pylons/dp/1590599349. You can also view the errata and download the example code from http://apress.com/book/errata/773.
Incidentally, I noticed the first review of the book on Amazon today and was delighted the reviewer rated it 5 stars with such positive comments so hopefully all this work has paid off in helping people in the Pylons community get a deeper grasp of Pylons!