Digging into ODT file contents

0

I love that open source is built on open standards. One example is LibreOffice. If you aren’t familiar with LibreOffice, it has an interesting history, which I’ll describe briefly:

In the 1980s, a German company called StarDivision released Star-Writer, a word processor for the CP/M operating system, and later ported to DOS. Over the years, StarWriter (they dropped the hyphen in 1991) added more features and functionality, even providing compatibility with Microsoft Word files. In 1996, they released StarOffice 3.1, which was the first version to support Linux.

I bought StarOffice in 1997, and it was great! It allowed me to do work on my Linux machine, and remain compatible with the Microsoft Office files at the office.

In 1999, Sun Microsystems purchased StarDivision and released it for free. Later, they released it as open source software, to become OpenOffice.org. That’s where the Open Document Format (ODF) came from, in 2005. However, after Oracle acquired Sun Microsystems in 2009, developers forked OpenOffice.org to become LibreOffice, supported by a foundation called The Document Foundation. LibreOffice has remained under active development since then, and maintains its open roots – including the ODF open file format.

What’s in an ODT file?

ODF comes in several “flavors”, the most common of which are: ODT for word processor files (Open Document: Text), ODS for spreadsheet files (Open Document: Spreadsheet), and ODP for presentation files (Open Document: Presentation). These are all just zip file containers with XML data and metadata. And that means we can explore them using the unzip command line tool; let’s experiment with a sample ODT file.

I saved this one-line document in LibreOffice Writer, called sample.odt:

screenshot of a 1-line test file in LibreOffice Writer

The zipinfo tool shows the internal structure of this file:

$ zipinfo sample.odt 
Archive:  sample.odt
Zip file size: 9479 bytes, number of entries: 17
-rw----     2.0 fat       39 b- stor 24-May-17 14:38 mimetype
-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2/accelerator/
-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2/images/Bitmaps/
-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2/toolpanel/
-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2/floater/
-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2/statusbar/
-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2/toolbar/
-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2/progressbar/
-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2/popupmenu/
-rw----     2.0 fat        0 b- stor 24-May-17 14:38 Configurations2/menubar/
-rw----     2.0 fat    12782 bl defN 24-May-17 14:38 styles.xml
-rw----     2.0 fat      899 bl defN 24-May-17 14:38 manifest.rdf
-rw----     2.0 fat     3878 bl defN 24-May-17 14:38 content.xml
-rw----     2.0 fat      975 bl defN 24-May-17 14:38 meta.xml
-rw----     2.0 fat    13831 bl defN 24-May-17 14:38 settings.xml
-rw----     2.0 fat     1220 b- stor 24-May-17 14:38 Thumbnails/thumbnail.png
-rw----     2.0 fat     1061 bl defN 24-May-17 14:38 META-INF/manifest.xml
17 files, 34685 bytes uncompressed, 7383 bytes compressed:  78.7%

Notice that the first file in the archive is called mimetype and is saved uncompressed (stor indicates it is “stored,” which is not compressed). According to the standard, mimetype must be the first file in the archive, and must be uncompressed. This allows any tool to verify that this is a ODT file by reading the zip archive:

  1. The first two bytes of a zip file will be PK (because the zip file format was defined by Phil Katz at PKWare in the 1980s)
  2. Skip ahead 28 more bytes (zip file overhead)
  3. Find the string “mimetypeapplication/vnd.oasis.opendocument.text” which is the one-line uncompressed contents of the mimetype file

If you are interested in programming, you can write your own program that uses this method to examine files to determine if they are valid ODT files. One such implementation might look like this:

#include <stdio.h>
#include <string.h>

char buf[47]; /* global */

int magic(FILE *in)
{
  /* read magic number */

  fread(buf, 1, 2, in);

  if (strncmp(buf, "PK", 2) == 0) {
    puts("PK: this is a zip file");
    return 1; /* yes */
  }

  puts("not a zip file");
  return 0; /* no */
}

int skip28(FILE *in)
{
  /* skip 28 more bytes */

  fread(buf, 1, 28, in);
  return 1; /* success */
}

int mimetype(FILE *in)
{
  /* read "mimetype" */

  fread(buf, 1, 47, in);

  if (strncmp(buf, "mimetypeapplication/vnd.oasis.opendocument.text", 47) == 0) {
    puts("ODT: mimetype found");
    return 1; /* yes */
  }

  puts("didn't find mimetype");
  return 0; /* no */
}

void test_odt(FILE *in)
{
  if (!magic(in)) {
    return;
  }

  if (feof(in)) {
    puts("unexpected EOF");
    return;
  }

  skip28(in);

  if (feof(in)) {
    puts("unexpected EOF");
    return;
  }

  mimetype(in);

  if (feof(in)) {
    puts("unexpected EOF");
  }

  return;
}

int main(int argc, char **argv)
{
  FILE *odt;
  int i;

  for (i = 1; i < argc; i++) {
    odt = fopen(argv[i], "rb");

    if (odt) {
      puts("-----");
      puts(argv[i]);
      test_odt(odt);
      fclose(odt);
    }
    else {
      fputs("cannot open file: ", stdout);
      puts(argv[i]);
    }
  }

  return 0;
}

If I save this as testodt.c and compile it, I can demonstrate that the sample.odt file has the structure described above:

$ gcc -Wall -o testodt testodt.c

$ ./testodt sample.odt 
-----
sample.odt
PK: this is a zip file
ODT: mimetype found

The horizontal line makes it easier to see the output if you test several files at once – although I’ve only tested one file here.

Unzipping the ODT file

We can use the unzip command to extract the contents of the sample ODT file to examine it further. I’ll save my copy in a new directory called sample_odt so it’s named similarly to the sample.odt file I saved from LibreOffice Writer:

$ unzip sample.odt -d sample_odt
Archive:  sample.odt
 extracting: sample_odt/mimetype     
   creating: sample_odt/Configurations2/accelerator/
   creating: sample_odt/Configurations2/images/Bitmaps/
   creating: sample_odt/Configurations2/toolpanel/
   creating: sample_odt/Configurations2/floater/
   creating: sample_odt/Configurations2/statusbar/
   creating: sample_odt/Configurations2/toolbar/
   creating: sample_odt/Configurations2/progressbar/
   creating: sample_odt/Configurations2/popupmenu/
   creating: sample_odt/Configurations2/menubar/
  inflating: sample_odt/styles.xml   
  inflating: sample_odt/manifest.rdf  
  inflating: sample_odt/content.xml  
  inflating: sample_odt/meta.xml     
  inflating: sample_odt/settings.xml  
 extracting: sample_odt/Thumbnails/thumbnail.png  
  inflating: sample_odt/META-INF/manifest.xml  

To locate the contents of an ODT file, we need to first examine the manifest.xml file, located in the META-INF directory. This is an XML document, so is saved as plain text, which we can display using the cat command:

$ cat sample_odt/META-INF/manifest.xml 
<?xml version="1.0" encoding="UTF-8"?>
<manifest:manifest xmlns:manifest="urn:oasis:names:tc:opendocument:xmlns:manifest:1.0" manifest:version="1.3" xmlns:loext="urn:org:documentfoundation:names:experimental:office:xmlns:loext:1.0">
 <manifest:file-entry manifest:full-path="/" manifest:version="1.3" manifest:media-type="application/vnd.oasis.opendocument.text"/>
 <manifest:file-entry manifest:full-path="Configurations2/" manifest:media-type="application/vnd.sun.xml.ui.configuration"/>
 <manifest:file-entry manifest:full-path="styles.xml" manifest:media-type="text/xml"/>
 <manifest:file-entry manifest:full-path="manifest.rdf" manifest:media-type="application/rdf+xml"/>
 <manifest:file-entry manifest:full-path="content.xml" manifest:media-type="text/xml"/>
 <manifest:file-entry manifest:full-path="meta.xml" manifest:media-type="text/xml"/>
 <manifest:file-entry manifest:full-path="settings.xml" manifest:media-type="text/xml"/>
 <manifest:file-entry manifest:full-path="Thumbnails/thumbnail.png" manifest:media-type="image/png"/>
</manifest:manifest>

This file contains the “master” metadata for the ODT file, and indicates where everything is saved. The line that has a text media type tells us where our content is stored. In this case, that’s in the content.xml file. Since that file doesn’t contain a path, it’s in the “root” of the ODT file.

Again, the content.xml file is a plain text XML file. However, it’s quite long; this sample file has 1 line, but over 3,800 characters. So I don’t want to display it with cat or I’ll fill up my screen with an XML file that’s hard for humans to read. Instead, let’s break up the XML tags using xmllint to add some extra spaces with the --format option:

$ xmllint --format sample_odt/content.xml 
<?xml version="1.0" encoding="UTF-8"?>
<office:document-content xmlns:css3t="http://www.w3.org/TR/css3-text/" xmlns:grddl="http://www.w3.org/2003/g/data-view#" xmlns:xhtml="http://www.w3.org/1999/xhtml" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xmlns:xforms="http://www.w3.org/2002/xforms" xmlns:dom="http://www.w3.org/2001/xml-events" xmlns:script="urn:oasis:names:tc:opendocument:xmlns:script:1.0" xmlns:form="urn:oasis:names:tc:opendocument:xmlns:form:1.0" xmlns:math="http://www.w3.org/1998/Math/MathML" xmlns:office="urn:oasis:names:tc:opendocument:xmlns:office:1.0" xmlns:ooo="http://openoffice.org/2004/office" xmlns:fo="urn:oasis:names:tc:opendocument:xmlns:xsl-fo-compatible:1.0" xmlns:ooow="http://openoffice.org/2004/writer" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:drawooo="http://openoffice.org/2010/draw" xmlns:oooc="http://openoffice.org/2004/calc" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:calcext="urn:org:documentfoundation:names:experimental:calc:xmlns:calcext:1.0" xmlns:style="urn:oasis:names:tc:opendocument:xmlns:style:1.0" xmlns:text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" xmlns:of="urn:oasis:names:tc:opendocument:xmlns:of:1.2" xmlns:tableooo="http://openoffice.org/2009/table" xmlns:draw="urn:oasis:names:tc:opendocument:xmlns:drawing:1.0" xmlns:dr3d="urn:oasis:names:tc:opendocument:xmlns:dr3d:1.0" xmlns:rpt="http://openoffice.org/2005/report" xmlns:formx="urn:openoffice:names:experimental:ooxml-odf-interop:xmlns:form:1.0" xmlns:svg="urn:oasis:names:tc:opendocument:xmlns:svg-compatible:1.0" xmlns:chart="urn:oasis:names:tc:opendocument:xmlns:chart:1.0" xmlns:officeooo="http://openoffice.org/2009/office" xmlns:table="urn:oasis:names:tc:opendocument:xmlns:table:1.0" xmlns:meta="urn:oasis:names:tc:opendocument:xmlns:meta:1.0" xmlns:loext="urn:org:documentfoundation:names:experimental:office:xmlns:loext:1.0" xmlns:number="urn:oasis:names:tc:opendocument:xmlns:datastyle:1.0" xmlns:field="urn:openoffice:names:experimental:ooo-ms-interop:xmlns:field:1.0" office:version="1.3">
  <office:scripts/>
  <office:font-face-decls>
    <style:font-face style:name="Liberation Sans" svg:font-family="'Liberation Sans'" style:font-family-generic="swiss" style:font-pitch="variable"/>
    <style:font-face style:name="Liberation Serif" svg:font-family="'Liberation Serif'" style:font-family-generic="roman" style:font-pitch="variable"/>
    <style:font-face style:name="Noto Sans CJK SC" svg:font-family="'Noto Sans CJK SC'" style:font-family-generic="system" style:font-pitch="variable"/>
    <style:font-face style:name="Noto Sans Devanagari" svg:font-family="'Noto Sans Devanagari'" style:font-family-generic="swiss"/>
    <style:font-face style:name="Noto Sans Devanagari1" svg:font-family="'Noto Sans Devanagari'" style:font-family-generic="system" style:font-pitch="variable"/>
    <style:font-face style:name="Noto Serif CJK SC" svg:font-family="'Noto Serif CJK SC'" style:font-family-generic="system" style:font-pitch="variable"/>
  </office:font-face-decls>
  <office:automatic-styles>
    <style:style style:name="P1" style:family="paragraph" style:parent-style-name="Standard">
      <style:text-properties officeooo:rsid="00157cf3" officeooo:paragraph-rsid="00157cf3"/>
    </style:style>
  </office:automatic-styles>
  <office:body>
    <office:text>
      <text:sequence-decls>
        <text:sequence-decl text:display-outline-level="0" text:name="Illustration"/>
        <text:sequence-decl text:display-outline-level="0" text:name="Table"/>
        <text:sequence-decl text:display-outline-level="0" text:name="Text"/>
        <text:sequence-decl text:display-outline-level="0" text:name="Drawing"/>
        <text:sequence-decl text:display-outline-level="0" text:name="Figure"/>
      </text:sequence-decls>
      <text:p text:style-name="P1">This is a LibreOffice file.</text:p>
    </office:text>
  </office:body>
</office:document-content>

There’s a lot of overhead in the XML structure, including some style definitions. But the content is easy enough to find: my document’s one-line contents is in an XML tag called text:p that carries a text:style-name attribute with the value P1 (which is the name of a style defined a few lines earlier in the file).

In fact, we can extract just the file’s paragraph contents by filtering the output with grep to find just the text:p tags:

$ xmllint --format sample_odt/content.xml | grep 'text:p'
      <text:p text:style-name="P1">This is a LibreOffice file.</text:p>

You can do the same to find other content stored in any ODT file you have, such as headings which are saved as text:h.

ODT files are open data

Not every file format is like this; for example, some other word processors (especially earlier systems before “open source” became the norm) essentially saved a file by dumping the contents of memory into a file. This provided a fast way to save and load data, but meant the file format remained closed and made it more difficult to import into other programs that didn’t have the same internal memory structures.

ODT and all other files in the Open Document Format (ODF) is an open file format that can be read by anything. This avoids “vendor lock-in” because the open nature of ODT means you can always convert your ODT files to another format if you wish, even without using LibreOffice.

This article is adapted from What’s inside a LibreOffice ODT file by Jim Hall, and is republished with the author’s permission.

Leave a Reply