Monday, October 1, 2012

How to convert ods to csv

It was big surprise for me that there are so many options to convert proprietary XLS to CSV, but there are nothing available to convert open source ODS to CSV easily. Frankly speaking I have found some scripts, but I failed to make them to do what I need. Some of them were too old with outdated libraries. Some scripts were just macros intended to use the OpenOffice and LibreOffice cores, but for me it was not a good idea because of the speed and memory consumption. This situation really strange, maybe I overlook something. Once I read one guy complaining that ODS format is too complicated to make conversion library. He suggested to use XLS to have possibility easily export any data table to CSV. So far I had not so many data and I converted them just through LibreOffice interface manually. But that was really boring if you have to do this several times per hour. Let's find a more or less simple python solution.

 An ODS file is a ZIP archive containing several files and folders. The data are contained in a file content.xml. To unpack data from the file we will use zipfile library.
import zipfile
xml = zipfile.ZipFile(filepath).read('content.xml') #filepath is a full path to ods file

So in xml variable we have data in XML format. We will use pyquery and lxml libraries to parse XML.
from pyquery import PyQuery as pq
from lxml.cssselect import CSSSelector
d = pq(xml, parser='xml')
ns={'table': 'urn:oasis:names:tc:opendocument:xmlns:table:1.0'}
selr = CSSSelector('table|table-row', namespaces=ns)
selc = CSSSelector('table|table-cell', namespaces=ns)
rowxs = pq(selr(d[0]))
data = []
for ir,rowx in enumerate(rowxs):
    cells = pq(selc(rowx))
    if cells.text():
        data.append([cells.eq(ic).text().encode('utf-8') for ic in range(len(cells))])

Now we have to save our data in a CSV file.
import os,csv
root,ext=os.path.splitext(filepath)
with open(''.join([root,'.csv']),'wb') as f:
    for row in data:
        dw = csv.writer(f)
        dw.writerow(row)

That's it! Easy - no? Not so easy. The pyquery library does not always parse ODS xml in a good way. There is a problem: special denotation is used to show that there are several empty cells in a table or there are several cells with the same data. I had to use rather dirty hack to overcome this problem. I replace all these occurrences by multiple simple xml tags before parsing.
import re
def rep_repl(match):
    return '<table:table-cell>%s' %match.group(2) * int(match.group(1))
def repl_empt(match):
    n = int(match.group(1))
    pat = '<table:table-cell/>'
    return pat*n if (n<100) else pat
p_repl = re.compile(r'<table:table-cell [^>]*?repeated="(\d+)[^/>]*>(.+?table-cell>)')
p_empt = re.compile(r'<table:table-cell [^>]*?repeated="(\d+)[^>]*>')
xml = re.sub(p_repl, rep_repl, xml)
xml = re.sub(p_empt, repl_empt, xml)
I limited maximum number of empty cells by 100. More than one hundred cells is truncated to one. I have to do this as there can be several thousands of empty useless sells at the end of the rows and empty rows with thousands empty cells. Of course, it is possible to add two more regular expressions for this cases, but I don't like such hacks and maybe I will find some other more beautiful way to get rid of regex completely. Maybe you know better solutions? You are welcome to leave your comments!

 So the whole function csv2ods to make conversion is the following:
import sys,zipfile,re,os,csv
from pyquery import PyQuery as pq
from lxml.cssselect import CSSSelector

def ods2csv(filepath):

    xml = zipfile.ZipFile(filepath).read('content.xml')

    def rep_repl(match):
        return '<table:table-cell>%s' %match.group(2) * int(match.group(1))
    def repl_empt(match):
        n = int(match.group(1))
        pat = '<table:table-cell/>'
        return pat*n if (n<100) else pat

    p_repl = re.compile(r'<table:table-cell [^>]*?repeated="(\d+)[^/>]*>(.+?table-cell>)')
    p_empt = re.compile(r'<table:table-cell [^>]*?repeated="(\d+)[^>]*>')
    xml = re.sub(p_repl, rep_repl, xml)
    xml = re.sub(p_empt, repl_empt, xml)

    d = pq(xml, parser='xml')
    ns={'table': 'urn:oasis:names:tc:opendocument:xmlns:table:1.0'}
    selr = CSSSelector('table|table-row', namespaces=ns)
    selc = CSSSelector('table|table-cell', namespaces=ns)
    rowxs = pq(selr(d[0]))
    data = []
    for ir,rowx in enumerate(rowxs):
        cells = pq(selc(rowx))
        if cells.text():
            data.append([cells.eq(ic).text().encode('utf-8') for ic in range(len(cells))])

    root,ext=os.path.splitext(filepath)
    with open(''.join([root,'.csv']),'wb') as f:
        for row in data:
            dw = csv.writer(f)
            dw.writerow(row)

ods2csv(os.path.expanduser('~/foo.ods')) #example
Script is tested for the python v2.7.2.

6 comments:

  1. Superb. Thank you so much! How can I add an argument to get something other than the first sheet in the ods file?

    ReplyDelete
  2. Thank you, but I never did that, sorry. You have to look into the file content.xml to try to understand what is the difference between single and multiple sheet ods files. Probably you have to add some loop to read all sheets.

    ReplyDelete
  3. I get this error : TypeError: __init__() got an unexpected keyword argument 'namespaces' :(

    ReplyDelete
  4. Probably this is a problem with lxml library. I have python-lxml 2.3.2-1 installed. Maybe you have older or newer version? Also I have to note again, that the script is tested with python v2.7.2 and 2.7.3 only. Maybe you use Python 3?

    ReplyDelete
  5. New and improved "Casino" game - stillcasino
    New and 카지노사이트 improved "Casino" game, or "Casino" matchpoint is the new live casino game developed by 인카지노 Evolution Gaming and released in 2021.

    ReplyDelete
  6. Casino 2021 - Mapyro
    Casino 2021 is here! Learn everything you need to know 순천 출장안마 about the casino and 나주 출장안마 get a casino titanium wire bonus! Rating: 목포 출장샵 2.6 · ‎12 reviews 강릉 출장샵 · ‎Price range: $$$

    ReplyDelete