Converting HTML To RML

HTML often forms part of the input for systems, though sometimes this can cause issues when trying directly to generate a PDF containing HTML. ReportLab have developed tools within our rlextra package to deal with two related issues:

1.Cleaning input before the data is saved - removing tags and other content that might cause problems

2.Writing out HTML content to a PDF

Cleaning input before the data is saved

This section covers some content also included on this site under XML helper utilities. Note: rlextra and html_cleaner does not handle the extensive plethora of HTML tags and attributes but instead focuses on a smaller subset tags and attributes. We can use the functionality that exists in rlextra/radxml/html_cleaner.py. Some basic examples follow, though for more comprehensive examples, look directly at the test function in html_cleaner.py. These examples assume you have the necessary imports.

>>> from rlextra.radxml.html_cleaner import cleanPlain, cleanBlocks, cleanInline

`cleanBlocks`

Accept markup as one or more blocks. Example

>>> data = "<p>This is <unkown>raw data</unkown> with HTML</em> <b>paragraph</b></p>"
>>> cleanBlocks(data)
'<p>This is raw data with HTML <b>paragraph</b></p>'

`cleanInline`

Accept and normalize markup for use inline. Example

>>> data = "<img width='100' unkown='x' src='photo.png'/>"
>>> cleanInline(data)
'<img width="100" src="photo.png" alt=""/>'

`cleanPlain`

Remove all tags to output plain text. Example

>>> from rlextra.radxml.html_cleaner import cleanPlain
>>> data = "<p>This is raw data with <em>HTML</em> <b>paragraph</b></p>"
>>> cleanPlain(data)
'This is raw data with HTML paragraph'

Writing out HTML content to a PDF

Here we detail rendering html in a PDF but also include the aforementioned cleanPlain

There are a number of approaches that can be taken depending on your input.

In these snippets we use the following imports;

    from preppy import SafeString
    from rlextra.radxml.xhtml2rml import xhtml2rml
    from rlextra.radxml.html_cleaner import cleanPlain

The input examples are:

    data = "<p>This is raw data with <em>HTML</em> <b>paragraph</b></p>"
    data2 = "This is raw data with <em>HTML</em> <b>paragraph</b>"

1: Raw XHTML data example; preppy quoting escapes the tags

    <para>{{data}}</para>

2: cleanPlain Strips XHTML tags example

     <para style="normal">{{cleanPlain(data)}}</para>

3: XHTML data without para tags but with inline tags, ensure the data is enclosed in an RML para tag SafeString tells preppy not to xml escape the contents xhtml2rml converts the XHTML to RML

    <para style="normal">{{SafeString(xhtml2rml(data2))}}</para>

4: XHTML to RML data example - without a specified paraStyle, ensure there are no RML para tags around the data. When there no paraStyles specified with the content, xhtml2rml assumes paraStyle='normal', tableStyle='noPaddingStyle', bulletStyle='bullet' exists in your style sheets

    {{SafeString(xhtml2rml(data))}}

5: XHTML to RML data example - with a specified paraStyle, ensure there are no RML para tags around the data

    {{SafeString(xhtml2rml(data, paraStyle="normal"))}}