Wednesday, March 7, 2012

Convert RTF to plain text

I have a SQL Server 2000 table with a few fields of "text" data type
that contain rich text. I have to downstream this data and the
recipient cannot handle rich text. I need to figure out a way to
convert it back to plain text. Any suggetions?

TIAIt's painful, but you can loop thru every character. Use the ASCII
function to identify and remove any non-alphanumerics (0-9 or a-z or
A-Z or space or period or comma ...).|||"Ted" <teddy_theo@.yahoo.com> wrote in message
news:1108133581.962856.208040@.l41g2000cwc.googlegr oups.com...
>I have a SQL Server 2000 table with a few fields of "text" data type
> that contain rich text. I have to downstream this data and the
> recipient cannot handle rich text. I need to figure out a way to
> convert it back to plain text. Any suggetions?
> TIA

The best idea is probably to export the data to file and convert it
externally - the MSSQL string functions are extremely basic, and writing a
script in something like Perl, Python, C# or whatever will be much more
efficient. With a bit of Googling, you'll probably be able to find something
for your preferred language - there's already a Perl module, for example.

Simon|||agreed. i was trying to avoid a front end process because the client
is going to be pulling data directly from a view in a production
environment. i'll have to throw a little .net app together to do the
conversion i suppose. thanks for all the feedback!!|||"louis" <louisducnguyen@.gmail.com> wrote:

>It's painful, but you can loop thru every character. Use the ASCII
>function to identify and remove any non-alphanumerics (0-9 or a-z or
>A-Z or space or period or comma ...).

I've written software (as a standalone utility, not in the context of
SQL) that goes the other way, but can't offer anything that helps in
this direction. I can offer some advice though.

You need to be a bit more careful than outlined above. In RTF the
backslash "\" and brace characters "{}" are reserved. RTF is
essentially a markup language and the "tags" start with a backslash,
and can contain alphanumerics (typically alphas and then - optionally
- numerics). Braces are used to delimit sections. Some sections
such as those in the header (info and font tables) can be entirely
discarded from the visible output.

Braces and backslashes in the text are escaped - IIRC - with a
backslash. Using this, you could indeed convert most of the RTF
to text.

This approach would recover most text, but some features such as
lists might come out strange, and it wouldn't be formatted nicely,
unless you wanted to honour the \par and \line tags to give line
formatting, but in the context of insertion into a database I guess
that's the most you's want to do.

Also, depending on the source of the RTF you may be dealing with easy
to parse snippets, as opposed to a fully-featured document.

I should imagine that programming the above in SQL would be - as you
rightly point out - quite painful.

--
HTML-to-text and markup removal with Detagger
http://www.jafsoft.com/detagger/

No comments:

Post a Comment