Author Topic: Word doc to HTML  (Read 780 times)

ergophobe

  • Inner Core
  • Hero Member
  • *
  • Posts: 3622
    • View Profile
Word doc to HTML
« on: August 02, 2017, 04:33:48 PM »
Always a stupid pain in the butt, but someone sent me a Word doc that had to be converted to reasonable HTML

This works great for cleaning the garbage
https://wordhtml.com/

Looks promising too
https://html-online.com/editor/

Rumbas

  • Global Moderator
  • Hero Member
  • *****
  • Posts: 1779
  • Viking Wrath
    • MSN Messenger - rasmussoerensen@hotmail.com
    • AOL Instant Messenger - seorasmus
    • View Profile
Re: Word doc to HTML
« Reply #1 on: August 02, 2017, 06:00:57 PM »
Save as html in Word? ..or am I missing something?

littleman

  • Administrator
  • Hero Member
  • *****
  • Posts: 3590
    • View Profile
Re: Word doc to HTML
« Reply #2 on: August 02, 2017, 08:36:04 PM »
LibreOffice can save it as HTML too, not sure about the quality though.

ergophobe

  • Inner Core
  • Hero Member
  • *
  • Posts: 3622
    • View Profile
Re: Word doc to HTML
« Reply #3 on: August 02, 2017, 10:08:09 PM »
Save as html in Word? ..or am I missing something?

If you do that, you will have hundreds or lines of CSS and all sorts of classes added and useless tags. I have a file of regular expressions to use sed or PowerGREP to strip all that garbage, but it's a hassle.

The thing I mentioned above works better than my regex because it pretty much catches everything, whereas my regex only catches what I've seen once before.

ergophobe

  • Inner Core
  • Hero Member
  • *
  • Posts: 3622
    • View Profile
Re: Word doc to HTML
« Reply #4 on: August 02, 2017, 10:22:26 PM »
I just did a check. A document I was working with was 2,183 lines (61,555 bytes) as saved by Word filtered HTML

It was 4091 lines when saved with the default Word Save as HTML (138,063 bytes). Only 104 lines in the BODY though.

After using the utility above, it was 140 lines (2,739 bytes)

Sample code from Word
Code: [Select]
<body lang=EN-US link=blue vlink=purple style='tab-interval:.5in'>

<div class=WordSection1>

<p class=MsoTitle>Title Goes Here</p>

<p class=MsoTitle>Subtitle here</p>

<div style='border-top:solid #215868 1.0pt;border-left:none;border-bottom:solid #215868 1.0pt;
border-right:none;padding:1.0pt 0in 1.0pt 0in'>

<h1><span style='mso-fareast-font-family:"Times New Roman"'>Friday 9.29.17<o:p></o:p></span></h1>

</div>

<table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0
 summary="Conference agenda information layout table #1" width="100%"
 style='width:100.0%;border-collapse:collapse;mso-yfti-tbllook:1184;mso-padding-alt:
 0in 0in 0in 0in'>
 <tr style='mso-yfti-irow:0;mso-yfti-firstrow:yes'>
  <td width=179 style='width:118.6pt;border:none;border-right:solid gray 1.0pt;
  padding:0in 2.9pt 0in 0in'>
  <p class=MsoNormal><b>3:00 pm to 5:00 pm </b></p>
  </td>
  <td width=529 style='width:349.4pt;padding:0in 0in 0in 2.9pt'>
  <p class=MsoNormal><b>Check in </b></p>
  </td>
 </tr>
 <tr style='mso-yfti-irow:1'>
  <td width=179 valign=top style='width:118.6pt;border:none;border-right:solid gray 1.0pt;
  padding:0in 2.9pt 0in 0in'>
  <p class=MsoNormal><b>3:00 pm to 9:00 pm</b></p>
  </td>
  <td width=529 valign=bottom style='width:349.4pt;padding:0in 0in 0in 2.9pt'>
  <p class=MsoNormal><b>Spa appointments </b></p>

After
Code: [Select]
<p>Title Goes Here</p>
<p>Subtitle here</p>
<h1>Friday 9.29.17</h1>
<table width="100%">
<tbody>
<tr>
<td width="179">
<p><strong>3:00 pm to 5:00 pm </strong></p>
</td>
<td width="529">
<p><strong>Check in </strong></p>
</td>
</tr>
<tr>
<td width="179">
<p><strong>3:00 pm to 9:00 pm</strong></p>
</td>
<td width="529">
<p><strong>Spa appointments </strong></p>

rcjordan

  • I'm consulting the authorities on the subject
  • Global Moderator
  • Hero Member
  • *****
  • Posts: 6398
  • Debbie says...
    • View Profile
Re: Word doc to HTML
« Reply #5 on: August 02, 2017, 10:42:21 PM »
I had a couple of ultraedit macros to do some of the cleanup but it always took a lot of manual editing to finish it.

This *might* come in handy as a 3-page-brochure-site-builder for those (like my math tutor daughter) who are proficient in Word and are never going to invest any time in html.

littleman

  • Administrator
  • Hero Member
  • *****
  • Posts: 3590
    • View Profile
Re: Word doc to HTML
« Reply #6 on: August 02, 2017, 11:04:14 PM »
I just tested opening up a doc with LebreOffice and saving it as HTML.  I was surprised by how clean the HTML turned out -- not bad at all.

Using:
http://www.snee.com/xml/xslt/sample.doc

This is the output:
Code: [Select]
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8">
<TITLE></TITLE>
<META NAME="GENERATOR" CONTENT="LibreOffice 3.5  (Linux)">
<META NAME="AUTHOR" CONTENT="Lexis Nexis Group">
<META NAME="CREATED" CONTENT="20010910;14520000">
<META NAME="CHANGEDBY" CONTENT="Lexis Nexis Group">
<META NAME="CHANGED" CONTENT="20030210;13460000">
<META NAME="Info 1" CONTENT="">
<META NAME="Info 2" CONTENT="">
<META NAME="Info 3" CONTENT="">
<META NAME="Info 4" CONTENT="">
<STYLE TYPE="text/css">
<!--
@page { margin-right: 1.25in; margin-top: 1in; margin-bottom: 1in }
P { margin-bottom: 0.08in; direction: ltr; color: #000000; widows: 2; orphans: 2 }
P.western { font-family: "Times New Roman", serif; font-size: 12pt; so-language: en-US }
P.cjk { font-family: "Times New Roman", serif; font-size: 12pt }
P.ctl { font-family: "Times New Roman", serif; font-size: 12pt; so-language: ar-SA }
H1 { margin-bottom: 0.04in; direction: ltr; color: #000000; widows: 2; orphans: 2 }
H1.western { font-family: "Arial", sans-serif; font-size: 16pt; so-language: en-US }
H1.cjk { font-family: "Times New Roman", serif; font-size: 16pt }
H1.ctl { font-family: "Arial", sans-serif; font-size: 16pt; so-language: ar-SA }
H2 { margin-bottom: 0.04in; direction: ltr; color: #000000; widows: 2; orphans: 2 }
H2.western { font-family: "Arial", sans-serif; font-size: 14pt; so-language: en-US; font-style: italic }
H2.cjk { font-family: "Times New Roman", serif; font-size: 14pt; font-style: italic }
H2.ctl { font-family: "Arial", sans-serif; font-size: 14pt; so-language: ar-SA; font-style: italic }
-->
</STYLE>
</HEAD>
<BODY LANG="en-US" TEXT="#000000" DIR="LTR">
<H1 CLASS="western">This is Heading1 Text</H1>
<P CLASS="western" STYLE="margin-bottom: 0in">This is a regular
paragraph with the default style of Normal. This is a regular
paragraph with the default style of Normal. This is a regular
paragraph with the default style of Normal. This is a regular
paragraph with the default style of Normal. This is a regular
paragraph with the default style of Normal.</P>
<P STYLE="margin-top: 0.08in"><FONT FACE="Arial, sans-serif"><FONT SIZE=4><I><B>This
is a Defined Block Style Called BlockStyleTest</B></I></FONT></FONT></P>
<P CLASS="western" STYLE="margin-bottom: 0in">This is more Normal
text.</P>
<H2 CLASS="western">This is Heading 2 text</H2>
<P CLASS="western" STYLE="margin-bottom: 0in">This is more Normal
text. <B>This is bold, </B><I>this is italic</I>, <I><B>and this is
bold italic</B></I>. This is normal. <FONT FACE="Courier New, monospace"><FONT SIZE=2 STYLE="font-size: 9pt">This
is in a defined inline style called InlineStyle</FONT></FONT>. This
is normal. <FONT COLOR="#ff0000">This is red text.</FONT> This is
normal.
</P>
<P CLASS="western" ALIGN=CENTER STYLE="margin-bottom: 0in">This block
is centered.</P>
<P CLASS="western" STYLE="margin-bottom: 0in">This is left-aligned.
</P>
<P CLASS="western" STYLE="margin-bottom: 0in"><BR>
</P>
<UL>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">First item of
bulleted list.
</P>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">Second item of
bulleted list.</P>
</UL>
<P CLASS="western" STYLE="margin-left: 0.5in; margin-bottom: 0in">Second
paragraph of second item of bulleted list.
</P>
<UL>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">Third item of
bulleted list.</P>
<UL>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">First item of
third item’s nested list</P>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">Second item of
third item’s nested list</P>
</UL>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">Fourth and final
item of main bulleted list.</P>
</UL>
<P CLASS="western" STYLE="margin-bottom: 0in"><BR>
</P>
<P CLASS="western" STYLE="margin-bottom: 0in">This is Normal text.</P>
<P CLASS="western" STYLE="margin-bottom: 0in"><BR>
</P>
<OL>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">First item of
numbered list.
</P>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">Second item of
numbered list.</P>
</OL>
<P CLASS="western" STYLE="margin-left: 0.5in; margin-bottom: 0in">Second
paragraph of second item of numbered list.
</P>
<OL START=3>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">Third item of
numbered list.</P>
</OL>
<P CLASS="western" STYLE="margin-bottom: 0in"><BR>
</P>
<P CLASS="western" STYLE="margin-bottom: 0in">Here is a BMP picture:</P>
<P CLASS="western" STYLE="margin-bottom: 0in"><IMG SRC="sample_html_m7a692fe0.png" NAME="graphics1" ALIGN=BOTTOM WIDTH=100 HEIGHT=100 BORDER=0></P>
<P CLASS="western" STYLE="margin-bottom: 0in">Here is a table:</P>
<TABLE WIDTH=591 CELLPADDING=7 CELLSPACING=0>
<COL WIDTH=103>
<COL WIDTH=104>
<COL WIDTH=104>
<COL WIDTH=104>
<COL WIDTH=104>
<TR VALIGN=TOP>
<TD WIDTH=103 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western"><BR>
</P>
</TD>
<TD COLSPAN=2 WIDTH=222 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western" ALIGN=CENTER>New York</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western" ALIGN=CENTER>Boston</P>
</TD>
<TD WIDTH=104 STYLE="border: 1px solid #000000; padding: 0in 0.08in">
<P CLASS="western" ALIGN=CENTER>Detroit</P>
</TD>
</TR>
<TR VALIGN=TOP>
<TD WIDTH=103 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Baseball</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Mets</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Yankees</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Red Sox</P>
</TD>
<TD WIDTH=104 STYLE="border: 1px solid #000000; padding: 0in 0.08in">
<P CLASS="western">Tigers</P>
</TD>
</TR>
<TR VALIGN=TOP>
<TD WIDTH=103 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Hockey</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Rangers</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Islanders</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Bruins</P>
</TD>
<TD WIDTH=104 STYLE="border: 1px solid #000000; padding: 0in 0.08in">
<P CLASS="western">Red Wings</P>
</TD>
</TR>
<TR VALIGN=TOP>
<TD WIDTH=103 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Football</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Giants</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Jets</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Patriots</P>
</TD>
<TD WIDTH=104 STYLE="border: 1px solid #000000; padding: 0in 0.08in">
<P CLASS="western">Lions</P>
</TD>
</TR>
</TABLE>
<P CLASS="western" STYLE="margin-bottom: 0in"><BR>
</P>
<P CLASS="western" STYLE="margin-bottom: 0in">Here is an embedded
Excel spreadsheet:</P>
<P CLASS="western" STYLE="margin-bottom: 0in"><BR>
</P>
<P CLASS="western" STYLE="margin-bottom: 0in"><IMG SRC="sample_html_m7b38147.gif" NAME="Object1" ALIGN=BOTTOM WIDTH=338 HEIGHT=120></P>
<P CLASS="western" STYLE="margin-bottom: 0in"><BR>
</P>
<P CLASS="western" STYLE="margin-bottom: 0in">This concludes our
test.</P>
<P CLASS="western" STYLE="margin-bottom: 0in"><BR>
</P>
</BODY>
</HTML>

Added: It may not be as good as the utility you referenced, Lebre will turn spreadsheet includes into images, but it translated a 32Kbyte .doc into 9959 bytes of HTML.
« Last Edit: August 02, 2017, 11:10:46 PM by littleman »