Th3 Core
Welcome, Guest. Please login or register.
Did you miss your activation email?
November 21, 2017, 06:20:14 AM

Login with username, password and session length
News
Stats
51070 Posts in 6278 Topics by 186 Members
Latest Member: DavidBrown
Search:     Advanced search
* Home active threads unread threads Help Search Login Register
+  Th3 Core
|-+  Why We Are Here
| |-+  Web Development
| | |-+  Word doc to HTML
« previous next »
Pages: [1] Print
Author Topic: Word doc to HTML  (Read 492 times)
ergophobe
Inner Core
Hero Member
*
Posts: 3389


View Profile
« on: August 02, 2017, 04:33:48 PM »

Always a stupid pain in the butt, but someone sent me a Word doc that had to be converted to reasonable HTML

This works great for cleaning the garbage
https://wordhtml.com/

Looks promising too
https://html-online.com/editor/
Logged
Rumbas
Global Moderator
Hero Member
*****
Posts: 1739

Viking Wrath

rasmussoerensen@hotmail.com seorasmus
View Profile
« Reply #1 on: August 02, 2017, 06:00:57 PM »

Save as html in Word? ..or am I missing something?
Logged
littleman
Administrator
Hero Member
*****
Posts: 3421


View Profile
« Reply #2 on: August 02, 2017, 08:36:04 PM »

LibreOffice can save it as HTML too, not sure about the quality though.
Logged
ergophobe
Inner Core
Hero Member
*
Posts: 3389


View Profile
« Reply #3 on: August 02, 2017, 10:08:09 PM »

Save as html in Word? ..or am I missing something?

If you do that, you will have hundreds or lines of CSS and all sorts of classes added and useless tags. I have a file of regular expressions to use sed or PowerGREP to strip all that garbage, but it's a hassle.

The thing I mentioned above works better than my regex because it pretty much catches everything, whereas my regex only catches what I've seen once before.
Logged
ergophobe
Inner Core
Hero Member
*
Posts: 3389


View Profile
« Reply #4 on: August 02, 2017, 10:22:26 PM »

I just did a check. A document I was working with was 2,183 lines (61,555 bytes) as saved by Word filtered HTML

It was 4091 lines when saved with the default Word Save as HTML (138,063 bytes). Only 104 lines in the BODY though.

After using the utility above, it was 140 lines (2,739 bytes)

Sample code from Word
Code:
<body lang=EN-US link=blue vlink=purple style='tab-interval:.5in'>

<div class=WordSection1>

<p class=MsoTitle>Title Goes Here</p>

<p class=MsoTitle>Subtitle here</p>

<div style='border-top:solid #215868 1.0pt;border-left:none;border-bottom:solid #215868 1.0pt;
border-right:none;padding:1.0pt 0in 1.0pt 0in'>

<h1><span style='mso-fareast-font-family:"Times New Roman"'>Friday 9.29.17<o:p></o:p></span></h1>

</div>

<table class=MsoNormalTable border=0 cellspacing=0 cellpadding=0
 summary="Conference agenda information layout table #1" width="100%"
 style='width:100.0%;border-collapse:collapse;mso-yfti-tbllook:1184;mso-padding-alt:
 0in 0in 0in 0in'>
 <tr style='mso-yfti-irow:0;mso-yfti-firstrow:yes'>
  <td width=179 style='width:118.6pt;border:none;border-right:solid gray 1.0pt;
  padding:0in 2.9pt 0in 0in'>
  <p class=MsoNormal><b>3:00 pm to 5:00 pm </b></p>
  </td>
  <td width=529 style='width:349.4pt;padding:0in 0in 0in 2.9pt'>
  <p class=MsoNormal><b>Check in </b></p>
  </td>
 </tr>
 <tr style='mso-yfti-irow:1'>
  <td width=179 valign=top style='width:118.6pt;border:none;border-right:solid gray 1.0pt;
  padding:0in 2.9pt 0in 0in'>
  <p class=MsoNormal><b>3:00 pm to 9:00 pm</b></p>
  </td>
  <td width=529 valign=bottom style='width:349.4pt;padding:0in 0in 0in 2.9pt'>
  <p class=MsoNormal><b>Spa appointments </b></p>

After
Code:
<p>Title Goes Here</p>
<p>Subtitle here</p>
<h1>Friday 9.29.17</h1>
<table width="100%">
<tbody>
<tr>
<td width="179">
<p><strong>3:00 pm to 5:00 pm </strong></p>
</td>
<td width="529">
<p><strong>Check in </strong></p>
</td>
</tr>
<tr>
<td width="179">
<p><strong>3:00 pm to 9:00 pm</strong></p>
</td>
<td width="529">
<p><strong>Spa appointments </strong></p>
Logged
rcjordan
I'm consulting the authorities on the subject
Global Moderator
Hero Member
*****
Posts: 6016

Debbie says...


View Profile
« Reply #5 on: August 02, 2017, 10:42:21 PM »

I had a couple of ultraedit macros to do some of the cleanup but it always took a lot of manual editing to finish it.

This *might* come in handy as a 3-page-brochure-site-builder for those (like my math tutor daughter) who are proficient in Word and are never going to invest any time in html.
Logged
littleman
Administrator
Hero Member
*****
Posts: 3421


View Profile
« Reply #6 on: August 02, 2017, 11:04:14 PM »

I just tested opening up a doc with LebreOffice and saving it as HTML.  I was surprised by how clean the HTML turned out -- not bad at all.

Using:
http://www.snee.com/xml/xslt/sample.doc

This is the output:
Code:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<META HTTP-EQUIV="CONTENT-TYPE" CONTENT="text/html; charset=utf-8">
<TITLE></TITLE>
<META NAME="GENERATOR" CONTENT="LibreOffice 3.5  (Linux)">
<META NAME="AUTHOR" CONTENT="Lexis Nexis Group">
<META NAME="CREATED" CONTENT="20010910;14520000">
<META NAME="CHANGEDBY" CONTENT="Lexis Nexis Group">
<META NAME="CHANGED" CONTENT="20030210;13460000">
<META NAME="Info 1" CONTENT="">
<META NAME="Info 2" CONTENT="">
<META NAME="Info 3" CONTENT="">
<META NAME="Info 4" CONTENT="">
<STYLE TYPE="text/css">
<!--
@page { margin-right: 1.25in; margin-top: 1in; margin-bottom: 1in }
P { margin-bottom: 0.08in; direction: ltr; color: #000000; widows: 2; orphans: 2 }
P.western { font-family: "Times New Roman", serif; font-size: 12pt; so-language: en-US }
P.cjk { font-family: "Times New Roman", serif; font-size: 12pt }
P.ctl { font-family: "Times New Roman", serif; font-size: 12pt; so-language: ar-SA }
H1 { margin-bottom: 0.04in; direction: ltr; color: #000000; widows: 2; orphans: 2 }
H1.western { font-family: "Arial", sans-serif; font-size: 16pt; so-language: en-US }
H1.cjk { font-family: "Times New Roman", serif; font-size: 16pt }
H1.ctl { font-family: "Arial", sans-serif; font-size: 16pt; so-language: ar-SA }
H2 { margin-bottom: 0.04in; direction: ltr; color: #000000; widows: 2; orphans: 2 }
H2.western { font-family: "Arial", sans-serif; font-size: 14pt; so-language: en-US; font-style: italic }
H2.cjk { font-family: "Times New Roman", serif; font-size: 14pt; font-style: italic }
H2.ctl { font-family: "Arial", sans-serif; font-size: 14pt; so-language: ar-SA; font-style: italic }
-->
</STYLE>
</HEAD>
<BODY LANG="en-US" TEXT="#000000" DIR="LTR">
<H1 CLASS="western">This is Heading1 Text</H1>
<P CLASS="western" STYLE="margin-bottom: 0in">This is a regular
paragraph with the default style of Normal. This is a regular
paragraph with the default style of Normal. This is a regular
paragraph with the default style of Normal. This is a regular
paragraph with the default style of Normal. This is a regular
paragraph with the default style of Normal.</P>
<P STYLE="margin-top: 0.08in"><FONT FACE="Arial, sans-serif"><FONT SIZE=4><I><B>This
is a Defined Block Style Called BlockStyleTest</B></I></FONT></FONT></P>
<P CLASS="western" STYLE="margin-bottom: 0in">This is more Normal
text.</P>
<H2 CLASS="western">This is Heading 2 text</H2>
<P CLASS="western" STYLE="margin-bottom: 0in">This is more Normal
text. <B>This is bold, </B><I>this is italic</I>, <I><B>and this is
bold italic</B></I>. This is normal. <FONT FACE="Courier New, monospace"><FONT SIZE=2 STYLE="font-size: 9pt">This
is in a defined inline style called InlineStyle</FONT></FONT>. This
is normal. <FONT COLOR="#ff0000">This is red text.</FONT> This is
normal.
</P>
<P CLASS="western" ALIGN=CENTER STYLE="margin-bottom: 0in">This block
is centered.</P>
<P CLASS="western" STYLE="margin-bottom: 0in">This is left-aligned.
</P>
<P CLASS="western" STYLE="margin-bottom: 0in"><BR>
</P>
<UL>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">First item of
bulleted list.
</P>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">Second item of
bulleted list.</P>
</UL>
<P CLASS="western" STYLE="margin-left: 0.5in; margin-bottom: 0in">Second
paragraph of second item of bulleted list.
</P>
<UL>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">Third item of
bulleted list.</P>
<UL>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">First item of
third item’s nested list</P>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">Second item of
third item’s nested list</P>
</UL>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">Fourth and final
item of main bulleted list.</P>
</UL>
<P CLASS="western" STYLE="margin-bottom: 0in"><BR>
</P>
<P CLASS="western" STYLE="margin-bottom: 0in">This is Normal text.</P>
<P CLASS="western" STYLE="margin-bottom: 0in"><BR>
</P>
<OL>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">First item of
numbered list.
</P>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">Second item of
numbered list.</P>
</OL>
<P CLASS="western" STYLE="margin-left: 0.5in; margin-bottom: 0in">Second
paragraph of second item of numbered list.
</P>
<OL START=3>
<LI><P CLASS="western" STYLE="margin-bottom: 0in">Third item of
numbered list.</P>
</OL>
<P CLASS="western" STYLE="margin-bottom: 0in"><BR>
</P>
<P CLASS="western" STYLE="margin-bottom: 0in">Here is a BMP picture:</P>
<P CLASS="western" STYLE="margin-bottom: 0in"><IMG SRC="sample_html_m7a692fe0.png" NAME="graphics1" ALIGN=BOTTOM WIDTH=100 HEIGHT=100 BORDER=0></P>
<P CLASS="western" STYLE="margin-bottom: 0in">Here is a table:</P>
<TABLE WIDTH=591 CELLPADDING=7 CELLSPACING=0>
<COL WIDTH=103>
<COL WIDTH=104>
<COL WIDTH=104>
<COL WIDTH=104>
<COL WIDTH=104>
<TR VALIGN=TOP>
<TD WIDTH=103 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western"><BR>
</P>
</TD>
<TD COLSPAN=2 WIDTH=222 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western" ALIGN=CENTER>New York</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western" ALIGN=CENTER>Boston</P>
</TD>
<TD WIDTH=104 STYLE="border: 1px solid #000000; padding: 0in 0.08in">
<P CLASS="western" ALIGN=CENTER>Detroit</P>
</TD>
</TR>
<TR VALIGN=TOP>
<TD WIDTH=103 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Baseball</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Mets</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Yankees</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Red Sox</P>
</TD>
<TD WIDTH=104 STYLE="border: 1px solid #000000; padding: 0in 0.08in">
<P CLASS="western">Tigers</P>
</TD>
</TR>
<TR VALIGN=TOP>
<TD WIDTH=103 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Hockey</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Rangers</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Islanders</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Bruins</P>
</TD>
<TD WIDTH=104 STYLE="border: 1px solid #000000; padding: 0in 0.08in">
<P CLASS="western">Red Wings</P>
</TD>
</TR>
<TR VALIGN=TOP>
<TD WIDTH=103 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Football</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Giants</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Jets</P>
</TD>
<TD WIDTH=104 STYLE="border-top: 1px solid #000000; border-bottom: 1px solid #000000; border-left: 1px solid #000000; border-right: none; padding-top: 0in; padding-bottom: 0in; padding-left: 0.08in; padding-right: 0in">
<P CLASS="western">Patriots</P>
</TD>
<TD WIDTH=104 STYLE="border: 1px solid #000000; padding: 0in 0.08in">
<P CLASS="western">Lions</P>
</TD>
</TR>
</TABLE>
<P CLASS="western" STYLE="margin-bottom: 0in"><BR>
</P>
<P CLASS="western" STYLE="margin-bottom: 0in">Here is an embedded
Excel spreadsheet:</P>
<P CLASS="western" STYLE="margin-bottom: 0in"><BR>
</P>
<P CLASS="western" STYLE="margin-bottom: 0in"><IMG SRC="sample_html_m7b38147.gif" NAME="Object1" ALIGN=BOTTOM WIDTH=338 HEIGHT=120></P>
<P CLASS="western" STYLE="margin-bottom: 0in"><BR>
</P>
<P CLASS="western" STYLE="margin-bottom: 0in">This concludes our
test.</P>
<P CLASS="western" STYLE="margin-bottom: 0in"><BR>
</P>
</BODY>
</HTML>

Added: It may not be as good as the utility you referenced, Lebre will turn spreadsheet includes into images, but it translated a 32Kbyte .doc into 9959 bytes of HTML.
« Last Edit: August 02, 2017, 11:10:46 PM by littleman » Logged
Pages: [1] Print 
« previous next »
Go to the Active page, the Unread page, or jump to:  

  Powered by SMF 1.1.20 | SMF © 2013, Simple Machines
Mercury design by Bloc