HTML vs MS Word - some help

Started by bayonetbrant, March 08, 2013, 01:29:47 PM

Previous topic - Next topic

0 Members and 1 Guest are viewing this topic.

bayonetbrant

I got this by email at work today.  Might be useful to someone out there

QuoteHi folks, if this isn't a new trick, I apologize for spamming you.  It's new to me so I figured I'd share.  Dreading the thought of cleaning up a set of Word-exported HTML by hand, I thought about using jQuery to do a lot of the dirty work for me.  It works pretty well!

Get yourself a copy of jQuery (below I'm using jquery-1.8.3.min.js) , drop it in the folder with the HTML source you'd like to clean, and add this to the HEAD of the HTML file(s) you exported from Word:

<script type="text/javascript" src="jquery-1.8.3.min.js" ></script>
<script type="text/javascript">
$().ready( function() {
      $('*').removeAttr("style").removeAttr("class").removeAttr("vlink").removeAttr("lang").removeAttr("link");
      $('style,script').remove();
      while ( $('span').length > 0 ) { $('span').each( function () { var h = $(this).html(); $(this).replaceWith( h ); } ); }
      var bdy = $('html').html();
      $('body').html("").append('<form><textarea style="width: 100%; height: 100%"></textarea></form>');
      $('textarea').val( bdy );
} );
</script>


When you load the document(s) in your browser (try Chrome, it makes for better output than IE does, and doesn't squawk about local javascript), you should see a textarea containing the cleaned-up HTML.  A _much_ easier starting point for manual cleanup than before!  The above script will strip any style/class/lang/vlink/link attributes, remove any style tags, spans and the clean-up javascript itself from the resulting source.  As always, your mileage may vary depending on just how messy the HTML coming out of Word happens to be, but it's pretty easy to add/change what the script will clean up if you're familiar with jQuery/CSS selectors. For example, the version below adds some additional cleaning for the align attribute and paragraphs containing a single &nbsp; character.

<script type="text/javascript" src="jquery-1.8.3.min.js" ></script>
<script type="text/javascript">
$().ready( function() {
      var checkRepl = function ( obj, str ) { if ( $(obj).html() == str ) { $(obj).replaceWith(""); } };
      $('*').removeAttr("style").removeAttr("class").removeAttr("vlink").removeAttr("lang").removeAttr("link").removeAttr("align");
      $('style,script').remove();
      while ( $('span').length > 0 ) { $('span').each( function () { var h = $(this).html(); $(this).replaceWith( h ); } ); }
      $('p,p>b,p>b>i').each( function () { checkRepl( this, "&nbsp;" ); } ).each( function () { checkRepl( this, "" ); } ).each( function () { checkRepl( this, "" ); } );
      var bdy = $('html').html();
      $('body').html("").append('<form><textarea style="width: 100%; height: 100%"></textarea></form>');
      $('textarea').val( bdy );
} );
</script>

The key to surviving this site is to not say something which ends up as someone's tag line - Steelgrave

"their citizens (all of them counted as such) glorified their mythology of 'rights'...and lost track of their duties. No nation, so constituted, can endure." Robert Heinlein, Starship Troopers