Categories
Perl

Perl Remove Unwanted HTML

In case you want to remove the extra email such as spans, div, and other junk from a block of text, you can use HTML::Restrict like this:


#!/usr/bin/perl
use HTML::Restrict;
  my $hr = HTML::Restrict->new();
  $hr->set_rules({
    # allowed
    p  => [],
    li => [],
    ul => [],
    h4 => [],
    h3 => [],
    h2 => []
    
    # not allowed (everything by default is not allowed!)
    #img => [qw( alt / )]
    # h1 => []
  });
  foreach my $line(<DATA>){
    $line =~ s  "\&nbsp\;" "g;      # no space symbols
    $line =~ s  "\s+" "g;           # only 1 space, also remove tabs and anything that matches \s
    $line =~ s  "^\s+""g;           # trim leading spaces
    $line =~ s  "\s+$""g;           # trim training spaces
   
   print $hr->process( $line ) . "\n";
  }
__DATA__
Paste your code here below this line

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.