Python - remove excessive html tags -


so i'm having text:

<i>this article written </i><a href="http://google.com"><i>test</i></a><i>.</i> 

i think html, however, want clean up, remove excessive <i> tags , simplify single <i> tag:

<i>this article written <a href="http://google.com">test</a>.</i> 

i tried clean myself, i'd need ahead text, , haven't had success this. there package can use or way can or i'd have manually it?

thank you

the use of html parser reliable solution. able cope tags split across many lines.

the following solve example, not more...

def outeri(text):     outer = re.search("(.*?)(\<i\>.*<\/i\>)(.*)", text)      if outer:         return "%s<i>%s</i>%s" % (outer.group(1), re.sub(r"(\<\/?[ii]\>)", "", outer.group(2)), outer.group(3))     else:         return text  print outeri('<i>this article written </i><a href="http://google.com"><i>test</i></a><i>.</i>') print outeri('text before <i>this article written </i><a href="http://google.com"><i>test</i></a><i>.</i> text after') 

Comments

Popular posts from this blog

How has firefox/gecko HTML+CSS rendering changed in version 38? -

android - CollapsingToolbarLayout: position the ExpandedText programmatically -

Listeners to visualise results of load test in JMeter -