Python - remove excessive html tags -
so i'm having text:
<i>this article written </i><a href="http://google.com"><i>test</i></a><i>.</i>
i think html, however, want clean up, remove excessive <i>
tags , simplify single <i>
tag:
<i>this article written <a href="http://google.com">test</a>.</i>
i tried clean myself, i'd need ahead text, , haven't had success this. there package can use or way can or i'd have manually it?
thank you
the use of html parser reliable solution. able cope tags split across many lines.
the following solve example, not more...
def outeri(text): outer = re.search("(.*?)(\<i\>.*<\/i\>)(.*)", text) if outer: return "%s<i>%s</i>%s" % (outer.group(1), re.sub(r"(\<\/?[ii]\>)", "", outer.group(2)), outer.group(3)) else: return text print outeri('<i>this article written </i><a href="http://google.com"><i>test</i></a><i>.</i>') print outeri('text before <i>this article written </i><a href="http://google.com"><i>test</i></a><i>.</i> text after')
Comments
Post a Comment