ruby on rails - Scrape only HTML+Microdata with Nokogiri -
problem
i need scrape html pages , extract html has microdata in it.
example
<div itemscope itemtype="http://schema.org/movie"> <span>something else</span> <script>something</script> <h1 itemprop="name"&g;avatar</h1> <div itemprop="director" itemscope itemtype="http://schema.org/person"> director: <span itemprop="name">james cameron</span> (born <span itemprop="birthdate">august 16, 1954)</span> <img url="doesnt matter" /> </div> <span itemprop="genre">science fiction</span> <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">trailer</a> </div> <div>something else</div> <div itemscope itemtype="http://schema.org/product"> <span itemprop="brand">acme</span> <span itemprop="name">executive anvil</span> <span itemprop="offers" itemscope itemtype="http://schema.org/offer"> <span>something else</span> <meta itemprop="pricecurrency" content="usd" /> $<span itemprop="price">119.99</span> </span> </div>
goal: html microdata
<div itemscope itemtype="http://schema.org/movie"> <h1 itemprop="name">avatar</h1> <div itemprop="director" itemscope itemtype="http://schema.org/person"> director: <span itemprop="name">james cameron</span> (born <span itemprop="birthdate">august 16, 1954)</span> </div> <span itemprop="genre">science fiction</span> <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">trailer</a> </div> <div itemscope itemtype="http://schema.org/product"> <span itemprop="brand">acme</span> <span itemprop="name">executive anvil</span> <span itemprop="offers" itemscope itemtype="http://schema.org/offer"> <meta itemprop="pricecurrency" content="usd" /> $<span itemprop="price">119.99</span> </span> </div>
attempt
i tried use:
doc.css("*[itemtype]").each |container| puts container.to_html end
but doesn't work because iterates each itemtype
, outputs childen itemtype
, after iterates again children duplicate things, i.e., movie+person > person > product+offer > offer.
Comments
Post a Comment