ruby on rails - Scrape only HTML+Microdata with Nokogiri -


problem

i need scrape html pages , extract html has microdata in it.

example

<div itemscope itemtype="http://schema.org/movie">   <span>something else</span>   <script>something</script>   <h1 itemprop="name"&g;avatar</h1>   <div itemprop="director" itemscope itemtype="http://schema.org/person">   director: <span itemprop="name">james cameron</span> (born <span itemprop="birthdate">august 16, 1954)</span>   <img url="doesnt matter" />   </div>   <span itemprop="genre">science fiction</span>   <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">trailer</a> </div>  <div>something else</div>  <div itemscope itemtype="http://schema.org/product">   <span itemprop="brand">acme</span>   <span itemprop="name">executive anvil</span>    <span itemprop="offers" itemscope itemtype="http://schema.org/offer">     <span>something else</span>     <meta itemprop="pricecurrency" content="usd" />     $<span itemprop="price">119.99</span>   </span> </div> 

goal: html microdata

<div itemscope itemtype="http://schema.org/movie">   <h1 itemprop="name">avatar</h1>   <div itemprop="director" itemscope itemtype="http://schema.org/person">   director: <span itemprop="name">james cameron</span> (born <span itemprop="birthdate">august 16, 1954)</span>   </div>   <span itemprop="genre">science fiction</span>   <a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">trailer</a> </div> <div itemscope itemtype="http://schema.org/product">   <span itemprop="brand">acme</span>   <span itemprop="name">executive anvil</span>   <span itemprop="offers" itemscope itemtype="http://schema.org/offer">     <meta itemprop="pricecurrency" content="usd" />     $<span itemprop="price">119.99</span>   </span> </div> 

attempt

i tried use:

doc.css("*[itemtype]").each |container|   puts container.to_html end 

but doesn't work because iterates each itemtype , outputs childen itemtype , after iterates again children duplicate things, i.e., movie+person > person > product+offer > offer.


Comments

Popular posts from this blog

How has firefox/gecko HTML+CSS rendering changed in version 38? -

javascript - Complex json ng-repeat -

jquery - Cloning of rows and columns from the old table into the new with colSpan and rowSpan -