Python BeautifulSoup webcrawling: Formatting output -

- August 15, 2010

the site trying crawl http://www.boxofficemojo.com/yearly/chart/?yr=2013&p=.htm. specific page i'm focusing on http://www.boxofficemojo.com/movies/?id=catchingfire.htm.

from page, having trouble 2 things. first thing "foreign gross" amount (under total lifetime grosses). got amount function:

def getforeign(item_url):     response = requests.get(item_url)     soup = beautifulsoup(response.content)     print soup.find(text="foreign:").find_parent("td").find_next_sibling("td").get_text(strip = true)

the problem is, can print amount out console, can't append these values list or write them csv file. previous data needed on site, got individual piece of information each movie , appended them 1 list, exported csv file.

how can "foreign gross" amount separate amount each movie? need change?

the second problem related getting list of actors/actresses each movie. have function:

def getactors(item_url):     source_code = requests.get(item_url)     plain_text = source_code.text     soup = beautifulsoup(plain_text)     tempactors = []     print soup.find(text="actors:").find_parent("tr").text[7:]

this prints out list of actors: jennifer lawrencejosh hutchersonliam hemsworthelizabeth banksstanley tucciwoody harrelsonphilip seymour hoffmanjeffrey wrightjena maloneamanda plummersam claflindonald sutherlandlenny kravitz - so.

i having same problem having foreign gross amount. i want each individual actor seperately, append them temporary list, , later append list full list of movies. did list of directors, since directors links, not of actors/actresses have html links, can't same. issue right there no space between each of actors.

why current functions not working, , how can fix them?

more code::

def spider(max_pages): page = 1 while page <= max_pages:     url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2013&p=.htm'     source_code = requests.get(url)     plain_text = source_code.text     soup = beautifulsoup(plain_text)     link in soup.select('td > b > font > a[href^=/movies/?]'):         href = 'http://www.boxofficemojo.com' + link.get('href')         details(href)          listofforeign.append(getforeign(href))          listofdirectors.append(getdirectors(href))         str(listofdirectors).replace('[','').replace(']','')          getactors(href)          title = link.string         listoftitles.append(title)     page

listofforeign = []

def getforeign(item_url):     s = urlopen(item_url).read()     soup = beautifulsoup(s)     return soup.find(text="foreign:").find_parent("td").find_next_sibling("td").get_text(strip = true)  def spider(max_pages):     page = 1     while page <= max_pages:         url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2013&p=.htm'         source_code = requests.get(url)         plain_text = source_code.text         soup = beautifulsoup(plain_text)         link in soup.select('td > b > font > a[href^=/movies/?]'):             href = 'http://www.boxofficemojo.com' + link.get('href')             listofforeign.append(getforeign(href))                     page += 1   print listofforeign

returns

traceback (most recent call last): file "c:/users/younjin/pycharmprojects/untitled/movies.py", line 75, in spider(1) file "c:/users/younjin/pycharmprojects/untitled/movies.py", line 29, in spider listofforeign.append(getforeign(href)) file "c:/users/younjin/pycharmprojects/untitled/movies.py", line 73, in getforeign return soup.find(text="foreign:").find_parent("td").find_next_sibling("td").get_text(strip = true) attributeerror: 'nonetype' object has no attribute 'find_parent'

Search This Blog

Alconcel

Python BeautifulSoup webcrawling: Formatting output -

Comments

Post a Comment

Popular posts from this blog

How has firefox/gecko HTML+CSS rendering changed in version 38? -

javascript - Complex json ng-repeat -

jquery - Cloning of rows and columns from the old table into the new with colSpan and rowSpan -