Python BeautifulSoup webcrawling: Formatting output -
the site trying crawl http://www.boxofficemojo.com/yearly/chart/?yr=2013&p=.htm. specific page i'm focusing on http://www.boxofficemojo.com/movies/?id=catchingfire.htm.
from page, having trouble 2 things. first thing "foreign gross" amount (under total lifetime grosses). got amount function:
def getforeign(item_url): response = requests.get(item_url) soup = beautifulsoup(response.content) print soup.find(text="foreign:").find_parent("td").find_next_sibling("td").get_text(strip = true)
the problem is, can print amount out console, can't append these values list or write them csv file. previous data needed on site, got individual piece of information each movie , appended them 1 list, exported csv file.
how can "foreign gross" amount separate amount each movie? need change?
the second problem related getting list of actors/actresses each movie. have function:
def getactors(item_url): source_code = requests.get(item_url) plain_text = source_code.text soup = beautifulsoup(plain_text) tempactors = [] print soup.find(text="actors:").find_parent("tr").text[7:]
this prints out list of actors: jennifer lawrencejosh hutchersonliam hemsworthelizabeth banksstanley tucciwoody harrelsonphilip seymour hoffmanjeffrey wrightjena maloneamanda plummersam claflindonald sutherlandlenny kravitz - so.
i having same problem having foreign gross amount. i want each individual actor seperately, append them temporary list, , later append list full list of movies. did list of directors, since directors links, not of actors/actresses have html links, can't same. issue right there no space between each of actors.
why current functions not working, , how can fix them?
more code::
def spider(max_pages): page = 1 while page <= max_pages: url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2013&p=.htm' source_code = requests.get(url) plain_text = source_code.text soup = beautifulsoup(plain_text) link in soup.select('td > b > font > a[href^=/movies/?]'): href = 'http://www.boxofficemojo.com' + link.get('href') details(href) listofforeign.append(getforeign(href)) listofdirectors.append(getdirectors(href)) str(listofdirectors).replace('[','').replace(']','') getactors(href) title = link.string listoftitles.append(title) page
listofforeign = []
def getforeign(item_url): s = urlopen(item_url).read() soup = beautifulsoup(s) return soup.find(text="foreign:").find_parent("td").find_next_sibling("td").get_text(strip = true) def spider(max_pages): page = 1 while page <= max_pages: url = 'http://www.boxofficemojo.com/yearly/chart/?page=' + str(page) + '&view=releasedate&view2=domestic&yr=2013&p=.htm' source_code = requests.get(url) plain_text = source_code.text soup = beautifulsoup(plain_text) link in soup.select('td > b > font > a[href^=/movies/?]'): href = 'http://www.boxofficemojo.com' + link.get('href') listofforeign.append(getforeign(href)) page += 1 print listofforeign
returns
traceback (most recent call last): file "c:/users/younjin/pycharmprojects/untitled/movies.py", line 75, in spider(1) file "c:/users/younjin/pycharmprojects/untitled/movies.py", line 29, in spider listofforeign.append(getforeign(href)) file "c:/users/younjin/pycharmprojects/untitled/movies.py", line 73, in getforeign return soup.find(text="foreign:").find_parent("td").find_next_sibling("td").get_text(strip = true) attributeerror: 'nonetype' object has no attribute 'find_parent'
Comments
Post a Comment